Predicting Budget from Transportation Research Grant Description: An Exploratory Analysis of Text Mining and Machine Learning Techniques

Singhal, Ayush; Gopalakrishnan, Kasthurirangan; Khaitan, Siddhartha

doi:https://dx.doi.org/10.22115/scce.2017.49604

(ندگان)پدیدآور

Singhal, AyushGopalakrishnan, KasthuriranganKhaitan, Siddhartha

دریافت مدرک

FullText

اندازه فایل:

1.175 مگابایت

نوع فايل (MIME):

PDF

نوع مدرک

Text
Regular Article

زبان مدرک

English

نمایش کامل رکورد

چکیده

Funding agencies such as the U.S. National Science Foundation (NSF), U.S. National Institutes of Health (NIH), and the Transportation Research Board (TRB) of The National Academies make their online grant databases publicly available which document a variety of information on grants that have been funded over the past few decades. In this paper, based on a quantitative analysis of the TRB's Research In Progress (RIP) online database, we explore the feasibility of automatically estimating the appropriate funding level, given the textual description of a transportation research project. We use statistical Text Mining (TM) and Machine Learning (ML) technologies to build this model using the 14,000 or more records of the TRB's RIP research grants big data. Natural Language Processing (NLP) based text representation models such as the Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI) and the Doc2Vec are used to vectorize the project descriptions and generate semantic vectors. Each of these representations are then used to train supervised regression models such as Random Forest (RF) regression. Out of the three latent feature generation models, we found LDA gives the least Mean Absolute Error (MAE). However, based on the correlation coefficients, it was found that it is not very feasible to accurately predict the funding level directly from the unstructured project abstract, given the large variations in source agencies, subject areas, and funding levels. By using separate prediction models for different types of funding agencies, funding levels were better correlated to the project abstract.

کلید واژگان

Text Mining
Transportation research
Natural Language Processing (NLP)
Big Data
Deep Learning
Soft Computing
Data Mining

شماره نشریه

تاریخ نشر

2017-10-01
1396-07-09

ناشر

Pouyan Press

سازمان پدید آورنده

R&D, Contata Solutions, LLC, Minneapolis, Minnesota, USA
Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA
Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA

شاپا

2588-2872

URI

https://dx.doi.org/10.22115/scce.2017.49604
http://www.jsoftcivil.com/article_49604.html
https://iranjournals.nlai.ir/handle/123456789/44853