A Joint Semantic Vector Representation Model for Text Clustering and Classification

Momtazi, S.; Rahbar, A.; Salami, D.; Khanijazani, I.

doi:https://dx.doi.org/10.22044/jadm.2019.7400.1876

(ندگان)پدیدآور

Momtazi, S.Rahbar, A.Salami, D.Khanijazani, I.

دریافت مدرک

FullText

اندازه فایل:

1.029 مگابایت

نوع فايل (MIME):

PDF

نوع مدرک

Text
Research/Original/Regular Article

زبان مدرک

English

نمایش کامل رکورد

چکیده

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose. In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.

کلید واژگان

Text mining
Semantic representation
Topic modeling
Neural document embedding
Document and Text Processing

شماره نشریه

تاریخ نشر

2019-07-01
1398-04-10

ناشر

Shahrood University of Technology

سازمان پدید آورنده

Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.

شاپا

2322-5211
2322-4444

URI

https://dx.doi.org/10.22044/jadm.2019.7400.1876
http://jad.shahroodut.ac.ir/article_1457.html
https://iranjournals.nlai.ir/handle/123456789/294922