DocumentCode :
3673656
Title :
Fast Text Classification Using Randomized Explicit Semantic Analysis
Author :
Aibek Musaev;De Wang;Saajan Shridhar;Calton Pu
Author_Institution :
Georgia Inst. of Technol., Atlanta, GA, USA
fYear :
2015
Firstpage :
364
Lastpage :
371
Abstract :
Document classification or document categorization is one of the most studied areas in computer science due to its importance. The problem is to assign a document using its text to one or more classes or categories from a predefined set. We propose a new approach for fast text classification using randomized explicit semantic analysis (RS-ESA). It is based on a state of the art approach for word sense disambiguation based on Wikipedia, the largest encyclopedia in existence. Our method reduces Wikipedia repository using a random sample approach resulting in a throughput, which is an order of magnitude faster than the original explicit semantic analysis. RS-ESA approach has been implemented as part of the LITMUS project due to a need in classifying data from Social Media into relevant and irrelevant items with respect to landslide as a natural disaster. We demonstrate that our approach achieves 96% precision when classifying Social Media landslide data collected in December 2014. We also demonstrate the genericity of the proposed approach by using it for separating factual texts from fictional based on Wikipedia articles and fan fiction stories, where we achieve 97% in precision.
Keywords :
"Encyclopedias","Internet","Electronic publishing","Terrain factors","Media","Training"
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/IRI.2015.62
Filename :
7301000
Link To Document :
بازگشت