DocumentCode :
676945
Title :
Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages
Author :
Kondor, Daniel ; Csabai, Istvan ; Dobos, Lubomir ; Szule, Janos ; Barankai, Norbert ; Hanyecz, Tamas ; Sebok, Tamas ; Kallus, Zsofia ; Vattay, Gabor
Author_Institution :
Dept. of Phys. of Complex Syst., Eotvos Lorand Univ., Budapest, Hungary
fYear :
2013
fDate :
2-5 Dec. 2013
Firstpage :
393
Lastpage :
398
Abstract :
Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.
Keywords :
data mining; feature extraction; natural language processing; principal component analysis; social networking (online); text analysis; PCA pursuit method; Twitter microblogging service; geo-tagged Twitter messages; geographic location; language use; natural language processing; online social media; online social networks; principal component analysis; regional characteristic estimation; regional feature identification; robust PCA technique; text mining applications; word usage frequencies; Cities and towns; Meteorology; Principal component analysis; Robustness; Sparse matrices; Twitter;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on
Conference_Location :
Budapest
Print_ISBN :
978-1-4799-1543-9
Type :
conf
DOI :
10.1109/CogInfoCom.2013.6719277
Filename :
6719277
Link To Document :
بازگشت