DocumentCode :
2415804
Title :
Authorship attribution of web forum posts
Author :
Pillay, Sangita R. ; Solorio, Thamar
Author_Institution :
Dept. of Comput. & Inf. Sci., Univ. of Alabama at Birmingham, Birmingham, AL, USA
fYear :
2010
fDate :
18-20 Oct. 2010
Firstpage :
1
Lastpage :
7
Abstract :
Extracting useful information from user generated text on the web is an important ongoing research in natural language processing, machine learning, and data mining. Online tools like emails, news groups, blogs, and web forums provide an effective communication platform for millions of users around the globe and also provide an added advantage of anonymity. Millions of people post information on different web forums daily. The possibility of exchanging sensitive information between anonymous users on these web forums cannot be ruled out. This document proposes a two stage approach for combining unsupervised and supervised learning approaches for performing authorship attribution on web forum posts. During the first stage, the approach focuses on using clustering techniques to make an effort to group the data sets into stylistically similar clusters. The second stage involves using the resulting clusters from stage one as features to train different machine learning classifiers. This two stage approach is an effort towards reducing the complexity of the classification task and boosting the prediction accuracy.
Keywords :
Internet; authorisation; data mining; learning (artificial intelligence); natural language processing; pattern classification; pattern clustering; task analysis; Web forum posts; authorship attribution; classification task; clustering techniques; data mining; machine learning classifiers; natural language processing; prediction accuracy; sensitive information exchange; supervised learning; unsupervised learning; Accuracy; Classification algorithms; Classification tree analysis; Feature extraction; Machine learning; Machine learning algorithms; Training; Authorship attribution; clustering; machine learning classifiers; stylometry; text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
eCrime Researchers Summit (eCrime), 2010
Conference_Location :
Dallas, TX
ISSN :
2159-1237
Print_ISBN :
978-1-4244-7760-9
Type :
conf
DOI :
10.1109/ecrime.2010.5706693
Filename :
5706693
Link To Document :
بازگشت