DocumentCode
2415804
Title
Authorship attribution of web forum posts
Author
Pillay, Sangita R. ; Solorio, Thamar
Author_Institution
Dept. of Comput. & Inf. Sci., Univ. of Alabama at Birmingham, Birmingham, AL, USA
fYear
2010
fDate
18-20 Oct. 2010
Firstpage
1
Lastpage
7
Abstract
Extracting useful information from user generated text on the web is an important ongoing research in natural language processing, machine learning, and data mining. Online tools like emails, news groups, blogs, and web forums provide an effective communication platform for millions of users around the globe and also provide an added advantage of anonymity. Millions of people post information on different web forums daily. The possibility of exchanging sensitive information between anonymous users on these web forums cannot be ruled out. This document proposes a two stage approach for combining unsupervised and supervised learning approaches for performing authorship attribution on web forum posts. During the first stage, the approach focuses on using clustering techniques to make an effort to group the data sets into stylistically similar clusters. The second stage involves using the resulting clusters from stage one as features to train different machine learning classifiers. This two stage approach is an effort towards reducing the complexity of the classification task and boosting the prediction accuracy.
Keywords
Internet; authorisation; data mining; learning (artificial intelligence); natural language processing; pattern classification; pattern clustering; task analysis; Web forum posts; authorship attribution; classification task; clustering techniques; data mining; machine learning classifiers; natural language processing; prediction accuracy; sensitive information exchange; supervised learning; unsupervised learning; Accuracy; Classification algorithms; Classification tree analysis; Feature extraction; Machine learning; Machine learning algorithms; Training; Authorship attribution; clustering; machine learning classifiers; stylometry; text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
eCrime Researchers Summit (eCrime), 2010
Conference_Location
Dallas, TX
ISSN
2159-1237
Print_ISBN
978-1-4244-7760-9
Type
conf
DOI
10.1109/ecrime.2010.5706693
Filename
5706693
Link To Document