DocumentCode :
1761746
Title :
Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia
Author :
Khoi-Nguyen Tran ; Christen, Peter
Author_Institution :
Res. Sch. of Comput. Sci., Australian Nat. Univ., Canberra, ACT, Australia
Volume :
27
Issue :
3
fYear :
2015
fDate :
March 1 2015
Firstpage :
673
Lastpage :
685
Abstract :
Vandalism, the malicious modification of articles, is a serious problem for open access encyclopedias such as Wikipedia. The use of counter-vandalism bots is changing the way Wikipedia identifies and bans vandals, but their contributions are often not considered nor discussed. In this paper, we propose novel text features capturing the invariants of vandalism across five languages to learn and compare the contributions of bots and users in the task of identifying vandalism. We construct computationally efficient features that highlight the contributions of bots and users, and generalize across languages. We evaluate our proposed features through classification performance on revisions of five Wikipedia languages, totaling over 500 million revisions of over nine million articles. As a comparison, we evaluate these features on the small PAN Wikipedia vandalism data sets, used by previous research, which contain approximately 62,000 revisions. We show differences in the performance of our features on the PAN and the full Wikipedia data set. With the appropriate text features, vandalism bots can be effective across different languages while learning from only one language. Our ultimate aim is to build the next generation of vandalism detection bots based on machine learning approaches that can work effectively across many languages.
Keywords :
Web sites; encyclopaedias; learning (artificial intelligence); text analysis; Wikipedia languages; counter-vandalism bots; cross-language learning; machine learning approaches; open access encyclopedias; small PAN Wikipedia vandalism data sets; text features; vandalism detection bots; Conferences; Electronic publishing; Encyclopedias; Feature extraction; Internet; Maintenance engineering; Bots; Wikipedia; cross-language learning; editors; feature engineering; transfer learning; users; vandalism;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2014.2339844
Filename :
6857333
Link To Document :
بازگشت