DocumentCode :
3165128
Title :
How Much Noise Is Too Much: A Study in Automatic Text Classification
Author :
Agarwal, Sumeet ; Godbole, Shantanu ; Punjani, Diwakar ; Roy, Shourya
Author_Institution :
IIT Delhi, Delhi
fYear :
2007
fDate :
28-31 Oct. 2007
Firstpage :
3
Lastpage :
12
Abstract :
Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.
Keywords :
data mining; pattern classification; text analysis; automatic text classification; data cleaning; data processing cycle; noisy text analytics; noisy text document analysis; text mining; Automatic speech recognition; Blogs; Cleaning; Data processing; Handwriting recognition; Internet; Mobile handsets; Text categorization; Text mining; Text recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3018-5
Type :
conf
DOI :
10.1109/ICDM.2007.21
Filename :
4470224
Link To Document :
بازگشت