Title :
Research on Data Cleaning in Text Clustering
Author :
Yuhang, Zhang ; Yue, Wang ; Wei, Yang
Author_Institution :
Coll. of Technol. & Econ., Liaoning Tech. Univ.(LNTU), Fuxin, China
Abstract :
The more reasonable method of data cleaning has been proposed according to situation that data cleaning mistake away words which have distinguish capacity in text clustering pre-treatment presently. This method considers the situation of new field words happening. For the problem of rare word filtering, consider both the importance degree of the word in the whole text collection, namely word frequency, and the importance in the text in which it appears, namely weightings. So this method avoids dividing it into existed category in order to achieve the goal of filtering comparatively accurately which make result of text clustering more precise. Text clustering is made by means of C-means algorithm at last and verifying this method improves the accuracy of text clustering result.
Keywords :
pattern classification; pattern clustering; text analysis; word processing; C-means algorithm; data cleaning; text clustering; word filtering; word frequency; Cleaning; Clustering algorithms; Dispersion; Equations; Filtering; Mathematical model; Vocabulary; data cleaning; text clustering; weighting; word frequency;
Conference_Titel :
Information Technology and Applications (IFITA), 2010 International Forum on
Conference_Location :
Kunming
Print_ISBN :
978-1-4244-7621-3
Electronic_ISBN :
978-1-4244-7622-0
DOI :
10.1109/IFITA.2010.73