DocumentCode :
1936247
Title :
Exploring Wikipedia and Query Log´s Ability for Text Feature Representation
Author :
Li, Bing ; Chen, Qing-cai ; Yeung, Daniel S. ; Ng, Wing W Y ; Wang, Xiao-long
Author_Institution :
Harbin Inst. of Technol. Shenzhen, Shenzhen
Volume :
6
fYear :
2007
fDate :
19-22 Aug. 2007
Firstpage :
3343
Lastpage :
3348
Abstract :
The rapid increase of Internet technology requires a better management of Web page contents. Many text mining researches has been conducted, like text categorization, information retrieval, text clustering. When machine learning methods or statistical models are applied to such a large scale of data, the first step we have to solve is to represent a text document into the way that computers could handle. Traditionally, single words are always employed as features in vector space model, which make up the feature space for all text documents. The single-word based representation is based on the word independence and doesn´t consider their relations, which may cause information missing. This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information. The experiment results show that a much better F1 value has been achieved than that of classical single-word based text representation. This means that Wikipedia and query segmented feature could better represent a text document.
Keywords :
Internet; text analysis; Internet technology; Web page contents; Wikipedia; information retrieval; text categorization; text classification; text clustering; text feature representation; text mining; vector space model; Content management; Information retrieval; Internet; Large-scale systems; Learning systems; Technology management; Text categorization; Text mining; Web pages; Wikipedia; Query-Log; Text feature representation; Wikipedia (Wiki); Word-Based model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2007 International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4244-0973-0
Electronic_ISBN :
978-1-4244-0973-0
Type :
conf
DOI :
10.1109/ICMLC.2007.4370725
Filename :
4370725
Link To Document :
بازگشت