DocumentCode :
3166725
Title :
Local Word Bag Model for Text Categorization
Author :
Pu, Wen ; Liu, Ning ; Yan, Shuicheng ; Yan, Jun ; Xie, Kunqing ; Chen, Zheng
Author_Institution :
Peking Univ., Peking
fYear :
2007
fDate :
28-31 Oct. 2007
Firstpage :
625
Lastpage :
630
Abstract :
Many text processing applications adopted the bag of words (BOW) model representation of documents, in which each document is represented as a vector of weighted terms or n-grams, and then the cosine distance between two vectors is used as the similarity measurement. Although the great success in information retrieval and text categorization, the conventional BOW model ignores the detailed local text information, i.e. the co-occurrence pattern of words at sentence or paragraph level. In this paper, we propose a novel approach to represent a document as a set of local tf-idf vectors, or what we called local word bags (LWB). By encapsulating local information distributed around a document into multiple LWBs, we can measure the similarity of two documents via the partial match of their corresponding local bags. To perform the matching efficiently, we introduce the local word bag kernel (LWB kernel), a variant of VG-Pyramid match kernel. The new kernel enables the discriminative machine learning methods like SVM to compute the partial matching between two sets of LWBs in linear time after an one time hierarchical clustering procedure over all local bags at the initialization stage. Experiments on real world datasets demonstrate the effectiveness of our new approach.
Keywords :
information retrieval; learning (artificial intelligence); text analysis; word processing; VG-Pyramid match kernel; cosine distance; discriminative machine learning; documents representation; information retrieval; local word bag model; text categorization; text processing; Asia; Data engineering; Data mining; Extraterrestrial measurements; Kernel; Laboratories; Machine intelligence; Support vector machines; Text categorization; Text processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3018-5
Type :
conf
DOI :
10.1109/ICDM.2007.69
Filename :
4470301
Link To Document :
بازگشت