DocumentCode :
3249084
Title :
Phrase-based document similarity based on an index graph model
Author :
Hammouda, Khaled M. ; Kamel, Mohamed S.
Author_Institution :
Dept. of Syst. Design Eng., Waterloo Univ., Ont., Canada
fYear :
2002
fDate :
2002
Firstpage :
203
Lastpage :
210
Abstract :
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the document index graph, which indexes web documents based on phrases, rather than single terms only. The semi-structured web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The document index graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The similarity between documents is based on both single term weights and matching phrases weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, enhances web document clustering quality significantly.
Keywords :
Web sites; directed graphs; indexing; text analysis; Web documents; document clustering techniques; document data set; document index graph; document structure capture; matching phrases weights; phrase representation; phrase-based document similarity; single term analysis; single term weights; vector space model; Clustering methods; Data engineering; Data mining; Data models; Design engineering; Functional analysis; System analysis and design; Systems engineering and theory; Web mining; Web sites;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
Type :
conf
DOI :
10.1109/ICDM.2002.1183904
Filename :
1183904
Link To Document :
بازگشت