DocumentCode :
2805248
Title :
Analysis of Web Search Engine Clicked Documents
Author :
Nettleton, David F. ; Calderón-Benavides, Liliana ; Baeza-Yates, Ricardo
Author_Institution :
Web Res. Group, Univ. Pompeu Fabra, Barcelona
fYear :
2006
fDate :
Oct. 2006
Firstpage :
209
Lastpage :
219
Abstract :
In this paper we process and analyze Web search engine query and click data from the perspective of the documents (URs) selected. We initially define possible document categories and select descriptive variables to define the documents. The URL dataset is preprocessed and analyzed using some traditional statistical methods, and then processed by the Kohonen (1984) SOM clustering technique, which we use to produce a two level clustering. The clusters are interpreted in terms of the document categories and variables defined initially. Then we apply the C4.5 (Quinlan, 1993) rule induction algorithm to produce a decision tree for the document category. The objective of the paper is to apply a systematic data mining process to click data, contrasting non-supervised (Kohonen) and supervised (C4.5) methods to cluster and model the data, in order to identify document profiles which relate to theoretical user behavior, and document (URL) organization
Keywords :
Internet; classification; data mining; decision trees; document handling; pattern clustering; query processing; search engines; self-organising feature maps; C4.5 rule induction; Kohonen SOM clustering; Web search engine; click data analysis; data clustering; data mining; decision tree; document analysis; document categorization; document organization; query analysis; statistical methods; two level clustering; Clustering algorithms; Data mining; Data processing; Decision trees; Input variables; Partitioning algorithms; Predictive models; Search engines; Uniform resource locators; Web search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Congress, 2006. LA-Web '06. Fourth Latin American
Conference_Location :
Cholula
Print_ISBN :
0-7695-2693-4
Type :
conf
DOI :
10.1109/LA-WEB.2006.6
Filename :
4022112
Link To Document :
بازگشت