DocumentCode :
670245
Title :
FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood
Author :
Kucecka, Tomas ; Chuda, Daniela ; Sladecek, Peter
Author_Institution :
Fac. of Inf. & Inf. Technol., Slovak Univ. of Technol., Bratislava, Slovakia
fYear :
2013
fDate :
19-21 Nov. 2013
Firstpage :
509
Lastpage :
514
Abstract :
Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.
Keywords :
Internet; data mining; information retrieval; pattern clustering; text analysis; CSS; FICWAN frequent itemset clustering; FIHC; HTML tags; Web articles clustering; article neighborhood; document clustering; frequent itemset-based hierarchical clustering; hyperlinks; topic related document; Cascading style sheets; Clustering algorithms; HTML; Informatics; Itemsets; Partitioning algorithms; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on
Conference_Location :
Budapest
Print_ISBN :
978-1-4799-0194-4
Type :
conf
DOI :
10.1109/CINTI.2013.6705250
Filename :
6705250
Link To Document :
بازگشت