DocumentCode
670245
Title
FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood
Author
Kucecka, Tomas ; Chuda, Daniela ; Sladecek, Peter
Author_Institution
Fac. of Inf. & Inf. Technol., Slovak Univ. of Technol., Bratislava, Slovakia
fYear
2013
fDate
19-21 Nov. 2013
Firstpage
509
Lastpage
514
Abstract
Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.
Keywords
Internet; data mining; information retrieval; pattern clustering; text analysis; CSS; FICWAN frequent itemset clustering; FIHC; HTML tags; Web articles clustering; article neighborhood; document clustering; frequent itemset-based hierarchical clustering; hyperlinks; topic related document; Cascading style sheets; Clustering algorithms; HTML; Informatics; Itemsets; Partitioning algorithms; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on
Conference_Location
Budapest
Print_ISBN
978-1-4799-0194-4
Type
conf
DOI
10.1109/CINTI.2013.6705250
Filename
6705250
Link To Document