FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood

Author

Kucecka, Tomas ; Chuda, Daniela ; Sladecek, Peter

Author_Institution

Fac. of Inf. & Inf. Technol., Slovak Univ. of Technol., Bratislava, Slovakia

fYear

2013

fDate

19-21 Nov. 2013

Firstpage

509

Lastpage

514

Abstract

Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.

Keywords

Internet; data mining; information retrieval; pattern clustering; text analysis; CSS; FICWAN frequent itemset clustering; FIHC; HTML tags; Web articles clustering; article neighborhood; document clustering; frequent itemset-based hierarchical clustering; hyperlinks; topic related document; Cascading style sheets; Clustering algorithms; HTML; Informatics; Itemsets; Partitioning algorithms; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on

Conference_Location

Budapest

Print_ISBN

978-1-4799-0194-4

Type

conf

DOI

10.1109/CINTI.2013.6705250

Filename

6705250