• DocumentCode
    670245
  • Title

    FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood

  • Author

    Kucecka, Tomas ; Chuda, Daniela ; Sladecek, Peter

  • Author_Institution
    Fac. of Inf. & Inf. Technol., Slovak Univ. of Technol., Bratislava, Slovakia
  • fYear
    2013
  • fDate
    19-21 Nov. 2013
  • Firstpage
    509
  • Lastpage
    514
  • Abstract
    Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.
  • Keywords
    Internet; data mining; information retrieval; pattern clustering; text analysis; CSS; FICWAN frequent itemset clustering; FIHC; HTML tags; Web articles clustering; article neighborhood; document clustering; frequent itemset-based hierarchical clustering; hyperlinks; topic related document; Cascading style sheets; Clustering algorithms; HTML; Informatics; Itemsets; Partitioning algorithms; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on
  • Conference_Location
    Budapest
  • Print_ISBN
    978-1-4799-0194-4
  • Type

    conf

  • DOI
    10.1109/CINTI.2013.6705250
  • Filename
    6705250