• DocumentCode
    3039238
  • Title

    Topic Modelling Used to Improve Arabic Web Pages Clustering

  • Author

    Alghamdi, Hanan ; Selamat, Ali

  • Author_Institution
    Fac. of Comput., Univ. Teknol. Malaysia, Johor Bahru, Malaysia
  • fYear
    2015
  • fDate
    26-29 April 2015
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Topic modelling main purpose is to have machine-understandable and semantic annotation to textual contents of Web.It aim to extract knowledge rather than unrelated information. In this paper, we evaluate the impact of using topic model (which intended to represent the documents like a combination of topics where each topic is a mix of vectors) in improving documents clustering results. We have compared the results of clustering using PLSA or LSA. The experiments performed on a set of common newspaper websites that have highly dimensional data and we use Purity, Mean intra-cluster distance (MICD) and Davies-Bouldin index (DBI) for clustering evaluation. Thus, we acquired favorable clustering results, especially in the context of the Arabic language as PLSA were effective in minimizing MICD, expanding purity and bringing down DBI.
  • Keywords
    Internet; knowledge acquisition; natural language processing; pattern clustering; probability; text analysis; Arabic Web page clustering; DBI; Davies-Bouldin index; LDA; MICD; PLSA; Web textual content; knowledge extraction; latent Dirichlet allocation; mean intracluster distance; probabilistic latent semantic analysis; semantic annotation; topic modelling; Clustering algorithms; Computational modeling; Indexes; Matrix decomposition; Probabilistic logic; Semantics; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing (ICCC), 2015 International Conference on
  • Conference_Location
    Riyadh
  • Print_ISBN
    978-1-4673-6617-5
  • Type

    conf

  • DOI
    10.1109/CLOUDCOMP.2015.7149662
  • Filename
    7149662