• DocumentCode
    545455
  • Title

    Mining contents in Web page using cosine similarity

  • Author

    Nyein, Swe Swe

  • Author_Institution
    Univ. of Comput. Studies, Mandalay, Myanmar
  • Volume
    2
  • fYear
    2011
  • fDate
    11-13 March 2011
  • Firstpage
    472
  • Lastpage
    475
  • Abstract
    Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g.; banner ads, navigation bars, copy right and privacy notices, advertisements which are not related to the main content (relevant information). In this paper, an algorithm is proposed that extract the main content from the web documents. The algorithm based on Content Structure Tree (CST). Firstly, the proposed system use HTML Parser to construct DOM (Document Object Model) tree from which construct Content Structure Tree (CST) which can easily separate the main content blocks from the other blocks. The proposed system then introduce cosine similarity measure to evaluate which parts of the CST tree represent the less important and which parts represent the more important of the page. The proposed system can define the ranking of the documents using similarity values and also extracts the top ranked documents as more relevant to the query.
  • Keywords
    Internet; data mining; hypermedia markup languages; tree data structures; CST; DOM; HTML parser; content structure tree; copy right notices; cosine similarity; document object model; navigation bars; privacy notices; web documents; web page mining contents; Algorithm design and analysis; Data mining; HTML; Image segmentation; Semantics; Visualization; Web pages; CST tree; Cosine Similarity; DOM tree;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Research and Development (ICCRD), 2011 3rd International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-61284-839-6
  • Type

    conf

  • DOI
    10.1109/ICCRD.2011.5764177
  • Filename
    5764177