• DocumentCode
    3703548
  • Title

    A text block context informations based multiple Web contents extraction

  • Author

    Wonmoon Song;Myungwon Kim

  • Author_Institution
    Strategic Business Team, ONYCOM, Seoul, Republic of Korea
  • fYear
    2015
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    In Web environment, in order to provide appropriate Web services to users´ needs it becomes important to quickly and accurately extract from Web documents contents such as main-content, menu-list, article-list, comments and so on. In this paper, we propose an efficient method that extracts various contents from Web documents. In the method, text blocks are separated from the document and context information is extracted and used to classify content type of each text block. Context information consists of documenting patterns and structural features of a Web document. For documenting patterns, we use in/out link information, which is expanded from word/link density proposed by a previous work. For structural features, distances between text blocks and parent tags of the target text block are used. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs about 17% points better in accuracy for multiple contents extraction and about 14% points better in F-measure for main-content extraction compared to the existing methods.
  • Keywords
    "Feature extraction","HTML","Context","Visualization","Data mining","Standards","XML"
  • Publisher
    ieee
  • Conference_Titel
    Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on
  • Print_ISBN
    978-1-4673-8272-4
  • Type

    conf

  • DOI
    10.1109/DSAA.2015.7344829
  • Filename
    7344829