• DocumentCode
    615372
  • Title

    Two-stage Web Record Extraction

  • Author

    Qing Yang ; Chunxia Zhang ; Zhendong Niu

  • Author_Institution
    Sch. of Comput. Sci., Beijing Inst. of Technol., Beijing, China
  • fYear
    2013
  • fDate
    26-28 April 2013
  • Firstpage
    783
  • Lastpage
    788
  • Abstract
    To extract structured data from the Web is a challenging subtask of information extraction. In the Web, the structured data are usually presented as lists, records or tables. Present methods require to identity the boundaries of data regions before separating them into records. Because records do not always have the same count of items or occur in consecutive sections, these methods often fail to handle such complicated or noisy pages. In this paper, we propose a fully automatic method called Two-Stage Web Record Extraction (TSWRE) to extract records from an open domain corpus. This approach uses a bottom-up analysis that starts with sequences of visually similar attribute sequences. It first identifies attribute sequences based on distinct tag paths of the ordered DOM tree of the document. The method exploits the position interleave characteristics of the attribute sequences to estimate how likely the sequences belong to the same records. Empirical experiments show that our method achieves promising performance compared to existing methods and is scalable to a large corpus.
  • Keywords
    Internet; information retrieval; text analysis; trees (mathematics); bottom-up analysis; data region; distinct tag paths; information extraction; open domain corpus; ordered DOM tree; position interleave characteristics; structured data extraction; two-stage Web record extraction; visually similar attribute sequences; Abstracts; Art; HTML; Road transportation; Information extraction; data record extraction; multiple sequence alignment;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science & Education (ICCSE), 2013 8th International Conference on
  • Conference_Location
    Colombo
  • Print_ISBN
    978-1-4673-4464-7
  • Type

    conf

  • DOI
    10.1109/ICCSE.2013.6554015
  • Filename
    6554015