• DocumentCode
    2226728
  • Title

    A hybrid method for Web data extraction

  • Author

    Wang, Yu ; Zhou, Lizhu

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2003
  • fDate
    13-17 Oct. 2003
  • Firstpage
    417
  • Lastpage
    420
  • Abstract
    Web data extraction refers to the technology that helps people find wanted information from the Web. We first classify existing data extraction algorithms into two classes: top-down and bottom-up, and then analyze their strengths and weaknesses in terms of extraction accuracy. On the basis of this analysis, we present a hybrid algorithm: bi-direction data extraction (BiDDE for short), which takes the full strengths of both top-down and bottom-up algorithms and yet avoid their weaknesses. The experimental results show that BiDDE has not only higher accuracy than top-down algorithm and bottom-up algorithm, but satisfactory performance.
  • Keywords
    Internet; hypermedia markup languages; information retrieval; tree searching; HTML documents; Web data extraction; bi-direction data extraction algorithm; bottom-up algorithms; information retrieval; top-down algorithms; Algorithm design and analysis; Bidirectional control; Computer science; Data mining; Databases; HTML; Internet; Particle separators; Web pages; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
  • Print_ISBN
    0-7695-1932-6
  • Type

    conf

  • DOI
    10.1109/WI.2003.1241229
  • Filename
    1241229