• DocumentCode
    2903333
  • Title

    DTM - Extracting Data Records from Search Engine Results Page Using Tree Matching Algorithm

  • Author

    Hong, Jer Lang ; Siew, Eugene ; Egerton, Simon

  • Author_Institution
    Monash Univ., Darul Ehsan, Malaysia
  • fYear
    2009
  • fDate
    4-7 Dec. 2009
  • Firstpage
    149
  • Lastpage
    154
  • Abstract
    In this paper, we develop a non-visual automatic wrapper for extracting data records from search engine results page. The novel techniques for our wrapper are (1) filtering rules to detect and filter out irrelevant data records, (2) a tree matching algorithm using frequency measures to increase the speed of data extraction (3) an algorithm to calculate the number and size of the components of data records to detect the correct data region. Results show that our wrapper is as robust and in many cases outperforms the state of the art wrappers such as ViNT and DEPTA. This wrapper could have significant speed advantages when processing large volumes of web sites data, which could be helpful in meta search engine development.
  • Keywords
    Web sites; information filtering; search engines; DEPTA; ViNT; data record extraction; dummy tree matching; filtering rules; meta search engine development; nonvisual automatic wrapper; search engine results page; web sites data; Data mining; Filtering algorithms; Frequency measurement; Matched filters; Metasearch; Pattern matching; Pattern recognition; Search engines; Tree data structures; Velocity measurement; Information Extraction; Search Engine; Wrapper Generation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Soft Computing and Pattern Recognition, 2009. SOCPAR '09. International Conference of
  • Conference_Location
    Malacca
  • Print_ISBN
    978-1-4244-5330-6
  • Electronic_ISBN
    978-0-7695-3879-2
  • Type

    conf

  • DOI
    10.1109/SoCPaR.2009.40
  • Filename
    5368635