• DocumentCode
    2298294
  • Title

    Data Extraction Based on Index Path in Web

  • Author

    Gao, Ya ; Yuan, Fang ; Zhang, Ming

  • Author_Institution
    Key Lab. in Machine Learning & Comput. Intelligenc, Hebei Univ., Baoding, China
  • Volume
    3
  • fYear
    2010
  • fDate
    6-7 March 2010
  • Firstpage
    157
  • Lastpage
    160
  • Abstract
    Data extraction in Web is to obtain the desired information to users in Web pages. For a more accurately valuable data extraction, this paper proposes a new method called data extraction based on index path in Web (DEIP) . This approach establishes the index path for each text node using XML DOM; defines the prefix of data-rich by keywords in the index path; generate extraction rule and obtain a wrapper according. The wrapper can extract data automatically in the same domain from a Website. It does relevant to the continuity, the structural similarity, and the location relations of the useful information in Web pages, but not the HTML tag, Experiments indicate that this method is efficient in the recall and the precision of data extraction.
  • Keywords
    Internet; XML; information retrieval; HTML tag; Web pages; Web site; XML DOM; data extraction; extraction rule; index path; structural similarity; Computer science; Data mining; Databases; HTML; Internet; Search engines; Web page design; Web pages; Web search; XML; DOM; XML; data extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Education Technology and Computer Science (ETCS), 2010 Second International Workshop on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-6388-6
  • Electronic_ISBN
    978-1-4244-6389-3
  • Type

    conf

  • DOI
    10.1109/ETCS.2010.291
  • Filename
    5459747