• DocumentCode
    2369909
  • Title

    On precision and recall of multi-attribute data extraction from semistructured sources

  • Author

    Yang, Guizhen ; Mukherjee, Saikat ; Ramakrishnan, I.V.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of Buffalo, NY, USA
  • fYear
    2003
  • fDate
    19-22 Nov. 2003
  • Firstpage
    395
  • Lastpage
    402
  • Abstract
    Machine learning techniques for data extraction from semistructured sources exhibit different precision and recall characteristics. However to date the formal relationship between learning algorithms and their impact on these two metrics remains unexplored. We propose a formalization of precision and recall of extraction and investigates the complexity-theoretic aspects of learning algorithms for multiattribute data extraction based on this formalism. We show that there is a tradeoff between precision/recall of extraction and computational efficiency and present experimental results to demonstrate the practical utility of these concepts in designing scalable data extraction algorithms for improving recall without compromising on precision.
  • Keywords
    Internet; computational complexity; data mining; learning (artificial intelligence); Internet; complexity-theoretic aspects; machine learning algorithms; multiattribute data extraction; semistructured sources; Animals; Computational efficiency; Computer science; Data engineering; Data mining; Hospitals; Labeling; Machine learning; Machine learning algorithms; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
  • Print_ISBN
    0-7695-1978-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2003.1250945
  • Filename
    1250945