• DocumentCode
    1637313
  • Title

    Scalable Feature Extraction from Noisy Documents

  • Author

    Lecerf, Loïc ; Chidlovskii, Boris

  • Author_Institution
    Xerox Res. Centre Eur., Meylan, France
  • fYear
    2009
  • Firstpage
    361
  • Lastpage
    365
  • Abstract
    We cope with the metadata recognition in layout-oriented documents. We address the problem as a classification task and propose a method for automatic extraction of relevant features, in presence of content and structural noise, caused by scanning, OCR and segmentation problems. The method is based on the automatic analysis of documents and requires no particular preprocessing. The method mines the documents and determines frequent patterns, which are bothliteral patterns and their generalization. We also propose a sampling technique which processes a sample of documents and uses the Chernoff bounds to estimate the pattern frequency in the entire dataset. As a number of frequent patterns as feature candidates grows, the method applies a scalable feature selection method to determine the most relevant features to a given classification task. A series of evaluations on two collections show that the method performs comparably to the manual work on rule writing made by domain experts.
  • Keywords
    data mining; document image processing; feature extraction; frequency estimation; image classification; image sampling; image segmentation; optical character recognition; Chernoff bound; OCR; classification task; content noise; image segmentation; layout-oriented document; literal pattern; metadata recognition; noisy document mining; pattern frequency estimation; sampling technique; scalable feature extraction; structural noise; Data mining; Europe; Feature extraction; Frequency estimation; Optical character recognition software; Performance analysis; Performance evaluation; Sampling methods; Text analysis; Writing; feature extraction; feature selection; noisy data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.227
  • Filename
    5277667