• DocumentCode
    3020386
  • Title

    Document ranking by layout relevance

  • Author

    Huang, May ; DeMenthon, Daniel ; Doermann, David ; Golebiowski, Lynn ; Hamilton, Booz Allen

  • Author_Institution
    Language & Media Process. Lab., Maryland Univ., College Park, MD, USA
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    362
  • Abstract
    This paper describes the development of a new document ranking system based on layout similarity. The user has a need represented by a set of "wanted" documents, and the system ranks documents in the collection according to this need. Rather than performing complete document analysis, the system extracts text lines, and models layouts as relationships between pairs of these lines. This paper explores three novel feature sets to support scoring in large document collections. First, pairs of lines are used to form quadrilaterals, which are represented by their turning functions. A non-Euclidean distance is used to measure similarity. Second, the quadrilaterals are represented by 5D Euclidean vectors, and third, each line is represented by a 5D Euclidean vector. We compare the classification performance and computation speed of these three feature sets using a large database of diverse documents including forms, academic papers and handwritten pages in English and Arabic. The approach using quadrilaterals and turning functions produces slightly better results, but the approach using vectors to represent text lines is much faster for large document databases.
  • Keywords
    computational geometry; document handling; handwritten character recognition; natural languages; very large databases; 5D Euclidean vector; document ranking system; handwritten pages; large document databases; layout relevance; nonEuclidean distance; quadrilaterals; text lines; turning functions; Business; Educational institutions; Image retrieval; Laboratories; Lamps; Optical character recognition software; Performance analysis; Spatial databases; Text analysis; Turning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.92
  • Filename
    1575570