• DocumentCode
    2196810
  • Title

    Layout and language: preliminary investigations in recognizing the structure of tables

  • Author

    Hurst, Matthew ; Douglas, Shona

  • Author_Institution
    Language Technol. Group, Edinburgh Univ., UK
  • Volume
    2
  • fYear
    1997
  • fDate
    18-20 Aug 1997
  • Firstpage
    1043
  • Abstract
    Describes a prototype system for assigning table cells to their proper place in the logical structure of the table, based on a simple model of table structure combined with a number of measures of cohesion between cells. A framework is presented for examining the effect of particular variables on the performance of the system, and preliminary results are presented showing the effect of cohesion measures based on the simplest domain-independent analyses, with the aim allowing future comparison with more knowledge-intensive analyses based on natural language processing. These baseline results suggest that very simple string-based cohesion measures are not sufficient to support the extraction of tuples as we require. Future work will pursue the aim of more adequate approximations to a notional subtype/supertype definition of the relationship between value cells and label cells
  • Keywords
    document handling; natural languages; pattern recognition; spreadsheet programs; cell cohesion measures; domain-independent analyses; knowledge-intensive analyses; natural language processing; string-based measures; system performance; table cell assignment; table layout; table logical structure recognition; tuple extraction; Communications technology; Data mining; HTML; Humans; Information retrieval; Natural language processing; Particle measurements; Performance analysis; Prototypes; SGML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
  • Conference_Location
    Ulm
  • Print_ISBN
    0-8186-7898-4
  • Type

    conf

  • DOI
    10.1109/ICDAR.1997.620668
  • Filename
    620668