DocumentCode
2061239
Title
An interactive system to extract structured text from a geometrical representation
Author
Poirier, Benoit ; Dagenais, Michel
Author_Institution
Dept. de Genie Electr., Ecole Polytech. de Montreal, Que., Canada
Volume
1
fYear
1997
fDate
18-20 Aug 1997
Firstpage
342
Abstract
The proliferation of electronic document formats impedes the dissemination and management of documents. Indeed, a common format with structural information is required to obtain document indexing and navigation. While in some formats it is easy to decode and preserve the document structure information, often the only easily obtainable representation is Postscript, where only the geometrical information remains. Even if an organization is willing to convert all its document producing activities to a structure preserving format such as HTML, the existing documents need to be converted. The paper addresses the difficult problem of extracting the structure of a document from a geometrical representation. An interactive tool to extract the document content and structure from a geometric representation (Postscript) has been developed. It successfully analyzes several documents produced with different tools, and produces structural information using the HyperText Markup Language (HTML). The end user, when presented with the extracted document structure, can interactively modify it, if needed. The tool is easily extended to recognize new constructs and is aimed at organizations needing to convert numerous documents for searching and browsing on intranets or on the Internet
Keywords
Internet; document image processing; hypermedia; information retrieval; interactive systems; page description languages; word processing; HTML; HyperText Markup Language; Internet; Postscript; common format; document content extraction; document indexing; document structure information; electronic document formats; extracted document structure; geometrical information; geometrical representation; interactive system; interactive tool; intranets; structural information; structure preserving format; structured text extraction; Data mining; Decoding; HTML; Impedance; Indexing; Information analysis; Interactive systems; Markup languages; Navigation; Page description languages;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location
Ulm
Print_ISBN
0-8186-7898-4
Type
conf
DOI
10.1109/ICDAR.1997.619868
Filename
619868
Link To Document