DocumentCode :
1993517
Title :
Document transformation system from papers to XML data based on pivot XML document method
Author :
Ishitani, Yasuto
Author_Institution :
Corporate R&D Center, Toshiba Corp., Kawasaki, Japan
fYear :
2003
fDate :
3-6 Aug. 2003
Firstpage :
250
Abstract :
This paper proposes a new method for document transformation using OCR to generate various XML documents from printed documents. The proposed method adopts a hierarchical transformation strategy based on a pivot XML document. Firstly, document elements such as title, authors, abstract, headings, paragraphs, lists, captions, tables and figures are extracted from document images. Secondly, the hierarchical structure of document elements is extracted and is described using a DOM tree. Thirdly, this document structure is converted into a pivot XML document described as an XHTML document by an XML parser. Finally, this pivot XML document is transformed into the target XML document by the XML parser with XSLT scripts or specific programs. Experimental results show the method is effective in transforming printed documents to various XML documents.
Keywords :
XML; document image processing; grammars; optical character recognition; tree data structures; DOM tree; OCR; XHTML; XML data; XML parser; XSLT script; document image; document structure; document transformation system; optical character recognition; pivot XML document method; Data conversion; Document image processing; Electronic commerce; Electronic government; Image analysis; Image converters; Optical character recognition software; Technology management; Text analysis; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
Print_ISBN :
0-7695-1960-1
Type :
conf
DOI :
10.1109/ICDAR.2003.1227668
Filename :
1227668
Link To Document :
بازگشت