DocumentCode :
2830213
Title :
Text extraction on Windows®-based documents
Author :
Ray, Brian ; Chiang, Chia-Chu ; Melescue, Jim
Author_Institution :
Dept. of Comput. Sci., Arkansas Univ., AR, USA
fYear :
2005
fDate :
16-18 Aug. 2005
Firstpage :
205
Lastpage :
210
Abstract :
Syntel LLC is the developer of a mail presorting application called AutoMail®, which needs to alter bank statements that are being printed. For this and other applications, it is sometimes impossible to exert any control over the document creation software, but changes to the printed documents must nevertheless be made. The purpose of this project is to retrieve data which has been sent to the Microsoft Windows® printing subsystem, parse the data, modify sections of text contained within each document, and continue the print process, leaving the document unmolested except for the altered sections of text. This is done by processing enhanced metafile (EMF) documents, and generating XML documents formatted to be easily read by the software modules responsible for actually altering the text data. During some phase of the print process on Microsoft Windows operating systems, each page will exist as an EMF document. Each EMF document consists of a number of entries describing drawing operations. Those drawing operations which are found to pertain to text output in the important spatial regions of the document are converted to plain text. This text, along with certain formatting and positioning information, is written to the XML file. All other drawing operations are included in the XML file as "black box" entities, so that the document can be repackaged after processing. Repackaging is accomplished by creating new text drawing operations, reinserting the other drawing operations, and using the Windows® API to print the resulting EMF document.
Keywords :
XML; data structures; operating systems (computers); postal services; text analysis; Microsoft Windows operating system; Microsoft Windows printing subsystem; Windows API; Windows-based documents; XML documents; XML file; data parsing; data retrieval; document processing; enhanced metafile documents; text data alteration; text drawing operation; text extraction; Application software; Computer science; Information retrieval; Operating systems; Postal services; Printers; Printing; Programming; Systems engineering and theory; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems Engineering, 2005. ICSEng 2005. 18th International Conference on
Print_ISBN :
0-7695-2359-5
Type :
conf
DOI :
10.1109/ICSENG.2005.80
Filename :
1562853
Link To Document :
بازگشت