Title :
A block segmentation method for document images with complicated column structures
Author_Institution :
IBM Japan Ltd., Yamato city, Kanagawa, Japan
Abstract :
Presents a novel block segmentation method for document images which can be applied to various document formats. Some documents have complicated column structures, in which some figure areas have no surrounding rectangles and others cut across text areas. In the approach presented, in order to segment documents into text and figure areas, the text areas are analyzed first, and the figure areas are then detected by analyzing information on the text areas. The overall process is as follows. First, character strings are merged into text groups by analyzing regularity in the text areas. Next, border lines of columns are detected by linking the edges of the text groups. After that, the whole page is segmented into small blocks according to the border lines. The blocks are then unified by using the column information, and some unified blocks are detected. Finally, a projection profile method is applied to the unified blocks in order to detect text areas and figure areas. This method was applied to 61 pages of Japanese technical papers and magazines, and 93.3% of the text areas and 93.2% of the figure areas were detected correctly
Keywords :
document image processing; image segmentation; merging; Japanese magazines; Japanese technical papers; block segmentation method; block unification; border lines; character strings; complicated column structures; document formats; document images; figure areas; page segmentation; projection profile method; regularity; text areas; text groups; Cities and towns; Databases; Image analysis; Image edge detection; Image segmentation; Information analysis; Joining processes; Laboratories; Publishing; Text recognition;
Conference_Titel :
Document Analysis and Recognition, 1993., Proceedings of the Second International Conference on
Conference_Location :
Tsukuba Science City
Print_ISBN :
0-8186-4960-7
DOI :
10.1109/ICDAR.1993.395775