DocumentCode :
2528161
Title :
Chart image understanding and numerical data extraction
Author :
Mishchenko, Ales ; Vassilieva, Natalia
Author_Institution :
CEA (formerly: HP Labs. contractor), Grenoble, France
fYear :
2011
fDate :
26-28 Sept. 2011
Firstpage :
115
Lastpage :
120
Abstract :
Chart images in digital documents are an important source of valuable information that is largely under-utilized for data indexing and information extraction purposes. We developed a framework to automatically extract data carried by charts and convert them to XML format. The proposed algorithm classifies image by chart type, detects graphical and textual components, extracts semantic relations between graphics and text. Classification is performed by a novel model-based method, which was extensively tested against the state-of-the-art supervised learning methods and showed high accuracy, comparable to those of the best supervised approaches. The proposed text detection algorithm is applied prior to optical character recognition and leads to significant improvement in text recognition rate (up to 20 times better). The analysis of graphical components and their relations to textual cues allows the recovering of chart data. For testing purpose, a benchmark set was created with the XML/SWF Chart tool. By comparing the recovered data and the original data used for chart generation, we are able to evaluate our information extraction framework and confirm its validity.
Keywords :
XML; image classification; learning (artificial intelligence); optical character recognition; text analysis; XML format; XML/SWF Chart tool; chart data recovering; chart generation; chart image understanding; chart type; data indexing; digital documents; graphical component detection; image classification; information extraction; model-based method; numerical data extraction; optical character recognition; semantic relation extraction; supervised learning method; testing purpose; text detection algorithm; text recognition rate; textual component detection; Accuracy; Data mining; Feature extraction; Image color analysis; Image edge detection; Optical character recognition software; Text recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Information Management (ICDIM), 2011 Sixth International Conference on
Conference_Location :
Melbourn, QLD
ISSN :
Pending
Print_ISBN :
978-1-4577-1538-9
Type :
conf
DOI :
10.1109/ICDIM.2011.6093320
Filename :
6093320
Link To Document :
بازگشت