مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines

DocumentCode :

1633951

Title :

Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines

Author :

Liu, Ying ; Bai, Kun ; Mitra, Prasenjit ; Giles, C. Lee

Author_Institution :

Coll. of Inf. Sci. & Technol., Penn State Univ., University Park, PA, USA

fYear :

2009

Firstpage :

1006

Lastpage :

1010

Abstract :

As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

Keywords :

data structures; document handling; information retrieval; ubiquitous computing; PDF documents; document structure; sequence error; table boundary detection; text extraction tools; ubiquitous document component; Computer errors; Data mining; Educational institutions; Filtering; HTML; Image analysis; Image converters; Information analysis; Optical character recognition software; Text analysis;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on

Conference_Location :

Barcelona

ISSN :

1520-5363

Print_ISBN :

978-1-4244-4500-4

Electronic_ISBN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2009.138

Filename :

5277535

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1633951