DocumentCode
595010
Title
Learning the characteristics of critical cells from web tables
Author
Nagy, G.
Author_Institution
Rensselaer Polytech. Inst., Troy, NY, USA
fYear
2012
fDate
11-15 Nov. 2012
Firstpage
1554
Lastpage
1557
Abstract
Critical Cells (CCs) are identified to partition a web table into mutually exclusive regions of stub, column header, row header, data, and neutral cells. Every table cell (including titles and footnotes outside the table proper but usually within the HTML table tags) is classified into one of six classes based on cell-features extracted from the target cell and its eight neighbors. Changing the domain of maximization over posteriors results in the assignment of exactly four CCs to each table. The average number of interactions required for error-free table data extraction can be reduced more than 75% by alternating between graphic interaction and auto-assignment.
Keywords
Internet; feature extraction; learning (artificial intelligence); CC; Web tables; auto-assignment; cell-feature extraction; column header; critical cell characteristics; error-free table data extraction; graphic interaction; learning; neutral cells; row header; stub; Algorithm design and analysis; Data mining; Feature extraction; HTML; Training; Visualization;
fLanguage
English
Publisher
ieee
Conference_Titel
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location
Tsukuba
ISSN
1051-4651
Print_ISBN
978-1-4673-2216-4
Type
conf
Filename
6460440
Link To Document