DocumentCode :
595013
Title :
A graph-based method of newspaper article reconstruction
Author :
Liangcai Gao ; Zhi Tang ; Xiaoyan Lin ; Yongtao Wang
Author_Institution :
Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
fYear :
2012
fDate :
11-15 Nov. 2012
Firstpage :
1566
Lastpage :
1569
Abstract :
The primary information units in a newspaper are the articles. Article reconstruction from newspapers including article aggregation and reading order recovery is known to be a quite challenging task due to the complexity of the multi-article page layout. In this paper, we propose a novel approach for article reconstruction using a bipartite graph framework, which models the complex relationships between text blocks as one-to-one correspondences, and accomplishes the task by finding the optimal match on this graph. During the optimization process, various information sources, including geometric layout, linguistic and semantic content, are deeply mined in the bipartite graph model to deal with the wide range of complex newspaper layouts. Moreover, quite different from the existing methods, we perform the two sub-tasks of article reconstruction in reverse order, that is, we detect the reading orders of the text blocks first and then use the reading order to aggregate blocks belonging to the same articles. Experimental results on 3312 newspaper pages with 23184 articles demonstrate that our method outperforms the state-of-the-art methods for newspaper article reconstruction. In addition, this method has been adopted in several large-scale newspaper digitalization projects.
Keywords :
graph theory; information resources; text analysis; article aggregation; bipartite graph framework; geometric layout; graph-based method; information sources; linguistic content; multiarticle page layout complexity; newspaper article reconstruction; newspaper digitalization projects; one-to-one correspondences; optimization process; primary information units; reading order recovery; semantic content; text blocks; Bipartite graph; Complexity theory; Image edge detection; Layout; Optimal matching; Semantics; Visualization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location :
Tsukuba
ISSN :
1051-4651
Print_ISBN :
978-1-4673-2216-4
Type :
conf
Filename :
6460443
Link To Document :
بازگشت