• DocumentCode
    595013
  • Title

    A graph-based method of newspaper article reconstruction

  • Author

    Liangcai Gao ; Zhi Tang ; Xiaoyan Lin ; Yongtao Wang

  • Author_Institution
    Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
  • fYear
    2012
  • fDate
    11-15 Nov. 2012
  • Firstpage
    1566
  • Lastpage
    1569
  • Abstract
    The primary information units in a newspaper are the articles. Article reconstruction from newspapers including article aggregation and reading order recovery is known to be a quite challenging task due to the complexity of the multi-article page layout. In this paper, we propose a novel approach for article reconstruction using a bipartite graph framework, which models the complex relationships between text blocks as one-to-one correspondences, and accomplishes the task by finding the optimal match on this graph. During the optimization process, various information sources, including geometric layout, linguistic and semantic content, are deeply mined in the bipartite graph model to deal with the wide range of complex newspaper layouts. Moreover, quite different from the existing methods, we perform the two sub-tasks of article reconstruction in reverse order, that is, we detect the reading orders of the text blocks first and then use the reading order to aggregate blocks belonging to the same articles. Experimental results on 3312 newspaper pages with 23184 articles demonstrate that our method outperforms the state-of-the-art methods for newspaper article reconstruction. In addition, this method has been adopted in several large-scale newspaper digitalization projects.
  • Keywords
    graph theory; information resources; text analysis; article aggregation; bipartite graph framework; geometric layout; graph-based method; information sources; linguistic content; multiarticle page layout complexity; newspaper article reconstruction; newspaper digitalization projects; one-to-one correspondences; optimization process; primary information units; reading order recovery; semantic content; text blocks; Bipartite graph; Complexity theory; Image edge detection; Layout; Optimal matching; Semantics; Visualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2012 21st International Conference on
  • Conference_Location
    Tsukuba
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4673-2216-4
  • Type

    conf

  • Filename
    6460443