• DocumentCode
    1814345
  • Title

    Automatic extraction of titles from general documents using machine learning

  • Author

    Yunhua Hu ; Hang Li ; Yunbo Cao ; Meyerzon, D.

  • Author_Institution
    Comput. Sci. Dept., Xi´an Jiaotong Univ.
  • fYear
    2005
  • fDate
    7-11 June 2005
  • Firstpage
    145
  • Lastpage
    154
  • Abstract
    We propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in do document retrieval by using the extracted titles
  • Keywords
    document handling; information retrieval; intranets; learning (artificial intelligence); meta data; Office; PowerPoint; Word; automatic title extraction; book chapters; brochures; document retrieval; font size; formatting information; general documents; intranet data; letters; machine learning; precision; presentations; recall; reports; search ranking; technical papers; title annotation; training data; Asia; Books; Computer science; Data mining; Entropy; Information retrieval; Machine learning; Permission; Software measurement; Training data; information extraction; machine learning; metadata extraction; search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on
  • Conference_Location
    Denver, CO
  • Print_ISBN
    1-58113-876-8
  • Type

    conf

  • DOI
    10.1145/1065385.1065418
  • Filename
    4118530