• DocumentCode
    2012055
  • Title

    Collecting Handwritten Nom Character Patterns from Historical Document Pages

  • Author

    Truyen Van Phan ; Zhu, Bilan ; Nakagawa, Masaki

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Tokyo Univ. of Agric. & Technol., Tokyo, Japan
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    344
  • Lastpage
    348
  • Abstract
    In this paper, we present methods of segmenting Nom historical documents and clustering character patterns to build a Nom character pattern database. Nom is an ideographic script to represent Vietnamese, used from the 10th century to 20th century. However, this heritage is nearly lost. In order to preserve the wisdom and knowledge expressed in Nom, recognition and digitalization are indispensable. Because there is no OCR for Nom yet, we have to start from collecting patterns. We have employed a projection profile based method for segmenting hundreds of pages into individual characters. Then, we have implemented a combination of Chinese OCR-based clustering and K-means clustering to group characters into categories. The experiment shows that the proposed system can help collecting the characters patterns effectively. Moreover, it has revealed that there are many character classes lost or uncategorized so far.
  • Keywords
    document image processing; handwritten character recognition; history; optical character recognition; pattern clustering; Chinese OCR-based clustering; K-means clustering; Nom character pattern database; Nom historical documents segmentation; Vietnamese; character patterns clustering; handwritten Nom character patterns collection; historical document pages; ideographic script; Accuracy; Character recognition; Databases; Image segmentation; Libraries; Noise; Optical character recognition software; Chu Nom; Han Nom; Vietnamese ancient text; clustering; document image analysis; historical document; offline character database; pattern collection; segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.25
  • Filename
    6195391