• DocumentCode
    548452
  • Title

    Simplified-traditional Chinese character conversion based on multi-data resources: Towards a fused conversion algorithm

  • Author

    Hao, Tianyong ; Zhu, Chunshen

  • Author_Institution
    Dept. of Chinese, Translation & Linguistics, City Univ. of Hong Kong, Hong Kong, China
  • fYear
    2011
  • fDate
    21-23 June 2011
  • Firstpage
    50
  • Lastpage
    56
  • Abstract
    In recent years, communication between Chinese communities in different parts of the world has been on a constant increase. However, between the traditional Chinese character used in Taiwan, Hong Kong and Macao, and the simplified Chinese character used in mainland China and Singapore, extensive differences in both formation and usage may result in unexpected hindrance in verbal communications. Though there are already a lot of conversion methods from researchers and industry companies, the precisions are still not high enough for professional usage especially on one-to-many cases. To solve this seemingly technical but actually linguistically-related problem, this paper proposes a new priority-based multi-data resources management model. With this model, conversion can be more context-sensitive, human controllable, and thus more reliable. A new algorithm called Fused Conversion Algorithm from Multi-Data resources (FCMD) is also presented. This algorithm incorporates the advantages of reverse maximum matching and N-Gram-based statistical model to render the system more responsive to contextual nuances. After parameter training on a huge LDC corpus, the conversion precision of the proposed method reaches 90.2% on one-to-many cases, which are the most difficult part in Chinese character conversion, with an overview precision rate at 99.7%. Its experimental performance in terms of precision and efficiency promises a significant improvement over the state-of-the-art models.
  • Keywords
    linguistics; natural languages; statistical analysis; Hong Kong; Macao; N-Gram-based statistical model; Singapore; Taiwan; fused conversion algorithm; priority-based multidata resources management model; simplified-traditional Chinese character conversion; verbal communications; Algorithm design and analysis; Data models; Dictionaries; Encyclopedias; Internet; Resource management; Training; Chinese character conversion; FCMD algorithm; multi-data resources; reverse maximum matching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Next Generation Information Technology (ICNIT), 2011 The 2nd International Conference on
  • Conference_Location
    Gyeongju
  • Print_ISBN
    978-1-4577-0266-2
  • Electronic_ISBN
    978-89-88678-39-8
  • Type

    conf

  • Filename
    5967471