Title :
Mining Pinyin-to-character conversion rules from large-scale corpus: a rough set approach
Author :
Xiaolong, Wang ; Qingcai, Chen ; Yeung, Daniel S.
Author_Institution :
Dept. of Comput. Sci. & Technol., Harbin Inst. of Technol., China
fDate :
4/1/2004 12:00:00 AM
Abstract :
The paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Data generalization and rule extraction algorithms can then be used to eliminate redundant information and extract consistent PTC conversion rules. The design of our model also addresses a number of important issues such as the long-distance dependency problem, the storage requirements of the rule base, and the consistency of the extracted rules, while the performance of the extracted rules as well as the effects of different model parameters are evaluated experimentally. These results show that by the smoothing method, high precision conversion (0.947) and recall rates (0.84) can be achieved even for rules represented directly by pinyin rather than words. A comparison with the baseline tri-gram model also shows good complement between our method and the tri-gram language model.
Keywords :
data mining; natural languages; rough set theory; text analysis; Pinyin-to-character conversion rule mining; baseline tri-gram model; consistent PTC conversion rules; data generalization; free-form textual corpus; high precision conversion; language information table; large-scale corpus; long-distance dependency problem; model parameters; recall rates; redundant information; rough set approach; rough set technique; rule extraction algorithms; smoothing method; text-structuring method; Computer science; Context modeling; Data mining; Error analysis; Large-scale systems; Merging; Natural language processing; Natural languages; Smoothing methods; Speech recognition;
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
DOI :
10.1109/TSMCB.2003.817101