Title :
Chinese coding type identification based on Kolmogorov complexity theory
Author :
He, Gang ; Zhu, Ning ; Wu, Xiaochun ; Xu, Qiuchen
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.
Keywords :
data compression; encoding; text analysis; Chinese characters; Chinese coding type identification; Chinese web content audit; Kolmogorov complexity theory; information distance; n-gram algorithms; text compressors; text segment; Accuracy; Algorithm design and analysis; Books; Complexity theory; Encoding; Grippers; Training; Chinese encoding identification; Kolmogorov complexity; information distance; text compression;
Conference_Titel :
Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6851-5
DOI :
10.1109/ICNIDC.2010.5657789