DocumentCode :
3434127
Title :
Chinese coding type identification based on Kolmogorov complexity theory
Author :
He, Gang ; Zhu, Ning ; Wu, Xiaochun ; Xu, Qiuchen
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
24-26 Sept. 2010
Firstpage :
293
Lastpage :
297
Abstract :
Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.
Keywords :
data compression; encoding; text analysis; Chinese characters; Chinese coding type identification; Chinese web content audit; Kolmogorov complexity theory; information distance; n-gram algorithms; text compressors; text segment; Accuracy; Algorithm design and analysis; Books; Complexity theory; Encoding; Grippers; Training; Chinese encoding identification; Kolmogorov complexity; information distance; text compression;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6851-5
Type :
conf
DOI :
10.1109/ICNIDC.2010.5657789
Filename :
5657789
Link To Document :
بازگشت