Title :
Text mining of bilingual parallel corpora with a measure of semantic similarity
Author :
Lee, Chunghong ; Yang, Hsin-Chang
Author_Institution :
Dept. of Inf. Manage., Chang Jung Univ., Tainan, Taiwan
Abstract :
The paper describes a new application of a text-mining algorithm to the text sources of bilingual parallel corpora. The ultimate task, being undertaken in the context of a Chinese-English machine translation project, will be to develop a language-neutral method to discover similar documents from multilingual text collections. Using a variation of automatic clustering techniques which apply a neural net approach, namely the self-organizing maps (SOM), we have conducted several experiments to uncover associated documents based on Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. The experiments show some interesting results and a couple of potential ways for future work towards the field of multilingual information discovery. In addition, for exploring the impacts on linguistic issues with the machine learning approach to mining sensible linguistics elements from multilingual texts, we have examined the resulting term associations and text associations from the view of cross-lingual text similarity. To evaluate semantic relatedness of the mined bilingual texts, we applied a measure technique of semantic similarity in the resulting bilingual document clusters and word clusters. The paper presents algorithms that enable multilingual text mining based on the self-organizing map (SOM) for automatically grouping similar multilingual texts (i.e. Chinese and English texts), along with a means of measuring their semantic similarity to resolve the difficulties of syntactic and semantic ambiguity in multilingual information access
Keywords :
data mining; language translation; learning (artificial intelligence); natural languages; self-organising feature maps; text analysis; Chinese-English bilingual parallel corpora; Chinese-English machine translation project; SOM; automatic clustering techniques; bilingual document clusters; bilingual parallel corpora; cross-lingual text similarity; hybrid Chinese-English corpus; language-neutral method; machine learning approach; multilingual information access; multilingual information discovery; multilingual text collections; multilingual text mining; multingual text mining; neural net approach; self-organizing map; semantic ambiguity; semantic relatedness; semantic similarity; semantic similarity measure; sensible linguistics elements; syntactic ambiguity; term associations; text associations; text sources; word clusters; Clustering algorithms; Data mining; Information management; Machine learning; Natural languages; Neural networks; Open source software; Self organizing feature maps; Text categorization; Text mining;
Conference_Titel :
Systems, Man, and Cybernetics, 2001 IEEE International Conference on
Conference_Location :
Tucson, AZ
Print_ISBN :
0-7803-7087-2
DOI :
10.1109/ICSMC.2001.969857