Title :
Text mining of multilingual corpora via computing semantic relatedness
Author :
Lee, Chung-Hong ; Yang, Hsin-Chang
Author_Institution :
Dept. of Inf. Manage., Chang Jung Univ., Tainan, Taiwan
Abstract :
This paper describes a new application of a text-mining algorithm to the text sources of bilingual corpora. In the past, the majority of the approaches applied to measuring semantic relatedness was based on edge counting methods through a semantic network, such as WordNet. It is not well suited for applications in specific domains in which the standard lexical knowledge bases are not available. In this work, we propose an alternative solution for acquisition of semantic relatedness from text corpora by means of a machine learning technique, namely the self-organizing maps. This paper presents a hybrid approach to discovering a concept-based feature map containing word clusters and document clusters from multilingual text collections. Using SOM-based automatic clustering techniques, we have conducted several experiments to uncover associated documents based on Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. In essence, this work provides a method for automatic text clustering, which resolves some of the language difficulties in concept discovery and categorization from multilingual text corpora.
Keywords :
classification; data mining; learning (artificial intelligence); self-organising feature maps; text analysis; Chinese-English bilingual parallel corpora; automatic clustering techniques; bilingual corpora; concept discovery; document clusters; edge counting methods; experiments; lexical knowledge bases; machine learning technique; multilingual corpora; self-organizing maps; semantic network; semantic relatedness; text mining; word clusters; Data mining; History; Information management; Information systems; Machine learning; Measurement standards; Natural languages; Text categorization; Text mining;
Conference_Titel :
Systems, Man and Cybernetics, 2002 IEEE International Conference on
Print_ISBN :
0-7803-7437-1
DOI :
10.1109/ICSMC.2002.1176326