Title :
Combining Modern Machine Translation Software with LSI for Cross-Lingual Information Processing
Author :
Bradford, Russell ; Pozniak, John
Author_Institution :
Agilex Technol. Inc., Chantilly, VA, USA
Abstract :
The growing internationalization of business and social interactions poses significant challenges in implementing multilingual information systems. For applications requiring retrieval, clustering, and categorization of multilingual document collections, cross-lingual application of latent semantic indexing (LSI) has a number of characteristics that make it potentially attractive. However, this technique is dependent upon the availability of applicable parallel corpora. Historically, such corpora have been quite limited in size and scope. In this paper, we provide new results regarding implementation of cross-lingual LSI text processing systems employing parallel corpora produced using modern machine translation (MT) products. We present measurements using the Reuters 21578 test set to demonstrate three key points regarding this combined LSI/modern MT approach: (1) for some languages, this approach can create parallel corpora of sufficient fidelity to support effective multilingual and cross-lingual LSI applications, (2) the technique is not particularly sensitive to details of LSI parameters, and (3) multiple languages can be represented in a single LSI space with little degradation in performance.
Keywords :
indexing; information systems; language translation; text analysis; MT products; Reuters 21578 test; business internationalization; cross-lingual LSI applications; cross-lingual LSI text processing systems; cross-lingual information processing; latent semantic indexing; machine translation software; multilingual document collection categorization; multilingual document collection clustering; multilingual document collection retrieval; multilingual information systems; parallel corpora; social interactions; Abstracts; Large scale integration; Matrix decomposition; Semantics; Standards; Training; Vectors; cross-lingual; latent semantic indexing; machine translation; multilingual; parallel corpora;
Conference_Titel :
Information Technology: New Generations (ITNG), 2014 11th International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4799-3187-3
DOI :
10.1109/ITNG.2014.52