Title :
Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining
Author :
Dang, Van B. ; Ho, Bao-Quoc
Author_Institution :
Fac. of Inf. Technol., Univ. of Natural Sci., Ho Chi Minh City
Abstract :
Parallel corpus has become a very essential resource for multilingual natural language processing and there are large scale of parallel texts available on the Internet these days. In this paper, we propose a simple but reliable method to construct an English-Vietnamese parallel corpus through Web mining. Our system can automatically download and detect parallel Web pages on a given domain to construct a parallel corpus that is well-aligned at paragraph level with completely clean texts. The proposed technique can be easily applied to other language pairs. Experiments have been made and shown promising results.
Keywords :
Internet; data mining; natural language processing; Internet; Web mining; automatic English-Vietnamese parallel corpus construction; multilingual natural language processing; parallel Web pages; Detectors; Dictionaries; Filtering; Information retrieval; Information technology; Internet; Large-scale systems; Natural language processing; Uniform resource locators; Web mining; information retrieval; parallel corpus; web mining;
Conference_Titel :
Research, Innovation and Vision for the Future, 2007 IEEE International Conference on
Conference_Location :
Hanoi
Print_ISBN :
1-4244-0694-3
DOI :
10.1109/RIVF.2007.369166