DocumentCode :
2910887
Title :
Using HTML Tags to Improve Parallel Resources Extraction
Author :
Feng, Yan-hui ; Hong, Yu ; Tang, Wei ; Yao, Jian-min ; Zhu, Qiao-ming
Author_Institution :
Sch. of Comput. Sci. & Technol., Soochow Univ., Suzhou, China
fYear :
2011
fDate :
15-17 Nov. 2011
Firstpage :
255
Lastpage :
259
Abstract :
This paper proposes a new approach to extract parallel resources (including bilingual sentences and bilingual terms) from bilingual web pages, which have a primary language and a secondary language (the second language is often the translation to primary language). Our method is composed of four tasks: 1) parsing the web page into a DOM tree and segmenting inner texts of each node into series of monolingual snippets; 2) selecting adjacent snippet pairs in different languages and with higher translation scores as seeds for the next task; 3) constructing comprehensive wrappers from selected seeds, which save both HTML and surface formatting styles; 4) mining candidate instances and selecting good instances by their similarities with seeds. In this paper, we first propose to segment text by HTML tags, and select potential parallel resources by ranking all extracted candidates. According to the experimental results, our method can be applied to bilingual pages written in any other pair of languages. Experimental results also show that our approaches are effective in improving the parallel resources extraction.
Keywords :
Internet; data mining; information retrieval; natural language processing; text analysis; DOM tree; HTML tags; Web page parsing; bilingual Web page; bilingual sentence; bilingual term; comprehensive wrapper; instance mining; monolingual snippets; parallel resource extraction; primary language; secondary language; text segmentation; translation score; Computational linguistics; Data mining; HTML; Noise measurement; Web pages; Bilingual Resource; HTML Tags; Web Data Mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Asian Language Processing (IALP), 2011 International Conference on
Conference_Location :
Penang
Print_ISBN :
978-1-4577-1733-8
Type :
conf
DOI :
10.1109/IALP.2011.23
Filename :
6121515
Link To Document :
بازگشت