DocumentCode :
3414012
Title :
Sentence alignment for web page text based on vector space model
Author :
Zhang, GuanHong ; Odbal
Author_Institution :
Dept. of Comput. Sci. & Technol., Hefei Univ., Hefei, China
fYear :
2012
fDate :
24-26 Aug. 2012
Firstpage :
167
Lastpage :
170
Abstract :
There exist noisy, unparallel sentences in parallel web pages. Web page structure is subjected to some limitation for sentences alignment task for web page text. The most straightforward way of aligning sentences is using a translation lexicon. However, a major obstacle to this approach is the lack of dictionary for training. This paper presents a method for automatically align Mongolian-Chinese parallel text on the Web via vector space model. Vector space model is an algebraic model for representing any object as vectors of identifiers, such as index terms. In the statistically based vector-space model, a sentence is conceptually represented by a vector of keywords extracted from the text. Extracted keywords are composed by content words, known as terms and the weight of a term in a sentence vector can be determined tf-idf method. CHI is used to compute the association between bilingual words. Once the term weights are determined, the similarity between sentence vectors is computed via cosine measure. The experimental results indicate that the method is accurate and efficient enough to apply without human intervention.
Keywords :
Web sites; natural language processing; parallel processing; text analysis; Mongolian-Chinese parallel text; Web page structure; Web page text; algebraic model; bilingual words; content words; cosine measure; human intervention; keywords extraction; parallel web pages; sentence alignment; sentence vector; translation lexicon; unparallel sentences; vector space model; HTML; Manganese; Chinese scripts; Mongolian scripts; Parallel web page; sentence alignment; vector space model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Information Processing (CSIP), 2012 International Conference on
Conference_Location :
Xi´an, Shaanxi
Print_ISBN :
978-1-4673-1410-7
Type :
conf
DOI :
10.1109/CSIP.2012.6308821
Filename :
6308821
Link To Document :
بازگشت