DocumentCode :
2669429
Title :
A Template-Based Tibetan Web Text Information Extraction Method
Author :
Chuncheng, Xiang ; Yu, Weng
Author_Institution :
Nat. Language Resource Monitoring & Res. Center, Minzu Univ. of China, Beijing, China
fYear :
2011
fDate :
1-3 Nov. 2011
Firstpage :
218
Lastpage :
221
Abstract :
In order to build a large Tibetan corpus, the researcher proposes a simple and effective method of text information extraction over Tibetan Web pages. Most web pages too much noise information unrelated to the content of the text, which makes it difficult to collect the required text information accurately and completely. After analyzing the characteristics of the seven major Tibetan Web sites, whose way of providing information is a combining use of the records in the database and the inherent dynamic web templates, the researcher presents in this article a web-based template text information extraction method. Experiments show that the method can identify and extract text information through a regular expression that filters the noise information, thus it might play a significant role in the Tibetan corpus construction with much feasibility and applicability.
Keywords :
Web sites; information retrieval; natural languages; text analysis; Tibetan Web pages; Web sites; dynamic Web templates; template-based Tibetan Web text information extraction; Accuracy; Data mining; Educational institutions; Noise; Training; Web pages; Text Information Extraction; Tibetan Information Processing; Tibetan language websites; Web Templates;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Networks and Intelligent Systems (ICINIS), 2011 4th International Conference on
Conference_Location :
Kunming
Print_ISBN :
978-1-4577-1626-3
Type :
conf
DOI :
10.1109/ICINIS.2011.7
Filename :
6104732
Link To Document :
بازگشت