Title :
A Template-Based Tibetan Web Text Information Extraction Method
Author :
Chuncheng, Xiang ; Yu, Weng
Author_Institution :
Nat. Language Resource Monitoring & Res. Center, Minzu Univ. of China, Beijing, China
Abstract :
In order to build a large Tibetan corpus, the researcher proposes a simple and effective method of text information extraction over Tibetan Web pages. Most web pages too much noise information unrelated to the content of the text, which makes it difficult to collect the required text information accurately and completely. After analyzing the characteristics of the seven major Tibetan Web sites, whose way of providing information is a combining use of the records in the database and the inherent dynamic web templates, the researcher presents in this article a web-based template text information extraction method. Experiments show that the method can identify and extract text information through a regular expression that filters the noise information, thus it might play a significant role in the Tibetan corpus construction with much feasibility and applicability.
Keywords :
Web sites; information retrieval; natural languages; text analysis; Tibetan Web pages; Web sites; dynamic Web templates; template-based Tibetan Web text information extraction; Accuracy; Data mining; Educational institutions; Noise; Training; Web pages; Text Information Extraction; Tibetan Information Processing; Tibetan language websites; Web Templates;
Conference_Titel :
Intelligent Networks and Intelligent Systems (ICINIS), 2011 4th International Conference on
Conference_Location :
Kunming
Print_ISBN :
978-1-4577-1626-3
DOI :
10.1109/ICINIS.2011.7