مرکز منطقه ای اطلاع رساني علوم و فناوري - A Template-Based Tibetan Web Text Information Extraction Method

DocumentCode :

2669429

Title :

A Template-Based Tibetan Web Text Information Extraction Method

Author :

Chuncheng, Xiang ; Yu, Weng

Author_Institution :

Nat. Language Resource Monitoring & Res. Center, Minzu Univ. of China, Beijing, China

fYear :

2011

fDate :

1-3 Nov. 2011

Firstpage :

218

Lastpage :

221

Abstract :

In order to build a large Tibetan corpus, the researcher proposes a simple and effective method of text information extraction over Tibetan Web pages. Most web pages too much noise information unrelated to the content of the text, which makes it difficult to collect the required text information accurately and completely. After analyzing the characteristics of the seven major Tibetan Web sites, whose way of providing information is a combining use of the records in the database and the inherent dynamic web templates, the researcher presents in this article a web-based template text information extraction method. Experiments show that the method can identify and extract text information through a regular expression that filters the noise information, thus it might play a significant role in the Tibetan corpus construction with much feasibility and applicability.

Keywords :

Web sites; information retrieval; natural languages; text analysis; Tibetan Web pages; Web sites; dynamic Web templates; template-based Tibetan Web text information extraction; Accuracy; Data mining; Educational institutions; Noise; Training; Web pages; Text Information Extraction; Tibetan Information Processing; Tibetan language websites; Web Templates;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligent Networks and Intelligent Systems (ICINIS), 2011 4th International Conference on

Conference_Location :

Kunming

Print_ISBN :

978-1-4577-1626-3

Type :

conf

DOI :

10.1109/ICINIS.2011.7

Filename :

6104732

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2669429