مرکز منطقه ای اطلاع رساني علوم و فناوري - The Dynamic Web Pages Information Extraction Algorithm Based on Sequence Alignment

DocumentCode :

2818770

Title :

The Dynamic Web Pages Information Extraction Algorithm Based on Sequence Alignment

Author :

Guo, Dongwei ; Li, Dan ; Liu, Miao ; Liu, Yanbin ; Chen, Sha

Author_Institution :

Colledge of Comput. Sci. & Technol., Jilin Univ., Changchun, China

Volume :

fYear :

2009

fDate :

24-26 April 2009

Firstpage :

784

Lastpage :

786

Abstract :

In this paper, ´common framework´ is defined as the information which is irrelative to the kernel contents of Web pages and common in Web pages from the same source, such as headers, tails, advertisements, orientations of browsers and flash etc. Sequence alignment is adopted in the information extraction algorithm to detect the common framework. After eliminating the common frameworks from Web pages, the data fields obtained will be more suitable for information extraction. On data-intensive Web pages from real-world Websites, the effects of the alignment parameter on extraction results and the phase of common framework detection on decreasing data quantity and increasing extraction accuracy were tested and evaluated. The experimental results proved the validity of this approach convincingly.

Keywords :

Internet; information retrieval; Websites; data quantity; dynamic Web pages information extraction algorithm; sequence alignment; Classification tree analysis; Computer science; Data mining; HTML; Hidden Markov models; Kernel; Network servers; Tail; Web pages; Web server;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Sciences and Optimization, 2009. CSO 2009. International Joint Conference on

Conference_Location :

Sanya, Hainan

Print_ISBN :

978-0-7695-3605-7

Type :

conf

DOI :

10.1109/CSO.2009.200

Filename :

5193809

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2818770