DocumentCode :
583146
Title :
Deep Web Repeated Pattern Discovering Based on the Largest Block Strategy
Author :
Ye, Feiyue ; Tang, Haibo ; Luo, Xiangfeng
Author_Institution :
Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
fYear :
2012
fDate :
27-29 Oct. 2012
Firstpage :
1082
Lastpage :
1086
Abstract :
Repeated pattern is a common phenomenon in query result pages of deep web sites. The deep web back-end data can be accessed by mining repeated patterns. So far, most of the algorithms of discovering repeated pattern use traditional web information extraction methods. But the recall percentage and accuracy are not high. How to obtain the repeated pattern accurately and completely is still a difficulty. We propose a method based on the largest block strategy to discover such pattern. The core of the method is using the largest block strategy to discover the repeated pattern layer. We can quickly navigate to the region of the entity data, and then analyze the sub tree in this area, finally, get the simplified repeated pattern of the deep web site. According to the results of the experiment, this method can get the repeated pattern data more accurately and more completely than the traditional methods. It can also address the multi-pattern problem which has not been solved yet in other methods.
Keywords :
Internet; Web sites; data handling; Web information extraction methods; Web sites; deep web repeated pattern discovering; largest block strategy; pattern data; pattern mining; query result pages; Accuracy; Clustering algorithms; Data mining; Feature extraction; HTML; Web sites; Deep Web; Repeated Pattern; Web Information Extraction; the Largest Block;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (CIT), 2012 IEEE 12th International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4673-4873-7
Type :
conf
DOI :
10.1109/CIT.2012.220
Filename :
6392057
Link To Document :
بازگشت