Title :
Deep Web Databases Sampling Approach Based on Probability Selection and Rule Mining
Author :
Xu, Yang ; Wang, Shu-Liang ; Tian, Jian-Wei
Author_Institution :
Int. Sch. of Software, Wuhan Univ., Wuhan, China
Abstract :
A great portion of data on the Web lies in the hidden databases of the Deep Web. These databases can only be accessed through the query interfaces. The data information in these databases can only be obtained by data sampling. Efficient and uniform data sampling approach is very important to other research work, such as data source selection and ranking, for the data samples can give insight into the data quality, freshness and coverage information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability selection and rule mining based sampling approach to solve this problem. First, we leverage the historical valid walks to calculate the valid probability of the attribute values. Based on the valid probability, we give priority to sample using the attribute values with largest valid probability and guide the sampler to find the valid sampling path earlier. Meanwhile, we save the underflow walk path to mine the underflow rules, which are used in the sampling process to guide the sampler to avoid the underflow walks. The experimental results indicate that our approach can improve the sampling efficiency by detecting the valid path earlier and avoid many underflow queries.
Keywords :
Internet; data mining; database management systems; probability; sampling methods; data quality; data sampling; deep Web databases sampling; probability selection; rule mining; Data mining; Legged locomotion; Probability; Sampling methods; Search engines; Spatial databases;
Conference_Titel :
Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-4507-3
Electronic_ISBN :
978-1-4244-4507-3
DOI :
10.1109/CISE.2009.5362897