Title :
Probability Model Based Hidden Databases Sampling Approach
Author :
Jian-Wei Tian ; Shi-Jun Li ; Qi Lu
Author_Institution :
Sch. of Comput., Wuhan Univ., Wuhan
Abstract :
A great portion of data on the Web lies in the hidden databases of the deep Web. These databases can only be accessed through the query interfaces. Efficient and uniform data sampling approach is very important to other research work, for the data samples can give insight into the data quality, freshness and size information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability model based sampling approach to solve this problem. First, we leverage the historical underflow walks to calculate the underflow probability of the attribute values. Based on the underflow probability, we give priority to execute the attribute values with largest underflow probability. The experimental results indicate that our approach can improve the sampling efficiency by detecting the underflow earlier and avoid many wasted queries.
Keywords :
database management systems; probability; query processing; attribute values; data quality; deep Web; hidden databases sampling; historical underflow; probability model; query interfaces; underflow probability; Data mining; Databases; Histograms; Probability; Query processing; Sampling methods; Search engines; Virtual manufacturing; Web sites;
Conference_Titel :
Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-2107-7
Electronic_ISBN :
978-1-4244-2108-4
DOI :
10.1109/WiCom.2008.2575