Probability Model Based Hidden Databases Sampling Approach

Author

Jian-Wei Tian ; Shi-Jun Li ; Qi Lu

Author_Institution

Sch. of Comput., Wuhan Univ., Wuhan

fYear

2008

fDate

12-14 Oct. 2008

Firstpage

1

Lastpage

4

Abstract

A great portion of data on the Web lies in the hidden databases of the deep Web. These databases can only be accessed through the query interfaces. Efficient and uniform data sampling approach is very important to other research work, for the data samples can give insight into the data quality, freshness and size information in the databases. However, the existing hidden database samplers are very inefficient, because lots of queries are wasted in the sampling walks. In this paper, we propose a probability model based sampling approach to solve this problem. First, we leverage the historical underflow walks to calculate the underflow probability of the attribute values. Based on the underflow probability, we give priority to execute the attribute values with largest underflow probability. The experimental results indicate that our approach can improve the sampling efficiency by detecting the underflow earlier and avoid many wasted queries.

Keywords

database management systems; probability; query processing; attribute values; data quality; deep Web; hidden databases sampling; historical underflow; probability model; query interfaces; underflow probability; Data mining; Databases; Histograms; Probability; Query processing; Sampling methods; Search engines; Virtual manufacturing; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on

Conference_Location

Dalian

Print_ISBN

978-1-4244-2107-7

Electronic_ISBN

978-1-4244-2108-4

Type

conf

DOI

10.1109/WiCom.2008.2575

Filename

4680764