DocumentCode :
116648
Title :
Estimating the size of hidden data sources by queries
Author :
Yan Wang ; Jie Liang ; Jianguo Lu
Author_Institution :
Sch. of Inf., Central Univ. of Finance & Econ., Beijing, China
fYear :
2014
fDate :
17-20 Aug. 2014
Firstpage :
712
Lastpage :
719
Abstract :
The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al´s estimation method on all the datasets.
Keywords :
document handling; query processing; baseline random query method; capture-recapture methods; document frequency variance; document query frequency; hidden data source size estimation; query pool; textual corpora; Dictionaries; Educational institutions; Estimation; Indexes; Measurement; Nickel; Hidden data source; document frequency; estimator; pool-based sampling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on
Conference_Location :
Beijing
Type :
conf
DOI :
10.1109/ASONAM.2014.6921664
Filename :
6921664
Link To Document :
بازگشت