مرکز منطقه ای اطلاع رساني علوم و فناوري - Estimating the size of hidden data sources by queries

DocumentCode :

116648

Title :

Estimating the size of hidden data sources by queries

Author :

Yan Wang ; Jie Liang ; Jianguo Lu

Author_Institution :

Sch. of Inf., Central Univ. of Finance & Econ., Beijing, China

fYear :

2014

fDate :

17-20 Aug. 2014

Firstpage :

712

Lastpage :

719

Abstract :

The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al´s estimation method on all the datasets.

Keywords :

document handling; query processing; baseline random query method; capture-recapture methods; document frequency variance; document query frequency; hidden data source size estimation; query pool; textual corpora; Dictionaries; Educational institutions; Estimation; Indexes; Measurement; Nickel; Hidden data source; document frequency; estimator; pool-based sampling;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on

Conference_Location :

Beijing

Type :

conf

DOI :

10.1109/ASONAM.2014.6921664

Filename :

6921664

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=116648