Title :
Joint Optimization of Index Freshness and Coverage in Real-Time Search Engines
Author :
Shin, Yongwook ; Lim, Junseok ; Park, Jonghun
Author_Institution :
Dept. of Ind. Eng., Seoul Nat. Univ., Seoul, South Korea
Abstract :
Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as indexindexing latency coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.
Keywords :
Internet; document handling; indexing; information retrieval; optimisation; protocols; real-time systems; scheduling; search engines; Web content indexing; data streams; document fetching; document publication; fetching process; fetching scheduling; index coverage; index freshness; indexing capacity finiteness; indexing latency; inventory control theory; joint optimization; optimization model; real-time content notification protocols; real-time search engines; retrieval latency; Indexing; Inventory control; Real time systems; Search engines; Feed; index coverage; index freshness; information retrieval; real-time search; search engine;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2011.144