DocumentCode :
3121709
Title :
Weighted Proximity Best-Joins for Information Retrieval
Author :
Thonangi, Risi ; He, Hao ; Doan, AnHai ; Wang, Haixun ; Yang, Jun
Author_Institution :
Dept. of Comput. Sci., Duke Univ., Durham, NC
fYear :
2009
fDate :
March 29 2009-April 2 2009
Firstpage :
234
Lastpage :
245
Abstract :
We consider the problem of efficiently computing weighted proximity best-joins over multiple lists, with applications in information retrieval and extraction. We are given a multi-term query, and for each query term, a list of all its matches with scores, sorted by locations. The problem is to find the overall best matchset, consisting of one match from each list, such that the combined score according to a scoring function is maximized. We study three types of functions that consider both individual match scores and proximity of match locations in scoring a matchset. We present algorithms that exploit the properties of the scoring functions in order to achieve time complexities linear in the size of the match lists. Experiments show that these algorithms greatly outperform the naive algorithm based on taking the cross product of all match lists. Finally, we extend our algorithms for an alternative problem definition applicable to information extraction, where we need to find all good matchsets in a document.
Keywords :
information retrieval; information extraction; information retrieval; weighted proximity best-joins; Application software; Computer science; Data engineering; Data mining; Dentistry; Engineering profession; Information retrieval; Personal communication networks; Portable computers; Search engines; Entity Search; Information Retrieval; Information Scoring; Ranking; Weighted Joins; Weighted Proximity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
Conference_Location :
Shanghai
ISSN :
1084-4627
Print_ISBN :
978-1-4244-3422-0
Electronic_ISBN :
1084-4627
Type :
conf
DOI :
10.1109/ICDE.2009.61
Filename :
4812406
Link To Document :
بازگشت