Title :
AML: Efficient Approximate Membership Localization within a Web-Based Join Framework
Author :
Li, Zhixu ; Sitbon, Laurianne ; Wang, Liwei ; Zhou, Xiaofang ; Du, Xiaoyong
Author_Institution :
Sch. of Inf. Technol. & Electr. Eng., Univ. of Queensland, Brisbane, QLD, Australia
Abstract :
In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.
Keywords :
Internet; dictionaries; document handling; string matching; AME process; AML problem; P-prune algorithm; Web documents; Web-based join framework; approximate membership extraction; approximate membership localization; dictionary-based entity recognition problem; extracted substrings; nonoverlapped substring localization; overlapped redundant matched substrings; real-word data sets; search-based approach; true matched substrings; Approximation algorithms; Approximation methods; Correlation; Dictionaries; Pattern matching; Web search; AML; Web-based join; approximate membership location;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2011.178