• DocumentCode
    2457470
  • Title

    An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

  • Author

    Deng, Dong ; Li, Guoliang ; Feng, Jianhua

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    762
  • Lastpage
    773
  • Abstract
    Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.
  • Keywords
    approximation theory; dictionaries; indexing; information retrieval; search problems; string matching; tree data structures; approximate matching; database community; dictionary-based approximate entity extraction; document substring; edit-distance constraints; edit-distance thresholds; entity partitioning; entity substring; error tolerance; extension-based method; extraction performance improvement; partition scheme optimization; segment indexing; segment matching; segment search; trie-based method; Complexity theory; Dictionaries; Heuristic algorithms; Indexes; Partitioning algorithms; Search problems; Strontium;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2012 IEEE 28th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-0042-1
  • Type

    conf

  • DOI
    10.1109/ICDE.2012.29
  • Filename
    6228131