• DocumentCode
    2457694
  • Title

    Approximate String Membership Checking: A Multiple Filter, Optimization-Based Approach

  • Author

    Sun, Chong ; Naughton, Jeffrey F. ; Barman, Siddharth

  • Author_Institution
    Comput. Sci. Dept., Univ. of Wisconsin, Madison, WI, USA
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    882
  • Lastpage
    893
  • Abstract
    We consider the approximate string membership checking (ASMC) problem of extracting all the strings or sub strings in a document that approximately match some string in a given dictionary. To solve this problem, the current state-of-art approach involves first applying an approximate, fast filter, then applying a more expensive exact verification algorithm to the strings that pass the filter. Correspondingly, many string filters have been proposed. We note that different filters are good at eliminating different strings, depending on the characteristics of the strings in both the documents and the dictionary. We suspect that no single filter will dominate all other filters everywhere. Given an ASMC problem instance and a set of string filters, we need to select the optimal filter to maximize the performance. Furthermore, in our experiments we found that in some cases a sequence of filters dominates any of the filters of the sequence in isolation, and that the best set of filters and their ordering depend upon the specific problem instance encountered. Accordingly, we propose that the approximate match problem be viewed as an optimization problem, and evaluate a number of techniques for solving this optimization problem.
  • Keywords
    document handling; formal verification; information filters; string matching; ASMC; approximate fast filter; approximate string membership checking problem; document substring; exact verification algorithm; multiple filter optimization-based approach; optimal filter; optimization problem; string filters; string matching; Approximation algorithms; Approximation methods; Dictionaries; Estimation; Matched filters; Optimization; Pipelines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2012 IEEE 28th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-0042-1
  • Type

    conf

  • DOI
    10.1109/ICDE.2012.68
  • Filename
    6228141