• DocumentCode
    2944729
  • Title

    Compressed Dictionary Matching with One Error

  • Author

    Hon, Wing-Kai ; Ku, Tsung-Han ; Shah, Rahul ; Thankachan, Sharma V. ; Vitter, Jeffrey Scott

  • Author_Institution
    Dept of CS, Nat. Tsing Hua Univ., Hsinchu, Taiwan
  • fYear
    2011
  • fDate
    29-31 March 2011
  • Firstpage
    113
  • Lastpage
    122
  • Abstract
    Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the {approximate} dictionary matching problem with one error, we consider a substring of T[i..j] an occurrence of P whenever the edit distance between T[i..j] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n+ dlog{d}) words of space and reports all occurrences in O(|T|log{d}log{log{d}}+occ) time, and by Ferragina et al. (1999), which requires O(n^{1+epsilon}) words of space and reports all occurrences in O(|T|loglog n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010, Hon et al., 2010). However, a compressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nH_k+O(n)+o(nlogsigma)-bit index space, where H_k denotes the kth-order empirical entropy of D, and sigma is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(σ|T|log3 n log log n + occ).
  • Keywords
    computational complexity; data compression; dictionaries; indexing; query processing; string matching; AC automaton; O(n) word; compressed dictionary matching; kth order empirical entropy; query text; Approximation algorithms; Automata; Computers; Dictionaries; Entropy; Indexes; Pattern matching; Approximate Matching; Compressed Indexes; Dictionary Matching; Suffix Tree;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2011
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-1-61284-279-0
  • Type

    conf

  • DOI
    10.1109/DCC.2011.18
  • Filename
    5749469