• DocumentCode
    2581133
  • Title

    Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results

  • Author

    Litwin, Witold ; MOKADEM, Riad ; Rigaux, Philippe ; Thomas, Schwarz

  • Author_Institution
    Dauphine Univ., Paris
  • fYear
    0
  • fDate
    0-0 0
  • Firstpage
    251
  • Lastpage
    255
  • Abstract
    We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially those organized in a scalable distributed data structure, (SDDS), over a grid or a structured P2P net. We presume that the records are encoded into their cumulative algebraic signatures, providing incidental confidentiality of stored data. The search starts with pre-processing the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of n ges 1 is a parameter that one may tune. The search attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K - n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M - K)/(K - n+1)). The 2-grams should typically suffice, leading to O ((M - K)/(K - 1)). We show that the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K - n + 1) faster than our previous algorithms and among the fastest known, e.g., probably often faster than Boyer-Moore
  • Keywords
    data structures; distributed databases; grid computing; parallel processing; peer-to-peer computing; search problems; string matching; P2P net; cumulative algebraic signatures; logarithmic algebraic signature; n-gram sampling; n-gram search; pattern matching; scalable distributed data structure; string matching algorithm; Content addressable storage; Costs; DNA; Data structures; Decoding; Distributed databases; Galois fields; Pattern matching; Sampling methods; Special issues and sections; SDDS; algebraic signatures.; distributed pattern matching; grid; scalable; structured P2P;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications, 2006. DEXA '06. 17th International Workshop on
  • Conference_Location
    Krakow
  • ISSN
    1529-4188
  • Print_ISBN
    0-7695-2641-1
  • Type

    conf

  • DOI
    10.1109/DEXA.2006.80
  • Filename
    1698345