Author :
Hon, Wing-Kai ; Patil, Manish ; Shah, Rahul ; Thankachan, Sharma V.
Author_Institution :
Dept. of Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan
Abstract :
Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern P in a string text T (of size n), such that the matched part in T satisfies some conceptual property. The property of a string is a set π of (possibly overlapping) intervals {(s1, f1), (s2, f2),⋯} corresponding to the part of T, and an occurrence of a pattern P at T[i..(i + |P| - 1)] is a valid output under the property π only if T[i..(i + |P| - 1)] is completely contained in some interval (sj, fj) ∈ π. Algorithmically this problem can be solved in time linear to the size of text. Amir et al. (2008) introduced the indexing version of this problem, where they preprocess the text in O(n log σ + n log log n) time and maintain an O(n log n) bits index, named Property Suffix Tree (PST), where σ denotes the alphabet size. PST can perform property matching in optimal O(|P| log σ + occπ) time, where occπ is the number of occurrences of P in T which satisfies the property. Later, Iliopoulos and Rahman (2008) proposed an alternative index which can be constructed in linear time. Recently Kopelowitz (2010) considered the dynamic version of this problem where intervals can be added or deleted. However, all these indexes requires space of O(n log n) bits, which can be much more than the size of the text (n log σ bits). In this paper, we propose the first index for property matching which takes space close to the entropy compressed space requirement of the text. Our compressed index takes |CSA| + n(2 + ∈ + o(1)) bits space and can perform query answering in O(t(|P|) + 1/∈ occπtSA) time, where |CSA| is the size of compressed suffix array (CSA), t(|P|) and tSA are the time for searching a pattern of length |P| and the time for computing the suffix array value using CSA, respectively, and e is a constant. We also introduce a dynamic index, taking |CSA| + n(2 + ∈ + o(1)) + O(|π|log n) bits of space, which can perform query answering in O(t(|P|) + 1/∈ occπ (log n/log log n + tSA) log n) time and can update (insert or delete) an interval (s, f) in O((f - s + 1)(log n + log |π| + tSA)) time.
Keywords :
computational complexity; indexing; string matching; tree data structures; biologically motivated problem; bit index; compressed suffix array; online pattern; property matching; property suffix tree; query answering; string text; Arrays; Bioinformatics; Genomics; Indexing; Pattern matching; Compressed Indexes; Pattern Matching; Property Matching; Suffix Tree;