Pattern matching with character classes
Chaim Linhart, TAU
In the pattern matching with character classes problem the goal is to find
all occurrences of a pattern p of length m in a text t of length n, where
each pattern position is a set of characters drawn from a finite
alphabet Σ (i.e., p[j]Σ), and the text is a string over Σ.
The pattern is said to match the text at position i if the set p[j] contains
the character t[i+j-1], for all j, 1≤j≤m. Like other general
string matching problems, this problem can be solved in O(|Σ|nlogm) time
(assuming the RAM computational model) using boolean codes and computing |Σ|
convolutions by means of the Fast Fourier Transform (FFT). We present a novel
coding scheme, which is logn/logm times faster. In particular, if m^|Σ|=O(n),
our algorithm computes a single convolution and runs in time O(nlogm).
This matches the running time of the fastest algorithms for pattern matching
with "don't cares", which is a special case of our problem. A major advantage
of our algorithm is that it allows a tradeoff between the running time and
the RAM word size. We also develop a randomized variant of our algorithm for
solving the subset matching problem, in which both the pattern and the text
are comprised of subsets of Σ.