Pattern matching with character classes

Chaim Linhart, TAU

In the pattern matching with character classes problem the goal is to find all occurrences of a pattern p of length m in a text t of length n, where each pattern position is a set of characters drawn from a finite alphabet Σ (i.e., p[j]Σ), and the text is a string over Σ. The pattern is said to match the text at position i if the set p[j] contains the character t[i+j-1], for all j, 1≤j≤m. Like other general string matching problems, this problem can be solved in O(|Σ|nlogm) time (assuming the RAM computational model) using boolean codes and computing |Σ| convolutions by means of the Fast Fourier Transform (FFT). We present a novel coding scheme, which is logn/logm times faster. In particular, if m^|Σ|=O(n), our algorithm computes a single convolution and runs in time O(nlogm). This matches the running time of the fastest algorithms for pattern matching with "don't cares", which is a special case of our problem. A major advantage of our algorithm is that it allows a tradeoff between the running time and the RAM word size. We also develop a randomized variant of our algorithm for solving the subset matching problem, in which both the pattern and the text are comprised of subsets of Σ.