Seeds for Homology Search in DNA -
Two Topics: Indel Seeds and Parameter Free Calculation of Sensitivity
Gary Benson
Boston University
Standard search techniques for detecting homology in DNA sequences start by
detecting small matching parts, called seeds, between a query sequence and
database sequences. Contiguous seed models (k-mers, k-tuples, etc.) have
existed for many years. Recently, spaced seeds were shown to be more
sensitive than contiguous seeds without increasing the random hit rate. To
determine the superiority of one seed over another, a homology model for
sequence alignment must be chosen. Previous studies have assumed that matches
and mismatches occur within these alignments, but not insertions and deletions
(indels). This is perhaps appropriate when searching for protein coding
sequences (<5% of the human genome), but is inappropriate when looking for
repeats in the majority of genomic sequence where indels are common. In the
first half of this talk, I discuss a model of homologous sequence alignment
which includes indels and describe a new seed model, called indel seeds, which
explicitly allows indels. Indel seeds significantly outperform contiguous and
spaced seeds when homologies include indels and as an example, I present
results from a search for inverted repeats in the dog genome using both indel
and spaced seeds.
Optimal seed selection for a given homology model is a resource intensive
activity because essentially all possible seed shapes must be tested. For
spaced seeds with 11 exact matches and 7 wildcard positions in a seed of
length 18, the number of seeds is over 11,000 and each seed describes 128
"patterns" whose joint sensitivity must be evaluated. The evaluation process
further requires that the probability parameters in the homology model be
specified in advance. When the parameters change, the entire calculation must
be rerun. In the second half of this talk, I show how to eliminate the need
for prior parameter specification. The ideas presented follow from a simple
observation: given a homology model, the alignments hit by a particular seed
remain the same regardless of the probability parameters. Only
the weights assigned to those alignments change. Therefore, if we know all the
hits, we can easily (and quickly) find optimal seeds.
This is joint work with Denise Mak, PhD graduate student in Bioinformatics at
Boston University, and has appeared at ISMB 2006 and APBC 2007.