Seeds for Homology Search in DNA - Two Topics: Indel Seeds and Parameter Free Calculation of Sensitivity

Gary Benson
Boston University

Standard search techniques for detecting homology in DNA sequences start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models (k-mers, k-tuples, etc.) have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed over another, a homology model for sequence alignment must be chosen. Previous studies have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In the first half of this talk, I discuss a model of homologous sequence alignment which includes indels and describe a new seed model, called indel seeds, which explicitly allows indels. Indel seeds significantly outperform contiguous and spaced seeds when homologies include indels and as an example, I present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.
Optimal seed selection for a given homology model is a resource intensive activity because essentially all possible seed shapes must be tested. For spaced seeds with 11 exact matches and 7 wildcard positions in a seed of length 18, the number of seeds is over 11,000 and each seed describes 128 "patterns" whose joint sensitivity must be evaluated. The evaluation process further requires that the probability parameters in the homology model be specified in advance. When the parameters change, the entire calculation must be rerun. In the second half of this talk, I show how to eliminate the need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds.

This is joint work with Denise Mak, PhD graduate student in Bioinformatics at Boston University, and has appeared at ISMB 2006 and APBC 2007.