EVEREST: Automatic Identification and Classification of Protein Domains
in All Protein Sequences
Elon Portugaly
HUJI
Proteins are comprised of one or several building blocks, known as
domains. These domains can be classified into families according to
their evolutionary origin and biological function. Whereas sequencing
technologies have advanced immensely in recent years, providing for
vast sequence databases, our knowledge of higher properties of
proteins, such as their structure and function is still limited. To
date there are no satisfying computational tools for large-scale
determination of protein domains and their classification. Such tools
could assist in annotating the vast sequence databases, bridging the
gap between our knowledge at the sequence level and our knowledge of
structure and function. This talk describes our attempts at creating
such an automatic domain identification and classification tool.
Our method, named EVEREST (EVolutionary Ensembles of REcurrent
SegmenTs), combines methods from the fields of finite metric spaces,
machine learning and statistical modeling. The process begins by
constructing a library of protein segments that emerge in an all
vs. all pairwise sequence comparison. It then proceeds to cluster
these segments into putative domain families. The selection of the
best putative families is accomplished utilizing machine learning
techniques. At this stage a statistical model is created for each of
the chosen families. This procedure is then iterated, with the
aforementioned statistical models being used to scan all protein
sequences, recreating a database of segments, which is then clustered
as before.
I will describe several aspects the EVEREST algorithm, discuss the
advantages and disadvantages of a fully-automatic domain family
definition process as compared with semi-manual ones (such as Pfam)
and present an execution of EVEREST over the Swiss-Prot database. In
this run, 13,569 domain families are defined, covering 83% of the
amino acids of the Swiss-Prot database. EVEREST annotates 8816
proteins (8% of the database) that are not annotated by Pfam A.
Additionally, in 18,234 proteins (16% of the database), EVEREST
annotates a part of the protein that is not annotated by Pfam A.
Performance tests versus Pfam and SCOP show that EVEREST recovers 60%
of Pfam A families and 53% of SCOP families with high accuracy, and
suggests previously unknown domain families with at least 43%
fidelity. A manual analysis of EVEREST families often suggests that the
EVEREST family is a valid interpretation of the data, even when it is
different from the interpretation suggested by Pfam. This leads us to
assume that the aforementioned 43% is an underestimate of the actual
fidelity of EVEREST families.
The EVEREST procedure is scalable and we intend to run it on
larger databases as well, providing many more novel domain
families. The EVEREST library of domain families is accessible for
browsing and downloading at http://www.everest.cs.huji.ac.il.