EVEREST: Automatic Identification and Classification of Protein Domains in All Protein Sequences

Elon Portugaly
HUJI

Proteins are comprised of one or several building blocks, known as domains. These domains can be classified into families according to their evolutionary origin and biological function. Whereas sequencing technologies have advanced immensely in recent years, providing for vast sequence databases, our knowledge of higher properties of proteins, such as their structure and function is still limited. To date there are no satisfying computational tools for large-scale determination of protein domains and their classification. Such tools could assist in annotating the vast sequence databases, bridging the gap between our knowledge at the sequence level and our knowledge of structure and function. This talk describes our attempts at creating such an automatic domain identification and classification tool.
Our method, named EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), combines methods from the fields of finite metric spaces, machine learning and statistical modeling. The process begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is accomplished utilizing machine learning techniques. At this stage a statistical model is created for each of the chosen families. This procedure is then iterated, with the aforementioned statistical models being used to scan all protein sequences, recreating a database of segments, which is then clustered as before.
I will describe several aspects the EVEREST algorithm, discuss the advantages and disadvantages of a fully-automatic domain family definition process as compared with semi-manual ones (such as Pfam) and present an execution of EVEREST over the Swiss-Prot database. In this run, 13,569 domain families are defined, covering 83% of the amino acids of the Swiss-Prot database. EVEREST annotates 8816 proteins (8% of the database) that are not annotated by Pfam A. Additionally, in 18,234 proteins (16% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests versus Pfam and SCOP show that EVEREST recovers 60% of Pfam A families and 53% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 43% fidelity. A manual analysis of EVEREST families often suggests that the EVEREST family is a valid interpretation of the data, even when it is different from the interpretation suggested by Pfam. This leads us to assume that the aforementioned 43% is an underestimate of the actual fidelity of EVEREST families.
The EVEREST procedure is scalable and we intend to run it on larger databases as well, providing many more novel domain families. The EVEREST library of domain families is accessible for browsing and downloading at http://www.everest.cs.huji.ac.il.