High-throughput genomicsThe last decade has seen an explosion of large-scale biological data: The human genome was sequenced, along with numerous other species, and we are approaching the era where personal genomes will be sequenced for broad clinical use. DNA microarrays are used to measure gene expression levels under a variety of conditions, and more than half a million microarray profiles are available today. Next generation sequencing techniques provide fast and cheap measurements of a plethora of biological entities, at a rapidly decreasing cost (a recent report says sequencing cost drops by 50% every five months).
We have been developing methods for analysis of large-scale gene expression data. Grouping the data into modules is a key step for such analysis, and we have developed-among other methods-the Click clustering algorithm, the Samba algorithm for biclustering, and the Matisse method that finds modules using expression and protein interaction networks. Our "flagship" is the Expander platform, which incorporates these algorithms and many others into a streamlined, user-friendly analysis.
The tools are developed in close collaboration with experimentalists, and are in broad use by the community for a variety of projects and species. Our own collaborations have included, among others, human DNA damage, immune system and cell cycle, yeast genomics, human embryonic stem cells, and human pathogens.
Method development is ongoing and is being adapted to use new data types (e.g. genetic interactions, next-gen sequencing) and to answer new challenging biomedical problems as they arise.
Biological networksProteins do not work in isolation. They interact, form complexes, change as they transmit and receive signals, etc. Hence, to understand the behavior of biological cells and systems the genes/proteins should be viewed as a network. Understanding these networks—an area sometimes called systems biology-is key to 21st century biology and medicine. We are developing methods and tools for different facets of this challenge:
- A signaling-networks knowledge base: Spike is a highly curated database for specific human signaling pathways, accompanied by visualization and analysis tools.
- Modeling: We develop methods to model regulatory networks, both in steady state, using the MetaReg approach for probabilistic Boolean networks, and the dynamics of networks, using Petri nets.
- Dissection and inference of biological networks, aiming to improve the models by refining their logic and topology, evaluating the effects of perturbations on the models, etc. These analyses utilize tools from machine learning, graph algorithms and verification theory. Models of human cancer and yeast processes have been analyzed.
Transcription regulationWe seek to better understand mechanisms of transcription regulation, including transcription factor and microRNA control and their evolution. We first developed the Prima method for testing enrichment in known regulatory motifs, later the Amadeus software for accurate de novo motif finding based on co-regulated gene groups, and then the Allegro software for discovering DNA motifs from sequence and expression data. We are extending and improving these methods to handle protein binding microarray data and intend to accommodate large scale epigenetic data to improve motif finding accuracy. Extensions to multiple species include comparative analysis of cell cycle regulatory motifs, and exciting area of the dynamics of motif evolution.
Another important aspect of regulation is microRNAs. Some of our motif finding algorithms mentioned above were applied successfully to non-coding RNAs as well. For example, Amadeus identified the 21U-RNAs in C. elegans, and Allegro discovered microRNA motifs in human stem cell data. The Fame algorithm identifies microRNA function based on computational microRNA target prediction and predicts microRNA-regulated pathways.
Human diseaseHigh throughput genomics data, biological networks and their regulation all must be integrated in the effort to understand disease processes. Among the problems we address are
- Analysis of expression profiles of patients along with auxiliary information. The ultimate goal here is better classification of patients and improved disease prognosis. The Degas algorithm provides network-based biomarker selection. We are part of the GenePark EU project that aims to identify blood biomarkers in Parkinson's disease patients.
- Chromosomal aberrations in cancer. By analyzing a very large number of low resolution cancer karyotypes, we were able to get new insights on aneuploidy, the phenomenon where an abnormal number of whole chromosome is present in cancer cells. The Stack website provides searchable analysis results.
- Disease genetics. Analysis of specific disease causing mutations. In collaborations with clinicians, we contributed to studies on Crohn's disease, breast cancer, substance abuse and obesity. We also developed analytical methods for SNP, haplotype and population-based analysis. See our Gevalt software tool.
Genome rearrangementsWe want to understand how genomes evolve by rearrangements. As the genome changes during the evolution between species, the classical model assumes that the rearrangement events are reversals only in the single chromosome case, and reversals and translocations in the multi-chromosome case. One wishes to find a scenario of minimum number of events that explain the difference between two genomes. In cancer, the somatic changes that occur as the cancer genome evolves are more complicated, and the model must include also duplications and deletions. The analysis of such problems is typically combinatorial and using methods from graph theory, approximation and complexity.
Other topics that we have been studying recently include some aspects of evolution and horizontal gene transfer, comparative protein complex prediction, host-pathogen interactions, and adapting the methods developed for high throughput genomics analysis to problems in post-silicon hardware testing and to dissection of fMRI brain signals.