PRIMA - a software for promoter analysis

By Ran Elkon, Chaim Linhart, Roded Sharan, Ron Shamir and Yossi Shiloh,
Tel-Aviv University, Jan. 2003.
Last updated on Aug. 2004.

Overview

PRIMA (PRomoter Integration in Microarray Analysis) is a program for finding transcription factors (TFs) whose binding sites are enriched in a given set of promoters. PRIMA is typically used for the analysis of large-scale gene expression data. Microarray ('DNA chip') measurements point to alterations in gene expression levels under varying biological conditions, but they do not, however, directly reveal the transcriptional networks that underlie the observed transcriptional modulations. PRIMA is aimed at the identification of TFs that take part in these networks. The basic biological assumption is that genes that are co-expressed over multiple biological conditions are regulated by common TFs, and therefore are expected to share common regulatory elements in their promoters. By utilizing human genomic sequences and models for binding sites (BSs) of known TFs, PRIMA identifies TFs whose BSs are significantly over-represented in a given set of promoters.
New: Please read the Prima updates.

Databases

PRIMA requires two data collections:
(1) Human promoters. We constructed a set of putative promoters of known human genes by extracting sequences from the human genome that correspond to 1200 bp upstream of the genes' putative transcription start sites (TSSs) based on genes' start annotations (Human genome was downloaded from NCBI on July 2001). Human repetitive sequences are masked. The set contains putative promoters for 12981 human genes (we call it the '13K set'). The 13K set can be downloaded here (4MB, gzipped).
(2) Models for BSs recognized by TFs. PRIMA uses the commonly used position weight matrices (PWMs) models for modeling binding sites recognized by TFs. In our analysis PWMs were obtained from the TRANSFAC database [3].

Algorithm

PRIMA gets as input two sets of genes: a target set (e.g., a list of co-expressed genes found in a microarray experiment) and a background set (e.g., the 13K set), and for each PWM P it performs the following steps:
(a) Compute a similarity threshold T(P). Subsequences in the scanned promoters with similarity scores above this threshold are considered as 'hits' of P (i.e., putative binding sites of the TF modeled by the PWM).
(b) Scan the promoters of the target and the background sets for identification of hits of P.
(c) Employ a statistical test to examine whether hits of P are significantly over-represented in the target set with respect to the background set.

The full details of the algorithm and the relevant computational analysis are described in [1].
PRIMA is written in Perl and C, and runs under Windows and Linux.

Usage

See the "README.txt" file provided with the software. A sample output file can be viewed here.

Application

We demonstrated the utility of PRIMA in deciphering regulatory mechanisms that control gene expression in [1]. In this study we analyzed the human cell cycle dataset published by [2], which recorded genome-wide gene expression levels over multiple time points during the progression of cell cycle in HeLa human cell line. PRIMA revealed 8 TFs whose binding sites were significantly over-represented in the promoters of cell cycle-regulated genes. The enrichment of some of these factors was specific to certain phases of the cell cycle.
The eight circles in the Figure below correspond to the TFs that were highly enriched in promoters of cell cycle-regulated genes. Each circle is divided into 5 zones, corresponding to cell cycle phases. The number adjacent to the zone represents the ratio of the TF's hits prevalence in promoters contained in each of the cell cycle phase clusters to their prevalence in the set of 13K background promoters. Note that several TFs show a tendency towards specific cell cycle phases: e.g., over-representation of the E2F PWM in promoters of the G1/S and S clusters, and its under-representation in promoters of the M/G1 cluster.


License

A new version of Prima is integrated in the EXPANDER software (see also Updates), which is freely available for academic use.
The standalone version of PRIMA is freely available for academic use under the following license agreement.
It is also available for non-academic use under appropriate licensing. Please contact Ron Shamir or Chaim Linhart for further information.

Updates

Aug. '04: New promoter sequences (from 1000 bp upstream to 200 bp downstream the TSS, repetitive sequences were masked out), downloaded from Ensembl [5]:
    HumanPromoters_v19.txt.zip - 19,565 human promoters, Ensembl release 19.34b.
    MousePromoters_v19.txt.zip - 20,028 mouse promoters, Ensembl release 19.32.
In order to run Prima on these promoters, please download the EXPANDER software [4].

Oct. '03: A new version of Prima which utilizes precomputed fingerprint files (for both Human and Mouse), and is, therefore, much faster, is now available as part of the EXPANDER package [4].

The fingerprint of a gene is the number of hits (putative binding-sites) of the various TFs that were identified in its promoter. The standalone version of PRIMA recomputes the fingerprints in each execution. While this allows more flexibility (e.g., in choosing the thresholds for declaring hits), this process is very time consuming. EXPANDER, on the other hand, executes PRIMA on a fixed set of precomputed fingerprints, which were constructed as follows: A set of about 17,000 human promoter sequences, spanning from 1000 bp upstream the TSS to 200 bp downstream the TSS, was scanned in order to locate putative BSs (hits). The scan was performed for each TF motif (PWM) in TRANSFAC (version 5.4, April '02) [3] that corresponds to a Human TF. The information on the number of hits of each PWM in a promoter is called the fingerprint of that promoter. The fingerprints of all human promoters are supplied with EXPANDER. The human promoter sequences were downloaded from Ensembl (release 13.30) [5]. Another set of fingerprints was prepared on mouse promoters (15,000 promoters, Ensembl release 13.30).

For most users we recommend using EXPANDER, both for promoter analysis and other computational and visualization tasks.

PRIMA is accessible via the "Made In Israel" bioinformatics portal.

References

[1]  Elkon, R., Linhart, C., Sharan, R., Shamir, R., and Shiloh, Y., "Genome-wide In-silico Identification of Transcriptional Regulators Controlling Cell Cycle in Human Cells",
Genome Research, Vol. 13(5), pp. 773-780, 2003. 
[2]  Whitfield, M.L., G. Sherlock, A.J. Saldanha, J.I. Murray, C.A. Ball, K.E. Alexander, J.C. Matese, C.M. Perou, M.M. Hurt, P.O. Brown, and D. Botstein, "Identification of genes periodically expressed in the human cell cycle and their expression in tumors",
Mol Biol Cell, Vol. 13, pp. 1977-2000, 2002. 
[3]  Wingender, E., X. Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer, "TRANSFAC: an integrated system for gene expression regulation",
Nucleic Acids Res, Vol. 28, pp. 316-319, 2000. 
[4]  EXPANDER - A Gene Expression Analysis and Visualization Software - http://acgt.cs.tau.ac.il/expander/expander.html
[5]  The Ensembl Project - http://www.ensembl.org

This page was visited times since Apr 24 2003.
Powered by counter.bloke.com