Home 
Allegro home  
Overview 
Allegro tool overview
Hands On 
Amadeus-Allegro tutorial
Download 
Download Allegro and regulatory sequences
Supplementary data 
Supplementary data for the paper
Contact us 


Overview 

Download and install
How to run AmadeusPBM/Allegro/Amadues
Important notes
Why "Allegro"?


Download and install

  1. Download the software from the download page. After you download the zip file, please extract its contents to a folder of your choice, e.g., D:\MyTools. This should create a directory named AmadeusPBM_v1.0 in the above folder.
  2. Download regulatory sequences from the download page. Save the sequences zip file in the Allegro directory, e.g., under D:\MyTools\AmadeusPBM_v1.0 in the previous example. Then, extract the zip file and make sure the sequences files reside in the data\sequences\[org] folder, where [org] is the name of the organism. For example, if you download human promoters, you should verify that the directory D:\MyTools\AmadeusPBM_v1.0\data\sequences\Human contains a file called "masked_promoters.all.txt".
  3. Further instructions, hardware/software requirements and notes are detailed in the README.txt file supplied with the installation.

Note that Allegro requires Java version 1.5 or later (get Java here), and at least 1GB RAM memory. Allegro should run properly on Windows (2000/XP/Vista) and Linux (Debian). It was not tested on Mac OS.

How to run Allegro/Amadues

Following are step-by-step instructions on how to run Allegro and Amadeus, and analyze their output.
  Screenshot of the Allegro input panel

  1. Launch the applcation
    Execute/double-click the "run.bat" file in the Allegro folder. If you wish to analyze a very large expression file or multiple datasets, you may need to allocate more memory for Java. To do so, either execute the "run_1.3G_mem.bat" file, or edit the "run.bat" file and change the value of "-Xmx". Note that you will need at least 2GB RAM memory for such large executions.

  2. Choose the data type
    Select "PBM" if you are analyzing PBM dataset. This will run the AmadeusPBM algorithm. Follow the rest of the instructions on this page.

    Select "Expression" if you are analyzing a gene expression dataset (or similar types of data that assign one or more values to each gene). This will run the Allegro algorithm. Follow the rest of the instructions on this page.
    Select "Target set" if you would like to execute the Amadeus algorithm in order to:
    (a) discover motifs that are over-represented in a supplied list of genes (or several such lists); or (b) find motifs with global spatial features (i.e., motifs that appear non-uniformly along the promoters, between the two strands, or among the chromosomes). For more information on the input of Amadeus, please read the Amadeus overview page.

  3. Choose the sequence type (if "Expression" or "Target set")
    Choose the type of sequences you would like to analyze: promoters (to analyze both strands) or 3' UTRs (single strand).

  4. Set files
    Choose the organism and its sequences & expression files.

    PBM file (in case of "PBM"): Use the "Browse" button to select a file. Note that the PBM file should be in deBruijn format (from UniProbe database), in which each sequence's prefix contains 36 informative base pairs and a tab delimited number as a measur of the binding.

    Sequences (if "Expression" or "Target set"): When you choose an organism, the default sequences file is set to the file you can download from the download page. Use the "Browse" button to select a different file. Note that the sequence file should be in fasta format, in which the header of each sequence contains its name (>geneName), and the TSS of each gene is assumed to be located at the end of the sequence. Repetitive elements and other sequences you'd like to ignore in the analysis, such as protein-coding sequences, should be masked out with N's. The sequence file may contain up to 65,000 sequences, each up to 16,000 bases long.

    Expression (if "Expression"): Use the "Browse" button to select the expression data file. The file should be tab-delimited, and contain one line per gene. Each line should contain the id of the gene in the first column, and the expression values in the other columns. Specify the range of columns you'd like to include in the analysis using the "Cols" boxes. For example, if you'd like to ignore the 2nd column in the expression file, use "Cols" 3 to -1 ("-1" stands for the last column). The first line in the expression file should specify the names (titles) of the experimental conditions. If the second line starts with ">SERIES", then it should indicate the series name of each condition. This feature is optional, and is intended for presentation purposes only - it has no effect on the algorithm's results. The rest of the lines in the expression file should contain the expression values, as mentioned above.
    An example of the expression file format can be found here. See also the file "TLRs_RAW264.7.avg.txt" in the "expression" directory of the installation.

    Important notes:
    (1) Allegro does not perform any type of normalization to the data. Thus, it is up to the user to pre-process the data properly, as explained here.
    (2) The gene ids should match those in the sequences file. If you are using our sequences files, the gene names should be either Ensembl or Entrez gene ids. Genes in the expression file that aren't found in the sequences file, or vice versa, are ignored.

    Expression levels:
    As described in the paper, Allegro uses discrete expression levels in its CWM exression model. The discretization is performed either using fixed cutoffs or using percentiles. If you choose "Cutoffs", you should specify the lower bound of each expression level except the lowest one. E.g., if the cutoffs are "1.5, -1", Allegro will use three levels: 1.5 and up, between -1 and 1.5, and less than -1. If you specify "Percentiles" as "10,90", for example, then Allegro will use two expression levels in each condition: the top 10%, and the rest 90%. Note that the cutoffs/percentiles apply to all the conditions in the expression file.

    PBM parameters:
    As described in the paper, AmadeusPBM uses 3 parameters: scoring scheme, k and number of k-mers. Each is configurable through the GUI. The scoring scheme is either average, median or WMW. k should be set between 7 to 10 (=k-mer length). The default number of top k-mers value is 1000, but you may want to experiment with other values.

    After all the fields have been assigned, press the "Add" button. Repeat these steps in order to add more organisms/expression files to the analysis.

  5. Set general parameters
    Running mode: choose between faster execution and more comprehensive analysis.
    From/to position: determine the range of sequences that will be scanned for the motifs.
    Motif length: the length of the motifs to search for. We recommend values of 8-10. If "PBM" you should set it to be at most the chosen k. Known motifs DB: a file that contains PWMs (matrices) of known motifs, in Transfac format. The motifs discovered by Allegro are compared to the PWMs in the file, and similarities are reported. By default, Allegro uses Transfac and miRBase for comparison in promoter and 3'-UTR analysis, respectively. Species-specific miRBase files are also supplied in the installation - see "data/miRNA/README.txt".
    Analyze pairs: choose whether to perform motif-pair analysis, which searches for co-occurring motifs.

    Amadeus/Allegro uses additional parameters that have pre-defined default values and cannot be controlled via the GUI, as explained here.

  6. Select score(s) for ranking the motifs
    Each motif considered by Allegro is evaluated using one or more scores. When several scores are chosen, you may assign them different weights. All scores are combined into a single p-value.
    Enrichment: (obligatory) evaluates the over-representation of the motif in the sequences of genes, which share the expression profile that Allegro inferred for the motif. Choose one of the variants: "hypergeometric" or "binned". The latter accounts for length and GC biases (i.e., when the genes' expression profiles are correlated to the length and/or GC-content of their regulatory sequences), as described in the paper.
    Strand bias, localization, chromosomal preference: (optional) evaluate global spatial features of the motif, namely, whether it's distributed un-evenly between the strands, along the sequences, or among the chromosomes.

  7. Start the analysis
    Click the "Run" button to start the analysis. Other buttons in the bottom panel are: Stop run, Save textual output to file, Save parameters to file, Load parameters from file.

  8. Output of Allegro
    Allegro has both a textual and graphical output.
    At the top of the textual "Output" tab, Allegro reports general statistics on the supplied input, e.g, the number of sequences, their average length and their base frequencies, the number of conditions read from the expression file, the distribution of discrete expression levels in each condition, and more. Check these stats to verify that your input was read correctly.
    Once the analysis is completed, Allegro shows the discovered motifs in the graphical "Results" tab (see figure below).
    If pairs analysis was chosen, the results are shown in an additional tab.

    As described in the paper, Allegro uses a novel non-parametric model called CWM (Condition Weight Matrix) to describe the expression profile of a group of co-regulated genes. For each candidate motif, Allegro fits a CWM to its putative targets using a cross-validation-like procedure. The genes whose expression values match the CWM (above a computed threshold) are called the CWM targets. In order to ascertain whether the motif and the CWM are significantly correlated, Allegro computes one of two enrichment scores, as chosen by the user: the HG score, or the binned enrichment score. Allegro utilizes the efficient motif search engine of Amadeus to enumerate a huge number of candidate motifs and to converge to high-scoring ones.


    For each discovered motif, Allegro reports its p-value, its graphical logo, the scores it attained (all scores are shown; those used for computing the p-value are marked in bold face), the CWM expression profile fitted to its targets, statistics on the number of hits and targets, and a list of similar known motifs from Transfac/miRBase ("Divergence" closer to 0 means higher similarity). Additional information is presented in several pop-up screens (see figure):
    (a) The list of k-mers that comprise the motif (i.e., pass the PWM cutoff).
    (b) Expression profile of the motif's targets that are also its CWM targets (i.e., genes whose sequence contains a hit of the motif, and whose expression values match the CWM). Multiple types of views are available (mean expression profile, per-gene expression matrix).
    (c) A histogram of the locations of the motif's hits in the regulatory sequences. Location 0 is the TSS (or, in 3' UTR analysis, the 3' end of the sequence).
    (d) A list of genes whose promoters/3'-UTRs contain a hit of the reported motif. This list can be exported for further analysis. Also shown are a graphical representation of the CWM fitted to the motif's targets (bottom left), and the mean expression profile of: [i] all the targets of the motif (middle right), [ii] the motif's targets that are also its CWM targets (bottom right).
    (e) The logo of the chosen PWM from Transfac.

Important notes

  1. Default values, batch runs: The algorithms have many parameters that control the way motifs are searched and reported. The most important parameters can be set by the user via the graphical user interface (GUI). However, other parameters are always set with pre-defined default values. For example, by default, Amadeus reports motifs that occur in at most 25% of the background sequences; elements that appear more frequently are often not biologically interesting.
    If you wish to modify the default values of Amadeus/Allegro, or execute the programs in batch (command-line) mode, you first need to create a parameters file for the execution - you can do this using the graphical interface (i.e., set all the parameters, run Amadeus/Allegro to make sure they're ok, and then click the "Save parameters to file" button). Then, run Amadeus/Allegro by simply adding "file [filename]" to the command line, where "filename" is the name of the parameters file you saved. For example:
        java -Xmx800m -jar Allegro_v1.0.jar file myParams.txt
    Note: On Windows, you need to execute the above line in a command ("DOS") window (click "Start"->"Run", type "cmd" and hit Enter); on Linux, execute the above line in a shell (xterm/terminal) window. Please contact us in case you need further assistance.

  2. Pre-processing the expression data: Before supplying the expression data to Allegro, you should pre-process it according to the experimental setting and what type of expression profiles you wish to find. For example, when measuring expression values along a time course, it is usually recommended to compute the log fold-change in each time point relative to the first time point. For expression profiles in various tissues or under different conditions (i.e., when there is no clear base condition), one could standardize the genes' profiles to mean 0 and SD 1, or compute the log ratio relative to the average value. Note that you should NOT filter out genes (e.g., those that did not change in any condition), since Allegro utilizes as many genes as possible to enhance the power of its statistical tests. Pre-processing can be performed using any gene expression analysis software, such as our Expander platform, or by other means (Excel, R, Perl script, etc.).


Why "Allegro"?
First, the name Allegro is an acronym of "A Log-Likelihood based Engine for Gene expression Regulatory motifs Over-representation discovery". Second, in music allegro refers to a quick, lively tempo; the Allegro software allows a fast (and accurate) analysis of gene expression datasets in the context of regulatory motif discovery. Third, Allegro builds upon the Amadeus motif discovery platform, so we wanted the name of our new method to be from the musical domain as well. One of the most popular and recognized classical pieces is the first movement of Wolfgang Amadeus Mozart's Eine kliene Nachtmusik, which is in sonata allegro form. The background of the Allegro logo shows the sheet music of this piece.