Using the TANGO program ---------------------- TANGO is a program for functional annotation of gene sets. It uses pre-processed tables of genes and their GO annotation, and perform hyper-geometric enrichment tests for sets of genes with common annotation and sets of genes given as input. Importantly, TANGO correct for multiple testing at both the multiple GO classes and multiple tested sets levels. It does so by bootstrapping and estimating the empirical p-value distribution for the evaluated sets. TANGO is a standalone program available for Linux and Windows. Precompiled annotation tables are available through the Expander system. See www.cs.tau.ac.il:~amos for downloading the current version. 1.Running TANGO ------------ You should run TANGO like this: TANGO parameter_file TANGO will read files as specified in the parameter file and will write its output to yet another file (that is specified in the param file..). Below you'll find a description of the files and parameters. 2.Precompiled annotation files ---------------------------- TANGO use preprocessed GO annotation files to map genes and known annotations. All files are tab delimited text files. A set of tables always describe the annotation of a single organism. varob.txt - mapping variable internal ids with external gene identifiers (ORF, Locuslink etc) Field 0: Internal variable id (Number) Field 1: Variable Name (String) Field 2: Variable External Key (String) Example line: 21043 TP53 7157 goclskey.txt - key file containing the names of all annotation categories. Typically, each category reflects one GO attribute, and the gene associated with it are all genes that are annotated with this attribute, or with an attribute that specialize it. Field 0: Internal annotation category id Field 1: GO id (or any external id for the annotation source) Field 2: Category name Field 3: Number of genes annotated with this category (not used by TANGO) Example line: 0 GO:0008289 lipid binding 15 clsassoc.txt - this table associate variable internal ids (key to varob.txt) Field 0: Category id (key to goclskey) Field 1: Gene id (key to varob) 3.TANGO Input files ---------------- TANGO process two input files. One define the set of genes that should be consider as the background. Typically this set can include the entire genome, or only the genes that where printed on the chip that was used to generate the clusters/biclusters, or only the genes that survived the filtering that precede the analysis that generated the clusters. The second file define the actual sets (clusters/biclusters) to annotate. chip.txt - define the background set Field 0: Gene external key (points to field 2 in the varob table). In other words - a list of locuslink ids (mammals), orf codes (yeast), flybase ids (fly) etc. sets.txt - define the sets to annotate Field 0: Gene external key (points to field 2 in the varob table) Field 1: Set Id (serial number for the sets to annotate) Example: YOR348C 0 YPL265W 0 YPL274W 0 YAL067C 1 YBL042C 1 YBR021W 1 Define 2 sets of yeast genes, each with 3 genes. 4.TANGO output file ------------------ TANGO generate a tab delimited text file including all significant annotations. The format is as follows: Field 0: set id (key to sets.txt) Field 1: annotation name (name from goclskey) Field 2: uncorrected hyper-geometric p-value (log10) Field 3: Corrected hyper-geometric p-value (log10) Field 4: fraction of genes in the set annotated with the category Field 5: number of genes in the set annotated with the category Field 6: category external id (field 1 in goclskey) 5.TANGO Parameter file ------------------- TANGO comes with a parameter files that controls the input files it uses, as well as important parameters. The file is formated as an INI file - including "scopes" (bracket delimited names in their own lines) and "options" (assignments of values to parameter in the format options=value). The ordering of options is not important as long as each option is below its appropriate scope. Here is an example of the parameter file, explanations are below: #file starts here [Random] Seed=19 [Tables] varob=/data/yeast/varob.txt goclskey=/data/yeast/annots/go/goclskey.txt clsassoc=/data/yeast/annots/go/clsassoc.txt ChipOrfs=chip.txt SetsOrfs=sets.txt AnnotReport=annots.txt [TANGO] BootstrapNum = 1000 MinClsSize=5 MaxClsSize=1000 MinClsInter=4 MaxPvToRep=0.01 FilterRedPVThres = 0.05 #file ends here Random::Seed - control the pseudo-random sequence used for bootstraping. Runnig TANGO twice with the same seed and same data will generate the SAME results. Tables::varob - the full path of the varob file (see section 2) Tables::goclskey - the full path of the goclskey file (see section 2) Tables::ChipOrfs - the full path of the chip.txt file (see section 3) Tables::SetsOrfs - the full path of the sets.txt file (see section 3) Tables::AnnotReport - the TANGO output file (see section 4) TANGO::BootstrapNum - number of bootstraps to perform. The corrected pvalue will always be larger or equal 1/BoostrapNum, but since the output report provide the uncorrected value as well as the corrected one, using 1000 should be generally enough. This value linearly affect the running time of the program (naturally), so use it carefully. TANGO::MinClsSize - the minimal size of category to consider for annotation. Categories that have less annotated genes than this number will not be considered. Use this to save time and reduce the abundance of spurious results. TANGO::MaxClsSize - the maximal size of category to consider for annotation. Categories that have more annotated genes than this number will not be considered. Use this to prevent very general annotation (e.g., metabolism..). TANGO::MinClsInter - the minimal number of genes that are annotated with the category and are part of the annotated set to be consider for annotation. Setting this to 0 will allow annotation using a single gene, which are prone to false positives. Although these will be corrected by the bootstrap procedure, we recommend to prevent these to increase the statistical power. TANGO::MaxPvToRep - the maximal p-value (uncorrected) to report on. TANGO::FilteredPVThres - the maximal conditional p-value to consider when filtering annotations of the same set. TANGO filter results by performing conditional hyper-geometric tests for one category, assuming the observed enrichment in the other. Whenever this conditional p-value is higher than the threshold set by this parameter, TANGO will remove the weaker annotation of the two.