Prev   Next   Top

File Formats

 

Expression data file format:

1) Suffix: no limitations. 
2) Separating token: tab delimiter.
3) Format:

1st line: contains a string like ‘probeId’ and a tab delimiter, followed by a string like ‘geneSymbol’ and a tab delimiter, followed by the names of all conditions separated by tab delimiters. The symbol column is optional – if the file does not contain a symbol column, please specify it in the Advanced Input Dialog box (see Input Data section).

2nd line (optional): contains the string ‘>SERIES’, a tab delimiter followed by the string ‘SYMBOL ‘ (if there is a symbol column), a tab delimiter and then all series names corresponding to the condition (one series assigned for each condition) separated by tab delimiters.

Next lines: Each subsequent line consists of the probe ID (an identifier string that is unique to each probe in the chip), followed by a string, which represents the gene full name (if missing can be left empty by adding an additional tab delimiter), followed by its expression values (all tab delimited). If the expression file contains missing values, Expander either replaces them with a preset value (0 by default), or estimates them using the KNN (K-Nearest Neighbors) method, depending on the user selection in the data load dialog box.

*For example see files ‘expressionData1.txt’ and ‘expressionData2.txt’ in the Expander/sample_input_files/ directory.

If the data is not in the above format, it may be possible to load it using the ‘Advanced’ dialog box, which appears upon pressing the ‘Advanced’ button in the Expression Data load dialog box (see Advanced Input Dialog box in Input Data section).

 

Expression data with detection calls file format:

1) Suffix: no limitations. 
2) Separating token: tab delimiter.
3) Format:

1st line: contains a string like ‘probeId’ and a tab delimiter, followed by a string like ‘geneSymbol’ and a tab delimiter, followed by the names of all conditions and detection signals columns alternately, separated by tab delimiters. Each title of condition is followed by a title of its detection column. 

The symbol column is optional – if the file does not contain a symbol column, please specify it in the Advanced Input Dialog box (see Input Data section).

Next lines: Each subsequent line consists of the probe ID (an identifier string that is unique to each probe in the chip), followed by a string, which represents the gene full name (if missing can be left empty by adding an additional tab delimiter), followed by its expression values and detection calls values, alternately (all tab delimited). Each expression value is followed by its detection value (P, M or A). If the expression file contains missing values, Expander either replaces them with a preset value (0 by default), or estimates them using the KNN (K-Nearest Neighbors) method, depending on the user selection in the data load dialog box.

*For example see files ‘expressionWithDetection.txt’ in the Expander/sample_input_files/ directory.

If the data is not in the above format, it may be possible to load it using the ‘Advanced’ dialog box, which appears upon pressing the ‘Advanced’ button in the Expression Data load dialog box (see Advanced Input Dialog box in Input Data section).

 

Gene Sets file format:

1) Suffix: no limitations

2) Format: Each line contains a gene ID, a gene symbol (optional) and the name/number of its set (separated by tabs/spaces). The gene IDs are expected to be of the same convention used in the GO annotation and TF fingerprint files.  For details regarding the Gene ID convention that is used for each organism, refer to the Supplied files section.

*For example see file ‘geneSetsData1.txt’ under the Expander/sample_input_files/ directory (see Sample input files for more details).

Gene Rank file format:

1) Suffix: no limitations

2) Format: Each line contains a gene ID and a rank number (separated by tab/space) where the highest gene is ranked 1. The gene IDs are expected to be of the same convention used in the Gene Set Enrichment Analysis.  For details regarding the Gene ID convention that is used for each organism, refer to the Supplied files section.

ChIP-Seq file format:

1)   Suffix: BED or GFF3

2)   Format: Please refer to the following links explaining the formats:

·    BED

·    GFF3 – note that "Score" field is Q-value and can range between 0-1 or be in

–log(10) values.

Note that the files should not contain any headers and should contain only the peaks data.

 

Probes Filter file format:

Each line contains a single identifier. Identifiers can be probe Ids, gene Ids OR gene symbols (but not a mixture of these identifier types).

 

ID conversion file format:

1) Suffix: Currently, there are no limitations regarding the file name suffix.

2) Format: Each line contains the probe id as it appears in the data file, a tab separator and the corresponding gene ID (e.g. Entrez/Locus-Link ids for mouse and human genes and ORF codes for yeast).  The second field can be left blank, indicating no conversion for that probe ID.

* It is possible that several probe IDs in the data file will be mapped to the same gene ID (e.g.: several ESTs from the same gene).

 

Clustering files format:

1) Suffix: no limitations.

2) Format: Each line contains the probeID, a tab separator and name/number of its cluster. The number 0 is reserved for probes that are left unclustered. The file does not have to contain all probes in the data. If a probe does not appear in the file, it is automatically set as unclustered.  

*For example see file ‘expressionData1Clustering.sol’ (a clustering solution for the data file’ expressionData1.txt’) under the Expander/sample_input_files/ directory (see Sample Input Files section for more details).

 

Biclustering files format:

1) Suffix: `.bic`.
2) Format: the file is composed of two parts, presented here.

Part 1 presents a summary of the biclusters found.

·                   It begins with the string: `[Bick]` in the first line.

·                   Following lines contain the bicluster's id followed by its' score, separated by a tab delimiter (a line for each bicluster).

Part 2 presents the probesets and the conditions contained in each bicluster.

·                   It begins with the string: `[Bicd]` in the first line.

·                   Following lines contain the bicluster id, type of element ('0' for condition, '1' for probe) and element id (name of condition or probe ID), separated by tab delimiters.

 

Background set files format:

1) Suffix: no limitation.

2) Format: each line should contain one gene ID. The gene IDs are expected to be of the same convention used in the annotation and TF fingerprint files for the organism you are working on (please refer to the Supplied Files section).

 

Gene annotations/categories files format (for the general enrichment analysis):

1) Suffix: no limitation.

2) Format: each line should contain one gene ID and an annotation/category name separated by a tab delimiter. The gene Ids are expected to be of the same convention used in the annotation and TF fingerprint files for the organism you are working on (please refer to the Supplied Files section).

 

 

 


Prev   Next   Top