SELMAP - high-throughput MITOMI-SELEX

SELMAP technology was developed by Rada Golodintsky, Dana Chen and Dorit Avrahami in Doron Gerber's lab at Bar-Ilan University.

The software to analyze the data was developed by Yaron Orenstein in Ron Shamir's Computational Genomics group at Tel Aviv University.

Get the data

all_data.zip

The file contains all the sequencing files produced and analyzed in the study in fastq.gz format.

all_processed_data.zip

The file contains the processed data in a form of 6-mer scores and PWM files.

Get the software

Java executable distribution (SELEX.jar)

Java executable distribution (SELEX-est.jar)

This distribution is our officially supported executable for SELMAP. This binary is completely self-contained and should work out of the box without any issues.

The software is freely available under the GNU Lesser General Public License, version 3, or any later version at your choice.

SELMAP is a research tool, still in the development stage. Hence, it is not presented as error-free, accurate, complete, useful, suitable for any specific application or free from any infringement of any rights. The Software is licensed AS IS, entirely at the user's own risk.

How to use it

java -jar SELEX.jar <k> <barcode> <length> <output_file> <output_file_RC> <output_file_pwms> <seed> <cycle_0_file> <cycle_1_file> ...

SELEX-est receives the same arguments, and computes an additional k-mer ratio score, based on estimated frequencies in intial round.

Example run:

java -jar SELEX.jar 6 TAGCTC 18 Pho4_Full_6mers.txt Pho4_Full_6mers_RC.txt Pho4_Full_pwm.txt CACGTG Library1_Cycle0_barcode_TAGCTC.fastq Pho4_Cycle1_Full_Chip_lib1.fastq Pho4_Cycle2_Full_Chip_lib1.fastq Pho4_Cycle3_Full_Chip_lib1.fastq
java -jar SELEX.jar 6 TAGCTC 18 Pho4_Library1_6mers.txt Pho4_Library1_6mers_RC.txt Pho4_Library1_pwm.txt CACGTG Library1_Cycle0_barcode_TAGCTC.fastq Pho4_2libraries_cycle2.fastq
java -jar SELEX.jar 6 ACTGAA 18 Pho4_Library2_6mers.txt Pho4_Library2_6mers_RC.txt Pho4_Library2_pwm.txt CACGTG Library2_Cycle0_barcode_ACTGAA.fastq Pho4_2libraries_cycle2.fastq
java -jar SELEX.jar 6 TAGCTC 18 atERF2_Library1_6mers.txt atERF2_Library1_6mers_RC.txt atERF2_Library1_pwm.txt GCCGCC Library1_Cycle0_barcode_TAGCTC.fastq atERF2_2libraries_cycle2.fastq
java -jar SELEX.jar 6 ACTGAA 18 atERF2_Library2_6mers.txt atERF2_Library2_6mers_RC.txt atERF2_Library2_pwm.txt GCCGCC Library2_Cycle0_barcode_ACTGAA.fastq atERF2_2libraries_cycle2.fastq atERF2_2libraries_cycle3.fastq
java -Xmx4096m -jar SELEX-est.jar 10 TAGCTC 18 BTD_results_10mers.txt BTD_results_10mers_rc.txt BTD_results_10mers_pwm.txt CGGGCGCGCC Library1_Cycle0_barcode_TAGCTC.fastq BTD-rnd-1_S15_L001_001.fastq.18 BTD-rnd-2_S16_L001_001.fastq.18 BTD-rnd-3_S17_L001_001.fastq.18
java -Xmx4096m -jar SELEX-est.jar 10 TAGCTC 18 Pho4_Full_10mers.txt Pho4_Full_10mers_RC.txt Pho4_Full_10pwm.txt CCCACGTGGG Library1_Cycle0_barcode_TAGCTC.fastq Pho4_Cycle1_Full_Chip_lib1.fastq Pho4_Cycle2_Full_Chip_lib1.fastq Pho4_Cycle3_Full_Chip_lib1.fastq
java -Xmx4096m -jar SELEX-est.jar 10 ACTGAA 18 atERF2_Library2_10mers.txt atERF2_Library2_10mers_RC.txt atERF2_Library2_10pwm.txt CTGCGCCGCC Library2_Cycle0_barcode_ACTGAA.fastq atERF2_2libraries_cycle2.fastq atERF2_2libraries_cycle3.fastq

Interpreting the output

The 6-mer scores output should look like this:
Kmer    Count_0 Freq_0  Count_1 Freq_1  Ratio_1 Ratio_0 Count_2 Freq_2  Ratio_2 Ratio_0 Count_3 Freq_3  Ratio_3 Ratio_0
AAAAAA  342     1.4E-4  208     1.3E-4  0.928   0.928   235     1.2E-4  0.918   0.851   90      3.3E-5  0.264   0.225
CAAAAA  380     1.6E-4  218     1.4E-4  0.875   0.875   271     1.4E-4  1.010   0.884   184     6.8E-5  0.469   0.415
GAAAAA  434     1.8E-4  237     1.5E-4  0.833   0.833   294     1.5E-4  1.008   0.839   136     5.0E-5  0.320   0.268
TAAAAA  452     1.9E-4  277     1.8E-4  0.935   0.935   301     1.6E-4  0.882   0.825   128     4.7E-5  0.294   0.242
ACAAAA  399     1.7E-4  273     1.8E-4  1.044   1.044   302     1.6E-4  0.898   0.938   233     8.6E-5  0.533   0.500
...
Each line starts with a k-mer string, correpsonding to a unique k-mer.
Then, it provides different k-mer statistics for each cycle, corresponding to binding scores.
Count_i = k-mer count in cycle i.
Freq_i = k-mer freuency in cycle i.
Ratio_i = Freq_i / Freq_i-1.
Ratio_0 = Freq_i / Freq_0.

The PWM output should look like this:

A:	0.183147132396698	0.22810086607933044	0.11032721400260925	0.8135622143745422	0.9158732891082764	. . .	
C:	0.2849732041358948	0.14351284503936768	0.32613304257392883	0.028032639995217323	0.017520010471343994	. . .
G:	0.26065367460250854	0.10847616195678711	0.21236343681812286	0.11921338737010956	0.04402954876422882	. . .
T:	0.2712264955043793	0.5199114084243774	0.35117655992507935	0.03919265791773796	0.022574998438358307	. . .

Each line is of the form nucleotide: [tab] probability_pos_1 [tab] probability_pos_2 [tab] . . .

nucleotide
The line contains probabilities of this nucleotide in all positions.

probability_pos_i
The probability of the nucleotide in position i.

Citing SELMAP

SELMAP can be cited as follows:
SELMAP - SELEX Affinity Landscape Mapping of transcription factor binding sites using integrated microfluidics
Dana Chen*, Yaron Orenstein*, Rada Golodintsky, Chaim Wachtel, Michal Pellach, Dorit Avrahami, Avital Ovadia-Shochat, Hila Shir-Shapira, Adi Kedmi, Tamat Juven-Gershon, Ron Shamir and Doron Gerber.
* Authors contributed equally to the work.
Scientific Reports (2016).

Get in touch

In case of any questions or suggestions please feel free to contact Yaron Orenstein.