Accurate estimation of expression levels of homologous genes in
RNA-seq experiments
Eran Halperin
Abstract:
Next generation high throughput sequencing (NGS) is poised to
replace array based technologies as the experiment of choice for
measuring RNA expression levels. Several groups have demonstrated the
power of this new approach (RNA-seq), making significant and novel
contributions and simultaneously proposing methodologies for the
analysis of RNA-seq data. In a typical experiment, millions of short
sequences (reads) are sampled from a RNA extracts and mapped back to a
reference genome. The number of reads mapping to each gene is used as
proxy for its corresponding RNA concentration. A significant challenge
in analyzing RNA expression of homologous genes is the large fraction of
the reads that map to multiple locations in the reference genome.
Currently, these reads are either dropped from the analysis, or a naive
algorithm is used to estimate their underlying distribution. In this
work, we present a rigorous alternative for handling the reads generated
in an RNA-seq experiment within a probabilistic model for RNA-seq data;
we develop maximum likelihood based methods for estimating the model
parameters. In contrast to previous methods, our model takes into
account the fact that the DNA of the sequenced individual is not a
perfect copy of the reference sequence. We show with both simulated and
real RNA-seq data that our new method improves the accuracy and power of
RNA-seq experiments.
This is joint work with Noah Zaitlen and Bogdan Pasaniuc