Exploring the sampling universe of RNA-seq

Stefanie Tauber; Arndt von Haeseler

doi:10.1515/sagmb-2012-0049

Article

Exploring the sampling universe of RNA-seq

Stefanie Tauber and Arndt von Haeseler

Published/Copyright: April 16, 2013

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 12 Issue 2

Abstract

How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.

Keywords: RNA sequencing; sampling; modeling RNA-seq; deep sequencing; Pitman sampling formula

Corresponding authors: Stefanie Tauber and Arndt von Haeseler: Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria

References

Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106.Search in Google Scholar

Blencowe, B. J., S. Ahmad and L. J. Lee (2009): “Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes,” Genes Dev., 23, 1379–1386.Search in Google Scholar

Bullard, J. H., E. Purdom, K. D. Hansen and S. Dudoit (2010): “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments,” BMC Bioinformatics, 11, 94.10.1186/1471-2105-11-94Search in Google Scholar

Durden, C. and Q. Dong (2009): “RICHEST–a web server for richness estimation in biological data,” Bioinformation, 3, 296–2988.10.6026/97320630003296Search in Google Scholar

Ewens, W. J. (1972): “The sampling theory of selectively neutral alleles,” Theor. Popul. Biol., 3, 87–112.Search in Google Scholar

Favaro, S., A. Lijoi, R. H. Mena and I. Prünster (2009): “Bayesian non-parametric inference for species variety with a two-parameter Poisson “Dirichlet process prior,” J. Royal Statistical Soc., 71, 993–1008.Search in Google Scholar

Garber, M., M. G. Grabherr, M. Guttman and C. Trapnell (2011): “Computational methods for transcriptome annotation and quantification using RNA-seq,” Nat. Methods, 8: 469–477.Search in Google Scholar

Gentleman, R. C., V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang and J. Zhang (2004): “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biol., 5, R80.Search in Google Scholar

Griebel,T., B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó and M. Sammeth (2012): “Modelling and simulating generic RNA-Seq experiments with the flux simulator,” Nucleic Acids Res., 40, 10073–10083.Search in Google Scholar

Hansen, K. D., S. E. Brenner and S. Dudoit (2010): “Biases in Illumina transcriptome sequencing caused by random hexamer priming,” Nucleic Acids Res., 38, e131.Search in Google Scholar

Hoppe, F. M. (1984): “Pólya-like urns and the Ewen’s sampling formula,” J. Math. Biol., 20, 91–94.Search in Google Scholar

Huang,W., L. Li, J. R. Myers and G. T. Marth (2012): “ART: a next-generation sequencing read simulator,” Bioinformatics, 28, 593–594.10.1093/bioinformatics/btr708Search in Google Scholar

Human BodyMap 2.0 data from Illumina (2011) http://www.ensembl.info/blog/2011/05/24/human-bodymap-2-0-data-from-illumina/. Accessed on 3 August, 2012.Search in Google Scholar

Knierim, E., B. Lucke, J. M. Schwarz, M. Schuelke and D. Seelow (2011): “Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing,” PLoS One, 6, e28240.10.1371/journal.pone.0028240Search in Google Scholar

Lander, E. S. and M. S. Waterman (1988): “Genomic mapping by fingerprinting random clones: a mathematical analysis,” Genomics, 2, 231–239.10.1016/0888-7543(88)90007-9Search in Google Scholar

Levin, J. Z., M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke and A. Regev (2010): “Comprehensive comparative analysis of strand-specific RNA sequencing methods,” Nat. Methods, 7, 709–715.Search in Google Scholar

Li, H. and R. Durbin (2009): “Fast and accurate short read alignment with Burrows-Wheeler Transform,” Bioinformatics, 25, 1754–1760.10.1093/bioinformatics/btp324Search in Google Scholar PubMed PubMed Central

Li, B., V. Ruotti, R. M. Stewart, J. A. Thomson and C. N. Dewey (2010): “RNA-Seq gene expression estimation with read mapping uncertainty,” Bioinformatics, 26, 493–500.10.1093/bioinformatics/btp692Search in Google Scholar PubMed PubMed Central

Lijoi, A., R. H. Mena and I. Prünster (2008): “A Bayesian nonparametric approach for comparing clustering structures in EST libraries,” J. Comput. Biol., 15, 1315–1327.Search in Google Scholar

McElroy, K. E., F. Luciani, and T. Thomas (2012): “GemSIM: general, error-model based simulator of next-generation sequencing data,” BMC Genomics, 13, 74.10.1186/1471-2164-13-74Search in Google Scholar PubMed PubMed Central

Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008): “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nat. Methods, 5, 621–628.Search in Google Scholar

Oshlack, A., M. D. Robinson and M. D. Young (2010): “From RNA-seq reads to differential expression results,” Genome Biol., 11, 220.Search in Google Scholar

Ozsolak, F. and P. M. Milos (2011): “RNA sequencing: advances, challenges and opportunities,” Nat. Rev. Genet., 12, 87–98.Search in Google Scholar

Pitman, J. (1995): “Exchangeable and partially exchangeable random partitions,” Probab. Theory Relat. Fields, 102, 145–158.10.1007/BF01213386Search in Google Scholar

Pitman, J. (2006): Combinatorial stochastic processes. Berlin Heidelberg, Germany: Springer.Search in Google Scholar

R-Development-Core-Team (2012): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Vienna, Austria.Search in Google Scholar

Richard, H., M. H. Schulz, M. Sultan, A. Nürnberger, S. Schrinner, D. Balzereit, E. Dagand, A. Rasche, H. Lehrach, M. Vingron, S. A. Haas and M. Yaspo (2010): “Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments,” Nucleic Acids Res., 38, e112.Search in Google Scholar

Roberts, A. and L. Pachter (2013): “Streaming fragment assignment for real-time analysis of sequencing experiments,” Nat. Methods, 10, 71–73.Search in Google Scholar

Roberts, A., C. Trapnell, J. Donagehey, J. L. Rinn and L. Pachter (2011): “Improving RNA-Seq expression estimates by correcting for fragment bias,” Genome Biol., 12, R22.Search in Google Scholar

Robinson, M. D. and G. K. Smyth (2007): “Moderated statistical tests for assessing differences in tag abundance,” Bioinformatics, 23, 2881–2887.10.1093/bioinformatics/btm453Search in Google Scholar PubMed

Robinson, M. D. and A. Oshlack (2010): “A scaling normalization method for differential expression analysis of RNA-seq data,” Genome Biol., 11, R25.Search in Google Scholar

Schwartz, S., R. Oren and G. Ast (2011): “Detection and removal of biases in the analysis of next-generation sequencing reads,” PLoS One, 6, e16685.10.1371/journal.pone.0016685Search in Google Scholar PubMed PubMed Central

Shen, Y., F. Yue, D. F. McCleary, Z. Ye, L. Edsall, S. Kuan, U. Wagner, J. Dixon, L. Lee, V. V. Lobanenkov and B. Ren (2012): “A map of the cis-regulatory sequences in the mouse genome,” Nature, 488, 116–120.10.1038/nature11243Search in Google Scholar PubMed PubMed Central

Shendure, J. and H. Ji (2008): “Next-generation DNA sequencing,” Nat. Biotechnol., 26, 1135–1145.Search in Google Scholar

Smyth, G. K. and M. D. Robinson (2008): “Small-sample estimation of negative binomial dispersion, with applications to SAGE data,” Biostatistics, 9, 321–332.10.1093/biostatistics/kxm030Search in Google Scholar PubMed

Tarazona, S., F. García-Alcalde, J. Dopazo, A. Ferrer and A. Conesa (2011): Differential expression in RNA-seq: a matter of depth. Genome Res., 21, 2213–2223.Search in Google Scholar

Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold and L. Pachter (2010): “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,” Nat. Biotechnol., 28, 511–515.Search in Google Scholar

Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth and C. B. Burge (2008): “Alternative isoform regulation in human tissue transcriptomes,” Nature, 456, 470–476.10.1038/nature07509Search in Google Scholar PubMed PubMed Central

Wang, Z., M. Gerstein and M. Snyder (2009): “RNA-Seq: a revolutionary tool for transcriptomics,” Nat. Rev. Genet., 10, 57–63.Search in Google Scholar

Wang, L., Z. Feng, X. Wang, X. Wang and X. Zhang (2010): “DEGseq: an R package for identifying differentially expressed genes from RNA-seq data,” Bioinformatics, 26, 136–138.10.1093/bioinformatics/btp612Search in Google Scholar PubMed

Zabell, S. L. (1992): “Predicting the unpredictable,” Synthese, 90, 205–232.10.1007/BF00485351Search in Google Scholar

Published Online: 2013-04-16

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2012-0049

Keywords for this article

RNA sequencing; sampling; modeling RNA-seq; deep sequencing; Pitman sampling formula