Abstract
How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.
References
Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106.Search in Google Scholar
Blencowe, B. J., S. Ahmad and L. J. Lee (2009): “Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes,” Genes Dev., 23, 1379–1386.Search in Google Scholar
Bullard, J. H., E. Purdom, K. D. Hansen and S. Dudoit (2010): “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments,” BMC Bioinformatics, 11, 94.10.1186/1471-2105-11-94Search in Google Scholar
Durden, C. and Q. Dong (2009): “RICHEST–a web server for richness estimation in biological data,” Bioinformation, 3, 296–2988.10.6026/97320630003296Search in Google Scholar
Ewens, W. J. (1972): “The sampling theory of selectively neutral alleles,” Theor. Popul. Biol., 3, 87–112.Search in Google Scholar
Favaro, S., A. Lijoi, R. H. Mena and I. Prünster (2009): “Bayesian non-parametric inference for species variety with a two-parameter Poisson “Dirichlet process prior,” J. Royal Statistical Soc., 71, 993–1008.Search in Google Scholar
Garber, M., M. G. Grabherr, M. Guttman and C. Trapnell (2011): “Computational methods for transcriptome annotation and quantification using RNA-seq,” Nat. Methods, 8: 469–477.Search in Google Scholar
Gentleman, R. C., V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang and J. Zhang (2004): “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biol., 5, R80.Search in Google Scholar
Griebel,T., B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó and M. Sammeth (2012): “Modelling and simulating generic RNA-Seq experiments with the flux simulator,” Nucleic Acids Res., 40, 10073–10083.Search in Google Scholar
Hansen, K. D., S. E. Brenner and S. Dudoit (2010): “Biases in Illumina transcriptome sequencing caused by random hexamer priming,” Nucleic Acids Res., 38, e131.Search in Google Scholar
Hoppe, F. M. (1984): “Pólya-like urns and the Ewen’s sampling formula,” J. Math. Biol., 20, 91–94.Search in Google Scholar
Huang,W., L. Li, J. R. Myers and G. T. Marth (2012): “ART: a next-generation sequencing read simulator,” Bioinformatics, 28, 593–594.10.1093/bioinformatics/btr708Search in Google Scholar
Human BodyMap 2.0 data from Illumina (2011) http://www.ensembl.info/blog/2011/05/24/human-bodymap-2-0-data-from-illumina/. Accessed on 3 August, 2012.Search in Google Scholar
Knierim, E., B. Lucke, J. M. Schwarz, M. Schuelke and D. Seelow (2011): “Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing,” PLoS One, 6, e28240.10.1371/journal.pone.0028240Search in Google Scholar
Lander, E. S. and M. S. Waterman (1988): “Genomic mapping by fingerprinting random clones: a mathematical analysis,” Genomics, 2, 231–239.10.1016/0888-7543(88)90007-9Search in Google Scholar
Levin, J. Z., M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke and A. Regev (2010): “Comprehensive comparative analysis of strand-specific RNA sequencing methods,” Nat. Methods, 7, 709–715.Search in Google Scholar
Li, H. and R. Durbin (2009): “Fast and accurate short read alignment with Burrows-Wheeler Transform,” Bioinformatics, 25, 1754–1760.10.1093/bioinformatics/btp324Search in Google Scholar PubMed PubMed Central
Li, B., V. Ruotti, R. M. Stewart, J. A. Thomson and C. N. Dewey (2010): “RNA-Seq gene expression estimation with read mapping uncertainty,” Bioinformatics, 26, 493–500.10.1093/bioinformatics/btp692Search in Google Scholar PubMed PubMed Central
Lijoi, A., R. H. Mena and I. Prünster (2008): “A Bayesian nonparametric approach for comparing clustering structures in EST libraries,” J. Comput. Biol., 15, 1315–1327.Search in Google Scholar
McElroy, K. E., F. Luciani, and T. Thomas (2012): “GemSIM: general, error-model based simulator of next-generation sequencing data,” BMC Genomics, 13, 74.10.1186/1471-2164-13-74Search in Google Scholar PubMed PubMed Central
Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008): “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nat. Methods, 5, 621–628.Search in Google Scholar
Oshlack, A., M. D. Robinson and M. D. Young (2010): “From RNA-seq reads to differential expression results,” Genome Biol., 11, 220.Search in Google Scholar
Ozsolak, F. and P. M. Milos (2011): “RNA sequencing: advances, challenges and opportunities,” Nat. Rev. Genet., 12, 87–98.Search in Google Scholar
Pitman, J. (1995): “Exchangeable and partially exchangeable random partitions,” Probab. Theory Relat. Fields, 102, 145–158.10.1007/BF01213386Search in Google Scholar
Pitman, J. (2006): Combinatorial stochastic processes. Berlin Heidelberg, Germany: Springer.Search in Google Scholar
R-Development-Core-Team (2012): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Vienna, Austria.Search in Google Scholar
Richard, H., M. H. Schulz, M. Sultan, A. Nürnberger, S. Schrinner, D. Balzereit, E. Dagand, A. Rasche, H. Lehrach, M. Vingron, S. A. Haas and M. Yaspo (2010): “Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments,” Nucleic Acids Res., 38, e112.Search in Google Scholar
Roberts, A. and L. Pachter (2013): “Streaming fragment assignment for real-time analysis of sequencing experiments,” Nat. Methods, 10, 71–73.Search in Google Scholar
Roberts, A., C. Trapnell, J. Donagehey, J. L. Rinn and L. Pachter (2011): “Improving RNA-Seq expression estimates by correcting for fragment bias,” Genome Biol., 12, R22.Search in Google Scholar
Robinson, M. D. and G. K. Smyth (2007): “Moderated statistical tests for assessing differences in tag abundance,” Bioinformatics, 23, 2881–2887.10.1093/bioinformatics/btm453Search in Google Scholar PubMed
Robinson, M. D. and A. Oshlack (2010): “A scaling normalization method for differential expression analysis of RNA-seq data,” Genome Biol., 11, R25.Search in Google Scholar
Schwartz, S., R. Oren and G. Ast (2011): “Detection and removal of biases in the analysis of next-generation sequencing reads,” PLoS One, 6, e16685.10.1371/journal.pone.0016685Search in Google Scholar PubMed PubMed Central
Shen, Y., F. Yue, D. F. McCleary, Z. Ye, L. Edsall, S. Kuan, U. Wagner, J. Dixon, L. Lee, V. V. Lobanenkov and B. Ren (2012): “A map of the cis-regulatory sequences in the mouse genome,” Nature, 488, 116–120.10.1038/nature11243Search in Google Scholar PubMed PubMed Central
Shendure, J. and H. Ji (2008): “Next-generation DNA sequencing,” Nat. Biotechnol., 26, 1135–1145.Search in Google Scholar
Smyth, G. K. and M. D. Robinson (2008): “Small-sample estimation of negative binomial dispersion, with applications to SAGE data,” Biostatistics, 9, 321–332.10.1093/biostatistics/kxm030Search in Google Scholar PubMed
Tarazona, S., F. García-Alcalde, J. Dopazo, A. Ferrer and A. Conesa (2011): Differential expression in RNA-seq: a matter of depth. Genome Res., 21, 2213–2223.Search in Google Scholar
Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold and L. Pachter (2010): “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,” Nat. Biotechnol., 28, 511–515.Search in Google Scholar
Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth and C. B. Burge (2008): “Alternative isoform regulation in human tissue transcriptomes,” Nature, 456, 470–476.10.1038/nature07509Search in Google Scholar PubMed PubMed Central
Wang, Z., M. Gerstein and M. Snyder (2009): “RNA-Seq: a revolutionary tool for transcriptomics,” Nat. Rev. Genet., 10, 57–63.Search in Google Scholar
Wang, L., Z. Feng, X. Wang, X. Wang and X. Zhang (2010): “DEGseq: an R package for identifying differentially expressed genes from RNA-seq data,” Bioinformatics, 26, 136–138.10.1093/bioinformatics/btp612Search in Google Scholar PubMed
Zabell, S. L. (1992): “Predicting the unpredictable,” Synthese, 90, 205–232.10.1007/BF00485351Search in Google Scholar
©2013 by Walter de Gruyter Berlin Boston
Articles in the same Issue
- Approximate Bayesian computation (ABC) gives exact results under the assumption of model error
- Modeling the DNA copy number aberration patterns in observational high-throughput cancer data
- Exploring the sampling universe of RNA-seq
- Detection of epigenetic changes using ANOVA with spatially varying coefficients
- Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery
- Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures
- A novel method for analyzing genetic association with longitudinal phenotypes
- Two optimization strategies of multi-stage design in clinical proteomic studies
Articles in the same Issue
- Approximate Bayesian computation (ABC) gives exact results under the assumption of model error
- Modeling the DNA copy number aberration patterns in observational high-throughput cancer data
- Exploring the sampling universe of RNA-seq
- Detection of epigenetic changes using ANOVA with spatially varying coefficients
- Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery
- Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures
- A novel method for analyzing genetic association with longitudinal phenotypes
- Two optimization strategies of multi-stage design in clinical proteomic studies