Abstract
Conservative statistical tests are often used in complex multiple testing settings in which computing the type I error may be difficult. In such tests, the reported p-value for a hypothesis can understate the evidence against the null hypothesis and consequently statistical power may be lost. False Discovery Rate adjustments, used in multiple comparison settings, can worsen the unfavorable effect. We present a computationally efficient and test-agnostic calibration technique that can substantially reduce the conservativeness of such tests. As a consequence, a lower sample size might be sufficient to reject the null hypothesis for true alternatives, and experimental costs can be lowered. We apply the calibration technique to the results of DESeq, a popular method for detecting differentially expressed genes from RNA sequencing data. The increase in power may be particularly high in small sample size experiments, often used in preliminary experiments and funding applications.
Acknowledgments
The numerical simulations shown in this article were performed on computational resources provided under EU FP7 project EGI-InSPIRE (contract number RI-261323), and described in Atanassov et al. (forthcoming). The authors would like to thank the Associate Editor and anonymous reviewers for their thoughtful suggestions and comments which resulted in a significant improvement of this work.
References
Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106+.10.1186/gb-2010-11-10-r106Search in Google Scholar PubMed PubMed Central
Anders, S. and W. Huber (2013): “Differential expression of RNA-Seq data at the gene level – the DESeq package,” http://www.bioconductor.org/packages/2.12/bioc/html/DESeq.html.Search in Google Scholar
Atanassov, E., T. Gurov, A. Karaivanova, S. Ivanovska, M. Durchova, D. Georgiev and D. Dimitrov (forthcoming): “Tuning for scalability on hybrid HPC cluster,” Math. Industry.Search in Google Scholar
Benjamini, Y. and Y. Hochberg (1995): “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” J. Roy. Stat. Soc. B Met., 57, 289–300.Search in Google Scholar
Bolouri, H. and W. L. Ruzzo (2012): “Integration of 198 ChIP-seq datasets reveals human cis-regulatory regions,” J. Comput. Biol., 19, 989–997.Search in Google Scholar
Bottomly, D., N. A. Walter, J. E. E. Hunter, P. Darakjian, S. Kawane, K. J. Buck, R. P. Searles, M. Mooney, S. K. McWeeney and R. Hitzemann (2011): “Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays,” PLos One, 6, e17820+.10.1371/journal.pone.0017820Search in Google Scholar PubMed PubMed Central
Brooks, A. N., L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner and B. R. Graveley (2011): “Conservation of an RNA regulatory map between Drosophila and mammals,” Genome Res., 21, 193–202.Search in Google Scholar
Burgess, D. (2013): “Genetic screens: RNA-seq into the toolkit,” Nature Reviews Genetics 14, 154–155. DOI: 10.1038/nrg3432.10.1038/nrg3432Search in Google Scholar PubMed
Cantor, R. M., K. Lange and J. S. Sinsheimer (2010): “Prioritizing GWAS results: a review of statistical methods and recommendations for their application,” Am. J. Hum. Genet., 86, 6–22.Search in Google Scholar
Casella, G. and R. L. Berger (2002): Statistical inference, Thomson learning, 2nd edition, Duxbury Press: Belmont, CA.Search in Google Scholar
Devlin, B. and K. Roeder (1999): “Genomic control for association studies,” Biometrics, 55, 997–1004.10.1111/j.0006-341X.1999.00997.xSearch in Google Scholar
Efron, B. (2008): “Microarrays, empirical bayes and the two-groups model,” Stat. Sci., 23, 1–22.Search in Google Scholar
Fodor, A. A., T. L. Tickle and C. Richardson (2007): “Towards the uniform distribution of null p values on affymetrix microarrays,” Genome Biol., 8, R69.Search in Google Scholar
Frazee, A., B. Langmead and J. Leek (2011): “ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets,” BMC Bioinformatics, 12, 449+.10.1186/1471-2105-12-449Search in Google Scholar PubMed PubMed Central
Furey, T. S. (2012): “ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions,” Nat. Rev. Genet., 13, 840–852.Search in Google Scholar
Hindorff, L. A., P. Sethupathy, H. A. Junkins, E. M. Ramos, J. P. Mehta, F. S. Collins and T. A. Manolio (2009): “Potential etiologic and functional implications of genome-wide association loci for human diseases and traits,” Proc. Natl. Acad. Sci., 106, 9362–9367.Search in Google Scholar
Hubbard, R. and M. Bayarri (2003): “Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing,” Am. Stat., 57, 171–178.Search in Google Scholar
Kvam, V. M., P. Liu and Y. Si (2012): “A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data,” Am. J. Botany, 99, 248–256.10.3732/ajb.1100340Search in Google Scholar PubMed
Lehmann, E. L. and J. P. Romano (2005): Testing statistical hypotheses (Springer Texts in Statistics), Springer: New York, NY.Search in Google Scholar
Mardis, E. R. (2007): “ChIP-seq: welcome to the new frontier,” Nat. Methods, 4, 613–613.Search in Google Scholar
McCarthy, M. I., G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. Ioannidis and J. N. Hirschhorn (2008): “Genome-wide association studies for complex traits: consensus, uncertainty and challenges,” Nat. Rev. Genet., 9, 356–369.Search in Google Scholar
Mortazavi, A., B. Williams, K. McCue, L. Schaeffer and B. Wold (2008): “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nat. Methods, 5, 621–628.Search in Google Scholar
Muralidharan, O., G. Natsoulis, J. Bell, H. Ji and N. R. Zhang (2012): “Detecting mutations in mixed sample sequencing data using empirical bayes,” Ann. Appl. Stat., 6, 1047–1067.Search in Google Scholar
Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick and D. Reich (2006): “Principal components analysis corrects for stratification in genome-wide association studies,” Nat. Genet., 38, 904–909.Search in Google Scholar
R Core Team (2013): “R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria,” http://www.R-project.org/.Search in Google Scholar
Sellke, T., M. J. Bayarri and J. O. Berger (2001): “Calibration of p values for testing precise null hypotheses,” Am. Stat., 55, 62–71.Search in Google Scholar
Skol, A. D., L. J. Scott, G. R. Abecasis and M. Boehnke (2006): “Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies,” Nat. Genet., 38, 209–213.Search in Google Scholar
Soneson, C. and M. Delorenzi (2013): “A comparison of methods for differential expression analysis of RNA-seq data,” BMC Bioinformatics, 14, 91+.10.1186/1471-2105-14-91Search in Google Scholar PubMed PubMed Central
Wang, W. Y., B. J. Barratt, D. G. Clayton and J. A. Todd (2005): “Genome-wide association studies: theoretical and practical concerns,” Nat. Rev. Genet., 6, 109–118.Search in Google Scholar
Wang, Z., M. Gerstein and M. Snyder (2009): “RNA-Seq: a revolutionary tool for transcriptomics,” Nat. Rev. Genet., 10, 57–63.Search in Google Scholar
Wasserman, L. (2006): All of nonparametric statistics, Springer: New York, NY.Search in Google Scholar
Supplemental Material
The online version of this article (DOI: 10.1515/sagmb-2013-0074) offers supplementary material, available to authorized users.
©2014 by De Gruyter
Articles in the same Issue
- Frontmatter
- Research Articles
- When is Menzerath-Altmann law mathematically trivial? A new approach
- Covariate adjusted differential variability analysis of DNA methylation with propensity score method
- P-value calibration for multiple testing problems in genomics
- Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations
- Markovianness and conditional independence in annotated bacterial DNA
- Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories
- Corrigendum
- Biological pathway selection through Bayesian integrative modeling
Articles in the same Issue
- Frontmatter
- Research Articles
- When is Menzerath-Altmann law mathematically trivial? A new approach
- Covariate adjusted differential variability analysis of DNA methylation with propensity score method
- P-value calibration for multiple testing problems in genomics
- Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations
- Markovianness and conditional independence in annotated bacterial DNA
- Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories
- Corrigendum
- Biological pathway selection through Bayesian integrative modeling