Abstract
Gene Set Enrichment Analysis (GSEA) is a basic tool for genomic data treatment. Its test statistic is based on a cumulated weight function, and its distribution under the null hypothesis is evaluated by Monte-Carlo simulation. Here, it is proposed to subtract to the cumulated weight function its asymptotic expectation, then scale it. Under the null hypothesis, the convergence in distribution of the new test statistic is proved, using the theory of empirical processes. The limiting distribution needs to be computed only once, and can then be used for many different gene sets. This results in large savings in computing time. The test defined in this way has been called Weighted Kolmogorov Smirnov (WKS) test. Using expression data from the GEO repository, tested against the MSig Database C2, a comparison between the classical GSEA test and the new procedure has been conducted. Our conclusion is that, beyond its mathematical and algorithmic advantages, the WKS test could be more informative in many cases, than the classical GSEA test.
Acknowledgments
This work was supported by Laboratoire d’Excellence TOUCAN (Toulouse Cancer). The authors are grateful to Sophie Rousseaux and Jean-Jacques Fournié for using the WKS test on many different datasets; their feedback helped improving the code. They are also indebted to the reviewers for important remarks and suggestions.
Funding: Labex Toucan (Toulouse Cancer).
References
Acevedo, L. G., M. Bieda, R. Green and P. J. Farnham (2008): “Analysis of the mechanisms mediating tumor-specific changes in gene expression in human liver tumors,” Cancer Res., 68(8), 2641–2651.Search in Google Scholar
Arnold, T. B. and J. W. Emerson (2011): “Nonparametric goodness-of-fit tests for discrete null distributions,” R Journal, 3/2, 34–39.10.32614/RJ-2011-016Search in Google Scholar
Barbie, D. A., P. Tamayo, J. S. Boehm, S. Y. Kim, S. E. Moody, I. F. Dunn, A. C. Schinzel, P. Sandy, E. Meylan, C. Scholl, S. Fröhling, E. M. Chan, M. L. Sos, K. Michel, C. Mermel, S. J. Silver, B. A. Weir, J. H. Reiling, Q. Sheng, P. B. Gupta, R. C. Wadlow, H. Le, S. Hoersch, B. S. Wittner, S. Ramaswamy, D. M. Livingston, D. M. Sabatini, M. Meyerson, R. K. Thomas, E. S. Lander, J. P. Mesirov, D. E. Root, D. G. Gilliland, T. Jacks and W. C. Hahn (2009): “Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1,” Nature, 462(7269), 108–112.10.1038/nature08460Search in Google Scholar PubMed PubMed Central
Barretina, J., G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Lehár, G. V. Kryukov, D. Sonkin, A. Reddy, M. Liu, L. Murray, M. F. Berger, J. E. Monahan, P. Morais, J. Meltzer, A. Korejwa, J. Jané-Valbuena, F. A. Mapa, J. Thibault, E. Bric-Furlong, P. Raman, A. Shipway, I. H. Engels, J. Cheng, G. K. Yu, J. Yu, P. Aspesi Jr., M. de Silva, K. Jagtap, M. D. Jones, L. Wang, C. Hatton, E. Palescandolo, S. Gupta, S. Mahan, C. Sougnez, R. C. Onofrio, T. Liefeld, L. MacConaill, W. Winckler, M. Reich, N. Li, J. P. Mesirov, S. B. Gabriel, G. Getz, K. Ardlie, V. Chan, V. E. Myer, B. L. Weber, J. Porter, M. Warmuth, P. Finan, J. L. Harris, M. Meyerson, T. R. Golub, M. P. Morrissey, W. R. Sellers, R. Schlegel and L. A. Garraway (2012): “The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, 483(7391), 603–607.10.1038/nature11003Search in Google Scholar PubMed PubMed Central
Benjamini, Y. and D. Yekutieli (2001): “The control of the false discovery rate in multiple testing under dependency,” Ann. Statist., 29(4), 1165–1188.Search in Google Scholar
Bild, A. and P. G. Febbo (2005): “Application of a priori established gene sets to discover biologically important differential expression in microarray data,” PNAS 102(43), 15278–15279.10.1073/pnas.0507477102Search in Google Scholar PubMed PubMed Central
Carlson, M. (2012): “org.Hs.eg.db: Genome wide annotation for Human,” R package version 2.8.0.Search in Google Scholar
Carlson, M. “hgug4110b.db: Agilent Human 1A (V2) annotation data (chip hgug4110b),” R package version 2.14.0.Search in Google Scholar
Dudoit, S. and M. van der Laan (2007): Multiple testing procedures with applications to genomics, New York: Springer.10.1007/978-0-387-49317-6Search in Google Scholar
Edgar, R., M. Domrachev and A. E. Lash (2002): “Gene expression omnibus: NCBI gene expression and hybridization array data repository,” Nucleic Acids Res., 30(1), 207–210.Search in Google Scholar
Frei, E., C. Visco, Z. Y. Xu-Monette, S. Dirnhofer, K. Dybkær, A. Orazi, G. Bhagat, E. D. Hsi, J. H. van Krieken, M. Ponzoni, R. S. Go, M. A. Piris, M. B. Møller, K. H. Young and A. Tzankov (2013): “Addition of rituximab to chemotherapy overcomes the negative prognostic impact of cyclin E expression in diffuse large B-cell lymphoma,” J. Clin. Pathol., 66(11), 956–961.10.1136/jclinpath-2013-201619Search in Google Scholar PubMed
Goeman, J. J. and P. Bühlmann (2007): “Analyzing gene expression data in terms of gene sets: methodological issues,” Bioinformatics, 23(8), 980–987.10.1093/bioinformatics/btm051Search in Google Scholar PubMed
Héritier, S., E. Cantoni, S. Copt and M. P. Victoria-Feser (2009): Robust methods in biostatistics, New York: Wiley.10.1002/9780470740538Search in Google Scholar
Herschkowitz, J. I., K. Simin, V. J. Weigman, I. Mikaelian, J. Usary, Z. Hu, K. E. Rasmussen, L. P. Jones, S. Assefnia, S. Chandrasekharan, M. G. Backlund, Y. Yin, A. I. Khramtsov, R. Bastein, J. Quackenbush, R. I. Glazer, P. H. Brown, J. E. Green, L. Kopelovich, P. A. Furth, J. P. Palazzo, O. I. Olopade, P. S. Bernard, G. A. Churchill, T. Van Dyke and C. M. Perou (2007): “Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors,” Genome Biol., 8(5), R76.Search in Google Scholar
Huang, D. W., B. T. Sherman and R. A. Lempicki (2009): “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic Acids Res., 37(1), 1–13.Search in Google Scholar
Irizarry, R. A., C. Wang, Y. Zhou and T. P. Speed (2009): “Gene set enrichment analysis made simple,” Stat. Methods Med. Res., 18(6), 565–575.10.1177/0962280209351908Search in Google Scholar PubMed PubMed Central
Kim, S. Y. and D. J. Volsky (2005): “PAGE: parametric analysis of gene set enrichment,” BMC Bioinformatics, 6, 144.10.1186/1471-2105-6-144Search in Google Scholar PubMed PubMed Central
Kosorok, M. R. (2008): Introduction to empirical processes and semiparametric inference, New York: Springer.10.1007/978-0-387-74978-5Search in Google Scholar
Marisa, L., A. de Reyniès, A. Duval, J. Selves, M. P. Gaub, L. Vescovo, M. C. Etienne-Grimaldi, R. Schiappa, D. Guenot, M. Ayadi, S. Kirzin, M. Chazal, J. F. Fléjou, D. Benchimol, A. Berger, A. Lagarde, E. Pencreach, F. Piard, D. Elias, Y. Parc, S. Olschwang, G. Milano, P. Laurent-Puig and V. Boige (2013): “Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value,” PLoS Med., 10(5), e1001453.Search in Google Scholar
Mayerle, J., C. M. den Hoed, C. Schurmann, L. Stolk, G. Homuth, M. J. Peters, L. G. Capelle, K. Zimmermann, F. Rivadeneira, S. Gruska, H. Völzke, A. C. de Vries, U. Völker, A. Teumer, J. B. van Meurs, I. Steinmetz, M. Nauck, F. Ernst, F. U. Weiss, A. Hofman, M. Zenker, H. K. Kroemer, H. Prokisch, A. G. Uitterlinden, M. M. Lerch and E. J. Kuipers (2013): “Identification of genetic loci associated with Helicobacter pylori serologic status,” J. Am. Med. Assoc., 309(18), 1912–1920.10.1001/jama.2013.4350Search in Google Scholar PubMed
Mikheev, A. M., T. Nabekura, A. Kaddoumi, T. K. Bammler, R. Govindarajan, M. F. Hebert and J. D. Unadkat (2008): “Profiling gene expression in human placentae of different gestational ages: an OPRU network and UW SCOR study,” Reprod. Sci., 15(9), 866–877.Search in Google Scholar
Mootha, V. K., C. M. Lindgren, K. F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstråle, E. Laurila, N. Houstis, M. J. Daly, N. Patterson, J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn, D. Altshuler and L. C. Groop (2003): “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes,” Nat. Genet., 34(3), 267–273.Search in Google Scholar
Nam, D. and S. Y. Kim (2008): “Gene-set approach for expression pattern analysis,”Brief. Bioinform., 9(3), 189–197.Search in Google Scholar
Obermoser, G., S. Presnell, K. Domico, H. Xu, Y. Wang, E. Anguiano, L. Thompson-Snipes, R. Ranganathan, B. Zeitner, A. Bjork, D. Anderson, C. Speake, E. Ruchaud, J. Skinner, L. Alsina, M. Sharma, H. Dutartre, A. Cepika, E. Israelsson, P. Nguyen, Q. A. Nguyen, A. C. Harrod, S. M. Zurawski, V. Pascual, H. Ueno, G. T. Nepom, C. Quinn, D. Blankenship, K. Palucka, J. Banchereau and D. Chaussabel (2013): “Systems scale interactive exploration reveals quantitative and qualitative differences in response to influenza and pneumococcal vaccines,” Immunity, 38(4), 831–844.10.1016/j.immuni.2012.12.008Search in Google Scholar PubMed PubMed Central
R Core Team (2013): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/, ISBN 3-900051-07-0.Search in Google Scholar
Sauer, T. (2013): “Computational solution of stochastic differential equations,” WIREs Comput. Stat., 5(5), 362–371.Search in Google Scholar
Seok, J., H. S. Warren, A. G. Cuenca, M. N. Mindrinos, H. V. Baker, W. Xu, D. R. Richards, G. P. McDonald-Smith, H. Gao, L. Hennessy, C. C. Finnerty, C. M. López, S. Honari, E. E. Moore, J. P. Minei, J. Cuschieri, P. E. Bankey, J. L. Johnson, J. Sperry, A. B. Nathens, T. R. Billiar, M. A. West, M. G. Jeschke, M. B. Klein, R. L. Gamelli, N. S. Gibran, B. H. Brownstein, C. Miller-Graziano, S. E. Calvano, P. H. Mason, J. P. Cobb, L. G. Rahme, S. F. Lowry, R. V. Maier, L. L. Moldawer, D. N. Herndon, R. W. Davis, W. Xiao and R. G. Tompkins; Inflammation and Host Response to Injury, Large Scale Collaborative Research Program (2013): “Genomic responses in mouse models poorly mimic human inflammatory diseases,” PNAS, 110(9), 3507–3512.10.1073/pnas.1222878110Search in Google Scholar PubMed PubMed Central
Shorack, G. R. and J. A. Wellner (1986): Empirical processes with applications to statistics, New York: Wiley.Search in Google Scholar
Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov (2005): “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,” PNAS, 102(43), 15545–15550, URL http://www.pnas.org/content/102/43/15545.full.10.1073/pnas.0506580102Search in Google Scholar PubMed PubMed Central
Subramanian, A., H. Kuehn, J. Gould, P. Tamayo and J. P. Mesirov (2007): “Gsea-P: a desktop application for gene set enrichment analysis,” Bioinformatics, 23(23), 3251–3253.10.1093/bioinformatics/btm369Search in Google Scholar PubMed
Tarca, A. L., G. Bhatti and R. Romero (2013): “A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity,” PLoS One, 8(11), e79217.10.1371/journal.pone.0079217Search in Google Scholar PubMed PubMed Central
Tsodikov, A., A. Szabo and D. Jones (2002): “Adjustments and measures of differential expression for microarray data,” Bioinformatics, 18(2), 251–260.10.1093/bioinformatics/18.2.251Search in Google Scholar PubMed
Westra, H. J., M. J. Peters, T. Esko, H. Yaghootkar, C. Schurmann, J. Kettunen, M. W. Christiansen, B. P. Fairfax, K. Schramm, J. E. Powell, A. Zhernakova, D. V. Zhernakova, J. H. Veldink, L. H. Van den Berg, J. Karjalainen, S. Withoff, A. G. Uitterlinden, A. Hofman, F. Rivadeneira, P. A. 't Hoen, E. Reinmaa, K. Fischer, M. Nelis, L. Milani, D. Melzer, L. Ferrucci, A. B. Singleton, D. G. Hernandez, M. A. Nalls, G. Homuth, M. Nauck, D. Radke, U. Völker, M. Perola, V. Salomaa, J. Brody, A. Suchy-Dicey, S. A. Gharib, D. A. Enquobahrie, T. Lumley, G. W. Montgomery, S. Makino, H. Prokisch, C. Herder, M. Roden, H. Grallert, T. Meitinger, K. Strauch, Y. Li, R. C. Jansen, P. M. Visscher, J. C. Knight, B. M. Psaty, S. Ripatti, A. Teumer, T. M. Frayling, A. Metspalu, J. B. van Meurs and L. Franke (2013): “Systematic identification of trans eQTLs as putative drivers of known disease associations,” Nat. Genet., 45(10), 1238–1243.Search in Google Scholar
Wu, D. and G. K. Smyth (2012): “Camera: a competitive gene set test accounting for inter-gene correlation,” Nucleic Acids Res., 40(17), e133.Search in Google Scholar
Xiao, W., M. N. Mindrinos, J. Seok, J. Cuschieri, A. G. Cuenca, H. Gao, D. L. Hayden, L. Hennessy, E. E. Moore, J. P. Minei, P. E. Bankey, J. L. Johnson, J. Sperry, A. B. Nathens, T. R. Billiar, M. A. West, B. H. Brownstein, P. H. Mason, H. V. Baker, C. C. Finnerty, M. G. Jeschke, M. C. López, M. B. Klein, R. L. Gamelli, N. S. Gibran, B. Arnoldo, W. Xu, Y. Zhang, S. E. Calvano, G. P. McDonald-Smith, D. A. Schoenfeld, J. D. Storey, J. P. Cobb, H. S. Warren, L. L. Moldawer, D. N. Herndon, S. F. Lowry, R. V. Maier, R. W. Davis and R. G. Tompkins; Inflammation and Host Response to Injury Large-Scale Collaborative Research Program (2011): “A genomic storm in critically injured humans,” J. Exp. Med., 208(13), 2581–2590.10.1084/jem.20111354Search in Google Scholar PubMed PubMed Central
Ycart, B., F. Pont and J. J. Fournié (2014): “Curbing false discovery rates in interpretation of genome-wide expression profiles,” J. Biomed. Inform., 47, 58–61.Search in Google Scholar
©2015 by De Gruyter
Articles in the same Issue
- Frontmatter
- Research Articles
- A novel method to prioritize RNAseq data for post-hoc analysis based on absolute changes in transcript abundance
- A mutual information estimator with exponentially decaying bias
- Bayes factors based on robust TDT-type tests for family trio design
- Modeling gene-covariate interactions in sparse regression with group structure for genome-wide association studies
- Weighted Kolmogorov Smirnov testing: an alternative for Gene Set Enrichment Analysis
- Application of the fractional-stable distributions for approximation of the gene expression profiles
- Software and Application Notes
- CSI: a nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data
- TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists
Articles in the same Issue
- Frontmatter
- Research Articles
- A novel method to prioritize RNAseq data for post-hoc analysis based on absolute changes in transcript abundance
- A mutual information estimator with exponentially decaying bias
- Bayes factors based on robust TDT-type tests for family trio design
- Modeling gene-covariate interactions in sparse regression with group structure for genome-wide association studies
- Weighted Kolmogorov Smirnov testing: an alternative for Gene Set Enrichment Analysis
- Application of the fractional-stable distributions for approximation of the gene expression profiles
- Software and Application Notes
- CSI: a nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data
- TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists