Abstract
The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box’s M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
Acknowledgments
This study was financially supported by the VIRGO consortium, which is funded by the Netherlands Genomics Initiative and by the Dutch Government (FES0908). The funding agencies in no way influenced the outcome or conclusions of the study.
References
Arijs, I., K. Li, G. Toedter, R. Quintens, L. L. Van, S. K. Van, P. Leemans, H. G. De, K. Lemaire, M. Ferrante, F Schnitzler, L. Thorrez, K. Ma, X. Y. Song, C. Marano, G. Van Assche, S. Vermeire, K. Geboes, F. Schuit, F. Baribaud and P. Rutgeerts (2009): “Mucosal gene signatures to predict response to infliximab in patients with ulcerative colitis,” Gut, 58, 1612–1619.10.1136/gut.2009.178665Search in Google Scholar PubMed
Bacher, U., S. Schnittger, K. Macijewski, V. Grossmann, A. Kohlmann, T. Alpermann, A. Kowarsch, N. Nadarajah, W. Kern, C. Haferlach and T. Haferlach (2012): “Multilineage dysplasia does not influence prognosis in CEBPA-mutated AML, supporting the WHO proposal to classify these patients as a unique entity,” Blood, 20, 4719–4722.10.1182/blood-2011-12-395574Search in Google Scholar PubMed
Becker, R. A., J. M. Chambers and A. R. Wilks (1988): The new S language: A programming environment for data analysis and graphics. Pacific Grove, Calif: Wadsworth & Brooks/Cole Advanced Books & Software.Search in Google Scholar
Beghini, A., F. Corlazzoli, L. Del Giacco, M. Re, F. Lazzaroni, M. Brioschi, G. Valentini, F. Ferrazzi, A. Ghilardi, M. Righi, M. Turrini, M. Mignardi, C. Cesana, V. Bronte, M. Nilsson, E. Morra and R. Cairoli (2012): “Regenerationassociated WNT signaling is activated in long-term reconstituting AC133 (bright) acute myeloid leukemia cells,” Neoplasia, 14, 1236–1248.10.1593/neo.121480Search in Google Scholar PubMed PubMed Central
Blalock, E. M., J. W. Geddes, K. C. Chen, N. M. Porter, W. R. Markesbery and P. W. Landfield (2004): “Incipient Alzheimer’s disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses,” Proc. Natl. Acad. Sci. USA, 101, 2173–2178.10.1073/pnas.0308512100Search in Google Scholar PubMed PubMed Central
Bochukova, E. G., S. Soneji, S. A. Wall and A. O. Wilkie (2010): “Scalp fibroblasts have a shared expression profile in monogenic craniosynostosis,” J. Med. Genet., 47, 803–808.Search in Google Scholar
Gautier, L., L. Cope, B. M. Bolstad and R. A. Irizarry (2004): “Affy-analysis of Affymetrix GeneChip data at the probe,” Bioinformatics, 20, 307–315.10.1093/bioinformatics/btg405Search in Google Scholar PubMed
Greco, S., P. Fasanaro, S. Castelvecchio, Y. D’Alessandra, D. Arcelli, D. M. Di, A. Malavazos, M. C. Capogrossi, L. Menicanti and F. Martelli (2012): “MicroRNA dysregulation in diabetic ischemic heart failure patients,” Diabetes, 61, 1633–1641.10.2337/db11-0952Search in Google Scholar PubMed PubMed Central
Hastie, T., R. Tibshirani and J. Friedman (2003): The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. New York: Springer.Search in Google Scholar
Kabakchiev, B., D. Turner, J. Hyams, D. Mack, N. Leleiko, W. Crandall, J. Markowitz, A. R. Otley, W. Xu, P. Hu, A. M. Griffiths and M. S. Silverberg (2010): “Gene expression changes associated with resistance to intravenous corticosteroid therapy in children with severe ulcerative colitis,” PLoS One, 5.10.1371/journal.pone.0013085Search in Google Scholar PubMed PubMed Central
Kaufman, L. and P. Rousseeuw (2005): Finding groups in data: An introduction to cluster analysis. Hoboken, N.J: Wiley.Search in Google Scholar
Kim, K. I. and R. Simon (2011): “Probabilistic classifiers with high-dimensional data,” Biostatistics, 12, 399–412.10.1093/biostatistics/kxq069Search in Google Scholar PubMed PubMed Central
Langfelder, P. and S. Horvath (2008): “WGCNA: an R package for weighted correlation network analysis,” BMC Bioinformatics, 9, 559.10.1186/1471-2105-9-559Search in Google Scholar PubMed PubMed Central
Langfelder, P., B. Zhang and S. Horvath (2008): “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R,” Bioinformatics, 24, 719–720.10.1093/bioinformatics/btm563Search in Google Scholar PubMed
Le Dieu, R., D. C. Taussig, A. G. Ramsay, R. Mitter, F. Miraki-Moud, R. Fatah, A. M. Lee, T. A. Lister and J. G. Gribben (2009): “Peripheral blood T cells in acute myeloid leukemia (AML) patients at diagnosis have abnormal phenotype and genotype and form defective immune synapses with AML blasts,” Blood, 114, 3909–3916.10.1182/blood-2009-02-206946Search in Google Scholar PubMed PubMed Central
Lee, J. W., J. B. Lee, M. Park and S. H. Song (2005): “An extensive comparison of recent classification tools applied to microarray data,” Comput. Stat. Data Anal., 48, 869–885.Search in Google Scholar
Majeti, R., M. W. Becker, Q. Tian, T. L. Lee, X. Yan, R. Liu, J. H. Chiang, L. Hood, M. F. Clarke and I. L. Weissman (2009): “Dysregulated gene expression networks in human acute myelogenous leukemia stem cells,” Proc. Natl. Acad. Sci. USA, 106, 3396–3401.10.1073/pnas.0900089106Search in Google Scholar PubMed PubMed Central
Marczyk, M., R. Jaksik, A. Polanski and J. Polanska (2013): “Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition,” BMC Bioinformatics, 14, 101.10.1186/1471-2105-14-101Search in Google Scholar PubMed PubMed Central
Ogata, S., Y. Ogihara, K. Nomoto, K. Akiyama, Y. Nakahata, K. Sato, K. Minoura, K. Kokubo, H. Kobayashi and M. Ishii (2009): “Clinical score and transcript abundance patterns identify Kawasaki disease patients who may benefit from addition of methylprednisolone,” Int. Pediatr. Res. Foundation, Inc., 66, 577–584.Search in Google Scholar
Payton, J. E., N. R. Grieselhuber, L. W. Chang, M. Murakami, G. K. Geiss, D. C. Link, R. Nagarajan, M. A. Watson and T. J. Ley (2009): “High throughput digital quantification of mRNA abundance in primary human acute myeloid leukemia samples,” J. Clin. Invest., 119, 1714–1726.Search in Google Scholar
Shi, L., G. Campbell, W. D. Jones, F. Campagne, Z. Wen, S. J. Walker, Z. Su, T. M. Chu, F. M. Goodsaid, L. Pusztai, J. D. Shaughnessy Jr., A. Oberthuer, R. S. Thomas, R. S. Paules, M. Fielden, B. Barlogie, W. Chen, P. Du, M. Fischer, C. Furlanello, B. D. Gallas, X. Ge, D. B. Megherbi, W. F. Symmans, M. D. Wang, J. Zhang, H. Bitter, B. Brors, P. R. Bushel, M. Bylesjo, M. Chen, J. Cheng, J. Cheng, J. Chou, T. S. Davison, M. Delorenzi, Y. Deng, V. Devanarayan, D. J. Dix, J. Dopazo, K. C. Dorff, F. Elloumi, J. Fan, S. Fan, X. Fan, H. Fang, N. Gonzaludo, K. R. Hess, H. Hong, J. Huan, R. A. Irizarry, R. Judson, D. Juraeva, S. Lababidi, C. G. Lambert, L. Li, Y. Li, Z. Li, S. M. Lin, G. Liu, E. K. Lobenhofer, J. Luo, W. Luo, M. N. McCall, Y. Nikolsky, G. A. Pennello, R. G. Perkins, R. Philip, V. Popovici, N. D. Price, F. Qian, A. Scherer, T. Shi, W. Shi, J. Sung, D. Thierry-Mieg, J. Thierry-Mieg, V. Thodima, J. Trygg, L. Vishnuvajjala, S. J. Wang, J. Wu, Y. Wu, Q. Xie, W. A. Yousef, L. Zhang, X. Zhang, S. Zhong, Y. Zhou, S. Zhu, D. Arasappan, W. Bao, A. B. Lucas, F. Berthold, R. J. Brennan, A. Buness, J. G. Catalano, C. Chang, R. Chen, Y. Cheng, J. Cui, W. Czika, F. Demichelis, X. Deng, D. Dosymbekov, R. Eils, Y. Feng, J. Fostel, S. Fulmer-Smentek, J. C. Fuscoe, L. Gatto, W. Ge, D. R. Goldstein, L. Guo, D. N. Halbert, J. Han, S. C. Harris, C. Hatzis, D. Herman, J. Huang, R. V. Jensen, R. Jiang, C. D. Johnson, G. Jurman, Y. Kahlert, S. A. Khuder, M. Kohl, J. Li, L. Li, M. Li,Q. Z. Li, S. Li, Z. Li, J. Liu, Y. Liu, Z. Liu, L. Meng, M. Madera, F. Martinez-Murillo, I. Medina, J. Meehan, K. Miclaus, R. A. Moffitt, D. Montaner, P. Mukherjee, G. J. Mulligan, P. Neville, T. Nikolskaya, B. Ning, G. P. Page, J. Parker, R. M. Parry, X. Peng, R. L. Peterson, J. H. Phan, B. Quanz, Y. Ren, S. Riccadonna, A. H. Roter, F. W. Samuelson, M. M. Schumacher, J. D. Shambaugh, Q. Shi, R. Shippy, S. Si, A. Smalter, C. Sotiriou, M. Soukup, F. Staedtler, G. Steiner, T. H. Stokes, Q. Sun, P. Y. Tan, R. Tang, Z. Tezak, B. Thorn, M. Tsyganova, Y. Turpaz, S. C. Vega, R. Visintainer, J. von Frese, C. Wang, E. Wang, J. Wang, W. Wang, F. Westermann, J. C. Willey, M. Woods, S. Wu, N. Xiao, J. Xu, L. Xu, L. Yang, X. Zeng, J. Zhang, L. Zhang, M. Zhang, C. Zhao, R. K. Puri, U. Scherf, W. Tong and R. D. Wolfinger, MAQC Consortium (2010): “The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models,” Nat. Biotechnol., 28, 827–838.Search in Google Scholar
Stirewalt, D. L., E. L. Pogosova-Agadjanyan and S. Ochsenreither (2012): “E-GEOD-37307 – Aberrant expressed genes in AML,” www.ebi.ac.uk/arrayexpress/, 01/05/2013.Search in Google Scholar
Stojanov, S., S. Lapidus, P. Chitkara, H. Feder, J. C. Salazar, T. A. Fleisher, M. R. Brown, K. M. Edwards, M. M. Ward, R. A. Colbert, H. W. Sun, G. M. Wood, B. K. Barham, A. Jones, I. Aksentijevich, R. Goldbach-Mansky, B. Athreya, K. S. Barron and D. L. Kastner (2011): “Periodic fever, aphthous stomatitis, pharyngitis, and adenitis (PFAPA) is a disorder of innate immunity and Th1 activation responsive to IL-1 blockade,” Proc. Natl. Acad. Sci. USA, 108, 7148–7153.10.1073/pnas.1103681108Search in Google Scholar PubMed PubMed Central
Suarez-Farinas, M., K. R. Shah, A. S. Haider, J. G. Krueger and M. A. Lowes (2010): “Personalized medicine in psoriasis: developing a genomic classifier to predict histological response to Alefacept,” BMC Dermatol, 10, 1.10.1186/1471-5945-10-1Search in Google Scholar PubMed PubMed Central
Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. Roy. Stat. Soc: B (Statistical Methodology), 63, 411–423.Search in Google Scholar
Vedin, I., T. Cederholm, Y. Freund-Levi, H. Basun, A. Garlind, G. F. Irving, M. Eriksdotter-Jonhagen, L. O. Wahlund, I. Dahlman and J. Palmblad (2012): “Effects of DHA-rich n-3 fatty acid supplementation on gene expression in blood mononuclear leukocytes: the OmegAD study,” PLoS One, 7, e35425.10.1371/journal.pone.0035425Search in Google Scholar PubMed PubMed Central
Walter, M., M. Bonin, R. S. Pullman, E. M. Valente, M. Loi, M. Gambarin, D. Raymond, M. Tinazzi, C. Kamm, N. Glockle, S. Poths, T. Gasser, S. B. Bressman, C. Klein, L. J. Ozelius, O. Riess and K. Grundmann (2010): “Expression profiling in peripheral blood reveals signature for penetrance in DYT1 dystonia,” Neurobiol. Dis., 38, 192–200.Search in Google Scholar
Wang, L., J. Zhu and H. Zou (2006): “The doubly regularized support vector machine,” Stat. Sinica, 16, 589–615.Search in Google Scholar
Wessels, L. F., M. J. Reinders, A. A. Hart, C. J. Veenman, H. Dai, Y. D. He and L. J. van’t Veer (2005): “A protocol for building and evaluating predictors of disease state based on microarray data,” Bioinformatics, 21, 3755–3762.10.1093/bioinformatics/bti429Search in Google Scholar PubMed
Yang, K., G. Shan and L. Zhao (2006): “Correlation coefficient method for support vector machine input samples,” Mach. Learn. Cybernetics, Int. Con., 2857, 2862. DOI: 10.1109/ICMLC.2006.259069.10.1109/ICMLC.2006.259069Search in Google Scholar
Ye, G., Y. Cheng and X. Xie (2011): “Efficient variable selection in support vector machines via the alternating direction method of multipliers,” Artif. Intell. Stat. (AISTATS), 15, 832–840.Search in Google Scholar
Zhang, J. and D. D. Boos (1992): “Bootstrap critical values for testing homogeneity of covariance matrices,” J. Am. Stat. Assoc., 87, 425–429.Search in Google Scholar
Zhang, B. and S. Horvath (2005): “A general framework for weighted gene co-expression network analysis,” Stat. Appl. Genet. Mol. Biol., 4, Article 17.Search in Google Scholar
Zhu, J., S. Rosset, T. Hastie and R. Tibshirani (2004): “1-Norm support vector machines,” Adv. Neural. Inform. Process. Systems, 16, 49–56.Search in Google Scholar
Supplemental Material
The online version of this article (DOI: 10.1515/sagmb-2014-0003) offers supplementary material, available to authorized users.
©2014 by De Gruyter
Articles in the same Issue
- Frontmatter
- Research Articles
- When is Menzerath-Altmann law mathematically trivial? A new approach
- Covariate adjusted differential variability analysis of DNA methylation with propensity score method
- P-value calibration for multiple testing problems in genomics
- Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations
- Markovianness and conditional independence in annotated bacterial DNA
- Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories
- Corrigendum
- Biological pathway selection through Bayesian integrative modeling
Articles in the same Issue
- Frontmatter
- Research Articles
- When is Menzerath-Altmann law mathematically trivial? A new approach
- Covariate adjusted differential variability analysis of DNA methylation with propensity score method
- P-value calibration for multiple testing problems in genomics
- Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations
- Markovianness and conditional independence in annotated bacterial DNA
- Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories
- Corrigendum
- Biological pathway selection through Bayesian integrative modeling