Abstract
Canonical correlation analysis (CCA) is a multivariate technique that takes two datasets and forms the most highly correlated possible pairs of linear combinations between them. Each subsequent pair of linear combinations is orthogonal to the preceding pair, meaning that new information is gleaned from each pair. By looking at the magnitude of coefficient values, we can find out which variables can be grouped together, thus better understanding multiple interactions that are otherwise difficult to compute or grasp intuitively. CCA appears to have quite powerful applications to high-throughput data, as we can use it to discover, for example, relationships between gene expression and gene copy number variation. One of the biggest problems of CCA is that the number of variables (often upwards of 10,000) makes biological interpretation of linear combinations nearly impossible. To limit variable output, we have employed a method known as sparse canonical correlation analysis (SCCA), while adding estimation which is resistant to extreme observations or other types of deviant data. In this paper, we have demonstrated the success of resistant estimation in variable selection using SCCA. Additionally, we have used SCCA to find multiple canonical pairs for extended knowledge about the datasets at hand. Again, using resistant estimators provided more accurate estimates than standard estimators in the multiple canonical correlation setting. R code is available and documented at https://github.com/hardin47/rmscca.
Acknowledgments
JC was supported by the Pomona College Summer Undergraduate Research Program and the Department of Mathematics at Pomona College. JR was supported by a grant to Pomona College from the Howard Hughes Medical Institute through the Precollege and Undergraduate Science Education Program. JH was supported by the Institute for Pure and Applied Mathematics, National Science Foundation Grant DMS-0931852.
Appendix
Consider the case of the first dimension of Y, Y1, which is centered at the first p1 dimensions of the random variable X. Because the majority of the correlation between the dimensions of the random variable Y values comes from their dependence on the random variable X, let ΣYY be a diagonal matrix. In contrast, ΣXX is made up of ρ(=0.2) at the appropriate off-diagonal elements and 1 on the diagonal.
Below is the derivation for the first diagonal entry of ΣYY, σYY,11. The goal is to find σYY,11 such that Cor(yl1, yl2)=ρ.
By increasing the variance for each of the simulated Y variables involved in the true linear relationships, we create correlations of ρ (=0.2 in our simulations) between the Y variables in a group. The cross-covariance matrix between X and Y (ΣXY) is not pre-specified, but rather it is given by the relationship between ΣXX, ΣYY, and B.
References
Branco, J., C. Croux, P. Filzmoser and R. Oliveira (2005): “Robust canonical correlations: a comparative study,” Computation. Stat., 20, 203–231.Suche in Google Scholar
Chalise, P. and B. Fridley (2011): “Comparison of penalty functions for sparse canonical correlation analysis,” Computation. Stat., 56, 245–254.Suche in Google Scholar
Chin, K., S. DeVries, J. Fridlyand, P. T. Spellman, R. Roydasgupta, W.-L. Kuo, A. Lapuk, R. M. Neve, Z. Qian, T. Ryder, F. Chen, H. Feiler, T. Tokuyasu, C. Kingsley, S. Dairkee, Z. Meng, K. Chew, D. Pinkel, A. Jain, B. M. Ljung, L. Esserman, D. G. Albertson, F. M. Waldman and J. W. Gray (2006): “Genomic and transcriptional aberrations linked to breast cancer pathophysiologies,” Cancer Cell, 100, 529–541.10.1016/j.ccr.2006.10.009Suche in Google Scholar PubMed
Dehon, C., P. Filzmoser, and C. Croux (2000): Data analysis, classification, and related methods, chapter Robust Methods for Canonical Correlation Analysis. New York, NY: Springer, pp. 321–326.Suche in Google Scholar
Gao, C., Z. Ma, Z. Ren and H. H. Zhou (2015): “Minimax estimation in sparse canonical correlation analysis,” Ann. Statist., 43, 2168–2197.Suche in Google Scholar
Hardin, J. and J. Wilson (2009): “A note on oligonucleotide expression values not being normally distributed,” Biostatistics, 10, 446–50.10.1093/biostatistics/kxp003Suche in Google Scholar PubMed
Hardin, J., A. Mitani, L. Hicks and B. VanKoten (2007): “A robust measure of correlation between two genes on a microarray,” BMC Bioinformatics, 8, 220.10.1186/1471-2105-8-220Suche in Google Scholar PubMed PubMed Central
Hong, S., X. Chen, L. Jin and M. Xiong (2013): “Canonical correlation analysis for rna-seq co-expression networks,” Nuc. Acids Res., 41, e95.Suche in Google Scholar
Hotelling, H. (1936): “Relations between two sets of variates,” Biometrika, 28, 321–377.10.1093/biomet/28.3-4.321Suche in Google Scholar
Huber, P. (1985): “Projection pursuit,” Ann. Stat., 13, 435–525.Suche in Google Scholar
Karnel, G. (1991): “Robust canonical correlation and correspondence analysis,” The Frontiers of Statistical Scientific and Industrial Applications, 2, 335–354.Suche in Google Scholar
Lê Cao, K.-A., P. G. Martin, C. Robert-Granié and P. Bess (2009): “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC Bioinformatics, 10, 34.10.1186/1471-2105-10-34Suche in Google Scholar PubMed PubMed Central
Nguyen, D. and D. M. Rocke (2001): “Tumor classification by partial least squares using microarray gene expression data,” Bioinformatics, 18, 39–50.10.1093/bioinformatics/18.1.39Suche in Google Scholar PubMed
Parkhomenko, E., D. Tritchler, and J. Beyene (2009): “Sparse canonical correlation analysis with application to genomic data integration,” Statistical, 8, 1–36.10.2202/1544-6115.1406Suche in Google Scholar PubMed
Pearson, K. (1901): “On lines and planes of closest fit to systems of points in space,” Philos. Mag., 11, 559–572.Suche in Google Scholar
R Core Team (2014): “R: A Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.Suche in Google Scholar
Rousseeuw, P. (1984): “Least median of squares regression,” J. Am. Stat. Assoc., 79, 871–880.Suche in Google Scholar
Roy, S. and A. M. Reif (2013): “Evaluation of calling algorithms for array-cgh,” Front. Genet., 4, 217.Suche in Google Scholar
Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc. B, 58, 267–288.Suche in Google Scholar
Wang, Y. R., K. Jiang, L. J. Feldman, P. J. Bickel and H. Huang (2015): “Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis,” Ann. Appl. Stat., 9, 300–323.Suche in Google Scholar
Witten, D. and R. Tibshirani (2009): “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Mol. Biol., 8, 901–929.Suche in Google Scholar
Witten, D., R. Tibshirani, and T. Hastie (2009): “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, 100, 515–534.10.1093/biostatistics/kxp008Suche in Google Scholar PubMed PubMed Central
Witten, D., R. Tibshirani, S. Gross and B. Narasimhan (2013): PMA: Penalized Multivariate Analysis, URL http://CRAN.R-project.org/package=PMA. R package version 1.0.9.Suche in Google Scholar
Wold, H. (1973): Multivariate Analysis II, chapter Nonlin ear Iterative Partial Least Squares (NIPALS) Modeling: Some Current Developments. New York: Academic Press, pp. 383–407.Suche in Google Scholar
Zou, H., T. Hastie, and R. Tibshirani (2006): “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, 15, 262–286.10.1198/106186006X113430Suche in Google Scholar
©2016 by De Gruyter
Artikel in diesem Heft
- Frontmatter
- Research Articles
- What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment
- A graph theoretical approach to data fusion
- Resistant multiple sparse canonical correlation
- A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data
- AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies
- Comparing five statistical methods of differential methylation identification using bisulfite sequencing data
Artikel in diesem Heft
- Frontmatter
- Research Articles
- What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment
- A graph theoretical approach to data fusion
- Resistant multiple sparse canonical correlation
- A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data
- AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies
- Comparing five statistical methods of differential methylation identification using bisulfite sequencing data