Resistant multiple sparse canonical correlation

Jacob Coleman; Joseph Replogle; Gabriel Chandler; Johanna Hardin

doi:10.1515/sagmb-2014-0081

Artikel

Resistant multiple sparse canonical correlation

Jacob Coleman , Joseph Replogle , Gabriel Chandler und Johanna Hardin

Veröffentlicht/Copyright: 10. März 2016

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Statistical Applications in Genetics and Molecular Biology Band 15 Heft 2

Abstract

Canonical correlation analysis (CCA) is a multivariate technique that takes two datasets and forms the most highly correlated possible pairs of linear combinations between them. Each subsequent pair of linear combinations is orthogonal to the preceding pair, meaning that new information is gleaned from each pair. By looking at the magnitude of coefficient values, we can find out which variables can be grouped together, thus better understanding multiple interactions that are otherwise difficult to compute or grasp intuitively. CCA appears to have quite powerful applications to high-throughput data, as we can use it to discover, for example, relationships between gene expression and gene copy number variation. One of the biggest problems of CCA is that the number of variables (often upwards of 10,000) makes biological interpretation of linear combinations nearly impossible. To limit variable output, we have employed a method known as sparse canonical correlation analysis (SCCA), while adding estimation which is resistant to extreme observations or other types of deviant data. In this paper, we have demonstrated the success of resistant estimation in variable selection using SCCA. Additionally, we have used SCCA to find multiple canonical pairs for extended knowledge about the datasets at hand. Again, using resistant estimators provided more accurate estimates than standard estimators in the multiple canonical correlation setting. R code is available and documented at https://github.com/hardin47/rmscca.

Keywords: Pearson correlation; shrinkage estimate; Spearman correlation

Corresponding author: Johanna Hardin, Department of Mathematics, Pomona College, 610 N. College Ave., Claremont, CA 91711, USA, e-mail: jo.hardin@pomona.edu

Acknowledgments

JC was supported by the Pomona College Summer Undergraduate Research Program and the Department of Mathematics at Pomona College. JR was supported by a grant to Pomona College from the Howard Hughes Medical Institute through the Precollege and Undergraduate Science Education Program. JH was supported by the Institute for Pure and Applied Mathematics, National Science Foundation Grant DMS-0931852.

Appendix

Consider the case of the first dimension of Y, Y₁, which is centered at the first p₁ dimensions of the random variable X. Because the majority of the correlation between the dimensions of the random variable Y values comes from their dependence on the random variable X, let Σ_YY be a diagonal matrix. In contrast, Σ_XX is made up of ρ(=0.2) at the appropriate off-diagonal elements and 1 on the diagonal.

Below is the derivation for the first diagonal entry of Σ_YY, σ_YY,11. The goal is to find σ_YY,11 such that Cor(y_l1, y_l2)=ρ.

Yl∼MVNq(μl, ΣYY), where μl=Xl×B, l=1, …, nYl=Xl×B+εl, where εl∼MVNq(0, ΣYY), l=1, …, nYl1=∑i=1p1Xli+εl1, where εl1∼iidN(0, σYY,11)Var(Yl1)=Var(∑i=1p1Xli+εl1)=p1σXX,11+(p12−p1)σXX,12+Var(εl1) Var(Yl1)=p1+(p12−p1)ρ+σYY,11 WLOGCov(Yl1, Yl2)=Cov(∑i=1p1Xli+εl1, ∑i=1p1Xli+εl2)=p1σXX,11+p1(p1−1)σXX,12+cov(εl1, εl2) =p1+p1(p1−1)ρ WLOGCor(Yl1, Yl2)=p1+(p12−p1)ρp1+(p12−p1)ρ+σYY,11=ρσYY,11=(1ρ−1)(p1+(p12−p1)ρ)

By increasing the variance for each of the simulated Y variables involved in the true linear relationships, we create correlations of ρ (=0.2 in our simulations) between the Y variables in a group. The cross-covariance matrix between X and Y (Σ_XY) is not pre-specified, but rather it is given by the relationship between Σ_XX, Σ_YY, and B.

References

Branco, J., C. Croux, P. Filzmoser and R. Oliveira (2005): “Robust canonical correlations: a comparative study,” Computation. Stat., 20, 203–231.Suche in Google Scholar

Chalise, P. and B. Fridley (2011): “Comparison of penalty functions for sparse canonical correlation analysis,” Computation. Stat., 56, 245–254.Suche in Google Scholar

Chin, K., S. DeVries, J. Fridlyand, P. T. Spellman, R. Roydasgupta, W.-L. Kuo, A. Lapuk, R. M. Neve, Z. Qian, T. Ryder, F. Chen, H. Feiler, T. Tokuyasu, C. Kingsley, S. Dairkee, Z. Meng, K. Chew, D. Pinkel, A. Jain, B. M. Ljung, L. Esserman, D. G. Albertson, F. M. Waldman and J. W. Gray (2006): “Genomic and transcriptional aberrations linked to breast cancer pathophysiologies,” Cancer Cell, 100, 529–541.10.1016/j.ccr.2006.10.009Suche in Google Scholar PubMed

Dehon, C., P. Filzmoser, and C. Croux (2000): Data analysis, classification, and related methods, chapter Robust Methods for Canonical Correlation Analysis. New York, NY: Springer, pp. 321–326.Suche in Google Scholar

Gao, C., Z. Ma, Z. Ren and H. H. Zhou (2015): “Minimax estimation in sparse canonical correlation analysis,” Ann. Statist., 43, 2168–2197.Suche in Google Scholar

Hardin, J. and J. Wilson (2009): “A note on oligonucleotide expression values not being normally distributed,” Biostatistics, 10, 446–50.10.1093/biostatistics/kxp003Suche in Google Scholar PubMed

Hardin, J., A. Mitani, L. Hicks and B. VanKoten (2007): “A robust measure of correlation between two genes on a microarray,” BMC Bioinformatics, 8, 220.10.1186/1471-2105-8-220Suche in Google Scholar PubMed PubMed Central

Hong, S., X. Chen, L. Jin and M. Xiong (2013): “Canonical correlation analysis for rna-seq co-expression networks,” Nuc. Acids Res., 41, e95.Suche in Google Scholar

Hotelling, H. (1936): “Relations between two sets of variates,” Biometrika, 28, 321–377.10.1093/biomet/28.3-4.321Suche in Google Scholar

Huber, P. (1985): “Projection pursuit,” Ann. Stat., 13, 435–525.Suche in Google Scholar

Karnel, G. (1991): “Robust canonical correlation and correspondence analysis,” The Frontiers of Statistical Scientific and Industrial Applications, 2, 335–354.Suche in Google Scholar

Lê Cao, K.-A., P. G. Martin, C. Robert-Granié and P. Bess (2009): “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC Bioinformatics, 10, 34.10.1186/1471-2105-10-34Suche in Google Scholar PubMed PubMed Central

Nguyen, D. and D. M. Rocke (2001): “Tumor classification by partial least squares using microarray gene expression data,” Bioinformatics, 18, 39–50.10.1093/bioinformatics/18.1.39Suche in Google Scholar PubMed

Parkhomenko, E., D. Tritchler, and J. Beyene (2009): “Sparse canonical correlation analysis with application to genomic data integration,” Statistical, 8, 1–36.10.2202/1544-6115.1406Suche in Google Scholar PubMed

Pearson, K. (1901): “On lines and planes of closest fit to systems of points in space,” Philos. Mag., 11, 559–572.Suche in Google Scholar

R Core Team (2014): “R: A Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.Suche in Google Scholar

Rousseeuw, P. (1984): “Least median of squares regression,” J. Am. Stat. Assoc., 79, 871–880.Suche in Google Scholar

Roy, S. and A. M. Reif (2013): “Evaluation of calling algorithms for array-cgh,” Front. Genet., 4, 217.Suche in Google Scholar

Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc. B, 58, 267–288.Suche in Google Scholar

Wang, Y. R., K. Jiang, L. J. Feldman, P. J. Bickel and H. Huang (2015): “Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis,” Ann. Appl. Stat., 9, 300–323.Suche in Google Scholar

Witten, D. and R. Tibshirani (2009): “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Mol. Biol., 8, 901–929.Suche in Google Scholar

Witten, D., R. Tibshirani, and T. Hastie (2009): “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, 100, 515–534.10.1093/biostatistics/kxp008Suche in Google Scholar PubMed PubMed Central

Witten, D., R. Tibshirani, S. Gross and B. Narasimhan (2013): PMA: Penalized Multivariate Analysis, URL http://CRAN.R-project.org/package=PMA. R package version 1.0.9.Suche in Google Scholar

Wold, H. (1973): Multivariate Analysis II, chapter Nonlin ear Iterative Partial Least Squares (NIPALS) Modeling: Some Current Developments. New York: Academic Press, pp. 383–407.Suche in Google Scholar

Zou, H., T. Hastie, and R. Tibshirani (2006): “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, 15, 262–286.10.1198/106186006X113430Suche in Google Scholar

Published Online: 2016-3-10

Published in Print: 2016-4-1

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/sagmb-2014-0081

Schlagwörter für diesen Artikel

Pearson correlation; shrinkage estimate; Spearman correlation