Abstract
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. Applying statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure within the multivariate linear model framework that accounts for the dependence between the multiple responses. We shall focus on a specific type of dependence which consists in assuming that the responses of a given individual can be modelled as a time series. We propose a novel Lasso-based approach within the framework of the multivariate linear model taking into account the dependence structure by using different types of stationary processes covariance structures for the random error matrix. Our numerical experiments show that including the estimation of the covariance matrix of the random error matrix in the Lasso criterion dramatically improves the variable selection performance. Our approach is successfully applied to an untargeted LC-MS (Liquid Chromatography-Mass Spectrometry) data set made of African copals samples. Our methodology is implemented in the R package MultiVarSel which is available from the Comprehensive R Archive Network (CRAN).
Acknowledgement
This project has been funded by La mission pour l’interdisciplinarité du CNRS in the frame of the DEFI ENVIROMICS (project AREA). The authors thank the Musée François Tillequin for providing the samples from the Guibourt Collection.
Appendix A
Let vec(A) denote the vectorization of the matrix A formed by stacking the columns of A into a single column vector. Let us apply the vec operator to Model (2), then
Let
where we used that
see (Mardia, Kent & Bibby , 1979, Appendix A.2.5). In this equation, B′ denotes the transpose of the matrix B. Thus,
where
Appendix B
Let us apply the vec operator to Model (5) where
Hence,
where
References
Audoin, C., V. Cocandeau, O. Thomas, A. Bruschini, S. Holderith, and G. Genta-Jouve (2014): “Metabolome consistency: additional parazoanthines from the mediterranean zoanthid parazoanthus axinellae,” Metabolites, 4, 421–432.10.3390/metabo4020421Search in Google Scholar PubMed PubMed Central
Bates, D. and M. Maechler (2017): Matrix: sparse and dense matrix classes and methods. R package version 1.2-8. https://CRAN.R-project.org/package=Matrix.Search in Google Scholar
Boccard, J. and S. Rudaz (2016): “Exploring omics data from designed experiments using analysis of variance multiblock orthogonal partial least squares,” Anal. Chim. Acta, 920, 18–28.10.1016/j.aca.2016.03.042Search in Google Scholar PubMed
Brockwell, P. and R. Davis (1991): Time series: theory and methods, Springer Series in Statistics, Springer-Verlag, New York.10.1007/978-1-4419-0320-4Search in Google Scholar
Dieterle, F., A. Ross, G. Schlotterbeck, and H. Senn (2006): “Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1h nmr metabonomics,” Anal. Chem., 78, 4281–4290.10.1021/ac051632cSearch in Google Scholar PubMed
Faraway, J. J.(2004): Linear models with R, Chapman & Hall/CRC, New York.10.4324/9780203507278Search in Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., 33, 1–22.10.18637/jss.v033.i01Search in Google Scholar PubMed
Hrydziuszko, O. and M. R. Viant (2012): “Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline,” Metabolomics, 8, 161–174.10.1007/s11306-011-0366-4Search in Google Scholar
Kirwan, J., D. Broadhurst, R. Davidson, and M. Viant (2013): “Characterising and correcting batch variation in an automated direct infusion mass spectrometry (dims) metabolomics workflow,” Anal. Bioanal. Chem., 405, 5147–5157.10.1007/s00216-013-6856-7Search in Google Scholar PubMed
Kuhl, C., R. Tautenhahn, C. Boettcher, T. R. Larson, and S. Neumann (2012): “CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets,” Anal. Chem., 84, 283–289.10.1021/ac202450gSearch in Google Scholar PubMed PubMed Central
Lê Cao, K.-A., S. Boitard, and P. Besse (2011): “Sparse pls discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems,” BMC Bioinformatics, 12, 253.10.1186/1471-2105-12-253Search in Google Scholar PubMed PubMed Central
Mardia, K., J. Kent, and J. Bibby (1979): Multivariate analysis, Probability and mathematical statistics, Academic Press, Londan.Search in Google Scholar
Meinshausen, N. and P. Buhlmann (2010): “Stability selection,” J. R. Stat. Soc., 72, 417–473.10.1111/j.1467-9868.2010.00740.xSearch in Google Scholar
Muller, K. E. and P. W. Stewart (2006): Linear model theory: univariate, multivariate, and mixed models, John Wiley & Sons.10.1002/0470052147Search in Google Scholar
Nicholson, J. K., J. C. Lindon, and E. Holmes ( 1999): “‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data,” Xenobiotica, 29, 1181–1189.10.1080/004982599238047Search in Google Scholar PubMed
Perrot-Dockès, M., C. Lévy-Leduc, L. Sansonnet, and J. Chiquet (2018): “Variable selection in multivariate linear models with high-dimensional covariance matrix estimation,” J. Multivar. Anal., 166, 78–97.10.1016/j.jmva.2018.02.006Search in Google Scholar
R Core Team (2017): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.Search in Google Scholar
Ren, S., A. A. Hinzman, E. L. Kang, R. D. Szczesniak, and L. J. Lu (2015): “Computational and statistical analysis of metabolomics data,” Metabolomics, 11, 1492–1513.10.1007/s11306-015-0823-6Search in Google Scholar
Rothman, A. J., E. Levina, and J. Zhu ( 2010): “Sparse multivariate regression with covariance estimation,” J. Comput. Graph. Stat., 19, 947–962.10.1198/jcgs.2010.09188Search in Google Scholar PubMed PubMed Central
Saccenti, E., H. C. J. Hoefsloot, A. K. Smilde, J. A. Westerhuis, and M. M. W. B. Hendriks (2013): “Reflections on univariate and multivariate analysis of metabolomics data,” Metabolomics, 10, 361–374.10.1007/s11306-013-0598-6Search in Google Scholar
Smith, C., E. Want, G. O’Maille, R. Abagyan, and G. Siuzdak, (2006): “XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification,” Anal. Chem., 78, 779–787.10.1021/ac051437ySearch in Google Scholar PubMed
Smith, R., A. Mathis, and J. Prince (2014): “Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view,” BMC Bioinformatics, 15, S9.10.1186/1471-2105-15-S7-S9Search in Google Scholar PubMed PubMed Central
Tibshirani, R. (1996): “Regression shrinkage and selection via the Lasso,” J. R. Stat. Soc. B, 58, 267–288.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar
Verdegem, D., D. Lambrechts, P. Carmeliet, and B. Ghesquière (2016): “Improved metabolite identification with midas and magma through ms/ms spectral dataset-driven parameter optimization,” Metabolomics, 12, 1–16.10.1007/s11306-016-1036-3Search in Google Scholar
Zhang, A., H. Sun, P. Wang, Y. Han, and X. Wang ( 2012): “Modern analytical techniques in metabolomics analysis,” Analyst, 137, 293–300.10.1039/C1AN15605ESearch in Google Scholar PubMed
Zhang, H., Y. Zheng, G. Yoon, Z. Zhang, T. Gao, B. Joyce, W. Zhang, J. Schwartz, P. Vokonas, E. Colicino, A. Baccarelli, L. Hou, and L. Liu (2017): “Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study,” Stat. Appl. Genet. Mol. Biol. 16, 159–171.10.1515/sagmb-2016-0073Search in Google Scholar PubMed PubMed Central
©2018 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Research Articles
- Assessing genome-wide significance for the detection of differentially methylated regions
- A test for detecting differential indirect trans effects between two groups of samples
- A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data
Articles in the same Issue
- Research Articles
- Assessing genome-wide significance for the detection of differentially methylated regions
- A test for detecting differential indirect trans effects between two groups of samples
- A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data