A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data

Marie Perrot-Dockès; Céline Lévy-Leduc; Julien Chiquet; Laure Sansonnet; Margaux Brégère; Marie-Pierre Étienne; Stéphane Robin; Grégory Genta-Jouve

doi:10.1515/sagmb-2017-0077

Article

A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data

, , , , , , and

Published/Copyright: September 8, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Statistical Applications in Genetics and Molecular Biology Volume 17 Issue 5

Abstract

Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. Applying statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure within the multivariate linear model framework that accounts for the dependence between the multiple responses. We shall focus on a specific type of dependence which consists in assuming that the responses of a given individual can be modelled as a time series. We propose a novel Lasso-based approach within the framework of the multivariate linear model taking into account the dependence structure by using different types of stationary processes covariance structures for the random error matrix. Our numerical experiments show that including the estimation of the covariance matrix of the random error matrix in the Lasso criterion dramatically improves the variable selection performance. Our approach is successfully applied to an untargeted LC-MS (Liquid Chromatography-Mass Spectrometry) data set made of African copals samples. Our methodology is implemented in the R package MultiVarSel which is available from the Comprehensive R Archive Network (CRAN).

Keywords: metabolomics; multivariate linear model; time series; variable selection

Acknowledgement

This project has been funded by La mission pour l’interdisciplinarité du CNRS in the frame of the DEFI ENVIROMICS (project AREA). The authors thank the Musée François Tillequin for providing the samples from the Guibourt Collection.

Appendix A

Let vec(A) denote the vectorization of the matrix A formed by stacking the columns of A into a single column vector. Let us apply the vec operator to Model (2), then

vec(Y)=vec(XB+E)=vec(XB)+vec(E).

Let Y=vec(Y), B=vec(B) and E=vec(E). Hence,

Y=vec(XB)+E=(Iq⊗X)B+E,

where we used that

vec(AXB)=(B′⊗A)vec(X),

see (Mardia, Kent & Bibby , 1979, Appendix A.2.5). In this equation, B′ denotes the transpose of the matrix B. Thus,

Y=XB+E,

where X=Iq⊗X and 𝒴, ℬ and ℰ are vectors of size nq, pq and nq, respectively.

Appendix B

Let us apply the vec operator to Model (5) where Σq−1/2 is replaced by Σ^q−1/2, then

vec(YΣ^q−1/2)=vec(XBΣ^q−1/2)+vec(EΣ^q−1/2)=((Σ^q−1/2)′⊗X)vec(B)+vec(EΣ^q−1/2).

Hence,

Y=XB+E,

where Y=vec(YΣ^q−1/2), X=(Σ^q−1/2)′⊗X and E=vec(EΣ^q−1/2).

References

Audoin, C., V. Cocandeau, O. Thomas, A. Bruschini, S. Holderith, and G. Genta-Jouve (2014): “Metabolome consistency: additional parazoanthines from the mediterranean zoanthid parazoanthus axinellae,” Metabolites, 4, 421–432.10.3390/metabo4020421Search in Google Scholar PubMed PubMed Central

Bates, D. and M. Maechler (2017): Matrix: sparse and dense matrix classes and methods. R package version 1.2-8. https://CRAN.R-project.org/package=Matrix.Search in Google Scholar

Boccard, J. and S. Rudaz (2016): “Exploring omics data from designed experiments using analysis of variance multiblock orthogonal partial least squares,” Anal. Chim. Acta, 920, 18–28.10.1016/j.aca.2016.03.042Search in Google Scholar PubMed

Brockwell, P. and R. Davis (1991): Time series: theory and methods, Springer Series in Statistics, Springer-Verlag, New York.10.1007/978-1-4419-0320-4Search in Google Scholar

Dieterle, F., A. Ross, G. Schlotterbeck, and H. Senn (2006): “Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1h nmr metabonomics,” Anal. Chem., 78, 4281–4290.10.1021/ac051632cSearch in Google Scholar PubMed

Faraway, J. J.(2004): Linear models with R, Chapman & Hall/CRC, New York.10.4324/9780203507278Search in Google Scholar

Friedman, J., T. Hastie, and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., 33, 1–22.10.18637/jss.v033.i01Search in Google Scholar PubMed

Hrydziuszko, O. and M. R. Viant (2012): “Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline,” Metabolomics, 8, 161–174.10.1007/s11306-011-0366-4Search in Google Scholar

Kirwan, J., D. Broadhurst, R. Davidson, and M. Viant (2013): “Characterising and correcting batch variation in an automated direct infusion mass spectrometry (dims) metabolomics workflow,” Anal. Bioanal. Chem., 405, 5147–5157.10.1007/s00216-013-6856-7Search in Google Scholar PubMed

Kuhl, C., R. Tautenhahn, C. Boettcher, T. R. Larson, and S. Neumann (2012): “CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets,” Anal. Chem., 84, 283–289.10.1021/ac202450gSearch in Google Scholar PubMed PubMed Central

Lê Cao, K.-A., S. Boitard, and P. Besse (2011): “Sparse pls discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems,” BMC Bioinformatics, 12, 253.10.1186/1471-2105-12-253Search in Google Scholar PubMed PubMed Central

Mardia, K., J. Kent, and J. Bibby (1979): Multivariate analysis, Probability and mathematical statistics, Academic Press, Londan.Search in Google Scholar

Meinshausen, N. and P. Buhlmann (2010): “Stability selection,” J. R. Stat. Soc., 72, 417–473.10.1111/j.1467-9868.2010.00740.xSearch in Google Scholar

Muller, K. E. and P. W. Stewart (2006): Linear model theory: univariate, multivariate, and mixed models, John Wiley & Sons.10.1002/0470052147Search in Google Scholar

Nicholson, J. K., J. C. Lindon, and E. Holmes ( 1999): “‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data,” Xenobiotica, 29, 1181–1189.10.1080/004982599238047Search in Google Scholar PubMed

Perrot-Dockès, M., C. Lévy-Leduc, L. Sansonnet, and J. Chiquet (2018): “Variable selection in multivariate linear models with high-dimensional covariance matrix estimation,” J. Multivar. Anal., 166, 78–97.10.1016/j.jmva.2018.02.006Search in Google Scholar

R Core Team (2017): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.Search in Google Scholar

Ren, S., A. A. Hinzman, E. L. Kang, R. D. Szczesniak, and L. J. Lu (2015): “Computational and statistical analysis of metabolomics data,” Metabolomics, 11, 1492–1513.10.1007/s11306-015-0823-6Search in Google Scholar

Rothman, A. J., E. Levina, and J. Zhu ( 2010): “Sparse multivariate regression with covariance estimation,” J. Comput. Graph. Stat., 19, 947–962.10.1198/jcgs.2010.09188Search in Google Scholar PubMed PubMed Central

Saccenti, E., H. C. J. Hoefsloot, A. K. Smilde, J. A. Westerhuis, and M. M. W. B. Hendriks (2013): “Reflections on univariate and multivariate analysis of metabolomics data,” Metabolomics, 10, 361–374.10.1007/s11306-013-0598-6Search in Google Scholar

Smith, C., E. Want, G. O’Maille, R. Abagyan, and G. Siuzdak, (2006): “XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification,” Anal. Chem., 78, 779–787.10.1021/ac051437ySearch in Google Scholar PubMed

Smith, R., A. Mathis, and J. Prince (2014): “Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view,” BMC Bioinformatics, 15, S9.10.1186/1471-2105-15-S7-S9Search in Google Scholar PubMed PubMed Central

Tibshirani, R. (1996): “Regression shrinkage and selection via the Lasso,” J. R. Stat. Soc. B, 58, 267–288.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

Verdegem, D., D. Lambrechts, P. Carmeliet, and B. Ghesquière (2016): “Improved metabolite identification with midas and magma through ms/ms spectral dataset-driven parameter optimization,” Metabolomics, 12, 1–16.10.1007/s11306-016-1036-3Search in Google Scholar

Zhang, A., H. Sun, P. Wang, Y. Han, and X. Wang ( 2012): “Modern analytical techniques in metabolomics analysis,” Analyst, 137, 293–300.10.1039/C1AN15605ESearch in Google Scholar PubMed

Zhang, H., Y. Zheng, G. Yoon, Z. Zhang, T. Gao, B. Joyce, W. Zhang, J. Schwartz, P. Vokonas, E. Colicino, A. Baccarelli, L. Hou, and L. Liu (2017): “Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study,” Stat. Appl. Genet. Mol. Biol. 16, 159–171.10.1515/sagmb-2016-0073Search in Google Scholar PubMed PubMed Central

Published Online: 2018-09-08

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2017-0077

Keywords for this article

metabolomics; multivariate linear model; time series; variable selection