MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
Abstract
Studying a large number of variables measured on the same observations and organized in blocks – denoted multiblock data – is becoming standard in several domains especially in biology. To explore the relationships between all these variables – at the block- and the variable-level – several exploratory multiblock methods were proposed. However, most of them are only designed for numeric variables. In reality, some data sets contain variables of different measurement levels (i.e., numeric, nominal, ordinal). In this article, we focus on exploratory multiblock methods that handle variables at their appropriate measurement level. Multi-Block Principal Component Analysis with Optimal Scaling (MBPCA-OS) is proposed and applied to multiblock data from the CURIE-O-SA French cohort. In this study, variables are of different measurement levels and organized in four blocks. The objective is to study the immune responses according to the SARS-CoV-2 infection and vaccination statuses, the symptoms and the participant’s characteristics.
Acknowledgments
We thank from the Institut Pasteur, the Recombinant Protein Production and Purification core facility for SARS-CoV-2 protein preparation, the Molecular Biophysics core facility for their quality checking and Yves L. Janin (Unit of Chemistry and Biocatalysis), for providing the Hikarazine, the LuLISA substrate. We also thank for serum sample management the whole ICAReB team and COVID-19 support staff at Institut Pasteur, the team from the Eurofins Biomnis Sample Library and from CerbaHealthcare. We also thank the personnel of the Institut Curie who volunteered to participate to the Curiosa study, which was set up and managed by the staff of the clinical and laboratory departments of the Institut Curie.
-
Research ethics: Not applicable.
-
Author contributions: Write the manuscript: MP, EV, AH, SB, Designed the study: OL, Manage the data: OL, Analyzed and visualized the data: MP, EV, SB, Revised the manuscript: MP, EV, AH, SB, OL
-
Competing interests: The authors state no competing interests.
-
Research funding: The blood and clinical Study at Institut Curie was funded by Fondation de France, Agence Nationale de la Recherche (ANR-21-COVR-002) and Institutional funding from Institut Curie.
CURIE-O-SA data (N = 4383). Visualization of the correlation matrix of
Participants’ categories | log10LN mean [CI, 95 %] | log10LS mean [CI, 95 %] | log10PNT mean [CI, 95 %] |
---|---|---|---|
Covid- Non.vacc | 3.35 [3.33;3.36] | 3.16 [3.14;3.18] | 4.7 [4.68;4.71] |
Covid- vacc | 3.2 [3.17;3.25] | 5.04 [5;5.08] | 2.7 [2.63;2.79] |
Covid + Non.vacc | 4.3 [4.22;4.39] | 4.25 [4.15;4.35] | 4.09 [4;4.19] |
Covid + vacc | 4.21 [3.92;4.52] | 5.23 [4.12;5.37] | 2.57 [2.18;2.95] |
CURIE-O-SA data (N = 4383). Means and 95 % confidence intervals for serological assays X 3 for participants’ categories from the joined information between infection and vaccination.
References
1. Skov, T, Honoré, AH, Jensen, HM, Næs, T, Engelsen, SB. Chemometrics in foodomics: handling data structures from multiple analytical platforms. TrAC, Trends Anal Chem 2014;60:71–9. https://doi.org/10.1016/j.trac.2014.05.004.Search in Google Scholar
2. Mishra, P, Roger, J-M, Jouan-Rimbaud-Bouveresse, D, Biancolillo, A, Marini, F, Nordon, A, et al.. Recent trends in multi-block data analysis in chemometrics for multi-source data integration. TrAC, Trends Anal Chem 2021;137:116206. https://doi.org/10.1016/j.trac.2021.116206.Search in Google Scholar
3. Bougeard, S, Cardinal, M. Multiblock modeling for complex preference study. Application to European preferences for smoked salmon. Food Qual Prefer 2014;32:56–64. https://doi.org/10.1016/j.foodqual.2013.06.002.Search in Google Scholar
4. Bougeard, S, Qannari, EM, Rose, N. Multiblock redundancy analysis: interpretation tools and application in epidemiology. J Chemometr 2011;25:467–75. https://doi.org/10.1002/cem.1392.Search in Google Scholar
5. Smilde, AK, Westerhuis, JA, de Jong, S. A framework for sequential multiblock component methods. J Chemometr 2003;17:323–37. https://doi.org/10.1002/cem.811.Search in Google Scholar
6. Tchandao Mangamana, E, Cariou, V, Vigneau, E, Glèlè Kakaï, RL, Qannari, EM. Unsupervised multiblock data analysis: a unified approach and extensions. Chemometr Intell Lab Syst 2019;194:103856. https://doi.org/10.1016/j.chemolab.2019.103856.Search in Google Scholar
7. Smilde, AK, Næs, T, Liland, KH. Multiblock data fusion in statistics and machine learning: applications in the natural and life sciences. Hoboken, NJ: Wiley; 2022.10.1002/9781119600978Search in Google Scholar
8. Wold, S, Geladi, P, Esbensen, K, Öhman, J. Multi-way principal components-and PLS-analysis. J Chemometr 1987;1:41–56. https://doi.org/10.1002/cem.1180010107.Search in Google Scholar
9. Wold, S, Kettaneh, N, Tjessem, K. Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemometr 1996;10:463–82. https://doi.org/10.1002/(sici)1099-128x(199609)10:5/6<463::aid-cem445>3.0.co;2-l.10.1002/(SICI)1099-128X(199609)10:5/6<463::AID-CEM445>3.3.CO;2-CSearch in Google Scholar
10. Cariou, V, Qannari, EM, Rutledge, DN, Vigneau, E. ComDim: from multiblock data analysis to path modeling. Food Qual Prefer 2018;67:27–34. https://doi.org/10.1016/j.foodqual.2017.02.012.Search in Google Scholar
11. Hanafi, M, Kohler, A, Qannari, EM. Shedding new light on hierarchical principal component analysis. J Chemometr 2010;24:703–9. https://doi.org/10.1002/cem.1334.Search in Google Scholar
12. Carroll, JD. Generalization of canonical correlation analysis to three of more sets of variables. Oxford: Oxford University Press; 1968.10.1037/e473742008-115Search in Google Scholar
13. Pagès, J. Multiple factor analysis by example using R. Boca Raton, Fla: CRC Press, Taylor & Francis Group; 2015.Search in Google Scholar
14. Lavit, C, Escoufier, Y, Sabatier, R, Traissac, P. The act (statis method). Comput Stat Data Anal 1994;18:97–119. https://doi.org/10.1016/0167-9473(94)90134-1.Search in Google Scholar
15. Stevens, SS. On the theory of scales of measurement. Science 1946;103:677–80. https://doi.org/10.1126/science.103.2684.677.Search in Google Scholar PubMed
16. Gifi, A. Nonlinear multivariate analysis. Hoboken: Wiley-Blackwell; 1990.Search in Google Scholar
17. Michailidis, G, De Leeuw, J. The Gifi system of descriptive multivariate analysis. Stat Sci 1998;1:307–36. https://doi.org/10.1214/ss/1028905828.Search in Google Scholar
18. Hirschfeld, HO. A connection between correlation and contingency. In: Mathematical proceedings of the cambridge philosophical society. Cambridge University Press; 1935:520–4 pp.10.1017/S0305004100013517Search in Google Scholar
19. Benzécri, J-P. L’analyse des données. Paris: Dunod; 1973.Search in Google Scholar
20. Di Ciaccio, A. Optimal coding of high-cardinality categorical data in machine learning. In: Scientific meeting of the classification and data analysis group of the italian statistical society. Springer; 2021:39–51 pp.10.1007/978-3-031-30164-3_4Search in Google Scholar
21. Linting, M, Meulman, JJ, Groenen, PJ, van der Koojj, AJ. Nonlinear principal components analysis: introduction and application. 2007; 12: 336, https://doi.org/10.1037/1082-989x.12.3.336.Search in Google Scholar PubMed
22. De Leeuw, J. History of nonlinear principal component analysis. California: UCLA: Department of Statistics; 2013.Search in Google Scholar
23. van der Burg, E, de Leeuw, J, Dijksterhuis, G. OVERALS. Comput Stat Data Anal 1994;18:141–63. https://doi.org/10.1016/0167-9473(94)90136-8.Search in Google Scholar
24. Tenenhaus, A, Tenenhaus, M. Regularized generalized canonical correlation analysis. Psychometrika 2011;76:257–84. https://doi.org/10.1007/s11336-011-9206-8.Search in Google Scholar
25. Hwang, H, Takane, Y. Nonlinear generalized structured component analysis. Behaviormetrika 2009;37:1–14. https://doi.org/10.2333/bhmk.37.1.Search in Google Scholar
26. Russolillo, G. Non-metric partial least squares. Electron J Stat 2012;6:1641–69. https://doi.org/10.1214/12-ejs724.Search in Google Scholar
27. Young, FW. Quantitative analysis of qualitative data. Psychometrika 1981;46:357–88. https://doi.org/10.1007/bf02293796.Search in Google Scholar
28. de Leeuw, J, Young, FW, Takane, Y. Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika 1976;41:471–503. https://doi.org/10.1007/bf02296971.Search in Google Scholar
29. Kroonenberg, PM, De Leeuw, J. Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 1980;45:69–97. https://doi.org/10.1007/bf02293599.Search in Google Scholar
30. Kruskal, JB. Nonmetric multidimensional scaling: a numerical method. Psychometrika 1964;29:115–29. https://doi.org/10.1007/bf02289694.Search in Google Scholar
31. Campos, MP, Reis, MS. Data preprocessing for multiblock modelling – a systematization with new methods. Chemometr Intell Lab Syst 2020;199:103959. https://doi.org/10.1016/j.chemolab.2020.103959.Search in Google Scholar
32. Westerhuis, JA, Kourti, T, MacGregor, JF. Analysis of multiblock and hierarchical PCA and PLS models. J Chemometr: J Chemom Soc 1998;12:301–21. https://doi.org/10.1002/(sici)1099-128x(199809/10)12:5<301::aid-cem515>3.0.co;2-s.10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-SSearch in Google Scholar
33. Chavent, M, Kuentz-Simonet, V, Labenne, A, Saracco, J. Multivariate analysis of mixed data: the R package PCAmixdata. arXiv; 2017.Search in Google Scholar
34. Van der Burg, E, De Leeuw, J, Verdegaal, R. Homogeneity analysis withk sets of variables: an alternating least squares method with optimal scaling features. Psychometrika 1988;53:177–97. https://doi.org/10.1007/bf02294131.Search in Google Scholar
35. Tenenhaus, M, Vinzi, VE, Chatelin, Y-M, Lauro, C. PLS path modeling. Comput Stat Data Anal 2005;48:159–205. https://doi.org/10.1016/j.csda.2004.03.005.Search in Google Scholar
36. Pagès, J. Analyse factorielle multiple appliquée aux variables qualitatives et aux données mixtes. Rev Stat Appl 2002;50:5–37.Search in Google Scholar
37. Pagès, J. Analyse factorielle de données mixtes. Rev Stat Appl 2004;52:93–111.Search in Google Scholar
38. Paries. PCA.OS: principal component analysis with optimal scaling features. R package version; 2022. Available from: https://github.com/martinparies/PCA.OS.Search in Google Scholar
39. Anna, F, Goyard, S, Lalanne, AI, Nevo, F, Gransagne, M, Souque, P, et al.. High seroprevalence but short‐lived immune response to SARS‐CoV‐2 infection in Paris. Eur J Immunol 2021;51:180–90. https://doi.org/10.1002/eji.202049058.Search in Google Scholar PubMed PubMed Central
40. Le Vu, S, Jones, G, Anna, F, Rose, T, Richard, J-B, Bernard-Stoecklin, S, et al.. Prevalence of SARS-CoV-2 antibodies in France: results from nationwide serological surveillance. Nat Commun 2021;12:3025. https://doi.org/10.1038/s41467-021-23233-6.Search in Google Scholar PubMed PubMed Central
41. Si, Y, Covello, L, Wang, S, Covello, T, Gelman, A. Beyond vaccination rates: a synthetic random proxy metric of total SARS-CoV-2 immunity seroprevalence in the community. Epidemiology 2022;33:457–64. https://doi.org/10.1097/ede.0000000000001488.Search in Google Scholar
42. Hall, V, Foulkes, S, Insalata, F, Kirwan, P, Saei, A, Atti, A, et al.. Protection against SARS-CoV-2 after covid-19 vaccination and previous infection. N Engl J Med 2022;386:1207–20. https://doi.org/10.1056/nejmoa2118691.Search in Google Scholar PubMed PubMed Central
43. Gower, JC. A general coefficient of similarity and some of its properties. Biometrics 1971;1:857–71. https://doi.org/10.2307/2528823.Search in Google Scholar
44. Pavoine, S, Vallet, J, Dufour, A-B, Gachet, S, Daniel, H. On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 2009;118:391–402. https://doi.org/10.1111/j.1600-0706.2008.16668.x.Search in Google Scholar
45. Mariette, J, Villa-Vialaneix, N. Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics 2018;34:1009–15. https://doi.org/10.1093/bioinformatics/btx682.Search in Google Scholar PubMed
© 2023 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods
Articles in the same Issue
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods