MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination

Martin Paries; Evelyne Vigneau; Adeline Huneau; Olivier Lantz; Stéphanie Bougeard

doi:10.1515/ijb-2023-0062

Article

MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination

Martin Paries , Evelyne Vigneau , Adeline Huneau , Olivier Lantz and Stéphanie Bougeard

Published/Copyright: December 13, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 20 Issue 2

Abstract

Studying a large number of variables measured on the same observations and organized in blocks – denoted multiblock data – is becoming standard in several domains especially in biology. To explore the relationships between all these variables – at the block- and the variable-level – several exploratory multiblock methods were proposed. However, most of them are only designed for numeric variables. In reality, some data sets contain variables of different measurement levels (i.e., numeric, nominal, ordinal). In this article, we focus on exploratory multiblock methods that handle variables at their appropriate measurement level. Multi-Block Principal Component Analysis with Optimal Scaling (MBPCA-OS) is proposed and applied to multiblock data from the CURIE-O-SA French cohort. In this study, variables are of different measurement levels and organized in four blocks. The objective is to study the immune responses according to the SARS-CoV-2 infection and vaccination statuses, the symptoms and the participant’s characteristics.

Keywords: multiblock analysis; exploratory analysis; optimal scaling; level of scaling; categorical variables; SARS-CoV-2

Corresponding author: Stéphanie Bougeard, Anses, Epidemiology, Health and Welfare, Laboratory of Ploufragan-Plouzané-Niort, Ploufragan, France, E-mail: Stephanie.BOUGEARD@anses.fr

Acknowledgments

We thank from the Institut Pasteur, the Recombinant Protein Production and Purification core facility for SARS-CoV-2 protein preparation, the Molecular Biophysics core facility for their quality checking and Yves L. Janin (Unit of Chemistry and Biocatalysis), for providing the Hikarazine, the LuLISA substrate. We also thank for serum sample management the whole ICAReB team and COVID-19 support staff at Institut Pasteur, the team from the Eurofins Biomnis Sample Library and from CerbaHealthcare. We also thank the personnel of the Institut Curie who volunteered to participate to the Curiosa study, which was set up and managed by the staff of the clinical and laboratory departments of the Institut Curie.

Research ethics: Not applicable.
Author contributions: Write the manuscript: MP, EV, AH, SB, Designed the study: OL, Manage the data: OL, Analyzed and visualized the data: MP, EV, SB, Revised the manuscript: MP, EV, AH, SB, OL
Competing interests: The authors state no competing interests.
Research funding: The blood and clinical Study at Institut Curie was funded by Fondation de France, Agence Nationale de la Recherche (ANR-21-COVR-002) and Institutional funding from Institut Curie.

Appendix A

CURIE-O-SA data (N = 4383). Visualization of the correlation matrix of X ̃ (quantified variables), obtained through MPBCAOS with two components. The intensity of colors is linked to the intensity of the correlation. A strong positive correlation (close to 1) is colored in red, whereas a strong negative correlation (close to −1) is colored in blue.

Appendix B

Participants’ categories	log10LN mean [CI, 95 %]	log10LS mean [CI, 95 %]	log10PNT mean [CI, 95 %]
Covid- Non.vacc	3.35 [3.33;3.36]	3.16 [3.14;3.18]	4.7 [4.68;4.71]
Covid- vacc	3.2 [3.17;3.25]	5.04 [5;5.08]	2.7 [2.63;2.79]
Covid + Non.vacc	4.3 [4.22;4.39]	4.25 [4.15;4.35]	4.09 [4;4.19]
Covid + vacc	4.21 [3.92;4.52]	5.23 [4.12;5.37]	2.57 [2.18;2.95]

CURIE-O-SA data (N = 4383). Means and 95 % confidence intervals for serological assays X₃ for participants’ categories from the joined information between infection and vaccination.

References

1. Skov, T, Honoré, AH, Jensen, HM, Næs, T, Engelsen, SB. Chemometrics in foodomics: handling data structures from multiple analytical platforms. TrAC, Trends Anal Chem 2014;60:71–9. https://doi.org/10.1016/j.trac.2014.05.004.Search in Google Scholar

2. Mishra, P, Roger, J-M, Jouan-Rimbaud-Bouveresse, D, Biancolillo, A, Marini, F, Nordon, A, et al.. Recent trends in multi-block data analysis in chemometrics for multi-source data integration. TrAC, Trends Anal Chem 2021;137:116206. https://doi.org/10.1016/j.trac.2021.116206.Search in Google Scholar

3. Bougeard, S, Cardinal, M. Multiblock modeling for complex preference study. Application to European preferences for smoked salmon. Food Qual Prefer 2014;32:56–64. https://doi.org/10.1016/j.foodqual.2013.06.002.Search in Google Scholar

4. Bougeard, S, Qannari, EM, Rose, N. Multiblock redundancy analysis: interpretation tools and application in epidemiology. J Chemometr 2011;25:467–75. https://doi.org/10.1002/cem.1392.Search in Google Scholar

5. Smilde, AK, Westerhuis, JA, de Jong, S. A framework for sequential multiblock component methods. J Chemometr 2003;17:323–37. https://doi.org/10.1002/cem.811.Search in Google Scholar

6. Tchandao Mangamana, E, Cariou, V, Vigneau, E, Glèlè Kakaï, RL, Qannari, EM. Unsupervised multiblock data analysis: a unified approach and extensions. Chemometr Intell Lab Syst 2019;194:103856. https://doi.org/10.1016/j.chemolab.2019.103856.Search in Google Scholar

7. Smilde, AK, Næs, T, Liland, KH. Multiblock data fusion in statistics and machine learning: applications in the natural and life sciences. Hoboken, NJ: Wiley; 2022.10.1002/9781119600978Search in Google Scholar

8. Wold, S, Geladi, P, Esbensen, K, Öhman, J. Multi-way principal components-and PLS-analysis. J Chemometr 1987;1:41–56. https://doi.org/10.1002/cem.1180010107.Search in Google Scholar

9. Wold, S, Kettaneh, N, Tjessem, K. Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemometr 1996;10:463–82. https://doi.org/10.1002/(sici)1099-128x(199609)10:5/6<463::aid-cem445>3.0.co;2-l.10.1002/(SICI)1099-128X(199609)10:5/6<463::AID-CEM445>3.3.CO;2-CSearch in Google Scholar

10. Cariou, V, Qannari, EM, Rutledge, DN, Vigneau, E. ComDim: from multiblock data analysis to path modeling. Food Qual Prefer 2018;67:27–34. https://doi.org/10.1016/j.foodqual.2017.02.012.Search in Google Scholar

11. Hanafi, M, Kohler, A, Qannari, EM. Shedding new light on hierarchical principal component analysis. J Chemometr 2010;24:703–9. https://doi.org/10.1002/cem.1334.Search in Google Scholar

12. Carroll, JD. Generalization of canonical correlation analysis to three of more sets of variables. Oxford: Oxford University Press; 1968.10.1037/e473742008-115Search in Google Scholar

13. Pagès, J. Multiple factor analysis by example using R. Boca Raton, Fla: CRC Press, Taylor & Francis Group; 2015.Search in Google Scholar

14. Lavit, C, Escoufier, Y, Sabatier, R, Traissac, P. The act (statis method). Comput Stat Data Anal 1994;18:97–119. https://doi.org/10.1016/0167-9473(94)90134-1.Search in Google Scholar

15. Stevens, SS. On the theory of scales of measurement. Science 1946;103:677–80. https://doi.org/10.1126/science.103.2684.677.Search in Google Scholar PubMed

16. Gifi, A. Nonlinear multivariate analysis. Hoboken: Wiley-Blackwell; 1990.Search in Google Scholar

17. Michailidis, G, De Leeuw, J. The Gifi system of descriptive multivariate analysis. Stat Sci 1998;1:307–36. https://doi.org/10.1214/ss/1028905828.Search in Google Scholar

18. Hirschfeld, HO. A connection between correlation and contingency. In: Mathematical proceedings of the cambridge philosophical society. Cambridge University Press; 1935:520–4 pp.10.1017/S0305004100013517Search in Google Scholar

19. Benzécri, J-P. L’analyse des données. Paris: Dunod; 1973.Search in Google Scholar

20. Di Ciaccio, A. Optimal coding of high-cardinality categorical data in machine learning. In: Scientific meeting of the classification and data analysis group of the italian statistical society. Springer; 2021:39–51 pp.10.1007/978-3-031-30164-3_4Search in Google Scholar

21. Linting, M, Meulman, JJ, Groenen, PJ, van der Koojj, AJ. Nonlinear principal components analysis: introduction and application. 2007; 12: 336, https://doi.org/10.1037/1082-989x.12.3.336.Search in Google Scholar PubMed

22. De Leeuw, J. History of nonlinear principal component analysis. California: UCLA: Department of Statistics; 2013.Search in Google Scholar

23. van der Burg, E, de Leeuw, J, Dijksterhuis, G. OVERALS. Comput Stat Data Anal 1994;18:141–63. https://doi.org/10.1016/0167-9473(94)90136-8.Search in Google Scholar

24. Tenenhaus, A, Tenenhaus, M. Regularized generalized canonical correlation analysis. Psychometrika 2011;76:257–84. https://doi.org/10.1007/s11336-011-9206-8.Search in Google Scholar

25. Hwang, H, Takane, Y. Nonlinear generalized structured component analysis. Behaviormetrika 2009;37:1–14. https://doi.org/10.2333/bhmk.37.1.Search in Google Scholar

26. Russolillo, G. Non-metric partial least squares. Electron J Stat 2012;6:1641–69. https://doi.org/10.1214/12-ejs724.Search in Google Scholar

27. Young, FW. Quantitative analysis of qualitative data. Psychometrika 1981;46:357–88. https://doi.org/10.1007/bf02293796.Search in Google Scholar

28. de Leeuw, J, Young, FW, Takane, Y. Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika 1976;41:471–503. https://doi.org/10.1007/bf02296971.Search in Google Scholar

29. Kroonenberg, PM, De Leeuw, J. Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 1980;45:69–97. https://doi.org/10.1007/bf02293599.Search in Google Scholar

30. Kruskal, JB. Nonmetric multidimensional scaling: a numerical method. Psychometrika 1964;29:115–29. https://doi.org/10.1007/bf02289694.Search in Google Scholar

31. Campos, MP, Reis, MS. Data preprocessing for multiblock modelling – a systematization with new methods. Chemometr Intell Lab Syst 2020;199:103959. https://doi.org/10.1016/j.chemolab.2020.103959.Search in Google Scholar

32. Westerhuis, JA, Kourti, T, MacGregor, JF. Analysis of multiblock and hierarchical PCA and PLS models. J Chemometr: J Chemom Soc 1998;12:301–21. https://doi.org/10.1002/(sici)1099-128x(199809/10)12:5<301::aid-cem515>3.0.co;2-s.10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-SSearch in Google Scholar

33. Chavent, M, Kuentz-Simonet, V, Labenne, A, Saracco, J. Multivariate analysis of mixed data: the R package PCAmixdata. arXiv; 2017.Search in Google Scholar

34. Van der Burg, E, De Leeuw, J, Verdegaal, R. Homogeneity analysis withk sets of variables: an alternating least squares method with optimal scaling features. Psychometrika 1988;53:177–97. https://doi.org/10.1007/bf02294131.Search in Google Scholar

35. Tenenhaus, M, Vinzi, VE, Chatelin, Y-M, Lauro, C. PLS path modeling. Comput Stat Data Anal 2005;48:159–205. https://doi.org/10.1016/j.csda.2004.03.005.Search in Google Scholar

36. Pagès, J. Analyse factorielle multiple appliquée aux variables qualitatives et aux données mixtes. Rev Stat Appl 2002;50:5–37.Search in Google Scholar

37. Pagès, J. Analyse factorielle de données mixtes. Rev Stat Appl 2004;52:93–111.Search in Google Scholar

38. Paries. PCA.OS: principal component analysis with optimal scaling features. R package version; 2022. Available from: https://github.com/martinparies/PCA.OS.Search in Google Scholar

39. Anna, F, Goyard, S, Lalanne, AI, Nevo, F, Gransagne, M, Souque, P, et al.. High seroprevalence but short‐lived immune response to SARS‐CoV‐2 infection in Paris. Eur J Immunol 2021;51:180–90. https://doi.org/10.1002/eji.202049058.Search in Google Scholar PubMed PubMed Central

40. Le Vu, S, Jones, G, Anna, F, Rose, T, Richard, J-B, Bernard-Stoecklin, S, et al.. Prevalence of SARS-CoV-2 antibodies in France: results from nationwide serological surveillance. Nat Commun 2021;12:3025. https://doi.org/10.1038/s41467-021-23233-6.Search in Google Scholar PubMed PubMed Central

41. Si, Y, Covello, L, Wang, S, Covello, T, Gelman, A. Beyond vaccination rates: a synthetic random proxy metric of total SARS-CoV-2 immunity seroprevalence in the community. Epidemiology 2022;33:457–64. https://doi.org/10.1097/ede.0000000000001488.Search in Google Scholar

42. Hall, V, Foulkes, S, Insalata, F, Kirwan, P, Saei, A, Atti, A, et al.. Protection against SARS-CoV-2 after covid-19 vaccination and previous infection. N Engl J Med 2022;386:1207–20. https://doi.org/10.1056/nejmoa2118691.Search in Google Scholar PubMed PubMed Central

43. Gower, JC. A general coefficient of similarity and some of its properties. Biometrics 1971;1:857–71. https://doi.org/10.2307/2528823.Search in Google Scholar

44. Pavoine, S, Vallet, J, Dufour, A-B, Gachet, S, Daniel, H. On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 2009;118:391–402. https://doi.org/10.1111/j.1600-0706.2008.16668.x.Search in Google Scholar

45. Mariette, J, Villa-Vialaneix, N. Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics 2018;34:1009–15. https://doi.org/10.1093/bioinformatics/btx682.Search in Google Scholar PubMed

Received: 2023-06-05

Accepted: 2023-11-02

Published Online: 2023-12-13

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/ijb-2023-0062

Keywords for this article

multiblock analysis; exploratory analysis; optimal scaling; level of scaling; categorical variables; SARS-CoV-2