Streamlining feature elaboration and statistics analysis in metabolomics: the GetFeatistics R-package

Gianfranco Frigerio

doi:10.1515/jib-2025-0047

Artikel Open Access

Streamlining feature elaboration and statistics analysis in metabolomics: the GetFeatistics R-package

Veröffentlicht/Copyright: 24. Dezember 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Journal of Integrative Bioinformatics

Abstract

Metabolomics studies require complex data processing pipelines to ensure data quality and extract meaningful biological insights. GetFeatistics is an R-package developed to streamline the elaboration and statistical analysis of metabolomics data. For targeted analyses, the package enables calibration curve-based quantification with different data weighting options. For untargeted studies, it includes dedicated functions to import feature tables from tools like patRoon and MS-DIAL, assign annotation confidence levels, and filter features based on pooled quality control (QC) criteria, including options for group-specific pooled QCs. The package also provides functions for univariate and multivariate statistical analyses, notably streamlined regression modelling with fixed effects, mixed-effects models for longitudinal data, and Tobit regression for censoring values exceeding the limits of detection. Output tables are concise and informative, facilitating interpretation and reporting, while output visualisations are fully customisable via the ggplot grammar. Additional functionalities include automated retrieval of chemical properties from PubChem, ontology classification via ClassyFire, and pathway enrichment analysis using the FELLA package. GetFeatistics is publicly available on GitHub, with comprehensive documentation and a step-by-step vignette. By integrating key steps of the metabolomics workflow, the package aims to facilitate both exploratory studies and large-scale epidemiological applications in metabolomics research.

Keywords: metabolomics workflow; non-targeted metabolomics; computational metabolomics; open-source software; data preprocessing

1 Introduction

Metabolomics is the youngest of the four main “omics” sciences – following genomics, transcriptomics, and proteomics – and as such, still requires methodological development and improvement. Metabolomic approaches can be broadly categorised into targeted and untargeted strategies. Targeted metabolomics aims to accurately quantify a predefined set of metabolites, whereas untargeted metabolomics seeks to capture the widest possible spectrum of metabolites to explore biological differences without prior assumptions.

In untargeted workflows, the inclusion of pooled quality control (QC) samples during sample preparation is a key step to monitor data quality and analytical reproducibility throughout the acquisition process [1]. Community-driven initiatives such as the Metabolomics Quality Assurance and Quality Control Consortium (mQACC) are actively working to promote the implementation of QC practices in metabolomics studies [2]. In addition, the use of group-specific QC samples has been proposed to prevent the dilution of low-abundance metabolites that may fall below the instrument’s detection limit when using a single pooled QC for all groups [3].

Following instrumental analysis with liquid chromatography coupled to high resolution mass spectrometry (LC-MS/MS), data elaboration is usually the most time-consuming and demanding step of the metabolomics workflow. Open-source tools such as XCMS [4] are used to extract metabolic features from raw data files, where a feature is defined as peak with specific mass-to-charge ratio (m/z) and retention time (rt); additional tools like CAMERA [5] assist with the grouping of adducts and isotopes; while annotation software such as MetFrag [6] and databases like PubChemLite [7] support metabolite annotation: all of these have being integrated into the patRoon R package [8], 9]. An alternative widely adopted solution is MS-DIAL, a user-friendly software that spans the same pipeline from feature extraction to annotation, the latter made possible thanks to the extensive publicly available MSP libraries [10], 11].

Once features have been extracted and potentially annotated, filtering based on QC quality cut-off criteria is typically performed; data intensities are pre-processed to deal with missing values and data normality; and then statistical analyses are applied to highlight biologically relevant associations. Since the metabolome is influenced by several factors, it is essential to properly account for potential confounding variables, particularly in observational epidemiological studies, and this can be achieved by applying regression models with multiple covariates. For longitudinal designs, mixed-effects models that incorporate both fixed and random effects are a suitable approach [12], 13]. In the context of targeted analyses, Tobit regression models are a suitable approach to handle censored data, i.e.: metabolite concentrations that fall below or above quantification limits [14], 15].

Presented here is the R-package “GetFeatistics” (feature and statistics). The tool is designed to streamline the data elaboration of targeted and untargeted metabolomics studies, by offering an R-based pipeline for data processing, particularly for pooled QCs filtration, and statistical analyses, notably mixed effect and Tobit regression models.

2 Implementation and workflow

The GetFeatistics package has been developed with the R-language and is publicly available on GitHub. Most of its functions rely on the tidyverse ecosystem [16], and all the visualisations are created as “ggplot” objects, thus allowing further customisation with the ggplot grammar. The website of the package, built using the pkgdown framework [17], provides extensive documentation for each function, and the vignette shows the application of the workflow on sample data. Figure 1 illustrates the overall data processing workflow, highlighting the functions involved in each step.

Figure 1:

Typical workflow for data elaboration that can be implemented with the GetFeatistics R-package. Each green rectangle represents a function, and each purple rectangle contains a brief description of it. The logo of the package is reported in the centre of the picture. Example graphs were generated using mock data and from a previously published study under the author’s retained rights [13].

2.1 Targeted analyses

The package provides straightforward calculation of absolute concentrations from signal intensities via the function get_targeted_elaboration. The function’s arguments include: (I) a table of intensities for samples, quality controls, and calibration standards used to build the curve; (II) a table reporting the nominal concentrations for quality controls and calibration levels; and (III) optionally, a table mapping each analyte to its internal standard and specifying the desired weighting scheme. Because calibration data often have higher variance at higher concentrations, applying weighted least squares (e.g., 1/Y, 1/X or 1/X²), reduces bias and improves accuracy at low concentrations [18], 19]. The function returns calculated absolute concentrations and accuracy (%) for standards and quality controls. These results can be passed to plot_calibration_curves to visualise the fitted calibration curves.

2.2 Untargeted analyses

Following feature extraction, grouping, and annotation using external tools such as patRoon or MS-DIAL, the workflow of GetFeatistics is then based to be conducted on two core data frames: “featTable”, where rows represent features and columns represent samples, containing intensity values; and “featINFO”, where rows again represent features, while columns contain the feature metadata such as m/z, retention time, chemical annotations, and other descriptors of the feature or of the annotated molecule. Dedicated functions are available to import and structure these data frames from patRoon (get_feat_info_from_patRoon, get_feat_table_from_patRoon) and MS-DIAL outputs (get_feat_info_from_MSDial, get_feat_table_from_MSDial). Confidence levels of annotated compounds can be automatically assigned according to the annotation grading proposed by Schymanski et al. [20], particularly implementing the MoNAScore cut-offs, for patRoon, and the dot product and fragment presence cut-offs, for MS-DIAL, as reported by Talavera Andújar et al. [21]. Known molecules can be searched in the feature list based on their exact mass; moreover, if the retention time is also available, i.e.: when analytical standards have been analysed under the same instrumental conditions, matching features can be identified and a confidence level 1 is then assigned (checkmolecules_in_feat_table).

In order to improve the quality of the dataset and retain only biologically meaningful features, the package includes a function to filter features based on QC criteria (QCs_proces) by filtering out features with: (I) poor reproducibility in pooled QCs (based on relative standard deviation of QC intensities); (II) insufficient presence across QC injections (based on detection frequency in QCs); (III) high blank contribution (based on blank/QC mean intensity ratio); (IIII) unreasonable response in diluted QC samples (based on a range of acceptability of QC/diluted QC mean intensity ratio). When group-specific QC pools are available, the filtering retains features meeting quality thresholds in at least one group [3].

2.3 Statistical analyses

A summary table with descriptive statistics can be generated (gentab_descr) considering the sample groups. Normality is assessed via density plots (test_normality_density_plot), Q–Q plots (test_normality_q_q_plot), and the Shapiro–Wilk test (test_normality_Shapiro_table). Data can then be processed via missing value imputation, log-transformation, centering, and scaling (transf_data). A suite of functions supports univariate statistical testing, including t-tests (gentab_P.t.test), fold change analysis (gentab_FC, gentab_FC_more_than2levels), one- and two-way ANOVA, with or without interaction terms (gentab_P.1wayANOVA_posthocTukeyHSD, gentab_P.2wayANOVA_posthocTukeyHSD). Considering the application of such statistical tests on several metabolites, p-values can be corrected using false discovery rate (FDR) procedures [22]. Results can be visualized through volcano plots (Volcano_ttest_FC) and box plots (getBoxplots). The outcome tables from statistical analyses can be enriched with corresponding feature metadata (addINFO_to_table).

For multivariate analysis, a single function (gentab_lm_long) allows the construction of regression models across the entire metabolite dataset, incorporating one or more covariates, with the option of building linear models with fixed effects, mixed-effects models with the “lmerTest” package [23], 24], or Tobit regression for censored data (specifically suited in this workflow for values below or above detection limits) via the “AER” package [25]. The output of the function is a table with complete effect estimates, with related confidence intervals and p-values (with also FDR correction), for each metabolite – covariate pair. Visual summaries of results through a Volcano plot can be created (Volcano_lm). Principal component analyses and heatmap with hierarchical clustering can be also performed with extensive customisation options (getPCA, getHeatMap).

2.4 Miscellaneous

The package also contains functions (getChemData, add_ChemData_to_featINFO) to automatically retrieve chemical properties, synonyms, and identifiers from PubChem using the Power User Gateway (PUG) [26], and chemical ontology classification via the ClassyFire taxonomy [27]. Additionally, a function (do_FELLA_enrichment_analysis) is provided for pathway enrichment analysis using the FELLA package [28].

Finally, the package includes a functionality (merge_results) to merge results from multiple chromatographic runs and/or ionization modes, prioritising overlapping compounds according to user-defined criteria (e.g., lower mass error, retention time shift, or p-value).

3 Conclusions

In conclusion, the GetFeatistics R-package streamlines the processing and statistical analysis of both targeted and untargeted metabolomics data. It integrates a wide range of functionalities, and the novelty rely particularly in the filtration of features considering pooled QC criteria, including also the use separated pooled QC per sample group, and the efficient construction of multiple regression models with mixed-effect and Tobit models. The package is publicly available, fully documented, actively maintained, and open to contributions from the community.

Corresponding author: Gianfranco Frigerio, Center for Omics Sciences (COSR), IRCCS San Raffaele Scientific Institute, Milan, Italy; Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg; and Department of Clinical Sciences and Community Health, University of Milan, and Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy, E-mail: frigerio.gianfranco@hsr.it

Funding source: Fonds National de la Recherche Luxembourg

Award Identifier / Grant number: A18/BM/12341006

Acknowledgments

All the people of the Proteomics and Metabolomics group (ProMeFa) of the Center for Omics Sciences (COSR), IRCCS San Raffaele Scientific Institute, are acknowledged for the support. Albina Rastoder is acknowledged for helping with the testing of some functions during her 2 months internship at the Luxembourg Centre for Systems Biomedicine (LCSB) of the University of Luxembourg. Prof. Dr. Emma Schymanski is thanked for the support and suggestions, and the entire Environmental Cheminformatics group of the LCSB is acknowledged for the help and feedback. Prof. Dr. Silvia Fustinoni and the group of the Laboratory of Environmental and Industrial Toxicology at the Department of Clinical Sciences and Community Health, University of Milan, and Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico are acknowledged for the support.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: GF conceived the idea of the package; ideated, wrote, developed, tested, and is maintaining all the functions and the documentation; and wrote this manuscript.
Use of Large Language Models, AI and Machine Learning Tools: OpenAI’s ChatGPT (GPT-4o) was used solely to suggest improvements to English grammar and clarity of the manuscript text. The author selectively implemented and curated these suggestions, further edited the text, and takes full responsibility for the content.
Conflict of interest: The author states no conflict of interest.
Research funding: GF acknowledges funding support from the Luxembourg National Research Fund (FNR) for project A18/BM/12341006.
Data availability: The GetFeatistics R-package is freely available under a GPL-3.0 license on GitHub at https://github.com/FrigerioGianfranco/GetFeatistics.

References

1. Broadhurst, D, Goodacre, R, Reinke, SN, Kuligowski, J, Wilson, ID, Lewis, MR, et al.. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics 2018;14:72. https://doi.org/10.1007/s11306-018-1367-3.Suche in Google Scholar PubMed PubMed Central

2. Kirwan, JA, Gika, H, Beger, RD, Bearden, D, Dunn, WB, Goodacre, R, et al.. Quality assurance and quality control reporting in untargeted metabolic phenotyping: mQACC recommendations for analytical quality management. Metabolomics 2022;18:70. https://doi.org/10.1007/s11306-022-01926-3.Suche in Google Scholar PubMed PubMed Central

3. Frigerio, G, Moruzzi, C, Mercadante, R, Schymanski, EL, Fustinoni, S. Development and application of an LC-MS/MS untargeted exposomics method with a separated pooled quality control strategy. Molecules 2022;27. https://doi.org/10.3390/molecules27082580.Suche in Google Scholar PubMed PubMed Central

4. Smith, CA, Want, EJ, O’Maille, G, Abagyan, R, Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006;78:779–87. https://doi.org/10.1021/ac051437y.Suche in Google Scholar PubMed

5. Kuhl, C, Tautenhahn, R, Böttcher, C, Larson, TR, Neumann, S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal Chem 2012;84:283–9. https://doi.org/10.1021/ac202450g.Suche in Google Scholar PubMed PubMed Central

6. Ruttkies, C, Schymanski, EL, Wolf, S, Hollender, J, Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J Cheminf 2016;8:3. https://doi.org/10.1186/s13321-016-0115-9.Suche in Google Scholar PubMed PubMed Central

7. Schymanski, EL, Kondić, T, Neumann, S, Thiessen, PA, Zhang, J, Bolton, EE. Empowering large chemical knowledge bases for exposomics: pubchemlite meets MetFrag. J Cheminf 2021;13:1–15. https://doi.org/10.1186/s13321-021-00489-0.Suche in Google Scholar PubMed PubMed Central

8. Helmus, R, ter Laak, TL, van Wezel, AP, de Voogt, P, Schymanski, EL. patRoon: open source software platform for environmental mass spectrometry based non-target screening. J Cheminf 2021;13:1–25. https://doi.org/10.1186/s13321-020-00477-w.Suche in Google Scholar PubMed PubMed Central

9. Helmus, R, van de Velde, B, Brunner, AM, ter Laak, TL, van Wezel, AP, Schymanski, EL. patRoon 2.0: improved non-target analysis workflows including automated transformation product screening. J Open Source Softw 2022;7:4029. https://doi.org/10.21105/joss.04029.Suche in Google Scholar

10. Tsugawa, H, Ikeda, K, Takahashi, M, Satoh, A, Mori, Y, Uchino, H, et al.. A lipidome atlas in MS-DIAL 4. Nat Biotechnol 2020;38:1159–63. https://doi.org/10.1038/s41587-020-0531-2.Suche in Google Scholar PubMed

11. Matsuzawa, Y, Tsugawa, H, Matsuzawa, Y, Nishida, K, Takahashi, M, Buyantogtokh, B. MS-DIAL metabolomics MSP spectral kit containing EI-MS, MS/MS, and CCS values – last edited in Aug. 8th, 2024 [Internet]; 2024. https://systemsomicslab.github.io/compms/msdial/main.html#MSP [Accessed 8 Jan 2025].Suche in Google Scholar

12. Carnoli, AJ, Lohuis, PO, Buydens, LMC, Tinnevelt, GH, Jansen, JJ. Linear mixed-effects models in chemistry: a tutorial. Anal Chim Acta 2024;1304:342444. https://doi.org/10.1016/j.aca.2024.342444.Suche in Google Scholar PubMed

13. Rigamonti, AE, Frigerio, G, Caroli, D, De Col, A, Cella, SG, Sartorio, A, et al.. A metabolomics-based investigation of the effects of a short-term body weight reduction program in a cohort of adolescents with obesity: a prospective interventional clinical study. Nutrients 2023;15:529. https://doi.org/10.3390/nu15030529.Suche in Google Scholar PubMed PubMed Central

14. McDonald, JF, Robert, AM. The uses of tobit analysis. Rev Econ Stat 1980;62:318–21.10.2307/1924766Suche in Google Scholar

15. Frigerio, G, Favero, C, Savino, D, Mercadante, R, Albetti, B, Dioni, L, et al.. Plasma metabolomic profiling in 1391 subjects with overweight and obesity from the SPHERE study. Metabolites 2021;11. https://doi.org/10.3390/metabo11040194.Suche in Google Scholar PubMed PubMed Central

16. Wickham, H, Averick, M, Bryan, J, Chang, W, McGowan, LD, François, R, et al.. Welcome to the tidyverse. J Open Source Softw 2019;4:1686. https://doi.org/10.21105/joss.01686.Suche in Google Scholar

17. Wickham, H, Hesselberth, J, Salmon, M, Roy, O, Brüggemann, S. pkgdown: make static HTML documentation for a package [Internet]; 2024. Available from: https://CRAN.R-project.org/package=pkgdown.Suche in Google Scholar

18. Gu, H, Liu, G, Wang, J, Aubry, AF, Arnold, ME. Selecting the correct weighting factors for linear and quadratic calibration curves with least-squares regression algorithm in bioanalytical LC-MS/MS assays and impacts of using incorrect weighting factors on curve stability, data quality, and assay performance. Anal Chem 2014;86:8959–66. https://doi.org/10.1021/ac5018265.Suche in Google Scholar PubMed

19. Sonawane, SS, Chhajed, SS, Attar, SS, Kshirsagar, SJ. An approach to select linear regression model in bioanalytical method validation. J Anal Sci Technol 2019;10:1. https://doi.org/10.1186/s40543-018-0160-2.Suche in Google Scholar

20. Schymanski, EL, Jeon, J, Gulde, R, Fenner, K, Ruff, M, Singer, HP, et al.. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol 2014;48:2097–8. https://doi.org/10.1021/es5002105.Suche in Google Scholar PubMed

21. Talavera, AB, Aurich, D, Aho, VTE, Singh, RR, Cheng, T, Zaslavsky, L, et al.. Studying the Parkinson’s disease metabolome and exposome in biological samples through different analytical and cheminformatics approaches: a pilot study. Anal Bioanal Chem 2022;414:7399–419. https://doi.org/10.1007/s00216-022-04207-z.Suche in Google Scholar PubMed PubMed Central

22. Benjamini, Y, Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B 1995;57. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.Suche in Google Scholar

23. Kuznetsova, A, Brockhoff, PB, Christensen, RHB. lmerTest package: tests in linear mixed effects models. J Stat Software 2017;82:1–26. https://doi.org/10.18637/jss.v082.i13.Suche in Google Scholar

24. Bates, D, Mächler, M, Bolker, B, Walker, S. Fitting linear mixed-effects models using lme4. J Stat Software 2015;67:1–48. https://doi.org/10.18637/jss.v067.i01.Suche in Google Scholar

25. Kleiber, C, Zeileis, A. Applied econometrics with R [Internet]; 2008. Available from: https://CRAN.R-project.org/package=AER.10.32614/CRAN.package.AERSuche in Google Scholar

26. Kim, S, Thiessen, PA, Cheng, T, Yu, B, Bolton, EE. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res 2018;46:W563–70. https://doi.org/10.1093/nar/gky294.Suche in Google Scholar PubMed PubMed Central

27. Djoumbou Feunang, Y, Eisner, R, Knox, C, Chepelev, L, Hastings, J, Owen, G, et al.. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminf 2016;8:61. https://doi.org/10.1186/s13321-016-0174-y.Suche in Google Scholar PubMed PubMed Central

28. Picart-Armada, S, Fernández-Albert, F, Vinaixa, M, Yanes, O, Perera-Lluna, A. FELLA: an R package to enrich metabolomics data. BMC Bioinf 2018;19:538. https://doi.org/10.1186/s12859-018-2487-5.Suche in Google Scholar PubMed PubMed Central

Received: 2025-09-09

Accepted: 2025-12-05

Published Online: 2025-12-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

https://doi.org/10.1515/jib-2025-0047

Schlagwörter für diesen Artikel

metabolomics workflow; non-targeted metabolomics; computational metabolomics; open-source software; data preprocessing

Creative Commons

BY 4.0