Home Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis
Article
Licensed
Unlicensed Requires Authentication

Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis

  • Chaochao Ma and Ling Qiu EMAIL logo
Published/Copyright: March 11, 2025

Abstract

R language has gained traction in laboratory medicine for its statistical power and dynamic tools like RMarkdown and RShiny. However, there is limited literature summarizing R packages and functions tailored for laboratory medicine, making it difficult for clinical laboratory workers to access these tools. Additionally, varying algorithms across R packages can lead to inconsistencies in published reports. This review addresses these challenges by providing an overview of R’s evolution and its key features, followed by a summary of statistical methods implemented in R, including platform comparisons, precision verification, factor analysis, and the establishment of reference intervals (RIs). We also highlight the development and validation of predictive models using techniques such as linear and logistic regression, decision trees, random forests, support vector machines, naive Bayes, K-Nearest Neighbors, k-means clustering, and backpropagation neural networks – all implemented in R. To ensure transparency and reproducibility in research, a checklist is provided for authors publishing papers using R for data analysis in laboratory medicine. In the final section, the potential of R in big data analytics is explored, focusing on standardized reporting through RMarkdown and the creation of user-friendly data visualization platforms with RShiny. Moreover, the integration of large language models (LLMs), such as ChatGPT, is discussed for their benefits in enhancing R programming, automating reporting, and offering insights from data analysis, thus improving the efficiency and accuracy of laboratory data analysis.


Corresponding author: Ling Qiu, Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Science, Beijing 100730, P.R. China; and State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Science, Beijing 100730, P.R. China, E-mail:

Acknowledgments

The code in this article was written using R language (version 4.3.1). The manuscript was edited using Rmarkdown (source code available upon request from the corresponding author). Images were edited using WPS Office software (version 6.7.1). Additionally, this research utilized Chat GPT-4 for coding assistance and language improvement.

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: Chaochao Ma wrote and revised this manuscript. Ling Qiu made suggestions for the revision of the manuscript. The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: Chat GPT-4 for coding assistance and language improvement.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: The study was supported by the National Natural Science Foundation of China (72274218).

  7. Data availability: Not applicable.

References

1. Blatter, TU, Witte, H, Nakas, CT, Leichtle, AB. Big data in laboratory medicine-FAIR quality for AI? Diagnostics 2022;12. https://doi.org/10.3390/diagnostics12081923.Search in Google Scholar PubMed PubMed Central

2. Hulsen, T, Friedecký, D, Renz, H, Melis, E, Vermeersch, P, Fernandez-Calle, P. From big data to better patient outcomes. Clin Chem Lab Med 2023;61:580–6. https://doi.org/10.1515/cclm-2022-1096.Search in Google Scholar PubMed

3. Ma, C, Wang, X, Wu, J, Cheng, X, Xia, L, Xue, F, et al.. Real-world big-data studies in laboratory medicine: current status, application, and future considerations. Clin Biochem 2020;84:21–30. https://doi.org/10.1016/j.clinbiochem.2020.06.014.Search in Google Scholar PubMed

4. Kim, S, Min, WK. Toward high-quality real-world laboratory data in the era of healthcare big data. Ann Lab Med 2025;45:1–11. https://doi.org/10.3343/alm.2024.0258.Search in Google Scholar PubMed PubMed Central

5. R Core Team. R: a language and environment for statistical computing [online]. Available from: https://www.R-project.org/ [Accessed 30 Jan 2025].Search in Google Scholar

6. Haymond, S. Why clinical laboratorians should embrace the R programming language [online]. Available from: https://myadlm.org/cln/articles/2020/april/why-clinical-laboratorians-should-embrace-the-r-programming-language [Accessed 30 Jan 2025].Search in Google Scholar

7. Xie, Y, Allaire, J, GrolemundMarkdown, GR. The definitive guide [online]. Available from: https://bookdown.org/yihui/rmarkdown/ [Accessed 30 Jan 2025].Search in Google Scholar

8. Wickham, H, François, R, Henry, L, Müller, K. Dplyr: a grammar of data manipulation [online]. Available from: https://dplyr.tidyverse.org/ [Accessed 30 Jan 2025].Search in Google Scholar

9. Python Software Foundation. Python language reference, version 3.11 [online]. Available from: https://www.python.org/ [Accessed 30 Jan 2025].10.1007/978-3-031-54680-8_2Search in Google Scholar

10. Wickham, H. ggplot2: elegant graphics for data analysis [online]. Available from: https://ggplot2.tidyverse.org/ [Accessed 30 Jan 2025].Search in Google Scholar

11. GitHub, build and ship software on a single, collaborative platform [Online]. Available from: https://github.com/ [Accessed 30 Jan 2025].Search in Google Scholar

12. Stack overflow [online]. Available from: https://stackoverflow.com/ [Accessed 30 Jan 2025].Search in Google Scholar

13. R-Bloggers [online]. Available from: https://www.r-bloggers.com/ [Accessed 30 Jan 2025].Search in Google Scholar

14. YouTube [online]. Available from: https://www.youtube.com/ [Accessed 30 Jan 2025].Search in Google Scholar

15. Posit Team. Posit RStudio IDE [Online]. Available from: https://posit.co/downloads/ [Accessed 30 Jan 2025].Search in Google Scholar

16. Muse, VP, Brunak, S. Protocol for EHR laboratory data preprocessing and seasonal adjustment using R and RStudio. STAR Protoc 2024;5:102912. https://doi.org/10.1016/j.xpro.2024.102912.Search in Google Scholar PubMed PubMed Central

17. Zayed, AM, Janssens, A, Mamouris, P, Delvaux, N. lab2clean: a novel algorithm for automated cleaning of retrospective clinical laboratory results data for secondary uses. BMC Med Inf Decis Making 2024;24:245. https://doi.org/10.1186/s12911-024-02652-7.Search in Google Scholar PubMed PubMed Central

18. Komamine, M, Fujimura, Y, Omiya, M, Sato, T. Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET(®) in Japan. BMC Med Inf Decis Making 2023;23:242. https://doi.org/10.1186/s12911-023-02345-7.Search in Google Scholar PubMed PubMed Central

19. Ibrahim, JG, Chu, H, Chen, MH. Missing data in clinical studies: issues and methods. J Clin Oncol 2012;30:3297–303. https://doi.org/10.1200/jco.2011.38.7589.Search in Google Scholar PubMed PubMed Central

20. Solberg, HE, Lahti, A. Detection of outliers in reference distributions: performance of Horn’s algorithm. Clin Chem 2005;51:2326–32. https://doi.org/10.1373/clinchem.2005.058339.Search in Google Scholar PubMed

21. Healy, MJ. Outliers in clinical chemistry quality-control schemes. Clin Chem 1979;25:675–7. https://doi.org/10.1093/clinchem/25.5.675.Search in Google Scholar

22. Komsta, L. Outliers: tests for outliers. Version 0.15. Available from: https://CRAN.R-project.org/package=outliers.Search in Google Scholar

23. R Core Team. Stats: the R stats package [Online]. Available from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html [Accessed 30 Jan 2025].Search in Google Scholar

24. Maechler, M, Rousseeuw, P, Croux, C, Todorov, V, Ruckstuhl, A, Salibian-Barrera, M, et al.. Robustbase: basic robust statistics. Version 0.99-4-1. Available from: https://CRAN.R-project.org/package=robustbase.Search in Google Scholar

25. Filzmoser, P, Gschwandtner, M. Mvoutlier: multivariate outlier detection based on robust methods. Version 2.1.1. Available from: https://CRAN.R-project.org/package=mvoutlier.Search in Google Scholar

26. DigitalOcean. Normalize data in R [online]. Available from: https://www.digitalocean.com/community/tutorials/normalize-data-in-r [Accessed 30 Jan 2025].Search in Google Scholar

27. Wickham, H, Pedersen, TL. Scales: scale functions for visualization. Version 1.3.0. Available from: https://CRAN.R-project.org/package=scales.Search in Google Scholar

28. Hongxiang, X, Shiyu, L, Yanying, Z, Wanju, X, Sumei, W. Consistency analysis of two fingertip capillary blood sampling methods for complete blood count. Sci Rep 2024;14:15011. https://doi.org/10.1038/s41598-024-64448-z.Search in Google Scholar PubMed PubMed Central

29. Lehnert, B. BlandAltmanLeh: plots (slightly extended) bland-altman plots. Version 0.3.1. Available from: https://CRAN.R-project.org/package=BlandAltmanLeh.Search in Google Scholar

30. Yin, Y, Ma, C, Yu, S, Liu, W, Wang, D, You, T, et al.. Comparison of three different chemiluminescence assays and a rapid liquid chromatography tandem mass spectrometry method for measuring serum aldosterone. Clin Chem Lab Med 2019;58:95–102. https://doi.org/10.1515/cclm-2019-0706.Search in Google Scholar PubMed

31. Potapov, S, Schuetzenmeister, A, Manuilova, E, Dufey, F, Raymaekers, J. Mcr: method comparison regression. Version 1.3.3. Available from: https://CRAN.R-project.org/package=mcr.Search in Google Scholar

32. Clinical and Laboratory Standards Institute. User verification of precision and estimation of bias. Approved Guideline – 3rd ed. Wayne, Pennsylvania, USA: Clinical and Laboratory Standards Institute; 2014.Search in Google Scholar

33. Clinical and Laboratory Standards Institute. Evaluation of precision of quantitative measurement procedures. Approved Guideline – 3rd ed. Wayne, Pennsylvania, USA: Clinical and Laboratory Standards Institute; 2014.Search in Google Scholar

34. Schuetzenmeister, A, Dufey, F. VCA: variance component analysis. Version 1.5.1; 2024. Available from: https://CRAN.R-project.org/package=VCA.Search in Google Scholar

35. Searle, SR, Casella, G, McCulloch, CE. Variance components. New York: John Wiley & Sons, Inc.; 1992. (Wiley Series in Probability and Statistics).10.1002/9780470316856Search in Google Scholar

36. Giesbrecht, FG, Burns, JC. Two-stage analysis based on a mixed model: large-sample asymptotic theory and small-sample simulation results. Biometrics 1985;41:477–86. https://doi.org/10.2307/2530872.Search in Google Scholar

37. Ma, C, Yu, Z, Qiu, L. Development of next-generation reference interval models to establish reference intervals based on medical data: current status, algorithms and future consideration. Crit Rev Clin Lab Sci 2024;61:298–316. https://doi.org/10.1080/10408363.2023.2291379.Search in Google Scholar PubMed

38. Clinical and Laboratory Standards Institute. Establishing and verifying reference intervals in the clinical laboratory. Approved guideline – 3rd ed. CLSI document EP28-a3c. Wayne, PA: Clinical and Laboratory Standards Institute; 2008.Search in Google Scholar

39. Ozarda, Y. Reference intervals: current status, recent developments and future considerations. Biochem Med 2016;26:5–16. https://doi.org/10.11613/bm.2016.001.Search in Google Scholar

40. Doyle, K, Bunch, DR. Reference intervals: past, present, and future. Crit Rev Clin Lab Sci 2023;60:466–82. https://doi.org/10.1080/10408363.2023.2196746.Search in Google Scholar PubMed

41. Jones, GRD, Haeckel, R, Loh, TP, Sikaris, K, Streichert, T, Katayev, A, et al.. Indirect methods for reference interval determination – review and recommendations. Clin Chem Lab Med 2018;57:20–9. https://doi.org/10.1515/cclm-2018-0073.Search in Google Scholar PubMed

42. Hoffmann, RG. Statistics in the practice of medicine. JAMA 1963;185:864–73. https://doi.org/10.1001/jama.1963.03060110068020.Search in Google Scholar PubMed

43. Bhattacharya, CG. A simple method of resolution of a distribution into Gaussian components. Biometrics 1967;23:115–35. https://doi.org/10.2307/2528285.Search in Google Scholar

44. Concordet, D, Geffré, A, Braun, JP, Trumel, C. A new approach for the determination of reference intervals from hospital-based data. Clin Chim Acta 2009;405:43–8. https://doi.org/10.1016/j.cca.2009.03.057.Search in Google Scholar PubMed

45. Zierk, J, Arzideh, F, Kapsner, LA, Prokosch, HU, Metzler, M, Rauh, M. Reference interval estimation from mixed distributions using truncation points and the Kolmogorov-smirnov distance (kosmic). Sci Rep 2020;10:1704. https://doi.org/10.1038/s41598-020-58749-2.Search in Google Scholar PubMed PubMed Central

46. Wosniok, W, Haeckel, R. A new indirect estimation of reference intervals: truncated minimum chi-square (TMC) approach. Clin Chem Lab Med 2019;57:1933–47. https://doi.org/10.1515/cclm-2018-1341.Search in Google Scholar PubMed

47. Ammer, T, Schützenmeister, A, Prokosch, HU, Rauh, M, Rank, CM, refineR, ZJ. A novel algorithm for reference interval estimation from real-world data. Sci Rep 2021;11:16023. https://doi.org/10.1038/s41598-021-95301-2.Search in Google Scholar PubMed PubMed Central

48. Agaravatt, A, Kansara, G, Khubchandani, A, Sanghani, H, Patel, S, Parchwani, D. Verification of reference interval of thyroid hormones with manual and automated indirect approaches: comparison of hoffman, KOSMIC and refineR methods. Cureus 2023;15:e39066. https://doi.org/10.7759/cureus.39066.Search in Google Scholar PubMed PubMed Central

49. Ma, C, Guan, L, Li, P, Hou, L, Xia, L, Su, W, et al.. Feasibility evaluation of big data algorithms for establishing serum protein electrophoresis reference intervals using Hoffmann and refineR methods. Clin Chim Acta 2025;567:120114. https://doi.org/10.1016/j.cca.2024.120114.Search in Google Scholar PubMed

50. Ma, C, Zou, Y, Hou, L, Yin, Y, Zhao, F, Hu, Y, et al.. Validation and comparison of five data mining algorithms using big data from clinical laboratories to establish reference intervals of thyroid hormones for older adults. Clin Biochem 2022;107:40–9. https://doi.org/10.1016/j.clinbiochem.2022.05.008.Search in Google Scholar PubMed

51. Koenker, R, Portnoy, S, Ng, PT, Melly, B, Zeileis, A, Grosjean, P, et al.. Quantreg: quantile regression. Version 6.00. Available from: https://CRAN.R-project.org/package=quantreg.Search in Google Scholar

52. Finnegan, D. referenceIntervals: reference intervals. Version 1.3.1.Available from: https://CRAN.R-project.org/package=referenceIntervals.Search in Google Scholar

53. Ammer, T, Rausch, C. refineR: reference interval estimation using real-world data. Version 1.6.1. Available from: https://CRAN.R-project.org/package=refineR.Search in Google Scholar

54. Hoffmann, G, Klawitter, S, Trulson, I, Adler, J, Holdenrieder, S, Klawonn, F. A novel tool for the rapid and transparent verification of reference intervals in clinical laboratories. J Clin Med 2024;13. https://doi.org/10.3390/jcm13154397.Search in Google Scholar PubMed PubMed Central

55. Hoffmann, G, Klawitter, S, Klawonn, F. reflimR: reference limit estimation using routine laboratory data. Version 1.0.6.Available from: https://CRAN.R-project.org/package=reflimR.Search in Google Scholar

56. Chafai, N, Bonizzi, L, Botti, S, Badaoui, B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024;61:140–63. https://doi.org/10.1080/10408363.2023.2259466.Search in Google Scholar PubMed

57. Herman, DS, Rhoads, DD, Schulz, WL, Durant, TJS. Artificial intelligence and mapping a new direction in laboratory medicine: a review. Clin Chem 2021;67:1466–82. https://doi.org/10.1093/clinchem/hvab165.Search in Google Scholar PubMed

58. Rabbani, N, Kim, GYE, Suarez, CJ, Chen, JH. Applications of machine learning in routine laboratory medicine: current state and future directions. Clin Biochem 2022;103:1–7. https://doi.org/10.1016/j.clinbiochem.2022.02.011.Search in Google Scholar PubMed PubMed Central

59. Ichihara, K, Boyd, JC. An appraisal of statistical procedures used in derivation of reference intervals. Clin Chem Lab Med 2010;48:1537–51. https://doi.org/10.1515/cclm.2010.319.Search in Google Scholar PubMed

60. Ma, C, Li, L, Wang, X, Hou, L, Xia, L, Yin, Y, et al.. Establishment of reference interval and aging model of homocysteine using real-world data. Front Cardiovasc Med 2022;9:846685. https://doi.org/10.3389/fcvm.2022.846685.Search in Google Scholar PubMed PubMed Central

61. Ma, C, Xia, L, Chen, X, Wu, J, Yin, Y, Hou, L, et al.. Establishment of variation source and age-related reference interval models for 22 common biochemical analytes in older people using real-world big data mining. Age Ageing 2020;49:1062–70. https://doi.org/10.1093/ageing/afaa096.Search in Google Scholar PubMed

62. Yang, XZ, Quan, WW, Zhou, JL, Zhang, O, Wang, XD, Liu, CF. A new machine learning model to predict the prognosis of cardiogenic brain infarction. Comput Biol Med 2024;178:108600. https://doi.org/10.1016/j.compbiomed.2024.108600.Search in Google Scholar PubMed

63. Kuhn, M, Wing, J, Weston, S, Williams, A, Keefer, C, Engelhardt, A, et al.. Caret: classification and regression training. R Package Version 2024;7.0-1.Search in Google Scholar

64. Kuhn, M. Building predictive models in R using the caret package. J Stat Software 2008;28:1–26. https://doi.org/10.18637/jss.v028.i05.Search in Google Scholar

65. Almer, G, Enko, D, Kartiosuo, N, Niinikoski, H, Lehtimäki, T, Munukka, E, et al.. Association of serum trimethylamine-N-oxide concentration from childhood to early adulthood with age and sex. Clin Chem 2024;70:1162–71. https://doi.org/10.1093/clinchem/hvae087.Search in Google Scholar PubMed

66. Holmes, DT, Buhr, KA. Widespread incorrect implementation of the Hoffmann method, the correct approach, and modern alternatives. Am J Clin Pathol 2019;151:328–36. https://doi.org/10.1093/ajcp/aqy149.Search in Google Scholar PubMed

67. Wilson, SM, Bohn, MK, Madsen, A, Hundhausen, T, Adeli, K. LMS-based continuous reference percentiles for 14 laboratory parameters in the CALIPER cohort of healthy children and adolescents. Clin Chem Lab Med 2023;61:1105–15. https://doi.org/10.1515/cclm-2022-1077.Search in Google Scholar PubMed

68. Peitzsch, M, Mangelis, A, Eisenhofer, G, Huebner, A. Age-specific pediatric reference intervals for plasma free normetanephrine, metanephrine, 3-methoxytyramine and 3-O-methyldopa: particular importance for early infancy. Clin Chim Acta 2019;494:100–5. https://doi.org/10.1016/j.cca.2019.03.1620.Search in Google Scholar PubMed

69. Chang, W, Cheng, J, Allaire, JJ, Xie, Y, McPherson, J. Shiny: web application framework for R [online]. Available from: https://shiny.rstudio.com/ [Accessed 30 Jan 2025].Search in Google Scholar

70. Faust, L, Wilson, P, Asai, S, Fu, S, Liu, H, Ruan, X, et al.. Considerations for quality control monitoring of machine learning models in clinical practice. JMIR Med Inform 2024;12:e50437. https://doi.org/10.2196/50437.Search in Google Scholar PubMed PubMed Central

71. Agraz, M, Mantzoros, C, Karniadakis, GE. ChatGPT-Enhanced ROC Analysis (CERA): a shiny web tool for finding optimal cutoff points in biomarker analysis. PLoS One 2024;19:e0289141. https://doi.org/10.1371/journal.pone.0289141.Search in Google Scholar PubMed PubMed Central

72. Meyer, A, Ruthard, J, Streichert, T. Dear ChatGPT – can you teach me how to program an app for laboratory medicine? J Lab Med 2024;48:197–201. https://doi.org/10.1515/labmed-2024-0034.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cclm-2024-1193).


Received: 2024-10-14
Accepted: 2025-02-18
Published Online: 2025-03-11
Published in Print: 2025-07-28

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Setting analytical performance specification by simulation (Milan model 1b)
  4. Reviews
  5. Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis
  6. Clostebol detection after transdermal and transmucosal contact. A systematic review
  7. Opinion Papers
  8. A value-based score for clinical laboratories: promoting the work of the new EFLM committee
  9. Digital metrology in laboratory medicine: a call for bringing order to chaos to facilitate precision diagnostics
  10. Perspectives
  11. Supporting prioritization efforts of higher-order reference providers using evidence from the Joint Committee for Traceability in Laboratory Medicine database
  12. Clinical vs. statistical significance: considerations for clinical laboratories
  13. Genetics and Molecular Diagnostics
  14. Reliable detection of sex chromosome abnormalities by quantitative fluorescence polymerase chain reaction
  15. Targeted proteomics of serum IGF-I, -II, IGFBP-2, -3, -4, -5, -6 and ALS
  16. Candidate Reference Measurement Procedures and Materials
  17. Liquid chromatography tandem mass spectrometry (LC-MS/MS) candidate reference measurement procedure for urine albumin
  18. General Clinical Chemistry and Laboratory Medicine
  19. Patient risk management in laboratory medicine: an international survey to assess the severity of harm associated with erroneous reported results
  20. Exploring the extent of post-analytical errors, with a focus on transcription errors – an intervention within the VIPVIZA study
  21. A survey on measurement and reporting of total testosterone, sex hormone-binding globulin and free testosterone in clinical laboratories in Europe
  22. Quality indicators in laboratory medicine: a 2020–2023 experience in a Chinese province
  23. Impact of delayed centrifugation on the stability of 32 biochemical analytes in blood samples collected in serum gel tubes and stored at room temperature
  24. Concordance between the updated Elecsys cerebrospinal fluid immunoassays and amyloid positron emission tomography for Alzheimer’s disease assessment: findings from the Apollo study
  25. Novel protocol for metabolomics data normalization and biomarker discovery in human tears
  26. Use of the BIOGROUP® French laboratories database to conduct CKD observational studies: a pilot EPI-CKD1 study
  27. Reference Values and Biological Variations
  28. Consensus instability equations for routine coagulation tests
  29. Hematology and Coagulation
  30. Flow-cytometric lymphocyte subsets enumeration: comparison of single/dual-platform method in clinical laboratory with dual-platform extended PanLeucogating method in reference laboratory
  31. Cardiovascular Diseases
  32. Novel Mindray high sensitivity cardiac troponin I assay for single sample and 0/2-hour rule out of myocardial infarction: MERITnI study
  33. Infectious Diseases
  34. Cell population data for early detection of sepsis in patients with suspected infection in the emergency department
  35. Letters to the Editor
  36. Lab Error Finder: A call for collaboration
  37. Cascading referencing of terms and definitions
  38. Strengthening international cooperation and confidence in the field of laboratory medicine by ISO standardization
  39. Determining the minimum blood volume required for laboratory testing in newborns
  40. Performance evaluation of large language models with chain-of-thought reasoning ability in clinical laboratory case interpretation
  41. Vancomycin assay interference: low-level IgM paraprotein disrupts Siemens Atellica® CH VANC assay
  42. Dr. Morley Donald Hollenberg. An extraordinary scientist, teacher and mentor
Downloaded on 9.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/cclm-2024-1193/pdf
Scroll to top button