Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis

Chaochao Ma; Ling Qiu

doi:10.1515/cclm-2024-1193

Article

Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis

Chaochao Ma and Ling Qiu

Published/Copyright: March 11, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Clinical Chemistry and Laboratory Medicine (CCLM) Volume 63 Issue 8

Abstract

R language has gained traction in laboratory medicine for its statistical power and dynamic tools like RMarkdown and RShiny. However, there is limited literature summarizing R packages and functions tailored for laboratory medicine, making it difficult for clinical laboratory workers to access these tools. Additionally, varying algorithms across R packages can lead to inconsistencies in published reports. This review addresses these challenges by providing an overview of R’s evolution and its key features, followed by a summary of statistical methods implemented in R, including platform comparisons, precision verification, factor analysis, and the establishment of reference intervals (RIs). We also highlight the development and validation of predictive models using techniques such as linear and logistic regression, decision trees, random forests, support vector machines, naive Bayes, K-Nearest Neighbors, k-means clustering, and backpropagation neural networks – all implemented in R. To ensure transparency and reproducibility in research, a checklist is provided for authors publishing papers using R for data analysis in laboratory medicine. In the final section, the potential of R in big data analytics is explored, focusing on standardized reporting through RMarkdown and the creation of user-friendly data visualization platforms with RShiny. Moreover, the integration of large language models (LLMs), such as ChatGPT, is discussed for their benefits in enhancing R programming, automating reporting, and offering insights from data analysis, thus improving the efficiency and accuracy of laboratory data analysis.

Keywords: R language; laboratory medicine; data analysis; code; predictive models

Corresponding author: Ling Qiu, Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Science, Beijing 100730, P.R. China; and State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Science, Beijing 100730, P.R. China, E-mail: lingqiubjbigdata@126.com

Acknowledgments

The code in this article was written using R language (version 4.3.1). The manuscript was edited using Rmarkdown (source code available upon request from the corresponding author). Images were edited using WPS Office software (version 6.7.1). Additionally, this research utilized Chat GPT-4 for coding assistance and language improvement.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: Chaochao Ma wrote and revised this manuscript. Ling Qiu made suggestions for the revision of the manuscript. The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: Chat GPT-4 for coding assistance and language improvement.
Conflict of interest: The authors state no conflict of interest.
Research funding: The study was supported by the National Natural Science Foundation of China (72274218).
Data availability: Not applicable.

References

1. Blatter, TU, Witte, H, Nakas, CT, Leichtle, AB. Big data in laboratory medicine-FAIR quality for AI? Diagnostics 2022;12. https://doi.org/10.3390/diagnostics12081923.Search in Google Scholar PubMed PubMed Central

2. Hulsen, T, Friedecký, D, Renz, H, Melis, E, Vermeersch, P, Fernandez-Calle, P. From big data to better patient outcomes. Clin Chem Lab Med 2023;61:580–6. https://doi.org/10.1515/cclm-2022-1096.Search in Google Scholar PubMed

3. Ma, C, Wang, X, Wu, J, Cheng, X, Xia, L, Xue, F, et al.. Real-world big-data studies in laboratory medicine: current status, application, and future considerations. Clin Biochem 2020;84:21–30. https://doi.org/10.1016/j.clinbiochem.2020.06.014.Search in Google Scholar PubMed

4. Kim, S, Min, WK. Toward high-quality real-world laboratory data in the era of healthcare big data. Ann Lab Med 2025;45:1–11. https://doi.org/10.3343/alm.2024.0258.Search in Google Scholar PubMed PubMed Central

5. R Core Team. R: a language and environment for statistical computing [online]. Available from: https://www.R-project.org/ [Accessed 30 Jan 2025].Search in Google Scholar

6. Haymond, S. Why clinical laboratorians should embrace the R programming language [online]. Available from: https://myadlm.org/cln/articles/2020/april/why-clinical-laboratorians-should-embrace-the-r-programming-language [Accessed 30 Jan 2025].Search in Google Scholar

7. Xie, Y, Allaire, J, GrolemundMarkdown, GR. The definitive guide [online]. Available from: https://bookdown.org/yihui/rmarkdown/ [Accessed 30 Jan 2025].Search in Google Scholar

8. Wickham, H, François, R, Henry, L, Müller, K. Dplyr: a grammar of data manipulation [online]. Available from: https://dplyr.tidyverse.org/ [Accessed 30 Jan 2025].Search in Google Scholar

9. Python Software Foundation. Python language reference, version 3.11 [online]. Available from: https://www.python.org/ [Accessed 30 Jan 2025].10.1007/978-3-031-54680-8_2Search in Google Scholar

10. Wickham, H. ggplot2: elegant graphics for data analysis [online]. Available from: https://ggplot2.tidyverse.org/ [Accessed 30 Jan 2025].Search in Google Scholar

11. GitHub, build and ship software on a single, collaborative platform [Online]. Available from: https://github.com/ [Accessed 30 Jan 2025].Search in Google Scholar

12. Stack overflow [online]. Available from: https://stackoverflow.com/ [Accessed 30 Jan 2025].Search in Google Scholar

13. R-Bloggers [online]. Available from: https://www.r-bloggers.com/ [Accessed 30 Jan 2025].Search in Google Scholar

14. YouTube [online]. Available from: https://www.youtube.com/ [Accessed 30 Jan 2025].Search in Google Scholar

15. Posit Team. Posit RStudio IDE [Online]. Available from: https://posit.co/downloads/ [Accessed 30 Jan 2025].Search in Google Scholar

16. Muse, VP, Brunak, S. Protocol for EHR laboratory data preprocessing and seasonal adjustment using R and RStudio. STAR Protoc 2024;5:102912. https://doi.org/10.1016/j.xpro.2024.102912.Search in Google Scholar PubMed PubMed Central

17. Zayed, AM, Janssens, A, Mamouris, P, Delvaux, N. lab2clean: a novel algorithm for automated cleaning of retrospective clinical laboratory results data for secondary uses. BMC Med Inf Decis Making 2024;24:245. https://doi.org/10.1186/s12911-024-02652-7.Search in Google Scholar PubMed PubMed Central

18. Komamine, M, Fujimura, Y, Omiya, M, Sato, T. Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET(®) in Japan. BMC Med Inf Decis Making 2023;23:242. https://doi.org/10.1186/s12911-023-02345-7.Search in Google Scholar PubMed PubMed Central

19. Ibrahim, JG, Chu, H, Chen, MH. Missing data in clinical studies: issues and methods. J Clin Oncol 2012;30:3297–303. https://doi.org/10.1200/jco.2011.38.7589.Search in Google Scholar PubMed PubMed Central

20. Solberg, HE, Lahti, A. Detection of outliers in reference distributions: performance of Horn’s algorithm. Clin Chem 2005;51:2326–32. https://doi.org/10.1373/clinchem.2005.058339.Search in Google Scholar PubMed

21. Healy, MJ. Outliers in clinical chemistry quality-control schemes. Clin Chem 1979;25:675–7. https://doi.org/10.1093/clinchem/25.5.675.Search in Google Scholar

22. Komsta, L. Outliers: tests for outliers. Version 0.15. Available from: https://CRAN.R-project.org/package=outliers.Search in Google Scholar

23. R Core Team. Stats: the R stats package [Online]. Available from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html [Accessed 30 Jan 2025].Search in Google Scholar

24. Maechler, M, Rousseeuw, P, Croux, C, Todorov, V, Ruckstuhl, A, Salibian-Barrera, M, et al.. Robustbase: basic robust statistics. Version 0.99-4-1. Available from: https://CRAN.R-project.org/package=robustbase.Search in Google Scholar

25. Filzmoser, P, Gschwandtner, M. Mvoutlier: multivariate outlier detection based on robust methods. Version 2.1.1. Available from: https://CRAN.R-project.org/package=mvoutlier.Search in Google Scholar

26. DigitalOcean. Normalize data in R [online]. Available from: https://www.digitalocean.com/community/tutorials/normalize-data-in-r [Accessed 30 Jan 2025].Search in Google Scholar

27. Wickham, H, Pedersen, TL. Scales: scale functions for visualization. Version 1.3.0. Available from: https://CRAN.R-project.org/package=scales.Search in Google Scholar

28. Hongxiang, X, Shiyu, L, Yanying, Z, Wanju, X, Sumei, W. Consistency analysis of two fingertip capillary blood sampling methods for complete blood count. Sci Rep 2024;14:15011. https://doi.org/10.1038/s41598-024-64448-z.Search in Google Scholar PubMed PubMed Central

29. Lehnert, B. BlandAltmanLeh: plots (slightly extended) bland-altman plots. Version 0.3.1. Available from: https://CRAN.R-project.org/package=BlandAltmanLeh.Search in Google Scholar

30. Yin, Y, Ma, C, Yu, S, Liu, W, Wang, D, You, T, et al.. Comparison of three different chemiluminescence assays and a rapid liquid chromatography tandem mass spectrometry method for measuring serum aldosterone. Clin Chem Lab Med 2019;58:95–102. https://doi.org/10.1515/cclm-2019-0706.Search in Google Scholar PubMed

31. Potapov, S, Schuetzenmeister, A, Manuilova, E, Dufey, F, Raymaekers, J. Mcr: method comparison regression. Version 1.3.3. Available from: https://CRAN.R-project.org/package=mcr.Search in Google Scholar

32. Clinical and Laboratory Standards Institute. User verification of precision and estimation of bias. Approved Guideline – 3rd ed. Wayne, Pennsylvania, USA: Clinical and Laboratory Standards Institute; 2014.Search in Google Scholar

33. Clinical and Laboratory Standards Institute. Evaluation of precision of quantitative measurement procedures. Approved Guideline – 3rd ed. Wayne, Pennsylvania, USA: Clinical and Laboratory Standards Institute; 2014.Search in Google Scholar

34. Schuetzenmeister, A, Dufey, F. VCA: variance component analysis. Version 1.5.1; 2024. Available from: https://CRAN.R-project.org/package=VCA.Search in Google Scholar

35. Searle, SR, Casella, G, McCulloch, CE. Variance components. New York: John Wiley & Sons, Inc.; 1992. (Wiley Series in Probability and Statistics).10.1002/9780470316856Search in Google Scholar

36. Giesbrecht, FG, Burns, JC. Two-stage analysis based on a mixed model: large-sample asymptotic theory and small-sample simulation results. Biometrics 1985;41:477–86. https://doi.org/10.2307/2530872.Search in Google Scholar

37. Ma, C, Yu, Z, Qiu, L. Development of next-generation reference interval models to establish reference intervals based on medical data: current status, algorithms and future consideration. Crit Rev Clin Lab Sci 2024;61:298–316. https://doi.org/10.1080/10408363.2023.2291379.Search in Google Scholar PubMed

38. Clinical and Laboratory Standards Institute. Establishing and verifying reference intervals in the clinical laboratory. Approved guideline – 3rd ed. CLSI document EP28-a3c. Wayne, PA: Clinical and Laboratory Standards Institute; 2008.Search in Google Scholar

39. Ozarda, Y. Reference intervals: current status, recent developments and future considerations. Biochem Med 2016;26:5–16. https://doi.org/10.11613/bm.2016.001.Search in Google Scholar

40. Doyle, K, Bunch, DR. Reference intervals: past, present, and future. Crit Rev Clin Lab Sci 2023;60:466–82. https://doi.org/10.1080/10408363.2023.2196746.Search in Google Scholar PubMed

41. Jones, GRD, Haeckel, R, Loh, TP, Sikaris, K, Streichert, T, Katayev, A, et al.. Indirect methods for reference interval determination – review and recommendations. Clin Chem Lab Med 2018;57:20–9. https://doi.org/10.1515/cclm-2018-0073.Search in Google Scholar PubMed

42. Hoffmann, RG. Statistics in the practice of medicine. JAMA 1963;185:864–73. https://doi.org/10.1001/jama.1963.03060110068020.Search in Google Scholar PubMed

43. Bhattacharya, CG. A simple method of resolution of a distribution into Gaussian components. Biometrics 1967;23:115–35. https://doi.org/10.2307/2528285.Search in Google Scholar

44. Concordet, D, Geffré, A, Braun, JP, Trumel, C. A new approach for the determination of reference intervals from hospital-based data. Clin Chim Acta 2009;405:43–8. https://doi.org/10.1016/j.cca.2009.03.057.Search in Google Scholar PubMed

45. Zierk, J, Arzideh, F, Kapsner, LA, Prokosch, HU, Metzler, M, Rauh, M. Reference interval estimation from mixed distributions using truncation points and the Kolmogorov-smirnov distance (kosmic). Sci Rep 2020;10:1704. https://doi.org/10.1038/s41598-020-58749-2.Search in Google Scholar PubMed PubMed Central

46. Wosniok, W, Haeckel, R. A new indirect estimation of reference intervals: truncated minimum chi-square (TMC) approach. Clin Chem Lab Med 2019;57:1933–47. https://doi.org/10.1515/cclm-2018-1341.Search in Google Scholar PubMed

47. Ammer, T, Schützenmeister, A, Prokosch, HU, Rauh, M, Rank, CM, refineR, ZJ. A novel algorithm for reference interval estimation from real-world data. Sci Rep 2021;11:16023. https://doi.org/10.1038/s41598-021-95301-2.Search in Google Scholar PubMed PubMed Central

48. Agaravatt, A, Kansara, G, Khubchandani, A, Sanghani, H, Patel, S, Parchwani, D. Verification of reference interval of thyroid hormones with manual and automated indirect approaches: comparison of hoffman, KOSMIC and refineR methods. Cureus 2023;15:e39066. https://doi.org/10.7759/cureus.39066.Search in Google Scholar PubMed PubMed Central

49. Ma, C, Guan, L, Li, P, Hou, L, Xia, L, Su, W, et al.. Feasibility evaluation of big data algorithms for establishing serum protein electrophoresis reference intervals using Hoffmann and refineR methods. Clin Chim Acta 2025;567:120114. https://doi.org/10.1016/j.cca.2024.120114.Search in Google Scholar PubMed

50. Ma, C, Zou, Y, Hou, L, Yin, Y, Zhao, F, Hu, Y, et al.. Validation and comparison of five data mining algorithms using big data from clinical laboratories to establish reference intervals of thyroid hormones for older adults. Clin Biochem 2022;107:40–9. https://doi.org/10.1016/j.clinbiochem.2022.05.008.Search in Google Scholar PubMed

51. Koenker, R, Portnoy, S, Ng, PT, Melly, B, Zeileis, A, Grosjean, P, et al.. Quantreg: quantile regression. Version 6.00. Available from: https://CRAN.R-project.org/package=quantreg.Search in Google Scholar

52. Finnegan, D. referenceIntervals: reference intervals. Version 1.3.1.Available from: https://CRAN.R-project.org/package=referenceIntervals.Search in Google Scholar

53. Ammer, T, Rausch, C. refineR: reference interval estimation using real-world data. Version 1.6.1. Available from: https://CRAN.R-project.org/package=refineR.Search in Google Scholar

54. Hoffmann, G, Klawitter, S, Trulson, I, Adler, J, Holdenrieder, S, Klawonn, F. A novel tool for the rapid and transparent verification of reference intervals in clinical laboratories. J Clin Med 2024;13. https://doi.org/10.3390/jcm13154397.Search in Google Scholar PubMed PubMed Central

55. Hoffmann, G, Klawitter, S, Klawonn, F. reflimR: reference limit estimation using routine laboratory data. Version 1.0.6.Available from: https://CRAN.R-project.org/package=reflimR.Search in Google Scholar

56. Chafai, N, Bonizzi, L, Botti, S, Badaoui, B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024;61:140–63. https://doi.org/10.1080/10408363.2023.2259466.Search in Google Scholar PubMed

57. Herman, DS, Rhoads, DD, Schulz, WL, Durant, TJS. Artificial intelligence and mapping a new direction in laboratory medicine: a review. Clin Chem 2021;67:1466–82. https://doi.org/10.1093/clinchem/hvab165.Search in Google Scholar PubMed

58. Rabbani, N, Kim, GYE, Suarez, CJ, Chen, JH. Applications of machine learning in routine laboratory medicine: current state and future directions. Clin Biochem 2022;103:1–7. https://doi.org/10.1016/j.clinbiochem.2022.02.011.Search in Google Scholar PubMed PubMed Central

59. Ichihara, K, Boyd, JC. An appraisal of statistical procedures used in derivation of reference intervals. Clin Chem Lab Med 2010;48:1537–51. https://doi.org/10.1515/cclm.2010.319.Search in Google Scholar PubMed

60. Ma, C, Li, L, Wang, X, Hou, L, Xia, L, Yin, Y, et al.. Establishment of reference interval and aging model of homocysteine using real-world data. Front Cardiovasc Med 2022;9:846685. https://doi.org/10.3389/fcvm.2022.846685.Search in Google Scholar PubMed PubMed Central

61. Ma, C, Xia, L, Chen, X, Wu, J, Yin, Y, Hou, L, et al.. Establishment of variation source and age-related reference interval models for 22 common biochemical analytes in older people using real-world big data mining. Age Ageing 2020;49:1062–70. https://doi.org/10.1093/ageing/afaa096.Search in Google Scholar PubMed

62. Yang, XZ, Quan, WW, Zhou, JL, Zhang, O, Wang, XD, Liu, CF. A new machine learning model to predict the prognosis of cardiogenic brain infarction. Comput Biol Med 2024;178:108600. https://doi.org/10.1016/j.compbiomed.2024.108600.Search in Google Scholar PubMed

63. Kuhn, M, Wing, J, Weston, S, Williams, A, Keefer, C, Engelhardt, A, et al.. Caret: classification and regression training. R Package Version 2024;7.0-1.Search in Google Scholar

64. Kuhn, M. Building predictive models in R using the caret package. J Stat Software 2008;28:1–26. https://doi.org/10.18637/jss.v028.i05.Search in Google Scholar

65. Almer, G, Enko, D, Kartiosuo, N, Niinikoski, H, Lehtimäki, T, Munukka, E, et al.. Association of serum trimethylamine-N-oxide concentration from childhood to early adulthood with age and sex. Clin Chem 2024;70:1162–71. https://doi.org/10.1093/clinchem/hvae087.Search in Google Scholar PubMed

66. Holmes, DT, Buhr, KA. Widespread incorrect implementation of the Hoffmann method, the correct approach, and modern alternatives. Am J Clin Pathol 2019;151:328–36. https://doi.org/10.1093/ajcp/aqy149.Search in Google Scholar PubMed

67. Wilson, SM, Bohn, MK, Madsen, A, Hundhausen, T, Adeli, K. LMS-based continuous reference percentiles for 14 laboratory parameters in the CALIPER cohort of healthy children and adolescents. Clin Chem Lab Med 2023;61:1105–15. https://doi.org/10.1515/cclm-2022-1077.Search in Google Scholar PubMed

68. Peitzsch, M, Mangelis, A, Eisenhofer, G, Huebner, A. Age-specific pediatric reference intervals for plasma free normetanephrine, metanephrine, 3-methoxytyramine and 3-O-methyldopa: particular importance for early infancy. Clin Chim Acta 2019;494:100–5. https://doi.org/10.1016/j.cca.2019.03.1620.Search in Google Scholar PubMed

69. Chang, W, Cheng, J, Allaire, JJ, Xie, Y, McPherson, J. Shiny: web application framework for R [online]. Available from: https://shiny.rstudio.com/ [Accessed 30 Jan 2025].Search in Google Scholar

70. Faust, L, Wilson, P, Asai, S, Fu, S, Liu, H, Ruan, X, et al.. Considerations for quality control monitoring of machine learning models in clinical practice. JMIR Med Inform 2024;12:e50437. https://doi.org/10.2196/50437.Search in Google Scholar PubMed PubMed Central

71. Agraz, M, Mantzoros, C, Karniadakis, GE. ChatGPT-Enhanced ROC Analysis (CERA): a shiny web tool for finding optimal cutoff points in biomarker analysis. PLoS One 2024;19:e0289141. https://doi.org/10.1371/journal.pone.0289141.Search in Google Scholar PubMed PubMed Central

72. Meyer, A, Ruthard, J, Streichert, T. Dear ChatGPT – can you teach me how to program an app for laboratory medicine? J Lab Med 2024;48:197–201. https://doi.org/10.1515/labmed-2024-0034.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cclm-2024-1193).

Received: 2024-10-14

Accepted: 2025-02-18

Published Online: 2025-03-11

Published in Print: 2025-07-28

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/cclm-2024-1193

Keywords for this article

R language; laboratory medicine; data analysis; code; predictive models