Standardization with zlog values improves exploratory data analysis and machine learning for laboratory data

Amani Al-Mekhlafi; Sandra Klawitter; Frank Klawonn

doi:10.1515/labmed-2024-0051

Abstract

Objectives

In the context of exploratory data analysis and machine learning, standardization of laboratory results is an important pre-processing step. Variable proportions of pathological results in routine datasets lead to changes of the mean (µ) and standard deviation (σ), and thus cause problems in the classical z-score transformation. Therefore, this study investigates whether the zlog transformation compensates these disadvantages and makes the results more meaningful from a medical perspective.

Methods

The results presented here were obtained with the statistical software environment R, and the underlying data set was obtained from the UC Irvine Machine Learning Repository. We compare the differences of the zlog and z-score transformation for five different dimension reduction methods, hierarchical clustering and four supervised classification methods.

Results

With the zlog transformation, we obtain better results in this study than with the z-score transformation for dimension reduction, clustering and classification methods. By compensating the disadvantages of the z-score transformation, the zlog transformation allows more meaningful medical conclusions.

Conclusions

We recommend using the zlog transformation of laboratory results for pre-processing when exploratory data analysis and machine learning techniques are applied.

Keywords: exploratory data analysis; laboratory data; machine learning; zlog value; z-score transformation

Introduction

Exploratory data analysis and machine learning (ML) for multivariate datasets have attracted considerable attention in numerous research areas, including laboratory medicine [1, 2]. For many statistical methods used in this context, it is important to standardize the values so that they can be compared with each other despite different scales.

In this study, we analyze a medical machine learning dataset [3] that includes 10 different analytes measured in blood donors and hepatitis C patients as well as demographic data such as sex and age. The most popular method to bring all numerical data onto a common scale, is the z-score transformation, which expresses each value × as a deviation from the mean value m in multiples of the standard deviation sd:

(1) z = x − m s d

Since the z-score transformation is not robust against pathological outliers, we compare the results of this conventional normalization method with the more recent zlog transformation, in which the mean and standard deviation are not calculated from the data itself, but from the logarithms of the lower and upper reference limits LL and UL of the respective analytes [4]:

(2) zlog ( x ) = ( log ( x ) − log ( LL ) + log ( UL ) 2 ) × 3.92 log ( UL ) – log ( LL )

The aim of our study is to find out which normalization method is superior for dimension reduction as well as unsupervised and supervised machine learning algorithms [4].

Materials and methods

The HCV dataset used in our study is freely available in the UC Irvine Machine Learning Repository [3] and has already been successfully used for machine learning applications [2, 5]. Here, it forms the basis for the comparison of the two standardization methods mentioned above. The data contains measurements in 583 individuals aged 23–77 years. 61 % thereof are male and 39 % are female.

The participants are separated into five categories [5] with the following means and standard deviations of age: 519 healthy blood donors (47.2±9.6), seven blood donors with the suspicion of liver disease (57.6±11.1), 20 patients diagnosed with histologically inconspicuous hepatitis C (40.7±11.1), 12 patients diagnosed with liver fibrosis (49.7±12.1), and 24 patients diagnosed with liver cirrhosis (54.3±8.7).

Ten analytes, which are commonly used as parameters for the early- and late-stage functionality of the liver, are reported: albumin (ALB), bilirubin (BIL), alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), cholesterol (CHOL), cholinesterase (CHE), creatinine (CREA), gamma-glutamyltransferase (GGT), and total protein (PROT). For the analysis of the data and the preparation of the graphics, we used the statistical software environment R (Version 4.1.2, www.r-project.org) and the following packages: MASS, neuralnet, tsne, umap, class and e1071 [6], [7], [8], [9], [10].

To project all 10 analytes onto comparable scales, we transformed the absolute values as z-scores (Equation (1)) and as zlog values (Equation (2)). Both calculations result in dimensionless relative values with a mean of approximately 0 and a scatter range of around −10 and +10 for the majority of values. For the zlog transformation, we calculated gender-specific lower and upper reference ranges according to ref. [11] as the 2.5 and 97.5 % quantiles of the respective values measured in the healthy blood donor subpopulation (Table 1). Regardless of the analytical method or the measuring unit, the common reference interval for all zlog values is thus −1.96 to 1.96, which corresponds to the interval between the 2.5 and 97.5 % quantiles of a standard normal distribution [4].

Table 1:

Reference intervals derived from the data set as 2.5th and 97.5th percentiles of blood donor measurements.

Analyte	Unit	Female		Male
Analyte	Unit	LL	UL	LL	UL
ALB	g/L	32.07	50.04	35.03	51.35
ALP	U/L	36.51	103.13	40.93	105.90
ALT	U/L	10.12	45.22	11.65	71.03
AST	U/L	14.92	41.15	17.28	46.98
BIL	µmol/L	2.40	17.63	2.90	25.40
CHE	kU/L	4.58	11.02	5.42	12.33
CHOL	mmol/L	3.63	7.56	3.71	7.76
CREA	µmol/L	52.00	93.65	63.75	111.00
GGT	U/L	7.49	73.92	10.48	99.63
PROT	g/L	62.61	79.63	64.18	80.75

LL, lower limit; UL, upper limit.

We applied 10 ML methods to the transformed values, four of them being supervised and the other six unsupervised. The goal of supervised machine learning is to predict a known output, in our case the predefined classes (diagnoses) mentioned above, whereas unsupervised techniques try to find hidden structures or correlations in the data without knowing anything about these classes. Ideally, the structures or clusters identified by unsupervised algorithms should reflect the known medical entities.

For unsupervised ML, we applied the following dimensionality reduction methods (DRM): Principal Component Analysis (PCA) [12, 13], Sammon Mapping [14], Autoencoder [15, 16] in two variants (the first with only one hidden neuron layer and the second a more complex with several hidden layers), t-Distributed Stochastic Neighbor Embedding (t-SNE) [17, 18], and Uniform Manifold Approximation and Projection (UMAP) [19]. Dimensionality reduction aims to reduce the number of variables in high-dimensional data sets without substantial loss of information. In this study, we used dimensionality reduction to display the ten-dimensional data set as two-dimensional scatter plots (Figures 2–6). As another exploratory data analysis technique, we used Hierarchical Cluster Analysis (HCA) [20], which orders and groups the cases according to similarities in the measured values.

Figure 1:

Boxplots and ranges of zlog and z-score values for 10 analytes. The dashed vertical lines indicate the common zlog reference interval of 0±1.96.

Figure 2:

Principal component analysis (PCA) with z-score and zlog transformation. On the left side is the result after z-score standardization, and the left plot shows the respective dimensionality reduction after zlog transformation. PC1 stands for the first principal component and PC2 for the second.

Figure 3:

Sammon mapping with z-score and zlog transformation. For more details see Figure 2.

Figure 4:

Autoencoder with z-score and zlog transformation. The upper two graphs use a simple and the bottom two graphs a complex autoencoder. For more details see Figure 2.

Figure 5:

t-Distributed stochastic neighbor embedding (t-SNE) with z-score and zlog transformation. For more details see Figure 2.

Finally, we applied the following supervised methods for the prediction of known categories: Linear Discriminant Analysis (LDA) [21], k=1 and 3 Nearest-Neighbor (kNN) [2], linear and non-linear Support Vector Machines (SVM) [22] and Artificial Neural Networks (ANN) with one and three hidden layers [23]. In the supervised section we restricted the experiment to the four well characterized classes listed in Table 3, excluding the group of suspected blood donors because this subgroup was too small (n=7) for reliable detection.

Table 2:

Percentage of correctly classified results with four supervised machine learning algorithms: A, k-nearest neighbour; B, support vector machine; C, linear discernment analysis; D, neural network with three hidden layers.

	A k=1	A k=3	B linear	B non-linear	C	D
z-score	94.6	94.3	94.8	95.1	94.6	98.6
zlog	96.2	95.7	96.2	95.8	97.0	99.1

To cross-validate the results, we used the leave-one-out method, which is an iterative algorithm often applied to small data sets [5]. Given a total of n samples, the respective ML models were trained on n-1 samples, and their performance was tested on the single sample left out. This process was repeated n times, once for each sample in the dataset.

Results

Table 1 summarizes the results of the direct estimation of reference intervals. We deliberately determined the 2.5th and 97.5th percentiles from the blood donor data itself rather than from the literature in order to avoid the need for external sources for standardization. Our results reflect the values given in the literature quite well [24] with only few deviations (e. g. the upper limits of GGT). For the purposes of the present study, this accuracy is absolutely sufficient.

Figure 1 gives an overview of the transformed values for the 10 analytes in a boxplot format. The blue boxes represent the zlog values and the red ones represent the respective z-score values. As expected, all boxes are within the common reference interval of −1.96 to +1.96 (vertical dashed lines), and all medians (vertical lines inside the boxes) are close to zero. It is noticeable, however, that the boxes (i. e. the central 50 %) of the z-score values are extremely narrow for ALT, AST, BIL, CREA, and GGT, whereas the boxes of the zlog values all have about the same size with a reasonable width compared to the common reference interval. In addition, the zlog values scatter quite symmetrically around the reference interval, whereas most z-score values are shifted to the right so that low values (e. g. ALB, ALT, CHE in liver cirrhosis) are not well represented.

Figure 6:

Uniform Manifold Approximation and Projection (UMAP) with z-score and zlog transformation. For more details see Figure 2.

In other words, the zlog values provide more information and reflect the clinical significance of the analytes better than the z-score values. In fact, most laboratory analytes do not conform to a normal distribution; instead, they tend to be right-skewed. Consequently, applying the zlog transformation can help approximate a normal distribution.

Figures 2–6 visualize the results of the dimensionality reduction techniques in two-dimensional scattergrams. The crucial question to be answered here is whether the original information about the different subgroups (inconspicuous and suspected blood donors, hepatitis with and without histological changes) is retained or blurred in the graphics.

The figures confirm the above expectation. Figure 2 shows the results of Principal Component Analysis (PCA), one of the oldest and most common dimensionality reduction techniques. It is based on feature extraction, which means that new features are created that capture the key information contained in the 10 original features (analytes). On the left side of Figure 2 the majority of z-score values for hepatitis and fibrosis are lost in the black cloud of normal blood donors. Only cirrhotic cases and suspected blood donors are clearly separated from the rest. The zlog values shown on the right resolve significantly more pathological cases: more than half of hepatitis patients without histological signs and the majority of fibrosis patients are found outside the black cloud of healthy blood donors.

Although Sammon Mapping (Figure 3) and Autoencoders (Figure 4) work differently from PCA, the results are very similar. Sammon Mapping aims to preserve the pairwise distances between the points in the high-dimensional space when mapping them to a lower-dimensional space. Autoencoders are neural network architectures designed to learn compressed representations of the input data. Figure 4 shows that the Autoencoder results are independent of whether we use a simple (one layer) or a complex (several layers) autoencoder.

The results of t-SNE and UMAP are even more impressive. These advanced, relatively new methods assume that the multidimensional data lie on a nonlinear data structure (a so-called manifold) and try to preserve the local structures while projecting them to a low-dimensional representation. Figures 5 and 6 show that both techniques resolve the healthy blood donors as distinguishable points and not as dense black areas like in Figures 2–4. With z-score as the underlying standardization technique, the suspected blood donors and the three hepatitis categories are scattered somewhere between the black dots, whereas with zlog transformation they appear as clearly separated clusters. It can also be seen that UMAP takes the absolute distances into account in contrast to t-SNE which focuses only on the local neighborhood of the points. Therefore, UMAP places the diseased cases farther away from the healthy group.

The result of hierarchical clustering is shown in Figure 7. As expected from the “boxes” in Figure 1, the z-scores are mostly concentrated around zero, which results in pale colors in the upper graphic, whereas the zlog values reflect the distribution of normal and pathological values better leading to stronger colors in the lower graphic. Thus, not surprisingly, there is a better separation of the diseased groups from the healthy blood donors with the zlog transformation, especially visible in the cirrhosis patients who cluster well together.

Figure 7:

Results of the hierarchical cluster analysis (HCA). Each row represents a single case, and each column stands for an analyte. The dendrograms are displayed on the left side of each clustergram and indicate how closely the laboratory markers of each case relate to each other. The upper graphic represents the results obtained with z-scores, and the lower graphic those obtained with zlog values.

Table 2 summarizes the results of the four classifiers kNN, SVM, LDA, and ANN. It is obvious that the accuracy with zlog values is consistently better than that with z-scores. This becomes even more evident, when we look at the individual results for the four classes. Table 3 takes the LDA method as an example. Here, the zlog values for the three pathological classes perform considerably better than the z-scores. This effect is particularly pronounced in case of fibrosis, which is difficult to diagnose with routine biomarkers [5]: more than 40 % are correctly identified using zlog values, but less than 20 % with z-scores. Interestingly, the high accuracy for healthy blood donors is independent of whether z-scores or zlog values are used.

Table 3:

Comparison of z-score transformation and zlog transformation, taking LDA as an example for a classification method.

	Blood donor	Hepatitis	Fibrosis	Cirrhosis
3A: Classification with z-scores (overall accuracy 94.6 %)

Blood donor	518	8	4	1
Hepatitis	0	7	6	3
Fibrosis	1	4	2	3
Cirrhosis	0	1	0	17
Correctly classified, %	99.8	35.0	16.7	70.8

3B: Classification with zlog values (overall accuracy 97.0 %)

Blood donor	518	1	0	0
Hepatitis	0	13	7	0
Fibrosis	1	4	5	2
Cirrhosis	0	2	0	22
Correctly classified, %	99.8	65.0	41.7	91.7

Detailed results for all supervised methods are shown in the Supplementary Table 1. The distinction between blood donors and cirrhosis cases is consistently the most successful, while the separation of fibrosis cases is the worst. The zlog values almost always perform better than the z-scores. Equally high accuracy results are only achieved with neural networks for blood donors and cirrhosis, as well as for blood donors with the other three methods. The poor performance of the Support Vector Machines in the detection of fibrosis is remarkable. Here, the Neural Networks are superior to all other methods.

Discussion

In medical diagnostics and prognostics, zlog transformation has already proven to be a useful scaling and normalization method for laboratory data. Published fields of application include standardization of electronic health records [4], plausibility testing of reference intervals [24], or outcome prediction in severely ill children [25]. This paper adds a new aspect to this list by introducing zlog values as an alternative to z-score scaling in the preparation of laboratory data for machine learning.

One main difference between the two approaches is that the calculation of z-scores is based on the entire data set including all pathological values, while the zlog transformation only refers to the central 95 % of healthy individuals [4]. In the case of z-scores, this results in a comparatively large scatter in the denominator of Equation (1) and thus in a substantial loss of information due to the concentration of the majority of values around zero. In addition, the position and variance of z-scores is strongly influenced by individual extreme values so that they are frequently shifted to the right in a relatively unpredictable manner (Figure 1).

Another important advantage of zlog values over z-scores is that other important covariates are taken into account for the normalization. In our case, we only distinguished between women and men for computing the reference intervals and the zlog value. But especially when children are part of the cohort, age-dependent reference intervals play a crucial role and the zlog values would automatically account for the age-dependence.

The zlog values are based on comparably stable parameters of position and variance derived from healthy individuals, so that they do not suffer from these disadvantages. They retain the information contained in the original data and are mostly projected onto a scale ranging from roughly −10 to +10. Both increased and decreased pathological values are about equally distributed, while the z-score normalization often loses information about disease states with decreased values. This is a possible explanation why our classification experiments are poor in detecting fibrosis and cirrhosis of the liver, since these severe disease states are characterized by low production rates of proteins such as albumin, cholinesterase or alanine aminotransferase [5, 24].

One limitation of the zlog approach is that the calculation requires reference limits, which may not always be available in data mining and machine learning projects. In our study, reference intervals were calculated according to the recommended standard method [11] as the 2.5th and 97.5th percentiles of a relatively large non-diseased cohort included in the dataset (Table 1). If no such well-defined subsets are available, reference limits can usually be derived from the assay insert sheet provided by the manufacturer or derived from the data itself with a so-called indirect method, as long as the proportion of diseased individuals is not too high [26]. The R packages reflimR or refineR can be used for this purpose [27, 28].

In our study, we provide many examples to demonstrate that the zlog standardization is clearly superior to z-scores, whenever exploratory data analysis and machine learning algorithms are applied to laboratory data. This is true both for supervised and unsupervised techniques. We achieved particularly impressive results with zlog values when using modern methods such as t-SNE (Figure 5) and UMAP (Figure 6) for dimensionality reduction, as well as in the differentiation of the three hepatitis states with LDA (Table 3) and artificial Neural Networks (Supplementary Table 1). Although our experiments with a publicly available HCV dataset are quite conclusive, the applicability of the zlog approach should be evaluated with a broader range of clinical examples in the future.

Corresponding author: Frank Klawonn, Institute for Information Engineering, Ostfalia University, Wolfenbüttel, Germany; and Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany, E-mail: frank.klawonn@helmholtz-hzi.de

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Open source data were used: https://archive.ics.uci.edu/dataset/571/hcv+data.

References

1. Rabbani, N, Kim, G, Suarez, C, Chen, J. Application of machine learning in routine laboratory medicine: current state and future directions. Clin Biochem 2022;103:1–7. https://doi.org/10.1016/j.clinbiochem.2022.02.011.Search in Google Scholar PubMed PubMed Central

2. Oladimeji, O, Oladimeji, A, Olayanju, O. Machine learning models for diagnostic classification of hepatitis C tests. Front Health Informat 2021;10:70. https://doi.org/10.30699/fhi.v10i1.274.Search in Google Scholar

3. HCV data. UCI machine learning repository. Available at: https://archive.ics.uci.edu/ml/datasets/HCV+data [Accessed 10 March 2022].Search in Google Scholar

4. Hoffmann, G, Klawonn, F, Lichtinghagen, R, Orth, M. The zlog value as a basis for the standardization of laboratory results. J Lab Med 2017;41:23–31. https://doi.org/10.1515/labmed-2017-0135.Search in Google Scholar

5. Hoffmann, G, Bietenbeck, A, Lichtinghagen, R, Klawonn, F. Using machine learning techniques to generate laboratory diagnostic pathways — a case study. J Lab Preci Med 2018;3:58. https://doi.org/10.21037/jlpm.2018.06.01.Search in Google Scholar

6. Venables, WN, Ripley, BD. Modern applied statistics with S, 4th ed. 0-387-95457-0. New York: Springer; 2002.10.1007/978-0-387-21706-2_14Search in Google Scholar

7. Fritsch, S, Guenther, F, Guenther, MF. Package ‘neuralnet’. Training of neural networks. Available at: https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf [Accessed 10 March 2022].Search in Google Scholar

8. Donaldson, J. Tsne: T-distributed stochastic neighbor embedding for R (t-SNE). R package version 0 2016:1–3. Available at: https://CRAN.R-project.org/package=tsne.Search in Google Scholar

9. Konopka, T.: Uniform manifold approximation and projection. R package version 0.2.10.0. Available at: https://CRAN.R-project.org/package=umap.2023 [Accessed 01 June 2024].Search in Google Scholar

10. Meyer, D, Dimitriadou, E, Hornik, K, Weingessel, A, Leisch, F. e1071: Misc functions of the department of statistics; 2021. Probability theory group (Formerly: E1071), TU Wien. R package version 1.7-9. Available at: https://CRAN.R-project.org/package=e1071.Search in Google Scholar

11. Horowitz, G, Altaie, S, Boyd, J, Ceriotti, F, Garg, U, Horn, P, et al.. Defining, establishing, and verifying reference intervals in the clinical laboratory; tech rep document EP28-A3C. Wayne, PA, USA: Clinical & Laboratory Standards Institute; 2010.Search in Google Scholar

12. Abdi, H, Williams, LJ. Principal component analysis. WIREs Comp Stat 2010;2:433–59. https://doi.org/10.1002/wics.101.Search in Google Scholar

13. Jolliffe, IT, Cadima, J. Principal component analysis: a review and recent developments. Phil Trans R Soc A 2016;374:20150202. https://doi.org/10.1098/rsta.2015.0202.Search in Google Scholar PubMed PubMed Central

14. Sammon, JW. A nonlinear mapping for data structure analysis. IEEE Trans Comput 1969;C-18:401–9. https://doi.org/10.1109/t-c.1969.222678.Search in Google Scholar

15. Bank, D, Koenigstein, N, Giryes, R. Autoencoders. arXiv 2020, abs 2003.05991.Search in Google Scholar

16. Hinton, GE, Salakhutdinov, RR. Reducing the dimensionality of data with neural networks. Science 2006;313:504–7. https://doi.org/10.1126/science.1127647.Search in Google Scholar PubMed

17. van der Maaten, L, Hinton, G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–605.Search in Google Scholar

18. Cook, JA, Sutskever, I, Mnih, A, Hinton, GE. Visualizing similarity data with a mixture of maps. In: Proc 11th international conference on artificial intelligence and statistics, 2; 2007:67–74 pp.Search in Google Scholar

19. McInnes, L, Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction; 2018. ArXiv e-prints 1802.03426.10.21105/joss.00861Search in Google Scholar

20. Zhang, Z, Murtagh, F, Van Poucke, S, Lin, S, Lan, P. Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R. Ann Transl Med 2017;5:75. https://doi.org/10.21037/atm.2017.02.05.Search in Google Scholar PubMed PubMed Central

21. Patil, MD, Sane, SS. Dimension reduction: a review. Int J Comput Appl 2014;92:23–9. https://doi.org/10.5120/16094-5390.Search in Google Scholar

22. Saberi-Karimian, M, Khorasanchi, Z, Ghazizadeh, H, Tayefi, M, Saffar, S, Ferns, GA, et al.. Potential value and impact of data mining and machine learning in clinical diagnostics. Crit Rev Clin Lab Sci 2021;58:275–96. https://doi.org/10.1080/10408363.2020.1857681.Search in Google Scholar PubMed

23. Cadamuro, J. Rise of the Machines: the inevitable evolution of medicine and medical laboratories intertwining with artificial intelligence — a narrative review. Diagnostics 2021;11:1399. https://doi.org/10.3390/diagnostics11081399.Search in Google Scholar PubMed PubMed Central

24. Thomas, L. Clinical laboratory diagnostics; 2020. Available at: https://www.clinical-laboratory-diagnostics.com/.Search in Google Scholar

25. Klawitter, S, Hoffmann, G, Holdenrieder, S, Kacprowski, T, Klawonn, F. A zlog-based algorithm and tool for plausibility checks of reference intervals. Clin Chem Lab Med 2023;61:260–5. https://doi.org/10.1515/cclm-2022-0688.Search in Google Scholar PubMed

26. Jones, G, Haeckel, R, Loh, T, Sikaris, K, Streichert, T, Katayev, A, et al.. Indirect methods for reference interval determination: review and recommendations. Clin Chem Lab Med 2019;57:20–9. https://doi.org/10.1515/cclm-2018-0073.Search in Google Scholar PubMed

27. Hoffmann, G, Klawitter, S, Klawonn, F.: Reference limit estimation using routine laboratory data_. R package version 1.0.6. Available at: https://github.com/reflim/reflimR [Accessed 01 June 2024].Search in Google Scholar

28. Ammer, T, Rank, C, Schuetzenmeister, A. _refineR: Reference interval estimation using real-world data_. R package version 1.6.1; 2023. Available at: https://CRAN.R-project.org/package=refineR.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/labmed-2024-0051).

Received: 2024-03-31

Accepted: 2024-05-31

Published Online: 2024-06-27

Published in Print: 2024-10-28

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material

Standardization with zlog values improves exploratory data analysis and machine learning for laboratory data

Abstract

Objectives

Methods

Results

Conclusions

Introduction

Materials and methods

Results

Discussion

References

Supplementary Material

Articles in the same Issue

Articles in the same Issue