Abstract
Objectives
The Friedewald, Martin, and Sampson equations are widely used for calculating low-density lipoprotein cholesterol (LDL-C). We aimed to develop a new machine-learning-based formula to address errors and uncertainties associated with these equations.
Methods
We collected LDL-C, total cholesterol, triglyceride (TG), and high-density lipoprotein cholesterol measurements from 2,270 patients using the Abbott Architect c16000 analyzer for the training dataset. Three independent test datasets were generated using the homogeneous enzymatic method: 40,941 results from Architect c16000, 42,520 from Beckman Coulter AU5800, and 5,813 from Cobas c501 analyzers. Equations were developed using machine learning and symbolic regression in Python.
Results
The novel equation (mg/dL): (Y-LDL-C)=TC – HDL_C – (√TG) xTC/100. The formula demonstrated better performance across all datasets with the lowest root mean square error (9.9, 8.1, 16.0), mean absolute error (6.8, 5.9, 12.8), and mean absolute percentage error (5.2, 4.6, 9.9 %), along with the highest R2 values (0.94, 0.95, 0.84) and accuracy rates (95.7, 95.3, 71.1 %). For samples with TG<400 mg/dL(<4.516 mmol/L), analytical accuracy was 96.3, 95.4, and 71.5 %, with minimal clinical error rates (0.6, 0.6, 3.5 %). Even with TG>400 mg/dL(>4.516 mmol/L), accuracy remained high (72.7, 91.2, 65.3 %) with low clinical error rates (3.9, 1.3, 11.7 %). The novel formula consistently outperformed the Friedewald, Sampson, and extended Martin equations. A key limitation is that the equation was derived using enzymatic LDL-C rather than beta-quantification.
Conclusions
The novel formula shows improved concordance in LDL-C estimations compared to existing equations, demonstrating better performance across diverse patient populations and laboratory systems.
Introduction
Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, with low-density lipoprotein cholesterol (LDL-C) playing a pivotal role in risk stratification and therapeutic decision-making [1], [2], [3], [4]. Accurate measurement or estimation of LDL-C is therefore essential for effective clinical management. While reference methods such as beta-quantification provide high accuracy, they are labor-intensive, time-consuming, and costly. Direct enzymatic assays offer an alternative but are also expensive and not universally accessible.
The most commonly used method in clinical practice is the Friedewald equation; however, it tends to underestimate LDL-C levels at lower concentrations and becomes unreliable when triglyceride (TG) levels exceed 400 mg/dL [5], 6]. To address these limitations, the Sampson equation was developed to improve accuracy in hypertriglyceridemic patients, extending applicability to TG levels up to 800 mg/dL [7]. The Martin equation enhances the Friedewald approach by incorporating adjustable factors, but its reliance on pre-calculated lookup tables limits practical implementation [8]. Integration into hospital automation systems is somewhat difficult. Since the original Friedewald method was introduced in 1972, over 20 alternative LDL-C estimation formulas have been proposed [9].
In parallel, the increasing integration of LDL-C estimation methods with machine learning (ML) approaches highlights the growing need for more precise and adaptable prediction tools [10], [11], [12]. The present study aims to develop a robust LDL-C estimation formula that offers improved accuracy, particularly at clinically critical thresholds such as high TG and low LDL-C levels. Additionally, the proposed formula is designed for ease of implementation in both laboratory and hospital information systems, enhancing its clinical utility.
Materials and methods
Ethics committee approval
This study was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval was obtained from the Scientific Research Ethics Committee of the University of Health Sciences, Trabzon Faculty of Medicine, Turkey (Approval Date: February 27, 2024; Approval Number: 2024/59).
Lipid measurements and study variables
To develop accurate machine learning models for predicting low-density lipoprotein cholesterol (LDL-C) levels, a range of clinical variables – including age, sex, total cholesterol (TC), triglycerides (TG), high-density lipoprotein cholesterol (HDL-C), and directly measured LDL-C (d-LDL-C) – were incorporated as input features (Table 1). Retrospective data were retrieved from 2,270 patients who received care at Akcaabat Hackali Baba State Hospital (Trabzon, Turkey) between January 1 and October 31, 2023, via the hospital’s laboratory information system.
Basic patients characteristics.
Variable | Train data (n=2,270) Abbott Architect | Test data 1 (n=40,941) Abbott Architect | Test data 2 (n=42,520) Beckman Coulter | Test data 3 (n=5,813) Roche Cobas |
---|---|---|---|---|
Age, years, median (IQR:Q1-Q3) | 53 (24:42–66) | 57 (46:23–69) | 56 (25:42–67) | 42 (20:34–54) |
Male, n (%) | 957 (42.2 %) | 14,448 (35.3 %) | 15,180 (35.7 %) | 2,462 (42,35 %) |
Female, n (%) | 1,313 (57.8 %) | 26,493 (64.7 %) | 27,340 (64.3 %) | 3,351(57.65 %) |
TC, mg/dL, median (IQR:Q1-Q3) | 196 (61:166–227) | 194 (63:164–227) | 203 (67:171–238) | 190 (57:163–220) |
HDL-C, mg/dL, median (IQR:Q1-Q3) | 48 (16:41–57) | 49 (16:42–58) | 48 (16:41–57) | 49 (18:41–59) |
LDL-C, mg/dL, median (IQR:Q1-Q3) | 123 (49:99–148) | 126 (53:100–153) | 130 (50:106–156) | 129 (52:104–156) |
TG, mg/dL, median (IQR:Q1-Q3) | 125 (101:85–186) | 122 (94:84–178) | 124 (93:87–180) | 112 (88:79–167) |
LDL-C<70, mg/dL, n (%) | 117 (5.2 %) | 2,109 (5.2 %) | 1,141 (2.7 %) | 240 (4.1 %) |
LDL-C70-99, mg/dL, n (%) | 472 (20.8 %) | 8,113 (19.8 %) | 6,993 (16.4 %) | 990 (17 %) |
LDL-C100-129, mg/dL, n (%) | 710 (31.3 %) | 11,840 (28.9 %) | 12,898 (30.3 %) | 1,716 (29.5 %) |
LDL-C130-159, mg/dL, n (%) | 590 (26.0 %) | 10,416 (25.4 %) | 11,931 (28.1 %) | 1,540 (26.5 %) |
LDL-C160-189, mg/dL, n (%) | 275 (12.1 %) | 5,720 (14.0 %) | 6,547 (15.4 %) | 860 (14.8 %) |
LDL-C≥190, mg/dL, n (%) | 106 (4.7 %) | 2,743 (6.7 %) | 3,010 (7.1 %) | 467 (8.0 %) |
TG<100, mg/dL, n (%) | 796 (35.1 %) | 14,541 (35.5 %) | 14,492 (34.1 %) | 2,401 (41.3 %) |
TG 100–149, mg/dL, n (%) | 607 (26.7 %) | 11,820 (28.9 %) | 12,585 (29.6 %) | 1,625 (28.0 %) |
TG 150–199, mg/dL, n (%) | 374 (16.5 %) | 6,763 (16.5 %) | 7,140 (16.8 %) | 869 (14.9 %) |
TG 200–399, mg/dL, n (%) | 422 (18.6 %) | 6,974 (17.0 %) | 7,327 (17.2 %) | 814 (14.0 %) |
TG≥400, mg/dL, n (%) | 71 (3.1 %) | 843 (2.1 %) | 976 (2.3 %) | 104 (1.8 %) |
Venous blood samples were collected from the antecubital vein of participants following an overnight fast. Serum samples were analyzed for lipid profiles using enzymatic methods on an Architect c16000 analyzer (Abbott Laboratories, Abbott Park, IL, USA). TC was measured through a series of enzymatic reactions involving cholesteryl ester hydrolase, cholesterol oxidase, and peroxidase. TG was quantified through hydrolysis to glycerol and subsequent oxidation, with hydrogen peroxide measured colorimetrically. HDL-C and LDL-C were quantified using homogeneous enzymatic colorimetric assays after selective removal of interfering lipoproteins.
These records constituted the training dataset. Three independent test datasets were also obtained from hospitals within the province and included lipid profiles measured between 2023 and 2024 using the homogeneous enzymatic method. Test Dataset one consisted of 40,941 samples measured on the Architect c16000 analyzer, Test Dataset two included 42,520 samples from the AU5800 analyzer (Beckman Coulter Inc., Brea, CA, USA), and Test Dataset three comprised 5,813 samples from the Cobas analyzer (Roche Diagnostics, Mannheim, Germany). No data were excluded based on extreme values; all available results were included in the analysis. Basic demographic and lipid profile characteristics of patients across datasets are summarized in Table 1.
Analytical performance evaluation
The analytical performance of lipid measurements was assessed by calculating the total analytical error (%TEa), incorporating both systematic error (%Bias) and random error (%CV), using the formula: %TEa=%Bias + 1.65 × %CV.
Internal quality control was performed daily using two levels of Technopath Multichem S control materials (lot numbers: 11811210 and 11811212) on the Abbott Architect c16000 analyzer during the training phase. The observed coefficients of variation for the first six months were as follows:
TC: 2.31 % at 109.29 mg/dL, 3.03 % at 170.24 mg/dL
HDL-C: 2.94 % at 33.6 mg/dL, 2.88 % at 48.35 mg/dL
LDL-C: 2.22 % at 66.45 mg/dL, 2.55 % at 106.5 mg/dL
TG: 3.33 % at 82.3 mg/dL, 2.78 % at 132 mg/dL
External quality assurance was ensured via participation in the Bio-Rad EQAS Monthly Program 21. The calculated %TEa values were as follows:
Training dataset: TC: 3.87 %, HDL-C: 1.72 %, LDL-C: 6.84 %, TG: 2.23%
Abbott test dataset: TC: 4.68 %, HDL-C: 1.62 %, LDL-C: 9.03 %, TG: 2.29%
Beckman test dataset: TC: 5.36 %, HDL-C: 7.00 %, LDL-C: 6.63 %, TG: 1.20%
Roche test dataset: TC: 4.84 %, HDL-C: 1.68 %, LDL-C: 8.37 %, TG: 1.18%
All values were within the acceptable limits defined by the 2019 Clinical Laboratory Improvement Amendments (CLIA) guidelines [13].
Algorithm development
Symbolic regression, an evolutionary computation-based machine learning technique, was employed to derive mathematical expressions that best describe the relationship between input variables and LDL-C levels. The algorithm was implemented using the gplearn library (version 0.4.2). Unlike traditional regression methods, symbolic regression evolves equation structures dynamically via genetic programming.
The model was trained on the 2,270-patient dataset with the following hyperparameters: population size=1,000, generations=20, tournament size=20. The function set included addition, subtraction, multiplication, division, and square root operations. To enhance interpretability and clinical feasibility, model parsimony was prioritized using a high parsimony coefficient. Model fitness was optimized using mean absolute error (MAE) as the cost function. Genetic operators included a crossover probability of 0.7 and a mutation probability of 0.1.
The modelling pipeline utilized scikit-learn (v1.2.2) for performance evaluation, NumPy (v1.24.3) and pandas (v2.0.2) for data handling, and matplotlib (v3.7.2) and seaborn (v0.12.2) for visualization. The final formula derived from the training set was directly validated on all three independent external datasets without further tuning.
Results
Machine learning-based analysis indicated that LDL-C levels did not show statistically significant correlations with age or sex. As a result, the following formula was derived using symbolic regression: The Novel Equation (mg/dL): LDL-C=TC – HDL-C – (√TG) × TC/100.
This expression was formulated to account for variability in the TG: Very-low-density lipoprotein cholesterol (VLDL-C) ratio, integrating a nonlinear component to enhance estimation accuracy, particularly in the presence of high triglyceride levels.
Comparative evaluation of LDL-C estimation formulas
LDL-C estimation methods differ primarily in how they calculate VLDL-C. The Friedewald equation uses a fixed TG:VLDL-C ratio by dividing TG by 5 [14]. Some modified formulas retain the Friedewald structure but use different fixed divisors, which still fail to address individual TG:VLDL-C variability. The Martin equation introduces a dynamic divisor based on stratified TG and non-HDL-C levels, improving precision [15]. The Sampson equation uses a bivariate quadratic regression derived from β-quantification data, calculating VLDL-C as:
The Sampson equation:
Our proposed formula introduces the term
Regression and bias analysis
LDL-C estimates were analyzed by entering the Friedewald, Sampson, Martin (180 cells), extended Martin (additional 240 cells), and Y-LDL-C formulas into the Python program. The data obtained as a result of the estimated formulas were compared with the measured LDL-C values. The Friedewald and Sampson equations have not been validated for triglyceride levels exceeding 400 mg/dL and 800 mg/dL, respectively; however, their estimates at these levels were included for comparative analysis.
We assessed the performance of our newly developed formula alongside the Friedewald, Sampson, and Extended Martin equations using three independent test datasets. To evaluate the agreement between the formulas and the homogeneous assay results, we performed Passing-Bablok regression analysis (Supplement 1) and Bland-Altman analysis (Figure 1).

Bland–Altman plots showing the agreement between calculated LDL-C values using the Friedewald (A, E, I), Sampson (B, F, J), extended Martin (C, G, K), and Y-LDL-C (D, H, L) formulas and directly measured LDL-C concentrations on Abbott Architect (A–D, n=40,941), Beckman Coulter AU5800 (E–H, n=42,520), and Roche Cobas (I–L, n=5,813) platforms. Each panel shows the mean difference (solid green line) and the limits of agreement (±1.96 SD; dashed blue lines).
In the Bland-Altman analysis, F-LDL-C exhibited negative bias across all platforms, with mean differences of −10.48 mg/dL on Abbott Architect (Figure 1A), −5.12 mg/dL on Beckman Coulter (Figure 1E), and −6.01 mg/dL on Roche Cobas (Figure 1I). The 95 % limits of agreement ranged from 16.17 to −37.14 mg/dL, 24.48 to −34.73 mg/dL, and 8.97 to −40.98 mg/dL, respectively. S-LDL-C demonstrated similar negative bias patterns with mean differences of −7.08 mg/dL (Abbott), −2.23 mg/dL (Beckman), and −2.75 mg/dL (Roche). The 95 % limits of agreement were 14.74 to −29.89 mg/dL, 23.09 to −27.55 mg/dL, and 8.95 to −34.44 mg/dL, respectively (Figure 1B,F,J). EM-LDL-C showed mean differences of −7.37 mg/dL (Abbott), −2.23 mg/dL (Beckman), and −3.43 mg/dL (Roche), with 95 % limits of agreement ranging from 10.23 to −24.98 mg/dL, 17.24 to −21.70 mg/dL, and 7.57 to −34.43 mg/dL (Figure 1C, G, K). Y-LDL-C formula presented the lowest overall bias with mean differences of −4.92 mg/dL (Abbott), −0.34 mg/dL (Beckman), and −1.27 mg/dL (Roche). The 95 % limits of agreement were 11.84 to −21.69 mg/dL, 15.95 to −16.59 mg/dL, and 12.01 to −23.86 mg/dL, indicating the narrowest ranges among all formulas (Figure 1D, H, L).
The performance of the four formulas was compared across all three datasets using the following statistical metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared (R2), Concordance Correlation Coefficient (CCC), Quadratic Weighted Kappa (QWK), Bias, Concordance Rate (CR), Overestimation Rate (OER), and Underestimation Rate (UER) (Table 2).
Comparison of measured LDL-C concentrations and the performance of estimation formulas across three independent datasets.
Dataset | Formula | RMSE | MAE | MAPE | R2 | CCC | QWK | Bias | C R | OER | UER |
---|---|---|---|---|---|---|---|---|---|---|---|
Test Data (n=40,941) Abbott Architect | Friedewald | 17.170 | 11.698 | 9.43 % | 0.812 | 0.910 | 0.867 | −10.483 | 65.11 % | 1.60 % | 33.29 % |
Sampson | 13.677 | 9.325 | 7.45 % | 0.881 | 0.941 | 0.896 | −7.577 | 70.89 % | 2.30 % | 26.81 % | |
Ext. Martin | 11.613 | 8.511 | 6.85 % | 0.914 | 0.957 | 0.908 | −7.366 | 72.55 % | 1.46 % | 25.99 % | |
Yayla | 9.868 | 6.825 | 5.19 % | 0.938 | 0.967 | 0.924 | −4.925 | 78.21 % | 2.57 % | 19.22 % | |
|
|||||||||||
Test Data (n=42,520) Beckman Coulter | Friedewald | 15.950 | 10.956 | 8.61 % | 0.815 | 0.917 | 0.875 | −5.124 | 67.73 % | 8.34 % | 23.92 % |
Sampson | 13.108 | 9.570 | 7.39 % | 0.875 | 0.943 | 0.897 | −2.232 | 70.93 % | 10.35 % | 18.72 % | |
Ext. Martin | 10.181 | 7.601 | 5.94 % | 0.925 | 0.965 | 0.920 | −2.228 | 75.86 % | 7.57 % | 16.57 % | |
Yayla | 8.091 | 5.981 | 4.58 % | 0.952 | 0.976 | 0.933 | −0.344 | 80.88 % | 8.31 % | 10.81 % | |
|
|||||||||||
Test Data (n=5,813) Roche Cobas | Friedewald | 20.447 | 16.403 | 13.15 % | 0.731 | 0.873 | 0.835 | −15.994 | 50.54 % | 0.24 % | 49.22 % |
Sampson | 16.171 | 13.783 | 11.04 % | 0.832 | 0.917 | 0.862 | −13.243 | 57.61 % | 0.40 % | 41.99 % | |
Ext. Martin | 17.177 | 14.260 | 11.43 % | 0.810 | 0.906 | 0.860 | −13.428 | 55.98 % | 0.89 % | 43.13 % | |
Yayla | 16.008 | 12.788 | 9.97 % | 0.835 | 0.912 | 0.865 | −10.924 | 60.50 % | 2.32 % | 37.18 % |
-
RMSE, root mean squared error; MAE, mean absolute error; MAPE, mean absolute percentage error; R2, coefficient of determination; CCC, concordance correlation coefficients; QW-kappa, quadratic weighted kappa; CR, concordance rate; OER, overestimation rate; UER, underestimation rate.
To further evaluate formula performance across clinically relevant LDL-C ranges, we conducted a stratified analysis by LDL-C category. Patients were classified into six groups based on their LDL-C levels: <70 mg/dL, 70–99 mg/dL, 100–129 mg/dL, 130–159 mg/dL, 160–189 mg/dL, and ≥190 mg/dL. The distribution of patients in each group is provided in Table 1 and Table 3. To assess the agreement between direct and estimated LDL-C values within each group, a series of statistical measures were applied, including RMSE, MAE, MAPE, CCC, Mean Difference (MD), Mean Percentage Difference (MPD), CR, UER, and OER. The results of these statistical analyses are presented in Table 3.
Performance of calculation formulas according to LDL-C categories.
Test Data (n=40,941) Abbott Architect | Test Data (n=42,520) Beckman Coulter | Test Data (n=5,813) Roche Cobas | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LDL-C | n | Formula | RMSE | MAE | MAPE% | C R% | OER% | UER% | n | Formula | RMSE | MAE | MAPE% | C R% | OER% | UER% | n | Formula | RMSE | MAE | MAPE% | C R% | OER% | UER% |
0–69 | 2,109 | Friedewald | 10.1 | 7.2 | 12.4 | 97.1 | 2.9 | – | 1,141 | Friedewald | 13.0 | 8.7 | 14.6 | 95.2 | 4.8 | – | 240 | Friedewald | 25.6 | 13.0 | 24.3 | 96.3 | 3.8 | – |
Sampson | 8.1 | 6.2 | 10.8 | 97.9 | 2.1 | – | Sampson | 9.6 | 7.6 | 12.8 | 96.0 | 4.0 | – | Sampson | 19.8 | 11.4 | 21.7 | 95.4 | 4.6 | – | ||||
Martin | 7.6 | 5.9 | 10.3 | 97.8 | 2.2 | – | Martin | 8.1 | 6.8 | 11.5 | 96.4 | 3.6 | – | Martin | 29.4 | 12.4 | 23.6 | 92.1 | 7.9 | – | ||||
Yayla | 6.6 | 3.5 | 6.3 | 90.4 | 9.6 | – | Yayla | 4.7 | 3.7 | 6.1 | 84.6 | 15.4 | – | Yayla | 29.8 | 10.8 | 21.9 | 83.8 | 16.3 | – | ||||
70–99 | 8,113 | Friedewald | 13.0 | 8.9 | 10.3 | 76.1 | 1.6 | 22.4 | 6,993 | Friedewald | 13.4 | 8.9 | 10.2 | 73.5 | 7.4 | 19.1 | 990 | Friedewald | 22.4 | 13.3 | 15.4 | 67.7 | 0.2 | 32.1 |
Sampson | 9.2 | 6.8 | 8.0 | 80.0 | 2.1 | 17.9 | Sampson | 9.7 | 7.1 | 8.1 | 77.0 | 8.0 | 15.0 | Sampson | 12.9 | 11.0 | 12.7 | 73.3 | 0.8 | 25.9 | ||||
Martin | 8.1 | 6.4 | 7.5 | 83.2 | 1.5 | 15.3 | Martin | 7.8 | 6.1 | 7.0 | 82.0 | 5.7 | 12.3 | Martin | 15.7 | 11.2 | 12.9 | 75.3 | 1.8 | 22.9 | ||||
Yayla | 6.1 | 3.9 | 4.5 | 90.8 | 4.1 | 5.1 | Yayla | 5.7 | 4.4 | 5.0 | 85.5 | 10.9 | 3.7 | Yayla | 11.0 | 8.3 | 9.5 | 82.7 | 4.8 | 12.4 | ||||
100–129 | 11,840 | Friedewald | 15.6 | 10.8 | 9.4 | 66.2 | 1.9 | 31.9 | 12,898 | Friedewald | 15.1 | 10.0 | 8.7 | 68.6 | 8.7 | 22.7 | 1,716 | Friedewald | 17.3 | 15.0 | 13.1 | 53.9 | 0.2 | 45.9 |
Sampson | 11.5 | 8.2 | 7.1 | 72.6 | 2.8 | 24.6 | Sampson | 11.3 | 8.2 | 7.2 | 71.9 | 11.0 | 17.1 | Sampson | 13.9 | 12.4 | 10.8 | 62.0 | 0.3 | 37.7 | ||||
Martin | 9.8 | 7.7 | 6.7 | 74.4 | 1.8 | 23.8 | Martin | 8.8 | 6.6 | 5.8 | 76.7 | 7.9 | 15.4 | Martin | 14.0 | 12.9 | 11.2 | 60.3 | 0.4 | 39.3 | ||||
Yayla | 7.5 | 5.4 | 4.7 | 84.1 | 2.7 | 13.2 | Yayla | 6.9 | 5.3 | 4.6 | 82.6 | 9.7 | 7.7 | Yayla | 12.0 | 10.6 | 9.2 | 68.8 | 1.7 | 29.5 | ||||
130–159 | 10,416 | Friedewald | 18.6 | 13.1 | 9.1 | 57.2 | 1.4 | 41.4 | 11,931 | Friedewald | 16.9 | 11.7 | 8.1 | 63.4 | 9.9 | 26.7 | 1,540 | Friedewald | 19.8 | 17.4 | 12.2 | 39.7 | 0.1 | 60.2 |
Sampson | 14.5 | 10.2 | 7.1 | 64.8 | 2.5 | 32.7 | Sampson | 13.8 | 10.2 | 7.1 | 66.9 | 12.9 | 20.2 | Sampson | 16.3 | 14.4 | 10.0 | 48.1 | 0.1 | 51.8 | ||||
Martin | 12.1 | 9.2 | 6.4 | 66.4 | 1.3 | 32.3 | Martin | 10.4 | 7.9 | 5.5 | 72.3 | 9.5 | 18.1 | Martin | 16.3 | 14.8 | 10.3 | 47.4 | 0.5 | 52.1 | ||||
Yayla | 9.9 | 7.5 | 5.2 | 73.5 | 1.4 | 25.1 | Yayla | 8.2 | 6.3 | 4.4 | 80.3 | 8.0 | 11.7 | Yayla | 15.0 | 13.2 | 9.2 | 53.2 | 1.3 | 45.5 | ||||
160–189 | 5,720 | Friedewald | 20.9 | 14.6 | 8.5 | 51.5 | 1.6 | 46.8 | 6,547 | Friedewald | 17.4 | 12.7 | 7.4 | 59.8 | 10.4 | 29.9 | 860 | Friedewald | 21.4 | 19.2 | 11.1 | 31.6 | 0.1 | 68.3 |
Sampson | 17.3 | 12.1 | 7.0 | 57.9 | 2.5 | 39.6 | Sampson | 15.6 | 11.9 | 6.9 | 62.9 | 12.9 | 24.2 | Sampson | 18.3 | 16.3 | 9.4 | 39.3 | 0.2 | 60.5 | ||||
Martin | 14.7 | 10.8 | 6.3 | 59.3 | 1.5 | 39.3 | Martin | 12.0 | 9.1 | 5.3 | 68.8 | 9.4 | 21.8 | Martin | 18.4 | 17.0 | 9.8 | 35.6 | 0.2 | 64.2 | ||||
Yayla | 12.9 | 10.2 | 5.9 | 60.4 | 1.0 | 38.5 | Yayla | 9.6 | 7.3 | 4.2 | 74.4 | 6.1 | 19.5 | Yayla | 18.5 | 17.1 | 9.9 | 34.9 | 0.5 | 64.7 | ||||
≥190 | 2,743 | Friedewald | 23.4 | 16.1 | 7.6 | 61.8 | – | 38.2 | 3,010 | Friedewald | 18.7 | 14.0 | 6.7 | 74.8 | – | 25.2 | 467 | Friedewald | 23.9 | 21.4 | 10.2 | 43.3 | – | 56.7 |
Sampson | 22.2 | 15.0 | 7.0 | 65.7 | – | 34.3 | Sampson | 18.5 | 14.2 | 6.7 | 76.7 | – | 23.3 | Sampson | 22.2 | 19.4 | 9.2 | 46.7 | – | 53.3 | ||||
Martin | 18.8 | 13.1 | 6.2 | 64.6 | – | 35.4 | Martin | 15.1 | 11.3 | 5.3 | 79.6 | – | 20.4 | Martin | 21.8 | 19.8 | 9.4 | 45.0 | – | 55.0 | ||||
Yayla | 18.8 | 14.4 | 6.7 | 60.9 | – | 39.1 | Yayla | 13.0 | 9.5 | 4.5 | 77.4 | – | 22.6 | Yayla | 24.0 | 22.1 | 10.4 | 40.0 | – | 60.0 |
-
RMSE, root mean squared error; MAE, mean absolute error; MAPE, mean absolute percentage error; CR, concordance rate; UER, underestimation rate; OER, overestimation rate.
To evaluate the performance of different LDL-C estimation methods across various TG levels, patients were categorized into five groups: <100 mg/dL, 100–149 mg/dL, 150–199 mg/dL, 200–399 mg/dL, and ≥400 mg/dL. To assess the agreement between direct and estimated LDL-C values within each group, a series of statistical measures were applied, including RMSE, MAE, MAPE, CCC, QW-Kappa, MD, MPD, CR, UER, and OER. The results of these statistical analyses are presented in Table 4.
Performance of calculation formulas according to TG categories.
Test data 1 (n=40,941) Abbott Architect | Test data 2 (n=42,520) Beckman Coulter | Test data 3 (n=5,813) Roche Cobas | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TG | n | Formula | RMSE | MAE | MAPE% | CR% | OER% | UER% | n | Formula | RMSE | MAE | MAPE% | CR% | OER% | UER% | n | Formula | RMSE | MAE | MAPE% | CR% | OER% | UER% |
<100 | 14,541 | Friedewald | 6.5 | 4.7 | 4.6 | 85.2 | 3.2 | 11.6 | 14,492 | Friedewald | 8.8 | 6.7 | 5.9 | 78.2 | 15.5 | 6.2 | 2,401 | Friedewald | 12.9 | 11.7 | 10.4 | 61.2 | 0.1 | 38.7 |
Sampson | 6.4 | 4.5 | 4.5 | 85.8 | 4.2 | 10.0 | Sampson | 9.6 | 7.3 | 6.3 | 76.7 | 17.7 | 5.5 | Sampson | 12.1 | 11.0 | 9.9 | 63.5 | 0.2 | 36.3 | ||||
Martin | 7.1 | 5.4 | 5.3 | 83.1 | 1.5 | 15.5 | Martin | 8.0 | 6.1 | 5.4 | 80.1 | 11.6 | 8.3 | Martin | 14.5 | 13.5 | 11.9 | 57.0 | 0.0 | 42.9 | ||||
Yayla | 6.5 | 4.6 | 4.1 | 85.2 | 2.2 | 12.7 | Yayla | 7.2 | 5.5 | 4.9 | 81.9 | 11.8 | 6.3 | Yayla | 14.1 | 12.5 | 10.5 | 59.4 | 0.1 | 40.5 | ||||
100–149 | 11,820 | Friedewald | 10.4 | 8.6 | 7.4 | 72.4 | 1.2 | 26.4 | 12,585 | Friedewald | 9.8 | 7.7 | 6.2 | 76.2 | 9.1 | 14.7 | 1,625 | Friedewald | 17.1 | 15.5 | 12.3 | 49.8 | 0.2 | 50.0 |
Sampson | 8.6 | 6.7 | 5.7 | 78.4 | 2.3 | 19.3 | Sampson | 9.6 | 7.5 | 5.8 | 76.8 | 12.6 | 10.5 | Sampson | 14.8 | 13.0 | 10.3 | 57.6 | 0.3 | 42.1 | ||||
Martin | 9.1 | 7.4 | 6.1 | 77.0 | 1.4 | 21.6 | Martin | 8.7 | 6.7 | 5.3 | 79.3 | 9.6 | 11.1 | Martin | 16.1 | 14.3 | 11.1 | 54.7 | 0.3 | 45.0 | ||||
Yayla | 8.4 | 6.3 | 4.8 | 80.2 | 1.9 | 17.9 | Yayla | 7.5 | 5.7 | 4.3 | 81.7 | 9.8 | 8.4 | Yayla | 15.4 | 13.0 | 9.6 | 58.5 | 0.8 | 40.7 | ||||
150–199 | 6,763 | Friedewald | 16.3 | 14.0 | 11.1 | 56.2 | 0.7 | 43.1 | 7,140 | Friedewald | 12.9 | 10.3 | 8.2 | 67.6 | 4.2 | 28.2 | 869 | Friedewald | 20.0 | 18.7 | 13.7 | 41.8 | 0.1 | 58.1 |
Sampson | 13.1 | 10.6 | 8.3 | 66.9 | 1.2 | 31.9 | Sampson | 10.7 | 8.3 | 6.3 | 74.6 | 6.4 | 19.0 | Sampson | 16.7 | 15.2 | 10.8 | 53.5 | 0.2 | 46.3 | ||||
Martin | 12.3 | 9.8 | 7.4 | 69.3 | 1.3 | 29.4 | Martin | 9.8 | 7.4 | 5.5 | 77.5 | 5.8 | 16.6 | Martin | 16.1 | 14.4 | 10.0 | 55.9 | 0.3 | 43.7 | ||||
Yayla | 10.9 | 8.0 | 5.6 | 75.3 | 2.7 | 22.0 | Yayla | 8.4 | 6.0 | 4.2 | 80.9 | 7.4 | 11.7 | Yayla | 14.8 | 12.5 | 8.3 | 63.2 | 1.7 | 35.1 | ||||
200–399 | 6,974 | Friedewald | 27.1 | 24.1 | 17.6 | 31.5 | 0.8 | 67.7 | 7,327 | Friedewald | 23.1 | 19.9 | 14.6 | 41.9 | 0.8 | 57.3 | 814 | Friedewald | 27.3 | 25.0 | 18.5 | 30.7 | 0.7 | 68.6 |
Sampson | 21.7 | 18.8 | 13.3 | 44.1 | 1.3 | 54.6 | Sampson | 17.7 | 14.9 | 10.7 | 54.6 | 1.3 | 44.1 | Sampson | 21.9 | 19.6 | 13.8 | 44.5 | 1.1 | 54.4 | ||||
Martin | 16.7 | 13.7 | 9.6 | 57.9 | 2.3 | 39.8 | Martin | 12.7 | 10.2 | 7.2 | 68.2 | 2.2 | 29.6 | Martin | 17.5 | 14.7 | 10.3 | 56.5 | 3.3 | 40.2 | ||||
Yayla | 13.8 | 10.3 | 6.9 | 70.9 | 4.7 | 24.4 | Yayla | 9.5 | 7.0 | 4.6 | 79.2 | 4.3 | 16.5 | Yayla | 15.5 | 11.9 | 8.6 | 65.7 | 8.4 | 25.9 | ||||
≥400 | 843 | Friedewald | 62.1 | 55.1 | 40.0 | 8.9 | 0.6 | 90.5 | 976 | Friedewald | 59.1 | 54.4 | 39.0 | 5.9 | 0.0 | 94.1 | 104 | Friedewald | 75.9 | 51.2 | 45.0 | 20.2 | 1.9 | 77.9 |
Sampson | 45.8 | 40.4 | 27.8 | 13.8 | 0.7 | 85.5 | Sampson | 41.8 | 39.9 | 27.5 | 7.5 | 0.0 | 92.5 | Sampson | 41.3 | 33.4 | 27.6 | 26.9 | 5.8 | 67.3 | ||||
Martin | 31.5 | 25.4 | 17.4 | 33.9 | 4.3 | 61.8 | Martin | 26.1 | 23.3 | 16.0 | 30.9 | 0.0 | 69.1 | Martin | 54.8 | 27.2 | 27.1 | 40.4 | 16.3 | 43.3 | ||||
Yayla | 21.5 | 15.8 | 11.7 | 59.8 | 17.2 | 23.0 | Yayla | 12.5 | 9.2 | 6.0 | 72.1 | 7.3 | 20.6 | Yayla | 48.1 | 26.1 | 28.7 | 46.2 | 39.4 | 14.4 |
-
RMSE, root mean squared error; MAE, mean absolute error; MAPE, mean absolute percentage error; CR, concordance rate; UER, underestimation rate; OER, overestimation rate.
The performance of LDL-C estimation formulas varies depending on TG levels. At TG<100 mg/dL, all formulas perform relatively well, with CR of 78–86 % on the Abbott and Beckman platforms. However, at TG≥400 mg/dL, performance drops significantly, with Friedewald demonstrating particularly poor results (CR% as low as 5.9 %). The Y-LDL-C formula consistently exhibits superior performance across most TG categories and platforms, especially at higher TG levels (≥200 mg/dL). The Friedewald formula continues to show the poorest performance, particularly at elevated TG levels. The Martin and Sampson formulas generally perform better than Friedewald, but not as well as Y-LDL-C. The RMSE increases significantly with TG levels for all formulas. The MAPE% surpasses 10 % for most formulas at TG≥150 mg/dL. At TG≥400 mg/dL, the Friedewald formula displays very high error metrics (MAPE% 39–45 %).
To visualize how each formula classifies patients across LDL-C categories, we created confusion matrices for each of the three datasets (Figure 2). The confusion matrices provide a detailed comparison between the predicted and homogeneous LDL-C classifications, highlighting the accuracy of each formula in correctly categorizing patients. The diagonal cells represent correctly classified samples, where the predicted LDL-C category matches the measured category, while off-diagonal cells indicate misclassifications. They also enable the identification of misclassifications, including false positives and false negatives, which are crucial for assessing the clinical reliability of each method.

Confusion matrices comparing the clinical categorization agreement between calculated LDL-C values using the Friedewald (A, E, I), Sampson (B, F, J), extended Martin (C, G, K), and Y-LDL-C (D, H, L) formulas and directly measured LDL-C concentrations on Abbott Architect (A–D, n=40,941), Beckman Coulter (E–H, n=42,520), and Roche Cobas (I–L, n=3,813) platforms. Darker blue diagonal cells indicate higher concordance between calculated and measured LDL-C clinical categories. Overall accuracy percentages are shown at the top of each matrix.
Error Grid Analysis, a crucial tool for assessing the accuracy of LDL-C estimation formulas, plays a key role in understanding error levels in clinical decision-making processes. The National Cholesterol Education Program Adult Treatment Panel III (NCEP ATP III) have established a TEa threshold of 12 % as an acceptable limit. In our study, we adopted the same 12 % TEa value. For Error Grid Analysis, we selected clinical decision thresholds of 70 mg/dL and 190 mg/dL. To enhance the analysis, we stratified the dataset based on triglyceride (TG) levels, dividing it into two groups: TG<400 mg/dL and, TG≥400 mg/dL (Figure 3). In the TG<400 mg/dL group, when analyzing 40,098 LDL-C results obtained from the Abbott Architect c16000 analyzer (Test Dataset 1), the proportion of calculated LDL-C values falling within the 12 % TEa (a + b) threshold was as follows: 75.7 % for F-LDL-C, 83.7 % for S-LDL-C, 89.2 % for EM-LDL-C, and 96.3 % for Y-LDL-C. Similarly, analysis of 41,544 LDL-C results from the Beckman Coulter AU5800 analyzer (Test Dataset 2) revealed that 79.4 % of F-LDL-C, 84.0 % of S-LDL-C, 89.9 % of EM-LDL-C, and 95.4 % of Y- LDL-C values were within the 12 % TEa threshold. In contrast, when analyzing 5,709 LDL-C results from the Roche Cobas c501 analyzer (Test Dataset 3), the proportions were somewhat lower, with 52.5 % for F-LDL-C, 63.5 % for S-LDL-C, 59.8 % for EM-LDL-C, and 71.5 % for Y-LDL-C falling within the 12 % TEa threshold. Error Grid Analysis results further demonstrated clinically misclassifications at low and high LDL-C cut-points (zones e + f + g + h) In Test Dataset 1, the proportion of values falling within these clinically relevant error zones was 4.2 % for F-LDL-C, 2.6 % for S-LDL-C, 1.6 % for EM-LDL-C, and 0.6 % for Y-LDL-C. Similarly, in Test Dataset 2, the respective error rates were 3.5 % for F-LDL-C, 2.6 % for S-LDL-C, 1.6 % for EM-LDL-C, and 0.6 % for Y-LDL-C. In contrast, Test Dataset three exhibited higher clinically significant error rates, with 6.8 % for F-LDL-C, 5.2 % for S-LDL-C, 4.8 % for EM-LDL-C, and 3.5 % for Y-LDL-C falling within these critical zones (Figure 3) (Supplement 2).

Error grid analysis for LDL-C measurements in samples with TG<400 mg/dL. Panels (A–L) show scatter plots comparing different LDL-C estimation methods against homogenous enzymatic assay across three analytical platforms. The error grid zones categorize results based on analytical accuracy and clinical impact: Zones a and b: Within ±12 % proportional error, considered analytically accurate. Zones c and d: Errors exceeding 12 % but without clinical impact. Zones e and f: Clinically significant misclassifications at high LDL-C thresholds. Zones g and h: Clinically significant misclassifications at low LDL-C thresholds. Panel (M) presents TE% accuracy for each method (zones a + b). Panel (N) quantifies analytical errors (c + d), while Panel (O) shows clinically significant errors (e–h). Panel (P) illustrates the LDL-C error grid structure. Numbers in shaded regions indicate misclassification counts. acceptable (zones c and d), and clinically significant errors (zones e to h) are presented in Panels N, O and P respectively.
For TG≥400 mg/dL, the analysis of LDL-C results included a smaller number of patients compared to the TG<400 mg/dL group (Test Dataset 1: n=843, Test Dataset 2: n=976, Test Dataset 3: n=104). When evaluating LDL-C results, the proportion of calculated LDL-C values falling within the 12 % TEa threshold (zones a + b) decreased across all four formulas in all three datasets. However, Y-LDL-C consistently demonstrated higher agreement with the TEa threshold compared to other estimation methods. Similarly, analysis revealed an increase in clinically significant misclassifications at both low and high LDL-C cut-points (zones e + f + g + h) across all three datasets (Figure 4) (Supplement 3).

Error grid analysis for LDL-C measurements in samples with TG≥400 mg/dL. Panels (A–L) show scatter plots comparing different LDL-C estimation methods against homogenous enzymatic assay across three analytical platforms. The error grid zones categorize results based on analytical accuracy and clinical impact: Zones a and b: Within ±12 % proportional error, considered analytically accurate. Zones c and d: Errors exceeding 12 % but without clinical impact. Zones e and f: Clinically significant misclassifications at high LDL-C thresholds. Zones g and h: Clinically significant misclassifications at low LDL-C thresholds. Panel (M) presents TE% accuracy for each method (zones a + b). Panel (N) quantifies analytical errors (c + d), while Panel (O) shows clinically significant errors (E–H). Panel (P) illustrates the LDL-C error grid structure. Numbers in shaded regions indicate misclassification counts. acceptable (zones c and d), and clinically significant errors (zones e to h) are presented in Panels N, O and P respectively.
Discussion
Although beta quantification remains the gold standard for LDL-C measurement, its clinical application is often constrained by high costs and operational complexity [16]. In contrast, homogeneous enzymatic assays provide a practical alternative but still involve higher costs compared to estimation-based approaches [17]. Among estimation methods, the Friedewald formula remains the most widely used due to its simplicity and long history of validation; however, it performs poorly in patients with low LDL-C or high triglyceride (TG) levels [5], 14]. This has become even more clinically problematic in the modern era, where LDL-C targets are significantly lower, and the global rise in obesity has led to a substantial increase in triglyceride levels across populations.
When the Friedewald and Martin equations are compared to beta quantification, studies have reported misclassification rates of cardiovascular disease risk as high as 30 and 35 %, respectively [18]. Other formulas such as those developed by Anandaraja, Chen, and Cordova have shown inferior performance compared to the Sampson equation, particularly in hypertriglyceridemic individuals [8]. This is clinically significant, as elevated TG levels greatly complicate the accurate assessment of LDL-C, a key determinant in cardiovascular risk stratification. In a comparative study utilizing the Vertical Auto Profile (VAP) method, it was found that in patients with TG levels between 400 and 799 mg/dL, the extended Martin equation demonstrated the highest accuracy (60.3 %), followed by the standard Martin (59.2 %) and Chen (57.7 %) equations [8]. Recently, machine learning (ML) approaches – such as random forest models and two-step ML algorithms – have shown promise in improving LDL-C prediction, especially in scenarios with high TG and low LDL-C levels [11], [19], [20], [21]. However, despite the strong performance of these complex models, they often lack clinical usability due to their computational complexity. Notably, while studies such as those by Kim et al. [11] and Singh et al. [20] have emphasized the potential of ML-based LDL-C prediction [11], 20], they have not developed a simple, clinically implementable equation using ML.
Our study addresses this gap by proposing the Y-LDL-C equation, a symbolic regression-based formula incorporating the nonlinear term √TG × TC/100 as a proxy for VLDL-C. This formula demonstrated superior performance across three independent test datasets.
Our findings, summarized in Table 2, highlight significant variability among formulas in terms of agreement metrics. The Y-LDL-C formula consistently outperformed all others across platforms, yielding the lowest RMSE values, MAE values, and MAPE values, along with the highest concordance rates. The most notable performance was on the Beckman platform, where the Y-LDL-C formula achieved an R2 of 0.952 and a CCC of 0.976.
The extended Martin equation demonstrated better performance than Friedewald and Sampson but still fell short of the accuracy achieved by the Y-LDL-C formula. As LDL-C levels increased, all formulas showed a gradual decline in concordance; nevertheless, the Y-LDL-C equation maintained the highest overall accuracy, particularly in the<70 mg/dL category – a critical threshold for high-risk patients requiring intensive lipid-lowering therapy (Table 3). While the Friedewald formula substantially underestimated LDL-C in this range, the Y-LDL-C formula exhibited only slight overestimation and maintained lower RMSE and MAE values. Similar trends were observed in the 70–99, 100–129, 130–159, 160–189, and ≥190 mg/dL categories, where the Y-LDL-C equation consistently yielded the highest agreement.
TG level stratification revealed marked limitations of the Friedewald formula, particularly above 150 mg/dL (Table 4). In contrast, the Y-LDL-C equation showed robust performance even at TG levels≥200 mg/dL and ≥400 mg/dL, where other formulas experienced significant degradation.
The widespread underestimation observed in other formulas – especially Friedewald – raises clinical concerns about the risk of undertreatment. If LDL-C is underestimated in high-TG patients, opportunities for timely intervention may be missed.
Several studies have demonstrated that replacing the Friedewald equation with alternative LDL-C estimation methods may result in clinically relevant patient reclassification. Zafrir et al. [22] reported that the Martin/Hopkins and Sampson equations reclassified 10.8 and 7.5 % of patients, respectively, into higher LDL-C categories. These discrepancies were more pronounced among individuals with elevated triglycerides and low LDL-C concentrations. Specifically, among patients with triglyceride levels between 200 and 399 mg/dL and LDL-C <70 mg/dL, reclassification occurred in 65 % of cases when comparing Martin to Friedewald, 44 % for Sampson vs. Friedewald, and 37 % for Martin vs. Sampson. In our study (dataset 1), homogeneous enzymatic measurements reclassified 68.5 % of patients estimated by Friedewald, 55.9 % by Sampson, 42.1 % by extended Martin, and 29.1 % by our novel formula within the triglyceride range of 200–399 mg/dL. Similar results were observed in dataset two and dataset three.
In a VOYAGER analysis, Palmer et al. [23] showed that the Martin/Hopkins equation tended to produce higher LDL-C values than the Friedewald formula, especially at lower LDL-C levels. Notably, 23 % of patients who met the<70 mg/dL target according to Friedewald were reclassified as above target when Martin/Hopkins was applied. Reclassification rates were lower for higher LDL-C thresholds, with 8 % for<100 mg/dL and 2 % for<130 mg/dL. As reported by Dinc Asarcikli et al., cardiology outpatients were reclassified into higher-risk categories when the Martin/Hopkins and Sampson equations were used instead of the Friedewald formula. In the same study, among patients with TG<400 mg/dL, 28.2 % of those in the high-risk group (LDL-C<70 mg/dL) and 27.6 % in the very high-risk group (LDL-C<55 mg/dL) – originally considered below target by Friedewald – were reassigned as above target when Martin/Hopkins was applied. However, these patients constituted 4 and 7 % of the overall population, respectively. Similarly, the Sampson equation led to reclassification in 13 % of patients with TG>400 mg/dL, though this subgroup made up 3 % of the total cohort [24].
The results of confusion matrices and error grid analysis in our study also showed that diagnosis and treatment decisions varied depending on the formula and device used. For example, among patients with true LDL-C levels between 70 and 99 mg/dL, 23.3, 19.2, and 15.7 % were incorrectly classified into the<70 mg/dL risk category by the Friedewald, Sampson, and extended Martin equations, respectively, in Dataset 1, whereas the rate was lower with the novel equation (9.1 %). Likewise, in this LDL-C subgroup, the corresponding rates of false low classification in Datasets two and three were 26.2, 23, 17.6, and 15.4, and 32.3, 26.7, 24.7, and 17.3 %, respectively. Error grid analysis provided further evidence of the Y-LDL-C equation’s clinical applicability. In datasets with TG<400 mg/dL, the Y-LDL-C formula correctly classified 96.3, 95.4, and 71.5 % of cases within the 12 % TEa threshold. In comparison, the Friedewald formula correctly classified only 75.7, 79.4, and 52.5 % of values. More strikingly, in the TG≥400 mg/dL group, the Y-LDL-C formula maintained performance with 72.7, 91.2, and 56.3 % correct classification rates across the three platforms, whereas the Friedewald formula only achieved 7.5, 5.1, and 7.8 %, respectively. These findings suggest that the Y-LDL-C equation can reliably estimate LDL-C even in high-TG scenarios where other equations typically fail.
Evaluations of homogeneous LDL-C assays have demonstrated that their coefficients of variation (CVs) typically remain below 3 %, consistently meeting the National Cholesterol Education Program (NCEP) performance goal of <4 %. In contrast, the Friedewald calculation has shown a CV of approximately 4 % in expert laboratories, a value that may be exceeded in routine clinical practice. Regarding accuracy, all homogeneous assays have been certified by the Cholesterol Reference Method Laboratory Network, indicating acceptable agreement with reference methods – particularly in normolipidemic samples. However, given the variety of instrument platforms available, not all systems have undergone systematic evaluation for potential bias [25].
A multicenter study by Miida et al. (2012) assessed the precision and accuracy of homogeneous LDL-C assays in comparison with the beta-quantification (BQ) reference method using fresh serum samples from both non-diseased and diseased individuals. The maximum inter- and intra-assay CVs were 1.8 and 1.5 %, respectively, with eight reagents exhibiting CVs of ≤1.0 %. The mean bias ranged from −0.5 to 1.8 % in non-diseased individuals and from −0.7 to 1.6 % in diseased patients. These findings support the reliability of homogeneous assays in healthy individuals, though careful interpretation is warranted in clinical settings involving hypertriglyceridemia [26].
Nevertheless, we acknowledge that direct comparison of our newly developed formula with ultracentrifugation-based beta-quantification methods would provide a more definitive validation and should be considered a priority for future research.
Furthermore, to enhance accessibility and practical application, we developed a mobile application specifically designed for clinical biochemists. This app allows users to calculate LDL-C using the Y-LDL-C, Friedewald, Sampson, and Extended Martin equations and includes a non-HDL-C calculator. You can download the mobile application by scanning the QR code provided on both major platforms (Figure 5):

QR code for downloading the mobile app.
Apple App Store: https://apps.apple.com/us/app/ldl-c-calculator/id6683293461.
Google Play Store: https://play.google.com/store/apps/details?id=com.tanxe.spinner&hl=tr.
Study limitations
This study has several limitations. First, although beta quantification is the reference method for LDL-C measurement, we relied on homogeneous enzymatic d-LDL-C measurements as our comparator. While this reflects real-world laboratory practices, further validation of the novel formula using beta quantification or vertical ultracentrifugation in multicenter studies is warranted.
Second, our high-TG patient sample was relatively small – 843 cases in Test Dataset 1, 976 in Test Dataset 2, and only 104 in Test Dataset three – potentially limiting the generalizability of our findings in this subgroup.
Moreover, although the formula was derived from a dataset of 2,270 patients, its performance was validated only on three external datasets. Variables such as lipoprotein (a) and remnant lipoproteins were not separately accounted for in this study, which may influence LDL-C estimation. Additionally, treatment data for the included patients were not available, restricting our ability to control for medication effects. Finally, further studies exploring adaptive divisors (instead of the fixed 100 in the nonlinear term) based on LDL-C or TG stratification may yield even higher accuracy in future iterations of the formula.
To address these limitations, we are planning larger, multicentre prospective studies to further evaluate and refine the novel equation.
Conclusions
In this study, we developed a novel LDL-C estimation formula using a machine learning-based symbolic regression approach. The Y-LDL-C equation demonstrated higher concordance with homogeneous enzymatic measurements compared to the Friedewald, Sampson, and Extended Martin equations, particularly in individuals with elevated triglyceride levels. It consistently delivered lower error rates and higher agreement with homogeneous enzymatic measurements across a wide range of LDL-C and TG concentrations. Owing to its mathematical simplicity, clinical adaptability, and robustness at high TG levels, the Y-LDL-C formula may represent a practical and precise alternative for LDL-C estimation in routine laboratory settings.
Acknowledgments
The authors wish to thank the laboratory staff members who worked on this study.
-
Research ethics: This study was completed in accordance with the tenets of the Declaration of Helsinki. Ethics committee approval was received for this study from the University of Health Sciences Trabzon Faculty of Medicine Scientific Research Ethics Committee (Approval Date: February 27, 2024; Approval Number:2024/59).
-
Informed consent: Informed consent was obtained from all individuals included in this study, or their legal guardians or wards.
-
Author contributions: Conceptualization – S.Y., H.A., A.Y.; Design – S.Y., A.Y., H.A.; Supervision – H.A.; Resources – S.Y., A.Y.; Materials – S.Y., A.Y.; Data Collection and/or Processing – S.Y., A.Y., H.A.; Analysis and/or Interpretation – S.Y., A.Y.; Literature Search – S.Y., A.Y.; Writing – S.Y., A.Y.; Critical Revision – S.Y., A.Y., H.A.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
-
Research funding: No funding was received for this research.
-
Data availability: The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
1. Ference, BA, Ginsberg, HN, Graham, I, Ray, KK, Packard, CJ, Bruckert, E, et al.. Low-density lipoproteins cause atherosclerotic cardiovascular disease. 1. Evidence from genetic, epidemiologic, and clinical studies. A consensus statement from the European Atherosclerosis Society Consensus Panel. Eur Heart J 2017;38:2459–72. https://doi.org/10.1093/eurheartj/ehx144.Search in Google Scholar PubMed PubMed Central
2. Grundy, SM, Stone, NJ, Bailey, AL, Beam, C, Birtcher, KK, Blumenthal, RS, et al.. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary: a report of the American College of Cardiology/American Heart Association task force on clinical practice guidelines. Circulation 2019;139:e1046-e81, https://doi.org/10.1161/cir.0000000000000624.Search in Google Scholar PubMed
3. Who.int [Internet]. [cited April 15]. Available from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).Search in Google Scholar
4. Yusuf, S, Joseph, P, Rangarajan, S, Islam, S, Mente, A, Hystad, P, et al.. Modifiable risk factors, cardiovascular disease, and mortality in 155 722 individuals from 21 high-income, middle-income, and low-income countries (PURE): a prospective cohort study. Lancet 2020;395:795–808. https://doi.org/10.1016/s0140-6736(19)32008-2.Search in Google Scholar PubMed PubMed Central
5. Basaran, O. Is it time to abandon Friedewald formula? New equations for LDL-C calculation. Turk Kardiyol Dern Ars 2021;49:615–8. https://doi.org/10.5543/tkda.2021.21265.Search in Google Scholar PubMed
6. Friedewald, WT, Levy, RI, Fredrickson, DS. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin Chem 1972;18:499–502. https://doi.org/10.1093/clinchem/18.6.499.Search in Google Scholar
7. Sampson, M, Wolska, A, Cole, J, Zubiran, R, Otvos, JD, Meeusen, JW, et al.. Accuracy and clinical impact of estimating low-density lipoprotein-cholesterol at high and low levels by different equations. Biomedicines 2022;10. https://doi.org/10.3390/biomedicines10123156.Search in Google Scholar PubMed PubMed Central
8. Sajja, A, Park, J, Sathiyakumar, V, Varghese, B, Pallazola, VA, Marvel, FA, et al.. Comparison of methods to estimate low-density lipoprotein cholesterol in patients with high triglyceride levels. JAMA Netw Open 2021;4:e2128817. https://doi.org/10.1001/jamanetworkopen.2021.28817.Search in Google Scholar PubMed PubMed Central
9. Samuel, C, Park, J, Sajja, A, Michos, ED, Blumenthal, RS, Jones, SR, et al.. Accuracy of 23 equations for estimating LDL cholesterol in a clinical laboratory database of 5,051,467 patients. Glob Heart 2023;18:36. https://doi.org/10.5334/gh.1214.Search in Google Scholar PubMed PubMed Central
10. Cubukcu, HC, Topcu, DI. Estimation of low-density lipoprotein cholesterol concentration using machine learning. Lab Med 2022;53:161–71. https://doi.org/10.1093/labmed/lmab065.Search in Google Scholar PubMed
11. Kim, Y, Lee, WK, Lee, W. Prediction of low-density lipoprotein cholesterol levels using machine learning methods. Lab Med 2024;55:471–84. https://doi.org/10.1093/labmed/lmad114.Search in Google Scholar PubMed
12. Tsigalou, C, Panopoulou, M, Papadopoulos, C, Karvelas, A, Tsairidis, D, Anagnostopoulos, K. Estimation of low-density lipoprotein cholesterol by machine learning methods. Clin Chim Acta 2021;517:108–16. https://doi.org/10.1016/j.cca.2021.02.020.Search in Google Scholar PubMed
13. Centers for Medicare & Medicaid Services. Clinical laboratory improvement Amendments (CLIA) regulations and guidance. U.S. Department of Health & Human Services; 2019. Available from: https://www.cms.gov/Regulations-and-Guidance/Legislation/CLIA.Search in Google Scholar
14. Visseren, FLJ, Mach, F, Smulders, YM, Carballo, D, Koskinas, KC, Back, M, et al.. 2021 ESC Guidelines on cardiovascular disease prevention in clinical practice. Eur Heart J 2021;42:3227–337. https://doi.org/10.1093/eurheartj/ehab484.Search in Google Scholar PubMed
15. Martin, SS, Blaha, MJ, Elshazly, MB, Toth, PP, Kwiterovich, PO, Blumenthal, RS, et al.. Comparison of a novel method vs the Friedewald equation for estimating low-density lipoprotein cholesterol levels from the standard lipid profile. JAMA 2013;310:2061–8. https://doi.org/10.1001/jama.2013.280532.Search in Google Scholar PubMed PubMed Central
16. Sampson, M, Ling, C, Sun, Q, Harb, R, Ashmaig, M, Warnick, R, et al.. A new equation for calculation of low-density lipoprotein cholesterol in patients with normolipidemia and/or hypertriglyceridemia. JAMA Cardiol 2020;5:540–8. https://doi.org/10.1001/jamacardio.2020.0013.Search in Google Scholar PubMed PubMed Central
17. Nauck, M, Warnick, GR, Rifai, N. Methods for measurement of LDL-cholesterol: a critical assessment of direct measurement by homogeneous assays versus calculation. Clin Chem 2002;48:236–54. https://doi.org/10.1093/clinchem/48.2.236.Search in Google Scholar
18. Koch, CD, El-Khoury, JM. New Sampson low-density lipoprotein equation: better than Friedewald and Martin-Hopkins. Clin Chem 2020;66:1120–1. https://doi.org/10.1093/clinchem/hvaa126.Search in Google Scholar PubMed
19. Ayyıldız, H, Arslan Tuncer, S. Determination of the effect of red blood cell parameters in the discrimination of iron deficiency anemia and beta thalassemia via neighborhood component analysis feature selection-based machine learning. Chemometr Intell Lab Syst 2019:103886. https://doi.org/10.1016/j.chemolab.2019.103886.Search in Google Scholar
20. Singh, G, Hussain, Y, Xu, Z, Sholle, E, Michalak, K, Dolan, K, et al.. Comparing a novel machine learning method to the Friedewald formula and Martin-Hopkins equation for low-density lipoprotein estimation. PLoS One 2020;15:e0239934. https://doi.org/10.1371/journal.pone.0239934.Search in Google Scholar PubMed PubMed Central
21. Weissler, EH, Naumann, T, Andersson, T, Ranganath, R, Elemento, O, Luo, Y, et al.. The role of machine learning in clinical research: transforming the future of evidence generation. Trials 2021;22:537. https://doi.org/10.1186/s13063-021-05489-x.Search in Google Scholar PubMed PubMed Central
22. Zafrir, B, Saliba, W, Flugelman, MY. Comparison of novel equations for estimating low-density lipoprotein cholesterol in patients undergoing coronary angiography. J Atheroscler Thromb 2020;27:1359–73. https://doi.org/10.5551/jat.57133.Search in Google Scholar PubMed PubMed Central
23. Palmer, MK, Barter, PJ, Lundman, P, Nicholls, SJ, Toth, PP, Karlson, BW. Comparing a novel equation for calculating low-density lipoprotein cholesterol with the Friedewald equation: a VOYAGER analysis. Clin Biochem 2019;64:24–9. https://doi.org/10.1016/j.clinbiochem.2018.10.011.Search in Google Scholar PubMed
24. Dinc Asarcikli, L, Kis, M, Guvenc, TS, Tosun, V, Acar, B, Avci Demir, F, et al.. Usefulness of novel Martin/Hopkins and Sampson equations over Friedewald equation in cardiology outpatients: a CVSCORE-TR substudy. Int J Clin Pract 2021;75:e14090. https://doi.org/10.1111/ijcp.14090.Search in Google Scholar PubMed
25. Rifai, N. Tietz textbook of laboratory medicine, 7th ed. St. Louis, Missouri: Elsevier Saunders; 2023.Search in Google Scholar
26. Miida, T, Nishimura, K, Okamura, T, Hirayama, S, Ohmura, H, Yoshida, H, et al.. A multicenter study on the precision and accuracy of homogeneous assays for LDL-cholesterol: comparison with a beta-quantification method using fresh serum obtained from non-diseased and diseased subjects. Atherosclerosis 2012;225:208–15. https://doi.org/10.1016/j.atherosclerosis.2012.08.022.Search in Google Scholar PubMed
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/tjb-2025-0171).
© 2025 the author(s), published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.