Application of machine learning tools for feature selection in the identification of prognostic markers in COVID-19

Sprockel Diaz Johm Jaime; Hector Fabio Restrepo Guerrero; Juan Jose Diaztagle Fernandez

doi:10.1515/em-2022-0132

Article Publicly Available

Application of machine learning tools for feature selection in the identification of prognostic markers in COVID-19

Sprockel Diaz Johm Jaime , Hector Fabio Restrepo Guerrero and Juan Jose Diaztagle Fernandez

Published/Copyright: April 17, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Epidemiologic Methods Volume 12 Issue 1

Abstract

Objective

To identify prognostic markers by applying machine learning strategies to the feature selection.

Methods

An observational, retrospective, multi-center study that included hospitalized patients with a confirmed diagnosis of COVID-19 in three hospitals in Colombia. Eight strategies were applied to select prognostic-related characteristics. Eight logistic regression models were built from each set of variables and the predictive ability of the outcome was evaluated. The primary endpoint was transfer to intensive care or in-hospital death.

Results

The database consisted of 969 patients of which 486 had complete data. The main outcome occurred in 169 cases. The development database included 220 patients, 137 (62.3%) were men with a median age of 58.2, 39 (17.7%) were diabetic, 62 (28.2%) had high blood pressure, and 32 (14.5%) had chronic lung disease. Thirty-three variables were identified as prognostic markers, and those selected most frequently were: LDH, PaO2/FIO2 ratio, CRP, age, neutrophil and lymphocyte counts, respiratory rate, oxygen saturation, ferritin, and HCO3. The eight logistic regressions developed were validated on 266 patients in whom similar results were reached (accuracy: 65.8–72.9%).

Conclusions

The combined use of strategies for selecting characteristics through machine learning techniques makes it possible to identify a broad set of prognostic markers in patients hospitalized for COVID-19 for death or hospitalization in intensive care.

Keywords: COVID-19; feature selection; machine learning; prognostic; risk factors

Introduction

One of the most relevant tasks for a clinician caring for a patient with coronavirus disease (COVID)-19 is to identify the risk of developing complications in order to plan how to handle the patients (outpatient or in-hospital management) as well as the implementation of specific therapeutic measures. To this end, a large amount of research has been done around the world, which is focused on identifying the variables related to the prognosis of individuals at various stages of this infection (Figliozzi et al. 2020; Liang et al. 2020).

From a medical standpoint, risk factor analysis has historically consisted of the initial assessment of each particular variable, or univariate analysis, which is then supplemented by the results from an evaluation of a subset chosen under criteria related to the evidence of their individual association, or multivariate analysis. This analytical framework, which was initially designed to establish associations between risk factors and diseases, was subsequently extended in order to identify prognostic variables or markers (Fletcher and Fletcher 2014). Nevertheless, significant limitations for this strategy have been identified, and this has led to a search for new approaches on how to deal with prognostic markers (Goldstein, Navar, and Carter 2017; Pepe et al. 2004).

In the research on COVID-19, after reviewing the studies that attempted to identify these prognostic markers, variability in the lists of the variables presented is evident (Rod, Oviedo-Trespalacios, and Cortes-Ramirez 2020). Although a coincidence in the majority of the factors identified could be expected, this procedure was found to be sensitive to variation due to the experimental conditions determined by the type of population explored, the geographical area, timing of the pandemic outbreak, along with other aspects (Di Castelnuovo et al. 2020; van Halem et al. 2020; Xu et al. 2020). An additional factor that changes in these studies and could affect the final results is how the variables are selected (Bursac et al. 2008; Núñez, Steyerberg, and Núñez 2011).

From the point of view of the fields of machine learning, pattern recognition, and data science, the selection of variables or characteristics is also a preliminary procedure done before the development of prediction (preprocessing) tools. The goal of this selection is threefold: (a) improve prediction performance, (b) provide faster and more cost-effective predictors, and (c) provide a better understanding of the underlying process that generated the data (Guyon and Elisseeff 2003). The management of huge amounts of data (bigdata) has led to the development of various strategies that are increasingly sophisticated and successful and have resulted from the improvement in the generalizing capacity of the various models fed by the variables resulting from these processes (Li et al. 2017).

Artificial intelligence tools have been used in different processes in the fight against the COVID-19 pandemic, including diagnosis, public health, clinical and therapeutic decision-making (Chen and See 2020; Rasheed et al. 2021); but so far, it’s possible usefulness in identifying prognostic markers has not been explored. The objective of this study is to explore an alternate route for identifying prognostic markers by leveraging the capacity of machine learning strategies in the feature selection.

Methodology

A sub-analysis was done of a multi-center retrospective observational study of patients hospitalized in the emergency department or general ward for SARS-CoV-2/COVID-19 viral pneumonia confirmed by real-time polymerase chain reaction (RT-CRP) test in nasal swabs between April 15 and December 31, 2020 in three fourth-level clinical care hospitals in Bogota, Colombia. The following patients were excluded: those who were admitted directly to the Intensive Care Unit (ICU) having been referred after a 72-h stay in another institution and were unaware of the outcome under study, women in gestation, and patients with any condition that could seriously affect their short-term survival as recorded on their medical chart.

The patients were screened based on the census of hospitalized patients in the three institutions and their data were entered into a virtual format built using data recommended by the WHO International Severe Acute Respiratory and Emerging Infections Consortium (ISARIC).

Statistical analysis

Descriptive statistics

Qualitative variables are reported with absolute frequencies and percentages. The quantitative variables are summarized with measurements of central tendency and dispersion based on the distribution of the variables (using the Shapiro–Wilks test).

Primary outcome

End point consists of admission to ICU or in-hospital death.

Initial selection of variables

A group of variables related to primary outcome was selected while taking what was reported in the literature into account (Table 1). The clinical laboratory variables were taken starting at admission and up to five days after hospitalization.

Table 1:

Variables included.

Component evaluated by variable	Variables
Demographic data	Sex, age
Clinical presentation	Type of symptoms, fever, cough, odynophagia, rhinorrhea, chest pain, mialgias or arthralgias, asthenia or malaise, dyspnea, confusion, nausea, diarrhea, headache, abdominal pain, and smell or taste disorders
Medical history	Heart disease, high blood pressure, lung disease, chronic renal disease, neurological disease, diabetes, smoking, rheumatological disease, number of diseases, use of ACE inhibitors or ARA II
Vital signs	Temperature, heart rate, respiratory rate, systolic blood pressure, diastolic blood pressure, oxygen saturation, use of supplemental oxygen
Laboratories	Hemoglobin, hematocrit, red blood cell distribution width (RDW), leucocytes, neutrophils, lymphocytes, platelets, urea nitrogen, creatinine, C-reactive protein, lactate dehydrogenase, troponin, D-dimer, ferritin, pH, PaO₂/FiO₂ ratio, pCO2, HCO3, lactate
Radiological	Chest X-ray infiltrates

Identification of prognostic markers

Eight strategies, most of which were found in the CARET (Classification And REgression Training) (Kuhn 2008) machine learning model package, were applied to identify prognostic markers in the RV.4.0.2 statistics program (R Foundation, Vienna, Austria), using the population for which complete data for all the variables was available. These methods are briefly described below.

Step-wise Forward and Backward Selection (SW): This is a variation of the usual step-by-step format in which forward stepwise variables are added and at the same time other backward stepwise variables are removed from the initial model in which they are all included (Derksen and Keselman 1992).
Least absolute shrinkage and selection operator (LASSO): is a regression analysis method that selects and regularizes variables by applying a penalty for having large weights (coefficient values) and known as the “L1” rule. As a result, the process eventually reduces the coefficients of certain unwanted characteristics to zero and thus eliminates unnecessary variables. It is applied by using the glmnet package in R (Friedman, Hastie, and Tibshirani 2010).
Recursive feature selection (RFS): This is a basic backward selection model of predictors (Derksen and Keselman 1992). It begins by building a model on the full set of predictors and calculating a significance score for each predictor. Afterward, the least important ones are eliminated, and the model is rebuilt by repeating the process. The number of subsets of predictors to be evaluated must be specified, as must the size of each subset that optimizes performance criteria. This is used to select predictors based on importance classifications. This model is implemented in CARET with the rfe().

The following methods are derived from machine learning and were applied as wrappers consisting of the evaluation of multiple models generated through the inclusion or exclusion of predictive variables and seeking to identify the optimal combination that maximizes the performance of the model. They are search algorithms that treat predictors as input values and use a model metric as the goal of optimization.

Recursive Partitioning and Regression Trees: these are decision trees that are automatically generated through the CART (Classification And Regress Trees) methodology. This refers to the decision tree algorithms that can be used for predictive categorization or regression modeling problems (Breiman 1984). When the model is run, it generates a ranking of the importance of the variables, which is the way they are selected to be used in other models. The R program implements it with the rpart () function.
Stochastic Gradient Boosting (SGB): is a model that generates multiple decision trees that are later assembled by boosting. It can be very effective at finding a set of predictors that better explain the variance in the response variable. It is implemented with the gbm training method in CARET.
BORUTA: algorithm for classification and selection of wrapper-type characteristics based on the random forest algorithm (Kursa and Rudnicki 2010). In this analysis, the characteristics do not compete with each other but with a random version of them and then, using the binomial distribution, it solves for the inclusion of the variables.
Genetic Algorithm (GA): these are stochastic search algorithms inspired by the basic principles of biological evolution and natural selection. They have been successfully applied to solving optimization problems. In program R, supervised characteristic selection can be carried out with GA using the gafs () function of the CARET package (Scrucca 2013).
Simulated Annealing (SA): The process of annealing that occurs with particles within a material can be abstracted and applied in order to select characteristics. In this case, the goal is to find the strongest (predictive) set of characteristics to predict the response. This is a global search algorithm that allows a suboptimal solution to be accepted while looking for a better one to appear later and making small random changes based on compliance with an acceptance criterion. It is implemented in CARET with the safs () function (Kuhn and Johnson 2013).

Development and comparison of logistic regression

Eight multivariate logistic regression analyses were done on each set of identified variables through the previously described strategies.

A validation process was done on the eight strategies evaluated in the subgroup of patients for whom the information from all of the variables analyzed initially was not available. The validation was carried out with the number of patients for whom there was data on the variables identified for each strategy. No imputation of variables was done on this population given that, in many cases, several characteristics were absent due to which, a high risk of error was estimated a priori.

The operating characteristics of each regression model were calculated by building contingency tables for each of the populations (training and validation) and getting from them the accuracy, precision, sensitivity, specificity, predictive values, and likelihood ratios (LR). A diagnosis of each one of the models was done by calculating the Akaike information criterion, the Hosmer-Lemeshow goodness-of-fit test, and the adjusted R² determination coefficient.

The work was approved by the ethics and research committees at each one of the institutions and filling out an informed consent document was not considered necessary given the retrospective nature of the study. This study received funding from the call for proposals under Research Promotion number DI-I-0631-20 of the research division of the Fundación Universitaria de Ciencias de la Salud (Health Sciences University Foundation).

Results

Between April 15 and December 31, 949 patients were included of which 399 came from San Jose Hospital, 394 from El Tunal Hospital, and 156 from San Jose University Children’s Hospital. Complete data for all variables were obtained on 466 patients, of which 220 were selected for the development database, in 83 cases the main outcome occurred and representing [36 (16.4%) deaths and 72 (32.7%) were transferred to ICU]. The remaining 266 patients were used for the validation. Among them, the primary outcome occurred in 52 cases [44 (16.5%) deaths and 73 (27.4%) were transferred to ICU]. Figure 1 shows the flowchart of selected patients and the applied methodology.

Figure 1:

Flowchart of methodology.

The characteristics of the population and laboratories are described in Tables 2 and 3. Like other cohorts of hospitalized patients, men predominate – 137 (62.3%). The median age was 58 (IQR 21.5). The most frequent comorbidities were diabetes – 39 (17.7%), high blood pressure – 62 (28.2%), and chronic lung disease – 32 (14.5%). The median duration of symptoms prior to admission was 8 days (IQR 4). The most frequent symptoms were cough in 200 patients (90.9%), fever in 175 (79.5%), and dyspnea in 176 (80.0%). The majority had infiltrates on the chest X-rays, 192 (87.3%). In the laboratories, the median number of lymphocytes was 7,870 cel/µL (IQR 4692); for ferritin, it was 787 (1,086); for dehydrogenase, 429 (380.8); and C-reactive protein, 14.4 (18.8). Troponin was positive in 39 patients (17.7%).

Table 2:

General population characteristics.

Characteristic	Development (n=220)	Validation (n=266)
Sex, n (%)
Women	83 (37.7%)	110 (41.4%)
Men	137 (62.3%)	156 (58.6%)
Age, years
Median (IQR)	58 (21.5)	63 (22.0)
Greater than 65 years, n (%)	72 (32.7%)	125 (47.0%)
Overweight, n (%)	47 (22.3%)	49 (18.4%)
Obesity, n (%)	45 (20.4%)	42 (15.8%)
Employed as a health worker, n (%)	8 (3.5%)	5 (1.9%)
Comorbidities, n (%)
Median (IQR)	1 (2)	1 (2)
At least one	128 (58.2%)	198 (74.4%)
Hypertension	62 (28.2%)	111 (41.7%)
Diabetes	39 (17.7%)	55 (20.7%)
Chronic heart disease (except high blood pressure)	14 (6.3%)	28 (10.5%)
Chronic renal disease	13 (5.9%)	14 (5.3%)
Smoking	56 (25.4%)	51 (19.2%)
Chronic pulmonary disease	32 (14.5%)	32 (12.0%)
Chronic neurological disease	18 (8.2%)	24 (9.0%)
Active cancer	3 (1.4%)	9 (3.4%)
HIV infection	1 (0.4%)	2 (0.8%)
Cirrhosis	0	1 (0.4%)
Duration of illness prior to admission to hospitalization (days), median (IQR)	8 (4)	7 (4)
Reported symptoms, n (%)
Dyspnea	176 (80%)	220 (82.7%)
Fever	175 (79.5%)	187 (70.3%)
Cough	200 (90.9%)	242 (91.0%)
Myalgias or arthralgias	111 (50.4%)	121 (45.5%)
Diarrhea	45 (20.4%)	63 (23.7%)
Rhinorrhea	22 (10%)	25 (9.4%)
Sore throat	63 (28.6%)	54 (20.3%)
Headache	27 (12.3%)	32 (12.0%)
Vital signs on admission to hospital, median (IQR)
Temperature, °C	36.7 (1.0)	36.5 (0.8)
Heart rate (beats per minute)	92 (26.2)	90 (27.0)
Respiratory rate (breaths per minute)	20 (3.0)	20 (6.0)
Systolic blood pressure, mmHg	123 (24.2)	123 (26.0)
Oxygen saturation, %	86 (10.0)	85 (10.0)
Alteration of the mental state, n (%)	7 (3.2%)	15 (5.6%)
Presence of infiltrates in the initial chest X-ray, n (%)	192 (87.3%)	235 (88.3%)

IQR, interquartilic range; HIV, human immunodeficiency virus.

Table 3:

Summary of the laboratory results for the two populations evaluated.

Characteristic	Development (n=220)	Validation (n=266)
White blood cell count (cel per μL), median (IQR)	7,870 (4,692)	8,200 (4,605)
Lymphocyte count (cel per μL), median (IQR)	900 (650)	900 (580)
Platelet count (cel per μL), median (IQR)	21,5000 (93,250)	22,4000 (99,000)
Lactate, mmol/L, median (IQR)	1.6 (0.8)	1.7 (0.7)
Creatinine, mg/dL, median (IQR)	0.8 (0.3)	0.8 (0.3)
High-sensitivity C-reactive protein, mg/L, median (IQR)	14.4 (18.8)	30.8 (142.4)
Ferritin, ng/mL, median (IQR)	787 (1,086)	789 (1,174)
D-dimer, μg/mL, median (IQR)	785 (745)	1,037 (1,163)
Lactate dehydrogenase, U/L, median (IQR)	429 (380.8)	430 (296.5)
Coagulation times greater than 5 s, n (%)	41/161 (25.5%)	41/166 (25.3%)
High positive sensitivity troponin I, n (%)	39 (17.7%)	66 (24.8%)

IQR, interquartilic range.

Identified risk factors

The eight models used to select characteristics identified 33 variables that could be considered prognostic markers for death or transfer to ICU (Table 4). The variables most frequently selected by the different models were LDH and the PaO₂/FIO₂ ratio 7 times, the C-reactive protein (CRP) 6 times, age, neutrophil and lymphocyte count 5 times, respiratory rate, oxygen saturation, ferritin, and Sodium bicarbonate 4 times.

Table 4:

Risk factors identified by various characteristic selection algorithms.

Origin of the variables	Step-wise forwardand backwardselection	LASSO	Recursivepartitioning andregression trees	Stochastic gradientboosting	BORUTA	Recursive featureselection	Geneticalgorithm	Simulatedannealing
Number of variables	14	8	9	8	11	20	5	12
Sex	–	–	X	–	–	–	–	–
Age	X	X	–	X	X	X	–	–
Odynophagia	X	X	–	–	–	–	–	–
Myalgias or arthralgias	–	–	–	–	–	–	–	X
Confusion	–	–	–	–	–	–	–	X
Headache	–	–	–	–	–	–	–	X
Chest pain	–	–	–	–	–	–	–	X
Diarrhea	–	X	–	–	–	–	–	–
High blood pressure	–	–	–	–	–	–	–	X
Lung disease	X	X	–	–	–	–	–	–
Chronic renal disease	X	X	–	–	–	–	–	–
Neurological disease	X	X	–	–	–	–	–	–
Number of diseases	–	–	–	X	–	–	–	X
Heart rate	–	–	–	–	–	X	–	–
Systolic blood pressure	X	–	–	–	–	X	–	–
Diastolic blood pressure	X	–	–	–	–	X	–	–
Respiration rate	X	X	–	–	X	X	–	–
Oxygen saturation	–	X	X	–	X	X	–	–
Leukocytes	–	–	–	–	–	X	–	–
Neutrophils	X	–	X	–	X	X	X	–
Lymphocytes	X	–	X	X	X	X	–	–
Platelets	X	–	–	–	–	X	–	X
Erythrocyte distribution width (RDW-CV)	–	–	–	–	–	X	–	–
Creatinine	–	–	X	–	X	X	–	–
Urea nitrogen	–	–	X	–	–	X	–	–
Lactate dehydrogenase	X	–	X	X	X	X	X	X
C-reactive protein	X	–	X	X	X	X	–	X
D-dimer	–	–	–	X	–	X	–	–
Ferritin	–	–	–	X	X	X	X	–
Sodium bicarbonate	–	–	–	–	X	X	X	X
Pa O₂/Fi O₂ ratio	X	–	X	X	X	X	X	X
Lactate	–	–	–	–	–	X	–	–
Chest X-ray infiltrates	–	–	–	–	–	–	–	X

LASSO, least absolute shrinkage and selection operator.

Prognostic validation of the multivariate logistic regression models

The different logistic regression models were validated in 266 patients. The results of the prognostic performance for the primary endpoint in each model in the development population and the validation population are shown in Table 5. The different models demonstrated a positive and very similar discriminative capacity (accuracy between 65.8 and 72.9%). The model generated with the variables selected through SW, BORUTA, and RFS showed the best result in the development population with an accuracy of 76.4% for each one while the model generated with the variables selected using RFS had the best performance in the validation population with an accuracy of 72.9%.

Table 5:

Result of the performance in development populations and validation of the different logistic regression models created starting from the sets of selected variables.

Origin of	Step-wise		LASSO		Recursive		Stochastic		BORUTA		Recursive		Genetic		Simulated
the	forward and				and partitioning		gradient				feature		algorithm		annealing
variables	backward selection				regression trees		boosting				selection
Model	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation	LRtraining	LRvalidation
True positive	50	23	34	24	47	17	42	13	49	22	50	20	39	21	39	21
True negative	118	164	120	151	120	176	119	181	119	171	118	174	119	168	119	168
False positive	33	21	49	20	36	27	41	31	34	22	33	24	44	23	44	23
False negative	19	58	17	71	17	46	18	41	18	51	19	48	18	54	18	54
Total	220	266	220	266	220	266	220	266	220	266	220	266	220	266	220	266
Accuracy, %	76.4	70.3	70.0	65.8	75.9	72.6	73.2	72.9	76.4	72.6	76.4	72.9	71.8	71.1	71.8	71.1
Sensitivity (recall, %)	72.5	28.4	66.7	25.3	73.4	27.0	70.0	24.1	73.1	30.1	72.5	29.4	68.4	28.0	68.4	28.0
Specificity, %	78.1	88.6	71.0	88.3	76.9	86.7	74.4	85.4	77.8	88.6	78.1	87.9	73.0	88.0	73.0	88.0
Pos pred value, %	60.2	52.3	41.0	54.5	56.6	38.6	50.6	29.5	59.0	50.0	60.2	45.5	47.0	47.7	47.0	47.7
Neg pred value, %	86.1	73.9	87.6	68.0	87.6	79.3	86.9	81.5	86.9	77.0	86.1	78.4	86.9	75.7	86.9	75.7
LR +	3.316	2.501	2.299	2.160	3.182	2.029	2.732	1.646	3.291	2.644	3.316	2.426	2.535	2.325	2.535	2.325
LR −	0.352	0.808	0.469	0.846	0.345	0.842	0.403	0.889	0.345	0.789	0.352	0.803	0.433	0.819	0.433	0.819
Presicion, %	60.2	52.3	41.0	54.5	56.6	38.6	50.6	29.5	59.0	50.0	60.2	45.5	47.0	47.7	47.0	47.7
F1-score	65.8	36.8	50.7	34.5	63.9	31.8	58.7	26.5	65.3	37.6	65.8	35.7	55.7	35.3	55.7	35.3

LASSO, least absolute shrinkage and selection operator, Pred, predictive; LR, Logistic regression.

The diagnostic for each one of the models found that the model generated with the variables selected through SW showed the best Akaike information criterion, 249.3. The CARD and SA models showed poor calibration using a significant p-value in the Hosmer and Lemeshow test, p=0.0001889 and 0.0425 respectively (Table 6). The different models showed a very low percentage of actual variability in the response that is modeled by the independent variables. The maximum is the SW one with an adjusted R-square that is barely 28%.

Table 6:

Diagnostic of each of the multivariate logistic regression models built from selected sets of variables.

		Hosmer and lemeshow
		goodness of fit (GOF) test
	Criterio deinformaciónde Akaike	X-squared	p-Value	MultipleR-squared	AdjustedR-squared	F-statistic	p-Value
Step-wise forward and backward selection	249.329	0.95085	0.9985	0.3236	0.2775	7.007	0.00000000
LASSO	292.056	9.3518	0.3135	0.1572	0.1169	3.898	0.00007150
Recursive partitioning and regression trees	279.253	30.275	0.0002	0.1976	0.1632	5.745	0.00000040
Stochastic gradient boosting	269.414	11.741	0.1632	0.1607	0.1289	5.049	0.00000941
BORUTA	270.556	4.3534	0.8239	0.2426	0.2025	6.056	0.00000001
Recursive feature selection	272.350	6.5625	0.5845	0.2964	0.2256	4.191	0.00000006
Genetic algorithm	283.633	11.884	0.1565	0.1511	0.1313	7.619	0.00000131
Simulated annealing	304.874	15.992	0.0425	0.1227	0.0719	2.413	0.00602600

LASSO, least absolute shrinkage and selection operator.

Discussion

The experiments done through our analysis identify, based on a relatively small population, a series of variables that are related to a requirement for intensive care or death in patients hospitalized for COVID-19. Therefore, they are considered prognostic markers. The ten most frequently identified by various algorithms were LDH, the PaO₂/FiO₂ ratio, CRP, age, and neutrophil and lymphocyte count.

The prognostic markers identified in our study have also been identified in other studies done in various parts of the world. Age and CRP are the most consistent (Rod, Oviedo-Trespalacios, and Cortes-Ramirez 2020) as are also the neutrophil count, lymphocyte count, and LDH which have been documented in several studies and have been cataloged as average in consistence (Rod, Oviedo-Trespalacios, and Cortes-Ramirez 2020). In a meta-analysis involving 58 studies, Noor et al. (Noor and Islam 2020) identified age >65 years (RR 3.59, 95% CI [1.87–6.90], p<0.001) and admission to ICU (RR 3.72, 95% CI [2.70–5.13], p<0.001) were the most statistically significant factors related to in-hospital mortality. In another meta-analysis that evaluated hematological and immunological markers, Elshazli et al. (Elshazli et al. 2020) identified 9 variables related to progression to a severe phenotype among which are found neutrophil counts (OR=2.62). Four were related to admission to ICU including the neutrophil count (OR=6.25), and another four variables were related to mortality among which a neutrophil count (OR=6.25) and the CRP (OR=7.09) were found. Another systematic review done by Izcovich et al. (Izcovich et al. 2020), evaluated 207 studies and identified 49 variables that could provide prognostic information on mortality and severe disease and among these there were variables that were identified by the strategies used in our study.

Of the models evaluated for selecting variables, three (SW, LASSO, and RFS) are derived from traditional statistics and are considered to be improvements on the usual practice of building multivariate models (Bursac et al. 2008; Taylor and Tibshirani 2015). The remaining five are tools used in machine learning. Three of them are based on decision trees. Even when strategies with different mathematical foundations were used, the generation of logistic regression models from each of the subsets of selected variables demonstrated a similar prognostic performance (with confidence intervals superimposed on all the accuracy measurements). This can be interpreted as a similar ability to identify prognostic markers. In general, these models were well-calibrated and the results in this population were not very different from those in the training population suggesting that there was no substantial overfitting. There was a low percentage in the variance of the results identified by the R-squared with the methodological limitations that this statistician may have (Cox and Snell 1989). This situation requires consideration of the complexity and the wide range of risk patterns that COVID-19 may have as highlighted by the various results obtained from a large number of studies.

The evaluation of the prognostic markers remains a major challenge in medical research. The traditional statistical methods used in the clinical study to evaluate etiological associations or “risk factors” are not always suitable for determining the potential performance of a marker for classifying or predicting results for individuals. Relatively often the performance of many of these markers can be described as disappointing (Pepe et al. 2004). There are three analytical challenges in these types of studies that limit the usefulness of logistic regressions for this by decreasing: the assumption of linearity between risk factors and outcome, the diversity of the effects, and the presence of many prognostic-related variables. These considerations may affect the performance of the logistic regressions as real-world prognostic evaluators (Goldstein, Navar, and Carter 2017). In addition, the purpose for feature selection for establishing prognostic models is to determine which combination of variables will offer the best performance when a regression or classification model is generated (Chandrashekar and Sahin 2014). A variable considered useless by itself can provide a significant improvement in performance when taken with others. Furthermore, perfectly correlated variables are actually redundant in the sense that no additional information is obtained when they are added together (Guyon and Elisseeff 2003). The challenge of correcting the effects of the selection is complex since selective decisions may occur at different stages in the process of analysis (Pepe et al. 2004). There are several proposals to solve this problem such as the false discovery rate strategy (Benjamini, Heller, and Yekutieli 2009), sample division, adaptive regression techniques (Taylor and Tibshirani 2015), Bayesian probabilistic models (García-Donato, Castellanos, and Quirós 2021), and automatic learning techniques (Nilsson et al. 2019). This last one was the one evaluated in the current paper.

One of the areas where the identification of prognostic markers has been changing in the last few years is in genetics. Here the association technique through genome-wise scan for associations (GWA) has been used. This makes inferences from variations in polymorphism areas associated with the problem condition within a large number of these evaluated areas (García-Donato, Castellanos, and Quirós 2021; Nilsson et al. 2019). This technique has suffered from a reproducibility problem that has been called “selective inference” (Taylor and Tibshirani 2015). One possible explanation is that logistic regression models ignore “heterogeneous effects,” and this can generate misleading evidence for large subgroups of individuals (Benjamini 2020). This is a problem that is similar to the one faced in clinic practice where using a large number of initial variables an effort is made to identify some of them as prognostic markers.

This study has several limitations. One of them is the size of the population since a calculation of the sample size was not carried out. Another was the impossibility of recognizing variables that boosted the performance of others within a model but that alone, do not mark a difference. Although there is a significant number of works that have developed models for the forecast of COVID-19 from which the variables that constitute them can be extracted (Bottino et al. 2021; Wang et al. 2021), the discussion around individual variables does not consider their results at the time of their analysis (example: obesity (Yang et al. 2021) or ferritin (Taneri et al. 2020)); so it is a field to be explored in which there is still much to say. We must note that the parameters used for the feature selection could not be modified since they were those preset by the packages used in the R program.

The possibility of identifying or at least corroborating the information already produced with respect to prognostic factors shows the possibility of favoring local decision-making with greater certainty. Future work will include testing the hypothesis of different populations (greater in size or other spectrum of disease) as well as in other clinical entities. A hypothesis that is worth studying is how assembling a set of prediction tools fed with different sets of variables might show better predictive performance along with a better ability to generalize.

Conclusions

The combined use of strategies for characteristic selection derived from machine-learning made it possible to identify a broader variety of prognostic markers for death or hospitalization in intensive care in patients who were hospitalized for COVID-19 starting from a small population of patients coinciding with those who were identified by other studies.

Corresponding author: John Jaime Sprockel Diaz, School of Medicine, Internal Medicine Department, Fundación Universitaria de Ciencias de la Salud (Health Sciences University Foundation), Hospital de San Jose (San Jose Hospital), Calle 10 No. 10-75, Bogota, DC, Colombia; and Intensive Care Unit at the Health Services Unit – Hospital de Tunal (Tunal Hospital), Bogota, Colombia, Phone: +(57) 3184009973, E-mail: jjsprockel@fucsalud.edu.co

Funding source: FundaciÃ³n Universitaria de Ciencias de la Salud (Health Sciences University Foundation)

Award Identifier / Grant number: DI-I-0631-20

Research funding: The current study received funding from the call for proposals under Research Promotion number DI-I-0631-20 of the research division of the Fundación Universitaria de Ciencias de la Salud (Health Sciences University Foundation).
Author contributions: All authors participated in the planning, design, data analysis, and preparation of this article. All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: Authors state no conflict of interest.
Informed consent: Not applicable.
Ethical approval: The work was approved by the ethics and research committees at each one of the institutions.

References

Benjamini, Y. 2020. “Selective Inference: The Silent Killer of Replicability.” Harvard Data Science Review 2 (4). https://doi.org/10.1162/99608f92.fc62b261, https://hdsr.mitpress.mit.edu/pub/l39rpgyc/release/1.10.1162/99608f92.fc62b261Search in Google Scholar

Benjamini, Y., R. Heller, and D. Yekutieli. 2009. “Selective Inference in Complex Research.” Philosophical Transactions of the Royal Society of London A Mathematical, Physical, and Engineering Sciences 367 (1906): 4255–71. https://doi.org/10.1098/rsta.2009.0127.Search in Google Scholar PubMed PubMed Central

Bottino, F., E. Tagliente, L. Pasquini, A. D. Napoli, M. Lucignani, L. Figà-Talamanca, and A. Napolitano. 2021. “COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal.” Journal of Personalized Medicine 11 (9): 893. https://doi.org/10.3390/jpm11090893.Search in Google Scholar PubMed PubMed Central

Breiman, L. 1984. Classification and Regression Trees. New York: Kluwer Academic Publishers.Search in Google Scholar

Bursac, Z., C. H. Gauss, D. K. Williams, and D. W. Hosmer. 2008. “Purposeful Selection of Variables in Logistic Regression.” Source Code for Biology and Medicine 3: 17. https://doi.org/10.1186/1751-0473-3-17.Search in Google Scholar PubMed PubMed Central

Chandrashekar, G., and F. Sahin. 2014. “A Survey on Feature Selection Methods.” Computers & Electrical Engineering 40 (1): 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024.Search in Google Scholar

Chen, J., and K. C. See. 2020. “Artificial Intelligence for COVID-19: Rapid Review.” Journal of Medical Internet Research 22 (10): e21476. https://doi.org/10.2196/21476.Search in Google Scholar PubMed PubMed Central

Cox, D. R., and E. J. Snell. 1989. The Analysis of Binary Data, 2nd ed. London: Chapman and Hall.Search in Google Scholar

Derksen, S., and H. J. Keselman. 1992. “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables.” British Journal of Mathematical and Statistical Psychology 45 (2): 265–82. https://doi.org/10.1111/j.2044-8317.1992.tb00992.x.Search in Google Scholar

Di Castelnuovo, A., M. Bonaccio, S. Costanzo, A. Gialluisi, A. Antinori, N. Berselli, L. Blandi, R. Bruno, R. Cauda, G. Guaraldi, I. My, L. Menicanti, G. Parruti, G. Patti, S. Perlini, F. Santilli, C. Signorelli, G. G. Stefanini, A. Vergori, A. Abdeddaim, W. Ageno, A. Agodi, P. Agostoni, L. Aiello, S. Al Moghazi, F. Aucella, G. Barbieri, A. Bartoloni, C. Bologna, P. Bonfanti, S. Brancati, F. Cacciatore, L. Caiano, F. Cannata, L. Carrozzi, A. Cascio, A. Cingolani, F. Cipollone, C. Colomba, A. Crisetti, F. Crosta, G. B. Danzi, D. D’Ardes, K. de Gaetano Donati, F. Di Gennaro, G. Di Palma, G. Di Tano, M. Fantoni, T. Filippini, P. Fioretto, F. M. Fusco, I. Gentile, L. Grisafi, G. Guarnieri, F. Landi, G. Larizza, A. Leone, G. Maccagni, S. Maccarella, M. Mapelli, R. Maragna, R. Marcucci, G. Maresca, C. Marotta, L. Marra, F. Mastroianni, A. Mengozzi, F. Menichetti, J. Milic, R. Murri, A. Montineri, R. Mussinelli, C. Mussini, M. Musso, A. Odone, M. Olivieri, E. Pasi, F. Petri, B. Pinchera, C. A. Pivato, R. Pizzi, V. Poletti, F. Raffaelli, C. Ravaglia, G. Righetti, A. Rognoni, M. Rossato, M. Rossi, A. Sabena, F. Salinaro, V. Sangiovanni, C. Sanrocco, A. Scarafino, L. Scorzolini, R. Sgariglia, P. G. Simeone, E. Spinoni, C. Torti, E. M. Trecarichi, F. Vezzani, G. Veronesi, R. Vettor, A. Vianello, M. Vinceti, R. De Caterina, and L. Iacoviello. 2020. “Common Cardiovascular Risk Factors and In-Hospital Mortality in 3,894 Patients with COVID-19: Survival Analysis and Machine Learning-Based Findings from the Multicentre Italian CORIST Study.” Nutrition, Metabolism, and Cardiovascular Diseases 30 (11): 1899–913. https://doi.org/10.1016/j.numecd.2020.07.031.Search in Google Scholar PubMed PubMed Central

Elshazli, R. M., E. A. Toraih, A. Elgaml, M. El-Mowafy, M. El-Mesery, M. N. Amin, M. H. Hussein, M. T. Killackey, M. S. Fawzy, and E. Kandil. 2020. “Diagnostic and Prognostic Value of Hematological and Immunological Markers in COVID-19 Infection: A Meta-Analysis of 6320 Patients.” PLoS One 15 (8): e0238160. https://doi.org/10.1371/journal.pone.0238160.Search in Google Scholar PubMed PubMed Central

Figliozzi, S., P. G. Masci, N. Ahmadi, L. Tondi, E. Koutli, A. Aimo, K. Stamatelopoulos, M. Dimopoulos, A. L. P. Caforio, and G. Georgiopoulos. 2020. “Predictors of Adverse Prognosis in COVID-19: A Systematic Review and Meta-Analysis.” European Journal of Clinical Investigation 50 (10): e13362. https://doi.org/10.1111/eci.13362.Search in Google Scholar PubMed

Fletcher, R. H., and S. W. Fletcher. 2014. Clinical Epidemiology: The Essentials, 5th ed., 272. Philadelphia: Lippincott Williams & Wilkins.Search in Google Scholar

Friedman, J., T. Hastie, and R. Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.Search in Google Scholar

García-Donato, G., M. E. Castellanos, and A. Quirós. 2021. “Bayesian Variable Selection with Applications in Health Sciences.” Mathematics 9 (3): 218. https://doi.org/10.3390/math9030218.Search in Google Scholar

Goldstein, B. A., A. M. Navar, and R. E. Carter. 2017. “Moving beyond Regression Techniques in Cardiovascular Risk Prediction: Applying Machine Learning to Address Analytic Challenges.” European Heart Journal 38 (23): 1805–14. https://doi.org/10.1093/eurheartj/ehw302.Search in Google Scholar PubMed PubMed Central

Guyon, I., and A. Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3: 1157–82.Search in Google Scholar

Izcovich, A., M. A. Ragusa, F. Tortosa, M. A. L. Marzio, C. Agnoletti, A. Bengolea, A. Ceirano, F. Espinosa, E. Saavedra, V. Sanguine, A. Tassara, C. Cid, H. N. Catalano, A. Agarwal, F. Foroutan, and G. Rada. 2020. “Prognostic Factors for Severity and Mortality in Patients Infected with COVID-19: A Systematic Review.” PLoS One 15 (11): e0241955. https://doi.org/10.1371/journal.pone.0241955.Search in Google Scholar PubMed PubMed Central

Kuhn, M. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (1): 1–26. https://doi.org/10.18637/jss.v028.i05.Search in Google Scholar

Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling. New York: Springer-Verlag. https://www.springer.com/gp/book/9781461468486 (accessed January 15, 2021).10.1007/978-1-4614-6849-3Search in Google Scholar

Kursa, M. B., and W. R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (1): 1–13. https://doi.org/10.18637/jss.v036.i11.Search in Google Scholar

Li, J., K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu. 2017. “Feature Selection: A Data Perspective.” ACM Computing Surveys 50 (6): 1–45. https://doi.org/10.1145/3136625.Search in Google Scholar

Liang, W., H. Liang, L. Ou, B. Chen, A. Chen, C. Li, Y. Li, W. Guan, L. Sang, J. Lu, Y. Xu, G. Chen, H. Guo, J. Guo, Z. Chen, Y. Zhao, S. Li, N. Zhang, N. Zhong, and J. He. 2020. “Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients with COVID-19.” JAMA Internal Medicine 180 (8): 1081–9. https://doi.org/10.1001/jamainternmed.2020.2033.Search in Google Scholar PubMed PubMed Central

Nilsson, A., C. Bonander, U. Strömberg, and J. Björk. 2019. “Assessing Heterogeneous Effects and Their Determinants via Estimation of Potential Outcomes.” European Journal of Epidemiology 34 (9): 823–35. https://doi.org/10.1007/s10654-019-00551-0.Search in Google Scholar PubMed PubMed Central

Noor, F. M., and M. M. Islam. 2020. “Prevalence and Associated Risk Factors of Mortality Among COVID-19 Patients: A Meta-Analysis.” Journal of Community Health 45 (6): 1270–82. https://doi.org/10.1007/s10900-020-00920-x.Search in Google Scholar PubMed PubMed Central

Núñez, E., E. W. Steyerberg, and J. Núñez. 2011. “Regression Modeling Strategies.” Revista Espanola de Cardiologia 64 (6): 501–7. https://doi.org/10.1016/j.rec.2011.01.017.Search in Google Scholar

Pepe, M. S., H. Janes, G. Longton, W. Leisenring, and P. Newcomb. 2004. “Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker.” American Journal of Epidemiology 159 (9): 882–90. https://doi.org/10.1093/aje/kwh101.Search in Google Scholar PubMed

Rasheed, J., A. Jamil, A. A. Hameed, F. Al-Turjman, and A. Rasheed. 2021. “COVID-19 in the Age of Artificial Intelligence: A Comprehensive Review.” Interdisciplinary Sciences: Computational Life Sciences, 13 (2): 153–175. https://doi.org/10.1007/s12539-021-00431-w.Search in Google Scholar PubMed PubMed Central

Rod, J. E., O. Oviedo-Trespalacios, and J. Cortes-Ramirez. 2020. “A Brief-Review of the Risk Factors for Covid-19 Severity.” Revista de Saúde Pública 54: 60. https://doi.org/10.11606/s1518-8787.2020054002481.Search in Google Scholar PubMed PubMed Central

Scrucca, L. 2013. “GA: A Package for Genetic Algorithms in R.” Journal of Statistical Software 53 (1): 1–37. https://doi.org/10.18637/jss.v053.i04.Search in Google Scholar

Taneri, P. E., S. A. Gómez-Ochoa, E. Llanaj, P. F. Raguindin, L. Z. Rojas, Z. M. Roa-Díaz, D. Salvador, D. Groothof, B. Minder, D. Kopp-Heim, W. E. Hautz, M. F. Eisenga, O. H. Franco, M. Glisic, and T. Muka. 2020. “Anemia and Iron Metabolism in COVID-19: A Systematic Review and Meta-Analysis.” European Journal of Epidemiology 35 (8): 763–73. https://doi.org/10.1007/s10654-020-00678-5.Search in Google Scholar PubMed PubMed Central

Taylor, J., and R. J. Tibshirani. 2015. “Statistical Learning and Selective Inference.” Proceedings of the National Academy of Sciences 112 (25): 7629–34. https://doi.org/10.1073/pnas.1507583112.Search in Google Scholar PubMed PubMed Central

van Halem, K., R. Bruyndonckx, J. van der Hilst, J. Cox, P. Driesen, M. Opsomer, E. Van Steenkiste, B. Stessel, J. Dubois, and P. Messiaen. 2020. “Risk Factors for Mortality in Hospitalized Patients with COVID-19 at the Start of the Pandemic in Belgium: A Retrospective Cohort Study.” BMC Infectious Diseases 20 (1): 897. https://doi.org/10.1186/s12879-020-05605-3.Search in Google Scholar PubMed PubMed Central

Wang, L., Y. Zhang, D. Wang, X. Tong, T. Liu, S. Zhang, J. Huang, L. Chen, H. Fan, and M. Clarke. 2021. “Artificial Intelligence for COVID-19: A Systematic Review.” Frontiers of Medicine 8: 704256. https://doi.org/10.3389/fmed.2021.704256.Search in Google Scholar PubMed PubMed Central

Xu, P. P., R. H. Tian, S. Luo, Z. Y. Zu, B. Fan, X. M. Wang, K. Xu, J. T. Wang, J. Zhu, J. C. Shi, F. Chen, Z. H. Yan, R. P. Wang, W. Chen, W. H. Fan, C. Zhang, M. J. Lu, Z. Y. Sun, C. S. Zhou, L. N. Zhang, F. Xia, L. Qi, W. Zhang, J. Zhong, X. X. Liu, Q. R. Zhang, G. M. Lu, and L. J. Zhang. 2020. “Risk Factors for Adverse Clinical Outcomes with COVID-19 in China: A Multicenter, Retrospective, Observational Study.” Theranostics 10 (14): 6372–83. https://doi.org/10.7150/thno.46833.Search in Google Scholar PubMed PubMed Central

Yang, J., C. Tian, Y. Chen, C. Zhu, H. Chi, and J. Li. 2021. “Obesity Aggravates COVID-19: An Updated Systematic Review and Meta-Analysis.” Journal of Medical Virology 93 (5): 2662–74. https://doi.org/10.1002/jmv.26677.Search in Google Scholar PubMed PubMed Central

Received: 2022-08-26

Accepted: 2023-03-21

Published Online: 2023-04-17

Articles in the same Issue

https://doi.org/10.1515/em-2022-0132

Keywords for this article

COVID-19; feature selection; machine learning; prognostic; risk factors