Abstract
Cancer classification and gene selection are important applications in DNA microarray gene expression data analysis. Since DNA microarray data suffers from the high-dimensionality problem, automatic gene selection methods are used to enhance the classification performance of expert classifier systems. In this paper, a new penalized logistic regression method that performs simultaneous gene coefficient estimation and variable selection in DNA microarray data is discussed. The method employs prior information about the gene coefficients to improve the classification accuracy of the underlying model. The coordinate descent algorithm with screening rules is given to obtain the gene coefficient estimates of the proposed method efficiently. The performance of the method is examined on five high-dimensional cancer classification datasets using the area under the curve, the number of selected genes, misclassification rate and F-score measures. The real data analysis results indicate that the proposed method achieves a good cancer classification performance with a small misclassification rate, large area under the curve and F-score by trading off some sparsity level of the underlying model. Hence, the proposed method can be seen as a reliable penalized logistic regression method in the scope of high-dimensional cancer classification.
-
Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: None declared.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
1. Sung, H, Ferlay, J, Siegel, RL, Laversanne, M, Soerjomataram, I, Jemal, A, et al.. Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J Clin 2021;71:209–49. https://doi.org/10.3322/caac.21660.Search in Google Scholar PubMed
2. Korkmaz, S, Zararsiz, G, Goksuluk, D. Drug/nondrug classification using support vector machines with various feature selection strategies. Comput Methods Progr Biomed 2014;117:51–60. https://doi.org/10.1016/j.cmpb.2014.08.009.Search in Google Scholar PubMed
3. Arya, C, Tiwari, R. Expert system for breast cancer diagnosis: a survey. In: 2016 international conference on computer communication and informatics (ICCCI). IEEE; 2016: 1–9 pp.10.1109/ICCCI.2016.7479940Search in Google Scholar
4. Tariq, M, Iqbal, S, Ayesha, H, Abbas, I, Ahmad, KT, Niazi, MFK. Medical image based breast cancer diagnosis: state of the art and future directions. Expert Syst Appl 2020;167:114095. https://doi.org/10.1016/j.eswa.2020.114095.Search in Google Scholar
5. Sartor, MA, Leikauf, GD, Medvedovic, M. LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2008;25:211–7. https://doi.org/10.1093/bioinformatics/btn592.Search in Google Scholar PubMed PubMed Central
6. Du, D, Li, K, Li, X, Fei, M. A novel forward gene selection algorithm for microarray data. Neurocomputing 2014;133:446–58. https://doi.org/10.1016/j.neucom.2013.12.012.Search in Google Scholar
7. Zheng, C-H, Chong, Y-W, Wang, H-Q. Gene selection using independent variable group analysis for tumor classification. Neural Comput Appl 2011;20:161–70. https://doi.org/10.1007/s00521-010-0513-2.Search in Google Scholar
8. Zheng, S, Liu, W. An experimental comparison of gene selection by lasso and dantzig selector for cancer classification. Comput Biol Med 2011;41:1033–40. https://doi.org/10.1016/j.compbiomed.2011.08.011.Search in Google Scholar PubMed
9. Belciug, S. Logistic regression paradigm for training a single-hidden layer feedforward neural network. application to gene expression datasets for cancer research. J Biomed Inf 2020;102:103373. https://doi.org/10.1016/j.jbi.2019.103373.Search in Google Scholar PubMed
10. Alonso-González, CJ, Moro-Sancho, QI, Simon-Hurtado, A, Varela-Arrabal, R. Microarray gene expression classification with few genes: criteria to combine attribute selection and classification methods. Expert Syst Appl 2012;39:7270–80. https://doi.org/10.1016/j.eswa.2012.01.096.Search in Google Scholar
11. Kalina, J. Classification methods for high-dimensional genetic data. Biocybern Biomed Eng 2014;34:10–8. https://doi.org/10.1016/j.bbe.2013.09.007.Search in Google Scholar
12. Drotár, P, Gazda, J, Smékal, Z. An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 2015;66:1–10. https://doi.org/10.1016/j.compbiomed.2015.08.010.Search in Google Scholar PubMed
13. Algamal, ZY, Lee, MH. Applying penalized binary logistic regression with correlation based elastic net for variables selection. J Mod Appl Stat Methods 2015;14:15. https://doi.org/10.22237/jmasm/1430453640.Search in Google Scholar
14. Saeys, Y, Inza, I, Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007;23:2507–17. https://doi.org/10.1093/bioinformatics/btm344.Search in Google Scholar PubMed
15. Piao, Y, Piao, M, Park, K, Ryu, KH. An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics 2012;28:3306–15. https://doi.org/10.1093/bioinformatics/bts602.Search in Google Scholar PubMed
16. Chandra, B, Gupta, M. An efficient statistical feature selection approach for classification of gene expression data. J Biomed Inf 2011;44:529–35. https://doi.org/10.1016/j.jbi.2011.01.001.Search in Google Scholar PubMed
17. Liang, Y, Liu, C, Luan, X-Z, Leung, K-S, Chan, T-M, Xu, Z-B, et al.. Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification. BMC Bioinf 2013;14:1–12. https://doi.org/10.1186/1471-2105-14-198.Search in Google Scholar PubMed PubMed Central
18. Yu, L, Han, Y, Berens, ME. Stable gene selection from microarray data via sample weighting. IEEE ACM Trans Comput Biol Bioinf 2011;9:262–72.10.1109/TCBB.2011.47Search in Google Scholar PubMed
19. Zhu, J, Hastie, T. Classification of gene microarrays by penalized logistic regression. Biostatistics 2004;5:427–43. https://doi.org/10.1093/biostatistics/kxg046.Search in Google Scholar
20. Hastie, T, Tibshirani, R, Wainwright, M. Statistical learning with sparsity: the lasso and generalizations. Boca Raton, FL: CRC Press; 2015.10.1201/b18401Search in Google Scholar
21. Bielza, C, Robles, V, Larrañaga, P. Regularized logistic regression without a penalty term: an application to cancer classification with microarray data. Expert Syst Appl 2011;38:5110–8. https://doi.org/10.1016/j.eswa.2010.09.140.Search in Google Scholar
22. Huang, H-H, Liu, X-Y, Liang, Y. Feature selection and cancer classification via sparse logistic regression with the hybrid l1/2+ 2 regularization. PLoS One 2016;11:e0149675. https://doi.org/10.1371/journal.pone.0149675.Search in Google Scholar PubMed PubMed Central
23. Algamal, ZY, Lee, MH. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv Data Anal Classif. 2019;13:753–71. https://doi.org/10.1007/s11634-018-0334-1.Search in Google Scholar
24. Shevade, SK, Keerthi, SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003;19:2246–53. https://doi.org/10.1093/bioinformatics/btg308.Search in Google Scholar PubMed
25. Shen, L, Tan, EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE ACM Trans Comput Bio Bioinf 2005;2:166–75. https://doi.org/10.1109/tcbb.2005.22.Search in Google Scholar PubMed
26. Jiang, D, Huang, J, Zhang, Y. The cross-validated auc for mcp-logistic regression with high-dimensional data. Stat Methods Med Res 2013;22:505–18. https://doi.org/10.1177/0962280211428385.Search in Google Scholar PubMed
27. Algamal, ZY, Lee, MH. Penalized logistic regression with the adaptive lasso for gene selection in high-dimensional cancer classification. Expert Syst Appl 2015;42:9326–32. https://doi.org/10.1016/j.eswa.2015.08.016.Search in Google Scholar
28. Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar
29. Wang, S, Nan, B, Rosset, S, Zhu, J. Random lasso. Ann Appl Stat 2011;5:468. https://doi.org/10.1214/10-aoas377.Search in Google Scholar
30. Zou, H, Hastie, T. Regularization and variable selection via the elastic net. J Roy Stat Soc B 2005;67:301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.Search in Google Scholar
31. Zou, H. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006;101:1418–29. https://doi.org/10.1198/016214506000000735.Search in Google Scholar
32. Genç, M, Özkale, MR. Usage of the go estimator in high dimensional linear models. Comput Stat 2021;36:217–39. https://doi.org/10.1007/s00180-020-01001-2.Search in Google Scholar
33. Cawley, GC, Talbot, NLC. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006;22:2348–55. https://doi.org/10.1093/bioinformatics/btl386.Search in Google Scholar PubMed
34. Bootkrajang, J, Kabán, A. Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 2013;29:870–7. https://doi.org/10.1093/bioinformatics/btt078.Search in Google Scholar PubMed
35. Vincent, M, Hansen, NR. Sparse group lasso and high dimensional multinomial classification. Comput Stat Data Anal 2014;71:771–86. https://doi.org/10.1016/j.csda.2013.06.004.Search in Google Scholar
36. Pan, X, Xu, Y. A safe feature elimination rule for l1-regularized logistic regression. IEEE Trans Pattern Anal Mach Intell 2021;44:4544–54.10.1109/TPAMI.2021.3071138Search in Google Scholar PubMed
37. Alharthi, AM, Lee, MH, Algamal, ZY. Gene selection and classification of microarray gene expression data based on a new adaptive l1-norm elastic net penalty. Inform Med Unlocked 2021;24:100622. https://doi.org/10.1016/j.imu.2021.100622.Search in Google Scholar
38. Li, X, Wang, Y, Ruiz, R. A survey on sparse learning models for feature selection. IEEE Trans Cybern 2020;52:1642–60. https://doi.org/10.1109/tcyb.2020.2982445.Search in Google Scholar
39. Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55–67. https://doi.org/10.1080/00401706.1970.10488634.Search in Google Scholar
40. Le Cessie, S, Van Houwelingen, JC. Ridge estimators in logistic regression. J Roy Stat Soc C Appl Stat 1992;41:191–201. https://doi.org/10.2307/2347628.Search in Google Scholar
41. Tutz, G, Ulbricht, J. Penalized regression with correlation-based penalty. Stat Comput 2009;19:239–53. https://doi.org/10.1007/s11222-008-9088-5.Search in Google Scholar
42. Bühlmann, P, Van De Geer, S. Statistics for high-dimensional data: methods, theory and applications. Berlin: Springer Science & Business Media; 2011.10.1007/978-3-642-20192-9Search in Google Scholar
43. Zou, H, Zhang, HH. On the adaptive elastic-net with a diverging number of parameters. Ann Stat 2009;37:1733–51. https://doi.org/10.1214/08-aos625.Search in Google Scholar
44. Dudoit, S, Fridlyand, J, Speed, TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002;97:77–87. https://doi.org/10.1198/016214502753479248.Search in Google Scholar
45. Varathan, N, Wijekoon, P. Logistic liu estimator under stochastic linear restrictions. Stat Pap 2019;60:945–62. https://doi.org/10.1007/s00362-016-0856-6.Search in Google Scholar
46. Varathan, N, Wijekoon, P. Optimal stochastic restricted logistic estimator. Stat Pap 2021;62:985–1002. https://doi.org/10.1007/s00362-019-01121-y.Search in Google Scholar
47. Wu, J, Asar, Y. On almost unbiased ridge logistic estimator for the logistic regression model. Hacettepe J Math Stat 2016;45:989–98. https://doi.org/10.15672/hjms.20156911030.Search in Google Scholar
48. Yüzbaşı, B, Arashi, M, Akdeniz, F. Penalized regression via the restricted bridge estimator. Soft Comput 2021;25:8401–16. https://doi.org/10.1007/s00500-021-05763-9.Search in Google Scholar
49. Wu, R, He, L, Peng, L, Wang, Z, Wang, W. Research and application of lasso regression model based on prior coefficient framework. Int J Comput Sci Math 2021;13:42–53. https://doi.org/10.1504/ijcsm.2021.10036767.Search in Google Scholar
50. Lukman, AF, Ayinde, K, Siok Kun, S, Adewuyi, ET. A modified new two-parameter estimator in a linear regression model. Model Simulat Eng 2019;2019:1–10. https://doi.org/10.1155/2019/6342702.Search in Google Scholar
51. Riani, M, Corbellini, A, Atkinson, AC. The use of prior information in very robust regression for fraud detection. Int Stat Rev 2018;86:205–18. https://doi.org/10.1111/insr.12247.Search in Google Scholar
52. Friedman, J, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J Stat Software 2010;33:1. https://doi.org/10.18637/jss.v033.i01.Search in Google Scholar
53. McCullagh, P, Nelder, J. Generalized linear models, 2nd ed. London: Chapman and Hall; 1989.10.1007/978-1-4899-3242-6Search in Google Scholar
54. Breheny, P, Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 2011;5:232. https://doi.org/10.1214/10-aoas388.Search in Google Scholar PubMed PubMed Central
55. Donoho, DL, Johnstone, JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994;81:425–55. https://doi.org/10.1093/biomet/81.3.425.Search in Google Scholar
56. Tibshirani, R, Bien, J, Friedman, J, Hastie, T, Simon, N, Taylor, J, et al.. Strong rules for discarding predictors in lasso-type problems. J Roy Stat Soc B 2012;74:245–66. https://doi.org/10.1111/j.1467-9868.2011.01004.x.Search in Google Scholar PubMed PubMed Central
57. Alon, U, Barkai, N, Notterman, DA, Gish, K, Ybarra, S, Mack, D, et al.. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999;96:6745–50. https://doi.org/10.1073/pnas.96.12.6745.Search in Google Scholar PubMed PubMed Central
58. Shipp, MA, Ross, KN, Tamayo, P, Weng, AP, Kutok, JL, Aguiar, RCT, et al.. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 2002;8:68–74. https://doi.org/10.1038/nm0102-68.Search in Google Scholar PubMed
59. Golub, TR, Slonim, DK, Tamayo, P, Huard, C, Gaasenbeek, M, Mesirov, JP, et al.. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7. https://doi.org/10.1126/science.286.5439.531.Search in Google Scholar PubMed
60. Singh, D, Febbo, PG, Ross, K, Jackson, DG, Manola, J, Ladd, C, et al.. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002;1:203–9. https://doi.org/10.1016/s1535-6108(02)00030-2.Search in Google Scholar PubMed
61. Gravier, E, Pierron, G, Vincent-Salomon, A, Gruel, N, Raynal, V, Savignoni, A, et al.. A prognostic dna signature for t1t2 node-negative breast cancer patients. Gene Chromosome Cancer 2010;49:1125–34. https://doi.org/10.1002/gcc.20820.Search in Google Scholar PubMed
62. Jung, Y. Multiple predicting k-fold cross-validation for model selection. J Nonparametric Statistics 2018;30:197–215. https://doi.org/10.1080/10485252.2017.1404598.Search in Google Scholar
63. Park, MY, Hastie, T. Penalized logistic regression for detecting gene interactions. Biostatistics 2008;9:30–50. https://doi.org/10.1093/biostatistics/kxm010.Search in Google Scholar PubMed
64. Pollard, KS, Dudoit, S, van der Laan, MJ. Multiple testing procedures: R multtest package and applications to genomics. In: Bioinformatics and computational biology solutions using R and bioconductor. New York, NY: Springer; 2005.10.1007/0-387-29362-0_15Search in Google Scholar
65. Efron, B. Empirical bayes estimates for large-scale prediction problems. J Am Stat Assoc 2009;104:1015–28. https://doi.org/10.1198/jasa.2009.tm08523.Search in Google Scholar PubMed PubMed Central
66. Thulin, M. A high-dimensional two-sample test for the mean using random subspaces. Comput Stat Data Anal 2014;74:26–38. https://doi.org/10.1016/j.csda.2013.12.003.Search in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Survival analysis using deep learning with medical imaging
- Using a population-based Kalman estimator to model the COVID-19 epidemic in France: estimating associations between disease transmission and non-pharmaceutical interventions
- Approximate reciprocal relationship between two cause-specific hazard ratios in COVID-19 data with mutually exclusive events
- Sensitivity of estimands in clinical trials with imperfect compliance
- Highly robust causal semiparametric U-statistic with applications in biomedical studies
- Hierarchical Bayesian bootstrap for heterogeneous treatment effect estimation
- Penalized logistic regression with prior information for microarray gene expression classification
- Bayesian learners in gradient boosting for linear mixed models
- Unequal allocation of sample/event sizes with considerations of sampling cost for testing equality, non-inferiority/superiority, and equivalence of two Poisson rates
- HiPerMAb: a tool for judging the potential of small sample size biomarker pilot studies
- Heterogeneity in meta-analysis: a comprehensive overview
- On stochastic dynamic modeling of incidence data
- Power of testing for exposure effects under incomplete mediation
- Exact correction factor for estimating the OR in the presence of sparse data with a zero cell in 2 × 2 tables
- Right-censored partially linear regression model with error in variables: application with carotid endarterectomy dataset
- Assessing HIV-infected patient retention in a program of differentiated care in sub-Saharan Africa: a G-estimation approach
- Prediction-based variable selection for component-wise gradient boosting
Articles in the same Issue
- Frontmatter
- Research Articles
- Survival analysis using deep learning with medical imaging
- Using a population-based Kalman estimator to model the COVID-19 epidemic in France: estimating associations between disease transmission and non-pharmaceutical interventions
- Approximate reciprocal relationship between two cause-specific hazard ratios in COVID-19 data with mutually exclusive events
- Sensitivity of estimands in clinical trials with imperfect compliance
- Highly robust causal semiparametric U-statistic with applications in biomedical studies
- Hierarchical Bayesian bootstrap for heterogeneous treatment effect estimation
- Penalized logistic regression with prior information for microarray gene expression classification
- Bayesian learners in gradient boosting for linear mixed models
- Unequal allocation of sample/event sizes with considerations of sampling cost for testing equality, non-inferiority/superiority, and equivalence of two Poisson rates
- HiPerMAb: a tool for judging the potential of small sample size biomarker pilot studies
- Heterogeneity in meta-analysis: a comprehensive overview
- On stochastic dynamic modeling of incidence data
- Power of testing for exposure effects under incomplete mediation
- Exact correction factor for estimating the OR in the presence of sparse data with a zero cell in 2 × 2 tables
- Right-censored partially linear regression model with error in variables: application with carotid endarterectomy dataset
- Assessing HIV-infected patient retention in a program of differentiated care in sub-Saharan Africa: a G-estimation approach
- Prediction-based variable selection for component-wise gradient boosting