Startseite Use of SVM-based ensemble feature selection method for gene expression data analysis
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Use of SVM-based ensemble feature selection method for gene expression data analysis

  • Shizhi Zhang und Mingjin Zhang ORCID logo EMAIL logo
Veröffentlicht/Copyright: 14. Juli 2022

Abstract

Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.


Corresponding author: Mingjin Zhang, School of Chemistry and Chemical Engineering, Qinghai Normal University, Xining 810016, P.R. China, E-mail:

Funding source: Qinghai Provincial Natural Science Fund

Award Identifier / Grant number: 2022-ZJ-769

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This work was supported by Qinghai Provincial Natural Science Fund (No.2022-ZJ-769) of China.

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

Abeel, T., Helleputte, T., Peer, V.D.Y., Dupont, P., and Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26: 392, https://doi.org/10.1093/bioinformatics/btp630.Suche in Google Scholar PubMed

Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., and Mack, D. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96: 6745–6750, https://doi.org/10.1073/pnas.96.12.6745.Suche in Google Scholar PubMed PubMed Central

Bhalla, S., Chaudhary, K., Kumar, R., Sehgal, M., Kaur, H., Sharma, S., and Raghava, G.P.S. (2017). Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. Sci. Rep. 7: 44997, https://doi.org/10.1038/srep44997.Suche in Google Scholar PubMed PubMed Central

Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Comput. Electr. Eng. 40: 16, https://doi.org/10.1016/j.compeleceng.2013.11.024.Suche in Google Scholar

Chang, C.C. and Lin, C.J. (2011). LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2: 21, https://doi.org/10.1145/1961189.1961199.Suche in Google Scholar

Chen, Q., Meng, Z., and Su, R. (2020). WERFE: a gene selection algorithm based on recursive feature elimination and ensemble strategy. Front. Bioeng. Biotechnol. 8: 496, https://doi.org/10.3389/fbioe.2020.00496.Suche in Google Scholar PubMed PubMed Central

Chopra, P., Lee, J., Kang, J., and Lee, S. (2010). Improving cancer classification accuracy using gene pairs. PLoS One 5: e14305, https://doi.org/10.1371/journal.pone.0014305.Suche in Google Scholar PubMed PubMed Central

Dietterich, T. (2000). Ensemble methods in machine learning. In: The 1st international workshop on multiple classifier systems. Springer-Verlag, p. 1.10.1007/3-540-45014-9_1Suche in Google Scholar

Emmanuel, C. and Terence, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35: 2313.10.1214/009053606000001523Suche in Google Scholar

Giallourakis, C., Henson, C., Reich, M., Xie, X., and Mootha, V.K. (2005). Disease gene discovery through integrative genomics. Annu. Rev. Genom. Hum. Genet. 6: 381, https://doi.org/10.1146/annurev.genom.6.080604.162234.Suche in Google Scholar PubMed

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531, https://doi.org/10.1126/science.286.5439.531.Suche in Google Scholar PubMed

Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46: 389, https://doi.org/10.1023/a:1012487302797.10.1023/A:1012487302797Suche in Google Scholar

Haury, A.C., Gestraud, P., and Vert, J.P. (2011). The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6: e28210, https://doi.org/10.1371/journal.pone.0028210.Suche in Google Scholar PubMed PubMed Central

Hess, D.A., Meyerrose, T.E., Wirthlin, L., Craft, T.P., Herrbrich, P.E., Creer, M.H., and Nolta, J.A. (2004). Functional characterization of highly purified human hematopoietic repopulating cells isolated according to aldehyde dehydrogenase activity. Blood 104: 1648, https://doi.org/10.1182/blood-2004-02-0448.Suche in Google Scholar PubMed

Hou, G., Sui, Y., and An, L. (2006). Research progress on GSTP1 in prostate cancer. Chin. J. Surg. Integ. Trad. West. Med. 12: 505.Suche in Google Scholar

Kannan, V. and Sandhya, G. (2018). Novel biomarkers for inborn errors of metabolism in the metabolomics era. Indian J. Biochem. Biophys. 55: 314.Suche in Google Scholar

Klezovitch, O., Chevillet, J., Mirosevich, J., Roberts, R.L., Matusik, R.J., and Vasioukhin, V. (2004). Hepsin promotes prostate cancer progression and metastasis. Cancer Cell 6: 185, https://doi.org/10.1016/j.ccr.2004.07.008.Suche in Google Scholar PubMed

Kuncheva, L.I. (2007). A stability index for feature selection. In: 25th IASTED international multi-conference on artificial intelligence and applications. ACTA Press(Innsbruck), p. 309.Suche in Google Scholar

Lakshmi, G.M. and Mythili, K. (2014). Survey of gene-expression-based cancer subtypes prediction. Int. J. Adv. Comput. Sci. Technol. 3: 207.Suche in Google Scholar

Li, H., Liang, Y., Xu, Q., and Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 648: 77, https://doi.org/10.1016/j.aca.2009.06.046.Suche in Google Scholar PubMed

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., and Liu, H. (2017). Feature selection: a data perspective. ACM Comput. Surv. 50: 941–945.10.1145/3136625Suche in Google Scholar

Liu, H., Li, J., and Wong, L. (2002). A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genom Inf 13: 51.Suche in Google Scholar

Patil, A.R. and Kim, S. (2020). Combination of ensembles of regularized regression models with resampling-based Lasso feature selection in high dimensional data. Mathematics 8: 110, https://doi.org/10.3390/math8010110.Suche in Google Scholar

Qing, X., Jeffery, A.T., and Devin, C.K. (2021). Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE). Stat. Appl. Genet. Mol. Biol. 20: 101–119.10.1515/sagmb-2021-0020Suche in Google Scholar PubMed

Rosso, M.D., Fibbi, G., Pucci, M., D’Alessio, S., Rosso, A.D., Magnelli, L., and Chiarugi, V. (2002). Multiple pathways of cell invasion are regulated by multiple families of serine proteases. Clin. Exp. Metastasis 19: 193 https://doi.org/10.1023/a:1015531321445 .10.1023/A:1015531321445Suche in Google Scholar

Saeys, Y., Abeel, T., and Peer, V.D.Y. (2008). Robust feature selection using ensemble feature selection techniques. In: Proceedings of the 25th european conference on machine learning and knowledge discovery in databases. Springer-Verlag, p. 313.10.1007/978-3-540-87481-2_21Suche in Google Scholar

Shah, S. and Kusiak, A. (2007). Cancer gene search with data-mining and genetic algorithms. Comput. Biol. Med. 37: 251, https://doi.org/10.1016/j.compbiomed.2006.01.007.Suche in Google Scholar

Sharma, A., Imoto, S., and Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE ACM Trans. Comput. Biol. Bioinf. 9: 754, https://doi.org/10.1109/TCBB.2011.151.Suche in Google Scholar

Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P, et al.. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1: 203–209, https://doi.org/10.1016/s1535-6108(02)00030-2.Suche in Google Scholar

Snezana, Z.S., Olgica, M., Danijela, J., Predrag, D., Irena, K., Ivana, M., Zorica, J., and Ljiljana, M.T. (2017). Cytokine profile in patients with differentiated thyroid cancer. Indian J. Biochem. Biophys. 54: 291.Suche in Google Scholar

Su, R., Liu, X., Xiao, G., and Wei, L. (2020). Meta-GDBP: a high-level stacked regression model to improve anti-cancer drug response prediction. Briefings Bioinf. 21: 996–1005, https://doi.org/10.1093/bib/bbz022.Suche in Google Scholar PubMed

Wang, B., Lu, K., Zheng, X., Su, B., Zhou, Y., Chen, P., and Zhang, J. (2018). Early stage identification of Alzheimer’s disease using a two-stage ensemble classifier. Curr. Bioinf. 13: 529–535, https://doi.org/10.2174/1574893613666180328093114.Suche in Google Scholar

Wang, N., Zhuang, Z., Tang, J., and Su, L. (2010). Classification of gene expression data based on fiedler vector. China Biotechnol. 30: 82.Suche in Google Scholar

Wei, L., Wan, S., Guo, J., and Wong, K.K. (2017). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med. 83: 82–90, https://doi.org/10.1016/j.artmed.2017.02.005.Suche in Google Scholar PubMed

Yingxin, L., Jiangeng, L., and Xiaogang, R. (2006). Study of informative gene selection for tissue classification based on tumor gene expression profiles. Chin. J. Comput. 29: 324.Suche in Google Scholar

Yoo, C., Lee, I.B., and Vanrolleghem, P.A. (2005). Interpreting patterns and analysis of acute leukemia gene expression data by Multivariate fuzzy statistical analysis. Comput. Chem. Eng. 29: 1345, https://doi.org/10.1016/j.compchemeng.2005.02.031.Suche in Google Scholar

Zhang, W., Jing, K., Huang, F., Chen, Y., Li, B., Li, J., and Gong, J. (2019). SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Inf. Sci. 497: 189–201, https://doi.org/10.1016/j.ins.2019.05.017.Suche in Google Scholar

Received: 2022-01-15
Revised: 2022-06-20
Accepted: 2022-07-01
Published Online: 2022-07-14

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 8.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/sagmb-2022-0002/html
Button zum nach oben scrollen