Abstract
Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.
Funding source: Qinghai Provincial Natural Science Fund
Award Identifier / Grant number: 2022-ZJ-769
-
Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: This work was supported by Qinghai Provincial Natural Science Fund (No.2022-ZJ-769) of China.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
Abeel, T., Helleputte, T., Peer, V.D.Y., Dupont, P., and Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26: 392, https://doi.org/10.1093/bioinformatics/btp630.Suche in Google Scholar PubMed
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., and Mack, D. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96: 6745–6750, https://doi.org/10.1073/pnas.96.12.6745.Suche in Google Scholar PubMed PubMed Central
Bhalla, S., Chaudhary, K., Kumar, R., Sehgal, M., Kaur, H., Sharma, S., and Raghava, G.P.S. (2017). Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. Sci. Rep. 7: 44997, https://doi.org/10.1038/srep44997.Suche in Google Scholar PubMed PubMed Central
Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Comput. Electr. Eng. 40: 16, https://doi.org/10.1016/j.compeleceng.2013.11.024.Suche in Google Scholar
Chang, C.C. and Lin, C.J. (2011). LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2: 21, https://doi.org/10.1145/1961189.1961199.Suche in Google Scholar
Chen, Q., Meng, Z., and Su, R. (2020). WERFE: a gene selection algorithm based on recursive feature elimination and ensemble strategy. Front. Bioeng. Biotechnol. 8: 496, https://doi.org/10.3389/fbioe.2020.00496.Suche in Google Scholar PubMed PubMed Central
Chopra, P., Lee, J., Kang, J., and Lee, S. (2010). Improving cancer classification accuracy using gene pairs. PLoS One 5: e14305, https://doi.org/10.1371/journal.pone.0014305.Suche in Google Scholar PubMed PubMed Central
Dietterich, T. (2000). Ensemble methods in machine learning. In: The 1st international workshop on multiple classifier systems. Springer-Verlag, p. 1.10.1007/3-540-45014-9_1Suche in Google Scholar
Emmanuel, C. and Terence, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35: 2313.10.1214/009053606000001523Suche in Google Scholar
Giallourakis, C., Henson, C., Reich, M., Xie, X., and Mootha, V.K. (2005). Disease gene discovery through integrative genomics. Annu. Rev. Genom. Hum. Genet. 6: 381, https://doi.org/10.1146/annurev.genom.6.080604.162234.Suche in Google Scholar PubMed
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531, https://doi.org/10.1126/science.286.5439.531.Suche in Google Scholar PubMed
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46: 389, https://doi.org/10.1023/a:1012487302797.10.1023/A:1012487302797Suche in Google Scholar
Haury, A.C., Gestraud, P., and Vert, J.P. (2011). The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6: e28210, https://doi.org/10.1371/journal.pone.0028210.Suche in Google Scholar PubMed PubMed Central
Hess, D.A., Meyerrose, T.E., Wirthlin, L., Craft, T.P., Herrbrich, P.E., Creer, M.H., and Nolta, J.A. (2004). Functional characterization of highly purified human hematopoietic repopulating cells isolated according to aldehyde dehydrogenase activity. Blood 104: 1648, https://doi.org/10.1182/blood-2004-02-0448.Suche in Google Scholar PubMed
Hou, G., Sui, Y., and An, L. (2006). Research progress on GSTP1 in prostate cancer. Chin. J. Surg. Integ. Trad. West. Med. 12: 505.Suche in Google Scholar
Kannan, V. and Sandhya, G. (2018). Novel biomarkers for inborn errors of metabolism in the metabolomics era. Indian J. Biochem. Biophys. 55: 314.Suche in Google Scholar
Klezovitch, O., Chevillet, J., Mirosevich, J., Roberts, R.L., Matusik, R.J., and Vasioukhin, V. (2004). Hepsin promotes prostate cancer progression and metastasis. Cancer Cell 6: 185, https://doi.org/10.1016/j.ccr.2004.07.008.Suche in Google Scholar PubMed
Kuncheva, L.I. (2007). A stability index for feature selection. In: 25th IASTED international multi-conference on artificial intelligence and applications. ACTA Press(Innsbruck), p. 309.Suche in Google Scholar
Lakshmi, G.M. and Mythili, K. (2014). Survey of gene-expression-based cancer subtypes prediction. Int. J. Adv. Comput. Sci. Technol. 3: 207.Suche in Google Scholar
Li, H., Liang, Y., Xu, Q., and Cao, D. (2009). Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 648: 77, https://doi.org/10.1016/j.aca.2009.06.046.Suche in Google Scholar PubMed
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., and Liu, H. (2017). Feature selection: a data perspective. ACM Comput. Surv. 50: 941–945.10.1145/3136625Suche in Google Scholar
Liu, H., Li, J., and Wong, L. (2002). A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genom Inf 13: 51.Suche in Google Scholar
Patil, A.R. and Kim, S. (2020). Combination of ensembles of regularized regression models with resampling-based Lasso feature selection in high dimensional data. Mathematics 8: 110, https://doi.org/10.3390/math8010110.Suche in Google Scholar
Qing, X., Jeffery, A.T., and Devin, C.K. (2021). Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE). Stat. Appl. Genet. Mol. Biol. 20: 101–119.10.1515/sagmb-2021-0020Suche in Google Scholar PubMed
Rosso, M.D., Fibbi, G., Pucci, M., D’Alessio, S., Rosso, A.D., Magnelli, L., and Chiarugi, V. (2002). Multiple pathways of cell invasion are regulated by multiple families of serine proteases. Clin. Exp. Metastasis 19: 193 https://doi.org/10.1023/a:1015531321445 .10.1023/A:1015531321445Suche in Google Scholar
Saeys, Y., Abeel, T., and Peer, V.D.Y. (2008). Robust feature selection using ensemble feature selection techniques. In: Proceedings of the 25th european conference on machine learning and knowledge discovery in databases. Springer-Verlag, p. 313.10.1007/978-3-540-87481-2_21Suche in Google Scholar
Shah, S. and Kusiak, A. (2007). Cancer gene search with data-mining and genetic algorithms. Comput. Biol. Med. 37: 251, https://doi.org/10.1016/j.compbiomed.2006.01.007.Suche in Google Scholar
Sharma, A., Imoto, S., and Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE ACM Trans. Comput. Biol. Bioinf. 9: 754, https://doi.org/10.1109/TCBB.2011.151.Suche in Google Scholar
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P, et al.. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1: 203–209, https://doi.org/10.1016/s1535-6108(02)00030-2.Suche in Google Scholar
Snezana, Z.S., Olgica, M., Danijela, J., Predrag, D., Irena, K., Ivana, M., Zorica, J., and Ljiljana, M.T. (2017). Cytokine profile in patients with differentiated thyroid cancer. Indian J. Biochem. Biophys. 54: 291.Suche in Google Scholar
Su, R., Liu, X., Xiao, G., and Wei, L. (2020). Meta-GDBP: a high-level stacked regression model to improve anti-cancer drug response prediction. Briefings Bioinf. 21: 996–1005, https://doi.org/10.1093/bib/bbz022.Suche in Google Scholar PubMed
Wang, B., Lu, K., Zheng, X., Su, B., Zhou, Y., Chen, P., and Zhang, J. (2018). Early stage identification of Alzheimer’s disease using a two-stage ensemble classifier. Curr. Bioinf. 13: 529–535, https://doi.org/10.2174/1574893613666180328093114.Suche in Google Scholar
Wang, N., Zhuang, Z., Tang, J., and Su, L. (2010). Classification of gene expression data based on fiedler vector. China Biotechnol. 30: 82.Suche in Google Scholar
Wei, L., Wan, S., Guo, J., and Wong, K.K. (2017). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med. 83: 82–90, https://doi.org/10.1016/j.artmed.2017.02.005.Suche in Google Scholar PubMed
Yingxin, L., Jiangeng, L., and Xiaogang, R. (2006). Study of informative gene selection for tissue classification based on tumor gene expression profiles. Chin. J. Comput. 29: 324.Suche in Google Scholar
Yoo, C., Lee, I.B., and Vanrolleghem, P.A. (2005). Interpreting patterns and analysis of acute leukemia gene expression data by Multivariate fuzzy statistical analysis. Comput. Chem. Eng. 29: 1345, https://doi.org/10.1016/j.compchemeng.2005.02.031.Suche in Google Scholar
Zhang, W., Jing, K., Huang, F., Chen, Y., Li, B., Li, J., and Gong, J. (2019). SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Inf. Sci. 497: 189–201, https://doi.org/10.1016/j.ins.2019.05.017.Suche in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Review Article
- Challenges for machine learning in RNA-protein interaction prediction
- Research Articles
- Distinct characteristics of correlation analysis at the single-cell and the population level
- pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples
- Use of SVM-based ensemble feature selection method for gene expression data analysis
- A robust association test with multiple genetic variants and covariates
- Estimation of the covariance structure from SNP allele frequencies
- GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing
- Sparse latent factor regression models for genome-wide and epigenome-wide association studies
Artikel in diesem Heft
- Review Article
- Challenges for machine learning in RNA-protein interaction prediction
- Research Articles
- Distinct characteristics of correlation analysis at the single-cell and the population level
- pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples
- Use of SVM-based ensemble feature selection method for gene expression data analysis
- A robust association test with multiple genetic variants and covariates
- Estimation of the covariance structure from SNP allele frequencies
- GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing
- Sparse latent factor regression models for genome-wide and epigenome-wide association studies