Comparisons of classification methods for viral genomes and protein families using alignment-free vectorization

Hsin-Hsiung Huang; Shuai Hao; Saul Alarcon; Jie Yang

doi:10.1515/sagmb-2018-0004

Article

Comparisons of classification methods for viral genomes and protein families using alignment-free vectorization

Hsin-Hsiung Huang , Shuai Hao , Saul Alarcon and Jie Yang

Published/Copyright: June 30, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 17 Issue 4

Abstract

In this paper, we propose a statistical classification method based on discriminant analysis using the first and second moments of positions of each nucleotide of the genome sequences as features, and compare its performances with other classification methods as well as natural vector for comparative genomic analysis. We examine the normality of the proposed features. The statistical classification models used including linear discriminant analysis, quadratic discriminant analysis, diagonal linear discriminant analysis, k-nearest-neighbor classifier, logistic regression, support vector machines, and classification trees. All these classifiers are tested on a viral genome dataset and a protein dataset for predicting viral Baltimore labels, viral family labels, and protein family labels.

Keywords: viral genomes; protein; family labels; Natural Vector; statistical classification models

References

Baltimore, D. (1971): “Expression of animal virus genomes,” Bacteriol. Rev. 35 (3), 235–241.10.1128/br.35.3.235-241.1971Search in Google Scholar PubMed PubMed Central

Chan, R. H., R. W. Wang and H. M. Yeung (2010): “Composition vector method for phylogenetics-a review,” Proc. 9th International Symposium on Operations Research and its Applications, 13–20.Search in Google Scholar

Cortes, C. and V. Vapnik (1995): “Support-vector networks,” Machine Learning, 20, 273–297.10.1007/BF00994018Search in Google Scholar

Darling, D. A. (1975): “Note on a limit theorem,” Ann. Probab. 3, 876–878.10.1214/aop/1176996274Search in Google Scholar

Deng, M., C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau (2011): “A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications,” PLoS One, 6 (3), e17293.10.1371/journal.pone.0017293Search in Google Scholar PubMed PubMed Central

Dudoit, S., J. Fridlyand, and T. P. Speed (2002): “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Stat. Assoc., 97, 77–87.10.1198/016214502753479248Search in Google Scholar

Fawcett, T. (2006): “An introduction to ROC analysis,” Pattern Recognit. Lett., 27, 861–874.10.1016/j.patrec.2005.10.010Search in Google Scholar

Friedman, J. H. (1989): “Regularized discriminant analysis,” J. Am. Stat. Assoc., 84, 165–175.10.1080/01621459.1989.10478752Search in Google Scholar

Ghor, B., D. Horn, N. Goldman, Y. Levy, and T. Massingham (2009): “Genomic DNA k-mer spectra: models and modalities,” Genome Biol., 10, R108.10.1186/gb-2009-10-10-r108Search in Google Scholar

Hand, D. J. and R. J. Till (2001): “A simple generalisation of the area under the ROC curve for multiple class classification problems,” Mach. Learn., 45: 171.10.1023/A:1010920819831Search in Google Scholar

Hastie, T., R. Tibshirani, and J. Friedman (2009): The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer, New York.10.1007/978-0-387-84858-7Search in Google Scholar

Hernandez, T. and J. Yang (2013): “Descriptive statistics of the genome: phylogenetic classification of viruses,” J. Comput. Biol., 23, 810–820.10.1089/cmb.2013.0132Search in Google Scholar PubMed

Hoang, T., C. Yin, H. Zheng, C. Yu, L. R. He, and S. S.-T. Yau (2015): “A new method to cluster DNA sequences using Fourier power spectrum,” J. Theor. Biol., 372, 135–145.10.1016/j.jtbi.2015.02.026Search in Google Scholar

Huang, G. H., H. Q. Zhou, Y. F. Li, and L. X. Xu (2011): “Alignment-free comparison of genome sequences by a new numerical characterization,” J. Theor. Biol., 281, 107–112.10.1016/j.jtbi.2011.04.003Search in Google Scholar

Huang, G. H. (2014): “A novel neighborhood model to predict protein function from protein-protein interaction data,” Current Bioinformatics,” 11, 237–244.10.2174/157016461104150121113959Search in Google Scholar

Huang, H.-H., T. Xu, and J. Yang (2014a): “Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension,” BMC Proceedings, 8 (Suppl 1), S96.10.1186/1753-6561-8-S1-S96Search in Google Scholar

Huang, H.-H., C. Yu, H. Zheng, T. Hernandez, S.-C. Yau, R. L. He, J. Yang, S. S.-T. Yau (2014b): “Global comparison of multiple-segmented viruses in 12-dimensional genome space,” Mol. Phylogenet. Evol., 81, 29–36.10.1016/j.ympev.2014.08.003Search in Google Scholar

Huang, H.-H. (2016): “An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses,” J. Theor. Biol., 398, 136–144.10.1016/j.jtbi.2016.03.004Search in Google Scholar

Huang, G. H., C. Chu, T. Huang, X. Kong, Y. Zhang, N. Zhang, and Y.-D. Cai (2016): “Exploring mouse protein function via multiple approaches,” PLoS One, 11, e0166580.10.1371/journal.pone.0166580Search in Google Scholar

Huang, H.-H. and S.-B. Girimurugan (2018): “A novel real-time genome comparison method using discrete wavelet transform,” J. Comput. Biol., 25, 405–416.10.1089/cmb.2017.0115Search in Google Scholar

Maddouri, M. and M. Elloumi (2002): “A data mining approach based on machine learning techniques to classify biological sequences,” Knowl. Based Syst., 15, 2002.10.1016/S0950-7051(01)00143-5Search in Google Scholar

National Center for Biotechnology Information (NCBI)[Internet]. (2016): Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Available from: https://www.ncbi.nlm.nih.gov/.Search in Google Scholar

Polychronopoulos, D., E. Weitschek, S. Dimitrieva, P. Bucher, G. Felici, and Y. Almirantis (2014): “Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers,” Genomics 104, 79–86.10.1016/j.ygeno.2014.07.004Search in Google Scholar PubMed

Rao, C. R. and S. K. Mitra (1972): “Generalized inverse of a matrix and its applications,” Proc. Sixth Berkeley Symp. on Math. Statist. and Prob., Vol. 1, Univ. of Calif. Press, 601–620.10.1525/9780520325883-032Search in Google Scholar

Selcuk, K., G. Dincer, and Z. Gokmen (2016): MVN: an R package for assessing multivariate normality. R package vignettes.Search in Google Scholar

Sims, G. E., S. R. Jun, G. A. Wu, and S. H. Kim (2009): “Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions,” Proc. Natl. Acad. Sci. U.S.A. 106, 2677–2682.10.1073/pnas.0813249106Search in Google Scholar PubMed PubMed Central

Vinga, S. and J. Almeida (2003): “Alignment-free sequence comparison review.” Bioinformatics, 19, 513–523.10.1093/bioinformatics/btg005Search in Google Scholar PubMed

Vinga, S. (2007): Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. In: Pham, T. D., Yan, H., Crane, D. I. (Eds.), Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers, New York.Search in Google Scholar

Weitschek, E., F. Cunial and G. Felici (2015): “LAF: logic alignment free and its application to bacterial genomes classification,” BioData Min., 8, 39.10.1186/s13040-015-0073-1Search in Google Scholar PubMed PubMed Central

Yu, C., T. Hernandez, H. Zheng, S.-C. Yau, H.-H. Huang, R. L. He, J. Yang, and S. S.-T. Yau (2013): “Real time classification of viruses in 12 dimensions,” PLoS One, 8, e64328.10.1371/journal.pone.0064328Search in Google Scholar PubMed PubMed Central

Published Online: 2018-06-30

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2018-0004

Keywords for this article

viral genomes; protein; family labels; Natural Vector; statistical classification models