Application of mutual information estimation for predicting the structural stability of pentapeptides

A. I. Mikhalskii; I. V. Petrov; V. V. Tsurko; A. A. Anashkina; A. N. Nekrasov

doi:10.1515/rnam-2020-0022

Artikel

Application of mutual information estimation for predicting the structural stability of pentapeptides

A. I. Mikhalskii , I. V. Petrov , V. V. Tsurko , A. A. Anashkina und A. N. Nekrasov

Veröffentlicht/Copyright: 30. Oktober 2020

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Russian Journal of Numerical Analysis and Mathematical Modelling Band 35 Heft 5

Abstract

A novel non-parametric method for mutual information estimation is presented. The method is suited for informative feature selection in classification and regression problems. Performance of the method is demonstrated on problem of stable short peptide classification.

Keywords: Mutual information; data mining; feature selection; pentapeptide stability prediction

MSC 2010: 92-04; 62-07; 62G07

Funding statement: The work was supported by the RFBR (project No. 20–04–01085).

References

[1] H. Almuallim and T. G. Dietterich, Learning with many irrelevant features. Proc. 9th National Conf. on Artificial Intelligence, AAAI Press, 1991, pp. 547–552.Suche in Google Scholar

[2] P. Comon, Independent component analysis. A new concept. Signal Processing36 (1994), 287–314.10.1016/0165-1684(94)90029-9Suche in Google Scholar

[3] D. Darmon, Information-theoretic model selection for optimal prediction of stochastic dynamical systems from data. Phys. Review E97 (2018), No. 3, 032206.10.1103/PhysRevE.97.032206Suche in Google Scholar PubMed

[4] L. Ein-Dor, O. Zuk, and E. Domany, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA103 (2006), No. 15, 5923–5928.10.1073/pnas.0601231103Suche in Google Scholar PubMed PubMed Central

[5] I. T. Jolliffe, Principal Component Analysis. Springer–Verlag, New York, 1986.10.1007/978-1-4757-1904-8Suche in Google Scholar

[6] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF. Proc. 7th Europ. Conf. on Machine Learning, 1994.10.1007/3-540-57868-4_57Suche in Google Scholar

[7] A. Kraskov, H. Stoogbauer, and P. Grassberger, Estimating mutual information. Phys. Review E69 (2004), No. 6, 066138.10.1103/PhysRevE.69.066138Suche in Google Scholar PubMed

[8] O. F. Lange and H. Grubmuller, Generalized correlation for biomolecular dynamics. Proteins62 (2006), 1053–1061.10.1002/prot.20784Suche in Google Scholar PubMed

[9] A. N. Nekrasov, Entropy of protein sequences: an integral approach. J. Biomolecular Struct. Dynam. 20 (2002), 87–92.10.1080/07391102.2002.10506825Suche in Google Scholar PubMed

[10] A. N. Nekrasov, Analysis of the information structure of protein sequences: a new method for analyzing the domain organization of proteins. J. Biomolecular Struct. Dynam. 21 (2004), No. 5, 615–623.10.1080/07391102.2004.10506952Suche in Google Scholar PubMed

[11] A. N. Nekrasov, L. G. Alekseeva, R. A. Pogosyan, D. A. Dolgikh, M. P. Kirpichnikov, A. G. de Brevern, and A. A. Anashkina, A minimum set of stable blocks for rational design of polypeptide chains. Biochimie160 (2019), 88–92.10.1016/j.biochi.2019.02.006Suche in Google Scholar PubMed

[12] A. N. Nekrasov, A. A. Anashkina, and A. A. Zinchenko, A new paradigm of protein structural organization. Theoretical Approaches to BioInformation Systems (2014), 1–22.Suche in Google Scholar

[13] B. Scholkopf, R. Herbrich, and A. J. Smola, A generalized representer theorem. LNAI (2001), 416–426.10.1007/3-540-44581-1_27Suche in Google Scholar

[14] T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics10 (2009), 552.10.1186/1471-2105-10-S1-S52Suche in Google Scholar PubMed PubMed Central

[15] G. D. Tourassi, E. D. Frederick, M. K. Markey, and C. E. Jr. Floyd, Application of the mutual information criterion for feature selection in computer-aided diagnosis. Medical Physics28 (2001), No. 12, 2394–2402.10.1118/1.1418724Suche in Google Scholar PubMed

[16] V. Tsurko and A. Michalskii, Contrasting method for selection of informative features using empirical data. Avtomatika i Telemekhanika12 (2016), 136–154 (in Russian).Suche in Google Scholar

[17] V. Vapnik and R. Izmailov, Statistical inference problems and their rigorous solutions. Statistical Learning and Data Sciences LNAI (2015), No. 9047, 33–75.10.1007/978-3-319-17091-6_2Suche in Google Scholar

Appendix A. Nonparametric estimation of mutual information

Substituting representation (1.5) into functional J_e(ŵ, λ), we get

Je(w^,λ)=12n2∑i=1n∑j=1n∑l=1nαlK(xi,yj,xl,yl)2−1n∑i=1n∑l=1nαlK(xi,yi,xl,yl)+λ2∑l=1nαlK(xi,yi,xl,yl)L2+C.

The first summand is transformed to the form

12n2∑i=1n∑j=1n∑l=1nαlK(xi,yj,xl,yl)2=12n2∑i=1n∑j=1n∑l=1n∑m=1nαlK(xi,yj,xl,yl)αmK(xi,yj,xm,ym)=12n2∑l=1n∑m=1nαlαm∑i=1n∑j=1nK(xi,yj,xl,yl)K(xi,yj,xm,ym)=12∑l=1n∑m=1nαlαmHlm

where Hlm=1n2∑i=1n∑j=1nK(xi,yj,xl,yl)K(xi,yj,xm,ym).

The second summand is transformed to the form

1n∑i=1n∑l=1nαlK(xi,yi,xl,yl)=1n∑l=1nαl∑i=1nK(xi,yi,xl,yl)=∑l=1nαlhl

where hl=1n∑i=1nK(xi,yi,xl,yl).

Calculate the last summand

λ2∑l=1nαlK(xi,yi,xl,yl)L2=λ2〈∑l=1nαlK(xi,yi,xl,yl),∑m=1nαmK(xi,yi,xm,ym)〉λ2∑l=1n∑m=1nαlαm〈K(xi,yi,xl,yl),K(xi,yi,xm,ym)〉=λ2∑l=1n∑m=1nαlαmK(xl,yl,xm,ym).

The calculation uses the property of the scalar product in the Hilbert space with the reproducing kernel K(z, t), namely, < K(z, u), K(t, u) > = K(z, t). Denoting the matrix with the elements K_ij = K(x_i, y_i, x_j, y_j), by K, we finally obtain the expression

Je(α,λ)=12αTHα−αTh+λ2αTKα+C.

The minimum of the later functional is attained at the vector

α∗=(H+λK)−1h.

Received: 2019-10-18

Revised: 2020-07-09

Accepted: 2020-09-18

Published Online: 2020-10-30

Published in Print: 2020-10-27

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/rnam-2020-0022

Schlagwörter für diesen Artikel

Mutual information; data mining; feature selection; pentapeptide stability prediction