Application of mutual information estimation for predicting the structural stability of pentapeptides

A. I. Mikhalskii; I. V. Petrov; V. V. Tsurko; A. A. Anashkina; A. N. Nekrasov

doi:10.1515/rnam-2020-0022

Article

Application of mutual information estimation for predicting the structural stability of pentapeptides

A. I. Mikhalskii , I. V. Petrov , V. V. Tsurko , A. A. Anashkina and A. N. Nekrasov

Published/Copyright: October 30, 2020

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Russian Journal of Numerical Analysis and Mathematical Modelling Volume 35 Issue 5

Abstract

A novel non-parametric method for mutual information estimation is presented. The method is suited for informative feature selection in classification and regression problems. Performance of the method is demonstrated on problem of stable short peptide classification.

Keywords: Mutual information; data mining; feature selection; pentapeptide stability prediction

MSC 2010: 92-04; 62-07; 62G07

Funding statement: The work was supported by the RFBR (project No. 20–04–01085).

References

[1] H. Almuallim and T. G. Dietterich, Learning with many irrelevant features. Proc. 9th National Conf. on Artificial Intelligence, AAAI Press, 1991, pp. 547–552.Search in Google Scholar

[2] P. Comon, Independent component analysis. A new concept. Signal Processing36 (1994), 287–314.10.1016/0165-1684(94)90029-9Search in Google Scholar

[3] D. Darmon, Information-theoretic model selection for optimal prediction of stochastic dynamical systems from data. Phys. Review E97 (2018), No. 3, 032206.10.1103/PhysRevE.97.032206Search in Google Scholar PubMed

[4] L. Ein-Dor, O. Zuk, and E. Domany, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA103 (2006), No. 15, 5923–5928.10.1073/pnas.0601231103Search in Google Scholar PubMed PubMed Central

[5] I. T. Jolliffe, Principal Component Analysis. Springer–Verlag, New York, 1986.10.1007/978-1-4757-1904-8Search in Google Scholar

[6] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF. Proc. 7th Europ. Conf. on Machine Learning, 1994.10.1007/3-540-57868-4_57Search in Google Scholar

[7] A. Kraskov, H. Stoogbauer, and P. Grassberger, Estimating mutual information. Phys. Review E69 (2004), No. 6, 066138.10.1103/PhysRevE.69.066138Search in Google Scholar PubMed

[8] O. F. Lange and H. Grubmuller, Generalized correlation for biomolecular dynamics. Proteins62 (2006), 1053–1061.10.1002/prot.20784Search in Google Scholar PubMed

[9] A. N. Nekrasov, Entropy of protein sequences: an integral approach. J. Biomolecular Struct. Dynam. 20 (2002), 87–92.10.1080/07391102.2002.10506825Search in Google Scholar PubMed

[10] A. N. Nekrasov, Analysis of the information structure of protein sequences: a new method for analyzing the domain organization of proteins. J. Biomolecular Struct. Dynam. 21 (2004), No. 5, 615–623.10.1080/07391102.2004.10506952Search in Google Scholar PubMed

[11] A. N. Nekrasov, L. G. Alekseeva, R. A. Pogosyan, D. A. Dolgikh, M. P. Kirpichnikov, A. G. de Brevern, and A. A. Anashkina, A minimum set of stable blocks for rational design of polypeptide chains. Biochimie160 (2019), 88–92.10.1016/j.biochi.2019.02.006Search in Google Scholar PubMed

[12] A. N. Nekrasov, A. A. Anashkina, and A. A. Zinchenko, A new paradigm of protein structural organization. Theoretical Approaches to BioInformation Systems (2014), 1–22.Search in Google Scholar

[13] B. Scholkopf, R. Herbrich, and A. J. Smola, A generalized representer theorem. LNAI (2001), 416–426.10.1007/3-540-44581-1_27Search in Google Scholar

[14] T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics10 (2009), 552.10.1186/1471-2105-10-S1-S52Search in Google Scholar PubMed PubMed Central

[15] G. D. Tourassi, E. D. Frederick, M. K. Markey, and C. E. Jr. Floyd, Application of the mutual information criterion for feature selection in computer-aided diagnosis. Medical Physics28 (2001), No. 12, 2394–2402.10.1118/1.1418724Search in Google Scholar PubMed

[16] V. Tsurko and A. Michalskii, Contrasting method for selection of informative features using empirical data. Avtomatika i Telemekhanika12 (2016), 136–154 (in Russian).Search in Google Scholar

[17] V. Vapnik and R. Izmailov, Statistical inference problems and their rigorous solutions. Statistical Learning and Data Sciences LNAI (2015), No. 9047, 33–75.10.1007/978-3-319-17091-6_2Search in Google Scholar

Appendix A. Nonparametric estimation of mutual information

Substituting representation (1.5) into functional J_e(ŵ, λ), we get

Je(w^,λ)=12n2∑i=1n∑j=1n∑l=1nαlK(xi,yj,xl,yl)2−1n∑i=1n∑l=1nαlK(xi,yi,xl,yl)+λ2∑l=1nαlK(xi,yi,xl,yl)L2+C.

The first summand is transformed to the form

12n2∑i=1n∑j=1n∑l=1nαlK(xi,yj,xl,yl)2=12n2∑i=1n∑j=1n∑l=1n∑m=1nαlK(xi,yj,xl,yl)αmK(xi,yj,xm,ym)=12n2∑l=1n∑m=1nαlαm∑i=1n∑j=1nK(xi,yj,xl,yl)K(xi,yj,xm,ym)=12∑l=1n∑m=1nαlαmHlm

where Hlm=1n2∑i=1n∑j=1nK(xi,yj,xl,yl)K(xi,yj,xm,ym).

The second summand is transformed to the form

1n∑i=1n∑l=1nαlK(xi,yi,xl,yl)=1n∑l=1nαl∑i=1nK(xi,yi,xl,yl)=∑l=1nαlhl

where hl=1n∑i=1nK(xi,yi,xl,yl).

Calculate the last summand

λ2∑l=1nαlK(xi,yi,xl,yl)L2=λ2〈∑l=1nαlK(xi,yi,xl,yl),∑m=1nαmK(xi,yi,xm,ym)〉λ2∑l=1n∑m=1nαlαm〈K(xi,yi,xl,yl),K(xi,yi,xm,ym)〉=λ2∑l=1n∑m=1nαlαmK(xl,yl,xm,ym).

The calculation uses the property of the scalar product in the Hilbert space with the reproducing kernel K(z, t), namely, < K(z, u), K(t, u) > = K(z, t). Denoting the matrix with the elements K_ij = K(x_i, y_i, x_j, y_j), by K, we finally obtain the expression

Je(α,λ)=12αTHα−αTh+λ2αTKα+C.

The minimum of the later functional is attained at the vector

α∗=(H+λK)−1h.

Received: 2019-10-18

Revised: 2020-07-09

Accepted: 2020-09-18

Published Online: 2020-10-30

Published in Print: 2020-10-27

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/rnam-2020-0022

Keywords for this article

Mutual information; data mining; feature selection; pentapeptide stability prediction