A fast (CNN + MCWS-transformer) based architecture for protein function prediction

Abhipsa Mahala; Ashish Ranjan; Rojalina Priyadarshini; Raj Vikram; Prabhat Dansena

doi:10.1515/sagmb-2024-0027

Artikel

A fast (CNN + MCWS-transformer) based architecture for protein function prediction

Abhipsa Mahala , Ashish Ranjan , Rojalina Priyadarshini , Raj Vikram und Prabhat Dansena

Veröffentlicht/Copyright: 1. Juli 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Statistical Applications in Genetics and Molecular Biology Band 24 Heft 1

Abstract

The transformer model for sequence mining has brought a paradigmatic shift to many domains, including biological sequence mining. However, transformers suffer from quadratic complexity, i.e., O(l ²), where l is the sequence length, which affects the training and prediction time. Therefore, the work herein introduces a simple, generalized, and fast transformer architecture for improved protein function prediction. The proposed architecture uses a combination of CNN and global-average pooling to effectively shorten the protein sequences. The shortening process helps reduce the quadratic complexity of the transformer, resulting in the complexity of O((l/2)²). This architecture is utilized to develop PFP solution at the sub-sequence level. Furthermore, focal loss is employed to ensure balanced training for the hard-classified examples. The multi sub-sequence-based proposed solution utilizing an average-pooling layer (with stride = 2) produced improvements of +2.50 % (BP) and +3.00 % (MF) when compared to Global-ProtEnc Plus. The corresponding improvements when compared to the Lite-SeqCNN are: +4.50 % (BP) and +2.30 % (MF).

Keywords: MCWS-transformer; fast transformer architecture; protein sequence; protein function prediction

Corresponding author: Ashish Ranjan, Department of Computer Science & Engineering, C. V. Raman Global University, Bhubaneswar, Odisha, India, E-mail: ashish.ranjan@cgu-odisha.ac.in

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: None declared.
Data availability: The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Alshahrani, M., Khan, M.A., Maddouri, O., Kinjo, A.R., Queralt-Rosinach, N., and Hoehndorf, R. (2017). Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33: 2723–2730, https://doi.org/10.1093/bioinformatics/btx275.Suche in Google Scholar PubMed PubMed Central

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.. (2000). Gene ontology: tool for the unification of biology. Nat. Genet. 25: 25–29, https://doi.org/10.1038/75556.Suche in Google Scholar PubMed PubMed Central

Boadu, F., Cao, H., and Cheng, J. (2023). Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 39(Suppl. 1): i318–i325, https://doi.org/10.1093/bioinformatics/btad208.Suche in Google Scholar PubMed PubMed Central

Cai, Y., Wang, J., and Deng, L. (2020). SDN2GO: an integrated deep learning model for protein function prediction. Front. Bioeng. Biotechnol. 8: 391, https://doi.org/10.3389/fbioe.2020.00391.Suche in Google Scholar PubMed PubMed Central

Cao, Y. and Shen, Y. (2021). TALE: transformer-based protein function annotation with joint sequence–label embedding. Bioinformatics 37: 2825–2833, https://doi.org/10.1093/bioinformatics/btab198.Suche in Google Scholar PubMed PubMed Central

Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, https://doi.org/10.3390/molecules22101732.Suche in Google Scholar PubMed PubMed Central

Clark, W.T. and Radivojac, P. (2011). Analysis of protein function and its prediction from amino acid sequence. Proteins: Struct., Funct., Bioinf. 79: 2086–2096, https://doi.org/10.1002/prot.23029.Suche in Google Scholar PubMed

Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.. (2021). Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44: 7112–7127, https://doi.org/10.1109/tpami.2021.3095381.Suche in Google Scholar PubMed

Giri, S.J., Dutta, P., Halani, P., and Saha, S. (2020). MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J. Biomed. Health Inform. 25: 1832–1838, https://doi.org/10.1109/jbhi.2020.3022806.Suche in Google Scholar PubMed

Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682, https://doi.org/10.1093/bioinformatics/btq003.Suche in Google Scholar PubMed PubMed Central

Kabir, A. and Shehu, A. (2022). GOProFormer: a multi-modal transformer method for gene ontology protein function prediction. Biomolecules 12: 1709, https://doi.org/10.3390/biom12111709.Suche in Google Scholar PubMed PubMed Central

Kim, G.B., Kim, J.Y., Lee, J.A., Norsigian, C.J., Palsson, B.O., and Lee, S.Y. (2023). Functional annotation of enzyme-encoding genes using deep learning with transformer layers. Nat. Commun. 14: 7370, https://doi.org/10.1038/s41467-023-43216-z.Suche in Google Scholar PubMed PubMed Central

Kingma, D.P. and Ba, J. (2014). Adam: a method for stochastic optimization. In: Proc. 3rd int. conf. learn. representations, 2015, pp. 1–11.Suche in Google Scholar

Kulmanov, M. and Robert, H. (2020). DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36: 422–429, https://doi.org/10.1093/bioinformatics/btz595.Suche in Google Scholar PubMed PubMed Central

Kulmanov, M., Khan, M.A., and Hoehndorf, R. (2018). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34: 660–668, https://doi.org/10.1093/bioinformatics/btx624.Suche in Google Scholar PubMed PubMed Central

Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2023). Lite-SeqCNN: a light-weight deep CNN architecture for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 2242–2253, https://doi.org/10.1109/TCBB.2023.3240169.Suche in Google Scholar PubMed

Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2024a). CrossPredGO: a novel light-weight cross-modal multi-attention framework for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 21: 1709–1720, https://doi.org/10.1109/TCBB.2024.3410696.Suche in Google Scholar PubMed

Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2024b). Bi-SeqCNN: a novel light-weight bi-directional CNN architecture for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 21: 1922–1933, https://doi.org/10.1109/tcbb.2024.3426491.Suche in Google Scholar

Li, M., Shi, W., Zhang, F., Zeng, M., and Li, Y. (2022). A deep learning framework for predicting protein functions with co-occurrence of GO terms. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 833–842, https://doi.org/10.1109/tcbb.2022.3170719.Suche in Google Scholar

Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.10.1109/ICCV.2017.324Suche in Google Scholar

Oliveira, G.B., Pedrini, H., and Dias, Z. (2023). TEMPROT: protein function annotation using transformers embeddings and homology search. BMC Bioinf. 24: 242, https://doi.org/10.1186/s12859-023-05375-0.Suche in Google Scholar PubMed PubMed Central

Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 701–710.10.1145/2623330.2623732Suche in Google Scholar

Rahardja, S., Wang, M., Nguyen, B.P., Fränti, P., and Rahardja, S. (2022). A lightweight classification of adaptor proteins using transformer networks. BMC Bioinf. 23: 461, https://doi.org/10.1186/s12859-022-05000-6.Suche in Google Scholar PubMed PubMed Central

Ranjan, A., Fahad, M.S., Fernández-Baca, D., Deepak, A., and Tripathi, S. (2019). Deep robust framework for protein function prediction using variable-length protein sequences. IEEE ACM Trans. Comput. Biol. Bioinf. 17: 1648–1659.10.1109/TCBB.2019.2911609Suche in Google Scholar PubMed

Ranjan, A., Fahad, M.S., Fernández-Baca, D., Tripathi, S., and Deepak, A. (2022). MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 1188–1199.10.1109/TCBB.2022.3173789Suche in Google Scholar PubMed

Ranjan, A., Tiwari, A., and Deepak, A. (2021). A sub-sequence based approach to protein function prediction via multi-attention based multi-aspect network. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 94–105, https://doi.org/10.1109/tcbb.2021.3130923.Suche in Google Scholar PubMed

Rentzsch, R. and Orengo, C.A. (2013). Protein function prediction using domain families. BMC Bioinf. 14: 1–14, https://doi.org/10.1186/1471-2105-14-s3-s5.Suche in Google Scholar PubMed PubMed Central

Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118, https://doi.org/10.1073/pnas.2016239118.Suche in Google Scholar PubMed PubMed Central

Sharma, L., Deepak, A., Ranjan, A., and Krishnasamy, G. (2023). A novel hybrid CNN and BiGRU-attention based deep learning model for protein function prediction. Stat. Appl. Genet. Mol. Biol. 22: 20220057, https://doi.org/10.1515/sagmb-2022-0057.Suche in Google Scholar PubMed

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15: 1929–1958.Suche in Google Scholar

UniProt Consortium (2015). UniProt: a hub for protein information. Nucleic Acids Res. 43: D204–D212, https://doi.org/10.1093/nar/gku989.Suche in Google Scholar PubMed PubMed Central

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008.Suche in Google Scholar

Wang, H., Yan, L., Huang, H., and Ding, C. (2016). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513, https://doi.org/10.1109/tcbb.2016.2591529.Suche in Google Scholar

Wenzel, M., Grüner, E., and Strodthoff, N. (2024). Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 40: btae031, https://doi.org/10.1093/bioinformatics/btae031.Suche in Google Scholar PubMed PubMed Central

Zhang, X., Guo, H., Zhang, F., Wang, X., Wu, K., Qiu, S., Liu, B., Wang, Y., Hu, Y., and Li, J. (2023). HNetGO: protein function prediction via heterogeneous network transformer. Briefings Bioinf. 24: bbab556, https://doi.org/10.1093/bib/bbab556.Suche in Google Scholar PubMed PubMed Central

Zhang, F., Song, H., Zeng, M., Li, Y., Kurgan, L., and Li, M. (2019). DeepFunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics 19: 1900019, https://doi.org/10.1002/pmic.201900019.Suche in Google Scholar PubMed

Zhang, F., Song, H., Zeng, M., Wu, F.X., Li, Y., Pan, Y., and Li, M. (2020). A deep learning framework for gene ontology annotations with sequence-and network-based information. IEEE ACM Trans. Comput. Biol. Bioinf. 18: 2208–2217, https://doi.org/10.1109/tcbb.2020.2968882.Suche in Google Scholar

Zhang, X., Wang, L., Liu, H., Zhang, X., Liu, B., Wang, Y., and Li, J. (2021). Prot2GO: predicting GO annotations from protein sequences and interactions. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 2772–2780, https://doi.org/10.1109/tcbb.2021.3139841.Suche in Google Scholar PubMed

Received: 2024-10-05

Accepted: 2025-05-09

Published Online: 2025-07-01

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

Research Article
A fast (CNN + MCWS-transformer) based architecture for protein function prediction

https://doi.org/10.1515/sagmb-2024-0027

Schlagwörter für diesen Artikel

MCWS-transformer; fast transformer architecture; protein sequence; protein function prediction

Artikel in diesem Heft