Abstract
The transformer model for sequence mining has brought a paradigmatic shift to many domains, including biological sequence mining. However, transformers suffer from quadratic complexity, i.e., O(l 2), where l is the sequence length, which affects the training and prediction time. Therefore, the work herein introduces a simple, generalized, and fast transformer architecture for improved protein function prediction. The proposed architecture uses a combination of CNN and global-average pooling to effectively shorten the protein sequences. The shortening process helps reduce the quadratic complexity of the transformer, resulting in the complexity of O((l/2)2). This architecture is utilized to develop PFP solution at the sub-sequence level. Furthermore, focal loss is employed to ensure balanced training for the hard-classified examples. The multi sub-sequence-based proposed solution utilizing an average-pooling layer (with stride = 2) produced improvements of +2.50 % (BP) and +3.00 % (MF) when compared to Global-ProtEnc Plus. The corresponding improvements when compared to the Lite-SeqCNN are: +4.50 % (BP) and +2.30 % (MF).
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Alshahrani, M., Khan, M.A., Maddouri, O., Kinjo, A.R., Queralt-Rosinach, N., and Hoehndorf, R. (2017). Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33: 2723–2730, https://doi.org/10.1093/bioinformatics/btx275.Suche in Google Scholar PubMed PubMed Central
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.. (2000). Gene ontology: tool for the unification of biology. Nat. Genet. 25: 25–29, https://doi.org/10.1038/75556.Suche in Google Scholar PubMed PubMed Central
Boadu, F., Cao, H., and Cheng, J. (2023). Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 39(Suppl. 1): i318–i325, https://doi.org/10.1093/bioinformatics/btad208.Suche in Google Scholar PubMed PubMed Central
Cai, Y., Wang, J., and Deng, L. (2020). SDN2GO: an integrated deep learning model for protein function prediction. Front. Bioeng. Biotechnol. 8: 391, https://doi.org/10.3389/fbioe.2020.00391.Suche in Google Scholar PubMed PubMed Central
Cao, Y. and Shen, Y. (2021). TALE: transformer-based protein function annotation with joint sequence–label embedding. Bioinformatics 37: 2825–2833, https://doi.org/10.1093/bioinformatics/btab198.Suche in Google Scholar PubMed PubMed Central
Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, https://doi.org/10.3390/molecules22101732.Suche in Google Scholar PubMed PubMed Central
Clark, W.T. and Radivojac, P. (2011). Analysis of protein function and its prediction from amino acid sequence. Proteins: Struct., Funct., Bioinf. 79: 2086–2096, https://doi.org/10.1002/prot.23029.Suche in Google Scholar PubMed
Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.. (2021). Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44: 7112–7127, https://doi.org/10.1109/tpami.2021.3095381.Suche in Google Scholar PubMed
Giri, S.J., Dutta, P., Halani, P., and Saha, S. (2020). MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J. Biomed. Health Inform. 25: 1832–1838, https://doi.org/10.1109/jbhi.2020.3022806.Suche in Google Scholar PubMed
Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682, https://doi.org/10.1093/bioinformatics/btq003.Suche in Google Scholar PubMed PubMed Central
Kabir, A. and Shehu, A. (2022). GOProFormer: a multi-modal transformer method for gene ontology protein function prediction. Biomolecules 12: 1709, https://doi.org/10.3390/biom12111709.Suche in Google Scholar PubMed PubMed Central
Kim, G.B., Kim, J.Y., Lee, J.A., Norsigian, C.J., Palsson, B.O., and Lee, S.Y. (2023). Functional annotation of enzyme-encoding genes using deep learning with transformer layers. Nat. Commun. 14: 7370, https://doi.org/10.1038/s41467-023-43216-z.Suche in Google Scholar PubMed PubMed Central
Kingma, D.P. and Ba, J. (2014). Adam: a method for stochastic optimization. In: Proc. 3rd int. conf. learn. representations, 2015, pp. 1–11.Suche in Google Scholar
Kulmanov, M. and Robert, H. (2020). DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36: 422–429, https://doi.org/10.1093/bioinformatics/btz595.Suche in Google Scholar PubMed PubMed Central
Kulmanov, M., Khan, M.A., and Hoehndorf, R. (2018). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34: 660–668, https://doi.org/10.1093/bioinformatics/btx624.Suche in Google Scholar PubMed PubMed Central
Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2023). Lite-SeqCNN: a light-weight deep CNN architecture for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 2242–2253, https://doi.org/10.1109/TCBB.2023.3240169.Suche in Google Scholar PubMed
Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2024a). CrossPredGO: a novel light-weight cross-modal multi-attention framework for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 21: 1709–1720, https://doi.org/10.1109/TCBB.2024.3410696.Suche in Google Scholar PubMed
Kumar, V., Deepak, A., Ranjan, A., and Prakash, A. (2024b). Bi-SeqCNN: a novel light-weight bi-directional CNN architecture for protein function prediction. IEEE ACM Trans. Comput. Biol. Bioinf. 21: 1922–1933, https://doi.org/10.1109/tcbb.2024.3426491.Suche in Google Scholar
Li, M., Shi, W., Zhang, F., Zeng, M., and Li, Y. (2022). A deep learning framework for predicting protein functions with co-occurrence of GO terms. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 833–842, https://doi.org/10.1109/tcbb.2022.3170719.Suche in Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.10.1109/ICCV.2017.324Suche in Google Scholar
Oliveira, G.B., Pedrini, H., and Dias, Z. (2023). TEMPROT: protein function annotation using transformers embeddings and homology search. BMC Bioinf. 24: 242, https://doi.org/10.1186/s12859-023-05375-0.Suche in Google Scholar PubMed PubMed Central
Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 701–710.10.1145/2623330.2623732Suche in Google Scholar
Rahardja, S., Wang, M., Nguyen, B.P., Fränti, P., and Rahardja, S. (2022). A lightweight classification of adaptor proteins using transformer networks. BMC Bioinf. 23: 461, https://doi.org/10.1186/s12859-022-05000-6.Suche in Google Scholar PubMed PubMed Central
Ranjan, A., Fahad, M.S., Fernández-Baca, D., Deepak, A., and Tripathi, S. (2019). Deep robust framework for protein function prediction using variable-length protein sequences. IEEE ACM Trans. Comput. Biol. Bioinf. 17: 1648–1659.10.1109/TCBB.2019.2911609Suche in Google Scholar PubMed
Ranjan, A., Fahad, M.S., Fernández-Baca, D., Tripathi, S., and Deepak, A. (2022). MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 1188–1199.10.1109/TCBB.2022.3173789Suche in Google Scholar PubMed
Ranjan, A., Tiwari, A., and Deepak, A. (2021). A sub-sequence based approach to protein function prediction via multi-attention based multi-aspect network. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 94–105, https://doi.org/10.1109/tcbb.2021.3130923.Suche in Google Scholar PubMed
Rentzsch, R. and Orengo, C.A. (2013). Protein function prediction using domain families. BMC Bioinf. 14: 1–14, https://doi.org/10.1186/1471-2105-14-s3-s5.Suche in Google Scholar PubMed PubMed Central
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118, https://doi.org/10.1073/pnas.2016239118.Suche in Google Scholar PubMed PubMed Central
Sharma, L., Deepak, A., Ranjan, A., and Krishnasamy, G. (2023). A novel hybrid CNN and BiGRU-attention based deep learning model for protein function prediction. Stat. Appl. Genet. Mol. Biol. 22: 20220057, https://doi.org/10.1515/sagmb-2022-0057.Suche in Google Scholar PubMed
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15: 1929–1958.Suche in Google Scholar
UniProt Consortium (2015). UniProt: a hub for protein information. Nucleic Acids Res. 43: D204–D212, https://doi.org/10.1093/nar/gku989.Suche in Google Scholar PubMed PubMed Central
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008.Suche in Google Scholar
Wang, H., Yan, L., Huang, H., and Ding, C. (2016). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513, https://doi.org/10.1109/tcbb.2016.2591529.Suche in Google Scholar
Wenzel, M., Grüner, E., and Strodthoff, N. (2024). Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 40: btae031, https://doi.org/10.1093/bioinformatics/btae031.Suche in Google Scholar PubMed PubMed Central
Zhang, X., Guo, H., Zhang, F., Wang, X., Wu, K., Qiu, S., Liu, B., Wang, Y., Hu, Y., and Li, J. (2023). HNetGO: protein function prediction via heterogeneous network transformer. Briefings Bioinf. 24: bbab556, https://doi.org/10.1093/bib/bbab556.Suche in Google Scholar PubMed PubMed Central
Zhang, F., Song, H., Zeng, M., Li, Y., Kurgan, L., and Li, M. (2019). DeepFunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics 19: 1900019, https://doi.org/10.1002/pmic.201900019.Suche in Google Scholar PubMed
Zhang, F., Song, H., Zeng, M., Wu, F.X., Li, Y., Pan, Y., and Li, M. (2020). A deep learning framework for gene ontology annotations with sequence-and network-based information. IEEE ACM Trans. Comput. Biol. Bioinf. 18: 2208–2217, https://doi.org/10.1109/tcbb.2020.2968882.Suche in Google Scholar
Zhang, X., Wang, L., Liu, H., Zhang, X., Liu, B., Wang, Y., and Li, J. (2021). Prot2GO: predicting GO annotations from protein sequences and interactions. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 2772–2780, https://doi.org/10.1109/tcbb.2021.3139841.Suche in Google Scholar PubMed
© 2025 Walter de Gruyter GmbH, Berlin/Boston