Leveraging transformers for semi-supervised pathogenicity prediction with soft labels

Pablo Enrique Guillem; Marco Zurdo-Tabernero; Noelia Egido Iglesias; Ángel Canal-Alonso; Liliana Durón Figueroa; Guillermo Hernández; Angélica González-Arrieta; Fernando de la Prieta

doi:10.1515/jib-2024-0047

Article Open Access

Leveraging transformers for semi-supervised pathogenicity prediction with soft labels

Pablo Enrique Guillem , Marco Zurdo-Tabernero , Noelia Egido Iglesias , Ángel Canal-Alonso , Liliana Durón Figueroa , Guillermo Hernández , Angélica González-Arrieta and Fernando de la Prieta

Published/Copyright: June 23, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Integrative Bioinformatics Volume 22 Issue 2

Abstract

The rapid advancement of Next-Generation Sequencing (NGS) technologies has revolutionized the field of genomics, producing large volumes of data that necessitate sophisticated analytical techniques. This paper introduces a Deep Learning model designed to predict the pathogenicity of genetic variants, a vital component in advancing personalized medicine. The model is trained on a dataset derived from the analysis of NGS outputs, containing a combination of well-defined and ambiguous genetic variants. By employing a semi-supervised learning approach, the model efficiently utilizes both confidently labeled and less certain data. At the core of the methodology is the Feature Tokenizer Transformer architecture, which processes both numerical and categorical genomic information. The preprocessing pipeline includes key steps such as data imputation, scaling, and encoding to ensure high data quality. The results highlight the model’s impressive accuracy, particularly in detecting confidently labeled variants, while also addressing the impact of its predictions on less certain (soft-labeled) data.

Keywords: deep learning; genomics; pathogenicity prediction; next-generation sequencing; variant classification; precision medicine

1 Introduction

The field of genomics has experienced a dramatic transformation with the arrival of Next-Generation Sequencing (NGS) technologies [1], leading to a surge in the volume and complexity of genetic data available for research and clinical applications. This wealth of data presents both opportunities and challenges, particularly in the identification of pathogenic genetic variants associated with various diseases [2]. Effective analysis of such extensive datasets requires the development of advanced computational methods capable of distinguishing pathogenic variants from benign ones with high accuracy.

In recent years, machine learning (ML) and deep learning (DL) have emerged as powerful tools in genomics, offering the ability to automatically extract meaningful patterns from complex and high-dimensional data [3]. Traditional approaches for variant pathogenicity prediction often rely on fixed biological features, such as conservation scores or functional annotations, which may not fully capture the underlying complexities of genetic data [4]. In contrast, DL models, particularly those leveraging architectures like transformers, have shown promise in capturing intricate relationships within data, enhancing the predictive capabilities of these models [5].

This paper presents a novel deep learning model that leverages the Feature Tokenizer Transformer (FTT) architecture [6] to predict the pathogenicity of genetic variants. Our model is designed to integrate both numerical and categorical features from genomic data, utilizing a semi-supervised learning framework [7] to make the best use of labeled and unlabeled data. By employing a combination of supervised learning for well-defined cases and unsupervised techniques for ambiguous examples, the model aims to provide a more nuanced understanding of variant pathogenicity.

Furthermore, our approach addresses the significant challenge of label uncertainty in genomic datasets, particularly in the context of variants of uncertain significance (VUS). By refining the classification process into a binary format, we enhance the model’s ability to discern between benign and pathogenic variants while accounting for the probabilistic nature of the predictions. This capability is crucial for advancing personalized medicine and improving clinical decision-making based on genomic information.

In the following sections, we detail the methodology, including data preprocessing, model architecture, and training strategies, and present a thorough evaluation of the model’s performance. The results underscore the potential of this DL model in enhancing the accuracy of pathogenicity predictions, providing a foundation for further research and clinical applications.

This work is based on the initial findings presented at the 2024 PACBB conference [8].

2 State of the art

Predicting the pathogenicity of genomic variants has long been a central task in bioinformatics and clinical genetics. Over time, numerous methodologies have emerged, drawing on an expanding catalog of annotations that include evolutionary conservation, biochemical properties, and functional assays, as well as collective knowledge encoded in publicly available databases. The following section provides an overview of representative tools and their underlying principles, accompanied by a measured examination of commonly discussed limitations. Many of these tools are continually evolving, so updates to their software or databases may address some of the issues described below.

ClinPred [9] is a machine learning-based tool that employs random forest and gradient boosting models over features derived from dbNSFP and gnomAD. It specializes in missense variants, demonstrating strong predictive performance for protein-altering substitutions and outperforming other ensemble methods. ClinPred is widely used in clinical and research pipelines, though its performance on noncoding or splice-altering mutations is limited due to its training data focus on missense variants. Additionally, because it relies on periodically updated databases, older static versions may lack newly discovered variants.

REVEL (Rare Exome Variant Ensemble Learner) [10] integrates outputs from 13 functional prediction tools, including SIFT, PolyPhen-2, MutationTaster, and PROVEAN, within a random forest model. Optimized for rare missense variants, REVEL demonstrates superior performance in distinguishing pathogenic from benign substitutions. However, as an ensemble model, its interpretability can be affected by systematic biases or disagreements among its constituent predictors.

MetaSVM and MetaLR [11] are ensemble-based classifiers that aggregate multiple functional scores, utilizing a support vector machine (MetaSVM) and logistic regression (MetaLR) to enhance predictive sensitivity and specificity. These tools perform well for coding variants but remain limited for large insertions/deletions (indels) and deep intronic variants, which lack comprehensive feature representation.

MAGPIE [12] extends predictive capabilities to synonymous, nonsynonymous, and certain noncoding variants using a machine learning pipeline trained on ClinVar. MAGPIE integrates population frequency data, structural annotations, and additional genomic features, making it a versatile method for clinical variant interpretation. However, its performance depends on the completeness of reference databases, and frequent retraining may be necessary to adapt to newly discovered variants.

Ensemble-based classifiers like VEST4 [13] and M-CAP [14] refine pathogenicity predictions by consolidating multiple computational scores. VEST4 prioritizes rare missense variants using supervised learning, while M-CAP employs gradient boosting to enhance discrimination of uncertain missense variants. Both tools demonstrate high accuracy but are limited by feature coverage variability across different genomic regions and populations.

Evolutionary conservation is central to predictors like MutationAssessor [15] and PrimateAI [16]. MutationAssessor derives functional impact from multiple sequence alignments, while PrimateAI utilizes primate-specific genomic data within a deep learning framework. Both methods excel at evaluating missense variants in highly conserved regions but provide limited insight into noncoding regions and structural variations.

Structure-based predictors, including PolyPhen-2 [17], SIFT4G [18], and LIST-S2 [19], leverage biochemical properties and protein structures to assess mutation effects. These tools are highly informative when reliable structural data exist but offer limited utility for variants in noncoding regions or within genes lacking high-resolution structural models.

Deep learning approaches, such as DANN [20], extend variant interpretation by encoding complex, nonlinear relationships across multiple genomic features. DANN improves classification for both coding and noncoding variants but requires extensive, well-labeled training data to prevent overfitting.

Overall, computational approaches have significantly improved the sensitivity and specificity of pathogenicity predictions, with many integrated into clinical workflows. However, persistent challenges remain, particularly in classifying non-missense variants, addressing the reliance on periodically updated databases, and resolving ambiguities in Variants of Uncertain Significance (VUS). The semi-supervised deep learning framework introduced in this paper seeks to address these challenges by systematically incorporating both well-labeled and uncertain data, broadening model coverage, and facilitating frequent updates as genomic knowledge expands.

3 Methodology

3.1 Dataset

The dataset employed to train the deep learning model was developed internally and is designed to reflect the typical format and content produced by advanced next-generation sequencing (NGS) analysis tools. The input data is in the form of a CSV file, derived from the annotation of a VCF file sourced from the ClinVar database, updated as of August 19, 2024. This file was processed using a custom-built bioinformatics pipeline. ClinVar, managed by the National Institutes of Health (NIH), compiles and curates information linking genetic variants to their clinical implications, drawing on submissions from research labs, clinical testing services, and expert review panels to provide comprehensive variant classifications and associated phenotypes [21].

3.1.1 Genetic analysis pipeline

This pipeline, which processes Illumina [22] germline whole-exome sequencing data, operates under the Nextflow workflow management system [23]. It is designed to ensure parallel task execution, reproducible results, and compatibility with diverse computing environments. Its primary steps include:

Quality Control and Preprocessing: Initial quality control is performed on the data to filter out low-quality reads, preparing it for deeper analysis.
Alignment: The Burrows-Wheeler Aligner-Maximum Exact Match (bwa-mem) [24] algorithm is employed to align sequencing reads to a reference genome, locating genetic variants.
Variant Calling: Advanced algorithms are used to identify a range of genetic variants, such as single nucleotide variants, insertions, deletions, and structural variants.
Annotation: In this step, the functional impact of genetic variants is analyzed and contextualized within biological and clinical frameworks. The present analysis begins at this stage, utilizing the VCF file from ClinVar as input. Annotation was conducted with Ensembl Variant Effect Predictor (VEP) v.109 [25], OpenCravat v. 2.2.9 [26], and TAPES v. 0.1 [27], each contributing distinct insights into variant significance.

3.1.2 Dataset composition

The resulting dataset is extensive, offering a thorough overview of the genomic data utilized for pathogenicity prediction. It comprises 376 columns, which are organized into six primary categories.

Basic Annotation: Variants are documented with essential genomic details, such as gene names, chromosomal positions, and allelic variations.
Pathogenicity Predictors: Computational tools integrated into VEP and OpenCravat assess the likelihood of a variant being pathogenic. These tools use various models and algorithms to evaluate the potential impact of variants on gene function and structure.
Population Allele Frequencies: Data on the frequency of specific alleles in various populations is incorporated. This information helps determine the rarity of variants, which is crucial for assessing their potential pathogenicity.
Clinical Information: Data linking genetic variations to clinical presentations, including patient symptoms and known disease associations, is included to evaluate the real-world impact of variants. This category is supported by information from ClinVar, which classifies variants based on clinical significance, expert reviews, and established guidelines.
Evolutionary Metrics: Conservation of genetic regions across species is assessed to understand their biological importance. Tools that evaluate evolutionary conservation help in understanding the functional relevance of genetic regions and highlight potential pathogenic consequences of mutations.
Other Annotation Sources: Additional insights are provided through integration with resources that offer valuable context on drug-gene interactions and literature-backed variant-disease associations.

The VEP utilizes a variety of plugins to analyze genetic variants. These plugins are designed to provide insights into different aspects of variant effects, such as pathogenicity, functional impact, evolutionary conservation, and gene function (refer to Table 1).

Table 1:

Ensembl VEP plugins grouped by functionality.

Category	Plugin examples	Description
Pathogenicity and functional impact prediction	CADD [28], FATHMM [29], FATHMM_MKL [30], REVEL [10], LoF [31], LoFtool [32], SpliceAI [33]	Predicts variant impact and pathogenicity using multiple models.
Conservation and evolutionary constraint	Conservation [34], AncestralAllele, PrimateAI, G2P [35]	Evaluates conservation and predicts functional consequences.
Phenotypic and disease association	DisGeNET [36], Mastermind [37], phenotypes	Links variants to diseases and phenotypes.
Gene function and expression	LOEUF [31], GO [38], miRNA	Provides insights into gene function and expression.
Splicing and protein structure	MaxEntScan [39], GeneSplicer [40], ProteinSeqs	Analyzes impacts on splicing and protein structure.
Variant frequency and population	dbNSFP [41], gnomADc [42], Gwava	Provides allele frequencies and population-based predictions.
Miscellaneous	IntAct [43], NearestGene, LocalID	Additional context like protein interactions and local IDs.

OpenCravat employs several annotators to enhance the interpretation of genetic variants. These annotators are designed to provide detailed information about variant pathogenicity, clinical relevance, and population frequencies. Annotators focused on functional impact prediction evaluate the potential deleterious effects of variants, whereas those addressing clinical relevance link variants to known diseases and cancer mutations. Additionally, OpenCravat includes resources for population frequency data and additional context from various databases, which support the assessment of variant rarity and its implications (refer to Table 2).

Table 2:

OpenCravat annotators grouped by functionality.

Category	Annotator examples	Description
Functional impact prediction	CScape coding [44], dbscSNV [45], ChasmPlus [46]	Assesses the impact of coding variants and splicing.
Clinical relevance and disease association	CIViC [47], COSMIC [48], cancer genome interpreter [49], CIViC gene	Links variants to diseases and cancer mutations.
Population frequency and GWAS	gnomAD gene, GWAS catalog [50], GRASP [51]	Provides allele frequencies and links to GWAS traits.
Conservation and evolutionary constraint	LINSIGHT [52], LOFtool, RVIS [53]	Scores conservation and mutation intolerance.
Gene and protein function	GO, NCBIGene [54], HG19	Provides insights into gene function and gene-related data.
Regulatory elements	Ensembl regulatory build [55], ESS gene	Annotates variants within regulatory elements.
Miscellaneous	dbSNP [56], MUPIT [57], LitVar [58]	Provides additional context including local IDs and variant effects.

In addition to the VEP and OpenCravat tools, which provide detailed insights into genetic variants through functional impact prediction and clinical relevance, TAPES is employed specifically for variant prioritization. TAPES uses American College of Medical Genetics and Genomics (ACMG) criteria [59] to systematically evaluate variants, focusing on distinguishing between those with pathogenic potential and those considered benign or of uncertain significance. The specific criteria are displayed in Figure 2.

3.1.3 Data preprocessing

The data preprocessing phase was essential in preparing the genomic data for our deep learning-based pathogenicity prediction model. The script used for this process applied multiple steps to ensure the data was well-suited for training (see Figure 1).

Loading Data: Data was loaded from CSV files in chunks, using custom converters to enforce data types and manage missing values. This method facilitated efficient memory usage and allowed processing of large genomic datasets.
Feature Categorization and Processing: Dataset features were grouped according to their type. Some features were one-hot encoded, others were binarized based on keyword presence, and specific features underwent custom processing depending on their characteristics.
Imputation and Scaling: A heuristic approach was taken for imputation, filling missing values with worst-case assumptions when domain knowledge supported it. In cases without such assumptions, the mean or median was used based on the feature’s distribution. Features with more than 25 % missing values were excluded to maintain data reliability. For scaling, we applied standard scaling to normally distributed features, while min-max scaling was used for non-normally distributed features, ensuring values ranged between zero and one. This dual approach minimized biases and optimized the data for various analytical methods.
Feature Selection: Features with significant missing data were discarded, and a variance threshold was applied to remove those with minimal variability, as such features are unlikely to enhance predictive performance. Given the dataset’s high dimensionality and the model’s ability to learn complex patterns, no additional feature selection was performed. The deep learning model is capable of handling large feature sets by identifying important patterns within the data. This strategy simplified preprocessing and allowed the model to leverage the full scope of available data, maximizing both information usage and predictive accuracy without compromising performance.
Categorical Encoding: Categorical variables were encoded using an ordinal method, especially for features with inherent rankings, such as confidence levels and annotations. The target variable was numerically encoded based on ACMG guidelines.
Final Dataset Assembly: The processed features, along with the encoded target variable, were consolidated to create the final dataset, fully prepared for model training. This preprocessing pipeline ensured the data was consistent, properly formatted, and optimized for the deep learning algorithms, ensuring that the model could efficiently learn from and generalize across the dataset.

Figure 1:

Data preprocessing pipeline.

Figure 2:

The chart categorizes each criterion by evidence type and strength for benign (left) or pathogenic (right) assertions. Abbreviations: BS (benign strong), BP (benign supporting), FH (family history), LOF (loss-of-function), MAF (minor allele frequency), path. (pathogenic), PM (pathogenic moderate), PP (pathogenic supporting), PS (pathogenic strong), PVS (pathogenic very strong). Extracted from [59].

3.2 Classification target exploration

In the dataset, six unique labels are defined according to the ACMG guidelines, covering a spectrum of pathogenicity and certainty in classification. These categories are: ”Benign auto”, ”Benign”, ”Likely Benign”, ”Likely Pathogenic”, ”Pathogenic”, and ”VUS” (Variant of Uncertain Significance). Each of these categories corresponds to specific criteria combinations based on the ACMG framework.

Benign auto: This label refers to variants automatically classified as benign based on the strongest evidence. The designation stems from the presence of the BA1 criterion, which indicates a very high allele frequency in control populations, demonstrating that the variant is too common to cause a rare genetic disorder. This criterion is often applied as an exclusionary filter, meaning that if a variant meets BA1, it can be classified as benign without needing to assess additional evidence [60].
Benign: Variants classified as ”Benign” meet stringent evidence of non-pathogenicity. In our dataset, a variant is labeled ”Benign” if it satisfies at least two BS (Benign Strong) criteria, such as observation in a healthy individual without disease or computational evidence suggesting no functional impact.
Likely Benign: A ”Likely Benign” label is assigned when the evidence leans towards non-pathogenicity but is not definitive. This classification is applied when at least one BS criterion is met alongside BP (Benign Supporting) criteria, which are weaker lines of evidence, or if two BP criteria are satisfied. These variants are considered unlikely to be disease-causing but do not meet the stringent evidence required for a full benign classification.
Pathogenic: This label denotes variants with strong evidence of pathogenicity. A variant is classified as ”Pathogenic” when it satisfies the PVS1 criterion (a null variant in a gene where loss of function is a known disease mechanism), in combination with other strong (PS), moderate (PM), or supporting (PP) pathogenicity criteria. Such variants are highly likely to be disease-causing.
Likely Pathogenic: ”Likely Pathogenic” variants have significant evidence pointing towards pathogenicity but fall short of the certainty required for a definitive pathogenic classification. This label is applied when a variant meets moderate (PM) and supporting (PP) criteria or has a PVS1 variant combined with other weaker evidence. While these variants are considered to be disease-causing, the evidence is not as conclusive as for the ”Pathogenic” category.
VUS: These variants fall into an ambiguous category where the available evidence is insufficient to determine whether they are pathogenic or benign. These variants do not meet the criteria for any of the other categories, often due to a lack of data or conflicting evidence regarding their impact.

However, this setup is not ideal for machine learning targets, as it mixes pathogenicity with classification certainty. To refine this, we aim to focus solely on pathogenicity, allowing the inherent probabilistic characteristics of ML algorithms to account for classification uncertainty.

To achieve this, we convert the problem into a binary classification task, where the model predicts either a benign or a pathogenic outcome. We achieve this binarization by grouping the six initial labels into two categories: hard labels–comprising Benign auto, Benign, and Pathogenic; and soft labels–including Likely Benign, Variant of Uncertain Significance (VUS), and Likely Pathogenic. It is important to understand that the majority of our dataset is softly labeled, with over 80 % of the data classified as VUS and only about 2 % having definitive, hard labels. This imbalance in label distribution underscores the need for a semi-supervised learning approach to effectively leverage the available data.

4 Architecture and training

The model architecture chosen for our pathogenicity predictor is the FTT [6]. This selection was driven by the necessity to effectively handle both numerical and categorical features in our dataset, a key strength of the FTT architecture. In an FTT, after all features are properly encoded, they are tokenized using an embedding layer within the model, as depicted in Figure 3.

Figure 3:

Illustration of feature tokenization within the FTT architecture for k features and latent dimension d. Adapted from [6].

This embedding or tokenization process converts features into dense vectors of a predetermined size, allowing the model to capture more intricate patterns in the data. In the context of genomic data, this method is particularly advantageous for handling features such as gene sequences or categorical variables with a large number of unique values, as it enables the model to develop a more nuanced understanding of the data.

A dedicated ‘classification’ token ([CLS]) is then added at the beginning of the latent representation for each sample. This is a standard technique when utilizing transformers for classification tasks (as seen in [61]). The stack of tokens created through this process is passed through the model’s multiple transformer layers. Ultimately, the transformed [CLS] token is fed into a classification head, which consists of a simple linear layer that outputs the class probabilities. The overall architecture is summarized in Figure 4.

Figure 4:

Summary of FTT architecture. Extracted from [6].

Before feeding the data to the DL model for the training process and the subsequent evaluations, it was split with a 70/15/15 ratio between training, validation and testing datasets. This split was stratified to maintain the class imbalance consistent across the three splittings of the dataset.

4.1 Semi-supervised training

A semi-supervised training approach was used to train the model to maximize the utility of softly labeled data, which might otherwise be underutilized or misinterpreted. We employed a straightforward pseudo-labeling technique as proposed in [62], using equal weights for both labeled and pseudo-labeled losses. This method was chosen over more recent approaches (such as consistency regularization [63], 64]) due to its simplicity and, crucially, to avoid introducing synthetic or augmented data, which may raise concerns about interpretability, particularly within the medical community.

Initially, the model is trained using only the hard-labeled samples with binary outcomes, as described previously. After convergence, the trained model is used in inference mode to predict the classes of softly labeled samples. Those samples with a prediction confidence score exceeding a specified threshold are added to the training set with their predicted labels. The model is then retrained with this expanded dataset. This process is repeated until no additional samples meet the confidence threshold or a predetermined maximum number of iterations is reached. In our setup, we used a confidence threshold of 95 % and allowed up to 10 iterations.

At the end of the semi-supervised iterations, only a small number of Likely Benign and Likely Pathogenic cases did not reach the confidence threshold and hence were not included in the training and validation data. The fraction that were not included in the data fed to the model was in all cases below the 1 % level for each label and data split.

For optimization, we employed the Adam algorithm with weight decay [65]. To further enhance generalization and mitigate overfitting, we incorporated several standard regularization techniques during training, most notably dropout and label smoothing. Dropout [66], 67] involves randomly disabling a portion of a layer’s output to prevent overfitting, and it was applied to the attention, feed-forward, and residual layers within the transformer blocks. Label smoothing [68], 69] adds uniform noise to the labels, accounting for potential labeling errors and enhancing model generalization–a critical aspect in our setup, given the likelihood of misclassifying some softly labeled samples during the semi-supervised process.

Regarding hyperparameters, the token dimension was set to 64, with three stacked transformer layers, each containing four heads in the multi-headed attention sublayers and a hidden dimension of 456 in their feed-forward sublayers. Dropout rates were configured at 0.1 for all attention layers, feed-forward layers and residual connections. A gated Gaussian error linear unit (GELU) was used as activation function for the non-linearity. We opted for prenormalization instead of postnormalization, favoring ease of optimization over peak performance [6]. For the semi-supervised parameters, the confidence threshold was set at 95 %, and a maximum of 10 iterations was allowed. The Adam optimizer had a learning rate of 10⁻⁴ with a weight decay of 10⁻⁵. Finally, the label smoothing parameter was set to 0.05.

5 Results and discussion

We developed a DL model to predict the pathogenicity of genetic variants, and its effectiveness was evaluated using a confusion matrix. The analysis considered the six specified labels, and aimed to distinguish between benign and pathogenic variants through a binary classification approach (refer to Table 3).

Table 3:

Confusion matrix on test set (original labels against hard predicted labels).

		Predicted label
		Benign	Pathogenic
Original label	Benign auto	883	1
	Benign	17	0
	Likely benign	27,962	142
	VUS	266,535	79,460
	Likely pathogenic	4,691	11,633
	Pathogenic	0	6,079

We also extracted the distribution of predictions for each target label before applying the final sigmoid layer used to binarize the output of the model. This is shown in Figure 5. The plot shows the great confidence the model has in its predictions, with only a negligible number of predictions lying in the interval between −2 and 2. This has no medical implications, but establishes clear predictions to test in future works.

The model demonstrated strong accuracy in correctly identifying hard labels, as shown by its results with ‘Benign auto’, ‘Benign’, and ‘Pathogenic’ categories. Accurately predicting ‘Pathogenic’ variants is particularly crucial due to the significant implications of misclassification.
For ‘VUS’, the model applied a sophisticated classification method, indicating its potential effectiveness in identifying pathogenic variants in uncertain cases.
The handling of ‘Likely Pathogenic’, ‘VUS’, and ‘Likely Benign’ variants appears to effectively simplify the soft label challenge into a binary format. However, due to the inherent uncertainty in these labels, additional study is necessary to refine their interpretation.
The strong confidence of the model predictions provides a great basis for testing when more hard-labeled data is publicly available.

Figure 5:

Violin plot showing the distribution of predictions, before applying sigmoid function, for each target label in the dataset.

Beyond these technical observations, there are potential implications for clinical practice. In particular, the semi-supervised approach could aid in reclassifying VUS by identifying those most likely to be pathogenic, thus guiding more focused laboratory investigations or functional studies. Likewise, the ability to integrate uncertain samples may eventually help genomics laboratories or clinical testing facilities reduce the time spent on manual curation, provided that large-scale prospective validations confirm the reliability of these semi-supervised predictions. However, because the data in this study derive primarily from public repositories without longitudinal follow-up, we do not claim immediate applicability for clinical diagnostics. Future studies that incorporate patient-level outcomes, cross-reference the model’s predictions with established clinical reports, and conduct experimental assays on borderline cases will be essential steps toward realizing this framework’s practical utility in genomic medicine.

We have performed a direct comparison of our method with ClinPred [9] and REVEL [10], both based on decision trees (Table 4). Our method achieves higher specificity and precision, at the cost of lower sensitivity and accuracy. This is due to the large number of “false negatives” predicted by our model (see Likely Pathogenic variants in Table 3). However, it is important to highlight a key difference between our approach and the one in [9]. While they consider some soft-labeled variants (Likely Pathogenic and Likely Benign) as ground truth, we handle them differently, classifying them dynamically as seen in 4.1. Rather than assuming certainty in these labels, our semi-supervised approach allows the model to learn from them adaptively, refining predictions based on broader patterns in the data. This not only reduces the risk of propagating label noise but also enhances the model’s ability to generalize to novel variants with uncertain classifications. Future work could compare the performance of ClinPred under a similar semi-supervised setup, which would provide valuable insights into the trade-offs between precision and recall in pathogenicity prediction.

Table 4:

Comparison between different methods. With data extracted from [9]. Abbreviations: FPR, false positive rate; MCC, matthews correlation coefficient.

	Sensitivity %	Specificity %	FPR	Accuracy	Precision	F1 score	MCC
FTT (ours)	79.06	99.51	0.005	0.91	0.99	0.88	0.82
ClinPred	93.58	94.10	0.060	0.94	0.86	0.90	0.85
REVEL	82.55	89.27	0.110	0.87	0.75	0.78	0.70

The bold values represents the best-performing score for each metric across the three methods.

6 Conclusions

This study introduced a DL model designed to predict the pathogenicity of genetic variants, a crucial advancement for personalized medicine and the integration of genomics into clinical practice. The model employs a semi-supervised learning strategy combined with the FTT architecture to manage the inherent complexities of genomic data. Our results indicate that the model can effectively classify genetic variants ranging from benign to clearly pathogenic, although the full implications of the soft labels remain somewhat ambiguous.

Although the initial findings are encouraging, there are still areas for potential improvement. Future research will aim to enhance the model by expanding the training dataset, incorporating a wider array of genomic features, and improving its ability to detect subtle genetic variations. Moreover, continued validation through functional studies and clinical correlations will be necessary to ensure the model’s applicability and reliability in real-world clinical environments.

Corresponding author: Marco Zurdo-Tabernero, BISITE Research Group, University of Salamanca, Salamanca, Spain; and Institute of Biomedical Research of Salamanca (IBSAL), University of Salamanca, Salamanca, Spain, E-mail: marcohperez@usal.es

Funding source: Agencia Estatal de Investigación

Award Identifier / Grant number: CNS2022-135101

Acknowledgements

This work was supported by RESILIENCE: REcurSos médicos mediante InteLIgENCIA artificial explicable. Reference: CNS2022-135101. Funding body: Agencia Estatal de Investigación (MCIN/AEI/10.13039/501100011033) and European Union NextGenerationEU/PRTR.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: P.E.G. and M.Z.T. jointly led the development of the AI models and experimental setup; P.E.G. focused more on implementation and coding, while M.Z.T. contributed more extensively to the manuscript writing. N.E.I. managed dataset curation, annotation, and bioinformatics analysis. A.C.A. coordinated the project and ensured methodological consistency. L.D.F. assisted in initial data handling and supported project administration. G.H., A.G.A., and F.d.l.P. supervised the study, provided strategic input, and contributed to manuscript revision. All authors reviewed and approved the final manuscript.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: This work was supported by RESILIENCE: REcurSos médicos mediante InteLIgENCIA artificial explicable. Reference: CNS2022-135101. Funding body: Agencia Estatal de Investigación (MCIN/AEI/10.13039/501100011033) and European Union NextGenerationEU/PRTR.
Data availability: Not applicable.

Appendix A: Dataset variable descriptions

This appendix provides a detailed list of the variables included in the dataset used for pathogenicity prediction. (Table A1).

Table A1:

Dataset description.

Variable name	Description
Chromosome (hg38)	Chromosome (hg38)
Position (hg38)	Position (hg38)
Reference allele	Reference allele
Alternative allele	Alternative allele
Gene (HUGO)	Gene name by HUGO
MANE SELECT	The MANE Select set consists of one transcript at each locus across the genome that is representative of biology at that locus.
MANE PLUS CLINICAL	The MANE plus clinical set includes additional transcripts for genes where MANE select alone is not sufficient to report all ‘pathogenic (P)’ or ‘likely pathogenic (LP)’ clinical variants available in public resources.
HGVSc	HGVS coding variant presentation from VEP
HGVSp	HGVS protein variant presentation from VEP
HGVSg	HGVS genomic variant presentation from VEP
Probability path	Pathogenicity probability score by TAPES
Prediction ACMG	Prediction by American College of medical genetics and genomics (ACMG)
Start position	Start position
End position	End position
Chromosome (hg19)	Chromosome (hg19)
Position (hg19)	Position (hg19)
Transcript	Transcript ID by ensembl
Sequence Ontology	Variant consequence by SO
cDNA change	cDNA change
Protein change	Protein change
All mappings	All mapping transcripts
Ensembl gene ID	Gene ID by ensembl
Ensembl feature ID	Feature ID by ensembl
Feature type	Feature type
CDS strand	Coding sequence (CDS) strand (+ or -)
cDNA position	Relative position of base pair in cDNA sequence
CDS position	Relative position of base pair in coding sequence
Protein position	Relative position of base pair in protein
Aminoacids	Amino acids involved in the variant
Codons	Codons involved in the variant
Impact	Impact of variant predicted by ensembl
Distance	Distance to the nearest gene
Strand	Indicates the strand where the variant is located, + (1) or - (−1)
HGNC ID	HUGO gene nomenclature committee ID
TSL	Transcript support level
APPRIS	APPRIS isoform annotation
CCDS	CCDS id
ENSP	Ensembl ENSP id
Exon	Exon number
Intron	Intron number
Clinical significance	ClinVar clinical significance
Pubmed	PubMed publications identifiers
1000Gp3 AC	Alternative allele counts in the whole 1000 genomes phase 3 (1000Gp3) data
1000Gp3 AF	Alternative allele frequency in the whole 1000Gp3 data
ALSPAC AC	Alternative allele count in called genotypes in UK10K ALSPAC cohort
ALSPAC AF	Alternative allele frequency in called genotypes in UK10K ALSPAC cohort
Aloft confidence	Confidence level of Aloft pred. Values can be ‘high confidence’ (p < 0.05) or ‘low confidence’ (p > 0.05)
Aloft fraction transcripts affected	The fraction of the transcripts of the gene affected i.e. No. of transcripts affected by the SNP/Total no. of protein coding transcripts for the gene
Aloft prediction	Final classification predicted by ALoFT. Values can be Tolerant, Recessive or Dominant
Aloft Probability Dominant	Probability of the SNP being classified as dominant disease-causing by ALoFT
Aloft Probability Recessive	Probability of the SNP being classified as recessive disease-causing by ALoFT
Aloft Probability Tolerant	Probability of the SNP being classified as benign by ALoFT
AltaiNeandertal	Genotype of a deep sequenced Altai Neanderthal
Ancestral allele	Ancestral allele based on 8 primates EPO. Ancestral alleles by ensembl 84. The following comes from its original README file: ACTG – high-confidence call, ancestral state supported by the other two sequences actg – low-confidence call, ancestral state supported by one sequence only N–failure, the ancestral state is not supported by any other sequence – the extant species contains an insertion at this position. – no coverage in the alignment
BayesDel addAF prediction	Prediction of BayesDel addAF score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.0692655
BayesDel addAF rankscore	The rankscore is the ratio of the rank of the score over the total number of BayesDel_addAF scores in dbNSFP
BayesDel addAF score	A deleteriousness preidction meta-score for SNVs and indels with inclusion of MaxAF. The higher the score, the more likely the variant is pathogenic. The author suggested cutoff between deleterious (‘D’) and tolerated (‘T’) is 0.0692655
BayesDel noAF prediction	Prediction of BayesDel noAF score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is −0.0570105
BayesDel noAF rankscore	The rankscore is the ratio of the rank of the score over the total number of BayesDel_noAF scores in dbNSFP
BayesDel noAF score	A deleteriousness preidction meta-score for SNVs and indels without inclusion of MaxAF. The higher the score, the more likely the variant is pathogenic. The author suggested cutoff between deleterious (‘D’) and tolerated (‘T’) is −0.0570105
DANN rankscore	The rankscore is the ratio of the rank of the score over the total number of DANN scores in dbNSFP
DANN score	DANN is a functional prediction score retrained based on the training data of CADD using deep neural network. Scores range from 0 to 1. A larger number indicate a higher probability to be damaging
DEOGEN2 prediction	Prediction of DEOGEN2 score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.5
DEOGEN2 rankscore	The rankscore is the ratio of the rank of the score over the total number of DEOGEN2 scores in dbNSFP
DEOGEN2 score	A deleteriousness prediction score ‘which incorporates heterogeneous information about the molecular effects of the variants, the domains involved, the relevance of the gene and the interactions in which it participates’. It ranges from 0 to 1. The larger the score, the more likely the variant is deleterious. The authors suggest a threshold of 0.5 for separating damaging vs tolerant variants
Denisova	Genotype of a deep sequenced Denisova
Eigen-PC-Phred coding	Eigen PC score in phred scale
Eigen-PC-raw coding	Eigen PC score for genome-wide SNVs. A functional prediction score based on conservation, allele frequencies, deleteriousness prediction (for missense SNVs) and epigenomic signals (for synonymous and non-coding SNVs) using an unsupervised learning method
Eigen-PC-raw coding rankscore	The rankscore is the ratio of the rank of the score over the total number of eigen-PC-raw scores in dbNSFP
Eigen-Phred coding	Eigen score in phred scale
Eigen-raw coding	Eigen score for coding SNVs. A functional prediction score based on conservation, allele frequencies, and deleteriousness prediction using an unsupervised learning method
Eigen-raw coding rankscore	The rankscore is the ratio of the rank of the score over the total number of eigen-raw scores in dbNSFP
FATHMM converted rankscore	FATHMM scores were first converted to FATHMMnew = 1-(FATHMMori+16.13)/26.77, then ranked among all FATHMMnew scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of FATHMMnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0 to 1
FATHMM prediction	If a FATHMM score is < = − 1.5 the corresponding nsSNV is predicted as ‘D(AMAGING)’; otherwise it is predicted as ‘T(OLERATED)’
FATHMM score	FATHMM default score (weighted for human inherited-disease mutations with disease Ontology). Scores range from −16.13 to 10.64. The smaller the score the more likely the SNP has damaging effect
GERP++ NR	GERP++ neutral rate
GERP++ RS	GERP++ RS score, the larger the score, the more conserved the site. Scores range from −12.3 to 6.17
GERP++ RS rankscore	The rankscore is the ratio of the rank of the score over the total number of GERP++ RS scores in dbNSFP
GM12878 confidence value	0 - highly significant scores (approx. p < 0.003); 1 – significant scores (approx. p < 0.05); 2 – informative scores (approx. p < 0.25); 3 – other scores (approx. p > = 0.25)
GM12878 fitCons rankscore	The rankscore is the ratio of the rank of the score over the total number of GM12878 fitCons scores in dbNSFP
GM12878 fitCons score	fitCons score predicts the fraction of genomic positions belonging to a specific function class (defined by epigenomic ‘fingerprint’) that are under selective pressure. Scores range from 0 to 1, with a larger score indicating a higher proportion of nucleic sites of the functional class the genomic position belong to are under selective pressure, therefore more likely to be functional important. GM12878 fitCons scores are based on cell type GM12878
GTEx V8 gene	Target gene of the (significant) eQTL SNP
GTEx V8 tissue	Tissue type of the expression data with which the eQTL/gene pair is detected
H1-hESC confidence value	0 - highly significant scores (approx. p < 0.003); 1 – significant scores (approx. p < 0.05); 2 – informative scores (approx. p < 0.25); 3 – other scores (approx. p > = 0.25)
H1-hESC fitCons rankscore	The rankscore is the ratio of the rank of the score over the total number of H1-hESC fitCons scores in dbNSFP
H1-hESC fitCons score	fitCons score predicts the fraction of genomic positions belonging to a specific function class (defined by epigenomic ‘fingerprint’) that are under selective pressure. Scores range from 0 to 1, with a larger score indicating a higher proportion of nucleic sites of the functional class the genomic position belong to are under selective pressure, therefore more likely to be functional important. GM12878 fitCons scores are based on cell type H1-hESC
HUVEC confidence value	0 - highly significant scores (approx. p < 0.003); 1 – significant scores (approx. p < 0.05); 2 – informative scores (approx. p < 0.25); 3 – other scores (approx. p > = 0.25)
HUVEC fitCons rankscore	The rankscore is the ratio of the rank of the score over the total number of HUVEC fitCons scores in dbNSFP
HUVEC fitCons score	fitCons score predicts the fraction of genomic positions belonging to a specific function class (defined by epigenomic ‘fingerprint’) that are under selective pressure. Scores range from 0 to 1, with a larger score indicating a higher proportion of nucleic sites of the functional class the genomic position belong to are under selective pressure, therefore more likely to be functional important. GM12878 fitCons scores are based on cell type HUVEC
Interpro domain	Domain or conserved site on which the variant locates. Domain annotations come from interpro database. The number in the brackets following a specific domain is the count of times interpro assigns the variant position to that domain, typically coming from different predicting databases. Multiple entries separated by ‘;’
LIST-S2 prediction	Prediction of LIST-S2 score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.85
LIST-S2 rankscore	The rankscore is the ratio of the rank of the score over the total number of LIST-S2 scores in dbNSFP
LIST-S2 score	A deleteriousness preidction score for nonsynonymous SNVs. The range of the score in dbNSFP is from 0 to 1. The higher the score, the more likely the variant is pathogenic. The author suggested cutoff between deleterious (‘D’) and tolerated (‘T’) is 0.85
LRT Omega	Estimated nonsynonymous-to-synonymous-rate ratio (Omega, reported by LRT)
LRT converted rankscore	LRTori scores were first converted as LRTnew = 1-LRTori0.5 if Omega* < 1, or LRTnew = LRTori0.5 if Omega* > = 1. Then LRTnew scores were ranked among all LRTnew scores in dbNSFP. The rankscore is the ratio of the rank over the total number of the scores in dbNSFP. The scores range from 0.00162 to 0.8433
LRT prediction	LRT prediction, D(eleterious), N(eutral) or U(nknown), which is not solely determined by the score.
LRT score	The original LRT two-sided p-value (LRTori), ranges from 0 to 1
M-CAP prediction	Prediction of M-CAP score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.025
M-CAP rankscore	The rankscore is the ratio of the rank of the score over the total number of M-CAP scores in dbNSFP
M-CAP score	M-CAP is a hybrid ensemble score. Scores range from 0 to 1. The larger the score the more likely the SNP has damaging effect.
MPC rankscore	The rankscore is the ratio of the rank of the score over the total number of MPC scores in dbNSFP
MPC score	A deleteriousness prediction score for missense variants based on regional missense constraint. The range of MPC score is 0–5. The larger the score, the more likely the variant is pathogenic
MVP rankscore	The rankscore is the ratio of the rank of the score over the total number of MVP scores in dbNSFP
MVP score	A pathogenicity prediction score for missense variants using deep learning approach. The range of MVP score is from 0 to 1. The larger the score, the more likely the variant is pathogenic. The authors suggest thresholds of 0.7 and 0.75 for separating damaging vs tolerant variants in constrained genes (ExACpLI > = 0.5) and non-constrained genes (ExACpLI < 0.5), respectively.
MetaLR prediction	Prediction of MetaLR based ensemble prediction score,‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.5
MetaLR rankscore	The rankscore is the ratio of the rank of the score over the total number of MetaLR scores in dbNSFP. The scores range from 0 to 1
MetaLR score	Logistic regression (LR) based ensemble prediction score, which incorporated 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging. Scores range from 0 to 1
MetaSVM prediction	Prediction of SVM based ensemble prediction score,‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0
MetaSVM rankscore	The rankscore is the ratio of the rank of the score over the total number of MetaSVM scores in dbNSFP. The scores range from 0 to 1
MetaSVM score	Support vector machine (SVM) based ensemble prediction score, which incorporated 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging.
MutPred protID	UniProt accession or ensembl transcript ID used for MutPred score calculation
MutPred rankscore	The rankscore is the ratio of the rank of the score over the total number of MutPred scores in dbNSFP
MutPred score	General MutPred score. Scores range from 0 to 1. The larger the score the more likely the SNP has damaging effect
MutationAssessor prediction	MutationAssessor’s functional impact of a variant–predicted functional, i.e. high (‘H’) or medium (‘M’), or predicted non-functional, i.e. low (‘L’) or neutral (‘N’). The score cutoffs between ‘H’ and ‘M’, ‘M’ and ‘L’, and ‘L’ and ‘N’, are 3.5, 1.935 and 0.8, respectively
MutationAssessor rankscore	The rankscore is the ratio of the rank of the score over the total number of MAori scores in dbNSFP. The scores range from 0 to 1
MutationAssessor score	MutationAssessor functional impact combined score
MutationTaster prediction	MutationTaster prediction, ‘A’ (‘disease_causing_automatic’), ‘D’ (‘disease_causing’), ‘N’ (’polymorphism’) or ‘P’ (‘polymorphism_automatic’)
MutationTaster converted rankscore	The MTori scores were first converted. If the prediction is ‘A’ or ‘D’ MTnew = MTori; if the prediction is ‘N’ or ‘P’, MTnew = 1-MTori. Then MTnew scores were ranked among all MTnew scores in dbNSFP. If there are multiple scores of an SNV, only the largest MTnew was used in ranking. The rankscore is the ratio of the rank of the score over the total number of MTnew scores in dbNSFP. The scores range from 0.08979 to 0.81001.
MutationTaster score	MutationTaster p-value, ranges from 0 to 1.
PROVEAN prediction	If PROVEAN < = − 2.5 the corresponding nsSNV is predicted as ‘D(amaging)’; otherwise it is predicted as ‘N(eutral)’.
PROVEAN converted rankscore	PROVEANori were first converted to PROVEANnew = 1 − (PROVEANori + 14)/28, then ranked among all PROVEANnew scores in dbNSFP. The rankscore is the ratio of the rank the PROVEANnew score over the total number of PROVEANnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0 to 1
PROVEAN score	Scores range from −14 to 14. The smaller the score the more likely the SNP has damaging effect.
PrimateAI prediction	Prediction of PrimateAI score based on the authors’ recommendation, ‘T(olerated)’ or ‘D(amaging)’. The score cutoff between ‘D’ and ‘T’ is 0.803
PrimateAI rankscore	The rankscore is the ratio of the rank of the score over the total number of PrimateAI scores in dbNSFP
PrimateAI score	A pathogenicity prediction score for missense variants based on common variants of non-human primate species using a deep neural network. The range of PrimateAI score is 0–1. The larger the score, the more likely the variant is pathogenic
Reliability index	Number of observed component scores (except the maximum frequency in the 1000 genomes populations) for MetaSVM and MetaLR. Ranges from 1 to 10. As MetaSVM and MetaLR scores are calculated based on imputed data, the less missing component scores, the higher the reliability of the scores and predictions.
SIFT4G prediction	If SIFT4G is < 0.05 the corresponding nsSNV is predicted as ‘D(amaging)’; otherwise it is predicted as ‘T(olerated)’.
SIFT4G converted rankscore	SIFT4G scores were first converted to SIFT4Gnew = 1-SIFT4G, then ranked among all SIFT4Gnew scores in dbNSFP. The rankscore is the ratio of the rank the SIFT4Gnew score over the total number of SIFT4Gnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented
SIFT4G score	Scores range from 0 to 1. The smaller the score the more likely the SNP has damaging effect
SIFT prediction	If SIFT is smaller than 0.05 the corresponding nsSNV is predicted as ‘D(amaging)’; otherwise it is predicted as ‘T(olerated)’.
SIFT converted rankscore	SIFTori scores were first converted to SIFTnew = 1-SIFTori, then ranked among all SIFTnew scores in dbNSFP. The rankscore is the ratio of the rank the SIFTnew score over the total number of SIFTnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The rankscores range from 0.00964 to 0.91255
SIFT score	Scores range from 0 to 1. The smaller the score the more likely the SNP has damaging effect
SiPhy 29way logOdds	SiPhy score based on 29 mammals genomes. The larger the score, the more conserved the site
SiPhy 29way logOdds rankscore	The rankscore is the ratio of the rank of the score over the total number of SiPhy_29way_logOdds scores in dbNSFP
SiPhy 29way pi	The estimated stationary distribution of A, C, G and T at the site, using SiPhy algorithm based on 29 mammals genomes.
Uniprot ACC	Uniprot accession number
Uniprot entry	Uniprot entry ID
VindijiaNeandertal	Genotype of a deep sequenced Vindijia Neandertal
bStatistic	Background selection (B) value estimates. Ranges from 0 to 1000. It estimates the expected fraction (*1000) of neutral diversity present at a site. Values close to 0 represent near complete removal of diversity as a result of background selection and values near 1000 indicating absent of background selection
bStatistic converted rankscore	bStatistic scores were first converted to -bStatistic, then ranked among all -bStatistic scores in dbNSFP. The rankscore is the ratio of the rank of -bStatistic over the total number of -bStatistic scores in dbNSFP
ClinVar MedGen ID	MedGen ID from ClinVar
ClinVar OMIM ID	OMIM ID from ClinVar
ClinVar Orphanet ID	Orphanet ID from ClinVar
ClinVar HGVS	HGVS nomenclature from ClinVar
ClinVar ID	ClinVar ID
ClinVar review	Review status of variant from ClinVar
ClinVar Trait	Trait associated with variant according to ClinVar
Fathmm-MKL coding group	The groups of features (labeled A-J) used to obtained the score
Fathmm-MKL coding prediction	If a fathmm-MKL coding score is > 0.5 the corresponding nsSNV is predicted as ‘D(AMAGING)’; otherwise it is predicted as ‘N(EUTRAL)’
Fathmm-MKL coding rankscore	The rankscore is the ratio of the rank of the score over the total number of fathmm-MKL coding scores in dbNSFP
Fathmm-MKL coding score	Fathmm-MKL p-values. Scores range from 0 to 1. SNVs with scores > 0.5 are predicted to be deleterious, and those < 0.5 are predicted to be neutral or benign. Scores close to 0 or 1 are with the highest-confidence. Coding scores are trained using 10 groups of features
Fathmm-XF coding prediction	If a fathmm-XF coding score is > 0.5 , the corresponding nsSNV is predicted as ‘D(AMAGING)’; otherwise it is predicted as ‘N(EUTRAL)’
Fathmm-XF coding rankscore	The rankscore is the ratio of the rank of the score over the total number of fathmm-XF coding scores in dbNSFP
Fathmm-XF coding score	Fathmm-XF p-values. Scores range from 0 to 1. SNVs with scores > 0.5 are predicted to be deleterious, and those < 0.5 are predicted to be neutral or benign. Scores close to 0 or 1 are with the highest-confidence. Coding scores are trained using 10 groups of features
gnomAD exomes AC	Alternative allele count in the whole gnomAD exome samples (125,748 samples)
gnomAD exomes AF	Alternative allele frequency in the whole gnomAD exome samples (125,748 samples)
gnomAD exomes AN	Total allele count in the whole gnomAD exome samples (125,748 samples)
gnomAD exomes POPMAX AC	Allele count in the population with the maximum AF
gnomAD exomes POPMAX AF	Maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)
gnomAD exomes POPMAX AN	Total number of alleles in the population with the maximum AF
gnomAD exomes flag	Information from gnomAD exome data indicating whether the variant falling within low-complexity (lcr) or segmental duplication (segdup) or decoy regions. The flag can be either. For high-quality PASS or not reported/polymorphic in gnomAD exomes, ‘lcr’ for within lcr, ‘segdup’ for within segdup, or ‘decoy’ for with decoy region
gnomAD genomes AC	Alternative allele count in the whole gnomAD genome samples (71,702 samples)
gnomAD genomes AF	Alternative allele frequency in the whole gnomAD genome samples (71,702 samples)
gnomAD genomes AN	Total allele count in the whole gnomAD genome samples (71,702 samples)
gnomAD genomes POPMAX AC	Allele count in the population with the maximum AF
gnomAD genomes POPMAX AF	Maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)
gnomAD genomes POPMAX AN	Total number of alleles in the population with the maximum AF
gnomAD genomes flag	Information from gnomAD genome data indicating whether the variant falling within low-complexity (lcr) or segmental duplication (segdup) or decoy regions. The flag can be either. For high-quality PASS or not reported/polymorphic in gnomAD exomes, ‘lcr’ for within lcr, ‘segdup’ for within segdup, or ‘decoy’ for with decoy region
Integrated confidence value	0 - highly significant scores (approx. p < 0.003); 1 – significant scores (approx. p < 0.05); 2 – informative scores (approx.p < 0.25); 3 – other scores (approx. p > = 0.25)
Integrated fitCons score	fitCons score predicts the fraction of genomic positions belonging to a specific function class (defined by epigenomic ‘fingerprint’) that are under selective pressure. Scores range from 0 to 1, with a larger score indicating a higher proportion of nucleic sites of the functional class the genomic position belong to are under selective pressure, therefore more likely to be functional important. Integrated (i6) scores are integrated across three cell types (GM12878, H1-hESC and HUVEC)
Integrated fitCons rankscore	The rankscore is the ratio of the rank of the score over the total number of integrated fitCons scores in dbNSFP
phastCons100way vertebrate	phastCons conservation score based on the multiple alignments of 100 vertebrate genomes (including human). The larger the score, the more conserved the site. Scores range from 0 to 1.
phastCons100way vertebrate rankscore	The rankscore is the ratio of the rank of the score over the total number of phastCons100way_vertebrate scores in dbNSFP
phastCons17way primate	A conservation score based on 17way alignment primate set. The larger the score, the more conserved the site. Scores range from 0 to 1.
phastCons17way primate rankscore	The rank of the phastCons17way_primate score among all phastCons17way_primate scores in dbNSFP
phastCons30way mammalian	phastCons conservation score based on the multiple alignments of 30 mammalian genomes (including human). The larger the score, the more conserved the site. Scores range from 0 to 1
phastCons30way mammalian rankscore	The rankscore is the ratio of the rank of the score over the total number of phastCons30way_mammalian scores in dbNSFP
phyloP100way vertebrate	phyloP (phylogenetic p-values) conservation score based on the multiple alignments of 100 vertebrate genomes (including human). The larger the score, the more conserved the site
phyloP100way vertebrate rankscore	The rankscore is the ratio of the rank of the score over the total number of phyloP100way_vertebrate scores in dbNSFP
phyloP17way primate	A conservation score based on 17way alignment primate set, the higher the more conservative
phyloP17way primate rankscore	The rank of the phyloP17way_primate score among all phyloP17way_primate scores in dbNSFP
phyloP30way mammalian	phyloP (phylogenetic p-values) conservation score based on the multiple alignments of 30 mammalian genomes (including human). The larger the score, the more conserved the site
phyloP30way mammalian rankscore	The rankscore is the ratio of the rank of the score over the total number of phyloP30way_mammalian scores in dbNSFP
rs dbSNP	rs number from dbSNP 151
PVS1 contrib	Null variant in a gene where loss of function (LOF) is a known mechanism of disease.
PS1 contrib	Same amino acid change as a previously established pathogenic variant regardless of nucleotide change.
PS2 contrib merged	De novo (both maternity and paternity confirmed) in a patient with the disease and no family history.
PS3 contrib	Well-established in vitro or in vivo functional studies supportive of a damaging effect on the gene or gene product.
PS4 contrib	The prevalence of the variant in affected individuals is significantly increased compared to the prevalence in controls.
PM1 contrib	Located in a mutational hot spot and/or critical and well-established functional domain (e.g. active site of an enzyme) without benign variation.
PM2 contrib	Absent from controls (or at extremely low frequency if recessive) (see Appendix A1) in exome sequencing Project, 1000 genomes or ExAC.
PM4 contrib	Protein length changes due to in-frame deletions/insertions in a non-repeat region or stop-loss variants.
PM5 contrib	Novel missense change at an amino acid residue where a different missense change determined to be pathogenic has been seen before.
PP2 contrib	Missense variant in a gene that has a low rate of benign missense variation and where missense variants are a common mechanism of disease.
PP3 contrib	Multiple lines of computational evidence support a deleterious effect on the gene or gene product (conservation, evolutionary, splicing impact, etc).
PP5 contrib	Reputable source recently reports variant as pathogenic but the evidence is not available to the laboratory to perform an independent evaluation.
BS1 contrib	Allele frequency is greater than expected for disorder.
BS2 contrib	Observed in a healthy adult individual for a recessive (homozygous), dominant (heterozygous), or X-linked (hemizygous) disorder with full penetrance expected at an early age.
BS3 contrib	Well-established in vitro or in vivo functional studies shows no damaging effect on protein function or splicing.
BP1 contrib	Missense variant in a gene for which primarily truncating variants are known to cause disease.
BP3 contrib	In-frame deletions/insertions in a repetitive region without a known function.
BP4 contrib	Multiple lines of computational evidence suggest no impact on gene or gene product (conservation, evolutionary, splicing impact, etc).
BP6 contrib	Reputable source recently reports variant as benign but the evidence is not available to the laboratory to perform an independent evaluation.
BP7 contrib	A synonymous (silent) variant for which splicing prediction algorithms predict no impact to the splice consensus sequence nor the creation of a new splice site AND the nucleotide is not highly conserved.
BA 1 contrib	Allele frequency is above 5 % in exome sequencing Project, 1000 genomes, or ExAC.
ALFA Total AC	Allele count for each ALT allele for the total (global) across all populations.
ALFA Total AF	The ratio of the allele count for each ALT allele for the population over the total allele count for the population, including REF
CCR Percentile	Percentile ranges from 90 to 100. 100 represents complete constraint, the highest constrained region in the model.
CCR Synonimous variant Density	A calculation of the synonymous variant density of the CCR region. Used variants that were SNPs and did not change amino acids or stop/start codons. Allowed multiple alleles at same bp.
CCR CpG	CpG dinucleotide density of the whole CCR region.
CCR cov score	The score of length scaled by coverage proportion at 10× for each base pair.
CCR residual	Raw residual value from the linear regression model.
CCR residual Percentile	Raw residual percentile, not weighted by proportion of exome represented.
CHASMplus P-value	P-value reflects the statistical significance of obtaining the acheived or higher CHASMplus score.
CHASMplus score	High scores reflect a greater likelihood that a mutation is a driver (scores range from 0.0 to 1.0).
CHASMplus transcript	Transcript ID.
CIViC description	Clinical interpretations of Vartiants in cancer description.
CIViC clinical Actionability score	Represents the accumulation of evidence.
CIViC diseases	Diseases names.
CIViC ID	ID number to link in CIViC database.
CIViC gene description	Clinical interpretations of Vartiants in cancer gene description
CScape coding score	Scores are p-values. Scores above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.
CGC class	Driver class.
CGC inheritance	Somatic or germline inheritance.
CGC tumor type (Somatic)	Type of tumor according to location (caused by somatic mutations).
CGC tumor type (germline)	Type of tumor according to location (caused by germline mutations).
CGC link	Link to GCG database.
Cancer gene Landscape class	Class. Oncogenes and tumor supressor genes described by Vogelstein et al.
CVDKP IBS	Irritable bowel syndrome.
CVDKP CAD	Coronary artery disease.
CVDKP BMI	Body mass index and obesity.
CVDKP Atrial Fibrillation	Atrial Fibrillation.
CVDKP diabetes	Type 2 Diabetes.
DGI category	Gene categories in DGIdb refers to a set of genes belonging to a group that is deemed to be potentially druggable.
DGI interaction	An interaction type describes the nature of the association between a particular gene and drug.
DGI name	Drug name.
DGI score	A higher score indicates that there is a greater interaction between the drug and the gene. The score depends on numbers of drug and gene partners, as well as number of supporting publications and sources.
DGI ChEMBL	ChEMBL ID. ChEMBL is a manually curated database of bioactive molecules with drug-like properties.
DGI PMID	Link to literature.
Ensembl regulatory build region	Regions that are predicted to regulate gene expression.
Ensembl regulatory build ensr	An ensembl stable ID consists of five parts; ENS(species)(object type)(identifier).(version).
ESS gene	Essential or non-essential genes.
ESS gene indispensability score	A probability prediction of the gene being essential.
ESS gene indispensability prediction	Essential (E) or loss-of-function tolerant (N) based on Gene_indispensability_score.
GRASP NHLBI	NHLBI key.
GRASP PMID	PubMed ID.
GRASP phenotype	Phenotype.
GWAS catalog disease	Disease/Trait.
GWAS catalog Odds Ratio/Beta CObv/Expff	Odds Ratio/Beta Coeff.
GWAS catalog P-value	P-value.
GWAS catalog PMID	PubMed ID.
GWAS catalog initial sample	Initial sample.
GWAS catalog replication sample	Replication sample.
GWAS catalog risk allele	Risk allele.
GWAS catalog confidence interval	Confidence interval.
GO biological process name	GO biological process name.
GO biological process ID	GO biological process ID.
GO Cellular component name	GO Cellular component name.
GO Cellular component ID	GO Cellular component ID.
GO molecular function name	GO molecular function name.
GO molecular function ID	GO molecular function ID.
LitVar rs ID	Reference SNP ID.
LoFtool score	A percentile score for gene intolerance to functional change. The lower the score the higher gene intolerance to functional change.
NCBIGene description	NCBIGene description.
NCBIGene entrez	NCBIGene entrez.
RVIS EVS	Residual variation intolerance score, a measure of intolerance of mutational burden, the higher the score the more tolerant to mutational burden the gene is. Based on EVS (ESP6500) data.
RVIS Percentile EVS	The percentile rank of the gene based on RVIS, the higher the percentile the more tolerant to mutational burden the gene is. Based on EVS (ESP6500) data.
RVIS FDR ExAC	A gene’s FDR p-value for preferential LoF depletion among ExAC.
RVIS ExAC	ExAC-based RVIS; setting ‘common’ MAF filter at 0.05 % in at least one of the six individual ethnic strata from ExAC.
RVIS Percentile ExAC	Genome-Wide percentile for the new ExAC-based RVIS; setting ‘common’ MAF filter at 0.05 % in at least one of the six individual ethnic strata from ExAC.
SpliceAI Acceptor gain score	Probability of the variant being splice-altering. Cutoffs for binary prediction are, 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision).
SpliceAI Acceptor loss score	Probability of the variant being splice-altering. Cutoffs for binary prediction are, 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision).
SpliceAI Donor gain score	Probability of the variant being splice-altering. Cutoffs for binary prediction are, 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision).
SpliceAI Donor loss score	Probability of the variant being splice-altering. Cutoffs for binary prediction are, 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision).
SpliceAI Acceptor gain position	Location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).
SpliceAI Acceptor loss position	Location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).
SpliceAI Donor gain position	Location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).
SpliceAI Donor loss position	Location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).
VCF INFO Phred	VCF info Phred.
VCF INFO filter	VCF info filter.
VCF INFO zygosity	VCF info Zygosity: hom (homozygous) or het (heterozygous).
VCF INFO Alternate reads	Count of alternate reads.
VCF INFO Total reads	Count of total reads.
VCF INFO AF	VCF info AF.
dbscSNV AdaBoost	AdaBoost score. If the score > 0.6 , it predicts that the splicing will be changed, otherwise it predicts the splicing will not be changed.
dbscSNV random forest	Random forest score. If the score > 0.6 , it predicts that the splicing will be changed, otherwise it predicts the splicing will not be changed.
gnomAD gene transcript	Gene transcript ID.
gnomAD gene Obv/Exp LoF	Observed/Expected for loss of function variants.
gnomAD gene Obv/Exp Mis	Observed/Expected for missense variants.
gnomAD gene Obv/Exp Syn	Observed/Expected for synonymous variants.
gnomAD gene LoF Z-score	Z-score for loss of function variants.
gnomAD gene Mis Z-score	Z-score for missense variants.
gnomAD gene Syn Z-score	Z-score for synonymous variants.
gnomAD gene pLI	The probability of being loss-of-function intolerant (intolerant of both heterozygous and homozygous lof variants).
gnomAD gene pRec	The probability of being intolerant of homozygous, but not heterozygous lof variants.
gnomAD gene pNull	The probability of being tolerant of both heterozygous and homozygous lof variants.
p(HI)	Estimated probability of haploinsufficiency of the gene.

References

1. Wadman, M. James Watson’s genome sequenced at high speed. Nature 2008;452:788–9. https://doi.org/10.1038/452788b.Search in Google Scholar PubMed

2. Gunning, AC, Fryer, V, Fasham, J, Crosby, AH, Ellard, S, Baple, EL, et al.. Assessing performance of pathogenicity predictors using clinically relevant variant datasets. J Med Genet 2021;58:547–55. https://doi.org/10.1136/jmedgenet-2020-107003.Search in Google Scholar PubMed PubMed Central

3. Qi, H, Zhang, H, Zhao, Y, Chen, C, Long, JJ, Chung, WK, et al.. MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun 2021;12:510. https://doi.org/10.1038/s41467-020-20847-0.Search in Google Scholar PubMed PubMed Central

4. Alarcon, JLC, Enriquez, JA, Sánchez-Cabo, F. Frequency Conservation Score (FCS): the power of conservation and allele frequency for variant pathogenic prediction. bioRxiv 2019:805051.10.1101/805051Search in Google Scholar

5. LeCun, Y, Bengio, Y, Hinton, G. Deep learning. Nature 2015;521:436–44. https://doi.org/10.1038/nature14539.Search in Google Scholar PubMed

6. Gorishniy, Y, Rubachev, I, Khrulkov, V, Babenko, A. Revisiting deep learning models for tabular data. Adv Neural Inf Process Syst 2021;34:18932–43.Search in Google Scholar

7. Zhu, X, Goldberg, AB. Introduction to semi-supervised learning. Cham, Switzerland: Springer Nature; 2022.Search in Google Scholar

8. Guillem, PE, Zurdo-Tabernero, M, Durón Figueroa, L, Canal-Alonso, Á, Hernández, G, González-Arrieta, A, et al.. Transformer-enhanced pathogenicity prediction with soft labels in a semi-supervised setup. In: International Conference on Practical Applications of Computational Biology & Bioinformatics. Cham, Switzerland: Springer; 2024:41–50 pp.10.1007/978-3-031-87873-2_5Search in Google Scholar

9. Alirezaie, N, Kernohan, KD, Hartley, T, Majewski, J, Hocking, TD. ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants. Am J Hum Genet 2018;103:474–83. https://doi.org/10.1016/j.ajhg.2018.08.005.Search in Google Scholar PubMed PubMed Central

10. Ioannidis, NM, Rothstein, JH, Pejaver, V, Middha, S, McDonnell, SK, Baheti, S, et al.. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 2016;99:877–85. https://doi.org/10.1016/j.ajhg.2016.08.016.Search in Google Scholar PubMed PubMed Central

11. Dong, C, Wei, P, Jian, X, Gibbs, R, Boerwinkle, E, Wang, K, et al.. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet 2015;24:2125–37. https://doi.org/10.1093/hmg/ddu733.Search in Google Scholar PubMed PubMed Central

12. Liu, Y, Zhang, T, You, N, Wu, S, Shen, N. MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach. Genome Med 2024;16:3. https://doi.org/10.1186/s13073-023-01274-4.Search in Google Scholar PubMed PubMed Central

13. Carter, H, Douville, C, Stenson, PD, Cooper, DN, Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genom 2013;14:1–16. https://doi.org/10.1186/1471-2164-14-s3-s3.Search in Google Scholar

14. Jagadeesh, KA, Wenger, AM, Berger, MJ, Guturu, H, Stenson, PD, Cooper, DN, et al.. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet 2016;48:1581–6. https://doi.org/10.1038/ng.3703.Search in Google Scholar PubMed

15. Reva, B, Antipin, Y, Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011;39:e118. https://doi.org/10.1093/nar/gkr407.Search in Google Scholar PubMed PubMed Central

16. Sundaram, L, Gao, H, Padigepati, SR, McRae, JF, Li, Y, Kosmicki, JA, et al.. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 2018;50:1161–70. https://doi.org/10.1038/s41588-018-0167-z.Search in Google Scholar PubMed PubMed Central

17. Schmidt, A, Röner, S, Mai, K, Klinkhammer, H, Kircher, M, Ludwig, KU. Predicting the pathogenicity of missense variants using features derived from AlphaFold2. Bioinformatics 2023;39:btad280. https://doi.org/10.1093/bioinformatics/btad280.Search in Google Scholar PubMed PubMed Central

18. Vaser, R, Adusumalli, S, Leng, SN, Sikic, M, Ng, PC. SIFT missense predictions for genomes. Nat Protoc 2016;11:1–9. https://doi.org/10.1038/nprot.2015.123.Search in Google Scholar PubMed

19. Malhis, N, Jacobson, M, Jones, SJ, Gsponer, J. LIST-S2: taxonomy based sorting of deleterious missense mutations across species. Nucleic Acids Res 2020;48:W154–61. https://doi.org/10.1093/nar/gkaa288.Search in Google Scholar PubMed PubMed Central

20. Quang, D, Chen, Y, Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 2015;31:761–3. https://doi.org/10.1093/bioinformatics/btu703.Search in Google Scholar PubMed PubMed Central

21. Landrum, MJ, Lee, JM, Benson, M, Brown, G, Chao, C, Chitipiralla, S, et al.. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 2016;44:D862–8. https://doi.org/10.1093/nar/gkv1222.Search in Google Scholar PubMed PubMed Central

22. Slatko, BE, Gardner, AF, Ausubel, FM. Overview of next-generation sequencing technologies. Curr Protoc Mol Biol 2018;122:e59. https://doi.org/10.1002/cpmb.59.Search in Google Scholar PubMed PubMed Central

23. Di Tommaso, P, Chatzou, M, Floden, EW, Barja, PP, Palumbo, E, Notredame, C. Nextflow enables reproducible computational workflows. Nat Biotechnol 2017;35:316–9. https://doi.org/10.1038/nbt.3820.Search in Google Scholar PubMed

24. Sarmento, C, Guimarães, S, Kılınç, GM, Götherström, A, Pires, AE, Ginja, C, et al.. A study on Burrows-Wheeler Aligner’s performance optimization for Ancient DNA mapping. In: Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021). Cham, Switzerland: Springer; 2022:105–14 pp.10.1007/978-3-030-86258-9_11Search in Google Scholar

25. McLaren, W, Gil, L, Hunt, SE, Riat, HS, Ritchie, GRS, Thormann, A, et al.. The Ensembl variant effect predictor. Genome Biol 2016;17:122. https://doi.org/10.1186/s13059-016-0974-4.Search in Google Scholar PubMed PubMed Central

26. Pagel, KA, Kim, R, Moad, K, Busby, B, Zheng, L, Tokheim, C, et al.. Integrated Informatics analysis of cancer-Related variants. JCO Clinical Cancer Inf 2020;4:310–7. https://doi.org/10.1200/cci.19.00132.Search in Google Scholar PubMed PubMed Central

27. Xavier, A, Scott, RJ, Talseth-Palmer, BA. TAPES: a tool for assessment and prioritisation in exome studies. PLoS Comput Biol 2019;15:1–9. https://doi.org/10.1371/journal.pcbi.1007453.Search in Google Scholar PubMed PubMed Central

28. Rentzsch, P, Schubach, M, Shendure, J, Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 2019;47:D886–94. https://doi.org/10.1093/nar/gky1016.Search in Google Scholar PubMed PubMed Central

29. Shihab, HA, Gough, J, Cooper, DN, Stenson, PD, Barker, GL, Edwards, KJ, et al.. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat 2013;34:57–65. https://doi.org/10.1002/humu.22225.Search in Google Scholar PubMed PubMed Central

30. Shihab, HA, Rogers, MF, Gough, J, Mort, M, Cooper, DN, Day, INM, et al.. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 2015;31:1536–43. https://doi.org/10.1093/bioinformatics/btv009.Search in Google Scholar PubMed PubMed Central

31. Karczewski, KJ, Francioli, LC, Tiao, G, Cummings, BB, Alföldi, J, Wang, Q, et al.. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020;581:434–43. https://doi.org/10.1038/s41586-020-2308-7.Search in Google Scholar PubMed PubMed Central

32. Fadista, JA, Oskolkov, N, Hansson, O, Groop, L. LoFtool: a gene intolerance score based on loss-of-function variants in 60 706 individuals. Bioinformatics 2016;33:471–4. https://doi.org/10.1093/bioinformatics/btv602.Search in Google Scholar PubMed

33. Jaganathan, K, Kyriazopoulou Panagiotopoulou, S, McRae, JF, Darbandi, SF, Knowles, D, Li, YI, et al.. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535–48.e24. https://doi.org/10.1016/j.cell.2018.12.015.Search in Google Scholar PubMed

34. Vilella, AJ, Severin, J, Ureta-Vidal, A, Heng, L, Durbin, R, Birney, E. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 2009;19:327–35. https://doi.org/10.1101/gr.073585.107.Search in Google Scholar PubMed PubMed Central

35. Thormann, A, Halachev, M, McLaren, W, Moore, DJ, Svinti, V, Campbell, A, et al.. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun 2019;10:2373. https://doi.org/10.1038/s41467-019-10016-3.Search in Google Scholar PubMed PubMed Central

36. Piñero, J, Bravo, À, Queralt-Rosinach, N, Gutiérrez-Sacristán, A, Deu-Pons, J, Centeno, E, et al.. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res 2017;45:D833–9. https://doi.org/10.1093/nar/gkw943.Search in Google Scholar PubMed PubMed Central

37. Chunn, LM, Nefcy, DC, Scouten, RW, Tarpey, RP, Chauhan, G, Lim, MS, et al.. Mastermind: a comprehensive genomic association search engine for empirical evidence curation and genetic variant interpretation. Front Genet 2020;11:577152. https://doi.org/10.3389/fgene.2020.577152.Search in Google Scholar PubMed PubMed Central

38. Ashburner, M, Ball, CA, Blake, JA, Botstein, D, Butler, H, Cherry, JM, et al.. Gene Ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. https://doi.org/10.1038/75556.Search in Google Scholar PubMed PubMed Central

39. Yeo, G, Burge, CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 2004;11:377–94. https://doi.org/10.1089/1066527041410418.Search in Google Scholar PubMed

40. Pertea, M, Lin, X, Salzberg, SL. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 2001;29:1185–90. https://doi.org/10.1093/nar/29.5.1185.Search in Google Scholar PubMed PubMed Central

41. Liu, X, Jian, X, Boerwinkle, E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 2011;32:894–9. https://doi.org/10.1002/humu.21517.Search in Google Scholar PubMed PubMed Central

42. Chen, S, Francioli, LC, Goodrich, JK, Collins, RL, Kanai, M, Wang, Q, et al.. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 2024;625:92–100. https://doi.org/10.1038/s41586-023-06045-0.Search in Google Scholar PubMed PubMed Central

43. del Toro, N, Shrivastava, A, Ragueneau, E, Meldal, B, Combe, C, Barrera, E, et al.. The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res 2021;50:D648–53. https://doi.org/10.1093/nar/gkab1006.Search in Google Scholar PubMed PubMed Central

44. Rogers, MF, Shihab, HA, Gaunt, TR, Campbell, C. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Sci Rep 2017;7:11597. https://doi.org/10.1038/s41598-017-11746-4.Search in Google Scholar PubMed PubMed Central

45. Jian, X, Boerwinkle, E, Liu, X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res 2014;42:13534–44. https://doi.org/10.1093/nar/gku1206.Search in Google Scholar PubMed PubMed Central

46. Tokheim, C, Karchin, R. CHASMplus reveals the scope of somatic missense mutations driving human cancers. Cell Syst 2019;9:9–23.e8. https://doi.org/10.1016/j.cels.2019.05.005.Search in Google Scholar PubMed PubMed Central

47. Griffith, M, Spies, NC, Krysiak, K, McMichael, JF, Coffman, AC, Danos, AM, et al.. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet 2017;49:170–4. https://doi.org/10.1038/ng.3774.Search in Google Scholar PubMed PubMed Central

48. Tate, JG, Bamford, S, Jubb, HC, Sondka, Z, Beare, DM, Bindal, N, et al.. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res 2019;47:D941–7. https://doi.org/10.1093/nar/gky1015.Search in Google Scholar PubMed PubMed Central

49. Tamborero, D, Rubio-Perez, C, Deu-Pons, J, Schroeder, MP, Vivancos, A, Rovira, A, et al.. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med 2018;10:25. https://doi.org/10.1186/s13073-018-0531-8.Search in Google Scholar PubMed PubMed Central

50. Sollis, E, Mosaku, A, Abid, A, Buniello, A, Cerezo, M, Gil, L, et al.. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res 2023;51:D977–85. https://doi.org/10.1093/nar/gkac1010.Search in Google Scholar PubMed PubMed Central

51. Leslie, R, O’Donnell, CJ, Johnson, AD. GRASP: analysis of genotype-phenotype results from 1,390 genome-wide association studies and corresponding open access database. Bioinformatics 2014;30:i185–94. https://doi.org/10.1093/bioinformatics/btu273.Search in Google Scholar PubMed PubMed Central

52. Huang, YF, Gulko, B, Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 2017;49:618–24. https://doi.org/10.1038/ng.3810.Search in Google Scholar PubMed PubMed Central

53. Petrovski, S, Wang, Q, Heinzen, EL, Allen, AS, Goldstein, DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 2013;9:e1003709. https://doi.org/10.1371/journal.pgen.1003709.Search in Google Scholar PubMed PubMed Central

54. Gene [Internet]. Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine (US); 2004 [cited 2024 Sep 5]. Available from: https://www.ncbi.nlm.nih.gov/gene/ Search in Google Scholar

55. Zerbino, DR, Wilder, SP, Johnson, N, Juettemann, T, Flicek, PR. The Ensembl regulatory build. Genome Biol 2015;16:56. https://doi.org/10.1186/s13059-015-0621-5.Search in Google Scholar PubMed PubMed Central

56. Sherry, ST, Ward, MH, Kholodov, M, Baker, J, Phan, L, Smigielski, E, et al.. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11. https://doi.org/10.1093/nar/29.1.308.Search in Google Scholar PubMed PubMed Central

57. Niknafs, N, Kim, D, Kim, R, Diekhans, M, Ryan, M, Stenson, PD, et al.. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum Genet 2013;132:1235–43. https://doi.org/10.1007/s00439-013-1325-0.Search in Google Scholar PubMed PubMed Central

58. Allot, A, Peng, Y, Wei, CH, Lee, K, Phan, L, Lu, Z. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 2018;46:W530–6. https://doi.org/10.1093/nar/gky355.Search in Google Scholar PubMed PubMed Central

59. Richards, S, Aziz, N, Bale, S, Bick, D, Das, S, Gastier-Foster, J, et al.. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of medical genetics and genomics and the association for molecular pathology. Genet Med 2015;17:405–24. https://doi.org/10.1038/gim.2015.30.Search in Google Scholar PubMed PubMed Central

60. Ghosh, R, Harrison, SM, Rehm, HL, Plon, SE, Biesecker, LG. On behalf of ClinGen sequence variant interpretation working group. updated recommendation for the benign stand-alone ACMG/AMP criterion. Hum Mutat 2018;39:1633–41.10.1002/humu.23642Search in Google Scholar PubMed PubMed Central

61. Devlin, J, Chang, MW, Lee, K, Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv 2018:181004805.Search in Google Scholar

62. Lee, D. Pseudo-label : the simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop: Challenges in Representation Learning. Atlanta, GA: Workshop Proceedings; 2013:1–6 pp.Search in Google Scholar

63. Xie, Q, Dai, Z, Hovy, E, Luong, T, Le, Q. Unsupervised data augmentation for consistency training. Adv Neural Inf Process Syst 2020;33:6256–68.Search in Google Scholar

64. Fan, Y, Kukleva, A, Dai, D, Schiele, B. Revisiting consistency regularization for semi-supervised learning. Int J Comput Vis 2023;131:626–43. https://doi.org/10.1007/s11263-022-01723-4.Search in Google Scholar

65. Loshchilov, I, Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv 2017:171105101.Search in Google Scholar

66. Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I, Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–58.Search in Google Scholar

67. Garbin, C, Zhu, X, Marques, O. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed Tool Appl 2020;79:12777–815. https://doi.org/10.1007/s11042-019-08453-9.Search in Google Scholar

68. Szegedy, C, Vanhoucke, V, Ioffe, S, Shlens, J, Wojna, Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas, NV: IEEE Computer Society; 2016:2818–26 pp.10.1109/CVPR.2016.308Search in Google Scholar

69. Zhang, CB, Jiang, PT, Hou, Q, Wei, Y, Han, Q, Li, Z, et al.. Delving deep into label smoothing. IEEE Trans Image Process 2021;30:5984–96. https://doi.org/10.1109/tip.2021.3089942.Search in Google Scholar

Received: 2024-10-18

Accepted: 2025-03-19

Published Online: 2025-06-23

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jib-2024-0047

Keywords for this article

deep learning; genomics; pathogenicity prediction; next-generation sequencing; variant classification; precision medicine

Creative Commons

BY 4.0