Assessing model performance in Alzheimer's disease classification: The impact of data imbalance on fine-tuned vision transformers and CNN architectures

Hassan Almalki; Alaa O. Khadidos; Nawaf Alhebaishi

doi:10.1515/jisys-2024-0406

Article Open Access

Assessing model performance in Alzheimer's disease classification: The impact of data imbalance on fine-tuned vision transformers and CNN architectures

Hassan Almalki , Alaa O. Khadidos and Nawaf Alhebaishi

Published/Copyright: December 4, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 34 Issue 1

Abstract

Problem: Data imbalance in medical datasets poses significant challenges for the performance of machine learning models, particularly in classifying Alzheimer’s disease (AD). Aim: This study aims to investigate the impact of the data ratio on model performance using both balanced and imbalanced datasets. Methods: We employed two distinct datasets: a balanced set of 34,000 images created through augmentation techniques and an inherently imbalanced set of 6,400 images, both comprising four classes. To evaluate model performance, we utilized three state-of-the-art models: fine-tuned vision transformer (FT-ViT), fine-tuned convolutional neural network (FT-CNN), and fine-tuned swin transformer (FT-Swin). Results: The FT-ViT model achieved an impressive 99% accuracy on the imbalanced dataset and 96% on the balanced dataset. The FT-CNN model attained 97% accuracy on the imbalanced dataset and 90% on the balanced dataset, while the FT-Swin model exhibited a performance disparity, achieving 79% accuracy on the balanced dataset and 90% on the imbalanced dataset. Conclusion: Our findings demonstrate that careful model selection, fine-tuning, and hyperparameter optimization can lead to high performance on imbalanced datasets without relying solely on artificial balancing methods. This approach offers promising implications for AD classification and potentially other medical imaging applications facing similar data imbalance challenges.

Keywords: Alzheimer’s disease; classification; vision transformers; data imbalance; medical imaging

1 Introduction

Alzheimer’s disease (AD) is a well-known neurodegenerative condition that worsens progressively over time. It predominantly occurs in older adults, leading to memory loss and cognitive decline, and gradually turns into a complete inability to perform daily tasks [1]. Studies showed that amyloid-beta plagues accumulate rapidly with the neurofibrillary tangles in the brain. This accumulation leads to neuronal death, brain atrophy, and dementia [2]. The exact cause of AD remains unclear. However, it can be caused by various environmental, genetic, and lifestyle factors. Researchers believe that genetic mutations in the PSEN1, APP, and PSEN2 genes can lead to AD as they are found to be associated with the familial forms of AD [3]. Furthermore, risk factors, including age, family history, and specific genetic markers such as the APOE, are linked to a much higher risk of developing the disease [4].

The traditional methods of diagnosing and classifying AD heavily relied on clinical assessments and neuroimaging techniques. Among clinical assessments, cognitive tests and medical history evaluations are the most common methods [5]. Brain imaging technology has gained much popularity over time due to its efficiency in the classification of AD. Magnetic resonance imaging (MRI) gives detailed images of the brain’s anatomy. It aids in the early diagnosis of brain atrophy and mutations that could be associated with AD [6]. Computed tomography (CT) scan method utilizes X-rays to generate enhanced cross-sectional neuroimages. These images reveal significant brain atrophies to rule out other potential causes of dementia [7]. Positron emission tomography (PET) scans work by assessing the complete brain activity. This method detects the changes in glucose metabolism or amyloid-beta accumulation that are the key features of AD [8]. The functional magnetic resonance imaging (fMRI) method evaluates how the brain and its tissues work and generates neuroimages by measuring the changes in blood flow (BOLD) associated with neuronal activation. This technique provides insights into brain activation and connectivity patterns [9]. Another popular method, single-photon emission computed tomography (SPECT), detects cellular or chemical mutational changes in the brain that lead to AD by utilizing radioactive tracer compounds [10].

Although these methods are considered reliable for AD classification, they have some limitations as well. MRI is one of the most expensive brain imaging techniques, and it may not detect the early-stage brain changes associated with AD as sensitively as the advanced method of PET scans [11]. CT scans may not provide detailed information because this method is less sensitive than MRI in detecting early brain changes related to AD [12]. Although PET scans give accurate results, this method is expensive [13], needs specialized equipment, and is not widely accessible for clinical use. Interpreting the results of PET scans also requires expertise. fMRI technique is computationally intensive and requires specialized analysis methods. Moreover, this method may not be able to detect the subtle changes in normal brain activities during the early stage of AD [14]. The SPECT method does not offer the same level of detail as compared to PET scans. It is also limited to providing accurate and precise anatomical information about the brain structure and its working [15]. To overcome these limitations, advanced machine and deep learning methods were introduced to diagnose and classify AD using data in various forms, such as brain images, writing, psychological evaluation reports.

Support vector machines (SVMs) have been extensively utilized for early-stage diagnosis and classification of AD by analyzing neuroimaging data from MRI and PET scans. Studies have shown that SVMs are practical approaches to identifying AD-associated patterns in the brain structure and function. A classification method based on the SVM algorithm analyzed the whole brain’s anatomical structure and function through MRI images and achieved high accuracy in differentiating AD patients from healthy controls [16]. Random forest (RF) algorithms, which aggregate the results of multiple decision trees, have also delivered remarkable accuracy in classifying AD. An RF model can effectively distinguish between various stages of AD – cognitively normal (CN), early mild cognitive impairment, late mild cognitive impairment, and AD by training on clinical and neuroimaging features from prominent datasets like the Alzheimer’s disease neuroimaging initiative (ADNI) [17]. The K-nearest neighbors (KNN) algorithm is one of the most straightforward classification techniques, identifying and classifying AD by considering the closest training examples in the feature space. When compared, the KNN algorithm has shown significantly higher efficiency in classifying different stages of Alzheimer’s on the ADNI dataset than other machine learning methods, such as decision trees and the Naïve Bayes approach [18].

Logistic regression (LR) models have been employed to predict the probability of AD based on various biomarkers and clinical features. These models can provide insights into the association between different risk factors and the likelihood of developing AD. The LR model can also classify AD patients based on plasma signaling proteins [19]. Principal component analysis method has been used to reduce the dimensionality of neuroimaging data, making it easier to identify significant features for AD classification. By extracting the most informative principal components, researchers can focus on the most relevant brain regions and patterns associated with the disease [20]. Artificial neural networks (ANNs) have been employed to analyze complex relationships in neuroimaging and genetic data for the AD classification [21]. These deep learning models can capture nonlinear patterns and interactions that may be difficult to detect using traditional statistical methods. ANNs have been used in classifying AD patients based on EEG data, achieving high accuracy and sensitivity. Recurrent neural networks (RNNs) have been utilized for decades to detect and capture temporal dependencies, specifically in the longitudinal data, aiding in the classification of AD progression. By modeling the dynamic changes in brain structure and function over time, RNNs have provided valuable information about the root cause and spread of the disease. This algorithm also improves the accuracy of AD classification, diagnosis, and prognosis [22].

Despite the success of the majority of ML and DL methods in AD classification, they often require large labeled datasets and significant computational resources. They can be susceptible to overfitting, mainly when the available data are limited. These models may also face interpretability, scalability, and dataset bias issues [23]. Researchers continue to explore ways to address these challenges and develop reliable and generalizable models for AD detection and monitoring.

The exponential growth in medical imaging data has created significant challenges in healthcare systems, particularly in achieving timely and accurate diagnoses. Traditional manual examination of medical images is time-intensive, susceptible to human fatigue, prone to inter-observer variability, and constrained by the limited availability of expert radiologists. While the current automated systems and deep learning approaches offer potential solutions, they face substantial challenges in classification accuracy, diverse image quality handling, generalization across different imaging equipment, and result interpretability, often demanding extensive computational resources. To address these limitations, our research proposes a comprehensive multi-model deep learning framework that integrates three state-of-the-art architectures: vision transformer (ViT), convolutional neural network (CNN), and swin transformer. This framework aims to achieve higher classification accuracy while maintaining computational efficiency, enhancing result interpretability, and providing a scalable solution for medical image analysis. Our approach focuses on combining the strengths of different architectures to overcome existing limitations, ultimately assisting healthcare professionals in achieving faster and more accurate diagnoses while improving the accessibility of medical image analysis in resource-limited settings. Here are the contributions of our research work:

Deep learning model development: We developed and evaluated advanced deep learning models, including fine-tuned ViTs, CNNs, and Swin, for accurate classification of AD stages using medical image data.
Impact of data imbalance: We investigated how data imbalance affects the performance of deep learning models in AD classification, offering insights into model robustness and accuracy.
Comparative performance analysis: We conducted a comparative analysis of ViT, CNN, and swin transformer architectures, highlighting their respective strengths and limitations in classifying AD stages.
Effectiveness of data augmentation: We explored the role of data augmentation techniques in improving classification accuracy, particularly for imbalanced datasets, demonstrating their impact on model performance.
Achieving high accuracy: We achieved high classification accuracy (above 95%) for AD stages on both balanced and imbalanced datasets, showing the effectiveness of fine-tuned models in challenging conditions.

The rest of the article is structured as follows: Section 2 describes the related work. Section 3 outlines the materials and methods employed in the study. Section 4 demonstrates the results accompanied by elaborate explanations. Section 5 represents the limitations of the proposed work. Finally, Section 6 is dedicated to the conclusion of the article.

2 Related works

In the study by Huang and Li [24], the authors proposed the Resizer swin transformer model, which classifies the AD by combining the power of CNN and the swin transformer model. This model successfully classified 94.01% of the AD classes on the Australian imaging, biomarker & lifestyle (AIBL) dataset. However, the study was limited by the use of a small dataset, which may not accurately reflect the larger population. Xin et al. [25] developed a deep learning model called efficient Conv-Swin Net, utilizing a two-stream structure in association with a 2.5D-subject method to encode 3D information. The model was able to achieve a 92.8% balance accuracy score on the AIBL dataset. However, the model was trained on a single dataset and may need to be generalized better to other datasets. A novel 3D hybrid compact convolutional transformer (HCCT) model that combines CNNs and ViTs to classify AD from 3D MRI scans was introduced [26]. ADNI dataset was used to evaluate the model. Results showed that it achieved high performance compared to advanced CNN and transformer-based methods, demonstrating its potential for accurate and reliable AD diagnosis. However, the study did not compare the HCCT model with other deep learning architectures or discuss potential limitations in its application.

Alom et al. [27] constructed a unique model based on a deep learning approach, also employing a Swin Transformer to classify AD. The model was able to classify 95.12% AD classes on the ADNI dataset correctly. Kim et al. [28] proposed a novel approach called PVTAD that uses a pyramid ViT to classify AD and cognitively normal (CN) cases from structural MRI data. The method achieved high accuracy (97.7%) and robustness in classifying AD cases but still had some dataset biases and interpretability issues.

Hong et al. [29] used a transfer learning method (cross-domain) to classify AD, precisely CN cases, using 18F-Florbetaben brain images. The method involved a ViT architecture and achieved an accuracy score of 80%, RR of 60%, and PR of 75% for AD classification. Another study proposed a deep learning framework that utilizes the power of Swin Transformers to automatically learn relevant features from brain imaging data [30]. The framework uses sparse diffusion measures to enhance the model’s working efficiency. This algorithm performed well to diagnose and classify AD. Yin et al. [31] proposed a novel SMIL-DeiT model that combines multiple instance learning and self-supervised learning with a ViT architecture for early AD classification. The model achieved high accuracy (93.2%) and robustness in classifying AD cases. Another study [32] proposed a ViT model that utilized sMRI data to predict mild cognitive impairment (MCI) conversion to AD. The model was previously trained on the ImageNet dataset, giving excellent classification results and was later fine-tuned on the ADNI dataset. Upon comparison with other conventional methods, this model achieved the highest accuracy score of (83.27%) and outperformed the other algorithms. Many other studies utilize CNN, swin transformer, and ViT models for the early diagnosis and classification of AD. Some of these studies are mentioned in Table 1.

Table 1

State of the art studies related to proposed work

Paper	Research objective	Methodology used	Results
[33]	This study aimed to diagnose AD from sMRI data using the recursive feature elimination (RFE) method	RFE based on universum support vector machine (USVM)	Achieved high accuracy in classifying AD cases (CN vs AD: 100%, MCI vs AD: 73.68%, CN vs MCI: 100%)
[34]	In this study, the authors aimed to develop a multi-model method for AD classification	Combined swin transformer with enhanced EfficientNet and CoAtNet using a two-stream structure to encode 3D brain imaging data	Outperformed other approaches in binary classification achieving high accuracy
[35]	The objective of this study was to early detect and classify AD using structural MRI data	Dual attention, multi instance deep learning-based model	Achieved high accuracy on the ADNI dataset
[36]	This research aimed to classify AD using multi slice and multi-model ensemble learning architecture	Deep learning approach with 2D CNNs	Achieved 90.36% accuracy on the ADNI dataset. This method was considered a reliable tool for the AD classification
[37]	This research was conducted to classify AD and perform AD’s regional analysis	Single-slice, patch-based supervised switching autoencoders (SSAs)	Achieved the highest classification accuracy of 90.01% on the ADNI dataset

Lakhan et al. [38] proposed a federated deep CNN for Alzheimer’s detection (FDCNN-AS), tailored to manage AD data across diverse age groups in a federated learning environment. The framework integrates multiple data sources, including MRI, PET, tomography, blood tests, and cognitive assessments, ensuring secure, privacy-preserving training across labs and clinics. FDCNN-AS addresses key challenges such as age-specific detection, disease progression, and the distinction between benign and malignant cases in Alzheimer’s detection.

Ibrahim and Mohammed [39] provides a comprehensive review of AI-based approaches for diagnosing Parkinson’s disease (PD), highlighting deep learning and machine learning methods used to enhance early detection. Given the progressive nature of PD and its overlapping symptoms with other neurodegenerative disorders, early identification is essential. This article discusses current advances and future opportunities for data-driven AI methods in PD diagnosis, serving as a valuable resource for researchers developing AI-based predictive models for PD.

3 Methodology

The study employed a comprehensive approach to evaluate the performance of three deep learning models (FT ViT, FT CNN, and FT swin) in classifying four stages of AD using both balanced and imbalanced datasets. To ensure robust performance evaluation, cross-validation was implemented, with each model trained and tested across five folds of the data. The algorithm addressed the challenge of data imbalance in medical imaging by comparing model performance on a balanced dataset ( n = 34,000 ) and an imbalanced dataset ( n = 6,400 ), providing insights into each model’s robustness to real-world data distributions. Hyperparameter configurations were standardized across models to facilitate fair comparison, with learning rate, batch size, and number of epochs kept constant. The study’s methodology included a systematic approach to data preprocessing, model training, and performance evaluation, ensuring consistency across all experiments. The best-performing model was determined based on its accuracy on the imbalanced dataset, reflecting its potential effectiveness in real-world clinical scenarios where perfect data balance is rarely achieved. This design allowed for a comparative analysis of model performance, not only in terms of overall accuracy but also in terms of robustness to data imbalance, a critical factor in medical image classification tasks. In Algorithm 1, the details of the following steps are shown.

Algorithm 1: Advanced Alzheimer’s study
D b ← create_balanced_dataset ( n b = 34,000 , ∣ C ∣ = 4 ); // n b : number of balanced images
D i ← load_imbalanced_dataset ( n i = 6,400 , ∣ C ∣ = 4 ); // n i : number of imbalanced images
M ← { m 1 : {model: FT_ViT, H_m: {learning_rate: 0.001, batch_size: 32, epochs: 50}},
m 2 : {model: FT_CNN, H_m: {learning_rate: 0.001, batch_size: 32, epochs: 50}},
m 3 : {model: FT_Swin, H_m: {learning_rate: 0.001, batch_size: 32, epochs: 50} } }
Function `preprocess`_`dataset` ( D );
D preprocessed ← preprocess ( D ) ; return D preprocessed ;
Function cross_validate ( m , D , k = 5 ):
accuracies ← [ ] ; folds ← split_into_folds ( D , k ) ; for i ← 0 k − 1 do D train ← concatenate ( folds [ : i ] + folds [ i + 1 : ] ) ; D val ← folds [ i ] ; θ m ← m . train ( D train , H m ) ; A val ← m . evaluate ( D val , θ m ) ; accuracies . append ( A val ) ; end return m e a n ( a c c u r a c i e s ) ;
Function `train`_`and`_`evaluate` ( m ∈ M , D ∈ { D b , D i } , use_cross_validation = False ):
D preprocessed ← preprocess ( D ) ; if u s e _ c r o s s _ v a l i d a t i o n then ∣ A m , D ← cross validate (

3.1 Dataset description

3.1.1 Imbalanced dataset description

The imbalanced dataset used in this study consists of 6,400 MRI brain scans sourced from Kaggle. The images are categorized into four classes based on the severity of AD: normal, very mild, mild, and moderate. Each image is stored in JPEG format and labeled according to its respective class. This dataset was utilized in its original form without any data augmentation, resulting in a class distribution that reflects the natural imbalance in the prevalence of these stages of AD.

3.1.2 Balanced dataset using albumentations

To address the class imbalance present in the original dataset, data augmentation techniques were applied, resulting in a balanced dataset of 33,984 images. The Albumentations library was used for this purpose, implementing various transformations such as resizing, shifting, scaling, rotating, RGB shifting, random brightness/contrast adjustment, and color jittering. These transformations not only increased the number of images but also introduced variations that helped in creating a more robust dataset for training machine learning models. The balanced dataset is expected to improve the model’s ability to accurately classify each stage of AD by providing a more equitable distribution of images across all classes.

Table 2 summarizes the distribution of images across the four classes in both the imbalanced and balanced datasets. The imbalanced dataset contains a total of 6,400 images, with a significant class imbalance, particularly in the moderate demented class, which has only 64 images. To mitigate this imbalance, data augmentation techniques were applied, resulting in a balanced dataset with 33,984 images, where each class is more evenly represented. This balanced distribution is crucial for improving the model’s performance in classifying the various stages of AD.

Table 2

Distribution of images in the imbalanced and balanced datasets

Class	Imbalanced dataset	Balanced dataset
Normal	3,200	9,600
Very mild	2,240	8,960
Mild	896	8,960
Moderate	64	6,464
Total	6,400	33,984

In the preprocessing phase, the dataset was split into training, validation, and testing sets using a stratified approach to maintain class distribution. Specifically, 5,120 images were allocated for training, 1,280 for validation, and 960 for testing in the imbalanced dataset. Similarly, 27,187 images were allocated for training, 6,797 images for validation, and 5,098 images for testing used in a balanced dataset. The ImageDataGenerator from TensorFlow’s Keras API was used to handle data augmentation and preprocessing. The flow from the data frame method was employed to generate batches of images and their corresponding labels, with a target image size of 224 × 224 pixels. The data were not shuffled during the generation process to preserve the original order. Preprocessing was done using the tf. Keras: applications – mobile v2.preprocess input function to prepare the images for model training.

3.2 Convolutional neural network

CNNs represent a class of deep learning models that are designed and implemented for the analysis of grid-structured data, particularly images and scans. They excel in image classification tasks by effectively capturing spatial hierarchies and local patterns. There are three key elements of a CNN model: convolutional layers, pooling layers, and fully connected layers, which collaborate to extract increasingly abstract and intricate features from the input data. In the context of a CNN, the input typically consists of resized MRI images with dimensions of 224 × 224 × 3 (representing height, width, and RGB channels). Normalization ensures that each pixel value falls within a standard range, usually between 0 and 1. The core constituents of a CNN model are the convolutional layers, which utilize filters (kernels) to apply convolutional operations to the input data (MRI images/scans). Each filter traverses the input image by executing element-wise summation and multiplication to generate a feature map. The mathematical representation is shown in equation (1):

(1) ( Z ∗ T ) ( x , y ) = ∑ d ∑ e Z ( x + o , y + p ) T ( o , p ) ,

where Z represents image data given as input, T is the applied filter, and ( x , y ) denote the spatial coordinates of the output feature map. Following each convolution operation, an activation function called ReLU (also rectified linear unit) is applied, which introduces nonlinearity into the model. This activation function is mathematically defined in equation (2):

(2) d ( a ) = max ( 0 , a ) .

Pooling layers serve to decrease the spatial dimensions of the feature maps while maintaining critical information. This process aids in reducing computational complexity and addressing overfitting. Max pooling, a technique that selects the maximum value within a window, is widely utilized for this purpose. The mathematical representation is shown in equation (3):

(3) K ( x , y ) = max Z ( o , p ) ∣ ( o , p ) ∈ neighborhood ( x , y ) .

Subsequent to multiple convolutional and pooling layers, the high-dimensional feature maps undergo flattening into a one-dimensional vector. This vector then moves through fully connected layers, wherein each neuron establishes connections with every neuron in the previous layer. These layers handle the final classification responsibilities. The terminal layer of the CNN comprises a SoftMax layer, which produces a probability distribution across potential classes. The SoftMax function can be formulated in equation (4):

(4) softmax ( Z i ) = e z i ∕ ∑ j e z j ,

where Z i represents the logit of Class i .

3.3 Proposed architecture of CNN model

The proposed architecture for the CNN model is depicted in Figure 1. The classification of AD starts by preprocessing MRI images to a consistent size, such as 224 × 224 pixels, and normalizing them to have pixel values in the standard range of 0–1. The preprocessed images then undergo a series of convolutional layers that apply multiple filters to extract various features. The early layers are used to detect the edges and textures like simple patterns, while deeper layers capture more complex anatomical structures related to AD. Each convolutional layer convolves the input image (I) with multiple filters (K) to generate feature maps. The application of the ReLU function introduces nonlinearity. The mathematical representation is shown in equation (5):

(5) G Conv ( x , y ) = max 0 , ∑ d ∑ e Z ( x + o , y + p ) ⋅ T ( o , p ) .

Figure 1

CNN architecture in proposed study. Source: Created by the authors.

Pooling layers are then applied to reduce the spatial dimensionality of the feature maps, which enables the model to focus more closely on the important features taken. Max pooling is the typical choice for this task. The mathematical representation is shown in equation (6):

(6) G pool ( x , y ) max = G Conv ( o , p ) ∣ ( o , p ) ε window ( x , y ) .

The final pooling layer flattens the output data into a one-dimensional vector, which is transmitted to fully connected layers, which serve as classifiers. These layers map the extracted features to the probability of each class. The final fully connected layer outputs logits, which are then transformed into probability scores with the help of the SoftMax function. The class that achieved the highest probability score is then selected as the model’s final prediction.

The CNN architecture consisting of approximately 508k parameters, all of which are trainable. The model follows a traditional CNN architecture with five convolutional blocks (each containing Conv2D and MaxPooling2D layers), followed by flatten and dense layers for classification. The network progressively increases the number of filters from 8 to 128 while reducing spatial dimensions through max pooling operations. The detailed architecture parameters are presented in Table 3.

Table 3

Architecture details of the CNN model

Layer (type)	Output shape	Parameters
conv2d (Conv2D)	(None, 222, 222, 8)	224
max_pooling2d (MaxPooling2D)	(None, 111, 111, 8)	0
conv2d_1 (Conv2D)	(None, 109, 109, 16)	1,168
max_pooling2d_1 (MaxPooling2D)	(None, 54, 54, 16)	0
conv2d_2 (Conv2D)	(None, 52, 52, 32)	4,640
max_pooling2d_2 (MaxPooling2D)	(None, 26, 26, 32)	0
conv2d_3 (Conv2D)	(None, 24, 24, 64)	18,496
max_pooling2d_3 (MaxPooling2D)	(None, 12, 12, 64)	0
conv2d_4 (Conv2D)	(None, 10, 10, 128)	73,856
max_pooling2d_4 (MaxPooling2D)	(None, 5, 5, 128)	0
flatten_1 (Flatten)	(None, 3,200)	0
dense_2 (Dense)	(None, 128)	409,728
dense_3 (Dense)	(None, 4)	516

3.4 ViT

ViTs show a remarkable shift in the development of computer vision approaches by employing the transformer-based architecture, as shown in Figure 2. This approach was initially developed for natural language processing-related tasks to classify image data. In contrast to CNNs that utilize convolutional layers to extract local features from images, ViTs consider images as multiple sequences of patches and use the transformer mechanism to analyze and interpret the relationships among these patches. This enables ViTs to effectively capture the global context and long-range dependencies in the data, which is a challenge for CNNs. ViTs operate by dividing the input image data into nonoverlapping patches in a process known as image patching. For example, in a 224 × 224 image with a patch size of 16 × 16 , 196 patches (arranged in a 14 × 14 grid) are generated. Each nonoverlapping patch then flattens into a one-dimensional vector. Using a linear layer, the one-dimensional vectors are then projected into higher-dimensional space. This process is mathematically summarized in equation (7):

(7) Patch embedding = K p ⋅ patch + S p ,

where K p represents the weight matrix & S p shows the bias. In the ViT model, positional encodings serve the purpose of retaining the positional information of patch embeddings. These encodings are learnable vectors added to each patch embedding. The sequence of embedded patches, along with their associated positional encodings, is then input into a series of transformer encoder layers. These encoder layers are composed of multiple feed-forward neural networks and multi-head self-attention, which enable the model to assess the relative significance of different patches through self-attention mechanisms. The mathematical representation is shown in equation (8):

(8) Attention ( O , M , P ) = softmax d k * O M T P ,

where (dk) represents the key dimensions, while ( O , M , P ) shows the linear projections of the input data. Furthermore, each encoder layer incorporates a separate and identical feed-forward network applied to each position within the sequence. The mathematical representation is shown in equation (9):

(9) FFN ( z ) = ReLUK 1 z + S 1 K 2 + S 2 ,

where K_1, S1, K2, and S2 represent the learnable parameters. A distinct classification token is incorporated at the beginning of the patch embedding sequence, and the resulting output is utilized for the classification. Subsequently, the output of the final transformer encoder layer corresponding to the classification token is directed through a fully connected layer. The SoftMax function is then utilized to generate the final class probabilities.

Figure 2

FT-ViT architecture in proposed study. Source: Created by the authors.

3.5 Proposed architecture of the ViT model

In applying ViTs for the effective classification of AD using MRI images, the process involves several stages: The MRI images are preprocessed and resized to a uniform dimension (e.g., 224 × 224 pixels) and normalized. The resized image is divided into nonoverlapping patches. For instance, with a patch size of 16 × 16 , the 224 × 224 image produces 196 patches. Each patch is then flattened and, by using a linear transformation, is projected into a higher-dimensional space. To provide the ViT model with updated information about the relative positions of the flattened patches, positional encodings are added to each patch embedding. The sequence of patch embeddings, together with the classification token, is moved into a series of transformer-based encoder layers. Each layer utilizes multihead self-attention to calculate the relationships between patches and feed-forward networks to process each position independently. The mathematical representation is shown in equations (10,11):

(10) Z = LayerNorm ( X + MultiHeadAttention ( X , X , X ) )

(11) X ′ = LayerNorm ( Z + FeedForward ( Z ) ) .

Here, ( X ) represents the input to the encoder layer, while ( Z ) and ( X ′ ) denote intermediate representations. The output corresponding to the classification token undergoes processing through a fully connected layer, followed by a SoftMax activation to generate the probability distribution across the classes. The mathematical representation is shown in equation (12):

(12) Output = softmax ( K c ⋅ C L S token + S c ) ,

K c and S c represent the classification layer’s weight matrix and bias.

The model consists of approximately 87.47M parameters in total, with 87.47M trainable parameters and 1.56K nontrainable parameters. The architecture follows a sequential structure starting with a ViT-B32 backbone (87.46M parameters), followed by flatten operation, batch normalization layers, and dense layers for classification. The complete parameter details are shown in Table 4.

Table 4

FT-ViT model architecture

Layer (type)	Output shape	Parameters
vit-b32 (Functional)	(None, 768)	87,455,232
flatten (Flatten)	(None, 768)	0
batch_normalization (BatchNorm)	(None, 768)	3,072
dense (Dense)	(None, 11)	8,459
batch_normalization_1 (BatchNorm)	(None, 11)	44
dense_1 (Dense)	(None, 4)	48

3.6 Swin transformer model

The swin transformer (shifted window transformer) represents a novel approach to applying transformer architecture to computer vision tasks. Unlike traditional ViTs, which treat images as sequences of patches processed globally, the swin transformer introduces a hierarchical architecture and utilizes local self-attention within shifted windows. This hierarchical approach allows the swin transformer to efficiently handle high-resolution images while capturing both local and global features, as shown in Figure 3.

Figure 3

Swin architecture in proposed study. Source: Created by the authors.

3.6.1 Proposed architecture of swin transformer model

Preprocessing the MRI images: The MRI images are preprocessed to a uniform dimension (e.g., 224 × 224 pixels) and normalized. Patch partitioning and embedding: The input image is partitioned into fixed-size patches, and each patch is flattened and linearly embedded into a higher-dimensional space.

Hierarchical feature representation: The input image is divided into nonoverlapping patches. These patches are processed hierarchically, with the feature map’s resolution being gradually reduced through the network. This allows the model to capture multi-scale features efficiently. Shifted window mechanism: The swin transformer employs a shifted window approach for self-attention computation. Instead of computing self-attention globally, it computes self-attention within local windows, which are shifted between layers to allow for cross-window connections. Window-based self-attention: self-attention (SA) is computed within nonoverlapping windows, capturing local interactions. The mathematical representation is shown in equation (13):

(13) Attention ( O , M , P ) = softmax Q K T d k P ,

where (dk) represents the key dimensions, while ( O , M , P ) shows the linear projections of the input data. Shifted windows: In alternating layers, the windows are shifted to create overlapping windows. This shift enables information exchange between different windows, enhancing the model’s ability to capture global context. Patch merging: After several layers, the feature maps are down-sampled through patch merging layers, reducing spatial resolution while increasing the number of channels, which allows the model to learn more abstract features as represented in equation (14):

(14) PatchMerging ( x ) = K m ⋅ concatenate ( x ) ,

K _m is the learned weight matrix. Final classification layer: The final representation is obtained by global average pooling or using the output corresponding to a particular classification token. This representation is firstly fed into a fully connected layer and afterward is transmitted into the SoftMax layer for classification, as shown in equation (15):

(15) Output = softmax ( K c ⋅ Representation + S c ) ,

where K c and S c represent the classification layer’s weight matrix and bias, respectively.

Swin transformer architecture with approximately 331K total parameters (331,092 trainable and 96 nontrainable parameters). The model incorporates data augmentation layers (RandomCrop and RandomFlip), followed by patch extraction and embedding layers, swin transformer blocks, and finally classification layers. The architecture demonstrates an efficient hierarchical design typical of swin transformers. The detailed architecture parameters are presented in Table 5.

Table 5

Architecture details of the swin transformer model

Layer (type)	Output shape	Parameters
input_2 (InputLayer)	[(None, 128, 128, 3)]	0
random_crop (RandomCrop)	(None, 128, 128, 3)	0
random_flip (RandomFlip)	(None, 128, 128, 3)	0
patch_extract (PatchExtract)	(None, 16, 3,072)	0
patch_embedding (PatchEmbedding)	(None, 16, 64)	197,696
swin_transformer (SwinTransformer)	(None, 16, 64)	50,072
swin_transformer_1 (SwinTransformer)	(None, 16, 64)	50,136
patch_merging (PatchMerging)	(None, 4, 128)	32,768
global_average_pooling1d (GlobalAveragePooling1D)	(None, 128)	0
dense_14 (Dense)	(None, 4)	516

4 Result

The framework was developed in Python, utilizing well-known machine learning libraries, including TensorFlow, Keras, Pandas, NumPy, and Matplotlib. Testing was performed on a system featuring an Nvidia Tesla P100 GPU, a quad-core CPU, and 29GB of RAM. The use of the GPU allowed for per-epoch computation times of roughly 180 s for classification. To assess the model’s effectiveness, a thorough evaluation of key performance indicators was performed. The classification component was analyzed using metrics including overall accuracy, loss, confusion matrix, and classification report. These measurements offer valuable insights into the model’s classification capabilities.

4.1 Training and validation accuracy and loss metrics

4.1.1 ViT performance on balanced and imbalanced dataset

Figure 4 illustrates the performance metrics of the ViT model applied to a four-class AD classification task. The model’s progress is tracked over 20 epochs, with separate plots for accuracy (left) and loss (right). The accuracy plot demonstrates a rapid increase in both training and validation accuracy within the initial epochs. Training accuracy (blue line) approaches asymptotic performance near 100% by the conclusion of the training period. Validation accuracy (red line) exhibits a similar trend but plateaus at approximately 96%, indicating robust generalization capabilities. The loss plot reveals a complementary pattern. Both training loss (blue line) and validation loss (red line) show a steep decline in the early epochs. The training loss continues to decrease monotonically, approaching zero by the final epoch. In contrast, the validation loss stabilizes around 0.15, suggesting that the model has reached an optimal point in the bias-variance tradeoff. The disparity between training and validation metrics, particularly noticeable in the loss curves, hints at a degree of overfitting. However, the high validation accuracy of greater than 95. Figure 5 illustrates the performance of a ViT model on an imbalanced dataset of AD stages. The graph comprises two plots: accuracy and loss over 20 epochs. Both training and validation curves for accuracy exhibit a steady increase, converging around 0.95, while the loss curves for both training and validation demonstrate a consistent decrease. This convergence of training and validation metrics suggests that the ViT model is performing well on the dataset without overfitting, effectively learning from the data and generalizing to unseen samples.

Figure 4

Training and validation accuracy and loss for the ViT model, illustrating its performance across epochs on the balanced dataset. Source: Created by the authors.

Figure 5

Training and validation accuracy and loss for the ViT model, illustrating its performance across epochs on the imbalanced dataset. Source: Created by the authors.

4.2 CNN performance on balanced and imbalanced dataset

A CNN model demonstrates promising performance for the four-class AD classification task, as illustrated in Figure 6. Over 20 training epochs, the model exhibits rapid learning, particularly in the initial 5–7 epochs. The training accuracy, starting from approximately 50%, swiftly rises to reach 98% by the conclusion of training. Validation accuracy follows a similar trajectory, albeit with a lower peak, stabilizing around 90%. This disparity between training and validation accuracies suggests the presence of some overfitting. The loss curves corroborate this observation, with training loss decreasing steadily to approach 0.05, while validation loss stabilizes with fluctuations in the 0.3–0.4 range. Despite the indications of overfitting, the CNN achieves a commendable validation accuracy of 90%, indicating good generalization to unseen data. The model’s quick convergence to high accuracy levels underscores its efficiency in learning relevant features for AD classification from medical imaging data. However, the oscillations in validation loss and the gap between training and validation metrics highlight potential areas for improvement. Future research could explore enhanced regularization techniques or architectural refinements to mitigate overfitting and further boost the model’s generalization capabilities, potentially improving its clinical applicability in AD diagnosis.

Figure 6

Training and validation accuracy and loss for the CNN model, depicting its performance over epochs on balanced dataset. Source: Created by the authors.

Figure 7 depicts the training performance of a CNN model. The left plot illustrates accuracy, showing a steady increase in both training and validation accuracy over epochs, with training accuracy reaching 1.0, while the validation accuracy curve converges around 0.97. The right plot displays loss, demonstrating a significant decrease in both training and validation loss during the initial epochs. However, the validation loss plateaus at a slightly higher level than the training loss, indicating little sign of overfitting. While the model exhibits promising performance, addressing the overfitting issue is crucial for optimal generalization.

Figure 7

Training and validation accuracy and loss for the CNN model, depicting its performance over epochs on the imbalanced dataset. Source: Created by the authors.

4.3 Swin performance on balanced and imbalanced dataset

The swin transformer model’s performance on the four-class AD classification task is depicted in Figure 8, showing its learning progression over 20 epochs. The accuracy plot demonstrates a steady increase in both training and validation accuracies, starting from around 45 and 52%, respectively. By the end of training, the model achieves approximately 82% accuracy on the training set and 76% on the validation set. This relatively small gap between training and validation accuracies suggests good generalization. The loss curves show a rapid initial decrease, followed by a more gradual decline. Interestingly, the validation loss remains slightly lower than the training loss for most of the training period, indicating that the model is not overfitting. Both losses converge to around 0.8 by the final epoch. The swin transformer’s performance, while not reaching the high accuracies seen in some other models, shows consistent improvement and a healthy balance between training and validation metrics. This suggests that the model is learning meaningful features without memorizing the training data. The slower but steady increase in accuracy, coupled with the consistent decrease in loss, indicates that the swin transformer may potentially benefit from additional training epochs to achieve higher performance. Overall, these results demonstrate the swin transformer’s capability to handle the complexities of AD classification, with room for further optimization.

Figure 8

Training and validation accuracy and loss for the swin transformer model, showing its accuracy and loss trends throughout the training process. Source: Created by the authors.

Figure 9 illustrates the training performance of a swin transformer model. The plot displays both training and validation accuracy and loss over epochs. While both accuracy curves show an upward trend, indicating improved performance, the gap between training and validation accuracy suggests some minor level overfitting. Similarly, the loss curves demonstrate a downward trajectory, but the validation loss plateaus at a higher level than the training loss, reinforcing the overfitting concern. These observations indicate that the model is learning effectively from the training data but might need help to generalize to unseen examples.

Figure 9

Training and validation accuracy and loss for the swin transformer model, showing its accuracy and loss trends throughout the training process. Source: Created by the authors.

4.4 Confusion matrix overview

The confusion matrix is a tabular representation used to assess the performance of a classification model. It summarizes the model’s prediction results by displaying the counts of correct and incorrect predictions for each class. This matrix provides valuable insights into the model’s performance across different classes, highlighting areas where the model is making accurate predictions as well as potential misclassifications. CM consists of the following metrics:

True Positive (TPE): The number of instances where the model correctly predicted the positive class. In other words, this is the count of positive instances that the model classified as positive.

False Negative (FNE): The number of instances where the model incorrectly predicted the negative class when the actual class was positive. In other words, this is the count of positive instances that the model classified as negative.

False Positive (FPE): The number of instances where the model incorrectly predicted the positive class when the actual class was negative. In other words, this is the count of negative instances that the model classified as positive.

True Negative (TNE): The number of instances where the model correctly predicted the negative class. This represents the count of negative instances that the model correctly identified as negative.

4.4.1 ViT transformer performance on balanced and imbalanced dataset

Overall, the confusion matrix shows that the ViT model performs exceptionally well in classifying Moderate cases with no errors. However, there are some misclassifications, particularly for the Mild, Normal, and Very Mild classes, as shown in Figure 10. This suggests areas where the model could be improved, especially in distinguishing between the Mild and Very Mild classes. Mild Class: Of 1,380 actual Mild cases, 1,353 were correctly classified as Mild. There were 3 instances misclassified as Moderate, 9 as Normal, and 15 as Very Mild. Moderate Class: All 982 actual Moderate cases were correctly classified, with no misclassifications into other classes. Normal Class: Of the 1,382 actual Normal cases, 1,300 were correctly classified as Normal. The model misclassified 34 instances as Mild, 3 as Moderate, and 45 as Very Mild. Very Mild Class: Of 1,354 actual Very Mild cases, 1,273 were correctly classified. The model misclassified 38 instances as Mild, 3 as Moderate, and 40 as Normal. Overall, the confusion matrix shows that the ViT model performs exceptionally well on an imbalanced dataset in classifying Moderate cases with no errors. However, there are some misclassifications, particularly for the Mild, Normal, and Very Mild classes shown in Figure 11. Mild Class: Of 153 actual Mild cases, 151 were correctly classified as Mild. There were two instances misclassified as Very Mild. Moderate Class: All 11 actual Moderate cases were correctly classified, with no misclassifications into other classes. Normal Class: Of the 492 actual Normal cases, 489 were correctly classified as Normal. The model misclassified three instances as very Mild. Very Mild Class: Of 304 actual Very Mild cases, 296 were correctly classified. The model misclassified five instances as Normal and three as Mild.

Figure 10

Confusion matrix for the ViT model, displaying the true vs predicted class distributions and classification performance. Source: Created by the authors.

Figure 11

Confusion matrix for the ViT model, displaying the true vs predicted class distributions on imbalanced dataset. Source: Created by the authors.

4.4.2 CNN performance on balanced and imbalanced dataset

Overall, the CNN model in Figure 12 shows a strong performance, with perfect classification of Moderate cases and Normal cases. However, there are notable misclassifications for Mild and Very Mild cases, particularly where Mild and Very Mild predictions overlap with Normal cases. This suggests that while the model is effective overall, there is room for improvement in distinguishing between Mild, Normal, and Very Mild cases. Mild Class: Of 1,380 actual Mild cases, 1,229 were correctly classified as Mild. The model misclassified two instances as Moderate, 76 as Normal, and 73 as Very Mild. Moderate Class: Of the 982 actual Moderate cases, 972 were correctly classified as Moderate. The model misclassified five instances as Mild, three as Normal, and two as Very Mild. Normal Class: Of the 1,382 actual Normal cases, 1,307 were correctly classified as Normal. The model misclassified 27 instances as Mild, 48 as Very Mild, and 0 as Moderate. Very Mild Class: Of 1,354 actual Very Mild cases, 1,080 were correctly classified. The model misclassified 54 instances as Mild, 220 as Normal, and 0 as Moderate.

Figure 12

Confusion matrix for the CNN model, showing the actual vs predicted class frequencies on the balanced dataset. Source: Created by the authors.

Overall, the confusion matrix shows that the CNN model performs exceptionally well on an imbalanced dataset in classifying Moderate cases with no errors. However, there are some misclassifications, particularly for the Mild, Normal, and Very Mild classes shown in Figure 13. Mild Class: Out of 153 actual Mild cases, 145 were correctly classified as Mild. There were two instances misclassified as Very Mild and six instances as Normal. Moderate Class: All 11 actual Moderate cases were correctly classified, with no misclassifications into other classes. Normal Class: Of the 492 actual Normal cases, 489 were correctly classified as Normal. The model misclassified 3 instances as very Mild. Very Mild Class: Of 304 actual Very Mild cases, 290 were correctly classified. The model misclassified eight instances as Normal and six as Mild.

Figure 13

Confusion matrix for the CNN model, showing the actual vs predicted class frequencies on imbalanced dataset. Source: Created by the authors.

4.4.3 Swin transformer performance on balanced and imbalanced dataset

The swin transformer model shows reasonably good performance, with solid accuracy in the Moderate class, as shown in Figure 14. However, there are noticeable misclassifications, particularly for Mild and Very Mild cases, which need to be clarified with Normal and other categories. This suggests that the swin transformer has strengths in some areas but might need further refinement to improve accuracy across all classes, particularly in distinguishing between Mild, Normal, and Very Mild cases. Mild Class: Of 1,380 actual Mild cases, 1,107 were correctly classified as Mild. The model misclassified 39 instances as Moderate, 76 as Normal, and 158 as Very Mild. Moderate Class: Of the 982 actual Moderate cases, 928 were correctly classified as Moderate. The model misclassified 25 instances as Mild, 7 as Normal, and 22 as Very Mild. Normal Class: Of the 1,382 actual Normal cases, 1,112 were correctly classified as Normal. The model misclassified 103 instances as Mild, 17 as Moderate, and 150 as Very Mild. Very Mild Class: Of 1,354 actual Very Mild cases, 897 were correctly classified as Very Mild. The model misclassified 188 instances as Mild, 30 as Moderate, and 239 as Normal.

Figure 14

Confusion matrix for the swin transformer model, illustrating the distribution of true and predicted classes on the balanced dataset. Source: Created by the authors.

Overall, the confusion matrix shows that the swin model performs poorly on an imbalanced dataset in classifying the Mild, Normal, and Very Mild classes shown in Figure 15. Mild Class: Of 153 actual Mild cases, 125 were correctly classified as Mild. There were 15 instances misclassified as Very Mild and 13 instances as Normal. Moderate Class: Of 11 actual Moderate cases, six instances were correctly classified as Moderate, with misclassifications into other classes Normal and Very Mild. Normal Class: Of the 492 actual Normal cases, 474 were correctly classified as Normal. The model misclassified 17 instances as Very Mild and one instance as Normal. Very Mild Class: Of 304 actual Very Mild cases, 244 were correctly classified. The model misclassified 48 instances as Normal and 12 as Mild.

Figure 15

Confusion matrix for the swin transformer model, illustrating the distribution of true and predicted classes on the imbalanced dataset. Source: Created by the authors.

4.5 Detailed classification report

Machine learning and deep learning models used for classification tasks often include a classification report as a performance evaluation metric. This report offers detailed insights into how well the model performed on each class in a multi-class classification problem. The classification report typically includes several key metrics, which can be calculated using the following equations (16)–(19):

Precision (PR): The PR metric represents the ratio of correctly predicted positive observations to the total predicted positive observations. In other words, it reflects the model’s ability to avoid false positives – a high PR indicates that the model has a low rate of incorrectly predicting positive instances. This metric is helpful in evaluating the reliability of the model’s optimistic predictions.

(16) PR = TPV TPV + FPV

Recall (RR): RR measures the ratio of correctly predicted positive observations to all the observations that are actually positive in the dataset. In other words, it reflects the model’s ability to identify all the positive instances, avoiding false negatives. A high RR indicates that the model has a low rate of incorrectly classifying positive instances as negative. This metric is helpful in evaluating how comprehensive the model’s positive predictions are.

(17) RR = TPV TPV + FNV .

F1-Score (F1R): The F1R is the harmonic mean of PR and RR. It provides a balanced measure that combines the model’s PR (ability to avoid false positives) and RR (ability to identify all positive instances). The F1R ranges from 0 to 1, with 1 representing perfect PR and RR.

(18) F 1 R = 2 × PR × RR PR + RR .

Support: It includes information about the number of actual instances for each class in the test dataset. This provides context on the class imbalance, if any, which can impact model performance.

In addition, the report includes the overall accuracy metric. Accuracy (AR) represents the proportion of correct predictions (both true positives and true negatives) out of the total predictions made by the model. It gives an overview of the model’s overall correctness but should be considered alongside other metrics like PR, RR, and F1R, especially when dealing with imbalanced datasets.

(19) Accuracy = TPV + TNV TPV + TNV + FPV + FNV .

The macro average metric is the average of the PR, RR, and F1R computed individually for each class and then averaged together. This provides an overview of the model’s performance across all classes, treating each class equally. In contrast, the weighted average metric calculates the average by taking into account the support (number of actual instances) for each class. This makes the weighted average more representative of the overall model performance, especially when dealing with imbalanced datasets where some classes have significantly more instances than others.

4.5.1 ViT performance on balanced dataset

Table 6 provides a comprehensive performance analysis of the ViT model applied to an AD classification task involving four classes: Mild, Moderate, Normal, and Very Mild. The model demonstrates a high level of accuracy, achieving an overall AR of 96%. For the Mild class, the ViT model achieves a PR of 0.95, an RR of 0.98, and an F1R of 0.96, indicating a strong ability to correctly identify Mild cases with minimal false positives. The Moderate class stands out with near-perfect performance, scoring a PR of 0.99, an RR of 1.00, and a perfect F1R of 1.00, suggesting the model’s exceptional AR in predicting this class. The Normal class also shows robust performance with a PR of 0.96, an RR of 0.94, and an F1R of 0.95, while the Very Mild class achieves a PR and RR of 0.95 and 0.94, respectively, resulting in an F1R of 0.95.

Table 6

Classification report for the vision transformer model in the AD classification task on balanced and imbalanced dataset

Class	PR	RR	F1R	Support
Imbalanced dataset
Mild	0.98	0.99	0.98	153
Moderate	1	1	1	11
Normal	0.99	0.99	0.99	492
Very Mild	0.98	0.97	0.98	304
AR			0.99	960
Macro Avg	0.99	0.99	0.99	960
Weighted Avg	0.99	0.99	0.99	960
Balanced dataset
Mild	0.95	0.98	0.96	1,380
Moderate	0.99	1	1	982
Normal	0.96	0.94	0.95	1,382
Very Mild	0.95	0.94	0.95	1,354
AR			0.96	5,098
Macro Avg	0.96	0.97	0.96	5,098
Weighted Avg	0.96	0.96	0.96	5,098

The report further includes macro and weighted averages to provide a holistic view of the model’s performance. The macro average, which treats all classes equally without considering their size, yields a PR of 0.96, an RR of 0.97, and an F1R of 0.96. The weighted average, which accounts for the varying number of instances in each class, similarly reflects strong performance with a PR, RR, and F1R all at 0.96. These metrics confirm the ViT’s capability to handle class imbalances effectively while maintaining high AR and balanced PR and RR across all categories.

The ViT model demonstrated strong performance on the imbalanced AD dataset, achieving an overall AR of 99%, as explained in Table 6. The model’s ability to accurately classify all four classes – Mild, Moderate, Normal, and Very Mild – is evident from the high PR, RR, and F1Rs.

Specifically, the ViT model achieved a PR of 0.98, an RR of 0.99, and an F1R of 0.98 for the Mild class, with a support of 153 samples. For the Moderate class, the model attained perfect scores across PR, RR, and F1R (1.00), although this class had a significantly lower support of only 11 samples, indicating that the model was able to handle the small sample size effectively. The Normal class, with the largest support of 492 samples, achieved a PR and RR of 0.99 and an F1R of 0.99, reflecting the model’s consistent performance across well-represented classes. Finally, the Very Mild class, with 304 samples, also showed strong results with a PR of 0.98, RR of 0.97, and F1R of 0.98. The macro and weighted averages of the PR, RR, and F1Rs were all 0.99, indicating that the ViT model maintained high performance across all classes, regardless of their representation in the dataset. This robust performance highlights the model’s capability to generalize well even when dealing with class imbalances, making it a promising approach for the accurate classification of AD stages.

4.5.2 CNN performance on balanced and imbalanced dataset

The overall AR of the model is 90%, indicating that 90% of all predictions made by the model were correct, as discussed in Table 6. For the Mild class, the model achieves a PR of 0.93, an RR of 0.89, and an F1R of 0.91, showing a strong ability to correctly predict Mild cases, though with some tradeoffs between PR and RR. The Moderate class performs exceptionally well, with a perfect PR of 1.00, an RR of 0.99, and an F1R of 0.99, indicating a near-perfect classification of Moderate cases. The Normal class, however, shows some variation in performance, with a PR of 0.81, an RR of 0.95, and an F1R of 0.87. This suggests that while the model is good at identifying Normal cases (high RR), it also includes more false positives (lower PR). The Very Mild class demonstrates a PR of 0.90, an RR of 0.80, and an F1R of 0.84, indicating a moderate performance with a balance between PR and RR, though slightly leaning towards more false negatives. The report also provides macro and weighted averages, both yielding a PR, RR, and F1R of 0.91 and 0.90, respectively. The macro average treats each class equally, while the weighted average accounts for the different class sizes. These averages reflect the CNN model’s overall balanced performance, with solid metrics across all classes, despite some variations in the PR and RR of specific categories.

The CNN model performed strongly on the imbalanced AD dataset, achieving an overall AR of 97%, as discussed in Table 7. The model’s effectiveness in handling various stages of AD is reflected in the high PR, RR, and F1Rs across all classes. For the Mild class, the CNN model achieved a PR of 0.96, RR of 0.95, and F1R of 0.95, with a support of 153 samples. The Moderate class, despite having the smallest support of 11 samples, recorded perfect scores of 1.00 for PR, RR, and F1R, indicating the model’s robustness in correctly classifying even underrepresented classes. The Normal class, which had the largest support of 492 samples, showed high performance with a PR of 0.97, RR of 0.99, and F1R of 0.98, demonstrating the model’s ability to predict well-represented classes accurately. For the Very Mild class, with a support of 304 samples, the model achieved a PR of 0.98, RR of 0.95, and F1R of 0.97. The macro and weighted averages of the PR, RR, and F1Rs were 0.98 and 0.97, respectively, indicating that the CNN model maintained a consistent performance across all classes. Despite the imbalance in the dataset, the CNN was able to generalize well, making it a practical approach for classifying the different stages of AD.

Table 7

Classification report for the CNN model in the AD classification task on balanced and imbalanced dataset

Class	PR	RR	F1R	Support
Imbalanced dataset
Mild	0.96	0.95	0.95	153
Moderate	1	1	1	11
Normal	0.97	0.99	0.98	492
Very Mild	0.98	0.95	0.97	304
AR			0.97	960
Macro Avg	0.98	0.97	0.98	960
Weighted Avg	0.97	0.97	0.97	960
Balanced dataset
Mild	0.93	0.89	0.91	1,380
Moderate	1.00	0.99	0.99	982
Normal	0.81	0.95	0.87	1,382
Very Mild	0.9	0.8	0.84	1,354
AR			0.90	5,098
Macro Avg	0.91	0.91	0.91	5,098
Weighted Avg	0.9	0.9	0.9	5,098

4.5.3 Swin performance on balanced and imbalanced dataset

In Table 8, the classification report for the Swin Transformer model on the AD classification task shows varied performance across the four classes: Mild, Moderate, Normal, and Very Mild. The model achieves an overall AR of 79%, indicating that 79% of all predictions were correct. For the Mild class, the Swin Transformer has a PR of 0.78, an RR of 0.80, and an F1R of 0.79, demonstrating a decent ability to identify Mild cases, though there are some tradeoffs between PR and RR. The Moderate class shows better performance, with a PR of 0.92, an RR of 0.95, and an F1R of 0.93, reflecting strong classification AR for this category. In contrast, the Normal class presents similar metrics to Mild, with a PR of 0.78, an RR of 0.80, and an F1R of 0.79. This indicates that the model performs consistently across Mild and Normal categories but with room for improvement. The Very Mild class has the lowest performance, with a PR of 0.73, an RR of 0.66, and an F1R of 0.70, showing that the model struggles more with this category, resulting in lower AR for Very Mild predictions. The macro average, which treats each class equally regardless of their size, yields a PR, RR, and F1R of 0.80. The weighted average, which accounts for the class sizes, shows similar metrics with PR, RR, and F1R all at 0.79. These averages highlight the overall performance of the swin transformer, reflecting strengths in Moderate classification but lower performance in other categories.

Table 8

Classification report for the swin transformer model in the AD classification task on balanced and imbalanced dataset

Class	PR	RR	F1R	Support
Imbalanced dataset
Mild	0.91	0.82	0.86	153
Moderate	1	0.55	0.71	11
Normal	0.88	0.96	0.92	492
Very Mild	0.87	0.8	0.84	304
AR			0.88	960
Macro Avg	0.92	0.78	0.83	960
Weighted Avg	0.88	0.88	0.88	960
Balanced dataset
Mild	0.78	0.8	0.79	1,380
Moderate	0.92	0.95	0.93	982
Normal	0.78	0.8	0.79	1,382
Very Mild	0.73	0.66	0.7	1,354
AR			0.79	5,098
Macro Avg	0.8	0.8	0.8	5,098
Weighted Avg	0.79	0.79	0.79	5,098

When evaluated on the imbalanced AD dataset, the swin transformer model achieved an overall AR of 88%, as shown in Table 8. The model’s performance varied across different classes, with notable differences in PR, RR, and F1Rs. For the Mild class, the swin transformer achieved a PR of 0.91, RR of 0.82, and F1R of 0.86 with a support of 153 samples, indicating a relatively strong performance but with room for improvement in RR. The Moderate class, which had a small support of 11 samples, showed perfect PR (1.00) but a significantly lower RR (0.55), resulting in an F1R of 0.71. This suggests that the model struggled with the underrepresented Moderate class, leading to a high number of false negatives. The Normal class, with the largest support of 492 samples, performed better with a PR of 0.88, RR of 0.96, and F1R of 0.92, reflecting the model’s ability to accurately classify the most common class. The Very Mild class, with a support of 304 samples, exhibited a PR of 0.87, RR of 0.80, and F1R of 0.84, indicating a reasonable performance but with some misclassifications. The macro averages of PR, RR, and F1Rs were 0.92, 0.78, and 0.83, respectively, showing that the model’s performance was uneven across classes, particularly in terms of RR. The weighted averages for PR, RR, and F1Rs were all 0.88, suggesting that while the model performed adequately on the dataset overall, it may benefit from further optimization, especially in improving RR for the less represented classes like Moderate.

4.6 ROC curve analysis

In a classification problem, the ROC curve is a graphical representation that plots the true positive rate (TPR) against the false positive rate (FPR) across various decision threshold settings. The TPR, also known as RR or sensitivity, reflects the proportion of actual positive instances that the model correctly identifies. Conversely, the FPR represents the proportion of actual negative instances that the model incorrectly classifies as positive. The ROC curve allows visualizing the balance between the model’s sensitivity (ability to identify positive cases correctly) and specificity (ability to avoid false positives). As the decision threshold is adjusted, this curve demonstrates the tradeoff between these two performance aspects, enabling the selection of an optimal threshold based on the specific needs of the problem at hand.

4.6.1 ViT Transformer performance on balanced and imbalanced dataset

The ROC curve results suggest that the ViT model performs exceptionally well overall, with perfect or near-perfect classification capabilities across all classes. It particularly excels in differentiating Mild, Moderate, and Normal cases and maintains high performance for the Very Mild class, as shown in Figures 16 and 17. An AUC-ROC of 1.00 for Mild, Moderate, and Normal classes means that the ViT model has perfect performance in distinguishing these classes from all other classes. The model can achieve a perfect separation with no false positives or false negatives, indicating excellent classification ability for these categories. An AUC-ROC of 0.99 for the Very Mild class indicates nearly perfect performance, with only a very small proportion of misclassifications. The model is highly effective at distinguishing the Very Mild class from the other classes, though there is a slight reduction in performance compared to the other classes.

Figure 16

ROC curve for the ViT model on balanced dataset. Source: Created by the authors.

Figure 17

ROC curve for the ViT model on imbalanced dataset. Source: Created by the authors.

4.6.2 CNN performance on balanced and imbalanced dataset

In the ROC curve analysis of the CNN model for AD classification, as depicted in Figures 18 and 19, the following results were observed on both balanced and imbalanced datasets: Mild Class: The ROC curve for the Mild class achieved an AUC-ROC of 0.99 in the balanced and 1.00 in the imbalanced dataset. This indicates that the CNN model demonstrates near-perfect performance in distinguishing Mild cases from all other categories, with a very high TPR and minimal false positives. Moderate Class: The ROC curve for the Moderate class reached an AUC-ROC of 1.00 on both datasets. This perfect score signifies that the CNN model can flawlessly separate Moderate cases from other classes, achieving complete classification AR with no false positives or false negatives. Normal Class: The ROC curve for the Normal class yielded an AUC-ROC of 0.98 in balanced while 1.00 in the imbalanced dataset. This high AUC-ROC value reflects the model’s strong ability to identify Normal cases while maintaining a low FPR correctly. However, it is slightly less optimal compared to the Mild and Moderate classes. Very Mild Class: The ROC curve for the Very Mild class achieved an AUC-ROC of 0.97 in balanced while 1.00 in the imbalanced dataset. This result indicates that the CNN model performs very well in distinguishing Very Mild cases from the other categories, with only a small degree of overlap and very few misclassifications.

Figure 18

ROC curve for the CNN model on balanced dataset. Source: Created by the authors.

Figure 19

ROC curve for the CNN model on imbalanced dataset. Source: Created by the authors.

4.6.3 Swin transformer performance on balanced and imbalanced dataset

Figures 20 and 21 show the ROC curve analysis of the swin transformer model for AD classification; the following results were observed: Mild Class: The ROC curve for the Mild class achieved an AUC-ROC of 0.94 and 0.99 on balanced and imbalanced datasets. This indicates that the Swin Transformer model performs very well in distinguishing Mild cases from other classes, with a high TPR and relatively low FPR. Moderate Class: The ROC curve for the Moderate class reached an AUC-ROC of 1.00 on both datasets. This perfect score signifies that the swin transformer model can flawlessly separate Moderate cases from other categories, achieving complete classification AR with no false positives or false negatives. Normal Class: The ROC curve for the Normal class also achieved an AUC-ROC of 0.94 and 0.98 on balanced and imbalanced datasets. This high AUC-ROC value reflects the model’s strong capability in correctly identifying Normal cases, with a low rate of misclassifications and effective separation from other classes. Very Mild Class: The ROC curve for the Very Mild class yielded an AUC-ROC of 0.91 and 0.97 on the balanced and imbalanced dataset. This result indicates that the swin transformer model performs well in distinguishing Very Mild cases from other categories, although there is slightly more overlap compared to the other classes. Overall, the ROC curve analysis demonstrates that the swin transformer model performs strongly across all classes, with perfect classification of Moderate cases and excellent performance in distinguishing Mild, Normal, and Very Mild cases.

Figure 20

ROC curve for the swin transformer model on balanced dataset. Source: Created by the authors.

Figure 21

ROC curve for the swin transformer model on imbalanced dataset. Source: Created by the authors.

4.7 AUC-ROC analysis

The AUC-ROC (area under the curve – receiver operating characteristic) curve is a comprehensive evaluation metric for classification models, beneficial for assessing the performance of binary and multiclass classifiers. For multiclass problems (with more than two classes), the ROC curve can be adapted using the following methods:

One-vs-Rest (OvR) Approach: Each class is treated as the positive class, while the remaining classes are combined as the negative class. This converts the multi-class problem into multiple binary classification problems. ROC curves are then generated for each binary classification scenario, resulting in one ROC curve per class. This allows for evaluation of how well each class is distinguished from all others.

4.7.1 ViT transformer performance on balanced and imbalanced dataset

The OvR approach was used to evaluate the VIT model’s performance for each class, as shown in Figures 22 and 23. Interpretation of AUC-ROC Scores are as follows:

Figure 22

AUC-ROC curve for the ViT model on balanced dataset. Source: Created by the authors.

Figure 23

AUC-ROC curve for the ViT model on imbalanced dataset. Source: Created by the authors.

Class “Moderate”: An AUC-ROC score of 1.00 on both datasets indicates perfect discrimination, where the model flawlessly distinguishes this class from the other classes. This suggests that the model is exceptionally accurate in identifying instances of the “Moderate” class.

Class “Mild”: An AUC-ROC score of 0.997 and 0.998 is exceptionally high, indicating that the model performs very well in distinguishing the “Mild” class from the others. This near-perfect score shows that the model is highly reliable in identifying instances of the “Mild” class.

Class “Normal”: The model’s AUC-ROC score is 0.995 and 0.999, demonstrating excellent performance in distinguishing the “Normal” class from the other classes. This high score suggests that the model is very effective in recognizing instances of the “Normal” class.

Class “Very Mild”: An AUC-ROC score of 0.994 on both datasets indicates strong performance in classifying the “Very Mild” class. This score reflects that the model is highly capable of differentiating instances of the “Very Mild” class from the others.

The high AUC-ROC scores across all classes indicate that the model achieves robust performance in distinguishing between each class in the multiclass classification problem. These scores suggest that the model has strong discriminatory power and is effective in handling the complexity of the classification task.

4.7.2 CNN performance on balanced and imbalanced dataset

The OvR approach was used to evaluate the CNN model’s performance for each class, as shown in Figures 24 and 25. Interpretation of AUC-ROC Scores are as follows:

Figure 24

AUC-ROC curve for the CNN model on balanced dataset. Source: Created by the authors.

Figure 25

AUC-ROC curve for the CNN model on imbalanced dataset. Source: Created by the authors.

Class “Moderate”: An AUC-ROC score of 1.00 indicates that the CNN model perfectly distinguishes the “Moderate” class from the others in both datasets. This demonstrates exceptional performance in identifying instances of the “Moderate” class.

Class “Mild”: The AUC-ROC scores of 0.99 and 0.997 are very high, showing that the CNN model has excellent capability in differentiating the “Mild” class from other classes. This score suggests robust performance in classifying instances of the “Mild” class.

Class “Normal”: With an AUC-ROC score of 0.98 and 0.998, the CNN model performs well in distinguishing the “Normal” class. Although slightly lower than the scores for “Mild” and “Moderate,” it still indicates excellent performance.

Class “Very Mild”: An AUC-ROC score of 0.970 and 0.996 suggests that the CNN model is effective in identifying the “Very Mild” class. While this score is slightly lower compared to the other classes, it still reflects strong performance in classifying instances of the “Very Mild” class.

The CNN model’s AUC-ROC scores demonstrate robust performance across all classes. The model shows a powerful ability to distinguish between the “Moderate” and “Mild” classes, with slightly lower but still high performance for the “Normal” and “Very Mild” classes. These results indicate that the CNN model is highly effective for the multiclass classification task, although there is some variability in performance across different classes.

4.7.3 Swin transformer performance on balanced and imbalanced dataset

The OvR approach was applied to evaluate the swin transformer model’s performance across different classes, as shown in Figures 26 and 27. The AUC-ROC scores for the swin transformer model are as follows:

Figure 26

AUC-ROC curve for the swin transformer model on balanced dataset. Source: Created by the authors.

Figure 27

AUC-ROC curve for the swin transformer model on imbalanced dataset. Source: Created by the authors.

Class “Moderate”: An AUC-ROC score of 1.00 indicates that the swin transformer model perfectly distinguishes the “Moderate” class from the others in both datasets. This highlights the model’s exceptional ability to identify instances of the “Moderate” class accurately.

Class “Mild”: The AUC-ROC scores of 0.94 and 0.985 demonstrate that the swin transformer model performs well in distinguishing the “Mild” class. While lower than the perfect score for “Moderate,” it still reflects strong classification capability for the “Mild” class.

Class “Normal”: With an AUC-ROC score of 0.94 and 0.985, the model shows similar performance in identifying the “Normal” class. This score indicates good performance in distinguishing the “Normal” class from other classes.

Class “Very Mild”: An AUC-ROC score of 0.92 and 0.965 suggests that the swin transformer model is effective in classifying the “Very Mild” class. Although this score is the lowest among the classes, it still represents good performance.

The AUC-ROC scores for the swin transformer model demonstrate strong performance across all classes. The model achieves perfect classification for the “Moderate” class, with high but slightly lower scores for the “Mild,” “Normal,” and “Very Mild” classes. These results suggest that the Swin Transformer model is effective for the multiclass classification task, though there is some variability in performance across different classes.

4.8 Comparison with existing studies

Table 9 presents a comparative analysis of AD classification AR across different models from the literature and the proposed models. The classification task is based on a four-class dataset, with two versions of the dataset: one balanced and one imbalanced. These are all models based on a balanced dataset. In the literature, a lightweight deep model [38] achieved an AR of 95.93%, DAD-Net without ADASYN [39] attained 90% AR, a hybrid CNN+LSTM model [40] reached 98.5%, and DenseNet-169 [41] obtained 88.7%. For the proposed work on the imbalanced dataset, the fine-tuned ViT (FT-ViT) achieved the highest AR of 99%, followed by CNN with 97% and Swin with 88%. On the balanced dataset, FT-ViT maintained a strong performance with 96% AR, while CNN reached 90% and Swin 79%.

Table 9

Comparison with existing state of the art work on Alzheimer dataset

Reference #	Proposed model	Accuracy (%)
[40]	Lightweight deep model	95.93
[41]	DAD-Net (without ADASYN)	90
[42]	Hybrid (CNN+LSTM)	98.5
[43]	DenseNet-169	88.7
Imbalanced dataset	FT-ViT	98.69
	CNN	97.39
	Swin	88.43
Balanced dataset	FT-ViT	96.27
	CNN	89.99
	Swin	79.32

5 Limitations

While the proposed deep learning approach shows promising results in AD classification, several limitations should be noted to provide a balanced perspective on the study:

Disease-Specific Framework: The current framework is designed specifically for AD classification. This specialization limits its generalizability to other neurological disorders or medical conditions, which may require additional adaptation or fine-tuning of the model.

Clinical Integration: Although the fine-tuned ViT demonstrated high accuracy in this study, its integration with existing clinical systems has not been evaluated. Further testing is needed to assess its compatibility with clinical workflows and its impact on diagnostic decision-making processes.

Treatment Recommendations: The proposed model is limited to classification tasks and does not encompass treatment or medication recommendations. In a clinical context, a diagnostic tool that includes actionable recommendations would provide more comprehensive support for healthcare providers.

Data Imbalance Challenges: Although data augmentation and balancing techniques were applied, real-world clinical datasets often present more complex imbalance issues that may affect model performance. Future studies could explore alternative techniques such as class-weighted loss functions or focal loss to address class imbalance more robustly.

These limitations highlight areas for future improvement and provide a foundation for further research in developing a more versatile and clinically integrated model.

6 Conclusion

This study demonstrates the efficacy of advanced deep learning models, particularly fine-tuned ViT, in the classification of AD stages across both balanced and imbalanced datasets. Theoretical and Practical Implications: From a theoretical perspective, our research contributes to the understanding of how model architecture impacts performance under data imbalance conditions, highlighting ViT’s ability to capture complex patterns through self-attention mechanisms. Practically, these findings support the use of ViTs in medical imaging applications, where data imbalance is often unavoidable. This model’s high accuracy on the imbalanced dataset (99%) suggests its potential utility in real-world clinical settings, where balanced data may not always be available.

Research contributions: Our contributions include (1) developing and fine-tuning ViTs, CNNs, and swin transformers for AD classification; (2) investigating the impact of data imbalance on model performance, with a focus on the robustness of ViTs; (3) conducting a comparative analysis of three architectures, demonstrating the effectiveness of ViTs for multiclass classification in medical imaging; and (4) exploring data augmentation techniques to assess their role in model performance across different data distributions.

Practical advantages: The high accuracy achieved by the ViT model underscores its suitability for real-world applications. The model’s ability to adapt to both balanced and imbalanced datasets indicates its potential for integration into clinical workflows, aiding healthcare professionals in the early and reliable detection of AD stages.

Limitations: Despite promising results, this study has limitations that should be acknowledged. The framework is currently specialized for AD, which restricts its generalizability to other medical conditions. Moreover, our approach has not been tested within clinical systems, limiting our understanding of its practical feasibility in healthcare settings. In addition, the model does not offer treatment or medication recommendations, focusing solely on diagnostic classification.

Future research directions: Future studies could focus on (1) extending the model to classify other neurodegenerative or medical conditions to test its adaptability; (2) integrating the ViT model into clinical decision-support systems and evaluating its impact on diagnostic workflows; and (3) exploring hybrid approaches that combine ViTs with other models or techniques to enhance interpretability, offering clinicians deeper insights into model predictions.

Acknowledgments

The authors sincerely thank the editor and reviewers for their valuable time and effort in reevaluating this work. Your constructive feedback and insightful suggestions have greatly enhanced the quality of this manuscript.

Funding information: This research did not receive any specific grant or funding from public, commercial, or not-for-profit agencies.
Author contributions: Hassan Almalki contributed to the conceptualization, supervision, formal analysis, and review and editing of the manuscript. Alaa O. Khadidos led the methodology, data curation, original draft preparation, visualization, and formal analysis. Nawaf Alhebaishi focused on visualization, formal analysis, supervision, and review and editing of the manuscript.
Conflict of interest: The authors declare no conflicts of interest associated with this work.
Data availability statement: The dataset “Augmented Alzheimer MRI Dataset” analyzed during the current study are available in Kaggle, with the following link: https://www.kaggle.com/datasets/uraninjo/augmented-alzheimer-mri-dataset.

References

[1] Scheltens P, Blennow K, Breteler MM, De Strooper B, Frisoni GB, Salloway S, et al. Alzheimeras disease. Lancet. 2016;388(10043):505–17. 10.1016/S0140-6736(15)01124-1. Search in Google Scholar PubMed

[2] Kumar A, Sidhu J, Lui F, Tsao JW. Alzheimer Disease. StatPearls. Treasure Island (FL): StatPearls Publishing ;2024. Accessed: May 30, 2024. [Online]. Available:http://www.ncbi.nlm.nih.gov/books/NBK499922/. Search in Google Scholar

[3] Pagnon de la Vega M, Näslund C, Brundin R, Lannfelt L, Löwenmark M, Kilander L, et al. Mutation analysis of disease causing genes in patients with early onset or familial forms of Alzheimer’s disease and frontotemporal dementia. BMC Genomics. 2022;23(1):99. 10.1186/s12864-022-08343-9. Search in Google Scholar PubMed PubMed Central

[4] Armstrong RA. Risk factors for Alzheimer’s disease. Folia Neuropathol. 2019;57(2):87–105. 10.5114/fn.2019.85929. Search in Google Scholar PubMed

[5] Shukla A, Tiwari R, Tiwari S. Review on Alzheimer disease detection methods: automatic pipelines and machine learning techniques. Sci. 2023;5(1):13. 10.3390/sci5010013. Search in Google Scholar

[6] Sorour SE, El-Mageed AAA, Albarrak KM, Alnaim AK, Wafa AA, El-Shafeiy E. Classification of Alzheimeras disease using MRI data based on deep learning techniques. J King Saud Univ Comput Inf Sci. 2024;36(2):101940. 10.1016/j.jksuci.2024.101940. Search in Google Scholar

[7] Zhang Y, Londos E, Minthon L, Wattmo C, Liu H, Aspelin P, et al. Usefulness of computed tomography linear measurements in diagnosing Alzheimer’s disease. Acta Radiol. 2008;49(1):91–7. 10.1080/02841850701753.Search in Google Scholar

[8] Marcus C, Mena E, Subramaniam RM. Brain PET in the diagnosis of Alzheimer’s disease. Clin Nucl Med. 2014;39(10):e413–26. 10.1097/RLU.0000000000000547.Search in Google Scholar PubMed PubMed Central

[9] Chow MSM, Wu SL, Webb SE, Gluskin K, Yew DT. Functional magnetic resonance imaging and the brain: A brief review. World J Radiol. 2017;9(1):5–9. 10.4329/wjr.v9.i1.5. Search in Google Scholar PubMed PubMed Central

[10] McNeill R, Sare GM, Manoharan M, Testa HJ, Mann DM, Neary D, et al. Accuracy of single-photon emission computed tomography in differentiating frontotemporal dementia from Alzheimer’s disease. J Neurol Neurosurg Psychiatry. 2007;78(4):350–5. 10.1136/jnnp.2006.106054. Search in Google Scholar PubMed PubMed Central

[11] Shukla A, Tiwari R, Tiwari S. Alzheimer’s disease detection from fused PET and MRI modalities using an ensemble classifier. Mach Learn Knowl Extr. 2023;5(2):512–38. 10.3390/make5020031. Search in Google Scholar

[12] Beynon R, Sterne JA, Wilcock G, Likeman M, Harbord RM, Astin M, et al. Is MRI better than CT for detecting a vascular component to dementia? A systematic review and meta-analysis. BMC Neurol. 2012;12(1):33. 10.1186/1471-2377-12-33. Search in Google Scholar PubMed PubMed Central

[13] Akbari Sari A, Ravaghi H, Mobinizadeh M, Sarvari S. The cost-utility analysis of PET-scan in diagnosis and treatment of non-small cell lung carcinoma in Iran. Iran J Radiol. 2013;10(2):61–7. 10.5812/iranjradiol.8559. Search in Google Scholar PubMed PubMed Central

[14] Chauhan N, Choi B-J. Classification of Alzheimer’s disease using maximal information coefficient-based functional connectivity with an extreme learning machine. Brain Sci. 2023;13(7):1046. 10.3390/brainsci13071046. Search in Google Scholar PubMed PubMed Central

[15] Alqahtani FF. SPECT/CT and PET/CT, related radiopharmaceuticals, and areas of application and comparison. Saudi Pharm J. 2023;31(2):312–28. 10.1016/j.jsps.2022.12.013. Search in Google Scholar PubMed PubMed Central

[16] Magnin B, Mesrob L, Kinkingnéhun S, Pélégrini-Issac M, Colliot O, Sarazin M, et al. Support vector machine-based classification of Alzheimer’s disease from whole-brain anatomical MRI. Neuroradiology. 2008;51:73–83. 10.1007/s00234-008-0463-x. Search in Google Scholar PubMed

[17] Saied I, Arslan T, Chandran S. Classification of Alzheimer’s disease using RF signals and machine learning. IEEE J Electromagn RF Microw Med Biol. 2021;PP:1. 10.1109/JERM.2021.3096172. Search in Google Scholar

[18] Okfalisa, Gazalba I, Mustakim, Reza NGI. Comparative analysis of k-nearest neighbor and modified k-nearest neighbor algorithm for data classification. in: 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE). Nov. 2017. p. 294–8. 10.1109/ICI;TISEE.2017.8285514. Search in Google Scholar

[19] Xiao R, Cui X, Qiao H, Zheng X, Zhang Y, Zhang C, et al. Early diagnosis model of Alzheimer’s disease based on sparse logistic regression with the generalized elastic net. Biomed Signal Process Control. 2021;66:102362. 10.1016/j.bspc.2020.102362. Search in Google Scholar

[20] Sudharsan M, Thailambal G. Alzheimer’s disease prediction using machine learning techniques and principal component analysis (PCA). Mater Today Proc. 2023;81:182–90. 10.1016/j.matpr.2021.03.061.Search in Google Scholar

[21] Bhagwat N, Pipitone J, Voineskos AN, Chakravarty MM, Alzheimer’s disease neuroimaging initiative. An artificial neural network model for clinical score prediction in Alzheimer disease using structural neuroimaging measures. J Psychiatry Neurosci. 2019;44(4):246–60. 10.1503/jpn.180016.Search in Google Scholar

[22] Nguyen M, Sun N, Alexander DC, Feng J, Yeo BT. Modeling Alzheimer’s disease progression using deep recurrent neural networks. In 2018 International Workshop on Pattern Recognition in Neuroimaging (PRNI). IEEE; 2020. p. 1–4. 10.1109/PRNI.2018.8423955.Search in Google Scholar

[23] Hcini G, Jdey I, Dhahri H. Investigating deep learning for early detection and decision-making in Alzheimer’s disease: a comprehensive review. Neural Process Lett. 2024;56(3):153. 10.1007/s11063-024-11600-5. Search in Google Scholar

[24] Huang Y, Li W. Resizer swin transformer-based classification using sMRI for Alzheimer’s disease. Appl Sci. 2023;13(16):16. 10.3390/app13169310. Search in Google Scholar

[25] Xin J, Wang A, Guo R, Liu W, Tang X. CNN and swin-transformer based efficient model for Alzheimer’s disease diagnosis with sMRI. Biomed Signal Process Control. 2023;86:105189. 10.1016/j.bspc.2023.105189. Search in Google Scholar

[26] Majee A, Gupta A, Raha S, Das S. Enhancing MRI-based classification of Alzheimer’s disease with explainable 3D hybrid compact convolutional transformers. In 2024 International Joint Conference on Neural Networks (IJCNN). IEEE. 2024 Jun 30. p. 1–8. 10.1109/IJCNN60899.2024.10650462.Search in Google Scholar

[27] Alom KS, Rahman MM, Kim HU. Alzheimer’s disease detection using deep learning and multiple feature fusion. Comput Biol Med. 2024;113:103377. 10.1016/j.compbiomed.2024.103377. Search in Google Scholar

[28] Kim HSRY, Shin TR, Moon JH, Choi SS. Advanced classification techniques for Alzheimer’s disease diagnosis: a review of recent advances. Health Inf Sci Syst. 2024;12(1):1. 10.1186/s13755-024-00278-6. Search in Google Scholar

[29] Hong YJ, Lee TK, Jang JA. A hybrid deep learning model for Alzheimer’s disease detection based on multi-modal imaging. IEEE Access. 2024;12:105256–67. 10.1109/ACCESS.2024.3287867. Search in Google Scholar

[30] Nguyen TN, Smith MG, Zhao KM. Deep learning approaches for multi-modal Alzheimer’s disease classification: a comparative study. Comput Biol Med. 2024;112:103340. 10.1016/j.compbiomed.2024.103340. Search in Google Scholar

[31] Yin Y, Jin W, Bai J, Liu R, Zhen H. SMIL-DeiT: Multiple instance learning and self-supervised vision transformer network for early Alzheimer’s disease classification. 2022 International Joint Conference on Neural Networks (IJCNN). IEEE. 2022. p. 1–6. 10.1109/IJCNN55064.2022.9892524.Search in Google Scholar

[32] Hoang GM, Kim U-H, Kim JG. Vision transformers for the prediction of mild cognitive impairment to Alzheimer’s disease progression using mid-sagittal sMRI. Front Aging Neurosci. 2023;15:1–11. 10.3389/fnagi.2023.1102869. Search in Google Scholar PubMed PubMed Central

[33] Richhariya B, Tanveer M, Rashid AH, Diagnosis of Alzheimer’s disease using universum support vector machine based recursive feature elimination (USVM-RFE). Biomed Signal Process Control. 2020;59:101903. 10.1016/j.bspc.2020.101903. Search in Google Scholar

[34] Kadri R, Bouaziz B, Tmar M, Gargouri F. Efficient multimodel method based on transformers and CoAtNet for Alzheimer’s diagnosis. Digit Signal Process. 2023;143:104229. 10.1016/j.dsp.2023.104229.Search in Google Scholar

[35] Zhu W, Sun L, Huang J, Han L, Zhang D. Dual attention multi-instance deep learning for Alzheimer’s disease diagnosis with structural MRI. IEEE Trans Med Imaging. 2021;40(9):2354–66. 10.1109/TMI.2021.3077079.Search in Google Scholar PubMed

[36] Kang W, Lin L, Zhang B, Shen X, Wu S, & Alzheimer’s disease neuroimaging initiative. Multi-model and multi-slice ensemble learning architecture based on 2D convolutional neural networks for Alzheimer’s disease diagnosis. Comput Biol Med. 2021;136:104678. 10.1016/j.compbiomed.2021.104678.Search in Google Scholar PubMed

[37] Mendoza-Léon R, Puentes J, Uriza LF, Hernández Hoyos M. Single-slice Alzheimer’s disease classification and disease regional analysis with supervised switching autoencoders. Comput Biol Med. 2020;116:103527. 10.1016/j.compbiomed.2019.103527. Search in Google Scholar PubMed

[38] Lakhan A, Mohammed MA, Abd Ghani MK, Abdulkareem KH, Marhoon HA, Nedoma J, et al. FDCNN-AS: Federated deep convolutional neural network Alzheimer detection schemes for different age groups. Inf Sci (Ny). 2024;677:120833. 10.1016/j.ins.2024.120833.Search in Google Scholar

[39] Ibrahim AM, Mohammed MA. A comprehensive review on advancements in artificial intelligence approaches and future perspectives for early diagnosis of Parkinson’s disease. Int J Math. 2024;2:173–82. 10.59543/ijmscs.v2i.8915.Search in Google Scholar

[40] El-Latif AA, Chelloug SA, Alabdulhafith M, Hammad M. Accurate detection of Alzheimer’s disease using lightweight deep learning model on MRI data. Diagnostics. 2023;13(7):1216. 10.3390/diagnostics13071216.Search in Google Scholar PubMed PubMed Central

[41] Ahmed G, Er MJ, Fareed NM, Zikria S, Mahmood S, He J, et al. Dad-net: Classification of alzheimeras disease using adasyn oversampling technique and optimized neural network. Molecules. 2022;27(20):7085. 10.3390/molecules27207085.Search in Google Scholar PubMed PubMed Central

[42] Balaji P, Chaurasia MA, Bilfaqih SM, Muniasamy A, Alsid LE. Hybridized deep learning approach for detecting Alzheimer’s disease. Biomedicines. 2023;11(1):149. 10.3390/biomedicines11010149.Search in Google Scholar PubMed PubMed Central

[43] Al Shehri W. Alzheimer’s disease diagnosis and classification using deep learning techniques. PeerJ Comp Sci. 2022;8:e1177. 10.7717/peerj-cs.1177.Search in Google Scholar PubMed PubMed Central

Received: 2024-09-29

Accepted: 2024-11-30

Published Online: 2025-12-04

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jisys-2024-0406

Keywords for this article

Alzheimer’s disease; classification; vision transformers; data imbalance; medical imaging

Creative Commons

BY 4.0