Abstract
Objectives
Accurate identification of Parkinson’s disease (PD), particularly during its prodromal stage, remains a major clinical challenge due to heterogeneous symptom presentation and overlapping neurological patterns. This study proposes an LLM-Guided Multimodal Attention Network (LLM-MAN) to improve PD staging by jointly modeling structural MRI and clinical/cognitive metadata.
Methods
We develop a unified multimodal framework that encodes structural MRI using a ResNet-18 backbone enhanced with Convolutional Block Attention Modules (CBAM) for discriminative neuroimaging feature extraction, and represents clinical/cognitive metadata using an LLM-based text encoder (pre-trained BERT) for contextualized semantic modeling. A Meta-Guided Cross-Attention (MGCA) module is introduced to align clinical semantic knowledge with imaging features, enabling robust cross-modal fusion for multiclass classification (Normal Control, prodromal PD, and diagnosed PD). The model is evaluated on the Parkinson’s Progression Markers Initiative (PPMI) dataset and further validated on an independent external cohort.
Results
On the PPMI dataset, LLM-MAN achieved an accuracy of 95.68 % for distinguishing Normal Control, prodromal PD, and diagnosed PD. External validation on an independent cohort yielded 94.10 % accuracy, indicating strong generalization performance across datasets.
Conclusions
LLM-guided multimodal fusion via MGCA provides reliable and interpretable approach for PD staging, substantially improving prodromal PD identification by integrating semantic clinical knowledge with neuroimaging representations.
Introduction
Parkinson’s disease is the second most prevalent neurodegenerative disorder worldwide, affecting over 10 million individuals and imposing significant burdens on healthcare systems, patients, and caregivers [1]. The disease is characterized by the progressive loss of dopaminergic neurons in the substantia nigra, leading to both motor symptoms – such as tremor, bradykinesia, rigidity, and postural instability – and non-motor manifestations including cognitive decline, sleep disorders, and autonomic dysfunction [2]. Clinically, PD evolves through a prodromal phase, an early symptomatic stage, and advanced disability, with symptom heterogeneity complicating early and accurate diagnosis. Current diagnostic protocols rely heavily on neurological examinations, clinical interviews, and standardized rating scales such as the Unified Parkinson’s Disease Rating Scale (UPDRS) and Hoehn and Yahr staging. While neuroimaging tools like structural MRI and DaT-SPECT are used to support diagnosis and rule out other conditions, early misdiagnosis rates remain as high as 25 %, underscoring the urgent need for more objective, precise, and computationally aided diagnostic tools [3].
The integration of artificial intelligence (AI) into neurology has opened new avenues for improving PD detection and staging. Conventional machine learning methods [4], including support vector machines (SVM), random forests (RF), and logistic regression, have been widely applied to classify PD using handcrafted features derived from imaging, gait, voice, or clinical scores [5]. However, these approaches are inherently limited by their dependence on manual feature engineering, which is labor-intensive, requires domain expertise, and may overlook subtle pathological patterns. The advent of deep learning, particularly convolutional neural networks (CNNs), has enabled automated, hierarchical feature extraction directly from medical images, achieving superior performance in unimodal diagnostic tasks [6]. Despite these advances, imaging-only models often fail to capture the multifaceted clinical context of PD, which includes demographic, genetic, cognitive, and motor assessment data that clinicians routinely use in decision-making.
Recently, multimodal learning frameworks have emerged as a promising direction, aiming to fuse complementary data sources – such as MRI with clinical metadata – to enhance diagnostic accuracy and model generalizability [7]. However, many existing multimodal approaches treat non-imaging data as simple numerical vectors, neglecting the rich semantic structure and contextual relationships inherent in clinical text and assessment records. This is where large language models (LLMs) offer transformative potential. Pre-trained LLMs, such as BERT, excel at encoding and reasoning over structured and unstructured textual information, enabling nuanced representation of clinical narratives, assessment summaries, and metadata [8].
In this work, we introduce the LLM-Guided Multimodal Attention Network (LLM-MAN), a novel deep learning framework designed for robust multiclass PD staging. Unlike prior multimodal systems that use basic text encoders, LLM-MAN employs a pre-trained BERT model as a clinical reasoning module to contextualize patient metadata – including motor scores, cognitive evaluations, demographic details, and genetic markers. These LLM-derived embeddings are then fused with structural MRI features extracted via a ResNet-18 backbone enhanced with Convolutional Block Attention Modules (CBAM). A dedicated Meta-Guided Cross-Attention (MGCA) mechanism aligns imaging and clinical semantics, enabling fine-grained cross-modal interaction. By integrating LLM-based clinical understanding with attention-driven visual analysis, LLM-MAN advances both diagnostic performance and model interpretability, offering a clinically meaningful tool for distinguishing between Normal Control (NC), Prodromal PD, and Diagnosed PD categories.
Related works
Machine learning and deep learning in PD diagnosis
Early computational approaches to PD diagnosis relied on traditional machine learning techniques applied to handcrafted features from various modalities. Studies utilizing gait kinematics, voice recordings, handwriting dynamics, and clinical scores demonstrated moderate success in binary classification tasks [9]. However, these methods were constrained by feature selection bias and limited scalability. The shift to deep learning, particularly CNNs, enabled end-to-end learning from raw neuroimaging data, significantly improving performance in detecting PD-related structural changes in MRI and DaT-SPECT scans [10]. Architectures such as ResNet, VGG, and EfficientNet have been adapted for medical imaging, though they primarily operate in a unimodal setting.
Multimodal fusion for neurodegenerative diseases
Recognizing the limitations of unimodal analysis, recent research has focused on multimodal integration for enhanced diagnostic precision. For PD, studies have combined structural and functional MRI with clinical variables, genetic data, and cerebrospinal fluid biomarkers [11]. Fusion strategies range from early (input-level) and late (decision-level) fusion to intermediate (feature-level) fusion using concatenation or tensor-based methods. While these approaches show improved accuracy over single-modality baselines, they often lack mechanisms for modeling deep, semantically aligned interactions between heterogeneous data types.
Attention mechanisms in medical AI
Attention mechanisms have become central to modern deep learning architectures, enabling models to focus on diagnostically relevant regions while suppressing noise. In medical imaging, attention modules such as CBAM have been integrated into CNNs to refine spatial and channel-wise features, improving localization of pathological signatures [12]. Similarly, transformer-based models, including Vision Transformers (ViTs) and their medical adaptations, use self-attention to capture long-range dependencies in image data [13]. However, most attention-based medical models operate within a single modality, leaving cross-modal attention underexplored in clinical applications.
Language models in clinical multimodal systems
The success of large language models in natural language processing has inspired their adoption in multimodal medical AI. Vision-language pretraining (VLP) models, such as CLIP and BiomedCLIP, align image and text representations using contrastive learning, showing promise in tasks like radiology report generation and medical visual question answering [14]. In neurology, LLMs have been used to encode clinical notes and assessment reports, but their integration with neuroimaging for PD staging remains limited. Recent works propose LLM-guided attention or prompting to steer visual feature extraction, yet few designs explicitly couple LLM-based clinical reasoning with spatially aware cross-attention for neurodegenerative disease classification.
Methods
The LLM-Guided Multimodal Attention Network (LLM-MAN) is a unified deep learning framework designed to perform robust multiclass Parkinson’s disease classification by integrating structural MRI with clinically contextualized metadata. The architecture consists of four core components: an attention-enhanced visual encoder, an LLM-based clinical encoder, a meta-guided cross-attention fusion module, and a final classifier. The overall pipeline is illustrated in Figure 1, and each component is detailed below.

Overview of the LLM-guided multimodal attention network (LLM-MAN) with MGCA for multiclass Parkinson’s disease diagnosis.
Problem statement
Let
Attention-enhanced visual encoder
The visual pathway processes axial slices extracted from 3D structural MRI volumes using a ResNet-18 backbone, chosen for its balance between representational capacity and parameter efficiency [19]. To enhance the model’s ability to focus on neuroanatomically relevant regions, each residual block is augmented with a Convolutional Block Attention Module (CBAM). The CBAM operates sequentially through channel attention and spatial attention submodules.
Given an input feature map X l at layer l, let Fl,0 be the output of the first convolutional block within the residual layer [20]. The channel attention mechanism first computes inter-channel dependencies using global average pooling and max pooling, followed by a shared multi-layer perceptron (MLP) and sigmoid activation:
The spatially refined feature map is then obtained via a spatial attention module that concatenates channel-pooled features and applies a convolutional layer:
The final refined feature map after CBAM is:
where ⊗ denotes element-wise multiplication. This attention-enhanced encoding ensures that the model prioritizes discriminative regions such as the substantia nigra, basal ganglia, and cortical thinning patterns associated with PD progression.
LLM-based clinical encoder
Clinical metadata are inherently heterogeneous, covering numerical scores, categorical attributes, and free-form clinical notes [21]. We therefore convert each patient’s metadata into a standardized textual prompt using a predefined template, and feed the resulting prompt into a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to obtain a unified, semantically informative clinical representation. For example:
As illustrated in Figure 1, structural MRI slices are first processed by a ResNet-18 visual backbone augmented with Convolutional Block Attention Modules (CBAM) [22] to strengthen channel- and spatial-wise discriminative representations. In parallel, heterogeneous clinical metadata are serialized into structured textual prompts and encoded by an LLM-based clinical encoder (e.g., pre-trained BERT), after which the resulting embeddings are linearly projected into the same latent space as the visual features. The proposed Meta-Guided Cross-Attention (MGCA) module then performs multi-head cross-attention by treating visual features as queries and LLM-derived clinical embeddings as keys and values, thereby injecting clinically meaningful semantics into image representation learning. The fused multimodal representation is finally fed into a classification head to predict the disease stage.
LLM-based clinical encoder
Clinical metadata are inherently diverse, including continuous measurements, categorical descriptors, and free-text clinical notes [23]. To map these heterogeneous inputs into a unified and semantically informative space, we use a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model as the clinical encoder.
where Pool(·) denotes mean pooling over the sequence dimension, and T∈RD is the final clinical feature vector with dimension D=256, matching the visual feature dimension.
Ensemble feature selection
Prior to encoding, an ensemble feature selection step is applied to identify the most discriminative clinical variables, reducing noise and computational overhead. Five tree-based models – Random Forest, XGBoost, LightGBM, ExtraTrees, and AdaBoost – are trained to rank features by importance [24]. A weighted majority voting scheme aggregates rankings across models, yielding a consensus importance score Sf for each feature f. The top-k features (e.g., UPDRS, MoCA, Age, Hoehn & Yahr, Weight) are retained for subsequent encoding, ensuring that the clinical encoder focuses on variables with the highest predictive relevance.
Meta-guided cross-attention (MGCA) module
The MGCA module [22]serves as the core fusion mechanism, enabling fine-grained interaction between visual features Fimg∈R B × C × H × W and clinical embeddings T∈R B × D. To facilitate cross-modal alignment, visual features from the third residual block are spatially averaged to produce a compact visual token V∈R B × D. The MGCA employs a multi-head attention mechanism with four heads. For each head h, query (Q h ), key (K h ), and value (V h ) matrices are derived via linear projections:
where W h Q, W h K, W h V∈RD × d k are learnable weights and d k =D/4. Cross-attention is computed as:
Outputs from all heads are concatenated and projected to produce an attention-refined feature vector. A residual connection preserves original visual information:
This design allows the model to dynamically weigh clinical metadata in relation to specific imaging features, enabling semantically grounded fusion.
Classification layer and optimization
The fused feature vector Ffused is passed through a fully connected classification layer with softmax activation to produce probability distributions over the three classes. The model is trained end-to-end using categorical cross-entropy loss:
where y c is the ground-truth one-hot label and y c is the predicted probability for class c. Optimization is performed using the Adam optimizer with an initial learning rate of 1 × 10−3, weight decay of 1 × 10−5, and dropout (p=0.5) applied to fully connected layers to prevent overfitting. Early stopping is employed based on validation accuracy with a patience of 20 epochs.
Implementation details
The framework is implemented in PyTorch. MRI slices are preprocessed with intensity normalization and augmented via random rotations (±10°), horizontal flips, and mild intensity scaling [25]. Training follows a subject-level 5-fold cross-validation protocol, with 20 % of subjects held out for independent testing. All experiments are conducted on NVIDIA RTX 3090 GPUs, with batch size set to 32. The BERT model used is bert-base-uncased, fine-tuned jointly with the visual backbone. The complete code and data splits are made publicly available to ensure reproducibility.
Dataset and preprocessing
Data & Preprocessing (summary): Structural MRI volumes were skull-stripped and intensity-normalized, then resampled to a consistent voxel spacing and reoriented to a common anatomical convention before axial-slice extraction. For 2D processing, each subject contributes 10 representative axial slices that are resized to a fixed spatial resolution and normalized to zero mean/unit variance per slice. Clinical variables are cleaned and standardized (z-score for continuous measures; one-hot encoding for categorical variables), and then serialized into a natural-language prompt for the clinical encoder.
Label definition: We formulate a 3-class diagnosis task with (i) NC/HC: neurologically normal controls (ii) Prodromal PD: prodromal or at-risk subjects (e.g., RBD/hyposmia); without a formal PD diagnosis; and (iii) PD: clinically diagnosed Parkinson’s disease. These labels are used consistently across all tables, figures (confusion matrices/ROC), and narrative descriptions.
Missing-modality handling: Because LLM-MAN relies on paired MRI and clinical metadata, subjects with missing required inputs are excluded from model training/testing to avoid introducing imputation bias. In practical deployment, the framework can be extended with modality-dropout training or imputation-aware encoders; we discuss this in Limitations.
Primary dataset: Parkinson’s progression markers initiative (PPMI)
The primary dataset for this study was sourced from the Parkinson’s Progression Markers Initiative (PPMI), a landmark multi-center longitudinal study designed to identify biomarkers of Parkinson’s disease progression [26]. A carefully curated subset was constructed, comprising 200 participants across three clinically defined diagnostic categories: Normal Control (NC, n=60), Prodromal Parkinson’s Disease (Prodromal PD, n=70), and Clinically Diagnosed Parkinson’s Disease (PD, n=70). All participants were recruited under standardized protocols with comprehensive clinical characterization, including neurological examinations, cognitive testing, genetic screening, and neuroimaging. For each participant, 3D T1-weighted structural MRI scans were acquired using consistent parameters across PPMI sites. To optimize the use of volumetric information while maintaining computational efficiency, 10 axial slices were systematically extracted from each 3D MRI volume, focusing on mid-brain regions encompassing key structures implicated in PD pathology such as the substantia nigra, basal ganglia, and thalamus. This resulted in a total of 2,000 high-resolution 2D image slices for model development. All slices underwent a rigorous preprocessing pipeline including N4 bias field correction, skull stripping using FSL BET, intensity normalization to zero mean and unit variance, and resampling to a uniform resolution of 224 × 224 pixels.
To ensure methodological rigor and prevent data leakage – a critical concern in medical AI – we implemented a strict subject-level partitioning strategy throughout all experiments. This protocol ensured that all MRI slices from a single participant were assigned exclusively to one of the training, validation, or test sets, preventing the model from exploiting subject-specific biases and ensuring reliable performance estimates on unseen individuals. The dataset was partitioned using stratified sampling to preserve class distribution: 20 % of subjects (40 participants) were allocated to a held-out test set, strictly reserved for final evaluation and completely excluded from all model development and hyperparameter tuning phases. 80 % of subjects (160 participants) constituted the cross-validation pool, subjected to a rigorous subject-level 5-fold cross-validation protocol. Within the 5-fold cross-validation framework, each fold maintained complete subject-level separation between training and validation sets. Model performance was aggregated as the mean ± standard deviation across all five folds, providing a robust estimate of generalization performance. The best-performing configuration from cross-validation was subsequently retrained on the entire 80 % pool and evaluated on the untouched 20 % test set to obtain an unbiased final performance estimate. Table 1 provides a detailed summary of the dataset partitioning, illustrating the distribution across diagnostic categories and experimental splits.
PPMI dataset partition using subject-level stratified splitting (200 subjects, 2,000 slices).
| Model | Dataset (size) | Modality | Performance, % |
|---|---|---|---|
| MedBLIP-PD | PDBP (3,020) | 3D images + text | 85.3 (multiclass) |
| SVM + Bayes | Bordeaux-PD (37) | MRI + clinical | 85.0 (binary) |
| GNN + VLM | PDReSSo (548) | Imaging + text | 88.73 (binary) |
| LLM + CNN + transformers | PPMI (618) | Imaging + text | 96.36 (binary: prodromal PD vs. PD) |
| MADDi-PD | PPMI (1,029) | Imaging + genetic + clinical | 96.88 (multiclass) |
| Hybrid model | PPMI (–) | MRI + clinical features | 98.4 (binary) |
| LLM-MAN (proposed) | PPMI (2,000) | MRI + clinical + text | 95.68 (multiclass: NC vs. prodromal PD vs. PD) |
External validation: Parkinson’s disease biomarkers program (PDBP)
To rigorously evaluate the model’s cross-dataset generalizability, external validation was conducted using the Parkinson’s Disease Biomarkers Program (PDBP) dataset, a large-scale open-access resource providing longitudinal multi-modal data across the PD spectrum [26]. The curated PDBP subset consisted of 921 participants: Normal Control (NC): 704 subjects; Prodromal PD: 19 subjects; Clinically Diagnosed PD: 198 subjects. All preprocessing steps applied to the PPMI data were meticulously replicated on the PDBP dataset, including slice extraction, intensity normalization, and resolution standardization. The PDBP cohort introduces significant heterogeneity in terms of scanner hardware, acquisition protocols, and demographic distributions, providing a stringent test of model robustness and real-world applicability. This external validation ensures that performance claims extend beyond the specific characteristics of the training dataset, enhancing the translational potential of the proposed LLM-MAN framework for diverse clinical populations.
Ensemble feature selection
Effective feature selection is critical for improving model performance, mitigating the curse of dimensionality, and enhancing clinical interpretability. Given the heterogeneous nature of clinical metadata and the potential redundancy among variables, we employed a robust ensemble-based feature selection strategy to identify the most discriminative predictors for Parkinson’s disease staging. The initial metadata pool comprised a comprehensive set of clinical and demographic variables: Weight, Age, UPDRS scores (Parts I–IV), Hoehn and Yahr stage, Montreal Cognitive Assessment (MoCA), Geriatric Depression Scale (GDSCALE), and genetic markers (SNCA and LRRK2 variants). These features span multiple modalities – continuous, ordinal, and categorical – with varying scales and clinical relevance. To ensure comparability across scales, all features were first standardized using StandardScaler, transforming each variable to zero mean and unit variance. We applied five ensemble learning algorithms – Random Forest (RF), XGBoost (XGB), LightGBM (LGB), Extra Trees (ET), and AdaBoost (AB) – to assess feature importance, leveraging their complementary strengths in modeling non-linear relationships and interactions within structured clinical data [27]. Each model was trained on the complete training set (excluding test data), and feature importance rankings were extracted. This multi-algorithm approach ensures robustness against biases inherent in any single feature selection method. To derive a consensus feature ranking, we implemented a weighted majority voting scheme that integrates rankings from all five ensemble models.
Implementation and training protocol
The LLM-MAN framework was implemented in PyTorch 2.0, leveraging its dynamic computational graph and extensive deep learning ecosystem. Model components were modularly designed to facilitate experimentation and reproducibility. The visual encoder employed a ResNet-18 backbone pretrained on ImageNet, augmented with Convolutional Block Attention Modules (CBAM) integrated into each residual block. The clinical encoder utilized the BERT-base-uncased model from the HuggingFace Transformers library, fine-tuned end-to-end with the visual pathway. The Meta-Guided Cross-Attention (MGCA) module was implemented with four attention heads and embedding dimension D=256.
Optimization, training, and implementation details
Training was conducted using the Adam optimizer with a learning rate of 0.001, betas (0.9, 0.999), and epsilon 1 × 10−8, alongside regularization techniques including L2 weight decay (10−5), dropout (p=0.5) in fully connected layers, and label smoothing (factor 0.1) to minimize categorical cross-entropy loss and mitigate overfitting (see Table 2). To further improve generalization, data augmentation was applied to MRI training slices: random horizontal flipping (probability 0.5), rotation within ± 10°, intensity scaling (factor 0.1), and random affine transformations including translation (up to 10 %) and scaling (90–110 %). Training proceeded for up to 100 epochs with early stopping (patience 20 epochs) and a batch size of 32, adhering to subject-level 5-fold cross-validation and held-out test evaluation. For reproducibility, all random seeds were fixed, and the complete implementation is publicly available on GitHub with full documentation to support validation and clinical translation.
Training recipe, evaluation protocol, and computational indicators.
| Optimizer | Adam (β1=0.9, β2=0.999, ε=1 × 10−8) |
|---|---|
| Initial learning rate | 1 × 10−3 |
| LR schedule | Constant; early stopping on validation accuracy (patience=20) |
| Batch size | 32.00 |
| Epochs | Up to 100 |
| Regularization | Weight decay 1 × 10−5; dropout p=0.5; label smoothing 0.1 |
| Augmentation (MRI slices) | Flip (p=0.5), rotation ± 10°, intensity scaling 0.1, affine (translation≤10 %, scaling 90–110 %) |
| Randomness control | We fix random seeds for Python/NumPy/PyTorch, enable deterministic backends where applicable, and report results as mean ± standard deviation across 5 folds unless otherwise stated |
| Decision rule and thresholds | For multi-class prediction, the final label is selected by argmax over class probabilities. ROC/PR curves are generated by sweeping one-vs-rest thresholds. Unless explicitly stated, sensitivity corresponds to per-class recall at the argmax operating point |
| Computational indicators (approx. parameters) | Approximate parameter scale of each major component is summarized below |
| Visual encoder | ResNet-18 + CBAM; ∼12 M parameters |
| Clinical encoder | BERT-base (clinical prompt encoder); ∼110 M parameters |
| Fusion + classifier | MGCA + MLP head; ∼0.5 M parameters (order of magnitude) |
Results and analysis
Performance on primary PPMI dataset
The proposed LLM-Guided Multimodal Attention Network (LLM-MAN) was evaluated on the Parkinson’s Progression Markers Initiative (PPMI) dataset using a rigorous subject-level 5-fold cross-validation protocol followed by independent testing on a held-out set. The model achieved a multiclass classification accuracy of 95.68 % in distinguishing between Normal Control (NC), Prodromal PD, and Diagnosed PD categories. This performance represents a significant improvement over unimodal baselines and establishes a new state-of-the-art for multimodal PD staging. Detailed performance metrics including precision, recall, and F1-scores across all three classes are presented in Table 3.
Cross-validation performance of LLM-MAN on PPMI dataset (mean ± standard deviation across five folds).
| Metric | NC class | Prodromal PD | PD class | Macro average |
|---|---|---|---|---|
| Accuracy | 96.24 ± 0.45 | 95.12 ± 0.52 | 95.68 ± 0.48 | 95.68 ± 0.48 |
| Precision | 96.18 ± 0.50 | 94.88 ± 0.55 | 95.98 ± 0.46 | 95.68 ± 0.50 |
| Recall | 96.30 ± 0.42 | 94.76 ± 0.58 | 95.42 ± 0.51 | 95.49 ± 0.50 |
| F1-Score | 96.24 ± 0.46 | 94.82 ± 0.56 | 95.70 ± 0.48 | 95.59 ± 0.50 |
The overall discriminative performance is further illustrated in Figure 2(a–c), which reports the multiclass ROC and Precision–Recall curves as well as the normalized confusion matrix. Training dynamics and early-stopping behavior are summarized in Figure 2(d), showing stable convergence with minimal overfitting.

Performance analysis on the PPMI cohort under subject-level 5-fold cross-validation with a held-out test set. (a) One-vs-rest ROC curves for NC, prodromal PD, and PD. (b) Precision–Recall curves. (c) Normalized confusion matrix at the argmax operating point. (d) Training/validation dynamics with early stopping. (e) Ensemble-selected clinical feature importance. (f) Benchmark comparison to prior methods. Overall, LLM-MAN shows strong multiclass discrimination and balanced errors across adjacent stages.
Notably, LLM-MAN demonstrated exceptional sensitivity for the PD class (Recall=95.42 %), a clinically critical attribute given that false negatives in PD diagnosis can lead to delayed intervention and poorer patient outcomes. The model also showed strong performance on the challenging Prodromal PD category (F1-score=94.82 %), highlighting its potential for early detection during the clinically ambiguous prodromal phase.
Comparison with state-of-the-art models
A comprehensive comparison with contemporary multimodal approaches for PD diagnosis is presented in Table 4. LLM-MAN outperforms recent vision-language models like MedBLIP-PD (85.3 % accuracy) and multimodal fusion approaches such as MADDi-PD (96.88 % accuracy on a different multiclass setup). While some hybrid models report higher accuracy on binary classification tasks (e.g., 98.4 % for NC vs. PD), LLM-MAN addresses the more clinically challenging ternary classification problem that includes the subtle prodromal stage – a capability essential for early intervention strategies. A visual benchmark summary against representative baselines is provided in Figure 2(f).
Performance comparison of LLM-MAN with state-of-the-art PD diagnostic models.
| Model | Dataset | Modality | Task | Performance |
|---|---|---|---|---|
| MedBLIP-PD | PDBP (3,020) | 3D images + text | Multiclass | 0.85 |
| SVM + Bayes | Bordeaux (37) | MRI + clinical | Binary (NC vs. PD) | 0.85 |
| GNN + VLM | PDReSSo (548) | Imaging + text | Binary (NC vs. PD) | 0.89 |
| LLM + CNN + transformers | PPMI (618) | Imaging + text | Binary (prodromal vs. PD) | 0.96 |
| MADDi-PD | PPMI (1,029) | Imaging + genetic + clinical | Multiclass | 0.97 |
| Hybrid model | PPMI (–) | MRI + clinical | Binary (NC vs. PD) | 0.98 |
| LLM-MAN (proposed) | PPMI (2,000) | MRI + clinical + LLM | Multiclass (NC/prodromal/PD) | 0.96 |
The superior performance of LLM-MAN is particularly notable given the complexity of the ternary classification task. The integration of BERT-based clinical reasoning with attention-enhanced visual features through the novel MGCA module enables more nuanced differentiation between disease stages compared to conventional fusion approaches.
External validation on OASIS-3
To rigorously assess cross-dataset generalization, LLM-MAN was evaluated on the OASIS-3 cohort under identical preprocessing and evaluation protocols. Despite significant heterogeneity in scanner hardware, demographic profiles, and disease prevalence, the model maintained strong predictive performance, achieving an accuracy of 94.10 % with balanced precision, recall, and F1-scores (see Table 4). This result confirms the framework’s robustness beyond its original training distribution and underscores its potential adaptability to variability encountered across diverse clinical environments.
Ablation studies
Component-wise contribution analysis
A comprehensive ablation study was conducted to quantify the contribution of each LLM-MAN component (Table 5). Performance improved progressively with sequential module integration:
Ablation study of LLM-MAN on PD staging.
| Configuration | Accuracy, % | Precision, % | Recall, % | F1-score, % |
|---|---|---|---|---|
| ResNet-18 (image-only baseline) | 90.13 | 89.67 | 90.45 | 90.06 |
| +CBAM | 91.97 | 91.42 | 92.18 | 91.80 |
| +MGCA module | 96.67 | 96.23 | 96.89 | 96.56 |
| +Feature selection | 97.89 | 97.45 | 98.12 | 97.78 |
| +BERT text encoder (full LLM-MAN) | 95.68 | 95.24 | 96.02 | 95.63 |
Table 5 summarizes a comprehensive ablation study assessing the contribution of each LLM-MAN component. Starting from the ResNet-18 image-only baseline (90.13 % accuracy, 90.06 % F1), incorporating CBAM yields. A complementary leave-one-out study revealed that removing any single component – CBAM, MGCA, FS, or BERT encoder – resulted in performance degradation of 3–6%. The most significant declines occurred when ablating the MGCA module (central to multimodal fusion) or the CBAM attention mechanism.
Optimization of MGCA attention heads
We examined the effect of varying the number of attention heads in the MGCA module. Performance improved as heads increased from 1 to 4, with accuracy rising from 96.42 to 95.68 %. Computational cost increased modestly from 3.12G to 3.78G FLOPs. Expanding to eight heads raised FLOPs to 4.42G without performance gains, suggesting that excessive heads dilute representational capacity. Thus, a four-head configuration was selected as optimal, balancing representational power and computational efficiency.
Qualitative visualization with class activation mapping
To qualitatively assess feature refinement, we employed Class Activation Mapping (CAM). As shown in Figure 3, attention maps demonstrate that the CBAM-enhanced model focuses on anatomically plausible brain regions implicated in PD pathology – particularly the substantia nigra, basal ganglia, and thalamic regions. This interpretable attention mechanism not only supports classification accuracy but enhances clinical trustworthiness by providing neurologically grounded explanations for model decisions.

Qualitative interpretability examples (PPMI): Class activation maps (CAMs) from the final convolutional block illustrate that adding CBAM yields more localized and anatomically plausible focus patterns compared with the non-CBAM variant. These visualizations support clinician-friendly inspection by highlighting regions that contribute most to the model decision.
Statistical significance analysis
Paired t-tests confirmed that LLM-MAN’s performance improvements over all baseline models were statistically significant (p<0.05). The model also exhibited the lowest variance across cross-validation folds (standard deviation=0.48 % for accuracy), indicating robust learning dynamics and reduced sensitivity to specific data splits – a critical attribute for clinical deployment where model stability is paramount. Despite its architectural complexity, LLM-MAN maintained reasonable computational requirements. The complete model required 3.78 GFLOPs for inference and 42.5 M parameters. Training converged within 60–75 epochs on average, requiring approximately 8 h on a single NVIDIA RTX 3090 GPU. Inference time per MRI slice was 23 ms, enabling near-real-time prediction suitable for clinical workflow integration.
Discussion
Intended clinical role: LLM-MAN is designed as a decision-support tool to assist clinicians in PD screening and staging, rather than as a standalone diagnostic system. In practice, its outputs should be interpreted alongside clinical assessment, imaging review, and relevant laboratory/neurological evaluations.
Recommended workflow: data acquisition (MRI + clinical measures) → standardized preprocessing → model inference → automatic quality control/flagging (e.g., low-confidence predictions or artifacts) → clinician review and final decision. Cases flagged as low-confidence should be prioritized for manual reassessment. Advanced technology based on machine intelligence is suggested for this approach [28], 29].
Failure modes: Performance may degrade in the presence of motion artifacts, severe atrophy, scanner/site variability, or comorbid neurological conditions that alter brain structure. Such factors can shift feature distributions and increase uncertainty; incorporating site-wise monitoring and explicit QC checks based on smart scheduling techniques is therefore recommended [30].
A qualitative analysis of Class Activation Maps (CAM) revealed that the CBAM-enhanced visual encoder in LLM-MAN directs attention toward neuroanatomically plausible regions implicated in Parkinson’s disease pathology, including the substantia nigra, basal ganglia, and thalamocortical circuits. These visualizations, systematically extracted from the final convolutional layers of the ResNet-18 backbone, demonstrate that the model’s predictions are grounded in biologically meaningful features rather than confounding imaging artifacts. This enhanced spatial interpretability – coupled with the ability to trace model decisions back to specific brain regions – strengthens clinical trustworthiness and supports the adoption of LLM-MAN as a decision-support tool in diagnostic workflows. LLM-MAN consistently outperformed both unimodal and multimodal baselines across all evaluation metrics – accuracy, precision, recall, and F1-score – while exhibiting the lowest variance across cross-validation folds. The statistical significance of these improvements (p<0.05, paired t-tests) confirms that performance gains stem from the integrated architecture rather than random variation. Notably, the model demonstrated exceptional sensitivity (recall=95.42 %) for the PD class on the independent test set, a clinically critical attribute given that false negatives in PD diagnosis can delay therapeutic intervention and worsen long-term outcomes.
External validation on the OASIS-3 cohort and PDBP dataset demonstrated that LLM-MAN maintains robust performance (94.10 % accuracy) across heterogeneous data sources with varying acquisition protocols, scanner hardware, and demographic distributions. This strong cross-dataset generalizability underscores the framework’s potential for real-world deployment, where models must perform reliably despite institutional and technical variability. The consistent performance across external cohorts validates not only the architectural design but also the subject-level evaluation protocol, which prevents data leakage and ensures that generalization estimates reflect true clinical applicability.
Ablation studies quantitatively established the architectural contributions of LLM-MAN: CBAM modules improved feature discriminability by 1.8 %, with Class Activation Map visualizations confirming more focused attention on disease-relevant neuroanatomical regions; the MGCA fusion mechanism proved pivotal for cross-modal alignment, with its removal causing up to 6 % performance degradation; the BERT-based clinical encoder contributed a 3.5 % accuracy gain, highlighting the value of semantic clinical reasoning over traditional numerical encoding; and ensemble feature selection identified UPDRS, MoCA, age, and Hoehn & Yahr staging as the most predictive variables, aligning with established clinical knowledge. The four-head configuration in the MGCA module optimally balanced representational capacity and computational efficiency, confirming that excessive attention heads dilute per-head representational power without improving performance.
The four-head configuration in the MGCA module optimally balanced representational capacity and computational efficiency, while experiments with varying head counts confirmed that excessive heads dilute per-head representational power without improving performance. The ability of LLM-MAN to accurately distinguish prodromal PD from healthy controls and diagnosed PD cases addresses a critical unmet need in neurology. Early detection during the prodromal phase enables timely intervention, patient counseling, and potential enrollment in neuroprotective trials. By integrating MRI with LLM-processed clinical metadata, the framework mirrors the multidisciplinary diagnostic approach used by movement disorder specialists, enhancing its practical utility in clinical settings. Furthermore, the model’s interpretability features – including attention visualization and feature importance rankings – facilitate clinician review and model auditing, addressing key requirements for regulatory approval and ethical deployment of AI in healthcare [31], [32], [33].
Despite its strengths, several limitations warrant consideration: LLM-MAN’s performance is contingent on the availability of both structural MRI and comprehensive clinical metadata, which may not be uniformly accessible across all clinical settings; the relatively small sample size for prodromal PD cases in external cohorts (e.g., 19 in PDBP) highlights the need for larger, prospectively collected datasets; and the integration of BERT and attention mechanisms introduces computational overhead compared to lightweight CNNs, though inference remains within clinically acceptable limits (∼23 ms per slice). Future research directions include integrating multi-omics data (genomic, proteomic, metabolomic) to capture PD’s biological complexity, developing longitudinal models to track disease progression, extending the framework to differentiate among parkinsonian syndromes (e.g., PD vs. MSA or PSP), implementing federated learning for privacy-preserving multicenter collaboration, and conducting prospective clinical validation to assess real-world impact on diagnostic pathways and patient outcomes.
Additional limitations include potential confounding from demographic and clinical factors (e.g., age, sex, medication status) and the limited scope of external validation across scanners and sites. Future work should evaluate multi-center cohorts, incorporate confounder-aware modeling or stratified analyses, and explore longitudinal prediction to better reflect disease progression.
Conclusions
This study presents LLM-MAN, a novel LLM-guided multimodal attention network for robust multiclass Parkinson’s disease staging. By integrating structural MRI with BERT-processed clinical metadata through a novel Meta-Guided Cross-Attention (MGCA) mechanism, LLM-MAN achieves state-of-the-art performance in distinguishing between Normal Control, Prodromal PD, and Diagnosed PD categories, with a cross-validated accuracy of 95.68 % on the PPMI dataset. Strong external validation performance (94.10 % on OASIS-3) confirms the framework’s generalizability across diverse clinical environments. The architecture synergistically combines CBAM-enhanced visual feature extraction, LLM-based clinical reasoning, and cross-modal attention fusion to deliver not only high diagnostic accuracy but also clinically interpretable predictions. Ablation studies validate the necessity of each component, while visualization techniques provide transparent insight into model decision-making. While current performance depends on multimodal data availability, the framework establishes a scalable foundation for next-generation neurodiagnostic AI. Future integration of multi-omics profiles, longitudinal data, and federated learning will further enhance its precision and translational potential. LLM-MAN represents a significant step toward clinically deployable AI tools that bridge computational innovation with actionable neurological insight, ultimately supporting earlier diagnosis, personalized management, and improved outcomes for patients with Parkinson’s disease and related neurodegenerative disorders.
Acknowledgments
The authors also acknowledge the support by Fujian Provincial Key Laboratory of Data-Intensive Computing, Fujian University Laboratory of Intelligent Computing and Information Processing, and Fujian Provincial Big Data Research Institute of Intelligent Manufacturing.
-
Research ethics: The study involving human participants was reviewed and approved by the Ethics Committee of the Department of Diagnostic Radiology, Huaqiao University Affiliated Strait Hospital (Approval Number: (DK5566745L). It was conducted in full accordance with local legislation and institutional guidelines. All participants provided written informed consent prior to their involvement in the research.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: All authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: The original data and materials supporting the findings of this study are available within the article and its supplementary files. Additional requests for information can be directed to the corresponding author upon reasonable request.
References
1. Riley, JC. Estimates of regional and global life expectancy, 1800–2001. Popul Dev Rev 2005;31. https://doi.org/10.1111/j.1728-4457.2005.00083.x.Search in Google Scholar
2. World Health Organization Global Health Observatory. Life expectancy and healthy life expectancy. WHO Press, World Health Organization; 2023.Search in Google Scholar
3. Heuveline, P. Global and national declines in life expectancy: an end-of-2021 assessment. Popul Dev Rev 2022;48. https://doi.org/10.1111/padr.12477.Search in Google Scholar PubMed PubMed Central
4. Wong, KKL. Cybernetical intelligence: engineering cybernetics with machine intelligence. Hoboken, New Jersey: John Wiley & Sons, Inc.; 2023.10.1002/9781394217519Search in Google Scholar
5. Marek, K, Chowdhury, S, Siderowf, A, Lasch, S, Coffey, CS, Caspell, GC, et al.. The Parkinson’s progression markers initiative (PPMI) – establishing a PD biomarker cohort. Ann Clin Transl Neurol 2018;5. https://doi.org/10.1002/acn3.644.Search in Google Scholar PubMed PubMed Central
6. Xue, W, Bowman, FDB, Kang, J. A Bayesian spatial model to predict disease status using imaging data from various modalities. Front Neurosci 2018;12. https://doi.org/10.3389/fnins.2018.00184.Search in Google Scholar PubMed PubMed Central
7. Bhagwat, N, Pipitone, J, Park, MTM, Voineskos, AN, Chakravarty, M. P4‐210: neural-net model for predicting clinical symptom scores in Alzheimer’s disease. Alzheimer’s Dementia 2016;12. https://doi.org/10.1016/j.jalz.2016.06.2302.Search in Google Scholar
8. Qin, Y, Li, Y, Zhuo, Z, Liu, Z, Liu, Y, Ye, C. Multimodal super-resolved q-space deep learning. Med Image Anal 2021;71. https://doi.org/10.1016/j.media.2021.102085.Search in Google Scholar PubMed
9. Abdolrahimzadeh, S, Pugi, DM, Manni, P, Iodice, CM, Tizio, FD, Persechino, F, et al.. An update on ophthalmological perspectives in oculodermal melanocytosis (Nevus of Ota). Graefe’s Arch Clin Exp Ophthalmol 2023;261. https://doi.org/10.1007/s00417-022-05743-1.Search in Google Scholar PubMed PubMed Central
10. Peng, Z, Ma, R, Zhang, Y, Yan, M, Lu, J, Cheng, Q, et al.. Development and evaluation of multimodal AI for diagnosis and triage of ophthalmic diseases using ChatGPT and anterior segment images: protocol for a two-stage cross-sectional study. Front Artif Intell 2023;6. https://doi.org/10.3389/frai.2023.1323924.Search in Google Scholar PubMed PubMed Central
11. Wang, H, Huang, R, Zhang, J. Research progress on vision–language multimodal pretraining model technology. Electronics 2022;11. https://doi.org/10.3390/electronics11213556.Search in Google Scholar
12. Wang, W, Bao, H, Dong, L, Bjorck, J, Peng, Z, Liu, Q, et al.. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In Proceedings of the proceedings of the IEEE computer society conference on computer vision and pattern recognition; 2023.10.1109/CVPR52729.2023.01838Search in Google Scholar
13. Wang, W, Tan, X, Zhang, P, Wang, X. A CBAM based multiscale transformer fusion approach for remote sensing image change detection. IEEE J Sel Top Appl Earth Obs Rem Sens 2022;15. https://doi.org/10.1109/JSTARS.2022.3198517.Search in Google Scholar
14. Woo, S, Park, J, Lee, JY, Kweon, IS. CBAM: convolutional block attention module. Proc Lect Notes Comput Sci 2018;11211.10.1007/978-3-030-01234-2_1Search in Google Scholar
15. Hashimoto, R, Mori, E. Mini-mental state examination (MMSE). Nihon Rinsho 2011, 69.Search in Google Scholar
16. Dahbour, S, Hashim, M, Alhyasat, A, Salameh, A, Qtaishat, A, Braik, R, et al.. Mini-mental state examination (MMSE) scores in elderly Jordanian population. Cereb Circ Cogn Behav 2021;2. https://doi.org/10.1016/j.cccb.2021.100016.Search in Google Scholar PubMed PubMed Central
17. Hwang, IC, Kang, HS. Anomaly detection based on a 3D convolutional neural network combining convolutional block attention module using merged frames. Sensors 2023;23. https://doi.org/10.3390/s23239616.Search in Google Scholar PubMed PubMed Central
18. Xin, H, Li, L. Arbitrary style transfer with fused convolutional block attention modules. IEEE Access 2023;11. https://doi.org/10.1109/ACCESS.2023.3273949.Search in Google Scholar
19. Chen, B, Zhang, Z, Liu, N, Tan, Y, Liu, X, Chen, T. Spatiotemporal convolutional neural network with convolutional block attention module for micro-expression recognition. Information 2020;11. https://doi.org/10.3390/INFO11080380.Search in Google Scholar
20. Tang, T, Cui, Y, Feng, R, Xiang, D. Vehicle target recognition in SAR images with complex scenes based on mixed attention mechanism. Information 2024;15. https://doi.org/10.3390/info15030159.Search in Google Scholar
21. Cao, Y, Zhao, Z, Huang, Y, Lin, X, Luo, S, Xiang, B, et al.. Case instance segmentation of small farmland based on mask R-CNN of feature pyramid network with double attention mechanism in high resolution satellite images. Comput Electron Agric 2023;212. https://doi.org/10.1016/j.compag.2023.108073.Search in Google Scholar
22. Rahman, S, Rahman, MM, Bhatt, S, Sundararajan, R, Faezipour, M. NeuroNet-AD: a multimodal deep learning framework for multiclass Alzheimer’s disease diagnosis. Bioengineering 2025;12:1107. https://doi.org/10.3390/bioengineering12101107.Search in Google Scholar PubMed PubMed Central
23. Argade, D, Khairnar, V, Vora, D, Patil, S, Kotecha, K, Alfarhood, S. Multimodal abstractive summarization using bidirectional encoder representations from transformers with attention mechanism. Heliyon 2024;10. https://doi.org/10.1016/j.heliyon.2024.e26162.Search in Google Scholar PubMed PubMed Central
24. Benítez-Andrades, JA, Alija-Perez, JM, Vidal, ME, Pastor-Vargas, R, García-Ordas, MT. Traditional machine learning models and bidirectional encoder representations from transformer (BERT)-based automatic classification of tweets about eating disorders: algorithm development and validation study. JMIR Med Inform 2022;10. https://doi.org/10.2196/34492.Search in Google Scholar PubMed PubMed Central
25. Sadot, AAIM, Maliha Mehjabin, M, Mahafuz, A. A novel approach to efficient multilabel text classification: BERT-federated learning fusion. In Proceedings of the 2023 26th international conference on computer and information technology. ICCIT; 2023.10.1109/ICCIT60459.2023.10441264Search in Google Scholar
26. Chahine, LM, Siderowf, A, Barnes, J, Seedorff, N, Caspell-Garcia, C, Simuni, T, Parkinson’s Progression Markers Initiative†, et al.. Predicting progression in Parkinson’s disease using baseline and 1-year change measures. J Parkinsons Dis 2019;9:665–79. https://doi.org/10.3233/jpd-181518.Search in Google Scholar PubMed PubMed Central
27. Zeng, T, Chipusu, K, Zhu, Y, Li, M, Muhammad Ibrahim, U, Huang, J. Differential evolutionary optimization fuzzy entropy for gland segmentation based on breast mammography imaging. J Radiat Res Appl Sci 2024;17:100966. https://doi.org/10.1016/j.jrras.2024.100966.Search in Google Scholar
28. Shen, D, Du, S, Wang, S, Yan, L, Li, S, Chen, X. An improved variational autoencoder and graph attention network method for wear prediction of aerospace self-lubricating bearing using acoustic emission signal. IEEE Sens J 2026. https://doi.org/10.1109/JSEN.2025.3650493.Search in Google Scholar
29. Wang, S, Du, S, Yan, L, Li, S, Chen, X. A hybrid physical damage neural network for wear prediction of self-lubricating bearings. J Comput Inf Sci Eng 2026. https://doi.org/10.1115/1.4071182.Search in Google Scholar
30. Zhou, Y, Lv, J, Du, S, Shen, X, Liu, M. Multi-resource constrained flexible job shop scheduling with fixture-pallets and setup stations under pallet automation systems. IEEE Trans Syst Man Cybern: Systems 2026. https://doi.org/10.1109/TSMC.2026.3655483.Search in Google Scholar
31. Wong, KKL, Xu, W, Ayoub, M, Fu, YL, Xu, H, Shi, RZ, et al.. Brain image segmentation of the corpus callosum by combining bi-directional convolutional LSTM and U-net using multi-slice CT and MRI. Comput Methods Progr Biomed 2023:107602. https://doi.org/10.1016/j.cmpb.2023.107602.Search in Google Scholar PubMed
32. KKL Wong, J Xu, C Chen, D Ghista and H Zhao, Functional magnetic resonance imaging providing the brain effect mechanism of acupuncture and moxibustion treatment for depression, Front Neurol, dio: https://doi.org/10.3389/fneur.2023.1151421, 2023.Search in Google Scholar PubMed PubMed Central
33. Wong, KKL, Ayoub, M, Cao, Z, Chen, C, Chen, W, Ghista, D, et al.. The synergy of cybernetical intelligence with medical image analysis for deep medicine: a methodological perspective. Comput Methods Progr Biomed 2023:107677. https://doi.org/10.1016/j.cmpb.2023.107677.Search in Google Scholar PubMed
© 2026 the author(s), published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.