Abstract
Communication barrier faced by elderly individuals and stroke patients with speech impairments pose significant challenges in daily interactions. While sign language serves as a vital means of communication, those struggling to speak may encounter difficulties in conveying their messages effectively. This research addresses this issue by proposing a system for generating audio-visual avatars capable of translating sign gestures into the written and spoken language, thereby offering a comprehensive communication tool for individuals with special needs. The proposed method integrated YOLOv8, MobileNetV2, and MobileNetV1 based on U-Net to accurately recognize and classify sign gestures. For gesture detection and classification, YOLOv8n was used; for segmentation, traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 based on multi-stage image segmentation were used; for classification, MobileNetV1 and MobileNetV2 were used. Using the improved first-order motion model, the generated avatars enabled the real-time translation of sign motions into text and speech and facilitated interactive conversation in both Arabic and English. The system’s importance was demonstrated by the evaluation findings, which showed that traditional U-Net produced ideal results in gesture segmentation and YOLOv8n performed best in gesture classification. This study contributes to advancing assistive communication technologies, offering insights into optimizing gesture recognition and avatar generation for enhanced communication support in elderly and stroke patient care. The YOLOv8n model achieved 0.956 and 0.939 for precision and recall, respectively, for detecting and classifying gestures. MobileNetV1 gained 0.94 and MobileNetV2 gained 0.79 in accuracy for classification.
1 Introduction
The problem of communication between older people with stroke and people assigned to take care of them is a major challenge. When people suffer from speech difficulty due to stroke, they are not able to communicate effectively, and this can lead to their isolation and increase their feelings of isolation and depression. This situation requires a deep understanding of the situation and patience, as well as the use of alternative means of communication such as body language and gestures to facilitate communication and improve the quality of care and interaction between them and the people taking care of them.
Due to the lack of data on gesture signs for elderly people who have stroke in Iraq, it was necessary to create a dataset. Also, some signals are similar, such as “I want to eat” and “I want to change my clothes,” “Hello” and “Finished,” and “What” and “Why,” causing confusion in interpreting them. YOLOv8n has successfully addressed all these challenges except for the “Why” gesture. MobileNetV1 for classification, combined with U-Net for segmentation, have been used to overcome the difficulty in classifying the “Why” gesture that YOLOv8n faced.
Virtual reality offers interest in several technical areas [1]. Avatar is a digital representation of a user or their character [2], known as a digital human [3], or a virtual human [4]. There are a number of intelligent platforms that have been utilized to generate a life model like a virtual avatar, imitating a patient with which training doctors have the ability to communicate with it; Convai is an example of one of those platforms [5]. In addition, toolkit libraries and tools called XFace are used for animation, face recognition, speech generation, facial expression analysis, and other interactive functions. This technology is also utilized in building virtual avatars [6]. The challenges faced by the elderly who find it difficult to speak or hear are multifaceted, affecting their ability to communicate and fulfill their daily needs [7,8]. Avatars allow users to interact socially, work, play, and express themselves [9]. Common communication problems in seniors include hearing loss, stuttering, weakened facial muscles, and inability to write. These issues can be caused by various health problems, such as stroke [10]. Generating avatars using available platforms and tools has several challenges, such as avatar animation, realistic appearance, cost of creation, privacy and security, and system integration. On the other hand, deep learning models assist in solving some of these issues. “Generating Audio-Visual Avatar by Improved First-Order Motion Model” refers to using an enhanced version of the first-order motion model to create an animated representation that combines both auditory and visual elements. The first-order motion model is a framework used for transferring motion between video sequences, initially developed for applications like face and image reenactment. Improving this model typically involves modifications or enhancements to increase the motion accuracy or enhance the overall quality of the results. In this work, an audio-visual avatar is generated using the improved first-order motion model, which translates the sign gestures and converts them into written and audio text, depending on the results of YOLOv8 and MobileNetV1.
YOLOv8n is used for detection and classification; the traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 are used for segmentation; and MobileNetV1 is used for classification. Based on the results of classification, the first-order motion model has been used to generate an interactive avatar that can speak English and Arabic and translate the meaning of the sign gesture motions into text and speech. The YOLOv8n model achieved values of 0.956, 0.939, and 0.971 for precision, recall, and mAP50, respectively, for detecting and classifying gestures. MobileNetV1 gained values of 0.94, 0.94, and 0.94 for accuracy, precision, and recall, respectively, for classification. Traditional U-Net for segmentation got 0.9845 as accuracy and 0.918 as the Dice coefficient. U-Net with VGG16 for segmentation achieved 0.9809 in accuracy and 0.889 in Dice coefficient. U-Net with MobileNetV2 for segmentation obtained 0.9478, 0.953, 0.9765, and 0.8034 as accuracy, Dice coefficient precision, and recall, respectively. The generated audio-visual avatar’s performance was 85.56%.
This study makes a unique contribution by the following ways:
constructing a dataset of hand sign gestures specifically tailored for elderly individuals afflicted by stroke and having trouble in speaking for Iraqi Nursing Homes.
propose mask annotation employed for segmenting the hand gestures using u-net model, thresholding segmentation, K-mean clustering, Image and Imglab tool.
developing an audio-visual avatar capable of converting them into written text and speech for improved communication.
integrating advanced models such as YOLOv8 and U-Net with MobileNetV1 for recognizing and classifying sign gestures, which is a robust methodological approach.
This contribution advances assistive communication technologies and provides a significant contribution to the field of elderly and stroke patient care.
The results show that the proposed avatar achieved excellent results, as illustrated in Section 4.2, Quantitative analysis. The remaining parts of this research work are structured as follows: The literature review is summarized in Section 2, the technique is discussed in more detail in Section 3, the results and discussion are presented in Section 4, and the conclusions are presented in Section 5 along with some suggestions for future study directions.
2 Literature review
An audio-visual avatar is a virtual person that blends visual and audible signs to improve human–computer interaction and the user experience [11,12]. Recent advancements in computer vision, artificial intelligence, and natural language processing have generated a lot of interest in the creation of visual and auditory avatars [13]. In addition to exploring the most recent developments and applications in this area, this section reviews the literature that has already been written on the production of audio-visual avatars.
In 2024, Zhang et al. [14] introduced Virbo, an intelligent talking avatar video generation system. It offers personalized functions, multilingual customization, voice cloning, face swapping, talking avatar dubbing, and visual special effect rendering. Virbo generates photo-realistic, lip-synchronized videos with better accuracy and authenticity. It can create videos comparable to professional creations. Future work aims to enhance speaker’s voice emotions and facial expressions. In 2024, López et al. [15] developed a novel hand gesture recognition (HGR) model using electromyography (EMG) signals and spectrograms. They compared the performance of convolutional neural network-long short-term memory (CNN-LSTM) and a post-processing algorithm. The results showed that memory cells improved the recognition accuracy by 3.29% compared to CNN models. However, post-processing had a more significant impact on recognition accuracy than memory cells via LSTM networks. This suggests that incorporating post-processing algorithms can enhance HGR models’ accuracy and robustness against EMG signal variability.
In 2021, García et al. [16] presented a description of an avatar production system, including the key technological specifications for the design of avatars that experience audio hallucinations, and an assessment of the system from the perspectives of both patients and therapists. Character Creator, Poser, Unity Multipurpose Avatar, and Adobe Fuse CC were all utilized for the avatar creation process. In 2023, Lu et al. [17] used a convenience sampling method to select 13 participants from a support group for senior citizens in the southern part of Ireland with hearing loss. Participants were interviewed in a semi-structured manner. NVivo 12 was used to audio-record interviews and transcribe the results. Themes related to the challenges faced in recent healthcare interactions and recommendations for improving comprehensive healthcare communication were identified using Clarke and Braun s thematic analysis technique. In 2022, Zhang et al. [18] introduced a 3D animated high-fidelity human model. However, it has some limitations: (1) it is based on skinned multi-person linear (SMPL) projections, and (2) it cannot capture extremely fine actions, such as facial emotion variations. In 2022, Athira et al. [19] suggested that signers use a cutting-edge vision-based movement recognition system that can recognize single-handed, dynamic, double-handed, and, using live video, finger-spell phrases in ISL (Indian sign language). For classification, they used support vector machine (SVM). The accuracy recognition for single-handed dynamic gesture was 89%, and for fingerspelling gestures it was 91%. In 2021, Li et al. [20] showed that a pipeline creates a 3D human form avatar from a single RGB photograph, producing a texture map for the entire body and three-dimensional human geometry. The technique separates the human body into its component components, fits them to a parametric model, and warps them into the desired shape. From the frontal photos, InferGAN infers the unseen back texture. Using MoCap data, it is simple to rig and animate their human avatars. A mobile application demonstrates the effectiveness of the solution for AR applications, showing its reliability and efficiency for both public and private datasets. In 2021, Sharma and Singh [21] proposed a CNN model for identifying gesture-based sign language that is based on deep learning. The model outperforms conventional CNN architectures in classification accuracy while using fewer parameters. In the ISL and American sign language (ASL) datasets, during training and testing, the VGG-11 and VGG-16 models achieved the highest accuracy of 99.96 and 100%, respectively. The model surpassed other strategies, according to experimental assessments, identifying the most gestures with the least amount of error. Rotation and scaling transformations had no effect on the model. In 2020, Thies et al. [22] introduced a pioneering technique for audio-based facial re-enactment that can be applied across diverse audio sources, enabling us to produce a speaking head video from an audio sequence of another individual as well as to generate realistic videos constructed with a synthetic voice. This implies that it is possible to create text-driven video synthesis with synchronized artificial voices. In 2020, Gupta and Rajan [23] employed inertial measurement data from categorization of continuously signed words from the ISL, which was carried out using an accelerometer and a gyroscope. A modified time-LeNet architecture was provided along with the use of time-LeNet and multichannel deep CNN (MC-DCNN) to address over-fitting. The models were compared for complexity, loss, and classification precision. Time-LeNet obtained 79.70% accuracy compared to MC-DCNN’s 83.94%. In 2019, Molano et al. [24] proposed the Candide parametric mask, a method for producing emotive avatars in runtime, speeding up 3D animation. It produced a variety of emotions, ranging from straightforward winks to complicated ones, and was inspired by the Ekaman emotional model. In 2019, García et al. [25] presented a scheme for creating treatments based on avatars. The strategy was based on a previously described tool to improve social cognition in people with cognitive impairment. First, the criteria for facial emotion identification in avatar-based treatments were developed. The supporting instrument for the therapy was then presented with a brief description. The administration of treatments has since been explained. For both the clinician and the patient, treatment execution was separated into pre-therapy and treatment. Table 1 presents a comparison between our generated avatar and those generated by earlier studies that generate an avatar that is required by a certain activity. Table 2 presents a comparison between our suggested approaches for classifying gestures for the elderly and patients with stroke who are unable to speak and earlier research that has classified sign language.
Comparison of earlier studies that generated an avatar
Work | Dataset | Method | Audio-visual avatar aim | Gaps |
---|---|---|---|---|
Zhang et al. [14] | HDTF dataset | Virbo | Offers Virbo, a talking avatar video creation system with intelligence | The system does not support video files as inputs for the user interface and need for more diverse character speech videos |
García et al. [16] | Their own dataset | Adobe Fuse CC tool, Mixamo tool, Autodesk 3D Studio Max tool, and Unity tool | Configuration of auditory hallucination avatars | Limited sample size (patients: 29; therapists: 20) |
Zhang et al. [18] | MPV, UBC, DeepFashion, and SHHQ | A 3D parametric human model SMPL and a deformation network | 3D-aware clothed human avatar | Does not provide a detailed evaluation or comparison with previous methods |
Li et al. [20] | DeepFashion | InferGAN | For general use | The body and limbs should not intersect, and some clothing features are lost, while the rebuilding utilized SMPL |
Our proposed audio-visual avatar | Constructing our dataset | Improved first-order motion model | Serve the elderly and people having special needs and aid those who take care of them | Attempted to overcome most of the gaps in previous works |
Comparison between earlier research that has classified sign language
Work | Method | Result | Gaps |
---|---|---|---|
López et al. [15] | CNN-LSTM model | Accuracy: 90.55% | Only recognizes a small set of hand gestures, employs a CNN-LSTM architecture with a high number of learnable parameters, requiring a large amount of training data, and not suitable for some real-time applications |
Athira et al. [19] | SVM | Accuracy recognition: single-handed dynamic gesture: 89% | Data availability, needing enhancements for robustness in real-world scenarios with diverse environmental settings and limited real-time exploration |
Fingerspelling gestures: 91% | |||
Sharma and Singh [21] | CNN | Accuracy: 100% for the ASL dataset. | Its generalization to unseen data or different sign languages may need further exploration |
99.96% for the ISL dataset | |||
Gupta and Rajan [23] | Time-LeNet + MC-DCNN | MC-DCNN accuracy: 83.94%. | Lacks comparison with other state-of-the-art models, and the robustness of the models to variations in signing styles, speeds, or environmental conditions is not explored |
Time-LeNet average accuracy: 79.70% | |||
Modified time-LeNet accuracy: 81.62% | |||
Our proposed methods | YOLOv8 + U-Net + MobileNetV1 | YOLOv8: precision, 0.95473; recall, 0.94035. MobileNetV1: accuracy, 94%. MobileNetV2: accuracy, 79% | Attempted to overcome most of the gaps in previous works |
When comparing our work with previous studies, we found the following: the work of Zhang et al. [14] used a speaking avatar to build an intelligent system. The work of García et al. [16] used an avatar to represent auditory hallucinations. The work of Zhang et al. [18] featured an avatar of a human wearing three-dimensional clothing. The work of Li et al. [20] created an avatar for public use. In our work, we used avatars to facilitate communication between the patient and the doctor. Regarding the technology used to create the avatar in each study, the work of Zhang et al. [14] used Virbo. The work of García et al. [16] utilized Adobe Fuse CC, Mixamo, Autodesk 3D Studio Max, and Unity. The work of Zhang et al. [18] employed a 3D parametric human model (SMPL) and a deformation network. The work of Li et al. [20] used InferGAN. In that work, the authors improved the performance of the first-order motion model by enhancing the input image to create a more realistic avatar. The authors also proposed a multi-stage image segmentation method to enhance the accuracy of classification.
3 Method and implementation
Figure 1 depicts the schematic of the proposed sign gesture interpreter system to serve the elderly and stroke patients having special needs and provide assistance to those who take care of them:

Schematic of the suggested system.
The system operates bidirectionally (from the patient to doctor and vice versa), but in this research, we have emphasized on the forward phase.
3.1 Gesture hand dataset preprocessing
The performance of the proposed communication enhancement system for older people with special needs, specifically those with speech difficulties, was evaluated using constructed datasets. This was necessary because there is a lack of standard and publicly accessible datasets for individuals who are neither mute nor deaf. This constructed dataset consists of a gesture-hand description of specific things that elders need for daily life requirements. The dataset was created by visiting many nursing homes for the elderly and conducting interviews with both the residents and those in charge of their care. After settling on gestures and movements that meet the needs of the elderly, dataset images were acquired for these movements, and a group of people were used to represent them. The proposed dataset contained 26 classes; each class had 110 images of different samples. Image dimensions were 640 × 640. The proposed dataset included images from a diverse collection of people of various ages and genders, with various backgrounds and orientations based on the elderly and stroke life needs. The dataset was named “Sign Gestures for Elderly and Stroke Patients” (SGESP). The images were captured using web and mobile cameras, and the dataset included hand gestures for Iraqi elderly and stroke patients, as illustrated in Table 3.
Signs for elderly people constructed dataset and their meanings
![]() |
The dataset was prepared for both semantic segmentation and instance segmentation. The dataset was annotated in two ways to be appropriate, for instance, segmentation and semantic segmentation: bounding box annotation (manual labeling) and masking images (using manual labeling methods [JSON file], K-mean clustering, U-Net, and threshold segmentation). The dataset had augmented × 4 images [horizontal flipping, 5% rotation, 3% blurring, and 2% noising].
In this work, preprocessing included cropping, resizing, and converting images to JPG. Manual cropping was performed to remove areas that were uninteresting in the image, which helped focus the main content. Antialiasing technology was also used during the resizing process. These visual processes enhanced the effectiveness of pre-processing to improve the readiness of images for use in visual analysis or the training of artificial intelligence models. All the dataset images were resized to 640 × 640 pixels to ensure reliable model training. The total number of datasets for the original images was 2,860; it increased to 10.808 after augmentation. The proposed dataset was divided into 70% training, 20% validation, and 10% testing. The hold-out method was used for training. About 10.010 images were used for training, 512 images were used for validation, and 286 images were used for testing. The dataset contained 10.808 text files as annotation labels.
3.1.1 Proposed multistage image segmentation method
The first stage of the proposed method was segmenting the entire preprocessed dataset using segmentation by thresholding in order to produce an image dataset labeled for semantic annotations. About 10% of dataset images were accurately segmented. In the next stage, K-mean clustering was applied to 90% of remaining dataset images, and about 30% of the resultant images had accurate segmentation. The third stage was applying the ImgLab tool to the remaining images (60%) in order to get a JSON file, which was then converted into a segmented image. About 10% of the resultant images had accurate segmentation. Then, we had about 50% correctly segmented images, and they were used to train U-Net to obtain the best weight, while the remaining 50% which were not segmented correctly were used for testing. The result of U-Net was about 100% correctly segmented image, which can be used as a mask for semantic annotations.
Algorithm 3.3: Annotating sign gesture dataset for semantic annotations | |||
---|---|---|---|
Input: Proposed dataset images | |||
Output: Images of dataset annotated for semantic annotations | |||
Begin: | |||
Step 1: For i = 1 to 26 Do//to login to every dataset folder | |||
Step 2: Segmentation by thresholding: | |||
For j = 1 to 110 Do//to read every dataset image in each folder | |||
- Img = Read(image) | |||
- Segment image by thresholding segmentation using equation (2) | |||
- If the segmented region of hand gesture is accurate, then | |||
- save the selected image in the gesture-mask-folder | |||
- delete the selected image from the original dataset | |||
End for j loop. | |||
Step 3: Segmentation by K-mean clustering: | |||
For j = 1 to 110 Do//to read every remaining image in each folder | |||
- Img = Read(image) | |||
- Segment image by K-mean clustering using algorithm (2) | |||
- If the segmented region of hand gesture is accurate, then | |||
- append the selected image to the gesture-mask-folder | |||
- delete the selected image from the original dataset | |||
End for j loop. | |||
Step 4: Segmentation by ImgLab: | |||
For j = 1 to 110 Do//to read every remaining image in each folder | |||
- Img = Read(image) | |||
- Segment image by ImgLab tool segmentation to get the JSON file | |||
- convert the JSON file to segmented images | |||
- if the segmented region of hand gesture is accurate, then | |||
- append the selected image to the gesture-mask-folder | |||
- delete the selected image from the original dataset | |||
End for j loop. | |||
Step 5: Semantic Segmentation using U-Net: | |||
For j = 1 to 110 Do//to read every remaining image in each folder | |||
- Img = Read(image) | |||
- Train U-Net using the resultant gesture-mask-folder | |||
- Obtain the best weight | |||
- Test the images remaining in the original dataset using the already trained U-Net//image which has not been selected | |||
- Append the segmented images to the gesture-mask-folder | |||
- Obtain the final segmented image using U-Net | |||
End for j loop. | |||
Return the mask image for each image in the dataset | |||
End |
3.2 Proposed methodology
3.2.1 Detection and classification by YOLOv8
YOLOv8 was adopted because it offers improved accuracy, faster speeds, free anchor, ability to focus on different areas of the image, and real-time processing. In this work, we employed YOLOv8 to detect and classify the gestures generated by elderly people with special needs.
3.2.2 Segmentation and classification by U-Net and MobileNetV1
U-Net is a powerful model for enhancing the quality and detail. It was utilized in this study to accomplish flexibility, effective feature learning, high performance, versatility, robustness to limited data, and ease of implementation. MobileNetV1 performed well in image classification and computer vision tasks. MobileNetV1’s quick execution speed qualified it for real-time applications. Additionally, MobileNetV1 achieved an acceptable mix of accuracy and performance, which made it a perfect option for tasks like image classification.
3.2.3 Classification by MobileNetV2
MobileNetV2 was employed because it offered excellent performance, increased accuracy, lightweight, faster processing speeds, and real-time processing. In this work, we utilized MobileNetV2 to classify the gestures generated by elderly people with special needs.
3.2.4 Generating audio-visual avatar
This section discusses creating and improving avatars by using an improved first-order motion model. The below paragraphs illustrate the method by which we improved audio-visual avatar generation:
3.2.4.1 Super-resolution GAN (SRGAN) model
Super-resolution models converted images into high-quality images using scanning, with the front page featuring high-resolution images. The success of elaboration depended on training quality and dataset quality. The source image was enhanced by SRGAN in order to be used later in the first-order motion model.
3.2.4.2 Generation of audio-visual avatar using the first-order motion model
Image animation included creating a video sequence to animate a source image’s object based on the motion of a driving video. An occlusion for target motions was modeled by a generator network, while appearance and motion information were separated using a self-supervised formulation. In several item categories and benchmarks, their approach beat the competing frameworks. Generating video avatars using an improved first-order motion model is illustrated in algorithm (1).
Algorithm 3.12: Generating video avatar using the improved first-order motion model | ||
---|---|---|
Input: Single source face image I, driving video D frame by frame face images | ||
Output: Video avatar | ||
Begin: | ||
Step 1: Apply the GFPGAN model to I | ||
Step 2: For each image in D frames Do | ||
- Apply the key point detector for both I and D | ||
- Use the self-supervised approach based on the Monkey-Net model to move I key points according to D | ||
- Split D into a number of frames (F1, F2, F3,…, Fn) | ||
- For i = 1 to n Do | ||
- I mimics the face in a Fi and save I after mimicking Ii | ||
- Merge (I1, I2,…,In) and save the video | ||
End for | ||
Return Video Avatar | ||
End |
3.2.4.3 Designing audio-visual avatar
Audio and text were embedded in a video avatar to generate an audio-visual avatar. This process involved integrating various multimedia elements to create a dynamic and interactive representation. Video editing, which is based on the OpenCV library, was used to select or create a video avatar and add text to display information or subtitles using the cv2.putText() function. Audio was created using the text-to-speech algorithm (TTS). TTS is a technology that converts written text into spoken words, employing algorithms to analyze, segment, process, synthesize speech, and produce audio output based on the gTTS() function from the GTTS (Google Text-to-Speech) library. The produced audio, whether speech or sound effects, was then incorporated into the video with synchronization. Lip-syncing techniques can be applied to match mouth movements with spoken words using video editing based on the set_audio() function from the mymovie library. Additionally, the text can be made interactive and integrated with both the audio and visual content. Generating audio-visual avatars is illustrated in algorithm (2).
Algorithm 3.13: Generating audio-visual avatar |
---|
Input: Video avatar, text (English and Arabic) |
Output: Audio-visual avatar |
Begin: |
Step 1: Convert the Arabic and English text to audio by the TTS algorithm using gTTS() function from GTTS library |
Step 2: Save the resulting audio |
Step 3: Embed Arabic and English audio to video avatar by using set_audio() function from mymovie library |
Step 4: Embed Arabic and English text to video avatar by using cv2.putText() function from OpenCV library |
Step 5: Save resulting video |
Return Audio-visual avatar |
End |
4 Results and discussion
The results are discussed based on the experimental setting and quantitative analysis.
4.1 Experimental setting
The Google Colaboratory (Google Colab Pro) with 25 GB RAM, 200 GB storage, and the Windows operating system was used for all of the studies. Along with TensorFlow and other high-level general-purpose programming languages, PyTorch, Keras, Matplotlib, OpenCV, NumPy, Pandas, MoviePy, and Pygame were utilized. As detailed in Table 4, the training phase of the proposed system was conducted with the following hyperparameter settings:
Number of parameters used in the proposed model
Model | Total parameters | Trainable parameters | Non-trainable parameters |
---|---|---|---|
YOLOv8 for detection and classification | 3,010,718 | — | — |
U-Net for segmentation | 1,941,105 | 1,941,105 | 0 |
U-Net with VGG16 for segmentation | 16,195,617 | 1,480,929 | 14,714,688 |
U-Net with MobileNetV2 for segmentation | 416,209 | 409,025 | 7,184 |
MobileNetV1 for classification | 3,296,154 | 3,274,266 | 21,888 |
MobileNetV2 for classification | 6,532,368 | 2,045,274 | 396,544 |
4.2 Quantitative analysis
The results of YOLOv8 were evaluated using benchmark performance metrics. Recall, precision, F1 measure, and mean average precision from epoch 0 to epoch 195 are illustrated in Table 5 and Figure 2. Epoch 146 shows the best results.
Comparison of the results of YOLOv8 from epoch 0 to epoch 195
Epoch | Train/box loss | Train/cls_loss | Train/dfl_loss | Metrics/precision(B) | Metrics/recall(B) | Metrics/mAP50(B) | Metrics/mAP50-95(B) | Val/box loss | Val/cls loss | Val/dfl loss |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.1778 | 4.0693 | 1.5151 | 0.25686 | 0.33499 | 0.23295 | 0.15871 | 1.1288 | 3.0318 | 1.4391 |
1 | 1.1586 | 2.9944 | 1.4688 | 0.4818 | 0.46913 | 0.46801 | 0.31006 | 1.2319 | 2.3013 | 1.5262 |
2 | 1.1802 | 2.6073 | 1.4557 | 0.47785 | 0.51797 | 0.52939 | 0.32774 | 1.3268 | 2.2235 | 1.6005 |
3 | 1.2184 | 2.384 | 1.4543 | 0.58504 | 0.6689 | 0.70701 | 0.43662 | 1.3896 | 1.5356 | 1.5991 |
: | : | : | : | : | : | : | : | : | : | : |
: | : | : | : | : | : | : | : | : | : | : |
146 | 0.74929 | 0.66212 | 1.1051 | 0.95499 | 0.9406 | 0.97009 | 0.77275 | 0.86865 | 0.4348 | 1.125 |
: | : | : | : | : | : | : | : | : | : | : |
: | : | : | : | : | : | : | : | : | : | : |
193 | 0.72875 | 0.63179 | 1.0947 | 0.95444 | 0.94036 | 0.96907 | 0.77275 | 0.86746 | 0.4248 | 1.1259 |
194 | 0.71511 | 0.61902 | 1.0884 | 0.95463 | 0.94033 | 0.96829 | 0.77179 | 0.8674 | 0.42481 | 1.1258 |
195 | 0.71785 | 0.6283 | 1.0931 | 0.95473 | 0.94035 | 0.96954 | 0.7725 | 0.86727 | 0.42433 | 1.1258 |
Bold values represent the best training and validation performance of the model, which likely corresponds to the optimal epoch (146) for selecting the best weight and final model.

Results of YOLOv8.
Figure 2 illustrates the values of these metrics for different training iterations. It appears that the training loss and classification loss decrease over time, which suggests that the model is learning to better predict bounding boxes and classify objects. The precision and recall metrics also increased, which suggests that the model is making better detections.
Moreover, the confusion matrix was calculated, as illustrated in Figure 3, where classes “Hello,” “I am good,” “Wait,” “I want to eat,” “Finished,” “What,” “I am married,” “I want to change my clothes,” “I cannot hear,” “Stop,” “Listen to me,” “Please,” “I love you,” “It is time,” and “Help me” achieved perfect results (1.00). While classes “Thank you,” “Correct,” “I want to drink water,” “Upset,” “Come,” and “Please” got excellent results with a range of 91 to 95. As for the classes “I want to take a shower,” “I am not sure,” and “Me” got a very good result in the range between 80 and 89. On the other hand, the class “You” obtained 0.73 and the class “Why” achieved 0.59.

Confusion matrix.
U-Nets’ segmentation assessment matrices for each class in Google Colab are presented in Table 6. MobileNetV1 and MobileNetV2 classification assessment matrices are presented in Table 7. MobileNetV1 classification evaluation matrices for each class in Google Colab are shown in Table 8.
U-Net’s segmentation assessment matrices for each class in Google Colab
Model | Accuracy | Dice coefficient |
---|---|---|
U-Net for segmentation | 0.9845 | 0.918 |
U-Net with VGG16 for segmentation | 0.9809 | 0.889 |
U-Net with MobileNet2 for segmentation | 0.9478 | 0.953 |
MobileNetV1 and MobileNetV1 classification assessment matrices
Model | Accuracy | Precision | Recall |
---|---|---|---|
MobileNetV1 for classification | 0.94 | 0.94 | 0.94 |
MobileNetV2 for classification | 0.79 | 0.76 | 0.71 |
MobileNetV1 classification assessment matrices for each class in Google Colab Pro
Class | Precision | Recall | F1-score | Support |
---|---|---|---|---|
Hello | 1.00 | 0.97 | 0.99 | 40 |
I am good | 0.97 | 0.85 | 0.91 | 40 |
Thank you | 0.97 | 0.97 | 0.97 | 40 |
I want to take a shower | 0.90 | 0.93 | 0.91 | 40 |
Wait | 0.97 | 0.97 | 0.97 | 40 |
Correct | 0.90 | 0.93 | 0.91 | 40 |
I want to eat | 0.93 | 0.93 | 0.93 | 40 |
Finished | 1.00 | 1.00 | 1.00 | 40 |
I cannot hear | 0.87 | 0.97 | 0.92 | 40 |
I want to change my clothes | 1.00 | 0.93 | 0.96 | 40 |
I want to drink water | 0.97 | 0.88 | 0.92 | 40 |
Me | 0.86 | 0.90 | 0.88 | 40 |
You | 0.88 | 0.88 | 0.88 | 40 |
Upset | 0.91 | 0.97 | 0.94 | 40 |
Come | 0.95 | 0.95 | 0.95 | 40 |
I am not sure | 0.88 | 0.93 | 0.90 | 40 |
I am married | 0.95 | 0.95 | 0.95 | 40 |
What | 1.00 | 0.97 | 0.99 | 40 |
Why | 0.86 | 0.90 | 0.88 | 40 |
Excuse me | 0.90 | 0.90 | 0.90 | 40 |
I love you | 1.00 | 0.97 | 0.99 | 40 |
Please | 0.93 | 1.00 | 0.96 | 40 |
Listen to me | 0.97 | 0.95 | 0.96 | 40 |
Accuracy | 0.94 | 1,040 | ||
Macro avg. | 0.94 | 0.94 | 0.94 | 1,040 |
Weighted avg. | 0.94 | 0.94 | 0.94 | 1,040 |
The video avatar was generated using a driving video and a Pacific image based on the first-order motion model, as shown in Figure 4(a), and then audio and text were embedded in the video avatar to generate the audio-visual avatar, as shown in Figure 4(b).

Generate audio-visual avatar: (a) generate Avatar and (b) generate audio-visual avatar after embedding audio and text to video avatar.
It can be concluded that when the proposed system used YOLOv8 and MobileNetV1 for the purpose of classification, YOLOv8 gave better results than MobileNetV1 in classifying hand gesture signs. Also, when the proposed system used traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 for segmentation purposes, the traditional U-Net gave the best result in segmentation hand gesture signs. The results of detection and classification stages, as well as segmentation and classification stages, were evaluated using several images, video samples, and several DNN models.
For detecting and classifying hand sign gestures, the YOLOv8 model was used, in which 3,010,718 total parameters were utilized. The benchmark evaluation matrices gave precision, recall, mAP50, and mAP50-95 values of 0.956, 0.939, 0.971, and 0.775%, respectively, on Google Colaboratory (Google Colab Pro) with 25 GB RAM and 200 GB storage.
In order to segment the hand sign gestures, the proposed system used the traditional U-Net, U-Net with VGG16, and U-Net with MobileNet2 models for semantic segmentation. The system utilized (1.941.105), (16.195.617), and (416,209) parameters, respectively, on Google Colaboratory (Google Colab Pro) with 25GB RAM and 200GB storage. Accuracy and Dice coefficient values were used to evaluate the segmenting, which gave (0.9845 and 0.918) for the traditional U-Net model, (0.9809 and 0.889) for the U-Net with the VGG16 model, and (0.9478 and 0.953) for the U-Net with MobileNetV2 model.
In order to classify the segmented hand sign gestures, MobileNetV1 was used, in which 3,296,154 parameters were utilized. The evaluated metrics used were accuracy, precision, and recall, which gave 0.94, 0.94, and 0.94, respectively.
The improved audio-visual avatar, which was used to translate elderly hand sign gestures, achieved an effective performance of 85.56%, with 25 volunteers answering 15 pre-prepared questions, indicating its effectiveness in providing care to elderly individuals.
The evaluation metrics indicate that the work of Sharma and Singh [21] outperformed all studies due to their design using a deep neural network (CNN), which led to high accuracy in classifying ISL and ASL gestures. This model outperformed VGG-11 and VGG-16, demonstrating the ability to classify with high accuracy and a low error rate. Additionally, it was stable against transformations in rotation and scaling, as demonstrated in Table 9. However, our proposed system is superior to previous studies because it includes an avatar to enhance communication between the patient and the doctor. In our research, we constructed a dataset specifically designed for elderly individuals with stroke in Iraq. This dataset includes fundamental hand motions identified by visiting several nursing homes for the elderly to meet their daily needs, filling a gap where no such dataset previously existed in Iraq.
5 Conclusions and future scope
In conclusion, this research has successfully achieved its main objective of developing an effective deep learning-based system for translating elderly gestures into audio and text using avatars. The creation and utilization of the SGESP dataset highlight the importance of incorporating real-life scenarios and contexts into gesture recognition systems. Segmenting hand motions using multi-stage segmentation proved to be 100% effective. Some signals, such as “I want to eat” and “I want to change my clothes,” “Hello” and “Finished,” and “What” and “Why,” are similar and cause confusion for the system. YOLOv8n successfully addressed all these challenges except for the “Why” gesture. In order to solve this confusion, U-Net was used for the segmentation process. MobileNetV1 was then used for classification to prevent the confusion between “What” and “Why” that appeared when using YOLOv8 for classification purposes.
-
Funding information: The authors state no funding involved.
-
Author contributions: Kawther Thabt performed the conceptualization, methodology, software development, validation, formal analysis, investigation, resource management, data curation, original draft preparation, visualization, and data measurement. Abdulamir Abdullah was involved in planning, review, supervision, and project administration. Both the authors discussed the results and commented on the manuscript.
-
Conflict of interest: The authors state no conflict of interest.
-
Data availability statement: Most datasets generated and analyzed in this study are included in the submitted manuscript. The other datasets are available on reasonable request to the corresponding author with the attached information.
References
[1] Čujan Z, Fedorko G, Mikušová N. Application of virtual and augmented reality in automotive. Open Eng. 2020;10(1):113–9. 10.1515/eng-2020-0022.Search in Google Scholar
[2] Brown T, Burleigh TL, Schivinski B, Bennett S, Gorman-Alesi A, Blinka L, et al. Translating the user-avatar bond into depression risk: A preliminary machine learning study. J Psychiatr Res. 2024;170:328–39. 10.1016/j.jpsychires.2023.12.038.Search in Google Scholar PubMed
[3] Yang D, Sun M, Zhou J, Lu Y, Song Z, Chen Z, et al. Expert consensus on the “Digital Human” of metaverse in medicine. Clin eHealth. 2023;6:159–63. 10.1016/j.ceh.2023.11.005.Search in Google Scholar
[4] Pauw LS, Sauter DA, van Kleef GA, Lucas GM, Gratch J, Fischer AH. The avatar will see you now: Support from a virtual human provides socio-emotional benefits. Comput Hum Behav. 2022;136:107368. 10.31234/osf.io/5u6hz.Search in Google Scholar
[5] Sardesai N, Russo P, Martin J, Sardesai A. Utilizing generative conversational artificial intelligence to create simulated patient encounters: a pilot study for anaesthesia training. Postgrad Med J. 2024;100:qgad137. 10.1093/postmj/qgad137.Search in Google Scholar PubMed
[6] Basori AH, Ali IR. Emotion expression of avatar through eye behaviors, lip synchronization and MPEG4 in virtual reality based on Xface toolkit: Present and future. Procedia-Soc Behav Sci. 2013;97:700–6. 10.1016/j.sbspro.2013.10.290.Search in Google Scholar
[7] Lu LL, Henn P, O’Tuathaigh C, Smith S. Patient–healthcare provider communication and age-related hearing loss: a qualitative study of patients’ perspectives. Ir J Med Sci. 2024;193(1):277–84. 10.1007/s11845-023-03432-4.Search in Google Scholar PubMed PubMed Central
[8] Hailu GN, Abdelkader M, Meles HA, Teklu T. Understanding the support needs and challenges faced by family caregivers in the care of their older adults at home. A qualitative study. Clin Interv Aging. 2024;19:481–90. 10.2147/cia.s451833.Search in Google Scholar PubMed PubMed Central
[9] Alfiras M, Bojiah J, Mohammed MN, Ibrahim FM, Ahmed HM, Abdullah OI. Powered education based on Metaverse: Pre- and post-COVID comprehensive review. Open Eng. 2023;13(1):20220476. 10.1515/eng-2022-0476.Search in Google Scholar
[10] Zijun L, Xu Y, Yujia Y, Zhiqiang X. Elderly onset of MELAS carried an M. 3243A > G mutation in a female with deafness and visual deficits: A case report. Clin Case Rep. 2024;12(3):e8438. 10.1002/ccr3.8438.Search in Google Scholar PubMed PubMed Central
[11] Zhen R, Song W, He Q, Cao J, Shi L, Luo J. Human-computer interaction system: A survey of talking-head generation. Electronics. 2023;12(1):218. 10.3390/electronics12010218.Search in Google Scholar
[12] Azofeifa JD, Noguez J, Ruiz S, Molina-Espinosa JM, Magana AJ, Benes B. Systematic review of multimodal human-computer interaction. Informatics. 2022 Feb;9(1):13. 10.3390/informatics9010013.Search in Google Scholar
[13] Khan NS, Abid A, Abid K. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation. Cogn Comput. 2020;12:748–65. 10.1007/s12559-020-09731-7.Search in Google Scholar
[14] Zhang J, Chen J, Wang C, Yu Z, Liu C, Qi T, et al. Virbo: Multimodal multilingual avatar video generation in digital marketing. arXiv preprint arXiv:2403.11700; 2024. 10.48550/arXiv.2403.11700.Search in Google Scholar
[15] López LIB, Ferri FM, Zea J, Caraguay ÁLV, Benalcázar ME. CNN-LSTM and post-processing for EMG-based hand gesture recognition. Intell Syst Appl. 2024;22:200352. 10.1016/j.iswa.2024.200352.Search in Google Scholar
[16] García AS, Fernández-Sotos P, Vicente-Querol MA, Sánchez-Reolid R, Rodriguez-Jimenez R, Fernández-Caballero A. Co-design of avatars to embody auditory hallucinations of patients with schizophrenia: A study on patients’ feeling of satisfaction and psychiatrists’ intention to adopt the technology. Virtual Real. 2021;27:1–16. 10.1007/s10055-021-00558-7.Search in Google Scholar
[17] Lu LLM, Henn P, O’Tuathaigh C, Smith S. Patient–healthcare provider communication and age-related hearing loss: a qualitative study of patients’ perspectives. Ir J Med Sci (1971-). 2023;193:1–8. 10.1007/s11845-023-03432-4.Search in Google Scholar PubMed PubMed Central
[18] Zhang J, Jiang Z, Yang D, Xu H, Shi Y, Song G, et al. Avatargen: a 3D generative model for animatable human avatars. In European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. p. 668–85. 10.1007/978-3-031-25066-8_39.Search in Google Scholar
[19] Athira PK, Sruthi CJ, Lijiya AA. A signer independent sign language recognition with co-articulation elimination from live videos: an Indian scenario. J King Saud Univ - Comput Inf Sci. 2022;34(3):771–81. 10.1016/j.jksuci.2019.05.002.Search in Google Scholar
[20] Li Z, Chen L, Liu C, Zhang F, Li Z, Gao Y, et al. Animated 3D human avatars from a single image with GAN-based texture inference. Comput Graph. 2021;95:81–91. 10.1016/j.cag.2021.01.002.Search in Google Scholar
[21] Sharma S, Singh S. Vision-based hand gesture recognition using deep learning for the interpretation of sign language. Expert Syst Appl. 2021;182:115657. 10.1016/j.eswa.2021.115657.Search in Google Scholar
[22] Thies J, Elgharib M, Tewari A, Theobalt C, Nießner M. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI. Springer International Publishing; 2020. p. 716–31. 10.1007/978-3-030-58517-4_42.Search in Google Scholar
[23] Gupta R, Rajan S. Comparative analysis of convolution neural network models for continuous Indian sign language classification. Procedia Comput Sci. 2020;171:1542–50. 10.1016/j.procs.2020.04.165.Search in Google Scholar
[24] Molano JSV, Díaz GM, Sarmiento WJ. Parametric facial animation for affective interaction workflow for avatar retargeting. Electron Notes Theor Comput Sci. 2019;343:73–88. 10.1016/j.entcs.2019.04.011.Search in Google Scholar
[25] García AS, Navarro E, Fernández-Caballero A, González P. Towards the design of avatar-based therapies for enhancing facial affect recognition. In Ambient Intelligence–Software and Applications–, 9th International Symposium on Ambient Intelligence. Springer International Publishing; 2019:306–13. 10.1007/978-3-030-01746-0_36.Search in Google Scholar
© 2025 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Research Article
- Modification of polymers to synthesize thermo-salt-resistant stabilizers of drilling fluids
- Study of the electronic stopping power of proton in different materials according to the Bohr and Bethe theories
- AI-driven UAV system for autonomous vehicle tracking and license plate recognition
- Enhancement of the output power of a small horizontal axis wind turbine based on the optimization approach
- Design of a vertically stacked double Luneburg lens-based beam-scanning antenna at 60 GHz
- Synergistic effect of nano-silica, steel slag, and waste glass on the microstructure, electrical resistivity, and strength of ultra-high-performance concrete
- Expert evaluation of attachments (caps) for orthopaedic equipment dedicated to pedestrian road users
- Performance and rheological characteristics of hot mix asphalt modified with melamine nanopowder polymer
- Second-order design of GNSS networks with different constraints using particle swarm optimization and genetic algorithms
- Impact of including a slab effect into a 2D RC frame on the seismic fragility assessment: A comparative study
- Analytical and numerical analysis of heat transfer from radial extended surface
- Comprehensive investigation of corrosion resistance of magnesium–titanium, aluminum, and aluminum–vanadium alloys in dilute electrolytes under zero-applied potential conditions
- Performance analysis of a novel design of an engine piston for a single cylinder
- Modeling performance of different sustainable self-compacting concrete pavement types utilizing various sample geometries
- The behavior of minors and road safety – case study of Poland
- The role of universities in efforts to increase the added value of recycled bucket tooth products through product design methods
- Adopting activated carbons on the PET depolymerization for purifying r-TPA
- Urban transportation challenges: Analysis and the mitigation strategies for road accidents, noise pollution and environmental impacts
- Enhancing the wear resistance and coefficient of friction of composite marine journal bearings utilizing nano-WC particles
- Sustainable bio-nanocomposite from lignocellulose nanofibers and HDPE for knee biomechanics: A tribological and mechanical properties study
- Effects of staggered transverse zigzag baffles and Al2O3–Cu hybrid nanofluid flow in a channel on thermofluid flow characteristics
- Mathematical modelling of Darcy–Forchheimer MHD Williamson nanofluid flow above a stretching/shrinking surface with slip conditions
- Energy efficiency and length modification of stilling basins with variable Baffle and chute block designs: A case study of the Fewa hydroelectric project
- Renewable-integrated power conversion architecture for urban heavy rail systems using bidirectional VSC and MPPT-controlled PV arrays as an auxiliary power source
- Review Articles
- A modified adhesion evaluation method between asphalt and aggregate based on a pull off test and image processing
- Architectural practice process and artificial intelligence – an evolving practice
- Special Issue: 51st KKBN - Part II
- The influence of storing mineral wool on its thermal conductivity in an open space
- Use of nondestructive test methods to determine the thickness and compressive strength of unilaterally accessible concrete components of building
- Use of modeling, BIM technology, and virtual reality in nondestructive testing and inventory, using the example of the Trzonolinowiec
- Tunable terahertz metasurface based on a modified Jerusalem cross for thin dielectric film evaluation
- Integration of SEM and acoustic emission methods in non-destructive evaluation of fiber–cement boards exposed to high temperatures
- Non-destructive method of characterizing nitrided layers in the 42CrMo4 steel using the amplitude-frequency technique of eddy currents
- Evaluation of braze welded joints using the ultrasonic method
- Analysis of the potential use of the passive magnetic method for detecting defects in welded joints made of X2CrNiMo17-12-2 steel
- Analysis of the possibility of applying a residual magnetic field for lack of fusion detection in welded joints of S235JR steel
- Eddy current methodology in the non-direct measurement of martensite during plastic deformation of SS316L
- Methodology for diagnosing hydraulic oil in production machines with the additional use of microfiltration
- Special Issue: IETAS 2024 - Part II
- Enhancing communication with elderly and stroke patients based on sign-gesture translation via audio-visual avatars
- Optimizing wireless charging for electric vehicles via a novel coil design and artificial intelligence techniques
- Evaluation of moisture damage for warm mix asphalt (WMA) containing reclaimed asphalt pavement (RAP)
- Comparative CFD case study on forced convection: Analysis of constant vs variable air properties in channel flow
- Evaluating sustainable indicators for urban street network: Al-Najaf network as a case study
- Node failure in self-organized sensor networks
- Comprehensive assessment of side friction impacts on urban traffic flow: A case study of Hilla City, Iraq
- Design a system to transfer alternating electric current using six channels of laser as an embedding and transmitting source
- Security and surveillance application in 3D modeling of a smart city: Kirkuk city as a case study
- Modified biochar derived from sewage sludge for purification of lead-contaminated water
- Special Issue: AESMT-7 - Part II
- Experimental study on behavior of hybrid columns by using SIFCON under eccentric load
Articles in the same Issue
- Research Article
- Modification of polymers to synthesize thermo-salt-resistant stabilizers of drilling fluids
- Study of the electronic stopping power of proton in different materials according to the Bohr and Bethe theories
- AI-driven UAV system for autonomous vehicle tracking and license plate recognition
- Enhancement of the output power of a small horizontal axis wind turbine based on the optimization approach
- Design of a vertically stacked double Luneburg lens-based beam-scanning antenna at 60 GHz
- Synergistic effect of nano-silica, steel slag, and waste glass on the microstructure, electrical resistivity, and strength of ultra-high-performance concrete
- Expert evaluation of attachments (caps) for orthopaedic equipment dedicated to pedestrian road users
- Performance and rheological characteristics of hot mix asphalt modified with melamine nanopowder polymer
- Second-order design of GNSS networks with different constraints using particle swarm optimization and genetic algorithms
- Impact of including a slab effect into a 2D RC frame on the seismic fragility assessment: A comparative study
- Analytical and numerical analysis of heat transfer from radial extended surface
- Comprehensive investigation of corrosion resistance of magnesium–titanium, aluminum, and aluminum–vanadium alloys in dilute electrolytes under zero-applied potential conditions
- Performance analysis of a novel design of an engine piston for a single cylinder
- Modeling performance of different sustainable self-compacting concrete pavement types utilizing various sample geometries
- The behavior of minors and road safety – case study of Poland
- The role of universities in efforts to increase the added value of recycled bucket tooth products through product design methods
- Adopting activated carbons on the PET depolymerization for purifying r-TPA
- Urban transportation challenges: Analysis and the mitigation strategies for road accidents, noise pollution and environmental impacts
- Enhancing the wear resistance and coefficient of friction of composite marine journal bearings utilizing nano-WC particles
- Sustainable bio-nanocomposite from lignocellulose nanofibers and HDPE for knee biomechanics: A tribological and mechanical properties study
- Effects of staggered transverse zigzag baffles and Al2O3–Cu hybrid nanofluid flow in a channel on thermofluid flow characteristics
- Mathematical modelling of Darcy–Forchheimer MHD Williamson nanofluid flow above a stretching/shrinking surface with slip conditions
- Energy efficiency and length modification of stilling basins with variable Baffle and chute block designs: A case study of the Fewa hydroelectric project
- Renewable-integrated power conversion architecture for urban heavy rail systems using bidirectional VSC and MPPT-controlled PV arrays as an auxiliary power source
- Review Articles
- A modified adhesion evaluation method between asphalt and aggregate based on a pull off test and image processing
- Architectural practice process and artificial intelligence – an evolving practice
- Special Issue: 51st KKBN - Part II
- The influence of storing mineral wool on its thermal conductivity in an open space
- Use of nondestructive test methods to determine the thickness and compressive strength of unilaterally accessible concrete components of building
- Use of modeling, BIM technology, and virtual reality in nondestructive testing and inventory, using the example of the Trzonolinowiec
- Tunable terahertz metasurface based on a modified Jerusalem cross for thin dielectric film evaluation
- Integration of SEM and acoustic emission methods in non-destructive evaluation of fiber–cement boards exposed to high temperatures
- Non-destructive method of characterizing nitrided layers in the 42CrMo4 steel using the amplitude-frequency technique of eddy currents
- Evaluation of braze welded joints using the ultrasonic method
- Analysis of the potential use of the passive magnetic method for detecting defects in welded joints made of X2CrNiMo17-12-2 steel
- Analysis of the possibility of applying a residual magnetic field for lack of fusion detection in welded joints of S235JR steel
- Eddy current methodology in the non-direct measurement of martensite during plastic deformation of SS316L
- Methodology for diagnosing hydraulic oil in production machines with the additional use of microfiltration
- Special Issue: IETAS 2024 - Part II
- Enhancing communication with elderly and stroke patients based on sign-gesture translation via audio-visual avatars
- Optimizing wireless charging for electric vehicles via a novel coil design and artificial intelligence techniques
- Evaluation of moisture damage for warm mix asphalt (WMA) containing reclaimed asphalt pavement (RAP)
- Comparative CFD case study on forced convection: Analysis of constant vs variable air properties in channel flow
- Evaluating sustainable indicators for urban street network: Al-Najaf network as a case study
- Node failure in self-organized sensor networks
- Comprehensive assessment of side friction impacts on urban traffic flow: A case study of Hilla City, Iraq
- Design a system to transfer alternating electric current using six channels of laser as an embedding and transmitting source
- Security and surveillance application in 3D modeling of a smart city: Kirkuk city as a case study
- Modified biochar derived from sewage sludge for purification of lead-contaminated water
- Special Issue: AESMT-7 - Part II
- Experimental study on behavior of hybrid columns by using SIFCON under eccentric load