Enhancing communication with elderly and stroke patients based on sign-gesture translation via audio-visual avatars

Kawther Thabt Saleh; Abdulamir Abdullah Karim

doi:10.1515/eng-2024-0068

Article Open Access

Enhancing communication with elderly and stroke patients based on sign-gesture translation via audio-visual avatars

Kawther Thabt Saleh and Abdulamir Abdullah Karim

Published/Copyright: March 20, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Open Engineering Volume 15 Issue 1

Abstract

Communication barrier faced by elderly individuals and stroke patients with speech impairments pose significant challenges in daily interactions. While sign language serves as a vital means of communication, those struggling to speak may encounter difficulties in conveying their messages effectively. This research addresses this issue by proposing a system for generating audio-visual avatars capable of translating sign gestures into the written and spoken language, thereby offering a comprehensive communication tool for individuals with special needs. The proposed method integrated YOLOv8, MobileNetV2, and MobileNetV1 based on U-Net to accurately recognize and classify sign gestures. For gesture detection and classification, YOLOv8n was used; for segmentation, traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 based on multi-stage image segmentation were used; for classification, MobileNetV1 and MobileNetV2 were used. Using the improved first-order motion model, the generated avatars enabled the real-time translation of sign motions into text and speech and facilitated interactive conversation in both Arabic and English. The system’s importance was demonstrated by the evaluation findings, which showed that traditional U-Net produced ideal results in gesture segmentation and YOLOv8n performed best in gesture classification. This study contributes to advancing assistive communication technologies, offering insights into optimizing gesture recognition and avatar generation for enhanced communication support in elderly and stroke patient care. The YOLOv8n model achieved 0.956 and 0.939 for precision and recall, respectively, for detecting and classifying gestures. MobileNetV1 gained 0.94 and MobileNetV2 gained 0.79 in accuracy for classification.

Keywords: audio-visual avatar; hand gesture signs; elderly; YOLOv8; U-Net; MobileNet; improved first-order motion model

1 Introduction

The problem of communication between older people with stroke and people assigned to take care of them is a major challenge. When people suffer from speech difficulty due to stroke, they are not able to communicate effectively, and this can lead to their isolation and increase their feelings of isolation and depression. This situation requires a deep understanding of the situation and patience, as well as the use of alternative means of communication such as body language and gestures to facilitate communication and improve the quality of care and interaction between them and the people taking care of them.

Due to the lack of data on gesture signs for elderly people who have stroke in Iraq, it was necessary to create a dataset. Also, some signals are similar, such as “I want to eat” and “I want to change my clothes,” “Hello” and “Finished,” and “What” and “Why,” causing confusion in interpreting them. YOLOv8n has successfully addressed all these challenges except for the “Why” gesture. MobileNetV1 for classification, combined with U-Net for segmentation, have been used to overcome the difficulty in classifying the “Why” gesture that YOLOv8n faced.

Virtual reality offers interest in several technical areas [1]. Avatar is a digital representation of a user or their character [2], known as a digital human [3], or a virtual human [4]. There are a number of intelligent platforms that have been utilized to generate a life model like a virtual avatar, imitating a patient with which training doctors have the ability to communicate with it; Convai is an example of one of those platforms [5]. In addition, toolkit libraries and tools called XFace are used for animation, face recognition, speech generation, facial expression analysis, and other interactive functions. This technology is also utilized in building virtual avatars [6]. The challenges faced by the elderly who find it difficult to speak or hear are multifaceted, affecting their ability to communicate and fulfill their daily needs [7,8]. Avatars allow users to interact socially, work, play, and express themselves [9]. Common communication problems in seniors include hearing loss, stuttering, weakened facial muscles, and inability to write. These issues can be caused by various health problems, such as stroke [10]. Generating avatars using available platforms and tools has several challenges, such as avatar animation, realistic appearance, cost of creation, privacy and security, and system integration. On the other hand, deep learning models assist in solving some of these issues. “Generating Audio-Visual Avatar by Improved First-Order Motion Model” refers to using an enhanced version of the first-order motion model to create an animated representation that combines both auditory and visual elements. The first-order motion model is a framework used for transferring motion between video sequences, initially developed for applications like face and image reenactment. Improving this model typically involves modifications or enhancements to increase the motion accuracy or enhance the overall quality of the results. In this work, an audio-visual avatar is generated using the improved first-order motion model, which translates the sign gestures and converts them into written and audio text, depending on the results of YOLOv8 and MobileNetV1.

YOLOv8n is used for detection and classification; the traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 are used for segmentation; and MobileNetV1 is used for classification. Based on the results of classification, the first-order motion model has been used to generate an interactive avatar that can speak English and Arabic and translate the meaning of the sign gesture motions into text and speech. The YOLOv8n model achieved values of 0.956, 0.939, and 0.971 for precision, recall, and mAP50, respectively, for detecting and classifying gestures. MobileNetV1 gained values of 0.94, 0.94, and 0.94 for accuracy, precision, and recall, respectively, for classification. Traditional U-Net for segmentation got 0.9845 as accuracy and 0.918 as the Dice coefficient. U-Net with VGG16 for segmentation achieved 0.9809 in accuracy and 0.889 in Dice coefficient. U-Net with MobileNetV2 for segmentation obtained 0.9478, 0.953, 0.9765, and 0.8034 as accuracy, Dice coefficient precision, and recall, respectively. The generated audio-visual avatar’s performance was 85.56%.

This study makes a unique contribution by the following ways:

constructing a dataset of hand sign gestures specifically tailored for elderly individuals afflicted by stroke and having trouble in speaking for Iraqi Nursing Homes.
propose mask annotation employed for segmenting the hand gestures using u-net model, thresholding segmentation, K-mean clustering, Image and Imglab tool.
developing an audio-visual avatar capable of converting them into written text and speech for improved communication.
integrating advanced models such as YOLOv8 and U-Net with MobileNetV1 for recognizing and classifying sign gestures, which is a robust methodological approach.

This contribution advances assistive communication technologies and provides a significant contribution to the field of elderly and stroke patient care.

The results show that the proposed avatar achieved excellent results, as illustrated in Section 4.2, Quantitative analysis. The remaining parts of this research work are structured as follows: The literature review is summarized in Section 2, the technique is discussed in more detail in Section 3, the results and discussion are presented in Section 4, and the conclusions are presented in Section 5 along with some suggestions for future study directions.

2 Literature review

An audio-visual avatar is a virtual person that blends visual and audible signs to improve human–computer interaction and the user experience [11,12]. Recent advancements in computer vision, artificial intelligence, and natural language processing have generated a lot of interest in the creation of visual and auditory avatars [13]. In addition to exploring the most recent developments and applications in this area, this section reviews the literature that has already been written on the production of audio-visual avatars.

In 2024, Zhang et al. [14] introduced Virbo, an intelligent talking avatar video generation system. It offers personalized functions, multilingual customization, voice cloning, face swapping, talking avatar dubbing, and visual special effect rendering. Virbo generates photo-realistic, lip-synchronized videos with better accuracy and authenticity. It can create videos comparable to professional creations. Future work aims to enhance speaker’s voice emotions and facial expressions. In 2024, López et al. [15] developed a novel hand gesture recognition (HGR) model using electromyography (EMG) signals and spectrograms. They compared the performance of convolutional neural network-long short-term memory (CNN-LSTM) and a post-processing algorithm. The results showed that memory cells improved the recognition accuracy by 3.29% compared to CNN models. However, post-processing had a more significant impact on recognition accuracy than memory cells via LSTM networks. This suggests that incorporating post-processing algorithms can enhance HGR models’ accuracy and robustness against EMG signal variability.

In 2021, García et al. [16] presented a description of an avatar production system, including the key technological specifications for the design of avatars that experience audio hallucinations, and an assessment of the system from the perspectives of both patients and therapists. Character Creator, Poser, Unity Multipurpose Avatar, and Adobe Fuse CC were all utilized for the avatar creation process. In 2023, Lu et al. [17] used a convenience sampling method to select 13 participants from a support group for senior citizens in the southern part of Ireland with hearing loss. Participants were interviewed in a semi-structured manner. NVivo 12 was used to audio-record interviews and transcribe the results. Themes related to the challenges faced in recent healthcare interactions and recommendations for improving comprehensive healthcare communication were identified using Clarke and Braun s thematic analysis technique. In 2022, Zhang et al. [18] introduced a 3D animated high-fidelity human model. However, it has some limitations: (1) it is based on skinned multi-person linear (SMPL) projections, and (2) it cannot capture extremely fine actions, such as facial emotion variations. In 2022, Athira et al. [19] suggested that signers use a cutting-edge vision-based movement recognition system that can recognize single-handed, dynamic, double-handed, and, using live video, finger-spell phrases in ISL (Indian sign language). For classification, they used support vector machine (SVM). The accuracy recognition for single-handed dynamic gesture was 89%, and for fingerspelling gestures it was 91%. In 2021, Li et al. [20] showed that a pipeline creates a 3D human form avatar from a single RGB photograph, producing a texture map for the entire body and three-dimensional human geometry. The technique separates the human body into its component components, fits them to a parametric model, and warps them into the desired shape. From the frontal photos, InferGAN infers the unseen back texture. Using MoCap data, it is simple to rig and animate their human avatars. A mobile application demonstrates the effectiveness of the solution for AR applications, showing its reliability and efficiency for both public and private datasets. In 2021, Sharma and Singh [21] proposed a CNN model for identifying gesture-based sign language that is based on deep learning. The model outperforms conventional CNN architectures in classification accuracy while using fewer parameters. In the ISL and American sign language (ASL) datasets, during training and testing, the VGG-11 and VGG-16 models achieved the highest accuracy of 99.96 and 100%, respectively. The model surpassed other strategies, according to experimental assessments, identifying the most gestures with the least amount of error. Rotation and scaling transformations had no effect on the model. In 2020, Thies et al. [22] introduced a pioneering technique for audio-based facial re-enactment that can be applied across diverse audio sources, enabling us to produce a speaking head video from an audio sequence of another individual as well as to generate realistic videos constructed with a synthetic voice. This implies that it is possible to create text-driven video synthesis with synchronized artificial voices. In 2020, Gupta and Rajan [23] employed inertial measurement data from categorization of continuously signed words from the ISL, which was carried out using an accelerometer and a gyroscope. A modified time-LeNet architecture was provided along with the use of time-LeNet and multichannel deep CNN (MC-DCNN) to address over-fitting. The models were compared for complexity, loss, and classification precision. Time-LeNet obtained 79.70% accuracy compared to MC-DCNN’s 83.94%. In 2019, Molano et al. [24] proposed the Candide parametric mask, a method for producing emotive avatars in runtime, speeding up 3D animation. It produced a variety of emotions, ranging from straightforward winks to complicated ones, and was inspired by the Ekaman emotional model. In 2019, García et al. [25] presented a scheme for creating treatments based on avatars. The strategy was based on a previously described tool to improve social cognition in people with cognitive impairment. First, the criteria for facial emotion identification in avatar-based treatments were developed. The supporting instrument for the therapy was then presented with a brief description. The administration of treatments has since been explained. For both the clinician and the patient, treatment execution was separated into pre-therapy and treatment. Table 1 presents a comparison between our generated avatar and those generated by earlier studies that generate an avatar that is required by a certain activity. Table 2 presents a comparison between our suggested approaches for classifying gestures for the elderly and patients with stroke who are unable to speak and earlier research that has classified sign language.

Table 1

Comparison of earlier studies that generated an avatar

Work	Dataset	Method	Audio-visual avatar aim	Gaps
Zhang et al. [14]	HDTF dataset	Virbo	Offers Virbo, a talking avatar video creation system with intelligence	The system does not support video files as inputs for the user interface and need for more diverse character speech videos
García et al. [16]	Their own dataset	Adobe Fuse CC tool, Mixamo tool, Autodesk 3D Studio Max tool, and Unity tool	Configuration of auditory hallucination avatars	Limited sample size (patients: 29; therapists: 20)
Zhang et al. [18]	MPV, UBC, DeepFashion, and SHHQ	A 3D parametric human model SMPL and a deformation network	3D-aware clothed human avatar	Does not provide a detailed evaluation or comparison with previous methods
Li et al. [20]	DeepFashion	InferGAN	For general use	The body and limbs should not intersect, and some clothing features are lost, while the rebuilding utilized SMPL
Our proposed audio-visual avatar	Constructing our dataset	Improved first-order motion model	Serve the elderly and people having special needs and aid those who take care of them	Attempted to overcome most of the gaps in previous works

Table 2

Comparison between earlier research that has classified sign language

Work	Method	Result	Gaps
López et al. [15]	CNN-LSTM model	Accuracy: 90.55%	Only recognizes a small set of hand gestures, employs a CNN-LSTM architecture with a high number of learnable parameters, requiring a large amount of training data, and not suitable for some real-time applications
Athira et al. [19]	SVM	Accuracy recognition: single-handed dynamic gesture: 89%	Data availability, needing enhancements for robustness in real-world scenarios with diverse environmental settings and limited real-time exploration
Athira et al. [19]	SVM	Fingerspelling gestures: 91%
Sharma and Singh [21]	CNN	Accuracy: 100% for the ASL dataset.	Its generalization to unseen data or different sign languages may need further exploration
Sharma and Singh [21]	CNN	99.96% for the ISL dataset
Gupta and Rajan [23]	Time-LeNet + MC-DCNN	MC-DCNN accuracy: 83.94%.	Lacks comparison with other state-of-the-art models, and the robustness of the models to variations in signing styles, speeds, or environmental conditions is not explored
		Time-LeNet average accuracy: 79.70%
		Modified time-LeNet accuracy: 81.62%
Our proposed methods	YOLOv8 + U-Net + MobileNetV1	YOLOv8: precision, 0.95473; recall, 0.94035. MobileNetV1: accuracy, 94%. MobileNetV2: accuracy, 79%	Attempted to overcome most of the gaps in previous works

When comparing our work with previous studies, we found the following: the work of Zhang et al. [14] used a speaking avatar to build an intelligent system. The work of García et al. [16] used an avatar to represent auditory hallucinations. The work of Zhang et al. [18] featured an avatar of a human wearing three-dimensional clothing. The work of Li et al. [20] created an avatar for public use. In our work, we used avatars to facilitate communication between the patient and the doctor. Regarding the technology used to create the avatar in each study, the work of Zhang et al. [14] used Virbo. The work of García et al. [16] utilized Adobe Fuse CC, Mixamo, Autodesk 3D Studio Max, and Unity. The work of Zhang et al. [18] employed a 3D parametric human model (SMPL) and a deformation network. The work of Li et al. [20] used InferGAN. In that work, the authors improved the performance of the first-order motion model by enhancing the input image to create a more realistic avatar. The authors also proposed a multi-stage image segmentation method to enhance the accuracy of classification.

3 Method and implementation

Figure 1 depicts the schematic of the proposed sign gesture interpreter system to serve the elderly and stroke patients having special needs and provide assistance to those who take care of them:

Figure 1

Schematic of the suggested system.

The system operates bidirectionally (from the patient to doctor and vice versa), but in this research, we have emphasized on the forward phase.

3.1 Gesture hand dataset preprocessing

The performance of the proposed communication enhancement system for older people with special needs, specifically those with speech difficulties, was evaluated using constructed datasets. This was necessary because there is a lack of standard and publicly accessible datasets for individuals who are neither mute nor deaf. This constructed dataset consists of a gesture-hand description of specific things that elders need for daily life requirements. The dataset was created by visiting many nursing homes for the elderly and conducting interviews with both the residents and those in charge of their care. After settling on gestures and movements that meet the needs of the elderly, dataset images were acquired for these movements, and a group of people were used to represent them. The proposed dataset contained 26 classes; each class had 110 images of different samples. Image dimensions were 640 × 640. The proposed dataset included images from a diverse collection of people of various ages and genders, with various backgrounds and orientations based on the elderly and stroke life needs. The dataset was named “Sign Gestures for Elderly and Stroke Patients” (SGESP). The images were captured using web and mobile cameras, and the dataset included hand gestures for Iraqi elderly and stroke patients, as illustrated in Table 3.

Table 3

Signs for elderly people constructed dataset and their meanings

The dataset was prepared for both semantic segmentation and instance segmentation. The dataset was annotated in two ways to be appropriate, for instance, segmentation and semantic segmentation: bounding box annotation (manual labeling) and masking images (using manual labeling methods [JSON file], K-mean clustering, U-Net, and threshold segmentation). The dataset had augmented × 4 images [horizontal flipping, 5% rotation, 3% blurring, and 2% noising].

In this work, preprocessing included cropping, resizing, and converting images to JPG. Manual cropping was performed to remove areas that were uninteresting in the image, which helped focus the main content. Antialiasing technology was also used during the resizing process. These visual processes enhanced the effectiveness of pre-processing to improve the readiness of images for use in visual analysis or the training of artificial intelligence models. All the dataset images were resized to 640 × 640 pixels to ensure reliable model training. The total number of datasets for the original images was 2,860; it increased to 10.808 after augmentation. The proposed dataset was divided into 70% training, 20% validation, and 10% testing. The hold-out method was used for training. About 10.010 images were used for training, 512 images were used for validation, and 286 images were used for testing. The dataset contained 10.808 text files as annotation labels.

3.1.1 Proposed multistage image segmentation method

The first stage of the proposed method was segmenting the entire preprocessed dataset using segmentation by thresholding in order to produce an image dataset labeled for semantic annotations. About 10% of dataset images were accurately segmented. In the next stage, K-mean clustering was applied to 90% of remaining dataset images, and about 30% of the resultant images had accurate segmentation. The third stage was applying the ImgLab tool to the remaining images (60%) in order to get a JSON file, which was then converted into a segmented image. About 10% of the resultant images had accurate segmentation. Then, we had about 50% correctly segmented images, and they were used to train U-Net to obtain the best weight, while the remaining 50% which were not segmented correctly were used for testing. The result of U-Net was about 100% correctly segmented image, which can be used as a mask for semantic annotations.

Algorithm 3.3: Annotating sign gesture dataset for semantic annotations
Input: Proposed dataset images
Output: Images of dataset annotated for semantic annotations
Begin:
Step 1: For i = 1 to 26 Do//to login to every dataset folder
Step 2: Segmentation by thresholding:
	For j = 1 to 110 Do//to read every dataset image in each folder
		- Img = Read(image)
		- Segment image by thresholding segmentation using equation (2)
		- If the segmented region of hand gesture is accurate, then
			- save the selected image in the gesture-mask-folder
			- delete the selected image from the original dataset
	End for j loop.
Step 3: Segmentation by K-mean clustering:
	For j = 1 to 110 Do//to read every remaining image in each folder
		- Img = Read(image)
		- Segment image by K-mean clustering using algorithm (2)
		- If the segmented region of hand gesture is accurate, then
			- append the selected image to the gesture-mask-folder
			- delete the selected image from the original dataset
	End for j loop.
Step 4: Segmentation by ImgLab:
	For j = 1 to 110 Do//to read every remaining image in each folder
		- Img = Read(image)
		- Segment image by ImgLab tool segmentation to get the JSON file
		- convert the JSON file to segmented images
		- if the segmented region of hand gesture is accurate, then
			- append the selected image to the gesture-mask-folder
			- delete the selected image from the original dataset
	End for j loop.
Step 5: Semantic Segmentation using U-Net:
	For j = 1 to 110 Do//to read every remaining image in each folder
		- Img = Read(image)
		- Train U-Net using the resultant gesture-mask-folder
		- Obtain the best weight
		- Test the images remaining in the original dataset using the already trained U-Net//image which has not been selected
		- Append the segmented images to the gesture-mask-folder
		- Obtain the final segmented image using U-Net
	End for j loop.
Return the mask image for each image in the dataset
End

3.2 Proposed methodology

3.2.1 Detection and classification by YOLOv8

YOLOv8 was adopted because it offers improved accuracy, faster speeds, free anchor, ability to focus on different areas of the image, and real-time processing. In this work, we employed YOLOv8 to detect and classify the gestures generated by elderly people with special needs.

3.2.2 Segmentation and classification by U-Net and MobileNetV1

U-Net is a powerful model for enhancing the quality and detail. It was utilized in this study to accomplish flexibility, effective feature learning, high performance, versatility, robustness to limited data, and ease of implementation. MobileNetV1 performed well in image classification and computer vision tasks. MobileNetV1’s quick execution speed qualified it for real-time applications. Additionally, MobileNetV1 achieved an acceptable mix of accuracy and performance, which made it a perfect option for tasks like image classification.

3.2.3 Classification by MobileNetV2

MobileNetV2 was employed because it offered excellent performance, increased accuracy, lightweight, faster processing speeds, and real-time processing. In this work, we utilized MobileNetV2 to classify the gestures generated by elderly people with special needs.

3.2.4 Generating audio-visual avatar

This section discusses creating and improving avatars by using an improved first-order motion model. The below paragraphs illustrate the method by which we improved audio-visual avatar generation:

3.2.4.1 Super-resolution GAN (SRGAN) model

Super-resolution models converted images into high-quality images using scanning, with the front page featuring high-resolution images. The success of elaboration depended on training quality and dataset quality. The source image was enhanced by SRGAN in order to be used later in the first-order motion model.

3.2.4.2 Generation of audio-visual avatar using the first-order motion model

Image animation included creating a video sequence to animate a source image’s object based on the motion of a driving video. An occlusion for target motions was modeled by a generator network, while appearance and motion information were separated using a self-supervised formulation. In several item categories and benchmarks, their approach beat the competing frameworks. Generating video avatars using an improved first-order motion model is illustrated in algorithm (1).

Algorithm 3.12: Generating video avatar using the improved first-order motion model
Input: Single source face image I, driving video D frame by frame face images
Output: Video avatar
Begin:
Step 1: Apply the GFPGAN model to I
Step 2: For each image in D frames Do
	- Apply the key point detector for both I and D
	- Use the self-supervised approach based on the Monkey-Net model to move I key points according to D
	- Split D into a number of frames (F1, F2, F3,…, Fn)
	- For i = 1 to n Do
		- I mimics the face in a Fi and save I after mimicking Ii
		- Merge (I1, I2,…,In) and save the video
End for
Return Video Avatar
End

3.2.4.3 Designing audio-visual avatar

Audio and text were embedded in a video avatar to generate an audio-visual avatar. This process involved integrating various multimedia elements to create a dynamic and interactive representation. Video editing, which is based on the OpenCV library, was used to select or create a video avatar and add text to display information or subtitles using the cv2.putText() function. Audio was created using the text-to-speech algorithm (TTS). TTS is a technology that converts written text into spoken words, employing algorithms to analyze, segment, process, synthesize speech, and produce audio output based on the gTTS() function from the GTTS (Google Text-to-Speech) library. The produced audio, whether speech or sound effects, was then incorporated into the video with synchronization. Lip-syncing techniques can be applied to match mouth movements with spoken words using video editing based on the set_audio() function from the mymovie library. Additionally, the text can be made interactive and integrated with both the audio and visual content. Generating audio-visual avatars is illustrated in algorithm (2).

Algorithm 3.13: Generating audio-visual avatar
Input: Video avatar, text (English and Arabic)
Output: Audio-visual avatar
Begin:
Step 1: Convert the Arabic and English text to audio by the TTS algorithm using gTTS() function from GTTS library
Step 2: Save the resulting audio
Step 3: Embed Arabic and English audio to video avatar by using set_audio() function from mymovie library
Step 4: Embed Arabic and English text to video avatar by using cv2.putText() function from OpenCV library
Step 5: Save resulting video
Return Audio-visual avatar
End

4 Results and discussion

The results are discussed based on the experimental setting and quantitative analysis.

4.1 Experimental setting

The Google Colaboratory (Google Colab Pro) with 25 GB RAM, 200 GB storage, and the Windows operating system was used for all of the studies. Along with TensorFlow and other high-level general-purpose programming languages, PyTorch, Keras, Matplotlib, OpenCV, NumPy, Pandas, MoviePy, and Pygame were utilized. As detailed in Table 4, the training phase of the proposed system was conducted with the following hyperparameter settings:

Table 4

Number of parameters used in the proposed model

Model	Total parameters	Trainable parameters	Non-trainable parameters
YOLOv8 for detection and classification	3,010,718	—	—
U-Net for segmentation	1,941,105	1,941,105	0
U-Net with VGG16 for segmentation	16,195,617	1,480,929	14,714,688
U-Net with MobileNetV2 for segmentation	416,209	409,025	7,184
MobileNetV1 for classification	3,296,154	3,274,266	21,888
MobileNetV2 for classification	6,532,368	2,045,274	396,544

4.2 Quantitative analysis

The results of YOLOv8 were evaluated using benchmark performance metrics. Recall, precision, F1 measure, and mean average precision from epoch 0 to epoch 195 are illustrated in Table 5 and Figure 2. Epoch 146 shows the best results.

Table 5

Comparison of the results of YOLOv8 from epoch 0 to epoch 195

Epoch	Train/box loss	Train/cls_loss	Train/dfl_loss	Metrics/precision(B)	Metrics/recall(B)	Metrics/mAP50(B)	Metrics/mAP50-95(B)	Val/box loss	Val/cls loss	Val/dfl loss
0	1.1778	4.0693	1.5151	0.25686	0.33499	0.23295	0.15871	1.1288	3.0318	1.4391
1	1.1586	2.9944	1.4688	0.4818	0.46913	0.46801	0.31006	1.2319	2.3013	1.5262
2	1.1802	2.6073	1.4557	0.47785	0.51797	0.52939	0.32774	1.3268	2.2235	1.6005
3	1.2184	2.384	1.4543	0.58504	0.6689	0.70701	0.43662	1.3896	1.5356	1.5991
:	:	:	:	:	:	:	:	:	:	:
:	:	:	:	:	:	:	:	:	:	:
146	0.74929	0.66212	1.1051	0.95499	0.9406	0.97009	0.77275	0.86865	0.4348	1.125
:	:	:	:	:	:	:	:	:	:	:
:	:	:	:	:	:	:	:	:	:	:
193	0.72875	0.63179	1.0947	0.95444	0.94036	0.96907	0.77275	0.86746	0.4248	1.1259
194	0.71511	0.61902	1.0884	0.95463	0.94033	0.96829	0.77179	0.8674	0.42481	1.1258
195	0.71785	0.6283	1.0931	0.95473	0.94035	0.96954	0.7725	0.86727	0.42433	1.1258

Bold values represent the best training and validation performance of the model, which likely corresponds to the optimal epoch (146) for selecting the best weight and final model.

Figure 2

Results of YOLOv8.

Figure 2 illustrates the values of these metrics for different training iterations. It appears that the training loss and classification loss decrease over time, which suggests that the model is learning to better predict bounding boxes and classify objects. The precision and recall metrics also increased, which suggests that the model is making better detections.

Moreover, the confusion matrix was calculated, as illustrated in Figure 3, where classes “Hello,” “I am good,” “Wait,” “I want to eat,” “Finished,” “What,” “I am married,” “I want to change my clothes,” “I cannot hear,” “Stop,” “Listen to me,” “Please,” “I love you,” “It is time,” and “Help me” achieved perfect results (1.00). While classes “Thank you,” “Correct,” “I want to drink water,” “Upset,” “Come,” and “Please” got excellent results with a range of 91 to 95. As for the classes “I want to take a shower,” “I am not sure,” and “Me” got a very good result in the range between 80 and 89. On the other hand, the class “You” obtained 0.73 and the class “Why” achieved 0.59.

Figure 3

Confusion matrix.

U-Nets’ segmentation assessment matrices for each class in Google Colab are presented in Table 6. MobileNetV1 and MobileNetV2 classification assessment matrices are presented in Table 7. MobileNetV1 classification evaluation matrices for each class in Google Colab are shown in Table 8.

Table 6

U-Net’s segmentation assessment matrices for each class in Google Colab

Model	Accuracy	Dice coefficient
U-Net for segmentation	0.9845	0.918
U-Net with VGG16 for segmentation	0.9809	0.889
U-Net with MobileNet2 for segmentation	0.9478	0.953

Table 7

MobileNetV1 and MobileNetV1 classification assessment matrices

Model	Accuracy	Precision	Recall
MobileNetV1 for classification	0.94	0.94	0.94
MobileNetV2 for classification	0.79	0.76	0.71

Table 8

MobileNetV1 classification assessment matrices for each class in Google Colab Pro

Class	Precision	Recall	F1-score	Support
Hello	1.00	0.97	0.99	40
I am good	0.97	0.85	0.91	40
Thank you	0.97	0.97	0.97	40
I want to take a shower	0.90	0.93	0.91	40
Wait	0.97	0.97	0.97	40
Correct	0.90	0.93	0.91	40
I want to eat	0.93	0.93	0.93	40
Finished	1.00	1.00	1.00	40
I cannot hear	0.87	0.97	0.92	40
I want to change my clothes	1.00	0.93	0.96	40
I want to drink water	0.97	0.88	0.92	40
Me	0.86	0.90	0.88	40
You	0.88	0.88	0.88	40
Upset	0.91	0.97	0.94	40
Come	0.95	0.95	0.95	40
I am not sure	0.88	0.93	0.90	40
I am married	0.95	0.95	0.95	40
What	1.00	0.97	0.99	40
Why	0.86	0.90	0.88	40
Excuse me	0.90	0.90	0.90	40
I love you	1.00	0.97	0.99	40
Please	0.93	1.00	0.96	40
Listen to me	0.97	0.95	0.96	40
Accuracy			0.94	1,040
Macro avg.	0.94	0.94	0.94	1,040
Weighted avg.	0.94	0.94	0.94	1,040

The video avatar was generated using a driving video and a Pacific image based on the first-order motion model, as shown in Figure 4(a), and then audio and text were embedded in the video avatar to generate the audio-visual avatar, as shown in Figure 4(b).

Figure 4

Generate audio-visual avatar: (a) generate Avatar and (b) generate audio-visual avatar after embedding audio and text to video avatar.

It can be concluded that when the proposed system used YOLOv8 and MobileNetV1 for the purpose of classification, YOLOv8 gave better results than MobileNetV1 in classifying hand gesture signs. Also, when the proposed system used traditional U-Net, U-Net with VGG16, and U-Net with MobileNetV2 for segmentation purposes, the traditional U-Net gave the best result in segmentation hand gesture signs. The results of detection and classification stages, as well as segmentation and classification stages, were evaluated using several images, video samples, and several DNN models.

For detecting and classifying hand sign gestures, the YOLOv8 model was used, in which 3,010,718 total parameters were utilized. The benchmark evaluation matrices gave precision, recall, mAP50, and mAP50-95 values of 0.956, 0.939, 0.971, and 0.775%, respectively, on Google Colaboratory (Google Colab Pro) with 25 GB RAM and 200 GB storage.

In order to segment the hand sign gestures, the proposed system used the traditional U-Net, U-Net with VGG16, and U-Net with MobileNet2 models for semantic segmentation. The system utilized (1.941.105), (16.195.617), and (416,209) parameters, respectively, on Google Colaboratory (Google Colab Pro) with 25GB RAM and 200GB storage. Accuracy and Dice coefficient values were used to evaluate the segmenting, which gave (0.9845 and 0.918) for the traditional U-Net model, (0.9809 and 0.889) for the U-Net with the VGG16 model, and (0.9478 and 0.953) for the U-Net with MobileNetV2 model.

In order to classify the segmented hand sign gestures, MobileNetV1 was used, in which 3,296,154 parameters were utilized. The evaluated metrics used were accuracy, precision, and recall, which gave 0.94, 0.94, and 0.94, respectively.

The improved audio-visual avatar, which was used to translate elderly hand sign gestures, achieved an effective performance of 85.56%, with 25 volunteers answering 15 pre-prepared questions, indicating its effectiveness in providing care to elderly individuals.

The evaluation metrics indicate that the work of Sharma and Singh [21] outperformed all studies due to their design using a deep neural network (CNN), which led to high accuracy in classifying ISL and ASL gestures. This model outperformed VGG-11 and VGG-16, demonstrating the ability to classify with high accuracy and a low error rate. Additionally, it was stable against transformations in rotation and scaling, as demonstrated in Table 9. However, our proposed system is superior to previous studies because it includes an avatar to enhance communication between the patient and the doctor. In our research, we constructed a dataset specifically designed for elderly individuals with stroke in Iraq. This dataset includes fundamental hand motions identified by visiting several nursing homes for the elderly to meet their daily needs, filling a gap where no such dataset previously existed in Iraq.

Table 9

Comparison of height performance results between previous research and the proposed method

Study	Accuracy (%)
López et al. [15]	90.55
Athira et al. [19]	89 and 91
Sharma and Singh [21]	100 and 99.96
Gupta and Rajan [23]	83.94, 79.70, and 81.62
Our proposed system	94

5 Conclusions and future scope

In conclusion, this research has successfully achieved its main objective of developing an effective deep learning-based system for translating elderly gestures into audio and text using avatars. The creation and utilization of the SGESP dataset highlight the importance of incorporating real-life scenarios and contexts into gesture recognition systems. Segmenting hand motions using multi-stage segmentation proved to be 100% effective. Some signals, such as “I want to eat” and “I want to change my clothes,” “Hello” and “Finished,” and “What” and “Why,” are similar and cause confusion for the system. YOLOv8n successfully addressed all these challenges except for the “Why” gesture. In order to solve this confusion, U-Net was used for the segmentation process. MobileNetV1 was then used for classification to prevent the confusion between “What” and “Why” that appeared when using YOLOv8 for classification purposes.

kawtherthabt@uomustansiriyah.edu.iq

Funding information: The authors state no funding involved.
Author contributions: Kawther Thabt performed the conceptualization, methodology, software development, validation, formal analysis, investigation, resource management, data curation, original draft preparation, visualization, and data measurement. Abdulamir Abdullah was involved in planning, review, supervision, and project administration. Both the authors discussed the results and commented on the manuscript.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: Most datasets generated and analyzed in this study are included in the submitted manuscript. The other datasets are available on reasonable request to the corresponding author with the attached information.

References

[1] Čujan Z, Fedorko G, Mikušová N. Application of virtual and augmented reality in automotive. Open Eng. 2020;10(1):113–9. 10.1515/eng-2020-0022.Search in Google Scholar

[2] Brown T, Burleigh TL, Schivinski B, Bennett S, Gorman-Alesi A, Blinka L, et al. Translating the user-avatar bond into depression risk: A preliminary machine learning study. J Psychiatr Res. 2024;170:328–39. 10.1016/j.jpsychires.2023.12.038.Search in Google Scholar PubMed

[3] Yang D, Sun M, Zhou J, Lu Y, Song Z, Chen Z, et al. Expert consensus on the “Digital Human” of metaverse in medicine. Clin eHealth. 2023;6:159–63. 10.1016/j.ceh.2023.11.005.Search in Google Scholar

[4] Pauw LS, Sauter DA, van Kleef GA, Lucas GM, Gratch J, Fischer AH. The avatar will see you now: Support from a virtual human provides socio-emotional benefits. Comput Hum Behav. 2022;136:107368. 10.31234/osf.io/5u6hz.Search in Google Scholar

[5] Sardesai N, Russo P, Martin J, Sardesai A. Utilizing generative conversational artificial intelligence to create simulated patient encounters: a pilot study for anaesthesia training. Postgrad Med J. 2024;100:qgad137. 10.1093/postmj/qgad137.Search in Google Scholar PubMed

[6] Basori AH, Ali IR. Emotion expression of avatar through eye behaviors, lip synchronization and MPEG4 in virtual reality based on Xface toolkit: Present and future. Procedia-Soc Behav Sci. 2013;97:700–6. 10.1016/j.sbspro.2013.10.290.Search in Google Scholar

[7] Lu LL, Henn P, O’Tuathaigh C, Smith S. Patient–healthcare provider communication and age-related hearing loss: a qualitative study of patients’ perspectives. Ir J Med Sci. 2024;193(1):277–84. 10.1007/s11845-023-03432-4.Search in Google Scholar PubMed PubMed Central

[8] Hailu GN, Abdelkader M, Meles HA, Teklu T. Understanding the support needs and challenges faced by family caregivers in the care of their older adults at home. A qualitative study. Clin Interv Aging. 2024;19:481–90. 10.2147/cia.s451833.Search in Google Scholar PubMed PubMed Central

[9] Alfiras M, Bojiah J, Mohammed MN, Ibrahim FM, Ahmed HM, Abdullah OI. Powered education based on Metaverse: Pre- and post-COVID comprehensive review. Open Eng. 2023;13(1):20220476. 10.1515/eng-2022-0476.Search in Google Scholar

[10] Zijun L, Xu Y, Yujia Y, Zhiqiang X. Elderly onset of MELAS carried an M. 3243A > G mutation in a female with deafness and visual deficits: A case report. Clin Case Rep. 2024;12(3):e8438. 10.1002/ccr3.8438.Search in Google Scholar PubMed PubMed Central

[11] Zhen R, Song W, He Q, Cao J, Shi L, Luo J. Human-computer interaction system: A survey of talking-head generation. Electronics. 2023;12(1):218. 10.3390/electronics12010218.Search in Google Scholar

[12] Azofeifa JD, Noguez J, Ruiz S, Molina-Espinosa JM, Magana AJ, Benes B. Systematic review of multimodal human-computer interaction. Informatics. 2022 Feb;9(1):13. 10.3390/informatics9010013.Search in Google Scholar

[13] Khan NS, Abid A, Abid K. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation. Cogn Comput. 2020;12:748–65. 10.1007/s12559-020-09731-7.Search in Google Scholar

[14] Zhang J, Chen J, Wang C, Yu Z, Liu C, Qi T, et al. Virbo: Multimodal multilingual avatar video generation in digital marketing. arXiv preprint arXiv:2403.11700; 2024. 10.48550/arXiv.2403.11700.Search in Google Scholar

[15] López LIB, Ferri FM, Zea J, Caraguay ÁLV, Benalcázar ME. CNN-LSTM and post-processing for EMG-based hand gesture recognition. Intell Syst Appl. 2024;22:200352. 10.1016/j.iswa.2024.200352.Search in Google Scholar

[16] García AS, Fernández-Sotos P, Vicente-Querol MA, Sánchez-Reolid R, Rodriguez-Jimenez R, Fernández-Caballero A. Co-design of avatars to embody auditory hallucinations of patients with schizophrenia: A study on patients’ feeling of satisfaction and psychiatrists’ intention to adopt the technology. Virtual Real. 2021;27:1–16. 10.1007/s10055-021-00558-7.Search in Google Scholar

[17] Lu LLM, Henn P, O’Tuathaigh C, Smith S. Patient–healthcare provider communication and age-related hearing loss: a qualitative study of patients’ perspectives. Ir J Med Sci (1971-). 2023;193:1–8. 10.1007/s11845-023-03432-4.Search in Google Scholar PubMed PubMed Central

[18] Zhang J, Jiang Z, Yang D, Xu H, Shi Y, Song G, et al. Avatargen: a 3D generative model for animatable human avatars. In European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. p. 668–85. 10.1007/978-3-031-25066-8_39.Search in Google Scholar

[19] Athira PK, Sruthi CJ, Lijiya AA. A signer independent sign language recognition with co-articulation elimination from live videos: an Indian scenario. J King Saud Univ - Comput Inf Sci. 2022;34(3):771–81. 10.1016/j.jksuci.2019.05.002.Search in Google Scholar

[20] Li Z, Chen L, Liu C, Zhang F, Li Z, Gao Y, et al. Animated 3D human avatars from a single image with GAN-based texture inference. Comput Graph. 2021;95:81–91. 10.1016/j.cag.2021.01.002.Search in Google Scholar

[21] Sharma S, Singh S. Vision-based hand gesture recognition using deep learning for the interpretation of sign language. Expert Syst Appl. 2021;182:115657. 10.1016/j.eswa.2021.115657.Search in Google Scholar

[22] Thies J, Elgharib M, Tewari A, Theobalt C, Nießner M. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI. Springer International Publishing; 2020. p. 716–31. 10.1007/978-3-030-58517-4_42.Search in Google Scholar

[23] Gupta R, Rajan S. Comparative analysis of convolution neural network models for continuous Indian sign language classification. Procedia Comput Sci. 2020;171:1542–50. 10.1016/j.procs.2020.04.165.Search in Google Scholar

[24] Molano JSV, Díaz GM, Sarmiento WJ. Parametric facial animation for affective interaction workflow for avatar retargeting. Electron Notes Theor Comput Sci. 2019;343:73–88. 10.1016/j.entcs.2019.04.011.Search in Google Scholar

[25] García AS, Navarro E, Fernández-Caballero A, González P. Towards the design of avatar-based therapies for enhancing facial affect recognition. In Ambient Intelligence–Software and Applications–, 9th International Symposium on Ambient Intelligence. Springer International Publishing; 2019:306–13. 10.1007/978-3-030-01746-0_36.Search in Google Scholar

Received: 2024-04-22

Revised: 2024-06-26

Accepted: 2024-07-14

Published Online: 2025-03-20

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/eng-2024-0068

Keywords for this article

audio-visual avatar; hand gesture signs; elderly; YOLOv8; U-Net; MobileNet; improved first-order motion model

Creative Commons

BY 4.0