Indoor environment monitoring method based on the fusion of audio recognition and video patrol features

Wei Zhang; Jianjun Yang; Ying Jiang; Yuling Chen; Yifan Zhang

doi:10.1515/nleng-2025-0120

40% Rabatt

auf Fachbücher bei De Gruyter Brill *

Artikel Open Access

Indoor environment monitoring method based on the fusion of audio recognition and video patrol features

Wei Zhang , Jianjun Yang , Ying Jiang , Yuling Chen und Yifan Zhang

Veröffentlicht/Copyright: 9. Juni 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Nonlinear Engineering Band 14 Heft 1

Abstract

To accurately detect the abnormal behavior of business hall employees in the indoor working environment and ensure the normal operation of the store, a deep learning technology is combined to build an audio and video feature extraction network. First, a separate video and an audio feature extraction model are built by using a convolutional neural network and a long short-term memory network, and then an abnormal behavior detection model of audio and video feature fusion is built using multi-channel modules. This model can divide audio and video data into multiple different channels, and perform feature extraction and recognition on each channel. To assess the performance and robustness of the proposed model, experiments are organized on multiple datasets. The research outcomes denote that compared with some existing neural network models, the detection accuracy, recall, F1 value, and mean average precision values of the proposed model in the video audio dataset are 97.85, 96.98, 95.61, and 35.61%, respectively. The research method has a remarkable fusion effect on video and audio datasets, which improves the accuracy and robustness of abnormal behavior recognition. In addition, user and expert satisfaction with the model has also improved significantly, reaching 0.92 and 0.97, respectively. The novelty of this study lies in the comprehensive analysis of audio and video data by using multi-channel processing technology. Through effective extraction and fusion of video frames and audio signals at the same time, it significantly improves the detection ability of abnormal behaviors in indoor environments in complex scenes.

Keywords: anomaly detection; deep learning; feature extraction; convolutional neural network; long short-term memory network

1 Introduction

In recent years, with the advancement of computer technology, abnormal behavior detection (ABD) has become an important research direction. ABD refers to the identification and classification of abnormal behaviors that occur in specific scenarios to help people better understand and respond to various security issues [1]. In anomaly recognition, many experts have conducted a series of studies using deep learning (DL) techniques [2]. However, current methods often only focus on identifying abnormal behaviors in specific scenarios, while ignoring information from other parts of the video. In addition, traditional feature extraction (FE) models often only focus on one type of feature, making it prone to significant extraction errors. The constant advancement of DL technology has given it strong processing capabilities in image recognition. Convolutional neural network (CNN) and Long short-term memory (LSTM) belong to a kind of DL [3,4]. The motivation for the research stems from a deep understanding of the need for ABD in indoor work environments, especially in high-traffic public places such as power business halls, where timely detection and response to employee abnormal behavior is crucial for maintaining safety and improving the operational efficiency. With the rapid development of deep learning technology, most of the existing ABD methods are limited to a single data source, resulting in insufficient accuracy of information extraction. Based on this background, this article proposes an ABD model based on Multichannel audio and video feature fusion (MCAVFF), aiming to improve the recognition ability of abnormal behavior through this model. The novelty of the research is reflected in the proposed innovative ABD model, which combines CNN and LSTM, and effectively integrates audio and video information through multichannel feature extraction technology to improve the accuracy and robustness of abnormal behavior recognition. The research structure is divided into four parts. The first part reviews the research progress in related fields, and focuses on the existing ABD techniques and their limitations. The second part describes the feature extraction methods of the MCAVFF model, including CNN-Convolutional LSTM (ConvLSTM) structure for video feature extraction and AlexNet-LSTM structure for audio feature extraction. The third part verifies the validity of the model through experiments and analyzes the experimental results compared with the existing methods. The fourth part summarizes the research results and puts forward the prospect of the future research direction.

2 Related works

Currently, many experts have applied various neural networks to face recognition, anomaly detection, and audio and video FE, and have achieved certain research results [5]. Singh et al. proposed a new automatic monitoring method for frequent stampede incidents in crowded scenes to avoid the occurrence of pedestrian stampede casualties. In the proposed method, a new set concept was used to detect anomalies in crowded scene video data, while ConvNets was used as a feature extractor for FE. The research findings indicated that the model used had good feature recognition accuracy [6]. Zhang et al. designed a valve viscosity detection method using CNN. The designed method fully considered the common characteristics of industrial time series signals, and thus automatically developed multiple strategies for learning time scale features. Compared to traditional FE methods, the designed method could automatically learn time series features collected from industrial control loops, thereby improving the accuracy and efficiency of feature detection. The research findings denoted that the feature detection method used did not require manual FE and had high accuracy in detecting valve stickiness features [7]. Wang and Liu put forward an improved algorithm for target detection using fast regional CNN to solve the error problem of traditional CNN in target detection. In addition, an improved interpolation algorithm was proposed at the boundary box location stage, and Newton parabola interpolation was used instead of Bilinear interpolation, to reduce the noise interference in the target detection. The experiment outcomes indicated that the method used could effectively extract target features, thereby completing the target detection task [8]. Kolli and Tatavarthi constructed a fraud detection model using an optimized deep recursive neural network to reduce fraudulent transactions in e-commerce. The fraud detection model adopted not only effectively detected fraudulent activities, but also recorded the characteristics of fraudulent data, thereby effectively avoiding electronic transaction fraud [9].

With the continuous growth of DL technology, it has been widely studied in audio and video FE. Zhang et al. raised a new video copy detection method using deep CNN for FE. This method could not only use CNN to encode video image content, but also retained the recognition ability of neural networks, to ensure that the depth CNN can fully extract video features. The experiment outcomes denoted that the video FE method used had a high accuracy [10]. Aiming at the problems of low detection accuracy and low efficiency of traditional dual compression adaptive multi-rate audio recording, Büker and Hanili used CNN to extract its features, and used a support vector machine to model the extracted audio features. The research results express that the use of CNN could extract different audio data feature signals very well. In addition, under the action of CNN, dual compression would seriously affect high-frequency audio, resulting in low accuracy of audio FE in high-frequency areas [11].

To sum up, a series of DL technologies, including CNN and recurrent neural networks, have achieved certain research results in image recognition, face detection and anomaly monitoring. Many experts have applied neural networks to recognition technology, aiming to promote the recognition accuracy and efficiency of machines. However, the monitoring research on some fixed indoor environments is still relatively few, and there is no specific indoor environment monitoring method. This research takes a business hall as the object, and attempts to extract its multimedia video and audio features using DL technology to identify the abnormal behavior of staff in the indoor environment.

3 Abnormal behavior monitoring of indoor environmental workers based on MCAVFF

This research aims at the power business hall in the indoor environment. First, CNN and LSTM networks are utilized to extract the video features taken by the high-definition camera in the business hall, and a video feature detection model is constructed. Then, the audio features in the multimedia are extracted, and an audio feature detection model is built. Finally, by combining the audio and video models, an MCAVFF abnormal behavior monitoring model is constructed, aiming to constantly monitor the abnormal behavior of the personnel in the store.

3.1 Abnormal behavior video feature detection based on CNN-ConvLSTM network structure

CNN is a commonly used method of DL technology in image recognition. CNN extracts feature information by processing images in convolutional, pooling, and fully connected layers. AlexNet is an optimized neural network based on CNN, mainly used in image recognition. AlexNet consists of three different layers, namely, convolutional, pooling, and fully connected layers [12]. Each level uses a different learning rate and loss function to learn image features to achieve image classification or recognition. This study utilizes CNN optimized AlexNet for video FE, and the FE is denoted in Figure 1.

Figure 1

AlexNet under the video FE flow chart.

Figure 1 is the flowchart of video FE using the AlexNet. Traditional FE networks directly take the original video frames as input. However, since using adjacent images as input to the network can improve the FE effect, this study first performs subtraction processing on the input images, and then uses the AlexNet for FE. In the AlexNet, it includes convolutional, pooling, and fully connected layers, respectively. To improve the training speed of the network, this study adds a ReLU function after each convolutional layer, thereby increasing the sparsity of the network and improving its training speed. Local response normalization is required after each pooling layer to avoid information loss during the extraction.

Due to the temporal nature of video features, using only the AlexNetcan only extract action features at each moment, without continuity. This study builds a temporal signal model using LSTM based on the AlexNet, aiming to assist the AlexNet in extracting features with temporal variations. LSTM is different from traditional CNN. LSTM can observe and record the changes in continuous video frames, and can also avoid the problem that each sample has a different time length. To enable LSTM to characterize local features through convolutional operations like CNN, this study uses CNN to replace the fully connected layer in LSTM and constructs the final temporal signal model using a ConvLSTM network. The calculation method for the ConvLSTM input gate is shown in Eq. (1) [13].

(1) i t = σ ( W i ∗ [ h t − 1 , x t ] + b i ) ,

where i t expresses the input gate at time t ; W i indicates the input gate weight; ∗ denotes convolution operation; σ means the sigmoid function; b i denotes the threshold of the input gate; h t − 1 represents the output of the previous layer of network at time t − 1 ; and x t denotes the input of the local network at time t .

(2) f t = σ ( W f ∗ [ h t − 1 , x t ] + b f ) ,

where f t expresses the forgetting gate at time t ; W f stands for the forgetting gate weight; and b f indicates the threshold of the forgetting gate.

(3) O t = σ ( W O ∗ [ h t − 1 , x t ] + b O ) ,

where O t indicates the output gate at time t ; W O refers to the output gate weight; and b O stands for the threshold of the output gate.

(4) C t = tanh ( W C ∗ [ h t − 1 , x t ] + b C ) ,

where C t stands for the memory unit at time t ; W C refers to the weight of the memory unit; b C means the threshold of the memory unit; and tanh denotes a hyperbolic tangent function.

(5) C t = f t ⊗ C t − 1 + i t ⊗ C t ,

where ⊗ means Hadamard multiplication. By using Hadamard multiplication to multiply the corresponding elements, the calculation method for memory units can be obtained. C t − 1 expresses the memory unit of time t − 1 .

(6) h t = O t ⊗ tanh ( C t ) ,

where h t means the output value of the local network at time t . The combination of ConvLSTM with AlexNet can obtain the CNN ConvLSTM abnormal behavior video feature detection network structure illustrated in Figure 2.

Figure 2

Structure of CNN-ConvLSTM anomalous behavior video feature detection network.

Figure 2 shows the structure diagram of an abnormal behavior video feature detection network constructed using the CNN-ConvLSTM network. In Figure 2, the entire network structure mainly contains four parts, namely, video frame differential processing, CNN, LSTM, and fully connected layers. First, it divides a video into countless frames, then performs differential processing on the adjacent two video frames, and then uses the AlexNet for FE. After FE is completed, ConvLSTM is used for temporal information modeling, and finally, a fully connected layer is applied for output.

3.2 Research on end-to-end abnormal audio feature detection based on original waveform

In addition to processing the video captured by the HD camera in the business hall, extracting the image frames in the video for identification and detection, the research further identifies and monitors the audio information in the indoor environment, aiming to judge various behaviors in the indoor environment through audio analysis, to achieve the purpose of real-time monitoring the dynamic of staff. Before analyzing a section of indoor audio information, it is necessary to first perform signal preprocessing, mainly including routine operations such as pre-emphasis, framing, and windowing. It assumes that the total duration of the audio information used in this study is 2 s and the frame length is 20 ms. The main calculation method for pre-emphasis is indicated in Eq. (7) [14].

(7) y ( n ) = x ( n ) − a ⋅ x ( n − 1 ) ,

where x ( n ) means the original audio signal sequence; n indicates the time of speech sampling; y ( n ) expresses the audio signal sequence after pre emphasis; a is the pre emphasis coefficient; x ( n − 1 ) refers to the audio signal sequence at time n − 1 .

Due to the nonlinear nature of the staff’s audio signals collected in the indoor business hall, it is necessary to perform framing processing on the audio signals to ensure their integrity is preserved during the extraction. This study adopts the overlapping segmentation method for frame segmentation, with a frame length of 20 ms and a frame shift of 10 ms. After framing the audio signal, windowing is also necessary to prevent signal distortion during the processing. Hamming and rectangular windows are two common window functions, and their expressions are denoted in Eqs. (8) and (9).

(8) w ( n ) = 1 0 ≤ n < N 0 n = others .

Eq. (8) is the window function of a rectangular window, where w ( n ) means a window function. n expresses the windowing parameter. N stands for the range of values and is a constant.

(9) w ( n ) = 0.54 − 0.46 cos 2 π n N − 1 0 ≤ n < N 0 n = others .

Eq. (9) is the window function of Hamming window, where w ( n ) stands for window function. n stands for the windowing parameter. N refers to the range of values and is a constant. Due to the loss of waveform details in the high-frequency part during the use of rectangular windows, the Hamming window can effectively ensure audio details and make the waveform more complete. Therefore, Hamming window function is selected for the windowing audio signal in this study.

Because the audio events in surveillance videos are mostly audio clips with inconsistent durations, there is a significant difference in audio features between different periods. To better extract audio features, the study utilizes Mel-frequency cepstral coefficients for FE. The Mel cepstrum coefficient can combine the auditory features perceived by the human ear with the speech generation mechanism, thereby extracting audio features that are more in line with the actual situation [15,16]. The FE of Mel cepstrum coefficients is shown in Figure 3.

Figure 3

Diagram of audio FE of Mel-frequency cepstral coefficients.

Figure 3 shows the audio FE diagram of Mel cepstrum coefficients. A segment of original audio will be preprocessed to remove redundant interference signals, and then a discrete Fourier transform will be performed to obtain the spectrum. It squares the spectrum to obtain the energy spectrum, and then filters it using M Mel bandpass filters to obtain the output power spectrum of the spectrum. Finally, the audio signal will be subject to an inverse discrete Fourier transform to obtain the static characteristics of the Mel cepstrum coefficient, and the dynamic characteristics can be obtained by differential processing of the static characteristics. Linear predictive coding can calculate different sound characteristics according to the ideal acoustic model. Combining linear predictive coding with Mel cepstrum coefficients can further determine the source and type of sound in the video. The calculation method of audio error prediction is indicated in Eqs. (10) and (11).

(10) x ˜ ( n ) = ∑ i = 1 p a i x ( n − i ) ,

where p refers to the amount of linear predictive coding; x ( n ) and x ( i ) stand for different sound signals, respectively; and a i indicates the Mel cepstrum coefficient.

(11) e ( n ) = x ( n ) − x ˜ ( n ) ,

where x ˜ ( n ) stands for the predicted sound signal; and e ( n ) expresses prediction error.

When using Mel cepstrum coefficients to extract audio features, a series of transformations are required, resulting in Mel cepstrum coefficients only changing within a limited frequency range, making it impossible to extract all the features of the audio. Moreover, it is difficult to ensure the accuracy and stability of feature parameters during real-time processing and analysis of audio. Therefore, a DL network is further used to build an audio feature detection model, as expressed in Figure 4.

Figure 4

Structure of the original audio FE network.

Figure 4 denotes the structure diagram of the original audio FE network constructed using AlexNet and LSTM. As shown in Figure 4, a complete multimedia audio segment will first undergo audio framing processing, and then the corresponding waveform of each frame will be used as input to the AlexNet for audio FE. After AlexNet processes the audio signal, it will transmit the features to the LSTM network to construct a temporal model. The last LSTM unit input is the most effective feature of the entire audio segment, which is connected to the fully connected layer to complete the final audio feature classification and facilitate subsequent system recognition.

3.3 Abnormal behavior monitoring based on MCAVFF

Multi-channel feature fusion refers to the fusion of feature information from multiple channels, such as the frequency domain, time domain, or spatial parts of an audio signal, to generate a comprehensive feature vector to represent the audio signal. Multichannel feature fusion can be used in multiple audio processing tasks such as speech recognition, audio arrangement, and speech synthesis. To facilitate the observation of abnormal behavior of staff in the indoor environment, this research combines the video and audio detection models built above, and uses multichannel feature fusion technology to reduce the inherent defects in the process of single-channel recognition features, to further promote the accuracy of recognition and detection. For a complete piece of multimedia information, it is first necessary to perform shot segmentation processing to separate its audio content from the video content, ensuring that there is visual and auditory information in each video sequence. The process of shot segmentation processing is expressed in Figure 5.

Figure 5

Flow chart of lens segmentation.

Figure 5 shows the specific process of shot segmentation, where a complete video is first divided into countless segments through shot segmentation processing. Subsequently, the visual and audio information in these fragments will be extracted for feature processing. In the above research, the AlexNet-ConvLSTM model has been constructed for processing video features, while the AlexNet LSTM model has been constructed for processing audio features. To further simplify the FE task, the study utilizes the multichannel principle to fuse the features of the two models mentioned above, and then send them together into the LSTM network for FE. Finally, the indoor environment monitoring model using the DL network is built as denoted in Figure 6.

Figure 6

Indoor environment monitoring model with DL network.

Figure 6 shows the indoor environment monitoring model built by the final use of the DL network, where the constructed model is composed of two channels: video and audio processing channels. In the video processing channel, the AlexNet is studied to extract differential information between adjacent frames of video and identify their differential features. At the same time, on the audio processing channel, the AlexNet is studied to process the original waveform features of each frame of the audio signal, and the adjacent features at each time are fused and fed into the LSTM model for temporal modeling. To sum up, the indoor environment monitoring model finally built above can well assist managers to monitor the working status of staff in the business hall and identify their abnormal behaviors. When dealing with audio events in incongruous videos, the study adopts a separation and association approach. The video is divided into independent segments by lens segmentation technology, so that the audio and video content of each segment can be analyzed separately. In the audio processing process, the integrity of the audio signal is ensured through the application of signal processing techniques, such as pre-accentuating, framing and Hamming windowing. In feature extraction, combining with Mel cepstrum coefficient and other methods, we can not only extract the time domain and frequency domain features of audio signals, but also enhance the recognition ability of irregular audio events.

4 Monitoring results of abnormal behaviors of indoor environmental workers based on MCAVFF

To further test the performance of the indoor environment monitoring model built, the research first analyzed the detection results of abnormal behavior video and audio features, and proved that its performance was far better than other models. Subsequently, the research applied the final designed MCAVFF model to real life and tested its recognition ability in practical applications.

4.1 Analysis of abnormal behavior video feature detection results

To test the effectiveness of AlexNet-ConvLSTM, the raised network model was compared with traditional video feature detection models. The self-made video action dataset was selected as the validation dataset for this experiment. First, the loss values of different models were tested as a function of iteration times, as shown in Figure 7.

Figure 7

Variation in loss values of different video feature models with the number of iterations. (a) Variation in loss curve of AlexNet-ConvLSTM network. (b) Variation in loss curve of CNN network.

Figure 7(a) and (b) expresses the variation in loss values with iteration times for the AlexNet-ConvLSTM and the traditional CNN models, respectively. As shown in Figure 7(a), as the iteration times increased, the training and validation losses of the AlexNet-ConvLSTM model showed a continuous decreasing trend. When the number of iterations reached 35, the model began to stabilize. As shown in Figure 7(b), as the iteration times increased, the training and validation losses of traditional CNN models also showed a continuous decreasing trend. However, both the training loss and validation loss had significant variations. When the iteration times reached 45, the CNN model began to stabilize. The results in Figure 7 show that ConvLSTM can capture dynamic information in video data more effectively by introducing LSTM’s timing processing capability, thereby improving the learning efficiency and effect of the model.

Figure 8(a) and (b) shows the variation in detection accuracy values with iteration times for the AlexNet-ConvLSTM and traditional CNN models, respectively. As shown in Figure 8(a), as the iteration time increased, the training and validation accuracy of the AlexNet-ConvLSTM model showed a continuous upward trend. When the number of iterations reached 25, the model began to stabilize, and the detection accuracy of the AlexNet-ConvLSTM model was around 0.95. From Figure 8(b), as the iteration time increased, the training and validation accuracy of traditional CNN models showed a constantly changing trend, and the range of changes in training accuracy was much greater than the validation accuracy. When the number of iterations was 15, the detection accuracy of the CNN model remained stable at around 0.80. The results in Figure 8 highlight the effectiveness of the model in processing complex video data, indicating that it can enable efficient abnormal behavior monitoring in future practical applications.

Figure 8

Variation in test accuracy of different video feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of CNN network.

4.2 Analysis of abnormal audio feature detection results

To test the performance of the audio FE model AlexNet-LSTM, the proposed network model was compared with traditional audio feature detection models. It selected a self-made voice dataset as the validation dataset for this experiment. First, it tested the loss values of different models as a function of iteration times, as denoted in Figure 9.

Figure 9

Variation in loss values with the iteration times for different audio feature models. (a) Variation in loss curve of AlexNet-LSTM network. (b) Variation in loss curve of LSTM network.

Figure 9(a) and (b) show the variation in loss values with iteration times for the AlexNet LSTM and the traditional LSTM models, respectively. As shown in Figure 9(a), as the iteration times increased, the training and validation losses of the audio FE model AlexNet LSTM showed a continuous decreasing trend. When the iteration times reached 60, the model began to stabilize. As shown in Figure 9(b), as the number of iterations increased, the training and validation losses of traditional LSTM models also showed a continuous decreasing trend. However, the training loss of LSTM varied significantly and there were fewer coincident trajectories with the validation loss. When the iteration time was 80, the LSTM model tended to stabilize.

Figure 10(a) and (b) shows the variation in detection accuracy values with iteration times for the AlexNet LSTM and the traditional LSTM models, respectively. As shown in Figure 10(a), as the number of iterations increased, the training and validation accuracy of the AlexNet-LSTM model for audio datasets showed a continuous upward trend. When the iteration times reached 50, the model began to stabilize, and the detection accuracy of the AlexNet-LSTM model was around 0.97. From Figure 10(b), as the iteration times increased, the training and validation accuracy of traditional LSTM models showed a constantly changing trend, with significant changes. When the number of iterations was 60, the detection accuracy of the LSTM model remained stable at around 0.90. The analysis of Figures 9 and 10 further reveals the performance of the audio feature extraction model, and the convergence and detection accuracy of AlexNet LSTM model are significantly better than the traditional LSTM model. This result not only highlights the importance of deep learning in audio signal processing but also highlights the potential for models to be used in real-time monitoring and responding to emergencies, thereby providing better analytical tools for related fields.

Figure 10

Variation in test accuracy of different audio feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of LSTM network.

4.3 Analysis of detection results of the MCAVFF model

The above research proved that the video and audio feature detection models, AlexNet-ConvLSTM and AlexNet-LSTM, performed well in both video and audio FE. To further demonstrate the excellent FE ability of MCAVFF models, a comparative study was conducted on their detection results.

Table 1 shows the detection accuracy, recall rate, F1 value, and mean average precision (MAP) values of CNN, LSTM, and multichannel audio video feature fusion models in a single video dataset, a single sound dataset, and a video sound mixed dataset, respectively. From Table 1, the detection accuracy, recall, F1 value, and MAP value of the proposed model in the video-audio dataset were 97.85, 96.98, 95.61, and 35.61%, respectively. In summary, the MCAVFF model performed better than traditional CNN and LSTM models under various indicators.

Table 1

Performance test results of different models

Models	Dataset	Precision (%)	Recall rate (%)	F1 value (%)	MAP value (%)
CNN	Video	88.26	87.46	88.65	8.56
	Audio	81.26	81.17	82.12	10.69
	Video-audio	84.68	86.84	85.25	11.82
LSTM	Video	83.65	82.14	81.03	9.54
	Audio	88.06	87.45	88.69	12.78
	Video-audio	86.32	86.55	86.94	13.54
MCAVFF model	Video	92.19	93.06	93.26	23.58
	Audio	93.18	94.24	94.08	22.12
	Video-audio	97.85	96.98	95.61	35.61

Figure 11 shows the satisfaction of users and experts with various FE models. Figure 11(a) shows the satisfaction of users with various FE models. From Figure 11(a), the user satisfaction levels for CNN, LSTM, and multi-channel fusion models were 0.76, 0.78, and 0.92, respectively. Figure 11(b) shows the satisfaction of experts with various FE models. From Figure 11(b), experts’ satisfaction with the three models was 0.72, 0.82, and 0.97, respectively. The positive feedback from users and experts further verified the performance and practicability of the model, indicating that it effectively met the needs of practical applications and provided reliable technical support for indoor environmental monitoring and safety management.

Figure 11

User and expert satisfaction with various FE models. (a) User satisfaction with various feature extraction models. (b) Expert satisfaction with various feature extraction models.

5 Conclusion

To better monitor the abnormal behavior of business hall staff in the indoor environment, this study used CNN and LSTM to build a series of FE networks and used them in the identification of the abnormal behavior of staff. The research results indicated that compared to the CNN model, AlexNet-ConvLSTM achieved stable loss values when iterating to the 35th generation, and its detection accuracy remained around 0.95 when iterating to the 25th generation. Compared to the LSTM model, the loss value of AlexNet LSTM reached stability when iterating to 60 generations, and its detection accuracy remained around 0.97 when iterating to 50 generations. By combining the video feature detection model with the audio feature detection model, the final multi-channel audio video feature fusion model had detection accuracy, recall, F1 value, and MAP value of 97.85, 96.98, 95.61, and 35.61% in the video audio dataset, respectively. It could achieve user satisfaction of 0.92 and expert satisfaction of 0.97. In summary, the multi-channel audio and video FE model built by the research institute has good performance. In terms of technical advantages, MCAVFF model can not only effectively capture the temporal dynamics in the video, but also further enhance the recognition ability of behavioral features through audio signals. This collaborative processing of audio and video features significantly improves the accuracy and robustness of abnormal behavior identification, especially when dealing with complex and uncoordinated environments. The research method provides an effective solution for the development of real-time monitoring system, which is helpful to improve the efficiency of ABD in commercial and public safety management. The research innovation lies in the application of multi-channel feature fusion to ABD, which breaks the limitation of single mode analysis in the past. Through the fusion of audio and video features, the model can improve the accuracy of recognizing abnormal behaviors in complex scenes. However, the study still has some limitations, including the lack of practical application data, and different application environments and scenarios still need to be verified. Therefore, future studies can validate the model on larger and diverse datasets; More complex audio and video feature fusion strategies can be explored to combine more types of sensor data to further improve the accuracy of ABD.

Funding information: The research is supported by “Internet +” marketing service electronic channel unified identity authentication and service supervision key technology research project, No.: SGZJ0000KXJS1700321.
Author contributions: Wei Zhang and Jianjun Yang: writing – original draft preparation; Ying Jiang and Yuling Chen: methodology, and Yifan Zhang: writing – review and editing. All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: Authors state no conflict of interest.
Data availability statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

[1] Sun X, Zhou Y, Huang L, Lian J. A real-time PMP energy management strategy for fuel cell hybrid buses based on driving segment feature recognition. Int J Hydrog Energy. 2021;46(80):39983–40000.10.1016/j.ijhydene.2021.09.204Suche in Google Scholar

[2] Yatbaz HY, Ever E, Yazici A. Activity recognition and anomaly detection in e-health applications using color-coded representation and lightweight CNN architectures. IEEE Sens J. 2021;21(13):14191–202.10.1109/JSEN.2021.3061458Suche in Google Scholar

[3] Ryan C, Murphy F, Mullins M. End-to-end autonomous driving risk analysis: A behavioural anomaly detection approach. IEEE Trans Intell Transp Syst. 2021;22(3):1650–62.10.1109/TITS.2020.2975043Suche in Google Scholar

[4] Turnbull A, Carroll J, Mcdonald A. Combining SCADA and vibration data into a single anomaly detection model to predict wind turbine component failure. Wind Energy. 2021;24(3):197–211.10.1002/we.2567Suche in Google Scholar

[5] Sun H, Chen M, Weng J, Liu Z, Geng G. Anomaly detection for in-vehicle network using CNN-LSTM with attention mechanism. IEEE Trans Veh Technol. 2021;70(10):10880–93.10.1109/TVT.2021.3106940Suche in Google Scholar

[6] Singh K, Rajora S, Vishwakarma DK, Tripathi G, Kumar S, Walia GS. Crowd anomaly detection using aggregation of ensembles of fine-tuned ConvNets. Neurocomputing. 2020;371(1):188–98.10.1016/j.neucom.2019.08.059Suche in Google Scholar

[7] Zhang K, Liu YT, Gu Y, Ruan X, Wang J. Multiple timescale feature learning strategy for valve stiction detection based on convolutional neural network. IEEE/ASME Trans Mechatron. 2022;27(3):1478–88.10.1109/TMECH.2021.3087503Suche in Google Scholar

[8] Wang K, Liu M. A feature-optimized Faster regional convolutional neural network for complex background objects detection. IET Image Process. 2021;15(2):378–92.10.1049/ipr2.12028Suche in Google Scholar

[9] Kolli CS, Tatavarthi UD. Fraud detection in bank transaction with wrapper model and Harris water optimization-based deep recurrent neural network. Kybernetes. 2021;50(6):1731–50.10.1108/K-04-2020-0239Suche in Google Scholar

[10] Zhang X, Xie Y, Luan X, He J, Zhang L, Wu L. Video copy detection based on deep CNN features and graph-based sequence matching. Wirel Pers Commun. 2018;103(1):401–16.10.1007/s11277-018-5450-xSuche in Google Scholar

[11] Büker A, Hanili C. Deep convolutional neural networks for double compressed AMR audio detection. IET Signal Process. 2021;15(4):265–80.10.1049/sil2.12028Suche in Google Scholar

[12] Korvel G, Treigys P, Tamuleviius G, Bernataviciene J, Kostek B. Analysis of 2D feature spaces for deep learning-based speech recognition. J Audio Eng Soc. 2018;66(12):1072–81.10.17743/jaes.2018.0066Suche in Google Scholar

[13] Lin T, Liu X. An intelligent recognition system for insulator string defects based on dimension correction and optimized faster R-CNN. Electr Eng. 2021;103(1):541–9.10.1007/s00202-020-01099-zSuche in Google Scholar

[14] Guo Y, Mustafaoglu Z, Koundal D. Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cogn Eng. 2022;2(1):5–9.10.47852/bonviewJCCE2202192Suche in Google Scholar

[15] Partila P, Tovarek J, Ilk GH, Rozhon J, Voznak M. Deep learning serves voice cloning: How vulnerable are automatic speaker verification systems to spoofing trials? IEEE Commun Mag. 2020;58(2):100–5.10.1109/MCOM.001.1900396Suche in Google Scholar

[16] Kar NB, Nayak DR, Babu KS, Zhang YD. A hybrid feature descriptor with Jaya optimised least squares SVM for facial expression recognition. IET Image Process. 2021;15(7):1471–83.10.1049/ipr2.12118Suche in Google Scholar

Received: 2024-09-14

Revised: 2024-12-18

Accepted: 2025-02-19

Published Online: 2025-06-09

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/nleng-2025-0120

Schlagwörter für diesen Artikel

anomaly detection; deep learning; feature extraction; convolutional neural network; long short-term memory network

Creative Commons

BY 4.0