Startseite Indoor environment monitoring method based on the fusion of audio recognition and video patrol features
Artikel Open Access

Indoor environment monitoring method based on the fusion of audio recognition and video patrol features

  • Wei Zhang EMAIL logo , Jianjun Yang , Ying Jiang , Yuling Chen und Yifan Zhang
Veröffentlicht/Copyright: 9. Juni 2025
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

To accurately detect the abnormal behavior of business hall employees in the indoor working environment and ensure the normal operation of the store, a deep learning technology is combined to build an audio and video feature extraction network. First, a separate video and an audio feature extraction model are built by using a convolutional neural network and a long short-term memory network, and then an abnormal behavior detection model of audio and video feature fusion is built using multi-channel modules. This model can divide audio and video data into multiple different channels, and perform feature extraction and recognition on each channel. To assess the performance and robustness of the proposed model, experiments are organized on multiple datasets. The research outcomes denote that compared with some existing neural network models, the detection accuracy, recall, F1 value, and mean average precision values of the proposed model in the video audio dataset are 97.85, 96.98, 95.61, and 35.61%, respectively. The research method has a remarkable fusion effect on video and audio datasets, which improves the accuracy and robustness of abnormal behavior recognition. In addition, user and expert satisfaction with the model has also improved significantly, reaching 0.92 and 0.97, respectively. The novelty of this study lies in the comprehensive analysis of audio and video data by using multi-channel processing technology. Through effective extraction and fusion of video frames and audio signals at the same time, it significantly improves the detection ability of abnormal behaviors in indoor environments in complex scenes.

1 Introduction

In recent years, with the advancement of computer technology, abnormal behavior detection (ABD) has become an important research direction. ABD refers to the identification and classification of abnormal behaviors that occur in specific scenarios to help people better understand and respond to various security issues [1]. In anomaly recognition, many experts have conducted a series of studies using deep learning (DL) techniques [2]. However, current methods often only focus on identifying abnormal behaviors in specific scenarios, while ignoring information from other parts of the video. In addition, traditional feature extraction (FE) models often only focus on one type of feature, making it prone to significant extraction errors. The constant advancement of DL technology has given it strong processing capabilities in image recognition. Convolutional neural network (CNN) and Long short-term memory (LSTM) belong to a kind of DL [3,4]. The motivation for the research stems from a deep understanding of the need for ABD in indoor work environments, especially in high-traffic public places such as power business halls, where timely detection and response to employee abnormal behavior is crucial for maintaining safety and improving the operational efficiency. With the rapid development of deep learning technology, most of the existing ABD methods are limited to a single data source, resulting in insufficient accuracy of information extraction. Based on this background, this article proposes an ABD model based on Multichannel audio and video feature fusion (MCAVFF), aiming to improve the recognition ability of abnormal behavior through this model. The novelty of the research is reflected in the proposed innovative ABD model, which combines CNN and LSTM, and effectively integrates audio and video information through multichannel feature extraction technology to improve the accuracy and robustness of abnormal behavior recognition. The research structure is divided into four parts. The first part reviews the research progress in related fields, and focuses on the existing ABD techniques and their limitations. The second part describes the feature extraction methods of the MCAVFF model, including CNN-Convolutional LSTM (ConvLSTM) structure for video feature extraction and AlexNet-LSTM structure for audio feature extraction. The third part verifies the validity of the model through experiments and analyzes the experimental results compared with the existing methods. The fourth part summarizes the research results and puts forward the prospect of the future research direction.

2 Related works

Currently, many experts have applied various neural networks to face recognition, anomaly detection, and audio and video FE, and have achieved certain research results [5]. Singh et al. proposed a new automatic monitoring method for frequent stampede incidents in crowded scenes to avoid the occurrence of pedestrian stampede casualties. In the proposed method, a new set concept was used to detect anomalies in crowded scene video data, while ConvNets was used as a feature extractor for FE. The research findings indicated that the model used had good feature recognition accuracy [6]. Zhang et al. designed a valve viscosity detection method using CNN. The designed method fully considered the common characteristics of industrial time series signals, and thus automatically developed multiple strategies for learning time scale features. Compared to traditional FE methods, the designed method could automatically learn time series features collected from industrial control loops, thereby improving the accuracy and efficiency of feature detection. The research findings denoted that the feature detection method used did not require manual FE and had high accuracy in detecting valve stickiness features [7]. Wang and Liu put forward an improved algorithm for target detection using fast regional CNN to solve the error problem of traditional CNN in target detection. In addition, an improved interpolation algorithm was proposed at the boundary box location stage, and Newton parabola interpolation was used instead of Bilinear interpolation, to reduce the noise interference in the target detection. The experiment outcomes indicated that the method used could effectively extract target features, thereby completing the target detection task [8]. Kolli and Tatavarthi constructed a fraud detection model using an optimized deep recursive neural network to reduce fraudulent transactions in e-commerce. The fraud detection model adopted not only effectively detected fraudulent activities, but also recorded the characteristics of fraudulent data, thereby effectively avoiding electronic transaction fraud [9].

With the continuous growth of DL technology, it has been widely studied in audio and video FE. Zhang et al. raised a new video copy detection method using deep CNN for FE. This method could not only use CNN to encode video image content, but also retained the recognition ability of neural networks, to ensure that the depth CNN can fully extract video features. The experiment outcomes denoted that the video FE method used had a high accuracy [10]. Aiming at the problems of low detection accuracy and low efficiency of traditional dual compression adaptive multi-rate audio recording, Büker and Hanili used CNN to extract its features, and used a support vector machine to model the extracted audio features. The research results express that the use of CNN could extract different audio data feature signals very well. In addition, under the action of CNN, dual compression would seriously affect high-frequency audio, resulting in low accuracy of audio FE in high-frequency areas [11].

To sum up, a series of DL technologies, including CNN and recurrent neural networks, have achieved certain research results in image recognition, face detection and anomaly monitoring. Many experts have applied neural networks to recognition technology, aiming to promote the recognition accuracy and efficiency of machines. However, the monitoring research on some fixed indoor environments is still relatively few, and there is no specific indoor environment monitoring method. This research takes a business hall as the object, and attempts to extract its multimedia video and audio features using DL technology to identify the abnormal behavior of staff in the indoor environment.

3 Abnormal behavior monitoring of indoor environmental workers based on MCAVFF

This research aims at the power business hall in the indoor environment. First, CNN and LSTM networks are utilized to extract the video features taken by the high-definition camera in the business hall, and a video feature detection model is constructed. Then, the audio features in the multimedia are extracted, and an audio feature detection model is built. Finally, by combining the audio and video models, an MCAVFF abnormal behavior monitoring model is constructed, aiming to constantly monitor the abnormal behavior of the personnel in the store.

3.1 Abnormal behavior video feature detection based on CNN-ConvLSTM network structure

CNN is a commonly used method of DL technology in image recognition. CNN extracts feature information by processing images in convolutional, pooling, and fully connected layers. AlexNet is an optimized neural network based on CNN, mainly used in image recognition. AlexNet consists of three different layers, namely, convolutional, pooling, and fully connected layers [12]. Each level uses a different learning rate and loss function to learn image features to achieve image classification or recognition. This study utilizes CNN optimized AlexNet for video FE, and the FE is denoted in Figure 1.

Figure 1 
                  AlexNet under the video FE flow chart.
Figure 1

AlexNet under the video FE flow chart.

Figure 1 is the flowchart of video FE using the AlexNet. Traditional FE networks directly take the original video frames as input. However, since using adjacent images as input to the network can improve the FE effect, this study first performs subtraction processing on the input images, and then uses the AlexNet for FE. In the AlexNet, it includes convolutional, pooling, and fully connected layers, respectively. To improve the training speed of the network, this study adds a ReLU function after each convolutional layer, thereby increasing the sparsity of the network and improving its training speed. Local response normalization is required after each pooling layer to avoid information loss during the extraction.

Due to the temporal nature of video features, using only the AlexNetcan only extract action features at each moment, without continuity. This study builds a temporal signal model using LSTM based on the AlexNet, aiming to assist the AlexNet in extracting features with temporal variations. LSTM is different from traditional CNN. LSTM can observe and record the changes in continuous video frames, and can also avoid the problem that each sample has a different time length. To enable LSTM to characterize local features through convolutional operations like CNN, this study uses CNN to replace the fully connected layer in LSTM and constructs the final temporal signal model using a ConvLSTM network. The calculation method for the ConvLSTM input gate is shown in Eq. (1) [13].

(1) i t = σ ( W i [ h t 1 , x t ] + b i ) ,

where i t expresses the input gate at time t ; W i indicates the input gate weight; denotes convolution operation; σ means the sigmoid function; b i denotes the threshold of the input gate; h t 1 represents the output of the previous layer of network at time t 1 ; and x t denotes the input of the local network at time t .

(2) f t = σ ( W f [ h t 1 , x t ] + b f ) ,

where f t expresses the forgetting gate at time t ; W f stands for the forgetting gate weight; and b f indicates the threshold of the forgetting gate.

(3) O t = σ ( W O [ h t 1 , x t ] + b O ) ,

where O t indicates the output gate at time t ; W O refers to the output gate weight; and b O stands for the threshold of the output gate.

(4) C t = tanh ( W C [ h t 1 , x t ] + b C ) ,

where C t stands for the memory unit at time t ; W C refers to the weight of the memory unit; b C means the threshold of the memory unit; and tanh denotes a hyperbolic tangent function.

(5) C t = f t C t 1 + i t C t ,

where means Hadamard multiplication. By using Hadamard multiplication to multiply the corresponding elements, the calculation method for memory units can be obtained. C t 1 expresses the memory unit of time t 1 .

(6) h t = O t tanh ( C t ) ,

where h t means the output value of the local network at time t . The combination of ConvLSTM with AlexNet can obtain the CNN ConvLSTM abnormal behavior video feature detection network structure illustrated in Figure 2.

Figure 2 
                  Structure of CNN-ConvLSTM anomalous behavior video feature detection network.
Figure 2

Structure of CNN-ConvLSTM anomalous behavior video feature detection network.

Figure 2 shows the structure diagram of an abnormal behavior video feature detection network constructed using the CNN-ConvLSTM network. In Figure 2, the entire network structure mainly contains four parts, namely, video frame differential processing, CNN, LSTM, and fully connected layers. First, it divides a video into countless frames, then performs differential processing on the adjacent two video frames, and then uses the AlexNet for FE. After FE is completed, ConvLSTM is used for temporal information modeling, and finally, a fully connected layer is applied for output.

3.2 Research on end-to-end abnormal audio feature detection based on original waveform

In addition to processing the video captured by the HD camera in the business hall, extracting the image frames in the video for identification and detection, the research further identifies and monitors the audio information in the indoor environment, aiming to judge various behaviors in the indoor environment through audio analysis, to achieve the purpose of real-time monitoring the dynamic of staff. Before analyzing a section of indoor audio information, it is necessary to first perform signal preprocessing, mainly including routine operations such as pre-emphasis, framing, and windowing. It assumes that the total duration of the audio information used in this study is 2 s and the frame length is 20 ms. The main calculation method for pre-emphasis is indicated in Eq. (7) [14].

(7) y ( n ) = x ( n ) a x ( n 1 ) ,

where x ( n ) means the original audio signal sequence; n indicates the time of speech sampling; y ( n ) expresses the audio signal sequence after pre emphasis; a is the pre emphasis coefficient; x ( n 1 ) refers to the audio signal sequence at time n 1 .

Due to the nonlinear nature of the staff’s audio signals collected in the indoor business hall, it is necessary to perform framing processing on the audio signals to ensure their integrity is preserved during the extraction. This study adopts the overlapping segmentation method for frame segmentation, with a frame length of 20 ms and a frame shift of 10 ms. After framing the audio signal, windowing is also necessary to prevent signal distortion during the processing. Hamming and rectangular windows are two common window functions, and their expressions are denoted in Eqs. (8) and (9).

(8) w ( n ) = 1 0 n < N 0 n = others .

Eq. (8) is the window function of a rectangular window, where w ( n ) means a window function. n expresses the windowing parameter. N stands for the range of values and is a constant.

(9) w ( n ) = 0.54 0.46 cos 2 π n N 1 0 n < N 0 n = others .

Eq. (9) is the window function of Hamming window, where w ( n ) stands for window function. n stands for the windowing parameter. N refers to the range of values and is a constant. Due to the loss of waveform details in the high-frequency part during the use of rectangular windows, the Hamming window can effectively ensure audio details and make the waveform more complete. Therefore, Hamming window function is selected for the windowing audio signal in this study.

Because the audio events in surveillance videos are mostly audio clips with inconsistent durations, there is a significant difference in audio features between different periods. To better extract audio features, the study utilizes Mel-frequency cepstral coefficients for FE. The Mel cepstrum coefficient can combine the auditory features perceived by the human ear with the speech generation mechanism, thereby extracting audio features that are more in line with the actual situation [15,16]. The FE of Mel cepstrum coefficients is shown in Figure 3.

Figure 3 
                  Diagram of audio FE of Mel-frequency cepstral coefficients.
Figure 3

Diagram of audio FE of Mel-frequency cepstral coefficients.

Figure 3 shows the audio FE diagram of Mel cepstrum coefficients. A segment of original audio will be preprocessed to remove redundant interference signals, and then a discrete Fourier transform will be performed to obtain the spectrum. It squares the spectrum to obtain the energy spectrum, and then filters it using M Mel bandpass filters to obtain the output power spectrum of the spectrum. Finally, the audio signal will be subject to an inverse discrete Fourier transform to obtain the static characteristics of the Mel cepstrum coefficient, and the dynamic characteristics can be obtained by differential processing of the static characteristics. Linear predictive coding can calculate different sound characteristics according to the ideal acoustic model. Combining linear predictive coding with Mel cepstrum coefficients can further determine the source and type of sound in the video. The calculation method of audio error prediction is indicated in Eqs. (10) and (11).

(10) x ˜ ( n ) = i = 1 p a i x ( n i ) ,

where p refers to the amount of linear predictive coding; x ( n ) and x ( i ) stand for different sound signals, respectively; and a i indicates the Mel cepstrum coefficient.

(11) e ( n ) = x ( n ) x ˜ ( n ) ,

where x ˜ ( n ) stands for the predicted sound signal; and e ( n ) expresses prediction error.

When using Mel cepstrum coefficients to extract audio features, a series of transformations are required, resulting in Mel cepstrum coefficients only changing within a limited frequency range, making it impossible to extract all the features of the audio. Moreover, it is difficult to ensure the accuracy and stability of feature parameters during real-time processing and analysis of audio. Therefore, a DL network is further used to build an audio feature detection model, as expressed in Figure 4.

Figure 4 
                  Structure of the original audio FE network.
Figure 4

Structure of the original audio FE network.

Figure 4 denotes the structure diagram of the original audio FE network constructed using AlexNet and LSTM. As shown in Figure 4, a complete multimedia audio segment will first undergo audio framing processing, and then the corresponding waveform of each frame will be used as input to the AlexNet for audio FE. After AlexNet processes the audio signal, it will transmit the features to the LSTM network to construct a temporal model. The last LSTM unit input is the most effective feature of the entire audio segment, which is connected to the fully connected layer to complete the final audio feature classification and facilitate subsequent system recognition.

3.3 Abnormal behavior monitoring based on MCAVFF

Multi-channel feature fusion refers to the fusion of feature information from multiple channels, such as the frequency domain, time domain, or spatial parts of an audio signal, to generate a comprehensive feature vector to represent the audio signal. Multichannel feature fusion can be used in multiple audio processing tasks such as speech recognition, audio arrangement, and speech synthesis. To facilitate the observation of abnormal behavior of staff in the indoor environment, this research combines the video and audio detection models built above, and uses multichannel feature fusion technology to reduce the inherent defects in the process of single-channel recognition features, to further promote the accuracy of recognition and detection. For a complete piece of multimedia information, it is first necessary to perform shot segmentation processing to separate its audio content from the video content, ensuring that there is visual and auditory information in each video sequence. The process of shot segmentation processing is expressed in Figure 5.

Figure 5 
                  Flow chart of lens segmentation.
Figure 5

Flow chart of lens segmentation.

Figure 5 shows the specific process of shot segmentation, where a complete video is first divided into countless segments through shot segmentation processing. Subsequently, the visual and audio information in these fragments will be extracted for feature processing. In the above research, the AlexNet-ConvLSTM model has been constructed for processing video features, while the AlexNet LSTM model has been constructed for processing audio features. To further simplify the FE task, the study utilizes the multichannel principle to fuse the features of the two models mentioned above, and then send them together into the LSTM network for FE. Finally, the indoor environment monitoring model using the DL network is built as denoted in Figure 6.

Figure 6 
                  Indoor environment monitoring model with DL network.
Figure 6

Indoor environment monitoring model with DL network.

Figure 6 shows the indoor environment monitoring model built by the final use of the DL network, where the constructed model is composed of two channels: video and audio processing channels. In the video processing channel, the AlexNet is studied to extract differential information between adjacent frames of video and identify their differential features. At the same time, on the audio processing channel, the AlexNet is studied to process the original waveform features of each frame of the audio signal, and the adjacent features at each time are fused and fed into the LSTM model for temporal modeling. To sum up, the indoor environment monitoring model finally built above can well assist managers to monitor the working status of staff in the business hall and identify their abnormal behaviors. When dealing with audio events in incongruous videos, the study adopts a separation and association approach. The video is divided into independent segments by lens segmentation technology, so that the audio and video content of each segment can be analyzed separately. In the audio processing process, the integrity of the audio signal is ensured through the application of signal processing techniques, such as pre-accentuating, framing and Hamming windowing. In feature extraction, combining with Mel cepstrum coefficient and other methods, we can not only extract the time domain and frequency domain features of audio signals, but also enhance the recognition ability of irregular audio events.

4 Monitoring results of abnormal behaviors of indoor environmental workers based on MCAVFF

To further test the performance of the indoor environment monitoring model built, the research first analyzed the detection results of abnormal behavior video and audio features, and proved that its performance was far better than other models. Subsequently, the research applied the final designed MCAVFF model to real life and tested its recognition ability in practical applications.

4.1 Analysis of abnormal behavior video feature detection results

To test the effectiveness of AlexNet-ConvLSTM, the raised network model was compared with traditional video feature detection models. The self-made video action dataset was selected as the validation dataset for this experiment. First, the loss values of different models were tested as a function of iteration times, as shown in Figure 7.

Figure 7 
                  Variation in loss values of different video feature models with the number of iterations. (a) Variation in loss curve of AlexNet-ConvLSTM network. (b) Variation in loss curve of CNN network.
Figure 7

Variation in loss values of different video feature models with the number of iterations. (a) Variation in loss curve of AlexNet-ConvLSTM network. (b) Variation in loss curve of CNN network.

Figure 7(a) and (b) expresses the variation in loss values with iteration times for the AlexNet-ConvLSTM and the traditional CNN models, respectively. As shown in Figure 7(a), as the iteration times increased, the training and validation losses of the AlexNet-ConvLSTM model showed a continuous decreasing trend. When the number of iterations reached 35, the model began to stabilize. As shown in Figure 7(b), as the iteration times increased, the training and validation losses of traditional CNN models also showed a continuous decreasing trend. However, both the training loss and validation loss had significant variations. When the iteration times reached 45, the CNN model began to stabilize. The results in Figure 7 show that ConvLSTM can capture dynamic information in video data more effectively by introducing LSTM’s timing processing capability, thereby improving the learning efficiency and effect of the model.

Figure 8(a) and (b) shows the variation in detection accuracy values with iteration times for the AlexNet-ConvLSTM and traditional CNN models, respectively. As shown in Figure 8(a), as the iteration time increased, the training and validation accuracy of the AlexNet-ConvLSTM model showed a continuous upward trend. When the number of iterations reached 25, the model began to stabilize, and the detection accuracy of the AlexNet-ConvLSTM model was around 0.95. From Figure 8(b), as the iteration time increased, the training and validation accuracy of traditional CNN models showed a constantly changing trend, and the range of changes in training accuracy was much greater than the validation accuracy. When the number of iterations was 15, the detection accuracy of the CNN model remained stable at around 0.80. The results in Figure 8 highlight the effectiveness of the model in processing complex video data, indicating that it can enable efficient abnormal behavior monitoring in future practical applications.

Figure 8 
                  Variation in test accuracy of different video feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of CNN network.
Figure 8

Variation in test accuracy of different video feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of CNN network.

4.2 Analysis of abnormal audio feature detection results

To test the performance of the audio FE model AlexNet-LSTM, the proposed network model was compared with traditional audio feature detection models. It selected a self-made voice dataset as the validation dataset for this experiment. First, it tested the loss values of different models as a function of iteration times, as denoted in Figure 9.

Figure 9 
                  Variation in loss values with the iteration times for different audio feature models. (a) Variation in loss curve of AlexNet-LSTM network. (b) Variation in loss curve of LSTM network.
Figure 9

Variation in loss values with the iteration times for different audio feature models. (a) Variation in loss curve of AlexNet-LSTM network. (b) Variation in loss curve of LSTM network.

Figure 9(a) and (b) show the variation in loss values with iteration times for the AlexNet LSTM and the traditional LSTM models, respectively. As shown in Figure 9(a), as the iteration times increased, the training and validation losses of the audio FE model AlexNet LSTM showed a continuous decreasing trend. When the iteration times reached 60, the model began to stabilize. As shown in Figure 9(b), as the number of iterations increased, the training and validation losses of traditional LSTM models also showed a continuous decreasing trend. However, the training loss of LSTM varied significantly and there were fewer coincident trajectories with the validation loss. When the iteration time was 80, the LSTM model tended to stabilize.

Figure 10(a) and (b) shows the variation in detection accuracy values with iteration times for the AlexNet LSTM and the traditional LSTM models, respectively. As shown in Figure 10(a), as the number of iterations increased, the training and validation accuracy of the AlexNet-LSTM model for audio datasets showed a continuous upward trend. When the iteration times reached 50, the model began to stabilize, and the detection accuracy of the AlexNet-LSTM model was around 0.97. From Figure 10(b), as the iteration times increased, the training and validation accuracy of traditional LSTM models showed a constantly changing trend, with significant changes. When the number of iterations was 60, the detection accuracy of the LSTM model remained stable at around 0.90. The analysis of Figures 9 and 10 further reveals the performance of the audio feature extraction model, and the convergence and detection accuracy of AlexNet LSTM model are significantly better than the traditional LSTM model. This result not only highlights the importance of deep learning in audio signal processing but also highlights the potential for models to be used in real-time monitoring and responding to emergencies, thereby providing better analytical tools for related fields.

Figure 10 
                  Variation in test accuracy of different audio feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of LSTM network.
Figure 10

Variation in test accuracy of different audio feature models with the number of iterations. (a) Variation in detection accuracy of AlexNet-ConvLSTM network. (b) Variation in detection accuracy of LSTM network.

4.3 Analysis of detection results of the MCAVFF model

The above research proved that the video and audio feature detection models, AlexNet-ConvLSTM and AlexNet-LSTM, performed well in both video and audio FE. To further demonstrate the excellent FE ability of MCAVFF models, a comparative study was conducted on their detection results.

Table 1 shows the detection accuracy, recall rate, F1 value, and mean average precision (MAP) values of CNN, LSTM, and multichannel audio video feature fusion models in a single video dataset, a single sound dataset, and a video sound mixed dataset, respectively. From Table 1, the detection accuracy, recall, F1 value, and MAP value of the proposed model in the video-audio dataset were 97.85, 96.98, 95.61, and 35.61%, respectively. In summary, the MCAVFF model performed better than traditional CNN and LSTM models under various indicators.

Table 1

Performance test results of different models

Models Dataset Precision (%) Recall rate (%) F1 value (%) MAP value (%)
CNN Video 88.26 87.46 88.65 8.56
Audio 81.26 81.17 82.12 10.69
Video-audio 84.68 86.84 85.25 11.82
LSTM Video 83.65 82.14 81.03 9.54
Audio 88.06 87.45 88.69 12.78
Video-audio 86.32 86.55 86.94 13.54
MCAVFF model Video 92.19 93.06 93.26 23.58
Audio 93.18 94.24 94.08 22.12
Video-audio 97.85 96.98 95.61 35.61

Figure 11 shows the satisfaction of users and experts with various FE models. Figure 11(a) shows the satisfaction of users with various FE models. From Figure 11(a), the user satisfaction levels for CNN, LSTM, and multi-channel fusion models were 0.76, 0.78, and 0.92, respectively. Figure 11(b) shows the satisfaction of experts with various FE models. From Figure 11(b), experts’ satisfaction with the three models was 0.72, 0.82, and 0.97, respectively. The positive feedback from users and experts further verified the performance and practicability of the model, indicating that it effectively met the needs of practical applications and provided reliable technical support for indoor environmental monitoring and safety management.

Figure 11 
                  User and expert satisfaction with various FE models. (a) User satisfaction with various feature extraction models. (b) Expert satisfaction with various feature extraction models.
Figure 11

User and expert satisfaction with various FE models. (a) User satisfaction with various feature extraction models. (b) Expert satisfaction with various feature extraction models.

5 Conclusion

To better monitor the abnormal behavior of business hall staff in the indoor environment, this study used CNN and LSTM to build a series of FE networks and used them in the identification of the abnormal behavior of staff. The research results indicated that compared to the CNN model, AlexNet-ConvLSTM achieved stable loss values when iterating to the 35th generation, and its detection accuracy remained around 0.95 when iterating to the 25th generation. Compared to the LSTM model, the loss value of AlexNet LSTM reached stability when iterating to 60 generations, and its detection accuracy remained around 0.97 when iterating to 50 generations. By combining the video feature detection model with the audio feature detection model, the final multi-channel audio video feature fusion model had detection accuracy, recall, F1 value, and MAP value of 97.85, 96.98, 95.61, and 35.61% in the video audio dataset, respectively. It could achieve user satisfaction of 0.92 and expert satisfaction of 0.97. In summary, the multi-channel audio and video FE model built by the research institute has good performance. In terms of technical advantages, MCAVFF model can not only effectively capture the temporal dynamics in the video, but also further enhance the recognition ability of behavioral features through audio signals. This collaborative processing of audio and video features significantly improves the accuracy and robustness of abnormal behavior identification, especially when dealing with complex and uncoordinated environments. The research method provides an effective solution for the development of real-time monitoring system, which is helpful to improve the efficiency of ABD in commercial and public safety management. The research innovation lies in the application of multi-channel feature fusion to ABD, which breaks the limitation of single mode analysis in the past. Through the fusion of audio and video features, the model can improve the accuracy of recognizing abnormal behaviors in complex scenes. However, the study still has some limitations, including the lack of practical application data, and different application environments and scenarios still need to be verified. Therefore, future studies can validate the model on larger and diverse datasets; More complex audio and video feature fusion strategies can be explored to combine more types of sensor data to further improve the accuracy of ABD.

  1. Funding information: The research is supported by “Internet +” marketing service electronic channel unified identity authentication and service supervision key technology research project, No.: SGZJ0000KXJS1700321.

  2. Author contributions: Wei Zhang and Jianjun Yang: writing – original draft preparation; Ying Jiang and Yuling Chen: methodology, and Yifan Zhang: writing – review and editing. All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Conflict of interest: Authors state no conflict of interest.

  4. Data availability statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

[1] Sun X, Zhou Y, Huang L, Lian J. A real-time PMP energy management strategy for fuel cell hybrid buses based on driving segment feature recognition. Int J Hydrog Energy. 2021;46(80):39983–40000.10.1016/j.ijhydene.2021.09.204Suche in Google Scholar

[2] Yatbaz HY, Ever E, Yazici A. Activity recognition and anomaly detection in e-health applications using color-coded representation and lightweight CNN architectures. IEEE Sens J. 2021;21(13):14191–202.10.1109/JSEN.2021.3061458Suche in Google Scholar

[3] Ryan C, Murphy F, Mullins M. End-to-end autonomous driving risk analysis: A behavioural anomaly detection approach. IEEE Trans Intell Transp Syst. 2021;22(3):1650–62.10.1109/TITS.2020.2975043Suche in Google Scholar

[4] Turnbull A, Carroll J, Mcdonald A. Combining SCADA and vibration data into a single anomaly detection model to predict wind turbine component failure. Wind Energy. 2021;24(3):197–211.10.1002/we.2567Suche in Google Scholar

[5] Sun H, Chen M, Weng J, Liu Z, Geng G. Anomaly detection for in-vehicle network using CNN-LSTM with attention mechanism. IEEE Trans Veh Technol. 2021;70(10):10880–93.10.1109/TVT.2021.3106940Suche in Google Scholar

[6] Singh K, Rajora S, Vishwakarma DK, Tripathi G, Kumar S, Walia GS. Crowd anomaly detection using aggregation of ensembles of fine-tuned ConvNets. Neurocomputing. 2020;371(1):188–98.10.1016/j.neucom.2019.08.059Suche in Google Scholar

[7] Zhang K, Liu YT, Gu Y, Ruan X, Wang J. Multiple timescale feature learning strategy for valve stiction detection based on convolutional neural network. IEEE/ASME Trans Mechatron. 2022;27(3):1478–88.10.1109/TMECH.2021.3087503Suche in Google Scholar

[8] Wang K, Liu M. A feature-optimized Faster regional convolutional neural network for complex background objects detection. IET Image Process. 2021;15(2):378–92.10.1049/ipr2.12028Suche in Google Scholar

[9] Kolli CS, Tatavarthi UD. Fraud detection in bank transaction with wrapper model and Harris water optimization-based deep recurrent neural network. Kybernetes. 2021;50(6):1731–50.10.1108/K-04-2020-0239Suche in Google Scholar

[10] Zhang X, Xie Y, Luan X, He J, Zhang L, Wu L. Video copy detection based on deep CNN features and graph-based sequence matching. Wirel Pers Commun. 2018;103(1):401–16.10.1007/s11277-018-5450-xSuche in Google Scholar

[11] Büker A, Hanili C. Deep convolutional neural networks for double compressed AMR audio detection. IET Signal Process. 2021;15(4):265–80.10.1049/sil2.12028Suche in Google Scholar

[12] Korvel G, Treigys P, Tamuleviius G, Bernataviciene J, Kostek B. Analysis of 2D feature spaces for deep learning-based speech recognition. J Audio Eng Soc. 2018;66(12):1072–81.10.17743/jaes.2018.0066Suche in Google Scholar

[13] Lin T, Liu X. An intelligent recognition system for insulator string defects based on dimension correction and optimized faster R-CNN. Electr Eng. 2021;103(1):541–9.10.1007/s00202-020-01099-zSuche in Google Scholar

[14] Guo Y, Mustafaoglu Z, Koundal D. Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cogn Eng. 2022;2(1):5–9.10.47852/bonviewJCCE2202192Suche in Google Scholar

[15] Partila P, Tovarek J, Ilk GH, Rozhon J, Voznak M. Deep learning serves voice cloning: How vulnerable are automatic speaker verification systems to spoofing trials? IEEE Commun Mag. 2020;58(2):100–5.10.1109/MCOM.001.1900396Suche in Google Scholar

[16] Kar NB, Nayak DR, Babu KS, Zhang YD. A hybrid feature descriptor with Jaya optimised least squares SVM for facial expression recognition. IET Image Process. 2021;15(7):1471–83.10.1049/ipr2.12118Suche in Google Scholar

Received: 2024-09-14
Revised: 2024-12-18
Accepted: 2025-02-19
Published Online: 2025-06-09

© 2025 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

  1. Research Articles
  2. Generalized (ψ,φ)-contraction to investigate Volterra integral inclusions and fractal fractional PDEs in super-metric space with numerical experiments
  3. Solitons in ultrasound imaging: Exploring applications and enhancements via the Westervelt equation
  4. Stochastic improved Simpson for solving nonlinear fractional-order systems using product integration rules
  5. Exploring dynamical features like bifurcation assessment, sensitivity visualization, and solitary wave solutions of the integrable Akbota equation
  6. Research on surface defect detection method and optimization of paper-plastic composite bag based on improved combined segmentation algorithm
  7. Impact the sulphur content in Iraqi crude oil on the mechanical properties and corrosion behaviour of carbon steel in various types of API 5L pipelines and ASTM 106 grade B
  8. Unravelling quiescent optical solitons: An exploration of the complex Ginzburg–Landau equation with nonlinear chromatic dispersion and self-phase modulation
  9. Perturbation-iteration approach for fractional-order logistic differential equations
  10. Variational formulations for the Euler and Navier–Stokes systems in fluid mechanics and related models
  11. Rotor response to unbalanced load and system performance considering variable bearing profile
  12. DeepFowl: Disease prediction from chicken excreta images using deep learning
  13. Channel flow of Ellis fluid due to cilia motion
  14. A case study of fractional-order varicella virus model to nonlinear dynamics strategy for control and prevalence
  15. Multi-point estimation weldment recognition and estimation of pose with data-driven robotics design
  16. Analysis of Hall current and nonuniform heating effects on magneto-convection between vertically aligned plates under the influence of electric and magnetic fields
  17. A comparative study on residual power series method and differential transform method through the time-fractional telegraph equation
  18. Insights from the nonlinear Schrödinger–Hirota equation with chromatic dispersion: Dynamics in fiber–optic communication
  19. Mathematical analysis of Jeffrey ferrofluid on stretching surface with the Darcy–Forchheimer model
  20. Exploring the interaction between lump, stripe and double-stripe, and periodic wave solutions of the Konopelchenko–Dubrovsky–Kaup–Kupershmidt system
  21. Computational investigation of tuberculosis and HIV/AIDS co-infection in fuzzy environment
  22. Signature verification by geometry and image processing
  23. Theoretical and numerical approach for quantifying sensitivity to system parameters of nonlinear systems
  24. Chaotic behaviors, stability, and solitary wave propagations of M-fractional LWE equation in magneto-electro-elastic circular rod
  25. Dynamic analysis and optimization of syphilis spread: Simulations, integrating treatment and public health interventions
  26. Visco-thermoelastic rectangular plate under uniform loading: A study of deflection
  27. Threshold dynamics and optimal control of an epidemiological smoking model
  28. Numerical computational model for an unsteady hybrid nanofluid flow in a porous medium past an MHD rotating sheet
  29. Regression prediction model of fabric brightness based on light and shadow reconstruction of layered images
  30. Dynamics and prevention of gemini virus infection in red chili crops studied with generalized fractional operator: Analysis and modeling
  31. Qualitative analysis on existence and stability of nonlinear fractional dynamic equations on time scales
  32. Fractional-order super-twisting sliding mode active disturbance rejection control for electro-hydraulic position servo systems
  33. Analytical exploration and parametric insights into optical solitons in magneto-optic waveguides: Advances in nonlinear dynamics for applied sciences
  34. Bifurcation dynamics and optical soliton structures in the nonlinear Schrödinger–Bopp–Podolsky system
  35. Review Article
  36. Haar wavelet collocation method for existence and numerical solutions of fourth-order integro-differential equations with bounded coefficients
  37. Special Issue: Nonlinear Analysis and Design of Communication Networks for IoT Applications - Part II
  38. Silicon-based all-optical wavelength converter for on-chip optical interconnection
  39. Research on a path-tracking control system of unmanned rollers based on an optimization algorithm and real-time feedback
  40. Analysis of the sports action recognition model based on the LSTM recurrent neural network
  41. Industrial robot trajectory error compensation based on enhanced transfer convolutional neural networks
  42. Research on IoT network performance prediction model of power grid warehouse based on nonlinear GA-BP neural network
  43. Interactive recommendation of social network communication between cities based on GNN and user preferences
  44. Application of improved P-BEM in time varying channel prediction in 5G high-speed mobile communication system
  45. Construction of a BIM smart building collaborative design model combining the Internet of Things
  46. Optimizing malicious website prediction: An advanced XGBoost-based machine learning model
  47. Economic operation analysis of the power grid combining communication network and distributed optimization algorithm
  48. Sports video temporal action detection technology based on an improved MSST algorithm
  49. Internet of things data security and privacy protection based on improved federated learning
  50. Enterprise power emission reduction technology based on the LSTM–SVM model
  51. Construction of multi-style face models based on artistic image generation algorithms
  52. Research and application of interactive digital twin monitoring system for photovoltaic power station based on global perception
  53. Special Issue: Decision and Control in Nonlinear Systems - Part II
  54. Animation video frame prediction based on ConvGRU fine-grained synthesis flow
  55. Application of GGNN inference propagation model for martial art intensity evaluation
  56. Benefit evaluation of building energy-saving renovation projects based on BWM weighting method
  57. Deep neural network application in real-time economic dispatch and frequency control of microgrids
  58. Real-time force/position control of soft growing robots: A data-driven model predictive approach
  59. Mechanical product design and manufacturing system based on CNN and server optimization algorithm
  60. Application of finite element analysis in the formal analysis of ancient architectural plaque section
  61. Research on territorial spatial planning based on data mining and geographic information visualization
  62. Fault diagnosis of agricultural sprinkler irrigation machinery equipment based on machine vision
  63. Closure technology of large span steel truss arch bridge with temporarily fixed edge supports
  64. Intelligent accounting question-answering robot based on a large language model and knowledge graph
  65. Analysis of manufacturing and retailer blockchain decision based on resource recyclability
  66. Flexible manufacturing workshop mechanical processing and product scheduling algorithm based on MES
  67. Exploration of indoor environment perception and design model based on virtual reality technology
  68. Tennis automatic ball-picking robot based on image object detection and positioning technology
  69. A new CNN deep learning model for computer-intelligent color matching
  70. Design of AR-based general computer technology experiment demonstration platform
  71. Indoor environment monitoring method based on the fusion of audio recognition and video patrol features
  72. Health condition prediction method of the computer numerical control machine tool parts by ensembling digital twins and improved LSTM networks
  73. Establishment of a green degree evaluation model for wall materials based on lifecycle
  74. Quantitative evaluation of college music teaching pronunciation based on nonlinear feature extraction
  75. Multi-index nonlinear robust virtual synchronous generator control method for microgrid inverters
  76. Manufacturing engineering production line scheduling management technology integrating availability constraints and heuristic rules
  77. Analysis of digital intelligent financial audit system based on improved BiLSTM neural network
  78. Attention community discovery model applied to complex network information analysis
  79. A neural collaborative filtering recommendation algorithm based on attention mechanism and contrastive learning
  80. Rehabilitation training method for motor dysfunction based on video stream matching
  81. Research on façade design for cold-region buildings based on artificial neural networks and parametric modeling techniques
  82. Intelligent implementation of muscle strain identification algorithm in Mi health exercise induced waist muscle strain
  83. Optimization design of urban rainwater and flood drainage system based on SWMM
  84. Improved GA for construction progress and cost management in construction projects
  85. Evaluation and prediction of SVM parameters in engineering cost based on random forest hybrid optimization
  86. Museum intelligent warning system based on wireless data module
  87. Optimization design and research of mechatronics based on torque motor control algorithm
  88. Special Issue: Nonlinear Engineering’s significance in Materials Science
  89. Experimental research on the degradation of chemical industrial wastewater by combined hydrodynamic cavitation based on nonlinear dynamic model
  90. Study on low-cycle fatigue life of nickel-based superalloy GH4586 at various temperatures
  91. Some results of solutions to neutral stochastic functional operator-differential equations
  92. Ultrasonic cavitation did not occur in high-pressure CO2 liquid
  93. Research on the performance of a novel type of cemented filler material for coal mine opening and filling
  94. Testing of recycled fine aggregate concrete’s mechanical properties using recycled fine aggregate concrete and research on technology for highway construction
  95. A modified fuzzy TOPSIS approach for the condition assessment of existing bridges
  96. Nonlinear structural and vibration analysis of straddle monorail pantograph under random excitations
  97. Achieving high efficiency and stability in blue OLEDs: Role of wide-gap hosts and emitter interactions
  98. Construction of teaching quality evaluation model of online dance teaching course based on improved PSO-BPNN
  99. Enhanced electrical conductivity and electromagnetic shielding properties of multi-component polymer/graphite nanocomposites prepared by solid-state shear milling
  100. Optimization of thermal characteristics of buried composite phase-change energy storage walls based on nonlinear engineering methods
  101. A higher-performance big data-based movie recommendation system
  102. Nonlinear impact of minimum wage on labor employment in China
  103. Nonlinear comprehensive evaluation method based on information entropy and discrimination optimization
  104. Application of numerical calculation methods in stability analysis of pile foundation under complex foundation conditions
  105. Research on the contribution of shale gas development and utilization in Sichuan Province to carbon peak based on the PSA process
  106. Characteristics of tight oil reservoirs and their impact on seepage flow from a nonlinear engineering perspective
  107. Nonlinear deformation decomposition and mode identification of plane structures via orthogonal theory
  108. Numerical simulation of damage mechanism in rock with cracks impacted by self-excited pulsed jet based on SPH-FEM coupling method: The perspective of nonlinear engineering and materials science
  109. Cross-scale modeling and collaborative optimization of ethanol-catalyzed coupling to produce C4 olefins: Nonlinear modeling and collaborative optimization strategies
  110. Unequal width T-node stress concentration factor analysis of stiffened rectangular steel pipe concrete
  111. Special Issue: Advances in Nonlinear Dynamics and Control
  112. Development of a cognitive blood glucose–insulin control strategy design for a nonlinear diabetic patient model
  113. Big data-based optimized model of building design in the context of rural revitalization
  114. Multi-UAV assisted air-to-ground data collection for ground sensors with unknown positions
  115. Design of urban and rural elderly care public areas integrating person-environment fit theory
  116. Application of lossless signal transmission technology in piano timbre recognition
  117. Application of improved GA in optimizing rural tourism routes
  118. Architectural animation generation system based on AL-GAN algorithm
  119. Advanced sentiment analysis in online shopping: Implementing LSTM models analyzing E-commerce user sentiments
  120. Intelligent recommendation algorithm for piano tracks based on the CNN model
  121. Visualization of large-scale user association feature data based on a nonlinear dimensionality reduction method
  122. Low-carbon economic optimization of microgrid clusters based on an energy interaction operation strategy
  123. Optimization effect of video data extraction and search based on Faster-RCNN hybrid model on intelligent information systems
  124. Construction of image segmentation system combining TC and swarm intelligence algorithm
  125. Particle swarm optimization and fuzzy C-means clustering algorithm for the adhesive layer defect detection
  126. Optimization of student learning status by instructional intervention decision-making techniques incorporating reinforcement learning
  127. Fuzzy model-based stabilization control and state estimation of nonlinear systems
  128. Optimization of distribution network scheduling based on BA and photovoltaic uncertainty
  129. Tai Chi movement segmentation and recognition on the grounds of multi-sensor data fusion and the DBSCAN algorithm
  130. Special Issue: Dynamic Engineering and Control Methods for the Nonlinear Systems - Part III
  131. Generalized numerical RKM method for solving sixth-order fractional partial differential equations
Heruntergeladen am 20.11.2025 von https://www.degruyterbrill.com/document/doi/10.1515/nleng-2025-0120/html
Button zum nach oben scrollen