Accurate and real-time object detection in crowded indoor spaces based on the fusion of DBSCAN algorithm and improved YOLOv4-tiny network

Jianing Shen; Yang Zhou

doi:10.1515/jisys-2022-0268

Article Open Access

Accurate and real-time object detection in crowded indoor spaces based on the fusion of DBSCAN algorithm and improved YOLOv4-tiny network

Jianing Shen and Yang Zhou

Published/Copyright: July 10, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 32 Issue 1

Abstract

Real-time object detection is an integral part of internet of things (IoT) application, which is an important research field of computer vision. Existing lightweight algorithms cannot handle target occlusions well in target detection tasks in indoor narrow scenes, resulting in a large number of missed detections and misclassifications. To this end, an accurate real-time multi-scale detection method that integrates density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm and the improved You Only Look Once (YOLO)-v4-tiny network is proposed. First, by improving the neck network of the YOLOv4-tiny model, the detailed information of the shallow network is utilized to boost the average precision of the model to identify dense small objects, and the Cross mini-Batch Normalization strategy is adopted to improve the accuracy of statistical information. Second, the DBSCAN clustering algorithm is fused with the modified network to achieve better clustering effects. Finally, Mosaic data enrichment technique is adopted during model training process to improve the capability of the model to recognize occluded targets. Experimental results show that compared to the original YOLOv4-tiny algorithm, the mAP values of the improved algorithm on the self-construct dataset are significantly improved, and the processing speed can well meet the requirements of real-time applications on embedded devices. The performance of the proposed model on public datasets PASCAL VOC07 and PASCAL VOC12 is also better than that of other advanced lightweight algorithms, and the detection ability for occluded objects is significantly improved, which meets the requirements of mobile terminals for real-time detection in crowded indoor environments.

Keywords: IoT application; computer vision; crowded scenes; occluded object detection; YOLOv4-tiny; DBSCAN; Cross mini-Batch Normalization; Mosaic data enrichment

1 Introduction

Dense object detection is one of the special application tasks of computer vision [1]. It has a wide range of application value in fields such as control and identification of suspicious objects [2], intelligent collection of human traffic statistics [3], and abnormal behavior detection [4]. Crowded indoor space is a single background scene that is less influenced by factors such as illumination, angle, and internal structure. The difficulty lies in the fact that when faced with factors such as mutual occlusion between multiple objects and incomplete camera framing, the traditional lightweight algorithms will lead to a great number of missed detections and false alarms, so it is of great research significance to achieve a lightweight detection algorithm that can accurately and quickly detect occluded small targets within crowded indoor spaces in the internet of things (IoT) environment [5].

Taking the elevator car detection application as an example, in the field of elevator safety management, the information provided by the elevator monitoring video is often used as the basis to judge the situation in the car and take corresponding measures. The current elevators generally have video acquisition devices on the top of the car, but this only provides real-time observation, video storage, and evidence collection and does not have functions such as timely warning and real-time monitoring. Although elevator managers can observe the video images and take timely measures after discovering the forbidden objects, managers cannot observe the surveillance video all the time, which may cause omissions in work. Therefore, it is necessary to solve the shortage of human monitoring through image acquisition equipment, processors, and target detection algorithms, that is, to design an intelligent video monitoring system that can detect prohibited objects in elevators throughout the day and realize the detection of prohibited target intrusions based on elevator monitoring video [6]. Through timely alarm to the management personnel, the workloads of relevant personnel are reduced and the elevator riding safety is guaranteed.

Traditional object detection methods rely on manual feature extraction and complete detection tasks through classifiers such as the support vector machine (SVM) [7]. The recognition process of such method is complex and highly subjective, and it is not sensitive to occluded salient regions, which leads to a poor performance in terms of detection accuracy, processing speed, and robustness. Recently, with the continuous progress in convolutional neural network (CNN), the concept of end-to-end framework has been applied to target detection algorithms, such as single shot multibox detector (SSD) [8] and the You Only Look Once (YOLO) series [9]. These algorithms combine the classification process and the regression network into a single stage, which has a significant improvement in the trade-off between detection accuracy and speed and is suitable for deployment on mobile terminals for target detection in crowded indoor spaces.

Existing algorithms such as YOLOv4-tiny network [10] achieve good detection performance in dense scenes, but they also have the following shortcomings:

The backbone network is too lightweight, and the contour evolution of the feature map is insufficient during the layer-by-layer transfer process, making it impossible to effectively learn enough features of the occluded targets in the course of training;
The neck network structure is too simple, with low efficiency when fusing feature maps of different sizes, and it is prone to lose detailed edge information;
The traditional K-means clustering has limitations in the post-processing stage, which can easily lead to missed detection.

The main contributions of the proposed method are listed as follows:

An adaptive multi-scale detection algorithm is proposed by taking the crowded indoor spaces such as the elevator car, bus car, and passenger aircraft cabin as the main research scenarios. Based on the YOLOv4-tiny model which is suitable for embedded hardware platforms, the neck network of the model is improved to improve the recognition accuracy of small targets. The Cross mini-Batch Normalization (CmBN) technique is used to replace the batch normalization (BN) technique from the original YOLOv4-tiny model to ensure the estimation accuracy of statistical information and reduce estimation errors. The Mosaic data enrichment strategy is introduced in the training phase to improve the utilization of feature maps of different scales and alleviate the loss of edge information;
An improved clustering method combining DBSCAN and K-means algorithm is proposed. First, initial clustering is performed by DBSCAN clustering algorithm, and then multi-scale clustering is added to further obtain the local and overall information of the targets. After multi-scale clustering and convolution operations, the initial feature map is obtained. Afterward, the accurate center point position is derived with the K-means algorithm, which effectively accelerates the convergence and improves the classification accuracy for dense small targets. As a result, the classification effect of objects in different datasets and complex environments is improved, and the resistance to noise and interference is also improved.

The remainder of this article is organized as follows. Section 2 introduces the related research, and Section 3 explains the background knowledge of YOLOv4-tiny. Section 4 describes the proposed method specifically, including the improved YOLOv4-tiny network structure and the enhanced clustering algorithm for bounding box determination. In Section 5, the experimental results and discussion are presented. Finally, Section 6 summarizes the full text.

2 Related research

Before the rise of deep learning (DL) methodology, the traditional object recognition technology proposed in the early days mainly utilize the more obvious features including image corners, textures, and contours, and used feature descriptors such as histogram of oriented gradient [11], scale-invariant feature transform [12], speeded up robust features [13], Haar-like features [14], and local binary pattern [15], and the recognition and classification of the extracted object features are performed with template matching [16], boosting [17], SVM, and other methods. However, the traditional recognition algorithms have shortcomings such as relying on manual experience for feature selection and poor robustness in complex application scenarios.

As hardware equipment and DL technology evolve, DL-based frameworks have begun to emerge in the field of object detection. Classic DL models include CNN, deep belief networks, deep residual Networks, and AutoEncoders (AE) [18]. LeCun et al. [19] first applied artificial neural network to recognize handwritten fonts, and the LeNet [20] proposed later was the beginning of research on deep convolutional networks in the area of object recognition. Later, the ALexNet suggested by Krizhevsky et al. [21] achieved the best result of 11% false recognition rate in the visual object classes (VOC) challenge, which greatly improved the practicability of neural networks in the area of target recognition. The VGGNet [22] proposed by Oxford University further deepens the neural network and extracts deeper abstract features of the image. GoogleNet [23] developed by Google improves the computational efficiency of neural networks, and new versions including inception2, inception3, inception4, and Xception [24] were subsequently proposed.

The ResNet [25] developed by He et al. from the Microsoft Research Institute is a 152-layer deep neural network. The model gained the championship in the ImageNet Large Scale Visual Recognition Challenge 2015 competition with significantly lower error rate and reduced number of parameters compared to that of the VGGNet. In addition, there are also some studies on shallow networks and unsupervised DL models. For example, Chen et al. [26] proposed to use an unsupervised sparse AE network to learn from randomly sampled image patches, and finally train a softmax classifier to classify the objects. In 2014, Girshick et al. [27] proposed a regional detection-based CNN (R-CNN) in which a region-selective search algorithm is used to obtain 2,000 candidate bounding boxes and then convolution operations are performed, which reduce the redundant convolution operations and realize the function of multi-target recognition for a single image. So far, the CNN has gained considerable attention in the area of object recognition.

On the basis of the classic networks, by improving the way of obtaining candidate regions and the modification of the network structure, two types of object detection methods have been developed, namely the candidate region-based (also known as two-stage) algorithms, and the regression-based (also known as one-stage) object detection algorithms. In general, the candidate region-based algorithms have better recognition performance, while the regression-based algorithms have faster processing speed [28].

The popular candidate region-based object detection models include Spp-Net [26], Fast R-CNN [29], Faster R-CNN [30], and mask R-CNN [31]. These algorithms mainly involve two tasks: candidate region acquisition and candidate region identification. Spp-Net reduces the computing load of the network through pyramid pooling layers. Fast RCNN and Faster R-CNN improve the network in different aspects to improve the detection speed. The key improvement of Faster R-CNN is to use the region proposal network (RPN) instead of the selection search algorithm, which further improves the detection speed. RPN is currently the most accurate candidate bounding box localization method, so Faster R-CNN has very high localization accuracy. On the basis of Faster R-CNN, researchers have proposed mask R-CNN, RetinaNet [32], etc.

The regression-based (one-stage) object detection algorithms seek a compromise between recognition speed and recognition accuracy; the selection of candidate bounding boxes, feature extraction, object classification and the bounding box predictions are all regressed into the network, and the location and category of the target are obtained explicitly at the output layer, so that the recognition speed improves significantly. The regression-based object recognition algorithms mainly include YOLO [9], SSD [8], and EfficientNet [33]. These are the methods with fast processing speed and great detection accuracy.

In 2016, Redmon et al. proposed the YOLO network [9] which no longer uses the region proposal strategy, but the entire image input is divided into multiple grids. Each grid is responsible for forecasting the object whose center is within the grid. End-to-end prediction can be achieved by running a single CNN operation on an image, which significantly accelerates the speed of object recognition. YOLO also has some shortcomings. Its positioning accuracy is lower than that of the candidate region-based algorithms, it cannot identify small targets well, and its generalization performance is poor. Liu et al. proposed SSD [8] in the European Conference on Computer Vision 2016, in which the RPN structure is used to improve the YOLO network. By mapping objects at different scales through different convolutional layers, the detection accuracy for small targets is enhanced, while retaining the merits of fast computing speed from YOLO and high accuracy from Faster R-CNN. YOLO is superior to the SSD algorithm in terms of detection speed, but there is a problem of missed detections for mutually occluded targets. In addition, researchers have developed simplified versions of YOLOv3-Tiny [34], YOLOv4-tiny [10], and other lightweight models based on the original YOLO network. These models are relatively simple and efficient with fewer parameters, thus significantly reducing storage and computing requirements. Among them, YOLOv4-tiny is significantly better than other lightweight models in terms of training time and detection speed, which are applicable for embedded devices.

3 Yolov4-tiny

In the YOLO series, region classification proposals are integrated into a single neural network for the predictions of bounding boxes and classification probabilities, where the input image is partitioned into S × S grid cells and detection is performed in a single evaluation stage. To improve accuracy, the YOLOv2 model [35] introduces BN and direct position prediction strategies and replaces fully connected layers with convolutional layers to accelerate the training and detection processes. The YOLOv3 model [34] uses darknet-53 as the backbone, which utilizes 53 convolutional layers for feature extraction. The improved CSPdarknet-53 is used as the backbone in YOLOv4 [36], where feature extraction and connection are split into two parts using cross-stage partial (CSP) connections.

YOLOv4-tiny [10], as a lightweight version of the YOLOv4 model, using CSPDarkNet53-tiny as its backbone network, and the network structure of the YOLOv4-tiny are shown in Figure 1. CSPDarkNet53-tiny uses three simplified CSP blocks with residual modules removed in the CSP network. In order to further reduce computing complexity, the Mish function is replaced by the leaky rectified linear unit as the activation function, and the feature pyramid network is used to extract two feature maps of different sizes to predict the recognition results, reducing the number of model parameters and computational loads, thereby promoting the application in embedded systems and mobile devices. However, due to the simpler network structure, the detection performance of YOLOv4-tiny is also significantly lower than that of the YOLOv4 network, especially for small targets.

Figure 1

Structure of Yolov4-tiny.

4 Proposed method

Based on the original YOLOv4-tiny model, this study designs a dense object recognition model for indoor spaces. The YOLOv4-tiny model has a simplified structure and fast inference speed and is suitable for embedded hardware platforms. However, the recognition accuracy of the model for small targets and overlapping targets within dense scenes needs to be improved, and further optimization and improvement are required.

4.1 Improvement of the network structure

The backbone network of the YOLOv4-tiny model contains three CSP modules, namely CSP1, CSP2, and CSP3. Among them, the CSP2 layer includes precise location information and more specific information, but fewer semantic information. The CSP3 layer includes a larger amount of semantic information but less specific information, the location information is relatively rough, and the position information and detail information of many small targets could be lost. To promote the recognition precision of the YOLOv4-tiny model for densely crowded targets, we propose an improved YOLOv4-tiny model, the detailed structure of which is shown in Figure 2. In the figure, Conv denotes convolutional operation, Leaky is the Leaky Relu activation function, and Maxpool represents max pooling operation. The route module performs concatenation operation, and CSP1, SCP2, and CSP3 are CSP network modules. UP1 and UP2 represent upsampling 1 and upsampling 2 operations, respectively. The input image resolution is 416 × 416. The dimensions of the output feature maps are 13 × 13, 26 × 26, and 52 × 52, respectively.

Figure 2

Structure of improved YOLOv4-tiny model, CBL module, and CSP module.

From Figure 2, it can be seen that the proposed model adds a path connected to the CSP2 layer of the backbone network in the neck network of the original YOLOv4-tiny model, and the detection scales are expanded from two to three. The outputs of the CSP2 layer and the upsampling layer (UP layer) are concatenated in the channel dimensions, the feature maps from two different layers are fused, and then through two CBL modules, the feature map that is downsampled by eight times of the resolution of the input image is finally obtained. For an input image with a resolution of 416 × 416 pixels, the improved YOLOv4-tiny model outputs feature maps at three scales with resolutions of 13 × 13, 26 × 26, and 52 × 52 pixels, respectively. Since the occluded objects have the characteristics of small salient areas, the original YOLOv4-tiny network is prone to lose a lot of edge detail information. In the proposed model, large-scale feature map optimization technique is introduced into the improved backbone network, which enables the network to capture more detailed image information. By improving the resolution of input pictures, the dimensions of the output feature map are changed from 13 × 13, 26 × 26 to 13 × 13, 26 × 26, and 52 × 52, thereby enhancing the learning capability of the shallow network and reducing the information loss of the shallow layers in the training process.

4.2 CmBN strategy

In object detection tasks, due to the limited memory capacity of the video cards, the BN strategy is inaccurate in estimating statistical information, which may easily lead to increased model errors. This drawback will be more pronounced in resource-constrained embedded devices and can significantly degrade model performance. To this end, the CmBN strategy is used to replace the BN strategy in the original YOLOv4-tiny model.

During model training, the batch of image data in the training set is evenly divided into several mini-batches and passed to the model for forward propagation. The weights of the model do not change until a single iteration is completed, so the statistics of different mini-batches in the same batch can be directly accumulated. The execution processes of the CmBN strategy are as follows:

When calculating the mean and standard deviation of the ith batch data, the output information of the convolutional layer of the (i − 1)th mini-batch data in the training set is combined;
The output of the convolution layer of the ith small batch of data is transformed into a normal distribution with a mean of 0 and a variance of 1;
Using learnable parameters, linearly transform the normalized output of the convolutional layers of the ith mini-batch to enhance its expressive ability.

The forward pass calculation process of the CmBN strategy is as follows:

(1) ς ¯ j l = 1 j ∑ τ = 1 j ς ¯ j − τ l + 1 m ∑ i = 1 m x j , i l ,

(2) v ¯ j l = 1 t ∑ τ = 1 j − 1 ( ς ¯ j − τ l ) 2 + 1 m ∑ i = 1 m ( x j , i l ) 2 ,

(3) φ j l = v ¯ j l − ( ς ¯ j l ) 2 ,

(4) x ˘ j , i l = x j , i l − ς ¯ j l ( φ ¯ j l ) 2 + δ ,

(5) y j , i l = γ x ˘ j , i l + β ,

where l denotes the number of convolutional layers and j denotes the index number of the mini-batch. m represents the size of the mini-batch data, i is the index number of the mini-batch data, and τ is the serial number of the mini-batch index. x j , i l is the output at the lth convolutional layer for the ith batch of data from the jth mini-batch. ς ¯ j l and ς ¯ j − τ l are the average values of the output at the lth convolutional layer for the jth and ( j − τ ) th mini-batch, respectively. v ¯ j l is the sum of squares of the mean output at the lth convolutional layer for the j-th mini-batch. φ ¯ j l is the standard deviation of the output at the lth convolutional layer for the jth mini-batch. δ is a constant added to increase numerical stability. x ˘ j , i l is the normalized output at the lth convolutional layer for the ith batch of data from the j-th minibatch. γ and β are learnable scaling and translation parameters, respectively. y j , i l is the result obtained after performing CmBN on the output of the lth convolutional layer for the ith batch of data from the jth mini-batch.

4.3 Target bounding box prediction

In the majority of DL-based target detection algorithms, the convolutional layers generally only acquire the features of the targets, and then pass the acquired features to the classifier or regressor for result prediction. The YOLO series of algorithms use a 1 × 1 convolution kernel to complete the target prediction so that the dimension of the obtained prediction map is equal to the dimension of the feature map. The number of target bounding boxes in YOLOv4-tiny is 3, which is determined by each grid cell in the prediction map, and the feature map contains b × ( 5 + C ) grid information, where C is the total number of categories and b represents the number of bounding boxes obtained, and different targets have their own unique bounding boxes. ( 5 + C ) is the parameter for each bounding box, which consists of the center coordinates, the size, the target score, and the confidence of the C classes of the bounding box. The relationship between the predicted bounding boxes and the network output can be expressed as follows:

(6) b x = σ ( t x ) + c x ,

(7) b y = σ ( t y ) + c y ,

(8) b w = p w e t w ,

(9) b h = p h e t h .

From Figure 3, it can be seen that the predicted center coordinate is ( b x , b y ) , where b w and b h are the width and the height of the forecasted bounding boxes, respectively. t x and t y represent the offset values of the predicted target center relative to the center coordinates, and the values of t x and t y are controlled in the [0, 1] interval by the sigmoid function.

Figure 3

Schematic diagram of bounding box prediction.

The YOLOv4-tiny algorithm consists of two parts, training and testing. During the training stage, a mass of data needs to be fed into the model, and during the prediction stage, the candidate bounding boxes are used to determine whether any target falls into the candidate box. If a target falls within the bounding box, its probability is as follows:

(10) C = p r ( Pre | Object ) × IoU pred truth .

In the process of object detection, images vary in size and variety. In order to determine the initial positions of the candidate bounding boxes in YOLOv4-tiny, the K-means clustering algorithm is utilized to determine the initial position of the bounding boxes. As an unsupervised learning method, the K-means clustering algorithm performs clustering operations on surrounding objects close to them by specifying a K value as the center. Through repeated iterations, the cluster center value is updated. When the intra-class difference is smaller and the out-of-class difference is larger, the desired effect is achieved, and the measurement of the “difference” is calculated based on the distance from the sample point to the centroid of the class to which it belongs. Euclidean distance is generally used to measure this difference, and the Euclidean distance measurement can be calculated as follows:

(11) d ( x , μ ) = ∑ i = 1 n ( x i − μ i ) 2 ,

where x is the sample point within the class, μ is the centroid within the class, n is the number of samples in each class, and i is the mark number of each data point.

However, the appropriate selection of K value in K-means clustering algorithm is often hard to complete, and it will directly affect the effect of clustering. At the same time, the clustering results of the K-means method can only guarantee the local optimal clustering in general and are more sensitive to noise interference. In addition, the K-means clustering algorithm is difficult to achieve the ideal clustering effect for non-convex data and data with large differences in size. In practice, the number of targets and the number of object categories in an image are generally unknown, and the objects may also be distributed in a very scattered way. Therefore, if the clustering is performed according to the specified K value, the distance between the obtained center point and the actual position may be too far. In contrast, the DBSCAN clustering algorithm has no bias on the shape of the clusters and does not need to input the number of clusters to be divided, only the appearance radius of the target is required to be set. In addition, the impact of the center point being too far away from the actual location can be alleviated by performing clustering based on density-reachable concept. Therefore, this article proposes a DB-K clustering method that combines DBSCAN and K-means algorithms. First, through the DBSCAN clustering algorithm, the initial good clustering effect is obtained under the condition of ignoring the center point, and thus several completed clusters are obtained. Then multi-scale clustering is added to further obtain the local and overall information of the object; specifically, multi-scale clustering is performed on the contour information of the object. After multi-scale clustering and convolution operations, the initial feature maps can be obtained. Finally, these clusters are used as input data, and the K-means algorithm is used to divide the clusters so that the accurate center point position can be obtained. This method can effectively speed up the convergence of the dataset and promote the classification accuracy of small objects.

In the DBSCAN clustering algorithm, the closeness of the sample distribution in the neighborhood is described by the parameter ( ε , MinPts ) . Let the input sample be D = ( x 1 , x 2 , … , x m ) ; if the number of samples in the set of subsamples satisfies ∣ N ∈ ( x j ) ∣ ≥ MinPts , then the subsamples are added to the core sample set Ω = Ω ∪ { x j } . Subsequently, search for all subsample sets by neighborhood distance threshold to update the sample set of the current cluster, and the unvisited sample set will also be updated accordingly. Finally, the output result after iteration is C = ( c 1 , c 2 , … , c k ) .

In the original YOLOv4-tiny, the seed points of the K-means clustering algorithm are chosen randomly, which increases the randomness of the clustering and will lead to poor clustering effect. This article proposes a strategy to reduce the randomness of seed point selection. First, the K value is obtained through the error sum of squares method, as shown in equation (11), and then the first cluster center is obtained through the K-means algorithm. Afterward, the obtained K clusters are analyzed, and the closest clusters are merged, thereby reducing the number of cluster centers; the corresponding number of clusters will also decrease when the next clustering is performed, obtaining an ideal number of clusters. After several iterations, when the convergence of the evaluation function reaches the expected value, the best clustering effect can be obtained as follows:

(12) SSE = ∑ i = 1 k ∑ p ∈ c i ∣ p − m i ∣ 2 .

In the experiment, 400 random points are selected and tested with the K-means clustering algorithm and the proposed DB-K hybrid clustering algorithm, respectively. In the K-means clustering algorithm, the K value is specified as 3, and the experimental results are shown in Figure 4(a). In the proposed DB-K clustering algorithm, the value of K is 3 and it is determined by the DBSCAN clustering algorithm, and the experimental results are shown in Figure 4(b). It can be clearly seen from the experimental results that the method proposed in this article has better clustering effect.

Figure 4

Performance comparison of different clustering methods. (a) K-means clustering method and (b) DB-K clustering method.

Therefore, by integrating the DBSCAN clustering algorithm and improving the traditional K-means clustering algorithm, the classification performance of object detection in different datasets and complex environments is enhanced, and the robustness to noise and interference is also improved. With the proposed DB-K clustering algorithm, the performance of neural network in object recognition is significantly improved.

5 Experiment results and analysis

The experiments in this article are conducted in the Linux environment, and the system is configured as Ubuntu18.04, compute unified device architecture (CUDA) 11.0, and CUDA deep neural network library (CUDNN) 8.0. The hardware platform is equipped with an 8-core Intel 10400F CPU and the graphics card is GTX 960, with 16 GB RAM. During the experiment, GPU is used to speed up the training process. The self-made data set is randomly split into training and testing sets in a ratio of 7:3, and multiple groups of ablation experiments are set up to verify the effect of each improved strategy on the model, thus obtaining the optimal model. In order to further authenticate the performance of the proposed algorithm, a comparative experiment with the most commonly used lightweight algorithms was conducted on the PASCAL VOC07 + 12 [37,38] public dataset, and the performance of different algorithms was compared in terms of the average detection accuracy (mAP) and detection speed (FPS).

5.1 Experiment datasets

In order to analyze the recognition performance of the proposed lightweight model in the crowded indoor spaces with occluded objects, two datasets were constructed: the PASCAL VOC07 + 12 public dataset which includes 16,551 training images and 4,952 testing images, and a self-made dataset. The self-made dataset involves four different scenarios, with 10,432 images captured from elevator cars and 9,139 images captured from bus cars, passenger aircraft cabins, and natural scenes. The training, validation, and test sets are partitioned in a ratio of 7:2:1. Among them, 70% of the pictures have the characteristics of mutual occlusion or incomplete camera framing. In this article, such pictures are defined as complex images in the experiment, which can effectively prevent the overfitting caused by the single background in the dense space during training, and improve the generalization capability of the model in crowded indoor scenarios. The dataset defines three detection categories: person, electric-bicycle, and bicycle. Baby carriage, trolley, furniture, other goods, and pets are used as negative samples to enhance the reliability of the model. The LabelImg tool designed by Tzutalin [39] is used to mark the labeling area and picture area of each picture according to a certain proportion, and the aspect ratio of the pictures is limited within 3:1 to make the predicted bounding boxes more fit to the targets. The statistics of the self-built dataset is shown in Table 1.

Table 1

Statistics of the self-built dataset

Categories	Positive objects				Negative objects				Total
Categories	Person	Electric bicycle	Bicycle	Baby carriage	Trolley	Furniture	Other goods	Pets	Total
Number	24,475	16,528	7,544	1,497	1,382	864	1,497	3,801	57,588
%	42.5	28.7	13.1	2.6	2.4	1.5	2.6	6.6	100

Objects in dense indoor scenes such as elevators and carriages are prone to mutual overlap and occlusion, which will seriously affect the accuracy of target recognition. Aiming at the aforementioned problems, this study introduces Mosaic data augmentation technique [40] during model training to improve the model’s capability to recognize occluded objects. The example images after Mosaic process is shown in Figure 5. Before each iteration, the DL framework not only acquires images from the training set, but also generates new images through Mosaic data augmentation, and then the newly generated images and the original images are combined and fed into the model for training. Mosaic data enhancement randomly selects four images from the training set, randomly crops the four selected images, and splices the cropped images in sequence to obtain a new image, and the generated image has the same resolution as the training set image. During random cropping, part of the bounding box of the target in the training set image may be cropped to simulate the effect of the object being occluded. In addition, this study optimizes Mosaic data augmentation, and proposes an improved Mosaic data augmentation, which uses the intersection-over-union (IoU) as an indicator and sets the threshold considering the relevant standards used when calibrating the dataset to filter the objects’ bounding boxes in the newly generated image. The improved Mosaic strategy generates a new image according to the steps of Mosaic data enhancement and filters the object bounding boxes in the newly generated image. If the IoU between the object bounding boxes in the new image and in the corresponding original image is less than the threshold, the object bounding box in the new image will be deleted and it is considered that there is no object here. Otherwise, the object bounding box in the newly generated image will be retained, and it is considered that there is a target here.

Figure 5

Examples of the training data enhancement. (a) Original images and (b) imaged after enhanced Mosaic process.

5.2 Evaluation metric

The inference speed (ms/frame) is used as an evaluation metric for model recognition speed, and the inference speed is the time required for the model to recognize an image. The average precision (AP) is used as the evaluation metric for the model recognition accuracy. AP is calculated based on the precision (P) and recall (R) of the model. P, R, AP, and mAP are calculated as follows:

(13) P = TP TP + FP × 100 % ,

(14) R = TP TP + FN × 100 % ,

(15) AP = ∫ 0 1 P ( R ) d R × 100 % ,

(16) mAP = 1 C ∑ c ∈ C AP ( c ) ,

where TP is the number of correctly detected positive samples, TN is the number of correctly detected negative samples, FP is the number of falsely detected positive samples, and FN is the number of falsely detected negative samples. P represents the percentage of correctly detected positive samples in all detected positive samples. R represents the percentage of correctly detected positive samples in all of the ground-truth positive samples. In the final evaluation result, AP represents the comprehensive evaluation of a certain category, and the greater the AP value is, the better the accuracy of a single category. mAP is an evaluation of the level of the entire network, C is the number of classes contained in the whole dataset, and c denotes a single class.

5.3 Model training parameters

The model parameters in the experiment are set as follows: the input image is 416 × 416; the epoch is set to 300; the batch_size is 128 for the first 70 rounds, and 32 for the last 230 rounds. The learning rate is 1 × 10^–3 for the first 70 rounds and 1 × 10^–4 for the last 230 rounds. We employed stochastic gradient descent optimization strategy for model training, with an initial learning rate whose momentum was gradually decreased as the degree of gradient descent increased, in order to improve convergence performance. Specifically, we set the momentum parameter to 0.9 to accelerate the convergence rate of the optimization process. The loss curves during training are shown in Figure 6.

Figure 6

Loss curves during training.

As the epoch continues to increase, the loss values of both modes continue to decrease. After 70 rounds of training, the loss curves tend to be stable, and there is no underfitting or overfitting. The loss values of the original YOLOv4-tiny algorithm and the improved algorithm converge to around 2.3 and 1.9, respectively, which proves that the recognition accuracy of the model is constantly improving, and the hyperparameters of the proposed algorithm are set reasonably.

5.4 Comparison with the original YOLOv4-tiny

The test results of the original YOLOv4-tiny and the proposed model on self-built dataset are shown in Table 2, in which the mAP, inference speed, and model size comparisons are given. Among them, complex images account for 70% of the dataset, which correspond to the situation where the targets are occluded by each other or the camera view is incomplete. The mAP of the original YOLOv4-tiny model for recognizing complex images and all images in the test dataset are 75.47 and 84.25%, respectively, and the proposed model is 7.94 and 8.32% higher than those of the original model, respectively. The neck network of the improved YOLOv4-tiny model reduces the loss of low-level feature map details and location information in the backbone network. The newly added feature maps have a smaller perceptual field for detecting blurred and occluded targets in images and improve the model’s ability to detect small targets. And, the DB-K clustering algorithm is used to further improve the classification effect. Compared with the original Yolov4-tiny, the model size of the proposed model is increased by 2MB, and the inference time per image is increased by 0.37 ms. The improved model only adds three convolutional layers, one upsampling operation, and one connection operation to the neck network, and the backbone network of the model does not change. Therefore, the improved YOLOv4-tiny model maintains a fast inference speed while improving the detection accuracy.

Table 2

Comparison results on self-build dataset

Models	mAP (%)		Inference speed (ms/frame)	Model size (MB)
Models	Complex images	All images	Inference speed (ms/frame)	Model size (MB)
Yolov4-tiny	75.47	84.25	3.42	23.5
Proposed model	83.41	92.57	3.79	25.5

The mAP curves of the proposed model and the original YOLOv4-tiny on PSACAL VOC dataset are shown in Figure 7. From the figures, it can be observed that with the proposed model, the mAP for recognizing objects of various scales and different categories are improved significantly. For example, compared with the original model, when using the proposed model, the APs for categories containing a large number of small targets such as potted plant, boat, and bird are increased by 7, 4, and 3%, respectively. The results validate that the proposed algorithm has better detection effect when dealing with scenes containing many occluded targets and small targets, which further proves the superiority of the proposed algorithm.

Figure 7

Performance comparison on PASCAL VOC dataset. (a) Original YOLO-v4-tiny and (b) proposed model.

5.5 Ablation study

The ablation analysis is performed based on the original YOLOv4-tiny model and combined with different improvement strategies, and the training and performance evaluation are carried out on the self-built dataset containing four types of scenes, to validate the contribution of each module to the improvement of recognition accuracy under the precondition of guaranteeing real-time performance. The test results of different module combinations are shown in Table 3, all of which are trained using the proposed enhanced Mosaic technology. The effectiveness of each module is analyzed in the table, and it can be seen that each improved module has different degrees of contribution to the overall performance. Among them, the introduction of DBSCAN in Model 4 contributes the most to the network, and the mAP increases by 4.99%, which proves that the proposed clustering algorithm significantly improves the classification effect in complex environments, while increasing the robustness to noise and interference. The mAP value of the original YOLOv4-tiny (model 1) is 84.25%, and the mAP value after using the neck network optimization (model 2) is 86.77%, an increase of 2.52%, indicating that the introduction of this optimization module enables the model to strengthen the extraction of detailed information from shallow layers, which makes the training for occluded objects more in-depth. Model 3 replaces BN with CmBN on the basis of model 2. BN can only use the output features from the convolution layers of the current mini-batch, so the statistic information is not accurate enough, resulting in a poor performance. CmBN, on the other hand, realizes the expansion of samples by accumulating information from different mini-batches and makes the estimation of statistical information more accurate.

Table 3

Results of the ablation tests

Models	Original neck network	Improved neck network	BN	CmBN	K-means clustering	Improved DB-K clustering	mAP (%)
1	√		√		√		84.25
2		√	√		√		86.77
3		√		√	√		87.58
4		√		√		√	92.57

5.6 Comparison with other state-of-the-art models

To further verify the effectiveness of the proposed model, a comparison experiment is carried out on PASCAL VOC07 + 12 public dataset with the proposed model and other state-of-the-art models including Faster RCNN [30], YOLOv4 [36], YOLOv3-tiny [34], and YOLOv4-tiny [10]. In addition, the DL frameworks and the trained models are ported to Jetson nano and Raspberry Pi 4B to test the inference speed of the models on the embedded hardware platforms. The results are listed in Table 4. From Table 4, it can be observed that the advantage of large networks is high detection accuracy. For example, the two-stage algorithm Faster RCNN using Resnet50 as the backbone network and the classic one-stage algorithm YOLOv4 have achieved mAPs of 83.77 and 90.01%, respectively. However, the size of these two models is too large, 330 MB and 256 MB, respectively, making such networks difficult to deploy to mobile terminals with limited computing capacity. The performance of the proposed method is comparable to that of the large network Faster RCNN in mAP, only 1.41% behind, and the size of the proposed model is only 1/13 to that of the faster RNN. Compared with the 64.3 million parameters of the YOLOv4 model, the parameters of the proposed model are only 6.1 million, which is less than 1/10. The advantage of the lightweight network is that the detection speed and accuracy are relatively balanced, and it can perform real-time detection on the mobile terminal, but the performance is relatively poor in complex scenes. Compared with two popular lightweight models YOLOv3-tiny and YOLOv4-tiny, the proposed method has significantly improved mAP performance while satisfying the real-time detection requirements. The proposed model will not introduce too much extra computing power and memory overhead in the inference process. The model retains the advantages of simplified structure and fast inference while ensuring a high recognition accuracy. It proves that the proposed model is applicable for the deployment in embedded hardware platforms.

Table 4

Comparison results with different methods

Models	Backbone network	mAP (%)	Model size (MB)	Parameters/million	Inference speed (ms/frame)
Models	Backbone network	mAP (%)	Model size (MB)	Parameters/million	GTX 960	Jetson nano	Raspberry Pi
Faster RCNN	Resnet50	83.77	330	81.7	41.55	—	—
YOLOv4	CSPDarknet53	90.01	256	64.3	14.58	812	19,350
YOLOv3-tiny	Darkent53-tiny	70.22	33.1	6.2	4.92	274	3,680
YOLOv4-tiny	CSPDarknet-53-tiny	75.49	23.5	5.9	3.42	152	2,177
Proposed method	CSPDarknet-53-tiny	82.36	25.5	6.1	3.79	183	2,601

6 Conclusion

In order to address the problem of poor performance of the original YOLOv4-tiny algorithm for detection of occluded object in dense indoor scenes, a modified target detection model is proposed based on YOLOv4-tiny algorithm, and three feasible improvements are made: 1) The neck network structure of the original YOLOv4-tiny model is modified so that the model can learn more information from the occluded objects; 2) The CmBN strategy is used instead of the BN strategy, and the model error is reduced by accumulating the outputs of the convolutional layers; 3) The DBSCAN clustering algorithm is incorporated in the proposed network and then the anchor point coordinates are determined through the improved K-means clustering, thus the detection accuracy is further increased. Experimental results show that the mAP values of the proposed algorithm in the PASCAL VOC07 + 12 dataset and the self-built dataset are 92.57 and 82.36%, respectively, which are 8.32 and 6.87% higher than those of the original YOLOV4-tiny model, respectively. It proves that the performance of the proposed algorithm is significantly improved compared with the original YOLOV4-tiny model. The inference speeds of the proposed algorithm on the embedded platforms Jetson nano and Raspberry PI are 183 and 2,601 ms/frame, respectively, indicating that the processing speed of the proposed algorithm can satisfy the requirements of different real-time applications, the occluded target can be detected quickly and accurately, and the proposed model is applicable for practical application of target detection in crowded indoor spaces. There are still some areas for improvement of the proposed method. Although the recognition accuracy of the proposed algorithm has been greatly improved, it is limited by the lightweight nature of the backbone network. When performing detection tasks in general scenes, the detection accuracy of small objects is still lower than that of complex networks. In the future, we will continue to optimize the backbone network. In order to reuse local features and strengthen the fusion of global features, we can try to connect the CBM module and the CBL module in the CSP module in a dense connection structure. The CBM modules in different CSP modules can be tensor spliced after the upsampling operation, which further integrates the feature information of the shallow layer and the deep layer. In addition, an attempt will be made to augment the experimental dataset for small object detection in general scenarios.

Funding information: This work was not supported by any fund projects.
Author contributions: Jianing Shen: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project, administration, Resources, Software, Supervision, Validation, Visualization; Yang Zhou: Writing – original draft, Writing – review & editing.
Conflict of interest: The author declares that there is no conflict of interest regarding the publication of this article.
Data availability statement: The data used to support the findings of this study are included within the article.

References

[1] Zou Z, Shi Z, Guo Y, Ye J. Object detection in 20 years: A survey. arXiv preprint arXiv: 1905.05055; 2019.Search in Google Scholar

[2] Meng Z, Zhang M, Wang H. CNN with pose segmentation for suspicious object detection in MMW security images. Sensors. 2020;20(17):4974.10.3390/s20174974Search in Google Scholar PubMed PubMed Central

[3] Teknomo K, Takeyama Y, Inamura H. Tracking system to automate data collection of microscopic pedestrian traffic flow. arXiv preprint arXiv: 1609.01810; 2016.Search in Google Scholar

[4] Ko KE, Sim K. Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Eng Appl Artif Intell. 2018;67:226–34.10.1016/j.engappai.2017.10.001Search in Google Scholar

[5] Murthy CB, Hashmi MF, Bokde ND, Geem ZW. Investigations of object detection in images/videos using various deep learning techniques and embedded platforms – A comprehensive review. Appl Sci. 2020;10(9):3280.10.3390/app10093280Search in Google Scholar

[6] Lan S, Gao Y, Jiang S. Computer vision for system protection of elevators. Journal of Physics: Conference Series. Vol. 1848. Issue 1. IOP Publishing; 2021. p. 012156.10.1088/1742-6596/1848/1/012156Search in Google Scholar

[7] Wang S. A review of gradient-based and edge-based feature extraction methods for object detection. 2011 IEEE 11th International Conference on Computer and Information Technology. IEEE; 2011. p. 277–82.10.1109/CIT.2011.51Search in Google Scholar

[8] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. Ssd: Single shot multibox detector. European Conference on Computer Vision. Cham: Springer; 2016. p. 21–37.10.1007/978-3-319-46448-0_2Search in Google Scholar

[9] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 779–88.10.1109/CVPR.2016.91Search in Google Scholar

[10] Wang CY, Bochkovskiy A, Liao HYM. Scaled-yolov4: Scaling cross stage partial network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 13029–38.10.1109/CVPR46437.2021.01283Search in Google Scholar

[11] Ren H, Li ZN. Object detection using edge histogram of oriented gradient.2014 IEEE international conference on image processing (ICIP). IEEE; 2014. p. 4057–61.10.1109/ICIP.2014.7025824Search in Google Scholar

[12] Gilani SAM. Object recognition by modified scale invariant feature transform. Third International Workshop on Semantic Media Adaptation and Personalization. IEEE; 2008. p. 33–9.10.1109/SMAP.2008.12Search in Google Scholar

[13] Farooq J. Object detection and identification using SURF and BoW model. International Conference on Computing, Electronic and Electrical Engineering (ICE Cube). IEEE; 2016. p. 318–23.10.1109/ICECUBE.2016.7495245Search in Google Scholar

[14] Lienhart R, Maydt J. An extended set of Haar-like features for rapid object detection. Proceedings. International Conference on Image Processing. Vol. 1. IEEE; 2002. p. I.10.1109/ICIP.2002.1038171Search in Google Scholar

[15] Trefný J, Matas J. Extended set of local binary patterns for rapid object detection. Computer Vision Winter Workshop; 2010. p. 1–7.Search in Google Scholar

[16] Mostafa A, Sander OE. Application of template matching for improving classification of urban railroad point clouds. Sensors. 2016;16(12):2112.10.3390/s16122112Search in Google Scholar PubMed PubMed Central

[17] Bühlmann P, Hothorn T. Boosting algorithms: Regularization, prediction and model fitting. Stat Sci. 2007;22(4):477–505.10.1214/07-STS242Search in Google Scholar

[18] Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: A review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.10.1109/TNNLS.2018.2876865Search in Google Scholar PubMed

[19] LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1(4):541–51.10.1162/neco.1989.1.4.541Search in Google Scholar

[20] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.10.1109/5.726791Search in Google Scholar

[21] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1106–14.Search in Google Scholar

[22] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556; 2014.Search in Google Scholar

[23] Ballester P, Araujo RM. On the performance of GoogLeNet and AlexNet applied to sketches. Thirtieth AAAI Conference on Artificial Intelligence; 2016.10.1609/aaai.v30i1.10171Search in Google Scholar

[24] Chollet F. Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 1251–8.10.1109/CVPR.2017.195Search in Google Scholar

[25] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–8.10.1109/CVPR.2016.90Search in Google Scholar

[26] Chen S, Liu H, Zeng X, Qian S, Yu J, Guo W. Image classification based on convolutional denoising sparse autoencoder. Math Probl Eng. 2017;2017:5218247. 10.1155/2017/5218247.Search in Google Scholar

[27] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 580–7.10.1109/CVPR.2014.81Search in Google Scholar

[28] Lu X, Li Q, Li B, Yan J. Mimicdet: Bridging the gap between one-stage and two-stage object detection. European Conference on Computer Vision. Cham: Springer; 2020. p. 541–57.10.1007/978-3-030-58568-6_32Search in Google Scholar

[29] Palaniappan K, Kambhamettu C. Hasler, Goldgof. Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 1440–8.Search in Google Scholar

[30] Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;28.Search in Google Scholar

[31] He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision; 2017. p. 2961–9.10.1109/ICCV.2017.322Search in Google Scholar

[32] Afif M, Ayachi R, Said Y, Pissaloux E, Atri M. An evaluation of retinanet on indoor object detection for blind and visually impaired persons assistance navigation. Neural Process Lett. 2020;51(3):2265–79.10.1007/s11063-020-10197-9Search in Google Scholar

[33] Tan M, Le QV. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference On Machine Learning. PMLR; 2019. p. 6105–14.Search in Google Scholar

[34] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement; 2018, arXiv:1804.02767. [Online]. http://arxiv.org/abs/1804.02767.Search in Google Scholar

[35] Redmon J, Farhadi A. YOLO9000: better, faster, stronger. Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 7263–71.10.1109/CVPR.2017.690Search in Google Scholar

[36] Bochkovskiy A, Wang CY, Liao HYM. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934; 2020.Search in Google Scholar

[37] Everingham M, Van Gool L, Williams CK, Winn J. The pascal visual object classes (voc) challenge. Int J Comput Vis. 2010;88(2):303–38.10.1007/s11263-009-0275-4Search in Google Scholar

[38] Shetty S. Application of convolutional neural network for image classification on Pascal VOC challenge 2012 dataset. arXiv preprint arXiv:1607.03785; 2016.Search in Google Scholar

[39] Tzutalin D. LabelImg Is a graphical image annotation tool and label object bounding boxes in images; 2015. URL https://github.com/tzutalin/labelImg.Search in Google Scholar

[40] Yun S, Han D, Chun S, Oh SJ, Yoo Y, Choe J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 6023–32.10.1109/ICCV.2019.00612Search in Google Scholar

Received: 2022-11-21

Revised: 2023-02-12

Accepted: 2023-03-21

Published Online: 2023-07-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jisys-2022-0268

Keywords for this article

IoT application; computer vision; crowded scenes; occluded object detection; YOLOv4-tiny; DBSCAN; Cross mini-Batch Normalization; Mosaic data enrichment

Creative Commons

BY 4.0