Mine underground object detection algorithm based on TTFNet and anchor-free

Zhen Song; Xuwen Qing; Meng Zhou; Yuting Men

doi:10.1515/comp-2024-0015

Article Open Access

Mine underground object detection algorithm based on TTFNet and anchor-free

Zhen Song , Xuwen Qing , Meng Zhou and Yuting Men

Published/Copyright: November 26, 2024

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Open Computer Science Volume 14 Issue 1

Abstract

To solve the problem of poor object detection effect caused by uneven light and high noise in underground mines, this study proposes a TTFNet (training-time-friendly network)-based object detection algorithm for underground mines. First, CenterNet and TTFNet algorithms are introduced, then pooling is introduced into CSPNet basic structure to design a lightweight feature extraction network, at the same time optimizing the feature fusion way in the original algorithm, optimizing residual shrinkage network structure, and introducing it into object detection task. Experiments were conducted on the established underground data set. The results show that compared with the original algorithm, our proposed algorithm can still maintain similar accuracy while significantly reducing model parameters; compared with other anchor-based detection algorithms, it has achieved similar overall performance.

Keywords: anchor-free; underground object detection; convolutional neural network; lightweight model

1 Introduction

With the rapid development of the social economy, the demand for mineral resources is increasing day by day, the shallow mineral resources are exhausted, and the mining work is gradually transferred to the deep underground [1]. The modern mining industry urgently needs to develop in the direction of intelligence, digitalization, and automation [2]. Object detection technology is one of the key technologies to achieve these goals, which can be widely used in underground intelligent monitoring and autonomous walking and shoveling of load-haul-dump(LHD) [3]. However, underground mines, uneven lighting, large amounts of dust, and limited equipment severely affect the effectiveness of detection algorithms [4].

At present, the method based on deep learning has become the mainstream of object detection, which can be divided into anchor-based and anchor-free algorithms. Algorithms based on anchor boxes such as Faster Regions with CNN features (RCNN), Single Shot MultiBox Detector (SSD), and You Only Look Once (YOLO) series can achieve higher accuracy and speed [5,6]. However, the anchor-based algorithm requires more hyper-parameters, consumes more time and computing resources, and may also lead to the imbalance of the proportion of positive and negative samples [7]. Therefore, many target detection technologies based on anchor-free have been proposed recently. CornerNet [8] predicts the upper-left corner and the lower-right corner of the target, respectively, according to the heatmap generated by the network and then pairs the corner points to complete the prediction of the target box. Fully Convolutional One-Stage Object Detection (FCOS) [9] directly predicts the category of each point on the feature map and the distance to the border to complete the detection. To solve the problem of border overlap, the author uses the Feature Pyramid Network (FPN) structure to disperse the target to different scales for prediction.

At present, there have been a lot of studies on the application of vision technology underground. Li et al. [10] proposed an algorithm based on improved Scale-Invariant Feature Transform (SIFT) for personnel detection in underground hazardous areas. This is a method based on traditional manual feature extraction, which is prone to false detection in a complex underground environment, the detection speed is slow, and the reliability is poor. Li et al. [11] proposed a pedestrian detection algorithm based on Faster RCNN, but the algorithm itself has many background errors, long operation time, and slow convergence speed. Cui and Wang [12] used the Yolov4 [13] algorithm to detect whether underground workers wear masks, which has the advantages of fast speed and high accuracy but did not optimize the size and parameters of the algorithm, which will be greatly limited in the mine with limited equipment. Zhang et al. [14] used SSD and the handwritten digital recognition network LeNet to explore the underground personnel and digital recognition and achieved high accuracy and recall rate, but the detection rate of the algorithm still needs to be improved. Hao et al. [15] designed an algorithm for foreign matter detection on coal mine conveyor belt by optimizing the YOLOv5 loss function and introducing depth separable convolution and attention module, which can realize the rapid and accurate detection of large foreign matter on mine conveyor belt under uneven illumination. Zhang and Jiang [16] used GHostNet as the backbone network of YOLOv4 to integrate the attention mechanism and inverse residual structure to develop an algorithm for drilling depth detection, realizing the function of accurate statistics of drill pipe number and calculation of drilling depth. Du [17] used YOLOv3 detection infrared imaging images and other high-precision ranging technologies to design and develop a system for collision avoidance of coal mine mining transportation equipment, which achieved a good safety protection effect.

Although there are a large number of studies on the application of object detection technology in underground mining, there is little research on the combination of object detection and mining equipment to improve automatic mining. Therefore, based on training-time-friendly network (TTFNet) [18], this study tries to design an underground object detection algorithm with few parameters, high accuracy, fast training speed, and good robustness applied to underground LHD.

2 Method

2.1 CenterNet algorithm

In contrast, CenterNet [19] is more concise and can achieve higher accuracy and speed. It is easy to extend to other tasks [20] such as target tracking, pose detection, instance segmentation, and 3D detection. It can be extended to a variety of scenes in the mine. As shown in Figure 1, CenterNet models the target as a point to complete the identification and localization of the object in an image. First, the image is input into the convolutional neural network to generate a Heatmap, and the peak in the Heatmap is the center of the corresponding category object, and then, the object detection is realized by directly regressing the bounding box from the center.

Figure 1

Centernet algorithm schematic diagram.

The structure of the CenterNet algorithm is shown in Figure 2. The network inputs an image with a size of 512 × 512 . The backbone network will first extract a feature map with a size of 16 × 16 , and then upsample it to a size of 128 × 128 . The three branches of the input prediction module, respectively, predict the center, offset, and border width and height of the target.

Figure 2

CenterNet algorithm structure.

The loss function of CenterNet mainly consists of three parts: type, center point offset, and frame size loss. For any input image I ∈ R W × H × 3 , the goal of the CenterNet algorithm is to obtain heatmap through convolutional neural networks:

(1) Y ˆ ∈ [ 0 , 1 ] W r × H r × C ,

where W and H represent the width and height of the image, r = 4 represents the output stride, and C represents the number of classes. When Y ˆ x y c = 1 , it means that ( x , y ) is the center point of C category on the heatmap, and when Y ˆ x y c = 0 , it is the background, and only the center point is regarded as the positive sample.

On the downsampling feature map, we calculate the radius of the Gaussian circle based on the size of the object box and then apply a Gaussian kernel:

(2) Y x y c = exp ( x − p ˜ x ) 2 + ( y − p ˜ y ) 2 2 σ p 2 ,

splat all ground truth center point onto heatmap Y ∈ [ 0 , 1 ] W r × H r × C . In formula (2), σ p is a standard deviation related to the size of the target. If there is an overlap in the Gaussians distribution of multiple targets of the same category, the target corresponding to the maximum value of elements in each position of the overlap area is taken. P ˜ x and P ˜ y represent the values of the center point coordinates on the downsampled picture and rounded down, respectively. In addition, for a category in the label graph, the algorithm calculates the center coordinates of the actual border:

(3) P = x 1 + x 2 2 , y 1 + y 2 2 ,

where x 1 x 2 y 1 y 2 represents the coordinate point of the object’s box. Therefore, given the prediction heatmap value Y ˆ x y c and calculated heatmap value Y x y c , the location loss is designed as

(4) L k = − 1 N ∑ x y c ( 1 − Y ˆ x y c ) α log ( Y ˆ x y c ) i f Y x y c = 1 ( 1 − Y ˆ x y c ) β ( Y ˆ x y c ) a log ( 1 − Y ˆ x y c ) , otherwise ,

where α and β represent the hyperparameters of focal loss, which are, respectively, taken as 2 and 4, and N is the number of key points in the image. This method can adjust the imbalance of positive and negative samples and improve the mining ability of difficult samples.

Since the prediction of the center point of the target is performed on the downsampled feature map, a precision error will occur when it is remapped to the original image size, so the algorithm adds a center point offset loss, and the formula is

(5) L off = 1 N ∑ p O ˆ p ˜ − P r − P ∼ ,

where O p is the predicted offset of the network.

On the size regression branch, let any target box belonging to category C k be represented as ( x 1 ( k ) , y 1 ( k ) , y 2 ( k ) , x 2 ( k ) , y 2 ( k ) ) , CenterNet selects s k = ( x 2 ( k ) − x 1 ( k ) , y 2 ( k ) − y 1 ( k ) ) as the regression target for the target size prediction and represents the predicted value with s ˆ ∈ R W r × H r × 2 , so the specific dimensional loss can be expressed as

(6) L size = 1 N ∑ k = 1 N ∣ s ˆ k − s k ∣ .

To sum up, the total loss function of the CenterNet algorithm can be expressed as

(7) L det = L k + λ size L size + λ off L off ,

where λ size = 0.1 , λ off = 1 is the weight coefficient of the loss function.

2.2 TTFNet detection algorithm

Although CenterNet can achieve good comprehensive performance, its training convergence rate is slow compared with other algorithms. TTFNet believes that the reason for this problem is that only the target midpoint is regarded as a positive sample for size regression during CenterNet training. In order to shorten the training time, TTFNet proposed to use all pixel points within a certain range of the target center as the sample of border regression for training and adopted Gaussian probability as the weight of each regression sample, emphasizing those samples close to the target center. This method allows the network to use more positive samples for training, thus speeding up the algorithm convergence in a similar way to increasing the learning rate. Combined with CenterNet’s original design, TTFNet achieves a better balance between training time, reasoning speed, and accuracy.

As shown in Figure 3, TTFNet splits object detection into two parts: object localization and size regression. In terms of center positioning, considering the actual aspect ratio of the object, TTFNet further uses two-dimensional Gaussian kernels with different widths and heights to generate heatmap. Given mth annotated box belongs to the Cm class, it is first mapped linearly onto the feature map scale. And then 2D Gaussian kernel with different widths and heights

(8) K m ( x , y ) = exp − ( x − p ˜ x ) 2 2 σ x 2 − ( y − p ˜ y ) 2 2 σ y 2

that are used to produce H m ∈ R 1 × H r × W r is generated with σ x = η w ∕ 6 , σ y = η h ∕ 6 , and η = 0.54 , w and h represent the width and height of the box, respectively, and the peak of the heatmap is the center of the corresponding box C m . On the other hand, TTFNet will use a similar method to CenterNet to predict the H ˆ m ∈ R 1 × H r × W r . During training, only the peak value of Gaussian distribution is regarded as a positive sample. Similar to CenterNet, given the prediction heatmap value H ˆ x y c and calculated heatmap value H x y c the positioning loss is represented by the corrected focal loss, as shown in the following:

(9) L loc = − 1 M ∑ x y c ( 1 − H ˆ x y c ) α log ( H ˆ x y c ) if H x y c = 1 , ( 1 − H x y c ) β ( H ˆ x y c ) a log ( 1 − H ˆ x y c ) else ,

where M is the number of annotated boxes.

Figure 3

TTFNet algorithm structure.

In terms of regression, as shown in Figure 4, TTFNet can use the same Gaussian kernel as the positioning branch to construct a sub-region around the target center and treat all pixels in the region as positive samples of size regression. The target of the algorithm regression is the distance of the sample point to the four boundaries. For one pixel ( i , j ) in the subarea and the downsampling rate r , the target of regression is the distance ( w l , h t , w r , h b ) i j m from ( i r , j r ) to the four boundaries of the corresponding box, so the predicted box at ( i , j ) can be expressed as

(10) x ˆ 1 = i r − w ˆ l s , y ˆ 1 = j r − h ˆ t s , x ˆ 2 = i r − w ˆ r s , y ˆ 2 = j r − h ˆ b s ,

where s = 16 is a fixed scalar used to increase the prediction for optimization purposes; therefore, the algorithm does not need to set additional center offset loss. ( x ˆ 1 , y ˆ 1 , x ˆ 2 , y ˆ 2 ) and ( w ˆ l , h ˆ t , w ˆ r , h ˆ b ) are the values predicted by the network. TTFNet uses the Generalized Intersection over Union (GIOU) loss between the ground truth and the bounding box as the loss function of the regression branch, which is expressed as

(11) L reg = 1 N reg ∑ ( i , j ) ∈ A m G I O U ( B ˜ i j , B m ) × W i j ,

where N reg is the number of regression samples, B ˆ i j is the bounding box, and B m is the ground truth. Furthermore, considering that the scale difference of the target will cause the difference in the number of samples, to improve the ability of the model to detect small targets, W i j is used to balance the loss values contributed by targets of different sizes, and the algorithm designs different sample weights W i j as follows:

(12) W i j = log ( a m ) × G m ( i , j ) ∑ ( x , y ) ∈ A m G m ( x , y ) , ( i , j ) ∈ A m 0 , ( i , j ) ∉ A m ,

where a m is the area of the m th annotated box, G m ( i , j ) is the Gaussian probability value at ( i , j ) , and A m is the Gaussian region.

Figure 4

CenterNet and TTFNet define the training sample strategy.

In summary, the total loss of the TTFNet algorithm can be expressed as follows:

(13) L t t f = w loc L loc + w reg L reg ,

where w loc = 1.0 and w reg = 5.0.

2.3 Algorithm improvement

The performance of the TTFNet algorithm is very excellent, which can shorten the training time by several times or even more than ten times while maintaining a very fast reasoning speed and better accuracy than CenterNet. Therefore, we made the following improvements based on the TTFNet algorithm.

2.3.1 Backbone network

Wang et al. [21] attributed the reason that the previous neural network architecture requires a large amount of reasoning calculation to the repetition of gradient information in network optimization, and they designed the CSPNet network structure as shown in Figure 5 using the idea of splitting and merging. In this structure, the feature map is first divided into two parts according to the proportion, which is propagated forward, respectively, and then merged into a complete feature map. This structure can not only increase the gradient propagation path but also reduce the network memory flow, balance the calculation, and improve the network learning efficiency.

Figure 5

CSPNet structure.

The underground object detection task is not many types of targets, and the computing equipment is limited, so it does not need too deep and too wide backbone network to extract features. But considering the complex underground environment, in order to enhance the robustness of the algorithm and further improve the computing speed, this article introduces pooling in the basic module of CSPNet. When the downsampling rate is 2, the CSPNet structure is improved as shown in Figure 6. The right structure of the module is kept unchanged, and 2 × 2 average pooling or maximum pooling is added before the left convolution to realize downsampling. To facilitate the description, these two structures are called AvgCSPNet and MaxCSPNet, respectively.

Figure 6

The CSPNet structure is used in this paper.

2.3.2 Multi-layer feature fusion

As shown in Figure 7(a), the multi-layer feature fusion method adopted by the TTFNet algorithm is to directly add the low-level features with the same dimension to the upsampled features. However, this method may weaken some learned features when adding. In order to maximize the retention of these features, this study changes it to the splicing of the low-level features and the upsampled features along the channel dimension. Then, 1 × 1 convolution is used to reduce the dimension to the required dimension to achieve multi-layer feature fusion, and the improved structure is shown in Figure 7(b).

Figure 7

Multi-layer feature fusion. (a) TTFNet up-sample, and (b) Ours up-sample.

2.3.3 Residual shrinkage structure

Due to the harsh underground working environment, dust is easily produced; coupled with uneven lighting and equipment movement, the photographs taken by the camera on the scraper usually contain a lot of noise. In addition, due to the irregular shape of the target during the image annotation process, it is inevitable to contain some background information or overlap the target, which can also be regarded as noise during the target detection process. Therefore, it is necessary to strengthen the network’s ability of the network to deal with noise.

Deep residual shrinkage network [22] is a kind of network structure used for fault diagnosis. Its core idea is to let the neural network learn a group of small positive thresholds and then set the weights whose absolute values are less than the thresholds to zero. In theory, because the noise contributes little to the detection target and the weight value is lower, it will be set to zero in the training process, so as to achieve the purpose of noise reduction and accelerate the convergence speed of the algorithm. Therefore, this method is introduced into the network to reduce the impact of underground noise on detection. The residual shrinkage unit (RSBU) can be regarded as a special filter, and its basic structure is shown in Figure 8(a). The structure first takes the absolute value of the input network features, obtains the mean value of each channel, and then learns the scale factors of each layer through the two fully connected layers and the sigmoid function. Finally, the threshold value of each channel is obtained by multiplying it with the mean value of the corresponding layer, so as to achieve soft thresholding. Let x c and y c represent the input and output values of the C channel, respectively, and the soft threshold method can be described as

(14) y c = x c − τ c x c > τ c , 0 − τ c ≤ x c ≤ τ c , x c + τ c x c < − τ c ,

(15) τ c = α c ⋅ average h , w ∣ x h , w , c ∣ ,

(16) α c = 1 1 + e − z c ,

where τ c is the threshold of the C channel, α c is the scale factor, and Z c is the output of the fully connected layer. The residual shrinkage unit uses two fully connected layers to achieve the purpose of learning the soft threshold, but the process involves dimensionality reduction operations, and the fully connected layer has low efficiency and slow propagation speed, which consumes computing resources and time.

Figure 8

Structure of residual contraction unit before and after improvement. (a) Residual shrinkage unit and (b) efficient residual shrinkage unit.

Therefore, the original structure is improved as shown in Figure 8(b), and the fully connected layer in the original structure is replaced by 1-D convolution, so as to avoid dimension reduction in the operation process, improve the efficiency of the neural network, and reduce the consumption of computing resources [23]. In addition, the threshold learned through the fully connected layer is only the processing learning of the single-layer feature layer, and the interaction between multiple different feature layers can be used to further enhance the learning ability of the network after replacing it with 1D convolution.

In addition, we use the SiLU function as the activation function to retain some negative features and enhance the nonlinear effect while achieving a certain regularization effect. The improved structure is called the efficiency residual shrinkage unit (ECRSBU). The high-efficiency residual collection unit only adds a very small amount of parameters and calculations, and its role can be regarded as a special attention mechanism, and it can also achieve the effect of plug-and-play.

2.3.4 Overall structure of the algorithm

A feature extraction network as shown in Figure 9 is constructed on the basis of the CSPNet network structure. After preprocessing and data enhancement, the original image is resized to 512 × 512 . Input it into the backbone network. After five times downsampling, the 16 × 16 size dimension feature is obtained. Then, the 128 × 128 size feature is obtained by three times upsampling using deformable convolution and bilinear interpolation. At the same time, the shallow features are fused with the current features during each upsampling, and finally, the network output is obtained by the decoding of the input detection head after ECRSBU processing.

Figure 9

Overall structure.

3 Experiment

3.1 Data preparation

The research of this study is based on the largest underground non-ferrous metal mine in China, Shangri-La Pulang Copper Mine in Yunnan Province, whose underground LHD model is Sandvik LH514, which is operated by remote video remote control. All the data in the experiment are collected from different angles of the LHD camera during operation, with a size of 1,280 × 720 pixels. Some frames were selected as training data, and a total of 2,149 data sets were obtained. The establishment of the experimental data set marked three categories of ore in the bucket, ore heap, and bucket to study the automatic shoveling and working state judgment of the scraper. All images were annotated by Labeling software after preprocessing. The labeling is shown in Figures 10 and 11 shows the data set details.

Figure 10

Downhole data labeling.

Figure 11

Dataset details.

3.2 Experiments and results

In this study, all the experiments are built and trained in the win10 system, AMD Ryzen7 4800H CPU, NVIDIA GeForce GTX 1650Ti GPU, pytorch1.8.1+CUDA11.1 deep learning framework, and python 3. 9.2 programming language environment.

In order to evaluate the quality of the model, this study uses the commonly used indicators in object detection, such as precision (P), recall (R), mean average precision (mAP), and mAP @.5:.95 to evaluate the model comprehensively.

In the experiment, the improved model adopts mosaic, flip, and color transformation, and other data enhancement methods, and SiLU is used as the activation function. Because the TTFNet algorithm converges quickly, only 24 or 60 epochs are trained. The network input image size is 512 × 512 , the minimum training batch is set to 4, Adam optimizer is selected, momentum is set to 0.9, cosine annealing strategy is used for learning rate setting, and the initial setting is 0.0002.

3.3 Backbone network comparison experiment

Figure 12 shows the convergence of different backbone networks when trained on the downhole data set. It can be seen that all the networks tend to converge after training 20 epochs, and the network structure proposed in this study can still converge to a similar level with a large number of reduced parameters.

Figure 12

Loss function.

Table 1 shows a comparison of the specific data from the experiment. It can be seen from the table that the recall rate and mAP of the improved network model are slightly reduced, but the accuracy is improved, and the overall level is similar. At the same time, MaxCSPNet with maximum pooling on network branches had a 1.5% higher recall rate, 4.3% higher mAP.5:.9, and 1.7% higher mAP than AvgCSPNet using average pooling on branches, indicating that MaxCSPNet feature extraction effect was significantly better than the latter. In addition, the MaxCSPNet backbone network can achieve close accuracy and reduce computational overhead while significantly reducing the number of parameters compared with other backbone networks.

Table 1

Backbone network comparison experiment

Backbone network	Epochs	P(%)	R(%)	mAP. 5.95(%)	mAP (%)	Parameters (M)
ResNet18	60	97.7	89.6	73.8	90.5	14.4
ResNet34	60	97.5	90.5	75.0	90.1	24.5
MobileNet	60	96.4	77.9	66.8	76.7	10.1
CSPNet	60	98.9	82.8	68.7	82.6	10.5
AvgCSPNet	60	98.8	86.8	71.7	85.0	7.6
MaxCSPNet	60	98.0	88.3	73.4	89.3	7.6

3.4 Ablation experiment

Table 2 shows the experimental comparison results of adding different structures to the model. The performance of the algorithm is improved by improving the network feature fusion method. Adding the RSBU structure to the backbone network improves the precision of the model by 0.5%, the recall by 0.9%, and the mAP@.5:.95 by 0.5% on the underground data set, which illustrates the effectiveness of RSBU in the object detection algorithm. When replacing RSBU with ECRSBU, the precision decreases by 0.5%, while the recall improves by 0.2% and mAP@.5:.95 improves by 0.7%. Moreover, ECRSBU introduces fewer parameters to the network and is computationally more efficient. However, when the CBAM attention structure is further added to the model, the overall performance of the algorithm decreases rather than improves as expected. When it is replaced by efficient channel attention (ECA) attention structure, the performance of the algorithm is further improved. The possible reason is that the residual shrinkage structure itself is equivalent to a spatial attention mechanism, while the addition of mixed attention CBAM will lead to its excessive effect and eventually reduce the performance of the algorithm. Therefore, algorithm performance can be further improved when combined with ECA that contains only channel attention.

Table 2

Ablation experiment

Name of the experiment	P(%)	R(%)	mAP. 5:.95(%)
MaxCSPNet (TTFNet contact)	97.8	87.8	73.3
MaxCSPNet	98.0	88.3	73.4
MaxCSPNet + RSBU	98.5	89.2	73.9
MaxCSPNet + ECRSBU	98.0	89.4	74.6
MaxCSPNet + ECRSBU + CBAM	98.4	85.6	70.6
MaxCSPNet + ECRSBU + ECA	98.7	89.7	75.3

In order to explore the influence of the residual shrinkage structure and the relative position of the attention mechanism in the network on the algorithm, experiments are conducted as shown in Table 3 and Figure 13 shows the corresponding loss function changes. The results show that the algorithm can achieve faster speed and higher performance by placing the residual shrinkage structure in front of the attention mechanism structure.

Table 3

Ablation experiment

Structural sequence	P(%)	R(%)	mAP0.5: 0.95(%)	FPS
MaxCSPNet + ECRSBU + CBAM	98.4	85.6	70.6	47
MaxCSPNet + CBAM + ECRSBU	98.4	86.3	70.4	45
MaxCSPNet + ECRSBU + ECA	98.7	89.7	75.3	47
MaxCSPNet + ECA + ECRSBU	98.7	89.4	74.9	46

Figure 13

Different structural sequence loss.

3.5 Comparative experiment

To further verify the performance level of the algorithm, we compared the algorithm in this article with the current mainstream object detection algorithm on the underground data set, and the results are shown in Table 4. It can be seen that the Faster RCNN algorithm has a high recall rate on the underground data set, but it has a lot of misrecognition of the background as the target; the indicators of YOLO series algorithms have reached a high level; the precision of CenterNet algorithm is significantly higher than other algorithms, but its recall rate is low. Compared with the CenterNet algorithm, the precision is increased by 1.6%, the recall is increased by 3.8%, and mAP@.5:.95(%) is increased by 4%, which is close to the performance indicators of the current mainstream object detection algorithms based on anchor and Non-Maximum Suppression (NMS). Compared with other anchor-free algorithms, the performance is slightly lower than YOLOx, but our algorithm is simpler. In addition, through experiments, we find that the algorithm based on CenterNet method always shows a high level of accuracy on the underground data sets with poor image quality, which is very important for the underground equipment to accurately identify the target and make accurate actions.

Table 4

Comparison of experimental results

Algorithm	Size	Epochs	P(%)	R(%)	mAP@.5:.95(%)
Faster RCNN	400	200	67.2	97.6	—
YOLOv3	640	300	92.5	93.6	72.1
YOLOv4	416	300	93.2	94.2	73.6
YOLOv5	640	300	94.8	91.4	71.8
YOLOv7	640	200	95.2	92.6	74.1
YOLOx	480	200	96.6	92.3	75.6
CornerNet	511	200	95.2	84.7	69.5
CenterNet	512	200	97.1	85.9	71.3
Ours	512	60	98.7	89.7	75.3

3.6 Presentation of results

The heatmap and results of the algorithm are shown in Figure 14.

Figure 14

Algorithm-detected heatmap (top) and results (bottom).

4 Conclusion

In this article, a new algorithm for detecting underground targets based on improved TTFNet anchor-free is proposed. The research of this study aims to the special environment of underground mine, a new lightweight feature extraction network is designed by introducing pooling into CSPNet, and the residual shrinkage network structure is improved and added to enhance the ability of the algorithm to deal with noise, which is trained on the underground mine data set collected and labeled independently. The experimental results show that the improved algorithm can achieve higher accuracy, and the overall performance is close to the current mainstream detection algorithm based on anchor frame and non-maximum suppression design. In comparison, the post-processing method of the algorithm in this article is simple, does not need to manually design the anchor box, and the model convergence speed is fast. It is more conducive to the application of rapid deployment, limited computing resources, or harsh environments.

Funding information: Authors state no funding involved.
Author contributions: Zhen Song and Xuwen Qing: conceived the idea of the study. Zhen Song, Xuwen Qing, and Yuting Men: conducted research, wrote code and experiments; Xuwen Qing, Meng Zhou, and Yuting Men: wrote and translated the initial draft of this article; All authors discussed the results and revised the manuscript.
Conflict of interest: The authors declare no conflict of interest.
Data availability statement: The data presented in this manuscript are available from the corresponding author upon request.

References

[1] Q. Sun, Research on the control of autonomous driving of intelligent scraper, M. S. Thesis, Department of Control Engineering, University of Jinan, Jinan, China, 2020. Search in Google Scholar

[2] Y. Yao, “Application and development of large-scale and efficient intelligent mining technology and equipment in underground mines,” Mining Equipment, vol. 4, pp. 17–20, 2018. Search in Google Scholar

[3] D. Jiang and L. Wang, “Present situation and development trend of self-loading technology for underground load-haul-dump,” Gold Science and Technology, vol. 29, no.1, pp. 35–42, 2021. Search in Google Scholar

[4] J. Wu, Research on image enhancement and target tracking in underground mine, M.S. Thesis, Dept. College of Inf. and Comput., Taiyuan University of Technology, Taiyuan, China, 2021. Search in Google Scholar

[5] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6. pp. 1137–1149, 2016. 10.1109/TPAMI.2016.2577031Search in Google Scholar PubMed

[6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, et al., “SSD: single shot multibox detector,” In: Proceedings of the IEEE Conference on Computer Vision, 2016, pp. 21–37. 10.1007/978-3-319-46448-0_2Search in Google Scholar

[7] T. Zhang, Z. Li, Z. Sun, and L. Zhu, “A fully convolutional anchor-free object detector,” The Visual Computer, vol. 39, no. 2, pp. 569–580, 2023. 10.1007/s00371-021-02357-2Search in Google Scholar

[8] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” International Journal of Computer Vision, vol. 128, no. 2, pp. 642–656, 2020. 10.1007/s11263-019-01204-1Search in Google Scholar

[9] T. Zhi, C. Shen, H. Chen, and T. He, Fcos: Fully convolutional one-stage object detection, ICCV, Seoul, Kr, 2019, 10.48550/arXiv.1904.01355. Search in Google Scholar

[10] D. Li, J. Qian, and Y. Chai, “The object detecting in dangerous areas for coal mine underground,” Journal of China Coal Society, vol. 36, no. 3, pp. 527–532, 2011. Search in Google Scholar

[11] W. Li, C. Wei, and L. Wang, “Improved Faster RCNN approach for pedestrian detection in underground coal mine,” Computer Engineering and Applications Journal, vol. 55, no. 4, pp. 200–207, 2019. Search in Google Scholar

[12] T. Cui and L. Wang, “Research on application of YOLOV4 object detection algorithm in monitoring on masks wearing of coal miners,” Journal of Safety Science and Technology, vol. 17, no. 10, pp. 66–71, 2021. Search in Google Scholar

[13] B. Alexey, C. Wang, H. Liao, and H. Mark, Yolov4: Optimal speed and accuracy of object detection, 2020. [Online]. https://arxiv.org/pdf/2004.10934.pdf. Search in Google Scholar

[14] F. Zhang, J. Luan, D. Cui, and Z. Xu, “SSD-LeNet based method of mine moving target detection and recognition,” Journal of Mining Science and Technology, vol. 6, no. 1, pp. 100–108, 2021. Search in Google Scholar

[15] S. Hao, X. Zhang, X. Ma, S. Sun, and H. Wen, “Foreign object detection in coal mine conveyor belt based on CBAM-YOLOv5,” Journal of China Coal Society, vol. 47, no. 11, pp. 4147–4156, 2022. Search in Google Scholar

[16] D. Zhang and Y. Jiang, “Lightweight target detection method of drilling rig based on attention mechanism and inverse residual structure,” Journal of Electronic Measurement and Instrument, vol. 36, no. 11, pp. 201–210, 2022. Search in Google Scholar

[17] C. Du, “Anti-collision system of mining and transportation equipment in coal mine based on multi-technology integration,” Journal of China Coal Society, vol. 45, no. S2, pp. 1060–1068, 2020. Search in Google Scholar

[18] Z. Liu, T. Zeng, G. Xu, Z. Yang, H. Liu, and D. Cai, “Training-time-friendly network for real-time object detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11685–11692, 2020. 10.1609/aaai.v34i07.6838Search in Google Scholar

[19] X. Zhou, D. Wang, and P. Krähenbühl, Objects as points, 2019, [Online]. https://arxiv.org/pdf/1904.07850.pdf. Search in Google Scholar

[20] Z. Luo, Object detection algorithm in complex background based on machine vision, M.S. Thesis, Department of Mechanical Engineering, University of Electronic Science and Technology of China, Chengdu, China, 2021. Search in Google Scholar

[21] C. Y. Wang, H. Y. Mark Liao, Y. H. Wu, P. Y. Chen, J. W. Hsieh, and I. H. Yeh, CSPNet: A New Backbone that can Enhance Learning Capability of CNN, CVPRW, Seattle, WA, USA, 2020, pp. 1571–1580, 10.1109/CVPRW50498.2020.00203. Search in Google Scholar

[22] M. Zhao, S. Zhong, X. Fu, B. Tang, and M. Pecht, “Deep residual shrinkage networks for fault diagnosis,” IEEE Transactions on Industrial Informatics, vol. 16, no. 7, pp. 4681–4690, 2019. 10.1109/TII.2019.2943898Search in Google Scholar

[23] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, CVPR, Seattle, WA, USA, 2020, pp. 11531–11539, 10.1109/CVPR42600.2020.01155. Search in Google Scholar

Received: 2023-03-12

Revised: 2024-05-23

Accepted: 2024-08-13

Published Online: 2024-11-26

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/comp-2024-0015

Keywords for this article

anchor-free; underground object detection; convolutional neural network; lightweight model

Creative Commons

BY 4.0