Detection and tracking of safety helmet wearing based on deep learning

Hua Liang; Liqin Yang; Jinhua Chen; Xin Liu; Guihua Hang

doi:10.1515/comp-2024-0017

Article Open Access

Detection and tracking of safety helmet wearing based on deep learning

Hua Liang , Liqin Yang , Jinhua Chen , Xin Liu and Guihua Hang

Published/Copyright: December 6, 2024

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Open Computer Science Volume 14 Issue 1

Abstract

Failure to wear a helmet correctly is a significant cause of injury or death in the construction industry and industrial production. Traditional supervision methods predominantly rely on manual oversight, incurring substantial costs and demonstrating inefficiencies. To address this pressing issue, the present work introduces an advanced intelligent detection technology based on deep learning algorithms, which leverages the YOLOv5 algorithm to train a dataset, enabling real-time assessment of correct helmet usage among personnel while promptly issuing warnings when deviations are detected. Simultaneously, to mitigate challenges related to object leakage and false detections in complex backgrounds, the model’s performance is further enhanced by optimizing Generalized Intersection over Union, Distance Intersection over Union, and Complete Intersection over Union loss functions and improving the Mosaic-9 data enhancement algorithm. Empirical results validate the system’s efficacy, with the optimized YOLOv5 algorithm achieving an impressive precision rate of 93.16% and a robust recall rate of 88.96%. These findings underscore the system’s ability to accurately identify instances of workers’ improper helmet usage. This enhanced YOLOv5-based intelligent detection technology provides a more efficient and accurate method for monitoring helmet compliance within the construction industry and industrial production, effectively addressing the limitations of traditional manual supervision and ensuring precision in complex operational contexts.

Keywords: deep learning; helmet recognition; intelligent detection; loss function; YOLOv5

1 Introduction

Helmets can mitigate and disperse head injuries resulting from external impacts, thereby reducing the likelihood of fatalities and severe injuries in the event of a safety-related incident [1]. Monitoring and enforcing safety helmet compliance primarily rely on manual methods. Nevertheless, conventional manual oversight falls short of meeting the demands of contemporary construction safety management.

Target detection through video analysis and image understanding has garnered growing interest. Presently, two primary detection algorithms are employed for the automatic identification of Helmet wearing: those rooted in machine vision and those based on deep learning. Machine vision-based methods for helmet detection leverage features encompassing helmet shape, size, and colour. To illustrate, Rubaiyat et al. [2] extracted features from construction sites and helmets, employing the Histogram of orientation gradient (HOG) for helmet detection. Li et al. [3] utilized support vector machine (SVM) and classifiers, including random forest, to identify worker and helmet positions. e Silva et al. [4] employed the Hough transform and multi-layer perceptron for feature extraction and detection analysis. In the context of machine vision-based target detection algorithms, as depicted in Figure 1, a sliding window strategy is employed to attain localized target detection using windows of varying sizes and aspect ratios. This method hinges on feature engineering and necessitates the extraction of target regions based on manually crafted features. Furthermore, manually crafted features exhibit clear limitations, resulting in a diminished capacity to control the algorithm’s generalization. This method proves inadequate for precise safety helmet detection when alterations in the construction environment or lighting conditions or when the helmet target is partially obscured.

Figure 1

Schematic representation of the detection steps in conventional target detection methods [5].

In recent years, both domestic and international research endeavours have transitioned their emphasis from manual feature-based algorithms to those rooted in deep learning for target detection. Girshick et al. [6] introduced the Regional Convolutional Neural Network (R-CNN) in 2014. This model employs CNN to process bottom-up region proposals for target detection, combining candidate region selection with CNN. Building upon this foundation, Girshick [7] and Ren et al. [8] developed Fast R-CNN and Faster R-CNN, respectively. These models enhance the candidate frame generation process by incorporating the region proposal network and region of interest pool operations.

Consequently, there are significant enhancements in both accuracy and detection speed. Fang et al. [9], along with Chen et al. [10] and Guo et al. [11], developed a helmet detection algorithm by amalgamating Faster R-CNN and Deep CNN. This algorithm employs a multiscale training mechanism and a feature map fusion technique, solving problems related to occluded and small target detection, thus enhancing overall detection performance. However, the algorithm’s drawback lies in its generally slow detection speed due to its reliance on a two-stage candidate region approach, limiting its suitability for real-time detection tasks.

Target detection algorithms are primarily categorized into two groups: one comprises regression-based (single-stage) detection algorithms, mainly including You Only Look Once (YOLO) [12], Single Shot MultiBox Detector (SSD) [13], and RetinaNet [14]. The construction industry faces significant safety risks, and wearing safety helmets can help mitigate these risks. This work introduces a deep learning-based method for real-time helmet detection among workers. The YOLOv5s network is enhanced with multiple attention mechanisms, a BiFPN bidirectional feature pyramid network, and a Soft-NMS post-processing technique. Additionally, the loss function is optimized to accelerate convergence and detection speed. The proposed bilateral feature enhancement layer (BiFEL)-YOLOv5s [15] network model, which integrates the BiFPN network and Focal-EIoU Loss, demonstrates improvements with a 0.9% increase in average precision, 2.8% boost in recall rate, and enhanced detection speed. This model is well-suited for real-time helmet detection in diverse work environments. A deep learning method has been developed for real-time detection of safety helmet failures at construction sites, utilizing the SSD-MobileNet algorithm based on convolutional neural networks [16]. The model was trained on 3,261 images sourced from video monitoring systems and web crawlers. The dataset was divided into training, validation, and test sets with a ratio of approximately 8:1:1. Experimental results demonstrate that the model effectively detects unsafe helmet usage with high accuracy and efficiency, contributing to enhanced worker safety on construction sites.

Hayat and Morgado-Dias [17] introduced a real-time safety helmet detection system for construction sites using the YOLO architecture. This high-speed system can process 45 frames per second, making it suitable for real-time applications. It was trained on a benchmark dataset of 5,000 hard hat images, split into 60% for training, 20% for testing, and 20% for validation. The YOLOv5x model achieved a mean average precision of 92.44%, showing strong performance even in low-light conditions.

Qian and Yang [18] introduced YOLO_CA, a lightweight model derived from YOLOv5, designed to detect construction workers’ helmet usage automatically. This model addresses the inefficiency and high costs associated with manual monitoring methods. By incorporating coordinate attention (CA) for improved accuracy in complex scenes, employing deeply separable convolution for parameter compression, and replacing C3, Ghost, and CIOU_Loss components, YOLO_CA enhances performance. It demonstrates a 1.13% increase in mean Average Precision (mAP), 17.5% reduction Giga Floating Point Operations Per Second (GFLOPs), 6.84% decrease in total parameters, and a 39.58% boost in frames per second, making it well-suited for lightweight embedding. Li et al. [19] introduced a streamlined AI detection method for municipal engineering that integrates deep learning with object detection technology. It focuses on identifying individuals wearing masks and safety helmets while minimizing computational demands and enhancing accuracy. By utilizing ShuffleNetv2 for feature extraction and ECA attention for generating detailed feature maps, the method achieves a 4.3% improvement in mean average precision and reduces parameters and computational load by 54.8%.

The other category involves candidate-region-based (two-stage) target detection methods, beginning with R-CNN [6] and advancing to encompass Fast R-CNN [7] and the more efficient Faster R-CNN [8]. To ensure real-time performance in helmet-wearing detection, we utilize a single-stage YOLOv5-based algorithm. We make this selection based on the regression approach’s pivotal role in centralizing target detection and feature extraction within the network. Eliminating the need for laborious candidate target region segmentation produces more direct target detection results. Consequently, it alleviates the computational burden of the entire detection process, substantially improving target detection speed and fulfilling the real-time requirements crucial for helmet-wearing detection.

The study proposes a real-time helmet detection method that uses an optimized loss function and an enhanced version of the YOLOv5 algorithm, particularly the Mosaic-9 variant. This approach is designed to overcome challenges commonly faced in the construction industry, especially in complex environments where traditional detection algorithms may have difficulties. The optimized loss function, which includes advanced types such as Generalized Intersection over Union (GIOU), Distance Intersection over Union (DIOU), and Complete Intersection over Union (CIOU) improves the model’s ability to handle overlapping bounding boxes and enhances object localization accuracy by providing more informative gradients during training. The Mosaic-9 technique, a data augmentation strategy that combines multiple images into a single training sample, allows the model to learn from diverse scenarios, improving its robustness to environmental variations. The improved YOLOv5 architecture enhances feature extraction and multiscale detection, which is crucial for accurately detecting helmets at varying distances. This approach addresses ecological challenges such as cluttered backgrounds and uneven lighting conditions by training the model with images captured under different lighting scenarios. This ensures adaptability to both bright and low-light conditions. The method aims for real-time performance, leveraging the speed of the YOLOv5 architecture to process frames quickly, which is essential for timely alerts in safety-critical environments. By achieving higher precision and recall rates through comprehensive training and continuous feedback mechanisms, the model correctly identifies workers wearing helmets and minimizes false positives, significantly enhancing safety compliance in the construction industry.

2 Deep learning network structure design

2.1 Network structure design

The YOLOv5 framework consists of three main parts: the backbone, the neck, and the predicted head [20]. YOLOv5 employs the Cross Stage Partial Darknet53 (CSPDarknet53) framework as its backbone, utilizing the spatial pyramid pooling layer to extract feature information from the input image. The Path Aggregation Network neck combines the collected data and creates three-scale feature graphs. The YOLO prediction head utilizes these feature maps to detect objects. The architectural design of the network framework is shown in Figure 2.

Figure 2

Structure of YOLOv5 target detection network.

The input is a helmet detection image, and the backbone network predominantly comprises two integral components: the Focus Structure and the Cross Stage Partial (CSP) structure [21]. The focus structure performs slicing operations before the image is imported into the backbone network to preserve the original input image information as much as possible. The CSP Structure serves as a transitional layer for input features, leveraging convolutional techniques to bolster the learning capabilities of the CNN. The neck is structured with Feature Pyramid Network (FPN) and PAN, where FPN up-samples high-level feature information for object recognition at different scale sizes. PAN is the bottom pyramid, which conveys localization features in a bottom-up manner. The prediction header predicts image features and generates bounding boxes to predict categories.

2.2 YOLOv5 algorithm

The operational flow of the YOLOv5 algorithm is delineated as follows: Upon initial input of the helmet image to the network, as shown in Figure 3, YOLOv5 partitions the input image into a grid of dimensions S × S. Each grid cell assumes responsibility for predicting the presence of a target within its central region, where the actual object’s frame is anticipated to intersect. To circumvent image distortion, the original image undergoes resizing operations to attain dimensions of S × S, achieved through filling or scaling. To address the inherent variability in sizes and scales of helmet targets, we have devised a novel approach by incorporating a range of grid dimensions: 13 × 13, 26 × 26, and 52 × 52. These dimensions serve the purpose of accurately segmenting the image into distinct regions, thereby enabling the identification and classification of targets based on their sizes: large, medium, and small. This deliberate grid differentiation facilitates the enhanced capture of helmet targets spanning various scales, thereby bolstering the network’s overall detection performance. For each grid within the network, the prediction information encompasses a set of five essential parameters. These parameters comprise the coordinates representing the target’s centre point (“x” and “y”), the dimensions specifying the width and height of the target (“width” and “height”), and a confidence score indicating the likelihood of the presence of the target.

Figure 3

Schematic diagram of YOLOv5 target detection network implementation.

Data partitioning and scrutiny are pivotal tenets in deep learning, serving as crucial safeguards against the perils of overfitting. Furthermore, using correlation diagrams is a highly productive approach for pursuing insightful exploratory data analysis. These diagrams facilitate the visual representation of intricate interrelationships within the dataset, empowering researchers to understand the underlying data dynamics comprehensively. As shown in Figure 4, YOLOv5 generates an association map to display the features of the provided image and bounding box. Within this map, four bars provide statistical insights into the parameters “x,” “y,” “height,” and “width.” Visual observation is employed to assess the correlation among distinct labels. When two labels frequently co-occur, the corresponding histogram exhibits a notable peak or heightened prominence. Conversely, in cases where two labels infrequently co-occur, the histogram displays diminished height or the absence of a discernible peak.

Figure 4

Labels_correlogram.

Examining the labels_correlogram graphical representation proves highly advantageous in comprehending the interrelationship among target labels within the dataset. It is feasible to determine which labels frequently appear together and the strength of the correlation between them to better understand the links between targets, and this will contribute to optimize the performance and accuracy of the target detection algorithms.

3 Improved YOLOv5 algorithm

Enhancing the YOLOv5 algorithm can significantly elevate target detection methodologies’ accuracy, speed, and adaptability, rendering them more competitive and pragmatic for real-world applications. To address challenges such as missed detections due to intricate construction site backgrounds, the variability in Helmet wearing among construction workers, and the risk of false helmet detections arising from background similarity, along with the preservation of fine-grained features during model training, a series of model optimization techniques have been employed. These optimizations encompass the refinement of loss functions, specifically GIOU, DIOU, and CIOU, as well as improvements to the Mosaic-9 data augmentation algorithm, all aimed at augmenting model performance.

3.1 Optimization of the loss function

Bounding box regression is a crucial step in target detection; although IOU_Loss and GIOU_Loss are widely used in bounding box regression, there are still some problems in the present methods [22]. The original YOLOv5 framework employs the GIOU_Loss strategy for prediction, which leads to concerns regarding its slow convergence rate and regression inaccuracies. In this work, the DIOU and CIOU are used as optimization strategies to improve the loss function of the helmet target detection algorithm YOLOv5. It aims to evaluate the effect of different loss functions on the performance of helmet-wearing detection and expects to obtain more significant performance improvement.

3.1.1 Intersection over Union (IOU)_Loss

The IOU metric is a standard measure for quantifying the extent of overlap between two bounding boxes. In instances where the area of overlap between two bounding boxes is substantial, the IOU value is correspondingly higher. This fundamental principle for calculating the IOU Loss is elucidated in Figure 5, where the prediction box (box_p) is delineated in red, whereas the ground truth box (box_gt) is depicted in white.

Figure 5

The fundamental principle for calculating the IOU Loss. (a) Predictive and real frames. (b) Intersection of the two. (c) Union of the two.

The intersection of the predicted frame with the true frame is denoted as set A, whereas the union set is represented as set B. The mathematical definitions of sets A and B are provided by (1) and (2), respectively.

(1) A = Intersection ( box_p, box_gt ) ,

(2) B = Union ( box_p, box_gt ) .

The IOU is represented in Figure 6

Figure 6

Schematic diagram of IOU calculation.

The IOU_Loss calculation formula is represented in (3).

(3) IOU _ Loss = 1 − IOU = 1 − A B .

IOU_Loss exhibits certain limitations, rendering it ineffective in addressing the scenarios illustrated in Figure 7.

Figure 7

Special cases where IOU_Loss cannot be optimized. (a) Cases where the two do not overlap coverage. (b) Overlap situation 1. (c) Overlap situation 2.

In Figure 7(a), the prediction box does not overlap with the true box, leading to the inability to calculate the gradient and reflect the gap between the two. In Figure 7(b) and (c), the predicted and real frame regions of helmet target detection overlap, but their concatenation and intersection areas are equal, resulting in an IOU of 1. It is impossible to distinguish between overlapping cases 1 and 2, as they have the same loss. IOU_Loss cannot effectively address instances where the predicted frame lacks intersection with the true frame or when the intersection and union areas are equivalent.

3.1.2 GIOU_Loss

The analysis of IOU reveals the following limitations: (1) IOU returns a value of 0 when there is no intersection between the prediction frame and the real frame, rendering gradient computation infeasible; (2) in cases where the intersection of the prediction frame and the real frame equals the area of their union, determining the relative positioning of the two frames becomes impossible. GIOU_Loss is designed to enhance IOU_Loss by addressing these challenges, as shown in Figure 8.

Figure 8

Schematic diagram of GIOU_Loss. (a) Predictive and real frames. (b) Minimum outer rectangular box. (c) Difference set.

In Figure 8, a minimum bounding box C (coloured in green) is introduced. This bounding box entirely encompasses both the prediction and true boxes, effectively resolving the problem of the prediction box failing to intersect with the true box, resulting in an indeterminate solution. By computing the disparity between the outer bounding box C and the union set B within the IOU framework, the differential set depicted in Figure 8(c) can be obtained, characterized by an area less than that of C, which is non-zero. This facilitates the continuation of gradient calculations. The calculation principle of GIOU_Loss is shown in (4).

(4) GIOU _ Loss = 1 − GIOU = 1 − IOU + ∣ C − B ∣ ∣ C ∣ .

However, when the real box contains cases of the same size as the predicted box, it is impossible to accurately distinguish the position of the anchor box because the difference is still the same, leading to a certain degree of loss of accuracy.

3.1.3 DIOU_Loss

To overcome the defects in IOU_Loss and GIOU_Loss, DIOU_Loss is proposed by merging the normalized distances between the prediction frames. The calculation principle of DIOU_Loss is shown in Figure 9.

Figure 9

Schematic diagram of DIOU_Loss. (a) Minimum outer rectangular box. (b) Calculation of diagonal distance. (c) Calculation of the distance from the centre point.

First, the distance between the predicted and real frames is measured by calculating the diagonal distance c in the minimum outer rectangular frame. At the same time, the distance d between the two centre points is calculated to assess the proximity between them further. The specific calculation principle of DIOU_Loss is shown in (5).

(5) DIOU_Loss = 1 − IOU + d 2 c 2 ,

where c ² represents the square of the diagonal distance within the minimum bounding rectangular box, and d ² denotes the square of the distance between the two centre points.

IOU and GIOU cannot handle the particular case of concentricity, where the values of both IOU and GIOU become zero when the distance between the two centroids is 0. The advantage of DIOU_Loss is that it guarantees that the values of the two centroids are not zero, so the value of DIOU will not be zero, allowing the gradient calculation to be performed. As a result, DIOU_Loss performs superiorly when this particular case is taken into consideration.

3.1.4 CIOU_Loss

A good target frame regression function requires a combination of three key factors: overlap area, centroid distance and aspect ratio. By analysing these factors together, we found that IOU_Loss requires consideration of the effect of overlapping regions, whereas GIOU_Loss relies heavily on the IOU_Loss loss calculation. To evaluate the degree of matching of the bounding boxes more comprehensively, DIOU_Loss is introduced, which considers not only the overlapping areas of the bounding boxes but also the distance between the centroids. Based on DIOU_Loss, the normalization method of aspect ratio is further improved, and CIOU_Loss is proposed. The calculation principle of CIOU_Loss can be expressed by (6)–(8).

(6) CIOU_Loss = 1 − CIOU = 1 − IOU + d 2 c 2 + α ν ,

where α represents a positive trade-off parameter designed to harmonize losses related to the aspect ratio, and υ quantifies the aspect ratio normalization. The formal definitions of υ and α are provided in (7) and (8).

(7) ν = π 4 arctan w gt h gt − arctan w h 2 .

The trade-off parameter is defined as

(8) α = ν 1 − IOU + ν .

This methodological enhancement emphasizes overlapping regions during regression, particularly when overlaps are minimal or non-existent.

3.2 Improved Mosaic data enhancement methods

To increase the diversity of the dataset, given the diversity and complexity of the construction sites, a data enhancement approach is adopted to expand the dataset. Data enhancement is an effective technical tool to generate more diverse and representative data samples by transforming and expanding the original data. In a variable environment such as a construction site, data augmentation enables better simulation and response to various real-world situations, improving the robustness and generalization of the model. Different viewing angles, lighting conditions, occlusion, and other factors are introduced through data augmentation to make the training data closer to the actual scene, improving the accuracy and reliability of the model at various construction sites.

3.2.1 Image preprocessing

Construction sites exhibit a multitude of complexities and variations. To extend the diversity of the dataset, data enhancement is performed on the dataset, including changing the hue, saturation, brightness, and contrast of the images and performing operations such as panning and zooming. With these operations, it is possible to simulate construction scenarios in different periods and weather conditions, such as cloudy, sunny, early morning, and dusk, as shown in Figure 10. The transformation between HSV (hue, saturation, and value) colour and image space is adopted. This method entails two key steps: (1) Establishing data enhancement coefficients in the HSV colour space and the image space and (2) defining data enhancement factors in the image space. Data enhancement enables the inspection model to be better adapted to various real-world situations and improves its accuracy and robustness across all types of construction sites.

Figure 10

Schematic diagram of data enhancement.

3.2.2 Mosaic data enhancement improvements

The training data images are pre-processed more meticulously to improve detection accuracy further. The Mosaic data enhancement method is adopted, which is a method that extends and improves on the principles of CutMix [23]. The core concept of the Mosaic data enhancement method is to stitch multiple images in a mosaic manner to generate new training samples. In contrast to the CutMix method, the Mosaic method does not merely merge two images but randomly scales, crops, and arranges the four images to create an entirely new image. The essential advantage of this methodology lies in its capacity to faithfully replicate the intricacies inherent in real-world scenarios, thereby enhancing the network’s adeptness in accommodating diverse environments and backgrounds [24].

YOLOv5 must be targeted for helmet-wearing detection in different work contexts, where complex backgrounds may include various complex elements. The helmet test results may interfere with buildings, trees, and the sky. Hence, to improve the algorithm’s detection ability in different situations, the data enhancement is emphasized to provide the network with a better generalization ability to accurately detect Helmet wearing in various complex work situations.

The improved Mosaic-9 data enhancement method is shown in Figure 11. The initial Mosaic data augmentation method has undergone enhancement through the incorporation of a lower channel for the transformation of the 3 × 3 channel output into a singular image. This augmentation process enhances the diversity and complexity of the data, enabling more accurate simulation of the diverse alterations observed in real-world scenes. The helmet detection dataset gains increased richness by incorporating an enhanced Mosaic data augmentation technique. This augmentation method significantly amplifies the diversity and comprehensiveness of the training dataset. It strengthens the capacity to detect helmets across various scales and notably enhances the precision of detecting smaller targets. Enhancements have been implemented in the Mosaic data augmentation methodology, encompassing three primary stages: scaling, edge overflow cropping, and internal overlap cropping.

Figure 11

Mosaic data enhancement schematic.

Step 1: The image dimensions are scaled to conform to the prescribed input constraints of width (W) and height (H). This operation entails using scaling multipliers, denoted as S _x for the horizontal X-axis and S _y for the vertical Y-axis.

(9) S x = f rand ( S W , S W + Δ S W ) ,

(10) S y = f rand ( S H , S H + Δ S H ) ,

where S _W is the minimum scaling factor for width in the horizontal direction, and S _H is the minimum scaling factor for height in the vertical direction. ∆S _W is the length of the random interval for a width scaling multiplier in the horizontal direction, and ∆S _H is the length for a height scaling multiplier in the vertical direction. S _W, S _H, ∆S _W, and ∆S _H are hyperparameters, and f _rand represents the random value function.

The scaled image coordinates are defined as [(A _i, B _i), (C _i, D _i)], which can be obtained from (11)–(14).

(11) A i = 0 i = 1 , 2 , 3 W × t 1 i = 4 , 5 , 6 W × t 2 i = 7 , 8 , 9 ,

(12) B i = 0 i = 1, 2, 3 H × t 3 i = 2, 5, 8 H × t 4 i = 3, 6, 9,

(13) C i = A i + W × S W ,

(14) D i = B i + H × S H ,

where t ₁, t ₂ represent the ratio of the distance from the origin to the left superscript of the image to the total width on the X-axis in the horizontal direction; t ₃, t ₄ represent the ratio of the distance from the origin to the left superscript of the image to the total height on the Y-axis in the vertical direction; t ₁, t ₂, t ₃, and t ₄ are hyperparameters.

Step 2: The nine images generated by random scaling in step 1 are subjected to stitching and cropping operations. The photos are placed at the specified location according to the coordinates. Given that this placement may occasionally extend beyond the designated bounding box, a trimming operation is executed to eliminate any pixel data exceeding the confines of the training image. This step ensures the precise handling of regions extending beyond the bounding box, thus maintaining data integrity and accuracy, as shown in (15) and (16)

(15) C i ′ = C i C 1 < W W C 1 ≥ W ,

(16) C i ′ = D i D 1 < W H D 1 ≥ W .

Step 3: In this step, overlapping regions within the image are subjected to precision trimming. A set of parallel lines is introduced, bisecting the image along rows and columns. These lines demarcate the interface between the random region and the surrounding area, with the length of the generated segmentation lines denoted as Δt _i. t _i is the ratio of the distance between the coordinates of the segmentation line and the origin to the size of the boundary. F _i are the coordinates of the segmentation line, which can be obtained from (17).

(17) F i = f rand ( t i , t i + Δ t i ) , i = 1 , 2 , 3 , 4 .

The improved data enhancement algorithm is available to enrich the helmet detection data, which can detect helmet wearing at different scales and improve the detection accuracy of small target helmets in the image. The schematic representation of the data augmentation training process is presented in Figure 12. Figure 12 generates composite images assembled through meticulous splicing and cropping operations. Subsequent processing of these composite images includes the removal of extraneous pixel regions, a critical step aimed at ensuring that the identification of safety helmet usage in diverse scenarios aligns seamlessly with the practical requirements of the target detection task.

Figure 12

Training process image of YOLOv5 data enhancement.

4 Experiments and analyses

4.1 Experimental platforms

Table 1 shows the experimental configuration. The experimental model training process uses the PyTorch deep learning framework. During the testing phase, an Intel^® Xeon E3-1230 v3 processor is used as the processing unit to ensure efficient and stable computing performance. The experimental procedures are executed within the Python 3.8 environment, with PyCharm serving as the integrated development environment for project management and debugging.

Table 1

Description of the environment configuration

Name	Type
CPU	Intel(R) Xeon(R) CPU E3-1230 v3
Graphics Card	NVIDIA GeForce GTX 1060 (6G)
RAM	16GB DDR3 1,600 MHz
Hard Drive	KIOXIA-EXCERIA SATA SSD (500GB)

4.2 Dataset creation and data preprocessing

The dataset’s quality plays a pivotal role in determining the effectiveness of the model’s detection capabilities. To meet the research requirements, the experiments were conducted employing open-source datasets. The preprocessing of image samples within the dataset encompassed the following steps: (1) The image annotation tool LabelImg was used to annotate the image samples in the open-source dataset with helmet targets. During the labelling process, the smallest outer rectangular box of the helmet was selected as the real box to reduce the redundant pixel information on the box’s background. (2) An XML file corresponding to each image’s name was generated upon image annotation. Figure 13 illustrates this annotation process. The XML file comprehensively records essential information, including the primary location data, category names, and coordinate specifics about the helmet tags. (3) A proprietary dataset configuration file was established, replacing the original VOC.yaml with hat.yaml. The number of category classes was adjusted to two to accommodate the helmet category. (4) The annotation files in XML format underwent conversion into the YOLO data format, resulting in the generation of corresponding TXT format files.

Figure 13

Schematic diagram of image tag and xml file tag information.

After completing the labelling of all images, the helmet dataset was built. The model was trained with the helmet image dataset for helmet-wearing detection. The helmet can be marked and positioned accurately when workers are in construction.

4.3 Visualization of training data information

The standardized production of the dataset was completed. The application scene features of this dataset were analysed to optimize and refine the helmet-wearing detection model. The labelled object class and box scale were analysed based on the statistical label information. As illustrated in Figure 14, the dataset comprises approximately 90,000 sample instances, further categorized into 10,000 instances denoting individuals wearing helmets, labelled as “hat,” and 80,000 instances representing individuals without helmets, designated as “person.” This distribution ensures a robust dataset for training purposes, featuring ample samples within each category. Density maps normalized by the amount of data covering the categories in multiple scenarios vs the dimension of the labelled boxes were counted. The labels and information on the centre coordinate center_xy, as well as the labels’ width and height distribution patterns, were also counted.

Figure 14

Statistical graph of data volume information of the training set.

The scale sizes of the targets show diverse distribution characteristics, and there are targets of different scales. Among these targets, the number of small-scale targets is large and occupies a significant proportion. The multiscale distribution ensures that the target detection task is more generalized, as the algorithms need to effectively handle the detection and recognition of targets of different scales.

4.4 TensorBoard training process visualization

The model training process is conducted on a local computing platform. It is paramount to meticulously assess and compare the model’s training and validation loss profiles. To facilitate this evaluation, TensorBoard, a visualization tool for monitoring the training process, allows for an intuitive examination of loss and metric variations across different epochs, as shown in Figure 15. The tabs on TensorBoard’s top navigation bar visually represent how losses and metrics change with each epoch. In addition, TensorBoard can be used to track training speed, learning rate, and other values.

Figure 15

TensorBoard user interface.

The experimental model was managed with storage for training. As shown in Figure 16, the TensorBoard function is invoked in PyCharm through the command-and-control terminal. It starts the TensorBoard before the training and points it to the folder record directory where the training results are recorded. The specific operations are as follows: (1) launch the command-line terminal within PyCharm, (2) invoke the TensorBoard function to initiate the TensorBoard interface, and (3) configure TensorBoard to monitor the designated directory for recording and storing training outcomes before commencing the training process.

Figure 16

Setting the variable log_dir logging directory.

During the network training phase, a variable named “log_dir” was generated to represent the folder name. A timestamp, depicted in red within the directory name, was incorporated to prevent potential conflicts with existing logging directories. This practice ensures the accurate and non-overlapping recording of training data.

4.5 Evaluation metrics and performance analysis of the improved YOLOv5 algorithm

4.5.1 Indicators for model evaluation

As shown in Table 2, target detection evaluation metrics, precision rate, and recall rate are introduced to quantify the model’s helmet detection performance.

Table 2

Evaluation indicators for target detection

		Projected results
Evaluation indicators		Positive	Negative
The real situation	True	TP	FN
The real situation	False	FP	TN

The definitions of Precision and Recall are provided as follows:

Precision rate: Precision rate quantifies the likelihood of a true positive prediction within the model’s output.

(18) Precison = TP TP + FP .

The Recall rate, also known as the detection rate, is the proportion of true positive instances to the total number of detected objects in the test results.

(19) Recall = TP TP + FN .

After completing the design of the training visualization tool and the catalogue work of the training result records, the performance of the model based on Mosaic-9 + CIOU_Loss + YOLOv5 was analysed against the model based on CIOU_Loss + YOLOv5, using the unimproved YOLOv5 model as a benchmark. Meanwhile, the hyperparameter settings during the training process were kept consistent.

As demonstrated in Table 3, hyperparameters’ configuration substantially influences the model’s performance and behaviour. The judicious selection of appropriate hyperparameters ensures consistent and dependable model performance. In this study, we undertake a comprehensive performance analysis of three distinct models: the original YOLOv5 model, the CIOU_Loss + YOLOv5 model, and the Mosaic-9 + CIOU_Los+YOLOv5 model. Our objective is to examine these models’ behaviour and characteristics in-depth. The results of our testing are presented in Figure 17(a)–(d).

Table 3

Hyperparameter settings

Name	Hyperparameter
Initial learning rate	lr0: 0.01
Final learning rate	lrf: 0.2
Weight decay	weight_decay: 0.0005
Learning rate warm-up epoch	warmup_epochs: 3.0
Coefficients for categorized losses	cls: 0.5
Weights of positive samples in classification	cls_pw: 1.0
Coefficient with and without object loss	obj: 1.0
Weights for positive samples with and without objects	obj_pw: 1.0

Figure 17

Test results of the model. (a) Precision rate, (b) Recall rate, (c) training set loss, and (d) loss of validation set.

As shown in Figure 17(c) and (d), during the training process, both the training set loss and the validation set loss gradually decrease with the training. It indicates the appropriateness of the various parameter settings of the network and the reasonable design of the network structure, implying that the network is in an ideal state during the learning process. In the training process of YOLOv5, the precision and recall curves also vary with the epoch increase. As shown in Figure 17(a) and (b), the number of training rounds is set to epoch = 50 for training. In the initial stage (0–20 epoch), the network learns more features and patterns as the training proceeds, and the accuracy rate will increase progressively. As the training continues, when the network reaches a certain level of training (20–50 epochs), the precision rate and recall rate gradually stabilize. It indicates that the network has learnt the key features and patterns for the target detection task. When the training reaches 50 epochs, each model exhibits commendable performance concerning precision and recall rates. More specifically, the original YOLOv5 model attains its highest precision and recall values at 91.74 and 85.95%, respectively. Similarly, the YOLOv5 model enhanced solely with CIOU_Loss demonstrates superior precision and recall, achieving 92.07 and 86.52%, respectively. Notably, the YOLOv5 model augmented with Mosaic-9 + CIOU_Loss yields the most robust detection results, with precision and recall rates peaking at 93.16 and 88.96%, respectively.

4.5.2 Comparative experimental analysis of loss function performance

To verify the superiority of the CIOU_Loss optimization strategy in the regression task, GIOU_Loss, DIOU_Loss, and CIOU_Loss were selected as the loss functions for the experiments, respectively. Three experiments were conducted to analyse the results of different loss functions, and the same dataset samples were used to optimally train the YOLOv5 all-weather-based model to interpret the results of the comparative tests between them, as shown in Figure 18.

Figure 18

Loss function comparison of experimental results. (a) GIOU_Loss, (b) DIOU_Loss, and (c) CIOU_Loss.

According to Figure 18, it is possible to observe the trend of the all-weather model as the number of training iterations increases under different loss functions. From the beginning of the iteration to the end of the training, the loss values show a steady decrease in all three cases. The final loss value of the model trained with the GIOU_Loss function in the helmet dataset is 0.058. In contrast, using DIOU_Loss as the loss function resulted in a loss value of 0.052 and an improvement value of 0.007. DIOU_Loss is more compatible with the objective regression mechanism than GIOU_Loss. After considering the overlapping region, centroid distance, and aspect ratio, CIOU_Loss achieves a loss value of 0.044 in the final iteration.

In contrast, CIOU_Loss converges faster than the first two loss functions, and the improvement value is 0.008 compared to DIOU_Loss, which is a more significant improvement. According to the experimental results, the CIOU_Loss loss function performs more efficiently when optimizing the helmet detection model. A comparison of the experimental detection results is shown in Table 4.

Table 4

Comparative experimental results of leakage rate

Model	Missing detection rate (%)	False detection rate (%)
YOLOv5 GIOU_Loss	5.98	7.86
YOLOv5 DIOU_Loss	5.87	7.59
YOLOv5 CIOU_Loss	5.64	7.32

In network training, the loss function is an intuitive indicator of the model’s convergence stability as the number of iterations increases. As the iteration count rises, the learning curve for the enhanced YOLOv5 algorithm progressively smoothens, indicative of a gradual convergence trend. Simultaneously, the loss value steadily declines, gradually approaching the optimal threshold. Reductions in both leakage and false detection rates accompany this phenomenon. A comparative analysis with the original YOLOv5 model reveals that employing CIOU_Loss as the loss function results in swifter and more precise regression, thus substantiating the model’s efficacy.

4.5.3 Improved Mosaic-9 data enhancement performance comparison experiments

An improved data enhancement-based method, Mosaic-9, is proposed, which significantly increases detection accuracy for the helmet-wearing detection task. The need for accurate detection under various weather conditions, including severe weather and complex environments, is satisfied. Scenes at different periods and under different weather conditions, including cloudy, sunny, early morning, and dusk, were first simulated using HSV colour and image space. A more comprehensive preprocessing procedure was applied to the training data images to enhance the precision of detection. The Mosaic data enhancement method was extended and improved based on the CutMix principles to enhance the images’ diversity and information richness. Through such comprehensive preprocessing strategies, the feature information in the image can be better mined to improve the model’s ability to detect the target and achieve higher precision and accuracy. The application of the approach has led to a more significant progress. To assess the effectiveness of the Mosaic-9 data enhancement algorithm in improving the model performance, two sets of comparative experimental setups were conducted with the same hyperparameters and number of iterations. The first group was trained using the default unimproved data enhancement method as the baseline, whereas the second group was trained using the improved Mosaic-9 data enhancement algorithm. Based on the empirical findings, the Mosaic-9 algorithm has been successfully integrated into the YOLOv5 framework. This integration not only augmented the diversity of the dataset but also efficiently leveraged data recycling, yielding a notable performance enhancement. Figure 19 illustrates the mAP comparisons across the two sets of trial experiments.

Figure 19

PR curve. (a) Before improvement and (b) after improvement.

In this section, the mAP will be used as an evaluation metric to assess the performance of the models trained by the Mosaic-9 data enhancement algorithm accurately. The mAP directly correlates positively with model performance. Enhanced average detection accuracy corresponds to superior model performance. Figure 19(a) and (b) depicts the comparative analysis of the Mosaic-9 data enhancement algorithm before and after the proposed enhancements are implemented. From the experimental results in Figure 19, we observed that the mAP value of the unimproved data enhancement algorithm is 0.935, whereas the mAP value using the improved data enhancement algorithm Mosaic-9 reaches 0.944. The effectiveness of the enhanced data enhancement optimization method is verified by comparing the performance of the unimproved data enhancement method with the improved data enhancement algorithm Mosaic-9 in YOLOv5. The target detection task provides a feasible and effective optimization strategy, providing a valuable reference for further research on the YOLOv5 model and the improved data enhancement method.

This section is dedicated to the performance comparison evaluation of the enhanced Mosaic data enhancement method. The improved data enhancement algorithm Mosaic-9 shows better performance in helmet-wearing detection. The experimental results after Mosaic-9 data enhancement are presented in Table 5. The evaluation results are analysed, and the model’s mAP is improved by 0.009 after the enhancement. Ultimately, a comparative assessment is conducted to evaluate the performance of the enhanced Mosaic-9 method, substantiating its efficacy.

Table 5

Experimental results after Mosaic-9 data enhancement

Data enhancement	Group number	Patchwork image format
Mosaic-9	1	Six in one
	2	Nine in one
	3	Four in one

4.5.4 Comparative experimental results with other existing methods

The results of comparing precision and recall between the existing techniques and the improved YOLOv5 helmet-wearing detection algorithm are shown in Table 6. The table compares the performance of various object detection methods using Precision and Recall metrics. Rubaiyat et al. [2], using HOG, achieved a Precision of 81.13% and a Recall of 74.59%, while Li et al. [3], using SVM, showed a slightly lower Precision of 80.07%, with no Recall reported. e Silva et al. [4] combined HOG with local binary patterns (LBP) for an improved Precision of 87.84%, though Recall is again not reported. Wei et al. [15], using BiFEL-YOLOv5s, reported 86.5% Precision and 77.9% Recall, balancing both metrics. Li et al. [16] using SSD-MobileNet achieved 90% Precision but a slightly lower Recall of 77%. Hayat and Morgado-Dias [17] with YOLO showed high performance with 92.44% Precision and 87.24% Recall. Qian and Yang [18] using YOLO_CA achieved 91.53% Precision and 86% Recall. Li et al. [19] further improved YOLOv5s to achieve 87.5% precision and 87% Recall. Fan et al. [24], using Faster R-CNN, achieved 87.3% Precision and 85.9% Recall. The highest Precision and Recall are seen with the Improved YOLOv5 (CIOU_Loss + Mosaic-9), reaching 93.16 and 88.96%, respectively, demonstrating its superior performance in both accuracy and coverage. The Mosaic-9 + CIOU_Loss + YOLOv5 all-weather helmet-wearing detection model was designed based on the improved YOLOv5, which has a higher Precision than the existing technology and a Higher recall than the existing technology. In the selection of the model, it is identified as the final choice for helmet-wearing detection.

Table 6

Comparative experimental results with other existing methods

Methods	Precision (%)	Recall (%)
Rubaiyat et al. [2] (HOG)	81.13	74.59
Li et al. [3] (SVM)	80.07	—
e Silva et al. [4] (HOG + LBP)	87.84	—
Wei et al. [15] (BiFEL-YOLOv5s)	86.5	77.9
Li et al. [16] (SSD-MobileNet algorithm)	90	77
Hayat and Morgado-Dias [17] (YOLO)	92.44	87.24
Qian and Yang [18] (YOLO_CA)	91.53	86
Li et al. [19] (Improved YOLOv5s)	87.5	87
Fan et al. [24] (Faster R-CNN)	87.3	85.9
Improved YOLO (CIOU_Loss + Mosaic-9)	93.16	88.96

4.6 Model deployment and testing

4.6.1 Overall structural design of helmet-wearing detection method

The overall structure of the helmet-wearing detection method is shown in Figure 20. First, the helmet targets in the images are labelled by an image annotation tool to build the helmet dataset required for training. Then, a suitable convolutional neural network model is designed, and a deep learning environment is built to train the model. Further, the YOLOv5 model is trained on NVIDIA GTX1060 (GPU), and the trained model is used to recognize staff-wearing scenes captured by surveillance cameras. Ultimately, the YOLOv5 model detects helmet-wearing information and transmits the detection results to the backend for real-time monitoring.

Figure 20

Schematic design of the overall structure of the detection method.

4.6.2 Helmet-wearing model testing

The effectiveness of a weather model based on Mosaic + CIOU_Loss for helmet-wearing detection is verified by comparing the feature extraction capability of the original model and the series-optimized all-weather model. Therefore, the model can be applied to the final practical task detection scheme.

As shown in Figure 21, the YOLOv5 model is effectively implemented on an NVIDIA GeForce GTX 1060, catering to the demands of on-site helmet inspection. A video file captured from an active construction site is employed as input data to assess the model’s performance under load. The results are readily discernible through real-time inspection of successive frame images.

Figure 21

Experimental platform.

Figure 22 shows a comparison graph of the improved YOLOv5 algorithm’s detection results in different scenes. Through the experimental platform, personnel wearing helmets at the construction site can be observed in the background in real-time, and accurate warning information can be provided.

Figure 22

Comparison of detection results after model improvement. (a) Complex-construction scene detection. (b) Cross-density scene detection. (c) Low-light scene detection. (d) Remote-scene detection.

Figure 22(a) shows the test results of construction personnel helmet-wearing detection performed in complex scenes. In these intricate scenes, the model adeptly detects the cranial region of construction personnel, offering precise discrimination regarding the presence or absence of helmets. The model can recognize helmets in various head postures, lighting conditions, and occlusion situations and accurately classify and locate them. Test results show that the improved YOLOv5 model achieves high detection accuracy and robustness in helmet-wearing detection in complex scenes.

Figure 22(b) shows the results of helmet-wearing detection for construction personnel in cross-density scenes. In the cross-density scenes, the heads of construction workers may cover each other, the lighting conditions may be uneven, and the personnel density is high, increasing the difficulty of Helmet-wearing detection. The test results show that the improved YOLOv5 model performs satisfactorily in helmet-wearing detection in cross-density scenes. The model can differentiate between construction personnel wearing helmets and those without, delivering precise detection outcomes. It is essential to ensure safety and compliance on construction sites.

Figure 22(c) shows the construction worker helmet-wearing detection results in a low-light scene at night. In the case of low light at night, the image’s brightness is low, and there may be shadows and uneven lighting, which can cause some difficulties in the target detection task. However, the improved YOLOv5 model is robust to light variations and can learn feature representations for different light conditions during training, and thus can be adapted to helmet-wearing detection in low-light environments at night. The test results show that the YOLOv5 model performs well in helmet-wearing detection in low-light scenes at night. It can quickly and accurately identify the head area of construction workers and determine whether they are wearing a helmet (red arrows show people without helmets, which are recognized as persons), providing reliable detection results even in low light conditions.

Figure 22(d) shows the helmet-wearing detection results for construction workers in the long-distance scene. In the long-distance scene, the construction worker may become relatively small, and the details in the image will be reduced, which will cause specific difficulties in the target detection task. However, the improved YOLOv5 model has strong target detection capability and multiscale feature fusion and can accurately locate and recognize targets in images of different scales. The test results show that the YOLOv5 model performs well in helmet-wearing detection in long-distance scenes. It can quickly and accurately detect the head region of a distant construction worker and determine whether he is wearing a helmet. The model still provides reliable detection results even at long distances and with relatively small targets.

5 Conclusion

This work presents a smart helmet recognition method utilizing an improved YOLOv5 network as a detector. This approach bolsters the detection algorithm’s accuracy, speed, and adaptability through loss function optimization and data enhancement strategies. The main findings are outlined as follows:

The CIOU loss function is combined with the improved data enhancement Mosaic-9. The improved YOLOv5 increases the diversity of feature information through Mosaic-9 data augmentation, and more accurate and effective feature representations are gradually learnt through the gap between the CIOU-improved predictions and the real labels during the training process.
To make the experiments more versatile, the optimization strategy increases the dataset samples’ richness and complexity, including small targets, dense targets, target occlusion, and target colours.
Through comprehensive experimental performance analysis, our proposed strategy markedly enhances precision and recall metrics in the training of the YOLOv5 algorithm using the helmet dataset. This results in an impressive precision rate of 93.16% and a commendable recall rate of 88.96%, meeting the stringent requirements of real-time detection. Our optimized model’s detection results emphasize its capacity to precisely identify targets in video footage, spanning diverse scenarios, including daylight and nighttime conditions, even in complex environments. Moreover, the model exhibits real-time proficiency in localizing and tracking construction workers. The proposed helmet detection model could be improved through refinement and optimization, including experimenting with different network configurations, layers, or architectures. Continuous learning mechanisms and cross-industry applications could enhance the model’s effectiveness and adaptability.

Funding information: This research was funded by the school-enterprise cooperation project of the Yancheng Institute of Technology (2023021615).
Author contributions: Hua Liang and Liqin Yang are responsible for designing the framework, analysing performance, validating the results, and writing the article. Jinhua Chen, Xin Liu, and Guihua Hang are responsible for collecting the information required for the framework, providing software, conducting critical reviews, and administering the process.
Conflict of interest: Authors do not have any conflicts.
Data availability statement: No datasets were generated or analysed during the current study.

References

[1] H. Song, X. Zhang, J. Song, and J. Zhao, “Detection and tracking of safety helmet based on DeepSort and YOLOv5,” Multimed. Tools Appl., vol. 82, pp. 10781–10794, 2023.10.1007/s11042-022-13305-0Search in Google Scholar

[2] A. H. Rubaiyat, T. T. Toma, M. Kalantari-Khandani, S. A. Rahman, L. Chen, Y. Ye, et al., “Automatic detection of helmet uses for construction safety,” In 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), Omaha, NE, USA, 13–16 October 2016, pp. 135–142.10.1109/WIW.2016.045Search in Google Scholar

[3] J. Li, H. Liu, T. Wang, M. Jiang, S. Wang, K. Li, et al., “Safety helmet wearing detection based on image processing and machine learning,” In 2017 Ninth International Conference on Advanced Computational Intelligence (ICACI), Doha, Qatar, 4–6 February 2017, pp. 201–205.10.1109/ICACI.2017.7974509Search in Google Scholar

[4] R. R. e Silva, K. R. Aires, and R. D. Veras, “Helmet detection on motorcyclists using image descriptors and classifiers,” In 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, Rio de Janeiro, Brazil, 26–30 August 2014, pp. 141–148.10.1109/SIBGRAPI.2014.28Search in Google Scholar

[5] Y. Guan, W. Li, T. Hu, and Q. Hou, “Design and implementation of safety helmet detection system based on YOLOv5,” In 2021 2nd Asia Conference on Computers and Communications (ACCC), Singapore, 24–27 September 2021, pp. 69–73.10.1109/ACCC54619.2021.00018Search in Google Scholar

[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, USA, 23–28 June 2014, pp. 580–587.10.1109/CVPR.2014.81Search in Google Scholar

[7] R. Girshick, “Fast R-CNN,” In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 7–13 December 2015, pp. 1440–1448.10.1109/ICCV.2015.169Search in Google Scholar

[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.10.1109/TPAMI.2016.2577031Search in Google Scholar PubMed

[9] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. M. Rose, et al., “Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,” Autom. Constr., vol. 85, pp. 1–9, 2018.10.1016/j.autcon.2017.09.018Search in Google Scholar

[10] S. Chen, W. Tang, T. Ji, H. Zhu, Y. Ouyang, and W. Wang, “Detection of safety helmet wearing based on improved Faster R-CNN,” In 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020, pp. 1–7.10.1109/IJCNN48605.2020.9207574Search in Google Scholar

[11] S. Guo, D. Li, Z. Wang, and X. Zhou, “Safety helmet detection method based on Faster R-CNN,” In Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, 17–20 July 2020, pp. 423–434.10.1007/978-981-15-8086-4_40Search in Google Scholar

[12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 26–July 1, 2016, pp. 779–788.10.1109/CVPR.2016.91Search in Google Scholar

[13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, et al., “SSD: Single shot multibox detector,” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, pp. 21–37.10.1007/978-3-319-46448-0_2Search in Google Scholar

[14] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 318–327, 2018.10.1109/TPAMI.2018.2858826Search in Google Scholar PubMed

[15] L. Wei, P. Liu, H. Ren, and D. Xiao, “Research on helmet wearing detection method based on deep learning,” Sci. Rep., vol. 14, no. 1, p. 7010, 2024.10.1038/s41598-024-57433-zSearch in Google Scholar PubMed PubMed Central

[16] Y. Li, H. Wei, Z. Han, J. Huang, and W. D. Wang, “Deep learning-based safety helmet detection in engineering management based on convolutional neural networks,” Adv. Civ. Eng., vol. 2020, pp. 1–10, 2020.10.1155/2020/9703560Search in Google Scholar

[17] A. Hayat and F. Morgado-Dias, “Deep learning-based automatic safety helmet detection system for construction safety,” Appl. Sci., vol. 12, no. 16, p. 8268, 2022.10.3390/app12168268Search in Google Scholar

[18] S. Qian and M. Yang, “Detection of safety helmet-wearing based on the YOLO_CA model,” Comput. Mater. Continua, vol. 77, no. 3, 2023, pp. 3349–3366.10.32604/cmc.2023.043671Search in Google Scholar

[19] S. Li, Y. Lv, X. Liu, and M. Li, “Detection of safety helmet and mask wearing using improved YOLOv5s,” Sci. Rep., vol. 13, no. 1, pp. 21417, 2023.10.1038/s41598-023-48943-3Search in Google Scholar PubMed PubMed Central

[20] F. Zhou, H. Zhao, and Z. Nie, “Safety helmet detection based on YOLOv5,” In 2021 IEEE International Conference on Power Electronics, Computer Applications (ICPECA), Shenyang, China, 22–24 January 2021, pp. 6–11.10.1109/ICPECA51329.2021.9362711Search in Google Scholar

[21] P. Jiang, D. Ergu, F. Liu, Y. Cai, and B. Ma, “A review of YOLO algorithm developments,” Procedia Comput. Sci., vol. 199, pp. 1066–1073, 2022.10.1016/j.procs.2022.01.135Search in Google Scholar

[22] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), California, USA, 16–20 June 2019, pp. 2888–2897.Search in Google Scholar

[23] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization strategy to train strong classifiers with localizable features,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, October 27–November 2, 2019, pp. 6023–6032.10.1109/ICCV.2019.00612Search in Google Scholar

[24] Z. Fan, C. Peng, L. Dai, F. Cao, J. Qi, and W. Hua, “A deep learning-based ensemble method for helmet-wearing detection,” PeerJ Comput. Sci., vol. 6, p. 311, 2020.10.7717/peerj-cs.311Search in Google Scholar PubMed PubMed Central

Received: 2024-06-06

Revised: 2024-08-28

Accepted: 2024-10-01

Published Online: 2024-12-06

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/comp-2024-0017

Keywords for this article

deep learning; helmet recognition; intelligent detection; loss function; YOLOv5

Creative Commons

BY 4.0