IG-YOLOv5-based underwater biological recognition and detection for marine protection

Jialu Huo; Qing Jiang

doi:10.1515/geo-2022-0590

Article Open Access

IG-YOLOv5-based underwater biological recognition and detection for marine protection

Jialu Huo and Qing Jiang

Published/Copyright: December 29, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Open Geosciences Volume 15 Issue 1

Abstract

Underwater biological detection is of great significance to marine protection. However, the traditional target detection techniques have some challenges, such as insufficient feature extraction for small targets and low feature utilization rate. To address these challenges, an underwater biological detection model IG-YOLOv5 based on the idea of feature reuse is proposed. An Improved-Ghost module with feature reuse is designed. The module adds batch normalization operations to the identity mapping branch using the Add operation with feature fusion and the Sigmoid Linear Unit activation function with smoother zeros. The proposed model uses the Improved-Ghost module to reconstruct the CSPDarknet structure of YOLOv5, so as to realize the lightweight and accuracy improvement of the model. In addition, in order to solve the problem of target size and shape change in underwater environment, the optimized loss function is Wise-IoU v3, which is used to evaluate the accuracy and robustness of detection results. The results show that the IG-YOLOv5 model performs well in the 2021URPPC data set, with 0.5 mAP reaching 74.2, 4.3% higher than that of YOLOv5 model, and 2.7 less floating-point operations. In a word, IG-YOLOv5 model has high accuracy and robustness in underwater target detection, and Wise-IoU index can evaluate the quality of target detection results more accurately, which is suitable for underwater robots, underwater monitoring, and other fields and has a practical application value.

Keywords: underwater target; reconstruct; backbone network; recognition and detection; loss function

1 Introduction

Underwater biological detection can provide a detailed observation and evaluation of marine biological community, so as to better understand the distribution, quantity, and ecological role of marine life [1,2]. This information holds substantial importance for marine resource management and preservation, offering a crucial scientific foundation for the efficient utilization of these resources. It is helpful in the rational development of underwater resources, especially marine resources, and will bring high-quality development to coastal areas [3,4]. Scientists can monitor the changes of marine biodiversity, identify rare species, monitor the number of different species, and understand their distribution. This is very important for protecting endangered species and maintaining ecological balance. In addition, underwater biological detection can also help to assess the health of the marine environment, monitor and predict the changes in the marine environment, assess the influence of human actions on marine ecosystems and the environment, and take effective management measures. Moreover, it urges the government, environmental protection organizations, and relevant parties to take action to jointly protect our precious marine resources. For example, using detection technology, we can track the migration pattern of fish, avoid overfishing, and ensure the sustainable development of fisheries. Scientists use biological detection technology to monitor the impact of marine pollution on marine life, such as plastic pollution, chemical pollution, or noise pollution. Through long-term data collection, we can better understand how pollution affects marine ecology and formulate corresponding countermeasures.

In a word, by adopting underwater biological detection technology, we can better protect marine resources, maintain ecological balance, and ensure that future generations can benefit from a healthy marine environment.

Underwater biological detection faces many technical and environmental challenges. For example, the changes of water depth, water quality, and lighting conditions will also cause troubles to underwater cameras and other visual equipment, and there is a problem of poor visibility. Compared with land environment, data transmission in underwater environment is more difficult and slow, especially wireless data transmission. The current approaches for target detection can be categorized into two groups: conventional methods and deep learning-based techniques [5].

The majority of conventional techniques revolve around region-based target detection, and a series of candidate frames are generated by candidate frame generation algorithm, and they are classified and regressed one by one [6]. This method is mainly based on hand-designed feature extractor and classifier and usually adopts the sliding window detection method. However, this method has the following limitations: (1) the feature extractor has no learnable habit, (2) the calculation of sliding window detection is large, (3) inaccurate target positioning, (4) not suitable for complex scenes, and (5) it is difficult to cope with occlusion and posture change.

In the realm of deep learning-based approaches, significant advancements have been achieved within the domain of target detection [7]. R-CNN [8], Fast R-CNN [9], and Faster-RCNN [10] belong to the two-stage target detection model, but they cannot meet the real-time detection scenarios. In terms of model architecture, more models with double improvement of speed and accuracy are proposed, such as SSD [11], RetinaNet [12], and YOLO family (YOLO [13], YOLO9000 [14], YOLOv3 [15], YOLOv4 [16], YOLOv5 [17], etc.). The proposal of multi-scale detection technology provides a direction for addressing the issue of detecting small objects and further improves the detection accuracy; there are also improvements in the direction of prediction, such as the popularization of anchor-free [18] prediction. In recent years, this method has made some progress in image recognition, data analysis, automatic detection, and model prediction, but there are still many challenges to be solved, such as too many model parameters, slow training speed, and low reasoning efficiency.

In recent years, underwater target detection has been continuously concerned, and many scholars have made in-depth research in this field.

In 2019, Li et al. [19] overcame the degradation problem of underwater image through the underwater image enhancement preprocessing algorithm for improvement using limited contrast adaptive histogram equalization. Zhao et al. [20] overcome the problem of few and single data samples by using data to enhance the Retinex algorithm of image preprocessing. In 2020, Jia and Liu [21] used MSRCR (Multi-Scale Retinex with Color Restoration) [22] algorithm to enhance underwater images. In 2021, Liu and Liang [23] introduced an enhancement approach for underwater optical images that relies on background light estimation and enhanced adaptive transmission fusion. In 2022, Hao et al. [24] used the optimized MSRCR algorithm to enhance the underwater image. In 2023, Chen et al. [25] introduced the limited contrast adaptive histogram equalization algorithm to preprocess the input image to solve the problem of weak underwater light.

The research listed above focuses on improving the detection effect by image preprocessing algorithm and enhancing underwater image data, but there is still room for improvement in extracting effective features and feature utilization. At the same time, the aforementioned methods have not solved the problems of large model and many parameters and failed to achieve both speed and accuracy. Therefore, the IG-YOLOv5 model is put forward in this study, aiming at lightening the model in the field of underwater biological detection and improving the detection performance and speed.

IG-YOLOv5 model improves accuracy, lightweight, and detection speed, which has a far-reaching impact on marine protection. First of all, accurate data is helpful to formulate more effective protected area policies, conduct population recovery assessment, and support ecological research; second, it reduces the need for direct observation of divers, thus reducing the interference of human activities on the fragile marine environment, which is particularly useful for studying protected, remote, or difficult-to-reach areas. Then, it improves the detection speed of the model; provides almost real-time monitoring; quickly identifies sudden environmental threats, such as illegal fishing, pollution leakage, or other activities that endanger biodiversity; and relevant departments can intervene more quickly, which may prevent or alleviate the damage to marine ecosystems. Finally, the sharing of IG-YOLOv5 model can improve the public’s knowledge and understanding of marine protection issues.

The IG-YOLOv5 model is experimentally proven to be effective. The innovation of this study lies in:

To maintain model accuracy while reducing parameters, re-parameterize the structure for effective feature reuse. An underwater target recognition network model IG-YOLOv5 based on the idea of feature reuse is proposed.
The Improved-Ghost, a new module with feature reuse, is designed, and the YOLOv5 backbone network is reconstructed. The model is lightweight and has an efficient reasoning framework.
To enhance the model’s generalization and robustness for underwater target detection, a strategic allocation of gradient gain is implemented in the loss function. This results in optimizing the loss function as Wise-IoU v3 Loss, incorporating bounding box regression along with a dynamic focusing mechanism.

The remaining sections of this article are outlined as follows: Section 2 introduces the YOLOv5 structure. Section 3 discusses the theory behind the IG-YOLOv5 model. In Section 4, the performance of the IG-YOLOv5 model is assessed using the 2021 URPC data set. Section 5 concludes the article with final remarks.

2 Related technologies

2.1 YOLOv5

The first official version of YOLOv5 was released by Ultralytics on June 25, 2020, and it was improved on the basis of YOLOv4. YOLOv5 model integrates CSPDarknet53 network, multi-scale detection strategy, and Squeeze and Excitation attention [26] mechanism to enhance the model’s performance. YOLOv5 has a more flexible architecture, which provides four predefined model architectures, namely, S, M, L, and X. Among them, YOLOv5’s S model is a lightweight target detection model, which adopts a relatively small network structure, ensures high detection accuracy, has faster reasoning speed and lower memory requirements, and is suitable for embedded devices and mobile applications. This experiment is based on YOLOv5’s S model.

The YOLOv5’s S model network architecture consists of a progressive convolutional neural network. It can be segmented into four primary sections: input layer, backbone layer, neck layer, and head layer [16]. Figure 1 illustrates the network configuration of YOLOv5-6.0.

Figure 1

Network architecture diagram of YOLO V5-6.0.

Input module: The image input size of YOLO V5 model is 640*640, and this part adopts operations such as Mosaic data enhancement, image size adaptive processing, and anchor box computing optimization [16].

The backbone network specifically realizes the feature extraction function. It uses a convolutional neural network to realize the feature extraction of images and generally uses many different feature extraction algorithms such as ResNet [27] and DarkNet. In YOLOv5 6.0, backbone is mainly divided into CBS module (C stands for Conv, B stands for batchNorm2d, and S stands for Sigmoid Linear Unit (SiLu) activation function), CSPDarkNet53, and SPPF (Fast Spatial Pyramid Pooling) module. Compared with the previous version, the first layer of version 6.0 replaces the Focus module with a 6*6 convolution layer, which makes the whole structure more efficient on GPU devices. Conv2d, Sigmoid, and multiplication are encapsulated in CBS module. YOLOv5 draws on the idea of CSPNet and applies it to DarkNet53 backbone network. In YOLOv5-6.0 version, C3 module is used to replace the old BottleneckCSP module, which reduces the parameters and improves the detection accuracy. C3 module contains three convolution layers and multiple Bottleneck modules; YOLOv5-6.0 adopts SPPF module, and several pooled cores of reduced size are cascaded to replace a single large-sized pooled core in the old version of Spatial Pyramid Pooling module, which not only fuses the feature maps of different receptive fields, enriches the expression ability of feature maps, but also improves the running speed. SPPF consists of two CBS modules and three MaxPool2d, which can enhance the invariance of image features and reduce the dimension of information extracted from the convolution layer.

Neck module draws on the thought of feature pyramid FPN [28] and PANet [29] to realize the transmission of semantic information. YOLOv5 6.0 is similar to previous versions.

There are three detection layers in the Head network, which correspond to the three feature maps with different sizes obtained in Neck. The Head network can be subdivided into two parts: the detector head and the detector. The detector is a branch network based on the feature map, and its main function is to predict the area of each feature map. The predicted results include the position, category, and bounding box of each object. The detector post-processes the prediction results of the detector head and uses the non-maximum suppression algorithm to filter out the repeated and conflicting prediction frames to obtain the final detection results.

2.2 Ghost module

The Ghost Models were developed by Han et al. in 2020. The fundamental Ghost module separates the initial convolutional layer into two sections and employs a reduced number of filters to produce multiple inherent feature maps [30].

As depicted in Figure 2, the Ghost module comprises conventional convolution, grouped convolution, and an identity operation.

Figure 2

Ghost module [30].

Ordinary convolution operation is used to generate a compact-sized feature map, and the number of output channels at this time is m, which is less than the total output channels n.

Formulating the operation of a regular convolutional layer to generate m feature maps involves

(1) Y = X ⁎ f ,

where * signifies the convolution operation, Y ∈ R h × w × m represents the intrinsic m-channel feature map，and f ∈ R c × k × k × m stands for the employed filters. Besides, h and w denote the output data’s height and width, while k × k corresponds to the filter f’s kernel size.

To achieve the desired n feature maps more comprehensively, group convolution re-operates the aforementioned feature map with cheap linear operation to generate more features:

(2) y i j = φ i , j ( y i ) , ∀ i = 1 , … , m , j = 1 , … , s ,

where y _i stands for the ith intrinsic feature map within Y, φ i , j . The mentioned function pertains to the jth linear operation (excluding the final one) responsible for producing the jth ghost feature map, denoted as y _ij.

Finally, as shown in Figure 2, the two parts are added by identity mapping.

The Ghost module employs standard convolution to create initial feature maps and then utilizes economical linear operations for feature enhancement and channel expansion. Because of these procedures, the Ghost module produces an equal count of feature maps as the conventional convolution layer. This allows seamless integration of the Ghost module into existing network architectures and mitigates computational expenses.

However, the efficiency of serial operator in Ghost module needs to be improved, and this study proposes a more efficient module on this basis.

3 Method

3.1 Improved-Ghost module

Improved-Ghost module is optimized on the basis of Ghost and RepGhost [31] module. Improved-Ghost continues to follow the idea of Ghost feature reuse and realizes implicit feature reuse through re-parameterization, which is convenient for the model to better understand and identify the objects in the image and ensure high detection accuracy in underwater environment. On the one hand, Improved-Ghost provides richer and more comprehensive information by combining different levels of features, which is helpful to improve the understanding and detection accuracy of IG-YOLOv5 model. By integrating multi-scale features, it is convenient for the network to obtain information from different levels, improve the recognition ability of targets in complex background, and thus maintain high detection accuracy in complex underwater environment. On the other hand, the connection mode of feature reuse makes gradient and information flow more effectively in the network, which is helpful to train deeper and stronger models.

The specific optimization contents are as follows:

Add operation with feature fusion function is used to replace the traditional Concat operation, thus improving efficiency.

In the aspect of information integration of feature graph, Concat operation splices two tensors to expand their dimensions. Add operation realizes tensor addition, i.e., the amount of information in each dimension increases, while the matrix dimension remains unchanged, which is beneficial to the final classification of images.

Under the same conditions, the convolution kernel, Concat operation, and Add operation of independent output channels are expressed as follows:
(3) Out concat = ∑ i = 1 n X i ⁎ K i + ∑ i = 1 n Y i ⁎ K i + n ,

(4) Out add = ∑ i = 1 n ( X i + Y i ) ⁎ K i ,
where X _i and Y _i represent the two input channels and K _i represents the dimension of stitching. From equations (3) and (4), it is clear that the Add operation offers benefits in terms of parameter and computation reduction.
Replace the original ReLU function with SiLU activation function with smoother zero point, and move it after the Add operation to realize re-parameterization and speed up the reasoning time.

Relu (Corrected Linear Unit) function belongs to piecewise function, which has good computational properties and can improve the training efficiency of neural network. Compared with ReLU, SiLU function is smoother near zero. Using Sigmoid function, the output range is limited to (0,1), and better performance is achieved in visual tasks.

The specific formulas of ReLU function and SiLU function are as follows:

(5) f ReLU ( x ) = x , x > 0 0 , x ≤ 0 ,

(6) f SiLU ( x ) = x ∗ sigmoid ( x ) = x 1 + e − x .

From Figures 3 and 4, the smoother and more expressive nature of the SiLU function around zero becomes apparent.

Figure 3

ReLU function diagram.

Figure 4

SiLU function diagram.

SiLU is a smooth function, which means that it can create a smoother feature map and help capture subtle patterns in data. Its performance on negative values enables it to transmit negative activation, which is helpful for the network to capture more complex relationships between features. In addition, it can also allow the network to adaptively adjust the slope of activation during the training process, which is helpful for more effective gradient flow and richer feature extraction. SiLU is used in the Improved-Ghost module, because it allows the gradient to flow more effectively in the deep network and helps the network learn more abundant feature representation.

Add BN (Batch Normalization) operation to the identity mapping branch to introduce nonlinearity, which is convenient for fusion and fast inference. It stabilizes the input of neural network by reducing the “internal covariant shift.” Specifically, it normalizes the input at each layer of the network, making its mean close to 0 and variance close to 1. In this way, convergence is accelerated, over-fitting is reduced, and the performance of the model is further improved. BN is used in the Improved-Ghost module to increase the diversity of feature graphs and promote the module to learn more diversified feature representations.

The structure of the Improved-Ghost module is illustrated in Figure 5, featuring two branches: a shortcut branch and a connected branch responsible for implementing operators. This module is slightly different in training and reasoning stages. In the process of reasoning, the branch of operator connection is more concise, including only 1*1 convolution, conventional convolution, and SiLU activation function, which has more advantages in computation and reasoning speed.

Figure 5

Structure diagram of Improved-Ghost module.

3.2 IG-YOLOv5 model

Underwater target detection needs both accuracy and efficiency. Although YOLOv5 network structure has obvious advantages in real-time detection, it still has some defects in too many model parameters, slow training speed, and low reasoning efficiency. To achieve efficient, real-time, and accurate underwater target detection, it still needs to be improved. Therefore, the YOLOv5 network is improved to improve the proposed problems.

YOLOv5-6.0 uses a special structure called CSPDarknet to extract deep features. Among them, CSP stands for Cross Stage Partial, which is a network design strategy. It improves information flow and gradient flow by sharing features between different stages of the network and partially crossing skip-connections, thus improving performance and reducing calculation cost. Darknet is a lightweight neural network foundation. Combining CSP strategy with Darknet structure can create an efficient and powerful model, which is suitable for real-time target detection and other computer vision tasks. The CSPDarknet structure splits the input feature map into two segments, one of which continues to pass forward and the other part is processed and spliced with the previous part. This structure can avoid the problem of gradient disappearance. In terms of parameters, although it is reduced compared with the previous version, it still has a large optimization space. In terms of detection effect, it is better to deal with large target detection tasks, and there exists significant potential for enhancing performance in tasks related to detecting small targets. Therefore, this article reconstructs the backbone layer of YOLOv5 through Improved-Ghost, which makes the structure lighter and has stronger learning ability.

The network constructed in this article consists of five CBS modules, four Improved-Ghost modules, and one SPPF module. The structural diagram is shown in Figure 6.

Figure 6

IG-YOLOv5 model network structure diagram.

IG-YOLOv5 uses mosaic algorithm in image processing to enhance data, which effectively simulates the visual conditions in underwater environment, and is also an effective means for the model to deal with different underwater environments, especially in underwater environment, which helps the model to adapt to the appearance of targets under different visual angles, different distances, and different light conditions. Based on the Improved-Ghost module, the backbone network is reconstructed, and using the feature reuse mechanism, the model’s understanding of complex or low-contrast underwater scenes is enhanced by copying key feature maps and enhancing and reorganizing them at multiple levels. In addition, the Improved-Ghost module introduces lightweight original features, which can capture the fine structure of data without significantly increasing the calculation cost. In the underwater environment, this can help the model to distinguish fuzzy or partially occluded objects. In summary, the aforementioned adjustments make IG-YOLOv5 more efficient and accurate in underwater target detection, especially when facing the unique visual challenges of underwater environment.

3.3 Wise-IoU loss

The effectiveness of target detection relies on the formulation of the loss function. YOLOv5-6.0 uses CIoU [32] as the loss function to help improve the accuracy and stability of target detection. CIoU is an improvement based on IoU metric and DIoU [33] metric, which considers the position difference and size difference between the prediction frame and the real frame. The distance and overlap between the pre-selected frame and the real frame are optimized during training. Compared with the traditional IoU, it has better performance in dealing with targets with different scales and shapes. There is still much room for optimization in terms of robustness and computational complexity.

Therefore, in this study, Wise-IoU [34] is used to design the loss function. Wise-IoU uses the dynamic non-monotonic focusing mechanism to replace IoU to evaluate the quality of the anchor frame and has a wise gradient gain distribution strategy. This makes it easier to be caught in the ordinary anchor frame, and the detection ability is improved as a whole. It has better robustness. Wise-IoU not only considers the position and scale information between the prediction frame and the real frame, but also considers the shape information between them. Compared with CIoU, Wise-IoU can better handle targets with different shapes, sizes, and directions, thus improving the robustness of target detection. Wise-IoU reduces the computational complexity by introducing the weight matrix. Compared with CIoU, Wise-IoU can train and reason faster and perform better on devices with limited computing resources. At the same time, it can control the importance of different parts through the weight matrix, thus improving the interpretability of the loss function. This enhancement can boost the precision and resilience of target detection in real-world scenarios.

Wise-IoU Loss has three focusing mechanisms, namely, Wise-IoU v1, Wise-IoU v2, and Wise-IoU v3. Wise-IoU v1 constructs the loss of the bounding box based on attention, and it constructs the distance attention according to the distance measure and obtains two layers of attention mechanisms, namely, RIoU and LIoU： RWIoU, ranging from 1 to E, which significantly enlarges the Liou of the ordinary mass anchor frame; LIoU, the value range is 0 to 1, which significantly reduces RWIoU of high-quality anchor frame. When the anchor frame coincides with the target frame, the distance between the center points is the focus of attention. Wise-IoU v2 constructs a monotonic focusing coefficient and introduces the average value of LIoU as a normalization factor to realize the model focusing on difficult targets, improve the classification performance, and expedite convergence during the later stages of training. Wise-IoU v3 defines outlier β to characterize the anchor frame’s quality. The coefficient of nonmonotonic focusing is constructed by β and applied to Wise-IoU v1. For the positioning ability of IG-YOLOv5 model in underwater small targets, this study optimizes the loss function as Wise-IoU v3 with dynamic focusing mechanism, and the specific formula is as follows:

(7) L WIoUv3 = r L WIoUv1 , r = β δ α β − δ ,

where r represents the gradient gain.

Among them, the outlier is defined as:

(8) β = L IoU ∗ L IoU ¯ ∈ [ 0 , + ∞ ) ,

where L IoU ⁎ is the monotonic focusing coefficient and L IoU ¯ is the exponential running average.

Images in underwater environment are usually affected by visual degradation, such as dim light, turbid water, and particle interference. This will seriously affect the performance of traditional IoU loss functions, because they usually cannot handle these visual artifacts well. Wise-IoU v3 can guide the model prediction more effectively by considering the “benefit degree,” thus improving the accuracy of detection in complex environment, which is helpful to predict the boundary box of the target more accurately even if the target is irregular in shape or deformed in complex underwater environment.

4 Experiments

4.1 Experimental platform and data set

The hardware environment of this experiment is Windows 11(x64) operating system, 32G memory, and NVIDIA RTX 3060 (6GB) GPU. The software environment is Pytorch 1.10.0 architecture and PyCharm platform. The specific experimental environment configuration is shown in Table 1.

Table 1

Configuration of experimental environment

Project	Experimental environment
Operating system	Windows 11(x64)
CPU	12th Gen Intel(R) Core(TM) i9-12900H 2.50GHz
GPU	NVIDIA RTX3060(6GB)
Memory size	32GB
Python	3.9.13
Accelerated environment	CUDA11.6

In this study, the pre-training weight file of the experiment is Yolov5s.pt, and the default superparameters are used to enhance the data. The learning rate is set to 0.01, the number of training periods epochs is set to 300, the training batch size is 8, the batch size is 16, and the pixels of the image are unified to 640*640. The relevant experimental parameters are configured as shown in Table 2.

Table 2

Configuration of experimental parameters

Parameter	Configuration
Weights	Yolov5s.pt
Hyp	hyp.scratch-low.yaml
Epochs	300
Learning rate	0.01
Batch size	16
Workers	8
Image size	640*640

The data set utilized in this experiment is named 2021URPC, which was provided in the underwater vehicle target detection algorithm competition. There are 6,575 underwater images in this data set, which are divided into five categories: holothurian, echinus, scallop, starfish, and waterweeds. Figure 7 illustrates a sample from the data set.

Figure 7

Data set example.

As depicted in Figure 8(a), an analysis is conducted on the count of targets within each category, among which sea urchins are the most abundant, followed by scallops, starfish, sea cucumbers, and aquatic plants. As shown in Figure 8(b), the target location map with regularization indicates a higher density of targets horizontally and comparatively more scattered vertically. In addition, in the standardized target size diagram depicted in Figure 8(c), it can be observed that target sizes are relatively focused, with a majority of them being small. In order to meet the needs of experimental training, verification, and testing, the total data set is randomly divided into three categories: training set, verification set, and test set, and the total data set is 4,141, 1,776, and 658, respectively.

Figure 8

Statistical results of data set. (a) Bar chart of the number of targets in each class; (b) normalized target location map; (c) normalized target size map.

4.2 Evaluation indicators

In the experiment, precision, recall, average precision (mAP), and floating-point operations (FLOPs) are used as performance evaluation indexes to evaluate the target detection method. The accuracy, recall, and mAP formulas are shown in the following formulas:

(9) Pression = TP TP + FP × 100 % ,

(10) Recall = TP TP + FP × 100 % ,

(11) mAP = 1 K ∑ k = 1 K AP ( P , R , k ) ,

where TP represents the correctly detected positive samples, FP stands for incorrectly detected negative samples, and FN denotes the incorrectly detected positive samples. MAP is obtained from the PR curve. In this experiment, the average precision at an IOU threshold of 0.5 is computed, and k signifies the count of target detection categories. Within this study, K = 5.

FLOPs, floating-point operands, are generally used to measure the complexity of the model and usually only consider the count of multiplication and addition operations:

(12) FLOPs = params × H × W .

4.3 URPC data set

To validate the underwater target detection capability of the IG-YOLOv5, an evaluation was conducted comparing both the YOLOv5 and the IG-YOLOv5. The test set’s detection outcomes using the IG-YOLOv5 model are depicted in Figure 9.

Figure 9

Experimental results of the IG-YOLOv5 model.

As shown in Figure 10, the results of the IG-YOLOv5 model show that the detection efficiency of various target categories is improved, especially the sea urchin category with a value of 91.5% 0.5_mAP. The average accuracy of the IG-YOLOv5 model is 74.2%.

Figure 10

Accurate recall curve of the IG-YOLOv5 model.

From the accurate recall curve, it can be seen that the model of IG-YOLOv5 performs well in four kinds of data: holothurian, echinus, scallop, and starfish, which all exceed or approach the average, but it still does not reach the ideal state in the data of waterweeds. See Table 3 for the comparison between the number of data sets and the evaluation indicators. As evident from the table, the evaluation indexes of sea urchins with large data and easy identification are higher than those of other species. However, the evaluation index of aquatic plants with a small number of sets and difficult shape positioning is at the end. Through analysis, the low evaluation index of aquatic plants is mainly related to the fact that the number in the training data set is too small and the shape is difficult to identify.

Table 3

Comparison between the number of data sets and evaluation indicators

Data set type	Evaluation indicators
Category name	quantity	Precision	Recall	0.5 mAP
Holothurian	6,074	0.784	0.63	0.692
Echinus	24,346	0.883	0.854	0.915
Scallop	8,687	0.776	0.709	0.777
Starfish	7,180	0.86	0.816	0.88
Waterweeds	82	0.626	0.667	0.443

IG-YOLOv5 has not reached the ideal value in identifying aquatic plants, but it has improved the accuracy by 19%, the recall by 33.34%, and the 0.5_mAP by 24.2% compared with the YOLOv5 model. The data show that IG-YOLOv5 has great advantages in organisms with few data sets and difficult to identify morphology.

The confusion matrix assesses the result accuracy of the IG-YOLOv5 model. In the confusion matrix, columns depict the predicted category proportions, while rows indicate the actual category proportions in the data, as shown in Figure 11. The findings from Figure 11 demonstrate that the accurate prediction rates for echinus, starfish, and scallop categories are 89, 85, and 77%, respectively. This suggests the IG-YOLOv5 model’s strong accuracy.

Figure 11

Confusion matrix of the IG-YOLOv5 model.

In addition, this study also provides the curve of loss value, including frame loss, target loss, and classification loss. In this study, the loss function is optimized. Wise-IoU v3 is used as the loss function of the bounding box, and the lower the value, the higher the accuracy. Target loss refers to the mean of target detection loss, with lower values indicating greater accuracy. A smaller classification loss value corresponds to higher accuracy. As shown in Figure 12, with the increase of the number of iterations, the loss value shows a steady decline and finally stability, and reaches convergence after 200 iterations.

Figure 12

Loss value variation curve for the data set.

4.4 Comparative experimental results and analysis

To further validate the superiority of the proposed IG-YOLOv5 model, it is trained and tested and compared with the Faster R-CNN, RTMDet [35], VarifocalNet [36], GFL [37], and YOLOv5 with their evaluation metrics over the 2021 URPC data set. Table 4 displays the comparison of specific results.

Table 4

Comparison of experimental results

Method	Backbone	0.5 mAP	Model size/M	GFLOPs
Faster-RCNN [10]	ResNet-50	57.2	41.37	128
RTMDet [35]	CSPNeXt	66.6	8.89	14.84
VarifocalNet [36]	ResNet-50	61.8	32.89	198.66
GFL [37]	ResNet-50	56.3	32.44	122.88
YOLOv5 [17]	CSPDarkNet	69.9	6.70	15.8
IG-YOLOv5	Improved-Ghost	74.2	5.77	13.1

Faster R-CNN, a typical representative of two-stage target detection, is a classic target detection model based on ResNet-50 backbone. RTMDet, which uses RoI Transformer module and feature pyramid network, can achieve State-of-the-Art effect on multiple tasks. Using CSPNeXt as the backbone, the table data show that its 0.5 mAP reaches 66.6%, which is better than Faster R-CNN. VarifocalNet is a target detection algorithm using the loss function of Varifocal Loss, and the backbone network is based on ResNet-50. GFL is a loss function algorithm to solve the problem of class imbalance in target detection, and its backbone network is also based on ResNet-50. This method does not perform well on 0.5 mAP, Model size, and GFLOPS.

By comparing the experimental results, the IG-YOLOv5 model has obvious advantages over the contrast model in mAP, model size, and FLOPs. IG-YOLOv5 is 17.9% higher than the GFL model in mAP, 35.6 M lower than the FasterRCNN model, and 1/15 of the VarifocalNet model in FLOPs.

The IG-YOLOv5 model outperforms the YOLOv5 model on all evaluation metrics. It is particularly impressive on mAP, where the IG-YOLOv5 model increases the mAP by 4.3 percentage points over the YOLOv5 model at an IoU threshold of 0.5.

The IG-YOLOv5 model can effectively improve the mAP while reducing the computational complexity, and it is more suitable for the real-time detection of underwater organisms.

4.5 Ablation experiments results and analysis

This study conducts ablation experiments to assess how various enhancements impact model performance. In this study, we reconstruct the backbone network using the designed Improved-Ghost module and subsequently optimize it with Wise-IoUv3 as the loss function. Finally to this study, algorithm is used to verify the effectiveness of different modules in improving the performance of the network. The results of the experiments are presented in Table 5.

Table 5

Comparison of ablation experiments

Model	Improved-Ghost	Wise IoU v3	P/%	R/%	0.5 mAP/%	Model size/M	GFLOPS
YOLOv5			78.3	64.8	69.9	6.70	15.8
A	√		79.6	71.9	73.8	5.77	13.1
B		√	77.5	66	74.7	6.70	15.8
IG-YOLOv5	√	√	78.6	73.5	74.2	5.77	13.1

The “√“ in the table signifies the utilization of the respective enhanced method. From Table 5, it can be concluded that the Improved-Ghost module or the optimized loss function is Wise-IoUv3; compared with YOLOv5 model, the accuracy, recall, and mAP value are improved and the Improved-Ghost module reduces the model algorithm.

From the point of view of accuracy, Model A performs best in this index, reaching 79.6%. From the recall index, IG-YOLOv5 scored the highest, with 73.5%. On the mAP value, the B model and IG-YOLOv5 are basically the same, which are 74.7 and 74.2%, respectively. In terms of the model size reflecting the complexity and storage requirements of the model, both the A model and the IG-YOLOv5 model are 5.77 M, which is smaller than the YOLOv5 and B model. Type A and IG-YOLOv5 have the lowest GFLOPS, both of which are 13.1, which means that they may be more efficient in calculation.

From the functional characteristics, the IG-YOLOv5 model performs well in many performance indicators, which shows that the Improved-Ghost module and Wise-IoU v3 have brought obvious performance improvement to the model. The B-model with Improved-Ghost alone performs well on 0.5 mAP, but its P and R are slightly lower, which may mean that this technology has a positive impact on the IoU performance of the model, but the impact on other performance indicators is mixed. GFLOPS analysis shows that Type A and IG-YOLOv5 models are more efficient in calculation, which is especially important for application scenarios that require rapid response or limited computing resources.

After reconstructing the backbone structure, the model increases the mAP value by 3.9% and reduces GFLOPs by 2.7. Improved-Ghost’s reconstruction of the YOLOv5 backbone significantly bolsters the network’s functionality.

Although the mAP value after optimizing the loss function Wise-IoU v3 based on the reconstruction of the backbone structure is slightly smaller than that after optimizing the loss function Wise-IoU v3 alone, from a comprehensive point of view, the model in this study has advantages in accuracy, recall, mAP value, and model lightweight.

In this study, YOLOv5 and IG-YOLOv5 models are tested and compared with the same data set. Figure 13 shows the comparison of experimental results of the IG-YOLOv5 model, in which (a) is the real value, (b) is the YOLOv5 model, and (c) is the IG-YOLOv5 model. The figure distinctly shows the instances of missed and false detections in the YOLOv5 model. IG-YOLOv5 model reduces the false detection rate due to complex environment and improves the detection accuracy and accuracy.

Figure 13

Improved comparison example of experimental results: (a) true value (left), (b) YOLOv5 (middle), and (c) IG-YOLOv5 (right).

4.6 Ablation experiments for IoU

In order to illustrate the advantages of Wise-IoU in the IG-YOLOv5 model, the ablation experiments of IoU were compared. Model A uses IG-YOLOv5 whose loss function is CIoU, and model C uses SIoU.

As can be seen from Table 6, IG-YOLOv5 model has the highest recall rate, which is nearly 2% higher than the other two models. In terms of 0.5 mAP, compared with A and C models, the IG-YOLOv5 model is in the first place. Although the accuracy of C model is slightly higher, IG-YOLOv5 performs better in both recall and 0.5 mAP. Wise-IoU v3 method used by IG-YOLOv5 is more effective than CIoU and SIoU in improving recall rate and overall mAP.

Table 6

Comparison of ablation experiments for IoU

Model	IoU	P/%	R/%	0.5 mAP/%
A	CIoU	79.6	71.9	73.8
C	SIoU	79.7	71.5	73.0
IG-YOLOv5	Wise IoU v3	78.6	73.5	74.2

4.7 Experimental results and analysis of adding noise

In order to further verify the robustness of IG-YOLOv5, the 2021URPC data set was denoised, and salt and pepper noise was added to 6,575 pictures in the data set. The data comparison diagram before and after adding noise is shown in Figure 14.

Figure 14

Comparison chart of data before and after adding noise. Raw data (left), the data after adding noise (right).

In this study, YOLOv5 and IG-YOLOv5 models are tested and compared with the same data set. As can be seen from Table 7, the accuracy of IG-YOLOv5 is 14.3 percentage points higher than that of YOLOv5, which is a considerable improvement in the target detection task. The recall rate increased by 2.1 percentage points. Although the increase rate is not as obvious as the accuracy, it is enough to show that IG-YOLOv5 can detect more positive samples. When IoU is 0.5, the mAP of IG-YOLOv5 is 61%, which is 1.2 percentage points higher than that of YOLOv5, which is 59.8%. IG-YOLOv5 outperforms YOLOv5 in all indicators shown in Table 7. This further confirms the advantages of IG-YOLOv5 in target detection performance. From the aforementioned analysis, it can be concluded that IG-YOLOv5 is not only superior to YOLOv5 in key performance indicators, but also shows great advantages in model size and computational complexity.

Table 7

Comparison of ablation experiments after adding noise

Model	P/%	R/%	0.5 mAP/%	Model size/M	GFLOPS
YOLOv5	51	64.2	59.8	28	15.8
IG-YOLOv5	65.3	66.3	61	24	13.1

Figure 15 shows the comparison of experimental results of the IG-YOLOv5 model on the noisy data set, in which (a) is the real value, (b) is the YOLOv5 model, and (c) is the IG-YOLOv5 model. It can be seen from the figure that both IG-YOLOv5 and YOLOv5 have some false detection phenomena, but compared with YOLOv5, the IG-YOLOv5 model shows higher confidence. This makes IG-YOLOv5 have higher potential value in various application scenarios.

Figure 15

Comparative example of experimental results on noisy data sets: (a) true value (left), (b) YOLOv5 (middle), and (c) IG-YOLOv5 (right).

5 Conclusion

In this study, we propose an underwater target recognition network model IG-YOLOv5 based on the idea of feature reuse. Combined with the idea of feature reuse, the Improved-Ghost module is proposed, the backbone structure of YOLOv5 is reconstructed, the model is lightweight, and the loss function is optimized as Wise-IoU, which improves the performance and accuracy of the model and keeps the efficient calculation speed. The experimental outcomes indicate the satisfactory performance of the proposed IG-YOLOv5 model on the 2021URPC public data set, and is more accurate and efficient than Faster-RCNN，RTMDet, VarifocalNet, GFL model, and YOLOv5 model in underwater target detection.

IG-YOLOv5 model has a wide range of potential practical applications. Robots can use IG-YOLOv5 to identify topographic features, such as seabed cracks, stones or other obstacles, and map seabed topography through real-time image analysis. In the search and rescue mission of missing ships or planes, underwater robots can use IG-YOLOv5 to identify and locate lost objects or other important clues. In the security monitoring of public places or private properties, IG-YOLOv5 can be used to detect and track unauthorized persons or vehicles. Of course, IG-YOLOv5 can also be trained to identify abnormal or suspicious behaviors, such as intrusion, fighting, or other security threats, and give an alarm in real time. To sum up, the high precision and computational efficiency of IG-YOLOv5 make it an ideal choice for underwater robots and monitoring systems, which can provide real-time and accurate target detection and tracking functions and meet the needs of various practical applications.

The advantages of high performance, real-time detection, and lightweight of IG-YOLOv5 model make it suitable for practical deployment in marine protection and underwater monitoring. Of course, there will be challenges or limitations when deploying in the actual underwater scene, such as the limitation of hardware equipment, long-term deployment in remote, or difficult-to-access locations may lead to maintenance problems, and the system needs to be able to automatically recover from failures and run reliably. In addition, it is necessary to (1) ensure that no private information unrelated to the research is captured during deployment; (2) ensure that technology deployment will not have a negative impact on breeding areas, migration routes of mammals, or other sensitive areas; and (3) consider how technology interacts with the current ecosystem and its possible long-term impact.

To make the IG-YOLOv5 model contribute to a wider field of marine science and protection, it is planned to provide the IG-YOLOv5 model to researchers and organizations involved in marine protection, so as to obtain data feedback in actual marine scenes. In terms of technology, it is planned to enhance it in the following aspects to further improve its performance and versatility.

The speed of model processing is further improved by hardware acceleration.
By training the model on more diverse data sets, the generalization ability of the model is enhanced, so that it can identify a wider range of object categories and perform well in an unprecedented environment.
Considering the spatial characteristics of underwater environment, develop technologies that can understand and manipulate three-dimensional spatial information, so as to improve the understanding of dynamic objects and complex scenes.
Integrating more advanced fault detection mechanism and allowing system self-diagnosis and recovery is very important for remote or difficult access deployment.

Through the work of underwater biological detection model, the ability to observe rapidly moving marine organisms or cope with dynamic environment shows that real-time detection is very important, which is the key for scenes that need rapid response, such as autonomous vehicles, emergency response systems, or real-time transaction analysis. At the same time, in order to achieve more comprehensive and more innovative solutions for complex projects, interdisciplinary collaboration is essential.

The work of this study will bring new development opportunities to the field of underwater target detection. The IG-YOLOv5 model excels not only in precise underwater target detection and identification but also holds significance across various application scenarios, such as marine resources development, marine environmental monitoring, and maritime security. Moving forward, we will persist in refining and enhancing the underwater target detection model to attain heightened accuracy and resilience. Through continuous research and improvement, we will provide more reliable technical support for underwater environment and development and create more opportunities for human beings to explore unknown fields.

Funding information: This work was not supported by any found projects.
Conflict of interest: The authors report no conflict of interest.
Data availability statement: Data are contained within the article or Supplementary Materials.

References

[1] Wang X, Zhu Y, Li D, Zhang G. Underwater target detection based on reinforcement learning and ant colony optimization. J Ocean Univ China. 2022;21(2):323–30.10.1007/s11802-022-4887-4Search in Google Scholar

[2] Zhou X, Ding W, Jin W. Microwave-assisted extraction of lipids, carotenoids, and other compounds from marine resources. Innovative and emerging technologies in the bio-marine food sector. 2022. p. 375–94.10.1016/B978-0-12-820096-4.00012-2Search in Google Scholar

[3] Gao S, Sun H, Huang X, Hui Y, Ge S. Performance audit evaluation of marine development projects based on SPA and BP neural network model. Open Geosci. 2023;15:20220470.10.1515/geo-2022-0470Search in Google Scholar

[4] Sun H, Gao S, Liu J, Liu W. Research on comprehensive benefits and reasonable selection of marine resources development types. Open Geosci. 2022;14:141–50.10.1515/geo-2022-0341Search in Google Scholar

[5] Zhang W, Sun W. Research on small moving target detection algorithm based on complex scene. J Phys: Conf Ser. 2021;1738(1):1742–6596.10.1088/1742-6596/1738/1/012093Search in Google Scholar

[6] Fu H, Song G, Wang Y. Improved YOLOv4 marine target detection combined with CBAM. Symmetry. 2021;13(4):623.10.3390/sym13040623Search in Google Scholar

[7] Francesco P, Philip HS, Torr P, Dokania K. An impartial take to the cnn vs transformer robustness contest. Computer Vision – ECCV 2022: 17th European Conference, Tel 21 Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII. Cham: Springer Nature Switzerland; 2022. p. 466–80.10.1007/978-3-031-19778-9_27Search in Google Scholar

[8] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR. 2013.10.1109/CVPR.2014.81Search in Google Scholar

[9] Girshick R. Fast R-CNN. Comput Vis Pattern Recognit. arXiv. 2015;1504:08083.10.1109/ICCV.2015.169Search in Google Scholar

[10] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. Comput Vis Pattern Recognit, arXiv. 2015;1506:01497.Search in Google Scholar

[11] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD：Single shot multi-box detector. 14th European Conference on Computer Vision. Cham: Springer; 2016. p. 21–37.10.1007/978-3-319-46448-0_2Search in Google Scholar

[12] Lin T, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE; 2017.10.1109/ICCV.2017.324Search in Google Scholar

[13] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE; 2016.10.1109/CVPR.2016.91Search in Google Scholar

[14] Redmon J, Ali F. YOLO9000: better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 7263–71.10.1109/CVPR.2017.690Search in Google Scholar

[15] Redmon J, Ali F. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767; 2018.Search in Google Scholar

[16] Bochkovskiy A, Wang C, Liao HM. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934; 2020.Search in Google Scholar

[17] Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 2778–88.10.1109/ICCVW54120.2021.00312Search in Google Scholar

[18] Tian Z, Shen C, Chen H, He T. FCOS: Fully convolutional one-stage object detection. Comput Vis Pattern Recognit. arXiv:1904.01355. 2019.10.1109/ICCV.2019.00972Search in Google Scholar

[19] Li Q, Li Y, Niu J. Real-time detection of underwater fish targets based on improved YOLO transfer learning. Pattern Recognit Artif Intell. 2019;32(3):193–203. in Chinese.Search in Google Scholar

[20] Zhao D, Liu X, Sun Y, Wu R, Hong J, Ruan C. Underwater crab identification method based on machine vision. J Agric Mach. 2019;50(3):151–8. (in Chinese).Search in Google Scholar

[21] Jia Z, Liu X. Target detection of marine animals based on YOLO and image enhancement. Electron Meas Technol. 2020;43(14):84–8. (in Chinese.Search in Google Scholar

[22] Teng L, Xue F, Bai Q. Remote sensing image enhancement via edge-preserving multiscale retinex. IEEE Photonics J. 2019;1–10.10.1109/JPHOT.2019.2902959Search in Google Scholar

[23] Liu K, Liang Y. Enhancement of underwater optical images based on background light estimation and improved adaptive transmission fusion. Opt Express. 2021;29(18):28307–28.10.1364/OE.428626Search in Google Scholar PubMed

[24] Hao K, Wang K, Wang B, Zhao L, Wang BB, Wang CQ. Underwater biological detection algorithm based on image enhancement and improvement of YOLOv3. J Jilin Univ (Eng Ed). 2022;52(5):1088–97. (in Chinese).Search in Google Scholar

[25] Chen YL, Dong SJ, Sun SZ, Yan KB. Improved detection algorithm of underwater biological targets in low light of YOLOv5 [J/OL]. J Beijing Univ Aeronaut Astronaut. 2023;7:1–13. in Chinese 10.13700/J.BH.1001-5965.Search in Google Scholar

[26] Xu Q, Su J, Wang Y, Zhang J, Zhong Y. Few-Shot learning based on double pooling squeeze and excitation attention. Electronics. 2023;12(1):27.10.3390/electronics12010027Search in Google Scholar

[27] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Comput Vis Pattern Recognit. arXiv. 2015;1512:03385.10.1109/CVPR.2016.90Search in Google Scholar

[28] Lin T, Dollár P, Girshick RB, He K, Hariharan B, Belongie SJ. Feature pyramid networks for object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 936–44.10.1109/CVPR.2017.106Search in Google Scholar

[29] Shu L, Lu Q, Haifang Q, Shi J, Jia J. Path aggregation network for instance segmentation. Comput Vis Pattern Recognit. arXiv. 01534, 1803. p. 2018.Search in Google Scholar

[30] Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C. Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 1580–9.10.1109/CVPR42600.2020.00165Search in Google Scholar

[31] Chen C, Guo Z, Zeng H, Xiong P, Dong J. RepGhost: A hardware-efficient ghost module via re-parameterization. arXiv. 2022;2211:06088v1.Search in Google Scholar

[32] Zheng Z, Wang P, Ren D, Liu W, Ye R, Hu Q, et al. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans Cybern. 2020;52(8):8574–86.10.1109/TCYB.2021.3095305Search in Google Scholar PubMed

[33] Zheng ZH, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU Loss: Faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence; 2020. p. 12993–3000.10.1609/aaai.v34i07.6999Search in Google Scholar

[34] Tong Z, Chen Y, Xu Z, Yu R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. Comput Vis Pattern Recognit. arXiv. 2023;2301:10051.Search in Google Scholar

[35] Lyu CQ, Zhang WW, Huang HA, Zhou Y, Wang YD, Liu YY, et al. RTMDet: An empirical study of designing real-time object detectors. Comput Vis Pattern Recognit. arXiv. 2022;2212:07784.Search in Google Scholar

[36] Zhang H, Wang Y, Dayoub F, Sunderhauf N. VarifocalNet: An IoU-aware Dense Object Detector. Comput Vis Pattern Recognit, arXiv: 2008.13367; 2021.10.1109/CVPR46437.2021.00841Search in Google Scholar

[37] Li X, Wang W, Wu L, Chen S, Hu X, Li J, et al. Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Comput Vis Pattern Recognit, arXiv: 2006. 04388; 2020.Search in Google Scholar

Received: 2023-08-18

Revised: 2023-10-30

Accepted: 2023-11-24

Published Online: 2023-12-29

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/geo-2022-0590

Keywords for this article

underwater target; reconstruct; backbone network; recognition and detection; loss function

Creative Commons

BY 4.0