Home CRNet: Context feature and refined network for multi-person pose estimation
Article Open Access

CRNet: Context feature and refined network for multi-person pose estimation

  • Lanfei Zhao and Zhihua Chen EMAIL logo
Published/Copyright: June 27, 2022
Become an author with De Gruyter Brill

Abstract

Multi-person pose estimation is a challenging problem. Bottom-up methods have been greatly studied because the prediction speed of top-down methods is related to the number of people in the input image, making these methods difficult to apply in real-time environments. To solve the problems of scale sensitivity and quantization error in bottom-up methods, it is necessary to have a model that can predict multi-scale keypoints and refine quantization error. To achieve this, we propose context feature and refined network for multi-person pose estimation (CRNet), which can effectively solve the problems of scale sensitivity and quantization error in bottom-up methods. We use a multi-scale feature pyramid and context feature to achieve scale invariance of the network. We extract global and local features and then fuse them by attentional feature fusion (AFF) to obtain context feature that adapt to multi-scale keypoints. In addition, we propose an efficient refined network to solve the problem of quantization error and use multi-resolution supervised learning to further improve the prediction accuracy of CRNet. Comprehensive experiments are conducted on two benchmarks: COCO and MPII datasets. The average precision of CRNet reached 72.1 and 80.2%, respectively, surpassing most state-of-the-art methods.

1 Introduction

Multi-person pose estimation is a challenging problem in the field of computer vision. It aims to locate all keypoints of human joints (such as elbows, wrists, and knees) or body parts from an input image and combine them into independent person instance. It is widely applied in the fields of human–computer interaction, human behavior recognition, security monitoring, etc. [1]. The application of the convolutional neural network (CNN) has greatly improved the prediction accuracy of multi-person pose estimation [2,3,4,5]. Multi-person pose estimation methods based on CNN can be divided into top-down and bottom-up methods.

Because the prediction speed of the top-down methods is affected by the number of people in the natural scene and the prediction accuracy depends heavily on the person detector, it is difficult to apply to the real-time scene. Currently, the bottom-up methods are receiving more and more attention. The bottom-up methods simultaneously predict all the keypoints of human joints in the input image. However, the multiple scale of keypoints in an image is one of the main reasons to limit bottom-up methods. Therefore, the extraction of context feature containing multi-scale information is being widely studied. In addition, the state-of-the-art methods also face the problem of quantization error. Although there are some methods [2,6] to solve the quantization error, these methods are difficult to adapt to the keypoints of all scales.

In this work, we propose a bottom-up multi-person pose estimation method based on context feature and refined network (CRNet), aiming at enhancing scale invariance and optimizing quantization error of the network. First, a context feature extraction method is proposed, which can extract global and local features, and the attention mechanism is used for multi-scale feature fusion to enhance the scale invariance of the network. Second, a quantization error refined network is proposed, which has the ability of automatically repairing quantization error and multi-resolution supervision is used to facilitate the proposed CRNet learning. Finally, the effectiveness of the proposed methods is verified on two mainstream multi-person pose estimation benchmark datasets: COCO dataset [7] and MPII dataset [8].

Our contributions are summarized as follows:

  1. We propose the context feature to overcome the difficulty of detecting multi-scale keypoints and improve the prediction accuracy of the model.

  2. We propose a refined network to cope with the inherent quantization error problem of the bottom-up methods to further improve the accuracy of multi-person pose estimation.

  3. Compared with the state-of-the-art methods, our CRNet model achieves competitive results on two mainstream benchmarks.

The remainder of this article is structured as follows:

Section 2 introduces the related work of multi-person pose estimation and the network architecture of HRNet. Section 3 discusses the proposed method, where Sections 3.1 and 3.2, respectively, introduce the structure of context feature and refined network, and Section 3.3 introduces our CRNet model. The experimental setup and result analysis are described in detail in Section 4. Section 5 summarizes the proposed methods and the effectiveness of the network.

2 Related work

2.1 Top-down methods

The top-down methods combine the single-person pose estimation with the object detection algorithm, which uses the person detector to detect all human instances from the image and then estimates the pose for each person. G-RMI [9] used the full convolutional residual network to estimate the pose of each detected person body based on the faster R-CNN. CPN [10] proposes a two-stage cascaded pyramid network, which introduces online difficult keypoints mining technology to predict the difficult keypoints. The Simple Baseline network [11] uses some deconvolution layers to increase the output feature maps resolution of the residual network. The main idea of the high-resolution network (HRNet) [12] is to maintain high resolution by connecting multi-resolution subnets in parallel, which allows HRNet achieves state-of-the-art results in multiple benchmark datasets.

2.2 Bottom-up methods

Different from the top-down methods, the bottom-up methods predict all the human keypoints in the image and then assign the keypoints to an individual person by grouping algorithm. The DeepCut [13] model is the first bottom-up multi-person pose estimation method, which uses the integer linear program algorithm to solve the association problem of keypoints. OpenPose [14] uses the two-branch network to predict the heatmaps and part affinity fields of keypoints, respectively. Associative Embedding [15] is an end-to-end multi-person pose estimation model based on the hourglass network, which can efficiently predict the heatmaps and grouping information of keypoints simultaneously. The main idea of HigherHRNet [16] is to use HRNet as a feature extraction network. The deconvolution layers similar to Simple Baseline is used to output the heatmap of keypoints with higher resolution and combined with the grouping method of associative embedding. This model can effectively predict the small-scale keypoints.

2.3 HRNet

HRNet [12] uses a stem structure that consists of two strided convolutional layers to quickly downsample the input image by four times and takes the result of downsampling as the input of the main body. As shown in Figure 1, the main body of HRNet is divided into four stages. A high-resolution subnet is taken as the first stage, and a new stage is formed by gradually adding high-resolution subnets to low-resolution subnets, and the multi-resolution subnets are connected in parallel. Therefore, the parallel subnet of the latter stage is composed of the multi-resolution subnet of the previous stage and a low-resolution subnet. To improve the robustness of the network, multi-scale feature fusion units are included in each stage of HRNet, so that subnets with different resolutions can interact with each other.

Figure 1 
                  The structure of HRNet.
Figure 1

The structure of HRNet.

Although HRNet is a powerful backbone network that can estimate conventional human pose well, the prediction accuracy of multi-scale keypoints needs to be improved. The coarse downsampling of stem structure leads to the loss of small-scale keypoints information, which makes it difficult to estimate the pose of small-scale person. HRNet cannot obtain global semantic information, which leads to incorrect estimation of large-scale human pose. For this reason, we propose the context feature to enhance the scale invariance of HRNet. In addition, a refined network is proposed to solve the inherent quantization error problem of the bottom-up methods.

3 Our method

3.1 Context feature

The purpose of the context feature is to enable the network to adapt to multi-scale human keypoints and enhance the scale invariance. As shown in Figure 2, the context feature extraction module (CFEM) is mainly composed of two parts: (a) multi-scale feature extraction and (b) attentional feature fusion, in which the multi-scale feature extraction module is responsible for extracting multi-scale features, and the attention feature fusion module uses the attention mechanism to fuse multi-scale features.

Figure 2 
                  Context feature extraction module. (a) Multi-scale feature extraction, (b) attentional feature fusion.
Figure 2

Context feature extraction module. (a) Multi-scale feature extraction, (b) attentional feature fusion.

3.1.1 Multi-scale feature extraction

The receptive field is very important for multi-person pose estimation [3,17]. To solve the problem that CNN cannot effectively extract global semantic information, we propose a multi-scale feature extraction module that can extract global and local context information simultaneously. As shown in Figure 2(a), there are two branches in the multi-scale feature extraction module: the left branch extracts global information, and the right branch extracts local information. For global information, we use global average pooling (GAP) to compress the spatial information of input feature X R H × W × C with C channels and feature maps of size H × W, and then use two point-wise convolution layers to learn the global information. In addition, batch normalization and ReLU nonlinear activation unit follow each convolution to obtain global feature F g R 1 × 1 × C containing global context information. Formally, the process can be summarized as follows:

(1) F g = f g ( g ( X ) ; W g ) ,

where f g ( ) denotes global convolution and W g is the corresponding parameter set. Given an input feature X , X c ( i , j ) represents the information of coordinate position ( i , j ) in the c channel, and the GAP g ( X ) is calculated by

(2) g ( X ) = 1 W × H i = 1 W j = 1 H X c ( i , j ) .

For local information, we use a similar method to the bottleneck residual to better extract local feature containing more detailed information. Given an input feature X R H × W × C , the local feature F l R H × W × C can be computed as follows:

(3) F l = f l ( X ; W l ) ,

where f l ( ) denotes local convolution and W l is the corresponding parameter set.

3.1.2 Attentional feature fusion

The feature fusion based on the attention mechanism can effectively highlight the information-rich features [18]. Since attentional feature fusion (AFF) [19] only uses channel attention, we add a spatial attention mechanism on the basis of AFF to construct an attentional feature fusion module to fuse multi-scale features. As shown in Figure 2(b), the attentional feature fusion module takes the global feature F g and local feature F l from the multi-scale feature extraction module as input and then merges them into the channel-spatial attention module (CSAM) to calculate the attention weight. The attention weight is multiplied by the input feature to obtain the attention feature, and the attention feature is fused to obtain the context feature Y R H × W × C . The resulting feature with powerful expressiveness contains global and local context information that can effectively predict multi-scale human keypoints. Let M ( ) represent the CSAM, and the process can be expressed as follows:

(4) Y = M ( F g F l ) F g + ( 1 M ( F g F l ) ) × F l ,

where and represent the addition and multiplication of the broadcast mechanism. In Figure 2(b), the dashed line represents 1 M ( F g F l ) . It is worth noting that the output of M ( F g F l ) is a real number between 0 and 1, so is the output of 1 M ( F g F l ) , which allows attentional feature fusion module to weight F g and F l with attention.

As shown in Figure 3, both channel and spatial attention mechanisms are used in CSAM. Given an intermediate feature X R H × W × C as input, the output feature X R H × W × C refined by CSAM can be obtained as follows:

(5) X = s ( M c ( X ) M s ( X ) ) × X ,

where M c ( X ) and M s ( X ) represent channel attention and spatial attention, respectively. s ( ) denotes the sigmoid activation function.

Figure 3 
                     Channel-spatial attention module.
Figure 3

Channel-spatial attention module.

Channel attention is generated based on the relationship between feature channels. First, GAP is used to spatially compress the input features, and then the dependency relationship between channels is learned through two point-wise convolutional layers. The process can be expressed as follows:

(6) M c ( X ) = f c ( g ( X ) ; W c ) ,

where f c ( ) denotes channel convolution and W c is the corresponding parameter set.

Spatial attention is generated based on the spatial dependence between features. Maximum pooling MPc and average pooling APc along the channel direction are performed and then concatenating the output feature. A 3 × 3 convolutional layer is used to learn the spatial relationship of feature to obtain spatial attention M s ( X ) . The process can be expressed as follows:

(7) M s ( X ) = f s ( [ AP c ( X ) ; MP c ( X ) ] ; W s ) ,

where f s ( ) denotes spatial convolution and W s is the corresponding parameter set.

3.2 Refined network

Aiming at the problem of quantification error, the previous methods add an offset vector to the maximum activation position of the predicted heatmaps. A standard offset is used in the stacked hourglass network [2], which is equal to 1/4 of the unit vector of the maximal activation to the second maximal activation direction. However, this method only relies on empirical formulas without the support of theoretical derivation. DARK [6] performs Taylor series expansion at the maximum position of the heatmaps to obtain the location of predicted keypoints. When the ground true position of keypoints is far from the predicted maximum position of the heatmap, it will not meet the Taylor series expansion condition. Therefore, this method is difficult to deal with some keypoints with a large range of motion, such as wrists, ankles, etc. In this work, we utilize the powerful nonlinear expression ability of CNN to propose a refined network to solve the quantization error problem. The goal of the refined network is to find a nonlinear function f refine ( ) that maps the keypoint positions with error to precise positions. Given the heatmap H h R H 4 × W 4 × N that contains information about N keypoints from the backbone network, the refined heatmap H r R H × W × N by the refined network can be computed as follows:

(8) H r = f refine ( H h ; W r ) ,

where W r is the parameter set of the refined network.

To balance efficiency and accuracy, as shown in Figure 4, the refined network uses two deconvolution layers to upsample the heatmaps output by the backbone network to the input image size. Four bottleneck residual blocks are used for refinement learning, and skip connection across blocks are used to enhance the refined feature. Finally, a point-wise convolution layer is used for channel number matching to obtain a calibrated heatmap.

Figure 4 
                  The refined network.
Figure 4

The refined network.

3.3 CRNet

Our CRNet takes HRNet [12] as the backbone and improves it with the proposed context feature and refined network. The overall architecture of CRNet is shown in Figure 5.

  1. Due to the coarse downsampling operation of stem in HRNet, the important detailed position information is lost and the importance of low-level features is ignored. To solve this problem, we use a feature pyramid module instead of the stem structure for downsampling. The module accepts input in three scales, which can better extract low-level and small-scale feature, and alleviates the problem that small-scale keypoints cannot be detected due to coarse downsampling.

  2. The proposed CFEM is used to replace the residual blocks of 1/8, 1/16, and 1/32 resolution subnets in HRNet and retain the residual blocks of 1/4. The improved HRNet can extract global and local features and better fuse multi-scale features via the attention mechanism to enhance the scale invariance of the network.

  3. A refined network is added to the end of the improved HRNet, so that the network can calibrate the quantization error and output more accurate keypoints heatmaps. In addition, multi-resolution loss function is used to supervise CRNet learning. In this work, the mean square error loss function is used to calculate the loss of the network. L h and L r are the loss functions of two different scales corresponding to the improved HRNet and refined network, respectively. The total loss of CRNet can be computed as follows:

(9) Loss = α L h + ( 1 α ) L r ,

where α denotes balance factor, and the prediction accuracy is highest when it is set to 0.9 through ablation experiment. L h and L r are, respectively, calculated by

(10) L h = 1 N i = 1 N | | H i h ^ H i h | | 2 2 ,

(11) L r = 1 N i = 1 N | | H i r ^ H i r | | 2 2 ,

where H i h ^ R H 4 × W 4 and H i r ^ R H × W are, respectively, the label heatmap of the ith keypoint with two different scales. Let x denote the 2D coordinates of the heatmap and x ˜ i denote the ground true coordinates of the ith keypoint, H i ^ can be generated by

(12) H ^ i ( x ) = 1 2 π σ 2 exp ( | | x x ˜ i | | 2 2 2 σ 2 ) ,

where σ is the standard deviation, which is set to 1 in this work.

Figure 5 
                  The overall architecture of CRNet.
Figure 5

The overall architecture of CRNet.

4 Experiments

Experiments are performed on two mainstream multi-person pose estimation benchmarks (COCO dataset [7] and MPII dataset [8]) to verify the effectiveness of the proposed methods in enhancing network scale invariance and optimizing quantization error.

4.1 Experiment setup

4.1.1 Datasets

We evaluate the proposed CRNet model on two widely adopted 2D benchmark datasets: COCO dataset [7] and MPII dataset [8]. The train/val/test sets of COCO keypoint detection dataset contain 57, 5, and 20 k images, respectively. The standard evaluation metric of COCO is Object Keypoint Similarity, which can be calculated by

(13) OKS = i = 1 17 exp d i 2 2 s 2 k i 2 δ ( v i > 0 ) i = 1 17 δ ( v i > 0 ) ,

where i is the index of all keypoints, d i represents the Euclidian distance between the predicted value of keypoint and the ground true value, v i denotes whether the keypoint is visible, s represents the scale of the object, k i represents a constant controlling attenuation corresponding to each keypoint, and δ ( ) is the step function. We report the standard average precision: AP50 (AP at OKS = 0.50), AP75, mAP (the mean of AP scores at 10 OKS positions, 0.50, 0.55, …, 0.90, 0.95); AP M for medium objects and AP L for large objects.

The MPII Human Pose dataset contains 5,602 images containing multi-person, of which 3,844 were used to train the CRNet model and the remaining 1,758 images were used for model testing. The dataset also provides more than 28,000 annotated single-person pose samples, and each person is annotated with 16 keypoints. The MPII dataset uses the mean average precision (mAP) of keypoint detection to evaluate model accuracy.

4.1.2 Data augmentation

We use traditional data augmentation strategies for multi-person pose estimation. For the COCO dataset, the methods we used to enhance the training samples included random rotation of −30 to 30 degrees, random scaling of 0.75 to 1.5 times, random cropping the image to the size of 512 × 512 or 640 × 640 , and random flipping. For 512 × 512 input size, the CRNet will generate ground true heatmaps with resolutions of 128 × 128 and 512 × 512, respectively, while for 640 × 640 input size, the CRNet will generate ground true heatmaps with resolutions of 160 × 160 and 640 × 640, respectively. For the MPII dataset, we enhance the training samples that included random cropping of the image to the size of 384 × 384, random rotation of −40 to 40 degrees, random scaling of 0.7–1.3 times, and random flipping.

4.1.3 Training

For the COCO dataset, Adam optimizer [20] is used to update parameters. The base learning rate is 1 × 10−3 and dropped to 1 × 10−4 and 1 × 10−5 on the 200th epochs and 260th epochs, respectively. We trained the proposed CRNet model for a total of 300 epochs. For the MPII dataset, we randomly selected 350 samples from the multi-person training set as a validation set during training and used the remaining and single-person samples for model training. By using RMSprop [21] as the training optimizer and setting the base learning rate to 0.003, the total epoch of training was 250 and the learning rate was reduced by two times at 150, 170, 200, and 230 epochs.

4.2 Results on COCO dataset

We evaluated CRNet on the COCO test-dev set. In Table 1, we can see the comparison results of the proposed CRNet model and the state-of-the-art bottom-up methods. The HRNet implemented by the bottom-up approach is still strong with mAP reaching 64.1%, which is close to Associative Embedding network, but its parameters and FLOPs are only 10 and 19%, respectively, of Associative Embedding network. In our CRNet model, when HRNet-W32 is used as the backbone, mAP is 69.2% better than most networks and slightly lower than HigherHRNet with HRNet-W48. However, our CRNet is more advantageous in terms of parameters and FLOPs. When HRNet-W48 was used as the backbone, CRNet achieved the best accuracy and mAP reached 72.1%. Compared with HigherHRNet, the accuracy for medium scale and large scale was improved by 1.9 and 1.1%, respectively. Indicating that the proposed CRNet can effectively predict multi-scale human pose.

Table 1

Comparisons with state-of-the-arts on the COCO test-dev set

Method Backbone Input size Params GFLOPs mAP AP50 AP75 AP M AP L
OpenPose [14] 61.8 84.9 67.5 57.1 68.2
AE [15] Hourglass 512 277.8M 206.9 65.5 86.8 72.3 60.6 72.6
PensonLab [22] ResNet-152 1401 68.7M 405.5 68.7 89.0 75.4 64.1 75.5
PifPaf [23] 66.7 62.4 72.9
HRNet* HRNet-W32 512 28.5M 38.9 64.1 86.3 70.4 57.4 73.9
SPM [24] Hourglass 66.9 88.5 72.9 62.6 73.1
HigherHRNet [16] HRNet-W48 640 63.8M 154.3 70.5 89.3 77.2 66.6 75.8
CRNet(Ours) HRNet-W32 512 30.2M 54.7 69.2 87.6 73.6 64.8 74.5
HRNet-W48 640 68.5M 197.4 72.1 89.8 78.3 68.5 76.9

* represents a bottom-up implementation.

4.3 Results on MPII dataset

Table 2 presents the evaluation result of the proposed CRNet on MPII test-dev set. Table 2 presents that the previous methods have achieved high accuracy when dealing with relatively fixed or large-sized keypoints or parts such as the head and shoulders. However, the prediction accuracy of relatively flexible keypoints or parts such as wrists and ankles is still low. The reason is that these keypoints or parts are easy to be occluded and the spatial location is changeable. It is necessary for the network to learn multi-scale context information and refine the quantization error, but the previous methods are difficult to achieve at the same time. The proposed CRNet combines multi-scale context feature and refined network, which can predict multi-scale keypoints and refine the quantization error inherent in the bottom-up methods. Our CRNet model achieves 80.2% mAP on MPII multi-person test-dev set, surpassing the previous state-of-the-art models. As shown in Figure 6, the prediction accuracy of CRNet is better than previous methods in most keypoints, among which elbow, wrist, knee, and ankle are improved by 1.7, 1.9, 1.5, and 2.1% respectively. It demonstrates that the proposed CRNet can better deal with difficult key points or parts.

Table 2

Comparison with state-of-the-arts on the MPII test-dev set

Method Head Sho. Elb. Wri. Hip. Knee Ank. mAP
Insafutdinov et al. [25] 88.8 85.2 75.9 64.9 74.2 68.8 60.5 74.3
Duan et al. [26] 88.4 86.3 70.4 63.4 73.6 72.5 66.7 74.6
Cao et al. [14] 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6
Newell et al. [15] 92.1 89.3 78.9 69.8 76.2 71.6 64.7 77.5
Fieraru et al. [27] 91.8 89.5 80.4 69.6 77.3 71.7 65.5 78.0
Nie et al. [24] 89.7 87.4 80.4 72.4 76.7 74.9 68.3 78.5
CRNet(Ours) 92.3 88.9 82.1 74.3 76.9 76.4 70.4 80.2

Bold values represent the best results for the corresponding metric.

Figure 6 
                  Comparison of prediction accuracy of all keypoints.
Figure 6

Comparison of prediction accuracy of all keypoints.

4.4 Ablation experiments

4.4.1 Effectiveness of each component in CRNet

To analyze the effectiveness of each individual component in the proposed CRNet, we perform a series of ablation experiments on the COCO validation set using HRNet-W32 as the backbone network. Figure 7 describes in detail the five network structures adopted in the ablation experiment. Figure 7(a) shows the original HRNet, Figure 7(b) represents the addition of feature pyramid structure, Figure 7(c) shows the improved HRNet using context feature, Figure 7(d) adds a refined network, and Figure 7(e) uses multi-resolution supervision. The experimental results are presented in Table 3.

Figure 7 
                     The network structure of the ablation experiment.
Figure 7

The network structure of the ablation experiment.

Table 3

Results of ablation experiments on the COCO validation set

Network Feature pyramid structure Context feature Refined network Multi-resolution supervision mAP AP M AP L
HRNet 63.9 56.7 72.4
CRNet(Ours) 64.5 57.2 72.5
67.5 62.9 74.1
66.7 62.6 73.7
68.8 65.1 74.6

Bold values represent the best results for the corresponding metric.

4.4.1.1 Feature pyramid structure

We use a feature pyramid to replace stem structures in HRNet for downsampling. It can be seen from Table 3 that mAP is improved by 0.6% after the feature pyramid is used. This is because stem structure ignores the important low-level features, while feature pyramid structure retains them. In addition, the feature pyramid can alleviate the problem of small-scale keypoints information loss caused by coarse downsampling.

4.4.1.2 Context feature

To solve the scale-sensitive problem of HRNet, we propose a context feature module and use it to replace the residual block in HRNet. Table 3 shows that mAP reaches 67.5% after using context features. Both AP M and AP L are greatly improved, indicating that compared with HRNet, our method could better predict multi-scale keypoints.

4.4.1.3 Refined network

After adding the refined network, the mAP decreased by 0.8%, which was caused by the lack of multi-resolution supervision. The network still regarded the refined network as a part of the backbone network. However, the HRNet based on context features is powerful enough, and it is prone to overfitting if the refined network is added.

4.4.1.4 Multi-resolution supervision

After using the refined network and multi-resolution supervision, the mAP reached 68.8%. The reason is that the multi-resolution supervision can effectively distinguish the improved HRNet from the refined network. The former is responsible for extracting important features and outputting heatmaps, while the latter is responsible for refining the heatmaps. Table 3 shows that AP M has a significant improvement, demonstrating that the quantization error of keypoints at small and medium scales is more serious than that at large scales.

4.4.2 The impact of balance factor

To verify the influence of the balance factor in equation (9), the value of α is gradually increased from 0 to 1 with a stride of 0.1. HRNet-W32 is used as the backbone for training, and the test results on the COCO validation set are shown in Figure 8. mAP is highest when α has a value of 0.9, so the value of α is set to 0.9 in this work.

Figure 8 
                     The impact of the balance factor on the network.
Figure 8

The impact of the balance factor on the network.

4.4.3 The impact of input size

We propose a context feature to solve the problem of scale sensitivity. In the context feature, we consider the features of multiple scales and use the attention mechanism to achieve multi-scale fusion. To verify the effectiveness of context feature for enhancing network scale invariance, we evaluated the impact of input image size on HRNet and the improved HRNet based on context feature on the COCO validation set. In the experiments, HRNet-W32 is used as the backbone network, and three different resolutions of 256 × 256, 384 × 384, and 512 × 512 are input, respectively. Two important conclusions can be drawn from the observations in Table 4: (a) As the input size decreases, the prediction accuracy of HRNet and the improved HRNet decreases to varying degrees, indicating that the problem of scale sensitivity does exist. (b) However, compared with HRNet, our method has less precision loss, especially when the resolution is reduced to a lower level. It is proved that context feature can indeed enhance scale invariance and have better robustness to different scales.

Table 4

The effect of input size on the COCO validation set

Network Input size mAP AP50 AP75 AP M AP L
HRNet 256 × 256 55.9 79.4 61.7 53.2 64.4
HRNet + Context feature 61.2 83.7 67.6 57.5 68.5
HRNet 384 × 384 61.7 84.8 67.9 55.5 69.3
HRNet + Context feature 65.8 86.4 71.3 60.3 73.2
HRNet 512 × 512 63.9 85.3 70.1 56.7 72.4
HRNet + Context feature 67.5 88.6 72.9 62.9 74.1

Bold values represent the best results for the corresponding metric.

4.4.4 Effectiveness of the refined network

In this work, we propose a refined network to address quantization error. To verify the effectiveness of the refined network, we compare it with several offset methods. These include no offset (resulting directly in the heatmaps maximum activation), standard offset [2], and DARK [6]. In the experiment, HRNet-W32 is used as the backbone network, and the input resolution is 512 × 512. The experimental results on the COCO validation set are presented in Table 5, from which we can draw the following two important conclusions: (a) The standard offset method brings 1.4% mAP improvement, indicating that the original HRNet does have quantization errors, and the offset method can suppress this error to a certain extent. (b) Compared with DARK, our refined network improves 0.5% mAP, which proves that the refined network can suppress the quantization error better than the offset methods, thereby further effectively improving the prediction accuracy of the model.

Table 5

Effectiveness of the refined network on the COCO validation set

Network mAP AP50 AP75 AP M AP L
HRNet + no offset 63.9 85.3 70.1 56.7 72.4
HRNet + standard offset 65.3 85.7 70.8 59.3 73.0
HRNet + DARK 66.4 86.2 71.6 61.4 73.8
HRNet + refined network 66.9 87.4 72.5 62.3 73.9

Bold values represent the best results for the corresponding metric.

4.5 Visualization of inference results

The visualization of the inference results of the proposed CRNet for conventional pose on COCO and MPII datasets is shown in Figure 9. Figure 9 shows that the proposed CRNet has excellent results for conventional human pose estimation. The visualization of multi-scale human pose inference results is shown in Figure 10, where Figure 10(a) is the output result of HRNet and Figure 10(b) is the output result of the proposed CRNet. In Figure 10(a), we can see that HRNet will have incorrect or unestimable problems when estimating the keypoint of large scale or small scale. The main reason is that the stem module causes the loss of small-scale keypoints and HRNet cannot obtain the global context information for the large-scale human body. The feature pyramid structure and context feature proposed in this work can effectively solve the scale sensitivity problem in HRNet. Furthermore, the prediction accuracy of multi-person pose estimation is further improved by the refined network proposed in this work.

Figure 9 
                  Visualization of conventional poses. (a) Visualization of inference results of COCO data set. (b) Visualization of inference results of MPII data set.
Figure 9

Visualization of conventional poses. (a) Visualization of inference results of COCO data set. (b) Visualization of inference results of MPII data set.

Figure 10 
                  Visualization of multi-scale poses. (a) Visualization of multi-scale pose inference results in HRNet. (b) Visualization of multiscale pose inference results in our CRNet.
Figure 10

Visualization of multi-scale poses. (a) Visualization of multi-scale pose inference results in HRNet. (b) Visualization of multiscale pose inference results in our CRNet.

5 Conclusion

Aiming at the problem of scale sensitivity and quantization error in bottom-up multi-person pose estimation tasks, we proposed a context feature and refined network for multi-person pose estimation based on HRNet(CRNet). We use multi-scale feature pyramid and context feature to solve multi-scale variation challenges. We extract multi-scale features and fuse them with the attentional feature fusion method to obtain context feature, which can effectively enhance the scale invariance of the network. In addition, we propose a simple but efficient refined network to solve the quantization error problem and CRNet is trained by multi-resolution supervision. The average precision of CRNet on COCO and MPII multi-person test-dev sets was 72.1 and 80.2%, respectively, which outperforms most bottom-up state-of-the-art methods.

  1. Conflict of interest: The authors state no conflict of interest.

References

[1] Chen Y, Tian Y, He M. Monocular human pose estimation: A survey of deep learning-based methods. Computer Vis Image Underst. 2020;192:1–20.10.1016/j.cviu.2019.102897Search in Google Scholar

[2] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. Proceedings of the European Conference on Computer Vision. Amsterdam, Netherlands, Berlin: Springer; 2016, October 8–16. p. 483–99.10.1007/978-3-319-46484-8_29Search in Google Scholar

[3] Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X. Multi-context attention for human pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Piscataway, USA: IEEE; 2017, July 21–26. p. 1831–40.10.1109/CVPR.2017.601Search in Google Scholar

[4] Nie X, Feng J, Xing J, Xiao S, Yan S. Hierarchical contextual refinement networks for human pose estimation. IEEE Trans Image Process. 2018;28(2):924–36.10.1109/TIP.2018.2872628Search in Google Scholar PubMed

[5] Wang Z, Liu G, Tian G. A parameter efficient human pose estimation method based on densely connected convolutional module. IEEE Access. 2018;6:58056–63.10.1109/ACCESS.2018.2874307Search in Google Scholar

[6] Zhang F, Zhu X, Dai H, Ye M, Zhu C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Piscataway, USA: IEEE; 2020, June 16–20. p. 7093–102.10.1109/CVPR42600.2020.00712Search in Google Scholar

[7] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision. Zurich, Switzerland, Berlin: Springer; 2014, September 5–12. p. 740–55.10.1007/978-3-319-10602-1_48Search in Google Scholar

[8] Andriluka M, Pishchulin L, Gehler P, Schiele B. 2d human pose estimation: New benchmark and state of the art analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Piscataway, USA: IEEE; 2014, June 23–28. p. 3686–93.10.1109/CVPR.2014.471Search in Google Scholar

[9] Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, et al. Towards accurate multi-person pose estimation in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Piscataway, USA: IEEE; 2017, July 21–26. p. 4903–11.10.1109/CVPR.2017.395Search in Google Scholar

[10] Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J. Cascaded pyramid network for multi-person pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Piscataway, USA: IEEE; 2018, June 19–23. p. 7103–12.10.1109/CVPR.2018.00742Search in Google Scholar

[11] Xiao B, Wu H, Wei Y. Simple baselines for human pose estimation and tracking. Proceedings of the European Conference on Computer Vision. Munich, Berlin, Germany: Springer; 2018, September 8–14. p. 466–81.10.1007/978-3-030-01231-1_29Search in Google Scholar

[12] Sun K, Xiao B, Liu D, Wang J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, Piscataway, USA: IEEE; 2019, June 15–21. p. 5693–5703.10.1109/CVPR.2019.00584Search in Google Scholar

[13] Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, et al. Deepcut: Joint subset partition and labeling for multi person pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Piscataway, USA: IEEE; 2016, June 26–July 1. p. 4929–37.10.1109/CVPR.2016.533Search in Google Scholar

[14] Cao Z, Simon T, Wei SE, Sheikh Y. Realtime multi-person 2d pose estimation using part affinity fields. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Piscataway, USA: IEEE; 2017, July 21–26. p. 7291–9.10.1109/CVPR.2017.143Search in Google Scholar

[15] Newell A, Huang Z, Deng J. Associative embedding: End-to-end learning for joint detection and grouping. Adv Neural Inf Process Syst. 2017;30:2277–87.Search in Google Scholar

[16] Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Piscataway, USA: IEEE; 2020, June 16–20. p. 5386–95.10.1109/CVPR42600.2020.00543Search in Google Scholar

[17] Li J. Research on bottom-up approaches for multi-person pose estimation. PhD thesis. Hefei: University of Science and Technology of China; 2021.Search in Google Scholar

[18] Su K, Yu D, Xu Z, Geng X, Wang C. Multi-person pose estimation with enhanced channel-wise and spatial information. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, Piscataway, USA: IEEE; 2019, June 15–21. p. 5674–82.10.1109/CVPR.2019.00582Search in Google Scholar

[19] Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. (virtual), Piscataway: IEEE; 2021, January 5–9. p. 3560–9.10.1109/WACV48630.2021.00360Search in Google Scholar

[20] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980; 2014.Search in Google Scholar

[21] Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn. 2012;4(2):26–31.Search in Google Scholar

[22] Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. Proceedings of the European Conference on Computer Vision. Munich, Berlin, Germany: Springer; 2018, September 8–14. p. 269–86.10.1007/978-3-030-01264-9_17Search in Google Scholar

[23] Kreiss S, Bertoni L, Alahi A. Pifpaf: Composite fields for human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, Piscataway, USA: IEEE; 2019, June 15–21. p. 11977–86.10.1109/CVPR.2019.01225Search in Google Scholar

[24] Nie X, Feng J, Zhang J, Yan S. Single-stage multi-person pose machines. Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Korea, Piscataway: IEEE; 2019, October 27–November 2. p. 6951–60.10.1109/ICCV.2019.00705Search in Google Scholar

[25] Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Levinkov E, Andres B, et al. Arttrack: Articulated multi-person tracking in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Piscataway, USA: IEEE; 2017, July 21–26. p. 6457–65.10.1109/CVPR.2017.142Search in Google Scholar

[26] Duan P, Wang T, Cui M, Sang H, Sun Q. Multi-person pose estimation based on a deep convolutional neural network. J Vis Commun Image Representation. 2019;62:245–52.10.1016/j.jvcir.2019.05.010Search in Google Scholar

[27] Fieraru M, Khoreva A, Pishchulin L, Schiele B. Learning to refine human pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops. Salt Lake City, Piscataway, USA: IEEE; 2018, June 19–23. p. 205–14.10.1109/CVPRW.2018.00058Search in Google Scholar

Received: 2021-12-31
Revised: 2022-02-27
Accepted: 2022-05-01
Published Online: 2022-06-27

© 2022 Lanfei Zhao and Zhihua Chen, published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Construction of 3D model of knee joint motion based on MRI image registration
  3. Evaluation of several initialization methods on arithmetic optimization algorithm performance
  4. Application of visual elements in product paper packaging design: An example of the “squirrel” pattern
  5. Deep learning approach to text analysis for human emotion detection from big data
  6. Cognitive prediction of obstacle's movement for reinforcement learning pedestrian interacting model
  7. The application of neural network algorithm and embedded system in computer distance teach system
  8. Machine translation of English speech: Comparison of multiple algorithms
  9. Automatic control of computer application data processing system based on artificial intelligence
  10. A secure framework for IoT-based smart climate agriculture system: Toward blockchain and edge computing
  11. Application of mining algorithm in personalized Internet marketing strategy in massive data environment
  12. On the correction of errors in English grammar by deep learning
  13. Research on intelligent interactive music information based on visualization technology
  14. Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience
  15. Conception and realization of an IoT-enabled deep CNN decision support system for automated arrhythmia classification
  16. Masking and noise reduction processing of music signals in reverberant music
  17. Cat swarm optimization algorithm based on the information interaction of subgroup and the top-N learning strategy
  18. State feedback based on grey wolf optimizer controller for two-wheeled self-balancing robot
  19. Research on an English translation method based on an improved transformer model
  20. Short-term prediction of parking availability in an open parking lot
  21. PUC: parallel mining of high-utility itemsets with load balancing on spark
  22. Image retrieval based on weighted nearest neighbor tag prediction
  23. A comparative study of different neural networks in predicting gross domestic product
  24. A study of an intelligent algorithm combining semantic environments for the translation of complex English sentences
  25. IoT-enabled edge computing model for smart irrigation system
  26. A study on automatic correction of English grammar errors based on deep learning
  27. A novel fingerprint recognition method based on a Siamese neural network
  28. A hidden Markov optimization model for processing and recognition of English speech feature signals
  29. Crime reporting and police controlling: Mobile and web-based approach for information-sharing in Iraq
  30. Convex optimization for additive noise reduction in quantitative complex object wave retrieval using compressive off-axis digital holographic imaging
  31. CRNet: Context feature and refined network for multi-person pose estimation
  32. Improving the efficiency of intrusion detection in information systems
  33. Research on reform and breakthrough of news, film, and television media based on artificial intelligence
  34. An optimized solution to the course scheduling problem in universities under an improved genetic algorithm
  35. An adaptive RNN algorithm to detect shilling attacks for online products in hybrid recommender system
  36. Computing the inverse of cardinal direction relations between regions
  37. Human-centered artificial intelligence-based ice hockey sports classification system with web 4.0
  38. Construction of an IoT customer operation analysis system based on big data analysis and human-centered artificial intelligence for web 4.0
  39. An improved Jaya optimization algorithm with ring topology and population size reduction
  40. Review Articles
  41. A review on voice pathology: Taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions
  42. An extensive review of state-of-the-art transfer learning techniques used in medical imaging: Open issues and challenges
  43. Special Issue: Explainable Artificial Intelligence and Intelligent Systems in Analysis For Complex Problems and Systems
  44. Tree-based machine learning algorithms in the Internet of Things environment for multivariate flood status prediction
  45. Evaluating OADM network simulation and an overview based metropolitan application
  46. Radiography image analysis using cat swarm optimized deep belief networks
  47. Comparative analysis of blockchain technology to support digital transformation in ports and shipping
  48. IoT network security using autoencoder deep neural network and channel access algorithm
  49. Large-scale timetabling problems with adaptive tabu search
  50. Eurasian oystercatcher optimiser: New meta-heuristic algorithm
  51. Trip generation modeling for a selected sector in Baghdad city using the artificial neural network
  52. Trainable watershed-based model for cornea endothelial cell segmentation
  53. Hessenberg factorization and firework algorithms for optimized data hiding in digital images
  54. The application of an artificial neural network for 2D coordinate transformation
  55. A novel method to find the best path in SDN using firefly algorithm
  56. Systematic review for lung cancer detection and lung nodule classification: Taxonomy, challenges, and recommendation future works
  57. Special Issue on International Conference on Computing Communication & Informatics
  58. Edge detail enhancement algorithm for high-dynamic range images
  59. Suitability evaluation method of urban and rural spatial planning based on artificial intelligence
  60. Writing assistant scoring system for English second language learners based on machine learning
  61. Dynamic evaluation of college English writing ability based on AI technology
  62. Image denoising algorithm of social network based on multifeature fusion
  63. Automatic recognition method of installation errors of metallurgical machinery parts based on neural network
  64. An FCM clustering algorithm based on the identification of accounting statement whitewashing behavior in universities
  65. Emotional information transmission of color in image oil painting
  66. College music teaching and ideological and political education integration mode based on deep learning
  67. Behavior feature extraction method of college students’ social network in sports field based on clustering algorithm
  68. Evaluation model of multimedia-aided teaching effect of physical education course based on random forest algorithm
  69. Venture financing risk assessment and risk control algorithm for small and medium-sized enterprises in the era of big data
  70. Interactive 3D reconstruction method of fuzzy static images in social media
  71. The impact of public health emergency governance based on artificial intelligence
  72. Optimal loading method of multi type railway flatcars based on improved genetic algorithm
  73. Special Issue: Evolution of Smart Cities and Societies using Emerging Technologies
  74. Data mining applications in university information management system development
  75. Implementation of network information security monitoring system based on adaptive deep detection
  76. Face recognition algorithm based on stack denoising and self-encoding LBP
  77. Research on data mining method of network security situation awareness based on cloud computing
  78. Topology optimization of computer communication network based on improved genetic algorithm
  79. Implementation of the Spark technique in a matrix distributed computing algorithm
  80. Construction of a financial default risk prediction model based on the LightGBM algorithm
  81. Application of embedded Linux in the design of Internet of Things gateway
  82. Research on computer static software defect detection system based on big data technology
  83. Study on data mining method of network security situation perception based on cloud computing
  84. Modeling and PID control of quadrotor UAV based on machine learning
  85. Simulation design of automobile automatic clutch based on mechatronics
  86. Research on the application of search algorithm in computer communication network
  87. Special Issue: Artificial Intelligence based Techniques and Applications for Intelligent IoT Systems
  88. Personalized recommendation system based on social tags in the era of Internet of Things
  89. Supervision method of indoor construction engineering quality acceptance based on cloud computing
  90. Intelligent terminal security technology of power grid sensing layer based upon information entropy data mining
  91. Deep learning technology of Internet of Things Blockchain in distribution network faults
  92. Optimization of shared bike paths considering faulty vehicle recovery during dispatch
  93. The application of graphic language in animation visual guidance system under intelligent environment
  94. Iot-based power detection equipment management and control system
  95. Estimation and application of matrix eigenvalues based on deep neural network
  96. Brand image innovation design based on the era of 5G internet of things
  97. Special Issue: Cognitive Cyber-Physical System with Artificial Intelligence for Healthcare 4.0.
  98. Auxiliary diagnosis study of integrated electronic medical record text and CT images
  99. A hybrid particle swarm optimization with multi-objective clustering for dermatologic diseases diagnosis
  100. An efficient recurrent neural network with ensemble classifier-based weighted model for disease prediction
  101. Design of metaheuristic rough set-based feature selection and rule-based medical data classification model on MapReduce framework
Downloaded on 7.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2022-0060/html?lang=en
Scroll to top button