Basketball action recognition by fusing video recognition techniques with an SSD target detection algorithm

Weizhao He; Bin Li

doi:10.1515/comp-2025-0027

Article Open Access

Basketball action recognition by fusing video recognition techniques with an SSD target detection algorithm

Weizhao He and Bin Li

Published/Copyright: May 7, 2025

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Open Computer Science Volume 15 Issue 1

Abstract

In the digital era, computer vision technology has become a key tool in the field of sports analysis, especially widely used in basketball action recognition (BAR) technology. To improve the recognition accuracy and efficiency of technical actions in basketball, a BAR system that integrates three-dimensional convolutional neural network video recognition technology and a single-shot multibox detector (SSD) target detection algorithm is proposed. First, the basketball player’s action sequence is captured by the video recognition technique, and then the SSD target detection algorithm is utilized for real-time target detection and localization of the player’s key parts. By combining these two techniques, it is possible to more accurately recognize and classify a wide range of basketball player’s actions including but not limited to shoot, dribbling, passing, and defending. The experimental results indicated that the proposed model of the study provided the best recognition effect in continuous video images with 16 frames, as well as faster response and convergence speeds, and displayed good stability in the computational process. In the application effect comparison experiments with the comparison algorithms, the average recognition rate of the proposed model was 89.5 and 92.5% on the original and cropped frames, respectively, and the average accuracy mean value was 91.69%. Except for the shoot action, the recognition misjudgment rate for the remaining five basketball actions was the lowest among the three methods, which significantly improved the accuracy and effectiveness of action recognition. In addition, the accuracy of the proposed model in practical applications reached 94.50%, and the real-time performance reached 95 ms/frame, both significantly better than the other two compared models, indicating that the model is suitable for real-time BAR tasks and can provide efficient and accurate results.

Graphical abstract

Keywords: basketball action recognition; video recognition techniques; SSD target detection; 3D convolutional networks; action classification

1 Introduction

In modern sports science research, action recognition and analysis of athletes is an important means to improve training effects and game performance [1]. For coaches and athletes, accuracy and real-time action recognition are crucial because basketball is a fast-paced, high-intensity activity [2]. Target detection algorithms (TDAs) and video recognition technologies have become widely used in the field of sports action recognition in recent years due to the rapid development of computer vision systems and deep learning technology [3]. In basketball, the diversity and complexity of technical movements require the recognition system not only to have high accuracy but also to be able to respond in real time. Video recognition technology uses multidimensional convolutional neural networks (CNNs) to analyze continuous video frames to capture and understand the dynamic behavior of athletes. Among them, single-shot multibox detector (SSD) TDA is widely used in the field of target detection because of its efficient target detection capability and fast processing speed [4,5]. Many scholars have conducted a lot of research on target detection as well as basketball action recognition (BAR) algorithms. For SSD in small target recognition and the occlusion problem between objects, Qu et al. presented a single target detection with multi-scale feature fusion and feature improvement. According to experimental data, the suggested method performed better and had a 2.4% increase in detection accuracy over the original SSD approach [6]. Liu et al. suggested a bidirectional feature fusion pyramid SSD technique by developing a bidirectional feature pyramid structure and introducing coordinated attention. This achieved a bi-directional fusion of feature layers and improved attention to important information. The study’s findings showed that the suggested method’s mAP was 76.48%, 3.52% higher than that of the SSD algorithm, and that its foreign object detection accuracy was 98.26% [7]. A study on some uncontrollable disturbing factors for face detection in video sequences was carried out by Liu et al. The study’s findings showed that the method can realize real-time monitoring, which in turn improves face detection accuracy [8]. Three-dimensional convolutional neural network (3D-CNN) was one of the deep learning techniques for learning time series data, which was able to extract rich spatio-temporal features in video data, thus realizing the accurate recognition of complex motion behaviors [9,10]. Lin and Lin presented a dynamic thermal management based on predictive rate control and task allocation for 3D-CNN gas pedal systems. The technique controlled the workload and parallelism of the 3D-CNN gas pedal by utilizing rate control, temperature prediction, workload shifting, and workload adjustment. According to the findings, this method’s 3D-CNN gas pedal’s peak temperature may be adjusted within a preset temperature range [11]. An approach to the problem of violence recognition was put forth by Keceli and Kaya, who used 3D-CNN and transfer learning to classify videos of violent activities. The approach used transfer learning to create deep features from the input video frames and then classified them using a 3D-CNN classifier. The outcomes revealed that the suggested method outperformed the movie and field hockey game datasets in terms of classification accuracy [12].

The BAR technique is to automatically recognize and classify various technical actions executed by basketball players on the court using computer vision and machine learning algorithms [13]. Zhang and Wang suggested a feature extraction (FE)-based method for basketball turn and dribble visual image recognition in an effort to address the diversity of spatio-temporal features of the action, the increase in the amount of data, and the complexity of the network. The approach created the association between the velocity field of basketball turn and dribble and the grayscale of image frames by adding optical flow pictures and processed the image frames using CNN, and the testing results verified the usefulness of the suggested model [14]. In an effort to increase the precision of basketball player shoot action detection evaluation, Xu presented a graph CNN-based approach. The technique created a model to extract the shot action features by using 3D graph CNN to analyze the optical flow data of the movie. According to experimental tests, the suggested model’s standard rate of action recognition evaluation was 0.8648 with a 0.0066 error, and it performed well in terms of recognition accuracy when it came to shoot actions [15]. Zhang et al. suggested a new image segmentation-based shoot-action recognition method for basketball players in an attempt to optimize the accuracy and time consumption of traditional shoot-action recognition methods for basketball players. The method processed the shooting action image by thresholding and edge segmentation techniques and used the kernel function to extract the change features to realize the recognition of basketball players’ shooting actions. According to the experimental data, the method’s accuracy was approximately 98%, and the processing time was about 2.1 s [16]. Tong and Wang proposed a basketball pose recognition model that combines a graph convolutional network (GCN) with spatiotemporal GCN, capable of processing time-series graph-structured data. This model constructs a spatiotemporal graph convolution model of human pose sequence through graph convolution operation. The experiment shows that the basketball pose recognition accuracy of the proposed model is about 95.58%, which is better than other comparative models [17]. Chen and Ni proposed a motion recognition method that combines computer vision and big data to address the issue of improper posture affecting the performance of basketball players. This method is applied to athletes’ daily training and competitions. Based on mainstream motion recognition models, 3D graph convolution is used to improve 3D convolution and enhance the spatial structure and temporal feature expression of skeleton sequences. Introduce channel and spatial attention mechanisms to study the weight distribution of key points and strong features in pose recognition. In real data testing, the model maintains high recognition performance and runs stably [18].

To summarize, most of the research on BAR techniques is based on NN algorithms. However, relying solely on NNs or other algorithms cannot improve the efficiency of action recognition, and too many interfering factors cause the accuracy to be affected. Meanwhile, most research mainly focuses on static or simplified environments, and there is still insufficient research on real-time action recognition and multi scene adaptability in complex environments. In view of this, the research innovatively integrates 3D-CNN-based video recognition technology with SSD TDA, aiming to improve the accuracy and efficiency of BAR. By processing and analyzing the video data of basketball games, the feasibility and advantages of this fusion method in practical applications are verified to provide technical support for future intelligent sports training systems.

2 Methods and materials

The study first performs image feature frame extraction based on SSD TDA and combines feature maps of different dimensions to determine the location of the target, which is trained on the VOC dataset for further production of cropped frames. On this basis, the continuous dynamics in video images are processed using a dual-resolution 3D-CNN video recognition technique to comprehensively analyze the video data in both spatial and temporal dimensions. Finally, a 3D-CNN-SSD BAR model is suggested by combining 3D-CNN and SSD TDA.

2.1 Cropped frame construction based on SSD TDA

The SSD algorithm works by applying a certain number of predicted edges to the input image. The position of the target object is ascertained, and the output is produced by combining it with feature maps of varying sizes and dimensions. Then, the non-maximum suppression algorithm is utilized to eliminate the predicted frames that do not meet the expectations as well as duplicates, and finally, the true frames that match the original image are obtained. The network framework of SSD is shown in Figure 1 [19].

Figure 1

SSD network framework.

In Figure 1, the network framework of SSD consists of two parts: a truncated visual geometry group (VGG) and a CNN layer with decreasing scale. On feature maps with varying scales, the target is detected simultaneously. The input image is passed through the VGG network and convolutional layer for FE. The edge length of the input image is first converted to 300, and the ReLU function is activated by two initialized convolutional kernel (CK) computations. During the CK computation, the image boundary feature information may be lost as the image matrix data are gradually reduced. To preserve the boundary features, a complementary zero operation is performed on the image boundary, i.e., the image is filled with SAME. This padding ensures the integrity of the image during the computation process, allowing the edge information to be retained to output the final feature map. The result is then detected by extracting the feature map, and the detection process is realized by the computation of CNN. The total CKs required for the feature map are the product of the a priori frames and the classification categories, and the CKs used to determine the location of the bounding box (BBO) are the product of the number of a priori frames and 4. During the prediction process, each a priori frame is predicted to generate a BBO, and the a priori frame function of the effective feature layer is shown in Figure 2.

Figure 2

Feature maps and reference boxes. (a) Image with GT boxes, (b) 16 × 16 feature map, and (c) 8 × 8 feature map.

Figure 2 (a) shows the original images labeled with blue and red true frames. Figure 2 (b) and (c) show the schematic diagrams of the prediction frames corresponding to the two real frames, respectively. The remaining prediction frames are negative samples. Each target in the feature map corresponds to multiple prediction frames. The convolution operation analyzes the features within the prediction region to produce a classification confidence score for the target as well as mapping the boundary coordinates back to the four points of the original map. The degree of overlap between real and predicted frames is performed by calculating intersection over union (IoU). The IoU expression is shown as follows [20]:

(1) IoU = ∣ A ∩ B ∣ ∣ A ∩ B ∣ .

In equation (1), A denotes the true frame and B denotes the predicted frame. The derived feature maps are increasingly abstract with an increase of CNN layers. The initial layer feature maps of the CNN contain detailed information about the image data, such as object borders. The neuron model in CNN only extracts features from local regions of the image and constructs higher level feature connections through these local features. SSD enhances the model’s ability to recognize objects of different sizes through the prediction frame. The prediction frame scale calculation formula is shown as follows:

(2) s k = b min + b max − b min n − 1 ( k − 1 ) , k ∈ [ 1 , n ] .

In equation (2), s k denotes the prediction frame of the corresponding feature map. b min and b max denote the minimum and maximum ratio of the predicted BBO to the original image, respectively. k denotes the CNN layers and n denotes the feature maps. The height and width of the default box (DB) are as follows:

(3) h k a r = s k a r w k a r = s k a r .

In equation (3), a r denotes the prediction box aspect ratio. h k a r denotes the height of the DB. w k a r denotes the width of the DB. Set S, then the extra prediction box is shown as follows:

(4) s k ′ = s k s k + 1 .

In equation (4), s k ′ denotes the size of the additional prediction frames, at which point the corresponding position of the feature map contains six prediction frames. The center of the prediction frame is shown as follows:

(5) i + 0.5 ∣ f k ∣ , j + 0.5 ∣ f k ∣ .

In equation (5), i and j denote the position of the image in the rows and columns of the matrix, respectively. f k is the size of the k th layer feature image. The feature map can produce six feature boxes of different sizes. To reduce training time and minimize computational complexity, only four a priori frames are used in some convolutional layers, excluding default frames with an aspect ratio of 3. Too many default frames can cause the training process to become complicated. The effective feature layers that retained the feature maps thus avoided the use of redundant prediction frames due to the larger size of the corresponding prediction frames. During the prediction process of SSD, the system generates multiple BBOs from the prediction boxes. To prevent overlapping BBOs in the prediction results, the study uses the non-maximal value suppression technique algorithm to further refine the prediction results. This effectively reduces the redundant prediction boxes and improves the detection accuracy. First, the DBs must be filtered according to the set confidence threshold, and the BBOs with confidence lower than the set value must be filtered out. The study adopts SSD target detection technique to construct the prediction model of the network using the weight files obtained from the training of the VOC dataset to recognize the human body in a single frame image. The VOC dataset covers a total of 21 different categories such as vehicles, animals, and furniture. During the model recognition process, non-human body categories are labeled if they are detected and these detections are ignored to ensure that the model recognizes only the human body. Lastly, the boundary coordinates are taken from the image frames that the SSD model returned using the OpenCV library. This allows us to crop the original image and create the cropped frames needed for the study. The process is shown in Figure 3 [21].

Figure 3

Cropped frame process: (a) original frame, (b) SSD detection, and (c) crop frame.

Figure 3 shows the cropped frame generation process, which utilizes the SSD TDA to remove the invalid parts of the image, leaving only the parts that need to be recognized. To meet the training requirements of 3D-CNN cropped frames, the study tries to select frames with fewer interfering elements when collecting the dataset for basketball technical movement analysis to ensure the clarity and simplicity of the image frames. As a result, the cropped frames are made to contain almost no interfering elements, thus improving the accuracy and efficiency of model training.

2.2 3D-CNN-based video recognition technique

In the process of recognizing basketball technical actions, the input data are video frame sequences. The recognition process not only needs to analyze the spatial representation of the action, but also consider the temporal relationship between the actions in the action sequence. The 3D-CNN video recognition technique adds a temporal dimension to the traditional two-dimensional (2D) convolution, which is able to comprehensively analyze the video data in both temporal and spatial dimensions [22,23]. As a result, 3D-CNN is able to capture action images and process continuous dynamics in the sequence. The difference between 2D convolution and 3D convolution operations in processing video data is shown in Figure 4 [24].

Figure 4

Schematic diagram of (a) 2D and (b) 3D convolution operations.

Figure 4 (a) shows the 2D convolution operation. In Figure 4 (b), by analyzing the image sequence of three consecutive frames, the dynamic changes are successfully captured with the setting of a time span of 3. Similar to the principle of 2D convolution operation, 3D convolution also employs the weight-sharing technique to extract specific features from the video frame stack. To enhance the expression of features, in addition to expanding the CK, the diversity of features can be increased by optimizing the settings of CKs. The training effect of CNN is influenced by multiple aspects, including algorithm structure and hyperparameter configuration. Therefore, in order to enhance model training efficiency and reduce training time, the study employs multi-resolution enhancement based on the 3D-CNN network model. The improved multi-resolution 3D-CNN algorithm architecture is shown in Figure 5.

Figure 5

Dual resolution 3D-CNN algorithm architecture.

In Figure 5, the video clip used as input to the network has a resolution of 112 × 112 and a frame rate of 16 frames. The input process is divided into two parts: one part is the original frame (OF) image after video preprocessing. The other part of the image is the outcome of the SSD TDA processing, which is circled by the red border in the figure. Its size is 64 × 64 pixels. Images are usually synthesized from different proportions of pixel points in the RGB color pattern, which will increase the amount of input data if the original image data is directly fed into the NN [25]. In the recognition task of basketball movements, the practical utility of color information is not high and grayscale images can be used to reduce the amount of computation. However, even in grayscale images, each pixel point has a variation interval, which is sufficient to retain most of the gradient and other key information of the original image. As an example, a 16-frame image frame input is shown in Figure 6 [26].

Figure 6

3D-CNN network model architecture: (a) original frame 3D-CNN and (b) crop frame 3D-CNN.

Figure 6 (a) represents the 3D-CNN network architecture in the OF. Figure 6 (b) shows the network architecture of 3D-CNN in cropped frames. The input image resolution is first scaled to 112 × 112 for 16 consecutive frames and then processed through 5 convolutional layers and 5 pooling layers (PL). The size of the CK is set to be uniformly 3 × 3 × 3 with a step size of 1 × 1 × 1. The first PL uses a window of 2 × 2 × 1 and a step size of 2 × 2 × 1 in order to avoid premature reduction outside the temporal dimension. The pooling window and step size for the remaining layers are 2 × 2 × 2. In Figure 6(b), the image is input with a resolution of 64 × 64 pixels and 16 consecutive frames through 4 convolutional layers and 4 PLs. The CK and step size settings are the same as the OF network, and the window and step size of all PLs are 2 × 2 × 2. After the same two fully connected layer (FCL) processing, the final classification results are output through the softmax layer. To improve the training efficiency, the initialization weight approach is utilized to merge the time series information with the spatial properties of the image frames. The pre-trained 2D convolutional layer weights on ImageNet are used as the initial values of 3D convolutional weights. The mean value is initialized as shown in the following equation [27]:

(6) W t 3 D = W 2 D T .

In equation (6), W 2 D denotes 2D weight parameters. W t 3 D denotes the 3D convolutional weight parameter. T denotes the convolution layer timing information. The generalization of weight initialization can be achieved by the initialization method of proportional scaling. The constraints are

(7) ∑ t = 1 T W t 3 D = W 2 D .

As shown in equation (7), this method is capable of adapting to diverse network architectures and parameter ranges. The method produces diverse results by incorporating randomness and dividing by an uncertain constant. This method allows the use of any combination of constant values as shown in the following equation:

(8) W t 3 D = α t W 2 D α t > 0 , ∑ t = 1 T α t = 1 .

In equation (8), α t represents the initial value of the weights. The initialization of the negative weights is then performed to determine the value of the submatrix by dividing it by a specific constant. In the case where the initial iteration step is 1, the value of this sub-matrix is larger than the other initialized values. The initialization process of negative weights is shown in the following equation:

(9) W t 3 D = α t W 2 D , α t = 2 T − 1 T , t = 1 1 T , 2 ≤ t ≤ T .

In equation (9), the OF and cropped frame are processed using 3D-CNN networks of both resolutions to generate weight files as model parameters. The sequence of action frames is predicted in order to extract feature vectors, followed by feature integration for classification. Finally, fusion is performed based on the summing of feature maps to overcome the resolution difference problem. The final feature representation of the input action frames is shown in the following equation:

(10) C = X i , j , d a + X i , j , d b .

In equation (10), C denotes the final feature of the input continuous technical action frame. b and X i , j , d b denote the j th feature of a technical action in the i frame of type a and b , respectively. d denotes the feature dimension. Among them, 1 ≤ d ≤ D denotes the dimension of the feature map. After FE and fusion is completed, these output data will be used as input to support vector machine for the final action recognition task [28]. The research fuses an improved 3D-CNN video recognition technique with SSD TDA to propose a 3D-CNN-SSDBAR model. One benefit of this model is that it can effectively recognize basketball moves by utilizing both the quick target detection capabilities of SSD and the spatio-temporal FE capability of 3D-CNN at the same time.

3 Results

The study is conducted by setting the experimental parameters, building the experimental environment, and first selecting the images with different frame counts from the Kinetics dataset and the UCF50 dataset to train the recognition accuracy of the 3D-CNN-SSD model. Then, the model is compared with the OpenPose human posture recognition algorithm and discrete wavelet transform (DWT) for performance comparison experiments on both training models of OFs and cropped frames [29,30]. Finally, taking 16 frames of RGB video data as an example, different basketball actions are selected for recognition verification and analyzing the application effect.

3.1 3D-CNN-SSD model performance test

The study uses the Python 3.7 framework to write the code, and experiments are conducted based on the Ubuntu 16.04 operating system. In all, 32 is the batch size, 40 is the network epoch parameter, and stochastic gradient descent is the optimizer of choice. The 3D-CNN-SSD model is trained using different frame rates of the Kinetics dataset and UCF50 dataset, and the effect of different frame rates on recognition accuracy is compared. Table 1 presents the findings.

Table 1

Recognition accuracy results of different input frame rates

Data set	Frames	Accuracy
Kinetics	3	87.8
	7	88.2
	11	90.4
	16	93.1
	20	92.6
UCF50	3	78.1
	7	88.2
	11	89.5
	16	92.6
	20	91.8

The outcomes illustrate that when the input frames consist of 16 continuous video image frames, the algorithm obtains the highest recognition performance of 93.1% in the Kinetics dataset and 92.6% in the UCF50 dataset (Table 1). To test the model performance, the 3D-CNN-SSD model proposed in the study is compared with the OpenPose algorithm and DWT for experiments on both training models, OF and cropped frame. The test outcomes are displayed in Figure 7.

Figure 7

Accuracy of different algorithms on different training models: (a) original frame training results and (b) crop frame training results.

Figure 7 (a) and (b) shows the training accuracy of different algorithms on the OF and cropped frame model. In the OF training results, the 3D-CNN-SSD model has the highest accuracy and reaches the steady state first at the 20th epoch. In the cropped-frame training model, the 3D-CNN-SSD model also has better accuracy than the remaining two recognition models and enters the steady state the fastest at the 7th epoch. The cross-entropy loss function (LF) is utilized for training on both models. The results are shown in Figure 8.

Figure 8

Loss curves of different algorithms in different training models: (a) original frame training and (b) crop frame training.

Figure 8 (a) and (b) show the LF curves for different algorithms trained on the OF model and the cropped frame model, respectively. The loss values on both training models show a decreasing trend, indicating that the models are learning the data features and improving the performance of the models at the same time. In the training results of the OF model, the LF curve of the 3D-CNN-SSD model first reaches a steady state at the 20th epoch, with the fastest convergence and the lowest loss value of 1.1. Similarly, the loss curve of the 3D-CNN-SSD model in cropped frames first reaches a steady state at the 17th epoch and the lowest loss value of 1.3. The results exhibit that the 3D-CNN-SSD model has good generalization ability and a low risk of overfitting. Then, 60 test samples are selected from the OF and cropped frame training models, respectively, to test the response performance of the three recognition methods. The results are shown in Figure 9.

Figure 9

Identify response time test results: (a) original frame training and (b) crop frame training.

Figure 9 (a) and (b) illustrate the variation of response time for different algorithms in the OF and cropped frame training models, respectively. As the test samples increase, the response times of the three recognition methods also change. Among the two training models, the response time of the 3D-CNN-SSD model is always smaller than the other two algorithms. It performs the best among the three algorithms and completes the image recognition with the fastest response time. Continuous loop testing is performed for the three different BAR methods. The test time is set to 150s, and the performance parameters of different algorithms are recorded in the form of curves. Resource overhead indicators are utilized to measure the consumption of resources required during the running process, such as time, memory, processor capacity, energy, etc. Figure 10 displays the outcomes of the continuous loop test.

Figure 10

Continuous cycle test results: (a) OpenPose, (b) DWT, and (c) 3D-CNN-SSD.

By comparing (a), (b), and (c) of Figure 10, it can be found that the three different BAR methods exhibit different performance fluctuation characteristics in the 150-s continuous loop recognition test. In Figure 10(c), the 3D-CNN-SSD model curve amplitude variations are small and regular, and the difference between its maximum and minimum values shows periodic changes. It indicates that the model has a more uniform resource allocation in the recognition task and the computing process is relatively stable. When the changes in resource overhead metrics are compared, it becomes evident that the performance of the OpenPose and DWT methods in Figure 10 (b) and (c) fluctuates more during the computing process than it does with the 3D-CNN-SSD model. Additionally, there is a significant widening of the difference between the maximum and minimum values, as well as an increase in the amplitude and frequency of the fluctuations. The results indicate that the 3D-CNN-SSD model has better stability.

3.2 Effect of 3D-CNN-SSD model application

To verify the application performance of the 3D-CNN-SSD model proposed in the study, the study uses 16 frames of RGB video data as input. Six different basketball movements are selected for recognition, including crossover, Eurostep, feint, Shamgod, spinmove, and shoot, and the results obtained after testing are shown in Figure 11.

Figure 11

Confusion matrix for different training models: (a) original frame and (b) crop frame.

Figure 11 (a) and (b) show the model training confusion matrices constructed from raw frame data and cropped frame data, respectively. In the OF training model, the direction-shifting actions are sometimes misclassified as Eurostep or Shamgod actions, probably due to the commonality of these actions with direction-shifting. Although there are some classification errors in the recognition process, the 3D-CNN-SSD model shows high accuracy in recognizing the action of shoot. The recognition rate (RR) is as high as 96% and its average RR is able to reach 89.5%. In the confusion matrix of the cropped frame model, the 3D-CNN-SSD model removes some interfering elements in the image by target detection, and the average RR is improved to 92.5%. Among them, it is still the highest RR of 98% for shoot action. The recognition accuracy of different algorithms for six basketball actions and the mean average precision (MAP) results are shown in Table 2.

Table 2

Accuracy of BAR

Identify actions	Accuracy rate (%)
Identify actions	OpenPose	DWT	3D-CNN-SSD
Crossover	80.46	86.13	91.64
Eurostep	83.26	85.16	89.46
Feint	81.33	87.63	90.25
Shamgod	85.64	86.28	92.65
Spinmove	82.47	85.46	91.78
Shoot	84.56	87.65	94.36
MAP	82.95	86.39	91.69

In Table 2, the MAPs of OpenPose, DWT, and 3D-CNN-SSD models for the six actions are 82.95, 86.39, and 91.69%, respectively. Among them, the 3D-CNN-SSD model improves 8.74 and 5.33% compared to OpenPose and DWT, respectively. The 3D-CNN-SSD model has the lowest accuracy of 89.46% in recognizing Eurostep actions compared to other action categories. This is because the motion differences of Eurostep in the video sequences are not significant. Most of the time basketball players maintain similar postures in consecutive frames, resulting in weak frame-to-frame connections. However, for the recognition of the other five motions such as crossover, feint, and shoot, the model achieves an accuracy of more than 90%, with significant improvement in recognition. Therefore, the 3D-CNN-SSD model can fully extract the motion information in video frames for recognition. Table 3 displays the misclassification outcomes of several models during the recognition process.

Table 3

Misjudgment rate of BAR

Situation number	Correct action	Misjudgment action	Misjudgment rate
Situation number	Correct action	Misjudgment action	OpenPose	DWT	3D-CNN-SSD
1	Crossover	Shamgod	12.58	9.46	6.49
2	Eurostep	Shoot	10.23	9.12	4.46
3	Feint	Spinmove	11.65	8.46	7.16
4	Shamgod	Eurostep	9.34	7.49	6.46
5	Spinmove	Shamgod	10.46	8.14	5.57
6	Shoot	—	—	—	—

Table 3 shows the recognized misjudgments for the six basketball actions. Except for case 6 shoot action, all the other actions have been misjudged. Case 1 and Case 5 indicate that crossover and spinmove are misjudged as Shamgod moves, respectively. Situation 2 is where Eurostep is misjudged as a shoot move. Case 3 denotes that feint is misjudged as spinmove. Case 4 denotes that Shamgod action is misjudged as Eurostep. The experimental results show that 3D-CNN-SSD has the lowest misjudgment rate in all five misjudgment cases. For example, in Case 1, the misjudgment rate of OpenPose, DWT, and 3D-CNN-SSD are 12.58, 9.46, and 6.49%, respectively. The misjudgment rate of 3D-CNN-SSD has the lowest misjudgment rate in Case 1. The study uses cluster analysis of the high-dimensional features output from the network’s FCL to assess the model’s performance in the area of basketball motion video recognition. The results are shown in Figure 12.

Figure 12

Visualization of feature clustering: (a) OpenPose, (b) DWT, and (c) 3D-CNN-SSD.

The feature clustering visualization results of OpenPose, DWT, and 3D-CNN-SSD are represented in Figure 12 in (a), (b), and (c), respectively. Different colored points are used to distinguish different action categories, and each color represents a specific kind of action data. In Figure 12(a), when using OpenPose technology for BAR, the feature data distribution of various types of basketball actions is more dispersed, and there is overlapping of data between categories. For example, part of the data in the shoot action is categorized into the Eurostep action, and part of the data in the Eurostep action is categorized into feint and spinmove action. In Figure 12 (b), the DWT algorithm also mixes the feature data of some action classes during recognition, which are lecture and Eurostep, crossover, and spinmove. Other than that the other actions are not overlapped, and the class spacing is more obvious. In Figure 12 (c), the 3D-CNN-SSD model achieves effective differentiation of features in the feature space by enhancing the differentiation between action categories, resulting in closer clustering of similar data. However, there are still some data in the figure with unclear boundaries with other categories, which may lead to misjudgment of actions by the model. This misjudgment may be due to the similarity of the performance of certain actions in the video frames, and the feature differentiation is not obvious, which affects the accuracy of the model action recognition. To evaluate the performance of the proposed 3D-CNN-SSD model in actual basketball games, the study used video datasets from multiple real games for testing. The test data includes various complex situations, such as athlete occlusion, motion blur, rapidly changing backgrounds, different lighting conditions, and crowd interference. And it includes 100 basketball game videos from different matches, with key movements annotated in the videos. By comparing with manually annotated actions, the performance of the confusion matrix analysis algorithm was evaluated, and the accuracy, recall, precision, and F1 score were calculated. The results are shown in Table 4.

Table 4

Comparison of actual application results

Index	OpenPose	DWT	3D-CNN-SSD
Accuracy rate	85.60%	88.20%	94.50%
Recall	81.30%	87.50%	92.70%
Accuracy	84.20%	89.10%	95.30%
F1 score	82.70%	87.80%	94.00%
Real-time performance	120 ms/frame	110 ms/frame	95 ms/frame

According to Table 4, the accuracy of 3D-CNN-SSD is significantly higher than the other two methods, reaching 94.50%, indicating that this model performs the most accurately in BAR. In contrast, OpenPose has relatively low accuracy, possibly because it is mainly used for pose estimation and difficult to effectively handle complex motion blur and occlusion problems. The recall rate of 3D-CNN-SSD is also better than the other two methods, reaching 92.70%. In addition, 3D-CNN-SSD performed the best among all models with an accuracy of 95.30%, which measures the correct proportion of actions recognized by the model. The high accuracy indicates that the model can effectively reduce misidentification and ensure that most of the detected actions are correct. The relatively low accuracy of OpenPose and DWT may be related to their misidentification in certain complex scenarios. In terms of F1 score, 3D-CNN-SSD performed the best at 94.00%, demonstrating the model’s overall performance advantage. Finally, in the real-time results, the real-time performance of 3D-CNN-SSD reached 95 ms/frame, indicating that its processing speed is very suitable for real-time applications. In contrast, OpenPose has the longest processing time at 120 ms/frame, which may affect tasks with high real-time requirements. The real-time performance of DWT is 110 ms/frame, slightly inferior to 3D-CNN-SSD, but still maintained within a lower latency range.

4 Discussion and conclusion

To deeply optimize the effect of computer technology in sports applications, the study proposes a BAR system that integrates 3D-CNN video recognition technology and SSD TDA for the improvement of the BAR system. The feature vectors are extracted and fused by multi-resolution 3D-CNN, which is combined with SSD TDA to crop the motion region and reduce the computation of redundant data. Finally, the performance as well as the application of the proposed model is investigated by comparing the algorithms experimentally tested. The outcomes revealed that the algorithm was most efficient in recognizing the results of training tests with different datasets with 16 frames of continuous video image input. Based on this, the algorithm performance comparison test was conducted. Both in the OF training model and in the cropped frame training model, the 3D-CNN-SSD model achieved the highest accuracy, the fastest convergence speed, the lowest loss value, and the first to reach the stable state, which indicated that the model has superior generalization ability. In terms of response performance, the 3D-CNN-SSD model had the best response time among the three algorithms and was able to complete image recognition with the fastest response time. The findings of the continuous cycle test illustrated that the 3D-CNN-SSD model curves had a small change in amplitude and a periodic regularity, which provided better stability compared to the other two methods. After the performance test, the study selected six different basketball actions to analyze the application effect. The model training confusion matrices constructed from the OF data and the cropped frame data revealed that the 3D-CNN-SSD model has a high RR of 96 and 98% in recognizing the shoot action, with an average RR of 89.5 and 92.5%, respectively. Moreover, the MAP of this model was 91.69%, which was improved by 8.74 and 5.33% compared to OpenPose and DWT, respectively. Except for the recognition accuracy of 89.46% for Eurostep, the recognition accuracy for the remaining five actions reached more than 90%. Meanwhile, the 3D-CNN-SSD model achieved the lowest misjudgment rate in all five misjudgment cases. The performance of 3D-CNN-SSD in real-world basketball games is also significantly better than the other two models, with accuracy, recall, and precision of 94.50, 92.70, and 95.30%, respectively. Finally, the clustering visualization of the FE results found that the 3D-CNN-SSD model not only increased the differentiation between action classes, but also reduced the distance between classes, which significantly improved the BAR. Although the research has achieved good performance in multiple aspects, there are still limitations. For example, the computational cost is high, especially when processing high-resolution videos, and the computation time of the model is still relatively long. At the same time, when analyzing video basketball technique movements, it is inevitable to encounter occlusion and multiple overlapping targets. To improve real-time performance in practical applications, it is possible to consider combining more advanced image processing techniques and deep learning models in the future and adopting more efficient inference methods for improvement.

Funding information: The authors state no funding involved.
Author contributions: Weizhao He provided the concept and wrote the draft, and Bin Li revised this paper critically. Both authors reviewed this paper carefully and approved this submission.
Conflict of interest: The authors state no conflict of interest.
Ethical approval: An ethics statement was not required for this study type, no human or animal subjects or materials were used.
Informed consent: Not applicable.
Data availability statement: All data generated or analyzed during this study are included in this article. Further enquiries can be directed to the corresponding author.

References

[1] N. Luo, H. Yu, Z. You, Y. Li, T. Zhou, and N. Han, “Fuzzy logic and neural network-based risk assessment model for import and export enterprises: A review,” J. Data Sci. Intell. Syst., vol. 1, no. 1, pp. 2–11, 2023.10.47852/bonviewJDSIS32021078Search in Google Scholar

[2] X. Li, R. Luo, and F. U. Islam, “Tracking and detection of basketball movements using multi-feature data fusion and hybrid YOLO-T2LSTM network,” Soft Comput., vol. 28, no. 2, pp. 1653–1667, 2024.10.1007/s00500-023-09512-ySearch in Google Scholar

[3] S. M. Sima, M. Hou, X. Zhang, J. Ding, and Z. Feng, “Action recognition algorithm based on skeletal joint data and adaptive time pyramid,” Sig., Image Video Process., vol. 16, no. 6, pp. 1615–1622, 2022.10.1007/s11760-021-02116-9Search in Google Scholar

[4] H. Pan, Y. Li, and D. Zhao, “Recognizing human behaviors from surveillance videos using the SSD algorithm,” J. Supercomput., vol. 77, no. 7, pp. 6852–6870, 2021.10.1007/s11227-020-03578-3Search in Google Scholar

[5] L. Xin, Z. Jing, Y. Zhen, L. Jiayi, and J. Wenchen, “Research on ultra-wideband radar life detection algorithm based on SE and SSD,” IEEE Sens. J., vol. 23, no. 12, pp. 13478–13488, 2023.10.1109/JSEN.2023.3268164Search in Google Scholar

[6] Z. Qu, X. Shang, S. F. Xia, T. M. Yi, and D. Y. Zhou, “A method of single-shot target detection with multi‐scale feature fusion and feature enhancement,” IET Image Process, vol. 16, no. 6, pp. 1752–1763, 2022.10.1049/ipr2.12445Search in Google Scholar

[7] Q. Liu, J. Bi, J. Zhang, X. Bu, and N. Hanajima, “B-FPN SSD: an SSD algorithm based on a bidirectional feature fusion pyramid,” Vis. Comput., vol. 39, no. 12, pp. 6265–6277, 2023.10.1007/s00371-022-02727-4Search in Google Scholar

[8] Y. Liu, R. Liu, S. Wang, D. Yan, B. Peng, and T. Zhang, “Video face detection based on improved SSD model and target tracking algorithm,” J. Web Eng., vol. 21, no. 2, pp. 545–568, 2022.10.13052/jwe1540-9589.21218Search in Google Scholar

[9] Z. Z. Xu and W. J. Zhang, “3D-CNN hand pose estimation with end-to-end hierarchical model and physical constraints from depth images,” Neural Netw. World, vol. 33, no. 1, pp. 35–48, 2023.10.14311/NNW.2023.33.003Search in Google Scholar

[10] H. Kim and Y. Chung, “Effect of input data video interval and input data image similarity on learning accuracy in 3D-CNN,” Int. Int., Broadcast. Commun., vol. 13, no. 2, pp. 208–217, 2021.Search in Google Scholar

[11] J. Y. Lin and S. Y. Lin, “Temperature-prediction based rate-adjusted time and space map algorithm for 3D-CNN accelerator systems,” IEEE Trans. Comput., vol. 72, no. 10, pp. 2767–2780, 2023.10.1109/TC.2023.3269696Search in Google Scholar

[12] A. S. Keceli and A. Kaya, “Violent activity classification with transferred deep features and 3D-CNN,” Sig., Image Video Process., vol. 17, no. 1, pp. 139–146, 2023.10.1007/s11760-022-02213-3Search in Google Scholar

[13] Z. Hao, X. Wang, and S. Zheng, “Recognition of basketball players’ action detection based on visual image and Harris corner extraction algorithm,” J. Intell. Fuzzy Syst., vol. 40, no. 4, pp. 7589–7599, 2021.10.3233/JIFS-189579Search in Google Scholar

[14] B. Zhang and T. Wang, “Visual image recognition of basketball turning and dribbling based on feature extraction,” Trait. du. Signal., vol. 39, no. 6, pp. 2115–2121, 2022.10.18280/ts.390624Search in Google Scholar

[15] J. Xu, “Recognition method of basketball players’ shooting action based on graph convolution neural network,” Int. J. Reasoning-based Intell. Syst., vol. 14, no. 4, pp. 227–232, 2022.10.1504/IJRIS.2022.126650Search in Google Scholar

[16] C. Zhang, M. Wang, and L. Zhou, “Recognition method of basketball players’ throwing action based on image segmentation,” Int. J. Biom., vol. 15, no. 2, pp. 121–133, 2023.10.1504/IJBM.2023.129216Search in Google Scholar

[17] J. Tong and F. Wang, “Basketball sports posture recognition technology based on improved graph convolutional neural network,” J. Advan. Comput. Intell. Intell. Inform., vol. 28, no. 3, pp. 552–561, 2024.10.20965/jaciii.2024.p0552Search in Google Scholar

[18] D. Chen and Z. Ni, “Action recognition method of basketball training based on big data technology,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 2, pp. 340–350, 2024.10.14569/IJACSA.2024.0150236Search in Google Scholar

[19] Q. Gao, Y. Chen, and Z. Ju, “Oropharynx visual detection by using a multi-attention single-shot multibox detector for human–robot collaborative oropharynx sampling,” IEEE Trans. Human-Mach. Syst., vol. 53, no. 6, pp. 1073–1082, 2023.10.1109/THMS.2023.3324664Search in Google Scholar

[20] L. Wei, H. Huang, and X. Yu, “Intersection-Over-Union Similarity-Based Nonmaximum suppression for human pose estimation in crowded scenes,” IEEE Trans. Cognit. Dev. Syst., vol. 16, no. 2, pp. 511–520, 2023.10.1109/TCDS.2023.3276372Search in Google Scholar

[21] P. Yin and W. Yu, “Nonlinear dynamical system iteration applied in video face feature extraction and recognition,” Evol. Syst., vol. 15, no. 2, pp. 397–412, 2024.10.1007/s12530-023-09562-5Search in Google Scholar

[22] K. Hara, “Recent advances in video action recognition with 3D convolutions,” IEICE Trans. Fundam. Electron., Commun. Comput. Sci., vol. 104, no. 6, pp. 846–856, 2021.10.1587/transfun.2020IMP0012Search in Google Scholar

[23] J. Guo, X. Nie, Y. Ma, K. Shaheed, I. Ullah, and Y. Yin, “Attention based consistent semantic learning for micro-video scene recognition,” Infor. Sci., vol. 543, pp. 504–516, 2021.10.1016/j.ins.2020.05.064Search in Google Scholar

[24] H. Chen, J. Zhao, and Q. Zhang, “Rotation-equivariant spherical vector networks for objects recognition with unknown poses,” Vis. Comput., vol. 40, no. 3, pp. 2089–2101, 2024.10.1007/s00371-023-02904-zSearch in Google Scholar

[25] S. Balasundaram and S. Krishnamoorthy, “Unsupervised learning-based recognition and extraction for intelligent automatic video retrieval,” Photogramm. Rec., vol. 37, no. 180, pp. 453–489, 2022.10.1111/phor.12427Search in Google Scholar

[26] X. Yan, G. Kou, F. Xiao, D. Zhang, and X. Gan, “Region-based demand forecasting in bike-sharing systems using a multiple spatiotemporal fusion neural network,” Soft Comput., vol. 27, no. 8, pp. 4579–4592, 2023.10.1007/s00500-022-07691-8Search in Google Scholar

[27] X. Wang and Z. Liang, “Hybrid network model based on 3D convolutional neural network and scalable graph convolutional network for hyperspectral image classification,” IET Image Process., vol. 17, no. 1, pp. 256–273, 2023.10.1049/ipr2.12632Search in Google Scholar

[28] S. Jain and R. Rastogi, “Parametric non-parallel support vector machines for pattern classification,” Mach. Learn., vol. 113, no. 4, pp. 1567–1594, 2024.10.1007/s10994-022-06238-0Search in Google Scholar

[29] B. J. Jo and S. K. Kim, “Comparative analysis of OpenPose, PoseNet, and MoveNet models for pose estimation in mobile devices,” Trait. du. Signal., vol. 39, no. 1, pp. 119–124, 2022.10.18280/ts.390111Search in Google Scholar

[30] J. Y. Li, C. H. Zhao, and G. D. Zhang, “Digital watermarking algorithm based on 4-level discrete wavelet transform and discrete fractional angular transform,” Opt. Appl., vol. 51, no. 4, pp. 605–619, 2021.10.37190/oa210411Search in Google Scholar

Received: 2024-11-17

Revised: 2025-01-23

Accepted: 2025-01-24

Published Online: 2025-05-07

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/comp-2025-0027

Keywords for this article

basketball action recognition; video recognition techniques; SSD target detection; 3D convolutional networks; action classification

Creative Commons

BY 4.0