Home Animation video frame prediction based on ConvGRU fine-grained synthesis flow
Article Open Access

Animation video frame prediction based on ConvGRU fine-grained synthesis flow

  • Xue Duan EMAIL logo
Published/Copyright: February 10, 2025
Become an author with De Gruyter Brill

Abstract

Due to the complexity and dynamism of animated scenes, frame prediction in animated videos is a challenging task. In order to improve the playback frame rate of animated videos, an innovative convolutional neural network combined with convolutional gated recursive unit method is used to refine the synthesized stream in frame prediction of animated videos. The obtained results indicated that the average prediction accuracy of the proposed model was 99.64%, and the training effect was good. The peak signal-to-noise ratios on the three datasets were 31.26, 36.63, and 22.15 dB, respectively, and the structural similarities were 0.958, 0.886, and 0.813, respectively. The maximum Learned Perceptual Image Patch Similarity of the proposed model was 0.144. This indicates that the model has achieved excellent performance in prediction accuracy and visual quality, which can successfully capture complex dynamics and fine details in animated scenes. The contribution of this study is to provide a technical support for improving the accuracy of frame prediction in animated videos, which will help promote the intelligent development of the animation production field.

1 Introduction

Video prediction uses a given continuous frame to predict future video frames. Animation video frame prediction plays a crucial role in various applications such as computer-generated animation, video game development, and virtual reality. The ability to accurately predict future frames enhances the visual quality and realism of animation content. However, due to the complexity and dynamism of animation, predicting accurate frames remains a challenging task [1]. Convolutional neural network (CNN), with its excellent feature extraction and pattern recognition capabilities, can effectively capture dynamic changes and temporal relationships in animation video recognition tasks. In the current field of deep learning, Gated recurrent unit (GRU) has achieved significant results in image processing and sequence data. ConvGRU combines CNN and GRU, which can effectively capture spatio-temporal dependencies and spatial information in video sequences, thereby generating more accurate and visually coherent animation frames [2]. For complex scenes, traditional video frame prediction methods may not accurately capture the details and complex actions in the scene [3]. For example, when there are a large number of moving objects or complex backgrounds in the scene, traditional methods may not be able to effectively predict the next frame [4]. Therefore, to avoid the need for too many reference frames, a Motion Estimation (ME) network is designed based on CNN. A ConvGRU fine-grained synthesis flow algorithm is proposed. Combining the reverse distortion algorithm and bi-linear interpolation algorithm, the specific pixel values in the extracted frames of the video are calculated to obtain high-precision predicted frames. The research mainly includes four parts. The first part overviews video frame prediction and ConvGRU. The second part is about animation video frame prediction based on ConvGRU fine-grained synthesis flow algorithm. The third part is the animation video frame prediction results based on ConvGRU fine-grained synthesis flow. The fourth part concludes animation video frame prediction based on ConvGRU fine-grained synthesis flow.

2 Related works

Researchers have proposed various methods to solve video frame prediction and have achieved certain results. Xu et al. developed a video anomaly detection and motion aware future frame prediction method that combined saliency perception to avoid imbalanced information distribution between video foreground and background. This method improved the ability to express moving targets. The results showed that the network had certain advantages [5]. Hassan et al. used long short term memory (LSTM) and generative adversarial networks to locate experimental subjects and analyze motion paths for predicting future human motion trajectories. The results showed that the method ultimately reduced displacement error by 41% [6]. Aslam et al. developed an auto-encoder based on deep multiple attention to achieve anomaly detection in videos. The context vector determined the output of the decoder, and the global attention mechanism participated in the weighted calculation. The results showed that the method improved the running time by 0.015 s [7]. RS combined an LSTM and ResNeXt-50 deep forgery detection algorithm for combating video deformation attacks, which was used to extract complex features. The results showed that this method had high detection accuracy [8].

ConvGRU has been widely applied in fields such as image processing and natural language processing. Many scholars have conducted extensive research on this topic. Sreeja and Kovoor focused on precise detection of suspicious events in surveillance videos. Multi-layer CNN and stacked bidirectional GRU extracted sequence level and frame level features, respectively, which was beneficial for improving accurate recognition rate. The results showed that the model had generalization and effectiveness [9]. To achieve multi-agent micro-grid energy management, Afrasiabi et al. utilized accelerated alternative direction method of multipliers to predict the parameters required for intelligent agents based on ConvGRU, searching for the optimal working point and enhancing the convergence of distributed algorithms. The results indicated that the method had good performance [10]. Xu et al. designed an AM-ConvGRU model that combined channel attention blocks for predicting typhoon paths, extracting nonlinear three-dimensional features of typhoons with complex high-dimensional attributes. The results showed that the model had good prediction accuracy [11]. Zhang et al. proposed a ConvGRU spatio-temporal prediction model to extract glacier velocity to analyze the past and present spatio-temporal variations of glacier velocity. The fluctuations in glacier velocity time series data was captured. The results showed that the model had high accuracy [12]. Tian et al. developed a GA-ConvGRU model for predicting precipitation approaching, achieving multi-modal and skewed intensity distributions. More realistic and accurate extrapolation was generated. The results showed that the model had certain applicability [13].

In summary, current research on ConvGRU mainly focuses on the improvement methods of the model, and the modeling ability for long-term dependency relationships still needs to be improved. Given the advantages of ConvGRU in processing sequence data, its future application prospects are still very broad, especially in the practical application of video analysis. Therefore, further exploring the research potential and application prospects of ConvGRU has positive impacts on the field of deep learning.

3 Animation video frame prediction based on ConvGRU fine-grained synthesis flow

The research first designs a ME network based on CNN to solve the single search direction and limited reference range. Afterwards, in order to ensure the acquisition of high-precision prediction frames, the ConvGRU fine-grained synthesis flow algorithm is proposed. Simultaneously, combining the reverse distortion algorithm and bi-linear interpolation algorithm, the computational complexity is optimized.

3.1 CNN-based ME network

Animation video is a media form that simulates motion by quickly and continuously playing static images. However, producing high-quality animated videos requires a significant amount of time and resources, especially during the process of drawing each frame. To improve the production efficiency and reduce cost of animated videos, the research needs to improve the frame efficiency of predicted animated videos. Animation video frame prediction refers to predicting the next frame of an image based on the known first few frames of the image [14]. To solve the excessive number of reference frames required for video frame prediction and ME process, the mathematical quantization animation modeling for video frame prediction is constructed, as shown in Eq. (1).

(1) p ( I T + 1 I 1 : T ) = p ( I T + 1 z ¯ 1 : T ) p ( z ¯ 1 : T I 1 : T ) d z ¯ 1 : T ,

where I 1 : T is the existing video frame sequence. T is the length of the sequence. I T + 1 is the future video frame sequence. p ( z ¯ 1 : T I 1 : T ) is the ME process. p ( I T + 1 z ¯ 1 : T ) represents the process of implementing motion compensation after the ME result. The current reference frame count is generally 10–20 frames. Predicting a large number of future frames requires continuous iterations, which results in long delays and poor timeliness in animation video processing. To address these issues, an animation video frame prediction network that combines ConvGRU fine-grained synthesis flow is designed. The flowchart of this network is shown in Figure 1.

Figure 1 
                  Schematic diagram of the animation video frame prediction network flow.
Figure 1

Schematic diagram of the animation video frame prediction network flow.

In Figure 1, the inter-frame temporal content information of the reference frame is extracted by the ME network module. Combined with motion residual calculation and synthetic flow calculation, the ConvGRU fine-grained synthetic flow algorithm is completed. Finally, combined with the reverse distortion algorithm, the predicted frame is obtained. When predicting between frames, the correlation between video frames is used to filter out redundant information about time, which generally includes two steps: ME and motion compensation. The main task of ME is to find a matching block for the currently being encoded macro block in the historical reference frame to find the optimal matching block. After finding the optimal matching block, ME will output a motion vector, which is the position coordinate of the reference block relative to the current block [15]. The ME is shown in Figure 2.

Figure 2 
                  ME principle diagram.
Figure 2

ME principle diagram.

In Figure 2, the current video coding standards mainly use inter-frame encoding methods that combine blocks. The principle is to use ME to find the reference block with the smallest difference from adjacent reference reconstruction frames. The reconstruction value serves as the predicted value of the current block. The displacement from the reference block to the current block is the motion vector. The process of taking reconstructed values as predicted values is motion compensation. The ME network can extract motion and occlusion information between reference frames, and predict initial motion vectors, soft masks, and convolutional kernel weights [16,17]. Soft mask matrix is used to avoid occlusion issues, and M R . Soft mask is a technique used for image processing and computer vision tasks to obscure or hide specific areas of an image. The initial motion vector obtained by ME is ( u , v ) . Two one-dimensional longitudinal and transverse quantities are combined using the matrix outer product. There is a two-dimensional convolution kernel K ( x , y ) , as shown in Eq. (2).

(2) K ( x , y ) = k v ( k u ) T .

To avoid the information loss at the previous level and the impact of gradient vanishing on performance when the network depth gradually increases, a ME network is designed based on CNN. The schematic diagram of the ME network structure is shown in Figure 3.

Figure 3 
                  Schematic diagram of ME network structure.
Figure 3

Schematic diagram of ME network structure.

In Figure 3, the ME network structure consists of an encoding end and a decoding end. The encoding side ensures that the convolutional kernel can extract the content and difference information of two reference frames. Afterwards, the initial feature extraction is completed through a convolutional layer with a kernel size 7 × 7. To reduce the resolution, a convolutional block is formed by combining three convolutional layers with kernel size 3 × 3 and an average pooling layer. The encoding end utilizes skip connections to the decoding end. The decoding end replaces the pooling layer with a bi-linear up-sampling layer to improve the resolution of the feature map. Finally, it is output to the sub-network for predicting motion vectors, soft mask M , and convolutional kernel weights. Sub-networks have similar structures but do not share weights. The ME network is used to extract inter frame temporal content information from existing video frame sequences I 1 : T , predict temporal motion vectors, soft masks, and adaptive convolution kernels, and transform I 1 : T to obtain rough predicted frame I ¯ T + 1 [18].

3.2 Animation video frame prediction based on ConvGRU fine-grained flow algorithm and reverse distortion algorithm

In animation video frame prediction, if only motion compensation is applied to the reference frame to obtain the predicted frame, the lack of high-dimensional content features cannot compensate for the detail loss caused by ME errors, resulting in blurred prediction frames [19,20]. To avoid blurry prediction frames, a video frame content extractor combining feature pyramids is developed on the basis of CNN, obtaining multi-scale features with rich detail information in reference frames [21]. By combining CNN for extracting spatial features with GRU for extracting temporal features, the encoded high-dimensional temporal motion information is processed to generate bias values to correct the errors in the synthesized stream, thereby achieving the correct pixel mapping relationship [22]. The internal structure of ConvGRU is shown in Figure 4.

Figure 4 
                  Internal structure of ConvGRU.
Figure 4

Internal structure of ConvGRU.

In Figure 4, the reset gate, update gate, and hidden gate are all used for deep feature extraction. All consist of a convolutional layer. The connection operation C of input state x t along the channel dimension is used to convert the past hidden state h t 1 into the current hidden state h t through reset gate, update gate, and hidden gate, respectively. The reset gate is shown in Eq. (3).

(3) r t = Conv R ( cat [ h t 1 , x t ] ) ,

The update gate is shown in Eq. (4).

(4) u t = Conv U ( cat [ h t 1 , x t ] ) .

where is the Hadamard product, which is an element wise operation that plays an important role in the training and inference process of neural networks [23,24]. The hidden gate is shown in Eq. (5).

(5) z t = Conv Z ( cat [ x t , h t 1 r t ] ) .

The current hidden state h t is shown in Eq. (6).

(6) h t = h t 1 ( 1 u t ) + u t z t .

To improve the accurate prediction ability of future frame movements, the ConvGRU fine-grained synthesis flow algorithm is proposed. This algorithm utilizes multi-level encoding of motion residuals and extracts high-dimensional temporal dependency information. Then, these pieces of information are decoded to obtain bias values with the same resolution. The bias value and f ¯ T + 1 T are vector summed to obtain a fine-grained synthetic flow f T + 1 T . High quality future video frame sequences I T + 1 are obtained through reverse distortion algorithm [25]. The ConvGRU fine-grained synthesis flow algorithm in the network includes a learning parameter module. Based on the internal structure of ConvGRU, the fine-grained flow network structure of ConvGRU is illustrated in Figure 5.

Figure 5 
                  Schematic diagram of ConvGRU fine-grained synthesis flow network structure.
Figure 5

Schematic diagram of ConvGRU fine-grained synthesis flow network structure.

In Figure 5, the network consists of a pyramid structure, a motion residual encoder, and a ConvGRU module. The ConvGRU module consists of three levels. The channel dimension at the same level is 256. The motion residual encoder shares weights at the same level and remains independent of each other at different levels. The reference frame is the rough prediction frame I ¯ T + 1 obtained by the motion aware convolution algorithm, as shown in Eq. (7).

(7) I ¯ T + 1 = M MAC ( I T 1 ; K T 1 , f T 1 ) + ( J M ) MAC ( I T ; K T , f T ) ,

where f T 1 and f T are the time-domain motion vectors. K T and K T 1 are the adaptive convolution kernel weights. J is the full 1 matrix, with the same resolution as the reference frame. The coarse prediction frame I ¯ T + 1 and the residual of the nearest neighbor frame assist in the refinement process of the synthesized flow. Reverse distortion algorithm is an algorithm used to reverse recover images or graphic deformations in videos. The basic principle is to restore the deformed image or video to its original shape by calculating the inverse transformation of forward deformation. The reverse distortion algorithm and bi-linear interpolation algorithm are shown in Figure 6.

Figure 6 
                  Schematic diagram of (a) reverse distortion algorithm and (b) bi-linear interpolation algorithm.
Figure 6

Schematic diagram of (a) reverse distortion algorithm and (b) bi-linear interpolation algorithm.

In Figure 6(a), the reverse distortion algorithm can calculate the optical flow f t + 1 t from image t + 1 to t . The corresponding pixel in I t is found for each whole pixel point of I t + 1 , achieving I t + 1 synthesis. There are no voids or occlusion, as the pixels in I t and the pixels in the image to be synthesized have a single mapping relationship. In Figure 6(b), to estimate the obtained optical flow as an integer value, the specific pixel values are calculated using bi-linear interpolation algorithm. Q 11 , Q 12 , Q 21 , and Q 22 are the pixel values of neighboring nearest whole pixel points. d x and d y represent the position difference of the pixel value p to the y -axis and x -axis, respectively. p is shown in Eq. (8).

(8) p = ( 1 d y ) ( 1 d x ) Q 11 + ( 1 d y ) d x Q 12 + ( 1 d x ) d y Q 21 + d x d y Q 22 .

In the horizontal direction, the differentiation for pixels is shown in Eq. (9).

(9) p d x = ( 1 d y ) Q 11 + ( 1 d y ) Q 12 d y Q 21 + d y Q 22 .

In the vertical direction, the differentiation for pixels is shown in Eq. (10).

(10) p d y = ( 1 d x ) Q 11 d x Q 12 + ( 1 d x ) Q 21 + d x Q 22 .

4 Prediction results of animation video frame based on ConvGRU fine-grained synthesis flow

The influence of ME network and its time-domain motion vector parameters on predicting frames in animated videos is analyzed. The performance of ConvGRU fine-grained synthesis flow algorithm and its improved algorithm is verified.

4.1 ME network and parameter analysis of time-domain motion vectors

The experimental CPU is Intel Core i7-9700 3.00 GHz, and the operating system is Ubuntu 18.04. The graphics memory is 16GB. The selected animation video datasets for the experiment are Creative Flow+ and AnimeRun. The basic principles of the proposed model parameters and their settings are shown in Table 1.

Table 1

Basic principles of model parameters and their settings

Parameter Numerical value Basic principle
Exponential decay rates of momentum and RMSProp terms 0.9, 0.999 Update of equilibrium gradient
Initial learning rate 0.001 Ensure stable convergence of the model
Batch size 8 Balancing memory consumption and training speed
The number of convolutional layers 32, 64, 96 Improve the expressive power of the model
Convolutional kernel size 5 × 5 Capture features at different scales
The number of motion vectors in the time domain 5 Considering the time complexity of the model comprehensively

The experimental evaluation indicators are Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity (SSIM) and Peak signal to noise ratio (PSNR). To investigate the impact of different numbers of time-domain motion vectors on algorithm performance, the model convolution kernel size is kept at 11 × 11. The PSNR and SSIM results of the model on Creative Flow+, AnimeRun simple, and difficult datasets are shown in Figure 7 when the number of time-domain motion vectors is 1, 5, and 11, respectively.

Figure 7 
                  Comparison of objective evaluation results of different time-domain motion vector numbers. (a) Comparison of PSNR results. (b) Comparison of SSIM results.
Figure 7

Comparison of objective evaluation results of different time-domain motion vector numbers. (a) Comparison of PSNR results. (b) Comparison of SSIM results.

In Figure 7(a), on the Creative Flow+ dataset, when the motion vectors in the domain were 5, the model had the highest PSNR, which was 37.51 dB. On the AnimeRun simple and difficult datasets, when the time-domain motion vectors were 11, the PSNR was the highest, at 35.83 and 24.62 dB, respectively. In Figure 7(b), on the Creative Flow+, AnimeRun simple, and difficult datasets, when the time-domain motion vectors were 11, the SSIM was the highest, with values of 0.984, 0.981, and 0862, respectively. The increase in the estimation quantity of time-domain motion vectors can quickly search for animation video information, thereby improving the predicted frame quality. To analyze the impact of different convolution kernel sizes on the predicted frames of animated videos, the number of temporal motion vectors is kept at 5. When the convolutional kernel sizes are 1 × 1, 5 × 5, and 11 × 11, the PSNR and SSIM results of the predicted animation video frames on the Creative Flow+, AnimeRun simple, and difficult datasets are shown in Figure 8.

Figure 8 
                  Comparison of objective evaluation results for different convolutional kernel sizes. (a) Comparison of PSNR results. (b) Comparison of SSIM results.
Figure 8

Comparison of objective evaluation results for different convolutional kernel sizes. (a) Comparison of PSNR results. (b) Comparison of SSIM results.

In Figure 8(a), when the convolution kernel was 11 × 11, the PSNR of the predicted animation video frame on different datasets was 36.82, 37.08, and 37.51 dB, respectively. In Figure 8(b), when the convolution kernel was 11 × 11, the SSIM of the predicted animation video frame on different datasets was 0.984, 0.980, and 0.860, respectively. To study the impact of animation video frame prediction models with ME network modules on accuracy, the AnimeRun dataset is selected as a sample to train models with and without the ME network module. The PSNR results of the running and walking states of the characters are shown in Figure 9.

Figure 9 
                  Comparison of PSNR results of different models in running and walking states of characters. (a) Run status. (b) Walk state.
Figure 9

Comparison of PSNR results of different models in running and walking states of characters. (a) Run status. (b) Walk state.

In Figure 9(a), when iterating 300 times, the PSNR with and without the ME module were 29.40 and 29.30 dB, respectively. In Figure 9(b), after 300 iterations of the walking state of the animated video character, the PSNR with and without the ME module were 26.78 and 26.58 dB, respectively, which increased by 0.2 dB. To verify the impact of learning rate on model training, the total loss changes during model training are compared when the learning rates are 0.01 and 0.001, respectively, as shown in Figure 10.

Figure 10 
                  Comparison of total loss values at learning rates of (a) 0.01 and (b) 0.001.
Figure 10

Comparison of total loss values at learning rates of (a) 0.01 and (b) 0.001.

In Figure 10(a), when the learning rate was 0.001, the total loss curve gradually decreased with increasing iterations. At 50–75 cycles, the total loss value slightly increased. Afterwards, the total loss value fluctuated between 21.8 and 22.0. In Figure 10(b), when the learning rate was 0.01, the total training loss curve first increased and then decreased. The highest total loss value was about 26.58, which decreased by 8.1%.

4.2 Performance analysis of the improved algorithm

To verify the performance of the improved algorithm, the results of spatial information transfer and time backtracking (SITB) [26], Bayesian DeNet [27], and ConvGRU algorithms on the Creative Flow+ and AnimeRun datasets are compared, as shown in Table 2.

Table 2

Comparison of three algorithms on different data machines

Algorithm type Creative Flow+ AnimeRun
LPIPS SSIM PSNR LPIPS SSIM PSNR
Bayesian DeNet 0.071 0.869 24.91 0.032 0.944 29.09
SITB 0.056 0.884 26.45 0.028 0.949 30.28
ConvGRU 0.073 0.886 26.63 0.033 0.958 31.26

On the Creative Flow+ dataset, compared with the other two algorithms, the ConvGRU algorithm had the highest LPIPS, SSIM, and PSNR, with values of 0.073, 0.886, and 26.63 dB, respectively. On the AnimeRun dataset, the SSIM and PSNR of the ConvGRU algorithm were 0.958 and 31.26 dB, respectively. The performance has improved. The ConvGRU algorithm ran nearly 50% longer than the SITB algorithm. However, compared with the Bayesian DeNet algorithm, it was 13.342 s faster. Experiments are conducted on a self-made high-resolution animation video dataset. These three algorithms under different motion modes and motion sizes are compared. The results are shown in Figure 11.

Figure 11 
                  Comparison of PSNR, SSIM, and LPIPS results of different algorithms on datasets of different levels. (a) Comparison of PSNR results of different algorithms. (b) Comparison of SSIM results of different algorithms. (c) Comparison of LPIPS results of different algorithms.
Figure 11

Comparison of PSNR, SSIM, and LPIPS results of different algorithms on datasets of different levels. (a) Comparison of PSNR results of different algorithms. (b) Comparison of SSIM results of different algorithms. (c) Comparison of LPIPS results of different algorithms.

In Figure 11(a), the PSNR of the improved ConvGRU algorithm on simple, medium, and difficult datasets were 38.96, 34.05, and 27.90 dB, respectively. In Figure 11(b), the PSNR of the improved ConvGRU algorithm on simple, medium, and difficult datasets were 0.987, 0.968, and 0.933, respectively. Compared with the SITB algorithm, the improved ConvGRU algorithm increased SSIM by 0.005 on simple datasets. In Figure 11(c), the improved ConvGRU algorithm had the lowest LPIPS on simple datasets, which was 0.014. To analyze the performance of the improved algorithm, the objective quality results of the model before and after training improvement are compared on Creative Flow+, AnimeRun simple, and difficult datasets, as shown in Figure 12.

Figure 12 
                  Comparison of PSNR, SSIM, and LPIPS results between unimproved and improved models on different datasets. (a) Comparison of PSNR results. (b) Comparison of SSIM results. (c) Comparison of LPIPS results.
Figure 12

Comparison of PSNR, SSIM, and LPIPS results between unimproved and improved models on different datasets. (a) Comparison of PSNR results. (b) Comparison of SSIM results. (c) Comparison of LPIPS results.

In Figure 12(a), the PSNR of the improved model on three datasets were 31.26, 36.63, and 22.15 dB, respectively. The improved model has the greatest improvement in quality on difficult datasets, indicating an enhanced accuracy capability. In Figure 12(b), the SSIM of the improved model on three datasets were 0.958, 0.886, and 0.813, respectively. In Figure 12(c), the LPIPS of the improved model on three datasets were 0.033, 0.073, and 0.144, respectively. The actual detection performance of the improved algorithm on Creative Flow+, AnimeRun, and self-made datasets is shown in Figure 13.

Figure 13 
                  The frequency of frame anomalies on different datasets. (a) Anomalous frequencies on the datasets Creative Flow+ and AnimeRun. (b) Frequency of anomalies on self-made datasets.
Figure 13

The frequency of frame anomalies on different datasets. (a) Anomalous frequencies on the datasets Creative Flow+ and AnimeRun. (b) Frequency of anomalies on self-made datasets.

In Figure 13(a), at frames 0–55, the frame sequence frequency of the Creative Flow+ dataset was at a relatively low level, averaging 1.95 frames per second. The characters in the animated video sample were in a normal walking state. The AnimeRun dataset showed that the characters in the animated video samples 35 frames ago were in a normal walking state. After 35 frames, the frequency of screen changes remained at a high level. In Figure 13(b), at 50 frames in the self-made dataset, there was a running scene on the screen, with a frequency of 0.28 frames per second. At 950 frames, the details of the character were magnified ten times compared with the previous frame sequence, with a frequency of 0.97 frames per second. The improved algorithm has good actual detection results and high accuracy. The training set, validation set, and testing set are divided in a ratio of 6:2:2. The improved model is trained 400 times. The model training results are shown in Table 3.

Table 3

Numerical comparison of improved model results

Batch 1 2 3 4 5 Average value
Training set RMSE 0.0423 0.0435 0.0447 0.0464 0.0476 0.0449
Accuracy 99.72 99.72 99.69 99.67 99.64 99.69
Validation set RMSE 0.0431 0.0407 0.0390 0.0446 0.0584 0.0452
Accuracy 99.64 99.74 99.70 99.68 99.42 99.64
Testing set RMSE 0.0479 0.0483 0.0454 0.0488 0.0487 0.0478
Accuracy 99.65 99.66 99.59 99.50 99.59 99.60

The average accuracy of the improved model on the training set, validation set, and testing set was 99.69, 99.64, and 99.60%, respectively. The training results of the improved model are good, with an average prediction accuracy of 99.64%. The improved model has achieved excellent performance in prediction accuracy and visual quality.

5 Conclusion

The study combined ConvGRU and synthetic flow algorithm for animation video frame prediction, effectively capturing temporal correlation and spatial information, and achieving accurate and realistic frame synthesis. The results showed that the improved model had the highest PSNR of 37.51 dB on the Creative Flow+ dataset when the motion vectors in the domain were 5. After 300 iterations, the PSNR of the running state of the animated video characters gradually increased. The PSNR with and without the ME module were 29.40 and 29.30 dB, respectively. At 128–142 times, there was a decrease in PSNR with the ME module, which may be due to errors in the dataset sample videos. When the learning rate was 0.001, the total loss curve gradually decreased with the increase in iterations. When the learning rate was 0.01, the total training loss curve first increased and then decreased, with the highest total loss value of about 26.58. The model accuracy decreased by 8.1%. On the Creative Flow+ dataset, the LPIPS, SSIM, and PSNR of the ConvGRU algorithm were 0.073, 0.886, and 26.63 dB, respectively. On the AnimeRun dataset, the ConvGRU algorithm improved performance with 0.958 and 31.26 dB, respectively. Compared with existing methods, the improved method has superior performance, improving the prediction accuracy. Accurate frame prediction can be used in animation production and video game development to reduce the workload of animators, improve production efficiency, and enhance the visual effects of animations and games. However, the model is sensitive to ME parameters and hyper-parameters of ConvGRU. Meanwhile, improper parameter settings may lead to significant performance degradation of the model. In the future, intelligent algorithms such as grid search, random search, or Bayesian optimization can be chosen to determine the optimal parameter combination, thereby further improving the stability of model performance.

  1. Funding information: None.

  2. Author contributions: Xue Duan Conducted the data collection and analysis, and the writing of the manuscript.

  3. Conflict of interest: Author declares no conflict of interest.

  4. Data availability statement: The dataset generated and analyzed in this study can be obtained from the Creative Flow+ and AnimeRun repositories.

References

[1] Li Y, Wang J, Sun X, Li Z, Liu M, Gui G. Smoothing-aided support vector machine based nonstationary video traffic prediction towards B5G networks. IEEE Trans Veh Technol. 2020;69(7):7493–502.10.1109/TVT.2020.2993262Search in Google Scholar

[2] Nsugbe E. Toward a self-supervised architecture for semen quality prediction using environmental and lifestyle factors. Artif Intell Appl. 2023;1(1):35–42.10.47852/bonviewAIA2202303Search in Google Scholar

[3] Zhao S, Zhao L. Forecasting long-term electric power demand by linear semiparametric regression. AIEM. 2022;11(1):29–31.Search in Google Scholar

[4] Hilda. Make an impression in 60 seconds: Video on social media, the main weapon of professional marketers – Trends 2023. Acta Inform Malays. 2024;8(1):52–5.10.26480/aim.02.2024.56.59Search in Google Scholar

[5] Xu H, Liu W, Xing W, Wei X. Motion-aware future frame prediction for video anomaly detection based on saliency perception. Signal Image Video Process. 2022;16(8):2121–9.10.1007/s11760-022-02174-7Search in Google Scholar

[6] Hassan MA, Khan MUG, Iqbal R, Riaz O, Bashir AK, Tariq U. Predicting human’s future motion trajectories in video streams using generative adversarial network. Multimed Tools Appl. 2024;83(5):15289–311.10.1007/s11042-021-11457-zSearch in Google Scholar

[7] Aslam N, Kolekar MH. DeMAAE: deep multiplicative attention-based autoencoder for identification of peculiarities in video sequences. Vis Comput. 2024;40(3):1729–43.10.1007/s00371-023-02882-2Search in Google Scholar

[8] Video RSSN. Morphing attack detection using convolutional neural networks on deep fake detection algorithm. Educ Admin Theory Pract. 2024;30(5):3589–603.Search in Google Scholar

[9] Sreeja MU, Kovoor BC. An aggregated deep convolutional recurrent model for event-based surveillance video summarisation: a supervised approach. IET Comput. 2021;15(4):297–311.10.1049/cvi2.12044Search in Google Scholar

[10] Afrasiabi M, Mohammadi M, Rastegar M, Kargarian A. Multi-agent microgrid energy management based on deep learning forecaster. Energ J. 2019;186(1):115873.1–14.10.1016/j.energy.2019.115873Search in Google Scholar

[11] Xu G, Xian D, Fournier-Viger P, Li X, Ye Y, Hu X. AM-ConvGRU: a spatio-temporal model for typhoon path prediction. Neural Comput Appl. 2022;34(8):5905–21.10.1007/s00521-021-06724-xSearch in Google Scholar

[12] Zhang Y, Zhang L, He Y, Yao S, Yang W, Cao S, et al. Analysis of the future trends of typical mountain glacier movements along the Sichuan-Tibet Railway based on ConvGRU network. Int J Digit Earth. 2023;16(1):762–80.10.1080/17538947.2022.2152884Search in Google Scholar

[13] Tian L, Li X, Ye Y, Xie P, Li Y. A generative adversarial gated recurrent unit model for precipitation nowcasting. IEEE Geosci Remote Sens Lett. 2019;17(4):601–5.10.1109/LGRS.2019.2926776Search in Google Scholar

[14] Li Y, Hou G, Zhang X. Deep convolutional LSTM for dynamic facial expression recognition. IEEE Trans Cybern. 2019;49(6):2150–63.Search in Google Scholar

[15] Li T, Wang X, Liu H, Zhang L. Video prediction with appearance and motion conditions. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). 2019;6(2):11341–50.Search in Google Scholar

[16] Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves A, et al. Video pixel networks. In Proceedings of the International Conference on Machine Learning (ICML). 2016. p. 1771–9.Search in Google Scholar

[17] Liu Z, He Q, Peng Z. Interactive visual simulation modeling for structural response prediction and damage detection. IEEE Trans Ind Electron. 2021;69(1):868–78.10.1109/TIE.2021.3050365Search in Google Scholar

[18] Bao W, Lai WS, Zhang X, Gao Z, Yang MH. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans Pattern Anal Mach Intell. 2019;43(3):933–48.10.1109/TPAMI.2019.2941941Search in Google Scholar PubMed

[19] Gogoi S, Peesapati R. Design and implementation of an efficient multi-pattern motion estimation search algorithm for HEVC/H.265. IEEE Trans Consum Electron. 2021;67(4):319–28.10.1109/TCE.2021.3126670Search in Google Scholar

[20] Li T, Zhang M, Qi W, Asma E, Qi J. Deep learning-based joint PET image reconstruction and motion estimation. IEEE Trans Med Imaging. 2021;41(5):1230–41.10.1109/TMI.2021.3136553Search in Google Scholar PubMed PubMed Central

[21] Mohseni M, Santhanam S, Williams J, Thakker A, Nataraj C. Systematic fatigue spectrum editing by fast wavelet transform and genetic algorithm. Fatigue Fract Eng Mater Struct. 2022;45(1):69–83.10.1111/ffe.13583Search in Google Scholar

[22] Zhan Y, Guan J, Zhao Y. An adaptive second-order sliding-mode observer for permanent magnet synchronous motor with an improved phase-locked loop structure considering speed reverse. Trans Inst Meas Control. 2020;42(5):1008–21.10.1177/0142331219880712Search in Google Scholar

[23] Wang ZJ, Turko R, Shaikh O, Park H, Das N, Hohman F, et al. CNN explainer: learning convolutional neural networks with interactive visualization. IEEE Trans Vis Comput Graph. 2020;27(2):1396–406.10.1109/TVCG.2020.3030418Search in Google Scholar PubMed

[24] Mourot L, Hoyet L, Le Clerc F, Schnitzler F, Hellier P. A survey on deep learning for skeleton-based human animation. Comput Graph Forum. 2022;41(1):122–57.10.1111/cgf.14426Search in Google Scholar

[25] Yan B, Wang L, Zhang Y, Zhang Y, Zhang L. Video prediction with convolutional LSTM networks using temporal relation networks. IEEE Trans Circuits Syst Video Technol. 2020;30(2):450–63.Search in Google Scholar

[26] Yuan P, Guan Y, Huang J. Video prediction based on spatial information transfer and time backtracking. Signal Image Video Process. 2022;16(3):825–33.10.1007/s11760-021-02023-zSearch in Google Scholar

[27] Yang X, Gao Y, Luo H, Liao C, Cheng KT. Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans Multimed. 2019;1(11):2701–13.10.1109/TMM.2019.2912121Search in Google Scholar

Received: 2024-08-16
Revised: 2024-10-10
Accepted: 2024-10-21
Published Online: 2025-02-10

© 2025 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Generalized (ψ,φ)-contraction to investigate Volterra integral inclusions and fractal fractional PDEs in super-metric space with numerical experiments
  3. Solitons in ultrasound imaging: Exploring applications and enhancements via the Westervelt equation
  4. Stochastic improved Simpson for solving nonlinear fractional-order systems using product integration rules
  5. Exploring dynamical features like bifurcation assessment, sensitivity visualization, and solitary wave solutions of the integrable Akbota equation
  6. Research on surface defect detection method and optimization of paper-plastic composite bag based on improved combined segmentation algorithm
  7. Impact the sulphur content in Iraqi crude oil on the mechanical properties and corrosion behaviour of carbon steel in various types of API 5L pipelines and ASTM 106 grade B
  8. Unravelling quiescent optical solitons: An exploration of the complex Ginzburg–Landau equation with nonlinear chromatic dispersion and self-phase modulation
  9. Perturbation-iteration approach for fractional-order logistic differential equations
  10. Variational formulations for the Euler and Navier–Stokes systems in fluid mechanics and related models
  11. Rotor response to unbalanced load and system performance considering variable bearing profile
  12. DeepFowl: Disease prediction from chicken excreta images using deep learning
  13. Channel flow of Ellis fluid due to cilia motion
  14. A case study of fractional-order varicella virus model to nonlinear dynamics strategy for control and prevalence
  15. Multi-point estimation weldment recognition and estimation of pose with data-driven robotics design
  16. Analysis of Hall current and nonuniform heating effects on magneto-convection between vertically aligned plates under the influence of electric and magnetic fields
  17. A comparative study on residual power series method and differential transform method through the time-fractional telegraph equation
  18. Insights from the nonlinear Schrödinger–Hirota equation with chromatic dispersion: Dynamics in fiber–optic communication
  19. Mathematical analysis of Jeffrey ferrofluid on stretching surface with the Darcy–Forchheimer model
  20. Exploring the interaction between lump, stripe and double-stripe, and periodic wave solutions of the Konopelchenko–Dubrovsky–Kaup–Kupershmidt system
  21. Computational investigation of tuberculosis and HIV/AIDS co-infection in fuzzy environment
  22. Signature verification by geometry and image processing
  23. Theoretical and numerical approach for quantifying sensitivity to system parameters of nonlinear systems
  24. Chaotic behaviors, stability, and solitary wave propagations of M-fractional LWE equation in magneto-electro-elastic circular rod
  25. Dynamic analysis and optimization of syphilis spread: Simulations, integrating treatment and public health interventions
  26. Visco-thermoelastic rectangular plate under uniform loading: A study of deflection
  27. Threshold dynamics and optimal control of an epidemiological smoking model
  28. Numerical computational model for an unsteady hybrid nanofluid flow in a porous medium past an MHD rotating sheet
  29. Regression prediction model of fabric brightness based on light and shadow reconstruction of layered images
  30. Dynamics and prevention of gemini virus infection in red chili crops studied with generalized fractional operator: Analysis and modeling
  31. Review Article
  32. Haar wavelet collocation method for existence and numerical solutions of fourth-order integro-differential equations with bounded coefficients
  33. Special Issue: Nonlinear Analysis and Design of Communication Networks for IoT Applications - Part II
  34. Silicon-based all-optical wavelength converter for on-chip optical interconnection
  35. Research on a path-tracking control system of unmanned rollers based on an optimization algorithm and real-time feedback
  36. Analysis of the sports action recognition model based on the LSTM recurrent neural network
  37. Industrial robot trajectory error compensation based on enhanced transfer convolutional neural networks
  38. Research on IoT network performance prediction model of power grid warehouse based on nonlinear GA-BP neural network
  39. Interactive recommendation of social network communication between cities based on GNN and user preferences
  40. Application of improved P-BEM in time varying channel prediction in 5G high-speed mobile communication system
  41. Construction of a BIM smart building collaborative design model combining the Internet of Things
  42. Optimizing malicious website prediction: An advanced XGBoost-based machine learning model
  43. Economic operation analysis of the power grid combining communication network and distributed optimization algorithm
  44. Sports video temporal action detection technology based on an improved MSST algorithm
  45. Internet of things data security and privacy protection based on improved federated learning
  46. Enterprise power emission reduction technology based on the LSTM–SVM model
  47. Construction of multi-style face models based on artistic image generation algorithms
  48. Special Issue: Decision and Control in Nonlinear Systems - Part II
  49. Animation video frame prediction based on ConvGRU fine-grained synthesis flow
  50. Application of GGNN inference propagation model for martial art intensity evaluation
  51. Benefit evaluation of building energy-saving renovation projects based on BWM weighting method
  52. Deep neural network application in real-time economic dispatch and frequency control of microgrids
  53. Real-time force/position control of soft growing robots: A data-driven model predictive approach
  54. Mechanical product design and manufacturing system based on CNN and server optimization algorithm
  55. Application of finite element analysis in the formal analysis of ancient architectural plaque section
  56. Research on territorial spatial planning based on data mining and geographic information visualization
  57. Fault diagnosis of agricultural sprinkler irrigation machinery equipment based on machine vision
  58. Closure technology of large span steel truss arch bridge with temporarily fixed edge supports
  59. Intelligent accounting question-answering robot based on a large language model and knowledge graph
  60. Analysis of manufacturing and retailer blockchain decision based on resource recyclability
  61. Flexible manufacturing workshop mechanical processing and product scheduling algorithm based on MES
  62. Exploration of indoor environment perception and design model based on virtual reality technology
  63. Tennis automatic ball-picking robot based on image object detection and positioning technology
  64. A new CNN deep learning model for computer-intelligent color matching
  65. Design of AR-based general computer technology experiment demonstration platform
  66. Indoor environment monitoring method based on the fusion of audio recognition and video patrol features
  67. Health condition prediction method of the computer numerical control machine tool parts by ensembling digital twins and improved LSTM networks
  68. Establishment of a green degree evaluation model for wall materials based on lifecycle
  69. Quantitative evaluation of college music teaching pronunciation based on nonlinear feature extraction
  70. Multi-index nonlinear robust virtual synchronous generator control method for microgrid inverters
  71. Manufacturing engineering production line scheduling management technology integrating availability constraints and heuristic rules
  72. Analysis of digital intelligent financial audit system based on improved BiLSTM neural network
  73. Attention community discovery model applied to complex network information analysis
  74. A neural collaborative filtering recommendation algorithm based on attention mechanism and contrastive learning
  75. Rehabilitation training method for motor dysfunction based on video stream matching
  76. Research on façade design for cold-region buildings based on artificial neural networks and parametric modeling techniques
  77. Intelligent implementation of muscle strain identification algorithm in Mi health exercise induced waist muscle strain
  78. Optimization design of urban rainwater and flood drainage system based on SWMM
  79. Improved GA for construction progress and cost management in construction projects
  80. Evaluation and prediction of SVM parameters in engineering cost based on random forest hybrid optimization
  81. Museum intelligent warning system based on wireless data module
  82. Special Issue: Nonlinear Engineering’s significance in Materials Science
  83. Experimental research on the degradation of chemical industrial wastewater by combined hydrodynamic cavitation based on nonlinear dynamic model
  84. Study on low-cycle fatigue life of nickel-based superalloy GH4586 at various temperatures
  85. Some results of solutions to neutral stochastic functional operator-differential equations
  86. Ultrasonic cavitation did not occur in high-pressure CO2 liquid
  87. Research on the performance of a novel type of cemented filler material for coal mine opening and filling
  88. Testing of recycled fine aggregate concrete’s mechanical properties using recycled fine aggregate concrete and research on technology for highway construction
  89. A modified fuzzy TOPSIS approach for the condition assessment of existing bridges
  90. Nonlinear structural and vibration analysis of straddle monorail pantograph under random excitations
  91. Achieving high efficiency and stability in blue OLEDs: Role of wide-gap hosts and emitter interactions
  92. Construction of teaching quality evaluation model of online dance teaching course based on improved PSO-BPNN
  93. Enhanced electrical conductivity and electromagnetic shielding properties of multi-component polymer/graphite nanocomposites prepared by solid-state shear milling
  94. Optimization of thermal characteristics of buried composite phase-change energy storage walls based on nonlinear engineering methods
  95. A higher-performance big data-based movie recommendation system
  96. Nonlinear impact of minimum wage on labor employment in China
  97. Nonlinear comprehensive evaluation method based on information entropy and discrimination optimization
  98. Application of numerical calculation methods in stability analysis of pile foundation under complex foundation conditions
  99. Research on the contribution of shale gas development and utilization in Sichuan Province to carbon peak based on the PSA process
  100. Characteristics of tight oil reservoirs and their impact on seepage flow from a nonlinear engineering perspective
  101. Nonlinear deformation decomposition and mode identification of plane structures via orthogonal theory
  102. Numerical simulation of damage mechanism in rock with cracks impacted by self-excited pulsed jet based on SPH-FEM coupling method: The perspective of nonlinear engineering and materials science
  103. Cross-scale modeling and collaborative optimization of ethanol-catalyzed coupling to produce C4 olefins: Nonlinear modeling and collaborative optimization strategies
  104. Special Issue: Advances in Nonlinear Dynamics and Control
  105. Development of a cognitive blood glucose–insulin control strategy design for a nonlinear diabetic patient model
  106. Big data-based optimized model of building design in the context of rural revitalization
  107. Multi-UAV assisted air-to-ground data collection for ground sensors with unknown positions
  108. Design of urban and rural elderly care public areas integrating person-environment fit theory
  109. Application of lossless signal transmission technology in piano timbre recognition
  110. Application of improved GA in optimizing rural tourism routes
  111. Architectural animation generation system based on AL-GAN algorithm
  112. Advanced sentiment analysis in online shopping: Implementing LSTM models analyzing E-commerce user sentiments
  113. Intelligent recommendation algorithm for piano tracks based on the CNN model
  114. Visualization of large-scale user association feature data based on a nonlinear dimensionality reduction method
  115. Low-carbon economic optimization of microgrid clusters based on an energy interaction operation strategy
  116. Optimization effect of video data extraction and search based on Faster-RCNN hybrid model on intelligent information systems
  117. Construction of image segmentation system combining TC and swarm intelligence algorithm
  118. Particle swarm optimization and fuzzy C-means clustering algorithm for the adhesive layer defect detection
  119. Optimization of student learning status by instructional intervention decision-making techniques incorporating reinforcement learning
  120. Fuzzy model-based stabilization control and state estimation of nonlinear systems
  121. Optimization of distribution network scheduling based on BA and photovoltaic uncertainty
Downloaded on 26.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/nleng-2024-0060/html
Scroll to top button