Animation video frame prediction based on ConvGRU fine-grained synthesis flow

Xue Duan

doi:10.1515/nleng-2024-0060

Article Open Access

Animation video frame prediction based on ConvGRU fine-grained synthesis flow

Xue Duan

Published/Copyright: February 10, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Nonlinear Engineering Volume 14 Issue 1

Abstract

Due to the complexity and dynamism of animated scenes, frame prediction in animated videos is a challenging task. In order to improve the playback frame rate of animated videos, an innovative convolutional neural network combined with convolutional gated recursive unit method is used to refine the synthesized stream in frame prediction of animated videos. The obtained results indicated that the average prediction accuracy of the proposed model was 99.64%, and the training effect was good. The peak signal-to-noise ratios on the three datasets were 31.26, 36.63, and 22.15 dB, respectively, and the structural similarities were 0.958, 0.886, and 0.813, respectively. The maximum Learned Perceptual Image Patch Similarity of the proposed model was 0.144. This indicates that the model has achieved excellent performance in prediction accuracy and visual quality, which can successfully capture complex dynamics and fine details in animated scenes. The contribution of this study is to provide a technical support for improving the accuracy of frame prediction in animated videos, which will help promote the intelligent development of the animation production field.

Keywords: animated videos; frame prediction; CNN; GRU; reverse distortion algorithm

1 Introduction

Video prediction uses a given continuous frame to predict future video frames. Animation video frame prediction plays a crucial role in various applications such as computer-generated animation, video game development, and virtual reality. The ability to accurately predict future frames enhances the visual quality and realism of animation content. However, due to the complexity and dynamism of animation, predicting accurate frames remains a challenging task [1]. Convolutional neural network (CNN), with its excellent feature extraction and pattern recognition capabilities, can effectively capture dynamic changes and temporal relationships in animation video recognition tasks. In the current field of deep learning, Gated recurrent unit (GRU) has achieved significant results in image processing and sequence data. ConvGRU combines CNN and GRU, which can effectively capture spatio-temporal dependencies and spatial information in video sequences, thereby generating more accurate and visually coherent animation frames [2]. For complex scenes, traditional video frame prediction methods may not accurately capture the details and complex actions in the scene [3]. For example, when there are a large number of moving objects or complex backgrounds in the scene, traditional methods may not be able to effectively predict the next frame [4]. Therefore, to avoid the need for too many reference frames, a Motion Estimation (ME) network is designed based on CNN. A ConvGRU fine-grained synthesis flow algorithm is proposed. Combining the reverse distortion algorithm and bi-linear interpolation algorithm, the specific pixel values in the extracted frames of the video are calculated to obtain high-precision predicted frames. The research mainly includes four parts. The first part overviews video frame prediction and ConvGRU. The second part is about animation video frame prediction based on ConvGRU fine-grained synthesis flow algorithm. The third part is the animation video frame prediction results based on ConvGRU fine-grained synthesis flow. The fourth part concludes animation video frame prediction based on ConvGRU fine-grained synthesis flow.

2 Related works

Researchers have proposed various methods to solve video frame prediction and have achieved certain results. Xu et al. developed a video anomaly detection and motion aware future frame prediction method that combined saliency perception to avoid imbalanced information distribution between video foreground and background. This method improved the ability to express moving targets. The results showed that the network had certain advantages [5]. Hassan et al. used long short term memory (LSTM) and generative adversarial networks to locate experimental subjects and analyze motion paths for predicting future human motion trajectories. The results showed that the method ultimately reduced displacement error by 41% [6]. Aslam et al. developed an auto-encoder based on deep multiple attention to achieve anomaly detection in videos. The context vector determined the output of the decoder, and the global attention mechanism participated in the weighted calculation. The results showed that the method improved the running time by 0.015 s [7]. RS combined an LSTM and ResNeXt-50 deep forgery detection algorithm for combating video deformation attacks, which was used to extract complex features. The results showed that this method had high detection accuracy [8].

ConvGRU has been widely applied in fields such as image processing and natural language processing. Many scholars have conducted extensive research on this topic. Sreeja and Kovoor focused on precise detection of suspicious events in surveillance videos. Multi-layer CNN and stacked bidirectional GRU extracted sequence level and frame level features, respectively, which was beneficial for improving accurate recognition rate. The results showed that the model had generalization and effectiveness [9]. To achieve multi-agent micro-grid energy management, Afrasiabi et al. utilized accelerated alternative direction method of multipliers to predict the parameters required for intelligent agents based on ConvGRU, searching for the optimal working point and enhancing the convergence of distributed algorithms. The results indicated that the method had good performance [10]. Xu et al. designed an AM-ConvGRU model that combined channel attention blocks for predicting typhoon paths, extracting nonlinear three-dimensional features of typhoons with complex high-dimensional attributes. The results showed that the model had good prediction accuracy [11]. Zhang et al. proposed a ConvGRU spatio-temporal prediction model to extract glacier velocity to analyze the past and present spatio-temporal variations of glacier velocity. The fluctuations in glacier velocity time series data was captured. The results showed that the model had high accuracy [12]. Tian et al. developed a GA-ConvGRU model for predicting precipitation approaching, achieving multi-modal and skewed intensity distributions. More realistic and accurate extrapolation was generated. The results showed that the model had certain applicability [13].

In summary, current research on ConvGRU mainly focuses on the improvement methods of the model, and the modeling ability for long-term dependency relationships still needs to be improved. Given the advantages of ConvGRU in processing sequence data, its future application prospects are still very broad, especially in the practical application of video analysis. Therefore, further exploring the research potential and application prospects of ConvGRU has positive impacts on the field of deep learning.

3 Animation video frame prediction based on ConvGRU fine-grained synthesis flow

The research first designs a ME network based on CNN to solve the single search direction and limited reference range. Afterwards, in order to ensure the acquisition of high-precision prediction frames, the ConvGRU fine-grained synthesis flow algorithm is proposed. Simultaneously, combining the reverse distortion algorithm and bi-linear interpolation algorithm, the computational complexity is optimized.

3.1 CNN-based ME network

Animation video is a media form that simulates motion by quickly and continuously playing static images. However, producing high-quality animated videos requires a significant amount of time and resources, especially during the process of drawing each frame. To improve the production efficiency and reduce cost of animated videos, the research needs to improve the frame efficiency of predicted animated videos. Animation video frame prediction refers to predicting the next frame of an image based on the known first few frames of the image [14]. To solve the excessive number of reference frames required for video frame prediction and ME process, the mathematical quantization animation modeling for video frame prediction is constructed, as shown in Eq. (1).

(1) p ( I T + 1 ∣ I 1 : T ) = ∫ p ( I T + 1 ∣ z ¯ 1 : T ) p ( z ¯ 1 : T ∣ I 1 : T ) d z ¯ 1 : T ,

where I 1 : T is the existing video frame sequence. T is the length of the sequence. I T + 1 is the future video frame sequence. p ( z ¯ 1 : T ∣ I 1 : T ) is the ME process. p ( I T + 1 ∣ z ¯ 1 : T ) represents the process of implementing motion compensation after the ME result. The current reference frame count is generally 10–20 frames. Predicting a large number of future frames requires continuous iterations, which results in long delays and poor timeliness in animation video processing. To address these issues, an animation video frame prediction network that combines ConvGRU fine-grained synthesis flow is designed. The flowchart of this network is shown in Figure 1.

Figure 1

Schematic diagram of the animation video frame prediction network flow.

In Figure 1, the inter-frame temporal content information of the reference frame is extracted by the ME network module. Combined with motion residual calculation and synthetic flow calculation, the ConvGRU fine-grained synthetic flow algorithm is completed. Finally, combined with the reverse distortion algorithm, the predicted frame is obtained. When predicting between frames, the correlation between video frames is used to filter out redundant information about time, which generally includes two steps: ME and motion compensation. The main task of ME is to find a matching block for the currently being encoded macro block in the historical reference frame to find the optimal matching block. After finding the optimal matching block, ME will output a motion vector, which is the position coordinate of the reference block relative to the current block [15]. The ME is shown in Figure 2.

Figure 2

ME principle diagram.

In Figure 2, the current video coding standards mainly use inter-frame encoding methods that combine blocks. The principle is to use ME to find the reference block with the smallest difference from adjacent reference reconstruction frames. The reconstruction value serves as the predicted value of the current block. The displacement from the reference block to the current block is the motion vector. The process of taking reconstructed values as predicted values is motion compensation. The ME network can extract motion and occlusion information between reference frames, and predict initial motion vectors, soft masks, and convolutional kernel weights [16,17]. Soft mask matrix is used to avoid occlusion issues, and M ∈ R . Soft mask is a technique used for image processing and computer vision tasks to obscure or hide specific areas of an image. The initial motion vector obtained by ME is ( u , v ) . Two one-dimensional longitudinal and transverse quantities are combined using the matrix outer product. There is a two-dimensional convolution kernel K ( x , y ) , as shown in Eq. (2).

(2) K ( x , y ) = k v ( k u ) T .

To avoid the information loss at the previous level and the impact of gradient vanishing on performance when the network depth gradually increases, a ME network is designed based on CNN. The schematic diagram of the ME network structure is shown in Figure 3.

Figure 3

Schematic diagram of ME network structure.

In Figure 3, the ME network structure consists of an encoding end and a decoding end. The encoding side ensures that the convolutional kernel can extract the content and difference information of two reference frames. Afterwards, the initial feature extraction is completed through a convolutional layer with a kernel size 7 × 7. To reduce the resolution, a convolutional block is formed by combining three convolutional layers with kernel size 3 × 3 and an average pooling layer. The encoding end utilizes skip connections to the decoding end. The decoding end replaces the pooling layer with a bi-linear up-sampling layer to improve the resolution of the feature map. Finally, it is output to the sub-network for predicting motion vectors, soft mask M , and convolutional kernel weights. Sub-networks have similar structures but do not share weights. The ME network is used to extract inter frame temporal content information from existing video frame sequences I 1 : T , predict temporal motion vectors, soft masks, and adaptive convolution kernels, and transform I 1 : T to obtain rough predicted frame I ¯ T + 1 [18].

3.2 Animation video frame prediction based on ConvGRU fine-grained flow algorithm and reverse distortion algorithm

In animation video frame prediction, if only motion compensation is applied to the reference frame to obtain the predicted frame, the lack of high-dimensional content features cannot compensate for the detail loss caused by ME errors, resulting in blurred prediction frames [19,20]. To avoid blurry prediction frames, a video frame content extractor combining feature pyramids is developed on the basis of CNN, obtaining multi-scale features with rich detail information in reference frames [21]. By combining CNN for extracting spatial features with GRU for extracting temporal features, the encoded high-dimensional temporal motion information is processed to generate bias values to correct the errors in the synthesized stream, thereby achieving the correct pixel mapping relationship [22]. The internal structure of ConvGRU is shown in Figure 4.

Figure 4

Internal structure of ConvGRU.

In Figure 4, the reset gate, update gate, and hidden gate are all used for deep feature extraction. All consist of a convolutional layer. The connection operation C of input state x t along the channel dimension is used to convert the past hidden state h t − 1 into the current hidden state h t through reset gate, update gate, and hidden gate, respectively. The reset gate is shown in Eq. (3).

(3) r t = Conv R ( cat [ h t − 1 , x t ] ) ,

The update gate is shown in Eq. (4).

(4) u t = Conv U ( cat [ h t − 1 , x t ] ) .

where ⊙ is the Hadamard product, which is an element wise operation that plays an important role in the training and inference process of neural networks [23,24]. The hidden gate is shown in Eq. (5).

(5) z t = Conv Z ( cat [ x t , h t − 1 ⊙ r t ] ) .

The current hidden state h t is shown in Eq. (6).

(6) h t = h t − 1 ⊙ ( 1 − u t ) + u t ⊙ z t .

To improve the accurate prediction ability of future frame movements, the ConvGRU fine-grained synthesis flow algorithm is proposed. This algorithm utilizes multi-level encoding of motion residuals and extracts high-dimensional temporal dependency information. Then, these pieces of information are decoded to obtain bias values with the same resolution. The bias value and f ¯ T + 1 → T are vector summed to obtain a fine-grained synthetic flow f T + 1 → T . High quality future video frame sequences I T + 1 are obtained through reverse distortion algorithm [25]. The ConvGRU fine-grained synthesis flow algorithm in the network includes a learning parameter module. Based on the internal structure of ConvGRU, the fine-grained flow network structure of ConvGRU is illustrated in Figure 5.

Figure 5

Schematic diagram of ConvGRU fine-grained synthesis flow network structure.

In Figure 5, the network consists of a pyramid structure, a motion residual encoder, and a ConvGRU module. The ConvGRU module consists of three levels. The channel dimension at the same level is 256. The motion residual encoder shares weights at the same level and remains independent of each other at different levels. The reference frame is the rough prediction frame I ¯ T + 1 obtained by the motion aware convolution algorithm, as shown in Eq. (7).

(7) I ¯ T + 1 = M ⊙ MAC ( I T − 1 ; K T − 1 , f T − 1 ) + ( J − M ) ⊙ MAC ( I T ; K T , f T ) ,

where f T − 1 and f T are the time-domain motion vectors. K T and K T − 1 are the adaptive convolution kernel weights. J is the full 1 matrix, with the same resolution as the reference frame. The coarse prediction frame I ¯ T + 1 and the residual of the nearest neighbor frame assist in the refinement process of the synthesized flow. Reverse distortion algorithm is an algorithm used to reverse recover images or graphic deformations in videos. The basic principle is to restore the deformed image or video to its original shape by calculating the inverse transformation of forward deformation. The reverse distortion algorithm and bi-linear interpolation algorithm are shown in Figure 6.

Figure 6

Schematic diagram of (a) reverse distortion algorithm and (b) bi-linear interpolation algorithm.

In Figure 6(a), the reverse distortion algorithm can calculate the optical flow f t + 1 → t from image t + 1 to t . The corresponding pixel in I t is found for each whole pixel point of I t + 1 , achieving I t + 1 synthesis. There are no voids or occlusion, as the pixels in I t and the pixels in the image to be synthesized have a single mapping relationship. In Figure 6(b), to estimate the obtained optical flow as an integer value, the specific pixel values are calculated using bi-linear interpolation algorithm. Q 11 , Q 12 , Q 21 , and Q 22 are the pixel values of neighboring nearest whole pixel points. d x and d y represent the position difference of the pixel value p to the y -axis and x -axis, respectively. p is shown in Eq. (8).

(8) p = ( 1 − d y ) ( 1 − d x ) Q 11 + ( 1 − d y ) d x Q 12 + ( 1 − d x ) d y Q 21 + d x d y Q 22 .

In the horizontal direction, the differentiation for pixels is shown in Eq. (9).

(9) ∂ p ∂ d x = − ( 1 − d y ) Q 11 + ( 1 − d y ) Q 12 − d y Q 21 + d y Q 22 .

In the vertical direction, the differentiation for pixels is shown in Eq. (10).

(10) ∂ p ∂ d y = − ( 1 − d x ) Q 11 − d x Q 12 + ( 1 − d x ) Q 21 + d x Q 22 .

4 Prediction results of animation video frame based on ConvGRU fine-grained synthesis flow

The influence of ME network and its time-domain motion vector parameters on predicting frames in animated videos is analyzed. The performance of ConvGRU fine-grained synthesis flow algorithm and its improved algorithm is verified.

4.1 ME network and parameter analysis of time-domain motion vectors

The experimental CPU is Intel Core i7-9700 3.00 GHz, and the operating system is Ubuntu 18.04. The graphics memory is 16GB. The selected animation video datasets for the experiment are Creative Flow+ and AnimeRun. The basic principles of the proposed model parameters and their settings are shown in Table 1.

Table 1

Basic principles of model parameters and their settings

Parameter	Numerical value	Basic principle
Exponential decay rates of momentum and RMSProp terms	0.9, 0.999	Update of equilibrium gradient
Initial learning rate	0.001	Ensure stable convergence of the model
Batch size	8	Balancing memory consumption and training speed
The number of convolutional layers	32, 64, 96	Improve the expressive power of the model
Convolutional kernel size	5 × 5	Capture features at different scales
The number of motion vectors in the time domain	5	Considering the time complexity of the model comprehensively

The experimental evaluation indicators are Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity (SSIM) and Peak signal to noise ratio (PSNR). To investigate the impact of different numbers of time-domain motion vectors on algorithm performance, the model convolution kernel size is kept at 11 × 11. The PSNR and SSIM results of the model on Creative Flow+, AnimeRun simple, and difficult datasets are shown in Figure 7 when the number of time-domain motion vectors is 1, 5, and 11, respectively.

Figure 7

Comparison of objective evaluation results of different time-domain motion vector numbers. (a) Comparison of PSNR results. (b) Comparison of SSIM results.

In Figure 7(a), on the Creative Flow+ dataset, when the motion vectors in the domain were 5, the model had the highest PSNR, which was 37.51 dB. On the AnimeRun simple and difficult datasets, when the time-domain motion vectors were 11, the PSNR was the highest, at 35.83 and 24.62 dB, respectively. In Figure 7(b), on the Creative Flow+, AnimeRun simple, and difficult datasets, when the time-domain motion vectors were 11, the SSIM was the highest, with values of 0.984, 0.981, and 0862, respectively. The increase in the estimation quantity of time-domain motion vectors can quickly search for animation video information, thereby improving the predicted frame quality. To analyze the impact of different convolution kernel sizes on the predicted frames of animated videos, the number of temporal motion vectors is kept at 5. When the convolutional kernel sizes are 1 × 1, 5 × 5, and 11 × 11, the PSNR and SSIM results of the predicted animation video frames on the Creative Flow+, AnimeRun simple, and difficult datasets are shown in Figure 8.

Figure 8

Comparison of objective evaluation results for different convolutional kernel sizes. (a) Comparison of PSNR results. (b) Comparison of SSIM results.

In Figure 8(a), when the convolution kernel was 11 × 11, the PSNR of the predicted animation video frame on different datasets was 36.82, 37.08, and 37.51 dB, respectively. In Figure 8(b), when the convolution kernel was 11 × 11, the SSIM of the predicted animation video frame on different datasets was 0.984, 0.980, and 0.860, respectively. To study the impact of animation video frame prediction models with ME network modules on accuracy, the AnimeRun dataset is selected as a sample to train models with and without the ME network module. The PSNR results of the running and walking states of the characters are shown in Figure 9.

Figure 9

Comparison of PSNR results of different models in running and walking states of characters. (a) Run status. (b) Walk state.

In Figure 9(a), when iterating 300 times, the PSNR with and without the ME module were 29.40 and 29.30 dB, respectively. In Figure 9(b), after 300 iterations of the walking state of the animated video character, the PSNR with and without the ME module were 26.78 and 26.58 dB, respectively, which increased by 0.2 dB. To verify the impact of learning rate on model training, the total loss changes during model training are compared when the learning rates are 0.01 and 0.001, respectively, as shown in Figure 10.

Figure 10

Comparison of total loss values at learning rates of (a) 0.01 and (b) 0.001.

In Figure 10(a), when the learning rate was 0.001, the total loss curve gradually decreased with increasing iterations. At 50–75 cycles, the total loss value slightly increased. Afterwards, the total loss value fluctuated between 21.8 and 22.0. In Figure 10(b), when the learning rate was 0.01, the total training loss curve first increased and then decreased. The highest total loss value was about 26.58, which decreased by 8.1%.

4.2 Performance analysis of the improved algorithm

To verify the performance of the improved algorithm, the results of spatial information transfer and time backtracking (SITB) [26], Bayesian DeNet [27], and ConvGRU algorithms on the Creative Flow+ and AnimeRun datasets are compared, as shown in Table 2.

Table 2

Comparison of three algorithms on different data machines

Algorithm type	Creative Flow+			AnimeRun
Algorithm type	LPIPS	SSIM	PSNR	LPIPS	SSIM	PSNR
Bayesian DeNet	0.071	0.869	24.91	0.032	0.944	29.09
SITB	0.056	0.884	26.45	0.028	0.949	30.28
ConvGRU	0.073	0.886	26.63	0.033	0.958	31.26

On the Creative Flow+ dataset, compared with the other two algorithms, the ConvGRU algorithm had the highest LPIPS, SSIM, and PSNR, with values of 0.073, 0.886, and 26.63 dB, respectively. On the AnimeRun dataset, the SSIM and PSNR of the ConvGRU algorithm were 0.958 and 31.26 dB, respectively. The performance has improved. The ConvGRU algorithm ran nearly 50% longer than the SITB algorithm. However, compared with the Bayesian DeNet algorithm, it was 13.342 s faster. Experiments are conducted on a self-made high-resolution animation video dataset. These three algorithms under different motion modes and motion sizes are compared. The results are shown in Figure 11.

Figure 11

Comparison of PSNR, SSIM, and LPIPS results of different algorithms on datasets of different levels. (a) Comparison of PSNR results of different algorithms. (b) Comparison of SSIM results of different algorithms. (c) Comparison of LPIPS results of different algorithms.

In Figure 11(a), the PSNR of the improved ConvGRU algorithm on simple, medium, and difficult datasets were 38.96, 34.05, and 27.90 dB, respectively. In Figure 11(b), the PSNR of the improved ConvGRU algorithm on simple, medium, and difficult datasets were 0.987, 0.968, and 0.933, respectively. Compared with the SITB algorithm, the improved ConvGRU algorithm increased SSIM by 0.005 on simple datasets. In Figure 11(c), the improved ConvGRU algorithm had the lowest LPIPS on simple datasets, which was 0.014. To analyze the performance of the improved algorithm, the objective quality results of the model before and after training improvement are compared on Creative Flow+, AnimeRun simple, and difficult datasets, as shown in Figure 12.

Figure 12

Comparison of PSNR, SSIM, and LPIPS results between unimproved and improved models on different datasets. (a) Comparison of PSNR results. (b) Comparison of SSIM results. (c) Comparison of LPIPS results.

In Figure 12(a), the PSNR of the improved model on three datasets were 31.26, 36.63, and 22.15 dB, respectively. The improved model has the greatest improvement in quality on difficult datasets, indicating an enhanced accuracy capability. In Figure 12(b), the SSIM of the improved model on three datasets were 0.958, 0.886, and 0.813, respectively. In Figure 12(c), the LPIPS of the improved model on three datasets were 0.033, 0.073, and 0.144, respectively. The actual detection performance of the improved algorithm on Creative Flow+, AnimeRun, and self-made datasets is shown in Figure 13.

Figure 13

The frequency of frame anomalies on different datasets. (a) Anomalous frequencies on the datasets Creative Flow+ and AnimeRun. (b) Frequency of anomalies on self-made datasets.

In Figure 13(a), at frames 0–55, the frame sequence frequency of the Creative Flow+ dataset was at a relatively low level, averaging 1.95 frames per second. The characters in the animated video sample were in a normal walking state. The AnimeRun dataset showed that the characters in the animated video samples 35 frames ago were in a normal walking state. After 35 frames, the frequency of screen changes remained at a high level. In Figure 13(b), at 50 frames in the self-made dataset, there was a running scene on the screen, with a frequency of 0.28 frames per second. At 950 frames, the details of the character were magnified ten times compared with the previous frame sequence, with a frequency of 0.97 frames per second. The improved algorithm has good actual detection results and high accuracy. The training set, validation set, and testing set are divided in a ratio of 6:2:2. The improved model is trained 400 times. The model training results are shown in Table 3.

Table 3

Numerical comparison of improved model results

Batch		1	2	3	4	5	Average value
Training set	RMSE	0.0423	0.0435	0.0447	0.0464	0.0476	0.0449
Training set	Accuracy	99.72	99.72	99.69	99.67	99.64	99.69
Validation set	RMSE	0.0431	0.0407	0.0390	0.0446	0.0584	0.0452
Validation set	Accuracy	99.64	99.74	99.70	99.68	99.42	99.64
Testing set	RMSE	0.0479	0.0483	0.0454	0.0488	0.0487	0.0478
Testing set	Accuracy	99.65	99.66	99.59	99.50	99.59	99.60

The average accuracy of the improved model on the training set, validation set, and testing set was 99.69, 99.64, and 99.60%, respectively. The training results of the improved model are good, with an average prediction accuracy of 99.64%. The improved model has achieved excellent performance in prediction accuracy and visual quality.

5 Conclusion

The study combined ConvGRU and synthetic flow algorithm for animation video frame prediction, effectively capturing temporal correlation and spatial information, and achieving accurate and realistic frame synthesis. The results showed that the improved model had the highest PSNR of 37.51 dB on the Creative Flow+ dataset when the motion vectors in the domain were 5. After 300 iterations, the PSNR of the running state of the animated video characters gradually increased. The PSNR with and without the ME module were 29.40 and 29.30 dB, respectively. At 128–142 times, there was a decrease in PSNR with the ME module, which may be due to errors in the dataset sample videos. When the learning rate was 0.001, the total loss curve gradually decreased with the increase in iterations. When the learning rate was 0.01, the total training loss curve first increased and then decreased, with the highest total loss value of about 26.58. The model accuracy decreased by 8.1%. On the Creative Flow+ dataset, the LPIPS, SSIM, and PSNR of the ConvGRU algorithm were 0.073, 0.886, and 26.63 dB, respectively. On the AnimeRun dataset, the ConvGRU algorithm improved performance with 0.958 and 31.26 dB, respectively. Compared with existing methods, the improved method has superior performance, improving the prediction accuracy. Accurate frame prediction can be used in animation production and video game development to reduce the workload of animators, improve production efficiency, and enhance the visual effects of animations and games. However, the model is sensitive to ME parameters and hyper-parameters of ConvGRU. Meanwhile, improper parameter settings may lead to significant performance degradation of the model. In the future, intelligent algorithms such as grid search, random search, or Bayesian optimization can be chosen to determine the optimal parameter combination, thereby further improving the stability of model performance.

Funding information: None.
Author contributions: Xue Duan Conducted the data collection and analysis, and the writing of the manuscript.
Conflict of interest: Author declares no conflict of interest.
Data availability statement: The dataset generated and analyzed in this study can be obtained from the Creative Flow+ and AnimeRun repositories.

References

[1] Li Y, Wang J, Sun X, Li Z, Liu M, Gui G. Smoothing-aided support vector machine based nonstationary video traffic prediction towards B5G networks. IEEE Trans Veh Technol. 2020;69(7):7493–502.10.1109/TVT.2020.2993262Search in Google Scholar

[2] Nsugbe E. Toward a self-supervised architecture for semen quality prediction using environmental and lifestyle factors. Artif Intell Appl. 2023;1(1):35–42.10.47852/bonviewAIA2202303Search in Google Scholar

[3] Zhao S, Zhao L. Forecasting long-term electric power demand by linear semiparametric regression. AIEM. 2022;11(1):29–31.Search in Google Scholar

[4] Hilda. Make an impression in 60 seconds: Video on social media, the main weapon of professional marketers – Trends 2023. Acta Inform Malays. 2024;8(1):52–5.10.26480/aim.02.2024.56.59Search in Google Scholar

[5] Xu H, Liu W, Xing W, Wei X. Motion-aware future frame prediction for video anomaly detection based on saliency perception. Signal Image Video Process. 2022;16(8):2121–9.10.1007/s11760-022-02174-7Search in Google Scholar

[6] Hassan MA, Khan MUG, Iqbal R, Riaz O, Bashir AK, Tariq U. Predicting human’s future motion trajectories in video streams using generative adversarial network. Multimed Tools Appl. 2024;83(5):15289–311.10.1007/s11042-021-11457-zSearch in Google Scholar

[7] Aslam N, Kolekar MH. DeMAAE: deep multiplicative attention-based autoencoder for identification of peculiarities in video sequences. Vis Comput. 2024;40(3):1729–43.10.1007/s00371-023-02882-2Search in Google Scholar

[8] Video RSSN. Morphing attack detection using convolutional neural networks on deep fake detection algorithm. Educ Admin Theory Pract. 2024;30(5):3589–603.Search in Google Scholar

[9] Sreeja MU, Kovoor BC. An aggregated deep convolutional recurrent model for event-based surveillance video summarisation: a supervised approach. IET Comput. 2021;15(4):297–311.10.1049/cvi2.12044Search in Google Scholar

[10] Afrasiabi M, Mohammadi M, Rastegar M, Kargarian A. Multi-agent microgrid energy management based on deep learning forecaster. Energ J. 2019;186(1):115873.1–14.10.1016/j.energy.2019.115873Search in Google Scholar

[11] Xu G, Xian D, Fournier-Viger P, Li X, Ye Y, Hu X. AM-ConvGRU: a spatio-temporal model for typhoon path prediction. Neural Comput Appl. 2022;34(8):5905–21.10.1007/s00521-021-06724-xSearch in Google Scholar

[12] Zhang Y, Zhang L, He Y, Yao S, Yang W, Cao S, et al. Analysis of the future trends of typical mountain glacier movements along the Sichuan-Tibet Railway based on ConvGRU network. Int J Digit Earth. 2023;16(1):762–80.10.1080/17538947.2022.2152884Search in Google Scholar

[13] Tian L, Li X, Ye Y, Xie P, Li Y. A generative adversarial gated recurrent unit model for precipitation nowcasting. IEEE Geosci Remote Sens Lett. 2019;17(4):601–5.10.1109/LGRS.2019.2926776Search in Google Scholar

[14] Li Y, Hou G, Zhang X. Deep convolutional LSTM for dynamic facial expression recognition. IEEE Trans Cybern. 2019;49(6):2150–63.Search in Google Scholar

[15] Li T, Wang X, Liu H, Zhang L. Video prediction with appearance and motion conditions. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). 2019;6(2):11341–50.Search in Google Scholar

[16] Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves A, et al. Video pixel networks. In Proceedings of the International Conference on Machine Learning (ICML). 2016. p. 1771–9.Search in Google Scholar

[17] Liu Z, He Q, Peng Z. Interactive visual simulation modeling for structural response prediction and damage detection. IEEE Trans Ind Electron. 2021;69(1):868–78.10.1109/TIE.2021.3050365Search in Google Scholar

[18] Bao W, Lai WS, Zhang X, Gao Z, Yang MH. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans Pattern Anal Mach Intell. 2019;43(3):933–48.10.1109/TPAMI.2019.2941941Search in Google Scholar PubMed

[19] Gogoi S, Peesapati R. Design and implementation of an efficient multi-pattern motion estimation search algorithm for HEVC/H.265. IEEE Trans Consum Electron. 2021;67(4):319–28.10.1109/TCE.2021.3126670Search in Google Scholar

[20] Li T, Zhang M, Qi W, Asma E, Qi J. Deep learning-based joint PET image reconstruction and motion estimation. IEEE Trans Med Imaging. 2021;41(5):1230–41.10.1109/TMI.2021.3136553Search in Google Scholar PubMed PubMed Central

[21] Mohseni M, Santhanam S, Williams J, Thakker A, Nataraj C. Systematic fatigue spectrum editing by fast wavelet transform and genetic algorithm. Fatigue Fract Eng Mater Struct. 2022;45(1):69–83.10.1111/ffe.13583Search in Google Scholar

[22] Zhan Y, Guan J, Zhao Y. An adaptive second-order sliding-mode observer for permanent magnet synchronous motor with an improved phase-locked loop structure considering speed reverse. Trans Inst Meas Control. 2020;42(5):1008–21.10.1177/0142331219880712Search in Google Scholar

[23] Wang ZJ, Turko R, Shaikh O, Park H, Das N, Hohman F, et al. CNN explainer: learning convolutional neural networks with interactive visualization. IEEE Trans Vis Comput Graph. 2020;27(2):1396–406.10.1109/TVCG.2020.3030418Search in Google Scholar PubMed

[24] Mourot L, Hoyet L, Le Clerc F, Schnitzler F, Hellier P. A survey on deep learning for skeleton-based human animation. Comput Graph Forum. 2022;41(1):122–57.10.1111/cgf.14426Search in Google Scholar

[25] Yan B, Wang L, Zhang Y, Zhang Y, Zhang L. Video prediction with convolutional LSTM networks using temporal relation networks. IEEE Trans Circuits Syst Video Technol. 2020;30(2):450–63.Search in Google Scholar

[26] Yuan P, Guan Y, Huang J. Video prediction based on spatial information transfer and time backtracking. Signal Image Video Process. 2022;16(3):825–33.10.1007/s11760-021-02023-zSearch in Google Scholar

[27] Yang X, Gao Y, Luo H, Liao C, Cheng KT. Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans Multimed. 2019;1(11):2701–13.10.1109/TMM.2019.2912121Search in Google Scholar

Received: 2024-08-16

Revised: 2024-10-10

Accepted: 2024-10-21

Published Online: 2025-02-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/nleng-2024-0060

Keywords for this article

animated videos; frame prediction; CNN; GRU; reverse distortion algorithm

Creative Commons

BY 4.0