A gap-filling algorithm selection strategy for GRACE and GRACE Follow-On time series based on hydrological signal characteristics of the individual river basins

Hamed Karimi; Siavash Iran-Pour; Alireza Amiri-Simkooei; Masoud Babadi

doi:10.1515/jogs-2022-0129

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

A gap-filling algorithm selection strategy for GRACE and GRACE Follow-On time series based on hydrological signal characteristics of the individual river basins

Hamed Karimi , Siavash Iran-Pour , Alireza Amiri-Simkooei and Masoud Babadi

Published/Copyright: March 20, 2023

Published by

Become an author with De Gruyter Brill

Author Information

From the journal Journal of Geodetic Science Volume 13 Issue 1

Abstract

Gravity recovery and climate experiment (GRACE) and GRACE Follow-On (GRACE-FO) are Earth’s gravity satellite missions with hydrological monitoring applications. However, caused by measuring instrumental problems, there are several temporal missing values in the dataset of the two missions where a long gap between the mission dataset also exists. Recent studies utilized different gap-filling methodologies to fill those data gaps. In this article, we employ a variety of singular spectrum analysis (SSA) algorithms as well as the least squares-harmonic estimation (LS-HE) approach for the data gap-filling. These methods are implemented on six hydrological basins, where the performance of the algorithms is validated for different artificial gap scenarios. Our results indicate that each hydrological basin has its special behaviour. LS-HE outperforms the other algorithms in half of the basins, whereas in the other half, SSA provides a better performance. This highlights the importance of different factors affecting the deterministic signals and stochastic characteristics of climatological time series. To fill the missing values of such time series, it is therefore required to investigate the time series behaviour on their time-invariant and time-varying characteristics before processing the series.

Keywords: Gap-Filling; GRACE Follow-On; Gravity Recovery And Climate Experiment (GRACE); Least Squares-Harmonic Estimation (LS-HE); Singular Spectrum Analysis (SSA)

1 Introduction

Gravity Recovery And Climate Experiment (GRACE), as a joint mission between the National Aeronautics and Space Administration (NASA) and Deutsches Zentrum für Luft- und Raumfahrt (DLR), was launched in March 2002 and ended its science mission in October 2017. The GRACE satellite mission aimed to monitor time-variable gravity field of the Earth (Tapley et al. 2004). The mission measured Earth’s mass changes due to hydro-climatological variations such as groundwater storage changes (Syed et al. 2008, Landerer and Swenson 2012, Feng et al. 2013), ice melting (Chambers et al. 2007, Chen et al. 2007, Nahavandchi et al. 2015), Glacial Isostatic Adjustment (Peltier 2004, Chambers et al. 2010, Wahr and Zhong 2013), etc. After decommission of GRACE, the GRACE Follow-On (GRACE-FO) mission, as a collaboration of NASA and GeoForschungsZentrum (GFZ), was launched on 22 May 2018. The GRACE-FO satellite mission is similar to its predecessor GRACE, though employed more advanced measuring instrumentations such as laser interferometry.

The main temporal resolution of GRACE and GRACE-FO in time-variable gravity field recovery (level-2 and level-3 dataset) is 30 days. However, because of several instrumental problems in both missions, the dataset suffers from missing values. In particular, there is about one-year data gap between GRACE and GRACE-FO. This temporal gap starts from July 2017 and ends in May 2018. Several recent research works have tried to fill the missing values in the GRACE and GRACE-FO time series and the long gap between them.

Rietbroek et al. (2014) used GNSS-derived loading inversion technique to bridge the gaps within GRACE satellite mission data series, where three methods were employed: (1) GPS-I, that is based on GNSS inverting data and additionally constrained over the ocean area by a new regularization method, (2) GPS-C, which is a constrained GNSS solution where additional correction parameters were applied to GNSS data and (3) using seasonal gap-filler method by fitting a trend and a seasonal signal to the data series to bridge the gaps. In evaluation of these methods, the third approach outperformed the others, while GPS-C showed a lower Root Mean Square Error (RMSE) compared to GPS-I. Li et al. (2020) implemented different data-driven techniques to reconstruct (1992-2002) and predict (2017- 2018) GRACE-like gridded Total Water Storage Changes (TWSC) using the climate inputs. They used the combination of data-driven methods, such as Principal Component Analysis (PCA) (Wold et al. 1987), Independent Component Analysis (ICA), Least Squares (LS) (Durbin and Watson 1992), Seasonal and Trend decomposition using Loess (STL), Artificial Neural Network (ANN), autoregressive exogenous (ARX), Multiple Linear Regression (MLR) and additional dataset, such as precipitation, land surface temperature, climate indices, and Sea Surface Temperature (SST) data. In that work, the authors found that ARX and ANN methods work better than MLR algorithm in simulation of the target variables, although because of overfitting problems, they are not robust enough for prediction. Furthermore, they indicated that PCA-LS-MLR is the most robust combination method for prediction and reconstruction of GRACE-like TWSC over all river basins.

Several studies employed other existing satellite missions for the recovery of time-variable gravity field. Among those works, Satellite Laser Ranging (SLR) geodetic observations provide lower degrees of spherical harmonic coefficients of the Earth’s gravity field. Sośnica et al. (2015) recovered time-variable Earth gravity field through SLR observations through up to nine geodetic satellites: LAGEOS-1, LAGEOS-2, Starlette, Stella, AJISAI, LARES, LARES, Larets, BLITS, and Beacon-C. In their research, they showed that low-degree coefficients can be derived from SLR observations which carry good information about large-scale mass transport in the Earth system.

Low Earth Orbiter (LEO) satellites, equipped with an accelerometer, star camera tracker, and geodetic GNSS receiver, can be employed to recover time-variable gravity field. The European Space Agency Earth Explorer mission Swarm is one of these constellations. Swarm Satellite mission consists of three satellites. A pair of them (Swarm-Alpha and Swarm-Charlie) fly side-by-side in a circular, near-polar orbit with an initial altitude and inclination of 450 km and 87.4. The separation between the satellites is around 1–1.5 in the direction of east-west, in longitude (equivalent to approximately 10 sec delay in orbit). The higher satellite (Swarm-Bravo) flies in a circular and near-polar orbit as well, where its altitude and inclination are respectively 530 km and 86.8, and its right ascension of ascending node is close to that of the two other satellites (Friis-Christensen et al. 2008). Jäggi et al. (2016) used the Celestial Mechanics Approach to recover the time variable and static gravity field from Swarm kinematic orbit. They analysed the effect of severe ionosphere activity on precise orbit determination of the Swarm satellite mission. Moreover, they compared their results with the gravity field recovered from GRACE where they concluded that the Swarm satellite mission can recover long-wavelength time-variable gravity signals (up to degree 20).

Lück et al. (2018) used three methods to fill several types of artificial gaps of single, 3 and 18 months. Those methods included the following: (1) interpolating existing monthly GRACE solutions, (2) using monthly Swarm solutions, (3) using the CTAS (directly estimating constant, trend, and annual and semi-annual signal terms) Swarm solution. In their research, in most cases, the third method outperformed the first and second approaches, although it depended on the length of the gap. In the case of single-month or three-month gaps, the first method showed better performance, while the third approach outperformed the others in 18-month gap scenario. It is of important to mention that several previous studies indicated that Swarm has the capability to recover time-variable gravity field of the Earth at lower spherical harmonic degrees (up to d/o 12), i.e., coarser spatial resolution with more noise. Their results showed that trend and annual variations and seasonal components of mass changes can be retrieved from the kinematic orbit of the Swarm mission. However, in this way, there is no benefit to use accelerometer data and kinematic baseline from Swarm satellite missions in gravity inversion; therefore, only GPS-derived kinematic orbit has been utilized for the Earth’s gravity recovery. It should be mentioned here that the GPS antenna on Swarm spacecraft is 8-channel antenna and cannot collect GPS data more than a particular amount; this is an important problem and limitation in precise orbit determination of Swarm satellite mission, especially for the kinematic orbit of the Swarm satellites and consequently for the gravity recovery from the kinematic orbit (ESA 2014, Zangerl 2014, Jose van den IJssel 2015, van den IJssel 2016). Therefore, the gravity field derived from Swarm kinematic orbit cannot substitute the gravity field derived from GRACE and GRACE-FO satellite missions. In fact, because of Swarm coarse spatial resolution and its high level of noise in the lower level data (kinematic orbits), it is preferable to seek a robust mathematical method to bridge the missing values in GRACE and GRACE-FO time series and the long gap between them. For more details, see Bezděk et al. (2016) and Da Encarnação et al. (2016, 2019, 2020).

Forootan et al. (2020) used ICA method to fill the gaps between GRACE and GRACE-FO and the long gap between them. In their study, using ICA, they merged GRACE TWSC with the low spatial resolution TWSC from Swarm, where the aim was to bridge the gaps and missing values and improve the spatial resolution of Swarm TWSC. They employed two approaches in their study, where the first approach was based on the reconstruction of Swarm dataset in the spatial domain and the second approach implemented the reconstruction in the temporal domain. Their results showed that the gap-filling by use of the ICA-based method and merging GRACE and Swarm gravity field dataset has a good performance. In their evaluation, the second approach in the temporal domain outperformed the first approach in the spatial domain.

Yi and Sneeuw (2021) employed SSA to fill the missing values in GRACE and GRACE-FO dataset and the long gap between them. Their work made use of a methodology based on a gap-filling method through SSA in an iteration. This approach is the so-called zero-filling SSA which was proposed by Kondrashov and Ghil (2006). In their research study, they used the dataset of GRACE and GRACE-FO in the spectral domain, i.e., spherical harmonic coefficients of the time variable gravity field for the global solution.

There are several techniques to use SSA to forecast and fill the missing values in time series. These techniques are based on the reconstruction of time series from the original signals, regardless of the noise (Golyandina and Zhigljavsky 2013). However, Khazraei and Amiri-Simkooei (2019) proposed Monte Carlo SSA algorithm, introduced by Allen and Smith (1996), to differentiate between the oscillation parts of a time series including noise through decomposition. The algorithm employs Least Squares-Variance Component Estimation (LS-VCE) to estimate noise model parameters and generates surrogate data using Cholesky decomposition.

Rodrigues and De Carvalho (2013) provided a recurrent imputation method (RIM) based on a weighted average between forecast and hindcast, to bridge time series gaps and missing values. Mahmoudvand and Rodrigues (2016) introduced a new algorithm for gap-filling of a time series, where the method was based on RIM, although it uses bootstrap resampling and a given weighting scheme by sample variances. These two estimates are combined to produce a unique estimation for missing values. Least Squares-Harmonic Estimation (LS-HE) is, on the other hand, a parametric and stationary method for time series analysis, developed by Amiri-Simkooei (2007) and Amiri-Simkooei et al. (2007). This method as a generalized form of the classic Least-Squares Spectral Analysis (LSSA) considers a general linear trend and also the stochastic model of the time series. Moreover, Amiri-Simkooei and Asgari (2012) developed the multivariate LS-HE (MLS-HE), where they implemented their method on time series of total electron content (TEC) in the ionosphere. They detected several significant signals in the time series of TEC, such as diurnal and annual signals with periods of 24/n hours and 365.25/n days (n = 1, 2 …), along with their higher harmonics and their modulated variants.

In summary, several recent studies focused on using external dataset and different methodologies to fill the missing values and long gaps in time-variable gravity field of GRACE and GRACE-FO satellite missions. Our study, however, aims to evaluate and compare the performance of three main methods in gap-filling of the missing values in GRACE and GRACE-FO time series and prediction of the long gap between the two missions. Our focus in this research work is on a few hydrological basins where the special behaviours of each basin may play an important role in the gap-filling procedure. To this end, we assign six scenarios of artificial gaps. The algorithms are implemented in five large water catchments and one sub-catchment. The case studies are selected from various regions of the world that have different hydrological behaviours to reflect the essential application of different techniques for data gap-filling in hydrological dataset, e.g. GRACE and GRACE-FO time series. For this purpose, as it was mentioned earlier, we choose six case studies that are (i) Amazon basin, (ii) Congo river basin, (iii) Gavkhouni (or Gavkhuni) sub-basin, (iv) Aravalli basin, (v) Rhine river basins, and (vi) Tigris river basin. To have more information about the selected study areas, the reader is referred to subsection 3.2. The performance of the aforementioned methods on each artificial gaps is assessed in this research, where a kind of inter-validation is conducted for filling the real gaps. In the Amazon catchment, an additional dataset of the Swarm time variable gravity field is also employed for validation.

The main contribution and novelty of this study are to use the signal characteristics of each individual catchment learnt from the gap-filling of the artificial gaps, to fill the original gaps in the time-series. In this way, it is expected that the hydrological characteristics of each case study are taken into account for the data gap-filling of that specific area which influences the selection procedure of the gap-filling algorithm. We believe that our research work can contribute to any other research field where the characteristics of the signals can be employed for prediction and gap-filling of the missing values in a time series.

This article is organized as follows. In Section 2, we present the methodologies employed for the data gap-filling of our research. Section 3 introduces data and study areas of the work, while the results are discussed in Section 4. Finally, Section 5 summarizes and concludes the research.

2 Methodology

Three different gap-filling methods are introduced in this section: (i) RIM-SSA, (ii) zero-filling SSA, and (iii) LS-HE.

First, SSA is introduced as a general method for time series analysis. Then, the recurrent singular-spectrum procedure is presented as a time-series gap-filling approach. Afterward, we describe an iterative zero-filling SSA algorithm. At the end, LS-HE mathematical formulation for filling the time series missing values is discussed.

2.1 Singular Spectrum Analysis (SSA)

SSA is a data-driven non-stationary method for signal processing and time series analysis. SSA and PCA are both based on Karhunen–Loeve theory of random fields and of stationary random processes, where SSA uses Singular Value Decomposition (SVD) of a Hankel matrix, constructed from the main time series (Ghil et al. 2002). This method can be implemented on time series to decompose it in temporal patterns such as trend, oscillatory signals, and random noise. In the Hankel matrix, the diagonal elements equal together, i.e. i + j = const → a _ij = const, and are constructed from the time series in the embedding step. After embedding and applying SVD on the Hankel matrix, the modes are grouped, extracted from eigentriples into interpretable signals such as the annual and seasonal signals and random noise. Eventually, the time series is reconstructed from the grouped modes by diagonal averaging of the Hankel matrix. Here, we describe SSA in two steps: decomposition and reconstruction (Golyandina and Zhigljavsky 2013).

Consider a time series y = (y ₁, … y _m)^Tby the length of m, which m > 2 and y is a non-zero series, that is, there is at least one i such that y _i ≠ 0. Let L(1 < L < N) be an integer number as the length of the window and K = m − L + 1 (the number of lagged vectors).

2.1.1 First stage: Decomposition

Step 1: Embedding

At the start, the time series should be arranged in a Hankel matrix form, where the original time series is mapped into a sequence of lagged vectors. Consider matrix X as a Hankel matrix, constructed from the main time series. This matrix would be formed as follows:

(1) Y i = ( y i , … , y i + L − 1 ) T , ( 1 ≤ i ≤ K ) ,

where Y _i are so-called lagged vectors. There are K lagged vectors in the Hankel matrix. The length of each lagged vectors equals L. The L-trajectory matrix, constructed from the original time series is

(2) X = [ X 1 : ⋯ : X K ] = ( y i j ) i , j = 1 L , K = y 1 y 2 y 3 ⋯ y K y 2 y 3 y 4 ⋯ y K + 1 y 3 y 4 y 5 ⋯ y K + 2 ⋮ ⋮ ⋮ ⋱ ⋮ y L y L + 1 y L + 2 ⋯ y N .

The lagged vectors are the columns of the trajectory matrix. Each rows and columns are a subset of the original time series (Golyandina and Zhigljavsky 2013).

Step 2: SVD

This step deals with the implementation of SVD on the trajectory matrix, where the matrix is decomposed to the left and right patterns and singular values. The implementation of SVD on matrix X results in:

(3) X = US V T .

The previous equation consists of U , the left pattern, V the right pattern, and S , the singular values including the square roots of eigenvalues of S = XX ^T, i.e. λ ₁, … λ _L. The aforementioned decomposition can be written as follows:

(4) X = X 1 + ⋯ + X d ,

where X i = λ i U i V i T , V i = X T U i / λ i and d is the rank of the trajectory matrix X . The matrices X _i have rank 1; such matrices are so-called elementary matrices. The collection of ( λ i , U _i, V _i) is called ith eigentriple of the SVD of X .

2.1.2 Second stage: reconstruction

Step 3: Eigentriple grouping

In this step, the power of each mode is calculated as:

(5) P i = λ i ∑ i = 1 d λ i × 100 ( % ) .

In this way, each mode is associated to its corresponding eigentriple with its calculated power. The time series can then be divided into d groups, where each of these groups has the power of P _i, i ∈ {1, …, d}. Now, considering the modes up to a special power r < d, it is possible to reconstruct the original time series. Here, one can interpret that the individual modes greater than a specific power value are taken as signals, and therefore, the remaining modes are noise and have no contribution to the reconstructed time series. Therefore, the reconstructed time series gets the following form:

(6) X ˜ = X 1 + ⋯ + X r .

The aforementioned grouping is called elementary grouping.

Step 4: Diagonal averaging

From the nature of Hankelized trajectory matrices related to each eigentriples, a special procedure should be applied to a Hankel matrix to extract the modes or reconstruct the time series from a trajectory matrix. This is called diagonal averaging. Using this method, the trajectory matrix is transformed into a time series. Diagonal averaging operator is defined as follows (Golyandina and Zhigljavsky 2013, Rodrigues and De Carvalho 2013):

(7) D ( X ) = ∑ w = 2 K h w − 1 ∑ ( i , j ) ∈ A W y i , j | A W | ,

where { h l } l = 1 m denotes the canonical basis of ℝ^m. If X = [y _i,j] ∈ M _L×K, then the mapping operator D (.) is defined as D (.):M _L×K → ℝ^m. Here |. | stands for cardinal operator, and

(8) A w = { ( i , j ) | i + j = w } .

Hence, we can decompose the time series to d modes, related to each eigentriple or reconstruct the time series from the first to the rth mode. For more information, the reader refers to Golyandina and Zhigljavsky (2013), Rodrigues and De Carvalho (2013).

2.2 Recurrent singular-spectrum-based procedure

2.2.1 Recurrent forecast method

To fill the gaps in a time series, different methods to predict the missing values are being introduced. First of all, a singular-spectrum-based forecasting method is explained, and in the next step a RIM is provided.

Let y _i as an out-of-sample data forecast, as a linear combination of l − 1 preceding observations:

(9) y i = a i − 1 y i − 1 + ⋯ + a i − ( l − 1 ) y i − ( l − 1 ) , i ≥ 1 ,

for a suitable choice of coefficients a = (a ₁, …, a _i−1)^T. To choose a suitable set of the aforementioned coefficients, consider the matrix constituted by eigenvectors of XX ^T, suitably subdivided as follows:

(10) U = U 1 | U 2 − − − | − − − u 1 | u 2 ,

where U ₁ consists of the first L − 1 components of eigenvectors corresponding to the signal, and u ₁ contains the last components of those eigenvectors. Similarly, U ₂ and u ₂ are defined, although they correspond to the remainder d − r components associated with the noise. According to Golyandina and Osipov (2007), the above-mentioned coefficients are computed as follows:

(11) a = M [ U 1 ⊙ ( u 1 ⊗ l L − 1 ) ] l r 1 − | | u 1 | | 2 ,

where ‖.‖ denotes the Euclidian norm, ⊙ and ⊗ respectively denote component-wise Hadamard and the Kronocker products, and

(12) M = 0 0 ⋯ 0 1 0 0 ⋯ 1 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 1 ⋯ 0 0 1 0 ⋯ 0 0 .

Based on this algorithm, the one-step-ahead forecast of out-of-sample data series can be written as a linear combination of l − 1 reconstructed values of the time series:

(13) y → m + 1 = ∑ i = 1 L − 1 a i y ˜ m − i ,

where the coefficients are taken from equation (11) and the tildes are used to denote reconstructed time series from the original and oscillatory signals. In general, for further forecasts of out-of-sample values, the following algorithm is proposed:

(14) y → m + 2 = a 1 y → m + 1 + ∑ i = 2 L − 1 a i y ˜ m − i , ⋮ y → m + ( L − 1 ) = ∑ i = 1 L − 1 a i y → m − i ,

for 2, …, L − 1 steps-ahead, respectively. In this article, the notation y → implies the forecast values.

2.2.2 RIM

In this subsection, a recurrent method to fill the missing values in a time series of interest is introduced. This method is based on a weighted average between forecast and hindcast, where the forecast is performed according to subsection 2.2.1. Furthermore, the hindcast values are computed analogously, although in the reverse order of the time series. That means, hindcast is in fact a forecast, but in reverse order (Rodrigues and De Carvalho 2013).

Consider the time series y = (y ₁, …, y _m)^T with a block of k sequential missing values. Let the decomposition of the time series as the following equation to known and missing values:

(15) y = y n y k y n ⁎ ,

where n* = m − n − k. y _n is the n-vector of known values before the missing values, y _k is a k-vector of missing values and y _n* is the n*-vector of known values after missing values. By prediction and estimation of missing values, the filled time series will be as follows:

(16) y ˆ = y n y ˆ k y n ⁎ ,

with y ˆ k as the estimation of missing values. For the purpose of prediction, at the first step, SSA is applied on the first block and the forecast values are computed based on 2.2.1. Simultaneously, the hindcast is performed according to a forecast from the third block in the reverse order of time. The procedure is done by multiplication y _n* to an operator similar to (12). The forecast and hindcast values are defined as:

(17) y → k = ( y → 1 , … , y → k ) T ,

(18) y ← k = ( y ← 1 , … , y ← k ) T ,

where y → k represents the forecast and y ← k stands for the hindcast values. The estimation values for missing values y ˆ k would be then a weighted average between y → k and y ← k . The “RIM” intends to establish an equilibrium between forecast and hindcast derived from recurrent forecast method. Especially, this method employs the following formula to estimate k missing values in the time series:

(19) y ˆ k = θ ⊙ y → k + ( l k − θ ) ⊙ y ← k ,

where

(20) θ = ( θ 1 , … , θ k ) T ,

denotes the weighting coefficients related to the forecasts achieved by recurrent forecast algorithm. For more details about this method, the readers refer to Rodrigues and De Carvalho (2013).

In this contribution, we use several weighting schemes. In the next subsection, these weighting strategies are described in more detail.

2.2.2.1 Linear weighting scheme

Linear weighting strategy is based on the distribution of the missing values in a time series. In fact, the method makes use of the closeness of missing values to the dataset for forecast or hindcast; i.e. if a missing value is closer to the forecast value, the weighting number for forecast is greater than the weighting value for hindcast and vice versa. The weighting function is defined as (Rodrigues and De Carvalho 2013, Mahmoudvand and Rodrigues 2016):

(21) θ i = 1 − i i + 1 , i = 1 , … , k .

2.2.2.2 Modified weighting strategy

In some cases, the number of data points of a time series used to forecast does not equal the number used for hindcast. In this way, the number of data points to forecast or hindcast would affect the prediction, and therefore, the linear weighting scheme is not appropriate. In such cases, Rodrigues and De Carvalho (2013) suggested a new weighting scheme to address this problem. This weighting strategy makes use of:

(22) θ i = ( k + 1 − i ) n ( k + 1 − i ) n + i n ⁎ .

Or in vector notation:

(23) θ = k n k n + n ⁎ , ⋯ , n n + n ⁎ T .

In a special case, where the number of observations used to forecast and hindcast is equal (n = n*), the equation above becomes the linear weighting scheme of equation (21).

2.2.2.3 Stochastic weighting

In this part, we introduce an innovative stochastic weighting scheme for the data gap-filling by SSA. If we have some information about the stochastic properties of time series, it is possible to define a weight function based on the variance of the data used to forecast or hindcast. For this purpose, at first, the variance of forecast and hindcast should be derived according to error propagation law. What it means is based on recurrent forecast (or hindcast) method. In this method, if l is considered as the window size of the trajectory matrix in SSA, l − 1 number of coefficients is calculated and multiplied into the l − 1 data series before the missing value (in the case of forecast). Therefore, it is easy to use error propagation law to calculate the variance of the forecast value (and also the hindcast value in the same way). In order to reach this goal, consider y → as a forecast of a missing value, a as the vector of coefficients multiplied into the l − 1 data points before missing values, and y as the (l − 1) × 1 vector of l − 1 of those data points. Therefore, we have:

(24) y = ( y 1 , … , y l − 1 ) T ,

(25) a = ( a 1 , … , a l − 1 ) T ,

(26) y → = a T y ,

(27) Q y = diag ( σ y 1 2 , … , σ y l − 1 2 ) ,

where Q _y is the covariance matrix of the vector y. Therefore, according to error propagation law, we will have:

(28) σ y → 2 = a T Q y a .

Note that y → in equation (26) is not a vector of a few missing values, but it is a scalar (one value), i.e. a forecast for one missing value. In a recurrent procedure described in subsection 2.2.1, the missing values are filled one by one, where at the end, the recurrent procedure generates the vector of forecast. Such a procedure is also performed to create a vector of hindcast. Finally, the weighted average of the forecast and hindcast is taken for the purpose of gap-filling. It should be noted that the variance in equation (28) is for one missing value as a scalar and not a vector.

From the discussion earlier, it is possible to calculate the variance of each forecast of the missing values. Consider y _f as the n-vector of forecasts for missing values and y _h as the n-vector of hindcasts (in one gap). Now, using the covariance matrix of y _f, y _h we can take a Best Linear Unbiased Estimation (BLUE) of the missing values. Let Q _yf and Q _yh be the covariance matrices of y _f and y _h, their diagonal elements are computed by using the error propagation law, equation (28). Therefore, we have:

(29) y f = ( y f 1 , … , y f n ) T ,

(30) y h = ( y h 1 , … , y h n ) T ,

(31) Q y f = diag ( σ y f 1 2 , … , σ y f n 2 ) ,

(32) Q y h = diag ( σ y h 1 2 , … , σ y h n 2 ) .

The model of observation equations are:

(33) E { y f } = y ˜ ; D { y f } = Q y f ,

(34) E { y h } = y ˜ ; D { y h } = Q y h ,

where E{.} and D{.} are expectation and dispersion operators, respectively. By a combination of (33) and (34), the Gauss–Markov model will be as follows:

(35) E y f y h = I n I n y ˜ ; D y f y h = Q y f 0 0 Q y h .

Based on the Least-Squares theory, we have

(36) x ˆ = ( A T Q y − 1 A ) − 1 A T Q y − 1 y .

Then, y ˜ ˆ is estimated

(37) y ˜ ˆ = ( I n I n ) Q y f − 1 0 0 Q h f − 1 I n I n − 1 ( I n I n ) Q y f − 1 0 0 Q y h − 1 y f y h = ( Q y f − 1 + Q y h − 1 ) − 1 ( Q y f − 1 y f + Q y h − 1 y h ) .

As we can see, equation (37) is a weighted average. This equation is used to predict the missing values by use of stochastic weighting.

2.2.2.4 Linear-stochastic weighting

In this subsection, an innovative combination of linear weighting methods based on data distribution in a time series with stochastic properties of missing values is discussed. The approach is utilized to have a combined weighted average between forecast and hindcast, where the weighted average is according to equation (37), but the covariance matrices are different as follows:

1 Linear part

(38) Q 1 f = diag 1 n + 1 − 1 , … , 1 n + 1 − i , … , 1 n + 1 − n ,

(39) Q 1 h = diag 1 1 , … , 1 i , … , 1 n ,

2 Stochastic part

(40) Q y 2 f = diag ( σ y f 1 2 , … , σ y f n 2 ) ,

(41) Q y 2 h = diag ( σ y h 1 2 , … , σ y h n 2 ) .

For the final covariance matrix, we have

(42) Q f = Q 1 f ⊙ Q 2 f ,

(43) Q h = Q 1 h ⊙ Q 2 h .

Using the covariance matrices for forecast and hindcast according to equation (42) and (43), the LS estimation of y ˜ is then computed based on LS theory and equation (37).

2.3 Zero-filling SSA

Here, we employ a gap-filling algorithm based on an iterative SSA procedure. The zero-filling SSA method has been introduced by Kondrashov and Ghil (2006) and was later employed by Zscheischler and Mahecha (2014). The method provided in these studies is based on setting zero or mean values of time series for the missing values. In the literature mentioned above, an iterative gap-filling procedure was proposed to fill the missing values in a time series. At first, the missing values are replaced by zero (or mean values of the time series). In the second step, in the inner loop, SSA is implemented on the zero-filled time series. After the reconstruction of the time series, extracted from SSA, the missing values (zero values) are then replaced by the values from the first mode. In this loop, the temporal correlation (in a univariate time series) or spatiotemporal correlation (in a spatiotemporal dataset) is utilized to validate the gap-filling procedure where a threshold for the loop is introduced.

In this study, 2.5% of RMSE is used as a criterion. That means when the new RMSE is smaller than 2.5% of the previous RMSE, the inner loop is stopped. After replacing the missing values, SSA is applied to the new time series and the loop continues until it reaches the threshold. The outer loop aims to separate the signal from the noise. After using the first mode in the inner iteration and filling the gaps, the reconstruction of the data is now continued by using the second mode added to the reconstructed missing values from the first mode. In the next iteration of the outer loop, another mode is added to reconstruct the time series. The added modes may represent the noise; therefore, the procedure may increase the RMSE value. If RMSE increases, we stop the outer iteration. For more details, you can refer to the study by Kondrashov et al. (2010).

2.4 LS-HE

LS-HE is a parametric stationary method based on the LS principles and Fourier expansion of a time series, developed by Amiri-Simkooei (2007). In this way, a Gauss-Markov model is considered for the time series, which consists of a functional model and a stochastic model. There is a duality between these two parts that the periodic signals in a time series such as trend, annual, semi-annual, and diurnal frequencies are captured by the functional model, while different types of noise in the time-series such as white noise, flicker noise, or random walk noise are taken in the stochastic part.

Consider a time series in a m-vector y. It can be expressed as a summation of a linear trend and some periodic terms such as in the following equation:

(44) E { y ( t ) } = y 0 + r t + ∑ k = 1 q a k cos ( ω k t ) + b k sin ( ω k t ) .

In a matrix notation, it could be written as:

(45) E { y ( t ) } = A x + ∑ k = 1 q A k x k , D { y } = Q y .

The equation earlier is called the Gauss–Markov model, where A is the design matrix, for example, contains the linear trend terms, and A _k consists of two columns related to the frequency ω _k of the periodic terms:

(46) A k = cos ω k t 1 sin ω k t 1 cos ω k t 2 sin ω k t 2 ⋮ ⋮ cos ω k t m sin ω k t m and x k = a k b k ,

with a _k, b _k, and ω _k are the (un)known real numbers. In fact, if the frequencies ω _k are known, the harmonic coefficients a _k and b _k can be estimated based on the linear LS principles, although in most cases, we do not have exact information about the frequencies existing in a time series. In this situation, in addition, to estimate the harmonic coefficients according to the current linear LS model, the frequencies should be also detected and estimated in the time series. This is the so-called harmonic estimation and is the task of LS-HE.

To find a set of frequencies ω ₁, …, ω _q, and a particular value of q in equation (45), the following null and alternative hypothesis is considered (to start, set i = 1):

(47) H o : E { y } = A x + ∑ k = 1 i − 1 A k x k , .

versus

(48) H a : E { y } = A x + ∑ k = 1 i A k x k .

The detection and validation of ω _i are performed in two steps:

Step I: In the first step, the frequency ω _i and the corresponding design matrix A _i are looked for. The procedure follows the solution to the minimization problem:

(49) ω i = arg min ω j ∥ P [ A ¯ A j ] ⊥ y ∥ Q y − 1 2 = arg min ω j ∥ e ˆ a ∥ Q y − 1 2 ,

where ∥ ⋅ ∥ Q y − 1 2 = ( ⋅ ) T Q y − 1 ( ⋅ ) , A ¯ = [ A A 1 ⋯ A i − 1 ] and e ˆ a is the least-squares residuals under the alternative hypothesis. The matrix A _j has the same structure as equation (46). The frequency, which minimizes the criterion mentioned earlier, is set to be ω _i. This minimization problem can be provided in an equivalent maximization form as the following equation (Teunissen 2003):

(50) ω i = arg max ω j ∥ P A ¯ j y ∥ Q y − 1 2 , with A ¯ j = P A ¯ ⊥ A j ,

with P A ¯ j = A ¯ j ( A ¯ j T Q y − 1 A ¯ j ) − 1 A ¯ j T Q y − 1 . Therefore, equation (50) simplifies to:

(51) ω i = arg max ω j e ˆ o T Q y − 1 A j ( A j T Q y − 1 P A ¯ ⊥ A j ) − 1 A j T Q y − 1 e ˆ o ,

with e ˆ o the least-squares residuals under the null hypothesis.

Step II: In order to test the null hypothesis against the alternative hypothesis H _o, the following statistic is used:

(52) T 2 = ∥ P A ¯ i y ∥ Q y − 1 2 2 σ ˆ a 2 = e ˆ o T Q y − 1 A j ( A j T Q y − 1 P A ¯ ⊥ A j ) − 1 A j T Q y − 1 e ˆ o 2 σ ˆ a 2 ,

where A ¯ i = P A ¯ ⊥ A i and the variance factor σ ˆ a 2 is estimated under the alternative hypothesis. Under H _o, the preceding test statistic has a central Fisher distribution:

(53) T 2 ∼ F α ( 2 , m − n − 2 i ) ,

where α is the level of significance in the central Fisher test. After the detection of unknown frequencies, the design matrix is formed by a linear trend and periodic signals. For the sake of brevity, we name the final design matrix A. The harmonic coefficients are then estimated based on the linear least-squares principles:

(54) x ˆ = ( A T Q y − 1 A ) − 1 A T Q y − 1 y .

Based on the two steps earlier, the periodic terms and sinusoidal frequencies in the time series are detected. After detection and validation of the functional model, i.e. finding the frequencies of interests in the time series, the missing values in the time series are estimated using a suitable detected functional model. In fact, the detected frequencies are introduced to the functional model and the time argument is updated by adding the time argument of missing values into the design matrix. Afterwards, the reconstructed time series, including (now) filled-values, is estimated through the following equation:

(55) y ˆ New = A new x ˆ ,

where A _new is the new design matrix, including rows associated with the missing values, although the same column size as A in equation (54) is kept. In this way, y ˆ New is the reconstructed time series, in which the missing values are filled and predicted by the linear trend and harmonic functions. Amiri-Simkooei and Asgari (2012) used LS-HE and Multivariate LS-HE (MLS-HE) to analyse the time series of TEC and extracted interpretable frequencies from the aforementioned data set. Moreover, Rajabi et al. (2020) used LS-HE and MLS-HE to model and predict the ionosphere TEC maps in their studies.

3 Data and study areas

3.1 Data

3.1.1 GRACE L3 data

In this contribution, we use JPL mascon level-3 grid dataset RL06 version 2 processed by NASA Jet Propulsion Laboratory (JPL). This dataset is derived from mass concentration (mascon) solution based on surface spherical cap basis functions rather than spherical harmonics. The solutions aim to make improvements by using a priori information derived from near-global geophysical models and prevent striping in the solutions (Watkins et al. 2015). In order to reduce the leakage error and separation of land area from oceans, a Coastline Resolution Improvement (CRI) filter was applied to JPL-mascon RL06 v02 dataset (Wiese et al. 2016). The grid time series consist of GRACE and GRACE-FO data. GRACE-FO processing strategy is similar to GRACE, although there are a few differences in sensor analysis and processing of the GRACE-FO data. For more details see Landerer et al. (2020).

The dataset is available to download from https://podaac.jpl.nasa.gov/dataset/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06_V2.

3.1.2 Swarm L2

Another dataset, used in this article, is the spherical harmonic coefficients derived from the kinematic orbit of the Swarm satellite mission. Several analysis centers provide gravity field solutions from Swarm. In this study, we use the spherical harmonic coefficients, provided by the Astronomical Institute, Czech Academy of Science (ASU), where the processing strategy is based on Bezděk et al. (2016). The gravity field was derived from 3 Swarm satellites (Swarm-Alpha, Swarm-Charlie, and Swarm-Bravo), through the acceleration approach (Guo et al. 2015). The product is solved up to spherical harmonic coefficients degree and order of 40, although they are only significant up to degree and order 15 (Bezděk et al. 2016 and Da Encarnação et al. 2016, 2019, 2020). The solutions can be downloaded from http://www.asu.cas.cz/∼bezdek/vyzkum/geopotencial/index.php.

3.2 Study areas

We choose five river catchments and one sub-catchment, where our introduced gap-filling methodologies are implemented. The time series are in the form of TWSC, extracted from JPL mascon grid data introduced in Section 3. One of the main reasons for selection of these study areas is that we tend to capture different hydrological signal characteristics by the use of different methods explained in the previous section. The selected case studies are five hydrological basins and a sub-basin with different behaviours either stationary or non-stationary from various regions that are distributed around the globe. The study areas of this research article are as follows:

Amazon basin is located in South America with geographical coordinates of about 20S-5N in latitude and 80W-55W in longitude (Figure 1a).
Congo river basin in Central Africa is a sedimentary basin of the Congo river. Its geographical extension is about 15S-10N in latitude and 12E-35E in longitude (Figure 1b).
Gavkhouni (or Gavkhuni) sub-basin is located in the Iranian Plateau in Central Iran, where Gavkhouni wetland is the terminal basin of the Zayandehrud River, east of the city of Isfahan. Its geographical coordinates are limited to about 31N–34N in latitude and 50E–54E in longitude (Figure 1c).
Aravalli (India) basin which is located in India that its geographical coordinates are limited to about 2N–28N in latitude and 68E–76E in longitude (Figure 1d).
Rhine river basin is a catchment of Rhine river, located in the west of central Europe and its geographical coordinates are limited to about 46N–52N in latitude and 4E–12E in longitude (Figure 1e).
Tigris river basin is located in Western Asia by the geographical coordinates of about 33N–39N
in latitude and 39E–48E in longitude (Figure 1f).

Figure 1

Geographical boundaries of (a) Amazon, (b) Congo, (c) Gavkhouni, (d) Aravalli, (e) Rhine, and (f) Tigris river basins.

Amazon basin is located in South America with a tropical climate regime. Congo basin, the basin of Congo River, located in west-central Africa, is the world’s second-largest river basin (next to that of Amazon). Its climate is approximate as the same as the Amazon basin i.e. tropical. In fact, these two basins have almost similar climatological behaviour. The Rhine river basin including German Rhine, French Rhine, Dutch Rijn, Celtic Renos, and Latin Rhenus, lies in Central Europe with humid and mild climate. The Aravalli Mountain Range, located in South Asia, has a complex climate regime. Hence, its climate cannot be generalized into a single exact climatological behaviour. However, the dominant climate in the selected location in this research work is arid, steppe and hot. Gavkhouni sub-basin and Tigris river basin, located in the Middle-East, southwestern Asia, have arid and semi-arid climate regime; however, the higher part of Tigris river basin could be categorized as subtropical.

4 Results

Mascon level-3 data series provides information about time variable total water storage on the continental scale and ocean areas. Originating from time-to-time instrumental problems in GRACE and GRACE-FO measuring devices, the time series include missing values, causing sparse temporal gaps in GRACE and GRACE-FO dataset. The most serious gap in the time series, however, is the long gap between GRACE and GRACE-FO satellite missions. In order to report the data gaps, GRACE missing months are respectively June and July 2002, June 2003, January and June 2011, May and October 2012, March, August and September 2013, February, July and December 2014, June, October and November 2015, April, September and October 2016 and February 2017. Then, there is a long 11-months gap between GRACE and GRACE-FO missions, which started on July 2017 and ended in June, 2018. GRACE Follow-On gaps happened in August and September 2018.

Concerning the gap-filling problem, we here employ several methodologies, described in Section 2. However, lacking reliable information for validation of our gap-filling procedures, we design different artificial gap scenarios, similar to the gaps which exist in the main time series. These scenarios also include a long artificial gap, similar to the long gap between GRACE and GRACE-FO, to simulate the real long gap between two satellite missions. It should be mentioned here that we only make use of GRACE time series to validate the gap-filling methodologies using artificial gaps. Those artificial gaps are chosen between the years 2003 and 2011 in which there is no real gap, and the time span is long enough to derive the main signal signatures. The real gaps outside the validation time span are then filled from the results of the artificial gap-filling procedure where the best filling performance for each specific gap type is taken.

Our artificial gaps include the following:

Short gaps:

at the start of time series (2 data points gap);
at the end of time series (2 data points gap);
sparse gaps in the middle of time series (8 data points gap);
on a peak in the middle of time series (3 data points gap);
on a valley in the middle of time series (5 data points gap).

A long gap similar to the gap between GRACE and GRACE-FO:

- In the middle of the time series (13 data points gap)

It would be of importance to note that our short gaps scenarios consist of different schemes including “continuous a few months” and “sparse individual” missing values in the time series. On the other hand, based on the fact that there is a long gap (about one year) between GRACE and GRACE-FO measurements, we also simulate this gap scenario via a long artificial gap in the middle of time series where there is no data gap, i.e. between the years 2003 and 2011. Concerning these two sectors of missing values, i.e. short gaps and a long gap in the GRACE and GRACE-FO time series, we tackle the issue in two different manners; one is the scenarios for different types of short artificial gaps as previously mentioned, and another one is the long artificial gap.

To review our gap-filling approaches, we employ:

SSA – Linear weighting,
SSA – Stochastic weighting,
SSA – Linear + Stochastic,
SSA – Equation (22) weighting,
LS-HE,
Zero-filling SSA.

It should be noted that the length of the window through SSA data processing is set to 23 months which was determined by the trial-and-error method. However, wherever there are not enough data to choose 23-month window for the case of modified weighting strategy, we select 12-month window length as the other optimal case in our trial-and-error. This selection illustrates the capability of the modified weighting strategy in choosing various window lengths. It is noticeable that the number of data points used for gap-filling procedure depends on the length of the window in RIM-SSA (subsection 2.2.1, equation (9)). Moreover, the level of significance α in Fisher test through LS-HE analysis; i.e., equation (53) is considered α = 0.01.

The focus of this research study is on analysing the time series of GRACE and GRACE-FO in the scales of the hydrological catchment. In fact, we expect that different catchments on the globe show different regional and local hydro-climatological behaviours. To this end, it might be of interest to localize time series analysis, i.e. the gap-filling procedure. We expect that through the global analysis, some information in the data related to a specific catchment or basin might be lost. Therefore, in this contribution, we focus on the performance of a few gap-filling algorithms for particular basins where their particular temporal hydro-climatological variations may play a role.

The results for the selected river basins of this research work are provided in the form of Root Mean Square Error (RMSE), shown in the tables below. In those tables, the numbers identified with “*” and “**” respectively show the maximum and minimum RMSE. Those values are brought for the long gap scenario as well as the mean of the short gap scenarios.

Through our results from the error values for different artificial gap scenarios, for each catchment or the sub-catchment, we fill the real gaps by the best methodology, i.e. the approach which provides the minimum average RMSE for similar short artificial gap scenarios. Similarly, the long gap between GRACE and GRACE-FO is filled by using the methodology that provides minimum RMSE for the artificial long gap. We also try to validate the results in the Amazon basin where coarse-resolution Swarm satellite mission dataset is available. That comparison is employed for the long gap between GRACE and GRACE-FO missions.

As it can be seen in Table 1 for the Amazon catchment, according to RMSE values, for both short artificial gaps and long gap, LS-HE outperforms the other methods. In Table 2, related to Congo basin, for both short and long artificial gaps, the SSA-based methods have the best performance in comparison to LS-HE. For the short gap, the minimum RMSE value happens for the linear weighting scheme, while for the long gap, it is for the modified weighting strategy (equation (22)). Similarly, for Gavkhouni sub-basin in Table 3, the SSA-based algorithms perform better than the LS-HE method. In contrast, in the Aravalli basin (Table 4), the LS-HE method outweighs the SSA algorithms, based on RMSE values. Considering the Rhine river basin in Table 5, the LS-HE algorithm also outperforms the SSA-based methods for the gap-filling of both short- and long-gap scenarios. In Table 6, related to the Tigris river basin, SSA-based methods outperform the LS-HE approach. In fact, a linear weighting scheme for short gaps and zero-filling SSA for long gap are the selected gap-filling algorithms in this river basin.

Table 1