Statistical modeling of COVID-19 deaths with excess zero counts

Sami Khedhiri

doi:10.1515/em-2021-0007

Article Publicly Available

Statistical modeling of COVID-19 deaths with excess zero counts

Sami Khedhiri

Published/Copyright: October 8, 2021

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Epidemiologic Methods Volume 10 Issue s1

Abstract

Objectives

Modeling and forecasting possible trajectories of COVID-19 infections and deaths using statistical methods is one of the most important topics in present time. However, statistical models use different assumptions and methods and thus yield different results. One issue in monitoring disease progression over time is how to handle excess zeros counts. In this research, we assess the statistical empirical performance of these models in terms of their fit and forecast accuracy of COVID-19 deaths.

Methods

Two types of models are suggested in the literature to study count time series data. The first type of models is based on Poisson and negative binomial conditional probability distributions to account for data over dispersion and using auto regression to account for dependence of the responses. The second type of models is based on zero-inflated mixed auto regression and also uses exponential family conditional distributions. We study the goodness of fit and forecast accuracy of these count time series models based on autoregressive conditional count distributions with and without zero inflation.

Results

We illustrate these methods using a recently published online COVID-19 data for Tunisia, which reports daily death counts from March 2020 to February 2021. We perform an empirical analysis and we compare the fit and the forecast performance of these models for death counts in presence of an intervention policy. Our statistical findings show that models that account for zero inflation produce better fit and have more accurate forecast of the pandemic deaths.

Conclusions

This paper shows that infectious disease data with excess zero counts are better modelled with zero-inflated models. These models yield more accurate predictions of deaths related to the pandemic than the generalized count data models. In addition, our statistical results find that the lift of travel restrictions has a significant impact on the surge of COVID-19 deaths. One plausible explanation of the outperformance of zero-inflated models is that the zero values are related to an intervention policy and therefore they are structural.

Keywords: conditional exponential family distributions; COVID-19; generalized count series models; mean square forecast error; zero-inflated models

Introduction

Non-Gaussian time series modeling is a subject that has attracted many researchers recently and a number of studies have emerged in the literature to propose alternative solutions for modeling of the realistic situations like count time series. Two widely known probability distributions have been employed in modeling sequences of observed counts. First, we have the Poisson distribution which is characterized by the property of equal dispersion, or the equality between its mean, and its variance. A random variable would exhibit over-dispersion if its variance is greater than its mean. Secondly, a common choice of probability distribution for unbounded counts with over-dispersion is the negative binomial distribution which is used instead of the binomial distribution to reflect observed over dispersion in the data. Since the traditional ARMA time series models require Gaussian errors which may not be met by count random variables, a class of generalized autoregressive moving average models was developed to handle flexible observation-driven models for non-Gaussian time series models. In these models, the dependent variable can have a conditional exponential family distribution, such as the Poisson or the negative binomial distribution, given past history of the process. Following the original generalized autoregressive moving average (GARMA) models developed by Benjamin, Rigby, and Stasinopoulos (2003), a number of subsequent studies presented special cases and variations of the original modeling, where a link function is introduced and the conditional mean under that link function takes an ARMA structure (Zheng, Hiao, and Chen 2015).

However in some cases, a specific count, usually zero, can occur more often than other series counts and if we overlook the frequent occurrence of zero values, this could result in over-dispersion and biased estimators of the parameter and the standard deviation. Therefore we may obtain misleading statistical inferences. To account for frequent zero values in the data, Lambert (1992) presented zero-inflated models where the distribution is a mixture of some count models, such as the Poisson or the negative binomial, with one that is degenerate at zero. But authors caution that in order to use properly these models, one has to verify whether these zeros are structural or random (Tang et al. 2018).

In order to proceed further with appropriate statistical modeling of zero inflation with time series of counts, a number of research contributions have emerged recently in the literature to address the issue. Some scholars including Zhu and co-authors (2012) introduced the zero-inflated Poisson (ZIP) and the zero-inflated negative binomial (ZINB) models in a context of integer-valued generalized autoregressive conditional heteroskedasticity framework. Others (Yang, Cavanaugh, and Zamba 2015) developed dynamic models to deal with zero-inflated count time series with both ZIP and ZINB predictive distributions. In this paper, we study the empirical performance of these models and assess their forecast accuracy when the data contains frequent zeros with a particular pattern using COVID-19 data from Tunisia.

Materials and methods

The objective of this research is to use alternative count series models to study how they fit COVID-19 death records and to study their prediction accuracy. Several mathematical and statistical models have been introduced recently to study the spread of the coronavirus disease and to forecast its related deaths and infections. For instance, Mohd Yusoff (2020) used a system dynamic methodology to build a COVID-19 confirmed case model. However, as argued in Alahmadi et al. (2020), while these computational models and the use of algorithms and computational resources are useful to predict the spread of infectious diseases and allow disease dynamic models to be parameterized with good accuracy, the lack of identification of model parameters can limit notably the usefulness of these models. In addition, publicly available data on daily infections and deaths from developing countries may not always be reliable. With these challenges in mind, we suggest studying the forecast performance of coronavirus related deaths with competing statistical models based on parametric assumptions.

We consider two classes of models. The first class includes generalized count data models with an ARMA structure to depict observations dependence (ACC). The second class includes zero-inflated models for a time series of counts (ZIM). Although ZIM models seem to be the obvious choice when we have frequent zeros in the data, a number of researchers stated that these models are being routinely applied leading to overuse (Warton 2005) and suggest that a negative binomial distribution for the counts may be sufficient to deal with over-dispersion. The validity of their argument of course is related to whether or not the observed excess zeros come from a different data generating process, and only if so, there will be a gain in using ZIM models.

In heath studies, and specifically in the presence of a deadly infectious disease, one major issue is how to control the pandemic and how to limit hospitalizations and the deaths related to it. The zero death records may suggest an effective health policy or a proper intervention and the opposite is true if the policy or the health protocol is not effective. So with such data, one may need to know if there is dependence between the intervention policy and the observed zero counts and thus whether these zeros are in fact structural.

Autoregressive conditional count models (ACC)

One type of models for time series of counts suggest that the conditional mean depends on previous count observations and on its own previous values (Liboschik, Fokianos, and Fried 2017) as shown below:

(1) g ( λ t ) = α 0 + ∑ i = 1 p α i Y t − i + ∑ j = 1 q β j λ t − j

where α ₀, α _i, and β _j are coefficients to be estimated from the data.

Also, we need to state the assumption that Y _t given past information has a Poisson distribution (ACP models) or a negative binomial distribution (ACNB models). In practice the function g(.) could be either the identity or the log transform and both specifications are generalized linear models.

The idea is similar to modeling the conditional variance in generalized autoregressive conditional heteroskedasticity context (Bollorslev 1986). In this paper, we will follow this approach as one of two alternative methods to model and to predict COVID-19 deaths.

The ACC models can be presented with a count series that has the Poisson conditional distribution, and for which the probability mass function is given by,

(2) P ( Y t = y | I t − 1 ) = λ t y e − λ t y ! , y = 0,1 , …

The conditional mean and variance are represented by, E(Y _t/I _t−1) = V(Y _t/I _t−1) = λ _t.

Alternatively, the conditional probability distribution may also be given by the negative binomial as follows,

(3) P ( Y t = y | I t − 1 ) = Γ ( y + υ ) Γ ( y + 1 ) Γ ( υ ) υ λ t + υ υ λ t λ t + υ y , y = 0,1 , …

where λ _t refers to the conditional mean, λ t + λ t 2 υ is the conditional variance, I _t−1 is the σ-field generated by (Y ₀, Y ₁, …, Y _t−1, λ ₀), and υ is the coefficient of dispersion.

In general, these models work well with over-dispersed (or under dispersed) data and that is perhaps the reason why they are so popular in health studies and particularly in disease surveillance. However, with excess zero values their performance may not be that attractive specifically if these zeros could be attributed to a different data generating process. As a suggested solution, scholars introduced the so-called zero inflated models to handle specifically the statistical issues that may arise from excess zero counts.

Zero-inflated models

In surveillance disease studies, count series data sometimes have an overabundance of zeros. A common choice to model these types of data is the zero-inflated model, which is a mixture of a standard discrete count distribution and a degenerate distribution at zero. The model coefficients may be estimated with an expectation–maximization algorithm and time dependence in both the count and the zero-inflation components of the model.

Yau, Lee, and Carrivick (2004) used a mixed autoregresion model for zero inflated count time series data where the dependence is modeled with a random effect. Yang, Zamba, and Cavanaugh (2013, 2015 developed a zero-inflated autoregression model where the autocorrelation is modeled as a function of past responses, and Sathish et al. (2020) combined both autoregressive and moving average components in the ZINB model. The autoregressive model for count time series with excess zeros may be presented as follow:

Suppose the response series (Y _t) is composed of discrete count data and the count series is conditionally distributed as zero inflated Poisson (λ _t,p _t) with probability mass function given by,

(4) f Y ( y t \ I t − 1 ) = p t I ( y t = 0 ) + ( 1 − p t ) exp ( − λ t ) λ t y t / y t !

where I _t−1 is the information set, and λ _t, p _t represent the Poisson intensity and the zero inflation parameters, respectively.

The conditional mean and variance are E(Y _t\I _t−1) = λ _t(1 − p _t) and V(Y _t\I _t−1) = λ _t(1 − p _t)(1 + λ _t p _t). Notice that the conditional variance is greater than the conditional mean and therefore the ZIP distribution accounts for both over-dispersion and zero inflation.

An alternative conditional distribution is the zero-inflated negative binomial given by,

(5) f Y ( y t \ I t − 1 ) = p t I ( y t = 0 ) + ( 1 − p t ) Γ ( k + y t ) Γ ( k ) y t ! k k + λ t k 1 − k k + λ t y t

where k k + λ t denotes the probability of success for the negative binomial. Also, λ _t is the intensity parameter, and k is the dispersion parameter. The conditional mean and variance are E(Y _t\I _t−1) = λ _t(1 − p _t) and V ( Y t \ I t − 1 ) = λ t ( 1 − p t ) ( 1 + λ t p t + λ t k ) .

Lastly, the autoregressive zero inflated model equation is given by,

(6) g ( λ t ) = α 0 + ∑ i = 0 p α i Y t − i + ∑ j = 0 m π j X j

The function g(.) may be a log transform of the conditional mean. It can be noticed that Eq. (6) includes past observations of the response and possibly other covariates (X).

Results

We apply these two methods to a published COVID-19 death data for the North African country of Tunisia available from (https://www.statista.com/statistics/1185550/number-of-coronavirus-related-deaths-tunisia/). The interesting part about these data is the specific pattern of the zero counts. Most death counts were zero prior to July 31st, 2020, before the government decided to intervene and to change its policy by relaxing the strict travel restrictions in order to boost the economy. Then the post-intervention sample data shows a significant increase in daily deaths related to the pandemic.

Tunisia had an early success in dealing with the COVID-19 pandemic. In fact the Tunisian government had an effective policy to control and to limit the spread of the disease when the pandemic first hit the country in the early spring of 2020. Right after the first cases were recorded in Tunisia in March 2020, the government responded swiftly with a comprehensive set of measures aiming to slow the spread of the virus. The government also suspended all travel, mandated working from home for non-essential workers, imposed mandatory confinement and nightly curfews, and shut down schools and businesses and banned public gatherings. In addition, the military and the police forces were asked to enforce these strict measures to prevent and control the spread of the virus. However the economic situation in the country was sluggish and researchers suggested that the pandemic is likely to exacerbate Tunisia’s development challenges so more people will probably fall below the poverty line (Kokas et al. 2020) and the economic situation in the country could go from bad to worse if the government does not loosen the restrictions. This might explain why the Tunisian government in late July 2020 decided to lift travel restrictions and lockdown. However, after opening its borders in order to boost the tourism sector and to fix its stalled economy, Tunisia saw a significant spike in the number of its COVID-19 infection cases and more alarming was the number of deaths resulting from the progression of the pandemic. The latest statistics are displayed in Figure 1 and show the updated cumulative numbers of COVID-19 deaths in Tunisia, from March 15, 2020 to September 10, 2021.

Figure 1:

Number of COVID-19 related deaths in Tunisia (March 15, 2020–September 10, 2021. Source: Statista).

The official statistics on deaths related to the pandemic from March 15, 2020 to February 11, 2021, which represents the period covered by this study, indicate that the Tunisian data shows 148 zero values out of a total of 334 sample daily observations. This represents 44.3% of the total sample. Table 1 lists some descriptive statistics for the death counts before and after the end of July 2020, a date which marks government intervention to lift travel restrictions.

Table 1:

Descriptive statistics for the number of COVID-19 related deaths in Tunisia and the number of days with zero death count before and after government intervention to relax travel restrictions (March 15, 2020–February 11, 2021).

	Total number of sample	Number of observations	Number of observations
	observations (n = 334)	before relaxation of travel	after relaxation of travel
		restrictions (n = 139)	restrictions (n = 195)
Average number of daily deaths ( X ̄ )	22.28	0.348	41.279
Standard deviation of daily deaths (s)	34.12	0.761	37.363
Number of days with zero death count	148 (total)	119 (before intervention)	29 (after intervention)

Figure 2 displays the empirical probability distribution of deaths and clearly shows a significant proportion of zero death count (44.3%) occurring before the travel restriction relaxing policy which was implemented by the Tunisian government in late July 2020.

Figure 2:

The probability distribution of COVID-19 deaths in Tunisia (March 15, 2020 to February 11, 2021).

Before presenting the statistical findings of the count time series models and their performance in forecasting COVID-19 deaths, we test the data for possible unit roots using the Augmented Dickey Fuller test (Said and Dickey 1984) and the Phillips–Perron test (Phillips and Perron 1988). The results are displayed in Table 2 below. The choice of lag lengths in the test regressions is based on AIC and BIC information criteria.

Table 2:

Unit root tests for death counts related to COVID-19 in Tunisia (March 15, 2020–February 11, 2021).

ADF unit root test	PP unit root test
Dickey-Fuller = −3.2252, lag order = 5, p-value = 0.085, alternative hypothesis: stationary.	Dickey–Fuller Z (alpha) = −331.41, truncation lag parameter = 4, p-value = 0.01, alternative hypothesis: stationary.

Both ADF and PP tests indicate that the death count series is stationary at the 10% level of significance.

Given that death counts sharply increased after July, 31st 2020 following a government policy to relax travel restrictions, we include an intervention covariate (X _t) in the conditional mean model to assess its significance for the observed surge of deaths and we compare it with similar count models that do not include intervention as an additional covariate.

Discussion

We use R codes to perform the statistical analysis for this study. First, we start with a model with intervention (relaxation of travel restrictions). Based on the government policy to opt for travel restrictions removal in late July 2020, and referring to Liboschik et al. (2016) who studied the covariate effects in time series count data, we include in our models a covariate to account for the intervention policy.

We estimate alternative models with different lag lengths and based on AIC and BIC information criteria we select the model that yields the best fit and which is given by:

(7) λ ̂ t = 0.11 + 0.512 Y t − 2 + 0.433 Y t − 3 + 2.59 X t

where λ ̂ t is the estimated conditional mean at time t, Y _t is the death count at time t, and the intervention variable X _t is modeled as one level shift that takes 1 after July 31st, 2020 and the value 0 prior to that. In Table 3, we report the estimation results of model Eq. (7) with Poisson and negative binomial conditional distributions. The models are estimated with the maximum likelihood method (Fokianos 2012) and the results show statistical significance of the coefficients in the models with travel intervention. The coefficients beta_2 and beta_3 correspond to the regression on the second lagged and the third lagged observations, respectively. The coefficient eta_1 refers to the single intervention policy variable (X). We can conclude that all these coefficients are statistically significant at the 5% level because their confidence intervals do not contain zero.

Table 3:

Estimation of autoregressive conditional count models.

ACC models	With intervention				Without intervention
Coefficients	Estimate	Std. error	CI (lower)	CI (upper)	Estimate	Std. error	CI (lower)	CI (upper)
Intercept	0.110	0.034	0.042	0.178	0.516	0.063	0.392	0.639
Beta_2	0.512	0.014	0.484	0.539	0.516	0.014	0.489	0.543
Beta_3	0.433	0.014	0.406	0.461	0.449	0.013	0.423	0.476
Eta_1	2.590	0.254	2.093	3.087
ACNB models	With intervention				Without intervention
Coefficients	Estimate	Std. error	CI (lower)	CI (upper)	Estimate	Std. error	CI (lower)	CI (upper)
Intercept	0.110	0.040	0.031	0.188	0.516	0.313	−0.098	1.130
Beta_2	0.512	0.146	0.226	0.797	0.516	0.497	−0.457	1.493
Beta_3	0.433	0.141	0.158	0.709	0.449	0.477	−0.486	1.381
Eta_1	2.590	0.827	0.969	4.211
Sigmasq	1.744				22.204

For ACP model with travel intervention: link function is the identity, distribution family: Poisson, number of coefficients: 4, log-likelihood: −2240.303, AIC: 4488.607, BIC: 4503.839, QIC: 4488.669, standard errors and confidence intervals (level = 95%) obtained by normal approximation.
For ACNB model with travel intervention: link function is the identity, distribution family: negative binomial (with overdispersion coefficient ‘sigmasq’), number of coefficients: 5, log-likelihood: −1909.4699, AIC: 4128.94, BIC: 4147.981, QIC: 4127.434.

The quasi information criterion (QIC) has been proposed by Pan (2001) as an alternative to Akaike’s information criterion (AIC) which is properly adjusted for regression analysis based on the generalized estimating equations. As noted by the author, QIC computes the quasi information criteria of a generalised linear model for time series of counts. In case of models with the Poisson distribution the QIC has approximately the same value as the AIC. However, in case of models with another distribution it can be a more adequate alternative to the AIC. It should also be reminded that the lower value of an information criterion (AIC, BIC, or QIC) indicates a better-fit model.

Next, we run the same models without intervention and the results are displayed in Table 3. It is shown that the coefficients in the models with intervention are more significant than those without intervention. In particular, for the negative binomial distribution the reported confidence intervals include the value zero and thus suggest that the coefficients are not statistically significant at the 5% level.

Also, for the nested models, we test the significance of the intervention policy and we obtain the following result:

Chisq-statistic: 312.8429 on 1 degree(s) of freedom, p-value < 0.0001. This proves that the relaxation of travel restrictions had a significant effect to increase COVID-19 deaths in Tunisia.

Furthermore, and in order to evaluate the adequacy of these ACC models where frequent zeros were not dealt with in a specific way, we analyse the statistical performance of the models. First, we refer to the Probability Integral Transform (PIT) which relates to the result that data values that are modeled as being random variables from any given continuous distribution can be converted to random variables having a standard uniform distribution on the interval [0, 1]. Thus, if the model is adequate (the predictive distribution is correct) then PIT should be close to the uniform distribution as discussed in Czado, Gneiting, and Held (2009) and also in Gneiting, Balabdaoui, and Raftery (2007). Figure 3 shows the residuals’ correlogram obtained from the ACC model with the relaxation of travel restrictions.

Figure 3:

Autocorrelation function of the residuals from the autoregressive conditional count model (ACC) with intervention.

Figures 4 and 5 display the probability integral transform of the Poisson distribution and the negative binomial distribution, respectively. It is shown that the histograms of the probability integral transform (PIT) have a non-uniform shape for both distributions, suggesting over dispersion.

Figure 4:

Probability integral transform of the autoregressive conditional count model with intervention, for the Poisson distribution.

Figure 5:

Probability integral transform of the autoregressive conditional count model with intervention, for the negative binomial distribution.

The statistical results show that both predictive distributions are not well calibrated and they do not fit closely to the uniform distribution, likely because of the excess zero counts.

Another additional criterion to assess their fit is the marginal calibration, which is the difference between the average predictive cumulative density function and the empirical distribution function of the observations. If the predictions from a given model are appropriate then marginal calibration should be close to zero as explained in Christou and Fokianos (2015). It can be noticed from Figures 6 and 7 that there is a significant difference between predictive and observed cumulative density functions for both distributions. This result in fact, proves our previous finding and again points to a poor fit of ACC models for the COVID-19 death counts.

Figure 6:

Marginal calibration for the Poisson probability. The plot displays a significant difference between predictive and empirical distributions specifically for counts close to zero, suggesting a poor fit of the ACC model.

Figure 7:

Marginal calibration for the negative binomial probability. The plot displays a significant difference between predictive and empirical distributions in particular for death counts close to zero, which suggests a poor fit of the ACC model.

In addition to the criteria discussed above regarding the fit of these models, one can assess both calibration and sharpness by computing a single score from alternative scoring rules as described by Gneiting and Katzfuss (2014). In this paper, we choose only four scoring rules; from each we compute a mean score in order to evaluate the probabilistic forecast of the Poisson and the negative binomial distributions. The notion of sharpness refers to the concentration of the predictive distribution and can be measured by the prediction interval width.

The statistical analysis in this study is based on the tscount package in R (Liboschik et al. 2017). We estimate the models and evaluate the model adequacy for each of the predictive distributions by computing mean difference scores. The model with the lowest score is preferable. These scores are summarized in Table 4 and show in fact inconclusive results regarding which distribution has an overall better fit for COVID-19 deaths.

Table 4:

Mean scores which are measures to evaluate the probabilistic forecasts of each distribution.

Mean scores given by each scoring rule	Logarithmic	Quadratic	Rank-prob	Sqerror
Poisson	6.516	0.298	0.431	12.029
Neg. binomial	2.668	0.311	0.451	8.709

Each numerical score is based upon the predictive distribution (P _i) and the ith death count observation (y _i) and is defined by: score = 1 n ∑ i = 1 n ψ ( P i , y i ) , where ψ is a penalty function given for each specific type of the four scoring rules. The mean scores reported in the table are logarithmic, quadratic, rank probability, and square error scores. The best values are indicated in bold.

In the next step, we estimate the zero-inflated Poisson mixed auto regression model (AR-ZIP) and the zero-inflated negative binomial mixed auto regression model (AR-ZINB) with intervention for the COVID-19 death data. Specifically, we use a first order autoregression to represent the dependence between the responses. The results are reported in Table 5 and show a good fit for both count specifications given by each distribution. The AR(1) coefficient and the intervention policy are both significant in the ZIP model.

Table 5:

Estimation of zero-inflated models.

AR(1) ZIP model	With intervention					Without intervention
Coefficients	Estimate		Std. error	z-Value	Pr(>\|z\|)	Estimate		Std. error	z-Value	Pr(>\|z\|)
Count model coefficients (Poisson with log link):
Intercept	2.821		0.019	149.6	<2e-16	2.822		0.019	149.5	<2e-16
AR(1)	0.159		0.000	79.53	<2e-16	0.016		0.000	79.54	<2e-16
Zero-inflated model coefficients (binomial with logit link):
Intercept	1.1978		0.201	5.938	3e-09	−0.235		0.110	−2.13	0.033
Intervention	−2.521		0.268	−9.42	<2e-16
	Number of iterations in BFGS optimization: 10					Number of iterations in BFGS optimization: 10
	Log-likelihood: −1692 on 4 degrees of freedom					Log-likelihood: −1746 on 3 degrees of freedom
	Pearson residuals:					Pearson residuals:
	Min	1Q	Median	3Q	Max	Min	1Q	Median	3Q	Max
	−1.1103	−0.4073	−0.4073	0.3930	2.088	−1.056	−1.056	−0.946	1.385	1.611
AR(1) ZINB model	With intervention
Coefficients	Estimate		Std. error	z-Value	Pr(>\|z\|)
Count model coefficients (negative binomial with log link):
Intercept	0.819		0.124	6.59	4e-11
AR(1)	0.053		0.002	18.61	<2e-16
Log(theta)	0.211		0.126	1.67	0.094
Zero-inflation model coefficients (binomial with logit link):
Intercept	0.753		0.239	3.14	0.002
Intervention	−12.01		40.78	−0.29	0.768

The statistical results displayed in Table 5 show the estimation of models with and without the travel ban lift policy. With respect to AR-ZIP models with and without intervention, we did a simple computation of BIC from the reported log likelihood functions and the number of parameters in each model to confirm that the model with intervention has a better fit. Also, we find that the relaxation of travel restrictions in Tunisia has a significant impact and the forecasts for COVID-19 deaths based on models with intervention are more accurate than those without intervention, as indicted by the root mean square error values which are reported in Table 6.

Table 6:

Root mean square forecast error for the estimated models (forecast horizon h = 40 days).

Estimated model	Model 1	Model 2	Model 3	Model 4
Root mean square forecast error for the number of COVID-19 deaths	24.6703	29.7246	28.9912	33.7968

Model 1: AR-ZIP with intervention.
Model 2: AR-ZIP without intervention.
Model 3: ACP with intervention.
Model 4: ACP without intervention.

In order to compare the empirical performance of the zero-inflation model with intervention and the autoregressive conditional probability model with intervention, we compute root mean square forecast errors for 40 days prediction of deaths for each model. The results in Table 6 show that the AR(1)-ZIP model with intervention has the best forecast performance.

To make a final assessment regarding the superiority of zero-inflation models, we additionally compute the Vuong test (Vuong 1989) for the two non-nested models (Model 1 and Model 3) and we report the test result in Table 7.

Table 7:

Vuong test for non-nested models.

Variance test	Non-nested likelihood ratio test
H0: Model 1 and Model 3 are indistinguishable	H0: Model fits are equal for the focal population
H1: Model 1 and Model 3 are distinguishable	H1A: Model 1 fits better than Model 3
w2 = 120.428, p = 0.173	z = 8.156, p = <2e-16
	H1B: Model 3 fits better than Model 1
	z = 8.156, p = 1

The test proves that the two competing models are distinguishable and that Model 1 outperforms Model 3 in fitting COVID-19 death counts. This presents an additional proof that zero-inflation models are better than autoregressive conditional count models in fitting and forecasting deaths related to infectious diseases in the presence of excess zero counts.

Conclusions

In this paper, we conduct an empirical analysis in order to compare the statistical performance of two types of models for count time series. Both models use non-Gaussian conditional distributions to account for possible over-dispersion in the data. The models however differ in how they treat excess zero counts. We apply these models to evaluate their fit and forecast performances using Tunisian data on COVID-19 daily deaths from March 2020 to February 2021. It is found that the zero-inflated models have more significant coefficients and present lower root mean square forecast errors. The statistical results displayed in Table 6 cast an overall outperformance of the zero inflation models. In particular, the root mean square forecast error for the AR(1) zero-inflation model with travel ban policy is nearly 16 percent less than the autoregressive conditional Poisson model with intervention, and it is 34 percent smaller than the ACP model without travel intervention. Therefore, in order to model the spread of an infectious disease and to forecast its related deaths accurately we recommend the application of zero-inflation models when excess zero counts are present in the data. Our findings also prove the statistical significance of an intervention policy implemented by the government to lift travel restrictions and which resulted in a change in the course of the pandemic death count trajectory.

Corresponding author: Sami Khedhiri, School of Mathematical and Computational Sciences, University of Prince Edward Island, Charlottetown, PE, C1A 4P3, Canada, E-mail: skh.upei@gmail.com

Research funding: No funding received to complete this study.
Author contribution: Not applicable.
Competing interests: There are no competing interests for the author.
Informed consent: Not applicable.
Ethical approval: This study uses publicly available data and information and therefore ethical approval is not required.

References

Alahmadi, A., S. Belet, A. Black, D. Cromer, J. A. Flegg, T. House, P. Jayasundara, J. M. Keith, J. M. McCaw, R. Moss, J. V. Ross, F. M. Shearer, S. T. T. Tun, J. Walker, L. White, J. M. Whyte, A. W. C. Yan, and A. E. Zarebski. 2020. “Influencing Public Health Policy with Data-Informed Mathematical Models of Infectious Diseases: Recent Developments and New Challenges.” Epidemics 32: 1–12. https://doi.org/10.1016/j.epidem.2020.100393.Search in Google Scholar

Benjamin, M. A., R. A. Rigby, and D. M. Stasinopoulos. 2003. “Generalized Autoregressive Moving Average Models.” Journal of the American Statistical Association 98: 214–23. https://doi.org/10.1198/016214503388619238.Search in Google Scholar

Bollorslev, T. 1986. “Generalized Autoregressive Conditional Heteroskedasticity.” Journal of Econometrics 31 (3): 307–27.10.1016/0304-4076(86)90063-1Search in Google Scholar

Christou, V., and K. Fokianos. 2015. “On Count Time Series Predictions.” Journal of Statistical Computation and Simulation 85 (2): 357–73. https://doi.org/10.1080/00949655.2013.823612.Search in Google Scholar

Czado, C., T. Gneiting, and L. Held. 2009. “Predictive Model Assessment for Count Data.” Biometrica 65 (4): 1254–61. https://doi.org/10.1111/j.1541-0420.2009.01191.x.Search in Google Scholar PubMed

Fokianos, K. 2012. “Count Time Series Models.” Time Series Analysis: Methods and Applications 30: 315–47.10.1016/B978-0-444-53858-1.00012-0Search in Google Scholar

Gneiting, T., and M. Katzfuss. 2014. “Probabilitistic Forecasting.” Annual Review of Statistics and Its Application 1: 125–51. https://doi.org/10.1146/annurev-statistics-062713-085831.Search in Google Scholar

Gneiting, T., F. Balabdaoui, and A. E. Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society B 69 (2): 243–68. https://doi.org/10.1111/j.1467-9868.2007.00587.x.Search in Google Scholar

Kokas, D., G. Lopez-Acevedo, A. R. El Lahga, and V. Mendiratta. 2020. How COVID-19 is Impacting Tunisian Household. Washington, DC: World Bank Blogs.Search in Google Scholar

Lambert, D. 1992. “Zero-inflated Poisson Regression Models with an Application to Defects in Manufacturing.” Technometrics 30: 1–14. https://doi.org/10.2307/1269547.Search in Google Scholar

Liboschik, T., K. Fokianos, and R. Fried. 2017. “tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models.” Journal of Statistical Software 82 (5): 1–51. https://doi.org/10.18637/jss.v082.Search in Google Scholar

Liboschik, T., P. Kerschke, K. Fokianos, and R. Fried. 2016. “Modelling Interventions in INGARCH Processes.” International Journal of Computer Mathematics 93 (4): 640–57. https://doi.org/10.1080/00207160.2014.949250.Search in Google Scholar

Mohd Yusoff, M.-I. 2020. “The Use of System Dynamics Methodology in Building a COVID-19 Confirmed Case Model.” Computational and Mathematical Methods in Medicine. https://doi.org/10.1155/2020/9328414.Search in Google Scholar PubMed PubMed Central

Pan, W. 2001. “Akaike’s Information Criterion in Generalized Estimating Equation.” Biometrics 57: 120–5. https://doi.org/10.1111/j.0006-341x.2001.00120.x.Search in Google Scholar PubMed

Phillips, P. C. B., and P. Perron. 1988. “Testing for a Unit Root Time Series Regression.” Biometrika 75 (2): 335–46. https://doi.org/10.1093/biomet/75.2.335.Search in Google Scholar

Said, S. E., and D. A. Dickey. 1984. “Testing for Unit Roots in Autoregressive-Moving Average Models with Unknown Order.” Biometrika 71 (3): 599–607. https://doi.org/10.1093/biomet/71.3.599.Search in Google Scholar

Sathish, V., S. Mukhopadhyay, and R. Tiwari. 2020. ARMA Models for Zero Inflated Count Time Series. Also available at https://arxiv.org/pdf/2004.10732v1.pdf.Search in Google Scholar

Tang, W., H. He, W. J. Wang, and D. G. Chen. 2018. “Untangle the Structural and Random Zeros in Statistical Modeling.” Journal of Applied Statistics 45 (9): 1714–33. https://doi.org/10.1080/02664763.2017.1391180.Search in Google Scholar PubMed PubMed Central

Vuong, Q. H. 1989. “Likelihood Ratio Test for Model Selection and Non-nested Hypotheses.” Econometrica 57 (2): 307–33. https://doi.org/10.2307/1912557.Search in Google Scholar

Warton, D. I. 2005. “Many Zeros Does Not Mean Zero-Inflation: Comparing the Goodness of Fit of Parametric Models to Multivariate Abundance Data.” Environmetrics 16: 275–89. https://doi.org/10.1002/env.702.Search in Google Scholar

Yang, M., J. E. Cavanaugh, and G. K. Zamba. 2015. “State-Space Models for Count Time Series with Excess Zeros.” Statistical Modelling 15: 70–90. https://doi.org/10.1177/1471082x14535530.Search in Google Scholar

Yang, M., G. K. Zamba, and J. E. Cavanaugh. 2013. “Markov Regression Models for Count Time Series with Excess Zeros: A Partial Likelihood Approach.” Statistical Methodology 14: 26–38. https://doi.org/10.1016/j.stamet.2013.02.001.Search in Google Scholar

Yau, K., A. Lee, and P. Carrivick. 2004. “Modeling Zero-Inflated Count Series with Application to Occupational Health.” Computer Methods and Programs in Biomedicine 74 (1): 47–52. https://doi.org/10.1016/s0169-2607(03)00070-1.Search in Google Scholar

Zheng, T., H. Hiao, and R. Chen. 2015. “Generalized ARMA Models with Martingale Difference Errors.” Journal of Econometrics 189 (2): 492–506. https://doi.org/10.1016/j.jeconom.2015.03.040.Search in Google Scholar

Received: 2021-02-27

Accepted: 2021-09-22

Published Online: 2021-10-08

Articles in the same Issue

https://doi.org/10.1515/em-2021-0007

Keywords for this article

conditional exponential family distributions; COVID-19; generalized count series models; mean square forecast error; zero-inflated models