Home Business & Economics Analysis of the Number of Tests, the Positivity Rate and Their Dependency Structure During COVID-19 Pandemic
Article Open Access

Analysis of the Number of Tests, the Positivity Rate and Their Dependency Structure During COVID-19 Pandemic

  • Babak Jamshidi EMAIL logo , Hakim Bekrizadeh , Shahriar Jamshidi Zargaran and Mansour Rezaei
Published/Copyright: April 3, 2023
Become an author with De Gruyter Brill

Abstract

Recent advances in medical instruments, information technology, and unprecedented data sharing allowed scientists to investigate, trace, and monitor the COVID-19 pandemic faster than any previous outbreak. This extraordinary speed makes COVID-19 a medical revolution that causes some unprecedented analyses, discussions, and models. Modeling the dependence between the number of tests and the positivity rate is one of these new issues. Using four classes of copulas (Clayton, Frank, Gumbel, and FGM), this study is the first attempt tom model the dependency. The estimation of the parameters of the copulas is obtained using the maximum likelihood method. To evaluate the goodness of fit of the copulas, we calculate AIC. The computations are conducted on Matlab R2015b, R 4.0.3, Maple 2018a, and EasyFit 5.6. Findings indicate that at the beginning of a typical epidemic, the number of tests is relatively low and the proportion of positivity is high. As time passes, the number of tests increases, and the positivity rate decreases. The epidemic peaks are occasions that violate the stated general rule –due to the early growth of the number of tests. Also, during both peak and non-peak times, the rising number of tests is accompanied by decreasing the positivity rate. We find that the proportion of positivity is more proportional than the number of tests to the number of infected cases. Therefore, the changes in the positivity rate can be considered a representative of the level of the spreading. Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic. Accordingly, the number and accuracy of tests can play a vital role in the quality level of epidemic data.

Key messages

  1. In a country, increasing the positivity rate is more informative than increasing the number of tests to warn about an epidemic peak.

  2. Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic.

  3. Except for the first half of the epidemic peaks, in a country, the higher number of tests is associated with a lower positivity rate.

1 Introduction

Dr. Li Wenliang, a 34-year-old ophthalmologist, warned his colleagues and set the alarm to the society about a new infection caused by a type of coronavirus in December 2019 in Wuhan, China (Jamshidi et al. 2020). Shortly after his warning, all over the world encountered this epidemic. WHO declared this fast speeding infection (COVID-19) in March 2020 as a pandemic. As of January 20, 2023, over 600 million cases, and around seven million deaths involving COVID-19 have been reported around the world. The epidemic COVID-19 is the most informative pandemic throughout history. These unprecedented recorded data give rise to some unprecedented concepts, relationships, analyses, discussions, and models (Jamshidi et al. 2021).

Modeling the dependence between the number of tests and the proportion of positivity (positivity rate) is one of these new issues. The proportion of positivity is a critical measure because it gives us an indication of how widespread infection is in the area of interest. The proportion of positivity helps public health officials answer questions such as:

  1. What is the current level of SARS-CoV-2 (coronavirus) transmission in the community?

  2. Are we doing enough testing for the people who are getting infected (Dowdy and D’souza 2020)?

According to the ratio nature, the high proportion of positivity is due to the high number of positive tests or the low number of total tests. Based on the first possibility, a higher positivity rate suggests higher transmission and that there are likely more people with coronavirus in the community who have not been tested yet. On the other hand, according to the second possibility, a high percentage of positivity means that more testing should probably be done. Accordingly, for policymakers, the high value for this parameter suggests either it is not a good time to relax restrictions aimed at reducing transmission, or it is a good time to add restrictions to slow the spread of disease (Dowdy and D’souza 2020). In this regard, an analytic report segregated by regions in the UK was presented by the Office for National Statistics (Coronavirus 2021).

This study is the first paper to highlight the importance of availability of the required tests and to model the effect of shortage of tests on the positivity rate. Our motivation is overcoming this ambiguity and giving a clear insight into the real status of the epidemic in the population. We want to emphasize the importance of fluctuations of positivity rate over the study time. We also aim to show the effect of available tests on these fluctuations. To highlight the meaningfulness of the criterion of interest, we use the data of 12 countries and compare them in terms of the other criteria. This study aims to investigate the time series of positivity rates individually and together with the time series of the number of tests. This investigation is conducted in two analytic methods: regional and temporal. The individual analysis is mainly undertaken based on the peaks of the spreading of the pandemic. For the regional aspect, among the 221 countries, we selected 12 countries: the USA, India, the UK, Italy, Iran, the UAE, Bolivia, Guatemala, Nigeria, Australia, South Korea, and South Africa. The reasons for selecting these 12 countries are

  1. They are the top countries in the influential indices (Table 1).

  2. Some of them are widely different from the others in some indices (Table 1).

  3. Their positivity rates are greatly dispersed (Table 2).

  4. The numbers and time of peaks are different about them (Table 3).

  5. Their quality of healthcare systems are of different levels.

  6. Their data, especially about the number of tests are relatively well recorded.

  7. They are selected from all continents: the USA and Guatemala from North America; Bolivia from South America; India, Iran, the UAE, and South Korea from Asia; the UK and Italy from Europe; Nigeria and South Africa from Africa; and Australia from Australia.

Table 1:

The information on the influential indicators of COVID-19 in the 12 countries of interest.

Country Total cases Total deaths Total tests Cases per million Deaths per million Tests per million Population
USA 26 M (1) 429 K (1) 299 M (1) 77 K (7) 1293 (11) 901 K (19) 333 M (3)
India 11 M (2) 154 K (3) 192 M (2) 7688 (115) 111 (104) 138 K (107) 1387 M (2)
UK 3647 K (5) 98 K (5) 67 M (5) 54 K (25) 1438 (5) 987 K (17) 68 M (21)
Italy 2467 K (8) 85 K (6) 31 M (8) 41 K (36) 1415 (7) 512 K (40) 60 M (24)
Iran 1373 K (16) 57 K (9) 8635 K (20) 16 K (88) 678 (43) 102 K (126) 85 M (18)
UAE 278 K (43) 792 (90) 25 M (12) 28 K (63) 80 (117) 2466 K (5) 10 M (92)
Bolivia 200 K (53) 9923 (33) 514 K (106) 17 K (84) 844 (34) 44 K (150) 12 M (79)
Guatemala 154 K (67) 5465 (43) 738 K (94) 8519 (112) 302 (73) 41 K (153) 18 M (65)
Nigeria 122 K (76) 1504 (80) 1241 K (75) 582 (180) 7 (178) 5939 (194) 209 M (7)
Australia 29 K (107) 909 (88) 13 M (14) 1121 (163) 35 (138) 495 K (41) 26 M (54)
S Korea 75 K (86) 1360 (82) 5284 K (37) 1500 (155) 27 (148) 108 K (121) 51 M (28)
S Africa 1418 K (15) 41 K (14) 7993 K (22) 24 K (75) 703 (42) 137 K (109) 60 M (25)
  1. The numbers in parentheses indicate the rank of countries among 221 countries in the world. For example, Iran has the 18th population among all countries. The bold numbers display the ranks of the second half of the global ranking. The ranks 111 to 221 are considered the second half. For example, regarding cases per million, India is a country of the second half. The light-highlighted cells show that the country is among the highest quarter of the countries based on the relevant parameter. The darker highlighted cells indicate the country is among the top 5% worldwide. For example, the first row illustrates that except for the criterion of the number of tests per million – where it is among the highest quartile –, the USA is of the top 5% in all indicators.

Table 2:

The properties of the datasets of the countries of interest.

Country Lag Positivity rate Start End Number of days
USA 2 8 27 February 2020 27 January 2021 334
India 3 6 3 March 2020 27 January 2021 329
UK 2 5 21 February 2020 25 January 2021 338
Italy 3 8 19 February 2020 27 January 2021 339
Iran 4 16 1 April 2020 26 January 2021 303
UAE 3 1 11 March 2020 26 January 2021 318
Bolivia 5 39 19 March 2020 24 January 2021 308
Guatemala 5 21 14 March 2020 27 January 2021 316
Nigeria 5 10 17 March 2020 27 January 2021 313
Australia 2 0.02 22 March 2020 26 January 2021 308
S Korea 2 0.01 18 February 2020 30 January 2021 347
S Africa 4 2 14 March 2020 30 January 2021 320
Table 3:

The epidemic peaks of COVID-19 in the countries of interest.

Country First peak Second peak Third peak
USA Early April** Second half of July** From November to January 2021**
India Middle September*** -
UK Middle April*** Early November** Late December***
Italy Late March** Early November***
Iran Late March** Second half of November**
UAE Middle May** January 2021**
Bolivia July to August*** Middle January 2021***
Guatemala
Nigeria June to July**# January 2021***#
Australia Late March** Early August***
S Korea February to March*** Second half of August** December***
S Africa July*** December and January 2021***
  1. All the dates belong to 2020. Otherwise, the year is mentioned. (*): Indicated only by the time series of the number of tests. (**): Indicated only by the time series of the proportion of positive tests. (***): Indicated by both time series. (#): After moving average.

Finally, to illustrate the dependency of the number of tests and the positivity rates, we apply copulas.

Sklar introduced the concept of copulas in 1959 (Sklar 1959). A copula – mainly parametric, partially semi-parametric, and rarely non-parametric – is a function that completely describes the dependency structure. It contains all the information to link the marginal distributions to their joint distribution. Accordingly, to obtain a valid multivariate distribution function, it suffices to combine several marginal distribution functions with any candidate for the copula function. Thus, for the purposes of statistical modeling, it is desirable to have a large collection of copulas at one’s disposal. Copula is widely applied in diverse fields, including environmental studies (Corbella and Stretch 2013; Zhang and Singh 2007) finance (Boubaker and Sghaier 2013; Wang et al. 2014), hydrology (Bekrizadeh et al. 2013), and medical studies (Bekrizadeh 2021; Bekrizadeh and Jamshidi 2017; Bekrizadeh et al. 2017; Chatrabgoun et al. 2020; Li and Fang 2012; Roman et al. 2012; Wienke 2011).

2 Data

The main data sources of the paper are the websites Worldometers (Worldometer website) and Our World in Data (Hasell et al. 2020). We summarize and illustrate all the relevant information about the 12 countries in three (twelve-row) tables and three (twelve-partitioned) figures created on Matlab R2015b.

Table 1 includes the key general indicators up to January 25, 2021. It is worth saying that the total indicators or even per-million indicators do not determine the quality of healthcare systems because there are observable underreported statistics about the countries Bolivia, Guatemala, Nigeria, Iran, and even India. On the other hand, we consider the indicator of the number of tests per one million (the 7th column of Table 1) as a criterion representing the level of facilities, therefore the quality of healthcare systems. Based on the information about this criterion, we define the lags (the distance between the test and diagnosis) for the different healthcare systems.

Table 2 represents the underlying properties of any country. As mentioned before, lag is the difference between the time of testing and the time of receiving the results of the tests, positive or negative, in days. The more facilities a healthcare system has, the more tests that system can do – therefore the lower positivity rate it has. Also, the more facilities a healthcare system has, the lower distance is between the tests and results. Based on the concept of lag, we pair the number of tests on the n th day with the number of results on the ( n + l a g ) th day to obtain the dependency structure by using the copulas. The last column is calculated based on the start and end dates of the period of recording data (the fourth and fifth columns) and the lag (the sixth column), and it displays the number of pairs that we use to obtain the dependency structure for each country.

Generally, during an epidemic wave, the number of new infected individuals increases rapidly to an epidemic peak and then falls more gradually until the epidemic wave is over, and the number of new cases be stabilized. Roughly speaking, the epidemic peaks are the – neighborhood of – time points that corresponds the local maximum of the number of newly infected cases.

3 Methods

3.1 Change Point Detection

We define the epidemic peak as the time neighborhood – or the time point – that X t : the number of new confirmed cases on the t th day, exceeds the mean plus three times standard deviation of the last three weeks for at least a week, that is,

t p e a k = { t | X t i > ( m e a n { X t i 21 , X t i 20 , , X t i 1 } + 3 S D { X t i 21 , X t i 20 , , X t i 1 } ) , f o r i = 1 , 2 , 3 , , 7 } .

This definition is derived from the definition of outlier in regression analysis. These epidemic peaks are local maximums. In addition, it is noticeable that the distance between two successive epidemic peaks must be at least one month. According to this definition, the peaks of Table 3 are obtained for the countries of interest. It is remarkable that except for the peaks of Bolivia – which are almost the same –, the later waves are more acute than previous ones. We must add this point that the more acute peak means the more number of new confirmed cases, therefore the more intense spreading. Finally, it is possible that because of the lack of information at the beginning, this definition misses the epidemic peaks in the initial days.

Mathematically and logically, the number of positive tests (confirmed cases) is affected by the number of tests and positivity rate. The number of cases equals the number of tests multiplied by the positivity rate. Therefore, the increment of the number of cases (as a multiplication) equals the sum of these two:

  1. The number of tests multiplied by the increment of positivity rate, and

  2. The positivity rate multiplied by the increment of the number of tests.

Consequently, the intense changes in the count of cases are due to at least a remarkable change in one of these multiplications. About the countries with a regular increase in the number of tests like the USA, the increment of the proportion of positivity plays the principal role in the peaks.

Table 3 shows that the proportion of positivity is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% vs 45%). After moving average, these proportions reach 100% and 50%, respectively.

Countries of the southern and northern hemispheres faced a peak around July and November, respectively, possibly due to falling temperatures.

Figures 1, 2, and 3 consist of 12 subfigures, each of them belonging to one country. The arrangement of the subfigures in all three figures is identical. The horizontal axes in Figures 1 and 3 represent time in days from the start to the end of the period of study for the studied countries (the fourth and fifth columns of Table 2). The vertical axis of Figures 1 and 2 display the number of new tests – conducted on that day – and the proportion of positive tests – reported that day –, respectively. Figure 3 is the plot of the joint distribution of the number of tests on a day and the proportion of positivity on l a g days later.

Figure 1: 
The time series of the number of new tests (daily) in the 12 countries. The peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.
Figure 1:

The time series of the number of new tests (daily) in the 12 countries. The peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.

Figure 2: 
The time series of the positivity rate (daily) in the 12 countries. The peaks of the positivity rate coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.
Figure 2:

The time series of the positivity rate (daily) in the 12 countries. The peaks of the positivity rate coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.

Figure 3: 
Scatterplots of the relationship between the number of tests and the positivity rate. Generally, as the number of new tests increases, the positivity rate falls. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2).
Figure 3:

Scatterplots of the relationship between the number of tests and the positivity rate. Generally, as the number of new tests increases, the positivity rate falls. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2).

Figure 1 shows that the peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. For example, in the USA, there are two peaks of the number of tests simultaneous with the second and the third epidemic peaks – mentioned in Table 3. Also, it is obvious that Bolivia has experienced two peaks for the number of tests around 150th and 300th days – from March 19, 2020 – which coincide with the epidemic peaks in Table 3.

The USA, the UK, and the UAE experienced some regularly rising time series. Except for some overruns in epidemic peaks, the patterns of Italy and South Africa are increasing too. The number of tests in Guatemala is increasing, accompanied by an increasing fluctuation. Owing to the restriction by the limited capacity of tests, Iran and Nigeria followed a stepwise trend. Apart from their peaks – one for each of them –, the plots of Australia and South Korea are stationary. In the case of Bolivia, the time series is proportional to the peaks. India is the only country whose time series is initially increasing, then stable, and after that decreasing. Generally, the counties have an increasing trend.

Figure 4 gives us a classification about the countries from the viewpoint of the number of tests: 1. The USA, 2. India, 3. The UK, 4. Italy, Australia, and the UAE, 5. South Korea, South Africa, and Iran, and 6. Nigeria, Guatemala, and Bolivia.

Figure 4: 
The time series of the number of new tests (daily) in the 12 countries. The USA, India, and the UK had the greatest number of tests.
Figure 4:

The time series of the number of new tests (daily) in the 12 countries. The USA, India, and the UK had the greatest number of tests.

Figure 2 illustrates the time series of the positivity rate of the tests (the ratio of the number of positive tests on a day to the number of taken tests on l a g days ago). It is interesting that the subfigures of Figure 2 are more in accordance with the epidemic peaks than their analogous in Figure 1. For example, it is clear that the USA has encountered three peaks. It is worthwhile that the graph of Iran has three peaks while the first of them is missed in Table 3 because of the lack of information at the beginning. A similar situation (being missed by investigation of either the number of tests or the number of confirmed cases while discovered by the analysis of the positivity rate) happens to the epidemic peak in India in late March, the first and the second peaks of the UK, and the epidemic peaks in middle May and the November for the UAE.

Figure 5 illustrates a classification of the countries based on the positivity rate: 1. Nigeria, Guatemala, and Bolivia, 2. South Africa, and Iran, 3. The USA, India, the UK, and Italy, and 4. Australia, the UAE, and South Korea.

Figure 5: 
The time series of the positivity rate (daily) in the 12 countries. Nigeria, Guatemala, and Bolivia had the greatest positivity rates among the 12 countries.
Figure 5:

The time series of the positivity rate (daily) in the 12 countries. Nigeria, Guatemala, and Bolivia had the greatest positivity rates among the 12 countries.

The horizontal and vertical axes of Figure 3 display the number of new tests and the proportion of positivity of them, respectively. Generally, as the number of new tests increases, the positivity rate falls. Since the epidemic peaks are opposing this general rule, it is not very clear to see the opposite direction of the changes. Guatemala, due to lack of epidemic peak, is a good example of this diversely proportional relationship.

If the reason for an increase be the rising number of tests, we expect not to return the previous channel in short term. In addition, the positivity rate does not undertake a remarkable change. On the other hand, it is normal to assume that entering a peak is accompanied by increasing the number of negative tests as well. Consequently, the lack of the growth of negative test results (rising the positivity rate while continuing the previous trend for the frequency of tests) is only reasonable if at least one of the factors of tests accuracy, testing policy, or the viewpoint of the population were changed around that time. Otherwise, there are a remarkable number of un-reported cases belonging the peak. It is noticeable that this company of risings causes the observed acceleration in growth regarding epidemic peaks.

3.2 Copulas

3.2.1 Definition

Copulas are functions that connect multivariate distribution functions to their one-dimensional marginal distribution functions -uniform on the interval [0, 1]. Mathematically speaking, if H is a bivariate distribution function with margins F(X) and G(Y), there must exist a copula C such that H θ (X,Y) = C(F(X),G(Y);θ)), where θ is introduced as the dependence parameter (Sklar 1959). Accordingly, Copula is mostly defined as a function C:[0,1]2 → [0,1] that satisfies boundary conditions:

( P 1 ) C ( x , 0 ) = C ( 0 , x ) = 0 a n d C ( x , 1 ) = C ( 1 , x ) = x , x [ 0,1 ] ,
( P 2 ) ( s 1 , s 2 , t 1 , t 2 ) [ 0,1 ] 4 , s u c h t h a t s 1 s 2 a n d t 1 t 2
C ( s 2 , t 2 ) C ( s 2 , t 1 ) C ( s 1 , t 2 ) + C ( s 1 , t 1 ) 0 .

Eventually, for twice differentiable function C, 2-increasing property (P2) can be replaced by the condition

C ( s , t ) = 2 C ( s , t ) s t 0 ,

where C(s,t) is the so-called copula density. A copula C is symmetric if C(s,t) = C(t,s), for every ( s , t ) [ 0,1 ] 2 , otherwise C is asymmetric. The most well-known, powerful, and applicable copulas are:

  1. FGM copula (Farlie 1960; Morgenstern 1956);

C F G M ( s , t ) = s t ( 1 + θ ( 1 s ) ( 1 t ) ) , θ [ 1 , + 1 ] , ( s , t ) [ 0,1 ] 2
  1. Clayton copula (Clayton 1978);

C C l a y t o n ( s , t ) = ( s β + t β 1 ) 1 β , β ( 0 , + ) , ( s , t ) [ 0,1 ] 2
  1. Frank copula (Genest 1987);

C F r a n k ( s , t ) = 1 α ln ( 1 + ( e α s 1 ) ( e α t 1 ) ( e α 1 ) ) , α ( , + ) , ( s , t ) [ 0,1 ] 2 , a n d
  1. Gumbel copula (Gumbel 1960);

C G u m b e l ( s , t ) = exp ( [ ( ln ( s ) ) σ + ( ln ( s ) ) σ ] 1 σ ) , σ [ 1 , + ) , ( s , t ) [ 0,1 ] 2 .

The parameters of the marginal and copula distributions are estimated using the maximum likelihood method. The computations and illustrations regarding copula theory are conducted in software Maple, R 4.0.3, Maple 2018a, and EasyFit 5.6.

3.2.2 Copula Vs Correlation Coefficient

Measures of dependence are common instruments to summarize a complicated dependence structure in the bivariate case. Pearson’s, Spearman’s rho, and Kendall’s tau correlation coefficients are common statistical measures of dependence structure (Kendall 1970; Pearson 1895; Spearman 1904). The correlation comes in trouble when the random variables are not elliptically distributed. The performance of the copula does not depend on the fact that if you are dealing with elliptical distributions or not. The Pearson’s linear correlation measure (−1 ≤ r ≤ 1) is the most popular and well-known measure between pairwise random variables. Despite its simplicity and plain rationale, Embrechts et al. (2001) noted that ρ is simply a measure of the dependency of elliptical distributions, such as the binormal distribution (the marginals are normally distributed, linked by the Gaussian copula). Moreover, ρ measures a linear relationship itself and does not capture a non-linear one on its own, as noted in (Priest 2003). These properties constitute obvious limitations for modeling the dependency structure. In addition, copulas could be useful to define nonparametric measures of dependence between random variables. Since the values of Kendall’s tau are easy to calculate, this measure is used for observation dependencies. If F(X) and G(Y) are continuous then C(s,t) is unique, else C(s,t) is uniquely determined on the range of F(X) × range of G(Y).

One standard non-parametric dependence measures Kendall’s τk is expressed in the copula form as:

τ k = 4 0 1 0 1 c ( u , v ) C ( u , v ) d u d v 1

The parameter copula is estimated and the relationship between parameter copula and τk is given in the last column of Table 4. The parameter copula in each case measures the degree of dependence and controls the association between two variables. When the parameter approaches 0 there is no dependence, and if the parameter tends to infinity there is a perfect dependence. Schweizer and Wolff (1981) showed that the dependence parameter copula, which characterizes each family of copulas can be related to Kendall’s τk. Therefore, copulas allow modeling both linear and non-linear dependence. Using copulas, regardless of marginal distributions, can model extreme endpoints.

Table 4:

Kendall’s tau of copula function.

Copula Parameter space Kendall’s tau
FGM θ ∈ [−1,+1] τ k  = 2θ/9
Clayton β ∈ (0,+∞) τ k  = β/(β + 2)
Frank α ∈ (−∞,+∞) τ k = 1 + 4 D ( α ) / α , D ( α ) = 1 α 0 x e x 1 d x 1
Gumbel σ ∈ [1,+∞) τ k  = (σ − 1)/σ

3.2.3 Copula Vs Regression

Regression analysis is a statistical method for investigating the relationships between some dependent and some independent variables. The basic form of the regression analysis, ordinary least squares is not suitable for some applications because the relationships are often nonlinear and the probability distribution of the response variable may be non-Gaussian.

The major advantage of copula regression is that there are no restrictions on the probability distributions that can be used. The copula regression is the most appropriate method in non-Gaussian (no need for normality assumption) regression model fitting. Copula functions, connecting the marginal distributions to their joint distributions, are useful in simulating the linear or nonlinear relationships among multivariate data. Copula is a multivariate distribution function with marginally uniform random variables on [0, 1] (the PDF[1] of the CDF[2]). Copula functions have some appealing properties such as they allow scale-free measures of dependence and are useful in constructing families of joint distributions.

4 Results

The presumptions to apply copula theory for a couple of variables are the existence of continuous marginal distributions accompanied with their correlation. Table 5 investigates whether the pair of the frequency of the tests and positivity rate meets the presumptions. The marginal distributions were obtained in EasyFit. It is observable that the generalized Pareto and Weibull distributions had good performance to fit the positivity rates. Also, the correlation in countries with the highest number of tests is negative and it is commonly between −0.2 and −0.3. In countries lacking enough tests, the correlation coefficient is significantly greater – possibly due to the low quality of data and under-reporting. It is noticeable that calculation over the data of Bolivia, Iran, and South Africa, lead even to positive correlations.

Table 5:

The results of fit distribution to data.

Country Frequency of tests data Positivity proportion data Correlation
Marginal Parameters K-S Test Marginal Parameters K-S Test r P-value
Statistic P-value Statistic P-value
USA Rayleigh σ = 864763

γ = −195972
0.0654 0.11266 Gen. Pareto k = −0.14127

σ = 0.06493

μ = 0.03461
0.04538 0.48317 −0.134 0.014
India Logistic σ = 252223

μ = 57602
0.04919 0.19214 Weibull α = 1.8788

β = 0.07082
0.04234 0.58217 −0.236 < 0.01
UK Gen. Pareto k = −0.37332

σ = 280831

μ = −14012
0.05684 0.2165 Weibull (3p) α = 0.78036

β = 0.06417

γ = −0.00833
0.07206 0.05687 −0.213 < 0.01
Italy Log-Logistic (3P) α = 2.6282

β = 87799.0

γ = −16892.0
0.06078 0.15679 Weibull (3p) α = 0.86458

β = 0.07299

γ = −0.00238
0.07386 0.0521 −0.001 0.986
Iran Log-Logistic (3P) α = 8.1298

β = 56499.0

γ = −29429.0
0.00736 0.05101 Burr k = 0.16689

α = 13.051

β = 0.08904
0.05516 0.3040 0.123 0.032
UAE Weibull α = 1.5811

β = 84164.0
0.05992 0.19016 Log-Logistic α = 3.0628

β = 0.01041
0.07394 0.05619 −0.001 0.854
Bolivia Gumbel Max σ = 923.72

μ = 1072.6
0.04872 0.44386 Beta α1 = 1.3627

α2 = 2.9923
0.0332 0.87483 0.189 0.001
Guatemala Dagum k = 0.0587

α = 10.772

β = 5929.2
0.0707 0.08088 Gamma α = 1.9352

β = 0.13129
0.03456 0.8318 −0.329 < 0.01
Nigeria Log-Logistic α = 2.3097

β = 27364

γ = −510.36
0.04696 0.48053 Weibull α = 1.2881

β = 0.2093
0.03527 0.81772 −0.371 < 0.01
Australia Logistic σ = 11052.0

μ = 40909.0
0.04808 0.46066 Frechet β = 0.0043

α = 0.77645
0.05106 0.38529 −0.269 < 0.01
S Korea Burr k = 0.53449

α = 3.5829

β = 8601.9
0.05161 0.30334 Gen. Pareto k = 0.18396

σ = 0.01109

μ = 0.00924
0.03947 0.63718 −0.005 0.926
S Africa Log-Logistic (3P) α = 4.5531

β = 40005.0

γ = −17075.0
0.03938 0.68861 Gen. Pareto k = −0.13432

σ = 0.14687

μ = 0.01868
0.02748 0.96362 0.405 < 0.01
  1. K–S, Kolmogorov–Smirnov; 3p, 3-parameter. The highlighted rows indicate that the correlation are not significant for those countries.

Based on Table 5, we are allowed to look for the suitable copula functions to connect the marginal distributions to find the desired joint distributions for nine of the countries. Notice that the countries without meaningful correlation (Italy, South Korea, and the UAE) were of the countries with the least proportion of positivity of the tests.

Table 6 represents the results of comparing the best candidates from the FGM, Clayton, Frank, and Gumbel families.

Table 6:

The obtained copula to fit the dependency and their performances.

Country Model MLE of θ Kendall’s tau AIC
USA FGM copula −0.47285 −0.1051 −663.3515
India Frank copula −0.77241 −0.1876 −660.0874
UK Frank copula −0.75843 −0.1624 −658.2413
Iran Clayton copula 0.28941 0.1264 −559.8742
Bolivia Clayton copula 0.37651 0.1584 −661.2521
Guatemala Frank copula −0.95054 −0.2743 −663.3011
Nigeria Frank copula −0.84251 −0.3221 −663.2462
Australia Frank copula −0.81262 −0.2138 −662.1021
South Africa Clayton copula 0.46723 0.1894 −664.7824

According to Table 6, Clayton copulas are suitable candidates for the countries with lower per million tests. In addition, Frank copulas can describe a wide variety of countries. Finally, the Gumbel family seems not to be a good option to couple the variables of the frequency of tests and the positivity rate.

We now discuss the simulation of data from the obtained copula models and perform comparisons between correlations in the simulated data and in the observed data based on 1000 simulations. We follow the simulation method proposed by Johnson (1987) and later Nelson (2006).

Figure 6 illustrates the scatter plots of the transformed observed data versus simulated samples of the CDFs of the frequency of tests and positivity proportion variables taken from the fitted copula models in Table 6. It can be seen that the simulated data and the original data have similar dependence patterns. To settle this concern, Table 6 shows the rank correlations between the frequency of tests and positivity proportion variables calculated from the original data and the simulated data of size 1000 taken from the fitted copula models. By comparing these correlations, we can conclude that the results show strong consistency of the estimated and real correlations.

Figure 6: 
Scatter plots of the transformed observed values (·) versus simulated samples (∗) variables from subfamilies of the copula model. USA (r1 c1), India (r1 c2), UK (r1 c3), Iran (r2 c1), Bolivia (r2 c2), Guatemala (r2 c3), Nigeria (r3 c1), Australia (r3 c2), and South Africa (r3 c3). r: row & c: column.
Figure 6:

Scatter plots of the transformed observed values (·) versus simulated samples (∗) variables from subfamilies of the copula model. USA (r1 c1), India (r1 c2), UK (r1 c3), Iran (r2 c1), Bolivia (r2 c2), Guatemala (r2 c3), Nigeria (r3 c1), Australia (r3 c2), and South Africa (r3 c3). r: row & c: column.

Finally, we want to investigate the structure of dependency between the number of tests and positivity rate totally. By collecting the data of the 12 countries, 3877 pairs are obtained whose Kendall’s correlation is −0.1434 (P-value: 2.8464 * 10ˆ−19). In addition, we split the data into two parts: peaks and otherwise. This split restricted us to applying marginal distributions – then copulas – because it causes the gap in the number of tests. Table 7 represents the Kendall’s correlations for the countries of interest. It is worth saying that the correlation coefficient for the variables (the number of tests and positivity rate) is negative in both peaks and otherwise.

Table 7:

The correlation between the number of tests and the positivity rate regarding all countries separated based on the peaks.

Country Kendall’s tau after removing peaks P-value Kendall’s tau for peaks P-value
USA −0.0168 0.8113 −0.4104 <0.0001
India −0.2410 0.0001 0.0993 0.3936
UK 0.2496 0.0015 −0.7017 <0.0001
Italy 0.3309 <0.0001 −0.5127 <0.0001
Iran 0.0779 0.2354 0.1574 0.1898
UAE 0.0387 0.5348 −0.1621 0.2081
Bolivia 0.3028 <0.0001 −0.2402 0.0119
Guatemala −0.2946 <0.0001
Nigeria 0.4474 <0.0001 −0.4197 <0.0001
Australia −0.2337 0.0003 −0.6295 <0.0001
South Korea 0.3134 <0.0001 −0.7214 <0.0001
South Africa 0.1203 0.0744 −0.4517 <0.0001
Total −0.1381 <0.0001 −0.2132 <0.0001
  1. Light or dark bolded figures indicate that the coefficient correlation is significantly positive or negative, respectively.

5 Discussion

Generally, at the beginning of an epidemic, the number of tests is low and the proportion of positivity is high. As time passes, the number of tests rises. Also, as the number of new tests increases, the positivity rate falls. The correlation in countries with high number of tests, higher quality of data, is negative and it is commonly between −0.2 and −0.3. By considering all the data as a set, the Kendall’s coefficients are −0.1434, −0.2132, and −0.1381 for total, peaks, and total after removing peaks, respectively. The positivity rate of the tests is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% vs 45%).

The proportion of positivity is more proportional than the number of tests to the number of infected cases. Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic. The number and accuracy of tests can play a vital role in the quality level of epidemic data. The policymakers can consider the factors affecting the positivity rate such as the testing policy, restricted facilities, peaks, fluctuations, and so on, and make decisions to prevent misleading because of them. Considering these factors altogether and inputting them in the models like (Khajanchi et al. 2022, 2021; Mondal and Khajanchi 2022) is the correct way to assess function of any intervention of interest.

The first limitation is the low quality of data for some countries because of the restricted facilities, the low number of tests, and non-organized data collection program. Also, some interpolation and moving average methods were applied to find some missing data regarding the countries of interest and calculating the correlation for the countries with poor data. Out of the 12 countries, Iran, South Africa, Nigeria, Bolivia, and Guatemala have got a relatively restricted number of tests. The data of Italy, the UAE, and South Korea showed no significant correlation. The highest quality and most significant correlations belong to the USA, India, the UK, and Australia.

The present approach using copulas is promising since it allows to take into account a wide range of correlation, frequently observed in medical studies. In fact, the classical multivariate models cannot reproduce all type of correlations. Moreover, the standard models are limited, especially because the choice of the marginal distributions is restricted. The crucial step in the modeling process is the choice of the copula function, which best fits the data. Further work is needed to choose the best copulas able to reproduce the dependence structure of bivariate medical variables. In clinical trials or medical studies, sample size is often an important consideration and is relatively small. The copula-based methodology overcomes this limitation as well, because the algorithm can be used to replicate data for any number of patients. The suggested copula-based methodology presented in this paper is simple and easy to implement.


Corresponding author: Babak Jamshidi, KiTEC, School of Biomedical Engineering & Imaging Sciences, King’s College London, London, SE1 7EU, UK, E-mail:

  1. Research Funding: None to report.

  2. Author contribution: BJ: Idea, Literature, Data, Methods, Programming, Interpretation, First draft. HB: Literature, Methods, Interpretation, Revision. SJZ: Data, Literature, Programming. MR: Design, Final manuscript.

  3. Declaration of Interest Statement: We have no conflict of interest to declare.

References

Boubaker, H., and N. Sghaier. 2013. “Portfolio Optimization in the Presence of Dependent Financial Returns with Long Memory: A Copula Based Approach.” Journal of Banking & Finance 37 (2): 361–77, https://doi.org/10.1016/j.jbankfin.2012.09.006.Search in Google Scholar

Bekrizadeh, H., G. A. Parham, and M. R. Zadkarmi. 2013. “Weighted Clayton Copulas and Their Characterizations: Application to Probable Modeling of the Hydrology Data.” Journal of Data Science 11: 293–303, https://doi.org/10.6339/jds.2013.11(2).1084.Search in Google Scholar

Bekrizadeh, H., and B. Jamshidi. 2017. “A New Class of Bivariate Copulas: Dependence Measures and Properties.” Metron 75: 31–50, https://doi.org/10.1007/s40300-017-0107-1.Search in Google Scholar

Bekrizadeh, H., G. A. Parham, and M. R. Zadkarmi. 2017. “A New Asymmetric Class of Bivariate Copulas for Modeling Dependence.” Communications in Statistics – Simulation and Computation 46 (7): 5594–609, https://doi.org/10.1080/03610918.2016.1169292.Search in Google Scholar

Bekrizadeh, H. 2021. “Generalized Family of Copulas: Definition and Properties.” Thailand Statistician 19 (1): 163–78.10.1080/03610918.2022.2032156Search in Google Scholar

Coronavirus (COVID-19) Infection Survey, UK: 8 January 2021. Also available at https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurveypilot/8january2021.Search in Google Scholar

Corbella, S., and D. D. Stretch. 2013. “Simulating a Multivariate Sea Storm Using Archimedean Copulas.” Coastal Engineering 76: 68–78. https://doi.org/10.1016/j.coastaleng.2013.01.011.Search in Google Scholar

Chatrabgoun, O., Hosseinian-far, and A. Daneshkhah. 2020. “Constructing Gene Regulatory Networks from Microarray Data Using Non-gaussian Pair-Copula Graphical Models.” Journal of Bioinformatics and Computational Biology 47 (59): 1–19.10.1142/S0219720020500237Search in Google Scholar

Clayton, D. G. 1978. “A Model for Association in Bivariate Life Tables and its Application in Epidemiological Studies of Familial Tendency in Chronic Disease Incidence.” Biometrika 65 (1): 141–51. https://doi.org/10.1093/biomet/65.1.141.Search in Google Scholar

Dowdy, D., and G. D’souza. 2020. Covid-19 Testing: Understanding the “Percent Positive”. August 10. Also available at https://www.jhsph.edu/covid-19/articles/covid-19-testing-understanding-the-percent-positive.html.Search in Google Scholar

Embrechts, P., F. Lindskog, and A. McNeil. 2001. Modelling Dependence with Copulas and Applications to Risk Management. ETH Zurich: Department of Mathematics.Search in Google Scholar

Farlie, D. G. J. 1960. “The Performance of Some Correlation Coefficients for a General Bivariate Distribution.” Biometrika 47: 307–23. https://doi.org/10.2307/2333302.Search in Google Scholar

Genest, C. 1987. “Frank’s Family of Bivariate Distributions.” Biometrika 74: 549–55. https://doi.org/10.1093/biomet/74.3.549.Search in Google Scholar

Gumbel, E. J. 1960. “Bivariate Exponential Distributions.” Journal of the American Statistical Association 55: 698–707. https://doi.org/10.1080/01621459.1960.10483368.Search in Google Scholar

Hasell, J., E. Mathieu, D. Beltekian, B. Macdonald, C. Giattino, E. Ortiz-Ospina, M. Roser, and H. Ritchie. 2020. “A Cross-Country Database of COVID-19 Testing.” Scientific Data 7: 345. https://doi.org/10.1038/s41597-020-00688-8. https://ourworldindata.org/coronavirus-testing.Search in Google Scholar

Jamshidi, B., M. Rezaei, S. J. Zargaran, and F. Najafi. 2020. “Mathematical Modeling the Epicenters of Coronavirus Disease-2019 (COVID-19) Pandemic.” Epidemiologic Methods 9 (s1): 20200009. https://doi.org/10.1515/em-2020-0009.Search in Google Scholar

Jamshidi, B., H. Bekrizadeh, S. J. Zargaran, M. Rezaei, and F. Najafi. 2021. “Comparing Length of Hospital Stay during COVID-19 Pandemic in the USA, Italy, and Germany.” International Journal for Quality in Health Care 33 (1): 1–11.10.1093/intqhc/mzab050Search in Google Scholar

Johnson, M. E. 1987. Multivariate Statistical Simulation. Hoboken: John Wiley.10.1002/9781118150740Search in Google Scholar

Kendall, M. G. 1970. Rank Correlation Methods. London: Griffin.Search in Google Scholar

Khajanchi, S., K. Sarkar, and S. Banerjee. 2022. “Modeling the Dynamics of COVID-19 Pandemic with Implementation of Intervention Strategies.” European Physics Journal Plus 137: 129. https://doi.org/10.1140/epjp/s13360-022-02347-w.Search in Google Scholar

Khajanchi, S., K. Sarkar, J. Mondal, K. S. Nisar, and S. F. Abdelwahab. 2021. “Mathematical Modeling of the COVID-19 Pandemic with Intervention Strategies.” Results in Physics 25: 104285. https://doi.org/10.1016/j.rinp.2021.104285.Search in Google Scholar

Li, X., and R. Fang. 2012. “A New Family of Bivariate Copulas Generated by Univariate Distributions.” Journal of Data Science 10: 1–17. https://doi.org/10.6339/jds.201201_10(1).0001.Search in Google Scholar

Morgenstern, D. 1956. “Einfache Beispiele Zweidimensionaler Verteilungen.” Mitteilungsblatt fürMathematische Statistik 8: 234–5.Search in Google Scholar

Mondal, J., and S. Khajanchi. 2022. “Mathematical Modeling and Optimal Intervention Strategies of the COVID-19 Outbreak.” Nonlinear Dynamics 109 (1): 177–202. https://doi.org/10.1007/s11071-022-07235-7.Search in Google Scholar

Nelson, R. 2006. An Introduction to Copulas. New York: Springer-Verlag.Search in Google Scholar

Pearson, K. 1895. “Notes on regression and inheritance in the case of two parents.” Proceedings of the Royal Society of London 58: 240–2.10.1098/rspl.1895.0041Search in Google Scholar

Priest, C. 2003. Correlations: What They Mean and More Importantly What They Don’t Mean. Sydney: The Institute of Actuaries of Australia Biennial Convention.Search in Google Scholar

Roman, M., F. Louzada, V. G. Cancho, and J. G. Leite. 2012. “A New Long-Term Survival Distribution for Cancer Data [Internet].” Journal of Data Science 10 (2): 241–58. http://www.jds-online.com/volume-10-number-2-april-2012.10.6339/JDS.201204_10(2).0005Search in Google Scholar

Sklar, A. 1959. Fonctions de répartition à n dimensions et leurs marges, Vol. 8, 229–31. Paris: Publications de L’Institute Statistical University Paris.Search in Google Scholar

Spearman, C. 1904. “The Proof and Measurement of Association between Two Things.” American Journal of Psychology 15 (1): 72–101. https://doi.org/10.2307/1412159.Search in Google Scholar

Schweizer, B., and E. F. Wolff. 1981. “On Nonparametric Measures of Dependence for Random Variables.” Annals of Statistics 9: 879–85. https://doi.org/10.1214/aos/1176345528.Search in Google Scholar

Wang, G. J., C. Xie, P. Zhang, F. Han, and S. Chen. 2014. “Dynamics of Foreign Exchange Networks: A Time-Varying Copula Approach,” Discrete Dynamics in Nature and Society. Article ID 170921, 11 p. https://doi.org/10.1155/2014/170921.Search in Google Scholar

Wienke, A. 2011. Frailty Models in Survival Analysis. New York: Chapman & Hall/CRC biostatistics series.Search in Google Scholar

Worldometer Website. Also available at https://www.worldometers.info/coronavirus/#countriesSearch in Google Scholar

Zhang, L., and V. P. Singh. 2007. “Bivariate Rainfall Frequency Distributions Using Archimedean Copulas.” Journal of Hydrology 332: 93–109. https://doi.org/10.1016/j.jhydrol.2006.06.033.Search in Google Scholar

Received: 2023-01-04
Accepted: 2023-03-20
Published Online: 2023-04-03
Published in Print: 2023-06-27

© 2023 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 29.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/spp-2023-0002/html
Scroll to top button