Analysis of the Number of Tests, the Positivity Rate and Their Dependency Structure During COVID-19 Pandemic

Babak Jamshidi; Hakim Bekrizadeh; Shahriar Jamshidi Zargaran; Mansour Rezaei

doi:10.1515/spp-2023-0002

Article Open Access

Analysis of the Number of Tests, the Positivity Rate and Their Dependency Structure During COVID-19 Pandemic

Babak Jamshidi , Hakim Bekrizadeh , Shahriar Jamshidi Zargaran and Mansour Rezaei

Published/Copyright: April 3, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistics, Politics and Policy Volume 14 Issue 2

Abstract

Recent advances in medical instruments, information technology, and unprecedented data sharing allowed scientists to investigate, trace, and monitor the COVID-19 pandemic faster than any previous outbreak. This extraordinary speed makes COVID-19 a medical revolution that causes some unprecedented analyses, discussions, and models. Modeling the dependence between the number of tests and the positivity rate is one of these new issues. Using four classes of copulas (Clayton, Frank, Gumbel, and FGM), this study is the first attempt tom model the dependency. The estimation of the parameters of the copulas is obtained using the maximum likelihood method. To evaluate the goodness of fit of the copulas, we calculate AIC. The computations are conducted on Matlab R2015b, R 4.0.3, Maple 2018a, and EasyFit 5.6. Findings indicate that at the beginning of a typical epidemic, the number of tests is relatively low and the proportion of positivity is high. As time passes, the number of tests increases, and the positivity rate decreases. The epidemic peaks are occasions that violate the stated general rule –due to the early growth of the number of tests. Also, during both peak and non-peak times, the rising number of tests is accompanied by decreasing the positivity rate. We find that the proportion of positivity is more proportional than the number of tests to the number of infected cases. Therefore, the changes in the positivity rate can be considered a representative of the level of the spreading. Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic. Accordingly, the number and accuracy of tests can play a vital role in the quality level of epidemic data.

Keywords: dependence; number of tests; copula; positivity; peak; correlation

Key messages

In a country, increasing the positivity rate is more informative than increasing the number of tests to warn about an epidemic peak.
Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic.
Except for the first half of the epidemic peaks, in a country, the higher number of tests is associated with a lower positivity rate.

1 Introduction

Dr. Li Wenliang, a 34-year-old ophthalmologist, warned his colleagues and set the alarm to the society about a new infection caused by a type of coronavirus in December 2019 in Wuhan, China (Jamshidi et al. 2020). Shortly after his warning, all over the world encountered this epidemic. WHO declared this fast speeding infection (COVID-19) in March 2020 as a pandemic. As of January 20, 2023, over 600 million cases, and around seven million deaths involving COVID-19 have been reported around the world. The epidemic COVID-19 is the most informative pandemic throughout history. These unprecedented recorded data give rise to some unprecedented concepts, relationships, analyses, discussions, and models (Jamshidi et al. 2021).

Modeling the dependence between the number of tests and the proportion of positivity (positivity rate) is one of these new issues. The proportion of positivity is a critical measure because it gives us an indication of how widespread infection is in the area of interest. The proportion of positivity helps public health officials answer questions such as:

What is the current level of SARS-CoV-2 (coronavirus) transmission in the community?
Are we doing enough testing for the people who are getting infected (Dowdy and D’souza 2020)?

According to the ratio nature, the high proportion of positivity is due to the high number of positive tests or the low number of total tests. Based on the first possibility, a higher positivity rate suggests higher transmission and that there are likely more people with coronavirus in the community who have not been tested yet. On the other hand, according to the second possibility, a high percentage of positivity means that more testing should probably be done. Accordingly, for policymakers, the high value for this parameter suggests either it is not a good time to relax restrictions aimed at reducing transmission, or it is a good time to add restrictions to slow the spread of disease (Dowdy and D’souza 2020). In this regard, an analytic report segregated by regions in the UK was presented by the Office for National Statistics (Coronavirus 2021).

This study is the first paper to highlight the importance of availability of the required tests and to model the effect of shortage of tests on the positivity rate. Our motivation is overcoming this ambiguity and giving a clear insight into the real status of the epidemic in the population. We want to emphasize the importance of fluctuations of positivity rate over the study time. We also aim to show the effect of available tests on these fluctuations. To highlight the meaningfulness of the criterion of interest, we use the data of 12 countries and compare them in terms of the other criteria. This study aims to investigate the time series of positivity rates individually and together with the time series of the number of tests. This investigation is conducted in two analytic methods: regional and temporal. The individual analysis is mainly undertaken based on the peaks of the spreading of the pandemic. For the regional aspect, among the 221 countries, we selected 12 countries: the USA, India, the UK, Italy, Iran, the UAE, Bolivia, Guatemala, Nigeria, Australia, South Korea, and South Africa. The reasons for selecting these 12 countries are

They are the top countries in the influential indices (Table 1).
Some of them are widely different from the others in some indices (Table 1).
Their positivity rates are greatly dispersed (Table 2).
The numbers and time of peaks are different about them (Table 3).
Their quality of healthcare systems are of different levels.
Their data, especially about the number of tests are relatively well recorded.
They are selected from all continents: the USA and Guatemala from North America; Bolivia from South America; India, Iran, the UAE, and South Korea from Asia; the UK and Italy from Europe; Nigeria and South Africa from Africa; and Australia from Australia.

Table 1:

The information on the influential indicators of COVID-19 in the 12 countries of interest.

Country	Total cases	Total deaths	Total tests	Cases per million	Deaths per million	Tests per million	Population
USA	26 M (1)	429 K (1)	299 M (1)	77 K (7)	1293 (11)	901 K (19)	333 M (3)
India	11 M (2)	154 K (3)	192 M (2)	7688 (115)	111 (104)	138 K (107)	1387 M (2)
UK	3647 K (5)	98 K (5)	67 M (5)	54 K (25)	1438 (5)	987 K (17)	68 M (21)
Italy	2467 K (8)	85 K (6)	31 M (8)	41 K (36)	1415 (7)	512 K (40)	60 M (24)
Iran	1373 K (16)	57 K (9)	8635 K (20)	16 K (88)	678 (43)	102 K (126)	85 M (18)
UAE	278 K (43)	792 (90)	25 M (12)	28 K (63)	80 (117)	2466 K (5)	10 M (92)
Bolivia	200 K (53)	9923 (33)	514 K (106)	17 K (84)	844 (34)	44 K (150)	12 M (79)
Guatemala	154 K (67)	5465 (43)	738 K (94)	8519 (112)	302 (73)	41 K (153)	18 M (65)
Nigeria	122 K (76)	1504 (80)	1241 K (75)	582 (180)	7 (178)	5939 (194)	209 M (7)
Australia	29 K (107)	909 (88)	13 M (14)	1121 (163)	35 (138)	495 K (41)	26 M (54)
S Korea	75 K (86)	1360 (82)	5284 K (37)	1500 (155)	27 (148)	108 K (121)	51 M (28)
S Africa	1418 K (15)	41 K (14)	7993 K (22)	24 K (75)	703 (42)	137 K (109)	60 M (25)

The numbers in parentheses indicate the rank of countries among 221 countries in the world. For example, Iran has the 18th population among all countries. The bold numbers display the ranks of the second half of the global ranking. The ranks 111 to 221 are considered the second half. For example, regarding cases per million, India is a country of the second half. The light-highlighted cells show that the country is among the highest quarter of the countries based on the relevant parameter. The darker highlighted cells indicate the country is among the top 5% worldwide. For example, the first row illustrates that except for the criterion of the number of tests per million – where it is among the highest quartile –, the USA is of the top 5% in all indicators.

Table 2:

The properties of the datasets of the countries of interest.

Country	Lag	Positivity rate	Start	End	Number of days
USA	2	8	27 February 2020	27 January 2021	334
India	3	6	3 March 2020	27 January 2021	329
UK	2	5	21 February 2020	25 January 2021	338
Italy	3	8	19 February 2020	27 January 2021	339
Iran	4	16	1 April 2020	26 January 2021	303
UAE	3	1	11 March 2020	26 January 2021	318
Bolivia	5	39	19 March 2020	24 January 2021	308
Guatemala	5	21	14 March 2020	27 January 2021	316
Nigeria	5	10	17 March 2020	27 January 2021	313
Australia	2	0.02	22 March 2020	26 January 2021	308
S Korea	2	0.01	18 February 2020	30 January 2021	347
S Africa	4	2	14 March 2020	30 January 2021	320

Table 3:

The epidemic peaks of COVID-19 in the countries of interest.

Country	First peak	Second peak	Third peak
USA	Early April**	Second half of July**	From November to January 2021**
India	Middle September***	-	–
UK	Middle April***	Early November**	Late December***
Italy	Late March**	Early November***	–
Iran	Late March**	Second half of November**	–
UAE	Middle May**	January 2021**	–
Bolivia	July to August***	Middle January 2021***	–
Guatemala	–	–	–
Nigeria	June to July**#	January 2021***#	–
Australia	Late March**	Early August***	–
S Korea	February to March***	Second half of August**	December***
S Africa	July***	December and January 2021***

All the dates belong to 2020. Otherwise, the year is mentioned. (*): Indicated only by the time series of the number of tests. (**): Indicated only by the time series of the proportion of positive tests. (***): Indicated by both time series. (#): After moving average.

Finally, to illustrate the dependency of the number of tests and the positivity rates, we apply copulas.

Sklar introduced the concept of copulas in 1959 (Sklar 1959). A copula – mainly parametric, partially semi-parametric, and rarely non-parametric – is a function that completely describes the dependency structure. It contains all the information to link the marginal distributions to their joint distribution. Accordingly, to obtain a valid multivariate distribution function, it suffices to combine several marginal distribution functions with any candidate for the copula function. Thus, for the purposes of statistical modeling, it is desirable to have a large collection of copulas at one’s disposal. Copula is widely applied in diverse fields, including environmental studies (Corbella and Stretch 2013; Zhang and Singh 2007) finance (Boubaker and Sghaier 2013; Wang et al. 2014), hydrology (Bekrizadeh et al. 2013), and medical studies (Bekrizadeh 2021; Bekrizadeh and Jamshidi 2017; Bekrizadeh et al. 2017; Chatrabgoun et al. 2020; Li and Fang 2012; Roman et al. 2012; Wienke 2011).

2 Data

The main data sources of the paper are the websites Worldometers (Worldometer website) and Our World in Data (Hasell et al. 2020). We summarize and illustrate all the relevant information about the 12 countries in three (twelve-row) tables and three (twelve-partitioned) figures created on Matlab R2015b.

Table 1 includes the key general indicators up to January 25, 2021. It is worth saying that the total indicators or even per-million indicators do not determine the quality of healthcare systems because there are observable underreported statistics about the countries Bolivia, Guatemala, Nigeria, Iran, and even India. On the other hand, we consider the indicator of the number of tests per one million (the 7th column of Table 1) as a criterion representing the level of facilities, therefore the quality of healthcare systems. Based on the information about this criterion, we define the lags (the distance between the test and diagnosis) for the different healthcare systems.

Table 2 represents the underlying properties of any country. As mentioned before, lag is the difference between the time of testing and the time of receiving the results of the tests, positive or negative, in days. The more facilities a healthcare system has, the more tests that system can do – therefore the lower positivity rate it has. Also, the more facilities a healthcare system has, the lower distance is between the tests and results. Based on the concept of lag, we pair the number of tests on the n th day with the number of results on the ( n + l a g ) th day to obtain the dependency structure by using the copulas. The last column is calculated based on the start and end dates of the period of recording data (the fourth and fifth columns) and the lag (the sixth column), and it displays the number of pairs that we use to obtain the dependency structure for each country.

Generally, during an epidemic wave, the number of new infected individuals increases rapidly to an epidemic peak and then falls more gradually until the epidemic wave is over, and the number of new cases be stabilized. Roughly speaking, the epidemic peaks are the – neighborhood of – time points that corresponds the local maximum of the number of newly infected cases.

3 Methods

3.1 Change Point Detection

We define the epidemic peak as the time neighborhood – or the time point – that X t : the number of new confirmed cases on the t th day, exceeds the mean plus three times standard deviation of the last three weeks for at least a week, that is,

t p e a k = { t | X t − i > ( m e a n { X t − i − 21 , X t − i − 20 , … , X t − i − 1 } + 3 S D { X t − i − 21 , X t − i − 20 , … , X t − i − 1 } ) , f o r i = 1 , 2 , 3 , … , 7 } .

This definition is derived from the definition of outlier in regression analysis. These epidemic peaks are local maximums. In addition, it is noticeable that the distance between two successive epidemic peaks must be at least one month. According to this definition, the peaks of Table 3 are obtained for the countries of interest. It is remarkable that except for the peaks of Bolivia – which are almost the same –, the later waves are more acute than previous ones. We must add this point that the more acute peak means the more number of new confirmed cases, therefore the more intense spreading. Finally, it is possible that because of the lack of information at the beginning, this definition misses the epidemic peaks in the initial days.

Mathematically and logically, the number of positive tests (confirmed cases) is affected by the number of tests and positivity rate. The number of cases equals the number of tests multiplied by the positivity rate. Therefore, the increment of the number of cases (as a multiplication) equals the sum of these two:

The number of tests multiplied by the increment of positivity rate, and
The positivity rate multiplied by the increment of the number of tests.

Consequently, the intense changes in the count of cases are due to at least a remarkable change in one of these multiplications. About the countries with a regular increase in the number of tests like the USA, the increment of the proportion of positivity plays the principal role in the peaks.

Table 3 shows that the proportion of positivity is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% vs 45%). After moving average, these proportions reach 100% and 50%, respectively.

Countries of the southern and northern hemispheres faced a peak around July and November, respectively, possibly due to falling temperatures.

Figures 1, 2, and 3 consist of 12 subfigures, each of them belonging to one country. The arrangement of the subfigures in all three figures is identical. The horizontal axes in Figures 1 and 3 represent time in days from the start to the end of the period of study for the studied countries (the fourth and fifth columns of Table 2). The vertical axis of Figures 1 and 2 display the number of new tests – conducted on that day – and the proportion of positive tests – reported that day –, respectively. Figure 3 is the plot of the joint distribution of the number of tests on a day and the proportion of positivity on l a g days later.

Figure 1:

The time series of the number of new tests (daily) in the 12 countries. The peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.

Figure 2:

The time series of the positivity rate (daily) in the 12 countries. The peaks of the positivity rate coincide with the epidemic peaks of COVID-19 in different countries. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2). r: row & c: column.

Figure 3:

Scatterplots of the relationship between the number of tests and the positivity rate. Generally, as the number of new tests increases, the positivity rate falls. USA (r1 c1), India (r1 c2), UK (r2 c1), Italy (r2 c2), Iran (r3 c1), UAE (r3 c2), Bolivia (r4 c1), Guatemala (r4 c2), Nigeria (r5 c1), Australia (r5 c2), South Korea (r6 c1), and South Africa (r6 c2).

Figure 1 shows that the peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. For example, in the USA, there are two peaks of the number of tests simultaneous with the second and the third epidemic peaks – mentioned in Table 3. Also, it is obvious that Bolivia has experienced two peaks for the number of tests around 150th and 300th days – from March 19, 2020 – which coincide with the epidemic peaks in Table 3.

The USA, the UK, and the UAE experienced some regularly rising time series. Except for some overruns in epidemic peaks, the patterns of Italy and South Africa are increasing too. The number of tests in Guatemala is increasing, accompanied by an increasing fluctuation. Owing to the restriction by the limited capacity of tests, Iran and Nigeria followed a stepwise trend. Apart from their peaks – one for each of them –, the plots of Australia and South Korea are stationary. In the case of Bolivia, the time series is proportional to the peaks. India is the only country whose time series is initially increasing, then stable, and after that decreasing. Generally, the counties have an increasing trend.

Figure 4 gives us a classification about the countries from the viewpoint of the number of tests: 1. The USA, 2. India, 3. The UK, 4. Italy, Australia, and the UAE, 5. South Korea, South Africa, and Iran, and 6. Nigeria, Guatemala, and Bolivia.

Figure 4:

The time series of the number of new tests (daily) in the 12 countries. The USA, India, and the UK had the greatest number of tests.

Figure 2 illustrates the time series of the positivity rate of the tests (the ratio of the number of positive tests on a day to the number of taken tests on l a g days ago). It is interesting that the subfigures of Figure 2 are more in accordance with the epidemic peaks than their analogous in Figure 1. For example, it is clear that the USA has encountered three peaks. It is worthwhile that the graph of Iran has three peaks while the first of them is missed in Table 3 because of the lack of information at the beginning. A similar situation (being missed by investigation of either the number of tests or the number of confirmed cases while discovered by the analysis of the positivity rate) happens to the epidemic peak in India in late March, the first and the second peaks of the UK, and the epidemic peaks in middle May and the November for the UAE.

Figure 5 illustrates a classification of the countries based on the positivity rate: 1. Nigeria, Guatemala, and Bolivia, 2. South Africa, and Iran, 3. The USA, India, the UK, and Italy, and 4. Australia, the UAE, and South Korea.

Figure 5:

The time series of the positivity rate (daily) in the 12 countries. Nigeria, Guatemala, and Bolivia had the greatest positivity rates among the 12 countries.

The horizontal and vertical axes of Figure 3 display the number of new tests and the proportion of positivity of them, respectively. Generally, as the number of new tests increases, the positivity rate falls. Since the epidemic peaks are opposing this general rule, it is not very clear to see the opposite direction of the changes. Guatemala, due to lack of epidemic peak, is a good example of this diversely proportional relationship.

If the reason for an increase be the rising number of tests, we expect not to return the previous channel in short term. In addition, the positivity rate does not undertake a remarkable change. On the other hand, it is normal to assume that entering a peak is accompanied by increasing the number of negative tests as well. Consequently, the lack of the growth of negative test results (rising the positivity rate while continuing the previous trend for the frequency of tests) is only reasonable if at least one of the factors of tests accuracy, testing policy, or the viewpoint of the population were changed around that time. Otherwise, there are a remarkable number of un-reported cases belonging the peak. It is noticeable that this company of risings causes the observed acceleration in growth regarding epidemic peaks.

3.2 Copulas

3.2.1 Definition

Copulas are functions that connect multivariate distribution functions to their one-dimensional marginal distribution functions -uniform on the interval [0, 1]. Mathematically speaking, if H is a bivariate distribution function with margins F(X) and G(Y), there must exist a copula C such that H_θ(X,Y) = C(F(X),G(Y);θ)), where θ is introduced as the dependence parameter (Sklar 1959). Accordingly, Copula is mostly defined as a function C:[0,1]² → [0,1] that satisfies boundary conditions:

( P 1 ) C ( x , 0 ) = C ( 0 , x ) = 0 a n d C ( x , 1 ) = C ( 1 , x ) = x , ∀ x ∈ [ 0,1 ] ,

( P 2 ) ∀ ( s 1 , s 2 , t 1 , t 2 ) ∈ [ 0,1 ] 4 , s u c h t h a t s 1 ≤ s 2 a n d t 1 ≤ t 2

C ( s 2 , t 2 ) − C ( s 2 , t 1 ) − C ( s 1 , t 2 ) + C ( s 1 , t 1 ) ≥ 0 .

Eventually, for twice differentiable function C, 2-increasing property (P2) can be replaced by the condition

C ( s , t ) = ∂ 2 C ( s , t ) ∂ s ∂ t ≥ 0 ,

where C(s,t) is the so-called copula density. A copula C is symmetric if C(s,t) = C(t,s), for every ( s , t ) ∈ [ 0,1 ] 2 , otherwise C is asymmetric. The most well-known, powerful, and applicable copulas are:

FGM copula (Farlie 1960; Morgenstern 1956);

C F G M ( s , t ) = s t ( 1 + θ ( 1 − s ) ( 1 − t ) ) , θ ∈ [ − 1 , + 1 ] , ∀ ( s , t ) ∈ [ 0,1 ] 2

Clayton copula (Clayton 1978);

C C l a y t o n ( s , t ) = ( s − β + t − β − 1 ) − 1 β , β ∈ ( 0 , + ∞ ) , ∀ ( s , t ) ∈ [ 0,1 ] 2

Frank copula (Genest 1987);

C F r a n k ( s , t ) = − 1 α ln ( 1 + ( e − α s − 1 ) ( e − α t − 1 ) ( e − α − 1 ) ) , α ∈ ( − ∞ , + ∞ ) , ∀ ( s , t ) ∈ [ 0,1 ] 2 , a n d

Gumbel copula (Gumbel 1960);

C G u m b e l ( s , t ) = exp ( − [ ( − ln ( s ) ) σ + ( − ln ( s ) ) σ ] 1 σ ) , σ ∈ [ 1 , + ∞ ) , ∀ ( s , t ) ∈ [ 0,1 ] 2 .

The parameters of the marginal and copula distributions are estimated using the maximum likelihood method. The computations and illustrations regarding copula theory are conducted in software Maple, R 4.0.3, Maple 2018a, and EasyFit 5.6.

3.2.2 Copula Vs Correlation Coefficient

Measures of dependence are common instruments to summarize a complicated dependence structure in the bivariate case. Pearson’s, Spearman’s rho, and Kendall’s tau correlation coefficients are common statistical measures of dependence structure (Kendall 1970; Pearson 1895; Spearman 1904). The correlation comes in trouble when the random variables are not elliptically distributed. The performance of the copula does not depend on the fact that if you are dealing with elliptical distributions or not. The Pearson’s linear correlation measure (−1 ≤ r ≤ 1) is the most popular and well-known measure between pairwise random variables. Despite its simplicity and plain rationale, Embrechts et al. (2001) noted that ρ is simply a measure of the dependency of elliptical distributions, such as the binormal distribution (the marginals are normally distributed, linked by the Gaussian copula). Moreover, ρ measures a linear relationship itself and does not capture a non-linear one on its own, as noted in (Priest 2003). These properties constitute obvious limitations for modeling the dependency structure. In addition, copulas could be useful to define nonparametric measures of dependence between random variables. Since the values of Kendall’s tau are easy to calculate, this measure is used for observation dependencies. If F(X) and G(Y) are continuous then C(s,t) is unique, else C(s,t) is uniquely determined on the range of F(X) × range of G(Y).

One standard non-parametric dependence measures Kendall’s τ_k is expressed in the copula form as:

τ k = 4 ∫ 0 1 ∫ 0 1 c ( u , v ) C ( u , v ) d u d v − 1

The parameter copula is estimated and the relationship between parameter copula and τ_k is given in the last column of Table 4. The parameter copula in each case measures the degree of dependence and controls the association between two variables. When the parameter approaches 0 there is no dependence, and if the parameter tends to infinity there is a perfect dependence. Schweizer and Wolff (1981) showed that the dependence parameter copula, which characterizes each family of copulas can be related to Kendall’s τ_k. Therefore, copulas allow modeling both linear and non-linear dependence. Using copulas, regardless of marginal distributions, can model extreme endpoints.

Table 4:

Kendall’s tau of copula function.

Copula	Parameter space	Kendall’s tau
FGM	θ ∈ [−1,+1]	τ_k = 2θ/9
Clayton	β ∈ (0,+∞)	τ_k = β/(β + 2)
Frank	α ∈ (−∞,+∞)	τ k = 1 + 4 D ( α ) / α , D ( α ) = 1 α ∫ 0 ∞ x e x − 1 d x − 1
Gumbel	σ ∈ [1,+∞)	τ_k = (σ − 1)/σ

3.2.3 Copula Vs Regression

Regression analysis is a statistical method for investigating the relationships between some dependent and some independent variables. The basic form of the regression analysis, ordinary least squares is not suitable for some applications because the relationships are often nonlinear and the probability distribution of the response variable may be non-Gaussian.

The major advantage of copula regression is that there are no restrictions on the probability distributions that can be used. The copula regression is the most appropriate method in non-Gaussian (no need for normality assumption) regression model fitting. Copula functions, connecting the marginal distributions to their joint distributions, are useful in simulating the linear or nonlinear relationships among multivariate data. Copula is a multivariate distribution function with marginally uniform random variables on [0, 1] (the PDF^[1] of the CDF^[2]). Copula functions have some appealing properties such as they allow scale-free measures of dependence and are useful in constructing families of joint distributions.

4 Results

The presumptions to apply copula theory for a couple of variables are the existence of continuous marginal distributions accompanied with their correlation. Table 5 investigates whether the pair of the frequency of the tests and positivity rate meets the presumptions. The marginal distributions were obtained in EasyFit. It is observable that the generalized Pareto and Weibull distributions had good performance to fit the positivity rates. Also, the correlation in countries with the highest number of tests is negative and it is commonly between −0.2 and −0.3. In countries lacking enough tests, the correlation coefficient is significantly greater – possibly due to the low quality of data and under-reporting. It is noticeable that calculation over the data of Bolivia, Iran, and South Africa, lead even to positive correlations.

Table 5:

The results of fit distribution to data.

Country	Frequency of tests data				Positivity proportion data				Correlation
	Marginal	Parameters	K-S Test		Marginal	Parameters	K-S Test		r	P-value
	Marginal	Parameters	Statistic	P-value	Marginal	Parameters	Statistic	P-value	r	P-value
USA	Rayleigh	σ = 864763 γ = −195972	0.0654	0.11266	Gen. Pareto	k = −0.14127 σ = 0.06493 μ = 0.03461	0.04538	0.48317	−0.134	0.014
India	Logistic	σ = 252223 μ = 57602	0.04919	0.19214	Weibull	α = 1.8788 β = 0.07082	0.04234	0.58217	−0.236	< 0.01
UK	Gen. Pareto	k = −0.37332 σ = 280831 μ = −14012	0.05684	0.2165	Weibull (3p)	α = 0.78036 β = 0.06417 γ = −0.00833	0.07206	0.05687	−0.213	< 0.01
Italy	Log-Logistic (3P)	α = 2.6282 β = 87799.0 γ = −16892.0	0.06078	0.15679	Weibull (3p)	α = 0.86458 β = 0.07299 γ = −0.00238	0.07386	0.0521	−0.001	0.986
Iran	Log-Logistic (3P)	α = 8.1298 β = 56499.0 γ = −29429.0	0.00736	0.05101	Burr	k = 0.16689 α = 13.051 β = 0.08904	0.05516	0.3040	0.123	0.032
UAE	Weibull	α = 1.5811 β = 84164.0	0.05992	0.19016	Log-Logistic	α = 3.0628 β = 0.01041	0.07394	0.05619	−0.001	0.854
Bolivia	Gumbel Max	σ = 923.72 μ = 1072.6	0.04872	0.44386	Beta	α₁ = 1.3627 α₂ = 2.9923	0.0332	0.87483	0.189	0.001
Guatemala	Dagum	k = 0.0587 α = 10.772 β = 5929.2	0.0707	0.08088	Gamma	α = 1.9352 β = 0.13129	0.03456	0.8318	−0.329	< 0.01
Nigeria	Log-Logistic	α = 2.3097 β = 27364 γ = −510.36	0.04696	0.48053	Weibull	α = 1.2881 β = 0.2093	0.03527	0.81772	−0.371	< 0.01
Australia	Logistic	σ = 11052.0 μ = 40909.0	0.04808	0.46066	Frechet	β = 0.0043 α = 0.77645	0.05106	0.38529	−0.269	< 0.01
S Korea	Burr	k = 0.53449 α = 3.5829 β = 8601.9	0.05161	0.30334	Gen. Pareto	k = 0.18396 σ = 0.01109 μ = 0.00924	0.03947	0.63718	−0.005	0.926
S Africa	Log-Logistic (3P)	α = 4.5531 β = 40005.0 γ = −17075.0	0.03938	0.68861	Gen. Pareto	k = −0.13432 σ = 0.14687 μ = 0.01868	0.02748	0.96362	0.405	< 0.01

K–S, Kolmogorov–Smirnov; 3p, 3-parameter. The highlighted rows indicate that the correlation are not significant for those countries.

Based on Table 5, we are allowed to look for the suitable copula functions to connect the marginal distributions to find the desired joint distributions for nine of the countries. Notice that the countries without meaningful correlation (Italy, South Korea, and the UAE) were of the countries with the least proportion of positivity of the tests.

Table 6 represents the results of comparing the best candidates from the FGM, Clayton, Frank, and Gumbel families.

Table 6:

The obtained copula to fit the dependency and their performances.

Country	Model	MLE of θ	Kendall’s tau	AIC
USA	FGM copula	−0.47285	−0.1051	−663.3515
India	Frank copula	−0.77241	−0.1876	−660.0874
UK	Frank copula	−0.75843	−0.1624	−658.2413
Iran	Clayton copula	0.28941	0.1264	−559.8742
Bolivia	Clayton copula	0.37651	0.1584	−661.2521
Guatemala	Frank copula	−0.95054	−0.2743	−663.3011
Nigeria	Frank copula	−0.84251	−0.3221	−663.2462
Australia	Frank copula	−0.81262	−0.2138	−662.1021
South Africa	Clayton copula	0.46723	0.1894	−664.7824

According to Table 6, Clayton copulas are suitable candidates for the countries with lower per million tests. In addition, Frank copulas can describe a wide variety of countries. Finally, the Gumbel family seems not to be a good option to couple the variables of the frequency of tests and the positivity rate.

We now discuss the simulation of data from the obtained copula models and perform comparisons between correlations in the simulated data and in the observed data based on 1000 simulations. We follow the simulation method proposed by Johnson (1987) and later Nelson (2006).

Figure 6 illustrates the scatter plots of the transformed observed data versus simulated samples of the CDFs of the frequency of tests and positivity proportion variables taken from the fitted copula models in Table 6. It can be seen that the simulated data and the original data have similar dependence patterns. To settle this concern, Table 6 shows the rank correlations between the frequency of tests and positivity proportion variables calculated from the original data and the simulated data of size 1000 taken from the fitted copula models. By comparing these correlations, we can conclude that the results show strong consistency of the estimated and real correlations.

Figure 6:

Scatter plots of the transformed observed values (·) versus simulated samples (∗) variables from subfamilies of the copula model. USA (r1 c1), India (r1 c2), UK (r1 c3), Iran (r2 c1), Bolivia (r2 c2), Guatemala (r2 c3), Nigeria (r3 c1), Australia (r3 c2), and South Africa (r3 c3). r: row & c: column.

Finally, we want to investigate the structure of dependency between the number of tests and positivity rate totally. By collecting the data of the 12 countries, 3877 pairs are obtained whose Kendall’s correlation is −0.1434 (P-value: 2.8464 * 10ˆ−19). In addition, we split the data into two parts: peaks and otherwise. This split restricted us to applying marginal distributions – then copulas – because it causes the gap in the number of tests. Table 7 represents the Kendall’s correlations for the countries of interest. It is worth saying that the correlation coefficient for the variables (the number of tests and positivity rate) is negative in both peaks and otherwise.

Table 7:

The correlation between the number of tests and the positivity rate regarding all countries separated based on the peaks.

Country	Kendall’s tau after removing peaks	P-value	Kendall’s tau for peaks	P-value
USA	−0.0168	0.8113	−0.4104	<0.0001
India	−0.2410	0.0001	0.0993	0.3936
UK	0.2496	0.0015	−0.7017	<0.0001
Italy	0.3309	<0.0001	−0.5127	<0.0001
Iran	0.0779	0.2354	0.1574	0.1898
UAE	0.0387	0.5348	−0.1621	0.2081
Bolivia	0.3028	<0.0001	−0.2402	0.0119
Guatemala	−0.2946	<0.0001	–	–
Nigeria	0.4474	<0.0001	−0.4197	<0.0001
Australia	−0.2337	0.0003	−0.6295	<0.0001
South Korea	0.3134	<0.0001	−0.7214	<0.0001
South Africa	0.1203	0.0744	−0.4517	<0.0001
Total	−0.1381	<0.0001	−0.2132	<0.0001

Light or dark bolded figures indicate that the coefficient correlation is significantly positive or negative, respectively.

5 Discussion

Generally, at the beginning of an epidemic, the number of tests is low and the proportion of positivity is high. As time passes, the number of tests rises. Also, as the number of new tests increases, the positivity rate falls. The correlation in countries with high number of tests, higher quality of data, is negative and it is commonly between −0.2 and −0.3. By considering all the data as a set, the Kendall’s coefficients are −0.1434, −0.2132, and −0.1381 for total, peaks, and total after removing peaks, respectively. The positivity rate of the tests is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% vs 45%).

The proportion of positivity is more proportional than the number of tests to the number of infected cases. Approaching zero positivity rate is a good criterion to scale the success of a healthcare system in fighting against an epidemic. The number and accuracy of tests can play a vital role in the quality level of epidemic data. The policymakers can consider the factors affecting the positivity rate such as the testing policy, restricted facilities, peaks, fluctuations, and so on, and make decisions to prevent misleading because of them. Considering these factors altogether and inputting them in the models like (Khajanchi et al. 2022, 2021; Mondal and Khajanchi 2022) is the correct way to assess function of any intervention of interest.

The first limitation is the low quality of data for some countries because of the restricted facilities, the low number of tests, and non-organized data collection program. Also, some interpolation and moving average methods were applied to find some missing data regarding the countries of interest and calculating the correlation for the countries with poor data. Out of the 12 countries, Iran, South Africa, Nigeria, Bolivia, and Guatemala have got a relatively restricted number of tests. The data of Italy, the UAE, and South Korea showed no significant correlation. The highest quality and most significant correlations belong to the USA, India, the UK, and Australia.

The present approach using copulas is promising since it allows to take into account a wide range of correlation, frequently observed in medical studies. In fact, the classical multivariate models cannot reproduce all type of correlations. Moreover, the standard models are limited, especially because the choice of the marginal distributions is restricted. The crucial step in the modeling process is the choice of the copula function, which best fits the data. Further work is needed to choose the best copulas able to reproduce the dependence structure of bivariate medical variables. In clinical trials or medical studies, sample size is often an important consideration and is relatively small. The copula-based methodology overcomes this limitation as well, because the algorithm can be used to replicate data for any number of patients. The suggested copula-based methodology presented in this paper is simple and easy to implement.

Corresponding author: Babak Jamshidi, KiTEC, School of Biomedical Engineering & Imaging Sciences, King’s College London, London, SE1 7EU, UK, E-mail: babak.jamshidi@kcl.ac.uk

Research Funding: None to report.
Author contribution: BJ: Idea, Literature, Data, Methods, Programming, Interpretation, First draft. HB: Literature, Methods, Interpretation, Revision. SJZ: Data, Literature, Programming. MR: Design, Final manuscript.
Declaration of Interest Statement: We have no conflict of interest to declare.

References

Boubaker, H., and N. Sghaier. 2013. “Portfolio Optimization in the Presence of Dependent Financial Returns with Long Memory: A Copula Based Approach.” Journal of Banking & Finance 37 (2): 361–77, https://doi.org/10.1016/j.jbankfin.2012.09.006.Search in Google Scholar

Bekrizadeh, H., G. A. Parham, and M. R. Zadkarmi. 2013. “Weighted Clayton Copulas and Their Characterizations: Application to Probable Modeling of the Hydrology Data.” Journal of Data Science 11: 293–303, https://doi.org/10.6339/jds.2013.11(2).1084.Search in Google Scholar

Bekrizadeh, H., and B. Jamshidi. 2017. “A New Class of Bivariate Copulas: Dependence Measures and Properties.” Metron 75: 31–50, https://doi.org/10.1007/s40300-017-0107-1.Search in Google Scholar

Bekrizadeh, H., G. A. Parham, and M. R. Zadkarmi. 2017. “A New Asymmetric Class of Bivariate Copulas for Modeling Dependence.” Communications in Statistics – Simulation and Computation 46 (7): 5594–609, https://doi.org/10.1080/03610918.2016.1169292.Search in Google Scholar

Bekrizadeh, H. 2021. “Generalized Family of Copulas: Definition and Properties.” Thailand Statistician 19 (1): 163–78.10.1080/03610918.2022.2032156Search in Google Scholar

Coronavirus (COVID-19) Infection Survey, UK: 8 January 2021. Also available at https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurveypilot/8january2021.Search in Google Scholar

Corbella, S., and D. D. Stretch. 2013. “Simulating a Multivariate Sea Storm Using Archimedean Copulas.” Coastal Engineering 76: 68–78. https://doi.org/10.1016/j.coastaleng.2013.01.011.Search in Google Scholar

Chatrabgoun, O., Hosseinian-far, and A. Daneshkhah. 2020. “Constructing Gene Regulatory Networks from Microarray Data Using Non-gaussian Pair-Copula Graphical Models.” Journal of Bioinformatics and Computational Biology 47 (59): 1–19.10.1142/S0219720020500237Search in Google Scholar

Clayton, D. G. 1978. “A Model for Association in Bivariate Life Tables and its Application in Epidemiological Studies of Familial Tendency in Chronic Disease Incidence.” Biometrika 65 (1): 141–51. https://doi.org/10.1093/biomet/65.1.141.Search in Google Scholar

Dowdy, D., and G. D’souza. 2020. Covid-19 Testing: Understanding the “Percent Positive”. August 10. Also available at https://www.jhsph.edu/covid-19/articles/covid-19-testing-understanding-the-percent-positive.html.Search in Google Scholar

Embrechts, P., F. Lindskog, and A. McNeil. 2001. Modelling Dependence with Copulas and Applications to Risk Management. ETH Zurich: Department of Mathematics.Search in Google Scholar

Farlie, D. G. J. 1960. “The Performance of Some Correlation Coefficients for a General Bivariate Distribution.” Biometrika 47: 307–23. https://doi.org/10.2307/2333302.Search in Google Scholar

Genest, C. 1987. “Frank’s Family of Bivariate Distributions.” Biometrika 74: 549–55. https://doi.org/10.1093/biomet/74.3.549.Search in Google Scholar

Gumbel, E. J. 1960. “Bivariate Exponential Distributions.” Journal of the American Statistical Association 55: 698–707. https://doi.org/10.1080/01621459.1960.10483368.Search in Google Scholar

Hasell, J., E. Mathieu, D. Beltekian, B. Macdonald, C. Giattino, E. Ortiz-Ospina, M. Roser, and H. Ritchie. 2020. “A Cross-Country Database of COVID-19 Testing.” Scientific Data 7: 345. https://doi.org/10.1038/s41597-020-00688-8. https://ourworldindata.org/coronavirus-testing.Search in Google Scholar

Jamshidi, B., M. Rezaei, S. J. Zargaran, and F. Najafi. 2020. “Mathematical Modeling the Epicenters of Coronavirus Disease-2019 (COVID-19) Pandemic.” Epidemiologic Methods 9 (s1): 20200009. https://doi.org/10.1515/em-2020-0009.Search in Google Scholar

Jamshidi, B., H. Bekrizadeh, S. J. Zargaran, M. Rezaei, and F. Najafi. 2021. “Comparing Length of Hospital Stay during COVID-19 Pandemic in the USA, Italy, and Germany.” International Journal for Quality in Health Care 33 (1): 1–11.10.1093/intqhc/mzab050Search in Google Scholar

Johnson, M. E. 1987. Multivariate Statistical Simulation. Hoboken: John Wiley.10.1002/9781118150740Search in Google Scholar

Kendall, M. G. 1970. Rank Correlation Methods. London: Griffin.Search in Google Scholar

Khajanchi, S., K. Sarkar, and S. Banerjee. 2022. “Modeling the Dynamics of COVID-19 Pandemic with Implementation of Intervention Strategies.” European Physics Journal Plus 137: 129. https://doi.org/10.1140/epjp/s13360-022-02347-w.Search in Google Scholar

Khajanchi, S., K. Sarkar, J. Mondal, K. S. Nisar, and S. F. Abdelwahab. 2021. “Mathematical Modeling of the COVID-19 Pandemic with Intervention Strategies.” Results in Physics 25: 104285. https://doi.org/10.1016/j.rinp.2021.104285.Search in Google Scholar

Li, X., and R. Fang. 2012. “A New Family of Bivariate Copulas Generated by Univariate Distributions.” Journal of Data Science 10: 1–17. https://doi.org/10.6339/jds.201201_10(1).0001.Search in Google Scholar

Morgenstern, D. 1956. “Einfache Beispiele Zweidimensionaler Verteilungen.” Mitteilungsblatt fürMathematische Statistik 8: 234–5.Search in Google Scholar

Mondal, J., and S. Khajanchi. 2022. “Mathematical Modeling and Optimal Intervention Strategies of the COVID-19 Outbreak.” Nonlinear Dynamics 109 (1): 177–202. https://doi.org/10.1007/s11071-022-07235-7.Search in Google Scholar

Nelson, R. 2006. An Introduction to Copulas. New York: Springer-Verlag.Search in Google Scholar

Pearson, K. 1895. “Notes on regression and inheritance in the case of two parents.” Proceedings of the Royal Society of London 58: 240–2.10.1098/rspl.1895.0041Search in Google Scholar

Priest, C. 2003. Correlations: What They Mean and More Importantly What They Don’t Mean. Sydney: The Institute of Actuaries of Australia Biennial Convention.Search in Google Scholar

Roman, M., F. Louzada, V. G. Cancho, and J. G. Leite. 2012. “A New Long-Term Survival Distribution for Cancer Data [Internet].” Journal of Data Science 10 (2): 241–58. http://www.jds-online.com/volume-10-number-2-april-2012.10.6339/JDS.201204_10(2).0005Search in Google Scholar

Sklar, A. 1959. Fonctions de répartition à n dimensions et leurs marges, Vol. 8, 229–31. Paris: Publications de L’Institute Statistical University Paris.Search in Google Scholar

Spearman, C. 1904. “The Proof and Measurement of Association between Two Things.” American Journal of Psychology 15 (1): 72–101. https://doi.org/10.2307/1412159.Search in Google Scholar

Schweizer, B., and E. F. Wolff. 1981. “On Nonparametric Measures of Dependence for Random Variables.” Annals of Statistics 9: 879–85. https://doi.org/10.1214/aos/1176345528.Search in Google Scholar

Wang, G. J., C. Xie, P. Zhang, F. Han, and S. Chen. 2014. “Dynamics of Foreign Exchange Networks: A Time-Varying Copula Approach,” Discrete Dynamics in Nature and Society. Article ID 170921, 11 p. https://doi.org/10.1155/2014/170921.Search in Google Scholar

Wienke, A. 2011. Frailty Models in Survival Analysis. New York: Chapman & Hall/CRC biostatistics series.Search in Google Scholar

Worldometer Website. Also available at https://www.worldometers.info/coronavirus/#countriesSearch in Google Scholar

Zhang, L., and V. P. Singh. 2007. “Bivariate Rainfall Frequency Distributions Using Archimedean Copulas.” Journal of Hydrology 332: 93–109. https://doi.org/10.1016/j.jhydrol.2006.06.033.Search in Google Scholar

Received: 2023-01-04

Accepted: 2023-03-20

Published Online: 2023-04-03

Published in Print: 2023-06-27

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/spp-2023-0002

Keywords for this article

dependence; number of tests; copula; positivity; peak; correlation

Creative Commons

BY 4.0