Bayesian bivariate Conway–Maxwell–Poisson regression model for correlated count data in sports

Mauro Florez; Michele Guindani; Marina Vannucci

doi:10.1515/jqas-2024-0072

Article

Bayesian bivariate Conway–Maxwell–Poisson regression model for correlated count data in sports

Mauro Florez , Michele Guindani and Marina Vannucci

Published/Copyright: August 12, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Quantitative Analysis in Sports Volume 21 Issue 1

Abstract

Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway–Maxwell–Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary.

Keywords: Conway–Maxwell–Poisson distribution; intractable likelihood; random effects; Markov chain Monte Carlo; soccer and baseball; COVID-19

Corresponding author: Mauro Florez, Department of Statistics, Rice University, Houston, TX, USA, E-mail: mf53@rice.edu

Acknowledgments

We would like to express our gratitude to Scott Powers for his invaluable assistance and suggestions. We also extend our thanks to the two anonymous referees and the editors, for their insightful comments and recommendations, which have greatly improved the quality of this paper.

Research ethics: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Data available at: https://www.football-data.co.uk/ and https://www.retrosheet.org/.

Appendix A: Proof of Equation (3)

Proof.

Let δ_i = (δ_i1, δ_i2), where δ_ij = exp(b_ij), then δ_i = exp(b_i) ∼ LN₂(μ, Σ) with μ = exp(0.5 ⋅ diag(D)), and Σ = diag(μ)(exp(D) − 11^T) diag(μ). Also, assume that λ i j = exp x i j T β j , then y_ij|λ_ij, δ_ij, ν_ij ∼ CMP(λ_ijδ_ij, ν_ij). Finally, let λ ̃ i j = λ i j μ , λ ̃ i = ( λ ̃ i 1 , λ ̃ i 2 ) , and Λ i ̃ = diag ( λ ̃ i ) . By law of iterated expectations and using the asymptotic approximations of Shmueli et al. (2005), we have that E ( y i | β , γ , D ) ≈ λ ̃ i + 1 2 ν i − 1 2 , and var ( y i | β , D ) ≈ V i − 1 Λ ̃ i ( exp ( D ) − 11 T ) Λ ̃ i , V_i = diag(ν_i1, ν_i2). Thus

cov ( y i 1 , y i 2 ) ≈ λ ̃ i 1 ( exp ( d 12 ) − 1 ) δ ̃ i 2 ≈ λ i 1 ⁡ exp ( 0.5 d 11 ) ( exp ( d 12 ) − 1 ) λ i 2 ⁡ exp ( 0.5 d 22 ) .

Clearly, (3) can be positive or negative depending on the sign of d₁₂, i.e., the non-diagonal element of D.

Appendix B: Prior sensitivity analysis

We assess the performance of the bivariate Conway–Maxwell–Poisson (CMP) model under various prior specifications. Specifically, we focus on the Over-dispersed scenario of 1 Season discussed in Section 3 and examine different combinations of values for B₀, G₀, ν₀, and R₀. These combinations are summarized in Table 5. To monitor convergence, we calculate the multivariate potential scale reduction factor ( R ̂ ) Brooks and Gelman (1998) using 3 chains and also check the sampler’s efficiency by calculating the effective sample size (ESS).

Table 5:

Prior sensitivity: values considered in each scenario to assess the sensitivity of the model’s performance to the specification of prior hyperparameters.

Scenarios	A	B	C	D
B ₀	0.1I	I	3I	10I
G ₀	0.1I	I	3I	10I
ν ₀	30	10	10	5
R ₀	I	I	0.1I	0.1I

Moving on to Table 6, we present the mean squared error (MSE) values for μ_j across all scenarios. Consistent with expectations, scenario A demonstrates the best recovery of the observed values. Again, this outcome can be attributed to the small variance of the priors employed in this scenario. Conversely, scenarios C and D exhibit the poorest performance in terms of recovering the observed values.

Table 6:

Mean squared error (MSE) of μ – prior sensitivity analysis.

Scenarios	A	B	C	D
μ ₁	0.71	1.01	1.19	1.2
μ ₂	0.25	0.25	0.22	0.545

Table 7:

Prior sensitivity analysis: multivariate potential scale factor ( R ̂ ) .

Scenarios	A		B		C		D
R ̂	y ₁	y ₂	y ₁	y ₂	y ₁	y ₂	y ₁	y ₂
β	1.07	1.06	1.15	1.16	1.2	1.21	4.22	1.82
γ	1.05	1.05	1.19	1.17	1.22	1.24	4.74	5.04
b	1.02	1.02	1.03	1.03	1.22	1.24	4.22	3.91

Table 8:

Prior sensitivity analysis: effective sample size (ESS).

Scenarios	A		B		C		D
R ̂	y ₁	y ₂	y ₁	y ₂	y ₁	y ₂	y ₁	y ₂
β	350.44	408.54	123.25	131.13	98.48	97.13	7.26	10.71
γ	506.3	511.11	195.37	192.4	106.45	111.09	47.51	43.48
b	9,366	9,555.62	9,175.4	9,286.9	2,324.947	2,351.463	2,306.58	2,329.84

Scenarios A and B exhibit more stable estimations, with all R ̂ values below 1.2 (Table 7). Indeed, some of the trace plots for chains in scenarios C and D demonstrated some mild convergence issues (plots not shown). Additionally, the effective sample size (ESS) analysis (Table 8) indicates higher algorithm efficiency in scenario A, while scenarios C and D show lower efficiency. These findings are likely connected to the specified values of ν₀ and R₀, as well as the larger covariance matrices B₀ and G₀. Notably, the chains for random effects display larger R ̂ values and smaller ESS in scenarios C and D compared to scenarios A and B. This discrepancy may be attributed to the expected value of the inverse of the random effects’ covariance matrix D, which is determined by ν₀R₀. Consequently, larger values are permitted for the random effects in scenarios C and D, despite the expectation of low heterogeneity in this simulation scenario.

Appendix C: Shape parameters γ

In Figure 18, we observe the shape parameters estimation for the Premier League data analysis. In the left sub-figure we have the estimation for parameters associated with the Home Goals. Strong teams, such as Chelsea, Liverpool, and Manchester City, are located below 0, indicating that when other teams play against them, the dispersion parameter will be lower, implying a larger variance in their scored goals (over-dispersion). We observe a similar effect analyzing the Away Goals. Conversely, weaker teams such as Norwich or Brighton have their shape parameter above 0, indicating that when teams play against them, the goals will be more under-dispersed. On the other hand, in the x-axis, Manchester United and Manchester City are below 0, which means that the number of goals they score tends to be more over-dispersed compared to the other teams. We see the opposite trend with Liverpool, where the goals scored tend to be more under-dispersed. We can compare this with respect to over or under-dispersion because the intercept effect was 0. It is noticeable that the effects are different in the Home Goals and Away scores, supporting the modeling approach we assumed.

Figure 18:

Analysis of premier league data: shape parameters.

Similarly, we plot the estimations for the MLB analysis in Figure 19. One of the noticeable cases is Miami Marlins (MIA), with a negative effect on the x-axis in the Home Points and a positive effect in the Away Points. This means that at Home, MIA tends to have scores more dispersed, while Away, their scores will be less dispersed in comparison to the others. This highlights the ability of our model to adapt to data of different types and to accommodate different phenomena and mechanics observed in the different teams and sports.

Figure 19:

Analysis of MLB data: shape parameters.

References

Backlund, J. and Johdet, N. (2018). A Bayesian approach to predict the number of soccer goals: modeling with Bayesian negative binomial regression. Dissertation, Linköping University, The Division of Statistics and Machine Learning, Available at: https://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-149028.Search in Google Scholar

Baio, G. and Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. J. Appl. Stat. 37: 253–264. https://doi.org/10.1080/02664760802684177.Search in Google Scholar

Benson, A. and Friel, N. (2021). Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway–Maxwell–Poisson distribution. Bayesian Anal. 16: 905–931. https://doi.org/10.1214/20-ba1230.Search in Google Scholar

Benz, L.S. and Lopez, M.J. (2021). Estimating the change in soccer’s home advantage during the Covid-19 pandemic using bivariate Poisson regression. AStA Adv. Stat. Anal.: 1–28. https://doi.org/10.1007/s10182-021-00413-9.Search in Google Scholar PubMed PubMed Central

Boshnakov, G., Kharrat, T., and McHale, I.G. (2017). A bivariate Weibull count model for forecasting association football scores. Int. J. Forecast. 33: 458–466. https://doi.org/10.1016/j.ijforecast.2016.11.006.Search in Google Scholar

Brooks, S.P. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. J. Comput. Graph. Stat. 7: 434–455. https://doi.org/10.2307/1390675.Search in Google Scholar

Chanialidis, C., Evers, L., Neocleous, T., and Nobile, A. (2018). Efficient Bayesian inference for COM-Poisson regression models. Stat. Comput. 28: 595–608. https://doi.org/10.1007/s11222-017-9750-x.Search in Google Scholar

Chiu, Y. and Chang, C. (2022). Major league baseball during the COVID-19 pandemic: does a lack of spectators affect home advantage? Humanit. Soc. Sci. Commun. 9: 1–6. https://doi.org/10.1057/s41599-022-01193-6.Search in Google Scholar

Conway, R.W. and Maxwell, W.L. (1962). A queuing model with state dependent service rates. J. Ind. Eng. 12: 132–136.Search in Google Scholar

Dixon, M.J. and Coles, S.G. (1997). Modelling association football scores and inefficiencies in the football betting market. J. R. Stat. Soc., C: Appl. Stat. 46: 265–280. https://doi.org/10.1111/1467-9876.00065.Search in Google Scholar

Fedrizzi, G., Canal, L., and Micciolo, R. (2022). UEFA EURO 2020: a pure game of chance? arXiv preprint arXiv:2203.07531.Search in Google Scholar

Guikema, S.D. and Goffelt, J.P. (2008). A flexible count data regression model for risk analysis. Risk Anal. Int. J. 28: 213–223. https://doi.org/10.1111/j.1539-6924.2008.01014.x.Search in Google Scholar PubMed

Higgs, N. and Stavness, I. (2021). Bayesian analysis of home advantage in North American professional sports before and during COVID-19. Sci. Rep. 11: 1–11. https://doi.org/10.1038/s41598-021-93533-w.Search in Google Scholar PubMed PubMed Central

Jones, M.B. (2015). The home advantage in major league baseball. Percept. Mot. Ski. 121: 791–804. https://doi.org/10.2466/26.pms.121c25x1.Search in Google Scholar

Karlis, D. and Ntzoufras, I. (2003). Analysis of sports data by using bivariate Poisson models. J. R. Stat. Soc. Ser. D Statistician 52: 381–393. https://doi.org/10.1111/1467-9884.00366.Search in Google Scholar

Karlis, D. and Ntzoufras, I. (2009). Bayesian modelling of football outcomes: using the Skellam’s distribution for the goal difference. IMA J. Manag. Math. 20: 133–145. https://doi.org/10.1093/imaman/dpn026.Search in Google Scholar

Kleiber, C. and Zeileis, A. (2016). Visualizing count data regressions using rootograms. Am. Stat. 70: 296–303. https://doi.org/10.1080/00031305.2016.1173590.Search in Google Scholar

Kramer, D. (2022). 3 reasons for seattle’s recent surge. MLB, Available at: https://www.mlb.com/news/mariners-playoff-odds-surging (Accessed 18 April 2024).Search in Google Scholar

Lee, A.J. (1997). Modeling scores in the premier league: is manchester united really the best? Chance 10: 15–19. https://doi.org/10.1080/09332480.1997.10554791.Search in Google Scholar

Lopez, M.J. (2016). Persuaded under pressure: evidence from the national football league. Econ. Inq. 54: 1763–1773. https://doi.org/10.1111/ecin.12341.Search in Google Scholar

Losak, J.M. and Sabel, J. (2021). Baseball home field advantage without fans in the stands. Int. J. Sport Finance 16. https://doi.org/10.32731/ijsf/163.082021.04.Search in Google Scholar

Maher, M.J. (1982). Modelling association football scores. Stat. Neerl. 36: 109–118. https://doi.org/10.1111/j.1467-9574.1982.tb00782.x.Search in Google Scholar

McCarrick, D., Bilalic, M., Neave, N., and Wolfson, S. (2021). Home advantage during the COVID-19 pandemic in European football. Psychol. Sport Exerc. 56: 102013. https://doi.org/10.1016/j.psychsport.2021.102013.Search in Google Scholar PubMed PubMed Central

McHale, I. and Scarf, P. (2011). Modelling the dependence of goals scored by opposing teams in international soccer matches. Stat. Model. 11: 219–236. https://doi.org/10.1177/1471082x1001100303.Search in Google Scholar

Murray, I., Ghahramani, Z., and MacKay, D. (2012) MCMC for doubly-intractable distributions. In: Proceedings of the twenty-second conference on uncertainty in artificial intelligence, pp. 359–366.Search in Google Scholar

Payne, E.H., Gebregziabher, M., Hardin, J.W., Ramakrishnan, V., and Egede, L.E. (2018). An empirical approach to determine a threshold for assessing overdispersion in Poisson and negative binomial models for count data. Commun. Stat. Simulat. Comput. 47: 1722–1738. https://doi.org/10.1080/03610918.2017.1323223.Search in Google Scholar PubMed PubMed Central

Pettersson-Lidbom, P. and Priks, M. (2010). Behavior under social pressure: empty Italian stadiums and referee bias. Econ. Lett. 108: 212–214. https://doi.org/10.1016/j.econlet.2010.04.023.Search in Google Scholar

Piancastelli, L.S., Friel, N., Barreto-Souza, W., and Ombao, H. (2023). Multivariate Conway–Maxwell–Poisson distribution: Sarmanov method and doubly-intractable Bayesian inference. J. Comput. Graph. Stat. 32: 483–500. https://doi.org/10.1080/10618600.2022.2116443.Search in Google Scholar

Price, K., Cai, H., Shen, W., and Hu, G. (2022). How much does home field advantage matter in soccer games? A causal inference approach for English premier league analysis. arXiv preprint arXiv:2205.07193.Search in Google Scholar

Reade, J., Schreyer, D., and Singleton, C. (2022). Eliminating supportive crowds reduces referee bias. Econ. Inq. 60: 1416–1436, https://doi.org/10.1111/ecin.13063.Search in Google Scholar

Reep, C., Pollard, R., and Benjamin, B. (1971). Skill and chance in ball games. J. Roy. Stat. Soc. 134: 623–629. https://doi.org/10.2307/2343657.Search in Google Scholar

Shmueli, G., Minka, T.P., Kadane, J.B., Borle, S., and Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution. J. R. Stat. Soc., C: Appl. Stat. 54: 127–142. https://doi.org/10.1111/j.1467-9876.2005.00474.x.Search in Google Scholar

Spiegelhalter, D.J., Best, N.G., Carlin, B.P., and Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. J. Roy. Stat. Soc. B Stat. Methodol. 64: 583–639. https://doi.org/10.1111/1467-9868.00353.Search in Google Scholar

Thomas, R. (2019). West Ham are better when they don’t have the ball, which is why they’re thriving away from home. The Athletic, Available at: https://theathletic.com/1467224/2019/12/19/west-ham-are-better-when-they-dont-have-the-ball-which-is-why-theyre-thriving-away-from-home/ (Accessed 18 April 2024).Search in Google Scholar

Tilp, M. and Thaller, S. (2020). Covid-19 has turned home advantage into home disadvantage in the German soccer bundesliga. Front. Sports Act. Living 2: 593499. https://doi.org/10.3389/fspor.2020.593499.Search in Google Scholar PubMed PubMed Central

Vihola, M. (2012). Robust adaptive metropolis algorithm with coerced acceptance rate. Stat. Comput. 22: 997–1008. https://doi.org/10.1007/s11222-011-9269-5.Search in Google Scholar

Received: 2023-06-14

Accepted: 2024-07-06

Published Online: 2024-08-12

Published in Print: 2025-03-26

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/jqas-2024-0072

Keywords for this article

Conway–Maxwell–Poisson distribution; intractable likelihood; random effects; Markov chain Monte Carlo; soccer and baseball; COVID-19