Around the goal: examining the effect of the first goal on the second goal in soccer using survival analysis methods

Daniel Nevo; Ya’acov Ritov

doi:10.1515/jqas-2012-0004

Article

Around the goal: examining the effect of the first goal on the second goal in soccer using survival analysis methods

Daniel Nevo and Ya’acov Ritov

Published/Copyright: June 8, 2013

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Quantitative Analysis in Sports Volume 9 Issue 2

Abstract

In this paper we apply survival techniques to soccer data, treating a goal scoring as the event of interest. It specifically concerns the relationship between the time of the first goal in the game and the time of the second goal. In order to do so, the relevant survival analysis concepts are readjusted to fit the problem and a Cox model is developed for the hazard function. Time dependent covariates and a frailty term are also considered. We also use a reliable propensity score to summarize the pre-game covariates. The conclusions derived from the results are that a first goal occurrence could either expedite or impede the next goal scoring, depending on the time it was scored. Moreover, once a goal is scored, another goal scoring becomes more and more likely as the game progresses. Furthermore, the first goal effect is the same whether the goal was scored or conceded.

Keywords: Cox model; frailty; goals; soccer; survival analysis

Corresponding author: Daniel Nevo, The Hebrew University of Jerusalem – Statistics, Jerusalem, Israel

We thank the Editor and two anonymous reviewers for helpful comments and suggestions. These comments helped us improve the paper’s quality, sharpen the presented questions, model, results and conclusions, and also contributed to the improved clarity of the paper. This research was partially supported by an ISF grant.

Appendix

A Detailed analysis

We now look for a more parsimonious model and compare it with the null model [Model (I)]. If we limit ourself to models in which all effects are at least weakly significant (p<0.1) we get two possible models,

and

The results for these models are displayed in Tables 4 and 5. Note that both models include TimeOfFirstGoal, though the coefficient and the standard error estimates are twice as big for Model (V).

Table 4

Results of Model (IV).

	Coef	Exp (coef)	SE (coef)	z	p-Value
ProbWin	1.911	6.763	0.204	9.365	<0.001
Season	0.154	1.167	0.074	2.085	0.037
RedCardsAway	0.590	1.805	0.190	3.110	0.002
TimeOfFirstGoal	0.006	1.006	0.002	2.517	0.012

Cox model estimates for Model (IV). n=1433. 698 (48.7%) censored. Log likelihood=–4548.573. Likelihood ratio test statistic=105.8 on 4 degrees of freedom, p<0.001. All variables in the model are significant.

Table 5

Results of Model (V).

	Coef	Exp (coef)	SE (coef)	z	p-Value
ProbWin	1.922	6.831	0.204	9.402	<0.001
Season	0.155	1.168	0.074	2.085	0.037
RedCardsAway	0.580	1.787	0.190	3.054	0.002
Goal	–0.360	0.698	0.184	–1.955	0.051
TimeOfFirstGoal	0.012	1.012	0.004	3.006	0.003
TimeFromFirstGoal	0.008	1.008	0.004	1.914	0.056

Cox model estimates for Model (V). n=1433. 698 (48.7%) censored. Log likelihood=–4546.42. Likelihood ratio test statistic=110.1 on 6 degrees of freedom, p<0.001. Looking at this table together with Table 4 does not direct us to any conclusive statement which model to prefer.

The three models (including the null model) are nested, and hence a likelihood ratio test, could be used. Comparing Model (V) and Model (IV), separately, to Model (I) yields p-values of 0.013, 0.015 for Models (IV) and (V), respectively. By comparing Models (V) and (IV) a χ² statistic of 4.3 on 2 degrees of freedom is obtained (p=0.116), meaning Model (V) should not be preferred over Model (IV).

Note that Model (V) includes the time dependent term One could also consider some other function or even estimate it, although this matter is out of this paper’s scope. The linear function, that we have used so far, implies constant marginal effect over time. It is possible that the marginal effect is strongest in the minutes after the first goal scoring and declines during the rest of the game. We emphasize that the first goal effect is increases with time. Therefore, we also consider the natural log function, The log function seems to fit the data better since it preserves the significance of all covariates in the model and improves the likelihood of the model. The results for this model,

are displayed in Table 6 . If we repeat the likelihood ratio test to compare between this model and Model (IV), we get a 5.91 χ² statistic on 2 df, which yields a borderline p-value of 0.052.

Table 6

Results of Model (VI).

	Coef	Exp (coef)	SE (coef)	z	p-Value
ProbWin	1.927	6.872	0.204	9.425	<0.001
Season	0.156	1.169	0.074	2.098	0.036
RedCardsAway	0.591	1.806	0.190	3.112	0.002
Goal	–0.598	0.550	0.257	–2.328	0.020
TimeOfFirstGoal	0.0116	1.012	0.004	3.189	0.001
LogTimeFromFirstGoal	0.157	1.170	0.070	2.246	0.025

Cox model estimates for Model (VI). n=1433. 698 (48.7%) censored. Log likelihood=–4545.618. Likelihood ratio test statistic=111.7 on 6 degrees of freedom, p<0.001.

In order to choose the most suitable model, we have tried to fit Models (VI) and (IV) using only HomeSecondGoalTime observations. Obviously, the variable Goal was eliminated from this study. All together we have 673 observations. The surprising result is that none of these models fit well, i.e., the effects related to the first goal are not significant.

Next, a stratified model with no interactions between the two goals was fitted to the data. That is,

and looking at the differences between the baseline cumulative hazard function estimators and (where ) may reveal which model is more adequate.

Figure 6 presents these baseline cumulative hazards. The baseline hazard function for the HomeFirstGoalTime seems to be linear. Linearity of the cumulative hazard function implies constant hazard rate, i.e., the probability of a immediate goal scoring at time t conditional on no goal scored yet is the same for all t. Constant hazard rate fits to the exponential distribution. This result is consistent with the usual model for Poisson distribution for the number of goals. The HomeSecondGoalTime baseline hazard, although close to linear, behaves differently at the edges. The hazard between the 20th and the 70th min is similar to the HomeFirstGoalTime hazard. Looking at the hazard ratio reveals that as time goes by, a second goal is more and more likely to be scored in comparison to a first goal. This result is consistent with Figure 1. These findings strengthen the need of a time dependent term for the HomeSecondGoalTime hazard.

Figure 6

Estimated cumulative baseline hazards for Model (VII). The linear HomeFirstGoalTime cumulative hazard function implies constant hazard rate for the first goal. The HomeSecondGoalTime cumulative hazard function is linear for the first half but then changes with time. Thus, it seems reasonable to consider time dependent covariates for the HomeSecondGoalTime.

The hazard for the first 15 min behaves differently. A goal is more likely to be scored if one has already been scored. Models (IV) and (VI) miss this phenomenon. We tried to include the appropriate effect in our models, but it was rejected. It should be noted that only 29 out of the 673 HomeSecondGoalTime observations were in the first 15 min and therefore it is harder to model them appropriately.

We summarize the model choice discussion by stating that the first goal may effect the HomeSecondGoalTime by a few small effects. When combining them together they are significant but each of them separately is not strong enough to be found significant. Moreover, we have presented evidence to time dependent changes for the HomeSecondGoalTime hazard. Therefore, we suggest that Model (VI) is more appropriate.

References

Carmichael, F. and D. Thomas. 2005. “Home-Field Effect and Team Performance.” Journal of Sports Economics 6:264–281.10.1177/1527002504266154Search in Google Scholar

Clarke, S. R. and J. M. Norman. 1995. “Home Ground Advantage of Individual Clubs in English Soccer.” Journal of the Royal Statistical Society, Series D (The Statistician) 44: 509–521.Search in Google Scholar

Cox, D. R. 1972. “Regression and Life-Tables.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 34:187–220.Search in Google Scholar

Dixon, M. J. and S. G. Coles. 1997. “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 46:265–280.10.1111/1467-9876.00065Search in Google Scholar

Dixon, M. and M. Robinson. 1998. A Birth Process Model for Association Football Matches.” Journal of the Royal Statistical Society: Series D (The Statistician) 47:523–538.10.1111/1467-9884.00152Search in Google Scholar

Fox, J. 2002. Cox Proportional-Hazards Regression for Survival Data, Appendix to an R and S-PLUS Companion to Applied Regression. Sage Publications, pp. 1–18.Search in Google Scholar

Jones, M. B. 2011. “Responses to Scoring or Conceding the First Goal in the nhl.” Journal of Quantitative Analysis in Sports 7:15.10.2202/1559-0410.1324Search in Google Scholar

Kaplan, E. L. and P. Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53:457–481.10.1080/01621459.1958.10501452Search in Google Scholar

Klein, J. P. 1992. “Semiparametric Estimation of Random Effects Using the Cox Model Based on The Em Algorithm.” Biometrics 48:795–806.10.2307/2532345Search in Google Scholar

Klein, J. P. and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. Springer.10.1007/b97377Search in Google Scholar

Maher, M. J. 1982. “Modelling Association Football Scores.” Statistica Neerlandica 36:109–118.10.1111/j.1467-9574.1982.tb00782.xSearch in Google Scholar

Nielsen, G. G., R. D. Gill, P. K. Andersen, and T. I. A. S. Sørensen. 1992. “A Counting Process Approach to Maximum Likelihood Estimation in Frailty Model.” Scandunavian Journal of Statistics 19:25–43.Search in Google Scholar

Pollard, R. 1986. “Home Advantage in Soccer: A Retrospective Analysis.” Journal of Sports Sciences 4:237–248.10.1080/02640418608732122Search in Google Scholar PubMed

Ridder, G., J. S. Cramer, and P. Hopstaken. 1994. “Estimating The Effect of a Red Card in Soccer.” Journal of the American Statistical Association 89:1124–1127.10.1080/01621459.1994.10476850Search in Google Scholar

Therneau, T. 2012. A Package for Survival Analysis in S, R Package Version 2.36-14.Search in Google Scholar

Therneau, T. M., P. M. Grambsch, and T. R. Fleming. 1990. “Martingale-Based Residuals for Survival Models.” Biometrika 77:147–160.10.1093/biomet/77.1.147Search in Google Scholar

Volf, P. 2009. “A Random Point Process Model for the Score in Sport Matches.” IMA Journal of Management Mathematics 20:121–131.10.1093/imaman/dpn027Search in Google Scholar

Published Online: 2013-06-08

Published in Print: 2013-06-01

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/jqas-2012-0004

Keywords for this article

Cox model; frailty; goals; soccer; survival analysis