Abstract
The availability of large databases of athletic performances offers the opportunity to understand age-related performance progression and to benchmark individual performance against the World’s best. We build a flexible Bayesian model of individual performance progression whilst allowing for confounders, such as atmospheric conditions, and can be fitted using Markov chain Monte Carlo. We show how the model can be used to understand performance progression and the age of peak performance in both individuals and the population. We apply the model to both women and men in 100 m sprinting and weightlifting. In both disciplines, we find that age-related performance is skewed, that the average population performance trajectories of women and men are quite different, and that age of peak performance is substantially different between women and men. We also find that there is substantial variability in individual performance trajectories and the age of peak performance.
1 Introduction
The availability of large databases of elite athletic performances makes it possible to model age-related performance progression and to benchmark this progression against the World’s best athletes. This information has many applications such as systematically identifying those athletes with the potential for future success in order to prioritise funding and resource allocation (Allen, Vandenbogaerde, and Hopkins 2014), setting performance goals by coaches and/or athletes, or guiding training programmes. These data have repeated measurements, usually taken at irregular intervals (due to the timing of different events/competitions in the international calendar) and with confounders of performance such as wind speed in sprinting, seasonality effects, and competition level. As discussed by Brander, Egan, and Yeung (2014), selection effects due to ability differences across athletes provide a further complication. For example, only exceptional younger athletes will compete at the highest levels of sport against a wider range of abilities levels in older more established counterparts. This selection bias may result in an underestimation of performance gains that an athlete may realize over their early years within senior level competition. As such, cross-sectional analytical methods including least squares or simple averages at different ages can lead to misleading results.
There are two main approaches to the statistical analysis of athletic performance over time. Firstly, a time series analysis of a small number of elite level performances can be used to understand population-level changes. For example, Stephenson and Tawn (2013) analysed world best times per year using extreme value theory to understand changing performance for men and women. Similarly, Gao, Li, and Wang (2020) analysed the finishing times in World Championship and Olympic Games semi-finals and finals using time series model to understand the effects of technology in swimming, and Kovalchik and Stefani (2013) analysed medal-winning performances at Olympic games using a regression model to understand whether the difference in performance between men and women was changing over time. Secondly, longitudinal models have been fitted to a database of performances measured over time. For example, in cricket, Boys and Philipson (2019) modelled batsmen’s scores allowing for the effect of age, home advantage, different innings and quality of opposition, whereas Stevenson and Brewer (2021) modelled changing individual performance with a Gaussian process. Egidi and Gabry (2018) build a hierarchical autoregressive model for player ratings in football allowing for individual, team and position effects. In ice hockey, Bradbury (2009) used a quadratic age effect and allowed for other covariates using random effects model. Wimmer et al. (2011) built a factor model for decathlon performances (across the ten component events) and used a spline functions to nonparametrically model the effect of age and the month of the year (this allows for the changes in performance levels over the athletics season).
The longitudinal approach allows for inference at the individual level (unlike the time series approach) and addresses the selection effect by accounting for different performance levels across athletes. This approach was first proposed for the analysis of sports data by Berry, Reese, and Larkey (1999) who fit Bayesian hierarchical models across different individual athletes and competition seasons. The performance for an individual in a particular season is decomposed into an individual effect, a season effect and an individual effect of age (called the individual ageing function). The use of individual ageing functions (rather than a single common ageing function for all individuals) is important since it allows for variability in the effect of ageing which is observed in data (for example, some athletes peaking earlier than others) and is a key part of longitudinal models. Berry, Reese, and Larkey (1999) model the individual ageing function as the sum of a flexible, nonparametric average ageing function and a simple (parametric) function. However, Albert (1999) suggests an alternative approach using a parametric average ageing function but a more flexible model for the departures for each individual ageing function. A range of individual and average ageing functions have been proposed for longitudinal monitoring including quadratic (Boys and Philipson 2019; Bradbury 2009), cubic (Brander, Egan, and Yeung 2014), a two parameter non-linear function (Berthelot et al., 2012; Strand, Nelson, and Grunwald 2018), and Gaussian processes (Stevenson and Brewer 2021).
In this paper, we will use a longitudinal model in a Bayesian framework (for a review of Bayesian methods for analysis of sports data the reader is referred to the work of Santos-Fernandez, Wu, and Mengersen 2019) and concentrate on centimetre-gram-second sports where performance is either measured in centimetres (e.g. distance thrown), grams (e.g. weight lifted) or seconds (e.g. time to run a specific distance). Our approach is in the spirit of Albert (1999) and uses a parametric form for the average ageing function and a flexible model for the individual ageing function. We make several contributions. Firstly, our model is more flexible than previous models in the literature. We assume a polynomial for the average ageing effect, a spline model for the individual ageing function and a skew-t distribution for the observation errors. Secondly, we extend the Adaptively Scaled Independence proposal (Griffin, Latuszynski, and Steel 2021) in a Metropolis-Hastings sampler to allow variable selection in linear regression models with skew t distributed errors. Thirdly, we apply this model to large databases of athlete performances in both 100 m sprinting and weightlifting, and show that Markov chain Monte Carlo (MCMC) methods can be used for inference in this context. Fourthly, we discuss how the Bayesian approach can be used to draw inferences about aspects of athlete’s performance such as age of peak performance at both the population and individual level.
The paper is organized as follows. Section 2 describes the data used in the study. Section 3 introduces the model to be used for the analysis and aspects of the computational methods required. Section 4 presents results for both males and females in 100 m sprinting and weightlifting. Section 5 discusses the approach and possible future work.
2 Data
Our analysis focuses on competition results from 100 m sprinting and weightlifting events obtained from the World Athletics and International Weightlifting Federation websites, respectively following Institutional ethical approval (Prop_72_2017_18). The 100 m sprint data contains results from both male and female sprinters who had at least 5 competition results between 8th January 2001 and 28th August 2021. The database contained 2834 male athletes who have a personal best below 10.5 s and 1297 female athletes who have a personal best below 11.6 s. The male data set had 95,376 observed performances, with the female data set having 48,999 observations. The ages for males athletes ranged from 12 to 45 years, whereas females ranged from 12 and 42 years.
The weightlifting data comprised competition results from 1609 male and 1212 female lifters who had at least 5 competition results between 24th April 1998 and 24th May 2021. This led to 14,557 observations in the male data set and 11,690 observations in the female data set. Weightlifters’ ages ranged from 11 to 43 years old in the male data set, and between 12 and 41 years in the female data set. The two different lifts that comprise weightlifting competitions (snatch and clean and jerk) were combined into a total lift score. The scores were transformed to Sinclair totals to allow for the differences in the levels of lift scores across weight categories. This allowed us to consider results across weight categories without adjusting for weight category in the model. The Sinclair total is calculated by multiplying the total lift score by a coefficient derived from the lifters body weight, the current World record holder’s body weight in the various weight categories, and a coefficient from World record performance from the previous years.
3 Model
We assume that there is a database of measured performance for M athletes. The ith athlete has n i performances and we define yi,j to be the jth performance of the ith athlete which occurs at age ti,j with a p-dimensional vector of additional regressors xi,j which are functions of measured confounders or other covariates. This allows the model to include the effects of confounders of performance such as wind speed in sprint running events. We model yi,j by
where h i (t) is the ith athlete’s individual performance trajectory (or ageing function) as a function of age, ζ are population-level regression coefficients shared by all athletes and events and ϵi,j are observation errors which are assumed to follow a standard skew t distribution (Azzalini and Capitanio 2003). The individual performance trajectory h i (t) is decomposed into two parts: g(t) t , which represents the population performance trajectory (or ageing function) and f i (t) represents the difference between the ith individual performance trajectory and population performance trajectory, which we call the excess performance trajectory.
Our modelling approach assumes that f
i
(t) can be less smooth than g(t) for some athletes since it represents the population average whereas individual performance trajectories can change abruptly. Therefore, we use a parametric model for the population performance trajectories and a non-linear regression model for the excess performance trajectory. This leads to the same modelling approach as the suggestion of Albert (1999). In our analyses, we assume that g(t) follows a fourth order polynomial
The choice of a skew t error distribution for the observational errors allows for both skewness and kurtosis which occur due to unusually poor performances (for example, longer times in track events or shorter throws or jumps in field events). Following Azzalini and Capitanio (2003), a random variable Y to has a standard skew t-distribution with parameters α and ν if
We use a Bayesian analysis and assume the following priors for our model (the parametrization of all distributions is given in the Appendix A). The regression coefficients in the population-level performance trajectory are given an uninformative prior, π(η) ∝ 1. The regression coefficients of the regressors xi,j are given independent horseshoe priors (Carvalho, Polsen, and Scott 2010),
where C+ is the standard half-Cauchy distribution which has density
This choices allows us to perform regularisation on ζ and so shrink out regressors which are not related to the response. The parameters of the excess performance trajectory f i (t) follow a standard Bayesian variable selection structure by assuming that only some β i ’s are non-zero. The jth regression coefficient βi,j is non-zero (and the jth spline included in the model) if γi,j = 1 and is given the hierarchical prior
Let
where γ
i
= (γi,1, …, γi,K) and
The remaining population-level parameters are given the following prior distributions
The choice of α and ν follow the choices advocated by Frühwirth-Schnatter and Pyne (2010).
Bayesian inference can be implemented using Markov chain Monte Carlo (MCMC) with a Gibbs sampler. The sampling uses several latent variable representations to give simpler updates of the parameters. Firstly, the standard skew-t distribution can be represented as
where
The same representation can be used for τ2,
All steps for the Gibbs sampler are given in the Appendix A.
Bayesian variable selection is needed for spline model for each individual and we use the recently developed Adaptive Scaled Individual adaptation scheme (Griffin, Latuszynski, and Steel 2021) to update the full conditionals associated with each spline model. Griffin, Latuszynski, and Steel (2021) show that their approach can lead to faster mixing Markov chains than traditional approaches to MCMC in Bayesian variable selection in linear regression models with normal errors (see e.g. Dellaportas, Forster, and Ntzoufras 2000; Tadesse and Vannucci 2021). We extend their approach to linear regression models with skew t error distributions using a latent variable representation.
We define
where
In our Gibbs sampler, the adaptive Metropolis–Hastings method of (Griffin, Latuszynski, and Steel 2021) can be used to update γ
i
. The method uses a Metropolis–Hastings update and proposes a new state
for i = 1, …, K. The tuning parameter Ai,j is the probability of adding the jth variable to the model and the tuning parameter Di,j is the probability of deleting the jth variable from the model. The proposal value is accepted with probability
where
The values of these tuning parameters are adjusted during the MCMC run using the following rule
where κ
i
is a further individual-specific tuning parameter and
where logit ϵ (x) = log(x − ϵ) − log(1 − x − ϵ) for 0 ≤ ϵ ≤ 1/2, 1/2 < λ ≤ 1, a t is the Metropolis–Hastings acceptance probability at the tth iteration. Griffin, Latuszynski, and Steel (2021) shows that this scheme is optimal with η = 1 if γi,1, …, γi,p are independent under the posterior distribution and that it leads to good performance under different levels of dependence.
4 Results and discussion
We have analyzed the two sets of data (100 m sprinting and weightlifting) using our model separately for males and females. We include the month of competition as a covariate using January as a baseline in all analyses. Wind is a confounder in sprint events and we include both linear and quadratic effects of wind in the 100 m sprint analyses. The MCMC algorithm was run for 11,000 iterations with 1000 used as a burn-in. For the MCMC results we have saved every 10th iteration and we have also started 10 chains. Altogether this leads to 10,000 MCMC recorded samples.
4.1 Ageing and athletic performance
Figures 1 and 2 show inference about the population performance trajectories for 100 m sprinting and weightlifting respectively. As shown in Figure 1, the population performance trajectory suggests that age of peak performance occurs about two years earlier for males than females in the 100 m. The parabolic relationship between age and performance is not symmetric with a greater rate of performance improvement (i.e. faster sprinting times) before peak age, with a reduced rate of performance decline subsequently (i.e. slower sprinting times).

The population performance trajectory for 100 m sprint. The solid line is the posterior median and the dashed lines are 95% credible intervals.

The population performance trajectory for weightlifting. The solid line is the posterior median and the dashed lines are 95% credible intervals.
This is true for both men and women. This biphasic development of athletic performance was first described by Moore (1975) and has later been characterised using quadratic and polynomial functions (Berthelot et al. 2012; Bongard et al. 2007; Stones and Kozma 1984). As shown by Figure 3, in agreement with the previous literature (Boccia et al. 2017), the rate of improvement in running performance for males appears to be much greater than females up until their late teens. However, over the early 20s improvement in female sprinting performance surpasses their male counterparts. For example, at age 20, average times for women are improving by 0.07 s per year whereas men show an improvement of 0.04 s per year.

The derivative of the population performance trajectory for 100 m sprinting and weightlifting shown as posterior median (solid line) and 95% credible interval (dashed lines). Men and women are shown in blue and red respectively.
There is no evidence of differences in the rate of deterioration of performance between the sexes at older ages. Between the age of peak performance and 30 years of age, both male and female sprint performance deteriorates at a rate of 0.03 s/year, which is similar to previous findings investigating short-duration events in swimming over a similar age range (Tanaka and Seals 1997). The similar magnitudes of decline in short duration sprinting performance with age between male and female athletes suggests that age-related decline in physiological determinants of anaerobic performance is likely to be similar between groups at the elite level. As shown in Figure 2, both male and female lifters show a rapid improvement in performance until an age of peak performance followed by a subsequent decrease in performance. The rate of increase in performance for males is greater than females at younger ages (up to 23 years old), which is unsurprising, since functional performance capacity increases much faster for males compared to females during puberty. However, this difference in the rate of performance improvement is not evident following maturation between 23 and 30 years. After 30, average men’s performance decreases at a faster rate than females. This may be due to different drop out rates between males and females once athletes pass their age of peak performance.
Interestingly, performance in weightlifting is maintained at a level close to the peak for a much longer period than seen in the 100 m data, with a clear plateau above the 0 point being evident from the mid-20’s until approximately 30 in the age-related performance derivative (Figure 3). Indeed, previous research has suggested that task related duration appears to modulate the decline in physiological functional capacity (Donato et al. 2003). Moreover, there is evidence that muscle function appears to decline less rapidly in the upper limbs compared with the lower limbs (McDonagh, White, and Davies 1984), which may also play a factor in the findings. A slower rate of decline in weightlifting versus sprinting performance may therefore be attributed to a slower decline in anaerobic power of the upper body muscles.
Similar results were found between sprinting and weightlifting for age of peak performance (Figure 4) with men reaching their peak approximately 2 years earlier than women. These findings are in accordance with Haugen et al. (2018) who analysed season best results from World ranked track and field athletes. The current study observed similar results despite analysing all results from athletes competing on the World stage, not just those within the top 100. Reasons for this sex differences are unclear, but previous research has speculated that factors such as hormone-dependent changes in muscle and fat mass, access to specialized training, exposure to training and technique development, and child-bearing in females are all possible causes (Haugen et al. 2018). However, as outlined above, data suggests that women catch up some of this gender gap in their early and mid-20’s, and supports the notion that women appear to improve more quickly than men in the years immediately preceding their age of peak performance (Haugen et al. 2018).

Posterior distribution of population peak age for 100 m sprint and weightlifting.
4.2 Variability of performance in males versus females
Results of the current study demonstrate that the variance in 100 m sprinting performance is larger for females than males (Table 1).
100 m sprint: Parameter estimates for the model with month and wind effect (with regressors) and without either effect (without regressors).
Without regressors | With regressors | |||||
---|---|---|---|---|---|---|
|
α | ν |
|
α | ν | |
Males | 0.0293 | 1.29 | 52.1 | 0.0197 | 1.21 | 18.4 |
(0.0181, 0.309) | (0.00, 1.37) | (29.9, 93.1) | (0.0163, 0.0206) | (0.95, 1.30) | (14.6, 21.8) | |
Females | 0.0462 | 1.55 | 49.0 | 0.0307 | 1.35 | 19.6 |
(0.0279, 0.0483) | (0.52, 1.64) | (23.0, 73.2) | (0.0215, 0.0338) | (0.53, 1.48) | (14.2, 47.5) |

The posterior predictive distribution of ϵi,j for 100 m sprinting and weightlifting for men and women.
In weightlifting, the variability in performance is much larger for males than females. This is in contrast to previous literature which suggests greater within-athlete variability in women than men (McGuigan and Kane 2004; Solberg et al. 2019). The reason for these contrasting findings is unclear, but may be due to the much larger and more representative sample of female lifters used in the current study. Moreover, the inclusion of more recent competition results may also address concerns in previous literature over level of competitiveness and participant rates of female weightlifting compared to that of their male counterparts. The error distribution is negatively skewed with a heavy tail for both males and females which, and, in a similar way to the 100 m sprinting, indicates that an under-performance is more likely than an over-performance. As shown in Figure 5, the size of skewness and kurtosis is very similar between males and females. Even though similar in nature, the error distribution is clearly more skewed and heavier tailed than that of 100 m sprinting. This may be due to the fact that sprinting involves a single performance whereas weightlifting involves a set series of lifts in which the athlete attempts to lift the largest weight possible in order to win the competition. Therefore, lifters may not lift their best possible weight due to either misjudgements of the chosen weights to lift, or competition tactics.
4.3 Effects of covariates
Previous models of athletic performance trajectories have, generally, not included adjustments for confounding factors of performance, such as seasonality or wind effects. For 100 m sprinting, the effects of the wind and month of competition are shown in Table 1 and Figure 6. There is clear evidence that both effects are important for explaining variation in sprinting performance. The variance of the errors in the regression model are substantially reduced when regressors are included, with effects being similar for both males and females.

The effects of month and wind for 100 m sprint.
In agreement with the findings of previous research, the current study demonstrates that wind speed has a strong effect on performance, with greater tail winds associated with faster times (Moinat, Fabius, and Emanuel 2018). Specifically, for males a 1 m per second increase in tail wind improves sprint time by 0.05 s on average. For females, the tail wind advantage is slightly bigger, leading to an average improvement of 0.06 s. Slightly faster sprint times are also shown during the main part of the outdoor track and field racing calendar in June and July. However, interestingly, the seasonality effect is smaller than that of wind, with the sprinting time improving by only 0.04 s in June and July compared to January, suggesting that performance is remarkably stable over the year.
In weightlifting, there was evidence of a small effect across the months for both males and females. Males demonstrated an improvement of 2.5 and females of 1.8 Sinclair Total units in August compared to January. Plots of the results are shown in Figure 7. This weaker effect leads to similar estimated variability for the model with and without regressors and, as such, we only report results for the model with regressors in Table 2.

The effects of month for weightlifting.
Weightlifting: Parameter estimates for the model with month effect (with regressors) and without month effect (without regressors).
Without regressors | With regressors | |||||
---|---|---|---|---|---|---|
|
α | ν |
|
α | ν | |
Males | 211 | −1.97 | 7.9 | 206 | −1.91 | 7.8 |
(141, 311) | (−1.44, −2.73) | (6.3, 71.4) | (152, 308) | (−2.65, −1.34) | (6.2, 42.2) | |
Females | 134 | −1.73 | 7.3 | 126 | −1.64 | 6.9 |
(103, 154) | (−1.97, −1.10) | (5.9, 9.7) | (100, 147) | (−1.93, −0.60) | (5.7, 12.2) |
4.4 Individual performance trajectory analysis
Analysis of performance trajectories provides evidence of a typical rate of progression of successful athletes, which can be used to inform benchmarks for talent identification, analysis of performance during a season to identify the impact of performance improvement strategies, and to predict performance itself. For this level of information it is important to consider performance trajectory at an individual level in comparison to both their own historical performance level, but also that of the wider cohort of athletes. Within the current model this is provided by the measure of excess performance. In Figure 8, we present an example of individual performance trajectories (adjusted for seasonality and wind effects) for three of the World’s top 50 (based upon 2021 rankings) male and female 100 m sprinters across the course of their careers. The data points in panel (b) are adjusted by the posterior mean population performance trajectory, and month and wind effects. Figure 8 shows results for men’s 100 m sprinting. It can be seen that across his career, Athlete 1 consistently performs approximately 0.6 s faster than the rest of the population, bearing in mind the population displays performance improvement up until the mid-20’s at which point a decline in performance is seen. Similarly Athlete 3 shows a career profile that evolves at a similar rate to the population leading to a flat excess performance trajectory, but again consistently better than the wider population. By comparison, Athlete 2 was performing at the population average at age 20, but had almost caught up with Athlete 1 by age 30, showing a much later improvement in performance and an ability to retain performance better than others. At the age of 25 years, Athlete 1 was on average 0.1 s faster than Athlete 2.

Individual performance inference for Athletes 1–3: (a) shows observed performances and posterior median individual performance trajectory (solid line) with 95% credible interval (dashed line), and (b) shows adjusted excess performances and posterior median excess performance trajectory (solid line) with 95% credible interval (dashed line).
Similar patterns of performance are seen in female sprint athletes (Figure 9) such as Athlete 5 who’s performance career path largely follows the same polynomial pattern as the wider population, but with a between 0.19 and 0.42 s greater level of excess performance. Athlete 6 is an athlete who competed in the heptathlon in the earlier part of their career and specialised in sprinting disciplines later in her career. As such she realises a greater performance gain of 0.50 s into her early and mid-20’s than seen across the wider population, but as she learns the skills of sprinting the rate of her performance improvement slows to match that of the wider population by the age of 30. As such her excess performance represents a “v” shaped trajectory. Going through the American College athletic system, Athlete 4 shows a fast rate of improvement in her performance over her early 20s (0.57 s/year at the initial peak age of 24 years). However, recurrent injuries between the ages of 26 & 27 limits her rate of performance improvement compared to the wider population of sprinters. Following resolution of her injuries, the athlete goes on to demonstrate a second peak in her performance at 29 years of age when her rate of improvement reaches 0.63 s/year.

Individual performance inference for Athletes 4–6: (a) shows observed performances and posterior median individual performance trajectory (solid line) with 95% credible interval (dashed line), and (b) shows adjusted excess performances and posterior median excess performance trajectory (solid line) with 95% credible interval (dashed line).
Figure 10 shows the posterior distribution of the age of peak performance across six 100 m sprint athletes. There are clearly differences in these distributions. There is clear evidence that Athlete 2 peaks at a later age than Athletes 1, 3 or 6. There are also some differences in shape. Particular clear is the bi-modal posterior distribution of Athlete 4 which indicates the aforementioned two periods when this athlete was performing at their peak level.

Posterior distributions of individual age of peak performance for Athletes 1–6.
Individual performance trajectories for 3 of the top 50 ranked male and female weightlifters (as of August 2021) are shown in Figures 11 and 12. It is evident that elite weightlifting athletes tend to compete fewer times per year than the 100 m sprinters. However, performance trajectories appear similar, with increasing (Athlete 7), polynomial (Athlete 8), and plateauing (Athlete 9) patterns observed. A key difference between the 100 m sprinting and weightlifting data sets is the skew distribution. The negative skew distribution within the weightlifting data suggests that poor performances are often a lot worse than exceptionally good ones. Evidence of this can be seen in the career trajectory from Athlete 8, where, despite a general upward trajectory in his level of excess performance, his worst performances lie outside the 95% confidence bounds. A similar rate of performance increase is seen in the trajectory of Athlete 7 over the early stage of his career. The rate of increase in performance shows a tendency towards plateauing towards his late 20s, but excess performance continues to increase linearly, suggesting that the lifter is still continuing to make significant gains in performance over the age matched wider population of lifters. The performance trajectory of female weightlifters (Figure 12) is much flatter than that of males, with clear evidence of a plateau in lifting performance, and excess performance from their mid-20’s onwards. Notwithstanding the age polynomial, the median of excess performance suggests that these athletes are consistently performing between 50 and 100 Sinclair units above that of the wider population of female weightlifters, even in to the later stages of their careers.

Individual performance inference for Athletes 7–9: (a) shows observed performances and posterior median individual performance trajectory (solid line) with 95% credible interval (dashed line), and (b) shows adjusted excess performances and posterior median excess performance trajectory (solid line) with 95% credible interval (dashed line).

Individual performance inference for Athlete 10–12: (a) shows observed performances and posterior median individual performance trajectory (solid line) with 95% credible interval (dashed line), and (b) shows adjusted excess performances and posterior median excess performance trajectory (solid line) with 95% credible interval (dashed line).
5 Discussion
This study developed a Bayesian hierarchical model to investigate both population and individual level longitudinal performance trajectories over time. Performance change was modelled as a function of age using a non-parametric curve fitting approach which accounted for the estimated effects of known covariates within the regression model. The inclusion of these covariates allows us to adjust for confounders and leads to a substantially reduced variance of the observation errors in both for 100 m sprinting and weightlifting. By reducing these error variances, the model produced tight-fitting individual performance trajectories that are appropriate for the assessment of athletic performance changes over the course of their career.
The model has potential to be applied a wide-range of sporting events within the category of centimetre-gram-second sports. For example, in swimming performance time can be modelled with pool size and depth included as confounders as both will influence performance; in timed track cycling events, the venue (altitude, humidity, track surface) has an important systematic effect on times which can be included as a confounder; finally in longer distance running events, performances can be influenced by factors including the environment and race tactics, and by including a “race” effect would allow for consistent differences in race times across a whole field. Within this manuscript we have demonstrated the model on sports measured in centimetres–grams–seconds, but it can also be applied to a wide range of others where performance is measured in rankings (e.g. Tennis), scores (e.g. Gymnastics), or where there are player metrics provided in competition (e.g. distance covered in team sports).
Several methodological limitations need to be acknowledged with this study. Career trajectories were developed giving equal importance to each performance and did not account for the strength of the competition. It is likely that athletes prioritised certain events, or performances were affected due to the relative strength of their opponents. Data were limited to results appearing on the World Athletics and International Weightlifting Federation websites, and therefore do not include lower level competitions or those from junior age-group events. We cannot discount the potential effects of doping within the data, especially given the history of anti-doping rule violations in both 100 m sprinting and weightlifting. Future work will look at extending the model to other centimetre–gram–second sports as well as others where performance is measured by finishing position/rank or individual player metrics. This will involve adjusting for more confounders and modelling the effect of competition strength.
Funding source: Partnership for Clean Competition
Award Identifier / Grant number: 26
Funding source: University of Kent
Award Identifier / Grant number: 0000000053
Acknowledgments
We are grateful to Alex Bohl and Lorenzo Gaborini who assisted us in acquiring data from the World Athletics and International Weightlifting Federation websites, respectively.
-
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: This study received funding from the Partnership for Clean Competition (Grant ID: 26), and utilised the University of Kent High Performance Computing Cluster (0000000053).
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
Appendix A: Computational method
We define IG(a, b) to be an inverse gamma distribution with shape parameter a and scale parameter b which has density
Ga(a, b) to be a gamma distribution with shape parameter a and scale parameter b which has density
N(μ, Σ) to be a multivariate normal with mean μ and covariance matrix Σ, TN A (μ, σ2) represents a normal distribution with mean μ and variance σ2 truncated to the interval A, C+ is the standard half-Cauchy distribution which has density
GIG(p, a, b) is the generalized inverse Gaussian distribution with shape p and scale parameters a and b which has density
If Y follows a zero-mean skew-t distribution with parameters ω2, α and ν can be represented as
where X ∼ N(0, 1/W) and Z ∼ TN[0,∞)(0, 1/W), W ∼ Ga(ν/2, (ων)/2). This allows us to write the model in a convenient form for Gibbs sampling,
where
A.1 Gibbs sampler
We define the following notation:
X
i
is a n
i
-dimensional vector whose jth entry is di,j concatenated with xi,j,
The full conditional distributions of the Gibbs sampler are as follows.
We update η and ζ jointly marginalising over
where
We update γ i using the method in Section 3.
which can be updated using a Metropolis-Hastings random walk on log a,
which can be updated using a Metropolis-Hastings random walk on log g,
which can be updated using a Metropolis-Hastings random walk on log α.
which can be updated using a Metropolis–Hastings random walk on log ν.
References
Albert, J. 1999. “Comment on “Bridging Different Eras in Sports” by Berry, Reese and Larkey.” Journal of the American Statistical Association 94: 677–80. https://doi.org/10.2307/2669974.Search in Google Scholar
Allen, S. V., T. J. Vandenbogaerde, and W. G. Hopkins. 2014. “Career Performance Trajectories of Olympic Swimmers: Benchmarks for Talent Development.” European Journal of Sport Science 14: 643–51. https://doi.org/10.1080/17461391.2014.893020.Search in Google Scholar PubMed
Azzalini, A. 1985. “A Class of Distributions Which Includes the Normal Ones.” Scandinavian Journal of Statistics 12: 171–8.Search in Google Scholar
Azzalini, A., and A. Capitanio. 2003. “Distributions Generated by Perturbation of Symmetry with Emphasis on a Multivariate Skew T-Distribution.” Journal of the Royal Statistical Society: Series B 65: 367–89. https://doi.org/10.1111/1467-9868.00391.Search in Google Scholar
Berry, S. M., C. S. Reese, and P. D. Larkey. 1999. “Bridging Different Eras in Sports.” Journal of the American Statistical Association 94: 661–76. https://doi.org/10.1080/01621459.1999.10474163.Search in Google Scholar
Berthelot, G., S. Len, P. Hellard, M. Tafflet, M. Guillaume, B. Vollmer, J.-C. nad Gager, L. Quinquis, A. Marc, and J.-F. Toussaint. 2012. “Exponential Growth Combined with Exponential Decline Explains Lifetime Performance Evolution in Individual and Human Species Explains Lifetime Performance Evolution in Individuae.” Age 34: 1001–9. https://doi.org/10.1007/s11357-011-9274-9.Search in Google Scholar PubMed PubMed Central
Boccia, G., P. Moisè, A. Franceschi, F. Trova, D. Panero, A. La Torre, A. Rainoldi, F. Schena, and M. Cardinale. 2017. “Career Performance Trajectories in Track and Field Jumping Events from Youth to Senior Success: The Importance of Learning and Development.” PLoS One 12: 1–15. https://doi.org/10.1371/journal.pone.0170744.Search in Google Scholar PubMed PubMed Central
Bongard, V., A. Y. McDermott, G. E. Dallal, and E. J. Schaefer. 2007. “Effects of Age and Gender on Physical Performance.” Age 29: 77–85. https://doi.org/10.1007/s11357-007-9034-z.Search in Google Scholar PubMed PubMed Central
Boys, R. J., and P. M. Philipson. 2019. “On the Ranking of Test Match Batsmen.” Journal of the Royal Statistical Society: Series A C 68: 161–79. https://doi.org/10.1111/rssc.12298.Search in Google Scholar
Bradbury, J. C. 2009. “Peak Athletic Performance and Ageing: Evidence from Baseball.” Journal of Sports Sciences 27: 599–610. https://doi.org/10.1080/02640410802691348.Search in Google Scholar PubMed
Brander, J. A., E. J. Egan, and L. Yeung. 2014. “Estimating the Effects of Age on NHL Player Performance.” Journal of Quantitative Analysis in Sports 10: 241–59.10.1515/jqas-2013-0085Search in Google Scholar
Carvalho, C. M., N. G. Polsen, and J. G. Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” Biometrika 97: 465–80. https://doi.org/10.1093/biomet/asq017.Search in Google Scholar
Dellaportas, P., J. J. Forster, and I. Ntzoufras. 2000. “Bayesian Variable Selection Using the Gibbs Sampler.” In Generalized Linear Models: A Bayesian Perspective, Vol. 5, edited by D. K. Dey, S. K. Ghosh, and B. K. Mallick, 273–86. Chemical Rubber Company Press. Also available at https://eprints.soton.ac.uk/29960/.Search in Google Scholar
Denison, D. G. T., C. C. Holmes, B. K. Mallick, and A. F. M. Smith. 2002. Bayesian Methods for Nonlinear Classification and Regression. Chichester: Wiley & Sons.Search in Google Scholar
Donato, A. J., K. Tench, D. H. Glueck, D. R. Seals, I. Eskurza, and H. Tanaka. 2003. “Declines in Physiological Functional Capacity with Age: A Longitudinal Study in Peak Swimming Performance.” Journal of Applied Physiology 94: 764–9. https://doi.org/10.1152/japplphysiol.00438.2002.Search in Google Scholar PubMed PubMed Central
Egidi, L., and J. Gabry. 2018. “Bayesian Hierarchical Models for Predicting Individual Performance in Soccer.” Journal of Quantitative Analysis in Sports 14: 143–57. https://doi.org/10.1515/jqas-2017-0066.Search in Google Scholar
Frühwirth-Schnatter, S., and S. Pyne. 2010. “Bayesian Inference for Finite Mixtures of Univariate and Multivariate Skew-Normal and Skew-T Distributions.” Biostatistics 11: 317–36. https://doi.org/10.1093/biostatistics/kxp062.Search in Google Scholar PubMed
Gao, Z., Y. Li, and Z. Wang. 2020. “Restoring the Real World Records in Men’s Swimming without High-Tech Swimsuits.” Journal of Quantitative Analysis in Sports 16: 291–300. https://doi.org/10.1515/jqas-2019-0087.Search in Google Scholar
Griffin, J. E., K. Latuszynski, and M. F. J. Steel. 2021. “In Search of Lost (Mixing) Time: Adaptive Markov Chain Monte Carlo Schmes for Bayesian Variable Selection with Large P.” Biometrika 108: 53–69. https://doi.org/10.1093/biomet/asaa055.Search in Google Scholar
Haugen, T. A., P. A. Solberg, C. Foster, R. Morán-Navarro, F. Breitschädel, and W. G. Hopkins. 2018. “Peak Age and Performance Progression in World-Class Track-And-Field Athletes.” International Journal of Sports Physiology and Performance 13: 1122–9. https://doi.org/10.1123/ijspp.2017-0682.Search in Google Scholar PubMed
Kovalchik, S. A., and R. Stefani. 2013. “Longitudinal Analyses of Olympic Athletics and Swimming Events Find No Gender Gap in Performance Improvement.” Journal of Quantitative Analysis in Sports 9: 15–24. https://doi.org/10.1515/jqas-2012-0007.Search in Google Scholar
Lang, S., and A. Brezger. 2004. “Bayesian P-Splines.” Journal of Computational & Graphical Statistics 13: 183–212. https://doi.org/10.1198/1061860043010.Search in Google Scholar
Ley, E., and M. F. J. Steel. 2009. “On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regression.” Journal of Applied Econometrics 24: 651–74. https://doi.org/10.1002/jae.1057.Search in Google Scholar
Makalic, E., and D. F. Schmidt. 2016. “A Simple Sampler for the Horseshoe Estimator.” IEEE Signal Processing Letters 23: 179–82. https://doi.org/10.1109/lsp.2015.2503725.Search in Google Scholar
McDonagh, M. J., M. J. White, and C. T. Davies. 1984. “Different Effects of Ageing on the Mechanical Properties of Human Arm and Leg Muscles.” Gerontology 30: 49–54. https://doi.org/10.1159/000212606.Search in Google Scholar PubMed
McGuigan, M. R., and M. K. Kane. 2004. “Reliability of Performance of Elite Olympic Weightlifters.” The Journal of Strength & Conditioning Research 18: 650–3. https://doi.org/10.1519/12312.1.Search in Google Scholar PubMed
Moinat, M., O. Fabius, and K. S. Emanuel. 2018. “Data-driven Quantification of the Effect of Wind on Athletics Performance.” European Journal of Sport Science 18: 1185–90. https://doi.org/10.1080/17461391.2018.1480062.Search in Google Scholar PubMed
Rasmussen, C. E., and C. K. I. Williams. 2006. Gaussian Processes for Machine Learning. Cambridge: MIT Press.10.7551/mitpress/3206.001.0001Search in Google Scholar
Santos-Fernandez, E., P. Wu, and K. L. Mengersen. 2019. “Bayesian Statistics Meets Sports: A Comprehensive Review.” Journal of Quantitative Analysis in Sports 15: 289–312. https://doi.org/10.1515/jqas-2018-0106.Search in Google Scholar
Solberg, P. A., W. G. Hopkins, G. Paulsen, and T. A. Haugen. 2019. “Peak Age and Performance Progression in World-Class Weightlifting and Powerlifting Athletes.” International Journal of Sports Physiology and Performance 14: 1357–63. https://doi.org/10.1123/ijspp.2019-0093.Search in Google Scholar PubMed
Stephenson, A. G., and J. A. Tawn. 2013. “Determining the Best Track Performances of All Time Using a Conceptual Population Model for All Athletics Records.” Journal of Quantitative Analysis in Sports 9: 67–76. https://doi.org/10.1515/jqas-2012-0047.Search in Google Scholar
Stevenson, O. G., and B. J. Brewer. 2021. “Finding You Feet: A Gaussian Process Model for Estimating the Abilities of Batsmen in Test Cricket.” Journal of the Royal Statistical Society: Series A C 70: 481–506. https://doi.org/10.1111/rssc.12470.Search in Google Scholar
Stones, M. J., and A. Kozma. 1984. “Longitudinal Trends in Track and Field Performances.” Experimental Aging Research 10: 107–10. https://doi.org/10.1080/03610738408258552.Search in Google Scholar PubMed
Strand, M., D. Nelson, and G. Grunwald. 2018. “Modeling Between-Subject Differences and Within-Subject Changes for Long Distance Runners by Age.” Journal of Quantitative Analysis in Sports 14: 81–90. https://doi.org/10.1515/jqas-2017-0038.Search in Google Scholar
Tadesse, M. G., and M. Vannucci. 2021. Handbook of Bayesian Variable Selection. Boca Raton: Chapman & Hall, CRC.10.1201/9781003089018Search in Google Scholar
Tanaka, H., and D. R. Seals. 1997. “Age and Gender Interactions in Physiological Functional Capacity: Insight from Swimming Performance.” Journal of Applied Physiology 82: 846–51. https://doi.org/10.1152/jappl.1997.82.3.846.Search in Google Scholar PubMed
Wimmer, N., N. Fenske, P. Pyrka, and L. Fahrmeir. 2011. “Exploring Competition Performance in Decathlon Using Semi-parametric Latent Variable Models.” Journal of Quantitative Analysis in Sports 7: 6. https://doi.org/10.2202/1559-0410.1307.Search in Google Scholar
© 2022 the author(s), published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Frontmatter
- Research Articles
- Modelling Australian Rules Football as spatial systems with pairwise comparisons
- The effects of draw restrictions on knockout tournaments
- ‘Form is temporary, class is permanent’: identifying a longer-term hot hand in golf
- Bayesian modelling of elite sporting performance with large databases
Articles in the same Issue
- Frontmatter
- Research Articles
- Modelling Australian Rules Football as spatial systems with pairwise comparisons
- The effects of draw restrictions on knockout tournaments
- ‘Form is temporary, class is permanent’: identifying a longer-term hot hand in golf
- Bayesian modelling of elite sporting performance with large databases