Abstract
This study presents a quantitative framework to compare teams in collective sports with respect to their style of play. The style of play is characterized by the team’s spatial distribution over a collection of frames. As a first step, we introduce an optimal transport-based embedding to map frames into Euclidean space, enabling efficient computation of distance. Then, building on this frame-level analysis,we use quantization to establish a similarity metric between teams based on a collection of frames from their games. For illustration, we present an analysis of a collection of games from the 2021–2022 Ligue 1 season. We successfully retrieve relevant clusters of game situations and calculate the similarity matrix between teams in terms of style of play. Additionally, we demonstrate the effectiveness of the embedding as a preprocessing tool for prediction tasks. Likewise, we apply our framework to analyze the dynamics in the first half of the NBA season in 2015–2016.
Funding source: Machine Learning & Systematic Methods in Finance
Acknowledgments
The authors thank Stats Perform, Matthieu Lille-Palette and Andy Cooper for providing tracking data. They are also grateful to Mathieu Lacome and Sébastien Coustou for insightful discussions. The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.
-
Data availability: Data is partly publicly available. The NBA tracking data can be found in https://github.com/linouk23/NBA-Player-Movements/tree/master/data.
A Distance between teams in the NBA
Similarly to Section 3.3, we apply the optimal transport framework to tracking data from the first half of the 2015–2016 NBA season. For each team, the tracking data was subsampled by retaining one out of every 25 frames. We set L = 6 to obtain an embedding in
Figure 9 illustrates an example of the ten clusters identified from the collection of frames across all Golden State Warriors games, and Table 5 provides the percentage of frames in each cluster. We observe that the clustering algorithm effectively distinguishes different phases of play in a consistent manner. Specifically, we identify three distinct defensive settings in clusters 6, 2, and 10, where the defensive block transitions from very deep positions to more advanced ones. Additionally, clusters 3, 5, 8, and 9 represent transitions between defense and offense, characterized by players being spread across the court. The dynamics observed in these clusters differ significantly from those in football. In particular, clusters where players are positioned in the center of the court account for a total of 18.62 % of the frames, unlike in football, where central clusters are more prevalent. This difference is expected because basketball rules prohibit the ball from moving backward from the offensive half of the court. Consequently, basketball is essentially a transition-oriented game where teams alternate between defense and offense.

Example of 10 clusters observed in the frames from the games of the Golden State Warriors. For each cluster, the closest frame to the barycenter and a random example from the cluster are shown. The shaded areas represent the total density of player locations taken from a random sample of 100 frames from each cluster. Clusters 2, 6 and 10 represent defensive phases. Clusters 1 and 4 represent attacking phases while clusters 3, 5, 7, 8 and 9 represent transitions between attack and defense.
Rate of occurrence of clusters displayed in Figure 9.
| Cluster | Percentage of frames |
|---|---|
| 1 | 13.04 % |
| 2 | 22.49 % |
| 3 | 4.73 % |
| 4 | 15.66 % |
| 5 | 3.40 % |
| 6 | 17.31 % |
| 7 | 5.22 % |
| 8 | 3.27 % |
| 9 | 7.22 % |
| 10 | 7.66 % |
Finally, Figure 10 illustrates the similarity in playing styles between NBA teams. This similarity is computed by considering the collection of frames from all available games in the first half of the NBA season, subsampled at a rate of one frame every 25 frames. We observe that the Golden State Warriors exhibit the largest deviation from the other franchises. This is not surprising, as they are credited with introducing a new style of play heavily reliant on 3-pointers. In fact, this season marked the record for the most points scored in the regular season, which naturally leads to a different spatial distribution of play. Interestingly, the Utah Jazz display the second largest deviation from the rest of the teams and its distance to the Golden State Warriors is the greatest. Figure 11 presents the similarity metric after centering the frames prior to embedding. We observe in this case that the distance metrics become more uniform, and the significant deviations of the Utah Jazz and Golden State Warriors diminish. This suggests that these effects are primarily due to the average locations both teams occupy during their games rather than relative placement of players.

Distance matrix between the collections of frames in the games of each team in the NBA.

Distance matrix between the collections of frames in the games of each team in the NBA after centering the data.
A.1 Predicting team identity
Similarly to Section 3.3.3, we demonstrate that our representation of collections of frames as distributions in the embedding space effectively captures the stylistic identity of teams in the NBA. Specifically, we show that it is possible to predict a team’s identity from an out-of-sample collection of frames with high accuracy, which provides strong evidence that our embedding preserves spatial information while differentiating between unique playing styles.
Using all frames in each test fold yields an average Top-1 accuracy of 99.33 % and a Top-2 accuracy of 100 %. This performance suggests that the distribution of frames in the embedding perfectly captures the style of play of a team with a large enough test sample of frames. Figure 12 shows how accuracy varies as a function of the number of frames in each subset. Similarly to the football data, larger subsets naturally improve accuracy, as they offer a more representative idea of a team’s positioning style. Compared to football, we need more frames to achieve an accuracy of 70 %, but the ceiling of precision is higher. In fact, we achieve 93.8 % Top-1 accuracy with 2000 frames. Overall, the results further confirm that our embedding captures team-specific spatial configurations, enabling accurate and fast team identification.

Accuracy of NBA team identity prediction as a function of test sample size. For each sample size, 100 randomly selected subsets of the test fold are used to compute the accuracy and its standard deviation.
B Proof of Proposition 1
First, we prove that Proj θ is injective. We proceed by induction on n
(H n ): For any choice of grid θ 1, θ 2,…, θ L such that L ≥ n + 1 and θ l , θ k are noncollinear for all l, k ≤ L, Proj θ is injective.
If n = 1, take μ = δ
x
and ν = δ
y
and θ
1, θ
2,…, θ
L
such that L ≥ 2 and Proj
θ
(μ) = Proj
θ
(ν). Then we have
Let n ≥ 2 and assume (H
n−1) is true. Take μ and ν in
Therefore, for all l in {1, …, L}, there exists i l in {1, …, n} such that
Since L ≥ n + 1 and using the pigeon-hole principle, there must exist l, k such that i l = i k . Thus
we can deduce that
We conclude using (H
n−1) on
Finally, for μ and ν in
This justifies that
C Justification of the normalisation term
The following result allows to determine a suitable normalisation factor for the sliced-Wasserstein distance.
Possession values for each team in the analysis, displayed as percentages. Teams are sorted in descending order.
| Team | Possession value (%) |
|---|---|
| Olympique Marseille | 63.85 % |
| PSG | 60.63 % |
| Rennes | 60.15 % |
| Olympique Lyonnais | 58.36 % |
| Nice | 55.15 % |
| Lille | 54.77 % |
| Lens | 51.05 % |
| Clermont | 49.30 % |
| Saint-Étienne | 48.71 % |
| Montpellier | 48.32 % |
| Bordeaux | 48.29 % |
| Strasbourg | 47.72 % |
| Brest | 46.87 % |
| Monaco | 46.62 % |
| Angers SCO | 45.71 % |
| Metz | 45.19 % |
| Nantes | 45.02 % |
| Troyes | 44.22 % |
| Lorient | 43.43 % |
| Reims | 41.63 % |
List of 100 games used in this study. Left: Games 1–33. Center: Games 34–66. Right: Games 67–100.
| Game | Home team | Away team | Date |
|---|---|---|---|
| 1 | Metz | Lens | 2022-03-13 |
| 2 | PSG | Bordeaux | 2022-03-13 |
| 3 | Strasbourg | Monaco | 2022-03-13 |
| 4 | Angers SCO | Brest | 2022-03-20 |
| 5 | Bordeaux | Montpellier | 2022-03-20 |
| 6 | Lens | Clermont | 2022-03-19 |
| 7 | Lorient | Strasbourg | 2022-03-20 |
| 8 | Olympique Marseille | Nice | 2022-03-20 |
| 9 | Monaco | PSG | 2022-03-20 |
| 10 | Nantes | Lille | 2022-03-19 |
| 11 | Reims | Olympique Lyonnais | 2022-03-20 |
| 12 | Rennes | Metz | 2022-03-20 |
| 13 | Saint-Etienne | Troyes | 2022-03-18 |
| 14 | Clermont | Nantes | 2022-04-03 |
| 15 | Lille | Bordeaux | 2022-04-02 |
| 16 | Olympique Lyonnais | Angers SCO | 2022-04-03 |
| 17 | Metz | Monaco | 2022-04-03 |
| 18 | Montpellier | Brest | 2022-04-03 |
| 19 | Nice | Rennes | 2022-04-02 |
| 20 | PSG | Lorient | 2022-04-03 |
| 21 | Saint-Etienne | Olympique Marseille | 2022-04-03 |
| 22 | Strasbourg | Lens | 2022-04-03 |
| 23 | Troyes | Reims | 2022-04-03 |
| 24 | Angers SCO | Lille | 2022-04-10 |
| 25 | Brest | Nantes | 2022-04-10 |
| 26 | Bordeaux | Metz | 2022-04-10 |
| 27 | Clermont | PSG | 2022-04-09 |
| 28 | Lens | Nice | 2022-04-10 |
| 29 | Lorient | Saint-Etienne | 2022-04-08 |
| 30 | Monaco | Troyes | 2022-04-10 |
| 31 | Reims | Rennes | 2022-04-09 |
| 32 | Strasbourg | Olympique Lyonnais | 2022-04-10 |
| 33 | Lille | Lens | 2022-04-16 |
| 34 | Olympique Lyonnais | Bordeaux | 2022-04-17 |
| 35 | Metz | Clermont | 2022-04-17 |
| 36 | Nantes | Angers SCO | 2022-04-17 |
| 37 | Nice | Lorient | 2022-04-17 |
| 38 | PSG | Olympique Marseille | 2022-04-17 |
| 39 | Rennes | Monaco | 2022-04-15 |
| 40 | Saint-Etienne | Brest | 2022-04-16 |
| 41 | Troyes | Strasbourg | 2022-04-17 |
| 42 | Angers SCO | PSG | 2022-04-20 |
| 43 | Bordeaux | Saint-Etienne | 2022-04-20 |
| 44 | Lens | Montpellier | 2022-04-20 |
| 45 | Lorient | Metz | 2022-04-20 |
| 46 | Olympique Marseille | Nantes | 2022-04-20 |
| 47 | Monaco | Nice | 2022-04-20 |
| 48 | Reims | Lille | 2022-04-20 |
| 49 | Strasbourg | Rennes | 2022-04-20 |
| 50 | Troyes | Clermont | 2022-04-20 |
| 51 | Brest | Olympique Lyonnais | 2022-04-20 |
| 52 | Lille | Strasbourg | 2022-04-24 |
| 53 | Olympique Lyonnais | Montpellier | 2022-04-23 |
| 54 | Metz | Brest | 2022-04-24 |
| 55 | Nantes | Bordeaux | 2022-04-24 |
| 56 | Nice | Troyes | 2022-04-24 |
| 57 | PSG | Lens | 2022-04-23 |
| 58 | Reims | Olympique Marseille | 2022-04-24 |
| 59 | Rennes | Lorient | 2022-04-24 |
| 60 | Saint-Etienne | Monaco | 2022-04-23 |
| 61 | Olympique Marseille | Olympique Lyonnais | 2022-05-01 |
| 62 | Strasbourg | PSG | 2022-04-29 |
| 63 | Angers SCO | Bordeaux | 2022-05-08 |
| 64 | Clermont | Montpellier | 2022-05-08 |
| 65 | Lille | Monaco | 2022-05-06 |
| 66 | Metz | Olympique Lyonnais | 2022-05-08 |
| 67 | Nantes | Rennes | 2022-05-11 |
| 68 | Nice | Saint-Etienne | 2022-05-11 |
| 69 | PSG | Troyes | 2022-05-08 |
| 70 | Reims | Lens | 2022-05-08 |
| 71 | Metz | Angers SCO | 2022-05-14 |
| 72 | Monaco | Brest | 2022-05-14 |
| 73 | Montpellier | PSG | 2022-05-14 |
| 74 | Nice | Lille | 2022-05-14 |
| 75 | Rennes | Olympique Marseille | 2022-05-14 |
| 76 | Bordeaux | Nice | 2022-05-01 |
| 77 | Brest | Clermont | 2022-05-01 |
| 78 | Lens | Nantes | 2022-04-30 |
| 79 | Lorient | Reims | 2022-05-01 |
| 80 | Monaco | Angers SCO | 2022-05-01 |
| 81 | Montpellier | Metz | 2022-05-01 |
| 82 | Rennes | Saint-Etienne | 2022-04-30 |
| 83 | Troyes | Lille | 2022-05-01 |
| 84 | Brest | Strasbourg | 2022-05-07 |
| 85 | Lorient | Olympique Marseille | 2022-05-08 |
| 86 | Bordeaux | Lorient | 2022-05-14 |
| 87 | Olympique Lyonnais | Nantes | 2022-05-14 |
| 88 | Saint-Etienne | Reims | 2022-05-14 |
| 89 | Strasbourg | Clermont | 2022-05-14 |
| 90 | Troyes | Lens | 2022-05-14 |
| 91 | Angers SCO | Montpellier | 2022-05-21 |
| 92 | Brest | Bordeaux | 2022-05-21 |
| 93 | Clermont | Olympique Lyonnais | 2022-05-21 |
| 94 | Lens | Monaco | 2022-05-21 |
| 95 | Lille | Rennes | 2022-05-21 |
| 96 | Lorient | Troyes | 2022-05-21 |
| 97 | Olympique Marseille | Strasbourg | 2022-05-21 |
| 98 | Nantes | Saint-Etienne | 2022-05-21 |
| 99 | PSG | Metz | 2022-05-21 |
| 100 | Reims | Nice | 2022-05-21 |
Proposition 2.
For
Furthermore, in the case where ν = δ
y
for
To prove Equation (10), consider
Then, for all l = 1, …, L, we have
Thus,
where
For the choice of grid in Equation (7), we have
This yields
For the second bound, we consider the case where ν = δ y is concentrated in one location. Similar calculations yield
where γ(Θ) is the smallest eigenvalue of Θ and is given, for the grid in Equation (7), by
References
Arthur, D. and Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. Soda 7: 1027–1035.Search in Google Scholar
Bialkowski, A., Lucey, P., Carr, P., Matthews, I., Sridharan, S., and Fookes, C. (2016). Discovering team structures in soccer from spatiotemporal data. IEEE Trans. Knowl. Data Eng. 28: 2596–2605, https://doi.org/10.1109/tkde.2016.2581158.Search in Google Scholar
Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. J. Math. Imag. Vis. 51: 22–45, https://doi.org/10.1007/s10851-014-0506-3.Search in Google Scholar
Bonnotte, N. (2013). Unidimensional and Evolution Methods for Optimal Transportation, PhD thesis. Université Paris Sud-Paris XI; Scuola normale superiore, Pise, Italie.Search in Google Scholar
Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In: International Conference on Machine Learning. PMLR, Beijing, China, pp. 685–693.Search in Google Scholar
Domazakis, G., Drivaliaris, D., Koukoulas, S., Papayiannis, G., Tsekrekos, A., and Yannacopoulos, A. (2020). Clustering measure-valued data with Wasserstein barycenters. arXiv preprint arXiv:1912.11801.Search in Google Scholar
Fernández, J. and Bornn, L. (2021). Soccermap: a deep learning architecture for visually-interpretable analysis in soccer. In: Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part V. Springer, Ghent, Belgium, pp. 491–506.10.1007/978-3-030-67670-4_30Search in Google Scholar
Graf, S. and Luschgy, H. (2000). Lecture Notes in Mathematics, vol. 1730. Springer, Berlin, Heidelberg.Search in Google Scholar
Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2: 83–97, https://doi.org/10.1002/nav.3800020109.Search in Google Scholar
Lloyd, S. (1982). Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28: 129–137, https://doi.org/10.1109/tit.1982.1056489.Search in Google Scholar
Luan, Q. and Hamp, J. (2023). Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering. arXiv preprint arXiv:2310.01285. Available at https://ssrn.com/abstract=4587877, https://doi.org/10.2139/ssrn.4587877.Search in Google Scholar
Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci.: 666–704.Search in Google Scholar
Moura, F.A., Martins, L.E.B., Anido, R.D.O., De Barros, R.M.L., and Cunha, S.A. (2012). Quantitative analysis of brazilian football players’ organisation on the pitch. Sports Biomech. 11: 85–96, https://doi.org/10.1080/14763141.2011.637123.Search in Google Scholar PubMed
Narizuka, T. and Yamazaki, Y. (2019). Clustering algorithm for formations in football games. Sci. Rep. 9: 13172, https://doi.org/10.1038/s41598-019-48623-1.Search in Google Scholar PubMed PubMed Central
Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY 55: 94.10.1007/978-3-319-20828-2Search in Google Scholar
Shaw, L. and Glickman, M. (2019). Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit 13: 1–13.Search in Google Scholar
Tang, Z., Wang, X., and Zhang, S. (2023). Clustering football game situations via deep representation learning. Statsbomb conference 2023. London.Search in Google Scholar
Zhuang, Y., Chen, X., and Yang, Y. (2022). Wasserstein k-means for clustering probability distributions. Adv. Neural Inf. Process. Syst. 35: 11382–11395.Search in Google Scholar
© 2025 Walter de Gruyter GmbH, Berlin/Boston