An optimal transport based embedding to quantify the distance between playing styles in collective sports

Ali Baouan; Mathieu Rosenbaum; Sergio Pulido

doi:10.1515/jqas-2025-0007

Article

An optimal transport based embedding to quantify the distance between playing styles in collective sports

Ali Baouan , Mathieu Rosenbaum and Sergio Pulido

Published/Copyright: July 28, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Quantitative Analysis in Sports

Abstract

This study presents a quantitative framework to compare teams in collective sports with respect to their style of play. The style of play is characterized by the team’s spatial distribution over a collection of frames. As a first step, we introduce an optimal transport-based embedding to map frames into Euclidean space, enabling efficient computation of distance. Then, building on this frame-level analysis,we use quantization to establish a similarity metric between teams based on a collection of frames from their games. For illustration, we present an analysis of a collection of games from the 2021–2022 Ligue 1 season. We successfully retrieve relevant clusters of game situations and calculate the similarity matrix between teams in terms of style of play. Additionally, we demonstrate the effectiveness of the embedding as a preprocessing tool for prediction tasks. Likewise, we apply our framework to analyze the dynamics in the first half of the NBA season in 2015–2016.

Keywords: football; style of play; optimal transport; similarity metric; frame embedding

Corresponding author: Ali Baouan, Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France, E-mail: ali.baouan@polytechnique.edu

Mailing address: CMAP, Ecole Polytechnique, route de Saclay, 91128 Palaiseau Cedex, France.

Funding source: Machine Learning & Systematic Methods in Finance

Acknowledgments

The authors thank Stats Perform, Matthieu Lille-Palette and Andy Cooper for providing tracking data. They are also grateful to Mathieu Lacome and Sébastien Coustou for insightful discussions. The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.
Data availability: Data is partly publicly available. The NBA tracking data can be found in https://github.com/linouk23/NBA-Player-Movements/tree/master/data.

Appendix

A Distance between teams in the NBA

Similarly to Section 3.3, we apply the optimal transport framework to tracking data from the first half of the 2015–2016 NBA season. For each team, the tracking data was subsampled by retaining one out of every 25 frames. We set L = 6 to obtain an embedding in R 5 × 6 by projecting along the grid of directions defined for l = 1,…, L by

θ l = cos π ( l − 1 ) 2 L , sin π ( l − 1 ) 2 L .

Figure 9 illustrates an example of the ten clusters identified from the collection of frames across all Golden State Warriors games, and Table 5 provides the percentage of frames in each cluster. We observe that the clustering algorithm effectively distinguishes different phases of play in a consistent manner. Specifically, we identify three distinct defensive settings in clusters 6, 2, and 10, where the defensive block transitions from very deep positions to more advanced ones. Additionally, clusters 3, 5, 8, and 9 represent transitions between defense and offense, characterized by players being spread across the court. The dynamics observed in these clusters differ significantly from those in football. In particular, clusters where players are positioned in the center of the court account for a total of 18.62 % of the frames, unlike in football, where central clusters are more prevalent. This difference is expected because basketball rules prohibit the ball from moving backward from the offensive half of the court. Consequently, basketball is essentially a transition-oriented game where teams alternate between defense and offense.

Figure 9:

Example of 10 clusters observed in the frames from the games of the Golden State Warriors. For each cluster, the closest frame to the barycenter and a random example from the cluster are shown. The shaded areas represent the total density of player locations taken from a random sample of 100 frames from each cluster. Clusters 2, 6 and 10 represent defensive phases. Clusters 1 and 4 represent attacking phases while clusters 3, 5, 7, 8 and 9 represent transitions between attack and defense.

Table 5:

Rate of occurrence of clusters displayed in Figure 9.

Cluster	Percentage of frames
1	13.04 %
2	22.49 %
3	4.73 %
4	15.66 %
5	3.40 %
6	17.31 %
7	5.22 %
8	3.27 %
9	7.22 %
10	7.66 %

Finally, Figure 10 illustrates the similarity in playing styles between NBA teams. This similarity is computed by considering the collection of frames from all available games in the first half of the NBA season, subsampled at a rate of one frame every 25 frames. We observe that the Golden State Warriors exhibit the largest deviation from the other franchises. This is not surprising, as they are credited with introducing a new style of play heavily reliant on 3-pointers. In fact, this season marked the record for the most points scored in the regular season, which naturally leads to a different spatial distribution of play. Interestingly, the Utah Jazz display the second largest deviation from the rest of the teams and its distance to the Golden State Warriors is the greatest. Figure 11 presents the similarity metric after centering the frames prior to embedding. We observe in this case that the distance metrics become more uniform, and the significant deviations of the Utah Jazz and Golden State Warriors diminish. This suggests that these effects are primarily due to the average locations both teams occupy during their games rather than relative placement of players.

Figure 10:

Distance matrix between the collections of frames in the games of each team in the NBA.

Figure 11:

Distance matrix between the collections of frames in the games of each team in the NBA after centering the data.

A.1 Predicting team identity

Similarly to Section 3.3.3, we demonstrate that our representation of collections of frames as distributions in the embedding space effectively captures the stylistic identity of teams in the NBA. Specifically, we show that it is possible to predict a team’s identity from an out-of-sample collection of frames with high accuracy, which provides strong evidence that our embedding preserves spatial information while differentiating between unique playing styles.

Using all frames in each test fold yields an average Top-1 accuracy of 99.33 % and a Top-2 accuracy of 100 %. This performance suggests that the distribution of frames in the embedding perfectly captures the style of play of a team with a large enough test sample of frames. Figure 12 shows how accuracy varies as a function of the number of frames in each subset. Similarly to the football data, larger subsets naturally improve accuracy, as they offer a more representative idea of a team’s positioning style. Compared to football, we need more frames to achieve an accuracy of 70 %, but the ceiling of precision is higher. In fact, we achieve 93.8 % Top-1 accuracy with 2000 frames. Overall, the results further confirm that our embedding captures team-specific spatial configurations, enabling accurate and fast team identification.

Figure 12:

Accuracy of NBA team identity prediction as a function of test sample size. For each sample size, 100 randomly selected subsets of the test fold are used to compute the accuracy and its standard deviation.

B Proof of Proposition 1

First, we prove that Proj_θ is injective. We proceed by induction on n

(H _n): For any choice of grid θ ₁, θ ₂,…, θ _L such that L ≥ n + 1 and θ _l, θ _k are noncollinear for all l, k ≤ L, Proj_θ is injective.

If n = 1, take μ = δ _x and ν = δ _y and θ ₁, θ ₂,…, θ _L such that L ≥ 2 and Proj_θ(μ) = Proj_θ(ν). Then we have δ ⟨ θ 1 , x ⟩ = δ ⟨ θ 1 , y ⟩ and δ ⟨ θ 2 , x ⟩ = δ ⟨ θ 2 , y ⟩ . Equivalently, ⟨θ ₁, x⟩ = ⟨θ ₁, y⟩ and ⟨θ ₂, x⟩ = ⟨θ ₂, y⟩ which implies x = y since θ ₁ and θ ₂ are non collinear. Thus, μ = ν.

Let n ≥ 2 and assume (H _n−1) is true. Take μ and ν in P n u ( R 2 ) such that Proj_θ(μ) = Proj_θ(ν). We can write μ = ∑ i = 1 n 1 n δ x i and ν = ∑ i = 1 n 1 n δ y i and we have for all l in {1, …, L}:

∑ i = 1 n 1 n δ 〈 θ l , x i 〉 = ∑ i = 1 n 1 n δ 〈 θ l , y i 〉 .

Therefore, for all l in {1, …, L}, there exists i _l in {1, …, n} such that

〈 θ l , x n 〉 = 〈 θ l , y i l 〉 .

Since L ≥ n + 1 and using the pigeon-hole principle, there must exist l, k such that i _l = i _k. Thus

〈 θ l , x n 〉 = 〈 θ l , y i l 〉 , 〈 θ k , x n 〉 = 〈 θ k , y i l 〉 .

we can deduce that x n = y i l . And we have

∑ i = 1 n − 1 1 n − 1 δ 〈 θ l , x h , i 〉 = ∑ i ≠ i l 1 n − 1 δ 〈 θ l , y h , i 〉 .

We conclude using (H _n−1) on μ ′ = ∑ i = 1 n − 1 1 n − 1 δ x h , i and ν ′ = ∑ i ≠ i l 1 n − 1 δ y h , i .

Finally, for μ and ν in P n u ( R 2 ) we have

S W ̂ p ( μ , ν , ( θ l ) l ≤ L ) = 1 ( n L ) 1 / p ‖ Proj θ ( μ ) − Proj θ ( ν ) ‖ p .

This justifies that S W ̂ p ( . , . , ( θ l ) l ≤ L ) is a distance over P n u ( R 2 ) .

C Justification of the normalisation term

The following result allows to determine a suitable normalisation factor for the sliced-Wasserstein distance.

Table 6:

Possession values for each team in the analysis, displayed as percentages. Teams are sorted in descending order.

Team	Possession value (%)
Olympique Marseille	63.85 %
PSG	60.63 %
Rennes	60.15 %
Olympique Lyonnais	58.36 %
Nice	55.15 %
Lille	54.77 %
Lens	51.05 %
Clermont	49.30 %
Saint-Étienne	48.71 %
Montpellier	48.32 %
Bordeaux	48.29 %
Strasbourg	47.72 %
Brest	46.87 %
Monaco	46.62 %
Angers SCO	45.71 %
Metz	45.19 %
Nantes	45.02 %
Troyes	44.22 %
Lorient	43.43 %
Reims	41.63 %

Table 7:

List of 100 games used in this study. Left: Games 1–33. Center: Games 34–66. Right: Games 67–100.

Game	Home team	Away team	Date
1	Metz	Lens	2022-03-13
2	PSG	Bordeaux	2022-03-13
3	Strasbourg	Monaco	2022-03-13
4	Angers SCO	Brest	2022-03-20
5	Bordeaux	Montpellier	2022-03-20
6	Lens	Clermont	2022-03-19
7	Lorient	Strasbourg	2022-03-20
8	Olympique Marseille	Nice	2022-03-20
9	Monaco	PSG	2022-03-20
10	Nantes	Lille	2022-03-19
11	Reims	Olympique Lyonnais	2022-03-20
12	Rennes	Metz	2022-03-20
13	Saint-Etienne	Troyes	2022-03-18
14	Clermont	Nantes	2022-04-03
15	Lille	Bordeaux	2022-04-02
16	Olympique Lyonnais	Angers SCO	2022-04-03
17	Metz	Monaco	2022-04-03
18	Montpellier	Brest	2022-04-03
19	Nice	Rennes	2022-04-02
20	PSG	Lorient	2022-04-03
21	Saint-Etienne	Olympique Marseille	2022-04-03
22	Strasbourg	Lens	2022-04-03
23	Troyes	Reims	2022-04-03
24	Angers SCO	Lille	2022-04-10
25	Brest	Nantes	2022-04-10
26	Bordeaux	Metz	2022-04-10
27	Clermont	PSG	2022-04-09
28	Lens	Nice	2022-04-10
29	Lorient	Saint-Etienne	2022-04-08
30	Monaco	Troyes	2022-04-10
31	Reims	Rennes	2022-04-09
32	Strasbourg	Olympique Lyonnais	2022-04-10
33	Lille	Lens	2022-04-16
34	Olympique Lyonnais	Bordeaux	2022-04-17
35	Metz	Clermont	2022-04-17
36	Nantes	Angers SCO	2022-04-17
37	Nice	Lorient	2022-04-17
38	PSG	Olympique Marseille	2022-04-17
39	Rennes	Monaco	2022-04-15
40	Saint-Etienne	Brest	2022-04-16
41	Troyes	Strasbourg	2022-04-17
42	Angers SCO	PSG	2022-04-20
43	Bordeaux	Saint-Etienne	2022-04-20
44	Lens	Montpellier	2022-04-20
45	Lorient	Metz	2022-04-20
46	Olympique Marseille	Nantes	2022-04-20
47	Monaco	Nice	2022-04-20
48	Reims	Lille	2022-04-20
49	Strasbourg	Rennes	2022-04-20
50	Troyes	Clermont	2022-04-20
51	Brest	Olympique Lyonnais	2022-04-20
52	Lille	Strasbourg	2022-04-24
53	Olympique Lyonnais	Montpellier	2022-04-23
54	Metz	Brest	2022-04-24
55	Nantes	Bordeaux	2022-04-24
56	Nice	Troyes	2022-04-24
57	PSG	Lens	2022-04-23
58	Reims	Olympique Marseille	2022-04-24
59	Rennes	Lorient	2022-04-24
60	Saint-Etienne	Monaco	2022-04-23
61	Olympique Marseille	Olympique Lyonnais	2022-05-01
62	Strasbourg	PSG	2022-04-29
63	Angers SCO	Bordeaux	2022-05-08
64	Clermont	Montpellier	2022-05-08
65	Lille	Monaco	2022-05-06
66	Metz	Olympique Lyonnais	2022-05-08
67	Nantes	Rennes	2022-05-11
68	Nice	Saint-Etienne	2022-05-11
69	PSG	Troyes	2022-05-08
70	Reims	Lens	2022-05-08
71	Metz	Angers SCO	2022-05-14
72	Monaco	Brest	2022-05-14
73	Montpellier	PSG	2022-05-14
74	Nice	Lille	2022-05-14
75	Rennes	Olympique Marseille	2022-05-14
76	Bordeaux	Nice	2022-05-01
77	Brest	Clermont	2022-05-01
78	Lens	Nantes	2022-04-30
79	Lorient	Reims	2022-05-01
80	Monaco	Angers SCO	2022-05-01
81	Montpellier	Metz	2022-05-01
82	Rennes	Saint-Etienne	2022-04-30
83	Troyes	Lille	2022-05-01
84	Brest	Strasbourg	2022-05-07
85	Lorient	Olympique Marseille	2022-05-08
86	Bordeaux	Lorient	2022-05-14
87	Olympique Lyonnais	Nantes	2022-05-14
88	Saint-Etienne	Reims	2022-05-14
89	Strasbourg	Clermont	2022-05-14
90	Troyes	Lens	2022-05-14
91	Angers SCO	Montpellier	2022-05-21
92	Brest	Bordeaux	2022-05-21
93	Clermont	Olympique Lyonnais	2022-05-21
94	Lens	Monaco	2022-05-21
95	Lille	Rennes	2022-05-21
96	Lorient	Troyes	2022-05-21
97	Olympique Marseille	Strasbourg	2022-05-21
98	Nantes	Saint-Etienne	2022-05-21
99	PSG	Metz	2022-05-21
100	Reims	Nice	2022-05-21

Proposition 2.

For μ , ν ∈ P n u ( R 2 ) , we have

(10) S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≤ 1 + 1 L ⁡ sin π 2 L 2 W 2 ( μ , ν ) .

Furthermore, in the case where ν = δ _y for y ∈ R 2 :

(11) S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≥ 1 − 1 L ⁡ sin π 2 L 2 W 2 ( μ , ν ) .

To prove Equation (10), consider μ = 1 n ∑ i = 1 n δ x i and ν = 1 n ∑ i = 1 n δ y i and let σ be a permutation such that

W 2 2 ( μ , ν ) = 1 n ∑ i = 1 n ‖ x i − y σ ( i ) ‖ 2 .

Then, for all l = 1, …, L, we have

W 2 2 ( θ l # μ , θ l # ν ) ≤ 1 n ∑ i = 1 n 〈 θ l , x i 〉 − 〈 θ l , y σ ( i ) 〉 2 .

Thus,

S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≤ 1 n L ∑ l = 1 L ∑ i = 1 n ⟨ θ l , x i − y σ ( i ) ⟩ 2 1 / 2 , = 1 n L ∑ i = 1 n ∑ l = 1 L ( x i − y σ ( i ) ) T θ l θ l T × ( x i − y σ ( i ) ) 1 / 2 , = 1 n L ∑ i = 1 n ( x i − y σ ( i ) ) T Θ ( x i − y σ ( i ) ) 1 / 2 ,

where Θ = ∑ l = 1 L θ l θ l T is a positive semi-definite matrix. Let ρ(Θ) be its spectral radius, we have

S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≤ ρ ( Θ ) 1 n L ∑ i = 1 n ‖ x i − y σ ( i ) ‖ 2 1 / 2 , = ρ ( Θ ) L W 2 ( μ , ν ) .

For the choice of grid in Equation (7), we have θ l θ l T = cos 2 ( π ( l − 1 ) 2 L ) cos ( π ( l − 1 ) 2 L ) sin ( π ( l − 1 ) 2 L ) cos ( π ( l − 1 ) 2 L ) sin ( π ( l − 1 ) 2 L ) sin 2 ( π ( l − 1 ) 2 L ) and hence

Θ = L − 1 2 cos π 2 L 2 ⁡ sin π 2 L cos π 2 L 2 ⁡ sin π 2 L L + 1 2 .

This yields ρ ( Θ ) = L 2 ( 1 + 1 L ⁡ sin ( π 2 L ) ) and we deduce that

S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≤ 1 + 1 L ⁡ sin π 2 L 2 W 2 ( μ , ν ) .

For the second bound, we consider the case where ν = δ _y is concentrated in one location. Similar calculations yield

S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) = 1 n L ∑ l = 1 L ∑ i = 1 n ⟨ θ l , x i − y ⟩ 2 1 / 2 , = 1 n L ∑ i = 1 n ( x i − y ) T Θ ( x i − y ) 1 / 2 , ≥ γ ( Θ ) L 1 n ∑ i = 1 n ‖ x i − y ‖ 2 1 / 2 ,

where γ(Θ) is the smallest eigenvalue of Θ and is given, for the grid in Equation (7), by γ ( Θ ) = L 2 ( 1 − 1 L ⁡ sin ( π 2 L ) ) . Therefore, we obtain

S W ̂ 2 ( μ , ν , ( θ l ) l ≤ L ) ≥ 1 − 1 L ⁡ sin π 2 L 2 W 2 ( μ , ν ) .

References

Arthur, D. and Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. Soda 7: 1027–1035.Search in Google Scholar

Bialkowski, A., Lucey, P., Carr, P., Matthews, I., Sridharan, S., and Fookes, C. (2016). Discovering team structures in soccer from spatiotemporal data. IEEE Trans. Knowl. Data Eng. 28: 2596–2605, https://doi.org/10.1109/tkde.2016.2581158.Search in Google Scholar

Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. J. Math. Imag. Vis. 51: 22–45, https://doi.org/10.1007/s10851-014-0506-3.Search in Google Scholar

Bonnotte, N. (2013). Unidimensional and Evolution Methods for Optimal Transportation, PhD thesis. Université Paris Sud-Paris XI; Scuola normale superiore, Pise, Italie.Search in Google Scholar

Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In: International Conference on Machine Learning. PMLR, Beijing, China, pp. 685–693.Search in Google Scholar

Domazakis, G., Drivaliaris, D., Koukoulas, S., Papayiannis, G., Tsekrekos, A., and Yannacopoulos, A. (2020). Clustering measure-valued data with Wasserstein barycenters. arXiv preprint arXiv:1912.11801.Search in Google Scholar

Fernández, J. and Bornn, L. (2021). Soccermap: a deep learning architecture for visually-interpretable analysis in soccer. In: Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part V. Springer, Ghent, Belgium, pp. 491–506.10.1007/978-3-030-67670-4_30Search in Google Scholar

Graf, S. and Luschgy, H. (2000). Lecture Notes in Mathematics, vol. 1730. Springer, Berlin, Heidelberg.Search in Google Scholar

Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2: 83–97, https://doi.org/10.1002/nav.3800020109.Search in Google Scholar

Lloyd, S. (1982). Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28: 129–137, https://doi.org/10.1109/tit.1982.1056489.Search in Google Scholar

Luan, Q. and Hamp, J. (2023). Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering. arXiv preprint arXiv:2310.01285. Available at https://ssrn.com/abstract=4587877, https://doi.org/10.2139/ssrn.4587877.Search in Google Scholar

Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci.: 666–704.Search in Google Scholar

Moura, F.A., Martins, L.E.B., Anido, R.D.O., De Barros, R.M.L., and Cunha, S.A. (2012). Quantitative analysis of brazilian football players’ organisation on the pitch. Sports Biomech. 11: 85–96, https://doi.org/10.1080/14763141.2011.637123.Search in Google Scholar PubMed

Narizuka, T. and Yamazaki, Y. (2019). Clustering algorithm for formations in football games. Sci. Rep. 9: 13172, https://doi.org/10.1038/s41598-019-48623-1.Search in Google Scholar PubMed PubMed Central

Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY 55: 94.10.1007/978-3-319-20828-2Search in Google Scholar

Shaw, L. and Glickman, M. (2019). Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit 13: 1–13.Search in Google Scholar

Tang, Z., Wang, X., and Zhang, S. (2023). Clustering football game situations via deep representation learning. Statsbomb conference 2023. London.Search in Google Scholar

Zhuang, Y., Chen, X., and Yang, Y. (2022). Wasserstein k-means for clustering probability distributions. Adv. Neural Inf. Process. Syst. 35: 11382–11395.Search in Google Scholar

Received: 2025-01-17

Accepted: 2025-06-23

Published Online: 2025-07-28

You are currently not able to access this content.

https://doi.org/10.1515/jqas-2025-0007

Keywords for this article

football; style of play; optimal transport; similarity metric; frame embedding