Home Mathematics An optimal transport based embedding to quantify the distance between playing styles in collective sports
Article
Licensed
Unlicensed Requires Authentication

An optimal transport based embedding to quantify the distance between playing styles in collective sports

  • Ali Baouan EMAIL logo , Mathieu Rosenbaum and Sergio Pulido
Published/Copyright: July 28, 2025
Become an author with De Gruyter Brill

Abstract

This study presents a quantitative framework to compare teams in collective sports with respect to their style of play. The style of play is characterized by the team’s spatial distribution over a collection of frames. As a first step, we introduce an optimal transport-based embedding to map frames into Euclidean space, enabling efficient computation of distance. Then, building on this frame-level analysis,we use quantization to establish a similarity metric between teams based on a collection of frames from their games. For illustration, we present an analysis of a collection of games from the 2021–2022 Ligue 1 season. We successfully retrieve relevant clusters of game situations and calculate the similarity matrix between teams in terms of style of play. Additionally, we demonstrate the effectiveness of the embedding as a preprocessing tool for prediction tasks. Likewise, we apply our framework to analyze the dynamics in the first half of the NBA season in 2015–2016.


Corresponding author: Ali Baouan, Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France, E-mail: 

Mailing address: CMAP, Ecole Polytechnique, route de Saclay, 91128 Palaiseau Cedex, France.


Funding source: Machine Learning & Systematic Methods in Finance

Acknowledgments

The authors thank Stats Perform, Matthieu Lille-Palette and Andy Cooper for providing tracking data. They are also grateful to Mathieu Lacome and Sébastien Coustou for insightful discussions. The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: The authors gratefully acknowledge financial support from the chair “Machine Learning & Systematic Methods in Finance” from Ecole Polytechnique.

  7. Data availability: Data is partly publicly available. The NBA tracking data can be found in https://github.com/linouk23/NBA-Player-Movements/tree/master/data.

Appendix

A Distance between teams in the NBA

Similarly to Section 3.3, we apply the optimal transport framework to tracking data from the first half of the 2015–2016 NBA season. For each team, the tracking data was subsampled by retaining one out of every 25 frames. We set L = 6 to obtain an embedding in R 5 × 6 by projecting along the grid of directions defined for l = 1,…, L by

θ l = cos π ( l 1 ) 2 L , sin π ( l 1 ) 2 L .

Figure 9 illustrates an example of the ten clusters identified from the collection of frames across all Golden State Warriors games, and Table 5 provides the percentage of frames in each cluster. We observe that the clustering algorithm effectively distinguishes different phases of play in a consistent manner. Specifically, we identify three distinct defensive settings in clusters 6, 2, and 10, where the defensive block transitions from very deep positions to more advanced ones. Additionally, clusters 3, 5, 8, and 9 represent transitions between defense and offense, characterized by players being spread across the court. The dynamics observed in these clusters differ significantly from those in football. In particular, clusters where players are positioned in the center of the court account for a total of 18.62 % of the frames, unlike in football, where central clusters are more prevalent. This difference is expected because basketball rules prohibit the ball from moving backward from the offensive half of the court. Consequently, basketball is essentially a transition-oriented game where teams alternate between defense and offense.

Figure 9: 
Example of 10 clusters observed in the frames from the games of the Golden State Warriors. For each cluster, the closest frame to the barycenter and a random example from the cluster are shown. The shaded areas represent the total density of player locations taken from a random sample of 100 frames from each cluster. Clusters 2, 6 and 10 represent defensive phases. Clusters 1 and 4 represent attacking phases while clusters 3, 5, 7, 8 and 9 represent transitions between attack and defense.
Figure 9:

Example of 10 clusters observed in the frames from the games of the Golden State Warriors. For each cluster, the closest frame to the barycenter and a random example from the cluster are shown. The shaded areas represent the total density of player locations taken from a random sample of 100 frames from each cluster. Clusters 2, 6 and 10 represent defensive phases. Clusters 1 and 4 represent attacking phases while clusters 3, 5, 7, 8 and 9 represent transitions between attack and defense.

Table 5:

Rate of occurrence of clusters displayed in Figure 9.

Cluster Percentage of frames
1 13.04 %
2 22.49 %
3 4.73 %
4 15.66 %
5 3.40 %
6 17.31 %
7 5.22 %
8 3.27 %
9 7.22 %
10 7.66 %

Finally, Figure 10 illustrates the similarity in playing styles between NBA teams. This similarity is computed by considering the collection of frames from all available games in the first half of the NBA season, subsampled at a rate of one frame every 25 frames. We observe that the Golden State Warriors exhibit the largest deviation from the other franchises. This is not surprising, as they are credited with introducing a new style of play heavily reliant on 3-pointers. In fact, this season marked the record for the most points scored in the regular season, which naturally leads to a different spatial distribution of play. Interestingly, the Utah Jazz display the second largest deviation from the rest of the teams and its distance to the Golden State Warriors is the greatest. Figure 11 presents the similarity metric after centering the frames prior to embedding. We observe in this case that the distance metrics become more uniform, and the significant deviations of the Utah Jazz and Golden State Warriors diminish. This suggests that these effects are primarily due to the average locations both teams occupy during their games rather than relative placement of players.

Figure 10: 
Distance matrix between the collections of frames in the games of each team in the NBA.
Figure 10:

Distance matrix between the collections of frames in the games of each team in the NBA.

Figure 11: 
Distance matrix between the collections of frames in the games of each team in the NBA after centering the data.
Figure 11:

Distance matrix between the collections of frames in the games of each team in the NBA after centering the data.

A.1 Predicting team identity

Similarly to Section 3.3.3, we demonstrate that our representation of collections of frames as distributions in the embedding space effectively captures the stylistic identity of teams in the NBA. Specifically, we show that it is possible to predict a team’s identity from an out-of-sample collection of frames with high accuracy, which provides strong evidence that our embedding preserves spatial information while differentiating between unique playing styles.

Using all frames in each test fold yields an average Top-1 accuracy of 99.33 % and a Top-2 accuracy of 100 %. This performance suggests that the distribution of frames in the embedding perfectly captures the style of play of a team with a large enough test sample of frames. Figure 12 shows how accuracy varies as a function of the number of frames in each subset. Similarly to the football data, larger subsets naturally improve accuracy, as they offer a more representative idea of a team’s positioning style. Compared to football, we need more frames to achieve an accuracy of 70 %, but the ceiling of precision is higher. In fact, we achieve 93.8 % Top-1 accuracy with 2000 frames. Overall, the results further confirm that our embedding captures team-specific spatial configurations, enabling accurate and fast team identification.

Figure 12: 
Accuracy of NBA team identity prediction as a function of test sample size. For each sample size, 100 randomly selected subsets of the test fold are used to compute the accuracy and its standard deviation.
Figure 12:

Accuracy of NBA team identity prediction as a function of test sample size. For each sample size, 100 randomly selected subsets of the test fold are used to compute the accuracy and its standard deviation.

B Proof of Proposition 1

First, we prove that Proj θ is injective. We proceed by induction on n

(H n ): For any choice of grid θ 1, θ 2,…, θ L such that Ln + 1 and θ l , θ k are noncollinear for all l, kL, Proj θ is injective.

If n = 1, take μ = δ x and ν = δ y and θ 1, θ 2,, θ L such that L ≥ 2 and Proj θ (μ) = Proj θ (ν). Then we have δ θ 1 , x = δ θ 1 , y and δ θ 2 , x = δ θ 2 , y . Equivalently, ⟨θ 1, x⟩ = ⟨θ 1, y⟩ and ⟨θ 2, x⟩ = ⟨θ 2, y⟩ which implies x = y since θ 1 and θ 2 are non collinear. Thus, μ = ν.

Let n ≥ 2 and assume (H n−1) is true. Take μ and ν in P n u ( R 2 ) such that Proj θ (μ) = Proj θ (ν). We can write μ = i = 1 n 1 n δ x i and ν = i = 1 n 1 n δ y i and we have for all l in {1, , L}:

i = 1 n 1 n δ θ l , x i = i = 1 n 1 n δ θ l , y i .

Therefore, for all l in {1, , L}, there exists i l in {1, , n} such that

θ l , x n = θ l , y i l .

Since Ln + 1 and using the pigeon-hole principle, there must exist l, k such that i l = i k . Thus

θ l , x n = θ l , y i l , θ k , x n = θ k , y i l .

we can deduce that x n = y i l . And we have

i = 1 n 1 1 n 1 δ θ l , x h , i = i i l 1 n 1 δ θ l , y h , i .

We conclude using (H n−1) on μ = i = 1 n 1 1 n 1 δ x h , i and ν = i i l 1 n 1 δ y h , i .

Finally, for μ and ν in P n u ( R 2 ) we have

S W ̂ p ( μ , ν , ( θ l ) l L ) = 1 ( n L ) 1 / p Proj θ ( μ ) Proj θ ( ν ) p .

This justifies that S W ̂ p ( . , . , ( θ l ) l L ) is a distance over P n u ( R 2 ) .

C Justification of the normalisation term

The following result allows to determine a suitable normalisation factor for the sliced-Wasserstein distance.

Table 6:

Possession values for each team in the analysis, displayed as percentages. Teams are sorted in descending order.

Team Possession value (%)
Olympique Marseille 63.85 %
PSG 60.63 %
Rennes 60.15 %
Olympique Lyonnais 58.36 %
Nice 55.15 %
Lille 54.77 %
Lens 51.05 %
Clermont 49.30 %
Saint-Étienne 48.71 %
Montpellier 48.32 %
Bordeaux 48.29 %
Strasbourg 47.72 %
Brest 46.87 %
Monaco 46.62 %
Angers SCO 45.71 %
Metz 45.19 %
Nantes 45.02 %
Troyes 44.22 %
Lorient 43.43 %
Reims 41.63 %
Table 7:

List of 100 games used in this study. Left: Games 1–33. Center: Games 34–66. Right: Games 67–100.

Game Home team Away team Date
1 Metz Lens 2022-03-13
2 PSG Bordeaux 2022-03-13
3 Strasbourg Monaco 2022-03-13
4 Angers SCO Brest 2022-03-20
5 Bordeaux Montpellier 2022-03-20
6 Lens Clermont 2022-03-19
7 Lorient Strasbourg 2022-03-20
8 Olympique Marseille Nice 2022-03-20
9 Monaco PSG 2022-03-20
10 Nantes Lille 2022-03-19
11 Reims Olympique Lyonnais 2022-03-20
12 Rennes Metz 2022-03-20
13 Saint-Etienne Troyes 2022-03-18
14 Clermont Nantes 2022-04-03
15 Lille Bordeaux 2022-04-02
16 Olympique Lyonnais Angers SCO 2022-04-03
17 Metz Monaco 2022-04-03
18 Montpellier Brest 2022-04-03
19 Nice Rennes 2022-04-02
20 PSG Lorient 2022-04-03
21 Saint-Etienne Olympique Marseille 2022-04-03
22 Strasbourg Lens 2022-04-03
23 Troyes Reims 2022-04-03
24 Angers SCO Lille 2022-04-10
25 Brest Nantes 2022-04-10
26 Bordeaux Metz 2022-04-10
27 Clermont PSG 2022-04-09
28 Lens Nice 2022-04-10
29 Lorient Saint-Etienne 2022-04-08
30 Monaco Troyes 2022-04-10
31 Reims Rennes 2022-04-09
32 Strasbourg Olympique Lyonnais 2022-04-10
33 Lille Lens 2022-04-16
34 Olympique Lyonnais Bordeaux 2022-04-17
35 Metz Clermont 2022-04-17
36 Nantes Angers SCO 2022-04-17
37 Nice Lorient 2022-04-17
38 PSG Olympique Marseille 2022-04-17
39 Rennes Monaco 2022-04-15
40 Saint-Etienne Brest 2022-04-16
41 Troyes Strasbourg 2022-04-17
42 Angers SCO PSG 2022-04-20
43 Bordeaux Saint-Etienne 2022-04-20
44 Lens Montpellier 2022-04-20
45 Lorient Metz 2022-04-20
46 Olympique Marseille Nantes 2022-04-20
47 Monaco Nice 2022-04-20
48 Reims Lille 2022-04-20
49 Strasbourg Rennes 2022-04-20
50 Troyes Clermont 2022-04-20
51 Brest Olympique Lyonnais 2022-04-20
52 Lille Strasbourg 2022-04-24
53 Olympique Lyonnais Montpellier 2022-04-23
54 Metz Brest 2022-04-24
55 Nantes Bordeaux 2022-04-24
56 Nice Troyes 2022-04-24
57 PSG Lens 2022-04-23
58 Reims Olympique Marseille 2022-04-24
59 Rennes Lorient 2022-04-24
60 Saint-Etienne Monaco 2022-04-23
61 Olympique Marseille Olympique Lyonnais 2022-05-01
62 Strasbourg PSG 2022-04-29
63 Angers SCO Bordeaux 2022-05-08
64 Clermont Montpellier 2022-05-08
65 Lille Monaco 2022-05-06
66 Metz Olympique Lyonnais 2022-05-08
67 Nantes Rennes 2022-05-11
68 Nice Saint-Etienne 2022-05-11
69 PSG Troyes 2022-05-08
70 Reims Lens 2022-05-08
71 Metz Angers SCO 2022-05-14
72 Monaco Brest 2022-05-14
73 Montpellier PSG 2022-05-14
74 Nice Lille 2022-05-14
75 Rennes Olympique Marseille 2022-05-14
76 Bordeaux Nice 2022-05-01
77 Brest Clermont 2022-05-01
78 Lens Nantes 2022-04-30
79 Lorient Reims 2022-05-01
80 Monaco Angers SCO 2022-05-01
81 Montpellier Metz 2022-05-01
82 Rennes Saint-Etienne 2022-04-30
83 Troyes Lille 2022-05-01
84 Brest Strasbourg 2022-05-07
85 Lorient Olympique Marseille 2022-05-08
86 Bordeaux Lorient 2022-05-14
87 Olympique Lyonnais Nantes 2022-05-14
88 Saint-Etienne Reims 2022-05-14
89 Strasbourg Clermont 2022-05-14
90 Troyes Lens 2022-05-14
91 Angers SCO Montpellier 2022-05-21
92 Brest Bordeaux 2022-05-21
93 Clermont Olympique Lyonnais 2022-05-21
94 Lens Monaco 2022-05-21
95 Lille Rennes 2022-05-21
96 Lorient Troyes 2022-05-21
97 Olympique Marseille Strasbourg 2022-05-21
98 Nantes Saint-Etienne 2022-05-21
99 PSG Metz 2022-05-21
100 Reims Nice 2022-05-21

Proposition 2.

For μ , ν P n u ( R 2 ) , we have

(10) S W ̂ 2 ( μ , ν , ( θ l ) l L ) 1 + 1 L sin π 2 L 2 W 2 ( μ , ν ) .

Furthermore, in the case where ν = δ y for y R 2 :

(11) S W ̂ 2 ( μ , ν , ( θ l ) l L ) 1 1 L sin π 2 L 2 W 2 ( μ , ν ) .

To prove Equation (10), consider μ = 1 n i = 1 n δ x i and ν = 1 n i = 1 n δ y i and let σ be a permutation such that

W 2 2 ( μ , ν ) = 1 n i = 1 n x i y σ ( i ) 2 .

Then, for all l = 1, …, L, we have

W 2 2 ( θ l # μ , θ l # ν ) 1 n i = 1 n θ l , x i θ l , y σ ( i ) 2 .

Thus,

S W ̂ 2 ( μ , ν , ( θ l ) l L ) 1 n L l = 1 L i = 1 n θ l , x i y σ ( i ) 2 1 / 2 , = 1 n L i = 1 n l = 1 L ( x i y σ ( i ) ) T θ l θ l T × ( x i y σ ( i ) ) 1 / 2 , = 1 n L i = 1 n ( x i y σ ( i ) ) T Θ ( x i y σ ( i ) ) 1 / 2 ,

where Θ = l = 1 L θ l θ l T is a positive semi-definite matrix. Let ρ(Θ) be its spectral radius, we have

S W ̂ 2 ( μ , ν , ( θ l ) l L ) ρ ( Θ ) 1 n L i = 1 n x i y σ ( i ) 2 1 / 2 , = ρ ( Θ ) L W 2 ( μ , ν ) .

For the choice of grid in Equation (7), we have θ l θ l T = cos 2 ( π ( l 1 ) 2 L ) cos ( π ( l 1 ) 2 L ) sin ( π ( l 1 ) 2 L ) cos ( π ( l 1 ) 2 L ) sin ( π ( l 1 ) 2 L ) sin 2 ( π ( l 1 ) 2 L ) and hence

Θ = L 1 2 cos π 2 L 2 sin π 2 L cos π 2 L 2 sin π 2 L L + 1 2 .

This yields ρ ( Θ ) = L 2 ( 1 + 1 L sin ( π 2 L ) ) and we deduce that

S W ̂ 2 ( μ , ν , ( θ l ) l L ) 1 + 1 L sin π 2 L 2 W 2 ( μ , ν ) .

For the second bound, we consider the case where ν = δ y is concentrated in one location. Similar calculations yield

S W ̂ 2 ( μ , ν , ( θ l ) l L ) = 1 n L l = 1 L i = 1 n θ l , x i y 2 1 / 2 , = 1 n L i = 1 n ( x i y ) T Θ ( x i y ) 1 / 2 , γ ( Θ ) L 1 n i = 1 n x i y 2 1 / 2 ,

where γ(Θ) is the smallest eigenvalue of Θ and is given, for the grid in Equation (7), by γ ( Θ ) = L 2 ( 1 1 L sin ( π 2 L ) ) . Therefore, we obtain

S W ̂ 2 ( μ , ν , ( θ l ) l L ) 1 1 L sin π 2 L 2 W 2 ( μ , ν ) .

References

Arthur, D. and Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. Soda 7: 1027–1035.Search in Google Scholar

Bialkowski, A., Lucey, P., Carr, P., Matthews, I., Sridharan, S., and Fookes, C. (2016). Discovering team structures in soccer from spatiotemporal data. IEEE Trans. Knowl. Data Eng. 28: 2596–2605, https://doi.org/10.1109/tkde.2016.2581158.Search in Google Scholar

Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. J. Math. Imag. Vis. 51: 22–45, https://doi.org/10.1007/s10851-014-0506-3.Search in Google Scholar

Bonnotte, N. (2013). Unidimensional and Evolution Methods for Optimal Transportation, PhD thesis. Université Paris Sud-Paris XI; Scuola normale superiore, Pise, Italie.Search in Google Scholar

Cuturi, M. and Doucet, A. (2014). Fast computation of Wasserstein barycenters. In: International Conference on Machine Learning. PMLR, Beijing, China, pp. 685–693.Search in Google Scholar

Domazakis, G., Drivaliaris, D., Koukoulas, S., Papayiannis, G., Tsekrekos, A., and Yannacopoulos, A. (2020). Clustering measure-valued data with Wasserstein barycenters. arXiv preprint arXiv:1912.11801.Search in Google Scholar

Fernández, J. and Bornn, L. (2021). Soccermap: a deep learning architecture for visually-interpretable analysis in soccer. In: Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part V. Springer, Ghent, Belgium, pp. 491–506.10.1007/978-3-030-67670-4_30Search in Google Scholar

Graf, S. and Luschgy, H. (2000). Lecture Notes in Mathematics, vol. 1730. Springer, Berlin, Heidelberg.Search in Google Scholar

Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2: 83–97, https://doi.org/10.1002/nav.3800020109.Search in Google Scholar

Lloyd, S. (1982). Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28: 129–137, https://doi.org/10.1109/tit.1982.1056489.Search in Google Scholar

Luan, Q. and Hamp, J. (2023). Automated regime detection in multidimensional time series data using sliced Wasserstein k-means clustering. arXiv preprint arXiv:2310.01285. Available at https://ssrn.com/abstract=4587877, https://doi.org/10.2139/ssrn.4587877.Search in Google Scholar

Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci.: 666–704.Search in Google Scholar

Moura, F.A., Martins, L.E.B., Anido, R.D.O., De Barros, R.M.L., and Cunha, S.A. (2012). Quantitative analysis of brazilian football players’ organisation on the pitch. Sports Biomech. 11: 85–96, https://doi.org/10.1080/14763141.2011.637123.Search in Google Scholar PubMed

Narizuka, T. and Yamazaki, Y. (2019). Clustering algorithm for formations in football games. Sci. Rep. 9: 13172, https://doi.org/10.1038/s41598-019-48623-1.Search in Google Scholar PubMed PubMed Central

Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY 55: 94.10.1007/978-3-319-20828-2Search in Google Scholar

Shaw, L. and Glickman, M. (2019). Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit 13: 1–13.Search in Google Scholar

Tang, Z., Wang, X., and Zhang, S. (2023). Clustering football game situations via deep representation learning. Statsbomb conference 2023. London.Search in Google Scholar

Zhuang, Y., Chen, X., and Yang, Y. (2022). Wasserstein k-means for clustering probability distributions. Adv. Neural Inf. Process. Syst. 35: 11382–11395.Search in Google Scholar

Received: 2025-01-17
Accepted: 2025-06-23
Published Online: 2025-07-28

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 7.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/jqas-2025-0007/html
Scroll to top button