Startseite Mathematik Unsupervised collaborative learning based on Optimal Transport theory
Artikel Open Access

Unsupervised collaborative learning based on Optimal Transport theory

  • Fatima-Ezzahraa Ben-Bouazza EMAIL logo , Younès Bennani , Guénaël Cabanes und Abdelfettah Touzani
Veröffentlicht/Copyright: 7. Juni 2021
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

Collaborative learning has recently achieved very significant results. It still suffers, however, from several issues, including the type of information that needs to be exchanged, the criteria for stopping and how to choose the right collaborators. We aim in this paper to improve the quality of the collaboration and to resolve these issues via a novel approach inspired by Optimal Transport theory. More specifically, the objective function for the exchange of information is based on the Wasserstein distance, with a bidirectional transport of information between collaborators. This formulation allows to learns a stopping criterion and provide a criterion to choose the best collaborators. Extensive experiments are conducted on multiple data-sets to evaluate the proposed approach.

1 Introduction

Data clustering is one of the main interests in unsupervised Machine Learning research [1]. A large number of clustering algorithms have been proposed in the literature [2], divided into different families based on the cost function to optimize [1, 3].

Clustering task is known to be difficult and suffer from several issues. Most of the problems come from the fact that unsupervised algorithms work with very little information about the expected result [4]. Therefore, the choice of the cost function to optimize, the algorithm to use and the values of the parameters require a lot of expertise to obtain the desired output [5]. In addition, modern data-sets are often very large (both in size and dimension) and distributed into several sites [6], which limit the efficiency of most classical clustering algorithms [7].

In an attempt to solve these issues, the scientific community has suggested several ways of combining the results of different algorithms [8]. Several approaches have been proposed in that direction, based on the idea of several algorithms working on the data, either with each algorithm optimizing a different cost function or working with different values of the parameters on the same data-set, or with each algorithm working on a subset of the data, usually trying to optimize the same cost functions. These approaches can be classified into two main categories: In Ensemble Learning approaches, several algorithms are trained on the data and the set of results are merged into a global consensus [9]. In Collaborative Clustering several models are trained simultaneously on the data-set, usually each algorithm working on a sub-set of the data, and exchange information during the learning process [10]. In this paper we focus on the later approach.

Generally speaking, the problem of Collaborative Clustering can be defined as follows: Given a finite number of disjoint data sites, collaborative clustering is a scheme of collective development and reconciliation of fundamental cluster structures across these sites [11]. The general framework for collaborative clustering is based on two principal steps:

  • Local step: Each algorithm will train on the data it has access to and produce a clustering result, e.g. a model of the local data subset.

  • Collaborative step: The algorithms share their output in order to confirm or improve their models, with the goal of finding better clustering solutions.

In this paper, we propose to study the unsupervised collaboration framework through the Optimal Transport theory, thus benefiting from this mathematical formalism to analyze and describe the process of collaboration between the different algorithms. In this case, the collaboration, that consists of exchanges of information between algorithms, will be modeled in the form of bi-directional or even multi-directional transports.

The rest of the paper is organized as follow. Section 2 includes the prototype based methods proposed in collaborative learning task, section 3 develops the background of Optimal Transport theory. In Section 4 we introduce the novel framework of collaborative clustering using Optimal Transport theory. In Section 5 we provide an experimental validation and discuss the quality of the proposed approach. Finally, in Section 6 a conclusion and some perspective work are given.

2 Related work

The first Collaborative clustering was introduced by [11] under the name “Collaborative Fuzzy Clustering” (CoFC). This approach was based on extended version of Fuzzy C-means adapted to distributed data. The algorithm was based on two steps, the first step aims to find c clusters for each collaborator where each object is assigned to some cluster with a certain degree membership stored in matrix S. The second step consists to exchange the information stored in the matrix S or the prototypes of each cluster. The algorithm of Fuzzy C-Means is trained again for each collaborator taking into account the shared information.

Several studies had been done to develop several algorithms and approaches on this framework, such as CoEM in [12], CoFKM [13] and collaborative EM-like algorithm (EM for Expectation–Maximization) based on Markov Random Fields [14]. All these approaches follow the same principle as Collaborative Fuzzy C-Means.

However, these algorithms display similar limitations: they require the same number of clusters in each site, the same same model trained in each site, and the algorithm can only happen between instances of the same algorithm.

The collaborative clustering was also developed based on Self-Organization Maps (SOM) [15] by adapting the original objective function to distributed data. The main idea was to add a term inspired by the classical SOM neighborhood function to the original SOM objective function, where this term aims to compare neighborhoods of each prototype in each sites. This neighborhood term is adapTable to either horizontal or vertical collaboration. The same principle can also be adapted to the Generative Topographic Maps (GTM) [16] with a modification in the M-step of the EM algorithm. The modification consists to add collaborative term inspired from the penalized likelihood estimation [17].

Another approach was proposed in this framework is the SAMARAH algorithm [10, 18], with the advantage of not requiring a smoothness function or the same number of clusters or prototypes. However, it is restricted to horizontal collaboration only and the principle of solving the conflict based on pairwise criterion can make the process volatile.

Recent works have been done to develop the collaborative clustering and make it more flexible [19, 20], ensuring a collaboration between different algorithms without fixing a unique number of clusters for all of the collaborators. The advantage of this approach is that different families of clustering algorithms can exchange information in a collaborative framework. Nevertheless, one of the most important issue in collaborative clustering is the control on the quality on the exchanged information from several collaborators and the right time to stop the collaboration. In [21, 22], the authors develop a new criterion to select the optimal collaborator. They showed that the diversity between collaborators could be an important impact on collaboration. Furthermore, recently a study of the influence of diversity on the collaboration was done based on the entropy [23] and showed trade-off between the gain quality and diversity between the collaborators.

3 Fundamental background of the proposed approach

In this section we will represents the mathematical formalization of Optimal Transport problem and how it could be resolved using the Sinkhorn algorithm [24].

3.1 Optimal Transport

Optimal Transport is a well defended theory introduced by Monge in [25] to resolve the problem of resources allocation. The basis of this theory was to compute the optimal path of a massive particle from one point to another, by minimizing the cost of this move or this transportation. Lately the Monge problem was relaxed by Kantorovich in [26], where the problem is transposed to a distribution problem using linear programming connecting a pair of distributions.

More formally, Let Ω ⊆ ℝn be the measurable space of dimension n, and 𝒫(Ω) denotes the set of probability measures on Ω.

Given two families of data sets in Ω, Xs={xis}i=1ns and Xt={xit}i=1nt , let μs and μt be their respective distributions over Ωs and Ωt respectively.

The transport map γ from μs to μt is defined as the push forward γ#μs = μt :

(1) γ:ΩsΩt

where γ transforms the probability measure μs in its image measure noted γ#μs, which is another probability measure defined over Ωt and satisfying:

(2) γ#μs(x)=μs(γ-1(x)),xΩt

The Monge-Kantorovich formulation of this problem is a convex relaxation which aims to find a coupling γ defined as a joint probability measure over Ωs × Ωt with marginals μs and μt that minimizes the cost of transport w.r.t c : Ωs × Ωt → ℝ+:

(3) γ*=argminγΠΩs×Ωtc(xs,xt)dγ(xs,xt)

Where Π is the set of all probabilistic couplings in 𝒫(Ωs × Ωt) with marginals μs and μt, and γ* designate the Optimal transportation plan.

This problem admits a unique solution γ* which allows to define the Wasserstein distance of order p ∈ [1, +∞[ between μs and μt:

(4) Wp(μs,μt)=(infγΠ(μs,μt)Ωs×Ωtdp(xs,xt)dγ(xs,xt))1p

where d is a distance corresponding to the cost function c(xs, xt) = dp(xs, xt)

s.t c : Ωs × Ωt → ℝ+ of transporting the unit mass xs to xt.

In this work, we focus on the discrete case of the Optimal Transport problem. However, we refer to [27] for more details on the continuous case and the mathematics involved.

We consider the discrete setting of Optimal Transport problem. This case arises where μs and μt are only accessible through discrete samples. The empirical measures can be defined as:

(5) μs=i=1nspisδxisandμt=i=1ntpitδxit

With δxi the Dirac function at xi ∈ ℝn

pis and pit the probability masses associated to the ith sample,

s.t i=1nspis=i=1ntpit=1 .

The Monge-Kantorovich problem consists on finding an optimal coupling (or Transportation plan) γ* as a joint probability between μs and μt over Ωs × Ωt by minimizing the cost of the transport w.r.t Xs ∈ ℝns×n and Xt ∈ ℝnt×n by solving:

(6) γ*=argminγΠ(μs,μt)<γ,C>F

with:

< ., . >F the Frobenius dot product

C+ns×nt the transport cost with Cij given by: C : Xs × Xt → ℝ+

Π(μs,μt)={γ+ns×nt|γ1ns=μs,γT1nt=μt} the transportation polytope where 1n is a n-dimensional vectors of ones.

This problem admits a unique solution γ* and defines a metric called the Wasserstein distance on the space of the discrete probability measures as follow:

(7) W(μs,μt)=<γ*,C>F

The Wasserstein distance has been very useful recently especially in machine learning such as domain adaptation [28] metric learning [29], clustering [30] and multi-level clustering [31, 32]. The particularity about this distance is that it takes into account the geometry of the data using the distance between the samples, which explains its efficiency. On the other hand, in term of computation, the success of this distance also comes from the work of Cuturi [24], who introduced an algorithm based on entropy regularization, as presented in the next section.

3.2 Regularized Optimal Transport

Even though the Wasserstein distance has known very significant successes, in term of computation the objective function has always suffered from a very slow convergence, especially in high dimension, which lead to the idea of proposing a smoothed objective function by adding a term of entropic regularization, introduced in [33] and applied to the Optimal Transport problem in [24] in order to speed up the convergence and improve the stability [34].

This is represented formally by the following minimization problem:

(8) γλ*=argminγΠ(μs,μt)<γ,C>-1λE(γ)

where E(γ)=-i,jns,ntγijlog(γij) and λ > 0 the entropy regularization parameter and C is the cost matrix.

With the strong convexity of entropy, the objective function became a strictly convex function. Consequently, the minimization problem (8) admits a unique solution and can be solved by the Sinkhorn’s fixed point algorithm, based on the following theorem.

Sinkhorn theorem (1967): For any positive matrix A(+n×m)) , aΣn et bΣm, there is one and unique pair of vectors (u,v)+n×+m such that diag(u)Adiag(v) ∈ 𝒰(a, b) and constitutes a fixed point of the application:

(9) (u,v)(Av-1./b,ATu-1./a)

Thanks to the regularized version of optimal transport, we obtained a less sparse, smoother and more sTable solution than the original problem. Another important advantage is that this formulation allows the scaling matrix approach of Sinkhorn-Knopp [35].

The regularized Optimal Transport plan is then found by iteratively computing two scaling vectors u and v such that

(10) γ*=diag(u)exp(λC)diag(v)

4 Proposed approach

4.1 Motivation and potential applications

With the development of hardware technology, a huge amount of data represented in different views and different structures have been generated in real word applications. This kind of data is considered as a new challenge to develop the existing clustering algorithms, designed for single view data, to be more adapTable to multi-view data.

To clarify the motivation behind the proposed approach, we present some potential industrial applications, where we have several organizations or companies using a collection of data sets that could either concern the same or different customers. This could be data describing customers of banking institutions, state organizations and hospital with medical information records, etc. Imagine that all these organizations are dealing with the same individuals but every organizations may have different characteristics and descriptors for these individuals linked to the activities of the organization. All these organizations may want to explore data mining algorithms on there one data set. On the other hand, they also recognize that as they are other data sets containing information about the same individuals, it would advantageous to learn about the dependencies that they have so that they could reveal a macro-picture. However, due to ethical consideration and privacy issues, these organization are forbidden to share their data sets. which prevents the experts to combine all these data sets into a single view and carrying out different algorithms of classical clustering. For example, the confidentiality requirements in medical records of patients could deny the access into their personal information, and security issues in banking organization forbid to share the customers information. In addition, it may some hesitation from experts about losing the real structure of the data by adding more information and characteristics. In this situation, the exchange of the information through the proposed approach will guarantee the privacy of the information of each organism, and the control of the collaboration to avoid affecting the real structure of the data.

One of the most difficult challenges in collaborative learning is how to choose the right collaborator to collaborate with, which construct the order of the collaboration, not only to increase local quality of each model, but also to ensure the convergence and to avoid over-fitting (Figure 1).

Figure 1 Collaborative clustering framework with selective information exchange.
Figure 1

Collaborative clustering framework with selective information exchange.

Classical collaborative algorithms are based on two steps. The first one consists to cluster the data locally, the second consists to send and receive information between the local models. Despite the quantity of work on this framework, it still requires many restrictions to ensure the convergence: usually each algorithm must work on the same representation space and must compute the same number of clusters. These restrictions limit the flexibility of collaborative clustering approaches for the analysis of real data.

On the other hand, Optimal Transport theory has shown very significant results, especially in transfer learning [28] and for comparison of distributions. Based on this idea, our intuition is to model collaborative learning as a bi-directional knowledge transfer and improve the optimization of the cost function based on the comparison of the distributions of local subset, in order to weight the mutual confidence of the collaborators and use a transport plan to transfer the information between them.

In the next section we detail the proposed approach based on Optimal Transport theory, either in vertical or horizontal collaboration.

4.2 Collaborative Learning algorithms

The main goal of the proposed approach is to improve the quality and the stability collaboration and guarantee the convergence without over-fitting. In collaborative clustering, we distinct two principal approaches: vertical and horizontal collaboration. In the vertical collaboration, the collaborators learn from different instances represented in the same space, while in horizontal collaboration the collaborators work on the same instances in different representation space.

In general, different frameworks must be used for vertical and horizontal collaboration. Here we propose a unified framework adapted to both approaches.

4.2.1 Local step

Let consider r collaborators, where the data of each collaborator v, Xv={xiv}i=1nv with xivdv and dv the dimension of the space representation of v, corresponds to a distribution μv=1nvi=1nvδxiv .

We seek in the local step to find the centroids Mv={m1v,..,mkvv} , corresponding to a distribution vv, that represents the local clusters of each collaborator v, such as to minimizes the Optimal Transport plan Lv={lijv}i,j=1i,j=nv,kv between the the local data Xv and the centroids Mv.

To achieve this, we will solve the following minimization problem (11), where the first minimization of Lv consists to find the Optimal Transport plan between the data and the centroids, and the second minimization Mv aims to update the distribution of the centroids so that the transport plan is optimal between the data and the centroids.

(11) argminLvΠ(μv,νv),Mv<Lv,C(Xv,Mv)>F-1λE(Lv)

Subject to j=1kvlijv=1nv and i=1nvlijv=1kv and C : Xv × Mv → ℝ+ s.t Cij=c(xiv,miv)=xiv-mjv2 the euclidean distance between sample xiv and the centroids miv .

It should be noticed that resolving (11) is equivalent to a Lloyd’s problem which is the Expectation Minimization algorithm when d = 1 and p = 2 without any constraints on the weights. This is why to resolve this problem we alternate between computing the Sinkhorn matrix Lv to assign instances the closest cluster and updating the centroids to decrease the transportation cost.

Algorithm 1

Sinkhorn-Means local algorithm

Input : Xv={xiv}i=1nv : data of collaborator v with distribution μv
    kv : number of local clusters
    λ : entropic constant
Output : The OT matrix Lv and the centroids Mv
1 Initialize the centroids Mv={mjv}j=1kv randomly
2 Compute the associated distribution νv=1kvj=1kvδmjv
3 repeat
4   Compute the OT matrix Lv={lijv}i,j=1i,j=nv,kv :
5
(Lv)*=argminLvΠ(μv,νv)<Lv,C(Xv,Mv)>F-1λE(Lv)
6   Update the centroids Mv={mjv}j=1kv :
mjv=ilij*xiv1jkv
7 until convergence;
8 return (Lv)* and Mv

Algorithm 1 details the computation of the local objective function (11), proceeding similarly to k-means but with the advantage of using the Wasserstein distance. This allows to get soft assignment of the data, in contrary to k-means, which means that the components of the assignment matrix lij[0,1n] . Besides, the penalty term based on the entropy regularization guarantees a solution with higher entropy which increases the stability of the algorithm and ensures a uniform assignation of the instances.

4.2.2 Global step

The global step aims to compute the collaboration between the models where each collaborator can update its local clustering based on information exchanged with the other collaborators, until stabilization of clusters with improved quality. In the proposed approach, the collaboration step could be seen as two simultaneous phases.

The first phase aims to create an interaction plan based on Sinkhorn matrix distance which compares the local distribution of each collaborator to the others. The idea behind this phase is to allow each model to select the best collaborator to exchange information with, in other words the algorithm will also learns the best order of the collaborations in each iteration. The heuristic work in [21] proved that a collaboration with a model proposing a very different data distribution decreases the local quality, while a collaboration between very similar models is ineffective. Thus, the most beneficial collaboration is the one with models of median diversity. Hence, after the construction of the transport plan using the Sinkhorn algorithm, which compares the local structures, the proposed algorithm learns to choose for each model the collaborator with the median distribution similarity.

The second phase consists to exchange information between collaborators to improve local quality of each model. More precisely, we are looking to transport the prototypes to influence the location of the local prototypes; in order to get a higher local quality of each collaborator.

Considering the same notation above, we seek to minimize the following objective function:

(12) argminLv,Mv{<Lv,C(Xv,Mv)>F-1λE(Lv)+v'=1.v'vrαv',v(<Lv,v',C(Mv,Mv')>F-1λE(Lv,v'))}

Where the first term deals with the local clustering and the remaining is the collaboration term and represents the influence on the local centroids’ distribution by the distant centroid’s distributions. αv′,v are non-negative coefficients proportional to the diversity between the collaborators and the difference of local quality, and Lv,v is the Optimal Transport plan between the centroids of the vth and vth collaborators.

Algorithm 2 explains the computation steps of the proposed approach and shows how it learns to select the best collaborator to learn from at each iteration, based on Sinkhorn comparisons between the distributions, and how it alternates between influencing the local centroids based on the confidence coefficient relative to the chosen collaborator and its local centroids distribution and update of the centroids relative to local instances in order to improve the clusters’ quality.

It should be pointed out that in each iteration, each collaborator chooses successively the collaborators to exchange information with, based on the Sinkhorn matrix distance. More accurately, in each iteration, each model exchanges information with the collaborator having the median similarity between the two modelled distributions, computed with the Wasserstein metric. If this exchange increases the quality of the model (here we use the Davies-Bouldin index [36]), the centroids of the model are updated. Otherwise, the selected collaborator is removed from the list of possible collaborators and the process is repeated with the remaining collaborators, until the quality of the clusters stops increasing.

It must be highlighted that the proposed algorithm can be adapted to both horizontal a vertical collaboration, since the inputs of the algorithm requires the distributions that represent the local structure of each collaborator, where it can be either sharing the same space but different samples (vertical collaboration), which formally means that Xv={xiv}i=1nv such that xivd or built from different spaces but share the same instances (horizontal collaboration), which means Xv={xiv}i=1n such that xivdv .

Another important advantage of the proposed algorithm is its adaptability with all the prototype-based algorithms, more precisely instead of using the Sinkhorn-Means as a local algorithm to get the centroids, we can use other prototype models like k-means, SOM, EM, etc. The proposed algorithm has therefore the capability to work with hybrid models. This will be detailed in Section 5.2.3.

Algorithm 2

Collaborative clustering (Co-OT)

Input : {Xv}v=1r : the r collaborators’ data with distributions {μv}v=1r
     {kv}v=1r : the numbers of clusters
    λ : the entropic constant
     {αv,v'}v,v'=1v,v'=r : the confidence coefficient matrix
Output :The partition matrix {(Lv)*}v=1r and the centroids {Mv}v=1r
1 Initialize the centroids Mv={mjv}j=1kv randomly
2 Compute the associated distribution νv=1kvj=1kvδmjv , ∀ v ∈ {1, ..., r}
3 repeat
4   for v = 1, ..., r do
5     Update the centroids Mv and the partition matrix (Lv)* using a local algorithm (e.g. Sinkhorn-Means, SOM, GTM, k-Means...)
6     Update the centroids distribution νv=1kvj=1kvδmjv
7     for v = 1, ..., r and vv do
8       Compute the OT matrix (Lv,v')*={ljj}j,j=1j,j=kv,kv' between the centroids of collaborators v and v:
(Lv,v')*=argminLv,v'Π(νv,νv')<Lv,v',C(Mv,Mv')>F-1λE(Lv,v')
9       Chose the median collaborator:
v*=medianv'{(Lv,v')*}v'=1r,v'v
10     Update the local centroids based on the collaborator’s information, if the internal quality is increased (see below):
11
mjv=αv,v*jljjv,v*mjv*1jkv
12 until convergence;
13 return {(Lv)*}v=1r and the centroids {Mv}v=1r

5 Experiments

5.1 Setting

5.1.1 Data-sets

We consider the following data-sets provided by the UCI Machine Learning Repository [37], described in Table 1. Each data-set is split between several collaborators.

  1. Glass dataset represents oxide content of the glass to determine its type. The study of classification of types of glass was motivated by criminological investigation. Since the glass left at the scene of the crime can be used as evidence...if it is correctly identified!

  2. The Spam base dataset consists of 57 attributes giving information about the frequency of usage of some words, the frequency of capital letters and other insights to detect if the e-mail is a spam or not.

  3. Waveform describes 3 types of waves with an added noise. Each class is generated from a combination of 2 of 3 “base” waves and each instance is generated of added noise (mean 0, variance 1) in each attribute.

  4. Wdbc data is Breast Cancer Wisconsin (Diagnostic) Data Set. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The target feature records the prognosis (benign (1) or malignant (2)).

  5. Wine data is the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Table 1

Some characteristics of the experimental real-world data-sets

Data-sets #instances #Attributes #Classes
Glass 214 10 7
Spambase 4601 57 6
Waveform-noise 5000 40 3
Wdbc 569 33 2
Wine 178 13 3

5.1.2 Data-set splitting

In order to test experimentally the proposed algorithm, we first proceeded with a data pre-processing in order to create the local subsets.

For vertical collaboration, we aim create samples from the original data, which means different instances represented with same characteristics. We split the data horizontally into 10 random subsets Xv, each subset is represented by the distribution μv we train the algorithm 1 to get the local centroids partitions νv, and then applied the collaborative algorithm 2 between the subsets in order to increase their local quality. To do so, we split the data as showed in Figure 2 where the data base is rated into v samples that share the same features.

Figure 2 Splitting of the Vertical collaboration
Figure 2

Splitting of the Vertical collaboration

For horizontal collaboration, the main idea is to split each chosen data set to 10 subsets, (see Figure 3) that share the same instances but represented with different features in each subset, selected randomly with replacement. Considering the notation above, each subset Xv will be represented by the distribution μv that will be considered as the input of algorithm 1 to get the distribution of the local centroids νv. Algorithm 2 is then applied to influence the location of the local centroids by the centroids of the distant learners without having access to their local data.

Figure 3 Splitting of the Horizontal collaboration
Figure 3

Splitting of the Horizontal collaboration

5.1.3 Quality measures

The proposed approach was evaluated with two internal quality indexes: Davies-Bouldin (DB) and Silhouette indexes, as well as an external criterion: Adjusted Rand Index (ARI). DB-Davies Bouldin index [36] is defined as follow:

(13) DB=1Kk=1KmaxkkΔn(ck)+Δn(ck)Δ(ck,ck)

Where K is the number of clusters, Δ (ck, ck) is the similarity between clusters centers ck and ck and Δn is the average similarity of all elements from the cluster Ck to their cluster center ck. This index evaluates the quality of unsupervised clustering basing on the compactness of clusters and separation measure between clusters. It’s based on the ratio of the sum of within-clusters scatter to between-clusters separation. The lower the value of DB index, the better the quality of the cluster.

The silhouette index [38], is based on the measurement of the difference between the average of the distance between the instance xi and the instances belonging to the same cluster ai and the average distance between the instance xi and the instances belonging to other clusters bi, the closer the silhouette value is to 1 means that the instances are assigned to the right cluster.

(14) S=1Kb(i)-a(i)max(a(i),b(i))

Moreover, since the data-sets we proposed in the experiments provide available labels, we choose to add an external quality index the Adjusted Rand Index (ARI) [39].

The Adjusted Rand Index [2] defined as follow 15:

(15) ARI=ij(nij2)-i(ai2)j(bj2)/(n2)12(i(ai2)+j(bj2))-i(ai2)j(bj2)/(n2)

Where nij = | Ci ∩ Yj | and Ci is the ith cluster and Yj is jth real class provides from the real label of the data-sets, and ai is the number of instances belonging to the same cluster with the same class while bj is the number of instances belonging to different cluster with different class.

The ARI index measures the agreement between two partitions, one provided from the proposed algorithm and the second one provided from the labeled data-sets. The values of ARI are between 0 and 1 and the quality is better when the value of ARI is close to 1.

We therefore applied algorithm 1 on local data, then the coefficient matrix α is computed based on a diversity index between the collaborators [21]. This coefficient is used to control the importance of the terms of the collaboration. Algorithm 2 in trained 20 times in order to estimate the mean quality of the collaboration and a 95% confidence interval for the 20 experiments. The experimental results of horizontal collaboration were compared with SOM-collaborative [16]. Both approaches were trained on the same subsets and on the same local model, a 3 × 5 map, with the parameters suggested by the authors of the algorithm [16]. The last part of the experiments results consists to compare the proposed algorithm with the collaborative algorithms proposed in the state of the art, where the algorithms is trained on only two collaborators, we followed the same split as mentioned in [16] to compare the gain quality brought from the collaboration, based on DB Davis-Bouldin index.

5.1.4 Computation tools

A nice feature of the wasserstein distance is that their computation is vectored, which means the computation of a n distances, whether from one histogram to many, or many to many, can be carried out simultaneously using elementary linear algebra operations. To do so we use the PyTorch version of Sinkhorn-means, on GPGPU’s. Moreover, the data collaborators were parallelized in order to compute local algorithm at the same time. For the experiment results, we used Alienware area-51m with GeForce RTX 2080/PCIe/SSE2 / NVIDIA Corporation graphic card.

5.2 Results and discussion

In this section we evaluate the approach on several data-sets for both vertical and horizontal clustering, based on different quality indexes, either internal or external one. We also compare the proposed algorithm with state-of-the-art approaches of collaborative clustering based on prototypes exchanges: Self-Organizing Maps collaboration (Co-SOM) and Generative-Topographic Maps collaboration (Co-GTM).

5.2.1 Vertical Collaboration case

To evaluate the proposed approach in a vertical collaboration case, we computed the algorithm 2 on several sub-sets that share the same features but have different size and complexity.

As one can see, the proposed approach shows in general an accepTable capacity at improving the DB index of the clustering before and after a vertical collaboration Table 3. This is not surprising, considering that the proposed algorithm evaluates the gain of quality based on this index. The DB index is computed at each iteration in order to learn whether or not the collaborator can benefit from this collaboration. To make sure of the validity of the algorithm, we used the silhouette internal index. As shown in Table 3, the value of Silhouette increases after collaboration, which is a confirmation that the proposed approach increases the quality of each collaborator. However, the quality gain resulting from the collaboration is not always very high for some data-sets. This is due to the structure of the database and its horizontal splitting. If the data is very sparse (notably Spambase), we can observe that the collaboration increases more the quality than non-sparse data (for example Waveform data-set).

Table 3 shows the results achieved on this index and highlighted the performance of our algorithm and confirm that the quality of each collaborator increases after the collaboration.

As one can see, the results are generally positive but the difference between the values either in internal indexes (Silhouette and DB) or in external index ARI before and after collaboration is not very impressive, this is explained by the horizontal splitting which gives small subsets that practically have the same structure, which means that the collaboration can be seen as a bidirectional exchange of information between subsets of the same given database.

As we will see later on, this is not the case in horizontal collaboration, in which the impact of the collaboration is more important since the data are represented with different features for each collaborator. In addition, we chose one data set (due to page limitation) to detail the effect of the proposed algorithm on each collaborator. Table 2 shows the values of different quality indexes of each collaborator built from Spambase data set, and confirm that the quality does increase the quality of most collaborators in the process.

Table 2

Values of the different quality indexes before and after a vertical collaboration for each collaborator built from the Spambase data set.

Models DB Silhouette ARI
Before After Before After Before After
collab1 0.681 0.561 0.524 0.539 0.169 0.165
collab2 0.769 0.540 0.532 0.625 0.118 0.119
collab3 0.751 0.653 0.530 0.548 0.127 0.130
collab4 0.714 0.576 0.538 0.574 0.168 0.169
collab5 0.653 0.673 0.539 0.535 0.149 0.148
collab6 0.714 0.569 0.552 0.556 0.156 0.160
collab7 0.705 0.589 0.536 0.563 0.174 0.154
collab8 0.720 0.576 0.544 0.590 0.163 0.165
collab9 0.717 0.711 0.503 0.577 0.149 0.178
collab10 0.665 0.605 0.495 0.561 0.159 0.192
Table 3

Average values (±CI95%) of the different quality indexes before and after the vertical collaboration for each data set over 20 executions.

Indexes DB Silhouette ARI

Data-sets before after before after before after
Glass Average ±CI95% 0.984 ±0.09 0.689 ±0.17 0.369 ±0.04 0.471 ±0.06 0.223 ±0.05 0.244 ±0.07
Spambase Average ±CI95% 0.711 ±0.02 0.603 ±0.03 0.529 ±0.01 0.567 ±0.01 0.153 ±0.01 0.158 ±0.01
Waveform-noise Average ±CI95% 2.819 ±0.05 2.768 ±0.05 0.078 ±0.002 0.080 ±0.002 0.285 ±0.01 0.291 ±0.01
WDBC Average ±CI95% 0.675 ±0.02 0.629 ±0.05 0.448 ±0.02 0.513 ±0.02 0.290 ±0.03 0.374 ±0.06
Wine Average ±CI95% 0.525 ±0.02 0.496 ±0.02 0.568 ±0.02 0.574 ±0.02 0.306 ±0.03 0.308 ±0.02

Sensitivity Box-Whiskers plots (Figure 4) are drawn for the 20 experiments scores for each dataset before and after collaboration process. They enable us to study the distributional characteristics of scores as well as the level of the scores. To begin with, scores are sorted over the 20 tests. Then four equal sized groups are made from the ordered scores. That is, 25% of all scores are placed in each group. The lines dividing the groups are called quartiles, and the groups are referred to as quartile groups. Usually, we label these groups 1 to 4 starting at the bottom. The median (middle quartile) marks the mid-point of the scores and is shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value and half are less. The middle “box” represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range.

Figure 4 Sensitivity Box-Whiskers plots for the vertical collaboration case
Figure 4

Sensitivity Box-Whiskers plots for the vertical collaboration case

As can be seen from these graphs, the overall performance behavior shows a clear improvement as a result of the collaboration process. For example, for the DB Index, we can see a decrease in index values for all databases due to the contribution of collaboration. For the other two quality indices, we rather observe an increase in the values of the indices showing an improvement in the qualities of the solutions found.

5.2.2 Horizontal collaboration case

In this section we validate the effectiveness of the proposed approach on different date-sets for horizontal collaboration, where each collaborator represents the instances with different features (in a different representation space) see Figure 3.

We show how the exchange of information between the collaborators can improve the local results of each collaborator. Moreover, we show that the gain of quality is much important comparing to classical collaboration (SOM and GTM collaboration)

Besides Davis-Bouldin index 13, which is trained in the algorithm, we validated the proposed approach with silhouette 14, the Adjusted Rand Index 15.

Table 5 shows that the collaboration step in the proposed approach increases the local quality of the models in regard to internal indexes DB and Silhouette, in a horizontal collaboration framework, for different data-sets. Similarly, the ARI index values show that the clusters computed by the models are closer to the expected output after the collaborations (Table 5). One can notice that horizontal collaboration, between models that do not share the same representation space, is much more beneficial compared to vertical collaboration, where the models are computed in different spaces. This is due to the fact that in the vertical framework, the random splitting of the data-sets produce sub-sets of different instances represented in the same space (i.e., same features) with similar distributions due to the random process of the split. Therefore, each local model should be quite similar to the others and few exploiTable information is exchanged in the collaborative step. This could be confirmed by the comparison between the index values of Spambase data set in vertical collaboration (Table 2) and the horizontal collaboration (Table 4) where the difference between the score of the indexes is much more important for each collaborator comparing to vertical collaboration.

Table 4

Values of the different quality indexes before and after the horizontal collaboration for each collaborator built from the Spambase data set.

Models DB Silhouette ARI
Before After Before After Before After
collab1 0,583 0.565 0.415 0.532 0.045 0.137
collab2 0.751 0.690 0.392 0.452 0.086 0.136
collab3 0.555 0.495 0.543 0.788 0.043 0.135
collab4 1.436 0.578 0.315 0.631 0.073 0.118
collab5 0.714 0.459 0.507 0.717 0.057 0.136
collab6 1.067 0.706 0.287 0.578 0.058 0.139
collab7 1.183 1.099 0.304 0.312 0.157 0.144
collab8 0.722 0.511 0.505 0.470 0.101 0.143
collab9 0.707 0.503 0.435 0.555 0.036 0.136
collab10 1.370 0.418 0.202 0.755 0.069 0.132
Table 5

Average values (±CI95%) of the different quality indexes before and after the horizontal collaboration for each data set over 20 executions.

Indexes DB Silhouette ARI

Data-sets before after before after before after
Glass Average ±CI95% 1.028 ±0.23 0.608 ±0.18 0.335 ±0.02 0.552 ±0.04 0.155 ±0.04 0.237 ±0.01
Spambase Average ±CI95% 0.903 ±0.20 0.481 ±0.12 0.390 ±0.06 0.579 ±0.09 0.072 ±0.02 0.135 ±0.004
Waveform-noise Average ±CI95% 2.578 ±0.15 2.310 ±0.20 0.078 ±0.008 0.108 ±0.01 0.179 ±0.02 0.218 ±0.02
WDBC Average ±CI95% 0.601 ±0.17 0.550 ±0.09 0.483 ±0.02 0.566 ±0.05 0.219 ±0.05 0.439 ±0.13
Wine Average ±CI95% 0.688 ±0.10 0.643 ±0.09 0.470 ±0.06 0.490 ±0.05 0.206 ±0.05 0.212 ±0.05

Sensitivity Box-Whiskers plots (Figure 5) represents a synthesis of the scores into five crucial pieces of information identifiable at a glance: position measurement, dispersion, asymmetry and length of Whiskers. The position measurement is characterized by the dividing line on the median (as well as the middle of the box). Dispersion is defined by the length of the Box-Whiskers (as well as the distance between the ends of the Whiskers and the gap). Asymmetry is defined as the deviation of the median line from the center of the Box-Whiskers from the length of the box (as well as by the length of the upper Whiskers from the length of the lower Whiskers, and by the number of scores on each side). The length of the Whiskers is the distance between the ends of the Whiskers in relation to the length of the Box-Whiskers (and the number of scores specifically marked). These graphs show the same overall performance behavior observed in the case of vertical collaboration. They show a clear improvement as a result of the collaboration process. This improvement is observed for all quality indices used.

Figure 5 Sensitivity Box-Whiskers plots for the horizontal collaboration case
Figure 5

Sensitivity Box-Whiskers plots for the horizontal collaboration case

5.2.3 Comparison with other collaborative approaches

In this section, the proposed collaborative algorithm 2 is based on Sinkhorn-Means (Sin-Mean) local algorithms as described in algorithm 1 (this framework is thereafter called Co-Sin-OT) and we illustrate the adaptability of the proposed collaborative approach by alternatively using Self-Organizing Maps (SOM) as local algorithms (Co-SOM-OT). Both are compared to popular state-of-the-art collaborative algorithms based on Self-Organized-Maps (Co-SOM) [15] and Generative-Topographic-Maps (Co-GTM) [16]. We focus here on the horizontal collaboration case as in [15] and [16]. Indeed, horizontal collaboration is usually more useful and applicable to real problems comparing to vertical collaboration, it is also more difficult. In the first part of the experiments, we test the quality of the collaboration for 10 collaborators. As the Co-GTM algorithm is designed for only two collaborators, it is not included in the comparisons. In the second part, only two collaborators are trained and the Co-GTM algorithm is included in the protocol.

The first set of experiments are thus restricted to Co-SOM, Co-SOM-OT and Co-Sin-OT in order to be able to work with several collaborators. All collaborative approaches are applied on the same subsets. In SOM-based approaches, each local collaborator starts with the same 5 × 3 SOM. The approaches are compared using the Silhouette index. As shown in Tables 6 to 10, the results obtained with the proposed approach are globally better for this index. One can note that, for some collaborators, the quality of the collaboration leads to very similar results in both cases, despite very different quality before collaboration. The OT-based approach (Co-SOM-OT) provides a much more sTable quality improvement over the set of collaborators. In addition, the use of Sinkhorn-Means as the local algorithm (Co-Sin-OT) provide the best results comparing a SOM-Based local clustering (Co-SOM and Co-SOM-OT). This can be explained by the fact that the mechanism of the SOM-based collaborative algorithms is constrained by the neighborhood’s functions. Moreover, it was built for a collaboration between two collaborators, then extended to allows multiple collaborations, unlike the proposed approach where each learner exchange information with all of the others at each step of the collaboration.

Table 6

Comparison of SOM-based and Sinkhorn-based collaborative approaches, using the Silhouette index for the Glass data set. The average values (±CI95%) is computed over 20 executions.

Sinkhorn-based SOM-based
Sin-Means Co-Sin-OT Gain SOM Co-SOM Gain Co-SOM-OT Gain
collab1 0.353 0.582 0.229 0.088 0.240 0.152 0.240 0.152
collab2 0.294 0.461 0.167 −0.009 −0.009 0.000 0.131 0.140
collab3 0.351 0.653 0.303 −0.036 −0.036 0.000 0.156 0.192
collab4 0.400 0.569 0.169 0.395 0.395 0.000 0.340 −0.055
collab5 0.256 0.439 0.182 −0.008 0.320 0.329 0.320 0.329
collab6 0.339 0.602 0.263 0.070 0.070 0.000 0.102 0.032
collab7 0.299 0.565 0.266 0.222 0.257 0.034 0.253 0.031
collab8 0.358 0.444 0.086 0.410 0.410 0.000 0.439 0.029
collab9 0.351 0.635 0.284 0.073 0.183 0.110 0.223 0.150
collab10 0.355 0.574 0.220 −0.053 −0.051 0.001 0.003 0.056

Average ±CI95% 0.335 ±0.02 0.552 ±0.04 0.216 ±0.12 0.115 ±0.10 0.177 ±0.10 0.063 ±0.06 0.221 ±0.07 0.106 ±0.06
Table 7

Comparison of SOM-based and Sinkhorn-based collaborative approaches, using the Silhouette index for the Spam-base data set. The average values (±CI95%) is computed over 20 executions.

Sinkhorn-based SOM-based
Sin-Means Co-Sin-OT Gain SOM Co-SOM Gain Co-SOM-OT Gain
collab1 0,415 0,532 0,118 0,224 0,483 0,260 0,346 0,122
collab2 0,392 0,452 0,060 0,038 0,080 0,042 0,124 0,086
collab3 0,543 0,788 0,246 −0,137 −0,137 0,000 0,005 0,142
collab4 0,315 0,631 0,316 −0,308 −0,091 0,216 −0,103 0,205
collab5 0,507 0,717 0,211 −0,101 −0,028 0,073 0,052 0,153
collab6 0,287 0,578 0,291 −0,039 0,036 0,075 0,153 0,192
collab7 0,304 0,312 0,008 −0,035 −0,035 0,000 0,087 0,122
collab8 0,505 0,470 −0,035 0,314 0,524 0,210 0,511 0,197
collab9 0,435 0,555 0,119 −0,260 −0,069 0,191 0,023 0,283
collab10 0,202 0,755 0,553 0,041 0,041 0,000 0,114 0,073

Average ±CI95% 0.390 ±0.06 0.579 ±0.09 0,188 ±0.10 −0.026 ±0.12 0.035 ±0.12 0.106 ±0.06 0,131 ±0.11 0,158 ±0.03
Table 8

Comparison of SOM-based and Sinkhorn-based collaborative approaches, using the silhouette index for the Waveform-noise data set. The average values (±CI95%) is computed over 20 executions.

Sinkhorn-based SOM-based
Sin-Means Co-Sin-OT Gain SOM Co-SOM Gain Co-SOM-OT Gain
collab1 0,108 0,155 0,047 0,025 0,030 0,006 0,064 0,039
collab2 0,074 0,097 0,023 0,036 0,036 0,000 0,062 0,026
collab3 0,075 0,091 0,016 0,070 0,070 0,000 0,069 −0,001
collab4 0,067 0,088 0,021 0,043 0,047 0,003 0,064 0,021
collab5 0,081 0,127 0,046 0,054 0,058 0,004 0,069 0,015
collab6 0,067 0,079 0,012 0,063 0,063 0,000 0,067 0,004
collab7 0,093 0,128 0,036 0,026 0,026 0,000 0,063 0,037
collab8 0,095 0,140 0,045 0,040 0,044 0,004 0,067 0,027
collab9 0,081 0,120 0,039 0,031 0,038 0,007 0,065 0,034
collab10 0,075 0,104 0,029 0,025 0,032 0,007 0,063 0,038

Average ±CI95% 0.078 ±0.008 0.108 ±0.01 0,0313 ±0.008 0.043 ±0.01 0.045 ±0.009 0.003 ±0.01 0,065 ±0.05 0,024 ±0.01
Table 9

Comparison of SOM-based and Sinkhorn-based collaborative approaches, using the silhouette index for the WDBC data set. The average values (±CI95%) is computed over 20 executions.

Sinkhorn-based SOM-based
Sin-Means Co-Sin-OT Gain SOM Co-SOM Gain Co-SOM-OT Gain
collab1 0,470 0,632 0,162 0,233 0,233 0,000 0,244 0,011
collab2 0,514 0,620 0,106 0,185 0,208 0,024 0,211 0,026
collab3 0,485 0,614 0,129 0,246 0,303 0,057 0,401 0,155
collab4 0,521 0,608 0,087 0,015 0,074 0,059 0,126 0,111
collab5 0,422 0,499 0,077 0,102 0,182 0,080 0,341 0,239
collab6 0,490 0,549 0,059 0,240 0,278 0,038 0,334 0,094
collab7 0,516 0,685 0,168 0,298 0,337 0,039 0,445 0,147
collab8 0,388 0,392 0,004 0,125 0,125 0,000 0,323 0,198
collab9 0,518 0,516 -0,003 0,202 0,213 0,011 0,367 0,165
collab10 0,513 0,548 0,035 0,110 0,123 0,013 0,236 0,126

Average ±CI95% 0.483 ±0.02 0.566 ±0.05 0,082 ±0.03 0.175 ±0.05 0.204 ±0.05 0.032 ±0.01 0,303 ±0.06 0,127 ±0.04
Table 10

Comparison of SOM-based and Sinkhorn-based collaborative approaches, using the silhouette index for the Wine data set. The average values (±CI95%) is computed over 20 executions.

Sinkhorn-based SOM-based
Sin-Means Co-Sin-OT Gain SOM Co-SOM Gain Co-SOM-OT Gain
collab1 0.560 0.562 0.002 0.3921 0.4033 0.011 0.446 0.054
collab2 0.559 0.560 0.001 0.4288 0.4288 0.000 0.431 0.002
collab3 0.572 0.573 0.001 0.4945 0.4945 0.000 0.542 0.048
collab4 0.320 0.318 −0.001 0.1223 0.1255 0.003 0.224 0.102
collab5 0.446 0.499 0.054 0.1444 0.2116 0.067 0.221 0.077
collab6 0.394 0.450 0.056 0.1558 0.1768 0.021 0.201 0.045
collab7 0.329 0.351 0.022 0.1107 0.1107 0.000 0.257 0.146
collab8 0.520 0.534 0.014 0.1243 0.2210 0.097 0.348 0.224
collab9 0.444 0.498 0.054 0.0715 0.2286 0.157 0.356 0.284
collab10 0.559 0.560 0.002 0.3923 0.3923 0.000 0.395 0.003

Average ±CI95% 0.470 ±0.06 0.490 ±0.05 0.020 ±0.01 0.244 ±0.10 0.279 ±0.09 0.036 ±0.03 0.342 ±0.07 0.098 ±0.06

In the second set of experiments, we compare the proposed approach to classical collaborative algorithms based on Self-Organized-Maps (Co-SOM) [15] and Generative-Topographic-Maps (Co-GTM) [16]. The three approaches are compared using DB index, as in [15, 16]. As shown in Table 11, the results obtained with the proposed approach (Co-Sin-OT), in comparison to the state-of-the-art, is generally better than the classical approaches. The lowest qualities are expressed by the older approach, SOM-based collaborative clustering (Co-SOM), followed by the GTM-based approach (Co-GTM). Unlike Co-SOM and Co-GTM, the proposed approach aims to find a local optimum for each collaborator. More precisely, at the end of the local training, each collaborator exchange information based on a stopping criterion that ends the collaboration with each collaborator as soon as the quality of the collaboration starts decreasing, which is not the case in the other approaches. Furthermore, Table 11 compares the quality gain brought by the collaboration from each approach. The proposed approach increases the quality of each collaborator on all of the data-sets, which implies a positive gain quality. On the contrary, in SOM-based collaboration the gain can be negative for some data-sets. Finally, in order to evaluate the general performance of the approaches, we define the following score measurement:

(16) Score(Mi)=jG(Mi,Dj)maxiG(Mi,Dj)
Table 11

Comparison of DB index Between SOM, GTM and OT based approaches on different data-sets for two collaborators

Approaches Co-SOM Co-GTM Co-Sin-OT
Data-sets before after gain before after gain before after gain
Glass Collab1 1.010 0.985 0.002 0.740 0.970 0.000 1.109 0.774 0.253
Collab2 0.902 0.924 1.280 1.050 0.908 0.731
Spambase Collab1 2.924 2.436 −0.077 1.120 1.060 1.014 1.467 1.055 0.187
Collab2 0.960 1.748 0.870 0.900 0.775 0.766
Waveform-noise Collab1 6.488 6.488 0.027 1.140 1.310 0.378 3.592 2.579 0.256
Collab2 7.269 6.898 3.750 1.310 3.732 2.868
Wdbc Collab1 0.640 0.641 0.001 0.970 0.920 0.010 0.755 0.612 0.219
Collab2 0.651 0.649 0.870 0.900 0.928

Score 1.740 1.377 3.581

Where G indicates the gain quality of each approach Mi of each data-sets Dj. This score gives an overall vision of the best approach on all the data-sets. As shown in Table 11, the best score belongs to the proposed collaboration based on Optimal Transport theory, followed by the GTM collaborative approach (Co-GTM) and the SOM-based collaboration (Co-SOM). These results highlight the the performance of the proposed algorithm, due to the strong theoretical back-ground of Optimal Transport theory.

In order to assess the performance of our approaches, we use the Friedman test and Nemenyi test recommended in [40]. The Friedman test is conducted to test the null-hypothesis that all approaches are equivalent in respect of accuracy. If the null hypothesis is rejected, then the Nemenyi test will be performed. In addition, if the average ranks of two approaches differ by at least the critical difference (CD), then it can be concluded that their performances are significantly different. In the Friedman test, we set the significant level α = 0.05. The Figure 6 shows a critical diagram representing a projection of average ranks approaches on enumerated axis. The approaches are ordered from left (the best) to right (the worst) and a thick line which connects the approaches were the average ranks not significantly different (for the level of 5% significance). As shown in Figure 6, Co-Sin-OT achieves significant improvement over the other proposed techniques (Co-GTM and Co-SOM) since during collaboration phase it is Table and the process stops the collaboration for some learners when their local quality stars to decrease, which prevents common issue of collaborative approaches.

Figure 6 Friedman and Nemenyi test for comparing multiple approaches over multiple data sets: Approaches are ordered from left (the best) to right (the worst)
Figure 6

Friedman and Nemenyi test for comparing multiple approaches over multiple data sets: Approaches are ordered from left (the best) to right (the worst)

Compared to the most cited approaches of the state of the art, the positive impact of using the collaborative learning based on this theory is:

  1. The proposed algorithm is based on a strong and well defended theory that becomes increasingly popular in the field of machine learning.

  2. Its strength is highlighted by experimental validation on both for artificial and real data-sets.

  3. The stopping criteria that we proposed based on the measure of the gain quality brought after each collaboration, because it guarantee the convergence once the gain quality tends towards zero.

  4. The choice of the distant collaborator which is very important and allows to give an optimal order of the collaboration. In the proposed algorithm we solved this paradigm based on the Optimal Transport Matrix plan, that aims to compare all distribution of the centroids in each site. In this way each collaborator will be enable to choose the best one.

  5. The proposed algorithm stops the negative collaboration, based on the measure of the gain quality between each collaboration and updates the centroids if the gain quality is positive. Otherwise, it moves to the other distant collaborator.

Finally, the proposed approach ensures the adaptability of working with different local models, this is lead us to introduce some managerial applications of our work, like the management system learning where the collaborative learning could offer in interaction between learners to make them work cooperatively rather than competitively and helps to create sub-networks of collaboration where the diversity is decreased, and manage the conflict learning by using a one-to-one collaboration. Besides the exchange information using the proposed algorithm preserve the privacy of each collaborator, and ensures the control of the shared information with each collaborator, and filters the received information to avoid affecting the real structure of local data. Thus, all the collaborators can explore the distributed data that could containing some mutual information while keeping the control on received and transmitted information.

However, the proposed algorithm still suffers from some limitations, in particular considering the same dimension in every site, and also the curse of height dimensionality that Optimal Transport still suffers from, which leads us to increase the penalty coefficient of the regularization in order to avoid the over-fitting.

6 Conclusion

In this paper, we proposed a new framework of collaborative learning inspired by Optimal Transport theory, where the collaborators aim to increase their local quality based on the information exchanged from other learners. We explained the motivation and the intuition behind our approach and we proposed a new algorithm of collaborative clustering based on the Wasserstein distance. The proposed approach allows to exchange information between collaborators either in vertical or horizontal collaboration. The results are sTable and the process stops the collaboration for some learners when their local quality stars to decrease, which prevents common issue of collaborative approaches.

The approach proposed in this paper is the first step into a new family of algorithms for the collaborative leaning task. We plan to develop further collaborative clustering algorithms based on Gromov-Wasserstein distance that ensure the comparison between the distribution coming from heterogeneous spaces, in order to make the collaborative algorithms more flexible, and to improve the quality and the stability of the collaborations.

There are several perspectives to this work. On the short term we are working to improve the approach in order to learn the confidence coefficient at each iteration, according to the diversity and the quality of the collaborators. This could be based on comparisons between sub-sets’ distributions using the Wasserstein distance. This would lead us to another extension where the interaction between collaborators could be modeled as graph in a Wasserstein space, which would allow the construction of a theoretical proof of convergence.

  1. Conflict of interest: Authors state no conflict of interest.

References

[1] Sotiris Kotsiantis and Panayiotis Pintelas. Recent advances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications, 1(1):73–81, 2004.Suche in Google Scholar

[2] Junjie Wu, Hui Xiong, and Jian Chen. Adapting the right measures for k-means clustering. In SIGKDD, pages 877–886. ACM, 2009.10.1145/1557019.1557115Suche in Google Scholar

[3] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, 1995.10.3115/981658.981684Suche in Google Scholar

[4] Miin-Shen Yang and Kuo-Lung Wu. Unsupervised possibilistic clustering. Pattern Recognition, 39(1):5–21, 2006.10.1016/j.patcog.2005.07.005Suche in Google Scholar

[5] James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, pages 194–202. Elsevier, 1995.10.1016/B978-1-55860-377-6.50032-3Suche in Google Scholar

[6] Yan Yang and Hao Wang. Multi-view clustering: A survey. Big Data Mining and Analytics, 1(2):83–107, 2018.10.26599/BDMA.2018.9020003Suche in Google Scholar

[7] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, 1:9–16, 2004.Suche in Google Scholar

[8] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.Suche in Google Scholar

[9] Junjie Wu, Hongfu Liu, Hui Xiong, Jie Cao, and Jian Chen. K-means-based consensus clustering: A unified view. IEEE transactions on knowledge and data engineering, 27(1):155–169, 2014.10.1109/TKDE.2014.2316512Suche in Google Scholar

[10] Germain Forestier, Cédric Wemmert, and Pierre Gançarski. Collaborative multi-strategical classification for object-oriented image analysis. In Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications in conjunction with IbPRIA, pages 80–90, 2007.10.1007/978-3-540-78981-9_4Suche in Google Scholar

[11] Witold Pedrycz. Collaborative fuzzy clustering. Pattern Recognition Letters, 23(14):1675–1686, 2002.10.1016/S0167-8655(02)00130-7Suche in Google Scholar

[12] Steffen Bickel and Tobias Scheffer. Estimation of mixture models using co-em. In ECML, pages 35–46. Springer, 2005.10.1007/11564096_9Suche in Google Scholar

[13] Guillaume Cleuziou, Matthieu Exbrayat, Lionel Martin, and Jacques-Henri Sublemontier. Cofkm: A centralized method for multiple-view clustering. In 2009 Ninth IEEE International Conference on Data Mining, pages 752–757. IEEE, 2009.10.1109/ICDM.2009.138Suche in Google Scholar

[14] Tianming Hu, Ying Yu, Jinzhi Xiong, and Sam Yuan Sung. Maximum likelihood combination of multiple clusterings. Pattern Recognition Letters, 27(13):1457–1464, 2006.10.1016/j.patrec.2006.02.013Suche in Google Scholar

[15] Nistor Grozavu and Younes Bennani. Topological collaborative clustering. Australian Journal of Intelligent Information Processing Systems, 12(2), 2010.Suche in Google Scholar

[16] Mohamad Ghassany, Nistor Grozavu, and Younes Bennani. Collaborative clustering using prototype-based techniques. International Journal of Computational Intelligence and Applications, 11(03):1250017, 2012.10.1142/S1469026812500174Suche in Google Scholar

[17] Peter J Green. On use of the em algorithm for penalized likelihood estimation. Journal of the Royal Statistical Society: Series B (Methodological), 52(3):443–452, 1990.10.1111/j.2517-6161.1990.tb01798.xSuche in Google Scholar

[18] Cédric Wemmert. Classification hybride distribuée par collaboration de méthodes non supervisées. PhD thesis, Strasbourg 1, 2000.Suche in Google Scholar

[19] Antoine Lachaud, Nistor Grozavu, Basarab Matei, and Younès Bennani. Collaborative clustering between different topological partitions. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4111–4117. IEEE, 2017.10.1109/IJCNN.2017.7966375Suche in Google Scholar

[20] Jérémie Sublime, Basarab Matei, Guénaël Cabanes, Nistor Grozavu, Younès Bennani, and Antoine Cornuéjols. Entropy based probabilistic collaborative clustering. Pattern Recognition, 72:144–157, 2017.10.1016/j.patcog.2017.07.014Suche in Google Scholar

[21] Parisa Rastin, Guénaël Cabanes, Nistor Grozavu, and Younes Bennani. Collaborative clustering: How to select the optimal collaborators? In 2015 IEEE Symposium Series on Computational Intelligence, pages 787–794. IEEE, 2015.10.1109/SSCI.2015.117Suche in Google Scholar

[22] Fatima Ezzahraa Ben Bouazza, Younès Bennani, Guénaël Cabanes, and Abdelfettah Touzani. Collaborative clustering through optimal transport. In International Conference on Artificial Neural Networks, pages 873–885. Springer, 2020.10.1007/978-3-030-61616-8_70Suche in Google Scholar

[23] Jérémie Sublime, Guénaël Cabanes, and Basarab Matei. Study on the influence of diversity and quality in entropy based collaborative clustering. Entropy, 21(10):951, 2019.10.3390/e21100951Suche in Google Scholar

[24] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.Suche in Google Scholar

[25] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale, 1781.Suche in Google Scholar

[26] Leonid V Kantorovich. On the translocation of masses. Journal of Mathematical Sciences, 133(4):1381–1382, 2006.10.1007/s10958-006-0049-2Suche in Google Scholar

[27] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.Suche in Google Scholar

[28] Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In ECML, pages 274–289. Springer, 2014.10.1007/978-3-662-44848-9_18Suche in Google Scholar

[29] Marco Cuturi and David Avis. Ground metric learning. The Journal of Machine Learning Research, 15(1):533–564, 2014.Suche in Google Scholar

[30] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In ICML, pages 685–693, 2014.Suche in Google Scholar

[31] Nhat Ho, Xuan Long Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via wasserstein means. In ICML, pages 1501–1509, 2017.Suche in Google Scholar

[32] Fatima Ezzahraa Ben Bouazza, Younès Bennani, Mourad El Hamri, Guénaël Cabanes, Basarab Matei, and Abdelfettah Touzani. Multi-view clustering through optimal transport. Aust. J. Intell. Inf. Process. Syst., 15(3):1–9, 2019.Suche in Google Scholar

[33] K Schwarzschild. Sitzungsberichte preuss. Akad. Wiss, 424, 1916.Suche in Google Scholar

[34] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33, 2020.Suche in Google Scholar

[35] Arthur Mensch and Gabriel Peyré. Online sinkhorn: Optimal transport distances from sample streams. Advances in Neural Information Processing Systems, 33, 2020.Suche in Google Scholar

[36] David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979.10.1109/TPAMI.1979.4766909Suche in Google Scholar

[37] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.Suche in Google Scholar

[38] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.10.1016/0377-0427(87)90125-7Suche in Google Scholar

[39] Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3):386, 2004.10.1037/1082-989X.9.3.386Suche in Google Scholar PubMed

[40] J. Demsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, 2006.Suche in Google Scholar

Received: 2020-07-01
Accepted: 2020-12-06
Published Online: 2021-06-07

© 2021 Fatima-Ezzahraa Ben-Bouazza et al., published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

  1. Research Articles
  2. Best Polynomial Harmony Search with Best β-Hill Climbing Algorithm
  3. Face Recognition in Complex Unconstrained Environment with An Enhanced WWN Algorithm
  4. Performance Modeling of Load Balancing Techniques in Cloud: Some of the Recent Competitive Swarm Artificial Intelligence-based
  5. Automatic Generation and Optimization of Test case using Hybrid Cuckoo Search and Bee Colony Algorithm
  6. Hyperbolic Feature-based Sarcasm Detection in Telugu Conversation Sentences
  7. A Modified Binary Pigeon-Inspired Algorithm for Solving the Multi-dimensional Knapsack Problem
  8. Improving Grey Prediction Model and Its Application in Predicting the Number of Users of a Public Road Transportation System
  9. A Deep Level Tagger for Malayalam, a Morphologically Rich Language
  10. Identification of Biomarker on Biological and Gene Expression data using Fuzzy Preference Based Rough Set
  11. Variable Search Space Converging Genetic Algorithm for Solving System of Non-linear Equations
  12. Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling
  13. Crowd counting via Multi-Scale Adversarial Convolutional Neural Networks
  14. Google Play Content Scraping and Knowledge Engineering using Natural Language Processing Techniques with the Analysis of User Reviews
  15. Simulation of Human Ear Recognition Sound Direction Based on Convolutional Neural Network
  16. Kinect Controlled NAO Robot for Telerehabilitation
  17. Robust Gaussian Noise Detection and Removal in Color Images using Modified Fuzzy Set Filter
  18. Aircraft Gearbox Fault Diagnosis System: An Approach based on Deep Learning Techniques
  19. Land Use Land Cover map segmentation using Remote Sensing: A Case study of Ajoy river watershed, India
  20. Towards Developing a Comprehensive Tag Set for the Arabic Language
  21. A Novel Dual Image Watermarking Technique Using Homomorphic Transform and DWT
  22. Soft computing based compressive sensing techniques in signal processing: A comprehensive review
  23. Data Anonymization through Collaborative Multi-view Microaggregation
  24. Model for High Dynamic Range Imaging System Using Hybrid Feature Based Exposure Fusion
  25. Characteristic Analysis of Flight Delayed Time Series
  26. Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French
  27. Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text
  28. MAPSOFT: A Multi-Agent based Particle Swarm Optimization Framework for Travelling Salesman Problem
  29. Research on target feature extraction and location positioning with machine learning algorithm
  30. Swarm Intelligence Optimization: An Exploration and Application of Machine Learning Technology
  31. Research on parallel data processing of data mining platform in the background of cloud computing
  32. Student Performance Prediction with Optimum Multilabel Ensemble Model
  33. Bangla hate speech detection on social media using attention-based recurrent neural network
  34. On characterizing solution for multi-objective fractional two-stage solid transportation problem under fuzzy environment
  35. Deep Large Margin Nearest Neighbor for Gait Recognition
  36. Metaheuristic algorithms for one-dimensional bin-packing problems: A survey of recent advances and applications
  37. Intellectualization of the urban and rural bus: The arrival time prediction method
  38. Unsupervised collaborative learning based on Optimal Transport theory
  39. Design of tourism package with paper and the detection and recognition of surface defects – taking the paper package of red wine as an example
  40. Automated system for dispatching the movement of unmanned aerial vehicles with a distributed survey of flight tasks
  41. Intelligent decision support system approach for predicting the performance of students based on three-level machine learning technique
  42. A comparative study of keyword extraction algorithms for English texts
  43. Translation correction of English phrases based on optimized GLR algorithm
  44. Application of portrait recognition system for emergency evacuation in mass emergencies
  45. An intelligent algorithm to reduce and eliminate coverage holes in the mobile network
  46. Flight schedule adjustment for hub airports using multi-objective optimization
  47. Machine translation of English content: A comparative study of different methods
  48. Research on the emotional tendency of web texts based on long short-term memory network
  49. Design and analysis of quantum powered support vector machines for malignant breast cancer diagnosis
  50. Application of clustering algorithm in complex landscape farmland synthetic aperture radar image segmentation
  51. Circular convolution-based feature extraction algorithm for classification of high-dimensional datasets
  52. Construction design based on particle group optimization algorithm
  53. Complementary frequency selective surface pair-based intelligent spatial filters for 5G wireless systems
  54. Special Issue: Recent Trends in Information and Communication Technologies
  55. An Improved Adaptive Weighted Mean Filtering Approach for Metallographic Image Processing
  56. Optimized LMS algorithm for system identification and noise cancellation
  57. Improvement of substation Monitoring aimed to improve its efficiency with the help of Big Data Analysis**
  58. 3D modelling and visualization for Vision-based Vibration Signal Processing and Measurement
  59. Online Monitoring Technology of Power Transformer based on Vibration Analysis
  60. An empirical study on vulnerability assessment and penetration detection for highly sensitive networks
  61. Application of data mining technology in detecting network intrusion and security maintenance
  62. Research on transformer vibration monitoring and diagnosis based on Internet of things
  63. An improved association rule mining algorithm for large data
  64. Design of intelligent acquisition system for moving object trajectory data under cloud computing
  65. Design of English hierarchical online test system based on machine learning
  66. Research on QR image code recognition system based on artificial intelligence algorithm
  67. Accent labeling algorithm based on morphological rules and machine learning in English conversion system
  68. Instance Reduction for Avoiding Overfitting in Decision Trees
  69. Special section on Recent Trends in Information and Communication Technologies
  70. Special Issue: Intelligent Systems and Computational Methods in Medical and Healthcare Solutions
  71. Arabic sentiment analysis about online learning to mitigate covid-19
  72. Void-hole aware and reliable data forwarding strategy for underwater wireless sensor networks
  73. Adaptive intelligent learning approach based on visual anti-spam email model for multi-natural language
  74. An optimization of color halftone visual cryptography scheme based on Bat algorithm
  75. Identification of efficient COVID-19 diagnostic test through artificial neural networks approach − substantiated by modeling and simulation
  76. Toward agent-based LSB image steganography system
  77. A general framework of multiple coordinative data fusion modules for real-time and heterogeneous data sources
  78. An online COVID-19 self-assessment framework supported by IoMT technology
  79. Intelligent systems and computational methods in medical and healthcare solutions with their challenges during COVID-19 pandemic
Heruntergeladen am 26.1.2026 von https://www.degruyterbrill.com/document/doi/10.1515/jisys-2020-0068/html
Button zum nach oben scrollen