Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm

Sarah Ghanim Mahmood Al-kababchee; Zakariya Yahya Algamal; Omar Saber Qasim

doi:10.1515/jisys-2022-0230

Article Open Access

Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm

Sarah Ghanim Mahmood Al-kababchee , Zakariya Yahya Algamal and Omar Saber Qasim

Published/Copyright: February 16, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 32 Issue 1

Abstract

Data mining’s primary clustering method has several uses, including gene analysis. A set of unlabeled data is divided into clusters using data features in a clustering study, which is an unsupervised learning problem. Data in a cluster are more comparable to one another than to those in other groups. However, the number of clusters has a direct impact on how well the K-means algorithm performs. In order to find the best solutions for these real-world optimization issues, it is necessary to use techniques that properly explore the search spaces. In this research, an enhancement of K-means clustering is proposed by applying an equilibrium optimization approach. The suggested approach adjusts the number of clusters while simultaneously choosing the best attributes to find the optimal answer. The findings establish the usefulness of the suggested method in comparison to existing algorithms in terms of intra-cluster distances and Rand index based on five datasets. Through the results shown and a comparison of the proposed method with the rest of the traditional methods, it was found that the proposal is better in terms of the internal dimension of the elements within the same cluster, as well as the Rand index. In conclusion, the suggested technique can be successfully employed for data clustering and can offer significant support.

Keywords: clustering; penalized method; equilibrium optimizer algorithm; K-means; feature selection; data mining; swarms

Abbreviations

EOA: equilibrium optimizer algorithm
EOAK-means: equilibrium optimizer algorithm and K-means
ICD: intra-cluster distances
RI: Rand index
CVK-means: cross-validation K-means

1 Introduction

Data clustering, which is the classification of an unlabeled dataset into clusters of comparable items, is one of the most crucial and extensively used data analysis techniques [1,2]. One cluster is made up of things that are comparable to each other but not to objects from other clusters [3,4]. Clustering has been applied in various fields of science and engineering, including web mining, text mining, image processing, stock prediction, signal processing, biology, and others [5,6]. In general, hierarchical algorithms and partitional algorithms can be used to categorize clustering techniques. A well-known class of partitional clustering algorithms is K-means method, which is simple to construct and efficient in most applications [7,8]. However, the K-means algorithm’s effectiveness hinges on figuring out K, the number of clusters, and the initial condition of centroids that could become trapped in local optima [9,10,11].

In recent years, the problem of grouping high-dimensional data has become apparent. High-dimensional data clustering is the process of doing cluster analyses on datasets that have a few dozen to thousands of dimensions. High-dimensional data, on the other hand, presents unique obstacles for clustering algorithms that necessitate specific solutions [12]. Standard similarity measurements, as utilized in traditional clustering techniques, are frequently not useful with high-dimensional data [13,14].

In order to solve classification problems, one of the key strategies is feature selection, which relies on selecting a subset of the initial data that significantly affects the clustering process. Many datasets contain misleading and redundant characteristics, which take a long time to evaluate during an exhaustive search in the solution space and confuse the classification process. Both enhancing clustering accuracy and increasing computational efficiency can benefit from keeping only a portion of essential characteristics, while eliminating unnecessary features [15,16]. A number of nature-inspired and swarm intelligence-based algorithms, such as the genetic algorithm [17], bat algorithm [18], particle swarm optimization [19], gray wolf optimization [20], and grasshopper algorithm [21] have been introduced over the past few years [17]. Studies on these optimization techniques have shown encouraging results, and they can be used to handle a number of optimization issues, including the clustering problems. In 2019, Wu et al. presented a standardized dimensionality reduction model, which combines the K-means clustering algorithm with linear trace ratio analysis to find the effective drop direction. The proposed model is suitable for supervised, semi-supervised, and unsupervised applications, in contrast to existing dimensionality reduction strategies. An effective and detailed optimization technique is presented for the goal of determining the best W projection matrix [22]. In 2020, Chen and others proposed a hybrid algorithm that combines the K-means clustering algorithm with the quantum-inspired ant lion optimized algorithm, combining the advantages of quantum computing and swarm intelligence algorithms to improve the clustering algorithm. The proposed algorithm was tested on many standard data and through this it was concluded that it can be used efficiently for data clustering and intrusion detection [23]. In 2021, Al-Thanoon et al. improved the BCSA binary raven algorithm for selecting features in big data. The proposed improvement includes the use of the opposition-based learning method concept to define the flight length parameter. The experimental results show that the properties were chosen to take advantage of the proposed modifications and are more representative and improve the accuracy of classification and the time spent in the calculations. In general, the proposed algorithm outperformed the traditional and other well-known algorithms in the two datasets [15]. Also, in 2021 Al-kababchee et al. proposed a binary bat algorithm with penalized regression-based clustering. The experimental results and statistical analysis on three chemical datasets have demonstrated that the performance of our proposed BPRBC compared with the PRBC and K-means leads to a better performance in terms of purity and computational time [24].

The most important problem in cluster analysis is to find the number of clusters. Therefore, in our proposal, we used the equilibrium algorithm to find the appropriate number for the number of clusters. Since we used high-dimensional data, we also used the proposed algorithm to determine the best data by choosing the features, because the cluster algorithm deals with large amount of data. The distance is challenging, thus our aim was to work the clustering algorithm to solve the challenges.

In this article, we discuss the use of a hybrid algorithm for cluster analysis that is based on the gravitational search method [15] and k-means algorithm [16,18,19]. The suggested method’s effectiveness has been evaluated on a number of genuine, standard datasets from the UCI repository [24], and the outcomes have been contrasted with those of alternative methods. In short, the main contributions can be listed as:

An effective strategy is suggested based on the improvement of K-means clustering through the use of an equilibrium optimization strategy.
A thorough methodology for feature selection is given that fully exploits feature interaction and prior information while capturing high-order connectivity between features using K-means clustering.
By using the optimum adjustments to the number of clusters, the K-means clustering was improved.

The remainder of the essay is structured as follows: A brief introduction to clustering issues and k-means algorithms is given in Section 2. We provide our suggested algorithm for resolving data clustering issues in Section 3 using the gravitational search algorithm and the k-means algorithm. Section 4 discusses experimental findings and comparisons to other available techniques. Finally, Section 5 summarizes this study’s findings and offers suggestions for further research.

2 K-means clustering

The process of grouping a set of data objects into clusters is known as data clustering, and it is one of the most essential and common data analysis approaches [25], which is comparable to things in the same cluster but different from objects in other clusters [4,26].

Let S = [ X 1 , X 2 , … , X K ] be a collection of clusters and R = [ Y 1 , Y 2 , … , Y N ] be a set of data items that need to be clustered, where Y i ∈ R D . During clustering, each data point inset R is assigned to one of the K clusters in order to minimize. The sum of the squared Euclidean distances between each object Y i and the cluster X j center is the intra-cluster variance objective function. The following is the objective function [27]:

(1) F ( X , Y ) = ∑ i = 1 N Min { ∥ Y i − X j ∥ 2 } , j = 1 , 2 , … , K .

Also,

X j ≠ ϕ , ∀ j { 1 , 2 , … , K } ,

X i ∩ X j = ϕ , ∀ i ≠ j and i , j ∈ { 1 , 2 , … , K } ,

⋃ j = 1 K X j = R .

Real-valued data vectors for continuous data are divided into a preset number of clusters by the partitional clustering algorithm K-means [28]. Consider a partition P c of a dataset with N data patterns (each data pattern is represented by a vector x j ∈ R m , where j = 1 , 2 , … , N ) in C clusters, where C is an algorithm input parameter. The centroid vector g c ∈ R m (where c = 1 , 2 , … , C ) represents each cluster.

Clusters are produced in K-means using a dissimilarity measure called the Euclidean distance (equation (2)). For each iteration (until a maximum number of iterations t max K -means is reached or another halting condition is met), a new cluster centroid vector is created for each cluster as the mean of its current data vectors (i.e., the data patterns currently assigned to the cluster). The new division is then created, with each pattern being assigned to the cluster with the closest centroid [29].

(2) d ( x j , g c ) = ∑ k = 1 m ( x j k − g c k ) 2 ,

where

(3) g c = 1 N c ∑ ∀ x l ∈ c x l .

The number of patterns associated with the cluster c is N c . The criteria function for K-means is supplied by the within-cluster sum of squares in equation (4) [29].

(4) J ( P c ) = ∑ c = 1 C ∑ ∀ x j ∈ c d ( x j , g c ) .

Algorithm 1 shows the passcode of K-means algorithm [29].

For t ← 0 put C patterns randomly as the initial cluster centroids g c = ( c = 1 , 2 , … , C ) .
Assign each pattern x j to its closest cluster.
Calculate the initial criterion value J ( P C 0 ) (equation (4)).
While ( t < t max K -means ) do.
New centroids are determined for each cluster c , centroid g c is updated using (equation (2)).
Determine the new partition for each data pattern x j and assign it to the cluster with the nearest centroid g c .
Calculate the new criterion value J ( P C t ) (equation (4)).
t ← t + 1 .
End while
Return.
Find the final partition P C t max K -means .

3 Equilibrium optimizer (EO) algorithm

The inspiration for the EO algorithm technique came from the basic well-mixed dynamic mass balance on a control volume, which employs a mass balance equation to describe the concentration of a nonreactive ingredient in a control volume as a function of its various source and sink processes. The mass balance equation gives the fundamental physics for the mass conservation that applies to mass coming into, going out of, and being generated in a control volume. A first-order ordinary differential equation is used to represent the universal mass-balance equation [30], which states that the change in mass over time is equal to the mass entering the system plus the mass being generated inside the system minus the mass exiting the system [31]:

(5) V d C d t = Q C eq − Q C + G ,

where Q is the volumetric defect rate into and out of the control volume, V d C d t is the rate of change of mass in the control volume, and C is the concentration inside the control volume (V), G is the rate of mass generation inside the control volume, and C eq is the concentration at an equilibrium state where there is no generation. A stable equilibrium is attained when V d C d t hits zero. It is possible to solve for d C d t as a function of Q V , where Q V is the inverse of the residence duration, also known as the turnover rate or λ in this context (i.e., λ = Q V ). Equation (5) can also be altered in order to solve for the concentration in the control volume (C) as a function of time (t).

(6) d C λ C eq − λ C + G V = d t .

The integration of equation (6) over time is seen in equation (7).

(7) ∫ c 0 c d C λ C eq − λ C + G V = ∫ t 0 t d t .

This results in

(8) C = C eq + ( C 0 − C eq ) F + G λ V ( 1 − F ) .

In equation (8), F is calculated as follows:

(9) F = exp [ − λ ( t − t 0 ) ] .

where t 0 and C 0 , which depend on the integration interval, are the initial start time and concentration, respectively. Use equation (8) to calculate the concentration in the control volume with a known turnover rate. Equation (8) can be used to obtain the average turnover rate using a straightforward linear regression with a known generation rate and other factors. t 0 and C 0 stand for the initial start time and concentration, respectively, according to the integration interval. It is possible to use equation (8) to either estimate the concentration in the control volume with a known turnover rate or to determine the average turnover rate using a simple linear regression with a known generation rate and other circumstances.

The starting population is where EO starts the optimization process, just like the bulk of meta-heuristic algorithms. The search space is uniformly initialized using uniform random initialization, and the starting concentrations are constructed according to the dimensions and number of particles.

(10) C i initial = C min + rand i ( C max − C min ) i = 1 , 2 , … , n ,

where C min and C max stand for the min and max values of the dimensions, Rand_i is a random vector in the range [0, 1], and n is the population’s total number of particles. C i initial is the starting concentration vector of the i -th particle. To identify the candidates for equilibrium, particles are sorted after being appraised for their fitness function.

The algorithm’s equilibrium state, which is intended to represent the overall optimal state, is where it finally converges. Only equilibrium candidates are chosen to create a search pattern for the particles at the beginning of the optimization process because the equilibrium state is unknown. The approach performs worse in multimodal and composition functions when there are less than four candidates present, but performs better in unimodal functions. The reverse outcome will occur with more than four candidates. The equilibrium pool is created using five particles, which are proposed as equilibrium candidates.

(11) C eq,pool → = { C eq ( 1 ) → , C eq ( 2 ) → , C eq ( 3 ) → , C eq ( 4 ) → , C eq ( ave ) → } .

Each particle changes its concentration in each iteration using a random selection process from candidates selected with the same probability.

The exponential term comes next, adding to the main concentration updating rule (F). This term’s precise definition will help EO strike a fair balance between exploration and exploitation. λ is assumed to be a random vector with a range of [0, 1] because the turnover rate in a real control volume can change with time.

(12) F → = e − λ → ( t − t 0 ) .

Since time, t, is a function of iteration (Iter), and as a result, gets shorter the more iterations there are.

(13) t = 1 − Iter Max _ iter a 2 Iter Max _ iter ,

where a ₂ is a constant used to control exploitation ability and Iter and max_iter represent the current and maximum number of iterations, respectively. In order to ensure convergence by slowing down the search speed as well as enhancing the algorithm’s capability for exploration and exploitation

(14) t 0 → = 1 λ → ( − a 1 sign ( r → − 0.5 ) [ 1 − e − λ → t ] ) + t ,

where a ₁ is a fixed value that governs the capacity for exploration. The greater the a ₁ and lower the exploitation, performance correspond to higher exploring capabilities.

Equation (15) shows the revised version of equation (12) with the substitution of equation (14) in equation (12).

(15) F → = a 1 sign ( r → − 0.5 ) [ e − λ → t − 1 ] .

One of the most crucial components of the algorithm that gives the correct solution by boosting the exploitation phase is the generation rate. As an illustration, the following general model defines generation rates as a first order exponential decay process.

(16) G → = G 0 → e − k → ( t − t 0 ) ,

where G ₀ denotes the starting point and k denotes the decay constant. The following are the final set of generation rate formulae:

(17) G → = G 0 → e − λ → ( t − t 0 ) = G 0 → F → ,

where

(18) G 0 → = GCP → ( C eq → − λ → C → ) ,

(19) GCP → = 0.5 r 1 r 2 ≥ GP 0 r 2 < GP .

where the random integers r ₁ and r ₂ are in the range [0, 1] and the GCP vector is created by repeating the same value obtained from equation (19). GCP, which in this equation incorporates the potential contribution of the generation term to the updating process, is described as the Generation Rate Control Parameter. Another term called Generation Likelihood determines the probability of this contribution, which indicates how many particles employ generation term to update their states (GP). Equations (18) and (19) determine the mechanism of this contribution (19). Each particle is affected by equation (19). For instance, if GCP = 0, G = 0, and no generation rate term is used, all of the particle’s dimensions are updated. An equitable balance between exploration and exploitation is achieved when GP = 0. The final rule for revising the EO is as follows:

(20) C → = C → eq + ( C → − C → eq ) ⋅ F → + G → λ → V ( 1 − F → ) ,

where V is regarded as a unit and F is specified in equation (20).

In equation (20), the equilibrium concentration is represented by the first term, and the fluctuations in concentration are represented by the second and third terms. The second term is in charge of doing a thorough examination of the area to choose an ideal point. This phrase adds more to investigation and hence makes use of significant differences in attention. By identifying a point, the third term contributes to increasing the solution’s accuracy. The generation rate term is more exploitative and benefits from the equation (17) because it controls concentration fluctuations. The second and third terms could have the same or opposing signs depending on factors like the concentrations of particles and equilibrium candidates as well as the turnover rate ( λ ). The variance is made large by the same sign, which improves searching over the entire domain, and small by the opposing sign, which improves local searches.

These fluctuations are controlled by the generation rate terms (equations (17)–(19)). This big variance only affects the dimensions with tiny values because λ varies as each dimension changes. It is important to note that this feature functions similar to an evolutionary algorithm’s mutation operator and significantly aids EO in exploiting the solutions.

We will present the detailed pseudo code of EO.

Set the particle’s populations as i = 1,…, n.

Assign a substantial number of fitness points to equilibrium candidates.

Give the free parameters a ₁ = 2 and a ₂ = 1, and GP = 0.5.

While Max_iter < Iter

For i = 1: number of particles (n)

Identify it’s fitness by calculating

If fit ( C i → ) < fit ( C → eq 1 )

Replace C → eq 1 with C → i and fit ( C → eq 1 ) with fit ( C → i )

Else, if fit ( C → i ) > fit ( C → eq 1 ) & fit ( C → i ) < fit ( C → eq 2 )

Replace C → eq 2 with C → i and fit ( C → eq 2 ) with fit ( C → i )

Else, if fit ( C → i ) > fit ( C → eq 1 ) & fit ( C → i ) > fit ( C → eq 2 ) & fit ( C → i ) < fit ( C → eq 3 )

Replace C → eq 3 with C → i and fit ( C → eq 3 ) with fit ( C → i )

Else, if fit ( C → i ) > fit ( C → eq 1 ) & fit ( C → i ) > fit ( C → eq 2 ) & fit ( C → i ) > fit ( C → eq 3 ) & fit ( C → i ) > fit ( C → eq 4 )

Replace C → eq 4 with C → i and fit ( C → eq 4 ) with fit ( C → i )

End if

End for

C → ave = ( C → eq 1 + C → eq 2 + C → eq 3 + C → eq 4 ) / 4

Construct the equilibrium pool C → eq , pool = { C → eq ( 1 ) , C → eq ( 2 ) , C → eq ( 3 ) , C → eq ( 4 ) , C → eq ( ave ) }

Accomplish memory saving (if Iter > 1 )

Assign t = 1 − Iter Max _ iter a 2 Iter Max _ iter

For i = 1: number of particles (n)

Select one candidate at random from the pool of candidates for equilibrium (vector).

Generate random vectors of λ → , r → from equation (15).

Construct F → = a 1 sign ( r → − 0.5 ) [ e − λ → t − 1 ] .

Construct GCP → = 0.5 r 1 r 2 ≥ GP 0 r 2 < GP .

Construct G 0 → = GCP → ( C eq → − λ → C → ) .

Construct G → = G 0 → e − λ → ( t − t 0 ) = G 0 → F → .

Update concentrations C → = C → eq + ( C → − C → eq ) ⋅ F → + G → λ → V ( 1 − F → ) .

End for

Iter = Iter + 1

End while.

4 The proposed enhancement

In K-means clustering, one element must be fixed. The number of clusters is the K-factor. The effectiveness of K-means clustering is strongly influenced by the choice of K. There have been numerous attempts in the literature to improve K-means clustering performance by appropriate K selection. These various techniques, including algorithms inspired by nature, were used to choose the K [32]. However, none of these existing processes for choosing K make any attempt to simultaneously choose many features. Our approach in this search aims to enhance K-means clustering by maximizing the number of clusters while simultaneously incorporating the feature selection. The equilibrium optimizer algorithm (EOA) is a novel meta-heuristic physics-based algorithm and is considered one of the most powerful, fast, and best-performing population-based optimization algorithms. In this study, we proposed to deal with feature selection and select the number of clusters in K-means by EOA simultaneously. Figure 1 is an illustration of the solution representation.

Figure 1

Representation of the proposed solution.

Each family member holds a position that is made up of both binary and quantitative values that indicate various attributes. If 0 is entered, the relevant characteristic will take the value of 1. To put it another way, each member holds positions. The following are the steps of our suggested algorithm.

Step 1: The maximum number of repetitions is t max = 150 , and there are n EOA = 30 populations.

Step 2: The number of clusters, K, is randomly generated from uniform distribution as K ∼ U ( 0 , 10 ) . The rest positions which are representing the feature are generated as U ( 0 , 1 ) .

Step 3: The rest positions which are representing the feature are generated as U ( 0 , 1 ) .

Step 4: The fitness function is defined as the total within-cluster variance in equation (21).

(21) fitness = min ∑ i = 1 N ∥ O i − C j ∥ 2 .

Step 5: The positions are changed utilizing equation (20). Binary EOA is applied to deal with feature selection. In this situation, a p-bit binary string serves as the representation for each member. The position is often updated by forcing Hawk into a binary space using the transfer function. This binary vector can be produced using a transfer function; however, the final solution can only include binary values [33].

(22) x t + 1 = 1 if T ( Δ x t + 1 ) > rand 0 otherwise,

where rand ∈ [ 0 , 1 ] is a random number, T ( x ) = ( 1 / 1 + exp ( − x ) ) is the sigmoid transfer function.

Step 6: In order to reach t max , steps 4 and 5 are repeated.

5 Results and discussion

The performance of the proposed algorithm, EOAK-means, is investigated by applying the proposed algorithm to solve five different publicly available datasets. Further, the performance of EOAK-means algorithm was compared with (1) A standard version of K-means algorithm by employing EOA for selecting the optimum number of clusters, K-means. (2) K-means algorithm using cross-validation (CV) methods for selecting the optimum number of clusters, CVK-means.

Table 1 shows the brief description of the used datasets. The selected datasets vary in the number of clusters, C, dimensions, d, and number of observations, N.

Table 1

Description of used datasets

Dataset	N	d	C
H1N1 [34]	479	2,881	2
Biodegradable [35]	1,725	41	2
MLL [36]	72	12,582	3
SRBCT [36]	83	2,308	4
E-coli [36]	336	7	8

To evaluate the effectiveness of the used algorithms, two criteria are used:

Sum of intra-cluster distances (ICD) as an internal quality criterion which is defined in equation (21). Obviously, the smaller the sum of ICD, the higher the quality of the clustering algorithm [37,38].
Rand index (RI) is a well-known external clustering criterion which is defined as

(23) RI = f 1 + f 4 f 1 + f 2 + f 3 + f 4 ,

where f 1 : one cluster in each of the two partitions has been assigned to the two data points, f 2 : a single cluster has been assigned to the two contrasting data points, f 3 : different clusters were assigned to the two related data points, and f 4 : two dissimilar data points have been assigned to different clusters. The RI criterion value is lying between 0 and 1. The RI with 1 represents that the algorithm has perfect clustering [39,40].

Table 2 provides a summary of the ICD that clustering techniques on the used datasets produced. The outcomes represent the top, bottom, average, and standard deviation (SD) of the 20 achieved solutions. The RI and its related SD are listed in Table 3. The best outcomes are in bold font in Tables 2 and 3.

Table 2

Comparison of ICD values of different algorithms

		EOAK-means	K-means	CVK-means
H1N1	Best	59.544	84.179	100.931
	Average	65.895	91.53	108.282
	Worst	74.204	98.839	115.591
	SD	6.158	7.106	7.428
Biodegradable	Best	71.055	95.687	112.439
	Average	78.409	103.04	119.808
	Worst	85.812	110.347	127.102
	SD	7.666	8.614	8.936
MLL	Best	61.92	86.552	103.304
	Average	69.271	93.903	110.655
	Worst	76.58	101.212	117.964
	SD	9.801	10.749	11.071
SRBCT	Best	74.488	99.12	115.872
	Average	81.839	106.471	123.223
	Worst	99.148	124.78	135.532
	SD	22.369	23.317	23.639
E. coli	Best	33.88	58.512	75.264
	Average	41.231	65.863	82.615
	Worst	48.54	73.172	89.924
	SD	7.139	8.087	8.409

Table 3

RI criterion average for each of the four clustering methods

		EOAK-means	K-means	CVK-means
H1N1	RI	0.94	0.887	0.853
	SD	1.191	2.071	2.603
Biodegradable	RI	0.957	0.904	0.881
	SD	0.877	1.757	2.289
MLL	RI	0.917	0.864	0.83
	SD	1.168	2.048	2.58
SRBCT	RI	0.884	0.841	0.817
	SD	1.148	2.025	2.557
E. coli	RI	0.95	0.918	0.894
	SD	1.492	2.042	2.2

As it can be observed from Table 2, the proposed algorithm, EOAK-means, obtained the best, averaged, and worst values for all five datasets. For the H1N1 dataset, the best, average, and worst solutions obtained by EOAK-means are 59.544, 65.895, and 74.204, respectively, which are better than the other algorithms. Further, it can be observed that K-means is in second place for all used datasets. Additionally, it can be seen that the proposed EOAK-means algorithm has consistent results due to the lowest SD values for all datasets compared with other used algorithms.

From Table 3, we can see that the EOAK-means algorithm achieves the highest average RI in all datasets, followed by K-means, and CVK-means. This indicates that EOAK-means is able to form clusters that are close to the target clusters on average. Simultaneously, K-means obtains excellent performance compared to CVK-means.

A non-parametric statistical test known as the EOAK-means is used to further demonstrate the effectiveness of the suggested algorithm. The ICD and RI criteria comparison p-values between the EOAK-means and the other three clustering algorithms on the five datasets are shown in Tables 4 and 5, respectively. Tables 4 and 5 show that the EOAK-means clustering algorithm outperforms the K-means, CVK-means, and EOAK-means clustering algorithms with a level of significance where the Wilcoxon signed-rank test rejects the null hypothesis of EOAK-means and each used cluster algorithm having equivalent performance and confirms significant differences in the performance of all the cluster algorithms.

Table 4

Wilcoxon signed-rank test P-values based on ICD values

EOAK-means vs	K-means	CVK-means
H1N1	0.0002	0.0001
Biodegradable	0.0023	0.0002
MLL	0.0004	0.0004
SRBCT	0.0002	0.0003
E. coli	0.0027	0.0005

Table 5

Wilcoxon signed-rank test P-values based on RI values

EOAK-means vs	K-means	CVK-means
H1N1	0.0004	0.0001
Biodegradable	0.0022	0.0001
MLL	0.0003	0.0002
SRBCT	0.0005	0.0002
E. coli	0.0020	0.0003

6 Conclusion

In this study, an enhancement of K-means clustering is proposed by employing a equilibrium optimization algorithm. The suggested approach adjusts the amount of clusters while simultaneously choosing the best attributes to find the optimal answer. Using five datasets, the proposed algorithm, EOAK-means, is contrasted with K-means and CVK-means in terms of intra-cluster distances and Rand index. The outcomes demonstrate that the Rand index and intra-cluster distances are better handled by the EOAK-means than by other methods. Additionally, the Wilcoxon rank sum test-based statistical analysis has shown that our suggested algorithm performs better than previous algorithms. As a limitation, the performance of the EOAK-means depends on the choosing of the algorithm’s parameters. In future, it will be possible to combine various feature selection techniques simultaneously in order to take use of the advantages of each technique. Ensemble methods offer greater accuracy and stability than relying just on a single feature selection technique.

Conflict of interest: Authors state no conflict of interest.

References

[1] Barbakh WA, Wu Y, Fyfe C. Non-standard parameter adaptation for exploratory data analysis. Vol. 249. Berlin, Heidelberg: Springer; 2009.10.1007/978-3-642-04005-4Search in Google Scholar

[2] Berikov V. Weighted ensemble of algorithms for complex data clustering. Pattern Recognit Lett. 2014;38:99–106.10.1016/j.patrec.2013.11.012Search in Google Scholar

[3] Han X, Quan L, Xiong X, Almeter M, Xiang J, Lan Y. A novel data clustering algorithm based on modified gravitational search algorithm. Eng Appl Artif Intell. 2017;61:1–7.10.1016/j.engappai.2016.11.003Search in Google Scholar

[4] Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognit Lett. 2010;31(8):651–66.10.1007/978-3-540-87479-9_3Search in Google Scholar

[5] Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Pattern Recognit Lett. 2006;128(9):651–66.Search in Google Scholar

[6] Everitt BS, Landau S, Leese M, Stahl D. Cluster analysis, 5th ed. John Wiley; 2011.10.1002/9780470977811Search in Google Scholar

[7] Nguyen T-HT, Dinh DT, Sriboonchitta S, Huynh VN. A method for K-means-like clustering of categorical data. J Ambient Intell Humanized Comput. 2019;2019:1–11.10.1007/s12652-019-01445-5Search in Google Scholar

[8] Devika TJ and Ravichandran J. A clustering method combining multiple range tests and K-means. Commun Stat Theory Methods. 2021;51:1–56.10.1080/03610926.2021.1872639Search in Google Scholar

[9] Das P, Das DK, Dey S. A modified Bee Colony Optimization (MBCO) and its hybridization with K-means for an application to data clustering. Appl Soft Comput. 2018;70:590–603.10.1016/j.asoc.2018.05.045Search in Google Scholar

[10] Liang S, Yu H, Xiang J, Yang W, Chen X, Liu Y, et al. A new approach for data clustering using hybrid artificial bee colony algorithm. Neurocomputing. 2012;97:241–50.10.1016/j.neucom.2012.04.025Search in Google Scholar

[11] Moslehi F and Haeri A. A novel feature selection approach based on clustering algorithm. J Stat Comput Simul. 2020;91(3):581–604.10.1080/00949655.2020.1822358Search in Google Scholar

[12] Kriegel H-P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans knowl Discov Data. 2009;3(1):1–58.10.1145/1497577.1497578Search in Google Scholar

[13] Esmin AAA, Coelho RA, Matwin S. A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif Intell Rev. 2013;44(1):23–45.10.1007/s10462-013-9400-4Search in Google Scholar

[14] Steinbach M, Ertöz L, Kumar V. The challenges of clustering high dimensional data. In New directions in statistical physics. Berlin, Heidelberg: Springer; 2004. p. 273–306.10.1007/978-3-662-08968-2_16Search in Google Scholar

[15] Al-Thanoon NA, Algamal ZY, Qasim OS. Feature selection based on a crow search algorithm for big data classification. Chemom Intell Lab Syst. 2021;212:212.10.1016/j.chemolab.2021.104288Search in Google Scholar

[16] Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev. 2019;53(2):907–48.10.1007/s10462-019-09682-ySearch in Google Scholar

[17] Moriyama T. Calibration of spaceborne polarimetric SAR data using a genetic alogrithm. In 2009 IEEE International Geoscience and Remote Sensing Symposium. IEEE; 2009.10.1109/IGARSS.2009.5417707Search in Google Scholar

[18] Nakamura RYM, Pereira LAM, Costa KA, Rodrigues D, Papa JP, et al. Binary bat algorithm for feature selection. In Swarm Intelligence and Bio-Inspired Computation. Elsevier; 2013. p. 225–37.10.1016/B978-0-12-405163-8.00009-0Search in Google Scholar

[19] Zhang Y, Wang S, Phillips P, Ji G. Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl Syst. 2014;64:22–31.10.1016/j.knosys.2014.03.015Search in Google Scholar

[20] Medjahed SA, Ait Saadi T, Benyettou A, Ouali M. Gray wolf optimizer for hyperspectral band selection. Appl Soft Comput. 2016;40:178–86.10.1016/j.asoc.2015.09.045Search in Google Scholar

[21] Mafarja M, Aljarah I, Faris H, Hammouri AI, Al-Zoubi AM, Mirjalili S. Binary grasshopper optimisation algorithm approaches for feature selection problems. Expert Syst Appl. 2019;117:267–86.10.1016/j.eswa.2018.09.015Search in Google Scholar

[22] Wu T, Xiao Y, Guo M, Nie F. A general framework for dimensionality reduction of K-means clustering. J Classif. 2020;37(3):616–31.10.1007/s00357-019-09342-4Search in Google Scholar

[23] Chen J, Qi X, Chen L, Chen F, Cheng G. Quantum-inspired ant lion optimized hybrid k-means for cluster analysis and intrusion detection. Knowl Syst. 2020;203:106167.10.1016/j.knosys.2020.106167Search in Google Scholar

[24] Al-Kababchee SGM, Qasim OS, Algamal ZY. Improving penalized regression-based clustering model in big data. In Journal of Physics: Conference Series. IOP Publishing; 2021.10.1088/1742-6596/1897/1/012036Search in Google Scholar

[25] Wei HJ, Kamber M. Data mining concepts and techniques. New York: Academic Press; 2001.Search in Google Scholar

[26] Chandrasekar P, Krishnamoorthi M. BHOHS: A two stage novel algorithm for data clustering. In 2014 International Conference on Intelligent Computing Applications; 2014. p. 138–42.10.1109/ICICA.2014.38Search in Google Scholar

[27] Krishnasamy G, Kulkarni AJ, and Paramesran R. A hybrid approach for data clustering based on modified cohort intelligence and K-means. Expert Syst Appl. 2014;41(13):6009–16.10.1016/j.eswa.2014.03.021Search in Google Scholar

[28] MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Oakland, CA, USA: 1967.Search in Google Scholar

[29] Pacifico LDS, Ludermir TB. An evaluation of k-means as a local search operator in hybrid memetic group search optimization for data clustering. Nat Comput. 202010.1007/s11047-020-09809-zSearch in Google Scholar

[30] Pant P, Pant S. A review: Advances in microbial remediation of trichloroethylene (TCE). J Environ Sci. 2010;22(1):116–26.10.1016/S1001-0742(09)60082-6Search in Google Scholar

[31] Faramarzi A, Heidarinejad M, Stephens B, Mirjalili S. Equilibrium optimizer: A novel optimization algorithm. Knowl Syst. 2020;191:105190.10.1016/j.knosys.2019.105190Search in Google Scholar

[32] Al Radhwani AMN, Algamal ZY. Improving K-means clustering based on firefly algorithm. J Physics: Conf Ser. 2021;1897(1):012004.10.1088/1742-6596/1897/1/012004Search in Google Scholar

[33] Too J, Abdullah AR, Mohd Saad N. A New Quadratic Binary Harris Hawk Optimization for Feature Selection. Electronics. 2019;8(10):1130.10.3390/electronics8101130Search in Google Scholar

[34] Algamal ZY, Qasim MK, Lee MH, Ali H. QSAR model for predicting neuraminidase inhibitors of influenza A viruses (H1N1) based on adaptive grasshopper optimization algorithm. SAR QSAR Environ Res. 2020;31(11):803–14.10.1080/1062936X.2020.1818616Search in Google Scholar PubMed

[35] Al-Fakih AM, Algamal ZY, Qasim MK. An improved opposition-based crow search algorithm for biodegradable material classification. SAR QSAR Environ Res. 2022;33(5):403–15.10.1080/1062936X.2022.2064546Search in Google Scholar PubMed

[36] Blake CL and Merz CJ. UCI repository of machine learning databases, 1998. 1998.Search in Google Scholar

[37] Hatamlou A. Black hole: A new heuristic optimization approach for data clustering. Inf Sci. 2013;222:175–84.10.1016/j.ins.2012.08.023Search in Google Scholar

[38] Pacifico LD, Ludermir TB. An evaluation of k-means as a local search operator in hybrid memetic group search optimization for data clustering. Nat Comput. 2021;20(3):611–36.10.1007/s11047-020-09809-zSearch in Google Scholar

[39] Azhir E, Navimipour NJ, Hosseinzadeh M, Sharifi A, Darwesh A. An efficient automated incremental density-based algorithm for clustering and classification. Future Gener Comput Syst. 2021;114:665–78.10.1016/j.future.2020.08.031Search in Google Scholar

[40] Gao, Y, Wang Z, Xie J, Pan J. A new robust fuzzy c-means clustering method based on adaptive elastic distance. Knowl Syst 2022;237:107769.10.1016/j.knosys.2021.107769Search in Google Scholar

Received: 2022-09-05

Revised: 2022-12-11

Accepted: 2022-12-18

Published Online: 2023-02-16

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jisys-2022-0230

Keywords for this article

clustering; penalized method; equilibrium optimizer algorithm; K-means; feature selection; data mining; swarms

Creative Commons

BY 4.0