Home Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm
Article Open Access

Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm

  • Sarah Ghanim Mahmood Al-kababchee , Zakariya Yahya Algamal ORCID logo and Omar Saber Qasim EMAIL logo
Published/Copyright: February 16, 2023
Become an author with De Gruyter Brill

Abstract

Data mining’s primary clustering method has several uses, including gene analysis. A set of unlabeled data is divided into clusters using data features in a clustering study, which is an unsupervised learning problem. Data in a cluster are more comparable to one another than to those in other groups. However, the number of clusters has a direct impact on how well the K-means algorithm performs. In order to find the best solutions for these real-world optimization issues, it is necessary to use techniques that properly explore the search spaces. In this research, an enhancement of K-means clustering is proposed by applying an equilibrium optimization approach. The suggested approach adjusts the number of clusters while simultaneously choosing the best attributes to find the optimal answer. The findings establish the usefulness of the suggested method in comparison to existing algorithms in terms of intra-cluster distances and Rand index based on five datasets. Through the results shown and a comparison of the proposed method with the rest of the traditional methods, it was found that the proposal is better in terms of the internal dimension of the elements within the same cluster, as well as the Rand index. In conclusion, the suggested technique can be successfully employed for data clustering and can offer significant support.

Abbreviations

EOA

equilibrium optimizer algorithm

EOAK-means

equilibrium optimizer algorithm and K-means

ICD

intra-cluster distances

RI

Rand index

CVK-means

cross-validation K-means

1 Introduction

Data clustering, which is the classification of an unlabeled dataset into clusters of comparable items, is one of the most crucial and extensively used data analysis techniques [1,2]. One cluster is made up of things that are comparable to each other but not to objects from other clusters [3,4]. Clustering has been applied in various fields of science and engineering, including web mining, text mining, image processing, stock prediction, signal processing, biology, and others [5,6]. In general, hierarchical algorithms and partitional algorithms can be used to categorize clustering techniques. A well-known class of partitional clustering algorithms is K-means method, which is simple to construct and efficient in most applications [7,8]. However, the K-means algorithm’s effectiveness hinges on figuring out K, the number of clusters, and the initial condition of centroids that could become trapped in local optima [9,10,11].

In recent years, the problem of grouping high-dimensional data has become apparent. High-dimensional data clustering is the process of doing cluster analyses on datasets that have a few dozen to thousands of dimensions. High-dimensional data, on the other hand, presents unique obstacles for clustering algorithms that necessitate specific solutions [12]. Standard similarity measurements, as utilized in traditional clustering techniques, are frequently not useful with high-dimensional data [13,14].

In order to solve classification problems, one of the key strategies is feature selection, which relies on selecting a subset of the initial data that significantly affects the clustering process. Many datasets contain misleading and redundant characteristics, which take a long time to evaluate during an exhaustive search in the solution space and confuse the classification process. Both enhancing clustering accuracy and increasing computational efficiency can benefit from keeping only a portion of essential characteristics, while eliminating unnecessary features [15,16]. A number of nature-inspired and swarm intelligence-based algorithms, such as the genetic algorithm [17], bat algorithm [18], particle swarm optimization [19], gray wolf optimization [20], and grasshopper algorithm [21] have been introduced over the past few years [17]. Studies on these optimization techniques have shown encouraging results, and they can be used to handle a number of optimization issues, including the clustering problems. In 2019, Wu et al. presented a standardized dimensionality reduction model, which combines the K-means clustering algorithm with linear trace ratio analysis to find the effective drop direction. The proposed model is suitable for supervised, semi-supervised, and unsupervised applications, in contrast to existing dimensionality reduction strategies. An effective and detailed optimization technique is presented for the goal of determining the best W projection matrix [22]. In 2020, Chen and others proposed a hybrid algorithm that combines the K-means clustering algorithm with the quantum-inspired ant lion optimized algorithm, combining the advantages of quantum computing and swarm intelligence algorithms to improve the clustering algorithm. The proposed algorithm was tested on many standard data and through this it was concluded that it can be used efficiently for data clustering and intrusion detection [23]. In 2021, Al-Thanoon et al. improved the BCSA binary raven algorithm for selecting features in big data. The proposed improvement includes the use of the opposition-based learning method concept to define the flight length parameter. The experimental results show that the properties were chosen to take advantage of the proposed modifications and are more representative and improve the accuracy of classification and the time spent in the calculations. In general, the proposed algorithm outperformed the traditional and other well-known algorithms in the two datasets [15]. Also, in 2021 Al-kababchee et al. proposed a binary bat algorithm with penalized regression-based clustering. The experimental results and statistical analysis on three chemical datasets have demonstrated that the performance of our proposed BPRBC compared with the PRBC and K-means leads to a better performance in terms of purity and computational time [24].

The most important problem in cluster analysis is to find the number of clusters. Therefore, in our proposal, we used the equilibrium algorithm to find the appropriate number for the number of clusters. Since we used high-dimensional data, we also used the proposed algorithm to determine the best data by choosing the features, because the cluster algorithm deals with large amount of data. The distance is challenging, thus our aim was to work the clustering algorithm to solve the challenges.

In this article, we discuss the use of a hybrid algorithm for cluster analysis that is based on the gravitational search method [15] and k-means algorithm [16,18,19]. The suggested method’s effectiveness has been evaluated on a number of genuine, standard datasets from the UCI repository [24], and the outcomes have been contrasted with those of alternative methods. In short, the main contributions can be listed as:

  • An effective strategy is suggested based on the improvement of K-means clustering through the use of an equilibrium optimization strategy.

  • A thorough methodology for feature selection is given that fully exploits feature interaction and prior information while capturing high-order connectivity between features using K-means clustering.

  • By using the optimum adjustments to the number of clusters, the K-means clustering was improved.

The remainder of the essay is structured as follows: A brief introduction to clustering issues and k-means algorithms is given in Section 2. We provide our suggested algorithm for resolving data clustering issues in Section 3 using the gravitational search algorithm and the k-means algorithm. Section 4 discusses experimental findings and comparisons to other available techniques. Finally, Section 5 summarizes this study’s findings and offers suggestions for further research.

2 K-means clustering

The process of grouping a set of data objects into clusters is known as data clustering, and it is one of the most essential and common data analysis approaches [25], which is comparable to things in the same cluster but different from objects in other clusters [4,26].

Let S = [ X 1 , X 2 , , X K ] be a collection of clusters and R = [ Y 1 , Y 2 , , Y N ] be a set of data items that need to be clustered, where Y i R D . During clustering, each data point inset R is assigned to one of the K clusters in order to minimize. The sum of the squared Euclidean distances between each object Y i and the cluster X j center is the intra-cluster variance objective function. The following is the objective function [27]:

(1) F ( X , Y ) = i = 1 N Min { Y i X j 2 } , j = 1 , 2 , , K .

Also,

X j ϕ , j { 1 , 2 , , K } ,

X i X j = ϕ , i j and i , j { 1 , 2 , , K } ,

j = 1 K X j = R .

Real-valued data vectors for continuous data are divided into a preset number of clusters by the partitional clustering algorithm K-means [28]. Consider a partition P c of a dataset with N data patterns (each data pattern is represented by a vector x j R m , where j = 1 , 2 , , N ) in C clusters, where C is an algorithm input parameter. The centroid vector g c R m (where c = 1 , 2 , , C ) represents each cluster.

Clusters are produced in K-means using a dissimilarity measure called the Euclidean distance (equation (2)). For each iteration (until a maximum number of iterations t max K -means is reached or another halting condition is met), a new cluster centroid vector is created for each cluster as the mean of its current data vectors (i.e., the data patterns currently assigned to the cluster). The new division is then created, with each pattern being assigned to the cluster with the closest centroid [29].

(2) d ( x j , g c ) = k = 1 m ( x j k g c k ) 2 ,

where

(3) g c = 1 N c x l c x l .

The number of patterns associated with the cluster c is N c . The criteria function for K-means is supplied by the within-cluster sum of squares in equation (4) [29].

(4) J ( P c ) = c = 1 C x j c d ( x j , g c ) .

Algorithm 1 shows the passcode of K-means algorithm [29].

  1. For t 0 put C patterns randomly as the initial cluster centroids g c = ( c = 1 , 2 , , C ) .

  2. Assign each pattern x j to its closest cluster.

  3. Calculate the initial criterion value J ( P C 0 ) (equation (4)).

  4. While ( t < t max K -means ) do.

  5. New centroids are determined for each cluster c , centroid g c is updated using (equation (2)).

  6. Determine the new partition for each data pattern x j and assign it to the cluster with the nearest centroid g c .

  7. Calculate the new criterion value J ( P C t ) (equation (4)).

  8. t t + 1 .

  9. End while

  10. Return.

  11. Find the final partition P C t max K -means .

3 Equilibrium optimizer (EO) algorithm

The inspiration for the EO algorithm technique came from the basic well-mixed dynamic mass balance on a control volume, which employs a mass balance equation to describe the concentration of a nonreactive ingredient in a control volume as a function of its various source and sink processes. The mass balance equation gives the fundamental physics for the mass conservation that applies to mass coming into, going out of, and being generated in a control volume. A first-order ordinary differential equation is used to represent the universal mass-balance equation [30], which states that the change in mass over time is equal to the mass entering the system plus the mass being generated inside the system minus the mass exiting the system [31]:

(5) V d C d t = Q C eq Q C + G ,

where Q is the volumetric defect rate into and out of the control volume, V d C d t is the rate of change of mass in the control volume, and C is the concentration inside the control volume (V), G is the rate of mass generation inside the control volume, and C eq is the concentration at an equilibrium state where there is no generation. A stable equilibrium is attained when V d C d t hits zero. It is possible to solve for d C d t as a function of Q V , where Q V is the inverse of the residence duration, also known as the turnover rate or λ in this context (i.e., λ = Q V ). Equation (5) can also be altered in order to solve for the concentration in the control volume (C) as a function of time (t).

(6) d C λ C eq λ C + G V = d t .

The integration of equation (6) over time is seen in equation (7).

(7) c 0 c d C λ C eq λ C + G V = t 0 t d t .

This results in

(8) C = C eq + ( C 0 C eq ) F + G λ V ( 1 F ) .

In equation (8), F is calculated as follows:

(9) F = exp [ λ ( t t 0 ) ] .

where t 0 and C 0 , which depend on the integration interval, are the initial start time and concentration, respectively. Use equation (8) to calculate the concentration in the control volume with a known turnover rate. Equation (8) can be used to obtain the average turnover rate using a straightforward linear regression with a known generation rate and other factors. t 0 and C 0 stand for the initial start time and concentration, respectively, according to the integration interval. It is possible to use equation (8) to either estimate the concentration in the control volume with a known turnover rate or to determine the average turnover rate using a simple linear regression with a known generation rate and other circumstances.

The starting population is where EO starts the optimization process, just like the bulk of meta-heuristic algorithms. The search space is uniformly initialized using uniform random initialization, and the starting concentrations are constructed according to the dimensions and number of particles.

(10) C i initial = C min + rand i ( C max C min ) i = 1 , 2 , , n ,

where C min and C max stand for the min and max values of the dimensions, Rand i is a random vector in the range [0, 1], and n is the population’s total number of particles. C i initial is the starting concentration vector of the i -th particle. To identify the candidates for equilibrium, particles are sorted after being appraised for their fitness function.

The algorithm’s equilibrium state, which is intended to represent the overall optimal state, is where it finally converges. Only equilibrium candidates are chosen to create a search pattern for the particles at the beginning of the optimization process because the equilibrium state is unknown. The approach performs worse in multimodal and composition functions when there are less than four candidates present, but performs better in unimodal functions. The reverse outcome will occur with more than four candidates. The equilibrium pool is created using five particles, which are proposed as equilibrium candidates.

(11) C eq,pool = { C eq ( 1 ) , C eq ( 2 ) , C eq ( 3 ) , C eq ( 4 ) , C eq ( ave ) } .

Each particle changes its concentration in each iteration using a random selection process from candidates selected with the same probability.

The exponential term comes next, adding to the main concentration updating rule (F). This term’s precise definition will help EO strike a fair balance between exploration and exploitation. λ is assumed to be a random vector with a range of [0, 1] because the turnover rate in a real control volume can change with time.

(12) F = e λ ( t t 0 ) .

Since time, t, is a function of iteration (Iter), and as a result, gets shorter the more iterations there are.

(13) t = 1 Iter Max _ iter a 2 Iter Max _ iter ,

where a 2 is a constant used to control exploitation ability and Iter and max_iter represent the current and maximum number of iterations, respectively. In order to ensure convergence by slowing down the search speed as well as enhancing the algorithm’s capability for exploration and exploitation

(14) t 0 = 1 λ ( a 1 sign ( r 0.5 ) [ 1 e λ t ] ) + t ,

where a 1 is a fixed value that governs the capacity for exploration. The greater the a 1 and lower the exploitation, performance correspond to higher exploring capabilities.

Equation (15) shows the revised version of equation (12) with the substitution of equation (14) in equation (12).

(15) F = a 1 sign ( r 0.5 ) [ e λ t 1 ] .

One of the most crucial components of the algorithm that gives the correct solution by boosting the exploitation phase is the generation rate. As an illustration, the following general model defines generation rates as a first order exponential decay process.

(16) G = G 0 e k ( t t 0 ) ,

where G 0 denotes the starting point and k denotes the decay constant. The following are the final set of generation rate formulae:

(17) G = G 0 e λ ( t t 0 ) = G 0 F ,

where

(18) G 0 = GCP ( C eq λ C ) ,

(19) GCP = 0.5 r 1 r 2 GP 0 r 2 < GP .

where the random integers r 1 and r 2 are in the range [0, 1] and the GCP vector is created by repeating the same value obtained from equation (19). GCP, which in this equation incorporates the potential contribution of the generation term to the updating process, is described as the Generation Rate Control Parameter. Another term called Generation Likelihood determines the probability of this contribution, which indicates how many particles employ generation term to update their states (GP). Equations (18) and (19) determine the mechanism of this contribution (19). Each particle is affected by equation (19). For instance, if GCP = 0, G = 0, and no generation rate term is used, all of the particle’s dimensions are updated. An equitable balance between exploration and exploitation is achieved when GP = 0. The final rule for revising the EO is as follows:

(20) C = C eq + ( C C eq ) F + G λ V ( 1 F ) ,

where V is regarded as a unit and F is specified in equation (20).

In equation (20), the equilibrium concentration is represented by the first term, and the fluctuations in concentration are represented by the second and third terms. The second term is in charge of doing a thorough examination of the area to choose an ideal point. This phrase adds more to investigation and hence makes use of significant differences in attention. By identifying a point, the third term contributes to increasing the solution’s accuracy. The generation rate term is more exploitative and benefits from the equation (17) because it controls concentration fluctuations. The second and third terms could have the same or opposing signs depending on factors like the concentrations of particles and equilibrium candidates as well as the turnover rate ( λ ). The variance is made large by the same sign, which improves searching over the entire domain, and small by the opposing sign, which improves local searches.

These fluctuations are controlled by the generation rate terms (equations (17)–(19)). This big variance only affects the dimensions with tiny values because λ varies as each dimension changes. It is important to note that this feature functions similar to an evolutionary algorithm’s mutation operator and significantly aids EO in exploiting the solutions.

We will present the detailed pseudo code of EO.

Set the particle’s populations as i = 1,…, n.

Assign a substantial number of fitness points to equilibrium candidates.

Give the free parameters a 1 = 2 and a 2 = 1, and GP = 0.5.

While Max_iter < Iter

For i = 1: number of particles (n)

Identify it’s fitness by calculating

If fit ( C i ) < fit ( C eq 1 )

Replace C eq 1 with C i and fit ( C eq 1 ) with fit ( C i )

Else, if fit ( C i ) > fit ( C eq 1 ) & fit ( C i ) < fit ( C eq 2 )

Replace C eq 2 with C i and fit ( C eq 2 ) with fit ( C i )

Else, if fit ( C i ) > fit ( C eq 1 ) & fit ( C i ) > fit ( C eq 2 ) & fit ( C i ) < fit ( C eq 3 )

Replace C eq 3 with C i and fit ( C eq 3 ) with fit ( C i )

Else, if fit ( C i ) > fit ( C eq 1 ) & fit ( C i ) > fit ( C eq 2 ) & fit ( C i ) > fit ( C eq 3 ) & fit ( C i ) > fit ( C eq 4 )

Replace C eq 4 with C i and fit ( C eq 4 ) with fit ( C i )

End if

End for

C ave = ( C eq 1 + C eq 2 + C eq 3 + C eq 4 ) / 4

Construct the equilibrium pool C eq , pool = { C eq ( 1 ) , C eq ( 2 ) , C eq ( 3 ) , C eq ( 4 ) , C eq ( ave ) }

Accomplish memory saving (if Iter > 1 )

Assign t = 1 Iter Max _ iter a 2 Iter Max _ iter

For i = 1: number of particles (n)

Select one candidate at random from the pool of candidates for equilibrium (vector).

Generate random vectors of λ , r from equation (15).

Construct F = a 1 sign ( r 0.5 ) [ e λ t 1 ] .

Construct GCP = 0.5 r 1 r 2 GP 0 r 2 < GP .

Construct G 0 = GCP ( C eq λ C ) .

Construct G = G 0 e λ ( t t 0 ) = G 0 F .

Update concentrations C = C eq + ( C C eq ) F + G λ V ( 1 F ) .

End for

Iter = Iter + 1

End while.

4 The proposed enhancement

In K-means clustering, one element must be fixed. The number of clusters is the K-factor. The effectiveness of K-means clustering is strongly influenced by the choice of K. There have been numerous attempts in the literature to improve K-means clustering performance by appropriate K selection. These various techniques, including algorithms inspired by nature, were used to choose the K [32]. However, none of these existing processes for choosing K make any attempt to simultaneously choose many features. Our approach in this search aims to enhance K-means clustering by maximizing the number of clusters while simultaneously incorporating the feature selection. The equilibrium optimizer algorithm (EOA) is a novel meta-heuristic physics-based algorithm and is considered one of the most powerful, fast, and best-performing population-based optimization algorithms. In this study, we proposed to deal with feature selection and select the number of clusters in K-means by EOA simultaneously. Figure 1 is an illustration of the solution representation.

Figure 1 
               Representation of the proposed solution.
Figure 1

Representation of the proposed solution.

Each family member holds a position that is made up of both binary and quantitative values that indicate various attributes. If 0 is entered, the relevant characteristic will take the value of 1. To put it another way, each member holds positions. The following are the steps of our suggested algorithm.

Step 1: The maximum number of repetitions is t max = 150 , and there are n EOA = 30 populations.

Step 2: The number of clusters, K, is randomly generated from uniform distribution as K U ( 0 , 10 ) . The rest positions which are representing the feature are generated as U ( 0 , 1 ) .

Step 3: The rest positions which are representing the feature are generated as U ( 0 , 1 ) .

Step 4: The fitness function is defined as the total within-cluster variance in equation (21).

(21) fitness = min i = 1 N O i C j 2 .

Step 5: The positions are changed utilizing equation (20). Binary EOA is applied to deal with feature selection. In this situation, a p-bit binary string serves as the representation for each member. The position is often updated by forcing Hawk into a binary space using the transfer function. This binary vector can be produced using a transfer function; however, the final solution can only include binary values [33].

(22) x t + 1 = 1 if T ( Δ x t + 1 ) > rand 0 otherwise,

where rand [ 0 , 1 ] is a random number, T ( x ) = ( 1 / 1 + exp ( x ) ) is the sigmoid transfer function.

Step 6: In order to reach t max , steps 4 and 5 are repeated.

5 Results and discussion

The performance of the proposed algorithm, EOAK-means, is investigated by applying the proposed algorithm to solve five different publicly available datasets. Further, the performance of EOAK-means algorithm was compared with (1) A standard version of K-means algorithm by employing EOA for selecting the optimum number of clusters, K-means. (2) K-means algorithm using cross-validation (CV) methods for selecting the optimum number of clusters, CVK-means.

Table 1 shows the brief description of the used datasets. The selected datasets vary in the number of clusters, C, dimensions, d, and number of observations, N.

Table 1

Description of used datasets

Dataset N d C
H1N1 [34] 479 2,881 2
Biodegradable [35] 1,725 41 2
MLL [36] 72 12,582 3
SRBCT [36] 83 2,308 4
E-coli [36] 336 7 8

To evaluate the effectiveness of the used algorithms, two criteria are used:

  1. Sum of intra-cluster distances (ICD) as an internal quality criterion which is defined in equation (21). Obviously, the smaller the sum of ICD, the higher the quality of the clustering algorithm [37,38].

  2. Rand index (RI) is a well-known external clustering criterion which is defined as

(23) RI = f 1 + f 4 f 1 + f 2 + f 3 + f 4 ,

where f 1 : one cluster in each of the two partitions has been assigned to the two data points, f 2 : a single cluster has been assigned to the two contrasting data points, f 3 : different clusters were assigned to the two related data points, and f 4 : two dissimilar data points have been assigned to different clusters. The RI criterion value is lying between 0 and 1. The RI with 1 represents that the algorithm has perfect clustering [39,40].

Table 2 provides a summary of the ICD that clustering techniques on the used datasets produced. The outcomes represent the top, bottom, average, and standard deviation (SD) of the 20 achieved solutions. The RI and its related SD are listed in Table 3. The best outcomes are in bold font in Tables 2 and 3.

Table 2

Comparison of ICD values of different algorithms

EOAK-means K-means CVK-means
H1N1 Best 59.544 84.179 100.931
Average 65.895 91.53 108.282
Worst 74.204 98.839 115.591
SD 6.158 7.106 7.428
Biodegradable Best 71.055 95.687 112.439
Average 78.409 103.04 119.808
Worst 85.812 110.347 127.102
SD 7.666 8.614 8.936
MLL Best 61.92 86.552 103.304
Average 69.271 93.903 110.655
Worst 76.58 101.212 117.964
SD 9.801 10.749 11.071
SRBCT Best 74.488 99.12 115.872
Average 81.839 106.471 123.223
Worst 99.148 124.78 135.532
SD 22.369 23.317 23.639
E. coli Best 33.88 58.512 75.264
Average 41.231 65.863 82.615
Worst 48.54 73.172 89.924
SD 7.139 8.087 8.409
Table 3

RI criterion average for each of the four clustering methods

EOAK-means K-means CVK-means
H1N1 RI 0.94 0.887 0.853
SD 1.191 2.071 2.603
Biodegradable RI 0.957 0.904 0.881
SD 0.877 1.757 2.289
MLL RI 0.917 0.864 0.83
SD 1.168 2.048 2.58
SRBCT RI 0.884 0.841 0.817
SD 1.148 2.025 2.557
E. coli RI 0.95 0.918 0.894
SD 1.492 2.042 2.2

As it can be observed from Table 2, the proposed algorithm, EOAK-means, obtained the best, averaged, and worst values for all five datasets. For the H1N1 dataset, the best, average, and worst solutions obtained by EOAK-means are 59.544, 65.895, and 74.204, respectively, which are better than the other algorithms. Further, it can be observed that K-means is in second place for all used datasets. Additionally, it can be seen that the proposed EOAK-means algorithm has consistent results due to the lowest SD values for all datasets compared with other used algorithms.

From Table 3, we can see that the EOAK-means algorithm achieves the highest average RI in all datasets, followed by K-means, and CVK-means. This indicates that EOAK-means is able to form clusters that are close to the target clusters on average. Simultaneously, K-means obtains excellent performance compared to CVK-means.

A non-parametric statistical test known as the EOAK-means is used to further demonstrate the effectiveness of the suggested algorithm. The ICD and RI criteria comparison p-values between the EOAK-means and the other three clustering algorithms on the five datasets are shown in Tables 4 and 5, respectively. Tables 4 and 5 show that the EOAK-means clustering algorithm outperforms the K-means, CVK-means, and EOAK-means clustering algorithms with a level of significance where the Wilcoxon signed-rank test rejects the null hypothesis of EOAK-means and each used cluster algorithm having equivalent performance and confirms significant differences in the performance of all the cluster algorithms.

Table 4

Wilcoxon signed-rank test P-values based on ICD values

EOAK-means vs K-means CVK-means
H1N1 0.0002 0.0001
Biodegradable 0.0023 0.0002
MLL 0.0004 0.0004
SRBCT 0.0002 0.0003
E. coli 0.0027 0.0005
Table 5

Wilcoxon signed-rank test P-values based on RI values

EOAK-means vs K-means CVK-means
H1N1 0.0004 0.0001
Biodegradable 0.0022 0.0001
MLL 0.0003 0.0002
SRBCT 0.0005 0.0002
E. coli 0.0020 0.0003

6 Conclusion

In this study, an enhancement of K-means clustering is proposed by employing a equilibrium optimization algorithm. The suggested approach adjusts the amount of clusters while simultaneously choosing the best attributes to find the optimal answer. Using five datasets, the proposed algorithm, EOAK-means, is contrasted with K-means and CVK-means in terms of intra-cluster distances and Rand index. The outcomes demonstrate that the Rand index and intra-cluster distances are better handled by the EOAK-means than by other methods. Additionally, the Wilcoxon rank sum test-based statistical analysis has shown that our suggested algorithm performs better than previous algorithms. As a limitation, the performance of the EOAK-means depends on the choosing of the algorithm’s parameters. In future, it will be possible to combine various feature selection techniques simultaneously in order to take use of the advantages of each technique. Ensemble methods offer greater accuracy and stability than relying just on a single feature selection technique.

  1. Conflict of interest: Authors state no conflict of interest.

References

[1] Barbakh WA, Wu Y, Fyfe C. Non-standard parameter adaptation for exploratory data analysis. Vol. 249. Berlin, Heidelberg: Springer; 2009.10.1007/978-3-642-04005-4Search in Google Scholar

[2] Berikov V. Weighted ensemble of algorithms for complex data clustering. Pattern Recognit Lett. 2014;38:99–106.10.1016/j.patrec.2013.11.012Search in Google Scholar

[3] Han X, Quan L, Xiong X, Almeter M, Xiang J, Lan Y. A novel data clustering algorithm based on modified gravitational search algorithm. Eng Appl Artif Intell. 2017;61:1–7.10.1016/j.engappai.2016.11.003Search in Google Scholar

[4] Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognit Lett. 2010;31(8):651–66.10.1007/978-3-540-87479-9_3Search in Google Scholar

[5] Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Pattern Recognit Lett. 2006;128(9):651–66.Search in Google Scholar

[6] Everitt BS, Landau S, Leese M, Stahl D. Cluster analysis, 5th ed. John Wiley; 2011.10.1002/9780470977811Search in Google Scholar

[7] Nguyen T-HT, Dinh DT, Sriboonchitta S, Huynh VN. A method for K-means-like clustering of categorical data. J Ambient Intell Humanized Comput. 2019;2019:1–11.10.1007/s12652-019-01445-5Search in Google Scholar

[8] Devika TJ and Ravichandran J. A clustering method combining multiple range tests and K-means. Commun Stat Theory Methods. 2021;51:1–56.10.1080/03610926.2021.1872639Search in Google Scholar

[9] Das P, Das DK, Dey S. A modified Bee Colony Optimization (MBCO) and its hybridization with K-means for an application to data clustering. Appl Soft Comput. 2018;70:590–603.10.1016/j.asoc.2018.05.045Search in Google Scholar

[10] Liang S, Yu H, Xiang J, Yang W, Chen X, Liu Y, et al. A new approach for data clustering using hybrid artificial bee colony algorithm. Neurocomputing. 2012;97:241–50.10.1016/j.neucom.2012.04.025Search in Google Scholar

[11] Moslehi F and Haeri A. A novel feature selection approach based on clustering algorithm. J Stat Comput Simul. 2020;91(3):581–604.10.1080/00949655.2020.1822358Search in Google Scholar

[12] Kriegel H-P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans knowl Discov Data. 2009;3(1):1–58.10.1145/1497577.1497578Search in Google Scholar

[13] Esmin AAA, Coelho RA, Matwin S. A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif Intell Rev. 2013;44(1):23–45.10.1007/s10462-013-9400-4Search in Google Scholar

[14] Steinbach M, Ertöz L, Kumar V. The challenges of clustering high dimensional data. In New directions in statistical physics. Berlin, Heidelberg: Springer; 2004. p. 273–306.10.1007/978-3-662-08968-2_16Search in Google Scholar

[15] Al-Thanoon NA, Algamal ZY, Qasim OS. Feature selection based on a crow search algorithm for big data classification. Chemom Intell Lab Syst. 2021;212:212.10.1016/j.chemolab.2021.104288Search in Google Scholar

[16] Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev. 2019;53(2):907–48.10.1007/s10462-019-09682-ySearch in Google Scholar

[17] Moriyama T. Calibration of spaceborne polarimetric SAR data using a genetic alogrithm. In 2009 IEEE International Geoscience and Remote Sensing Symposium. IEEE; 2009.10.1109/IGARSS.2009.5417707Search in Google Scholar

[18] Nakamura RYM, Pereira LAM, Costa KA, Rodrigues D, Papa JP, et al. Binary bat algorithm for feature selection. In Swarm Intelligence and Bio-Inspired Computation. Elsevier; 2013. p. 225–37.10.1016/B978-0-12-405163-8.00009-0Search in Google Scholar

[19] Zhang Y, Wang S, Phillips P, Ji G. Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl Syst. 2014;64:22–31.10.1016/j.knosys.2014.03.015Search in Google Scholar

[20] Medjahed SA, Ait Saadi T, Benyettou A, Ouali M. Gray wolf optimizer for hyperspectral band selection. Appl Soft Comput. 2016;40:178–86.10.1016/j.asoc.2015.09.045Search in Google Scholar

[21] Mafarja M, Aljarah I, Faris H, Hammouri AI, Al-Zoubi AM, Mirjalili S. Binary grasshopper optimisation algorithm approaches for feature selection problems. Expert Syst Appl. 2019;117:267–86.10.1016/j.eswa.2018.09.015Search in Google Scholar

[22] Wu T, Xiao Y, Guo M, Nie F. A general framework for dimensionality reduction of K-means clustering. J Classif. 2020;37(3):616–31.10.1007/s00357-019-09342-4Search in Google Scholar

[23] Chen J, Qi X, Chen L, Chen F, Cheng G. Quantum-inspired ant lion optimized hybrid k-means for cluster analysis and intrusion detection. Knowl Syst. 2020;203:106167.10.1016/j.knosys.2020.106167Search in Google Scholar

[24] Al-Kababchee SGM, Qasim OS, Algamal ZY. Improving penalized regression-based clustering model in big data. In Journal of Physics: Conference Series. IOP Publishing; 2021.10.1088/1742-6596/1897/1/012036Search in Google Scholar

[25] Wei HJ, Kamber M. Data mining concepts and techniques. New York: Academic Press; 2001.Search in Google Scholar

[26] Chandrasekar P, Krishnamoorthi M. BHOHS: A two stage novel algorithm for data clustering. In 2014 International Conference on Intelligent Computing Applications; 2014. p. 138–42.10.1109/ICICA.2014.38Search in Google Scholar

[27] Krishnasamy G, Kulkarni AJ, and Paramesran R. A hybrid approach for data clustering based on modified cohort intelligence and K-means. Expert Syst Appl. 2014;41(13):6009–16.10.1016/j.eswa.2014.03.021Search in Google Scholar

[28] MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Oakland, CA, USA: 1967.Search in Google Scholar

[29] Pacifico LDS, Ludermir TB. An evaluation of k-means as a local search operator in hybrid memetic group search optimization for data clustering. Nat Comput. 202010.1007/s11047-020-09809-zSearch in Google Scholar

[30] Pant P, Pant S. A review: Advances in microbial remediation of trichloroethylene (TCE). J Environ Sci. 2010;22(1):116–26.10.1016/S1001-0742(09)60082-6Search in Google Scholar

[31] Faramarzi A, Heidarinejad M, Stephens B, Mirjalili S. Equilibrium optimizer: A novel optimization algorithm. Knowl Syst. 2020;191:105190.10.1016/j.knosys.2019.105190Search in Google Scholar

[32] Al Radhwani AMN, Algamal ZY. Improving K-means clustering based on firefly algorithm. J Physics: Conf Ser. 2021;1897(1):012004.10.1088/1742-6596/1897/1/012004Search in Google Scholar

[33] Too J, Abdullah AR, Mohd Saad N. A New Quadratic Binary Harris Hawk Optimization for Feature Selection. Electronics. 2019;8(10):1130.10.3390/electronics8101130Search in Google Scholar

[34] Algamal ZY, Qasim MK, Lee MH, Ali H. QSAR model for predicting neuraminidase inhibitors of influenza A viruses (H1N1) based on adaptive grasshopper optimization algorithm. SAR QSAR Environ Res. 2020;31(11):803–14.10.1080/1062936X.2020.1818616Search in Google Scholar PubMed

[35] Al-Fakih AM, Algamal ZY, Qasim MK. An improved opposition-based crow search algorithm for biodegradable material classification. SAR QSAR Environ Res. 2022;33(5):403–15.10.1080/1062936X.2022.2064546Search in Google Scholar PubMed

[36] Blake CL and Merz CJ. UCI repository of machine learning databases, 1998. 1998.Search in Google Scholar

[37] Hatamlou A. Black hole: A new heuristic optimization approach for data clustering. Inf Sci. 2013;222:175–84.10.1016/j.ins.2012.08.023Search in Google Scholar

[38] Pacifico LD, Ludermir TB. An evaluation of k-means as a local search operator in hybrid memetic group search optimization for data clustering. Nat Comput. 2021;20(3):611–36.10.1007/s11047-020-09809-zSearch in Google Scholar

[39] Azhir E, Navimipour NJ, Hosseinzadeh M, Sharifi A, Darwesh A. An efficient automated incremental density-based algorithm for clustering and classification. Future Gener Comput Syst. 2021;114:665–78.10.1016/j.future.2020.08.031Search in Google Scholar

[40] Gao, Y, Wang Z, Xie J, Pan J. A new robust fuzzy c-means clustering method based on adaptive elastic distance. Knowl Syst 2022;237:107769.10.1016/j.knosys.2021.107769Search in Google Scholar

Received: 2022-09-05
Revised: 2022-12-11
Accepted: 2022-12-18
Published Online: 2023-02-16

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Salp swarm and gray wolf optimizer for improving the efficiency of power supply network in radial distribution systems
  3. Deep learning in distributed denial-of-service attacks detection method for Internet of Things networks
  4. On numerical characterizations of the topological reduction of incomplete information systems based on evidence theory
  5. A novel deep learning-based brain tumor detection using the Bagging ensemble with K-nearest neighbor
  6. Detecting biased user-product ratings for online products using opinion mining
  7. Evaluation and analysis of teaching quality of university teachers using machine learning algorithms
  8. Efficient mutual authentication using Kerberos for resource constraint smart meter in advanced metering infrastructure
  9. Recognition of English speech – using a deep learning algorithm
  10. A new method for writer identification based on historical documents
  11. Intelligent gloves: An IT intervention for deaf-mute people
  12. Reinforcement learning with Gaussian process regression using variational free energy
  13. Anti-leakage method of network sensitive information data based on homomorphic encryption
  14. An intelligent algorithm for fast machine translation of long English sentences
  15. A lattice-transformer-graph deep learning model for Chinese named entity recognition
  16. Robot indoor navigation point cloud map generation algorithm based on visual sensing
  17. Towards a better similarity algorithm for host-based intrusion detection system
  18. A multiorder feature tracking and explanation strategy for explainable deep learning
  19. Application study of ant colony algorithm for network data transmission path scheduling optimization
  20. Data analysis with performance and privacy enhanced classification
  21. Motion vector steganography algorithm of sports training video integrating with artificial bee colony algorithm and human-centered AI for web applications
  22. Multi-sensor remote sensing image alignment based on fast algorithms
  23. Replay attack detection based on deformable convolutional neural network and temporal-frequency attention model
  24. Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation
  25. Computer technology of multisensor data fusion based on FWA–BP network
  26. Application of adaptive improved DE algorithm based on multi-angle search rotation crossover strategy in multi-circuit testing optimization
  27. HWCD: A hybrid approach for image compression using wavelet, encryption using confusion, and decryption using diffusion scheme
  28. Environmental landscape design and planning system based on computer vision and deep learning
  29. Wireless sensor node localization algorithm combined with PSO-DFP
  30. Development of a digital employee rating evaluation system (DERES) based on machine learning algorithms and 360-degree method
  31. A BiLSTM-attention-based point-of-interest recommendation algorithm
  32. Development and research of deep neural network fusion computer vision technology
  33. Face recognition of remote monitoring under the Ipv6 protocol technology of Internet of Things architecture
  34. Research on the center extraction algorithm of structured light fringe based on an improved gray gravity center method
  35. Anomaly detection for maritime navigation based on probability density function of error of reconstruction
  36. A novel hybrid CNN-LSTM approach for assessing StackOverflow post quality
  37. Integrating k-means clustering algorithm for the symbiotic relationship of aesthetic community spatial science
  38. Improved kernel density peaks clustering for plant image segmentation applications
  39. Biomedical event extraction using pre-trained SciBERT
  40. Sentiment analysis method of consumer comment text based on BERT and hierarchical attention in e-commerce big data environment
  41. An intelligent decision methodology for triangular Pythagorean fuzzy MADM and applications to college English teaching quality evaluation
  42. Ensemble of explainable artificial intelligence predictions through discriminate regions: A model to identify COVID-19 from chest X-ray images
  43. Image feature extraction algorithm based on visual information
  44. Optimizing genetic prediction: Define-by-run DL approach in DNA sequencing
  45. Study on recognition and classification of English accents using deep learning algorithms
  46. Review Articles
  47. Dimensions of artificial intelligence techniques, blockchain, and cyber security in the Internet of medical things: Opportunities, challenges, and future directions
  48. A systematic literature review of undiscovered vulnerabilities and tools in smart contract technology
  49. Special Issue: Trustworthy Artificial Intelligence for Big Data-Driven Research Applications based on Internet of Everythings
  50. Deep learning for content-based image retrieval in FHE algorithms
  51. Improving binary crow search algorithm for feature selection
  52. Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm
  53. A study on predicting crime rates through machine learning and data mining using text
  54. Deep learning models for multilabel ECG abnormalities classification: A comparative study using TPE optimization
  55. Predicting medicine demand using deep learning techniques: A review
  56. A novel distance vector hop localization method for wireless sensor networks
  57. Development of an intelligent controller for sports training system based on FPGA
  58. Analyzing SQL payloads using logistic regression in a big data environment
  59. Classifying cuneiform symbols using machine learning algorithms with unigram features on a balanced dataset
  60. Waste material classification using performance evaluation of deep learning models
  61. A deep neural network model for paternity testing based on 15-loci STR for Iraqi families
  62. AttentionPose: Attention-driven end-to-end model for precise 6D pose estimation
  63. The impact of innovation and digitalization on the quality of higher education: A study of selected universities in Uzbekistan
  64. A transfer learning approach for the classification of liver cancer
  65. Review of iris segmentation and recognition using deep learning to improve biometric application
  66. Special Issue: Intelligent Robotics for Smart Cities
  67. Accurate and real-time object detection in crowded indoor spaces based on the fusion of DBSCAN algorithm and improved YOLOv4-tiny network
  68. CMOR motion planning and accuracy control for heavy-duty robots
  69. Smart robots’ virus defense using data mining technology
  70. Broadcast speech recognition and control system based on Internet of Things sensors for smart cities
  71. Special Issue on International Conference on Computing Communication & Informatics 2022
  72. Intelligent control system for industrial robots based on multi-source data fusion
  73. Construction pit deformation measurement technology based on neural network algorithm
  74. Intelligent financial decision support system based on big data
  75. Design model-free adaptive PID controller based on lazy learning algorithm
  76. Intelligent medical IoT health monitoring system based on VR and wearable devices
  77. Feature extraction algorithm of anti-jamming cyclic frequency of electronic communication signal
  78. Intelligent auditing techniques for enterprise finance
  79. Improvement of predictive control algorithm based on fuzzy fractional order PID
  80. Multilevel thresholding image segmentation algorithm based on Mumford–Shah model
  81. Special Issue: Current IoT Trends, Issues, and Future Potential Using AI & Machine Learning Techniques
  82. Automatic adaptive weighted fusion of features-based approach for plant disease identification
  83. A multi-crop disease identification approach based on residual attention learning
  84. Aspect-based sentiment analysis on multi-domain reviews through word embedding
  85. RES-KELM fusion model based on non-iterative deterministic learning classifier for classification of Covid19 chest X-ray images
  86. A review of small object and movement detection based loss function and optimized technique
Downloaded on 19.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jisys-2022-0230/html
Scroll to top button