Home Semi-supervised Classification Based Mixed Sampling for Imbalanced Data
Article Open Access

Semi-supervised Classification Based Mixed Sampling for Imbalanced Data

  • Jianhua Zhao EMAIL logo and Ning Liu
Published/Copyright: December 31, 2019

Abstract

In practical application, there are a large amount of imbalanced data containing only a small number of labeled data. In order to improve the classification performance of this kind of problem, this paper proposes a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification (S2MAID), which combines semi-supervised learning, over sampling, under sampling and ensemble learning. Firstly, a kind of under sampling algorithm UD-density is provided to select samples with high information content from majority class set for semi-supervised learning. Secondly, a safe supervised-learning method is used to mark unlabeled sample and expand the labeled sample. Thirdly, a kind of over sampling algorithm SMOTE-density is provided to make the imbalanced data set become balance set. Fourthly, an ensemble technology is used to generate a strong classifier. Finally, the experiment is carried out on imbalanced data with containing only a few labeled samples, and semi-supervised learning process is simulated. The proposed S2MAID is verified and the experimental result shows that the proposed S2MAID has a better classification performance.

1 Introduction

The imbalanced data classification is such a problem where the class distribution of training data is not balanced and the number of one class is far less than the other one. It exists widely in real life, such as genetic testing, bad bank loans, fault diagnosis, etc. [1, 2]

In practical application, people are more concerned about the information of the minority class data, and the cost of classification errors of minority classes is much higher than that of majority classes. For example, if a cancer patient is diagnosed as normal, it will delay the optimal timing of treatment, resulting in life-threatening for patients; in network intrusion detection, if the network intrusion behavior is sentenced to normal behavior, it will have the potential danger to cause major network security incidents. Therefore, it is necessary to improve the accuracy of minority classes and it has become an urgent problem in machine learning [3].

Researchers have proposed several different solutions to the classification of imbalanced data. At present, the method for imbalanced data classification can be divided into two levels: one is from the algorithm level [4, 5], the other is from the data processing level [6, 7].

At the algorithm level, the imbalanced data classification method mainly includes ensemble method [4], cost sensitive learning [5], and so on. At the data processing level, it includes over sampling and under sampling, improving the imbalanced data set by some mechanism to obtain a balanced data distribution [6, 7].

At present, there are many imbalanced data classification methods [8, 9, 10], most of them are based on supervised learning. However, supervised learning often needs a large amount of labeled samples, and it may take a lot of manpower and material resources to obtain labeled samples. Therefore, it is necessary to use semi-supervised learning for this classification problem. At present, there are also many semi-supervised learning methods [11, 12, 13], but most of them assume that the data set is balanced. Therefore, it is very important to design a kinds of semi-supervised algorithm for imbalanced data classification.

For semi-supervised classification of imbalanced data, many researchers have been engaged in the research in this field and have achieved a lot of research results. Li et al. [14] proposed a semi-supervised classification method to alleviate the adverse effects of imbalances. This method iteratively selected some unlabeled samples and added them to a few classes to form a balanced data set. Pan et al. [15] proposed a algorithm based on integrated framework for imbalanced noise graph flow classification. Frasca et al. [16] proposed a cost-sensitive neural network algorithm for imbalanced data. Hajizadeh et al. [17] proposed a semi-supervised imbalanced image data detection method. Li et al. [18] proposed a new label matrix normalization solution to solve the general equilibrium problem. Limin et al. [19] proposed a semi-supervised algorithm based on evidence theory and biased-SVM for imbalance data sets which had a number of unlabeled samples.

Although there are some semi-supervised classification methods for imbalanced data, there is no unified opinion about which method is the best to deal with imbalanced data. With the development of society, the social environment becomes more and more complex, and many problems in the real world may become more and more complex, which makes it possible to collect useful labels. The research on classification algorithm for imbalanced data with fewer labeled samples has certain theoretical significance and good practical value.

To solve the classification problem of imbalanced data containing only a few labeled samples, we proposed a semi-supervised learning algorithm based mixed sampling for imbalanced data classification (S2MAID), which combines semi-supervised learning, over sampling, under sampling and ensemble learning. First, the traditional under sampling algorithm is improved to UD-density to select samples with high information content from majority class set, providing data sets for semi-supervised learning. Second, a safe supervised-learning method is used to mark the unlabeled sample and expand the labeled sample. Then, the traditional over sampling algorithm is improved to SMOTE-density to turn the imbalanced data set into balance set. Finally, ensemble technology is used to generate a strong classifier to improve classifier‘s performance. The experiment is carried out on imbalanced data and semi-supervised learning process is simulated to verify the proposed algorithm.

The first part of the paper is the introduction; the second part is the related work, which presents semi-supervised learning, over sampling, under sampling, ensemble technology; the third part is the introduction of our proposed algorithm, including US-density, SMOTE-density and S2MAID; the fourth part is the experiment; the fifth part is acknowledgement.

2 Related works

2.1 Semi-supervised learning

Accompanied by the rapid development of data acquisition and storage technology, it is easier to obtain unlabeled data, but it is relatively difficult to obtain labeled samples because of the need to consume a certain amount of manpower and material resources. Semi-supervised learning [20, 21] is a kind of new method between supervised learning and unsupervised learning, whose purpose is to make full use of large unlabeled samples to make up for the lack of labeled samples. Semi-supervised learning is divided into semi-supervised clustering and classification. The main aim of the latter is to study how to use unlabeled samples to help to train supervised learning classifier, when labeled samples are insufficient. Semi-supervised classification mainly includes disagreement-based method, generative method, discriminative method and graph-based method [20, 21].

The disagreement-based semi-supervised classification realizes the utilization of unlabeled data by using multiple classifiers. In the process of machine learning, the unlabeled data is used as a platform for interaction between multiple classifiers. The original disagreement-based algorithm was developed by Blum and Mitchell [22] in 1998. They assumed that the data set had two views of sufficient redundancy, meeting the following conditions: First, each set of attributes was sufficient to describe the problem; second, each attribute set was conditioned to be independent of another set of attributes when it was marked.

The generative method [23] assumes that the sample and class labels are generated by a set of probability distributions of a certain or certain structural relationship. From these distributions, the sample L with the class label and the sample U without the class label are generated.

The discriminative method [24] uses the maximum interval algorithm to train the learning decision boundary of the labeled sample and unlabeled sample. The purpose of learning is to make the classification hyperplane through the low density data region, and to make the distance maximum between the classification hyperplane and the nearest sample.

Graph-based learning [25] is a very active direction of semi-supervised learning in recent years. The essence of the graph based approach is the label propagation.

2.2 SMOTE algorithm [26]

SMOTE is a kind of over sampling method by changing the balance of the data set. Its purpose is to maintain a balance between the number of majority classes and minority classes, by increasing the number of minority classes.

In SMOTE, it searches for the nearest K adjacent samples in each data sample x of minority class data set and randomly selects N samples in the nearest neighbor data set recorded as y1, y2, y3, . . . yn. The random linear interpolation operation is carried out between the minority class data x and yi(j = 1, 2, N) to construct a new sample zj. The interpolation operation is shown in the Formula (1):

(1) z j = x + r a n d N ( 0 , 1 ) ( y j x ) , j = 1 , 2 , , N

Where randN(0, 1) represents a random number, zj represents new sample, x represents the sample of the minority class, yj represents the j-th neighbor samples of x, these new synthetic minority class is merged into the original data set to generate new training set.

2.3 Under sampling algorithm [27]

Random under sampling method deletes some samples from the majority class sample set at random, but does not deal with the minority class sample set at all. Because of the randomness and contingency, this method is easy to lose some important information in most classes and affect the classification performance.

Easy Ensemble algorithm randomly extracts multiple subsets from the majority class sample set, then uses each subset and the minority class sample set to form a training set to train a classifier, and finally combines the classification results of multiple classifiers.

Balance Cascade is based on supervised learning combined with Boosting. In the n-round training, the subset sampled from the majority class sample set and the minority class sample set is combined to train a basic learner H. After training, the samples in the majority class sample set that can be correctly classified by H will be eliminated. In the next n+1 round, the classifier is trained by combining a subset from the rejected majority samples with the minority class sample set. Finally, different basic learners are integrated.

KNN-NearMiss is essentially a prototype selection method, which is to select the most representative samples from the majority class sample set for training, mainly to alleviate the problem of information loss in random under-sampling.

2.4 Ensemble learning [28]

Ensemble learning gets a number of different based classifiers by using a simple classification algorithm to train data, then it combines the base classifiers into a strong classifier in some way. Ensemble learning plays an important role in the field of machine learning.

With the development of the integrated learning technology, more and more researchers introduce ensemble learning into the classification of imbalanced data, and get a lot of research results.

Galar et al. [29] develop a new ensemble construction algorithm, which combines random under sampling with Boosting algorithm. Khoshgoftaar et al. [30] compares the performance of boosting and bagging techniques from imbalanced and noisy binary-class data. Ghazikhani et al. [31] propose an a two-layer approach online ensemble of neural network classifiers to handle class imbalance and non-stationarity.

The combination of sampling technology and ensemble learning is an effective method to solve the problem of imbalanced data classification [29]. However, the existing algorithms are often unable to combine the advantages of the two methods effectively. For example, the traditional over sampling technique blurred the boundaries of the majority and the minority; the traditional over sampling technology leads to a large scale of data as well as low classification efficiency; It is also possible to cause some valuable data to be lost after processing the imbalanced data sets by using the under sample. In addition, the choice of integration algorithm often affects the classification accuracy of the algorithm.

3 Our Method

In this paper, we proposes a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification (S2MAID),which combines semi-supervised learning, over sampling, under sampling and ensemble learning.

In S2MAID, the data set is divided into two types: the majority class sample set and the minority class sample set. In the majority class sample set there are a lot of labeled samples and a few of unlabeled samples, the same is in the minority class sample set.

In the following content, the main idea of S2MAID is firstly introduced. Then, the improved algorithm UD-density and SMOTE-density is introduced.

3.1 The basic idea of S2MAID

In S2MAID, a safe semi-supervised learning algorithm and an improved under sampling method UD-density are used to mark imbalanced unlabeled samples, and expand the number of labeled samples to form a labeled sample set. The improved SMOTE algorithm named SMOTE-density is used to make the imbalanced set become balanced set. Then the ensemble learning method is used to predict the label sample set after sampling. The Framework diagram of S2MAID is shown in Figure 1. The algorithm consists of three parts: data sampling, semi-supervised learning and ensemble learning. First, an improved under sampling method US-density is proposed to sample from majority class set. Second, a safe semi-supervised learning is used to mark the unlabeled data from minority class and the data set sampled by US-density. Next, an over sampling method SMOTE-density is proposed to make the minority class set become balanced. Finally, multiple classifier ensemble method is used to generate the final classifier. The implementation of each step is described in detail below.

Figure 1 Algorithm framework diagram
Figure 1

Algorithm framework diagram

3.2 The under sampling process

In the semi-supervised imbalance problem, the number of labeled samples is small, and the classification is imbalanced. In order to improve the accuracy, an under-sampling method US-density is added to the semi-supervised learning algorithm. By deleting the samples with low information content in the training set and retaining a small number of samples with high information content, the balance of the two types of samples is maintained and the accuracy of pre-labeled samples is improved.

For the majority sample set, an under-sampling method based on data density distribution (US-density) is proposed here. Firstly, the labeled samples in the majority sample set are retained. Secondly, for the unlabeled samples in the majority sample set, the data sets are divided into high-density data clusters and low-density data clusters. Different strategies of resampling are applied to different density data clusters to adjust the balance of data sets. Samples in high-density data clusters should be retained as much as possible in the process of under-sampling, and samples in low-density data clusters should be deleted in under-sampling. Finally, the data set is composed of high density unlabeled samples in the majority sample set, the labeled samples in the majority sample set and the minority sample set for semi-supervised learning to mark.

First step, the K-nearest neighbor method is used to classify the majority sample set into noise set, boundary set and safety set. Second step, the point P is selected from the boundary sample set. A circle with P as its center and r as its radius is selected. N denotes the number of samples contained in the circle. If the total sample in the circle is larger than N/2, delete the sample point, otherwise retain the sample point. The description of US-density is shown in Algorithm 1.

Algorithm 1

US-density

Input: The majority sample set M; Threshold N.

Output: A new sample set R

Progress:

1: The k-nearest neighbor method (k=5) is used to classify the majority sample set into noise set, boundary set and safety set.

2: Delete the noise samples and keep the safe samples.

3: A point P is selected from the boundary sample set.

4: A circle with P as its center and r as its radius is selected, the number of samples in the circle is N (here N=10).

5: If the total sample in the circle is larger than N/2, delete the sample point, otherwise retain the sample point.

6: Repeat step 5 until the set number of samples is reached.

7: All retained samples constitute a set R.

3.3 Semi-supervised learning process

Here, a safe semi-supervised classification algorithm is used to mark the unlabeled sample and expand the labeled set. The data set for semi-supervised learning is composed of data set R selected from Algorithm 1, the labeled samples in the majority set M and the minority set S. The algorithm description is shown in Algorithm 2 [32]. The main idea of the Algorithm 2 is as follows. First, prelabeling is carried out for unlabeled sample. Then, the sample with pseudo-label is added into the labeled sample set to carry out grouping, training and testing in the labeled set, and the corresponding errors of various pseudo-labels are calculated. At last, the pseudo-label with the lowest classification rate is selected as the candidate label. The loop is executed repeatedly until certain conditions are met.

Algorithm 2

safe semi-supervised classification

Input: Labeled set L; Unlabeled set U; Supervised classifier.

Output: The candidate label ex of the unlabeled sample x after the prediction

Progress:

1: While (U not empty)

2: For i=1 to m /* m presents the number of categories*/

3: Use pseudo-label Mi to mark x in U, record as xi

4: Add xi into L to form training set

5: Group L into k groups, record as L(1), L(2)...,L(k)

6: For j=1 to k

7: Take L(j) as validation set, the other k-1groups as training set

8: Using a training set to train a supervised classifier Cij

9: The supervised classifier Cij is used to test the validation set

10: Record the verifying accuracy at this time as acci (j)

11: EndFor

12: Calculate the average accuracy: acci=(acci(1)+acci(2)+...+acci (k))/k

13: Calculate error ei=1- acci

14: EndFor

15: Get the minimum value ex in e1, e2, ...,em

16: ex is used as the candidate predictive value of the sample x

17: L=LU{ex}

18: Repeat step 1

3.4 SMOTE-density

The expanded sample set after semi-supervised learning above is still an imbalanced data set, it is necessary to use oversampling algorithm to make data set balanced [33, 34]. Here an algorithm SMOTE-density is proposed to generate synthetic sample in the minority sample set.

SMOTE-density is improved on the basis of SMOTE algorithm. SMOTE-density algorithm identifies sparse samples based on the density of samples and selects high density samples as seed samples [35]. Then, the method of SMOTE is used to generate synthetic samples between the seed samples and their k-nearest neighbors.

The specific process of the algorithm is as follows:

3.4.1 Determining the neighborhood radius

Suppose x is a sample in the minority sample set, parameter r is the neighborhood radius of sample x, then neighborhood of the sample x is a space, in which x is the circle center and r is the radius.

After experimental comparison, the average distance among all samples in the minority sample set is chose as the neighborhood radius. The calculation formula is as follows.

(2) δ = m ( m 1 ) 2 i , j = 1 m D ( x i , x j )

Where m represents the number of samples in the minority sample set S, D denotes the Euclidean distance.

3.4.2 Calculate the density of each sample in the minority sample set

The density of δ-neighborhood of sample x is expressed by the number of samples in the δ-neighborhood of sample x. The density threshold of samples is expressed by the mean value of the δ-neighborhood density of all samples. The calculation formula of the density threshold is as follows.

(3) D T = 1 m i = 1 m D e n s i t y i

Where m denotes the number of samples in the minority sample set S, Densityi denotes the density of δ-neighborhood of sample i.

3.4.3 Generating the seed set

If the density of sample x is greater than DT, x is a dense sample. If the density of sample x is less than DT, x is a sparse object. All sparse samples constitute seed set of this kind.

3.4.4 Generating synthetic samples

According to the idea of SMOTE algorithm, SMOTE-density algorithm synthesizes new samples between seed samples and their neighbors according to Formula (4):

(4) p = x + r a n d N ( 0 , 1 ) D ( x , s e e d )

Where p denotes the new synthetic samples, x denotes the sample of the minority class, randN (0, 1) denotes a random number, D (x,seed) denotes the Euclidean distance between x and seed.

3.4.5 Forming new data set of the minority sample

Here, synthetic samples is added into the minority sample set to form a new sample set. The process of generating seed set mainly includes: calculating the neighborhood radius of the samples in the minority class sample set, calculating the density threshold of the class, and generating the sparse object set as a seed sample set [36, 37]. Synthetic sample segmentation generated by DS-SMOTE algorithm is distributed between sparse objects and their neighbors. The description of SMOTE-density is shown in Algorithm 3.

Algorithm 3

SMOTE-density

Input: The majority sample set M; The minority sample set S.

Output: The new data set of the minority sample

Progress:

1: Calculating the neighborhood radius according to formula (2).

2: Calculate the density threshold DT according to formula (3).

3: for i=1 to m

4: if densit yi < DT

5: SSET=SSET U{x}

6: endif

7: endfor

8: Generating seed set SSET.

9: The samples in SSET are over-sampled using SMOTE method according to Formula (4).

10: Forming new data sets of the minority sample.

3.5 The integration process of classifiers

In this step, an integrated classification algorithm is used to integrate the balanced data sets. The idea is as follows: different weak classifiers are trained and the weight of each sample in the next iteration is determined according to whether the classification of each sample in each training set is correct and the classification error of the sample [38]. Finally, each training classifier is fused together according to a certain weight, and a strong classifier is formed as the final decision classifier.

4 Experiment and result analysis

4.1 Experimental data set and scheme

The experiment platform is based on Intel Core2 Duo CPU 2.0GHz, memory 2.0GB PC,Windows XP operation system and MATLAB R2009b programming environment.

The experiment is carried out on 5 data sets commonly used in the UCI database (http://archive.ics.uci.edu/ml/), which is shown in Table 1.

Table 1

The data set

data set num min/max ratio feature
ionosphere 351 126/225 1.79 34
satimage 6435 1358/5077 3.74 36
segment 2310 330/1980 6.00 19
glass 214 29/185 6.38 10
yeast 1484 163/1321 8.10 8

In Table 1, num denotes the sample size of the data set; min denotes the sample number of minority class and max denotes the sample number of majority class; ratio denotes the ratio of majority class to minority class, and feature denotes the feature number of selected data set. Most of the data sets in Table 1 belong to multi-types data set, so it is necessary to convert multi-types data set into two types. Among them, ionosphere data set itself is a two-types data set; for satimage data set, category 3 is used as a minority class, and the rest category of the samples form a majority class; for segment data set, category 1 is used as a minority class, and the rest of the samples form a majority class; for glass data set, category 7 is used as a minority class, and the rest of the samples form a majority class; for yeast data set, category 4 is used as a minority class, and the rest of the samples constitute a majority class.

In the sample selected in Table 1, the proportion of the sample number of the training set and the test set is set to 1:1, that is, the training set and the test set are 50% of the data set in the list. In order to correctly calculate the classification rate, the training set is divided into labeled sample set and unlabeled sample set. In the first kind of experiment, labeled sample is accounted for 5% of training set; in the second kind of experiment, labeled sample is accounted for 10% of training set; in the third kind of experiments, labeled sample is accounted for 15% of training set.

Because of the randomness in the experiment, in order to fully verify the classification effect of the algorithm, each data set is trained 100 times, and the experimental results are averaged. Before the experiment, the discrete value of all the sample attributes is processed numerically. The Formula (5) is used to normalize the input data.

(5) x k = ( x k x min ) / ( x k x max )

Where xmin denotes the smallest value in data set; xmax denotes the maximum value in data set; xk on the right side of the equal sign denotes the input data; xk on the left side of the equal sign denotes the normalized data.

4.2 Evaluating indicator

In the classification of balanced data, accuracy is an important index to evaluate the performance of classifiers. However, the evaluation index is not reasonable for imbalanced data, because the cost of error classification for accuracy is the same for all samples.

Here, the confusion matrix shown in Table 2 is used as the evaluation index of imbalanced data. And the description of TP, TN, FN, FP is shown below.

Table 2

Confusion matrix

Category Actual positive

class
Actual negative

class
Experimental

positive class
TP FN
Experimental

negative class
FP TN
  • True Positive (TP): Correct judgement. A positive sample is judged to be a positive one.

  • True Negative (TN): Correct judgement. A negative sample is judged to be a positive one.

  • False Negative (FN): Erroneous judgement. A positive sample is judged to be a negative one.

  • False Positive (FP): Erroneous judgement. A negative sample is judged to be a positive one.

The calculation of Precision rate is shown in Formula (6).

(6) P r e c i s i o n = T P T P + F P

The calculation of Recall rate is shown in Formula (7).

(7) R e c a l l = T P T P + F N

The calculation of F-value is shown in Formula (8).

(8) F v a l u e = ( 1 + λ 2 ) × R e c a l l × P r e c i s i o n λ 2 × R e c a l l + P r e c i s i o n

Where λ denotes the relative importance of Recall and Precision.

The calculation of G-mean is shown in Formula (9).

(9) G m e a n = T N T N + F P × R e c a l l

Here, F-value and G-mean are used to evaluate the performance of the proposed algorithm.

4.3 Experimental data and results analysis

In order to evaluate the performance of the proposed algorithm S2MAID, it is compared with Algorithm 1 [14], Algorithm 2 [16] and Algorithm 3 [19]. Algorithm 1 is represented by A1, Algorithm 2 is represented by A2 and Algorithm 3 is represented by A3.

The experiment is carried out repeatedly, and the average value is calculated. The results are shown from Table 3 to Table 8. Table 3, Table 4 and Table 5 show the F-value of the 4 algorithms, and Table 6, Table 7 and Table 8 show the G-mean of the 4 algorithms. N presents the proportion of labeled samples and N is 5%, 10% and 15% respectively.

Table 3

Comparisons of F-value for minority class data (N=5%)

data set A1 A2 A3 S2MAID
ionosphere 0.7938 0.7754 0.8034 0.8152
satimage 0.8247 0.8175 0.8421 0.8547
segment 0.9018 0.8745 0.9325 0.9541
glass 0.8217 0.8101 0.8210 0.8224
letter 0.4015 0.4715 0.5184 0.5247
Table 4

Comparisons of F-value for minority class data (N=10%)

data set A1 A2 A3 S2MAID
ionosphere 0.8117 0.7928 0.8457 0.8517
satimage 0.8280 0.8314 0.8601 0.8684
segment 0.9241 0.8987 0.9424 0.9674
glass 0.8301 0.8178 0.8324 0.8401
letter 0.4587 0.4940 0.5207 0.5598
Table 5

Comparisons of F-value for minority class data (N=15%)

data set A1 A2 A3 S2MAID
ionosphere 0.8280 0.8024 0.8737 0.8920
satimage 0.8298 0.8274 0.8587 0.8724
segment 0.9455 0.9541 0.9657 0.9742
glass 0.8301 0.8078 0.8418 0.8485
letter 0.4530 0.5235 0.5207 0.5675
Table 6

Comparison of G-mean value for the whole data (N=5%)

data set A1 A2 A3 S2MAID
ionosphere 0.7987 0.7954 0.8278 0.8314
satimage 0.9024 0.9028 0.9232 0.9374
segment 0.9510 0.9474 0.9540 0.9754
glass 0.8017 0.8217 0.8457 0.8654
letter 0.6274 0.5872 0.7012 0.7512
Table 7

Comparison of G-mean value for the whole data (N=10%)

data set A1 A2 A3 S2MAID
ionosphere 0.8025 0.8274 0.8421 0.8578
satimage 0.9134 0.9154 0.9417 0.9574
segment 0.9478 0.9325 0.9587 0.9847
glass 0.8357 0.8210 0.8354 0.8721
letter 0.6184 0.6017 0.7254 0.7412
Table 8

Comparison of G-mean value for the whole data (N=15%)

data set A1 A2 A3 S2MAID
ionosphere 0.8025 0.8341 0.88718 0.8945
satimage 0.9077 0.9201 0.9471 0.9618
segment 0.9065 0.9451 0.9641 0.9898
glass 0.8127 0.8045 0.8421 0.8814
letter 05814 0.5987 0.7468 0.7841

From the experimental results, it can be seen that the semi-supervised classification method S2MAID proposed in this paper has higher recognition rate for minority classes and better stability for the whole data set than the other algorithms.

That is to say, this proposed algorithm can improve the F-value value of minority classes without reducing the overall G-mean value of the data set. The main reasons are as follows:

  1. The improved under sampling technique UD-density is used to reduce the size of the data set based on keeping the distribution of the whole data set and select samples with high information.

  2. The improved under sampling technique SMOTE-density is used to increase the number of minority class and balance the data distribution.

  3. The safe semi-supervised classification algorithm is used to increase reliable labeled sample, improving the accuracy of classification.

  4. The ensemble algorithm integrates several weak classifiers and constructs a strong classifier, which improves the classification performance. Therefore, the proposed algorithm S2MAID can improve the classification performance better than the other ones.

5 Conclusion

In practical application, there are a large number of imbalanced data set with a small number of labeled samples. Such problems are more important but difficult to deal with. In this paper, a semi-supervised learning algorithm for imbalanced data based on UD-density and SMOTE-density is proposed. The experimental results on UCI data set are compared with the existing semi-supervised classification algorithm for imbalanced set. The results show that the proposed algorithm has a higher G-mean value for the whole data set and a higher F-value value for the minority class. Under similar circumstance, it has a better stability.

Acknowledgement

This work was supported by Shangluo Universities Key Disciplines Project, Discipline name: Mathematics; Natural Science Basic Research Plan in Shaanxi Province of China (No.2015JM6347); Science Research Plan of Shangluo University (No.14SKY026); Horizontal Project of Shangluo University (No.2018HXKY056); Science and technology innovation team construction project of Shangluo University (18scx002).

References

[1] Provost F., Fawcett T., Robust classification for imprecise environments, Mach Learn., 2001, 42(3), 203-231.10.1023/A:1007601015854Search in Google Scholar

[2] He H., Garcia E.A., Learning from Imbalanced Data, IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9), 1263-1284.10.1109/TKDE.2008.239Search in Google Scholar

[3] Maldonado S., López J., Imbalanced data classification using second-order cone programming support vector machines, Pattern Recogn., 2014, 47(5), 2070–2079.10.1016/j.patcog.2013.11.021Search in Google Scholar

[4] Sun Z., Song Q., Zhu X., Sun H., Xu B., Zhou Y., A novel ensemble method for classifying imbalanced data, Pattern Recogn., 2015, 48(5), 1623-1637.10.1016/j.patcog.2014.11.014Search in Google Scholar

[5] Castro C.L., Braga A.P., Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Transactions on Neural Networks & Learning Systems, 2013, 24(6), 888-899.10.1109/TNNLS.2013.2246188Search in Google Scholar PubMed

[6] Barua S., Islam M.A., Yao X., Murase K., MWMOTE–Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning, IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2), 405-425.10.1109/TKDE.2012.232Search in Google Scholar

[7] Ng W.W., Hu J., Yeung D.S., Yin S., Roli F., Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems, IEEE T Cybernetics., 2017, 45(11), 2402-2412.10.1109/TCYB.2014.2372060Search in Google Scholar PubMed

[8] Hong X., Chen S., Harris C.J., A Kernel-Based Two-Class Classifier for Imbalanced Data Sets, IEEE T Neural Networ., 2007, 18(1), 28-41.10.1109/TNN.2006.882812Search in Google Scholar PubMed

[9] Khan S.H., Hayat M., Bennamoun M., Sohel F., Togneri R., Cost Sensitive Learning of Deep Feature Representations from Imbalanced Data, IEEE Transactions on Neural Networks & Learning Systems, 2015, 29(8), 3573-3587.10.1109/TNNLS.2017.2732482Search in Google Scholar PubMed

[10] Gao M., Hong X., Chen S., Harris C.J., A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems, Neurocomputing, 2011, 74(17), 3456-3466.10.1016/j.neucom.2011.06.010Search in Google Scholar

[11] Zhou Z.H., Li M., Tri-training: exploiting unlabeled data using three classifiers, IEEE Transactions on Knowledge & Data Engineering, 2005, 17(11), 1529-1541.10.1109/TKDE.2005.186Search in Google Scholar

[12] Yu Z., Lu Y., Zhang J., You J., Wong H.S., Wang Y., et al., Progressive Semisupervised Learning of Multiple Classifiers. IEEE T Cybernetics., 2018, 48(2), 689-702.10.1109/TCYB.2017.2651114Search in Google Scholar PubMed

[13] Forestier G., Cédric W., Semi-supervised learning using multiple clusterings with limited labeled data, Inform Sciences., 2016, 361-362(C), 48-65.10.1016/j.ins.2016.04.040Search in Google Scholar

[14] Li F., Yu C., Yang N., Li G., Kaveh-yazdy F., Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data, The Scientific World Journal, 2013, 1, 1903-1912.10.1155/2013/875450Search in Google Scholar PubMed PubMed Central

[15] Pan S., Wu J., Zhu X., Zhang C., Graph ensemble boosting for imbalanced noisy graph stream classification, IEEE T Cybernetics., 2015, 45(5), 954-968.10.1109/TCYB.2014.2341031Search in Google Scholar PubMed

[16] Frasca M., Bertoni A., Re M., Valentini G., A neural network algorithm for semi-supervised node label learning from unbalanced data, Neural Networks, 2013, 43, 84-98.10.1016/j.neunet.2013.01.021Search in Google Scholar PubMed

[17] Hajizadeh S., Núñez A., Tax D., Semi-supervised Rail Defect Detection from Imbalanced Image Data, IFAC PapersOnLine, 2016, 49(3), 78-83.10.1016/j.ifacol.2016.07.014Search in Google Scholar

[18] Li F., Li G., Yang N., Xia F., Label matrix normalization for semisupervised learning from imbalanced Data, New Rev Hypermedia M., 2014, 20(1), 5-23.10.1080/13614568.2013.846416Search in Google Scholar

[19] Du L., Xu Y., Semi-supervised classification method for imbalanced data based on evidence theory, Application Research of Computers, 2018, 35(2), 342-345.Search in Google Scholar

[20] Kawakita M., Kanamori T., Semi-supervised learningwith density-ratio estimation. Mach Learn., 2013, 91(2), 189-209.10.1007/s10994-013-5329-8Search in Google Scholar

[21] Belkin M., Niyogi P., Semi-Supervised Learning on Riemannian Manifolds, Mach Learn., 2004, 56(1-3), 209-239.10.1023/B:MACH.0000033120.25363.1eSearch in Google Scholar

[22] Blum A., Mitchell T., Combining labeled and unlabeled data with co-training, In Conference on Computational Learning Theory. 1998, 92-100.10.1145/279943.279962Search in Google Scholar

[23] Jiang Z., Zhang S., Zeng J., A hybrid generative/discriminative method for semi-supervised classification, Knowl-Based Sys., 2013, 37(2), 137-145.10.1016/j.knosys.2012.07.020Search in Google Scholar

[24] Appice A., Guccione P., Malerba D., Transductive hyperspectral image classification: toward integrating spectral and relational features via an iterative ensemble system, Mach Learn., 2016, 103(3), 343-375.10.1007/s10994-016-5559-7Search in Google Scholar

[25] Zhuang L., Zhou Z., Yin J., Gao S., Lin Z., Ma Y., et al., Label Information Guided Graph Construction for Semi-Supervised Learning, IEEE T Image Process., 2017, 26 (9), 4182-4192.10.1109/TIP.2017.2703120Search in Google Scholar

[26] Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 2011, 16(1), 321-357.10.1613/jair.953Search in Google Scholar

[27] Shalizi C.R., Rinaldo A., Consistency under sampling of exponential random graph models, The Annals of Statistics, 2013, 41(2), 508-535.10.1214/12-AOS1044Search in Google Scholar

[28] Liu Y., Yao X., Ensemble learning via negative correlation. Neural Networks, 1999, 12(10), 1399-1404.10.1016/S0893-6080(99)00073-8Search in Google Scholar

[29] Galar M., Fernández A., Barrenechea E., Herrera F., EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn., 2013, 46(12), 3460–3471.10.1016/j.patcog.2013.05.006Search in Google Scholar

[30] Khoshgoftaar T.M., Van Hulse J., Napolitano A., Comparing Boosting and Bagging Techniques with Noisy and Imbalanced Data. IEEE Transactions on Systems Man and Cybernetics - Part A Systems and Humans, 2011, 41(3), 552-568.10.1109/TSMCA.2010.2084081Search in Google Scholar

[31] Ghazikhani A., Monsefi R., Yazdi H.S., Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing, 2013, 122, 535-544.10.1016/j.neucom.2013.05.003Search in Google Scholar

[32] Zhao J., Liu N., Malov A., Safe semi-supervised classification algorithm combined with active learning sampling strategy. J Intell Fuzzy Syst., 2018, 35(4), 4001-4010.10.3233/JIFS-169722Search in Google Scholar

[33] Dewasurendra M., Vajravelu K., On the Method of Inverse Mapping for Solutions of Coupled Systems of Nonlinear Differential Equations Arising in Nanofluid Flow, Heat and Mass Transfer. Applied Mathematics & Nonlinear Sciences, 2018, 3, 1-14.10.21042/AMNS.2018.1.00001Search in Google Scholar

[34] Fernández-Pousa C.R., Perfect Phase-Coded Pulse Trains Generated by Talbot Effect, Applied Mathematics & Nonlinear Sciences, 2018, 3, 23-32.10.21042/AMNS.2018.1.00003Search in Google Scholar

[35] Gao W., Wang W., A Tight Neighborhood Union Condition on Fractional (G, F, N’, M)-Critical Deleted Graphs, Colloq Math-Warsaw., 2017, 149, 291-298.10.4064/cm6959-8-2016Search in Google Scholar

[36] Gao W., Wang W., New Isolated Toughness Condition for Fractional (G, F, N) - Critical Graph, Colloq Math-Warsaw., 2017, 147, 55-65.10.4064/cm6713-8-2016Search in Google Scholar

[37] García-Planas M.I., Klymchuk T., Perturbation Analysis of a Matrix Differential Equation X≐ ABx, Applied Mathematics & Nonlinear Sciences, 2018, 3, 97-104.10.21042/AMNS.2018.1.00007Search in Google Scholar

[38] Lakshminarayana G., Vajravelu K., Sucharitha G., and Sreenadh S., Peristaltic Slip Flow of a Bingham Fluid in an Inclined Porous Conduit with Joule Heating, Applied Mathematics & Nonlinear Sciences, 2018, 3, 41-54.10.21042/AMNS.2018.1.00005Search in Google Scholar

Received: 2019-10-28
Accepted: 2019-11-15
Published Online: 2019-12-31

© 2019 J. Zhao and N. Liu, published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Regular Articles
  2. Non-equilibrium Phase Transitions in 2D Small-World Networks: Competing Dynamics
  3. Harmonic waves solution in dual-phase-lag magneto-thermoelasticity
  4. Multiplicative topological indices of honeycomb derived networks
  5. Zagreb Polynomials and redefined Zagreb indices of nanostar dendrimers
  6. Solar concentrators manufacture and automation
  7. Idea of multi cohesive areas - foundation, current status and perspective
  8. Derivation method of numerous dynamics in the Special Theory of Relativity
  9. An application of Nwogu’s Boussinesq model to analyze the head-on collision process between hydroelastic solitary waves
  10. Competing Risks Model with Partially Step-Stress Accelerate Life Tests in Analyses Lifetime Chen Data under Type-II Censoring Scheme
  11. Group velocity mismatch at ultrashort electromagnetic pulse propagation in nonlinear metamaterials
  12. Investigating the impact of dissolved natural gas on the flow characteristics of multicomponent fluid in pipelines
  13. Analysis of impact load on tubing and shock absorption during perforating
  14. Energy characteristics of a nonlinear layer at resonant frequencies of wave scattering and generation
  15. Ion charge separation with new generation of nuclear emulsion films
  16. On the influence of water on fragmentation of the amino acid L-threonine
  17. Formulation of heat conduction and thermal conductivity of metals
  18. Displacement Reliability Analysis of Submerged Multi-body Structure’s Floating Body for Connection Gaps
  19. Deposits of iron oxides in the human globus pallidus
  20. Integrability, exact solutions and nonlinear dynamics of a nonisospectral integral-differential system
  21. Bounds for partition dimension of M-wheels
  22. Visual Analysis of Cylindrically Polarized Light Beams’ Focal Characteristics by Path Integral
  23. Analysis of repulsive central universal force field on solar and galactic dynamics
  24. Solitary Wave Solution of Nonlinear PDEs Arising in Mathematical Physics
  25. Understanding quantum mechanics: a review and synthesis in precise language
  26. Plane Wave Reflection in a Compressible Half Space with Initial Stress
  27. Evaluation of the realism of a full-color reflection H2 analog hologram recorded on ultra-fine-grain silver-halide material
  28. Graph cutting and its application to biological data
  29. Time fractional modified KdV-type equations: Lie symmetries, exact solutions and conservation laws
  30. Exact solutions of equal-width equation and its conservation laws
  31. MHD and Slip Effect on Two-immiscible Third Grade Fluid on Thin Film Flow over a Vertical Moving Belt
  32. Vibration Analysis of a Three-Layered FGM Cylindrical Shell Including the Effect Of Ring Support
  33. Hybrid censoring samples in assessment the lifetime performance index of Chen distributed products
  34. Study on the law of coal resistivity variation in the process of gas adsorption/desorption
  35. Mapping of Lineament Structures from Aeromagnetic and Landsat Data Over Ankpa Area of Lower Benue Trough, Nigeria
  36. Beta Generalized Exponentiated Frechet Distribution with Applications
  37. INS/gravity gradient aided navigation based on gravitation field particle filter
  38. Electrodynamics in Euclidean Space Time Geometries
  39. Dynamics and Wear Analysis of Hydraulic Turbines in Solid-liquid Two-phase Flow
  40. On Numerical Solution Of The Time Fractional Advection-Diffusion Equation Involving Atangana-Baleanu-Caputo Derivative
  41. New Complex Solutions to the Nonlinear Electrical Transmission Line Model
  42. The effects of quantum spectrum of 4 + n-dimensional water around a DNA on pure water in four dimensional universe
  43. Quantum Phase Estimation Algorithm for Finding Polynomial Roots
  44. Vibration Equation of Fractional Order Describing Viscoelasticity and Viscous Inertia
  45. The Errors Recognition and Compensation for the Numerical Control Machine Tools Based on Laser Testing Technology
  46. Evaluation and Decision Making of Organization Quality Specific Immunity Based on MGDM-IPLAO Method
  47. Key Frame Extraction of Multi-Resolution Remote Sensing Images Under Quality Constraint
  48. Influences of Contact Force towards Dressing Contiguous Sense of Linen Clothing
  49. Modeling and optimization of urban rail transit scheduling with adaptive fruit fly optimization algorithm
  50. The pseudo-limit problem existing in electromagnetic radiation transmission and its mathematical physics principle analysis
  51. Chaos synchronization of fractional–order discrete–time systems with different dimensions using two scaling matrices
  52. Stress Characteristics and Overload Failure Analysis of Cemented Sand and Gravel Dam in Naheng Reservoir
  53. A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms
  54. Semi-supervised Classification Based Mixed Sampling for Imbalanced Data
  55. The Influence of Trading Volume, Market Trend, and Monetary Policy on Characteristics of the Chinese Stock Exchange: An Econophysics Perspective
  56. Estimation of sand water content using GPR combined time-frequency analysis in the Ordos Basin, China
  57. Special Issue Applications of Nonlinear Dynamics
  58. Discrete approximate iterative method for fuzzy investment portfolio based on transaction cost threshold constraint
  59. Multi-objective performance optimization of ORC cycle based on improved ant colony algorithm
  60. Information retrieval algorithm of industrial cluster based on vector space
  61. Parametric model updating with frequency and MAC combined objective function of port crane structure based on operational modal analysis
  62. Evacuation simulation of different flow ratios in low-density state
  63. A pointer location algorithm for computer visionbased automatic reading recognition of pointer gauges
  64. A cloud computing separation model based on information flow
  65. Optimizing model and algorithm for railway freight loading problem
  66. Denoising data acquisition algorithm for array pixelated CdZnTe nuclear detector
  67. Radiation effects of nuclear physics rays on hepatoma cells
  68. Special issue: XXVth Symposium on Electromagnetic Phenomena in Nonlinear Circuits (EPNC2018)
  69. A study on numerical integration methods for rendering atmospheric scattering phenomenon
  70. Wave propagation time optimization for geodesic distances calculation using the Heat Method
  71. Analysis of electricity generation efficiency in photovoltaic building systems made of HIT-IBC cells for multi-family residential buildings
  72. A structural quality evaluation model for three-dimensional simulations
  73. WiFi Electromagnetic Field Modelling for Indoor Localization
  74. Modeling Human Pupil Dilation to Decouple the Pupillary Light Reflex
  75. Principal Component Analysis based on data characteristics for dimensionality reduction of ECG recordings in arrhythmia classification
  76. Blinking Extraction in Eye gaze System for Stereoscopy Movies
  77. Optimization of screen-space directional occlusion algorithms
  78. Heuristic based real-time hybrid rendering with the use of rasterization and ray tracing method
  79. Review of muscle modelling methods from the point of view of motion biomechanics with particular emphasis on the shoulder
  80. The use of segmented-shifted grain-oriented sheets in magnetic circuits of small AC motors
  81. High Temperature Permanent Magnet Synchronous Machine Analysis of Thermal Field
  82. Inverse approach for concentrated winding surface permanent magnet synchronous machines noiseless design
  83. An enameled wire with a semi-conductive layer: A solution for a better distibution of the voltage stresses in motor windings
  84. High temperature machines: topologies and preliminary design
  85. Aging monitoring of electrical machines using winding high frequency equivalent circuits
  86. Design of inorganic coils for high temperature electrical machines
  87. A New Concept for Deeper Integration of Converters and Drives in Electrical Machines: Simulation and Experimental Investigations
  88. Special Issue on Energetic Materials and Processes
  89. Investigations into the mechanisms of electrohydrodynamic instability in free surface electrospinning
  90. Effect of Pressure Distribution on the Energy Dissipation of Lap Joints under Equal Pre-tension Force
  91. Research on microstructure and forming mechanism of TiC/1Cr12Ni3Mo2V composite based on laser solid forming
  92. Crystallization of Nano-TiO2 Films based on Glass Fiber Fabric Substrate and Its Impact on Catalytic Performance
  93. Effect of Adding Rare Earth Elements Er and Gd on the Corrosion Residual Strength of Magnesium Alloy
  94. Closed-die Forging Technology and Numerical Simulation of Aluminum Alloy Connecting Rod
  95. Numerical Simulation and Experimental Research on Material Parameters Solution and Shape Control of Sandwich Panels with Aluminum Honeycomb
  96. Research and Analysis of the Effect of Heat Treatment on Damping Properties of Ductile Iron
  97. Effect of austenitising heat treatment on microstructure and properties of a nitrogen bearing martensitic stainless steel
  98. Special Issue on Fundamental Physics of Thermal Transports and Energy Conversions
  99. Numerical simulation of welding distortions in large structures with a simplified engineering approach
  100. Investigation on the effect of electrode tip on formation of metal droplets and temperature profile in a vibrating electrode electroslag remelting process
  101. Effect of North Wall Materials on the Thermal Environment in Chinese Solar Greenhouse (Part A: Experimental Researches)
  102. Three-dimensional optimal design of a cooled turbine considering the coolant-requirement change
  103. Theoretical analysis of particle size re-distribution due to Ostwald ripening in the fuel cell catalyst layer
  104. Effect of phase change materials on heat dissipation of a multiple heat source system
  105. Wetting properties and performance of modified composite collectors in a membrane-based wet electrostatic precipitator
  106. Implementation of the Semi Empirical Kinetic Soot Model Within Chemistry Tabulation Framework for Efficient Emissions Predictions in Diesel Engines
  107. Comparison and analyses of two thermal performance evaluation models for a public building
  108. A Novel Evaluation Method For Particle Deposition Measurement
  109. Effect of the two-phase hybrid mode of effervescent atomizer on the atomization characteristics
  110. Erratum
  111. Integrability analysis of the partial differential equation describing the classical bond-pricing model of mathematical finance
  112. Erratum to: Energy converting layers for thin-film flexible photovoltaic structures
Downloaded on 15.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/phys-2019-0103/html
Scroll to top button