In class imbalance learning problems, how to better recognize examples from the minority class is the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an ensemble could help to improve the performance of class imbalance learning. However, no study has actually investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of minority and majority classes.
In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and Fmeasure, which are widely used in class imbalance learning in six different situations of diversity’s impact on these measures are obtained through theoretical analysis.
Finally, to further understand how diversity affects the single class performance and overall performance in class imbalance problems, we carry out extensive experimental studies on both artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong correlations between diversity and discussed performance measures. Diversity shows a positive impact on the minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean. Relationships between Diversity Classification Ensembles and Single-Class Performance Measures
A typical imbalanced data set with two classes, one class is heavily under-represented compared to the other class that contains a relatively large number of examples. Class imbalance pervasively exists in many realworld applications, such as medical diagnosis fraud detection risk management text classification etc. Rare cases in these domains suffer from higher misclassification costs than common cases. It is a promising research area that has been drawing more and more attention in data mining and machine learning, since many standard machine learning algorithms have been reported to be less effective when dealing with this kind of problems. The fundamental issue to be resolved is that they tend to ignore or overfit the minority class. Hence, great research efforts have been made on the development of a good learning model that can predict rare cases more accurately to lower down the total risk. The difference of individual learners is interpreted as “diversity” in ensemble learning. It has been proved to be one of the main reasons for the success of ensembles from both theoretical and empirical aspects. To date, existing studies have discussed the relationship between diversity and overall accuracy. In class imbalance cases, however, the overall accuracy is not appropriate and less meaningful.
There is no agreed definition for diversity. Quite a few pair wise and non pair wise diversity measures were proposed in the literature such as Q-statistic double-default measure entropy generalized diversity These attractive features lead to a variety of ensemble methods proposed to handle imbalanced data sets from the data and algorithm levels. The data level, sampling strategies are integrated into the training of each ensemble member. For instance, Li’s BEV and Chan and Stolfo combining model were proposed based on the idea of Bagging by under sampling the majority class examples and combining them with all the minority class examples to form balanced training subsets. SMOTE Boost and Data Boost-IM were designed to alter the imbalanced distribution based on Boosting. The classification characteristics of class imbalance learning into account some insight into the class imbalance problem from the view of base learning algorithms, such as decision trees and neural networks. Skewed class distributions and different misclassification costs make the classification difficulty mainly reflect in the over fitting to the minority class and the overgeneralization to the majority class, because the small class has less contribution to the classifier.