Nowadays the majority of data sets in the industry are unbalanced, meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about ICOTE (Immune centroids over-sampling method for multi-class classification).
How does ICOTE work?
ICOTE has a very simple logic behind that. It can be split into 2 parts:
- Clone generations: In ICOTE minority samples are interpreted as white immune cells in the immune system. In an organism when a “non-self” antigen enters the body, they attack them. Antigens in ICOTE are represented by the majority class samples. Usually, white immune cells are creating a lot of copies of themselves to attack the antigens. The same idea is used in ICOTE, it is generating clones of minority class samples until the number of minority and majority samples are the same.
- Mutants generation: If every immune cell is ineffective versus the antigen then it would be a loss of resources to just clone these cells. That’s why immune cells while cloning are sometimes mutating. In such a way they make their chances of success higher. A similar mechanism is used in ICOTE, however, the mutation isn’t random.
The distance to every majority class sample is calculated for every clone. Using this distance, the alpha parameter is calculated for every clone as shown below (the inversed of the distance).This alpha is then used to mutate the clone using the below formula.
In such a way new samples of the minority class are generated.
NOTE: Crucio implementation of the ICOTE also standardizes date, so you don’t need to do this, also it brings the data back to the original form after generation.
Using Crucio ICOTE.
In case you didn’t install crucio yet, then type in the terminal the following:
pip install crucio
Now we can import the algorithm and create the ICOTE balancer.
from crucio import ICOTE icote = ICOTE() balanced_df = icote.balance(df, 'target')
ICOTE constructor has only 2 parameters:
- binary_columns (list, default=None) : if set to then it will check if the listed columns are in the passed data frame then they will be transformed to binary ones after generating new points.
- seed (int, default = 42) : seed for random number generation.
The balance function takes only 2 parameters:
- df (pd.DataFrame) : the pandas data frame that should be balanced.
- target (str) : the name of the column from the data frame that should be balanced.
Now let’s apply it to a data set. We will use it on the Pokemon dataset. The target column (Legendary) is unbalanced, only 8% is not legendary.
We recommend you, before applying any module from Crucio, to first split your data into a train and test data sets and balance only the train set. In such a way you will test the performance of the model only on natural data.
The results before balancing and after.
We can see that ICOTE generated a very big vector space for the minority class. This happened because it tries to not create outlayers, not generating minority samples in a majority vector space, however it sometimes fails as we can see.
Made with ❤ by Sigmoid.