AI/ML

Using Crucio SMOTEENN for balancing data

Vladimir Stojoc

May 18, 2021 • 2 min read

This article will be about a combination of the two most powerful algorithms used for oversampling and undersampling imbalanced datasets, SMOTE and ENN.

I talked about SMOTE in one of my previous articles, so you should check it out because I will now rely more on the ENN algorithm.

How does ENN works?

ENN(Edited Nearest Neighbor) is an undersampling technique that looks for noise in data, it computes KNN for every minority class example, and depending on them it decides if among those samples exists an imposter, if yes, then this example will be deleted.

After the ENN technique, we simply apply SMOTE to oversample minority class examples.

Using SMOTEENN from Crucio

If you still haven’t installed Crucio just type the following in the command line.

pip install crucio

Now we have to import and use our algorithm

from crucio import SMOTEENN

enn = SMOTEENN()
new_df = enn.balance(df,'target')

The SMOTEENN() initialization constructor can contain the following arguments:

k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
seed (int, default = 45): The number used to initialize the random number generator.
binary_columns (list, default = None): The list of binary columns from the data set, so sampled data is approximated to the nearest binary value.

The balance() method takes as parameters the panda’s data frame and the name of the target column.

Example:

I took the same dataset from the previous article, the one about Legendary Pokemons.

Classifying our pokemons without any help, got us an accuracy of 95.8%

which is a pretty good accuracy, but we can try to increase it without the SMOTEENN algorithm.

smote = SMOTEENN()
new_df = smote.balance(df,’Legendary’)

Now we got an even better accuracy, 100% percent, yep that’s a small dataset, but it can be really good at a specific situation, so I recommend remembering it and try using it next time you see an imbalanced dataset.

And here is a little plot demonstrating how new examples were sampled.

Conclusion:

SMOTEENN is an interesting technique that combines both undersampling (using ENN) and oversampling (using SMOTE), and this combination can bring you great results if used wisely.