AI/ML

Using Crucio SMOTENC for unbalanced datasets

Vladimir Stojoc

Apr 15, 2021 • 3 min read

Unbalanced data

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than the others. An example can be a dataset for transactions: usually, 99% of this set consists of simple, not fraudulent transactions, and only 1% of them are fraud.

How does SMOTENC work?

SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous) is very similar to classic SMOTE but, unlike SMOTE, SMOTE-NC can work with both numerical and categorical features.

SMOTE-NC slightly changes the way a new sample is generated by performing something specifically for the categorical features. The categories of a newly generated sample are decided by picking the most frequent category of the nearest neighbors present during the generation, and while calculating Euclidian distance when searching for k-nearest-neighbors, the algorithm just adds the square of the median of standard deviations of every column that represent numerical values.

Using Crucio SMOTENC

In case you didn’t install Crucio yet, then use in the terminal the following:

pip install crucio

Now we can import the algorithm and create the SMOTENC balancer.

from crucio import SMOTENC
smotenc = SMOTENC()
balanced_df = smotenc.balance(df, 'target')

The SMOTENC() initialization constructor can contain the following arguments:

k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
seed (int, default = 45): The number used to initialize the random number generator.
binary_columns (list, default = None): The list of binary columns from the data set, so sampled data be approximated to the nearest binary value.

The balance() method takes as parameters the panda’s data frame and the name of the target column.

Example:

So I chose a data set where we have to predict the type of a Pokemon (Legendary or not), the Legendary class constitutes 8% out of all dataset, so it is an imbalanced dataset.

We recommend you, before applying any module from Crucio, to first split your data into train and test data sets and balance only the train set. In such a way you will test the performance of the model only on natural data.

The basic Random Forest algorithm gives an accuracy of approximately 97% by training on imbalanced data, so now it’s time to test out SMOTE algorithm.

smotenc = SMOTENC()
new_df = smotenc.balance(df,'Legendary')

After balancing data, we increased accuracy from 97% to 100%.

Conclusion:

SMOTENC is a very good technique to balance datasets when you have categorical columns and you don’t wanna lose their predicting power. I encourage you to test it and many other balancing methods from Crucio such as SMOTETOMEK, SCUT, SLS, ADASYN, and ICOTE.