Using Crucio SMOTE for balancing data
This will be a new series of articles related to our brand new Python library made with love by Sigmoid. Its name is just like the second unforgivable spell from the Harry Potter series, Crucio. This library was created specifically for unbalanced data sets, with a lot of different methods that can be useful in different situations.
If you missed our first library, Kydavra, which is created to do feature selection, then I suggest you check it out after reading this article.
What is SMOTE?
SMOTE (Synthetic Minority Oversampling Technique) is a very popular and simple technique for balancing data, which is based on the KNN algorithm.
Using SMOTE from Crucio
If you still haven’t installed Crucio just type the following in the command line.
Now we have to import and use our algorithm
smote = SMOTE()
new_df = smote.balance(df,'target')
The SMOTE() initialization constructor can contain the following arguments:
- k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
- seed (int, default = 45): The number used to initialize the random number generator.
- binary_columns (list, default = None): The list of binary columns from the data set, so sampled data be approximated to the nearest binary value.
The balance() method takes as parameters the panda’s data frame and the name of the target column.
So I chose a data set where we have to predict the type of a Pokemon (Legendary or not), the Legendary class constitutes 8% out of all datasets, so it is an imbalanced dataset.
The basic Random Forest algorithm gives an accuracy of approximately 88% by training on imbalanced data, so now it’s time to test out SMOTE algorithm.
new_df = smote.balance(df,'Legendary')
new_df is now a balanced training data, and now we will train Random Forest on this data, and test on the same data that we did before balancing, and now it gives us a 100% accuracy.
And here is a little plot demonstrating how new examples were sampled.
SMOTE is a very good technique to use when you have an unbalanced data set, so I encourage you to test it with some other balancing methods from Crucio such as SMOTETOMEK, SMOTEENN, ADASYN, and ICOTE.
Thank you for reading!
Follow Sigmoid on Facebook, Instagram, and LinkedIn: