Generate synthetic samples using Crucio MTDF

Generate synthetic samples using Crucio MTDF

Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about MTDF(Mega-Trend Diffusion-Function).

How does MTDF work?

MTDF uses a common diffusion function to diffuse a set of data. Everything starts with computing the hset parameter of the minority class — the set of diffusion coefficient using the following formulas.

Then the Uset is computed.

Next, we must find the lower and upper skewness of the data set. The lower skewness is the ratio of the minority samples in the data, while the upper skewness is the ratio of the majority samples in the data.
The following steps are calculating the a and b vectors, which are calculated in the following way: a is the min vector of the data divided to 10, and b is the max vector multiplied to 10.

Then a and b are updated by the following formula where phi is the cumulative distribution function (or shortly CDF).

Every new sample is a random sample from a uniform distribution between a and b.

Using Crucio MTDF.

In case you didn’t install Crucio yet, then type in the terminal the following:

pip install crucio

Now we can import the algorithm and create the MTDF balancer.

from crucio import MTDF

mtdf = MTDF()
balanced_df = mtdf.balance(df, 'target')

MTDF constructor has only 2 parameters:

  • binary_columns (list, default=None) : if set to then it will check if the listed columns are in the passed data frame then they will be transformed to binary ones after generating new points.
  • seed (int, default = 42) : seed for random number generation.

The balance function takes only 2 parameters:

  • df (pd.DataFrame) : the pandas data frame that should be balanced.
  • target (str) : the name of the column from the data frame that should be balanced.

Now let’s apply it to a data set. We will use it on the Pokemon dataset. The target column (Legendary) is unbalanced, only 8% is not legendary.
We recommend you, before applying any module from Crucio, to first split your data into a train and test data sets and balance only the train set. In such a way you will test the performance of the model only on natural data.
The results before balancing and after.

Conclusion.

MTDF doesn’t usually expand a lot the vector space of the minority set, also because its computations are very fast and cheap. You may try it first, but also after applying some more complex oversampling techniques.

Made with ❤ by Sigmoid.

Discussion

Community guidelines