AI/ML

Solving categorical feature selection with Kydavra ChiSquaredSelector

Sigmoid

Sep 10, 2020 • 3 min read

So how we said in previous articles about Kydavra library, Feature selection is a very important part of Machine Learning model development. Unfortunately, there is not only one unique way to get the ideal model, mostly because of the fact that data almost every time has different forms, but this also implies different approaches. In this article, I would like to share a way to select the categorical features using Kydavra ChiSquaredSelector created by Sigmoid.

Using ChiSquaredSelector from Kydavra library.

As always, for those that are there mostly just for the solution to their problem their are the commands and the code:

To install kydavra just write the following command in terminal:

pip install kydavra

Now you can import the Selector and apply it on your data set a follows:

from kydavra import ChiSquaredSelector

selector = ChiSquaredSelector()
new_columns = selector.select(df, ‘target’)

To test it let’s apply it on the Heart Disease UCI Dataset with a little change. Instead of keeping all features, we will erase the numerical columns. So our new dataset will consist only from the next features:

sex, cp, fbs, restecg, exang, slope, ca and thal

So as the algorithm I chose is SVC, and before feature selection it’s cross_val_score was:

0.6691582491582491

But after applying ChiSquaredSelector the cross_val_score become:

0.8452525252525251

Keeping the next features: sex, cp, exang, slope, ca, thal.

So how it works?

So, as with other selectors, ChiSquaredSelector was inspired by statistics, of course from Chi2-test. As p-values, Chi2-test is used to prove or disprove null-hypothesis. Just to remind:

Null hypothesis is a general statement that there is no relationship between two measured phenomena (or also saying features).

So to find if features are related we need to see if we can reject the null hypothesis. Technically saying ChiSquaredSelector, takes the p-values obtained when chi2-s are calculated. Just to recapitulate.

P-value — is the probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations, is greater than or equal in magnitude to the observed results.

So setting the significance level (parameter of the ChiSquaredSelector) we iteratively eliminate features with the highest p-values.

BONUS!

If you are interested why did the selector chose some features and others left out, you can always plot the process of choosing features. ChiSquaredSelector has 2 plotting functions one for Chi2 and another for p-values:

selector.plot_chi2()

and for p-values:

selector.plot_p_value()

Each function has the following parameters:

title — the title of the plot.
save — the boolean value, True meaning that it will save the plot, and False not. By default, it is set to false.
file_path — the file path to the newly created plot.

If you want to dig deeper into the notions as Null hypothesis, Chi2 — test and p-values, or how this feature selection works, bellow you have a list of links.

If you want to dive deeper into how Chi-squared works I highly recommend the links at the end of the article. If you tried kydavra I invite you to leave some feedback and share your experience using it throw responding to this form.

Made with ❤ by Sigmoid.

Useful links:

Discussion

Community guidelines

Using ChiSquaredSelector from Kydavra library.

So how it works?

BONUS!

Discussion

Sign up for more like this.