AI/ML

Using Kydavra ElasticNetSelector for feature selection

Vladimir Stojoc

Feb 3, 2021 • 3 min read

Even if we don’t leave in the world of Harry Potter, now everyone can learn the unforgivable spells to use some not easy algorithms just by typing 3 lines of code. This can be done by using the Kydavra library for python.

Using Kydavra ElasticNetSelector.

To install Kydavra we just have to type the following in the command line.

pip install kydavra

Next, we need to import the model, create the selector, and apply it to our data:

from kydavra import ElasticNetSelector

selector = ElasticNetSelector()
selected_cols = selector.select(df, ‘target’)

The ElasticNetSelector() accepts 7 parameters:

alpha_start: float, default = -2, the starting point in the greedy search of coefficients for L1 regularization.
alpha_finish: float, default = 0, the finish point in the greedy search of coefficients for L1 regularization.
beta_start: float, default = -2, the starting point in the greedy search of coefficients for L2 regularization.
beta_finish: float, default = 0, the finish point in the greedy search of coefficients for L2 regularization.
n_alphas: int, default = 100, the number of points in greedy search.
extend_step: int, default = 20, the quantity with which the alpha_start and alpha_finish will be updated.
power: int, default = 2, used to set a threshold in finding the best coefficients.

And the select() function takes as parameters the panda’s data frame and the name of the target column. Also, it has a default parameter ‘cv’ (by default it is set to 5) which represents the number of folds used in cross-validation.

The algorithm looks through different combinations of alphas and betas to combine L1 and L2 regularizations and chooses the best one, if the best alpha or best beta is on one of the limit points (start or finish), it extends the limits by extend_step.

After finding the optimal value of alpha and beta algorithm will just see which features have weights higher than 10^-power.

Let’s see an example:

I chose the dataset where we have a regression and have to predict price (‘total (R$)’). The dataset has the following features.

'city', 'area', 'rooms', 'bathroom', 'parking spaces', 'floor', 'animal', 'furniture', 'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)', 'total (R$)'

The LinearRegression has the mean absolute error equal to 1.08701466411

When ElasticNetSelector is applied, it selects the following features:

'floor', 'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)'

and now the LinearRegression has the mean absolute error = 1.0684039933

So besides reducing the number of columns in the training dataset, it also increased the accuracy of the model a bit.

Bonus.

This module also has a plotting function. After applying the select function you can see why the selector selected some features and others not. To plot just type:

selector.plot_process()

selector.plot_process(regularization_plot = "L2")

The dotted lines are features that were thrown away because their weights were too close to 0. The central-vertical dotted line is the optimal value of the alpha and beta found by the algorithm.

The plot_process() function has the next parameters:

eps (float, default = 5e-3) the length of the path.
title (string, default = ‘Lasso coef Plot’) — the title of the plot.
save (boolean, default= False) if set to true it will try to save the plot.
file_path (string, default = None) if the save parameter was set to true it will save the plot using this path.
regularization_plot (string, default = ‘L1’), two possible values: “L1” and “L2”. For “L1” shows the plot for changes in L1 alphas. For “L2” shows the plot for changes in L2 betas.

Conclusions:

ElasticNetSelecor is a selector that uses the Lasso and Ridge regularizations to select the most useful features. It uses the good parts of both algorithms, and in correlation with Kydavra, it can be used just by typing 3 lines of code.