Using Kydavra BregmanDivergenceSelector for feature selection

Using Kydavra BregmanDivergenceSelector for feature selection

Sometimes it is necessary to select some features from a dataset that are quite similar to a target column. One way is to compare the features with the target using divergence. One such divergence is the Bregman divergence. Kydavra implements a selector based on Bregman divergence named BregmanDivergenceSelector.

What is Bregman divergence ?

In statistics, divergence is a function that establishes the “distance” between two probability distributions. In other words, it measures the difference between two distributions. If we interpret these two distributions as a set of observable values, we can measure the distance between them.

Bregman divergence is one of many divergences. It can be calculated with the squared Euclidean distance:

Calculate the Bregman divergence with squared Euclidean distance

We can select features in a dataset based on divergence between them and the target. That is what BregmanDivergenceSelector does.

Using BregmanDivergenceSelector

The BregmanDivergenceSelector constructor has the following parameters:

  1. min_divergence (int, default: 0): the minimum accepted divergence with the target column
  2. max_divergence (int, default: 10): the maximum accepted divergence with the target column

The select method has the following parameters:

  1. dataframe (pd.DataFrame): the dataframe for which to apply the selector
  2. target (str): the name of the column that the feature columns will be compared with

This method returns a list of the column names selected from the dataframe.

Example using BregmanDivergenceSelector

First of all, you should install kydavra if you don’t have it yet:

pip install kydavra

Now, you can import the selector:

from kydavra import BregmanDivergenceSelector

While we are here, let’s import pandas and the dataset we will work with:

import pandas as pd

df = pd.read_csv('./heart.csv')

As our selector expects numerical valued features, let’s select some columns that have a numerical data type:

df = df.select_dtypes('number')

Let’s instanciate our selector and choose the features compared with the ‘target’ column:

cols = BregmanDivergenceSelector().select(df, 'target')

The selector returns the list of column names of the features that have a divergence with the choosen column between min_divergence and max_divergence.

With the heart.csv dataset in this example, the selector returns the following columns:

['sex', 'cp', 'fbs', 'restecg', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

Let’s try limiting max_divergence to 1:

We get the following result:

['fbs', 'restecg', 'exang']

It’s that simple, just one line of code and you have your columns.

Use case

Let’s try the selector on a simple use case. We have the Heart Disease UCI dataset. We want to create a classification model that will predict whether a person has a heart disease or not.

You can find the dataset here: https://www.kaggle.com/ronitf/heart-disease-uci.

We will create two models. One trained without feature selection and one trained with selected features from our selector.

Let’s import the dataset first:

import pandas as pd

df = pd.read_csv('./heart.csv')
X = df.drop(columns=['target'])
y = df['target']

Now let’s create the model without the selector:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression().fit(X_train, y_train)

Now, let’s add some metrics to be able to compare the two models:

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

print('Without selector:')
print(f'accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf.predict(X_test)):.2f}')

Fine. Let’s do the same thing for the second model. But now we will apply our selector. First, import it:

from kydavra import BregmanDivergenceSelector

We will use the columns that have a divergence between 1 and 3 relative to the target column (which in this dataset is named ‘target’ for convenience):

bregman = BregmanDivergenceSelector(min_divergence=1, max_divergence=3)
cols = bregman.select(df, 'target')
print(f'\nselected columns: {cols}')

Now, let’s select these columns from the DataFrame. The target column remains the same:

X = df[cols]
y = df['target']

Continue with creating the model and printing the metrics:

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf_with_selector = LogisticRegression().fit(X_train, y_train)

print('\nWith selector:')
print(f'accuracy: {accuracy_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf_with_selector.predict(X_test)):.2f}')

Here is what we get as an output:

Without selector:
accuracy: 0.79
recall 0.83
AUC 0.78

selected columns: ['sex', 'cp', 'oldpeak', 'slope', 'ca', 'thal']With selector:
accuracy: 0.83
recall 0.87
AUC 0.83

We see that the results improved. It is a good result, considering that we trained the second model on less data.

If the right columns are selected, the results may improve compared to the model without the selector applied. That is why it is a good idea to experiment with the parameters of the selector.

Made with ❤ by Sigmoid.

Follow us on Facebook, Instagram and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/

Discussion

Community guidelines