Using Kydavra KullbackLeiblerSelector for feature selection

Using Kydavra KullbackLeiblerSelector for feature selection

What is KullbackLeiblerSelector ?

It is a feature selector based on the Kullback-Leibler divergence.

A divergence is a measure of difference between two probabilistic distributions. In the case of machine learning, we can consider data distributions and calculate how different a certain feature column is compared to a target column.

How is it calculated ?

Kullback-Leibler divergence, also named relative entropy in information theory, is calculated using the following formula:

where P and Q are the distributions and X is the sample space. This is for discrete distributions.

Using KullbackLeiblerSelector

The KullbackLeiblerSelector constructor has the following parameters:

  1. EPS (float, default: 0.0001): a small value to add to the feature column in order to avoid division by zero.
  2. min_divergence (int, default: 0): the minimum accepted divergence with the target column
  3. max_divergence (int, default: 1): the maximum accepted divergence with the target column

The select method has the following parameters:

  1. dataframe (pd.DataFrame): the dataframe for which to apply the selector
  2. target (str): the name of the column that the feature columns will be compared with

This method returns a list of the column names selected from the dataframe.

Example using KullbackLeiblerSelector

First of all, you should install kydavra if you donā€™t have it yet:

pip install kydavra

Now, you can import the selector:

from kydavra import KullbackLeiblerSelector

Import a dataset and create a dataframe out of it:

import pandas as pd

df = pd.read_csv('./heart.csv')

As our selector expects numerical valued features, letā€™s select the columns that have a numerical data type:

df = df.select_dtypes('number')

Letā€™s instanciate our selector and choose the features compared with the ā€˜targetā€™ column:

cols = KullbackLeiblerSelector().select(df, 'target')

The selector returns the list of column names of the features that have a divergence with the choosen column between min_divergence and max_divergence.

With the heart.csv dataset in this example, the selector returns the following columns:

['age', 'cp', 'trestbps', 'chol', 'thalach', 'slope', 'thal']

If we limit the divergence of the selected columns to be between 1 and 3 relative to the target column, we get the following:

['sex', 'restecg', 'oldpeak']

Use case

Continuing with the heart disease dataset, letā€™s create a classification model that will predict wether a patient has heart disease or not. We will observe how the KullbackLeiblerSelector will improve the performance of the model.

You can find the dataset here: https://www.kaggle.com/ronitf/heart-disease-uci.

Actually we will create two models. One will be trained without feature selection and one trained with selected features from our selector.

Letā€™s import the dataset first:

import pandas as pd

df = pd.read_csv('./heart.csv')
X = df.drop(columns=['target'])
y = df['target']

Now letā€™s create the model without the selector:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression().fit(X_train, y_train)

Now, letā€™s add some metrics to be able to compare the two models:

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

print('Without selector:')
print(f'accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf.predict(X_test)):.2f}')

Fine. Letā€™s do the same thing for the second model. But now we will apply our selector. First, import it:

from kydavra import KullbackLeiblerSelector

We will use the columns that have a divergence between 1 and 3 relative to the target column (which in this dataset is named ā€˜targetā€™ for convenience):

kullback = KullbackLeiblerSelector(min_divergence=1, max_divergence=3)
cols = kullback.select(df, 'target')
print(f'\nselected columns: {cols}')

Now, letā€™s select these columns from the DataFrame. The target column remains the same:

X = df[cols]
y = df['target']

Continue with creating the model and printing the metrics:

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf_with_selector = LogisticRegression().fit(X_train, y_train)

print('\nWith selector:')
print(f'accuracy: {accuracy_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf_with_selector.predict(X_test)):.2f}')

Here is what we get as an output:

Without selector:
accuracy: 0.82
recall 0.76
AUC 0.82

selected columns: ['sex', 'restecg', 'oldpeak']With selector:
accuracy: 0.80
recall 0.88
AUC 0.80

We observe that the accuray and AUC ROC metrics got slightly worse. Instead the recall improved a lot. In different cases the results may differ. Though, experimenting with the selectorā€™s parameters may cause the difference in the result. Also, donā€™t forget to consider other factors, such as the dataset or the model used.

Made with ā¤ by Sigmoid.

Follow us on Facebook, Instagram and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/

Discussion

Community guidelines