Using Kydavra ItakuraSaitoSelector for feature selection

Using Kydavra ItakuraSaitoSelector for feature selection

What is ItakuraSaitoSelector?

It is a selector that is based on Itakura-Saito divergence, which measures the difference between an original spectrum and an approximation of it. The spectrum can be thought of as a continuous distribution.

Since in Machine Learning we usually deal with observable data, we will consider dicrete distributions, or finite distributions.

How is it calculated ?

Itakura-Saito divergence is a Bregman divergence generated by minus logarithmic function. It can be calculated with the following formula:

sum of the difference between p(i)/q(i), log(p(i)/q(i)) and 1

where P and Q are distributions.

Using ItakuraSaitoSelector

The ItakuraSaitoSelector constructor has the following parameters:

  1. EPS (float, default: 0.0001): a small value to add to the feature column in order to avoid division by zero.
  2. min_divergence (int, default: 0): the minimum accepted divergence with the target column
  3. max_divergence (int, default: 10): the maximum accepted divergence with the target column

The select method has the following parameters:

  1. dataframe (pd.DataFrame): the dataframe for which to apply the selector
  2. target (str): the name of the column that the feature columns will be compared with

This method returns a list of the column names selected from the dataframe.

Example using ItakuraSaitoSelector

First of all, you should install kydavra if you don’t have it yet:

pip install kydavra

Now, you can import the selector:

from kydavra import ItakuraSaitoSelector

Import a dataset and create a dataframe:

import pandas as pd

df = pd.read_csv('./heart.csv')

Because our selector expects numerical valued features, let’s select the columns that have a numerical data type:

df = df.select_dtypes('number')

Let’s instanciate our selector and choose the features compared with the ‘target’ column:

cols = ItakuraSaitoSelector().select(df, 'target')

The selector returns the list of column names of the features that have a divergence with the choosen column between min_divergence and max_divergence.

With the heart.csv dataset in this example, the selector returns the following columns:

['age', 'cp', 'trestbps', 'chol', 'thalach', 'slope', 'thal']

If we limit the divergence of the selected columns to be between 1 and 3 relative to the target column, we get the following:

['age', 'trestbps', 'chol', 'thalach', 'slope', 'thal']

For divergence between 1 and 2 we get:

['age', 'trestbps', 'chol', 'thalach', 'thal']

Use case

As we started with the heart disease dataset, let’s finish with it. We are to create a classification model that has to predict if a given patient has a heart disease or not.

You can find the dataset here: https://www.kaggle.com/ronitf/heart-disease-uci.

Let’s create two models. One will be trained without feature selection and one trained with selected features from our selector.

Start with importing the dataset:

import pandas as pd

df = pd.read_csv('./heart.csv')
X = df.drop(columns=['target'])
y = df['target']

Now create the model without the selector:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression().fit(X_train, y_train)

Now, let’s add some metrics to be able to compare the two models:

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

print('Without selector:')
print(f'accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf.predict(X_test)):.2f}')

Good. Now let’s do the same thing for the second model and apply the selector. First, import it:

from kydavra import ItakuraSaitoSelector

We will use the columns that have a divergence between 1 and 3 relative to the target column (which in this dataset is named ‘target’ for convenience):

itakura = ItakuraSaitoSelector(min_divergence=1, max_divergence=3)
cols = itakura.select(df, 'target')
print(f'\nselected columns: {cols}')

Now, let’s select these columns from the DataFrame. The target column remains the same:

X = df[cols]
y = df['target']

Continue with creating the model and printing the metrics:

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf_with_selector = LogisticRegression().fit(X_train, y_train)

print('\nWith selector:')
print(f'accuracy: {accuracy_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test,
clf_with_selector.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf_with_selector.predict(X_test)):.2f}')

Here is what the program outputs:

Without selector:
accuracy: 0.79
recall 0.90
AUC 0.79selected columns: ['age', 'trestbps', 'chol', 'thalach', 'slope', 'thal']With selector:
accuracy: 0.82
recall 0.86
AUC 0.81

In the case of heart disease dataset, as this is a small dataset, if we remove columns, we’ll lose precious information the model trains with. But, if we had a bigger dataset with correlated feature columns, then using ItakuraSaitoSelector will help improve the model.

Made with ❤ by Sigmoid.

Follow us on Facebook, Instagram and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/

Discussion

Community guidelines