Feature Selection using the Kydavra FisherSelector

Feature Selection using the Kydavra FisherSelector

A Brief Review of Fisher Score:

Fisher Score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher Criterion, which leads to a suboptimal subset of features. Kydavra 0.3's latest release provides a generalized Fisher score to jointly select features.

The key idea of Fisher score is to find a subset of features, such that in the data space spanned by the selected features, the distances between data points in different classes are as large as possible, while the distances between data points in the same class are as small as possible. For each feature, it selects the top-m ranked features with large scores. Because the score of each feature is computed independently, the features selected by the heuristic algorithm is suboptimal.

Using Kydavra FisherSelector:

To get started, make sure you have Kydravra installed on your machine.

pip install kydavra

Next, we need to import the model, create the selector, and apply it to our data.

from FisherSelector import FisherSelector

selector = FisherSelector(10)
selected_cols = selector.select(df, ‘target’)

The select function takes as parameters the pandas data frame and the name of the target column.

The FisherSelector() takes the next parameter:

  • n_features(int, default=5) it represents the number of top features (according to the fisher score) to retain after feature selection is applied.

Testing

In our test, we use the load_boston data set provided by the sklearn library.

from sklearn.datasets import load_boston
Note: The algorithm works only on numeric data.

After doing some cleaning, the result our selector gave us looks like the following:

['DIS', 'INDUS', 'NOX', 'TAX', 'AGE', 'LSTAT', 'PTRATIO', 'B', 'RM', 'RAD']

Now let’s compare the results before and after applying the selector.

BEFORE:

AFTER:

Going furthermore, I compared the Mean Squared Error for every algorighm.

BEFORE:

MSE for LinearRegression : 27.22333372429674
MSE for SVR : 34.728436105564406

AFTER:

MSE for LinearRegression : 28.83326001122695
MSE for SVR : 30.82835204515224

Conclusion

Results show that SVR performed better after the selection and with Linear Regression we got almost the same accuracy when used on all features.

Made with ❤ from Sigmoid.

Follow us on Facebook, Instagram and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/

Discussion

Community guidelines