Selecting the right set of features to be used for data modelling has been shown to improve the performance of supervised and unsupervised learning, to reduce computational costs such as training time or required resources, in the case of high-dimensional input data to mitigate the curse of dimensionality. Computing and using feature importance scores is also an important step towards model interpret-ability.
Introduction to Relief Algorithm
The core idea behind Relief algorithms is to estimate the quality of attributes on the basis of how well the attribute can distinguish between instances that are near to each other.
Using Kydavra ReliefFSelector
If you haven’t install Kydavra yet, you can do it by typing the following command:
Now import the algorithm, fit it and select some features.
selector = ReliefFSelector(10, 7)
selected_cols = selector.select(df, ‘target’)
ReliefFSelector constructor has only 2 parameters:
n_neighbors(int, default=5) the number of neighbors to consider when assigning feature importance scores.
n_features(int, default=10) it represents the number of top features (according to the relieff score) to retain after feature selection is applied.
The select function takes as parameters the pandas data frame and the name of the target column.
In our test, we use a subset of UCI machine learning benchmark data set, e.g., Heart Disease.
Note: The algorithm works only on numeric data.
After doing some cleaning, the result our selector gave us looks like the following:
Now let’s compare the results before and after applying the selector.
As we can see, with the help of ReliefFSelector we got a small improvement in our model. And more exactly — with 2% much better accuracy score then before.
Made with ❤ from Sigmoid.
Follow us on Facebook, Instagram and LinkedIn: