The majority of machine learning problems in today’s world are the classification ones. Mostly data scientists and machine learning engineers use Correlations as Pearson or Spearman to find features that correlate the most with the predicted value. However, these types of correlations work better on continuous-continuous pair of features. That’s why we at Sigmoid decided to add in our features section library — Kydavra, a method that will work also on dichotomous data (a series that has only 2 values).
Using PointBiserialCorrSelector from Kydavra library.
As always, for those that are there mostly just for the solution to their problem their are the commands and the code:
So to install kydavra just type the following line in the terminal or command line.
pip install kydavra
Now you can import the Selector and apply it on your data set a follows:
from kydavra import PointBiserialCorrSelector selector = PointBiserialCorrSelector() new_columns = selector.select(df, ‘target’)
PointBiserialCorrSelector has the next parrameters:
- min_corr — minimal correlation for the feature to be important (default = 0.5)
- max_corr — maximal correlation for the feature to be importatn (default = 0.8)
- last_level — the number of correlation levels that the selector will take in account. Recommended is to not change it (default=2).
So let’t test it on the Heart Disease UCI Dataset. Note that it was cleaned before.
from kydavra import PointBiserialCorrSelector selector = PointBiserialCorrSelector() df = pd.read_csv(‘cleaned.csv’) new_columns = selector.select(df, ‘target’) print(new_columns)
['cp', 'thalach', 'exang', 'oldpeak', 'sex', 'slope', 'ca', 'thal']
Note that the features are ordered în the descendind order depending on the Point-Biserial Correlation Value.
To see the impact of feature selection on different types of models I decided to train 3 models (LogisticRegression, the linear one, TreeDecissionClassifier — the non-linear one, and the SVC with Gaussian kernel). So before feature selection we had the following cross_val_score:
LINEAR - 0.8346127946127947 TREE - 0.7681481481481482 SVC - 0.8345454545454546
After applying feature selection the scores were:
LINEAR - 0.838047138047138 TREE - 0.7718518518518518 SVC - 0.8418181818181818
So we got some more accuracy on Tree and SVC (almost 1%), but now we are using only 8 features instead of 13, and that’s respectable.
Created with ❤ by Sigmoid.
If you want to dive deeper in how Biserial-Point Correlation Works I highly recommend the links at the end of the article. If you tried kydavra I invite you to leave some feedback and share you experience using it throw responding to this form.