Many times, we have some features that are strongly correlated with the target column. However, sometimes they are correlated with each other, generating in such a way the problem of multicollinearity. One way is to reduce one of these columns. But, we at sigmoid want to propose to you a new solution to this problem implemented in kydavra.
Using LDAReducer from Kydavra library.
Linear Discriminant Analysis is a dimensional reduction technique that reduces your data frame into n predefined columns, however, unlike PCA it takes into account the target vector.
At sigmoid, we thought what would be if instead of reducing the whole data frame to some n number of columns, we will take the cluster of columns strongly correlated between each other and reduce just them to one column. So in such a way LDAReducer appeared.
Now let’s see how it works, firstly as usual let’s install kydavra if you haven’t done it. (Ensure that you have the 0.2.x version).
pip install kydavra
Next, we should create an abject and apply it to the Hearth Disease UCI dataset.
from kydavra import LDAReducer ldar = LDAReducer() new_df = ldar.reduce(df, 'target')
Applying the default setting of the reducer on the Hearth Disease UCI Dataset will not change the data frame. This is because no feature correlates with the target feature higher than 0.5. That’s why we highly recommend you to play around with the parameters of the reducer:
- solver (str, [‘svd’, ‘lsqr’, ‘eigen’], default=’svd’) the solver used by the LDA algortihm to reduce the columns.
- method (str, [‘pearson’, ‘kendall’, ‘spearman’], default=’pearson’) the method to compute the correlation matrix.
- min_corr (float, between 0 and 1, default=0.5) the minimal value of the correlation coefficient to be selected for reduction.
- max_corr (float, between 0 and 1, default=0.8) the maximal value of the correlation coefficient to be selected for reduction.
The reducer constructed bellow will reduce the data frame
ldar = LDAReducer(min_corr=0.4, max_corr=0.7)
to the following columns:
['sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'oldpeak', 'ca', 'thal', 'target', 'age_exang_slope_thalach']
The column interrupted with ‘_’ is the column formed by reducing these columns. This transformation can make the accuracy of a model higher. Bellow is the cross_val_score before and after we applied LDAReducer:
Also we can reuse this reduces and apply it again on similar data frames:
new_df = ldar.apply(df)
If you used or tried Kydavra we highly invite you to fill this form and share your experience.
Made with ❤ by Sigmoid.