Filter out the noise from your data with Kydavra PCAFilter

Filter out the noise from your data with Kydavra PCAFilter

Principal Component Analysis is known as one of the most popular dimension reduction techniques. However few know that it has a very interesting property — the reduced data can be brought back to the original dimension. Even more, the data brought back to its original size is more cleaned. So, at Sigmoid we decided to create a module, to easily apply this property on pandas data frames.

Using PCAFilter from Kydavra library.

Principal Component Analysis is a dimensional reduction technique that reduces your data frame into n predefined columns, however, unlike LDA it doesn’t take into account the target vector.

Now let’s see how it works, firstly as usual let’s install kydavra if you haven’t done it. (Ensure that you have the 0.2.x version).

pip install kydavra

Next, we should create an abject and apply it to the Hearth Disease UCI dataset.

from kydavra import PCAFilter
filt = PCAFilter()
new_df = filt.reduce(df, 'target')

Applying the default setting of the filter on the Hearth Disease UCI Dataset will not change the data frame a lot, so I recommend you first standardize data before passing it to the filter.

Also, we highly recommend you to find the best number of n_components:

for i in range(1, len(X[0])):
    filt = PCAFilter(n_components=i)
    new_df = filt.filter(df, 'target')
    X = new_df.iloc[:, :-1].values
    y = new_df['target'].values
    print(f"{i} - {np.mean(cross_val_score(logit, X, y))}")

So this code on standardized data gives the following result:

1 - 0.801952861952862
2 - 0.7945454545454546
3 - 0.8092255892255892
4 - 0.8383164983164983
5 - 0.841952861952862
6 - 0.8274074074074076
7 - 0.8236363636363636
8 - 0.8200673400673402
9 - 0.8200673400673402
10 - 0.8274074074074076
11 - 0.842087542087542
12 - 0.8311111111111111

As we can see the best number of components is — 11. The cross_val_score of the model without filtering and data standardizing is — 0.823. By applying other selectors from kydavra you can rise the accuracy even higher.

How it works?

PCAFIlter uses internally the PCA algorithm. In the beginning, the filter lowers the dimensionality of the data (without target column). After that it brings the data back to the original dimensions, in such a way a lot of noise goes away.

If you used or tried Kydavra we highly invite you to fill this form and share your experience.

Made with ❤ by Sigmoid.

gif trom tenor

Resource:

Discussion

Community guidelines