Split your data into clusters with imperio ClusterizeTransformer

Vladimir Stojoc

Jul 15, 2021 • 3 min read

Feature engineering is the process of transforming your input data in such a way that it will be more representative of the Machine Learning Algorithms. However, it is very often forgotten because of the inexistence of an easy-to-use package. That’s why we decided to create the one — imperio, the third our unforgivable curse.

The technique discussed in this article doesn’t transform your data actually, it adds a new column to it, which can be very helpful to your model. This technique is called ClusterizeTransformer.

How ClusterizeTransformer works?

If you ever heard about Unsupervised learning, then this technique won’t be hard for you, unsupervised learning is a type of algorithm that learns patterns from data without labels, meaning without the target column. In these circumstances were developed other techniques that can predict something based only on the input data. One of these algorithms clustering algorithms is KMeans.

I won’t explain how KMeans and many other clustering algorithms work, but shortly it just finds a way to divide your data into different groups, usually, the number of groups is specified by the user. Thus, in this example below, the data was divided into three clusters, red, blue, and green.

After the clustering algorithm is applied, it returns the labels corresponding to each data point, for example, 1 for green, 2 for red, and 3 for blue. In that way, we can obtain a new column, which will represent the number of the cluster for every data point.

Using imperio ClusterizeTransformer:

All transformers from imperio follow the transformers API from sci-kit-learn, which makes them fully compatible with sci-kit learn pipelines. First, if you didn’t install the library, then you can do it by typing the following command:

pip install imperio

Now you can import the algorithm, fit it and transform some data.

from imperio import ClusterizeTransformer

kmeans = KMeans(n_clusters=2)
cluster = ClusterizeTransformer(kmeans)
X_transformed = cluster.fit_transform(X)

As we said it can be easily used in a sci-kit learn pipeline.

from sklearn.pipeline import Pipeline
from imperio import ClusterizeTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

pipe = Pipeline([
('std', StandardScaler()),
('cluster', ClusterizeTransformer(KMeans(n_clusters = 5))),
('model', LogisticRegression())
])

Besides the sci-kit learn API, Imperio transformers have an additional function that allows the transformer to be applied on a pandas data frame.

new_df = cluster.apply(df, target = 'target')

The ClusterizeTransformer constructor has the following arguments:

algorithm (Object): The instance of the algorithm that will do the clusterization.
column_index (list, default = None): The list of indexes of the columns to apply the transformer on. If set to None it will be applied to all columns.

The apply function has the following arguments.

df (pd.DataFrame): The pandas DataFrame on which the transformer should be applied.
target (str): The name of the target column.
columns (list, default = None): The list with the names of columns on which the transformers should be applied.

Now let’s test with Heart Disease, a classic Machine Learning dataset. Note, we can apply it only to numeric data, so choose the columns if needed.

As we can observe from confusion matrices, initially we got 75% accuracy, and after applying ClusterizeTransformer we obtained 93% accuracy.

Thank you for reading!

Follow Sigmoid on Facebook, Instagram, and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/

Discussion

Community guidelines

How ClusterizeTransformer works?

Using imperio ClusterizeTransformer:

Discussion

Sign up for more like this.