Usually, categorical values are replaced by integer numbers. However, this approach is very dangerous for linear models, because of the false correlation that may appear. A step forward from this technique is One Hot Encoding (or Dummy variables). But even Dummy variables have its drawbacks — the matrix is becoming very large, and even worse it became sparse. That’s why we decided to add some methods for handling cases when you have a lot of categories in a column — FrequencyImputationTransformer.
How Frequency Imputation works?
The idea behind Frequency Imputation is very simple — you just replace the category with the frequency of the category in this column. Suppose the half of your category represents the category A, 30% of it is category B and the rest in the category C. Then A will be replaced with 0.5, B with 0.3 and C with 0.2, as shown below:
The assumption of this method is that if a category is more frequent then probably it is more important while making a prediction, that’s why it needs a higher score.
This method fails when you have 2 or more categories with the same frequency, so be careful.
Using imperio FrequencyImputationTransformer:
All transformers from imperio follow the transformers API from sci-kit-learn, which makes them fully compatible with sci-kit learn pipelines. First, if you didn’t install the library, then you can do it by typing the following command:
pip install imperio
Now you can import the transformer, fit it and transform some data. Don’t forget to indicate the indexes of the categorical values!
from imperio import FrequencyImputationTransformer freq = FrequencyImputationTransformer(index = [2, 6, 8, 10, 11, 12]) freq.fit(X_train, y_train) X_transformed = freq.transform(X_test)
Also, you can fit and transform the data at the same time.
X_transformed = freq.fit_transform(X_train, y_train)
As we said it can be easily used in a sci-kit learn pipeline.
from sklearn.pipeline import Pipeline from sklearn.linea_model import LogisticRegressionpipe = Pipeline( [ ('freq', FrequencyImputationTransformer(index=[10, 11, 12]), ('model', LogisticRegression()) ] )
Besides the sci-kit learn API, Imperio transformers have an additional function that allows the transformed to be applied on a pandas data frame.
new_df = freq.apply(df, target = 'target', columns = ['col1'])
The FrequencyImputationTransformer constructor has the following arguments:
- index (list, default = ‘auto’): The list of indexes of the columns to apply the transformer on. If set to ‘auto’ it will find the categorical columns by itself.
- min_int_freq (int, default = 5): The minimal number of categories that a column must have to be transformed. Used only when the index is set to ‘auto’.
The apply function has the following arguments.
- df (pd.DataFrame): The pandas DataFrame on which the transformer should be applied.
- target (str): The name of the target column.
- columns (list, default = None): The list with the names of columns on which the transformers should be applied.
Now let’s apply it on the Heard Disease UCI data set. However we will aply it only on the non-binary columns: ‘cp’, ‘restecg’, ‘exang’, ‘slope’, ‘ca’, and ‘thal’. Below are illustrated the confusion matrices before and after applying the transformation. We got 3% more to accuracy.
Made with ❤ by Sigmoid.