How to Impute Categorical Variable?

Pratik Kumar Roy
3 min readMar 19, 2022

--

In this blog, I will explain how to handle missing values in a categorical variable.

First of all, let’s brush up on some basics.

Categorical/Discrete Variable: Any random variable which can take values from a set of finite values is called a Categorical or Discrete random variable.

Example: In a Dice throw, we have 6 possible outcomes. So a categorical variable can take values from these 6 possible sets of outcomes, {1,2,3,4,5,6}. Day of a week is also a categorical variable.

Here in this blog, we will use Insurance claim data from Kaggle.

As you can see our dataset has missing values or null values and these columns have categorical data. So let’s get the counts of each category for each categorical column.

Method 1: Dropping the nulls

If the number of nulls is very low then we can drop these null values using dropna().

Method 2: Replacing nan with most frequent occurred category

This is a very simple and easy method but it has disadvantages. Category having max frequency when replaces nan it will lead to bias during prediction.

Let’s see how to implement this:

First, we will get the most frequent element using mode() and then we will replace nulls with this.

Method 3: Replace NaN with a new value ( ‘Unknown’)

Here we will replace missing values with a new category ‘Unknown’.This will preserve the variance but doesn’t give good results if the missing data percentage is high.

Each of the above methods has its own disadvantages. It all depends on our use case and business requirement.

One more method we can use is, adding an extra variable to capture the missing values, and then in the original variable either use minimum frequent category or maximum frequent category.

We can also use sci-kit learn inbuilt imputer

There is one more method to impute using the CategoricalImputer from sklearn_pandas library.

Please feel free to ask for doubts in the comments. Also if you find any more methods to handle missing values in categorical variables, please do mention them in the comments.

References:

--

--