The issue with replacing missing values in your dataset

•

Everyone knows they need to replace missing values in their dataset. Most people, however, miss one critical step. Here is what you aren't doing and how you can fix it:

I'll start with an example. A company surveys a bunch of people. Some people leave one particular question unanswered. When we collect the data, and before using it to build a model, we must take care of these missing values.

For simplicity's sake, let's assume that replacing missing answers with the mean of all the other answers is a good approach. We can do that and solve the problem. Most people stop here. That's their mistake.

Having participants not disclose the answer to a question could be as important as the answer itself! We want to know who didn't answer the question. Imputing the data will completely erase that information from the dataset. You should do something else instead:

Before you replace missing values, create a new column. You'll flag every participant that didn't answer the question. In other words, the new column will have a value of 1 if the row has a missing answer and 0 otherwise. Then you'll impute the missing values.

This extra column will tell the model who didn't answer. That information may be necessary. It may also be useless. There's only one way to find out.

I wanted to summarize this post by reinforcing the need for that extra column. But I think I have better advice: Make sure you understand where your data is coming from and the reason you have missing values in the first place.

I recorded a YouTube video talking about this topic for those who prefer video:

Every week, I break down machine learning concepts to give you ideas on applying them in real-life situations. Follow me @svpino to ensure you don't miss what's coming next.