An Introduction to Data Cleaning

Startup Stock Photos

You collected 1,200 survey responses – that’s awesome, and will hopefully represent a solid sample of your target population. The problem is, you will likely encounter problems with your data, such as missing fields, incomplete lines, type conversion, and more. It’s common for data analysts to spend significant time with data cleaning, frequently between 60% and 80% even, so plan your time accordingly when planning your research steps.

What is Data Cleaning?

Data cleaning, also called data scrubbing, data cleansing, or data preparation, is the act of taking collected data and making it usable in your preferred statistical software. Cleaning includes removing bad data, creating correct labels and codes, and for everything to be consistent. It’s sometimes unavoidable to collect data that are not consistent.

It’s uncommon for data to be collected and to instantly be ready in a useable state. Almost every analyst and researcher – whether in the field or academia – has data that needs cleaning, so there’s no need to ever feel like you did a bad job with the collection stage.

Why is Data Cleaning Important?

While it’s true that it’s oftentimes the most time-consuming part of any research endeavor, it’s a necessity that will lead to not only easier data analysis in the end, but more accurate results that can be considered significantly more reliable.

Think of it like having a conversation with someone: If someone asks you where you were on a certain day, they may find it difficult to believe you if you keeping saying “I don’t know” to questions they ask you. Anyone will be much more confident in your answer if you can respond to every question thoroughly. The same works for data, as people will find your analysis to be more accurate if the data is clean and comprehensive and without any missing cases.

How Should I Get Started with Data Cleaning?

There are two ways that you can look at this – before, or after data collection. If you haven’t collected any data yet, you can avoid significant data cleaning by designing a good survey that doesn’t create any potential problems or loopholes. If you’re collecting data without using a survey, interviews, focus groups, etc, and are using existing data, make sure that your variables are properly defined beforehand.

If the data is already collected, there are countless ways that you can clean your data. A good first step is to clean for missing data and complete cases, using R or another statistical software. An incomplete case is essentially a row of data with a field missing, such as someone not inputting their age on a survey. If age is important, then it’s acceptable to remove it, as leaving the data as N/A or 0 will only skew the results.

We will be covering multiple specific examples about how you can clean data, such as expanding on cleaning for complete cases in R, among others.

Already familiar with R and want to learn more specifics? Sound off in the comments and let us know how we can help!