A Beginner’s Guide To Cleaning Data In R — Part 1 (2025)

Roshaan Ashraf

·

Follow

Published in

Analytics Vidhya

·

5 min read

·

Jul 1, 2020

--

A Beginner’s Guide To Cleaning Data In R — Part 1 (3)

My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next. After completing introductory courses, I made my way over to data cleaning. It is no secret that most of the effort in any data science project goes into cleaning the data set and tidying it up for analysis. Therefore, it is crucial to have substantial knowledge about this topic.

Firstly, to understand the need for clean data, we need to look at the workflow for a typical data science project. Data is first accessed, followed by manipulation and analysis of the data. Afterward, insights are extracted, and finally, visualized and reported.

A Beginner’s Guide To Cleaning Data In R — Part 1 (4)

Errors and mistakes in data, if present, could end up generating errors throughout the entire workflow. Ultimately, the insights generated that are used to make critical business decisions are incorrect, which may lead to monetary and business losses. Thus, if untidy data is not tackled and corrected in the first step, the compounding effect can be immense.

This guide will serve as a quick onboarding tool for data cleaning by compiling all the necessary functions and actions that should be taken. I will briefly describe three types of common data errors and then explain how these can be identified in data sets and corrected. I will also be introducing some powerful cleaning and manipulation libraries including dplyr, stringr, and assertive. These can be installed by simply writing the following code in RStudio:

install.packages("tidyverse")
install.packages("assertive")

1. Incorrect Data Type

When data is imported, a possibility exists that RStudio incorrectly interprets a data column type, or the data column was wrongly labeled during extraction. For example, a common error is when numeric data containing numbers are improperly identified and labeled as a character type.

a) Identification
Firstly, to identify incorrect data type errors, the glimpse function is used to check the data types of all columns. The glimpse function is part of the dplyr package which needs to be installed before glimpse can be used. Glimpse will return all the columns with their respective data types.

library(dplyr)
glimpse(dataset)

Another form of logical checks includes the is function. The is function can be used for each data type and will return with a logical output (true/false). I have only mentioned the common is functions, but it can be used for all data types. If a numeric column is an argument for the is.numeric function, the output will be true, while if a character column is an argument for the is.numeric function the output will be false.

is.numeric(column_name)
is.character(column_name)

b) Correction
After all the incorrect data type columns have been identified, they can simply be converted to the correct data type by using the as functions. For example, if a numeric data type has been incorrectly imported as a character data type, the as.numeric function will convert it to numeric data type.

as.numeric(column_name)

2. Comma/Percentage Problems

In an untidy data set, numerics may be imported along with commas or percentages. While these are helpful characters and help identify large numbers efficiently, a possible error is that data might be read as a character type.

a) Correction
In this case, the unnecessary characters must first be removed, and then the data type must be changed to the correct data type. This is possible through stringr which is a powerful string manipulation library. The str_remove function can be used to correct such a data error. It takes the target column/string as the first argument and the character to be removed as the second argument.

library(stringr)
str_remove(column_name, ",")

3. Out Of Range Values

In some cases, data is limited to a specific range. For example, the percentage in a test is limited to values between 0 and 100. In this case, the range is defined as 0<P<100. In such a type of error, the values fall outside the range and cannot be included in the analysis.

a) Identification
To estimate how many out of range values a data set has, we can use the assertive package that provides check functions. Firstly, a note of the total rows should be made before the use of assert function. Afterward, the assert function is used with the target column as the first argument, the lower bound as the second argument, and the upper bound as the third argument. The output will be the target column containing only the values within the input range. A note of the total rows now should provide an estimate to the total out of range values.

library(assertive)
assert_all_are_in_closed_range(column_name, lower = 0, upper = 100)

b) Correction
After the out of range values have been identified, there are several possibilities in order to tackle them. These values can either be removed, treated as missing values, or replaced with range limits.

The most common solution to deal with these is to remove the values by using the filter function from the dplyr package to filter out the values not in range.

filter(percentage >= 0, percentage <= 100)

In order to treat them as missing values, we need to replace the erroneous values with “Not Available” commonly known as NA. We can do this replacement using the inbuilt replace function. The first argument is the target column, the second argument is the condition and the third argument is the replacement. The following input will replace all values less than 0 and greater than 100 with NA.

replace(column_name, percentage < 0, NA)
replace(column_name, percentage > 100, NA)

Finally, the out of range values can be replaced by the range limits. For example, we can replace an erroneous data value of 104% with 100%. The following input will replace all values in the column greater than 100 with 100.

replace(column_name, percentage > 100, 100)

These are only three of the most common types of errors faced during data cleaning. I hope my simple approach of identification followed by correction will help newcomers in tackling these effectively. In reality, hundreds more exist which must be dealt with extra caution. As I continue to learn about them and tackle them, I will be sure to share out similar stories to contribute to the data community.

Practice Resource

In order to practice what you’ve learned, I’ve created a simple practice data set titled “Percentages”, containing the marks for an exam, for you to try these techniques on. You will face the three types of common problems mentioned above. The final result should be a: numeric dataset, with no percentage characters and no out of range values (0<Marks<100).

Download Link: https://mega.nz/file/cIoBHRDA#FsIYW9xw5Sx2WJco4Wuq3WAjys1UJOd6-bZvwQAoQC4

You might face some problems while cleaning this data set but that’s part of the learning process. Good luck!

A Beginner’s Guide To Cleaning Data In R — Part 1 (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Rob Wisoky

Last Updated:

Views: 5611

Rating: 4.8 / 5 (48 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Rob Wisoky

Birthday: 1994-09-30

Address: 5789 Michel Vista, West Domenic, OR 80464-9452

Phone: +97313824072371

Job: Education Orchestrator

Hobby: Lockpicking, Crocheting, Baton twirling, Video gaming, Jogging, Whittling, Model building

Introduction: My name is Rob Wisoky, I am a smiling, helpful, encouraging, zealous, energetic, faithful, fantastic person who loves writing and wants to share my knowledge and understanding with you.