Are you wondering how to treat the outliers in your dataset? In this article we tell you everything you need to know about handling and removing outliers. First we discuss the most important considerations you should keep in mind when determining how to handle outliers. After that, we discuss different methods you can use to handle and remove outliers.
This article also includes a discussion of the different scenarios in which certain outlier handling methods are preferable over others. After reading this article, you will feel confidence in your ability to choose how to handle outliers.
Considerations for handling outliers
What considerations should you keep in mind when you are trying to decide how to handle outliers? Here are the main concerns you should keep in mind when deciding how to handle outliers.
- Type of outlier. The first thing you should consider when deciding how to handle your outliers is what kinds of outliers you are dealing with. We will go over the main types of outliers in the following section, but in short there are three different classes of outliers and some types of outliers should be treated differently than others.
- Source of outlier. The next consideration is the source of the outlier. Outliers can come from different sources and some outliers represent true, accurate values whereas others exist due to errors in data collection systems. The source of an outlier is a very important aspect to consider when determining how to handle an outlier.
- Metric robustness. A third consideration you should keep in mind when working with outliers is the metric that you plan to use in your analysis. If you plan to use robust metrics that are not heavily affected by outliers such as medians then you might decide to handle outliers differently than if you were using less robust metrics like means.
- Model robustness. Along the same lines, if you plan to do any statistical or machine learning modeling then you should also consider how robust your model is to outliers. If you are using a model like a random forest that is robust to outliers then you might handle outliers differently than if you are using a model like a linear regression that is not as robust.
Types of outliers
It is Important to consider what type of outlier you are encountering before you decide how to handle it. In this section we will provide a quick overview of the different types of outliers. If you want to learn more about the different types of outliers, then you should check out our article on the types of outliers.
- Global outliers. Global outliers are observations that have extreme values that are easily recognizable on their own. You can recognize global outliers by looking at simple univariate plots of your data. You do not need to look at any other variables or observations to recognize them.
- Contextual outliers. Contextual outliers are outliers that only become apparent when you look at multivariate plots or statistics. They appear in cases where the value of one variable informs the range of expected values for another variable.
- Collective outliers. Collective outliers are observations that do not look anomalous when you look at them by themselves, but stand out as outliers when you look at a collection of observations together.
Sources of outliers
The source of an outlier is also important to consider when deciding how to handle an outlier. We will provide a brief description of the different sources of outliers here, but if you want to learn more about the main sources of outliers then this article has a whole section on the main sources of outliers with real examples.
- Random noise. The first source of outliers is random noise. These are real pieces of data that appear due to randomness in a system. Random noise is most likely to cause global and contextual outliers.
- Measurement errors. The next source of outliers is measurement error. These are inaccurate pieces of data that appear due to errors in data measurement and data entry systems. Measurement error can cause global, contextual, and collective outliers.
- Unmeasured confounders. The third source of outliers is unmeasured confounders. This occurs when an unmeasured variable moderates the relationship between one variable and another. Unmeasured confounders can cause data points to appear as contextual outliers.
Handling & removing outliers
Now that we have discussed all of the considerations you should think through before handling and removing outliers, we will discuss how to deal with outliers in your dataset. For each outlier handling method, we will discuss cases in which that method is most appropriate.
In general, you should aim to modify your data as little as possible and remove as little information as possible. That being said, there are definitely cases where observations need to be modified or removed. These methods you can use to handle and remove outliers are listed roughly in order of preferences. The less destructive methods that should be favored are listed first.
Do not change the outlier
The first thing you should consider when you encounter outliers in your data is whether it is appropriate to keep the outlier in your data as is. Here are the main cases where you should do this.
- Real outlier and robust metrics. The first and most common case where you should keep the outlier in your data is when the outlier is real and you are using metrics or models that are robust to outliers. In this case, the outlier should not affect the outcome of your analysis too much so you should keep the outlier in your analysis.
- Contextual outlier and unmeasured confounder. The next case where you should not modify your data is when you have a contextual outlier that is caused by a previously measured confounder that you can add to your dataset. Adding a previously unmeasured confounder that provides context about the anomalous value will help to inform your model about why the value looks anomalous.
Modify the outlying values
When to modify outlying values
The next strategy you should consider is keeping the outlying observation in your data but modifying the values that are outliers. This method should be considered in the following situation.
- Anomalous values for only a few variables. The main case when you should keep an observation in your data but modify some of the values associated with it is when the observation only has anomalous values for a few variables. In this case, the observation should still contain a lot of useful information about the other variables that should be preserved.
How to modify outlying values
After you decide that you want to handle an outlier by modifying the values for one or two variables you have to decide the most appropriate way to modify the outlying values.
- Normalize or rescale a variable. The first thing you should consider doing is normalizing or rescaling a variable. This is a good strategy to follow if you have one or two variables for which there are many observations that have outlying values. If the outlying values are all abnormally high values, then you can apply a transformation such as a log transformation that reigns in high values to all of the observations in your dataset. If there are abnormally high and abnormally low values, you might bucket the values and treat the variable as categorical.
- Truncate to a minimum or maximum value. Your next option is truncating extreme values to a predetermined value. For example, you might choose the maximum possible value you want to appear in your dataset as 100 and truncate any values that are higher than this to 100. This is a good strategy if there are only a few outlying values for a variable, but should not be done if there are many outlying values.
- Remove and treat as missing data. As a final option, you can also remove the outlying values entirely and treat them as missing data. This is most appropriate if the value is clearly incorrect and caused by measurement error. You are best off choosing this option if your dataset already has the missing data in it that you need to handle.
Remove the entire outlying observation
The final option that you have for handing outliers is to remove the observation from the dataset entirely. Here are the scenarios when this is the best option.
- Many outlying values. If an observation has outlying values for multiple variables, then it may be appropriate to remove that observation from the dataset. The idea here is that if you are going to have to modify the values for many variables, the observation is going to contribute little real information to your analysis. This strategy is appropriate even in cases where you believe the outlying values to be real (but you are not using robust metrics or models).
- Suspicious or incorrect values. Another situation where it makes sense to remove an outlying observation from your dataset entirely is when you have suspicions that the values associated with that observation might be incorrect. For example, if an observation is part of a set of collective outliers that you think might be caused by measurement error then it might be appropriate to remove it.