Do you want to learn how to identify and handle outliers in your data? Well then this is the place to start! In this article we lay out the groundwork you need to understand in order to work with outliers by discussing the different types of outliers and the most common sources of outliers.
In the beginning of this article, we discuss what outliers are and why you should care about them. After that, we go over the main types of outliers and anomalies that appear in datasets. We follow this up with a discussion of the most common sources of outliers and the importance of considering the source of your outliers.
What are outliers?
What are outliers? In short, outliers are anomalous observations that appear in your dataset. Anytime you are analyzing data, you should look for outliers and carefully consider how to handle them because sometimes even one anomalous observation can have a large impact on your analysis.
When most people think of outliers, they think of one specific type outlier called a global outlier. In the next section we will talk about the most common types of outliers that appear in datasets to give you a better idea of all of the different types of outliers and how they should be thought of.
Types of outliers
How many types of outliers are there? And what are the different types of outliers? There are three main types of outliers called global outliers, contextual outliers, and collective outliers. In this section we will provide additional information about each type of outlier.
What is a global outlier? A global outlier is an observation that contains an anomalous value for a variable that you can easily identify as anomalous without looking at the value of other variables or looking at any other specific subset of observations.
Global outliers are the types of outliers that first come to mind when data practitioners hear the word outlier because global outliers are the most easy to spot. Most global outliers manifest as observations that have abnormally high or abnormally low values for a variable so they can easily be spotted by looking at univariate plots of your data.
What is an example of a global outlier? Imagine you had data on the highest temperature recorded on each day of the year. A global outlier in this context would be a day where the temperature was much higher or lower than the rest of the days. For example, if the temperature was -20 degrees Fahrenheit or 130 degrees Fahrenheit on one day that would represent a global outlier.
The next type of outlier we will talk about is contextual outliers. Contextual outliers are sometimes also called conditional outliers or multivariate outliers.
Contextual outliers are observations that do not look anomalous when you look at individual variables one by one, but do stand out as anomalous once you start to look at multivariate relationships between variables. The key to thinking about contextual outliers is realizing that one variable can provide “context” that further bounds the expected range of values for another variable.
So what is an example of a contextual outlier? Continuing on with our daily temperature example, an example of a contextual outlier is a day in the winter where the recorded temperature is 92 degrees Fahrenheit. 92 degrees is not an outrageous temperature on it’s own, so if you were only looking for univariate outliers you might miss it. However, once you condition on the fact that it is winter, it becomes clear that 92 degrees is an anomalous value.
The final type of outlier is a contextual outlier. Collective outliers are groups of observations that do not look anomalous on their own, but start to look anomalous once you consider them all together. Unlike contextual outliers which you can only spot by looking at the relationship between different variables associated with the same observation, collective outliers are outliers that you can only spot by looking at the relationship between different observations.
So what is an example of a collective outlier? Sticking with our temperature example, an example of a collective outlier would be a series of many days where the highest temperature recorded was exactly the same. For example, if the high temperature was 72 degrees Fahrenheit for 30 days straight then that would be suspicious. 72 degrees is a perfectly average temperature that does not sound any alarm bells on its own, but the fact that there was no fluctuation in temperature over a 30 day window is suspicious.
Causes of outliers
In addition to just thinking about the types of outliers that exist, it is also important to think about the different conditions that might cause outliers to pop up. Here are three common causes of outliers in your dataset and the types of outliers these causes are most likely to introduce.
The first possible cause for outliers is just random noise. Sometimes unexpectedly high or low values appear due to randomness when there is no error or suspicious activity going on in the background. Randomness is most likely to cause global outliers, but it is also feasible that it could be responsible for contextual outliers.
Going back to our weather example, it is fully possible that the outlier temperature of -20 degrees Fahrenheit might be observed due to randomness and volatility in our weather systems.
Another common source of outliers in your data is measurement error. This is any error that leads to incorrect measurements being recorded for your variables and can take the shape of anything from human error to mechanical failures in sensors. Measurement error can cause any type of outliers including global, contextual, and collective outliers.
So what is an example of measurement error? Returning to our collective outlier example where the recorded temperature was the same for 30 days straight, it is very possible that this group of outliers could be the result of an equipment or sensor failure. Rather than reading the correct temperature, the equipment may have been frozen at a certain temperature.
Another common reason that you might see outliers in your data is because you are failing to measure an important variable that moderates the relationships between other variables. Unmeasured confounders are most likely to cause observations to be marked as contextual confounders.
Going back to our previous example of a multivariate confounder where there was an abnormally hot day in the winter, it is possible that there was another variable that was left unmeasured that caused that day to be abnormally hot. For example, maybe there was a wildfire near the temperature measuring station that was picking up on heat radiating off of the fire.
If you had included a variable that accounted for fires in your model, the abnormally hot day would not have seemed so abnormal. This is a great example of why it is important to consider the possible causes of your outliers before determining how to handle your outliers.
If you are ready to move on to learning about how to handle outliers, check out our article about removing and identifying outliers.