Do you want to learn about the differences between oversampling and undersampling in machine learning? Or maybe you want to learn when to use oversampling and when to use undersampling? Well either way, you are in the right place! In this article, we tell you everything you need to know to understand the differences between oversampling and undersampling for machine learning.
We start out by explaining why you would need to oversample or undersample your data. After that, we provide more detail on what oversampling is and when oversampling is useful. We follow this up with a similar discussion of what undersampling is and why undersampling is useful. Finally, we provide a few heuristics that you can follow to determine whether to oversample or under sample your data.
When should you resample your data?
When should you resample the data that you are using for a machine learning project? In general, you should resample the data you are using for a machine learning project if you examine the distribution of a variable in your dataset and find that the distribution of that variable is highly skewed towards a specific range of values. In particular, you should resample your dataset if you anticipate that the skewed nature of the variable you are looking at will negatively impact your analysis.
And what does a skewed variable look like? If the variable you are looking at is categorical, you might find that most of the observations in your dataset take on one specific category. There may be very few observations in the dataset that take on any other category. If the variable you are looking at is numeric, you might observe that most values that are observed in your dataset fall within a small range of values. This is particularly concerning if you have reason to believe that the true distribution of values across the entire population takes on a larger range of values.
In general, it is most common to resample the data that you are using for a machine learning project if you find that the distribution of the outcome variable is highly skewed. That being said, there are some specific cases where it might make sense to resample your data based on the distribution of a feature or covariate that is used in your analysis.
Oversampling for machine learning
Now we will talk more specifically about oversampling data for machine learning models. We will start by discussing what oversampling is and how oversampling is performed. After that, we will talk about the advantages and disadvantages of oversampling for machine learning.
What is oversampling?
What is oversampling? Oversampling is a resampling scheme where you modify the distribution of a variable in your dataset by artificially increasing the number of observations that take on a particular value or range of values for that variable. In most cases, this is done by looking at what values are underrepresented in the dataset and artificially increasing the number of observations that take on that value or range of values.
There are two different types of methods that can be used to perform oversampling. The first type of method works by duplicating existing entries that are already present in the dataset to increase the presence of those entries. The second type of method works by adding some noise to the entries that already exist and creating new “synthetic” observations that resemble the existing observations.
Advantages of oversampling for machine learning
What are some of the advantages of oversampling for machine learning? In this section, we will talk about the main advantages that oversampling has over other resampling schemes.
- Does not decrease the size of the dataset. One advantage that oversampling has over other resampling schemes is that it does not decrease the size of the dataset. This can be important if you are using a complicated model with many parameters that need to be fit as it ensures that your model will have sufficient data to train on.
- Does not lose any information. Another advantage of oversampling is that it ensures that you do not lose any information that is already present in the dataset. This is because all of the records that are contained in the initial dataset are also contained in the resampled dataset.
Disadvantages of oversampling for machine learning
Now we will talk more about the disadvantages of oversampling for machine learning. Here are some of the main disadvantages of oversampling for machine learning.
- Can increase the chance of overfitting. One disadvantage of oversampling for machine learning is that it can increase the chance of your model overfitting to your training data. Whether you perform oversampling by introducing duplicated observations or synthetic observations that closely resemble existing observations, you are increasing the presence of characteristics in your dataset that were originally only applicable to a small number of observations. These characteristics might be specific details of a few specific observations rather than broadly applicable trends.
Now we will talk about undersampling data for machine learning models. We will start out by describing what undersampling is and how undersampling works. After that, we will discuss some of the main advantages of undersampling. Finally, we will discuss some of the main disadvantages of undersampling.
What is undersampling?
What is undersampling? Undersampling is a resampling scheme where you modify the distribution of a variable in your dataset by artificially decreasing the number of observations that take on a particular value or range of values for that variable. This is done by looking at what values are overrepresented in the dataset and decreasing the number of observations that take on that value or range of values.
Advantages of undersampling for machine learning
What are some of the main advantages of undersampling? Here are some of the main advantages that undersampling has compared to oversampling.
- No need to introduce redundant information into your dataset. The main advantage that undersampling has is that you do not have to add any artificial observations to your dataset that introduce repeated or redundant information into your dataset. Why is this beneficial? This is beneficial because when you duplicate existing observations (or create close analogs of existing observations), you are making it seem like the patterns that are seen in those observations are more widespread than they really are. This can lead to things like models overfitting to specific patterns that were only seen in a few observations in the original dataset.
Disadvantages of undersampling for machine learning
What are some of the main disadvantages of undersampling for machine learning? Here are some of the main disadvantages that undersampling has compared to oversampling.
- Reduces the size of your dataset. The first disadvantage of undersampling for machine learning is that it reduces the size of your dataset. Machine learning models generally perform better when they are trained on larger datasets with more observations, so this can have negative effects on the predictive performance of your model.
- Loses information. The next disadvantage of undersampling is that there is some loss of information. When you permanently remove observations from your dataset, you will naturally lose the information that was contained within those observations.
Oversampling vs undersampling for machine learning
So when should you use oversampling for machine learning? And when would you be better off sticking with undersampling? In this section, we will provide guidelines you can follow to determine when to use oversampling or undersampling for machine learning.
- Oversample if you do not have a particular reason to undersample. The first guideline you can follow to determine whether to oversample or undersample your data is this – you should consider oversampling to be the default option and only reach for undersampling when you have a specific reason to do so. The decrease and data size and loss of data that you incur when you perform undersampling should be avoided when possible.
- Undersample when your data is very large. One example of a situation where you should spring for undersampling is when your data is very large. This is particularly true if you are going to need to reduce the size of your dataset anyways because training the data on the full dataset takes too much time or too many resources. In these situations, you are going to lose some data anyways and the decrease in the size of your dataset can actually be viewed as an advantage rather than a disadvantage.
- Undersample if your model is overfitting to oversampled data. Another situation where you might want to try out undersampling is if you have fit a model on oversampled data and you see that your model is overfitting to the training dataset. Oversampling is known to increase the chances of overfitting, so changing the resampling scheme that you are using may help to reduce the impact of overfitting.
Other advice on building machine learning models
- Baseline models for machine learning
- Prototypes for machine learning models
- How to choose the right machine learning model
- How to improve a machine learning model