Are you wondering what overfitting is in machine learning? Or perhaps you are wondering how to identify and treat machine learning models that are overfitting? Either way, you are in the right place! In this article we tell you everything you need to know to understand what overfitting is, why it is a problem, and what options you have to fix it.
What is overfitting in machine learning?
What is overfitting? Overfitting is a phenomenon that occurs when a machine learning model pays too much attention to specific details of the data that was used to train it. Rather than focusing on broader trends that generalize across the data, the model focuses on specific details that are relevant to just a few training examples. By focusing on the specific details of a few observations, the model is more likely to pick up on random noise rather than any real signal that exists in the data.
Why is overfitting a problem in machine learning?
Why is overfitting a problem? Overfitting is a problem because machine learning machine learning models are generally trained with the intention of making predictions on unseen data, that is, data that was not used for model training. If a model overfits to the training data, it is not able to make good predictions on unseen data. This means that the model cannot be used for the intended use case.
How to detect when a model is overfitting?
How do you detect overfitting in machine learning models? In order to determine whether your machine learning model is overfitting, you have to compare how your model performs on the data it saw during training to how it performs on data it did not see during training.
This is generally done by splitting your dataset into a training dataset that is used during model training and a test dataset that is not touched until the final model has been selected. Usually around 80% of your total data should be allocated to the training dataset and 20% should be allocated to the test dataset.
After you train your model on your training dataset, you can use the model to make predictions on both the training dataset and the test dataset. Performance metrics like root mean square error and log loss can be applied to assess performance on the training data and test data separately.
If you see that your model makes much better predictions on the training data than it does on the test data, then that is a sign that your model is overfitting to the training data.
How to prevent overfitting?
So how do you prevent your machine learning models from overfitting? Here are some simple strategies you can use to prevent your machine learning models from overfitting.
Options for preventing overfitting
Adjust your model to prevent overfitting
The first way you can prevent and reduce overfitting is by simplifying your model. More complex models are more likely to overfit, so if there is an easy way to reduce the complexity of your model, that is a great start. The exact path you take to simplify your model will depend on the type of model you are using, but here are a few examples of steps you might take.
- Reduce the number of features in your model. One easy way to reduce the complexity of your model is to reduce the number of features that you are using in your model. Spend some time investigating the features of your model to determine which features are making strong contributions, then remove the rest!
- Use a simpler model. Another option is to trade out the model you are using for a more simple model. What does a simple model look like? Generally you will be better off using a model that has fewer parameters that need to be estimated. Try a linear or logic regression model if applicable.
- Adjust your hyperparameters. In many cases, you can also reduce complexity without changing the type of model that you are using by using different hyperparameters. For example, if you are using a tree-based model such as a random forest, you can reduce the maximum tree depth of your model. A random forest model that is made up of 20 trees that are only 1 level deep is going to have a lot less complexity than a model that is made up of 1000 trees that are up to 100 levels deep.
Adjust your training data to prevent overfitting
Another option you have when you see that your model is overfitting is to make changes to the training data that is used to train the model. Here are some examples of changes you can make to the training data to reduce overfitting.
- Sample more data. The first option is to simply use more data to train the model. Models that are trained on smaller datasets are in general more likely to overfit than models that are trained on larger datasets. Now of course obtaining a larger training dataset is not always possible, but if it is then that is a great first step to take!
- Augment your data. If you do not have a larger quantity of training data available, you can also augment the data you have to create additional examples that can be used during training. Data augmentation is a process by which you apply transformations to your existing training data to create new synthetic data that can be used in training. It is common in fields such as image analysis. As a simple example, you could flip an image across the vertical axis to create a “mirrored” version of the image that can also be used in training.
- Sample more varied data. Sometimes just sampling more data is not enough. You might actually need to rethink the strategy you are using to sample your data so that you get a more varied dataset in addition to sampling more data points. This is most likely to be the case if a large number of examples in your training sample are identical or nearly identical.
Adjust your model training routine to prevent overfitting
A third option you have to help prevent a machine learning model from overfitting is to adjust the routine that is being used to train the model. There are many different types of modifications that can be made to the model training routine to help ameliorate the effects of overfitting.
- Regularization (L1 regularization, L2 regularization, etc.). Regularization is another common choice for preventing a model from overfitting. The general idea behind this technique is that a penalty is added that increases proportionately to the size of each of your model coefficients. This has the effect of shrinking model coefficients closer to zero. When the coefficient for a given feature approaches zero, that feature is effectively removed from the model, which reduces model complexity. This technique is commonly used across a wide variety of machine learning models.
- Early stopping. Early stopping is a technique that is typically used to reduce the length of training for neural networks. When neural networks are trained, they generally cycle through the same set of training examples multiple times. These models may become more likely to overfit as they see the same taking examples over and over again. In early stopping, model training is paused at certain checkpoints to evaluate how much the model improved since the last checkpoint. Different stopping rules can be applied, but the general idea is that if the model performance has not improved by a lot, or if performance has degraded, then the training routine is stopped early.
- Dropout. Introducing dropout is another technique that is applied to neural networks to reduce overfitting. In this technique, a subset of nodes in the network are selected to have their outputs dropped or ignored for a portion of the training process. This effectively breaks the connection between that node and any downstream nodes for that portion of the training process. This technique is meant to simulate model averaging (or combining the predictions of multiple models with different architecture). Different connections are broken during different parts of the training process, so the effective architecture of the model changes throughout the training process.