Are you looking for tips on how to improve a machine learning model? Or maybe you are wondering whether you are better off focusing on the model itself or the dataset it is trained on? Well either way, you are in the right place!
In this article we tell you everything you need to know about improving the performance of a machine learning model. We start out by discussing a few general strategies you can employ to improve a machine learning model. Next, we provide examples of specific tactics that fall under each strategy. Finally, we discuss which strategies tend to lead to the largest gains in predictive performance.
Strategies for improving machine learning models
Before we discuss specific tactics that can be used to improve the performance of a machine learning model, we will first define a few high-level strategies that these tactics can be bucketed into. Here are the main strategies you should consider when trying to improve the performance of your machine learning model.
- Enhance your data. The first strategy you should consider when trying to improve your machine learning model does not have much to do with the model at all. Instead, it relates to the data that you are feeding into your model. The first option you have is to focus your effort on enhancing the dataset that you are using to train your model.
- Tune model parameters. The second strategy that you should consider when trying to improve your machine learning model is tuning the parameters of the model you are using. When we talk about tuning model parameters, we are not simply talking about performing a grid search to optimize hyperparameters. Instead, we are broadly referring to any changes or customizations that can be made to your algorithm without using a different model entirely.
- Use a different type of model. The next strategy you should consider when trying to improve the performance of a machine learning model is using a different type of model. There are often many different models that can be used to perform a given task and some will perform better than others.
- Use a different strategy to evaluate performance. Finally, it sometimes makes sense to take a step back and examine how you are evaluating the performance of your model. Sometimes the problem is not the model itself, but the way that you are evaluating the model.
Improving your model by improving your data
One of the best ways that you can improve the performance of a machine learning model is by improving the data that is used to train that model. Here are some examples of modifications you can make to your data to help improve the performance of your model.
- Incorporate new features. The first option you have is to enhance your dataset by incorporating new features into it. By incorporating new features, you are adding new sources of information that the model can consider. This can be a hugely valuable effort if you add features that increase the amount of signal in your dataset. That tactic is useful in almost all situations.
- Remove features without much signal. On the other hand, it may also make sense to remove features that are not providing much signal. Even if a model is supposed to be able to identify the most important features, it gets increasingly difficult to do so as the number of features increases. You are most likely to benefit from this tactic if you have a model that contains many features.
- Transform your features. In some scenarios, it may make sense to apply a transformation to a feature you are already using in your model. This tactic is most useful when you are working with parametric models that make assumptions about the distributions of your features. For example, if you are creating a linear regression model and you have a feature that is highly skewed, you may want to transform that feature to make the distribution look more normal.
- Handle outliers differently. It may also be fruitful to reexamine the way you are handling outliers in your dataset. Are you keeping the outliers or removing them? Can you transform the distributions of your variables to reign in the outliers? How are you identifying which points are treated as outliers? Changing the way you handle outliers is particularly useful for parametric models that are sensitive to outliers like linear regression models.
- Handle missing values better. You may also want to rethink the way you are handling missing values. There are many, many strategies to handle missing values and some will perform better than others. This is particularly important if you have a dataset with many missing values.
- Reduce dimensionality. Another option you can consider is using a dimensionality reduction technique to reduce the dimension of your feature space. This is particularly useful if you have many correlated features that contain similar information in your feature space.
- Use more data. If you are not using every last bit of data that is available to you, then it might also help to increase the amount of data you are using to train your model. This technique is particularly useful if you are training a complex model that is overfitting.
- Data augmentation. What if you are using all of the data that is available to you, but you still think that increasing the size of your training dataset should be larger? In this case, it might make sense to use data augmentation methods to create synthetic data to supplement your dataset.
- Resample your dataset. Sometimes it makes sense to resample your dataset to change the distribution of data points that are included. For example, if you have a highly unbalanced outcome then it might make sense to resample the dataset so that it has a more even distribution across the outcome variable. It may also make sense to resample your data if there is an imbalance in respect to important features. For example, if you expect gender to be an important feature in your model and you see that your dataset is comprised almost entirely of males, it may make sense to resample the data to resolve this imbalance.
- Change the grain of your outcome variable. In some cases, it makes sense to change the outcome variable you are predicting to modify the grain. One case where this might be useful is if your initial outcome variable is a multiclass variable with many values. You may benefit from bucketing similar values together to reduce the difficulty of the task at hand.
- Change the time window you are using to frame your problem. Sometimes you will find yourself in a situation where you need to predict whether a subject will perform a certain task in the next X days. For example, you might want to predict whether a subject will churn from a product. In these situations, it is generally easier to create accurate predictions for larger time windows that are measured in weeks or months rather than days or hours.
- Improve data correctness. Finally, if your model is performing really poorly then it might make sense to audit the correctness of your data. If the data that you are feeding into your model is not correct, it will impact the model’s ability to detect patterns in the data.
Improving your model by adjusting model parameters
The next strategy you can use to improve the performance of your machine learning model is to modify parameters of the model you are using. When we talk about modifying parameters, we are not just talking about tuning hyperparameters. Instead, we are broadly talking about performing any tweaks that can be made to a model without transforming it into a different model entirely.
- Tune hyperparameters. The first thing you should do when adjusting your models parameters is tune the hyperparameters that are used in your model. The exact hyperparameters that you need to tune will vary depending on what model you use, but in general you can expect to see a small to moderate performance boost after tuning the hyperparameters of your model.
- Change the loss function. The next avenue you should consider is changing the loss function that is used in your model. Re-evaluate whether the function that the model is optimizing for is appropriate for the task at hand.
- Weight observations differently. Some models allow you to modify the weights that are given to different types of observations during training. By giving more weight to a specific observation, you tell the model that it is particularly important to predict that outcome correctly. This is often done by incorporating weights into the loss function. Changing the way that observations are weighted is another strategy you can explore when your dataset is highly imbalanced over an import feature or the outcome variable.
- Reduce model complexity. If you are training a highly complex model that is overfitting, you may benefit from reducing the complexity of your model. This can mean different things for different models, but one example of something you can do to reduce the complexity of your model is adjust the hyperparameters that you are using.
- Change the way binary predictions are aggregated into multiclass predictions. This is a tactic that is specific to a subset of multiclass models. For models that cannot be applied directly to multiclass outcomes, multiclass outcomes are often handled indirectly by creating a series of binary models then aggregating the predictions from those models. For example, this strategy is used when applying gradient boosted trees to a multiclass problem. There are different strategies for deciding which binary models are trained and how their predictions are aggregated. Different strategies may be more appropriate for different scenarios.
Improving performance by using a different model
Another avenue you have when trying to improve the predictive performance of your mode is using a different model. Here are a few options you have when choosing a new model.
- Use a different implementation of the same model. The first option you have is to use a different implementation of the same kind of model. Depending on what programming language you use, there are likely to be multiple different machine learning libraries that contain different implementations of common machine learning models. These different implementations are not identical to one another and there are situations where one implementation will serve you better than another. This tactic is most useful when there is a specific quirk about your dataset that leads you to believe that one model implementation would be better than another.
- Use a different model (for the same type of outcome). The next option you have is to choose a different model. Specifically, we are talking about using another model that is intended to be used for the same type of data or outcome variable. This is generally the best and most straightforward route to take when it comes to choosing a different machine learning model. Check out our article on how to choose the right machine learning model for your data to see examples of models that can be used on different types of datasets.
- Use a different model (for a different type of outcome). If you have tried multiple different models for the same outcome type and you are still not having success, it might make sense to reframe your problem and try a model with a different type of outcome. For example, if you are typing to predict a numeric outcome then it might make sense to binarize your outcome variable and treat the problem as a binary classification problem. This is generally a good option to pursue when you have already tried using a few different models that were designed for the same type of outcome and you have not seen much improvement.
- Combine multiple models into an ensemble. The final option you have is to pool the predictions from multiple models together into a final ensemble model. Ensemble models like this tend to be slower to train and more difficult to interpret, so they should only be used in very specific situations where you are willing to trade prediction time and interpretability for (sometimes relatively small) gains in predictive performance.
Reevaluating how to assess model performance
Sometimes when you are seeing poor model performance, the problem might not be the model at all. Instead, the problem might be the way you are assessing the performance of your model. Here are a few tactics you can employ in these situations.
- Modify the metric you are using to assess performance. The first option you have is to examine the metric that you are using to assess model performance. Does it weigh outliers too much? Does it penalize under-prediction and over-prediction the same way? Does it weigh more harmful mistakes more heavily? These are just a few considerations you should keep in mind when evaluating the performance metric you are using.
- Modify the dataset you are using to assess performance. The other option you have is to examine the dataset you are using to assess model performance. Is it representative of the dataset that you expect to see when you put the model into production? Is it diverse enough to evaluate how your model performs in different types of situations? Does it contain data from an appropriate time frame? These are just a few different questions you should consider when evaluating the dataset you are using.
What is the best way to improve a model?
So what is the best way to improve the performance of your machine learning model? The true answer to this question is that it depends on the problem you are working on and the data you are using. A more satisfying answer that applies to most cases is that you generally see the best results by improving your dataset. The models that you train are only as good as the data you use to train them, so the data is the real factor that bounds the performance of your model.
Other advice for data science teams
- Avoid knowledge silos
- Use version control
- Perform code review
- Standardize your codebase
- Use unit tests
- Avoid duplication
Check out our article on data science best practices for all of our best recommendations on how to increase the efficacy of data science teams.