Are you looking for advice on machine learning model evaluation? Well then you are in the right place! In this article we discuss everything you need to know in order to understand how to evaluate your machine learning models. This includes all of the prerequisite information you need to know in order to understand how to evaluate your machine learning models in both online and offline settings.
This post does not contain a list of popular model evaluation metrics. Instead, it lays out a roadmap that is designed to give you a broad overview of the full machine learning model evaluation process.
Business and performance evaluation metrics
The first thing you need to understand before you can evaluate your machine learning models is the difference between business metrics and performance metrics. Performance metrics are often the main focus of articles on machine learning model evaluation, but business metrics are just as important, if not more important, to keep in mind when you evaluate your machine learning models.
Performance metrics for model evaluation
First we will talk about performance metrics for model evaluation. Performance metrics are the types of metrics that first come to mind when you think about model evaluation because they are the metrics that tell you how good your model’s predictions really are.
Performance metrics tell you how close your model’s predictions are to the actual values that were observed. Some common performance metrics that are used to evaluate machine learning models are root mean squared error (RMSE), accuracy, precision, and recall.
Business metric for model evaluation
So now that we have covered performance metrics, the next question is what is a business metric? A business metric is an important operational metric that relates back to the bottom line of the business. Business metrics are generally metrics that are tracked by product owners and business partners that work with the data team such as lead submission rates, email click through rates, cost of customer acquisition, and gross sales numbers.
Why is it important to pay attention to business metrics? It is important to keep business metrics in mind because these are the metrics that people outside of the core data team generally use to evaluate how a machine learning model is doing. A product manager does not care about the log loss associated with a lead prediction model. Instead, they care about the increase in the lead submission rate that is seen after the lead prediction model is used to implement a product change.
Offline and online model evaluation
Next we will talk about the difference between online and offline model evaluation. Most articles on machine learning model evaluation only focus on offline model evaluation, but in the real world, online model evaluation is even more important than offline model evaluation.
Offline model evaluation
What is offline model evaluation? Offline model evaluation is the process that most people think of when they think of machine learning model evaluation. Offline model evaluation is any model evaluation that you do using offline data before you deploy a model or use any insights generated from a model. Offline model evaluation is most commonly associated with performance metrics, but you can also evaluate your expected impact on business metrics in an offline fashion.
Offline evaluation of performance metrics
Since offline model evaluation is most commonly associated with performance metrics, we will first talk about evaluating performance metrics in an offline fashion. In order to evaluate performance metrics, you simply have to use your model to make predictions on a dataset then use a performance metric to evaluate how close the predictions are to the real values. Here are some best practices for offline performance evaluation.
- Use a test and train dataset. Generally when you are doing offline evaluation of performance, you will use two different datasets to evaluate your model. You will have a training dataset that your model is trained on a test dataset that is not used to train your model. You should make predictions on both your test and train dataset to make sure that your model performs just as well on new data as it does on data it saw during training.
- Use a baseline model. In addition to training a final model, you should also train a simple baseline model that you can compare the performance of your model to. If you are training a complex deep learning model, then your baseline model might be a simple linear regression model with a few variables. If your main model is a linear regression model, then your baseline model might be simple business logic with no stochastic component.
Offline evaluation of business metrics
Offline performance of business metrics is not alway as straightforward as offline evaluation of performance metrics, but it is often possible to infer some sort of relationship between the performance metrics you are using and the business metrics you are using. That being said, you may need to make large assumptions about the relationships between your business metrics and performance metrics.
For example, say you had a model that predicted how likely a potential customer was to sign on to your platform and you wanted to use this model to prioritize which customers your salespeople should speak to. In this situation, it feels natural that there should be a relationship between performance metrics like accuracy and business metrics like the total number of new customers converted. If your model is more accurate then your salespeople will spend their time talking to customers that are more likely to convert, so they should sign more customers on to the platform.
By making some assumptions and defining this relationship mathematically, you can get an estimate of what kind of increase in the number of new customers you might expect to see based on your model’s accuracy. This business metric estimate is an important number that you can share with business partners to help gain momentum for your project.
Online model evaluation
And what about online model evaluation? Broadly speaking, online model evaluation refers to the process of evaluating how your model is performing after it has actually been deployed and is being used in production. Offline evaluation can always be used to evaluate business metrics, and it can often be used to monitor online evaluation metrics as well.
Online evaluation of performance metrics
Online evaluation of performance metrics is not always an easy task because the fact that you are making changes to a product based on the results of your model pollutes any data that you collect.
For example, take the case of using sales conversion models to prioritize sales calls. Online performance evaluation is not as straightforward as looking at the data and seeing who converted and who did not. Why is this? Because it is difficult to determine whether a customer that failed to convert did not convert because they inherently were unlikely to convert or because they were deprioritized by the sales team due to the results of your model. Maybe if that custom had received a high volume of sales calls, they actually would have converted.
One option you have is to make predictions with your model but not make any changes based on those predictions. This way you have a set of model predictions and actual results that are not tainted by the results of your model. Otherwise, you may need to use causal inference methods or counterfactual policy evaluation to analyze this sort of data.
Online evaluation of business metrics
Online evaluation of business metrics is much more straightforward because the impact on business metrics is generally measured directly through randomized tests. If you are actively deploying a model that is used in production, you can just set up a randomized test where the model is used for some observations and not for others and see how the business metrics vary across groups.
Even if your model is not getting used in production and you are only using machine learning modeling to gather insights that will be used by the company, you should still try to evaluate your machine learning model on business metrics using a randomized test setup. Continuing along with the previous example, if you are trying to determine the impact of using a conversion model that predicts how likely a customer is to sign up to prioritize sales calls, you can still test this in a randomized way. All you have to do is split your sales team into two randomized groups and let only one team have access to the models insights.
Full machine learning model evaluation cycle
- Build a baseline model. You can think of these first two steps as prerequisite steps that you should take before you get to the point of evaluating a machine learning model. The first step is just building a baseline model that you can compare your actual model to.
- Iterate on the baseline model using cross validation. After you build a baseline model, you should iterate on that model by changing the feature set, tuning parameters, and testing out different types of models. You should do all of this using only your training data by using cross validation to evaluate your performance metrics. Do not touch your test data at any point in this step.
- Offline evaluation of performance metrics. After you settle on a final model, you should evaluate your performance metrics using both your test and training data. If your performance metrics are much better on your training data then that is a sign that your model is overfitting and you need to add more data, use a less complex model, or use a regularization technique.
- Offline evaluation of business metrics (if applicable). After you evaluate your offline performance metrics, you should try to get an estimate of the impact that your model will have on business metrics. This step of the process often involves making some strong assumptions that should be stated alongside your results. Even though you have to make strong assumptions, this type of model evaluation can be invaluable for getting momentum for your project and getting your project prioritized.
- Deploy your model or insights in a randomized test. After you complete the offline evaluation of your machine learning model, it is time to move to the online evaluation. Before you evaluate your online model, you must deploy your model or insights so that you can collect data on how they perform.
- Online evaluation of business metrics. Perhaps the most crucial step in evaluating your machine learning model is the online evaluation of your business metrics. This is a step that sometimes gets overlooked, but if you take anything away from this article it should be the importance of this step. This is the only way you can prove the value of your machine learning models to the business. Even if you are not formally deploying a model in production, it is generally still possible to run a randomized test where you apply your insights to make improvements for only one subset of your population. This will give you the ability to see the effect that your model had on business metrics.
- Online evaluation of performance metrics (if applicable). Finally, you should try to get an understanding of how your online performance metrics compare to your offline performance metrics if possible. There is sometimes a large difference between online and offline performance metrics. If such a difference exists, identifying and understanding the reason for this difference can help you train better models in the future.
Other considerations for evaluating models
Now that we have gone over how to evaluate machine learning models using performance metrics and business metrics, we will discuss some other considerations you should keep in mind when building and evaluating machine learning models.
First we will discuss special considerations that you should keep in mind when selecting performance metrics for different types of models. After that, we will discuss some competing factors that should be considered in conjunction with performance metrics.
Evaluating regression models
We will start out by discussing considerations that you should keep in mind when you are selecting performance metrics to evaluate a regression model.
- Impact of outliers. The first thing you should consider when selecting performance metrics for regression models is how large of an impact outliers should have on your performance metrics. Generally when you are working with regression error metrics, the individual errors are either squared or passed through an absolute value function so that all errors will have a positive value. If you want to heavily penalize large outliers you should use a performance metric with a squared error metric like root mean squared error (RMSE). Otherwise, you should use an error metric with an absolute value term like mean absolute error (MAE).
- Impact of under-predicting and over-predicting. Another consideration you should keep in mind is whether your model should be more harshly punished for predicting values that are too high or values that are too low. In some cases, over-predicting is much worse than under-predicting, or vice versa. In these cases, you should use error terms that reflect this
- Interpretability. Not all regression metrics are easy to interpret on their own. For example, it is hard to say what a good value for RMSE is because that depends on the scale of your outcome variable. If your outcome variable ranges from 0 to 1 then a model with a MAE of 0.8 would be considered bad, whereas if your outcome variable ranges from -500 to 500 then a MAE of 0.8 would be considered great. If you want your error metrics to be interpretable out of the box, then you should use error metrics that are scaled to account for the range of your outcome variable such as mean absolute percent error (MAPE).
Evaluating binary classification models
Now we will discuss some considerations that you should keep in mind when you are choosing performance metrics for binary classification models.
- Beware of imbalanced data. When you are choosing a performance metric to use for a classification model, you should check whether one of the classes is much more prevalent than the other class. If this is true, then you should avoid using metrics like accuracy that can be easily swayed by unbalanced data. For example, if 99% of your data represented positive outcomes then your model could just predict every observation as a positive outcome and achieve 99% accuracy. This sounds like a great accuracy rating, but a model that predicts the same outcome all the time is not very useful.
- Probability metrics and class metrics. Another consideration you should keep in mind is how you are going to use the output of your model. Are you going to use the probability that the model produces (ie. the probability that an observation is a positive case) or the binary outcome (ie. positive or not positive). Most common binary classification metrics focus on the binary outcome, but if you are going to use the probability outputted by your model then you should opt for a metric that directly evaluates the probability such as log loss.
Evaluating multi-class classification models
Finally, we will talk about considerations that you should keep in mind when working with muti-class classification models.
- Look at class-level metrics. Generally when you work with multi-class classification models, your north star metric will be a metric that summarizes how your model is doing as a whole across all classes. That being said, you should also look at one-vs-all metrics that explain how good your model’s predictions are for each class. This will give you an idea of what classes the model performs well on and what classes the model does not perform well on. This information may provide insights into what data that needs to be added to the model or changes that need to be made to the model.
Competing considerations for evaluating models
Many articles and blog posts on the internet would have you believe that performance metrics are the main factor that should be considered when evaluating your machine learning models. Maybe that is true if you are entering a Kaggle competition, but in the real world that is far from the truth! Here are some other factors that need to be considered when evaluating machine learning models.
- Prediction time. One factor that should be considered is the amount of time it takes to make a prediction using your model. This might not be as important if you are using an ad hoc model that only needs to be run once, but if you are building a model that will be used to score data on a regular cadence (whether that be in real time via an API or in batch jobs that run overnight) then you should consider the amount of time it takes to make predictions with your model. Generally, you should favor models that are able to make predictions quicker and with fewer resources.
- Model complexity. Another thing that should be considered is model complexity. Overly complex models are prone to issues like overfitting and instability. Generally, you should favor simple models over complex models if the performance metrics are in the same ballpark. This is especially true if your models will be retrained on a regular cadence.
- Model explainability. A third consideration that should be kept in mind is model explainability. Generally, you should choose a model that is more easily explainable over a model that is not easily explainable if the performance metrics are relatively similar in scale. Explainable models are generally easier to get buy-in for because having the ability to explain what your model is doing and why makes it easier to build trust with your business stakeholders.