Are you wondering why you should use baseline models for machine learning? Or maybe you are more interested in hearing more about how to build baseline models for machine learning? Either way, we’ve got you covered! In this article we tell you everything you need to know about building baseline models for machine learning.
In the beginning of this article, we discuss what a baseline model for a machine learning project is. After that, we talk about why you should build a baseline model for each of your machine learning projects. Finally, we provide examples of different kinds of baseline models you can use in your machine learning projects.
What is a baseline model?
What is a baseline model in a machine learning project? A baseline model is a very simple model that you can create in a short amount of time. Your baseline model should be created using the same data and outcome variable that will be used to create your actual model. Baseline modes can be simple stochastic models or they can be built on rule-based logic.
Generally speaking, if your actual model is a complex, highly parameterized model then a simple stochastic model would be an appropriate baseline. If your actual model is a fairly simple stochastic model, then a simple baseline that uses easy to implement business logic may be more appropriate.
Why use a baseline model for machine learning
Why should you use a baseline model for your machine learning projects? In the following section we will go over some of the main reasons that you should use a baseline model in your machine learning projects.
Understand your data faster
The first high level reason that you should use a baseline model in your machine learning projects is because it helps you understand your data faster. Here are a few examples of how baseline models help you to understand your data.
- Identify difficult to classify observations. By looking at the results of a baseline model, you can get a sneak peak at which observations are the most difficult to classify. You might see, for example, that one subset of your data was easy to classify using simple business logic, but another subset was not so easily classified. This kind of information can help inform the data you use in your model as well as your choice of model.
- Identify different classes to classify. Similarly, if you are working on a multi-class regression problem, using a baseline model can give you a preview of which classes are easy to classify and which classes are difficult to classify. You might see, for example, that two classes are very hard to distinguish from each other and decide to group those classes together moving forward.
- Identify low signal data. If you create a baseline model and find that your model has little to no prediction power, that might be an indicator that there is little signal in your data. It is much better to find this out early on after building just a simple model than later on after you have spent weeks building a highly complex model.
Compare your actual model to a benchmark
The next reason you should consider using a baseline mode for your machine learning projects is because baseline models give a good benchmark to compare your actual models against.
- Utilize relative performance metrics. Some performance metrics such as log loss are easier to use to compare one model to another than to evaluate on their own. This is because many performance metrics do not have a defined scale and rather take on different values depending on the range of the outcome variable. If you have a simple baseline model, you now have a built in benchmark to measure your actual model against. This can help you distinguish cases where a complex model is needed for cases where simple business logic is sufficient.
- Estimate the potential impact on business metrics. Building out a simple baseline model can also give you an idea of what kind of impact you might be able to have on business metrics. This is especially true if your baseline model is also a stochastic model.
Iterate with speed
- Iterate on your model more quickly. Once you have a simple baseline model build out, you have a good benchmark that you can build off of. This makes it easier to determine whether the modifications you are making to your model actually improve metrics or not, which allows you to identify and cease efforts that are not providing value faster. This allows you to identify efforts that will improve your metrics faster.
- Unblock downstream processes. If you have a simple baseline model built out, this also unblocks people who are working on downstream processes that depend on your model and allows them to get to their work faster. For example, if an engineer is helping you with your model deployment, they might be able to start their work using your baseline model as a template while you iterate on the actual model.
- Progress to other projects faster. Building simple baseline models can also help you complete your current project and move on to other projects faster. Why is that? Because sometimes you will build a baseline model then realize that the baseline model is sufficient for your use case. If you find that a quick simple model can get you to the point you need to be at, there is no point in spending weeks or months developing a more complex model.
How to create a baseline model
How do you create a baseline model? In this section, we will give you some examples of common baseline models that are used in machine learning. Most of these models apply to structured tabular data, but the concept of building a baseline model can certainly be extended to problems involving unstructured data.
Baseline regression models
First we will discuss a few simple examples of baselines that can be used for regression problems. You will notice that many of these examples do not involve any stochastic modeling at all.
- Mean or median. The first example of a baseline model we will provide is simply the mean or median of your outcome variable. This just means that you would predict the median value of the outcome variable for every single observation in the dataset. This is an extremely simple benchmark that you can use as a baseline if your actual model is a set of rules or business logic.
- Conditional mean or business logic. The next example is still a simple, deterministic model. Simply choose a variable or two that you believe to be most strongly associated with the outcome and build out some business logic that conditions on those variables. For example, if you are trying to predict the height of a child, you might condition on their age group and weight class that child falls into. You might, for example, see that the median height for a child in the 5 – 8 year old age group and 50 – 60 pound weight class is 4′ 2″ and decide to use that value for all observations in that age group and weight class. This is a great avenue to pursue if your main model is a relatively simple stochastic model like a linear regression.
- Linear regression. Finally, if you are using a complex model with a lot of features as your main model then a simple linear regression model with a few features is a great baseline model.
Baseline classification models
Now we will discuss baseline models that you can use for classification problems. If you pay close attention, you will see that the models we suggest for classification problems are very similar to the models we suggest for regression problems.
- Mode. For binary classification problems, the simplest baseline model you could think of is just predicting the mode (or the most common class) of the outcome variable for all observations. This is the analog to predicting the mean or median in regression and is a great baseline model to use if your main model is a set of deterministic rules or business logic.
- Conditional mode or business logic. If your actual model is a simple stochastic model such as a logistic regression model, then it might be more appropriate to use a conditional mode or simple business logic as your baseline model. For example, if you are predicting whether a dog will eat more or less than 2 cups of food per day then you might want to condition on the size of the dog. If, for example, you see that most large dogs eat more than 2 cups of food then you should just classify all large dogs as eating more than 2 cups.
- Logistic regression. Finally, if your actual classification model is a complex model with a lot of features, then a simple stochastic model such as a logistic regression model serves as a great baseline.