Are you wondering what surrogate splits are in the context of machine learning models? Or maybe you want to learn more about why surrogate splits are used in machine learning models? Well then you are in the right place! In this article, we tell you everything you need to know to understand what surrogate splits are and why surrogate splits are used in the context of machine learning.
We start out by talking about why surrogate splits are used. This includes a discussion of what types of problems surrogate splits are intended to solve. After that, we discuss what types of machine learning models use surrogate splits. Finally, we discuss how surrogate splits work and how they are implemented in machine learning models.
Why are surrogate splits used?
Why are surrogate splits used in machine learning models? And what types of issues are surrogate splits designed to solve? Surrogate splits were designed to make it easier for machine learning models to handle variables that have missing data. For many machine learning models, you need to pre-process your data to remove rows with missing values or impute missing values before you are able to run the data through the machine learning model. This is not the case if you are using a model that utilizes surrogate splits. Models that use surrogate splits can be applied to datasets that have missing values in one or more variables without any need to pre-process the data to remove the missing values.
What types of models use surrogate splits?
What types of machine learning models use surrogate splits? Surrogate splits are generally used in decision trees and ensemble models that are made up of multiple decision trees like random forests. Introducing surrogate splits into your model can make your model more computationally intensive, so surrogate splits are most commonly used in individual decision trees. That being said, surrogate splits can be used in ensemble models like random forests and they are used in some implementations of these models.
How do surrogate splits work?
So what actually is a surrogate split? And how do surrogate splits work? As we said before, surrogate splits are most commonly used in decision trees and ensemble models that are made up of multiple decision trees. In order to understand how surrogate splits work, it is useful to first understand how decision trees work.
Decision tree models are models that are created by taking a dataset and choosing the best variable that can be used to split the dataset into two different parts. For example, you might separate all of the observations that have a value greater than 10 for a given variable into one part and all of the observations that have a value less than 10 into the other part. After you make your initial split, you continue to apply the same methodology over and over again to split your data into smaller parts until you have reached some stopping condition. It is not important that you understand exactly how the best split is decided in order to understand how surrogate splits work. All you need to know is that decision tree models successively partition datasets into smaller and smaller pieces by applying a series of splits to the data.
So when do surrogate splits come into play? Surrogate splits are applied when the variable that you choose to split your data on has missing values in it. What happens is that for observations that have a missing value for the variable you are spitting your data on, you look for a similar variable that closely resembles the variable you are splitting your data on and make a split using that variable instead. If the most similar variable also has a missing value, you keep going on to your next choice until you arrive at a variable that does not have a missing value. This gives you a way to split your data even when the main variable you are splitting on has a missing value.
Related articles
- Can random forests handle missing values?
- Hyperparameter tuning for random forests
- Random forest overfitting