Are you wondering what max depth is in a random forest? Or maybe you want to know more about the range of values you should consider for max depth when tuning a random forest model? Well either way, you are in the right place! In this article we tell you everything you need to know about the max depth parameter in a random forest model.
First, we will talk about what max depth is and whether max depth is an important parameter to tune in a random forest model. After that, we will talk about what ranges of values you should consider for max depth. Finally, we discuss other random forest parameters that are closely related to max depth.
What is max depth in a random forest model?
What does the max depth parameter in a random forest model control? Before we talk about what the max depth parameter controls, we will first take a step back and talk about how random forest models are created. A random forest model is an ensemble model that is made up of a collection of simple models called decision trees.
Decision trees are made by successively partitioning the data into two parts. The first step in making a decision tree is choosing one variable and splitting the data into two parts based on the value of that variable. For example, if the variable you are splitting on is a person’s age then you might split the data so that every person that is under 18 goes into one side of the split and any person that is over 18 goes to the other. After that, successive splits are made on different variables until a certain condition has been met.
Sometimes splits are made until there is only one item left on each side of each split and there are no more possible splits that can be made. Other times, early stopping conditions are used to define when to stop splitting the data. The maximum depth parameter is exactly that – a stopping condition that limits the amount of splits that can be performed in a decision tree. Specifically, the max depth parameter limits the number of levels deep a decision tree can go.
The diagram below shows an example of a simple decision tree. This decision tree has a depth of two because there are two layers of splits between the original data and the data at the bottom of the tree. First there is a split on age and then there is a second split on weight.
Is max depth an important parameter to tune?
Is max depth an important parameter to tune when you are building a random forest model? The answer to that question is yes – the max depth of your decision trees is one of the most important parameters that you can tune when creating a random forest model. You should tune max depth (or a similar parameter that limits how many splits can happen) anytime you are performing hyperparameter tuning for a random forest model.
What max depth should you use for a random forest model?
What values of max depth should you consider when you are creating a random forest model? In this section we will tell you everything you need to know to be able to answer that question. First we will talk about the advantages and disadvantages of using a large value for max depth. After that, we will talk about the range of values you should consider for max depth.
Advantages of using a large max depth
What are the advantages of using a large value for max depth when creating a random forest model? The main advantage of creating trees with a high max depth is that its predictive performance may benefit from doing so. In general, adding more splits to your trees will result in better classification (as long as your model is not overfitting). That being said, you will start to see diminishing returns after a certain number of splits.
Disadvantages of using a large max depth
What are the disadvantages of using a large value for max depth when creating a random forest model? The main advantage of using a large value for max depth is computational efficiency. Deep trees take longer to train and make predictions with, so if speed is important to you then you might be better off limiting the size of your trees.
Decision trees that have a large max depth are also more likely to overfit to the data they were trained on shallow trees with a small max depth. Reducing the max depth parameter is a great way to prevent your decision trees from overfitting. For more information on this, check out our article on random forest overfitting.
Range of values to consider for max depth
What range of values should you consider for max depth? In general, it is good to keep the lower bound on the range of values close to one. There are many cases where random forests with a max depth of one have been shown to be highly effective. The upper bound on the range of values to consider for max depth is a little more fuzzy. In general, we recommend trying max depth values ranging from 1 to 20. It may make sense to consider larger values in some cases, but this range will serve you well for most use cases.
Here are some factors that will affect the range of max depth values that you should consider.
- Amount of data. The amount of data that you are using to train your model will impact the values that you should use for max depth. The reason for this is that if you do not have a lot of data, you will not be able to make many splits in your trees regardless. As a simple example, if you only have three data points then the maximum possible depth that your trees could reach is two. In this case, it does not make any sense to set a max depth value that is higher than two.
Parameters that are similar to max depth
Parameters that are similar to max depth
Are there other parameters that are similar to the max depth in a random forest model? Here are a few other parameters that can be used to control the maximum depth of decision trees in a random forest model.
- Max number of leaf nodes. An alternative to limiting the max depth of a tree is limiting the number of leaf nodes in a tree. A leaf node is simply a group of observations at the bottom of the tree that is not going to get split up any further. For example, if you have a decision tree with a single split that has a depth of one, then there will be two leaf nodes – one with observations that fell on the left side of the split and one with observations that fell on the right side of the split. The main difference between limiting the max depth of a tree and limiting the number of leaf nodes is that when you limit max depth, you are constraining the tree such that most leaf nodes fall at the same level of the tree. When you limit the number of leaf nodes without limiting the max depth, you could imagine a situation where all of the successive splits after the initial split fall on the right side of the initial split. In this case, the leaf node on the right side of the initial split would be many levels down, whereas the leaf node on the left side of the initial split would be at the first level.
- Max number of splits. Limiting the maximum number of splits in a tree is another alternative to limiting the maximum depth of a tree. Much like the maximum number of leaf nodes, limiting a decision tree based on the maximum number of splits does not constrain the tree to have the majority of the leaf nodes at the same level of the tree.
- Hyperparameter tuning for random forests
- Number of trees in random forests
- Mtry in random forests
- When to use random forests
- Random forest overfitting