Are you wondering when to use random forests as opposed to other machine learning models? Well then you are in the right place! In this article we tell you everything you need to know in order to understand when to use random forests. We start out with a discussion of the main advantages and disadvantages of random forests. With that context in mind, we then provide specific examples of cases were you should use random forest models over other machine learning models.
What type of outcomes can random forests handle
The first thing you should consider when deciding whether to use a random forest model is what your data and your outcome variable look like. This will help you understand whether random forests are even an option for you. What types of outcome variables can be predicted using random forests? Here are some examples of types of outcomes that random forests can handle.
- Numeric outcomes
- Binary outcomes
- Multiclass outcomes
Advantages and disadvantages of random forests
Before we talk about what situations random forests should be used in, we will start by talking about the advantages and disadvantages of random forests models. This discussion will provide more context that will help to justify why random forests perform well in specific scenarios.
Advantages of random forests
- Handle interactions well. One of the first advantages of random forests is that they handle interactions well. They are able to handle interactions between variables natively because sequential splits can be made on different variables. This means that you do not need to explicitly encode interactions between different variables in your feature set.
- Handle outliers well. In addition to handling interactions well, random forests also handle outliers well. This is because all values that lie on the same side of a split are treated equally. This means that values that are just barely above the split threshold are treated the same as variables that are orders of magnitude above the threshold. As a result, you do not need to preprocess your data or remove extreme values.
- Handle non-linearity well. Random forests also handle cases where the relationship between a feature and the outcome is non-linear better than many other models. This is because there are no attempts to fit a linear trend to the data during random forest training. This means that there is no need to transform your features to ensure that they have a linear relationship with the outcome variable.
- Handle missing data natively. In addition to handling interactions, outliers, and non-linearity well, many implementations of random forests handle missing data natively. That means that there is no need to impute your missing data with values that may or may not introduce bias.
- Handle high dimensionality well. Random forests also work well in cases where you are handling data with high dimensionality, such as cases where you have many features you want to include. One of the reasons for this is that only a subset of the features are considered at each split.
- Easily parallelizable. Another great fact about random forests is that they are highly parallelizable. This is because random forests are made up of many decision trees that are built independently without receiving any information from any other decision trees. That means that you can easily distribute the computations required to build each decision tree to a different machine to speed up your training.
- Handle multiclass classification outcomes natively. Random forests are also a great option to reach for if you are working with a multiclass outcome that has many classes. This is because many random forest implementations can handle multiclass classification problems natively with a single model rather than training one model for each outcome class.
- Low sensitivity to hyper parameter choices. Another reason to love random forests is that they are not very sensitive to the choice of hyper parameters that are used. While hyperparameter tuning can help to improve performance in random forest models, random forests with poorly chosen hyper parameters still tend to do an okay job.
- Easily understandable. Another benefit of random forests is that they are made up of decision trees, which are relatively easy to explain to curious stakeholders. Stakeholders who have a high level understanding of how a model works are more likely to have trust in model predictions.
Disadvantages of random forests
- No directly interpretable coefficients. One disadvantage that random forests have is that they do not produce coefficients that are directly interpretable. That means you need to do some additional work if you want to explain why your random forest model is making the predictions that it is.
- Not peak performance. Another disadvantage of random forests is that other models such as gradient boosted tree models tend to perform better than random forests when proper hyperparameter tuning is performed. This means that random forests might not be your best choice if you are in a situation where small improvements in performance metrics can have a very large impact.
When to use random forests
Now we will talk about some situations where random forests perform well. These are just a few examples of situations where random forests are a great option!
- Simple baseline. Random forests are a great option to turn to if you want to build a relatively simple baseline to evaluate another model against. For example, if you plan to use a more complex model like a neural network, then building a simple random forest with just a few variables is a great way to evaluate whether your complex model is doing better than a simple baseline model would.
- Prototype. Similarly, if you are building a prototype model to evaluate what kind of an impact a machine learning model can have for a specific use case then a random forest model is a great option. This is because random forests models are not as sensitive to hyperparameters as other models, so it is easy to get a decent model up and running without having to do too much parameter tuning.
- You have a long backlog of projects to get around to. Creating a new model for a case where no model exists often provides more value than making slight improvements on the performance metrics of an existing model. This means that you are often better off building a model that gets you 80% of the way to perfection with 20% of the work than putting in a hefty effort to get your model to 85%. Since random forests are not very sensitive to hyper parameters, they are great for situations where you want to get a decent model up and running then move on to another high impact project.
When not to use random forests
Now that we have talked about situations where random forests are a good option, we will now talk about some cases where random forests are not the best option. Here are some examples of cases where random forests might not be your best option.
- Inference is your main goal. If inference is the main goal of your project rather than prediction, then you might be better off using a more traditional regression model. This is because regression models tend to have directly interpretable coefficients that can be directly used to interpret the data
- Small increases in accuracy are crucial. If you are working on a project where you are trying to improve an existing model and small increases in accuracy are absolutely crucial then random forest might not be your best bet. Similar tree-based models like gradient boosted trees tend to outperform random forest if the hyper parameters are tuned properly.