Are you wondering when you should choose a linear regression model over a similar machine learning model? Well then you are in the right place! In this article we tell you everything you need to know to determine when you should reach for a linear regression model.
This article starts out with a discussion of what kind of outcome variables linear regression is typically used for. After that, some of the main advantages and disadvantages of linear regression are discussed. Finally, we provide specific examples of scenarios where you should and should not use a linear regression model.
What outcomes can you use linear regression for?
What types of outcome variables can you use linear regression for? Linear regression should be used when your outcome variable is a numeric variable. If your outcome variable is not numeric, then you should consider looking into other types of regression models.
Advantages and disadvantages of linear regression
Are you wondering what the main advantages and disadvantages of linear regression models are? Here are some of the main advantages and disadvantages of linear regression models.
Advantages of linear regression models
- Interpretable coefficients. One of the main advantages of linear regression models is that they have easily interpretable coefficients that come along with confidence intervals and statistical tests. This is very important if inference is a high priority in the project you are working on. Most other machine learning models do not have the same straightforward interpretation that linear regression models do.
- No hyperparameters. Another advantage of linear regression is that it does not have hyperparameters that need to be tuned. You may need to preprocess your data and select which features to use in your model, but other than that there is no need to run different versions of your model with different hyperparameters.
- Well understood. Another benefit of linear regression is that it is well studied and well understood. Most people who have taken an introductory statistics class have at least heard of linear regression. This means that it tends to be more popular with skeptical stakeholders who do not trust other machine learning models.
- Fast inference. A final advantage of linear regression is that it has fast and simple inference that can be implemented even without the use of dedicated machine learning libraries. This means that it is easier to put linear regression models in production at companies that do not have facilities for serving machine learning models built out.
Disadvantages of linear regression models
- Thrown off by outliers. One disadvantage of linear regression is that it is easily thrown off by outliers in your dataset. If you are using a linear regression model, you should examine your input data and model artifacts to make sure that the model is not being unduly influenced by outliers.
- Thrown off by correlated features. Another disadvantage of linear regression is that it is easily thrown off if you have multiple highly correlated features in your model.
- Need to specify interactions. Another disadvantage of linear regression is that you need to explicitly specify interactions that the model should consider when you build your model. If you do not specify interactions between your features, the model will not recognize and account for these interactions.
- Assumes linearity. Linear regression models also assume that there is a linear relationship between your model features and your outcome variable. This means that you might have to preprocess your model features to make the relationship more linear.
- Cannot handle missing data. Most implementations of linear regression models can not handle missing data natively. That means that you need to preprocess your data and handle the missing values before you run your model
- Not peak predictive performance. Another general disadvantage of linear regression is that it does not generally have peak predictive performance on tabular data. If prediction is your main goal, there are other machine learning models that tend to have better predictive performance.
When to use a linear regression model
When should you choose to use a linear regression model? Here are some examples of scenarios where you should use a linear regression model over another model.
- Inference is your primary goal. If inference is our primary goal, you are often better off using linear regression than another machine learning model. Linear regression models give you estimates of the magnitude of the relationship between your features and your outcome variable along with other useful values like confidence intervals and statistical tests.
- Baseline model. If you are looking for a simple baseline model that you can use to compare more complicated models against, a linear regression model is a decent choice. This is especially true if you have a relatively clean dataset that does not have many missing values or outliers. One of the main benefits linear regression has in these scenarios is that there are no hyperparameters that need to be tuned, so you only have to tune a single model.
- Building trust. Since linear regression is a well studied and well publicized model, it is often a good model to reach for when you are still building trust with stakeholders that are skeptical of more complicated machine learning models. After you get buy-in for your linear regression model, you can start to compare the performance of other models to the performance of your linear regression model to show the business value that could be added by upgrading your model.
When not to use linear regression
When should you not use linear regression? Here are some examples of cases where you should avoid using a linear regression model.
- Small improvements in predictive performance have a large impact. If you are operating in a scenario where small improvements in predictive performance can have large impacts on the business, you may be better off reaching for another model. For example, gradient boosted trees tend to have better predictive performance than linear regression models. This is especially true in cases where the relationships between your features and your outcome variable are not perfectly linear.
- You don’t have a lot of time to explore the data. Since linear regression is easily thrown off by things like missing data, outliers, and correlated features, it is not a great choice to turn to if you do not have a lot of time to clean and preprocess your data. In these types of situations, you might be better off turning to a tree-based model, such as a random forest model, that is less sensitive to these issues.
- You have more features than observations. If you have more features in your model than you do observations in your dataset, a standard linear regression is not a good choice. You should either reduce the number of features you are using in your model or use another model that can handle this situation. Ridge regression is one example of a model that can handle this situation.
- You have many correlated features. If you have many features in your model that are correlated with one another, you may be better off using ridge regression. This is a regularized version of regression that handles correlated features much better than a standard regression model.
- When to use logistic regression
- When to use ordinal logistic regression
- When to use multinomial regression
- When to use random forests
- When to use ridge regression
- When to use LASSO
- When to use support vector machines
- When to use gradient boosted trees
- When to use poisson regression
- When to use neural networks
- When to use mixed models
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.