Are you wondering whether to use a logistic regression for a data science project? Or maybe you are wondering what advantages logistic regression has over similar models? Well either way, you are in luck!
In this article we tell you everything you need to know to determine when to use logistic regression for a data science project. First, we highlight some of the major advantages and disadvantages of logistic regression. After that, we talk about specific situations where you should consider using logistic regression. We also provide some examples of scenarios where logistic regression might not be your best bet.
What type of outcomes can logistic regression handle?
One of the first things you need to think about when deciding which machine learning model to use is the format of your outcome variable. So what types of outcome variables can logistic regression handle? Logistic can only handle binary outcome variables, or outcome variables that have exactly two levels.
If your outcome variable is not a binary variable that has two levels the you have two options. The first is to find a different model that better accommodates the type of outcome variable you are using. This is going to be your best option in most cases. If you really do want to use logistic regression, your second option is to reformat your outcome variable so that it is binary.
If your outcome variable is numeric then you can choose a threshold and say that any value above that threshold falls into one category and any value below that threshold falls into the other. If you have a categorical outcome variable with multiple categories, you can combine some of the categories together so that you only have two categories in the end.
Advantages and disadvantages of logistic regression
Before we talk about the specific scenarios where logistic regression should and should not be used, we will first take some time to talk about the main advantages and disadvantages of logistic regression. This discussion will help to inform our discussion about the scenarios where logistic regression should be used.
Advantages of logistic regression
- Interpretable coefficients. One of the main advantages of logistic regression is that it provides interpretable coefficients out of the box. Logistic regression is one of the best options you have when you want to be able to give straightforward descriptions of exactly how the features in your model relate to the outcome variable.
- Simple model. Another advantage of logistic regression is that it is a relatively simple model that does not have many parameters that need to be estimated. This means that logistic regression is a good option to turn to if a more complex model is overfitting.
- Well understood model. Logistic regression is a fairly common model that is understood and recognized by many. That means that some people may trust the results of a logistic regression model more than the results of a more complicated model that they are not familiar with. Trust can play an important role in determining whether the results of a model get used or not, so this human factor should not be ignored.
- Fast inference. The calculations that are required to make predictions with a logistic regression model are often faster than those required for more complicated models.
Disadvantages of logistic regression
- Easily thrown off by outliers. Just like linear regression, logistic regression is easily thrown off by outliers. That means that you have to take the time to look at your data and model output to make sure that the model is not being unduly influenced by a few outliers.
- Does not automatically handle interactions. Logistic regression models do not account for interactions natively. Instead, you need to specify any interactions that you want to be included in your model. Failing to specify key interactions could have negative impacts on your results, especially if you are using the logistic regression model for inference.
- Does not automatically handle missing data. Another pitfall of logistic regression is that it does not natively handle missing data. That means that you will have to prepare your data ahead of time and make sure that the missing values are taken care of before you feed the data into the model.
- Struggles with correlated features. Just like outliers, logistic regression models can also be thrown off by correlated features. If you have multiple features that are highly correlated, it is generally not a good idea to include all of them in the same logistic regression model.
- Not peak predictive performance. Logistic regression is not regarded as state of the art when it comes to binary classification and there are other models that can often make more accurate predictions than logistic regression models. You may be better off using gradient boosted trees in these scenarios.
When to use logistic regression
So when should you use a logistic regression model? Here are some examples of scenarios when you should use a logistic regression model.
- Inference. Logistic regression is a great model to turn to if your primary goal is inference, or even if inference is a secondary goal that you place a lot of value on. This is especially true if you need to include confidence intervals or evidence of statistical significance in your analysis.
- Baseline model. Logistic regression is also a great option if you are looking for a simple baseline model that you can use to benchmark more complex machine learning models against. If a more complicated model is not able to perform much better than your simple baseline, then you are probably better off sticking with the simple model.
- Building trust. Since logistic regression is a classic statistical model that is well studied, it is often more well received by stakeholders who are skeptical of complicated machine learning models. That means that it is a great option to reach for when you are still building trust with skeptical stakeholders.
When not to use logistic regression
When should you avoid using logistic regression models? Here are a few examples of scenarios where you should avoid using a logistic regression model.
- Don’t have time to explore the data. Issues like correlated features and outliers have a much larger impact on logistic regression models than nonparametric models such as random forest models. If you do not have a lot of time to explore your data and look out for pitfalls that might affect your model, you might be better off using another model that natively accounts for these concerns.
- You don’t understand interactions between your variables. Unlike some other machine learning models, logistic regression models do not handle interactions natively. That means that if you do not specify the relevant interactions, your model will not consider interactions between variables. If you have reason to believe that there may be interactions between variables, but you do not have enough context to understand exactly which variables will interact, you might be better off going with a model that handles interactions natively.
- Need maximal performance. If you are in a situation where a very small increase in your model performance metrics is going to deliver a large increase in business value, you may be better off opting for another model.
- When to use ordinal logistic regression
- When to use multinomial regression
- When to use random forests
- When to use ridge regression
- When to use LASSO
- When to use support vector machines
- When to use gradient boosted trees
- When to use linear regression
- When to use poisson regression
- When to use Bayesian regression
- When to use neural networks
- When to use mixed models
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.