Are you wondering what problems can be caused if you train a regression model on a dataset with many correlated features? Or maybe you want to understand what strategies you can employ to train a regression model on a dataset with many correlated features? Well either way, you are in the right place! In this article, we tell you everything you need to know about training regression models on datasets with correlated features.
We start out by discussing what correlated features are and what types of correlated features cause issues for regression models. After that, we provide more detail on the problems that arise if you train a regression model on a dataset with many correlated features. This includes a discussion of what types of situations correlated features cause the most problems in. Finally, we provide strategies that you can use if you need to train a regression model on a dataset with correlated features.
What are correlated features?
What are correlated features? Correlated features are features that are related in a way such that the value of one feature tells you something about what you can expect to see for the value of the other feature.
There are a few different types of ways that features can be correlated. One example is that features can be positively correlated or negatively correlated. Another is that features can be linearly correlated or non-linearly correlated. In the following section, we will explain the differences between these types of correlations and provide more information on what types of correlated features cause the most problems in regression models
Positive correlation and negative correlation
Features that are highly correlated can be either positively correlated or negatively correlated. If two features are positively correlated, it means that an increase in the value of one feature tends to be associated with an increase in the value of the other feature. If two features are negatively correlated, it means that an increase in one feature is associated with a decrease in the other feature.
When it comes to building regression models, it does not make a difference whether a feature is positively correlated or negatively correlated. Both positively correlated features and negatively correlated features can cause issues in regression models.
Linear correlation and nonlinear correlation
In addition to having a positive or negative correlation, features can also have a linear correlation or a non-linear correlation. In this section, we will explain the difference between linear and non-linear correlations. We will also discuss the differential impact that linearly correlated features and non-linearly correlated features have on regression models.
If two features are linearly correlated, it means that the relationship between the value of one feature and the other is relatively constant across all values of both features. You would expect the average ratio between the value of one feature and the value of the other feature to remain constant across all values of both features. The standard Pearson correlation coefficient can be used to measure the strength of the relationship between linearly correlated features.
If two features are non-linearly correlated, it means that the relationship between one feature and the other is not constant across all values of both features. For example, you might see that for low values of one feature, the ratio between one feature and the other feature takes on one value. For high values of the feature, you might see that the ratio between one feature and the other feature takes on another value. Since the ratio between one feature and the other is not constant across all values, the relationship is not linear. Non-parametric measures of correlation like the Spearman correlation coefficient can be used to measure the strength of the relationship between features that are non-linearly correlated.
When it comes to building regression models, linearly correlated features and nonlinearly correlated features should not be treated equally. Linearly correlated features tend to be more problematic and cause more issues with regression models than features that have a non-linear correlation.
What problems do correlated features cause for regression models?
What types of problems do correlated features cause when training regression models? Here are are few of the most common problems that occur when you train a regression model on a dataset with correlated features
- Inflated variance estimates. If you include highly correlated features in a regression model, the variance estimates for the coefficients may be inflated. That means that the variance estimates will be larger than they would otherwise be if the correlated features were not included in the model. This is not representative of a true increase in uncertainty, rather it is a mathematical artifact that is caused by the inclusion of correlated features.
- Unstable coefficient estimates. If you include highly correlated features in a regression model, you may also find that your model and coefficient values are highly unstable. That means that small changes to things like the dataset that is used to train the model may lead to large changes in the model and coefficient values.
When do correlated features cause problems in regression models?
Do correlated features cause more problems in some situations than others? There are some situations where correlated features cause more problems than others. In this section, we will discuss scenarios where correlated features cause large problems.
- When the main goal is inference. In general, correlated features cause more problems in regression models when your primary goal is inference. That is because inflated variance estimates and unstable coefficient estimates are large hurdles that make it difficult to interpret the results of your model. If your primary goal is prediction and you do not care about inference, then these issues are not as large blockers.
- When the dataset you are using is small. Correlated features tend to cause more problems when you are training your model on a small dataset. This is especially true if your main goal is inference. If you are using a very large dataset, things like variance inflation will not matter as much because your variance estimates will be very small anyways.
- When the features you want to perform inference on are correlated. Many of the side effects that are caused by introducing correlated features into a regression model only impact the features that are correlated. For example, the variance will only be inflated for features that are correlated with one another. If there are other features that are not highly correlated, then their variance will remain unaffected. That means that correlated features are more of a problem when you specifically want to perform inference on those features that are correlated. If the correlated features are control features that you just want to adjust for, then the correlation may not be as large of a problem.
How much correlation is acceptable for regression model features?
How much correlation is too much correlation when it comes to features in regression models? Is it okay if features in regression models have low levels of correlation? In this section, we will talk about what level of correlation is acceptable for features in regression models.
We will start out by saying that the level of correlation that is acceptable in a regression model will depend on multiple characteristics of your dataset. There is not a hard and fast cutoff or threshold that can be applied to the correlation coefficient between two features to deterministically say whether those features are too highly correlated or not. That being said, there are some rules of thumb that are commonly used among practitioners. In general, many practitioners start to worry about correlated features when the features have a correlation coefficient that is around 0.6.
If your goal is prediction and you do not care about inference at all, then it may be okay to include very highly correlated features in your regression model. In general, correlated features do not start to cause issues for prediction problems until the features are almost perfectly correlated.
What types of regression models are affected by correlated features?
What types of regression models are affected by correlated features? In general, most linear regression models and generalized linear models are susceptible to the types of issues that are caused by correlated features. Here are some examples of models that may be impacted by highly correlated features.
- Linear regression
- Logistic regression
- Poisson regression
- Negative binomial regression
How to handle correlated features in regression models
So how do you prepare a dataset that has many correlated features to be used in a regression model? Here are some examples of strategies you can employ to prepare data that has correlated features to be used in a regression model.
- Feature selection. The first option that you have when it comes to preparing a dataset with correlated features for a regression model is to employ feature selection to reduce the number of features that are used in your regression model. There are many different strategies that you can use to do this, but the general goal is to reduce the amount of correlated features that are used in the regression model by selecting only one the most important features from each group of correlated features to include in your model. The other features are simply dropped and excluded from the model. If you want to learn more about the techniques you can employ to perform this type of feature selection, check out our guide on feature selection for machine learning.
- Regularization. Another option that you have when it comes to training regression models on datasets that have correlated features is using a regularized regression model. Different types of regularization can address the problem of correlated features in different ways. That being said, ridge regression is a particularly good option when you are looking to train a regression model on correlated data. For more information on ridge regression, check out our article on when to use ridge regression.
- Dimensionality reduction. Another strategy you can use to reduce the impact of correlated features is to use dimensionality reduction techniques to compress the information that is shared across the features into a smaller set of uncorrelated features. One thing to note here is that these dimensionality reduction techniques tend to create new features that do not have a straightforward interpretation. This may impact the interpretability of your model.
- Feature transformation. One final strategy that you can consider when you have multiple correlated features in your dataset is applying transformations to one or more of your correlated features. For example, if there is an accepted way to apply a threshold to one of the correlated features to binarize that feature, then that might be an option that you can look into.