A practical guide to feature selection for machine learning

Share this article

Are you wondering why you should use feature selection techniques to reduce the number of features used in your machine learning model? Or maybe you have decided you want to reduce the number of features you are using but you are unsure what feature selection techniques you should use? Either way, we’ve got you covered! In this article we tell you everything you need to know about feature selection techniques for building machine learning models.

Instead of providing an all inclusive list of every possible feature selection technique out there, we focus in on the most popular and useful feature selection techniques that work well most of the time. For each type of feature selection technique we discuss, we provide information on the pros and cons of that type feature selection technique and the use cases where it performs best.

Benefits of feature selection

Before we talk about how to implement feature selection for a machine learning model, we will take a step back and discuss the benefits of using feature selection techniques to reduce the number of features in your model. So why should you use feature selection techniques? Here are some benefits of using feature selection.

  • Reduce overfitting. One of the main reasons that you should consider using feature selection techniques is if you find that your model is overfitting to the training data and not performing well on data that was not seen during the training period. One of the best ways to improve a model that is overfitting is to reduce the complexity of your model. Reducing the number of features in the model is one possible way to do this.
  • Improve performance metrics. More generally, removing features that do not provide much signal can improve your models performance metrics even if the model is not overfitting. Including a large amount of noisy features that provide little useful information in your model will generally degrade your performance metrics.
  • Reduce training costs. Another potential reason to use feature selection techniques to reduce the number of features that are going into your model is to reduce the amount of time and resources it takes to train your model. This can be particularly useful if you are using a particularly large or complex model that is slow to train.
  • Reduce failure rate. Reducing the number of features used in your model also reduces the number of potential failure points that you have. When you add new features to your model, you are also adding additional dependencies on upstream datasets that may fail from time to time.

Feature selection methods for machine learning

Now we will dive into talking about some of the most popular and reliable feature selection techniques. In order to facilitate this discussion, we will split these common feature selection techniques up into four different categories. Specifically, the categories of feature selection techniques we will discuss are as follows.

  • Model free feature selection techniques
  • Feature selection techniques that utilize model artifacts
  • Feature selection techniques that utilize model interpretability methods
  • In-model feature selection techniques

Model free feature selection

  • Description: The first type of feature selection technique we will discuss is model free feature selection techniques. These are feature selection techniques that you can implement without ever training any type of machine learning model.
  • Use case: Model free feature selection techniques are great to use in the beginning of the model building process when you are just entering the exploration phase of a project. They can give you an initial directional indication of whether a potential feature you are considering may make a useful contribution or not.
  • Examples:
    • Domain knowledge. One way to select what features should be included in a model is to lean in on domain knowledge related to the application area. Sometimes domain knowledge you have going into the model building process can help to inform what types of features should and should not be included in a model.
    • Univariate plots and analysis. Another model free way to determine whether a feature should be included in a model or not is to perform some univariate analysis of the variables you are considering. For examples, you might want to look at characteristics such as the number missing values and the spread of the data. In general, features have an excessive number of missing values and features that take on the same value for the vast majority of observations are less likely to be useful.
    • Bivariate plots and analysis. After you perform some univariate analysis on your potential features, you can also perform a bivariate analysis that looks at the relationship between your outcome variable and each of the potential features. Simple bivariate plots and correlation analyses can provide a directional indication of whether each potential feature is related to your outcome variable.
    • Statistical tests. If you want to take your bivariate analysis a step further, you can use statistical tests like t-tests and chi-square tests to test whether there is a relationship between each of your variables and your outcome variable. If you suspect that some of your variables might not be normally distributed, we recommend looking into nonparametric statistical tests.
    • Weight of evidence: This is a more complex technique that can be used to examine the predictive power of each of your potential features for predicting your outcome variable. Weight of evidence scores only work when you are using categorical features and a binary outcome, so you may need to bin your data if it is continuous. Weight of evidence scores basically work by comparing the number of positive and negative cases at each level of each categorical variable.
  • Pros:
    • Easy to implement
    • Easy to explain
    • Do not require any model training
    • Generalizable to any type of model
  • Cons:
    • May miss complex interactions between features
  • Our recommendation: You should always use model free feature selection technique as you start to explore what types of features you want to include in your model. If you are just doing some quick exploration then we recommend using a mixture of univariate and bivariate plots and nonparametric spearman correlations. If you have some more time and want to do a deeper exploration then we recommend using weight of evidence values.

Feature selection with model artifacts

  • Description: The next category of feature selection techniques is feature selection techniques that rely on model artifacts. These are simple feature selection techniques that use artifacts that you get for free or at least for cheap after training a model. These artifacts usually take the form of variable importance scores that are calculated during model training.
  • Use cases: Feature selection techniques that use model artifacts are good if you want a quick and dirty way to identify features that contribute very little to the model. They are good for cases where you are working quickly to make a proof of concept model. However, these techniques can be biased in certain cases such as when there are multiple highly correlated variables.
  • Examples:
    • Permutation based feature importance. Permutation based feature important scores are often included along with trained models because they are relatively easy to calculate. The general idea behind permutation based feature importance scores is that you choose one feature at a time then shuffle around, or permute, the values for that feature. After that, you can look at how much predictive performance is affected when that feature is effectively eliminated from the equation.
    • Average reduction in impurity. Feature importance scores that are based on reductions in impurity are also commonly included along with tree-based models. The general idea behind this type of feature importance score is that you calculate how much more homogenous the leaf nodes become after a split is made on a given feature. The idea here is that if a feature is strongly associated with the outcome then adding a split based on that variable will result in more homogenous nodes that contain observations with similar outcomes.
  • Pros:
    • Often come for free with a trained model
    • Well understood by many data scientists
  • Cons:
    • Biased in some scenarios (ex. impurity based measures tend to be biased when you have high cardinality features and features with different scales)
    • Only applicable to certain types of models
  • Our recommendation: If you are building a quick model for a proof-of-concept then it makes sense to use built-in variable importance scores. However, if you are building a model for the long term, we recommend using feature selection methods that provide more interpretability. The additional interpretation that these methods provide can alert you to potential issues with your data and help to spark ideas for new and better features. If you are going to use built in feature importance measures, permutation based feature importances are generally preferable to impurity based feature importances.

Feature selection with model interpretability

  • Description: The third type of feature selection techniques is techniques that depend on model interpretability methods. These are methods that are explicitly developed to help understand why a model is making the predictions that it is.
  • Use cases. Feature selection techniques that depend on model interpretability methods are the best option to use in cases where inference and explainability are just as, if not more, important than prediction.
  • Examples:
    • Shapley values. Shapley values are observation-level values that are used to determine how much each observed feature value contributed to the final prediction. This is done by looking at the marginal contribution that the feature has when you use it in combination with all possible subsets of the other features. For example, of you had 4 features called feature A, feature B, feature C, and feature D and you wanted to calculate the Shapley value for feature A then you would look at the marginal contribution of feature A when it is combined with just feature B, just feature C, just feature D, both feature B and features C, and so on. Observation-level Shapley values can be aggregated up to create variable-level feature importance scores for selecting the best features.
    • LIME. LIME is a model interpretability method that works by permuting the data for a given observation X times, using the original model to make predictions on the permuted data, then training a simple linear model to explain the difference between the outcomes observed for the permuted data and the outcome observed for the original data.
  • Pros:
    • Often provide additional insight and interpretation such as the ranges of values for a feature than lead to certain predictions
    • Many methods are generalizable to any type of model (ex. LIME)
  • Cons:
    • Introduces an extra step after model training
    • Can take a long time to run
  • Our recommendation: You should use model interpretability methods to assist you in the feature selection process any time that you are building a model for the long term. This is especially true if you need to be able to explain why your model makes the predictions that it does.

In-model feature selection

  • Description: The final type of feature selection method is in-model feature selection. The general idea behind this type of feature selection method is that you can use a model that has a built-in capability to eliminate features that are not making a significant contribution.
  • Examples:
    • LASSO (L1 regularization). One example of a model that has built in feature selection is a LASSO model. This is simply a linear regression model that includes a L1 regularization term or a penalty that is proportional to the absolute value of the regression coefficient for a given feature. The addition of this penalty to the model enables individual regression coefficients to be set to zero in cases where a feature is not associated with the outcome variable.
    • Bayesian models. Some Bayesian models can also perform in-model feature selection if an appropriate prior is placed on the feature coefficients. The selected prior should have a lot of its density at zero so that the model is encouraged to set coefficients to zero. One example of such a prior is a spike and slab prior.
  • Pros:
    • You can sometimes directly use the results of these models without having to retrain any other models or complete any extra steps
  • Cons:
    • More difficult to explain to non-technical audiences
    • Only applicable to certain types of models
    • Not great to use in production because there is no way to guarantee that the selected feature set will remain stable if the model is retrained. This means you may need to use a model with in-model feature selection to perform feature selection then train another model with the selected feature set. This cancels out the main benefit of this type of feature selection, which is that it is a one step process.
  • Our recommendation: You should avoid using models with in-model feature selection in production, especially in cases where the model will need to be re-trained. There is no way to guarantee that the same feature set will be selected and that adds many potential sources of instability. If you are training a quick one-off model for inference purposes it might make more sense to use this type of model.

A comparative study of feature selection methods

In order to compare how different types of feature selection models perform in different situations, we created some simulated data and then employed various feature selection methods on that data. In the following section we will discuss the details of these simulations.

Feature selection simulation details

Types of variables generated

First we will talk about what kind of variables we generated in our simulations and how we simulated these variables. To keep things simple, we primarily used numeric variables that were generated from a normal distribution. We only included categorical variables when the scenario we were simulating explicitly required it.

  • Outcome variable. The first variable we simulated was our outcome variable. The outcome variable was generated from a random normal distribution.
  • Random variables. After we generated our outcome variable, we generated random variables that were not at all associated with the outcome variable. Each variable was generated using a normal distribution with a different mean and variance.
  • Related variables. After we simulated our random variables, we simulated some variables that were associated with our outcome variable. These variables were created by generating some random noise from a normal distribution then adding that noise to the outcome variable. In some cases, we squared or cubed the noise.
  • Highly correlated. For some scenarios we also added a set of variables that were highly correlated to one another. We created these variables by selecting one of the variables that was associated with the outcome then simulating some random noise from a normal distribution and adding that noise to the selected variable.
  • Categorical variables. If a given situation called for categorical variables, categorical variables were created by generating numeric variables then binning those variables into categories.

Types of scenarios simulated

Now that we have talked about the types of variables that we included in our simulations, we will also talk about the types of scenarios that we simulated.

  • Simple. The first scenario we simulated was a simple scenario where there were not any glaring issues that might make the feature selection task more difficult. There were no categorical variables or correlated variables included in this scenario.
  • High correlation. The second situation we simulated was the scenario where there were multiple highly correlated variables. This was the only scenario where we included variables that were intentionally created to be highly correlated with one another.
  • High noise. The third scenario that was simulated was a high noise scenario. In this scenario, the ratio of related variables to noisy variables was increased.
  • Variable scale. The fourth scenario was a scenario where there was a lot of variability in scale. In all of the previous scenarios, the variables were all on relatively similar scales. In this scenario, some variables were on much larger scales than others.
  • Categorical variables with differing cardinality. The final scenario that was simulated was one with categorical variables. In this scenario there were multiple categorical variables, some of which had higher cardinality than others.

Feature selection methods employed

Here are the feature selection methods that we employed on our simulated data. We only used feature selection techniques that provided concrete scores that could be used to rank our features from most important to least important.

  • Spearman correlation. We used spearman correlations, which are nonparametric correlations, to examine the relationship between the outcome variable and each feature. In order to select the best features, we simply selected the features with the highest absolute values for the correlations.
  • Permutation feature importances (random forest). We also trained a random forest model then examined the permutation feature importance scores associated with that model. Again, we took the features with the highest feature importance scores as the most important features.
  • Shapley values (random forest). Finally, we took the same random forest model that we used to get the permutation feature importances then used them to generate Shapley values. We aggregated the values up to get feature level scores then selected the features with the highest scores.

Results of the feature selection simulation

In order to compare how different feature selection methods performed in our different scenarios, we simulated some data then applied feature selection methods to the simulated data. Since we knew the exact number of features that were actually associated with the outcome variable in each scenario, we used that number to determine how many features to include in the set of selected features.

After we determined which features were selected by each feature selection method, we compared the list of selected features to the true list of features that were actually associated with the outcome variable. For each feature selection method that we looked at, we examined the percentage of selected features that were actually associated with the outcome variable. The results of this study can be seen below.

Scores displaying how well different feature selection methods worked for different scenarios that were simulated.

We found that the simple correlation-based feature selection performed best in all scenarios. In the more simple scenarios, the correlation based methods performed relatively similarly to the other methods. In more complex scenarios, such as the scenario where there was a lot of noise and the scenario where there were categorical variables with differences in cardinality, the simple correlation-based method performed much better than the other methods.

These findings reinforce the importance of performing simple univariate and bivariate analyses on your features before adding them to your model.

Caveats

Before we wrap up talking about our feature selections simulations, we want to point out some caveats and limitations of this study. The main limitation of this study is that the data was simulated using relatively simple methods and therefore is likely more clean and less complex than real world data. Here are some characteristics that apply to our simulated data but might not apply to real world data.

  • All data is normally distributed
  • No complex interactions between variables
  • No missing data
  • No incorrect values


Share this article

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *