Are you wondering which machine learning model you should use for your next data science project? Or maybe you are interested in hearing more about the considerations you should keep in mind when determining what machine learning model to use. Well either way, you are in the right place!
In this article we tell you everything you need to know to decide which machine learning model is right for you. We start out with a discussion of the main considerations you should keep in mind when deciding which machine learning model to use. After that, we provide examples of models that can be used in different situations along with brief descriptions of situations where each model performs best.
How to choose which machine learning model to use
Before we dive into talking about specific machine learning models, we will take some time to go over a list of concerns that you should keep in mind when determining which machine learning model to use. We will break these concerns up into three different categories – data, technical, and practical considerations.
Data considerations
When deciding which type of machine learning model you are going to use for a project, you should first think about the data you will use and whether there are any unique concerns you will need to account for. Here are some examples of data-related considerations you should keep in mind when choosing a machine learning model.
- Type of outcome variable. The first thing you should keep in mind when evaluating your dataset is what your outcome variable looks like. This will help narrow down which set of models you have to choose from. Is your outcome variable a simple tabular variable, such as a binary, multiclass, or numeric variable? Is the outcome you are looking for more complex, such as a piece of text or an image? In some cases, you might not even have an outcome variable at all! And that is perfectly fine, but you need to make sure you choose an unsupervised model that can accommodate this setup.
- Types of features. The next thing you should keep in mind when evaluating your dataset is what types of features you are working with. Do you have simple tabular features? Or does your input data include unstructured data like text and images?
- Size of dataset. You should also consider the size of the dataset that you have available to you. Some machine learning models, such as neural networks, require large amounts of data to work well. This is generally because these models have large amounts of parameters that need to be estimated. On the flip side, simpler models with fewer parameters do not typically require large datasets.
- Missing data. Another consideration you should keep in mind when looking over your dataset is the amount of missing data that you have in your dataset. In addition to looking at the amount of missing data you have, you should also consider the patterns of missingness. If the missingness is seemingly random, it will be more easy to handle than if there are patterns that lead to a higher likelihood of missingness for some data points. Some models, such as random forests, have implementations that can handle missing data natively without any need to impute the missing values.
- Outliers. You should also consider how many outliers you have in your dataset and what types of outliers are present. Some machine learning models are relatively unaffected by large outliers, whereas others are easily thrown off by just a few outliers.
- Correlated features. Another important concern you should keep in mind is whether you have many correlated features in your dataset. Many machine learning models are easily thrown off by highly correlated features. That being said, there are some models, like ridge regression models, that are uniquely qualified to handle these situations.
- High dimensionality. Another consideration you should keep in mind when evaluating your dataset is whether you are using a high-dimensional dataset that has many features in it. As an additional consideration, you should look at how the number of features in your model compares to the number of observations. Many machine learning models do nor perform well, or cannot be used at all, in cases where you have more features than observations. That means that this is an important consideration you need to look out for.
- Interactions. Another thing you should look out for in your data is whether there are interactions between features, or whether the value of one feature moderates the relationship between another feature and the outcome variable. Some models can detect and account for interactions automatically, whereas others require interactions to be specified in the input data in order for their effect to be considered.
- Linearity assumptions. Another consideration you should keep in mind is whether the relationships between your features and your outcome variable are linear. While many machine learning models can account for non-linear relationships between features and outcome variables, some cannot.
Technical considerations
Next we will talk about technical considerations you should keep in mind when deciding which machine learning model to use. Some of these concerns are specific to model deployment, especially if the model is going to be deployed in a user-facing environment. Others are more broadly related to engineering concerns like the maintainability of your codebase.
- Speed of inference. If you are going to make real time predictions using your machine learning model in a user-facing environment, you should almost certainly keep in mind the speed of inference. For example, if you are going to be hitting your model from the front end of a website, you need to take caution to make sure your model does not have a significant impact on the speed of page load. Some models are able to make predictions much faster than others.
- Speed of training. In addition to the speed of inference, you should also keep in mind the speed of training. The negative repercussions of long training runs tend to be less pronounced than the repercussions of long prediction times, but that does not mean that they do not exist. If your model takes a long time to train, it will slow down your speed of iteration and you may take longer to finish the project.
- Resources required for training. You should also keep in mind the amount of resources required for training. In many cases, computational power is relatively cheap. However, if you are retraining a complex deep learning model that trains over many GPUs every day, expenses can start to add up. This problem will be particularly pronounced if there are limitations on the resource pool available to you.
- Dependencies required for deployment. Another consideration you should keep in mind if your plan to deploy your model in a production environment is the set of dependencies that will be required to make predictions using the model. Some models, like linear regression models, can easily be put into production without the need to use specific machine learning libraries. All you need to do is create some logic that multiplies the features by their corresponding coefficient values then sums those products up. Other models, such as Bayesian regression, may require you to use niche libraries with a complex web of system-wide dependencies.
- Maintainability. If you are working on a machine learning model that will be used over a long span of time, you should also keep the maintainability of the model code in mind. You may be better off using models that are well understood as opposed to niche models that are not as broadly known, even if the niche models have slightly better predictive performance. You also use common machine learning libraries such as SciKit Learn when possible. This will make it easier for others to maintain the code you wrote.
Practical considerations
Finally, we will discuss some practical considerations you should keep in mind when selecting a machine learning model. These concerns relate more to the business context within which you are operating.
- Trust in machine learning models. One practical concern that you should keep in mind when determining which machine learning model to use is the level of trust that your stakeholders have in machine learning models. If you are working with stakeholders that are technology-adverse or skeptical of complicated models they do not understand, you may be better off using something like a linear regression model that is well studied and widely adopted. The model is only useful if the results get used.
- Importance of interpretation. Another practical concern you should keep in mind when deciding which machine learning model to use is the relative importance of prediction vs interpretation for your use case. If your primary goal is interpretation, you should reach for a model that has interpretable coefficients over a model that has peak predictive performance.
- Incremental value of small performance increases. You should also consider the amount of incremental business value that is delivered by small increases in predictive performance. In some cases, small increases in performance can deliver a large amount of business value. In others, this will not be true. All else being equal, if small increases in performance do not deliver large incremental value, then you may be better off using a simpler model than a more complex model that has slightly better predictive performance.
- Time available to tune hyperparameters. Another consideration you should keep in mind when deciding what model to use is the amount of time and effort that is required for tuning model hyperparameters. Some models, like neural networks, are very sensitive to the selection of hyperparameters you use. This means that they require a lot of time to be put into selecting model architecture and hyperparameters. Other models, such as random forests, are not very sensitive to the choice of hyperparameters used. You should consider the amount of time you have available to put into model tuning when you decide what model to use.
- Prior knowledge. A final consideration you should keep in mind is the amount of prior knowledge you have about the business domain. If you have a large amount of prior knowledge that could make valuable contributions to your model, you may be better off using a Bayesian model that allows you to incorporate external knowledge.
Machine learning models for supervised learning
In this section, we will focus on machine learning models that can be used to tackle supervised learning problems. Supervised learning problems are problems where you have at least one distinct outcome variable that you are trying to recover using your model. For example, you might have a numeric, binary, or multiclass variable in your dataset that you want to predict.
Machine learning models for structured data
In the following section, we will focus on machine learning models that are used for structured data. These are models that are used when you have straightforward tabular data in the form of binary, multiclass, numeric, out count variables.
Machine learning models for numeric outcomes
First we will talk about machine learning models that can be used when you have a continuous outcome variable.
- When to use Bayesian regression: A great option when you have a small sample size.
- When to use generalized additive models: For when you have nonlinear data but need interpretability.
- When to use gradient boosted trees: Peak predictive performance on tabular data.
- When to use LASSO: These models provide automatic variable selection.
- When to use linear regression: A great option when inference is your main goal.
- When to use mixed models: Can handle situations where some of your observations are not independent, such as when you have multiple measurements per subject.
- When to use neural networks: Generally preferred for unstructured data.
- When to use random forests: A quick and easy model that requires little data preprocessing.
- When to use ridge regression: These models handle correlated features well.
- When to use support vector machines: These models handle high dimensional data well.
Machine learning models for binary outcomes
Next we will talk about machine learning models that can be used when you have a categorical outcome variable with only two levels.
- When to use Bayesian regression: A great option when you have a small sample size.
- When to use generalized additive models: For when you have nonlinear data but need interpretability.
- When to use gradient boosted trees: Peak predictive performance on tabular data.
- When to use LASSO: These models provide automatic variable selection.
- When to use logistic regression: A great option when inference is your main goal.
- When to use mixed models: Can handle situations where some of your observations are not independent, such as when you have multiple measurements per subject.
- When to use neural networks: Generally preferred for unstructured data.
- When to use random forests: A quick and easy model that requires little data preprocessing.
- When to use ridge regression: These models handle correlated features well.
- When to use support vector machines: These models handle high dimensional data well.
Machine learning models for multiclass outcomes
Multiclass outcomes are categorical outcome variables that have more than two categories. In this section, we will focus on models that can handle multiclass outcomes natively. This means that we will stick to models that can predict multiclass outcomes using a single model.
Some other common machine learning algorithms, such as support vector machines and gradient boosted trees have implementations that can be used to predict multiclass outcomes. That being said, these implementations generally require training multiple binary classification models then combining their output to get the final result.
- When to use neural networks: Generally preferred for unstructured data.
- When to use mixed models: Can handle situations where some of your observations are not independent, such as when you have multiple measurements per subject.
- When to use multinomial regression: Great for inference on multiclass variables without ordering.
- When to use ordinal logistic regression: Great for multiclass variables with a natural ordering.
- When to use random forests: A quick and easy model that requires little data preprocessing.
Machine learning models for count outcomes
Next, we will discuss some types of models that can be used when your outcome variable is a count. In some cases, count outcomes can be treated as numeric outcomes. In others, you will need to use a model that is specifically built for count variables.
- When to use Bayesian regression: A great option when you have a small sample size.
- When to use generalized additive models: For when you have nonlinear data but need interpretability.
- When to use mixed models: Can handle situations where some of your observations are not independent, such as when you have multiple measurements per subject.
- When to use poisson regression: The simplest regression model for count data.
- A survey of regression models for count data
Machine learning models for time series data
There are also specific models that are designed to be used for time series data. Time series data is data where you have many repeated measurements that are taken over time on the same quantity.
- When to use ARIMA models: A classic baseline model for stationary data.
- When to use Fourier ARIMA models: An ARIMA model that can handle multiple seasonality.
- When to use exponential smoothing: A classic baseline model for non-stationary data.
- When to use TBATS models: Appropriate for time series with multiple seasonality.
- When to use Facebook Prophet: A beginner friendly model that does not require knowledge of time series forecasting and can account for mean shifts.
Machine learning models for unstructured data
Next we will talk about machine learning models that can be used when your data is not in an easy tabular format. Here are some examples of machine learning models that perform well on unstructured data, such as text and image data.
- When to use convolutional neural networks: State of the art for image data.
- When to use recurrent neural networks: Good for sequential data such as text or speech data.
Machine learning models for unsupervised learning
In this section we will talk about machine learning models that are used for unsupervised learning problems. These algorithms can be used in situations where you do not have a specific outcome variable you want to predict, but you do want to identify similarities between observations in your dataset.
Models for clustering
In this section, we will talk about clustering models that can be used when you want to identify similarities between different observations in your dataset.
- When to use DBSCAN. Robust to outliers and able to detect irregularly shaped clusters.
- When to use gaussian mixture models. Insensitive to scale and able to account for the fact that some observations share similarities with multiple clusters.
- When to use hierarchical clustering. Great for cases where you need detailed information about which observations are most similar.
- When to use k-means clustering. A fast and well-studied approach that works well in cases where it is reasonable to assume your clusters are spherical.
- When to use spectral clustering. Works well on high-dimensional datasets with many features.
Models for dimension reduction
In this section, we will talk about models that are used to reduce the dimension of your dataset. These are models that take in a set of input features and condense the information in those input features into a smaller set of transformed features.
- When to use factor analysis. Provides interpretable output features along with information on which input features contributed to them.
- When to use PCA. A fast and well-studied dimension reduction technique that is guaranteed to produce uncorrelated features.
- When to use t-sne. Preserves local relationships between observations and is therefore great for visualizing high dimensional datasets in low dimensional spaces.