Are you wondering when you should use mixed models for a data science project? Well then you are in the right place! In this article, we tell you everything you need to know to determine whether mixed models are appropriate for your use case.
This article starts out with a discussion of the types of outcome variables that can be handled with mixed models. After that, we spend some time discussing the main advantages and disadvantages of mixed models. Finally, we provide specific examples of situations where you should and should not use mixed models.
What kind of outcomes can mixed models handle?
What kinds of outcomes can mixed models handle? A vanilla mixed model is an extension of linear regression model, which means that it is used to handle numeric outcomes.
However, the term mixed models is often extended to cover generalized linear mixed models, which is a family of models that can be used for a wide variety of outcome variables from binary variables to count variables. When reading through this article, you can assume that we are referring to generalized linear mixed models and that the points we discuss are applicable to a family of models that handle a range of different types of outcome variable.
What really sets mixed models apart from the pack is not the types of outcome variables that they can handle, but rather the fact that they can be used for datasets such that the observations are not all independent of one another.
Advantages and disadvantages of mixed models
So what are the main advantages and disadvantages of mixed models? Here are some advantages and disadvantages you should keep in mind when deciding whether to use mixed models.
Advantages of mixed models
- Can account for multiple readings on the same subject. One of the main advantages of mixed models is that they can be used in situations where you have multiple readings on the same subject. Most standard machine learning models require all of your observations to be independent of one another, which is certainly not true if multiple observations came from the same subject. That means that this is a strong advantage that sets mixed models apart from the pack.
- Account for nested structure in data. Another advantage of mixed models is that they can account for data that has a nested or hierarchical structure. They are particularly powerful in cases where subjects that fall in similar areas of the hierarchy have strong similarities that make it so that they can not necessarily be considered independent of one another.
- Interpretable coefficients. Like many other regression models, mixed models provide interpretable coefficients that quantify the relationships between your features and your outcome variable. They are generally very useful in cases where inference is a primary goal.
- Can handle missing measurements and different numbers of measurements per subject. Another specific advantage that mixed models have over similar models like repeated measures ANOVA models is that they will still work in situations where you have different numbers of measurements for each subject or when measurements are taken at different time intervals.
- Can account for multiple features of multiple different types. Another advantage that mixed models have over similar models like repeated measures ANOVA models is that they can be used when you have many different features to account for. They can also be used when you have number features or a mixture of numeric and categorical features.
Disadvantages of mixed models
- Can be difficult to specify. One of the main disadvantages of mixed models is that it can be difficult to figure out how to specify your model, especially if it is one of your first times using mixed models or if your data has a few different levels of nesting. Mixed models are highly flexible and there are a lot of different types of structures that can be represented using mixed models, but your model is only useful if it is specified correctly. This difficulty is exacerbated by the fact that there are not many beginner-friendly resources that explain how to specify different types of mixed models. Many of the resources that do exist are highly technical and assume that the reader has a strong math background.
- Not available in all common machine learning libraries. Another disadvantage of mixed models is that they are not available in many common machine learning libraries. Mixed models are more commonly used in fields that place value on inference and classical statistics, which means that they generally have more robust implementations in packages like SAS and Stata.
- Implementation is not standardized across packages. Another difficulty of working with models is that the way that mixed models are implemented can vary a lot across different packages and libraries. This can cause some confusion if you need to implement a mixed model using a different technology than you are used to. There can be a wide range of differences from the way the model is specified to the convergence tests that are used to determine whether model training hit a stable ending point.
- More parameters to estimate than standard regression models. Mixed models also have more parameters that need to be estimated than standard regression models. On one hand, this is good because it allows the model to be more flexible. On the other hand, it generally means that you need more data and it increases the chance that your model will not converge to a stable solution.
- General pitfalls of regression. Finally, mixed models are also subject to many of the standard pitfalls that affect standard regression models like linear regression and logistic regression. These models can be thrown off by some types of missing data, outliers, correlated features, and unspecified interactions.
When to use mixed models
Are you wondering when you should use mixed models rather than another machine learning model? Here are some examples of scenarios where you should consider using mixed models.
- Multiple measurements on the same subject. If you are working with data that contains multiple measurements for each subject, this is very likely a situation where mixed models would serve you well. For example, if you are working with medical data where an indicator of a person’s health (such as their BMI) was measured multiple times over the course of the year, this is an example of a situation where you would have multiple measurements for each subject. These situations are generally not suitable for standard linear regression models because standard regression models require each observation to be independent of the others. Observations are certainly not independent if they are taken from the same subject.
- Subjects have hierarchical structure. If you are working with data such that the subjects you are taking your measurements on have a naturally nested or hierarchical structure, this is likely another situation where mixed models would serve you well. This is especially true if there is reason to believe that subjects that fall in the same area of the hierarchy are more likely to have similar results. As an example of this, if your outcome variable was a standardized test score for a math test that was taken by high school students, you might expect students that attended the same high school or even the same math class would have similar scores.
When not to use mixed models
And when should you avoid using mixed models? Here are some examples of situations where you should avoid using mixed models.
- Multiple measurements per subject, but it is sufficient to aggregate measurements to the subject level. There may be some scenarios where you can avoid using a mixed model by aggregating all of the data points that are interdependent on one another into a single observation. For example, if you were using medical data (such as BMI data) with multiple observations for each person, there may be scenarios where it would be sufficient to take the average of all the measurements for a single person and aggregate those into a single data point. Since all of the readings that were interdependent on one another were collapsed into a single data point, we would return to a situation where our data points are all independent of one another and a standard linear regression model would be applicable. Whether you are able to use a simple solution or not will depend on exactly how your data is structured and what you are trying to measure.
- Subjects have hierarchical structure, but it is sufficient to train multiple models. There may also be situations where your data has a natural hierarchical structure but it would be sufficient to train a few separate models, each on a specific area of the hierarchy. For example, say we were looking at an academic measure for students in Russia and the United States and we did not have reason to believe students in the same school or classroom would have similar readings, but we did believe that students in the same country would have similar readings. Depending on what we were trying to measure, it might make sense to train two different models – one for Russian students and one for United States students. Once we zoom in and look at a single country, we would expect all of the observations to be independent of one another, so a standard linear regression model would suffice. Again, this strategy may or may not be sufficient depending on exactly how your data is structured and what you are trying to measure.
- Multiple numeric measurements per subject that are all taken at the same interval with no missing measurements. If you are working with data such that you have multiple numeric measurements per subject and you realize that all of the subjects have the same number of measurements that are taken at the same intervals, you can sometimes use a repeated measures ANOVA rather than a mixed model. A repeated measures ANOVA is a simple model that is similar to mixed models, but it can not account for scenarios where there are different numbers of measurements for different subjects. All else considered equal, it is generally better to use the simplest model that is appropriate for your situation.
- One model per subject. Sometimes when you have repeated measurements on the same subject, a time series model is more appropriate than a mixed model. In general, it makes sense to use time series models when you are going to train a different model for each subject and mixed models should be used when you want to aggregate data from multiple subjects into the same model.
Other names for mixed models
There are many different names that are used to refer to mixed models. Here are a few examples of other terms that refer to mixed models (or certain types of mixed models).
- Hierarchical regression
- Multilevel models
- Mixed effects models
- Longitudinal regression
- When to use multinomial regression
- When to use ordinal logistic regression
- When to use linear regression
- When to use logistic regression
- When to use poisson regression
- When to use Bayesian regression
- When to use ridge regression
- When to use LASSO
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.