Are you wondering what stratified cross validation is and how it differs from regular cross validation? Or maybe you are wondering when you should use stratified cross validation in place of cross validation? Well then you are in the right place! In this article we tell you everything you need to know about stratified cross validation.
We start off with a review of what cross validation is and why it is used. After that, there is a discussion of what stratified cross validation is and how it differs from regular cross validation. Finally, we provide some examples of situations where you should use stratified cross validation in place of regular cross validation.
Why is cross validation used?
Why is cross validation used? Cross validation is used when you are building a predictive model and you want to understand how your model will perform on data that was not seen during training. Cross validation eliminates the need to have multiple holdout datasets that are used in addition to your training data and allows you to just have one dataset that is used during model training and one dataset that is to evaluate your final model.
If you do not use cross validation, you will need to have three different datasets. The first is a training dataset that all of your models are trained on. The second is a validation dataset that is used to assess how all of your candidate models perform on unseen data to help decide between them. The third is a pure test dataset that is not touched until you have decided on your final model. This completely untouched data gives the final word on how your chosen model performs on unseen data. Cross validation allows you to use your training data as validation data so that you only need to have a training dataset and a test dataset.
How does cross validation work?
So how does cross validation work? There are multiple different types of cross validation, but in general cross validation has two main stages. In the first stage, you petition your training data into multiple splits. In the second stage, you train your model on one subset of the splits then use the remaining splits to evaluate how your model performs on unseen data.
You can then repeat the second stage over and over again using different subsets of the splits to train your model. Once you have exhausted all possible combinations, you can take the average of the validation metrics you saw across all of the different runs as your final validation metrics.
What is stratified cross validation?
So what is stratified cross validation? And how does stratified cross validation differ from regular cross validation? The general idea behind stratified cross validation is that you can carefully allocate your data into splits in a way such that the distribution of your outcome variable is the same across all of the different splits.
The exact implementation details of stratified cross validation will look different depending on whether your outcome variable is numeric or categorical. In the following sections we will discuss how stratified cross validation differs for models with numeric outcomes and categorical outcomes.
Stratified cross validation for numeric outcomes
What does stratified cross validation look like for models with numeric outcomes? When you are using stratified cross validation for a numeric outcome, you want to allocate your data into splits in a way such that the mean and standard deviation of the outcome variable is the same across all of the splits.
As an example, imagine you were training a model to determine how much money a person would spend on a credit card in the following month. You examine the distribution of the outcome variable in your training data and see that the distribution had a mean of $1500 and a standard deviation of $200. In this case, you would want to carefully distribute observations into your splits in a way such that the distribution of the outcome within each split has a mean of roughly $1500 and standard deviation of roughly $200.
Stratified cross validation for categorical outcomes
What does stratified cross validation look like if your outcome is categorical? Stratified cross validation is even easier for categorical variables than numeric variables. All you have to do is allocate your data such that the proportion of the observations that fall into each class is relatively consistent across all of your splits.
As an example, imagine you were training a model to determine whether someone would default on a bank loan and 90% of your training examples belonged to the negative class (no default). In this scenario, you would want to ensure that around 90% of the observations that were included in each split belonged to the negative class.
When should you use stratified cross validation?
When should you use stratified cross validation? Here are a few cases where it might be beneficial to use stratified cross validation rather than regular cross validation.
- Small sample size. The main occasion when you should use stratified cross validation is when you are working with a relatively small dataset. If your dataset is very large then you are likely to end up with splits that are relatively well balanced. If your dataset is small on the other hand, then you are more likely to end up in a situation where the distribution of your outcome variable is imbalanced and irregular.
- Imbalanced data. If your are working on a model with a categorical outcome, then you are more likely to need to use stratified cross validation if your outcome variable has a large class imbalance. If your dataset is on the small side and highly imbalanced then you are much more likely to end up in a scenario where you barely have any observations attributable to the minority class in some of your splits. Again, this becomes less of an issue as the size of your training data increases.
- Many outcome classes. A final instance where stratified cross validation might be useful is if you have a multi class model with many different outcome classes. When you are in this situation, you are much more likely to run into an issue where at least one of your classes is barely represented in a training split. As before, this issue gets less prevalent as the size of your training data increases.