Are you wondering why you should version control machine learning models? Or maybe you want to hear more about common frameworks that are used to version control machine learning models? Well either way, you are in the right place! In this article, we tell you everything you need to know about version control for machine learning models.
We start out by discussing what machine learning model versioning is. After that, we discuss why it is important to version control machine learning models. Next, we discuss how machine learning model versioning works. Finally, we provide examples of tools that are commonly used for machine learning model versioning.
What is machine learning model versioning?
What does it mean to version control a machine learning model? If you want to version control a machine learning model, the first thing you need is a central repository where you can store many different versions of many different models. Within this repository, each version of each model should have a unique identifier associated with it. This unique identifier can be used to retrieve a particular model version from the central repository. It can also be used to pull metadata about the model such as the time when the model was saved.
Each time you retrain a model or make a change to a model, a new identifier should be created to track the new model. Here are some examples of changes that might be made to a machine learning model that would warrant a new version of the model being created.
- A feature is added or removed from the model. If you add or remove a feature from your machine learning model, then you should save a new version of that model with a new identifier. You should also create a new model version if you modify the calculation that is used to create a certain feature.
- A different set of hyperparameters is used. If you change the set of hyperparameters that is used to train your machine learning model, then you should save a new version of the model with a new identifier.
- Training dataset is resampled or modified. If you modify the dataset that is used to train your model, then you should save a new version of the model with a new identifier. This includes any changes that are made to the dataset that the model is trained on, ranging from changes to your data preprocessing routines to resampling schemes.
- A different type of machine learning model is used. If you change the machine learning model that is used to address a particular problem, then you should also create a new version of your model with a new identifier. For example, if you move from using a random forest model to a logistic regression model then you should create a new model version.
Why should you version machine learning models?
Why should you version machine learning models? In this section, we will discuss some of the reasons that you should version control your machine learning models.
- Reliably reproduce previous results. The first reason that you should version control your machine learning models is that it increases the reproducibility of your results. If your models are not labeled with unique identifiers and stored in a common repository where they can be accessed at any time, it can be hard to go back to a specific set of results and remember exactly which version of the model was used to produce those results. If you version control your machine learning models, all you need to do is keep track of the unique identifier for the model that produced each set of results.
- Flexibly deploy different model versions in different environments. If you version control your models and store all of the different models in a central repository, then you can flexibly deploy different versions of your machine learning model in different environments by switching out the unique identifier that is used to load the model. This makes it easy to do things like load a different version of the model into a development or QA environment to test it out before it is pushed to production.
- Flexibly deploy multiple models in the same environment. In addition to making it easier to deploy different model versions in different environments, version controlling your models and storing them in a central repository makes it easier to deploy different versions of the same model in the same environment. This comes in handy when you want to do things like run AB tests to determine whether one model drives better improvements to business metrics than another.
- Easily roll back to previous model versions. Just as version controlling code makes it easier to roll back to a previous version of the code, version controlling machine learning models also makes it easier to roll back to a previous version of the model. This is incredibly useful in situations where a newly deployed model turns out to be inferior to the previous model, or even incorrect.
- Automatically link models back to metadata. Machine learning model versioning frameworks often provide utilities that allow you to automatically link machine learning models back to metadata. For example, if your code and data are also version controlled, then model versioning frameworks might enable you to link model versions to unique identifiers that represent the version of the code that the model was trained using and the version of the data that the model was trained on. This improves reproducibility and makes it easier to answer questions that are asked about the model down the line.
How to version control machine learning models?
How do you get started with machine learning model versioning? The easiest way to get started with version controlling machine learning models is to integrate with an existing framework that has been built for machine learning model versioning. In the next section of this article, we discuss some common frameworks that are used for machine learning model versioning.
Tools for machine learning model versioning
What are some examples of popular tools that are used for machine learning model versioning? In this section, we will discuss some of the most common tools that are used for machine learning model versioning. This includes both open source tools that can be used free of cost as well as third party vendor offerings that are paid products.
Here are some of the most common tools that are used for machine learning model versioning.
- MLflow. MLflow is a free, open source framework for managing the lifecycle of machine learning models that can be used to track input parameters and output metrics for model training runs, register trained models, and deploy models into a variety of environments. MLflow can be used with any programming language and any machine learning library through flexible REST APIs. MLflow also maintains libraries that make it particularly easy to implement MLflow functionality in Python, R, and Java codebases.
- DVC. DVC is a free, open source framework for tracking models and datasets that are produced in machine learning projects. It can also be used to perform tasks such as comparing model performances across different training runs. DVC is a framework that is built on top of git that is also agnostic to both the language used and the machine learning library used.
- Weights and biases. Weights and Biases is a paid offering that does have free tiers for personal use. They provide a flexible abstraction that allows you to version control many different types of artifacts, including machine learning models.
- Neptune. Neptune is another paid offering that has a free tier for academic and personal use. They have a specific model registry abstraction that serves as a centralized repository for versioned models.
- Commet. Commet is a paid offering that also has a free tier for academic and personal use. They also have a model registry abstraction that can be used to version control machine learning models.