Are you wondering whether you should use version control for data science projects? Or maybe you are more interested in the details of what should and should not be version controlled? Well either way, you are in the right place! In this article, we tell you everything you need to know to understand how version control should be used for data science projects.
We start out by briefly explaining what version control is and why it is useful for data science projects. After that, we go into more detail on what artifacts should be tracked in version control systems for data science projects. Finally, we talk about some version control best practices that should be followed for data science projects.
What is version control?
What is version control? Version control software is software that enables you to keep track of changes that have been made to artifacts such as code, data, and machine learning models over time. Version control systems generally allow you to check in different versions of these artifacts to create checkpoints that can be referred to at any point in the future.
Why should you use version control for data science?
Why should you use version control for a data science project? Here are just a few reasons that you should use version control for data science projects.
- Safeguard against bugs and errors. If you track your artifacts using version control systems, you will have an easy way to roll back to a previous version if a bug is found in the current version. This advantage applies to all kinds of artifacts including code, models, and data.
- Improve reproducibility. Another advantage of version control is that it makes it easier to reproduce your work. In general, if someone has access to the exact code, model, and data you used in your analysis, they should be able to reproduce the analysis.
- Simultaneously work on the same code as coworkers. Another advantage of using version control systems to track changes to your code is that it becomes easier to share code across different computers and then merge changes that were made on different computers into one file. That means that you can simultaneously make changes to the same file that your coworker is working on.
- Facilitate code review. Another great benefit of using version control systems to track changes to your code is that they make it easy to perform code reviews. Check out our article on code review for data science to hear about all of the benefits your team could gain by introducing code reviews.
Still not convinced? Check out our article on the importance of version control for data science systems to see even more examples of situations where version control can shield you from common pain points.
What should be version controlled in data science projects?
So what should be version controlled in a data science project? In traditional software engineering projects, the code is the main artifact that needs to be version controlled. In data science projects, there are a few additional types of artifacts that should be tracked using version control. Specifically, you should consider tracking the following artifacts using a version control system .
- Code. Code that was used to transform data or produce your analysis.
- Data. Anything from input data to data intermediates to output data.
- Models. Trained machine learning models.
What code should be version controlled in data science projects?
What kind of code should be version controlled for data science projects? The short answer to this question is that almost all of the code you write has the potential to benefit from version control. Even if you are working on a quick ad-hoc analysis for a coworker, storing your code in a version control system can give you confidence that you will be able to roll back to a working state if any bugs get introduced.
As a general rule of thumb, you should version any code that meets any of the following criteria. These criteria are valid no matter whether the code is used to produce a simple metric dashboard or a complex machine learning model.
- Code that will be used in production. It is absolutely critical that code that is used in production, or code that produces artifacts that will be used in production, be tracked using version control. This helps to ensure that you will be able to roll back to a previous working if any bugs are detected.
- Code that influences important decisions. It is important that code that will be used to make important business decisions is tracked in a version control system. This ensures that the code is available in an accessible location in case someone else needs to understand how an analysis was run or re-run a similar analysis in the future.
- The code will be rerun on an ongoing basis. Any code that is important enough that it continues to be run on an ongoing basis is important enough to be tracked in a version control system. This way you have a backup if something happens to the original code and there are other processes that depend on the output of the code.
What data should be version controlled in data science projects?
What kind of data should be version controlled in data science projects? In general, version control should be used more sparingly for data than it should be for code. The reason for this is that version controlled data can serve as a crutch that people lean on when their data pipelines are not robust and reproducible. Having this crutch readily available can lead to situations where faulty data pipelines do not get the attention they need.
Ideally, anyone should be able to reproduce an analysis given an immutable copy of the raw data used and a version controlled copy of the code that was used. If that is possible and there are no practical concerns that make repeatedly re-running the code infeasible, you may not need to version control your data. However, we do not live in an ideal world and these conditions are not always met.
Here are some examples of situations where it does make sense to version control your data.
- Large datasets that are expensive or time consuming to reproduce. One example of a situation where it might make sense to use a version control system for your data is if you are using a data intermediate that is expensive to produce or takes a long time to produce. Sometimes in situations like these, it simply does not make sense to re-run the entire pipeline that is used to create the data for scratch every time you want to run an analysis.
- Raw data can be changed without warning. In an ideal world, raw data should be immutable and no one should be able to change it. That being said, we do not always live in an ideal world. If the raw data you are working off of can be changed without any warning, then it makes sense to version control your data so that you can be sure exactly which version of the data was used in your analysis.
- Many people are working on a data pipelines at once. If you are using data from a data pipeline that many people are working on at the same time, it can be hard to know exactly what version of the pipeline was run to produce the dataset you used in your analysis. In these situations, it also makes sense to version control your data.
What models should be version controlled in data science projects?
What models should be versioned in data sciences projects? Here are a few examples of situations where models should be version controlled.
- Models that will be used in production. Any model that is going to be used in any kind of production environment should be tracked using a version control system. This makes it easier to roll back to a previous version if the current version of the model has poor performance or displays unexpected behavior.
- Models that are used to make important decisions. Even if you are training a model to produce a one-off analysis that will only need to be run once, it is best to track the final model you use for the analysis in a version control system. This way you will easily be able to grab the model and run other data through it if questions arise in the future or if the analysis needs to be repeated.
Version control best practices for data science
What are some version control best practices for data science teams? Here are some examples of version control best practices that data science teams should aim to uphold.
- Models should be linked to a specific version of code (and potentially data). Whenever you are saving a model to a model version control system, or a model registry, you should be able to link that model back to the specific version of the code that was used to produce it. This is fairly straightforward to do if the model version control system you are using provides integrations with the version control system you are using for your code.
- Do not use data versioning as a replacement for reproducible pipelines. As mentioned above, you should use data version control sparingly and make sure that you do not rely on version controlled data as a replacement for reliable, reproducible data pipelines.
- Check in atomic changes to your code frequently. It is best to check in changes to your code, or create checkpoints that you can refer back to later, relatively frequently. If you do not check in changes to your code frequently, each checkpoint you check in will have accumulated multiple changes compared to the last checkpoint. This will make it difficult to roll back to the exact version of the code you want if, for example, five major changes happened between one checkpoint and the next and you only want to preserve three of the changes.
- Only check in functional code. While you should check in your code frequently, you should not check the code in before you have tested the code and ensured that it is functional. If you have only completed half of the changes that are needed to make a certain update and the code will not run properly until all of the changes have been made, you should wait until you have completed all of the necessary changes to check in the code.
- Get feedback. It is generally best to perform code review and get feedback when you check in major changes. There are many benefits to performing code review for data science projects. For starters, you will catch bugs in your code faster, have a more standardized code base, and have a great avenue for teammates to learn from one another.
Best practices for data teams
- Avoid knowledge silos
- Avoid duplication in your codebase
- Standardize your codebase
- Perform code reviews
- Write unit tests