Are you wondering whether you need to use version control software for your data science projects? In this article, we go over all the benefits of using version control software for data science projects. In this post we focus on the importance of version controlling your code, as opposed to version controlling other aspects of your projects such as the underlying data. We will save that for another post!
This post is primarily geared towards data professionals who are wondering whether they could benefit from using version control software. That being said, much of the advice we give here is also relevant to people in other fields such as software development.
Why is version control important?
- You can easily roll back to previous versions of code.
- You can track when updates were made to your code.
- You can reduce redundant copies of code.
- You can edit code at the same time as your colleagues.
Rolling back to previous versions
Why might you want to roll balls to previous versions of your code? Here are some of the most common reasons you might want to roll back to a previous version of your code.
- Your code is broken or has an elusive bug. This is one of the most compelling reasons for using version control software. If you make updates to your code that break the code or introduce a bug that is difficult to find, you can easily roll back to a working version of your code. Then you can re-introduce your changes one at a time and test to make sure that the most recent change has not re-introduced the bug. This is often much faster and easier than trying to back out where the bug is coming from in the fully modified code.
- You made code changes that are no longer necessary. Did you update your code to add a new feature to your model then realize that the new feature did not contribute much at all? If you make changes to your code then your situation changes and you realize that those changes are no longer necessary, you can easily roll back to a previous version of the code with the help of version control software. No need to manually undo all the changes you made to your code.
Tracking when code changes
If you use a version control system and regularly check your code in with that system, then you will easily be able to go back and check when you made certain changes to your code. Why is this useful? Here are some situations where having this capability might be helpful.
- You notice a drop in model performance. Imagine this situation – you are looking at metrics on the performance of your model over time. The metrics might relate to the accuracy of the model, or perhaps the speed at which your model scores data. You see a sudden drop in model performance starting on a certain date, but you are unable to remember what changes were made to the code on that date. If the performance degradation is significant, being able to track exactly which changes were added at the time the performance degradation began is invaluable.
- A colleague asks you when code changes were made. We’ve all been there. A colleague stops by your desk or drops you an email and asks what date a key update was made to your data science model or data processing code. It has been a while since you even touched the project in question, or maybe the code change was made by a coworker who is no longer at your company. If you are relying on human-produced change logs, or worse yet, if you do not have any way to track changes to your codebase, it can be very difficult to answer this question. If you were regularly checking your code in with version control software, then you can easily go back and see what date the changes were introduced.
Reducing redundant copies of code
Have you ever ended up with a congested directory full of different versions of the same notebook or script? This can make it hard to determine exactly which version of the code was used to produce a specific result. There are many reasons you might want to keep multiple copies of the same code around. Here are just a few.
- Preserving legacy code for documentation purposes. Perhaps you want to preserve some legacy code for record keeping purposes, just in case you need to refer back to the old code to see what changes were made between the legacy code and the current code. If you are using a version control system, you can say goodbye to cluttered directories with many copies of the same code. There is no need to save multiple files when you can easily search through all previous states of your code.
- Preserving a copy of working code in case something goes wrong. Imagine this scenario. You are making some large changes to a crucial python script and you decide to make a copy of the code to keep around so that you have a working copy of the code in case anything goes wrong. You tell yourself that you will delete the previous version of the code once you have validated that your updated code works. And maybe you do remember to delete the code, but maybe you forget. This leaves your directory cluttered, and more crucially, makes it difficult to determine which version of the code was used to produce a certain result.
- Making temporary changes in response to ad-hoc requests. Sometimes you get requests to look at how the results of your analysis would have changed if slight changes had been applied to your code. For example, someone might ask what would have happened if you had used zip code rather than state as the main location variable. If you know that the change you are making is only going to be relevant for this ad hoc request, you might decide to make another copy of your code to apply these temporary changes to and save it alongside the primary copy of the code. This creates yet another file that may accidentally get confused for your primary file and get run at the wrong time. If you were using a version control software that supported branching, then you could make a separate branch for your repository to support the ad hoc analysis. This way you would only have one version of the code per branch and you could easily toggle between branches to view different versions of your code.
Simultaneously editing code
Are you working on a project alongside other data scientists, analysts, or engineers? If you do not use a version control system then you might find yourselves in a situation where you have to trade off working one at a time to avoid tripping over half-baked changes that were added by your colleagues. With version control systems that support branching, you could easily work on separate versions of the same code then merge the changes back into one unified codebase later.