Are you looking to learn more about code reviews for data science and analytics teams? Well then you are in the right place! In this article we tell you everything you need to know in order to understand how and why to conduct code reviews for data science projects.
This article starts out with an explanation of why code reviews should be performed for data science projects. After that, we dive into operational details such as who should be performing code reviews and how often code review should be performed for data science and analytics projects. Finally, we provide more details on what you should look out for in data science code reviews.
Why should you do code reviews in data science?
What are the main advantages of doing code reviews for data science and analytics projects? Here are some of the main benefits of performing code reviews for data science and analytics projects.
- Find mistakes in your code. The first reason that you should perform code reviews is so that you can find mistakes in your code before it reaches production. Whether you are building out a pipeline to serve machine learning models in real time or just writing a SQL query that will be used to inform business decisions, getting a few extra pairs of eyes on your code is a great way to reduce the amount of bugs that make it through. This will help your team develop a reputation for delivering accurate work, which will help you build trust with stakeholders.
- Teach team members. Code reviews provide a great opportunity to teach and mentor team members. If you see that a team member is consistently writing inefficient code or deviating from the team’s styling guidelines, code reviews are a great time to give that team member a nudge back in the right direction. This will help to increase the quality of output throughout your team.
- Learn from team members. On the flip side, code reviews can also provide a great opportunity to learn from your team members. Sometimes you will pick up tricks that you see your coworkers use when you review their code. This will also help to increase the quality of output throughout your team.
- Disseminate knowledge about important projects. One of the most important reasons that you should perform code reviews on data science projects is that by performing code reviews, you help to ensure that multiple people on your team have knowledge about important projects. This means that if an urgent issue comes up while the main person working on that project is out of office, there will be other people on the team who have some level of familiarity with the project. This will ensure that your team is ready to respond to high priority issues even when some team members are out on vacation.
- Increase consistency in your codebase. Code reviews are also important for maintaining consistency in a codebase. This is important, because the more consistent your team is with code structure and styling, the easier it will be for other team members to maintain code that has been written by their peers. This can help to save you hours down the line.
Who should perform code reviews in data science?
Once you have decided that you should start performing code reviews within your data science team, one of the next questions that comes up is who should perform code review for data science projects? Here are some tips to help decide who will perform code reviews for your data science team.
- Seniority. One factor to think about when deciding who should perform code reviews in your team is seniority. Ideally, everyone on your team should perform code reviews at some point. Junior team members can learn a lot from reviewing code written by more senior team members. That being said, it is often a good practice to require that code be reviewed by at least one senior team member.
- Domain knowledge. It is generally good to have code reviewed by at least one team member who works within the same domain or focuses on the same subject matter. This is important because team members who work in the same domain will better be able to identify inconsistencies with standard business logic.
- Impacted teams. Sometimes if you are writing code that will integrate with systems built by another team or code that will impact another team, it makes sense to have code reviewed by someone on an external team. For example, if you are integrating with a system that was built out by a team of developers, then it might make sense to have someone from that team review code that will impact the integration.
How often should you do code reviews in data science?
How often should you perform code review in data science? One of the main factors you should consider when deciding how often to perform code reviews within a data science team is the number of lines of code that will be changed.
A good rule of thumb is that reviewers should not review more than 400 lines of code at a time, and ideally that number should be kept under 200 lines whenever possible. The reason for this is that the more code that is being reviewed at one time, the more likely reviewers are to miss flaws in the code.
What to look out for in data science code reviews?
What should you look out for when you are conducting code reviews for data science and analytics projects? Here are some examples of things you can look out for when conducting code reviews for data science and analytics projects.
- Errors and edge cases. The first thing you should look for when reviewing data science code is errors in the code. Keep an eye out for code that does not function quite in the way that the author intended. Pay special attention to edge cases that the user might not have considered when building out the code. Will the code work properly even in places where data is missing or contains an unexpected value?
- Inconsistency with business logic. In addition to looking for code that does not function as intended, you should also look for inconsistencies with standard business logic. Just because the code is functional does not mean that it is correct.
- Efficiency and scalability. A third topic to keep in mind is the efficiency and scalability of the code. You should pay particular attention to this if the code that is being reviewed will need to be applied to larger and larger datasets as the company scales.
- Consistency with code base. Another important thing to look out for in code review is consistency with the rest of your team’s code base. Are there standard styling guidelines that your team adheres to? Are tasks being performed in a non-standard way? These are additional things you should look out for in code reviews. The more consistent your codebase is, the easier it will be for new team members to come in and maintain the code.
- Good patterns. As a final note, code reviews are not just a place to criticize code. They are also a great way to reinforce good behaviors. You should feel free to leave comments pointing out things that are done well.
What code should be reviewed in data science?
What types of code should be reviewed in data science code reviews? In general, the answer here is that most code that is worth writing is work reviewing. In general, any code that is being used to inform business decisions or surface data to other users should be reviewed. Here are just a few examples of types of data science code that should be put through code reviews.
- Queries or notebooks that will be used to inform business decisions
- Dashboards that will be used to surface business metrics
- Pipelines that will be used to perform data transformations
Tools for performing data science code reviews
The final topic that we will discuss in this article is tools that can be used to perform code reviews. In general, it is best to use a version control tool such as git to perform code reviews. We recommend reaching out to development teams at your company to get more information on what version control tools are used within the company if you are not sure what tools other teams use for version control and code review.
Best practices for data teams
- Avoid knowledge silos
- Avoid duplication in your codebase
- Standardize your codebase
- Use version control
- Write unit tests
Check out our article on data science best practices for all of our best recommendations on how to increase the efficacy of data science teams.