Do you want to learn more about best practices for data science teams? Well then you are in the right place! In this article, we discuss some of the most important best practices that should be followed by data science teams.
We start out by elaborating on what types of data professionals would benefit from reading this article. After that, we discuss a few different categories of best practices that should be considered by data science teams. Finally, we provide examples of best practices that fall into each category. For each best practice we discuss, we also explain why this best practice helps data science teams operate more efficiently.
Who can benefit from these best practices?
Many of these best practices that we talk about in this article are broadly applicable to all types of data science teams. That means that they will be just as useful for teams that focus on analytics and reporting as they will be for teams that focus on deploying machine learning models in production.
These best practices will also be useful to data scientists across all levels. Whether you are a data science manager who is looking to improve team efficiency or a junior data scientist who is looking to develop your skills, these best practices will serve you well.
Types of data science best practices
In this article, we will talk about three different kinds of best practices that apply to data science teams. Here are the major themes we will discuss.
- Technical standards and processes. First we will talk about best practices that relate to technical skills and processes. This section will include tips on how to produce a codebase that is easy to onboard onto and maintain. These tips will improve the efficiency of data science teams as well as the quality of their results.
- Communication and stakeholder management. Next, we will talk about best practices that enable teams to communicate more effectively. We will put an emphasis on tips for communicating with nontechnical stakeholders and stakeholders in different disciplines. These best practices will help the team to build trust with their stakeholders and get the recognition they deserve.
- Impact and efficiency. After that, we will discuss practice tips that can help data science teams be more impactful. These best practices will help to improve the resilience of data science teams and ensure that teams are working on the right problems at the right time.
Technical standards and processes
- Use version control. Our first recommendation is to use a version control system like git to track the code that your team writes. Version control systems allow you to save snapshots of your code at different points in time so that you can revisit them in the future. This makes it easier to revert back to previous versions of the code after a sneaky bug is introduced. It also improves the reproducibility of your analysis by allowing you to track exactly which version of the code was used to produce a given deliverable. If you want to learn more about the benefits of version control, check out our article on version control for data science teams.
- Perform code reviews. Our next recommendation is to perform code reviews. This simply means that people on your team should review each other’s code before the code is used to make critical business decisions. The benefits to performing code reviews are multifold. At the most basic level, code reviews help to detect mistakes in the code. In addition to this, code review is a powerful tool for learning that can accelerate the pace at which teammates learn from one another. If you want to hear more about the benefits of code review, check out our article on code review for data science teams.
- Define standard conventions and follow them. You should also aim to define and follow a set of standard conventions to ensure consistency across different projects and teams. This means that the format of your data, code, and products should be consistent from one project to another. This makes it easier for teammates to onboard onto a new project and reduces the number of questions you get from (rightfully) confused stakeholders who see discrepancies in your teams’ work. If you want to learn more about the benefits of standardization, check out our article on standardization for data science teams.
- Test your code. The next technical best practice is to write formal tests to evaluate your code and ensure that it produces the right results. This will decrease the number of errors that find their way into your work and increase the level of trust that stakeholders have in your team. It will also enable your teammates to iterate faster on existing projects. If you want to hear more about the benefits of testing your data science code, check out our article on unit testing for data science teams.
- Avoid duplication. The next technical best practice for data science teams is to avoid duplication when possible. This principle applies to your code, data, and any data products that your team produces. Reducing duplication reduces the number of logical errors in your data, code, and data products. It also reduces the overall amount of work that the team needs to perform to achieve a given outcome. If you want to hear more about how to detect duplication in your team’s work, check out our article on avoiding duplication in data science teams.
- Use configuration files. You should use configuration files to separate slow changing code from fast changing business logic. This reduces the chances that a bug will be introduced into your code and makes it easy for you to run the same piece of code with different sets of parameters. For more information on when and how you should use configuration files, check out our article on configuration files for data science projects.
- Better data trumps a better model. This best practice is particularly useful for data science teams that regularly build machine learning models. When you are looking to improve an existing machine learning model, you will often see more benefit from making improvements to the dataset you are using to train your model than making improvements to the model itself. For more information on strategies you can use to improve your machine learning models, check out our article on how to improve a machine learning model.
Communication and stakeholder management
- Focus on the business impact of your projects. Our next recommendation is to focus on the business impact of your projects when communicating with stakeholders outside of your team. It can be tempting to put an emphasis on the technical details of your projects, especially when you spend so much time thinking about them, but you should resist this urge! Speaking in terms of business impact makes it easier to get buy-in for projects and recognition for the work that your team has done. If you want to learn more about how to translate technical impact into business terms, check out our article on translating technical impact into business metrics.
- Continue to report on the impact of your projects well after launch. The next best practice is to continue to track and report on the impact that data science projects well after they launch. Many times, teams will measure the success of a project shortly after launch then celebrate and move on. They do not track the continual impact that the project has over time. When you continue to report on the impact of data science projects, you ensure that the team gets the recognition it deserves. This often makes it easier to get more resources allocated to the team.
- Focus on the problem you are solving when showcasing work. When you are showcasing work that you have done to a general audience, it is often best to focus on what problem you are solving and why you are solving that problem rather than getting into nitty gritty implementation details. This helps to ensure that your audience understands the importance of your work and the impact that it has on the business.
- Use simple examples to convey complex topics. It is often useful to use simple examples to convey complex topics. Sometimes when a data scientist has to explain a highly technical concept to a non-technical audience, the explanation is so abstract and high level that it is difficult for non-technical stakeholders to understand the implications. Providing a simplified example can make the concept more tangible. This helps to increase stakeholder alignment and ensure that everyone has the same expectations about what is being built and what problem is being solved.
- Ask for feedback early and often. You should do your best to ensure that your team asks for feedback early on in the project lifecycle. Specifically, you should make sure that you are presenting your plan and asking for feedback from both technical and nontechnical stakeholders. This helps to make sure that everyone is aligned and the team is solving the right problem. There is nothing worse than spending a large amount of effort on a project just to learn that there has been a miscommunication and you have built the wrong thing. For more recommendations on how to best get feedback on data science projects, check out our article on soliciting feedback for data science projects.
Impact and efficiency
- Start simple and iterate. Whenever you are working on a complex project that will take a meaningful amount of time and effort, you should consider whether it makes sense to build a simple solution first. After you build your simple solution, you can evaluate whether that solution suffices and plan to iterate on your existing solution if necessary. This ensures that you do not spend extra time building a complex solution when a simple solution would suffice. It also gives you a faster route to feedback to help you understand whether you are headed in the right direction. If you want to hear more about how this strategy is applied in machine learning projects, check out our article on baseline models for machine learning.
- Favor projects that can be used by many teams. When you are prioritizing projects against one another, you should consider the number of different areas where the output produced by a project could be used. In general, projects that produce generalizable outputs that can be used by multiple different teams are better than projects that are produce a very specific output that is only relevant to one team or product area. Check out our article on scaling the impact of data science teams for more on this topic.
- Prioritize projects based on business impact. Our next tip is to prioritize the projects that you work on based on business impact. It is sometimes tempting to work on a project that is not so impactful because R is technically interesting or challenging. It is okay to consider whether a project provides a learning opportunity when deciding what to work on, but don’t forget to consider the impact that the project has on the business. Check out our framework for prioritizing data science projects to learn more about how to prioritize data science projects.
- Avoid knowledge silos. Our final recommendation is to do your best to avoid knowledge silos within data science teams. If you find yourself in a situation where there is only one person who has context about an important area that a team works on, then you likely have knowledge silos in your team. Knowledge silos can also lead to situations where multiple team members are working on similar or overlapping work. They can also cause large disruptions when a team member who is working on an important project goes on vacation or leaves the company. For more information on the negative impacts that knowledge silos can have on data science teams, check out our article on preventing knowledge silos in data science teams.