Do you want to learn more about the negative impact that duplication has on a data science codebase? Or maybe you are more interested in hearing about strategies you can use to avoid duplication in a data science codebase? Well either way, you are in the right place! In this article we tell you everything you need to know about duplication and its impact on data codebases.
We start out by specifying what we mean by duplication and elaborating on different areas where duplication should be avoided. After that, we discuss some of the main pain points that are caused by excessive duplication. Finally, we provide some strategies to help avoid duplication in a data science codebase.
What do we mean when we say to avoid duplication?
What do we mean when we say that you should avoid duplication in your data codebase? When we talk about avoiding duplication in a data codebase, we are not only talking about avoiding duplicated code or functionality. Instead, we are talking more broadly about duplication in any part of the data ecosystem, ranging from data points in databases to data products that are built on top of your data.
Where can duplication occur in a data codebase?
In the following section, we will talk about three main areas or components of your codebase where you should try to avoid duplication. For each area, we will discuss examples of what duplication in that area might look like. Specifically, we will talk about the following areas.
- Data. This includes everything from raw data that is collected from the source to normalized datasets that are built on top of the raw data.
- Code. This includes any code that is written to collect, transform, or analyze data.
- Data products. This includes any user-facing products that are built to surface data or information about data to end users. This can include anything from user-facing documentation to dashboards or APIs that return predictions from machine learning models.
What does duplication look like in data?
- Tables with similar content. For any type of information that you are looking for, there should be one clear table that contains that type of information. There should be clear boundaries distinguishing what type of data belongs in one table versus another. If you find yourself in a situation where it would make sense for a set of data points to be found in multiple different tables, you should consider whether the tables need to be consolidated.
- The same value captured in multiple tables. Even if the domain of each table is clearly differentiated, there may still be cases where the same value is captured in multiple tables. This should be avoided whenever possible. One table should be chosen to be the primary source of truth for that information.
- Similar values captured in the same table (or across multiple tables). You should also look out for situations where multiple values that are very similar to each other exist. For example, you should avoid cases where you have two variables that are calculated in very similar ways with one slight difference. Instead, you should try to agree on a standard representation of that variable that should be used across all applications.
What does duplication look like in code?
- Similar code with small changes. You should avoid copying and pasting code that has similar functionality over and over again within the same file (or across multiple files). If you find yourself copying and pasting blocks of code then making a small change to each instance of the copied code, you should reevaluate the patterns you are following.
- Exact copies of code. You should also avoid copying and pasting functions that you want to use across different files and platforms. Instead, you should examine solutions that allow you to maintain a single version of code that is accessible across different files.
What does duplication look like in data products?
- Similar documentation exists in multiple places. Be conscious of duplication when you are creating documentation to accompany your data, code, or other data products. There should be a single source of truth for every piece of information you want to document. If you need to reference that information in multiple places, you should link back to the main source of truth rather than copying the information.
- Similar metrics with slightly different calculations. You should avoid scenarios where you have slightly different variations of the same metric displayed in your dashboards. Rather than having slightly different versions of the same metric, you should agree on a standardized version of the metric and cut back to a single version of the metric.
- Predictions from similar models. Whenever you are experimenting with different machine learning models, you should avoid situations where there are predictions from multiple different models available and there is not a clear understanding of which one should be used. Even if multiple versions of the model do exist, you should ensure that there is a single model that is understood to be the production version of the model that is currently in use.
Why should you avoid duplication?
Why should you avoid unnecessary duplication in your data science codebase? Here are just a few examples of pain points that can be caused by unnecessary duplication.
- Duplication leads to more logical contradictions and incorrect logic. The most compelling reason to avoid duplication in a data science codebase is that if you have a lot of duplicated code, you are much more likely to have logical contradictions and incorrect pieces of logic in your codebase. The reason for this is that you have one piece of logic that is duplicated in many places and you need to update that logic to reflect a change in your environment, it is very easy to miss one or two places where that logic needs to be updated. This means that you will be left with some copies of the legacy logic, which is both incorrect and inconsistent with the logic that exists elsewhere in your codebase.
- Duplication leads to more misunderstandings and questions. As if having incorrect logic lurking in your codebase wasn’t bad enough on its own, these inconsistencies can have a secondary effect of increasing the number of questions you get from stakeholders and collaborators. Even if the incorrect logic is located in something like documentation that does not directly affect the results of your models, it will still increase the number of questions that are addressed to your team.
- It is time consuming to make updates in multiple places. Even if you do remember to update every instance of a piece of logic every time, it is still much more time consuming to have to make the same update in multiple different places. In addition to taking up your own time, it may also take up some of your collaborators time if they have to review each instance of code that you have updated.
- More duplication means more code. In many cases, less is more when it comes to code. Lager bodies of code generally take time to sort through, understand, and maintain. If you have many instances of duplicated code in your data codebase, you will also increase the number of lines of code that need to be maintained.
How to avoid duplication in data codebases?
How do you avoid unnecessary duplication in a data codebase? Here are some examples of strategies you can imply to reduce duplication in your codebase.
- Use functions rather than copying and pasting code. As a general rule of thumb, you should avoid copying and pasting similar blocks of code over and over again. Instead, you should write parameterized functions that can be applied to your data. For example, if you need to apply a set of transformations to multiple datasets, you are best off wrapping the transformations up in a function then applying that function to the data.
- Package shared utilities rather than copying and pasting functions. If you need to use the same function across multiple different files, you should avoid copying and pasting the same function over and over again. Instead, you are better off packing common functions that you use across any different files up into a library that can be imported anytime you need it.
- Use version control rather than duplicating files. Sometimes you run into a case where you want to perform a slightly different version of an analysis by implementing a few changes in your code. If you want to maintain the original version of your code alongside the modified version, you might be tempted to make a copy of the file you are working on and give it a slightly different name. In general, you should avoid this pattern. Instead, you are better off using a version control system that allows you to branch off of the original code to implement the changes.
- One source of truth for documentation. In addition to avoiding duplication in your code, you should also avoid duplication in your documentation. In general, you should ensure that there is a single source of truth that documents each required piece of information. If you need to reference that piece of information in multiple places, you should include links out to the single source of truth rather than copying and pasting the information in multiple places.
- Look for duplication in code review. If your team performs code reviews, you can explicitly tell them to look out for duplication in the codebase. This way someone will call it out if one teammate writes code or creates a product that overlaps with another artifact they have worked on.
- Assign clear ownership. You can also avoid some duplication in your codebase by ensuring that there is clear ownership assigned to different domains. If there is only one team that works on a particular domain, there is only one team that is likely to design data tables and build data products related to that domain. If there are multiple teams that work on overlapping areas, you are much more likely to run into a scenario where multiple teams are building similar tables or products without communicating with each other.
When is duplication okay in a data codebase?
Now that we have talked about cases where you should not allow duplication within your code base, we will talk about situations where it is okay to allow for some duplication. Here are a few situations where it might be okay to allow some duplication.
- Functionality that looks similar at first glance, but is actually different. There are some cases where you might have code that initially looks similar, but does not actually have the same purpose. In these cases, you will often find yourself needing to add an excessive number of arguments to your function to account for differences in the requirements. If you find yourself adding an excessive number of arguments to the function, take a step back and reconsider whether the functionality required for the different use cases is actually the same.
- To avoid undesired coupling. When you share functionality across different parts of your codebase, it implicitly couples different parts of the codebase together. It sometimes makes sense to allow for some duplication in order to avoid different coupling parts of your codebase together.
Best practices for data teams
- Avoid knowledge silos
- Standardize your codebase
- Perform code reviews
- Use version control
- Write unit tests
- Make use of configuration files
Check out our article on data science best practices for all of our best recommendations on how to increase the efficacy of data science teams.