Are you wondering what parts of your data codebase should be standardized to ensure uniformity? Or maybe you are wondering how to enforce better standardization in your data codebase? Well either way, you are in the right place! In this article we tell you everything you need to understand standardization and the role it plays in a data science codebase.
We start out by clarifying exactly what we mean by standardization and laying out some of the pain points that might occur if a codebase is not standardized. Next, we provide examples of entities that should be standardized in a data codebase. Finally, we provide concrete examples of techniques you can use to achieve a more standardized codebase.
What is standardization?
What do we mean when we talk about standardization of a data science codebase? When we talk about standardization of a codebase, we are broadly talking about implementing a set of rules that should be followed to promote uniformity across the codebase. These rules can be applied to a variety of different entities ranging from code that is used to capture raw data to code that is used to train machine learning models.
A good heuristic to keep in mind when thinking about standardization is that if a codebase is sufficiently standardized then you should not be able to look at a piece of code and immediately know who wrote that code. We will discuss the benefits of this more later, but in general it is much easier for teammates to read and maintain each other’s code if every piece of code looks roughly the same.
Standardization within teams and standardization across teams
When talking about standardization of a codebase, it is important to mention that there are different levels at which standardization can be applied. In some cases, there might be parts of a codebase that will only need to be standardized within a given team. We will call this situation where rules are implemented and followed by a specific team standardization within teams.
In other cases, there may be rules that need to be followed by all data professionals working at a company. Since these rules might apply to many people that span across many teams, we will call this standardization across teams.
It is common to encounter situations where codebases are highly standardized within a team, but not across teams. This causes problems when data or code from different teams needs to be combined and used together. Whenever feasible, you should strive to enforce standardization across teams in addition to standardization within teams.
Why is standardization important for data teams?
Why is standardization important for data teams? Here are just a few ways that standardization across the data ecosystem can help to alleviate common pain points.
- Easier to understand and maintain codebase. The more standardization there is across your codebase, the easier it will be for collaborators to understand and maintain your projects. If the general structure and location of artifacts is similar across many projects, it will make it easier for people who are looking for one specific thing to find what they are looking for. If naming conventions are followed and similar entities are named in similar ways, it will be easier to understand what a given entity refers to. Ease of maintenance is particularly important for teams that maintain production code and code that runs repeatedly at a regular interval.
- Increased discoverability. Another benefit of having a standardized data ecosystem is that it is much easier to discover new sources of data or pieces of code that are relevant to your work. If a specific entity is referred to using one standardized term across all databases and code, then it is much easier to find references to that entity using standard search functionality. If there are multiple different terms that are used to refer to a given entity, you run the risk that you might miss the reference you are looking for if you forget to search one of the many terms.
- Fewer questions from coworkers and adjacent teams. If consistent naming conventions are followed for code, data, and data products then there will not be as much confusion about what specific entities refer to. This means that there will be fewer questions from stakeholders, adjacent teams, and even collaborators on the same team.
- Fewer bugs. Along the same lines, if there is less confusion about what certain entities represent, there will be fewer bugs and inconsistencies in the codebase. Bugs and inconsistencies often arise as a result of someone misunderstanding a piece of code or data.
- Write less code. If your codebase is more standardized then you will save yourself from having to write unnecessary code. For example, if you ensure that IDs have the same name and type across all tables then you will not have to clean them up before they can be used to join tables.
What parts of a data codebase should be standardized?
What are some examples of entities that should be standardized in a data codebase? Here are some types of entities that should be standardized.
- Data. The first part of the data codebase that should be standardized is the underlying data that is used. This includes all datasets that are used by the data team ranging from the raw data that is collected from the source to cleaned datasets that are built on top of raw data.
- Code. The next part of the codebase that should be standardized is the code itself. This includes any code that is written to collect, transform, and analyze data.
- User-facing products. The final part of the codebase that should be standardized is user-facing products. This category includes everything from documentation that is written to explain data to other data professionals to dashboards that display the results of an analysis to business stakeholders.
Standardization of data
- Naming conventions for tables. There should be a standard set of naming conventions that is followed across all tables. These conventions should serve as guidelines for people who are creating new tables. For example, some companies distinguish between raw data and cleaned data by appending a prefix or suffix like “raw” to the table name. In these cases, it is important to ensure that everyone at the company follows the same convention. Otherwise, someone may see a table that contains raw data but is not named appropriately and assume that the data has already been cleaned.
- Modeling paradigms. If you are going to use common data modeling paradigms to structure your raw or cleaned data, you should ensure that the same modeling paradigm is used across the codebase. For example, if many of the tables at your company follow the dimensional modeling paradigm, it is best to codify that and recommend that dimensional modeling is used wherever it is appropriate. This will make it easier for end users to understand how different datasets relate to each other.
- Variable names. There should be a standard set of naming conventions that is followed when new variables are created and named. These conventions should be used to ensure that variable names are formatted in the same way. For example, you do not want to have some variables that contain spaces in their names and others that contain underscores in place of spaces. Beyond that, variables that are going to be used across different tables, such as unique IDs that are used to join tables together, should have the same name in all of the different tables they appear in.
- Variable types. There should also be an understanding related to variable types. This is especially true for variables that appear across different tables such as IDs that are used to join tables. You should not have shared IDs that are numeric types in some tables and string types in others. Entities such as zip codes and prices should also have standard representations across the codebase.
- Variable values. In many cases, it also makes sense to have a standard set of rules surrounding the values of certain variables. This often takes the form of a standard list of values that a given variable can take on. For example, if you have a table that contains information about actions a user completed on a website, you should make sure that similar actions that take place on different pages have similar names. This will make it easier to pull all actions of a given type at one time.
Standardization of code
- Directory structure. The first part of the codebase that should be standardized is the directory structure that is used for similar types of projects. For example, it might make sense for all machine learning projects that are completed to use a similar directory structure. This makes it easier for collaborators to navigate the code for a project and find what they are looking for.
- Imports and dependencies. No matter where dependencies are referenced in your codebase, it is best to have standard guidelines around how dependencies should be grouped and ordered. This makes it easier to scan through a list of dependencies and determine whether a project is dependent on a given package or library.
- Comments. It is also beneficial to have standard guidelines around how and where comments should be incorporated into code. This makes it easier to skim the code and identify any comments that contain important information.
- Naming conventions. Much like how naming conventions can help to make data more discoverable, they can also help to make code more discoverable. It is much easier to use search functionality when you know what terms you should be searching for. Standardization and guidance around naming can help point you in the right direction.
- Style (white space, capitalization, case, etc.). Finally, it is best to stick to a standard set of style guidelines when formatting code. Cohesive codebases that follow the same style guidelines are generally easier to read and maintain. When possible, it is best to stick to the standard style guidelines that are recommended for the language you are using.
Standardization of data products
- Naming and terminology. It is important to use consistent terminology and naming conventions when you reference similar concepts across different data products. This will help to prevent confusion and reduce the number of questions you get from stakeholders. For example, if a given metric or type of metric is displayed in multiple dashboards, you should refer to that metric by a similar name across those different dashboards.
- Methodology used for calculating metrics. You should also make sure that you calculate similar metrics using similar methodologies whenever possible For example, if you have many different reports where you calculate the average revenue that is generated by different projects, you should always calculate the average using the same methodology. You should avoid situations where some products use medians and other use means to display similar types of data. This can cause confusion if the numbers do not align.
- Population used to calculate metrics. If there are any segments of your data that you typically exclude when you calculate key metrics, you should make sure that these exclusions are applied uniformly across different products. This will help to avoid confusion in situations where metrics look different for different populations.
- Time windows used to calculate metrics. Along the same lines, you should do your best to ensure that similar time windows are used to calculate metrics across different data products. This will make it easier to compare metrics that are pulled from different data products.
- Charts and visual presentation of data. In general, it will be easier for stakeholders to ingest the information that you show them if it is formatted in a way they are used to.
How to encourage standardization in your codebase?
So how do you encourage standardization in your data codebase? Here are some strategies that can be employed to encourage standardization in a data codebase.
- Encourage code linting. One way to improve standardization in your code is to encourage everyone to run the code they write through code linters. Linters are tools that scan code for bugs and stylistic inconsistencies. They can be set to automatically reformat code or to raise alerts when inconsistencies are detected. Linting can be more strictly enforced by incorporating it as a mandatory step that must be taken before code can be checked into a shared repository.
- Automatically create skeletons for project directories. If you want to ensure that the directory structure that is used is standardized across projects, you can create utilities that automatically set up a directory skeleton for a new project. This can be used to ensure that whenever someone starts working on a new project, they are starting from the same standardized template.
- Assign ownership. In cases where it is not straightforward to automate processes that enforce standardization, it can be very useful to assign owners who are responsible for ensuring standardization across the codebase. For example, if you have event data that represents actions that users took on a website, you can assign one team or person to make sure that similar actions are represented using similar values.
- Well-documented guidelines. In addition to assigning ownership, it is important to have the guidelines that the team should follow well documented in an easily discoverable location. If you want people to follow the guidelines, it is important that everyone knows exactly where they need to go to look up the guidelines when they have questions.
- Encourage code reviews. Code reviews are another great way to encourage uniformity across a codebase. Experienced reviewers who are familiar with the codebase will be able to identify deviations from the standard guidelines. If you want to use code reviews for this specific purpose, you should make sure to tell reviewers they should be on the lookout for inconsistencies and deviations from standard guidelines.
- Focused onboarding materials. Finally, it is important to include documentation about the guidelines that are followed in the set of onboarding materials that is presented to new hires. New hires who are not familiar with the existing codebase are the ones who are most likely to introduce changes that break from the standard guidelines.
Best practices for data teams
- Avoid knowledge silos
- Avoid duplication in your codebase
- Perform code reviews
- Use version control
- Write unit tests