Are you wondering whether you should use unit tests in your data science codebase? Or maybe you want to hear more about best practices for incorporating unit tests into a data science codebase? Well either way, you are in the right place! In this article we tell you everything you need to know about unit tests for data science.
We start out by talking about what unit tests are and why they are important for data science projects. After that, we provide more details on situations where it is particularly important to incorporate unit tests into your codebase. Finally, we discuss some best practices for incorporating unit tests into your data science codebase.
What are unit tests?
What are unit tests? Before we talk about what unit tests are, we will more generally explain what tests are and what role they play in a codebase. Generally speaking, tests are functions or pieces of code that are created to examine the main body code you have written and ensure that it functions as expected.
Unit tests as a specific kind of test that are narrowly scoped to examine one specific piece of functionality. That means that unit tests are generally used to examine how one specific function performs in one specific scenario. For example, if you have a function that is supposed to take in an integer and add one to that number, you might write a unit test that passes the number 5 as an input to the function and checks that the function returns 6.
There are a few helpful paradigms that can be used to help understand what should happen within a unit test. One common paradigm is the arrange, act, assert paradigm. This paradigm specifies that you should first arrange the environment where the code will be tested. After that, you should act by applying the code to the environment. Finally, you should assert whether you achieved the outcome you were expecting.
Looping back to our previous example where we had a function that adds one to a number, you might first arrange the environment by defining a variable with the value 5 that will be passed into the function. You would then act by applying the function to the variable you defined. Finally, you would assert whether the outcome produced by the final step was 6 as expected.
Why use unit tests for data science projects?
Why should you use unit tests for your data science projects? Here are a few reasons why it is important to incorporate unit tests into your data science codebase.
- Find bugs faster. The most obvious benefit that unit tests bring is that they make it easier to find bugs in your code. Specifically, they make it easier to determine exactly where a specific bug is coming from. If each test examines how one specific piece of code performs in one specific environment, all you have to do is look at which test is failing to understand where a bug is coming from.
- Develop faster. There are multiple reasons that unit tests can help you develop code faster and more effectively. For starters, manually testing your code can be complex and time consuming. Unit tests allow you to bypass this process and test your code quickly on small, isolated pieces of toy data.
- Iterate with confidence. Unit tests also make it easier to iterate on code you have already written with confidence. If you are refactoring code that should ultimately maintain the same functionality, you can just run your existing unit tests as you develop to ensure you are not breaking the existing functionality. If you are making a slight modification to the functionality in your code, you can just update your unit test to reflect the functionality you are expecting before you start developing.
- Get others up to speed faster. Unit tests also make it easier for others who are new to your codebase to get up to speed. This is because they can look at your test files to understand exactly what the expected functionality is for each piece of code before they dive into the details of the code.
When to use unit tests for data science projects
When should you incorporate unit tests into your data science codebase? Here are some situations where it is particularly important to incorporate unit tests into your codebase.
- Building out production code. Whenever you write code that is going to be used in production, you should make sure to unit test this code. If your code is going to be used in production, it is important that you have a way to spot bugs early before your code is released. This will improve the level of trust that stakeholders have in your code.
- Refactoring existing code. It is also particularly important to write unit tests when you are refactoring existing code. For example, if you are updating a query to improve the performance and run time of the query, you should include some type of test to ensure that the output of your original code is the same as the output of the updated code.
- Many people touch your codebase. Even if your code is not used directly in production, it is particularly useful to write unit tests on code that is going to be touched by many people who are not necessarily familiar with the codebase. These tests make it easier for new people to get up to speed faster.
- Jobs that take a long time to run. If you are building out data pipelines or running jobs that take a long time to run, such as a job that trains a complex deep learning model, it is particularly important to unit test your code. This helps to ensure that you do not run into a situation where you kick off a job and wait eight hours just to find that it failed due to a bug in your post-processing code.
Best practices for unit tests
What are some best practices for unit tests in data science codebases? Here are just a few examples of best practices for unit tests.
- Tests should be deterministic. When you are writing unit tests, the code you are testing should be deterministic. This means that you should expect to see the same result if you run the code over and over again within the same environment. This is a particularly important rule in data science codebases which are full of stochasticity.
- Test should be fast. You should ensure that the tests that you write are fast and can be completed in a relatively short amount of time. This is a particularly important rule to keep in mind when working with data science code bases that often produce jobs that take hours, or even days, to run.
- Consider common data quality issues. You should consider common data quality issues that you regularly encounter when you are deciding what scenarios to test your code in. For example, it may be important to look out for things like missing values and duplicated values that should be handled within your code.
- Test one thing at a time. While there are many different types of scenarios and data quality issues you should look out for in your unit test, you should only test one thing per test. This means that if you need to test how your code handles missing values and how your code handles duplicate records, you should include two separate tests that examine these concerns independently.
- Do not test code from external libraries. You generally do not need to test code that comes from external libraries that have their own unit tests built into them. For example, if you are importing a machine learning model from a common library, you do not need to test that the machine learning model performs as expected. Instead, you should focus on testing code that you wrote yourself such as pre-processing and post-processing code.
Other articles on unit testing
Best practices for data science teams
- Avoid knowledge silos
- Standardize your codebase
- Perform code reviews
- Use version control
- Avoid duplication
- Make use of configuration files
Check out our article on data science best practices for all of our best recommendations on how to increase the efficacy of data science teams.