Now that we have set up our environment, we can dive into the data by performing some basic data quality checks. This post is going to be focused more on the summarizing types of data quality checks that should be implemented for simple tabular data rather than writing the code to implement these checks. As always, we will put the code corresponding to these data quality checks up on our GitHub account so that you can see exactly how we checked our data quality.
This article was created as part of a larger case study on developing data science models. That being said, it also a great standalone resource if you are looking for a gentle introduction to assessing data quality.
Why check data quality ?
Why check the quality of your data? Even though this case study is focused more on building reliable, reproducible code than cleaning data and extracting insights, you should still spend some time checking the quality of the data you are using. Your model is only going to be as good as the data that you use to train it so it is important to understand the quirks and limitations of the data.
Steps for checking data quality
Here are some basic steps you can go through to check the quality of your data. There may be additional steps you need to take depending on the type of data you are using, but these steps are a great place to start.
- Data types. The first thing you should look at after you confirm that your data has been read in properly is the data types associated with each column in your data. Look at the name or description of each column and make sure that the data type of the column seems appropriate. The most common issue that crops up with data types is numeric or timestamp columns that are formatted as strings.
- Missing data. Next you should look at the missing data and null values in your dataset. At bare minimum, you should count the number of null values in each column. If you are looking to get a little more depth, you can also look at whether the number of null values in any of the columns changes over time (assuming your data was collected over a range of dates). You can also check whether the values in one column are more likely to be null when another column takes on a certain value. We will talk more about null values and how to impute values for them in a later post.
- Categorical variable values. Next you should check the values of the categorical variables in your data. Get frequency counts for the number of times each category appears for each variable then determine whether there are any categories that should be combined into one. Sometimes you will see that a variable has multiple categories that are very similar or even the same category with a misspelling. You may also decide to combine categories that appear very infrequently into one category.
- Continuous variable values. After you check the values of your categorical variables, it is time to check the values of your continuous variables. Look at the range of your variables to check whether there are any values that are obviously too high or too low such as negative values in columns that should be strictly positive. You should also plot your data to see whether there are any outlying values that look anomalous. We will talk more about outliers and how to handle them in a later post. If there are values that are clearly incorrect, you can either correct those values if there is an obvious correction to make or remove the values and treat them as missing.
- Duplicated records. After you sanity check the values of your variables, it is time to look for duplicated records. This is easiest to do if you have a unique ID that represents each subject in your dataset. If there is no unique ID, then it may be difficult to tell the difference between unintentionally duplicated records and records that just have the same feature values. If there are obvious duplicates in your dataset, you should make sure that duplicates are not expected and remove them.
- Custom rule based checks. After you check for duplicated records, you should go through your columns and perform custom rule based checks where appropriate. This just means that you should look at each column and determine whether there is any standard format the values should be in then check whether the values are in that format. For example, you might check that email addresses have an @ in them and that phone numbers are strictly numeric with the correct number of digits. You may find that you need to perform additional standardization such as removing dashes or parentheses from phone numbers that are formatted incorrectly.
- Look at temporal patterns (optional). This is a check that is important if you are using data that comes from a large range of dates. You should look at the values of your variables and see how they change over time. If you see a sudden jump in the average value of a continuous variable or a sudden drop-off in the frequency of a specific category then you should determine what caused the change. You might find, for example, that the label corresponding to a certain category changed names and you need to combine two different categories that represent the same underlying feature together.
Data quality checks for our case study
Are you following along with our case study? In this section we will walk you through all of the steps you need to take to get set up to check the quality of your data. We will be using Jupyter notebooks as they are perfect for ad hoc exploratory analysis like this. We will walk you through all the steps required to get Jupyter notebooks set up to run in your Conda environment.
We will not include the code that we actually used to check our data quality in this post. We conducted a very minimalistic analysis in order to keep our code as simple and easy to follow as possible, but we encourage you to dive a little deeper into the data. That being said, if you do need to refer to the code we used to check that quality of our data then you can find it here on our public GitHub repository.
1. Create a new branch
As always, the first step in completing this new phase of the case study will be to create a new branch in your git repository. You should give your branch a short descriptive name like data-quality.
git checkout main git pull git checkout -b data-quality
2. Create a location to store your data
Next you should download the data for this case study. You should create a directory within your main project directory and call it data. Within the data directory, you should create another directory called input. You can store all of the input data you use to train your model in this directory.
As a next step, you should add the data directory to your .gitignore file. You do not want to track any large data files in your git repository. All you have to do is open the .gitignore file and add an entry at the bottom that says /data/. The leading slash before the directory name indicates that the directory is in the same directory as the .gitignore file.
After you update your .gitignore, you need to commit the file to solidify the changes. You can use the following commands to do this.
git add .gitignore git commit -m 'update gitignore'
3. Download data
4. Create a location to store notebooks
After you download your data, set up a location to store your notebooks. We recommend creating a directory called notebooks in your main project directory. Your file structure should look something like this.
5. Install ipykernel
After you download the data, you should use pip to install ipykernel in your conda environment. This will allow you to create a kernel that mirrors your conda environment that you can use to run notebooks. You should also add ipykernel to your conda yaml file so that ipykernel gets installed next time you build your environment.
After you install ipykernel you will need to create a kernel that mirrors your Conda environment. If your conda environment is named case-study-one the command will look like this.
python -m ipykernel install --user --name case-study-one --display-name "Python (case-study-one)"
The name argument specifies the name of the conda environment that should be used to create the kernel. The display name argument determines the display name that will appear when you search for your kernel in the Jupyter notebooks interface.
6. Create a notebook
Finally, it is time to create the notebook you will use to check that quality of your data. First you should open the Jupyter notebooks interface by typing the following command in your terminal.
This should open the Jupyter notebooks interface in a browser window. You should see a list of all of the files in your main project directory. In the upper left, you should see a new button that allows you to create a new notebook. When you click the new button, you should see a list of kernels that you can use to run your notebook.
Before you create a notebook, click into the notebooks directory. Then click new to create a new notebook using the kernel with the display name you entered in the previous step. We recommend giving your notebook a short and simple name like data-quality.
7. Check the quality of your data
Now you have a notebook that you can use to check the quality of your data. We recommend writing your own python code to complete each of the data quality checks on your own. If you need any hints as to how to perform a certain check, you can always refer back to the data quality notebook on our public GitHub repository.
8. Commit your changes
As always, you should finish up by committing your changes, pushing them to the remote repository, and merging your data-quality branch into master.