Are you wondering whether it is a good idea to use configuration files when building a data science project? Or maybe you are more interested in understanding what type of information you should store in configuration files? Well either way, you are in the right place!
In this article, we tell you everything you need to know about configuration files and how they should be used in data science projects. We start out by talking about what a configuration file is and what file formats are used for configuration files. After that, we discuss what the main benefits of using configuration files are. Finally, we provide examples of information that would be well suited for a configuration file.
What is a configuration file?
What is a configuration file? A configuration file is a file that contains parameters that determine how a specific piece of code is run. When we say parameters, we are defining the word broadly to include any options, settings, or logic that affects the way that the code is run. A configuration file should be stored in a separate location from the code itself and it should not contain any code.
When the piece of code is run, the configuration file can be read in from its respective location and parsed. Parameters can be extracted from that configuration file and passed into the code to determine how the code is run.
What file formats are used for configuration files
What file formats are used for configuration files? Here are some examples of file formats that are commonly used for configuration files. In particular, we will focus on file formats that are commonly used for data sciences projects.
- YAML. YAML is a file format that is commonly used for configuration files because it is easy for humans to read and understand. Creating configuration files is the most common use case for YAML, so it was designed with readability in mind. That means it is a great option if you are creating configuration files that you want less technical users to be able to understand and modify.
- JSON. JSON is another file format that is commonly used for configuration files in data science projects. JSON is a more general format that is used for storing and transporting data across a wide variety of use cases and technologies. It may be easier to use if you are working on an application that regularly uses JSON format already.
Why use configuration files for data science projects?
What are the main benefits of using configuration files for data science projects? Here are some benefits of using configuration files for data science projects.
- Make it easier for newbies to make changes. One benefit of using configuration files is that it makes it easier for new team members who are just onboarding onto the codebase and collaborators who work primarily on a different codebase make changes to your codebase. This is because the complexity of the code gets abstracted away into a different file that does not need to be modified. Depending on where your configuration files are accessed and how easy it is to make changes to them, this might even make it possible for non-technical stakeholders to make modifications to the files themselves (if this is something that is desired).
- Reduce the number of bugs introduced into the codebase. One of the main benefits of using configuration files is that it allows you to isolate fast changing business logic from slow changing components of the code. Isolating fast changing business logic from slow changing code also helps to reduce the number of bugs that get introduced into a system. Every time that you touch a piece of code, there is a chance that you will accidentally introduce a bug into the system. If you isolate the fast changing business logic from the parts of the code that change slowly, you will not have to touch the code as frequently. This means that you will not have as many opportunities to introduce bugs into the code.
- Toggle between multiple configurations. Another benefit of using configuration files is that it makes it easier to toggle between different configurations of the same system. If you have all of the parameters you are using embedded into your code, then there can only be one version of the parameters implemented at a time. If you have the parameters separated out into a configuration file, then you can store different combinations of parameters in different configuration files and read in different configuration files when you need to perform different tasks.
What information to store in a configuration file
What types of information should you put in a configuration file when you are building out a data science project. The exact types of information you should include in your configuration file will vary depending on what type of project you are working on. Here are a few examples of pieces of information that would be well suited for a configuration file.
- Business logic. One example of information that you can store in a configuration file is rapidly changing business logic. For example, if you have a job that filters to a list of users who fit a certain criteria then you can extract the logic that is used to define that criteria into a configuration file so that you can easily modify the set of users that gets pulled.
- Model parameters. Another example of a use case where configuration files are useful is if you are training a machine learning model or a set of machine learning models. If you store the model parameters or the ranges of model parameters that you want to train models with in a configuration file, then it will be easy to modify the set of parameters that are used in model training without touching the code.
- Simulation parameters. Similarly, if you are running a large scale simulation with multiple parameters that can be modified to produce different simulation output then you would also benefit from using a configuration file. This way you can have all of the parameters of your simulation collected into one source that is easy to modify.
- Resource configurations. If you have to configure resources for machinery that you use to run large jobs, this is another great use case for configuration files.
- Data science best practices
- Standardization for data science codebases
- Avoiding duplication in data science codebases