Purpose of this case study
In this case study we will focus on the engineering skills that go into building and deploying machine learning models using code that is stable and reproducible. This is a perfect case study for data professionals who are familiar with data cleaning and basic model building but do not have much experience writing production ready code and deploying machine learning models.
Here are some of the topics that we will cover in this case study. We will also review basic concepts such as how to check data quality and how to build a machine learning model, but more emphasis will be placed on these engineering topics.
- Version control with Git and GitHub.
- Package and environment management with Conda.
- Object oriented programming.
- Artifact and parameter tracking with MLflow.
- Developing packages with reusable components.
- Unit tests.
- Docker containers.
- Flask APIs.
More about this case study
In this case study we will work with a simple dataset that consists of simple tabular data. This means that all of the variables that we use from the variables we choose will have straightforward numeric or categorical values. There will be no text data, image data, or other unstructured data used in this case study.
Why are we sticking with simple tabular data? Because the goal of this case study is to teach you about the process of creating a data science model using code that is reliable, reproducible, and easy to iterate on. By using a simple dataset, we will free up more time and headspace to focus on the main objectives of this case study.
In this case study, we will use Python as our main language. Python is a great general purpose programming language that has a variety of tools built for productionalizing data science models. We will track all of the code that is used in this case study in our public github repository.
Dataset for this case study
For the purpose of this case study, we will be using the Bank Marketing Data Set from the UCI Machine Learning Repository. Specifically, we will be using the bank-additional-full.csv file that contains the complete dataset. This is a relatively clean dataset that contains a clear outcome variable, so it is a great dataset to use for a case study that focuses more on engineering and building reliable, reproducible code than cleaning data and extracting insights.
This data comes from marketing phone calls that were made by a Portuguese bank. The outcome variable indicates whether or not the person that received the phone call put a deposit in the bank. Some other variables in the dataset include the age, job, education level, and marital status of the person who received the phone call.