Tracking data with MLflow

Share this article

Are you looking for an introduction on how to use MLflow to track the inputs and outputs associated with your scripts? Well then you are in the right place! In the beginning of this article we discuss what MLflow is, why MLflow is useful, and how to use MLflow to track data associated with your scripts. After that we will provide an example of how to train a simple model in Python and use MLflow to track all of the data associated with the model training script. 

This article was written as part of a larger case study on developing production ready data science models. That being said, it also serves as a great standalone resource for anyone who wants to learn how to use MLflow tracking. 

What is MLflow?

What is MLflow? MLflow is an open source framework that makes it easy to track all of the data you could possibly want associated with your scripts and model training runs. MLflow has a simple interface and requires minimal code changes, so you can spend less time focusing on tracking and more time focusing on the body of your scripts. 

MLflow also has other functionality beyond its basic tracking functionality, but for the purpose of this article we will focus on the tracking functionality. We recommend that anyone who is just getting started with MLflow start out by looking into its general tracking functionality. This is the most basic and generally applicable functionality that MLflow offers.

MLflow also has model tracking capabilities that help you keep track of your models and all of the metadata associated with them. We will not be discussing those capabilities in this article, but if you want to learn more then check out our article on MLflow model tracking

When is MLflow tracking useful?

When is MLflow useful? Here are some scenarios when it might be useful to incorporate MLflow into your project. 

  • When you are running the same script with different parameters. Do you find that you are often re-running a script with different inputs? These inputs can be anything from the date range from which some data was pulled to the number of trees in a random first model. If you frequently re-run scripts with different parameters then MLflow tracking would almost certainly be useful for you because it would help you keep track of which inputs were associated with which outputs. 
  • When you are re-running the same script regularly. Do you have scripts that are automatically run at a regular cadence? Then you should look into using MLflow tracking to monitor the outputs of your scripts. This will make it easier to identify issues in your scripts before they affect downstream processes.
  • When there are frequent changes in your underlying data. Do you find that the underlying data your scripts use regularly changes between runs? Then you should use MFflow tracking to keep track of the exact dataset that was used when your script was originally run. 

Why use MLflow tracking?

Why should you use MLflow for tracking the inputs and outputs of your scripts? Here are some of the reasons that you should use MLflow. 

  • Track inputs and outputs with minimal code changes. The first reason to use MLflow’s tracking facilities is because MLflow makes it easy to figure out which runs of your script were associated with which inputs and outputs. This way if someone comes back and asks you how you produced a specific output weeks later, you have an easy way to go back and see exactly what inputs were used to produce that output. 
  • Monitor performance metrics with ease. Do you have processes that get re-run on a regular cadence? If you have MLflow tracking set up then you can easily monitor how metrics and outputs associated with your scripts change over time. This can include anything from the amount of time it takes to run your script to the accuracy metric you get after retraining a model. 
  • Easily reproduce previous runs. Using MLflow makes it easy to go back and reproduce previous runs of your scripts. You can use MLflow to track that exact data that you used when you ran your script so that exact dataset is available and ready for use next time you want to run your script. 

Basic MLflow concepts

What basic concepts do you need to understand in order to use MLflow? Two of the concepts that are most fundamental to MLflow are those of experiments and runs. You can think of an experiment as a collection of scripts that should all be tracked together in the same location. Generally you should create a new MLflow experiment for each project that you are working on. 

Within an experiment, you can have any number of runs coming from any number of scripts. In general, each time you run one of your scripts you will create a new run. This run will automatically be assigned a unique run id that you can use to look up metadata associated with that run. You can also re-open an existing run and use the same run to log data for multiple successive scripts, but for now we recommend creating a new run for each script. 

A diagram of what an MLflow experiment with multiple different runs might look like.

What kind of data can MLflow track?

What kind of data can be tracked in MLflow? There are three main types of data that can be tracked in MLflow called parameters, metrics, and artifacts. In the following section we will discuss each type of data and provide an example of how to track it using MLflow. 

A diagram showing the different types of data that can be tracked using MLflow.

Logging parameters in MLflow

The first type of data that MLflow allows you to log is a parameter.  A parameter is generally an input to your process such as the start date of a data pull or the number of trees in a random forest classifier you are training. Parameters can either be strings or have numeric values. All you have to do to log a parameter to MLflow is add in a line of code that calls the log_param function and provides the name of the parameter followed by the value as arguments. 

n_trees = 5
mlflow.log_param("n_trees", n_trees)

Logging metrics in MLflow

The next kind of data you can track using MLflow is called a metric. A metric is generally an output of your script, such as the accuracy metric associated with the model you trained in the script. A metric can also represent metadata such as the time it took to train your model. Unlike parameters, which can take on string or numeric values, metrics should be strictly numeric. All you have to do to log a metric in MLflow is call the log_metric function and provide it with the name and value of your metric. 

accuracy = 0.8
mlflow.log_metric("accuracy", accuracy)

Logging artifacts in MLflow

The third type of data that MFlow can track is called an artifact. In general, any complex piece of data that you want to track that does not have a simple string or numeric value can be considered an artifact. For example, you might track the pandas DataFrame that you trained your model on in csv format or the image of the diagnostic plots associated with your model in png format.

In order to track artifacts in MLflow, you must first write those artifacts out to a temporary directory. That directory can contain anything from one artifact to hundreds of artifacts with different file formats. After you have written your artifacts out to a local directory, you can then call the log_artifacts function and provide the name of the directory that contains your artifacts to copy all of those files from your local directory into the MLflow tracking server. 

local_directory = '/tmp/123/'
local_file = '/tmp/123/example.csv'
data = pd.DataFrame({'a': [1,2,3]})
data.to_csv(local_file)
mlflow.log_artifacts(local_directory)

MLflow tracking example

Now we will walk through an example of adding Mlflow tracking to a Python script. For this section of the article, we will be following along with our case study on how to build production ready machine learning models. If you were just looking for a high level explanation of what kind of data you can track in MLflow, you can drop off now. If you want to learn more about our case study, you should check out our case study overview for more details. 

For the first step of this exercise, we will create a basic model training script that trains a model using the data that we prepared earlier in our case study. After that we will create a new MLflow experiment and add basic MLflow tracking to our script so that each time the script is run, a new MLflow run is created. 

0. Create your script

The first thing we need to do before we demonstrate how to add MLflow tracking to a script is create a script we want to add MLflow tracking to. For the purpose of this case study, we will create a quick script that reads in some parameters from a configuration file, grabs some training data, trains a random forest classifier, and evaluates the predictions that classifier makes. This is the content of the script we will use. 

import os
import yaml

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

from bank_deposit_classifier.sample import upsample_minority_class
from bank_deposit_classifier.prep_data import DATA_DIR, CONFIG_DIR


config_path = 'train_model.yaml'
test_path = 'intermediate/test.csv'
train_path = 'intermediate/train.csv'

# get parameters from config
config_path_full = os.path.join(CONFIG_DIR, config_path)
with open(config_path_full, 'r') as file:
    config = yaml.load(file, Loader=yaml.FullLoader)
outcome = config.get('outcome')
n_estimators = config.get('n_estimators')
max_features = config.get('max_features')

# get test and train data
test_path_full = os.path.join(DATA_DIR, test_path)
train_path_full = os.path.join(DATA_DIR, train_path)
test = pd.read_csv(test_path_full)
train = pd.read_csv(train_path_full)
train_resampled = upsample_minority_class(train, outcome, 0.5)

# train model
rf = RandomForestClassifier(
    n_estimators = n_estimators,
    max_features = max_features,
    random_state = 123
    )
rf.fit(train_resampled.drop(outcome, axis=1), train_resampled[outcome])

# evaluate model
train_predictions = rf.predict(train_resampled.drop(outcome, axis=1))
test_predictions = rf.predict(test.drop(outcome, axis=1))
train_auc = roc_auc_score(train_resampled[outcome], train_predictions)
test_auc = roc_auc_score(test[outcome], test_predictions)

1. Start up the MLflow tracking server

Now that we have a script to add MLflow tracking too, we need to start up the MLflow tracking server so that we can create a MLflow experiment. The tracking server is just the location you go to in order to look at data that has been logged for your MLflow experiments. All you need to do to get the tracking server up and running is type the following command into your terminal.

mlflow ui

After this, type http://localhost:5000 into the address bar of your favorite internet browser to navigate to the MLflow tracking UI. You should see something that looks like this pop up in your browser. You will need to spin up this tracking server anytime you want to log new data to your MLflow experiment. 

An example of what the mlflow tracking ui looks like.

2. Create a new experiment

Now that you can view the MLflow tracking UI, it is time to create a new MLflow experiment to track the data associated with this project. Press the plus button in the upper left corner of the screen to create a new MLflow experiment. 

After you press this button, you will be prompted to add a name for your experiment. We called our experiment case-study-one, but you can call your experiment anything that you like! You will also have the option to choose the location where the MLflow data will be stored on your computer. You can leave this empty for now. 

A picture demonstrating how to create an MLflow experiment. There is a text box on the screen that asks the user to put in a name for their MLflow experiment.

3. Start a MLflow run

Now you have a new MLflow experiment to track your data – that was easy! Now that we have created our experiment, we will add code to our model training script to track the data associated with our experiment. The first thing we need to do is add a line of code that tells MLflow to start a new run. 

For the purpose of this walkthrough, we will add all of our MLflow tracking code at the bottom of our model training script so that we do not have to show our entire model training script each time we add new code. That being said, you can also add your MLflow tracking code higher up in your script. You can add the tracking code anywhere you want as long as you start a new run before logging any data.

Before you start your MLflow run, you will also need to tell MLflow where to access your tracking server and what the name of your experiment is. You can use the following code to do this.

import mlflow

mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('case-study-one')
mlflow.start_run()

3. Add MLflow parameter tracking

Now that we have started a new run within our model training script, we can start tracking some parameters. You should log any input variables that you expect to change frequently when you re-run your script. For example, in our model training script we might expect the n_estimators and max_features parameters for our random forest models to change between runs so we should log these parameters.

Here is the code we would use to track these parameters. 

mlflow.log_param('n_estimators', n_estimators)
mlflow.log_param('max_features', max_features)

4. Add MLflow metric tracking

Now that you have added tracking for your input parameters it is time to add tracking for our outcome metrics. In our model training script we calculate the auc score for both the test and training dataset so we will track these metrics using the following code. 

mlflow.log_metric('train_auc', train_auc)
mlflow.log_metric('test_auc', test_auc)

5. Add MLflow artifact tracking

Finally, we will log the dataset that we used to train our model to MLflow. We will do this using the artifact logging functionality. Remember that when we log artifacts to MLflow, we first have to write them to a temporary directory on our local computer. This means that there will be a tiny bit more code involved in logging our artifacts. 

Here is the code we used to write our training dataset out to a temporary directory then log the contents of that directory to MLflow.  

import tempfile

with tempfile.TemporaryDirectory() as tmp:
    path = os.path.join(tmp, 'train.csv')
    train.to_csv(path)
    mlflow.log_artifacts(tmp)

6. End your MLflow run

Now that you have logged all of the data you need, all that is left is to end your MLflow run. Here is what your MLflow tracking code should look like when you are done. 

import mlflow
import tempfile

mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('case-study-one')
mlflow.start_run()

mlflow.log_param('n_estimators', n_estimators)
mlflow.log_param('max_features', max_features)
mlflow.log_metric('train_auc', train_auc)
mlflow.log_metric('test_auc', test_auc)

with tempfile.TemporaryDirectory() as tmp:
    path = os.path.join(tmp, 'train.csv')
    train.to_csv(path)
    mlflow.log_artifacts(tmp)

mlflow.end_run()

Share this article

Leave a Comment

Your email address will not be published. Required fields are marked *