In the previous step of this case study, we set up a git repository to version control our code. Now we will set up a package manager to manage our Python environment. This will give us fine grained control over the package versions we use and allow us to easily navigate between different environments that use different package versions. We will use Conda environments in this tutorial because they are language agnostic and beginner friendly.
This article was created as part of a larger case study on developing data science models. That being said, it also a great standalone resource if you are looking for a gentle introduction to how to use a Conda environment.
What is a package manager?
What is a package manager? A package manager is a piece of software that allows you to create and navigate between different environments that contain different versions of the same packages. Many package managers also allow you to create different environments that use different versions of the same language.
Why use a package manager?
Why use a package manager? Package managers allow you to create different environments that isolate different projects from each other. Here are some common situations where package managers help out.
- Conflicts between different projects. Say you have two different Python projects you are working on, one of which requires version 1.1 of a package and the other of which requires version 2.2 of the same package. With a package manager, you can easily create two different environments, each of which contain a different version of that package.
- Reproducibility across machines. Another argument for using a package manager is that it makes your code more reproducible and easy to run across different computers. Generally when you use a package manager, you can store the exact language and package versions you are using in a separate file. Other people who are reading your code can use that file to create an exact copy of the environment you are using on their own computers.
- Running legacy code. Another situation where a package manager is useful is when you have to run old code that you have not worked on in a while. That code might not work with the updated package versions that you are using for your current projects. You must create a separate environment with older package versions to run your legacy code in.
Why use Conda for data science projects?
Why did we choose to use Conda for this data science tutorial rather than another package manager? Here are some of the main reasons we decided to go with Conda.
- Language agnostic. The first reason that we chose Conda is because it is language agnostic. That means that you can also use Conda as a package manager when you are coding in other popular data science languages such as R. Conda is perfect for data professionals that work in a variety of languages.
- Integrations with other software. Another reason that we decided to stick with Conda is because it has handy integrations with other software such as MLflow.
- Beginner friendly. The final reason that we chose to use Conda is because we find that it is the most straightforward and easy to use package manager for beginners.
We recommend following this Conda installation guide to set up Conda on your local computer. This guide is regularly updated as changes are made to the project and it contains instructions for installing Conda on a variety of operating systems.
Basic Conda commands
Before we guide you through the process of creating your own Conda environment, we will first go through a few basic commands that are useful for beginners.
Create a Conda environment
There are multiple ways to create a new Conda environment. The first is to create an empty environment then install packages individually. You can do this by using the create command and passing the name of the environment. For example, if you want to create an environment called conda-example, this is the command you would use.
conda create --name conda-example
If you need to create a Conda environment that uses a specific version of Python, you can also specify the version of Python you want to use. Your command would look something like this.
conda create --name conda-example python=3.9
You can also create a Conda environment from a yaml file that contains a list of all the package versions you want to use. You can do this by using the env create common with the -f option. This is the best way to specify your environments if you are looking for reproducibility. We will talk more about how Conda yaml files are structured later but as a quick example, this is how you would create a Conda environment from a file called conda-example.yml.
conda env create -f conda-example.yml
List all Conda environments
If you forget exactly what you called one of your Conda environments then you can see a list of all Conda environments you created using the env list command as so.
conda env list
Activate a conda environment
If you want to enter a Conda environment that you have created, you can use the activate command followed by the name of the environment. If you wanted to activate an environment called conda-example, this is what you would type.
conda activate conda-example
Deactivate a Conda environment
Exiting a Conda environment is just as easy as entering one. All you need to do is use the deactivate command. You always need to deactivate your current environment before activating another environment.
Creating your first conda environment
Now that you are familiar with basic Conda commands, it is time to create a Conda environment of your own. If you are following along with our case study, you will want to follow all these steps to create an environment for your project. You should create your Conda environment using a yaml file.
1. Create a new branch
Before you create your Conda environment, you should create a new branch in the GitHub repo you created in the previous step of the case study. First, switch over to your main branch and make sure that branch is up to date with the remote main branch. Then create a new branch called conda-env.
git checkout main git pull git checkout -b conda-env
2. Create a Conda yaml file
After you create a new branch to work on, it is time to create a yaml file for your Conda environment. You can think of YAML as just a standardized file structure that is easy to read for humans and computers alike. Yaml files are a common choice for configuration files such as environment configuration files.
Yaml files that are used to make Conda environments generally start with the name of the Conda environment that you want to create. This is generally followed by a list of dependencies that need to be installed in the environment.
In the dependencies section, you can specify the version of Python that you want to use as well as the Python package versions that you want to use. By default the packages that you list under dependencies will be installed using Conda. However, if there are packages that specifically need to be installed using pip then you can add pip to your list of dependencies and list those packages underneath.
If you do not have a specific version of a package in mind, you can just add the package name without a version constraint. If you want to use a specific version or range of versions, you can use the =, >, >=, <, and <= operators to specify the range of versions that are acceptable. Another useful operator that you can use to specify versions is the * operator. This is a wildcard operator that indicates that any number can go in that position. For example, 3.3.* indicates that 3.3.1, 3.3.2, 3.3.3, and so on are all acceptable versions.
For the sake of this case study, we will start out with a simple Conda environment that contains numpy and pandas. We will name the environment case-study-one. Here is what our Conda yaml file will look like.
3. Create environment
After you create your Conda yaml file, you can create a Conda environment with one simple command. Simply use the env create command with the -f option.
conda env create -f conda-env.yml
4. Activate environment
Once you create your Conda environment, all that is left is to activate the environment. You can do this using the activate command followed by the name of the environment you want to activate. We set the name field to case-study-one in our yaml file so our environment will be called case-study-one.
conda activate case-study-one
5. Push to remote main
Finally, you can update your branch on GitHub and push the changes to the remote repository. Typically, it would be overkill to create an entire branch for a change this small. However, for the purpose of this case study, we are going to create a new branch for each step.
Here are the commands that you would use to do this assuming that you named your Conda yaml file conda-env.yml and your branch conda-env.
git add conda-env.yml git commit -m 'simple conda file' git push --set-upstream origin conda-env
After you push your local conda-env branch to your remote repository, you will need to create a merge request using the GitHub web interface. Merge this request to complete this step of the case study.
Learn more about Conda
Do you want to learn more about Conda? Check out this reference guide to learn more about Conda commands and capabilities.