One of the first steps you should take when starting a new data science project is setting up a new repository to version control your code. For the sake of this case study, we will be using git to version control our code and GitHub to host a version of our git repository online. In the following article we will show you how to set up a GitHub account, create a new repository, and clone the repository onto your local computer.
This article was created as part of a larger case study on developing data science models. That being said, it also a great standalone resource if you are looking for a gentle introduction to using git and GitHub.
What is version control?
What is version control? Version control software is software that keeps track of changes that are made to your code by taking snapshots of your code at different points in time. These snapshots serve as checkpoints which you can revert back to with the click of a button if you introduce a bug into your code or find out that the updates you just made were not necessary.
Many types of version control software also allow you to maintain separate branches or versions of your code that you can and easily navigate back and forth between. This is useful if you want to make changes to your code for an ad hoc analysis but also want to maintain a separate version of your code without the changes.
If this is your first time using git, we recommend reading through our git for beginners post to develop an understanding of the git ecosystem and how it operates. We will assume that you understand the topics covered in this article as we progress through this case study.
Why use version control for data science?
Why use version control for data science projects? Here are some of the most common reasons to use version control for data science projects. If you want even more examples of pitfalls that can be avoided by using version control software, check out this article on why version control is important for data science projects.
- Easy rollbacks. Version control software allows you to easily roll back to previous versions of your code if you make a mistake.
- Automatic change log. Version control software automatically tracks which updates were made to your code at which checkpoints.
- Reduce redundancy. Version control software eliminates the need to store redundant copies of the same code in the same directory by allowing you to maintain different versions of your code on different branches.
- Enable collaboration. Version control makes it easy to work on the code at the same time as your colleagues then merge the changes back together at a later date.
- Bolster your resume. If nothing else, being familiar with a common version control software such as git is a great resume booster!
Setting up a GitHub repository
1. Sign up for a GitHub account
The first step in setting up a GitHub repository is signing up for a GitHub account. Simply navigate to github.com, click the sign up button, and create an account. You will be asked to enter an email address, username, and password to set up your GitHub account.
2. Create your first GitHub repository
After you create a GitHub account, it is time to create your first repository. Your git repository is the location where all of the code related to a project will be stored. Navigate to the repositories section of the GitHub interface and click the new button.
You should give your GitHub repository a name that clearly identifies what the contents of the repository are. Something like case-study-one would work great if you are working through our first case study. After that, you will be asked questions about whether you want to initialize your repository with some pre-existing files.
We recommend initializing your repository with a README, the default Python gitignore, and the MIT License. In case you are not familiar with these types of files, here is some more information on the files you can initialize your repository with.
- README. The README file is the place to store high level documentation about your project. This might include instructions for how to install and run the code in the repository or a high level description of the contents of the repository. The contents of your README file are automatically displayed when someone navigates to your GitHub repository.
- .gitignore. The gitignore file informs git to ignore certain files or file types and refrain from tracking them. If you have a specific file or directory that you want GitHub to ignore, you can add the path to that file to your gitignore file. If you have a certain file type you want GitHub to refrain from tracking, you can also add that to your gitignore. This is a great way to ensure that large data or model files that are constantly changing do not get tracked in your GitHub repository.
- License. The license essentially tells other people how they are allowed to use your code. This includes information on whether they are allowed to copy, modify, and distribute the code. This file does not have any material impact on your code or how GitHub tracks your code.
3. Create a local directory to store your code
After you create your new GitHub repository, should create a local directory on your computer to store your code in. In the next step, you will learn how to make a copy of your GitHub repository on your local computer, but for now let’s focus on creating the local directory.
We recommend creating a centralized location on your computer where you store all of your data science projects. For example, you might create a directory called data-science within which you have different subdirectories for each data science project you are working on.
For the purposes of this case study, we will create a subdirectory inside our centralized data-science folder called case-study-one. You can create your directory using a command line tool or a point and click UI. The most important thing is that you use a logical directory name and store it in a logical location.
4. Install git on your computer
Before you can clone your GitHub repository, you need to make sure that git is installed on your computer. We recommend following this guide on the GitHub website to install git. The exact instructions will vary based on the type of operating system you are using.
5. Clone your repository to your local directory
Now that you have set up a location to store your repository on your local computer, all that is left is to clone your repository to your local computer.
If you are using a Mac or Linux based operating system, you can open a terminal window from anywhere on your computer and navigate to the location of your GitHub repository. If you need help learning how to navigate in the terminal, this is a great tutorial that covers the basics. If you are using a PC, you can right click to open a git bash shell from the directory you created for your project.
First you need to inform git that you want to create a git repository in this directory. You can do this by using the init command as so.
After you initialize a git repository in your directory, you will need to clone the repository that you created on GitHub to your local computer. In order to do this, you first need to decide how you are going to refer to your GitHub repository so that git knows exactly which repository you want to clone. The easiest way to so this is using the web URL. Navigate to the repository you created on GitHub then click the Code button and copy the URL associated with the git repository.
Now that you have a way to refer to your remote GitHub repo, you can use the clone command to make a local copy of your repo. Simply type the clone command followed by the URL that you copied from the GitHub web interface.
git clone https://github.com/crunchingthedata/case-study-one.git
If you are having any trouble cloning your repository onto your local machine, we recommend you refer to the instructions here.
Creating your first branch
Now that you have a local git repository up and running, you can practice making changes to your code and pushing them back to your remote git repository. If you are not familiar with how git works, you will need to read our git for beginners article before completing this section. We will work through the example of a basic git workflow that we present in the aforementioned article.
1. Pull changes from remote master
Start out by making sure that you are on your main or master branch using the branch command. When you use the branch command without specifying the name of a specific branch, it prints out a list of all available branches. The branch that you are on will have an asterisk next to it.
After you make sure that you are on the correct branch, use the pull command to make sure that your branch is up to date with the main branch in the remote repository.
2. Branch off of local master
Now that your main branch is up to date with the main branch in the remote repository, you can create a new branch. The easiest way to do this is to create a new branch and automatically navigate to that branch using the checkout command with the -b flag. For the purpose of this tutorial, we will assume you called your new branch git-example.
git checkout -b git-example
3. Work on your new branch
Now that you are on your new branch, it is time to make changes to your files. To keep this simple, we recommend making a small change to your README file. Open the README in your favorite text editor and add a sentence to the file.
After you have saved the changes to your README file, you can use the add command to add your changes to the staging area.
git add README.md
After you add your updated README to the staging area, you can take a snapshot of your updated branch using the commit command. We recommend using the -m option to add a message for your commit.
git commit -m 'add a sentence to readme'
4. Push to remote
Now you can push the changes you made on your local git-example branch to the remote repository using the push command. You can use the –set-upstream option to let git know that your local branch should track a similarly named branch in the remote repository. You can use the word origin to refer to the remote GitHub repository because that is the original repository that your local repository was cloned from.
git push --set-upstream origin git-example
5. Merge into remote master
Now you have an updated version of your git-example branch in your remote GitHub repository. The next step is to merge the branch you have been working on back into your remote main or master branch by creating a merge request (also known as a pull request) using the GitHub web interface.
Navigate to your remote repository. If you just recently pushed code to a branch, then you may see a button asking if you want to create a pull request. This is the easiest way to create a new pull request. If you do not see this button then you can click into the branches tab to see a list of all the branches in your remote repository.
If you click into the branches tab, you should see a list of branches. Alongside each branch, there should be a button for creating a pull request. Click on the button next to your git-example branch.
After you click the button, you will be asked to add a title and description to the pull request. Fill these fields out and make sure that the correct branches are being merged then click the create pull request button. Then you can use the merge pull request button on the next screen to merge the branch you were working on into your main or master branch .