A diagram showing a high level overview of different components of the GitHub environment.

Git for beginners

In this article we provide a beginner-friendly explanation of how to use git to version control your code. We cover all of the concepts necessary to understand how git works and how to use it. First, we give a high level overview of version control and common features of version control systems. After that, we discuss concepts that are specific to the git ecosystem. Finally, we go over the git commands that are most useful for beginners who are just starting to dip their toes in the git ecosystem. 

What is version control?

Version control software is software that enables you to keep track of changes that are made to your code. More specifically, version control software enables you to automatically track changes that are made without having to manually update a change log that is prone to human error. Since version control software keeps track of what order changes were made in, it also allows you to rewind changes and revert back to previous versions of the code. 

When we say that version control automatically tracks changes that are made to code, we do not mean that it tracks every single change you make as you go. If it did, there would be many half-baked versions of your code lying around that were riddled with errors. Instead, version control software allows you to take snapshots of your code when it is in a stable state and save these snapshots as checkpoints that you can refer back to in the future. The software will automatically keep track of all changes that were made between subsequent checkpoints so that you can easily refer back to that information.

Another useful feature that many version control systems have is branching. This allows you to create a new branch, or a new version of your code that diverges from the original version of the code at a specific checkpoint. Different branches can be maintained as separate entities or they can be merged back together later on to create a version of the code that contains all of the changes from both branches. This flexibility is useful if you need to create a separate version of your code for an ad hoc analysis or if you want to work on your code at the same time as a colleague then combine the changes later. 

A simple example of how branching and checkpointing works for version control.

Git for version control

There are many version control tools out there, all of which come with a unique set of features. We generally stick with git because of its overwhelming popularity. Git is an open source version control software that enables you to version control your code and maintain different repositories in different locations, such as on different computers. 

One of the benefits of choosing git is that there are many hosting platforms that allow you to host git repositories online free of charge. That means that even if your local computer breaks, your still have a backup of your code online that you can download onto another computer. Two of the most common hosting platforms that use git are GitHub and GitLab. We do not have a strong preference for one or the other when it comes to hosting personal projects, but we generally use GitHub.

How does Git work?

In order to understand how to use git to version control your code, it helps to take a step back and learn about the different components that make up the git ecosystem. In this section we will provide a framework for understanding the most important components in the git ecosystem and how they interact with one another. 

There will be some edge cases in which this exact framework will not apply, but if you are just starting out then this framework will get you everywhere you need to go. 

Local and remote repositories

The first thing to understand about git is that different copies of the same git project  can exist on different devices. On each device, there will be a git repository where all of the code associated with that project is stored. There is generally one remote repository, which is the web hosted repository that you see when you log into the GitHub website. There can be any number of local repositories, which are the repositories that exist on your local computer or the computers of your colleagues. 

Each person who is working on a project will have their own local repository on their own computer. From their local repository, they can push code to the remote repository and pull code from the remote repository. Generally, the remote repository is the hub through which you will interact with all of the other local repositories. If you want to look at code that your colleague has created in their local repository, they will push the code to the remote repository so that you can pull from the remote repository into your local repository. 

A diagram showing a high level overview of different components of the GitHub environment.

Branches

Now that you understand repositories, we can move on to branches. We have already discussed the concept of branching in version control software, so you should already be familiar with the concept. A branch represents a specific version of the code that can be traced back through all of its previous checkpoints. 

You can think of branches as being nested within repositories. One repository can contain any number of branches, including branches that are unique to that repository as well as branches that are shared across multiple repositories. Whether or not a branch is present in a specific repository just depends on where the branch was created (in the remote repository or a local repository) and whether the branch has been pushed to or pulled from the remote repository.

You can branch off of an existing branch to create a new branch at any point. Git will keep track of the entire history of both branches, including shared checkpoints that occurred before the branches diverged as well as separate checkpoints that happened after the branches diverged. Branches can be maintained separately from one another or merged together to combine all of the changes from both branches.  

There is generally one branch called the master branch that serves as the source of truth. In the interest of keeping the master branch clean, you should not make changes directly to the master branch itself. Instead, you should branch off of the master branch to create a new branch where you implement your changes. You can merge your new branch back into the master branch after you have thoroughly tested your changes.  

Untracked, staged, and tracked changes

Now that you understand the concept of branches, we can talk about how changes to a branch are tracked. When you are working on a given branch, there are three different states that the changes you make to your code can be in – untracked, staged, or tracked. Changes in different files can be in different states, but all of the changes in one file will be in the same state. 

If your changes are untracked, that means from git’s perspective they do not exist. When you take a snapshot of a branch, git does not automatically pick up all changes that were made to all files in your repository. Instead, it only keeps track of changes that explicitly tell it to keep track of. 

The first step in telling git to keep track of a specific change is to add the file that the change was made in to the staging area. The staging area keeps track of all the files that contain changes that should be included in the next snapshot. You can add and remove files from the staging area as you please before you take a snapshot. Git will only keep track of the changes that are in the staging area at the time the snapshot is taken. 

Once you have added all the files that contain changes you want to track to the staging area, it is time to tell GitHub to take a snapshot. This will move your changes from the staged state to the tracked state. All of the changes that were in the files that were in the staging area will be included in the snapshot. 

A common git workflow

A diagram of how code flows through git environments.
Are you wondering how all of this comes together in practice? Here is an example of what one cycle of adding changes to your code might look like. 

  1. Pull changes from remote master. The first step in the cycle is to make sure that your local master branch is up to date with the remote master branch. 
  2. Branch off of local master. After you have updated your local master branch, create a new branch by branching off of the local master branch. For the sake of this example, let’s pretend you make a new branch and call it new-branch. 
  3. Work on your new branch. Now that you have your new local branch, you can make any code changes that you need to make on that branch. When you are done making changes, add the appropriate files to the staging area then take a snapshot of your branch. You can take multiple snapshots of your branch if there are multiple checkpoints you want to be able to refer back to.
  4. Push to remote. After you are done updating new-branch, you can push your local branch to the remote repository. Before this, new-branch will not have existed in the remote repository. This is because you created it in your own local repository. 
  5. Merge into remote master. Finally, it is time to solidify your changes by merging your new branch into the remote master branch. Now the remote master branch will contain all of the changes you added to be tracked in the local copy of new-branch.  

Basic git commands

Initialize a git repository

The first thing you have to do when creating a new local repository is navigate to the directory where the repository will live and let git know that a git repository should be set up in that directory. You can do this using the init command. 

If you have already created a remote repository via the GitHub web interface, you will also need to clone the remote repository to create a local repository that tracks the remote repository. The commands will look something like this. 

git init
git clone https://github.com/crunchingthedata/case-study-one.git

Navigate to a different branch

The first step in navigating between different branches in your local repository is looking at what branches are available. You can use the branch command by itself to get a list of all branches that are available in your local repository. If you just created a new repository, then you will likely only have one branch called main or master. 

git branch

The branch command will give you a list of all branches that are available in your local repository. You can now select the branch you want to navigate to and navigate to it using the checkout command followed by the name of the branch you wish to navigate to. 

git checkout desired-branch
A diagram showing a high level overview of different components of the GitHub environment and basic commands that trigger transitions from one to another.

Create a new branch

There are two commands you will need to know to make a new branch and navigate to your new branch. First you should navigate to the branch that you want to branch off of using the checkout command that we already discussed. After that, you can create a new branch using the branch command with the name of the branch you want to create. After you have created your new branch, you can navigate to it using the checkout command again. The following code will create a new branch called ‘new-branch’ that branches off of master.

git checkout master
git branch new-branch
git checkout new-branch

Git also offers a command that allows you to create a new branch and navigate to that branch all in one step. All you have to do is use the -b option with the checkout command to indicate that you want to create a new branch. 

git checkout master
git checkout -b new-branch

Add files to the staging area

Now that you have a local branch to work off of, you can make changes to the files on your local system. When you are ready to move your changes to the staging area, you can use the add command to do so. 

The add command can be used to add a completely new file to the staging area or to add an existing file that has been modified to the staging area. Each time you add new changes to a file, you will need to use the add command to push those new changes to the staging area. This is true even if you have already added a previous version of the file to the staging area. Here is an example of how to add a file called new-file.py to the staging area.

git add new-file.py

Remove file from staging area

Now say that instead of adding a new file to your snapshot, you want to remove a file from your next snapshot. This may be useful if you accidentally add a file that you do not want to track in your git repository such as a large data file that you no longer need. You can remove this file from all future snapshots using the git rm command. 

Here is the command to remove a file called bad-file.csv from the tracked files starting at your next snapshot. 

git rm bad-file.csv

Take a snapshot of your code

Now that you have added and removed all of the files that you wish to add and remove from the staging area, it is time to take a snapshot of your code. You can do this using the commit command. 

When you create a commit, GitHub asks you to create a short commit message describing the changes that you made to your code. If you just use the commit command without adding any other options, git will automatically open a text editor in which you can enter the message. You can also add an -m followed by the message that you want to save when you make the commit. 

For simplicity’s sake, we recommend using the -m option that allows you to include your message alongside your commit. Here is an example of what a basic commit looks like. 

git commit -m 'short message describing changes'

Push a local branch to the remote repository

Now that you have taken a snapshot of your local branch, it is time to push that snapshot to the remote repository. You can do this by making sure you are on the correct branch then using the push command. If you already told git which branch in the remote repository your local branch tracks then this is all you will have to type.  

git push

However, if you have not specified which remote branch your local branch tracks, you will need to specify that now. You can do this by using the –set-upstream flag. As we previously stated, the word origin simply refers to the remote repository that your local repository was originally cloned from.

git push --set-upstream origin new-branch

Pull a remote branch into your local repository

Oftentimes when you set up a new local repository on your computer, the first thing you want to do is pull a branch that exists in the remote repository into your local computer. You can do this by navigating to a local branch that tracks the remote branch and using the git pull command. 

git pull

Resources for learning more about git

Do you have more questions about git and how it works? This is one of our favorite resources for learning more about git.  

Related articles

Leave a Comment

Your email address will not be published. Required fields are marked *