When to start with git, in short
Depending on the people you work with, as a data scientist you can spend a lot of time working without git. You might wonder why some of your friends are using such a tool. When one of the following three situations occurs, I consider it a good starting point to start using git:
A) You start having 2 (or many more files) with the same name, like analysis1.py, analysis2.py and you find that confusing.
B) You want to make some changes but want to be able to switch easily to the previous working version.
C) You want to work in a team.
How we all start coding
When we start coding and doing data analysis science, we are trying new things everyday. Some day a new data set, some other day a new visualization library, or maybe just try a new model for the same data set. And that works well most of the times. At least, that is what I thought when I started.
How I realized I needed to update my toolbox
Later on, I started to work and I noticed that the main difference with my previous academic/ hobby experience was that a project usually takes longer than 1 or 2 days. You are off in the weekend and you do not remember where you left off 3 days ago. Or you realize you have 3 files with almost the same name like analysis.py, analysis_2.py, analysis_final_2.py and they also have the same code. That is confusing and I would then spend hours trying to start again, where I previously left off.
That is when I started thinking: “There must be a better way to do this”. Unfortunately, I did not know git, so I did not use it at the time.
Another time, by mistake I deleted my latest script, which I spent the last week on. I could not find it back and I had to start over. Not so much fun and I thought again to myself: “There should be a better way to do this”.
And why I didn’t use git earlier
Last but not least, even in a working environment between data scientists, I noticed a certain fear/ discomfort around git. Most of my colleagues knew it existed and why it existed, but they used it only if they had to. Git could cause some annoying issues that could take long time to solve. Also, the extensive usage of Jupyter Notebooks and cloud environment to run code, seemed to make git useless “because the code is just there and is already shared with everyone”.
Using git as a collaboration tool
At some point, I joined a developer team as a data scientist. Since we were deploying our data science code into a production environment, we needed to use git every day to share and review each other’s code. In the beginning, it was tough to use git all the time and I got stuck a few times. Over time I started to notice that git was getting ingrained in my routines and it was very, very useful.
At that point, I started using git very happily as a great addition to my daily job. Now I use it for any type of project and wonder why something similar to git does not exist for normal documents, like Power Point presentations or Excel sheets.
How git can help your daily job as a data scientist
If I could describe this in one image, it would be this:
Using git, every code edit gets a timestamp and some documentation along with it. This makes it so easy to keep track of your progress.
Using the git branch workflow made it easy to work on different features at the same time, because each feature branch is separated from the rest. As a consequence I did not need to finish a new feature right away, as that feature is on its own branch and separated from the rest of the code.
So, how to start using git as a data scientist?
With this article I hope to have convinced you why you should start using git as a data scientist. Git has many more advantages than just a better personal flow. Using github, gitlab or bitbucket helps sharing your code, either with a team or with the world. Github offers many more features on top of that, like sharing code spaces and many more integrated tools.
Get started
If this article has convinced you, here is what I consider the best tutorial to start. At the end you do not need to master git, you just need enough knowledge to know how to put your code online using the branch workflow!