Using GitHub for Data Science Workflows
GitHub is the largest development platform in the world with over 73 million developers using the platform in various capacities. The website is home to the most advanced development platform to date, and this platform is used by over 4 million organizations, including top MNCs.
As a matter of fact, 84% of all Fortune 500 companies use GitHub for a diverse number of reasons. GitHub is popular for Git, a version control software that allows developers around the world to monitor, review, and manage changes to the code of a software project.
Whether it is a machine learning project or a data science project, Git is suitable for all kinds of development pipelines and allows for git collaboration workflows. GitHub also promotes automation and CI/CD pipelines, offering advanced features such as automated model retraining, automated version builds, and automated testing.
Git allows each developer’s project copy (or working code) to become a repository that contains the history of all the changes. Currently, there are over 200 million repositories being hosted on GitHub’s cloud.
According to Stack Overflow, 87% of developers use Git. Why? Mainly, it is due to Git possessing a distributed architecture, making collaboration simple and secure.
GitHub also encourages developers to get involved with open-source projects such as TensorFlow, Facebook/react, Ansible, and Apple/swift. Thousands of developers can come together and make their own contributions to public projects such as these. One can also control code accessibility and only allow certain users to be able to make changes to software builds.
Git offers performance, security, and flexibility, allowing companies and individuals to review and track any changes before they are added to final builds. Mostly, GitHub is used as a cloud-based repository service that developers can use for free, but due to the enormous scalability and organizational capacity, GitHub also offers private enterprise repositories. GitHub is now one of the most preferred paid repository services for a massive list of companies.
GitHub also has a simple-to-use interface with a state-of-the-art Integrated Development Environment (IDEs) for writing code in any language and developing in any platform, whether it is for Linux, Windows, macOS, containers, or ARMs (Advanced RISC Machines). GitHub is one of the most powerful platforms for storing, editing, managing, building, and deploying your code.
Unlike Git’s raw interface, GitHub offers a more modern interface with more collaboration and community functionalities. When working with GitHub, you can either use GitHub online or you can download GitHub desktop. Additionally, if you do not want a GUI (Graphical User Interface), then you can even opt to use just the GitHub CLI (Command-Line Interface).
GitHub allows developers to maintain DevOps or MLOps best practices and automate everything in between. For instance, GitHub offers various automation tools useful for retraining existing models.
Let’s take a look at git workflow for data science and how to collaborate on github for data science projects.
Why GitHub for Data Science?
GitHub is extremely user-friendly and thus is used for all kinds of projects. As a matter of fact, GitHub is even used for writing books and collaborating on research papers. However, IT development processes are able to utilize GitHub’s potential to its true extent.
GitHub is also absolutely free for individual developers and you can just sign up and start hosting a public code repository. Though GitHub can be used for both commercial and open-source projects, the freedom of this platform can be enjoyed by all kinds of IT professionals.
Like other domains, data science, too, requires collaboration. The ability to work with other developers around the world helps speed up data science pipelines. GitHub is all about working with data, and data scientists can benefit heavily from learning how to use Git repositories and git collaboration workflows are a great way to go about doing it.
Code contributions referred to as commits are essential for improving data-driven projects, AI models, and training data for machine learning processes. For instance, a data science model retraining pipeline can benefit a lot from GitHub’s CI/CD development approach and automation tools.
Retraining models on new data using GitHub after already being built on the Git repository can fix many issues with the current build without compromising the working portions of the code. Let us understand this with an example.
Let us assume that you are working on the code for a predictive system or recommendation engine. Let us also consider that your build (deployed code) is functional but has a few limitations. Now, when other people collaborate with you on your project, they might commit code that might compromise the functional status of your build.
GitHub allows us to track these changes and recall any version of the working build. When code is committed by others, GitHub automatically branches out the version and stores the older builds which can be reviewed by checking the history of the current version (build). This is why version control software such as Git is essential for any collaborative project, whether it is for data science or artificial intelligence engineering. Using a Git workflow for data science enables smoother collaboration, especially for multi-team projects.
Data science projects are sometimes quite large, requiring developers, data engineers, and data scientists to work on small sections of massive builds or models. Thus, a sophisticated open-source distributed SCM such as Git provides the efficiency and speed that professionals require.
Aside from commercial or private projects, average data science projects on GitHub can benefit heavily from dedicated spaces for solving doubts and from Git’s massive developer community. This allows you to solve your doubts by asking other members of the community to simply invite others to help you fix bugs and glitches.
A simple example of public contribution would be continuous security fixes for open-source projects. Git outclasses all other distributed version control systems or source code management software with its powerful performance and convenient tools. Developers and data science professionals can also utilize local branching, staging areas, and multiple workflows through GitHub.
For commercial projects, GitHub’s security remediation tools allow us to identify vulnerabilities so that we can patch compromising portions of the code in private workspaces. Matrix builds also allow us to build and test our deployments on all kinds of platforms, including ARMs and containers (aside from traditional operating systems).
We can also automate any part of the testing and deployment process with the help of GitHub actions. More than anything, commercial GitHub spaces can be scaled to any size with massive cloud repositories, allowing us to branch out build after build.
Data scientists also benefit a lot from GitHub’s software package registry, the largest in the world. There are plenty of community-approved (verified) projects that you can use to speed up your development processes and write less code. Once you finish building on top of an existing build, you can share it with your team or make it available for the public through npm and GitHub packages.
You can do all of this while rewinding back any unwanted changes and keeping your team members in sync. The limitless nature of Git repositories allows you to keep doing this regardless of how many versions were created or how many times code was committed to the original build.
The source code of an existing project is too important to get compromised, thus, version control is absolutely essential. Making changes to the active or official source code is extremely risky and might cause the entire software to crash without any warnings.
Without proper version control, data scientists and engineers will not be able to keep up with the changes as data science projects grow or get scaled up. Version control management helps companies improve their models and software with the help of careful (by administering) merging and branching. Every individual who is involved with a project has his/her own version of the code repository that contains a history of every single change.
Data science projects were traditionally hosted in a common version control space where all the changes were recorded. However, these were not distributed systems and were limited to subversions which are not as effective as branching out to separate repositories. Git is a DVCS or a Distributed Version Control System, thus offering a distributed architecture that allows much more freedom and safety in terms of code contribution.
How to Use GitHub for Data Science Workflows
GitHub is used for data science projects due to the availability of secure and private code repositories that have limitless capacities. Git also helps enterprises manage team members and collaborators more effectively than any other development platform.
Committing code in Git is similar to saving a file or contributing/managing data for data science professionals, thus also helping them gain essential repository skills that can help them learn automation, model retraining, and consistent database management.
Gitflow is a branching and merging model that can prove to be extremely useful for data science professionals. This best-practice model focuses on using multiple primary branches and feature branches to sustain a continuous delivery workflow.
Gitflow is crucial for DevOps and most companies prefer using CI/CD pipelines when delving into any kind of project, including data science projects. Gitflow allows the use of larger commits and branches with high longevity to delay the merging of feature branches till the feature is reviewed and finalized. This allows projects to stick to scheduled release cycles but also accept commits into the primary branches for testing.
Now, let us find out how we can use GitHub for collaborative data science workflows.
Once you have been invited into a project in your organization or you find a public project you want to get involved with, you will be able to use GitHub’s IDE to make the changes you want to the working build.
You can use any language such as Python, C++, or R, depending on the project you have in front of you. Now, you can clone the code or build it into your personal repository or simply submit the working code after making some changes to create a branch of the build. You can submit your working code through pull requests.
Now, this branch belongs to you but can also be seen by an admin or approved team member of the project. If your commit is found to be useful after reviewing, then the code will be added to the build and deployed as scheduled. During reviewing and testing, the build manager checks what changes you have made.
If your commit is not accepted or approved, then you can still add your changes to your personal repository where a copy of the original source is also branched. This can be done by the ‘fork’ action in Git. Gitflow is based on the same concept of adding data to a database. The difference in this kind of workflow is that all of the changes made to a Git database are undoable and can be recalled back at any time.
It is almost impossible to get the Git system to completely erase important data or corrupt (mess up) builds. This is due to all commits in the database being pushed to different repositories. This makes it easy to recover seemingly lost data and undo unwanted changes.
There are three states that are essential for Gitflow:
- Modified: Files have been edited and changed but not committed to the database.
- Staged: Modified and edited files have been marked in their working versions.
- Committed: Commits (snapshots of the changes) have been added and stored in your local repository.
Now that we have understood the three states, we can move on to the three sections of importance in a Git workflow.
1. The working tree: This section is involved in extracting a single version of a target project from compressed databases in a Git directory. These files are then submitted to your local files for modification.
2. The staging area: This section deals with a file that stores all the data relevant to your planned commit. The file is known as ‘index’ in Git.
3. The Git directory: In this directory, the object database and metadata of your project are stored. This is the most essential offering of Git which allows you to clone a repository and make changes.
Thus, a collaborative data science workflow in Git would first involve modifying files in a working three and then selectively staging those changes to plan your next commit. Your initial changes are only adding to the staging area and not the full build. Finally, you submit a commit that involves transferring the target files (files with changes) from the staging area to permanently add them to your Git directory.
For example, if you wish to retrain a model with new data, then you can follow this workflow to make the necessary changes in data and build or train your new model on your personal repository.
GitFlow allows multiple collaborators to work on a single data science project while still keeping the original source code or master file unharmed. Data science projects can benefit significantly with the help of constant reviewing, documentation, testing, and commits for the purpose of improving the master build eventually.
These kinds of distributed workflows help keep the working code stable while still leaving room for developers and contributors to fix, patch, and add deliverables.
When considering learning through a data science online course or online certificate courses offering data science training, it is great to also look at GitHub to setup your code repository and host your projects. A data science course is bound to have case studies, assignments, and projects which will greatly benefit from having a Git collaboration workflow.