Data Versioning for CD4ML – Part 1

(This post is the first of a two-part series where I will be touching on a component of CD4ML: data versioning, and how it can be incorporated into an end-to-end CI/CD workflow. This first part presents a preface and a high-level overview of the topic, ending off with the introduction of DVC. The second part brings one through a deeper and more technical dive into the topic, to give the audience a better understanding of how the tool can be applied concerning the different aspects of a workflow.)

Table of Contents

  1. Preface
  2. The Data Of It All: Possible issues when working with data.
  3. Tracking & Versioning Data: Tools available for issues raised.
  4. Enter DVC: Introduction to Data Versioning Control tool.
  5. Conclusion

Preface

A few weeks ago, a colleague of mine at AI Singapore published this article highlighting the culture and organisational structure adopted by us when it comes to working on data science/AI projects. After being exposed to multiple projects across varying domains, there’s no doubt that there’s a need for us to go full-stack. Our AI Apprenticeship Programme conducted in tandem with the 100E initiative groups individuals into teams with 3 or 4 of them doing effective engineering work; these individuals will be the main parties churning out the deliverables set throughout the project stint. Various implications arising out of the nature of the programme:

  • we have a team of apprentices where each individual is discovering and experiencing the fundamentals of a data science workflow, be it reproducibility, good coding practices, or CI/CD.
  • while differing roles & responsibilities are eventually delegated, there’s an emphasis placed on them to be cognizant about the synergy required to streamline the workflow that spreads across the team.

As the main purpose of this apprenticeship programme is for these individuals to be equipped with deep skills required to successfully deliver AI projects, it is only natural that they are exposed to the cross-functional roles of a data science team.

The Data Of It All

With the above matters being said, there’s one thing that is common across all the projects: the first stage involving data, shared by the project sponsor. To better understand the nature and complexity of the problem at hand, every apprentice and engineer would have to be exposed to the tasks of data cleaning, Exploratory Data Analysis (EDA) and creation of data pipelines. Even though this set of tasks would eventually be delegated to a designated individual(s), no member of the team should be excused from getting their hands dirty with one of the most time consuming and crucial components of a data science workflow. After all, if one wishes to experience the realities of data science work, then there’s no way they should be deprived of having to work with real enterprise data with all its grime, gunk and organisational mess. Ah, the short-lived bliss of downloading datasets with the click of a single button (hi Kaggle!).

So, it’s been established that the whole team would somehow, one way or another, come in contact with the data relevant to the problem statement. The consequence of this is that the initial raw state of the data would branch off to different states, versions and structures, attributed to any of the following (non-exhaustive) factors:

  • engineered features that are formulated by different individuals.
  • differing preferences of directory structures.
  • differing definitions of states of data.

The aforementioned pointers can be further exacerbated by:

  • altered schema/format/structure of raw data, which can occur multiple times if the project sponsor does not have a streamlined data engineering pipeline or no convention is formalised early on.
  • batched delivery of raw data by project sponsor/client.

All of these further adds on to the mess that already exists with the raw data itself. Without proper deliberation in the data preparation stage, this could easily snowball into a huge obstacle for the team to overcome and progress to the modelling stage. The thing is, with formats and structures of processed data, just like the source code of the project, there isn’t a single acceptable answer. Two different source code can still deliver the same outcome. What’s important is to have these different versions of code tracked, versioned and committed to a repository for developers to revert back and forth should they choose to. Hence the reason why we have Git, SVN or Mercurial: version control systems (VCS) which allows us to track changes in source code files. But what about data?

See, code (text) files can easily be tracked as it does not take up much space and footprint, and every version of such small files can be retained by the repository. However, it’s not advisable to do so with data as this could easily cause the size of your repository to blow up. So what are the different ways available for versioning and tracking differing iterations and formats of data? Let’s explore.

Tracking & Versioning Data

Some ways to track and version data:

  1. Multiple (Backup) Folders

So with this method, what one does is having separate folders of EVERY version of processed data. A spreadsheet could be used to track different versions of the processed data as well as each of their dependencies and inputs. If a data scientist would like to refer to an old version of the data because a more recent one does not look promising, they would simply refer to the spreadsheet and provide the location of that old version of data to the script. Now, this is the most straightforward method of versioning and tracking datasets and one with the lowest learning curve and barrier, but it is in no way efficient or safe in implementing. Data files can be easily corrupted this way and unless proper permission controls are configured, it is too easy for anyone to accidentally delete or modify the data. Many organisations still do this for data analysis or data science workflows but as it is the easiest and least costly to implement, it is understandable why they would resort to this. However, such a tedious and inelegant method should prompt one to look for more innovative ways.

repository
├── src
├── data
│ ├── raw
│ │ ├── labels
│ │ └── images
│ └── processed
│ │
│ ├── 210620_v1
│ │ ├── train
│ │ └── test
│ │
│ ├── 210620_v2
│ │ ├── train
│ │ ├── test
│ │ └── holdout
│ │
│ ├── 010720_v1
│ └── final_v1
│
├── models
└── .gitignore

Figure 1: A sample folder tree containing multiple folders for different versions of processed datasets.

Data Folder NameStateData Pipeline Commit ID
210620_v1Processed54b11d42e12dc6e9f070a8b5095a4492216d5320
210620_v2Processedfd6cb176297acca4dbc69d15d6b7f78a2463482f
010720_v1Processedab0de062136da650ffc27cfb57febac8efb84b8d
final_v1Processed8cb8c4b672d0615d841d1c797779ee2e0768a7f3

Figure 2: A sample table/spreadsheet that can be used to track data folder versions with commit IDs of pipelines used to generate these artefacts.

  1. Git Large File Storage (Git-LFS)

Git-LFS allows one to push files that exceed the push limit imposed by Git. This solution leverages on special servers hosted by repository hosting platforms and each one has different limits set for Git-LFS uploads. For example, Github Free allows a maximum file size of 2GB to be uploaded to GitHub’s LFS servers. With that said, Git-LFS was created not with data science use cases in mind.

  1. Pachyderm
Pachyderm Logo

A much more elegant solution than the aforementioned ones would be Pachyderm. Pachyderm allows for the formulation and chaining of multiple different parts of a data pipeline and, in addition to this orchestration function, it has a data versioning function that can keep track of data as they are processed. It is built on Docker and Kubernetes but while it can be installed locally, it is better deployed and run in a Kubernetes cluster. While this would be an optimal choice for most data science teams out there, smaller or less mature organisations might not find this solution that accessible. Also, having to maintain a Kubernetes cluster without an engineer trained in infrastructure or DevOps skills would pose additional hurdles. So, is there an alternative solution which does not pose that high of a barrier for adoption? Well, that’s where DVC comes in.

Enter DVC

Three years since its first release, DVC by Iterative.ai is an open-source version control system that has gained a lot of traction within the data science community, built for versioning ML pipelines with a great emphasis on reproducibility. One of the best things about it (at least in my opinion) is that is incredibly easy to install and its usage is mainly through the command-line interface.

(Note: Me talking about DVC here is to bring awareness, summarising some of its key features and hopefully pique your interest. DVC is well-documented, has a forum to cultivate a community with many of its users blogging about it. I feel this introductory coverage on DVC is kind of redundant but this is my way of presenting a prequel to a deeper technical coverage to come in Part 2 of this series.)

Installation

Alright. So, how does one get started with DVC? Let’s start with the installation. To install DVC within a Python virtual environment…:

$ pip install dvc

…or to install it system-wide (for Linux):

$ sudo wget 
       https://dvc.org/deb/dvc.list 
       -O /etc/apt/sources.list.d/dvc.list
$ sudo apt update
$ sudo apt install dvc

# For Mac, using Brew:
# brew install dvc

# For Windows, using Chocolatey:
# choco install dvc

That’s it. No need for a server to be hosted, a container cluster or having to pay for the tool. The cache for versions of data/artefacts can be stored locally (say with a tmp folder), or on a remote storage for shared access. At AI Singapore, we are using on-premise S3-based data store (Dell ECS) for remote storage.

Features

There are a lot of things one can do with this tool and I am sure more are to come with the many issue tickets submitted to its GitHub repository. Let me list these features with my own words:

  • data versioning with the option to store caches on supported storage types and tracking through Git.
  • language & framework agnostic (you do not need Python to use the tool).
  • ability to reproduce artefacts generated by pipeline runs tracked through DVC’s commands.
  • compare model performances across different training pipeline runs.
  • easy integration and usage with Iterative.ai’s other tool CML.

Mechanism

One might wonder: how does DVC enable tracking of datasets? Well, each data or artefact being tracked is linked to files with the extension .dvc. These .dvc files are simply YAML formatted text files containing hash values tracking the states of the files. (Do look at this page of the documentation for reference of files relevant to DVC’s mechanism.) Since these text files are just as lightweight as code that are trackable by Git, they can be committed, tracked and versioned.

DVC Workflow Image 1

Figure 3: Diagram showcasing the relationship between the components of DVC.

This diagram above (taken from DVC’s website itself) shows how the file model.pkl is being tracked through the file model.pkl.dvc which is committed to a Git repository. The model file has a reflink to the local cache and it changes according to the version you call upon through DVC’s commands. (For some explanation on reflinks, one can look into this article.) There’s more to how DVC versions and tracks artefacts and pipelines but this is the basic coverage.

Basic Workflow

So what does a typical basic workflow look like for the usage of DVC? Assuming one has installed DVC, the following commands can be executed to start the versioning of files:

Let’s create/assume this project folder structure:

$ mkdir sample_proj
$ cd sample_proj
$ mkdir data models
$ wget -c https://datahub.io/machine-learning/iris/r/iris.csv -O ./data/dataset.csv
$ git init
$ tree
.
├── data
│ └── dataset.csv
└── models
  1. Initialise DVC workspace
$ dvc init
$ ls -la
drwxr-xr-x user group .dvc
drwxr-xr-x user group 352 .git
drwxr-xr-x user group 128 data
drwxr-xr-x user group  64 models

This command creates a /.dvc folder to contain configuration and cache files which are mainly hidden from the user. This folder is automatically staged to be committed to the Git repository. Do note that this command by default expects a Git repository. The flag --no-scm is needed to initialise DVC in a folder that’s not a Git repository.

  1. Start tracking files with DVC
$ dvc add data/dataset.csv
# For tracking a data folder
# dvc add ./data

To start tracking a file, run the command above with reference to either a directory or a file that is meant to be tracked. In this case, we are tracking just the dataset.csv file itself. A file dataset.csv.dvc within the /data folder would be created, to be committed to the Git repo.

  1. Track a pipeline/workflow stage along with its dependencies and outputs

Let’s get ahead of ourselves and assume that we have created a Python script to train a model where one of its inputs is the dataset.csv file and an output which is the predictive model itself: model.pkl.

$ dvc run 
    -n training
    -d data/dataset.csv 
    -o models/model.pkl 
    python train.py
Running stage 'training' with command:
        python train.py
Creating 'dvc.yaml'
Adding stage 'training' in 'dvc.yaml'
Generating lock file 'dvc.lock'
$ tree
.
├── data
│ └── dataset.csv
├── models
├── dvc.lock
├── dvc.yaml
└── train.py

The contents of the dvc.yaml file would be as such:

stages:
    training:
    cmd: 'python train.py'
    deps:
      - data/dataset.csv
    outs:
      - models/model.pkl
  1. Share and update the cache to remote and commit .dvc files
$ dvc push

This command will push whatever is stored in the local cache to a local remote or other storage you would configure for the project.

$ git add data/dataset.csv.dvc dvc.yaml
$ git commit -m "Commit DVC files"
$ git push
  1. Reproduce results

Let’s say we push the cache to a remote storage. Something that can be done following this is for someone to clone that same Git repository (containing the relevant DVC files that have been generated), configure a connection with the remote storage, and run the pipelines listed as stages in dvc.yaml.

$ git clone 
$ dvc pull # to get the exact version of the dataset.csv file that was used by the one who initially ran the pipeline
$ dvc repro

After the above commands are executed, the model.pkl file would be generated through the same set of data (versioned through DVC) and pipeline (versioned through Git). This showcases the reproducibility end of things in this whole tooling mechanism.

Conclusion

What I have shared above is just me trying to highlight to you, the audience, how easy it can be in executing a workflow that leverages on DVC as a data versioning tool. It’s more of a selling point for you to look deeper into DVC’s documentation and guides. It would be an inefficient use of my time to provide further coverage on the tool which the existing documentation does an excellent job of. What I am keener to do is to show you how DVC can be plugged and integrated into an end-to-end workflow consisting of data preparation, model training and tracking, and then model serving, all through an attempt of leveraging a CI/CD pipeline. This is something that is much more complicated and which the documentation does not cover. All this would be a much more technical coverage that comes in Part 2, to be published very soon. So that’s it for now. Until then, take care.

Author