Data Version Control: Reproducible Machine Learning

Jul 20, 2020

9 min read

Ever had the situation that you changed something in your data cleaning algorithm just to notice that the first version was actual the correct way of cleaning the data? No problem if you use Git to versioning your source code. However now you might have to wait minutes, in extreme cases even hours, until your data is re-processed and is identical to the data set you already had on your harddrive yesterday.
Ever wanted to re-use some old model but training the model again takes ages?
Ever shared some data within your team at work just to realize a month later that all of you use different versions of the data set?

Then Data Version Control (DVC) is the tool you missed so badly.
DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code. DVC is agnostic to the programming language used, since it is controlled via terminal commands.

DVC is the Smart Way to Manage Different Data Versions

Let me start by simply telling you the pros and cons of using DVC. Actually, I like to mention the things a bout DVC which are not ideal first. One might argue that DVC is only a clever way to store versions of your data like ‘cleaned_data.csv’, ‘cleaned_data_1.csv’, ‘cleaned_data_final.csv’ … on your harddrive. And there is some truth to that. Why to do that anyway? You might want to keep old versions in case that you want to go back to them if some changes you made turn out to be wrong. Hence, overwriting the files is not an option. But this messes up your harddrive and your project folders. DVC realizes that way more elegant, an explanation follows in the next section.

DVC can help you with a lot of things, including versioning your data, your machine learning models, enhances reproducibility and manages large data files easily.
One more thing to consider: it i easier to use DVC if you use functional-programming during your project with different stages of the projects packed in one script each.

An Experiment Management Software

DVC works on top of Git repositories and the usage resembles Git in a very tight manner. It uses implicit dependency graphs to make data science projects reproducible. It versions data by creating special files in your Git repository that point to the cache. DVC is programming language agnostic as well as ML library agnostic.
In the following we will dive a little bit deeper into the details about versioning your data with DVC. As said, DVC creates special files which point to the cache, a hidden storage, to make versions of data or models available. These DVC-files are then tracked by Git and versions can be easily accessed via Git tags or branches. Using a cache of course means that you still need a lot of additional disk space and that increases proportional with the number of versions you might have of your data. However, even using large data files, DVC is extremely efficient and fast getting a version of this data from the cache since it uses Reflinks [1]. Furthermore you can create pipelines and push data to remotes and pull from them. Pipelines are the real deal when it comes to reproducibility, so that is what we will focus on today.

How to use DVC

After installing DVC via homebrew or pip, you can initialize DVC in a Git repository. In your Terminal change directory to your project, then

dvc init
dvc config core.analytics false --global
git add .dvc/config
git commit -m "Initializing DVC"

We want to turn of any anonymized usage statistics. By using the –global-statement, we disable analytics for the active user.
Now, we can start versioning our data manually or build stages and pipelines which will automatically track data for us and recalculate stages if something changed data- or code-wise.

There is also the possibility to push our data to a remote storage as we do with our code (c.f. Github or GitLab). Hence, it is much easier to share (large) data between team members. It is also easier to keep all team members up to date regarding latest data. You are then able to pass data files inside the team in controllable and trackable way. It also becomes handy if we clone a project. Since the DVC-files are tracked by Git we can easily fetch data files from an existing remote storage. There are several options for remotes which are supported, listed in the documentation [1]. Some of them are via ssh or https as well as GoogleDrive and some more. To add a local remote, run

dvc remote add −d localremote /path/to/folder

We can push to this remote storage and pull from it as we do with Git, use dvc push and dvc pull for this purpose.

Define Stages and Pipelines

Our goal is to build a pipeline to easily reproduce results by connecting code and linking the data to its process. Therefore we transform steps from our project (for example python scripts for importing data, and modeling) into DVC stages which then themselves build the pipeline. The output from each stage will the be tracked by DVC.However we start with tracking the initial data:

 dvc add data/initialData/

We now track the entire folder. Using dvc add, there is no need for a separate dvc commit since it is incorporated in the add-command. As stated by the prompt message, run the corresponding git add-command. You can read about updating the initial data, which will not be tracked by the pipeline, in the corresponding section. To build the stage for importing data, run

dvc run \
-d data/initalData/data.csv \
-d code/import.py \
-o data/importedData/data.csv \
-f import.dvc \
python3 code/import.py

in the command line. As stated by the prompt message, run the corresponding git add-command. It will run the script and automatically tracks all files we defined as outputs by -o. We define dependencies by using -d and name the stage after -f. As stated by the prompt message, run the corresponding git add-command. We proceed with the modeling-step. One of its dependencies is a file we now track since it is the output of the previous stage. Also, DVC provides the possibility to track metrics associated to our models (see -m):

dvc run \
-d data/importedData/data.csv \
-d code/modeling.py \
-o data/modelData/model.pkl \
-m data/modelData/metrics.json \
-f model.dvc \
python3 code/modeling.py

As stated by the prompt message, run the corresponding git add-command. Our pipeline is now complete, we can inspect it via

dvc pipeline show −−ascii model.dvc

If everything has worked, we now push the data, commit all changes and the python scripts to Git and give it a tag. By this, we can later switch between different versions of the project, code-related as well as between different data or model versions.

dvc push
git add code/∗.py
git commit −m ’v1.0 of the project’
git tag −a v1.0 −m ’v1.0 of the project’

If we want to reproduce our results end-to-end we simply use

dvc repro model.dvc

but nothing should happen. If the code in modeling.py changes, DVC will recognize that something changed (this time only the code) and will just rerun the stages affected by this. If you do this and give this new version a new git tag, additionally, you can compare the metrics from the different versions via

dvc metrics show −T

It is now possible to switch to older versions of the data but keep the current code:

git checkout v1.0 import.dvc
dvc checkout

Obviously, we can also got back to our first version of the code:

git checkout v1.0
dvc checkout

The dvc checkout-command goes through the DVC-files, and retrieves the correct instance of each file from the DVC cache. If data points are added during the course of the project or flaws in the data are corrected you do not longer need to save the new date in a new folder with a slightly different name (if this is something you have done in the past). In essence, it is still the same data, regarding the structure and most of its content. Therefore, a much better approach is to use DVC instead of creating a bunch of new folders which contain a mostly unchanged data sets.

Versioning Data without Pipelines

However the initial data is not tracked by the above pipeline. Therefore we have to manually tell DVC about changes. Ideally, this should not happen often as we like to have only one data delivery at the begging of a project. In real life, this is not the case, unfortunately. Often we have to deal with multiple data deliveries during the course of a project. To tell DVC about changes in the initial data, use

dvc remove data/initialData.dvc
-- (update the inital data) --
dvc add data/initialData/
git add data/initialData.dvc
git commit -m "new initial data"

If the cache type reflink works on your system un-tracking the file by using the dvc remove-command is not necessary. However, typing one additional line when the initial data gets changed/updated to avoid data corruption should not be a hurdle.

Obviously, this is also the procedure to track any other data file which you added manually to DVC. If you choose to not use pipelines, this is the way to go.

Using Git Hooks

We can integrate DVC even tighter into Git by using hooks. Install these by running

dvc install

Three hooks are now activated (if the you are inside a git repository and DVC is initialized):

Checkout: git checkout retrieves the DVC-files corresponding to that version. There is no need to run dvc checkout since it is now done automatically.
Commit/Reproduce: A change committed to Git could produce new data or change data. Now, the user is reminded to run either dvc commit or dvc repro.
Push: dvc push is executed every time we push changes to the Git remote. Hence, we cannot forget to upload new or updated data.

Conclusion

DVC enables us to easily keep track of different data version by working on top of Git. However, the used disk space is proportional to the number of data versions. We can build pipelines so that automatically stages of the project are re-run if the data (or the code) changed. Moreover, data can be pushed to a remote storage and therefore version can be easily shared.
DVC makes machine learning reproducible.

‘In science consensus is irrelevant. What is relevant is reproducible results.’ - Michael Crichton

References

[1] https://dvc.org/doc/user-guide/large-dataset-optimization
[2] https://dvc.org/doc/command-reference/remote