back_office_ops · workflow

GoCardless adopts DVC for ML data versioning and pipeline management

GoCardless had no automated data version control for their ML processes, which ran as long-running Python scripts, making model provenance tracking and reproducibility of ML artefacts difficult.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Git-synced data versioning

DVC is paired with pre-commit hooks so the versioning workflow runs opaquely in sync with the typical Git workflow.

Tools used

DVCpre-commitJupytextGCS

Outcome

GoCardless achieved intuitive Git-like data versioning, automated model metrics tracking alongside code, and safe Jupyter notebook peer review via DVC; they recommend DVC for data versioning but plan to migrate away from its pipelining feature.

What failed first

DVC's pipelining had critical gaps: no parallel stage execution, atomic output handling that wipes full datasets on re-run, a single lock file preventing multiple execution environments, and insufficient expressiveness for dynamic YAML-based pipeline definitions, forcing workarounds like dummy dependency files.

Source

https://mlops.community/blog/experience-report-data-version-control-dvc-for-machine-learning-projects

How we source this →

Grounding & classification

Source type: technical build writeup

20 fields verified against source quotes.

fraud detectionpredictive analyticscode diff prbuilder submittedfailure mode describednamed customerproduction runtime claimedtools describedworkflow describedfinancial servicesemployee productivitytechnical build writeupback office opsdata sync enrichment