Data Version Control (DVC) Framework
(Redirected from data version control)
Jump to navigation
Jump to search
A Data Version Control (DVC) Framework is a AI artifact version control framework (data and model).
- AKA: DVC, Data Version Control, DVC Framework.
- Context:
- It can manage Dataset through git-like commands, replacing manual versioning approaches.
- It can version Machine Learning Model through version control workflows, enabling model lifecycle tracking.
- It can track ML Experiment through git tags and git branches, supporting experiment management.
- It can store Data Asset in storage backends, including cloud storage, ssh server, and hdfs system.
- It can maintain Project Reproducibility through version control mechanisms, ensuring reproducible workflows.
- It can replace Spreadsheet Tool for knowledge management, providing structured data tracking.
- It can eliminate Ad-Hoc Script for model versioning, standardizing version management processes.
- ...
- It can range from being a Simple Version Control Tool to being an Complete MLOps Solution, depending on its implementation scope.
- It can range from being a Local Development Utility to being an Enterprise Collaboration Platform, depending on its deployment scale.
- ...
- It can integrate with Git Workflow for version control process.
- It can connect to Cloud Provider for remote storage.
- It can support CI/CD Pipeline for automated deployment.
- ...
- Examples:
- DVC Implementations, such as:
- Open Source Versions, such as:
- DVC (2024) with advanced mlops features.
- DVC (2020) introducing stable command interface.
- DVC (2017) establishing core versioning capability.
- Enterprise Deployments, such as:
- Open Source Versions, such as:
- DVC Use Cases, such as:
- Data Science Projects, such as:
- Team Collaborations, such as:
- Knowledge Repository replacing spreadsheet sharing.
- Team Ledger System replacing document sharing.
- ...
- DVC Implementations, such as:
- Counter-Examples:
- Excel Spreadsheet, which lacks version control capability.
- Google Doc, which lacks data versioning feature.
- Manual Version System, which lacks automated tracking.
- Basic Script Solution, which lacks standardized workflow.
- See: MLflow, ML DevOps, Version Control System, ML Pipeline Tool.
References
2020b
- https://dvc.org/doc
- QUOTE: Data Version Control, or DVC, is a data and ML experiments management tool that takes advantage of the existing engineering toolset that you're already familiar with (Git, CI/CD, etc.)
2020b
- https://github.com/iterative/dvc
- QUOTE: Data Version Control or DVC is an open-source tool for data science and machine learning projects. Key features:
- Simple command line Git-like experience. Does not require installing and maintaining any databases. Does not depend on any proprietary online services.
- Management and versioning of datasets and machine learning models. Data is saved in S3, Google cloud, Azure, Alibaba cloud, SSH server, HDFS, or even local HDD RAID.
- Makes projects reproducible and shareable; helping to answer questions about how a model was built.
- Helps manage experiments with Git tags/branches and metrics tracking.
- DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs) which are being used frequently as both knowledge repositories and team ledgers. DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as well as ad-hoc data file suffixes and prefixes.
- QUOTE: Data Version Control or DVC is an open-source tool for data science and machine learning projects. Key features: