2022 DataQualityFundamentals
- (Moses et al., 2022) ⇒ Barr Moses, Lior Gavish, and Molly Vorwerck. (2022). “Data Quality Fundamentals.” O'Reilly Media, Inc..
Subject Headings: Data Quality.
Notes
Cited By
Quotes
Book Overview
Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.
Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.
- Build more trustworthy and reliable data pipelines
- Write scripts to make data checks and identify broken pipelines with data observability
- Learn how to set and maintain data SLAs, SLIs, and SLOs
- Develop and lead data quality initiatives at your company
- Learn how to treat data services and systems with the diligence of production software
- Automate data lineage graphs across your data ecosystem
- Build anomaly detectors for your critical data assets
Table of Contents
Preface Conventions Used in This Book Using Code Examples OâReilly Online Learning How to Contact Us Acknowledgments
1. Why Data Quality Deserves Attention What Is Data Quality? Framing the Current Moment Understanding the âRise of Data Downtimeâ Other Industry Trends Contributing to the Current Moment Summary
2. Assembling the Building Blocks of a Reliable Data System Understanding the Difference Between Operational and Analytical Data What Makes Them Different? Data Warehouses Versus Data Lakes Data Warehouses: Table Types at the Schema Level Data Lakes: Manipulations at the File Level What About the Data Lakehouse? Syncing Data Between Warehouses and Lakes Collecting Data Quality Metrics What Are Data Quality Metrics? How to Pull Data Quality Metrics Using Query Logs to Understand Data Quality in the Warehouse Using Query Logs to Understand Data Quality in the Lake Designing a Data Catalog Building a Data Catalog Summary
3. Collecting, Cleaning, Transforming, and Testing Data Collecting Data Application Log Data API Responses Sensor Data Cleaning Data Batch Versus Stream Processing Data Quality for Stream Processing Normalizing Data Handling Heterogeneous Data Sources Schema Checking and Type Coercion Syntactic Versus Semantic Ambiguity in Data Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka Running Analytical Data Transformations Ensuring Data Quality During ETL Ensuring Data Quality During Transformation Alerting and Testing dbt Unit Testing Great Expectations Unit Testing Deequ Unit Testing Managing Data Quality with Apache Airflow Scheduler SLAs Installing Circuit Breakers with Apache Airflow SQL Check Operators Summary
4. Monitoring and Anomaly Detection for Your Data Pipelines Knowing Your Known Unknowns and Unknown Unknowns Building an Anomaly Detection Algorithm Monitoring for Freshness Understanding Distribution Building Monitors for Schema and Lineage Anomaly Detection for Schema Changes and Lineage Visualizing Lineage Investigating a Data Anomaly Scaling Anomaly Detection with Python and Machine Learning Improving Data Monitoring Alerting with Machine Learning Accounting for False Positives and False Negatives Improving Precision and Recall Detecting Freshness Incidents with Data Monitoring F-Scores Does Model Accuracy Matter? Beyond the Surface: Other Useful Anomaly Detection Approaches Designing Data Quality Monitors for Warehouses Versus Lakes Summary
5. Architecting for Data Reliability Measuring and Maintaining High Data Reliability at Ingestion Measuring and Maintaining Data Quality in the Pipeline Understanding Data Quality Downstream Building Your Data Platform Data Ingestion Data Storage and Processing Data Transformation and Modeling Business Intelligence and Analytics Data Discovery and Governance Developing Trust in Your Data Data Observability Measuring the ROI on Data Quality How to Set SLAs, SLOs, and SLIs for Your Data Case Study: Blinkist Summary
6. Fixing Data Quality Issues at Scale Fixing Quality Issues in Software Development Data Incident Management Incident Detection Response Root Cause Analysis Resolution Blameless Postmortem Incident Response and Mitigation Establishing a Routine of Incident Management Why Data Incident Commanders Matter Case Study: Data Incident Management at PagerDuty The DataOps Landscape at PagerDuty Data Challenges at PagerDuty Using DevOps Best Practices to Scale Data Incident Management Summary
7. Building End-to-End Lineage Building End-to-End Field-Level Lineage for Modern Data Systems Basic Lineage Requirements Data Lineage Design Parsing the Data Building the User Interface Case Study: Architecting for Data Reliability at Fox Exercise âControlled Freedomâ When Dealing with Stakeholders Invest in a Decentralized Data Team Avoid Shiny New Toys in Favor of Problem-Solving Tech To Make Analytics Self-Serve, Invest in Data Trust Summary
8. Democratizing Data Quality Treating Your âDataâ Like a Product Perspectives on Treating Data Like a Product Convoy Case Study: Data as a Service or Output Uber Case Study: The Rise of the Data Product Manager Applying the Data-as-a-Product Approach Building Trust in Your Data Platform Align Your Productâs Goals with the Goals of the Business Gain Feedback and Buy-in from the Right Stakeholders Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains Sign Off on Baseline Metrics for Your Data and How You Measure Them Know When to Build Versus Buy Assigning Ownership for Data Quality Chief Data Officer Business Intelligence Analyst Analytics Engineer Data Scientist Data Governance Lead Data Engineer Data Product Manager Who Is Responsible for Data Reliability? Creating Accountability for Data Quality Balancing Data Accessibility with Trust Certifying Your Data Seven Steps to Implementing a Data Certification Program Case Study: Toastâs Journey to Finding the Right Structure for Their Data Team In the Beginning: When a Small Team Struggles to Meet Data Demands Supporting Hypergrowth as a Decentralized Data Operation Regrouping, Recentralizing, and Refocusing on Data Trust Considerations When Scaling Your Data Team Increasing Data Literacy Prioritizing Data Governance and Compliance Prioritizing a Data Catalog Beyond Catalogs: Enforcing Data Governance Building a Data Quality Strategy Make Leadership Accountable for Data Quality Set Data Quality KPIs Spearhead a Data Governance Program Automate Your Lineage and Data Governance Tooling Create a Communications Plan Summary
9. Data Quality in the Real World: Conversations and Case Studies Building a Data Mesh for Greater Data Quality Domain-Oriented Data Owners and Pipelines Self-Serve Functionality Interoperability and Standardization of Communications Why Implement a Data Mesh? To Mesh or Not to Mesh? That Is the Question Calculating Your Data Mesh Score A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh Can You Build a Data Mesh from a Single Solution? Is Data Mesh Another Word for Data Virtualization? Does Each Data Product Team Manage Their Own Separate Data Stores? Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh? Is the Data Mesh Right for All Data Teams? Does One Person on Your Team âOwnâ the Data Mesh? Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts? Case Study: Kolibri Gamesâ Data Stack Journey First Data Needs Pursuing Performance Marketing 2018: Professionalize and Centralize Getting Data-Oriented Getting Data-Driven Building a Data Mesh Five Key Takeaways from a Five-Year Data Evolution Making Metadata Work for the Business Unlocking the Value of Metadata with Data Discovery Data Warehouse and Lake Considerations Data Catalogs Can Drown in a Data Lakeâor Even a Data Mesh Moving from Traditional Data Catalogs to Modern Data Discovery Deciding When to Get Started with Data Quality at Your Company Youâve Recently Migrated to the Cloud Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity Your Data Team Is Growing Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues Your Team Has More Data Consumers Than They Did One Year Ago Your Company Is Moving to a Self-Service Analytics Model Data Is a Key Part of the Customer Value Proposition Data Quality Starts with Trust Summary
10. Pioneering the Future of Reliable Data Systems Be Proactive, Not Reactive Predictions for the Future of Data Quality and Reliability Data Warehouses and Lakes Will Merge Emergence of New Roles on the Data Team Rise of Automation More Distributed Environments and the Rise of Data Domains So Where Do We Go from Here?
1. Why Data Quality Deserves Attention
https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/ch01.html
What Is Data Quality?
Data quality as a concept is not novel—“data quality” has been around as long as humans have been collecting data!
Over the past few decades, however, the definition of data quality has started to crystallize as a function of measuring the reliability, completeness, and accuracy of data as it relates to the state of what is being reported on. As they say, you can’t manage what you don’t measure, and high data quality is the first stage of any robust analytics program. Data quality is also an extremely powerful way to understand whether your data fits the needs of your business.
For the purpose of this book, we define data quality as the health of data at any stage in its life cycle. Data quality can be impacted at any stage of the data pipeline, before ingestion, in production, or even during analysis.
In our opinion, data quality frequently gets a bad rep. Data teams know they need to prioritize it, but it doesn’t roll off the tongue the same way “machine learning,” “data science,” or even “analytics” does, and many teams don’t have the bandwidth or resources to bring on someone full time to manage it. Instead, resource-strapped companies rely on the data analysts and engineers themselves to manage it, diverting them from projects that are perceived to be more interesting or innovative.
But if you can’t trust the data and the data products it powers, then how can data users trust your team to deliver value? The phrase, “no data is better than bad data” is one that gets thrown around a lot by professionals in the space, and while it certainly holds merit, this often isn’t a reality.
Data quality issues (or, data downtime) are practically unavoidable given the rate of growth and data consumption of most companies. But by understanding how we define data quality, it becomes much easier to measure and prevent it from causing issues downstream.
Framing the Current Moment
Understanding the "Rise of Data Downtime"
Other Industry Trends Contributing to the Current Moment
Summary
The rise of the cloud, distributed data architectures and teams, and the move toward data productization have put the onus on data leaders to help their companies drive toward more trustworthy data (leading to more trustworthy analytics). Achieving reliable data is a marathon, not a sprint, and involves many stages of your data pipeline. Further, committing to improving data quality is much more than a technical challenge; it's very much organizational and cultural, too. In the next chapter, we'll discuss some technologies your team can use to prevent broken pipelines and build repeatable, iterative processes and frameworks with which to better communicate, address, and even prevent data downtime.
...
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2022 DataQualityFundamentals | Barr Moses Lior Gavish Molly Vorwerck | Data Quality Fundamentals | 2022 |