2022 DataQualityFundamentals

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Data Quality.

Notes

Cited By

Quotes

Book Overview

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.

Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.

  • Build more trustworthy and reliable data pipelines
  • Write scripts to make data checks and identify broken pipelines with data observability
  • Learn how to set and maintain data SLAs, SLIs, and SLOs
  • Develop and lead data quality initiatives at your company
  • Learn how to treat data services and systems with the diligence of production software
  • Automate data lineage graphs across your data ecosystem
  • Build anomaly detectors for your critical data assets

Table of Contents

Preface
   Conventions Used in This Book
   Using Code Examples
   O’Reilly Online Learning
   How to Contact Us
   Acknowledgments
1. Why Data Quality Deserves Attention
   What Is Data Quality?
   Framing the Current Moment
       Understanding the “Rise of Data Downtime”
       Other Industry Trends Contributing to the Current Moment
   Summary
2. Assembling the Building Blocks of a Reliable Data System
   Understanding the Difference Between Operational and Analytical Data
   What Makes Them Different?
   Data Warehouses Versus Data Lakes
       Data Warehouses: Table Types at the Schema Level
       Data Lakes: Manipulations at the File Level
       What About the Data Lakehouse?
       Syncing Data Between Warehouses and Lakes
   Collecting Data Quality Metrics
       What Are Data Quality Metrics?
       How to Pull Data Quality Metrics
       Using Query Logs to Understand Data Quality in the Warehouse
       Using Query Logs to Understand Data Quality in the Lake
   Designing a Data Catalog
   Building a Data Catalog
   Summary
3. Collecting, Cleaning, Transforming, and Testing Data
   Collecting Data
       Application Log Data
       API Responses
       Sensor Data
   Cleaning Data
   Batch Versus Stream Processing
   Data Quality for Stream Processing
   Normalizing Data
       Handling Heterogeneous Data Sources
       Schema Checking and Type Coercion
       Syntactic Versus Semantic Ambiguity in Data
       Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka
   Running Analytical Data Transformations
       Ensuring Data Quality During ETL
       Ensuring Data Quality During Transformation
   Alerting and Testing
       dbt Unit Testing
       Great Expectations Unit Testing
       Deequ Unit Testing
   Managing Data Quality with Apache Airflow
       Scheduler SLAs
       Installing Circuit Breakers with Apache Airflow
       SQL Check Operators
   Summary
4. Monitoring and Anomaly Detection for Your Data Pipelines
   Knowing Your Known Unknowns and Unknown Unknowns
   Building an Anomaly Detection Algorithm
       Monitoring for Freshness
       Understanding Distribution
   Building Monitors for Schema and Lineage
       Anomaly Detection for Schema Changes and Lineage
       Visualizing Lineage
       Investigating a Data Anomaly
   Scaling Anomaly Detection with Python and Machine Learning
       Improving Data Monitoring Alerting with Machine Learning
       Accounting for False Positives and False Negatives
       Improving Precision and Recall
       Detecting Freshness Incidents with Data Monitoring
       F-Scores
       Does Model Accuracy Matter?
   Beyond the Surface: Other Useful Anomaly Detection Approaches
   Designing Data Quality Monitors for Warehouses Versus Lakes
   Summary
5. Architecting for Data Reliability
   Measuring and Maintaining High Data Reliability at Ingestion
   Measuring and Maintaining Data Quality in the Pipeline
   Understanding Data Quality Downstream
   Building Your Data Platform
       Data Ingestion
       Data Storage and Processing
       Data Transformation and Modeling
       Business Intelligence and Analytics
       Data Discovery and Governance
   Developing Trust in Your Data
       Data Observability
       Measuring the ROI on Data Quality
       How to Set SLAs, SLOs, and SLIs for Your Data
   Case Study: Blinkist
   Summary
6. Fixing Data Quality Issues at Scale
   Fixing Quality Issues in Software Development
   Data Incident Management
       Incident Detection
       Response
       Root Cause Analysis
       Resolution
       Blameless Postmortem
   Incident Response and Mitigation
       Establishing a Routine of Incident Management
       Why Data Incident Commanders Matter
   Case Study: Data Incident Management at PagerDuty
       The DataOps Landscape at PagerDuty
       Data Challenges at PagerDuty
       Using DevOps Best Practices to Scale Data Incident Management
   Summary
7. Building End-to-End Lineage
   Building End-to-End Field-Level Lineage for Modern Data Systems
       Basic Lineage Requirements
       Data Lineage Design
       Parsing the Data
       Building the User Interface
   Case Study: Architecting for Data Reliability at Fox
       Exercise “Controlled Freedom” When Dealing with Stakeholders
       Invest in a Decentralized Data Team
       Avoid Shiny New Toys in Favor of Problem-Solving Tech
       To Make Analytics Self-Serve, Invest in Data Trust
   Summary
8. Democratizing Data Quality
   Treating Your “Data” Like a Product
   Perspectives on Treating Data Like a Product
       Convoy Case Study: Data as a Service or Output
       Uber Case Study: The Rise of the Data Product Manager
       Applying the Data-as-a-Product Approach
   Building Trust in Your Data Platform
       Align Your Product’s Goals with the Goals of the Business
       Gain Feedback and Buy-in from the Right Stakeholders
       Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains
       Sign Off on Baseline Metrics for Your Data and How You Measure Them
       Know When to Build Versus Buy
   Assigning Ownership for Data Quality
       Chief Data Officer
       Business Intelligence Analyst
       Analytics Engineer
       Data Scientist
       Data Governance Lead
       Data Engineer
       Data Product Manager
       Who Is Responsible for Data Reliability?
   Creating Accountability for Data Quality
   Balancing Data Accessibility with Trust
   Certifying Your Data
   Seven Steps to Implementing a Data Certification Program
   Case Study: Toast’s Journey to Finding the Right Structure for Their Data Team
       In the Beginning: When a Small Team Struggles to Meet Data Demands
       Supporting Hypergrowth as a Decentralized Data Operation
       Regrouping, Recentralizing, and Refocusing on Data Trust
       Considerations When Scaling Your Data Team
   Increasing Data Literacy
   Prioritizing Data Governance and Compliance
       Prioritizing a Data Catalog
       Beyond Catalogs: Enforcing Data Governance
   Building a Data Quality Strategy
       Make Leadership Accountable for Data Quality
       Set Data Quality KPIs
       Spearhead a Data Governance Program
       Automate Your Lineage and Data Governance Tooling
       Create a Communications Plan
   Summary
9. Data Quality in the Real World: Conversations and Case Studies
   Building a Data Mesh for Greater Data Quality
       Domain-Oriented Data Owners and Pipelines
       Self-Serve Functionality
       Interoperability and Standardization of Communications
   Why Implement a Data Mesh?
       To Mesh or Not to Mesh? That Is the Question
       Calculating Your Data Mesh Score
   A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh
       Can You Build a Data Mesh from a Single Solution?
       Is Data Mesh Another Word for Data Virtualization?
       Does Each Data Product Team Manage Their Own Separate Data Stores?
       Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?
       Is the Data Mesh Right for All Data Teams?
       Does One Person on Your Team “Own” the Data Mesh?
       Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?
   Case Study: Kolibri Games’ Data Stack Journey
       First Data Needs
       Pursuing Performance Marketing
       2018: Professionalize and Centralize
       Getting Data-Oriented
       Getting Data-Driven
       Building a Data Mesh
       Five Key Takeaways from a Five-Year Data Evolution
   Making Metadata Work for the Business
   Unlocking the Value of Metadata with Data Discovery
       Data Warehouse and Lake Considerations
       Data Catalogs Can Drown in a Data Lake—or Even a Data Mesh
       Moving from Traditional Data Catalogs to Modern Data Discovery
   Deciding When to Get Started with Data Quality at Your Company
       You’ve Recently Migrated to the Cloud
       Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity
       Your Data Team Is Growing
       Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues
       Your Team Has More Data Consumers Than They Did One Year Ago
       Your Company Is Moving to a Self-Service Analytics Model
       Data Is a Key Part of the Customer Value Proposition
       Data Quality Starts with Trust
   Summary
10. Pioneering the Future of Reliable Data Systems
   Be Proactive, Not Reactive
   Predictions for the Future of Data Quality and Reliability
       Data Warehouses and Lakes Will Merge
       Emergence of New Roles on the Data Team
       Rise of Automation
       More Distributed Environments and the Rise of Data Domains
   So Where Do We Go from Here?

1. Why Data Quality Deserves Attention

https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/ch01.html

What Is Data Quality?

Data quality as a concept is not novel—“data quality” has been around as long as humans have been collecting data!

Over the past few decades, however, the definition of data quality has started to crystallize as a function of measuring the reliability, completeness, and accuracy of data as it relates to the state of what is being reported on. As they say, you can’t manage what you don’t measure, and high data quality is the first stage of any robust analytics program. Data quality is also an extremely powerful way to understand whether your data fits the needs of your business.

For the purpose of this book, we define data quality as the health of data at any stage in its life cycle. Data quality can be impacted at any stage of the data pipeline, before ingestion, in production, or even during analysis.

In our opinion, data quality frequently gets a bad rep. Data teams know they need to prioritize it, but it doesn’t roll off the tongue the same way “machine learning,” “data science,” or even “analytics” does, and many teams don’t have the bandwidth or resources to bring on someone full time to manage it. Instead, resource-strapped companies rely on the data analysts and engineers themselves to manage it, diverting them from projects that are perceived to be more interesting or innovative.

But if you can’t trust the data and the data products it powers, then how can data users trust your team to deliver value? The phrase, “no data is better than bad data” is one that gets thrown around a lot by professionals in the space, and while it certainly holds merit, this often isn’t a reality.

Data quality issues (or, data downtime) are practically unavoidable given the rate of growth and data consumption of most companies. But by understanding how we define data quality, it becomes much easier to measure and prevent it from causing issues downstream.

Framing the Current Moment

Understanding the "Rise of Data Downtime"
Other Industry Trends Contributing to the Current Moment

Summary

The rise of the cloud, distributed data architectures and teams, and the move toward data productization have put the onus on data leaders to help their companies drive toward more trustworthy data (leading to more trustworthy analytics). Achieving reliable data is a marathon, not a sprint, and involves many stages of your data pipeline. Further, committing to improving data quality is much more than a technical challenge; it's very much organizational and cultural, too. In the next chapter, we'll discuss some technologies your team can use to prevent broken pipelines and build repeatable, iterative processes and frameworks with which to better communicate, address, and even prevent data downtime.

...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2022 DataQualityFundamentalsBarr Moses
Lior Gavish
Molly Vorwerck
Data Quality Fundamentals2022