2021 DataPipelinesPocketReference

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Data Pipeline.

Notes

Cited By

Quotes

Book Overview

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack.

You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You'll learn:

  • What a data pipeline is and how it works
  • How data is moved and processed on modern data infrastructure, including cloud platforms
  • Common tools and products used by data engineers to build pipelines.
  • How pipelines support analytics and reporting needs
  • Considerations for pipeline maintenance, testing, and alerting

Table of Contents

   Preface
       Who This Book Is For
       Conventions Used in This Book
       Using Code Examples
       O’Reilly Online Learning
       How to Contact Us
       Acknowledgments
   1. Introduction to Data Pipelines
       What Are Data Pipelines?
       Who Builds Data Pipelines?
           SQL and Data Warehousing Fundamentals
           Python and/or Java
           Distributed Computing
           Basic System Administration
           A Goal-Oriented Mentality
       Why Build Data Pipelines?
       How Are Pipelines Built?
   2. A Modern Data Infrastructure
       Diversity of Data Sources
           Source System Ownership
           Ingestion Interface and Data Structure
           Data Volume
           Data Cleanliness and Validity
           Latency and Bandwidth of the Source System
       Cloud Data Warehouses and Data Lakes
       Data Ingestion Tools
       Data Transformation and Modeling Tools
       Workflow Orchestration Platforms
           Directed Acyclic Graphs
       Customizing Your Data Infrastructure
   3. Common Data Pipeline Patterns
       ETL and ELT
       The Emergence of ELT over ETL
       EtLT Subpattern
       ELT for Data Analysis
       ELT for Data Science
       ELT for Data Products and Machine Learning
           Steps in a Machine Learning Pipeline
           Incorporate Feedback in the Pipeline
           Further Reading on ML Pipelines
   4. Data Ingestion: Extracting Data
       Setting Up Your Python Environment
       Setting Up Cloud File Storage
       Extracting Data from a MySQL Database
           Full or Incremental MySQL Table Extraction
           Binary Log Replication of MySQL Data
       Extracting Data from a PostgreSQL Database
           Full or Incremental Postgres Table Extraction
           Replicating Data Using the Write-Ahead Log
       Extracting Data from MongoDB
       Extracting Data from a REST API
       Streaming Data Ingestions with Kafka and Debezium
   5. Data Ingestion: Loading Data
       Configuring an Amazon Redshift Warehouse as a Destination
       Loading Data into a Redshift Warehouse
           Incremental Versus Full Loads
           Loading Data Extracted from a CDC Log
       Configuring a Snowflake Warehouse as a Destination
       Loading Data into a Snowflake Data Warehouse
       Using Your File Storage as a Data Lake
       Open Source Frameworks
       Commercial Alternatives
   6. Transforming Data
       Noncontextual Transformations
           Deduplicating Records in a Table
           Parsing URLs
       When to Transform? During or After Ingestion?
       Data Modeling Foundations
           Key Data Modeling Terms
           Modeling Fully Refreshed Data
           Slowly Changing Dimensions for Fully Refreshed Data
           Modeling Incrementally Ingested Data
           Modeling Append-Only Data
           Modeling Change Capture Data
   7. Orchestrating Pipelines
       Directed Acyclic Graphs
       Apache Airflow Setup and Overview
           Installing and Configuring
           Airflow Database
           Web Server and UI
           Scheduler
           Executors
           Operators
       Building Airflow DAGs
           A Simple DAG
           An ELT Pipeline DAG
       Additional Pipeline Tasks
           Alerts and Notifications
           Data Validation Checks
       Advanced Orchestration Configurations
           Coupled Versus Uncoupled Pipeline Tasks
           When to Split Up DAGs
           Coordinating Multiple DAGs with Sensors
       Managed Airflow Options
       Other Orchestration Frameworks
   8. Data Validation in Pipelines
       Validate Early, Validate Often
           Source System Data Quality
           Data Ingestion Risks
           Enabling Data Analyst Validation
       A Simple Validation Framework
           Validator Framework Code
           Structure of a Validation Test
           Running a Validation Test
           Usage in an Airflow DAG
           When to Halt a Pipeline, When to Warn and Continue
           Extending the Framework
       Validation Test Examples
           Duplicate Records After Ingestion
           Unexpected Change in Row Count After Ingestion
           Metric Value Fluctuations
       Commercial and Open Source Data Validation Frameworks
   9. Best Practices for Maintaining Pipelines
       Handling Changes in Source Systems
           Introduce Abstraction
           Maintain Data Contracts
           Limits of Schema-on-Read
       Scaling Complexity
           Standardizing Data Ingestion
           Reuse of Data Model Logic
           Ensuring Dependency Integrity
   10. Measuring and Monitoring Pipeline Performance
       Key Pipeline Metrics
       Prepping the Data Warehouse
           A Data Infrastructure Schema
       Logging and Ingesting Performance Data
           Ingesting DAG Run History from Airflow
           Adding Logging to the Data Validator
       Transforming Performance Data
           DAG Success Rate
           DAG Runtime Change Over Time
           Validation Test Volume and Success Rate
       Orchestrating a Performance Pipeline
           The Performance DAG
       Performance Transparency
   Index


References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2021 DataPipelinesPocketReferenceJames DensmoreData Pipelines Pocket Reference2021