AWS Glue Service
An AWS Glue Service is a fully-managed serverless AWS ETL service.
- Context:
- It can be used as a Metadata Repository.
- It can be used to create Python-based AWS Glue ETL Pipelines (which are often executed in PySpark).
- Example(s):
- Counter-Example(s):
- See: Apache Beam.
References
2019
- https://aws.amazon.com/glue/faqs/#AWS_Glue_Data_Catalog/
- QUOTE: ... Q. What are the main components of AWS Glue?
AWS Glue consists of a Data Catalog which is a central metadata repository, an ETL engine that can automatically generate Scala or Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. Together, these automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data. …
- QUOTE: ... Q. What are the main components of AWS Glue?
2017a
- https://aws.amazon.com/glue/
- QUOTE: AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
AWS Glue generates Python code that is customizable, reusable, and portable. Once your ETL job is ready, you can schedule it to run on AWS Glue's fully managed, scale-out Apache Spark environment. AWS Glue provides a flexible scheduler with dependency resolution, job monitoring, and alerting.
AWS Glue is serverless, so there is no infrastructure to buy, set up, or manage. It automatically provisions the environment needed to complete the job, and customers pay only for the compute resources consumed while running ETL jobs. With AWS Glue, data can be available for analytics in minutes.
- QUOTE: AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
2017b
- http://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
- QUOTE: You can use AWS Glue to build a data warehouse to organize, cleanse, validate, and format data. You can transform and move AWS Cloud data into your data store. You can also load data from disparate sources into your data warehouse for regular reporting and analysis. By storing it in a data warehouse, you integrate information from different parts of your business and provide a common source of data for decision making.
AWS Glue simplifies many tasks when you are building a data warehouse:
- Discovers and catalogs metadata about your data stores into a central catalog. You can process semi-structured data, such as clickstream or process logs.
- Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer the schema, format, and data types of your data. This metadata is stored as tables in the AWS Glue Data Catalog and used in the authoring process of your ETL jobs.
- Generates ETL scripts to transform, flatten, and enrich your data from source to target.
- Detects schema changes and adapts based on your preferences.
- Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data warehouse. Triggers can be used to create a dependency flow between jobs.
- Gathers runtime metrics to monitor the activities of your data warehouse.
- Handles errors and retries automatically.
- Scales resources, as needed, to run your jobs.
- QUOTE: You can use AWS Glue to build a data warehouse to organize, cleanse, validate, and format data. You can transform and move AWS Cloud data into your data store. You can also load data from disparate sources into your data warehouse for regular reporting and analysis. By storing it in a data warehouse, you integrate information from different parts of your business and provide a common source of data for decision making.
2017
- https://console.aws.amazon.com/glue/home?region=us-east-1#get-started:
- Build your AWS Glue Data Catalog: AWS Glue automatically stores metadata in a central data catalog. It can create table definitions for many common data stores, including, S3 buckets, web logs, and AWS databases. AWS Glue recognizes, infers, organizes, and classifies your data.
- Generate and edit transformations: PySpark transformation scripts are auto generated using source and target metadata. You can store customized versions to transform your data to meet your business needs. AWS Glue provides an environment to modify your jobs.
- Schedule and run your jobs: AWS Glue runs your ETL jobs in a serverless environment. You don’t need to set up the infrastructure, you just use Amazon's infrastructure and pay for the resources you use. You can define triggers to run jobs based on a schedule or event. AWS Glue enables you to monitor your jobs.
2017
- http://aws.amazon.com/glue/
- QUOTE:
- QUOTE:
2017
2016
- https://www.dremio.com/etl-tools-explained/
- QUOTE: … AWS Glue. This is a new fully-managed ETL service AWS announced in late 2016. Glue is targeted at developers. It is tightly integrated into other AWS services, including data sources such as S3, RDS, and Redshift, as well as other services, such as Lambda. Glue can connect to on-prem data sources to help customers move their data to the cloud. ETL pipelines are written in Python and executed using Apache Spark and PySpark. Like most services on AWS, Glue is designed for developers to write code to take advantage of the service, and is highly proprietary - pipelines written in Glue will only work on AWS. …