Deequ Library
A Deequ Library is a unit testing library for Spark jobs that use Spark dataframes.
References
2020
- https://github.com/awslabs/deequ
- QUOTE: ... Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. ...
... Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. ...
Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. ...
... Our library is built on Apache Spark and is designed to work with very large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse. ...
... Most applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind deequ is to explicitly state these assumptions in the form of a "unit-test" for data, which can be verified on a piece of data at hand. If the data has errors, we can "quarantine" and fix it, before we feed to an application.
- QUOTE: ... Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. ...