Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting...

Preview:

Citation preview

Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsEugene Mandel

Head of Product, Superconductive

Agenda

I

What is Pipeline Debt?II

How does Great Expectations beat pipeline debt?III

How can I get started?

What is pipeline debt?

Technical debt in data pipelines,mainly as a result of missing

tests and documentation.

Your data pipeline

Your data pipeline

Your data pipeline

Your data pipeline

Your data pipeline

wants to be a hairball

UndocumentedUntestedUnstable

What is pipeline debt?

code testing ≠ data testing

Solution: automated testing,

BUT

How does Great Expectations

beat pipeline debt

Always know what to expect from your data

▪ Public launch in 2018

▪ Full-time, active development started June 2019

▪ Most popular OSS library for data pipeline testing

▪ Growing community on Slack and github

An expectation is a declarative statement that describes a property of a dataset

“Values in this column should be between 55 and 90, at least 95% of the time.”

Describe expected behavior

{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}

Declarative language

{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}

class PandasDataset ... def expect_column_values_to_be_between( ...

class SparkDFDataset ... def expect_column_values_to_be_between( ...

class SqlAlchemyDataset ... def expect_column_values_to_be_between( ...

expectation:

Validate: take the compute to the data

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

Your tests are your docsYour docs are your tests

Your tests are your docsYour docs are your tests

Setup and Configuration

Drift

Outliers

Outage

How can I get started?

▪ Check out github▪ https://github.com/great-expectations/great_expectations

▪ Read the docs▪ https://docs.greatexpectations.io/en/latest/

▪ Say hi and ask questions on Slack▪ https://greatexpectations.io/slack

▪ pip install great_expectations

How can I get started?

Thank you!

Recommended