Design Principles for a Modern Data Warehouse

Design Principles for a

Modern Data WarehouseCASE STUDIES AT DE BIJENKORF AND TRAVELBIRD

Old Challenges, New Considerations Data warehouses still must deliver:

◦ Data integration of multiple systems

◦ Accuracy, completeness, and auditability

◦ Reporting for assorted stakeholders and business needs

◦ Clean data

◦ A “single version of the truth”

But the problem space now contains:◦ Unstructured/Semi-structured data

◦ Real time data

◦ Shorter time to access / self-service BI

◦ SO MUCH DATA (terabytes/hour to load)

◦ More systems to integrate (everything has an API)

New technologies are changing the landscape

What is best practice today? A modern, best in class data warehouse:

◦ Is designed for scalability, ideally using cloud architecture◦ Uses a bus-based, lambda architecture◦ Has a federated data model for structured and unstructured data◦ Leverages MPP databases◦ Uses an agile data model like Data Vault◦ Is built using code automation◦ Processes data using ELT, not ETL

All the buzzwords! But what does it look like and why do these things help?

Architectural overview at de Bijenkorf Tools

AWS◦ S3◦ Kinesis◦ Elasticache◦ Elastic Beanstalk◦ EC2◦ DynamoDB

Open Source◦ Snowplow Event Tracker◦ Rundeck Scheduler◦ Jenkins Continuous Integration◦ Pentaho PDI

Other◦ HP Vertica◦ Tableau◦ Github◦ RStudio Server

DWH internal architecture, Travelbird and Bijenkorf

• Traditional three tier DWH• ODS generated automatically from

staging• Allow regeneration of vault

without replaying logs• Ops mart reflects data in original

source form• Helps offload queries from

source systems• Business marts materialized

exclusively from vault

Why use the cloud?Cost Management• Services billed by the hour, pay for what you use• For small deployments (<50 machines), cloud hosting can be significantly cheaper• Ex. a 3 node Vertica cluster in AWS with 25TB data: $2.2k/mo

Off the Shelf Services• Minimize administration by using pre-built services like message buses (Kinesis), databases (RDS), Key/Value

stores (Elasticache), simplifying technology stack• Increase speed of delivery of new functionality by eliminating most deployment tasks• Full stack in a day? No problem!

Scalability• Services can automatically be scaled up/down based on time, load, or other triggers• Adding additional services can be done within minutes• Services can scale (near) infinitely

Designed to solve both primary data needs:◦ Damn close, right now◦ Correct, tomorrow

Data is processed twice per stream

As implemented at BYK and TB:◦ Real time flow from Kinesis to DWH◦ Simultaneous process to S3◦ Reprocessing as needed from S3 (batch)

Lambda architecture: Right Now and Right Later

Hadoop in the DWH What is Hadoop?

◦ A distributed, fault tolerant file system◦ A set of tools for file/data stream processing

Where does it fit into the DWH stack?◦ Data Lake: Save all raw data for cheap; don’t force

schemas on unstructured data◦ ETL: Distributed batch processing, aggregation, and

loading

Hadoop at Bijenkorf◦ We had it but threw it out; the use cases didn’t fit

◦ Very little data is unstructured and the DWH supports JSON◦ Data volumes are limited and growing slowly

◦ How did we solve the use cases?◦ Data lake: S3 file storage + semi-structured data in Vertica◦ Data processing: Stream processing (stable event volumes + clean

events)

Hadoop at Travelbird◦ Dirty, fast growing event data, so…◦ Hadoop in the typical role

◦ Raw data in AWS Elastic Map Reduce via S3◦ Data cleaned and processed in Hadoop, then loaded into Redshift

• C-Stores persist each column independently and allow column compression

• Queries retrieve data only from needed columns

Example: 7 billion rows, 25 columns, 10 bytes/column = 1,6 TB table

Query: Select A, sum( D ) from table where C >= X;

Row Store: 1,6TB of data scannedColumn Store (50% compression): <100 GB data scanned

The Role of Column Store Databases in the DWH

Count Distinct

Count

Top 20, One Month

Top 20

187

230

600

600

0.63

2.1

23

62

Query Performance Results (seconds)

C-store Postgres

Performance Comparison

Loads fast too! Facebook loads 35TB/hour into Vertica

But are there tradeoffs to a C-Store?Weaknesses◦ No PK/FK integrity enforced on write

◦ Slow on DELETE, UPDATE

◦ REALLY slow on single record INSERT and SELECT

◦ Optimized for limited concurrency but big queries; only a few users can use at a time

Solutions◦ Design to use calculated keys (ex. hashes)

◦ Build ETLs around COPY, TRUNCATE

◦ Individual transactions should use OLTP or Key/Value systems

◦ Optimize data structures for common queries and leverage big, slow disks to create denormalized tables

Data Vault 1.618 at Bijenkorf

3rd Normal Form Data Vault

So many tables! WHY?!?!?!

What we gained◦ Speed of integration

of new entities

◦ Fast primary keys without lookups by using hash keys

◦ Data matches business processes, not systems

◦ Easy parallelization of table loading (24 concurrent tables? OK!)

ELT, not ETL Advantages of ELT

◦ Performance: Bijenkorf benchmark showed ELT was >50x faster than ETL◦ Plus horizontal scalability is Web scale, big data, <insert buzzword here>

◦ Data Availability: You want an exact replica of your source data in the DWH anyways◦ Simpler Architecture: Fewer systems, fewer interdependencies (decouple STG and DV), can build multiple

transformations from STG simultaneously

Myths of ELT◦ Source and Target DB must match: Intelligently coded ELT jobs leverage platform agnostic code (or a library for each

source DB type) for loading to STG◦ Bijenkorf runs MySQL and Oracle ELT into Vertica◦ Travelbird runs MySQL and Postgres ELT into Redshift

◦ Limited tool availability: DV 2.0 lends itself to code generators / managers, which are best built internally anyways◦ Talend is free (like speech and hugs) and offers ELT for many systems

◦ ELT takes longer to deploy: Because data is perfectly replicated from source, getting records in is faster; transformations can be iterated quicker since they are independent of source->stg loading

Targeted benefits of DWH automation at BijenkorfObjective Achievements at Bijenkorf

Speed of development • Integration of new sources or data from existing sources takes 1-2 steps• Adding a new vault dependency takes one step

Simplicity • Five jobs handle all ETL processes across DWH

Traceability • Every record/source file is traced in the database and every row automatically identified by source file in ODS

Code simplification • Replaced most common key definitions with dynamic variable replacement

File management • Every source file automatically archived to Amazon S3 in appropriate locations sorted by source, table, and date

• Entire source systems, periods, etc can be replayed in minutes

Data Vault loading automation at BYK

• New sources automatically added

• Last change epoch based on load stamps, advanced each time all dependencies execute successfully

All Staging Tables

Checked for Changes

• Dependencies declared at time of job creation

• Load prioritization possible but not utilized

List of Dependent Vault Loads Identified

• Jobs parallelized across tables but serialized per job

• Dynamic job queueing ensures appropriate execution order

Loads Planned in Hub, Link, Sat Order

• Variables automatically identified and replaced

• Each load records performance statistics and error messages

Loads Executed

o Loader is fully metadata driven with focus on horizontal scalability and management simplicity

o To support speed of development and performance, variable-driven SQL templates used throughout

Bringing it back: Best practice, in practice

Code Automation

Cloud Based

Bus ArchitectureMPP

Data Vault

Unstructured Data Stores

ELT controlled by scheduler

Click icon to add picture

Rob [email protected]

Technology

Design Principles for a Modern Data Warehouse