Designing a Data Warehouse - what would a BI solution recommend?

Preview:

Citation preview

Segah MeerSr. Data Consultant, Professional Services

Connect. Describe. Explore.

Designing a Data Warehouse- what would a BI solution recommend?

4 Rules of Thumb

▪ Transparent E(T)L process

▪ Single copy of data

▪ Performance

▪ Shortest path

Transparent E(T)L process

Perform transformations to optimize on performance and shortest-path, but avoid making broad assumptions about the final use case. Ex: how the revenue is calculated

account profit

1 1000

You seeaccount value

1 {revenue: 2000, expenses: 500,account_payable: 500, is_current: true}

2 {revenue: 2000, expenses: 100, is_current: false}

Actual Data

Single Copy of Data

If data can change, store it in a single row. Avoid redundant tables. Ex: customer information

name phone_number

Segah Meer 650-575-5410

... ...

Segah Meer 650-575-5411

account profit

1 1000

account revenue cost

1 1500 500

OR

Redundant TablesDuplicate rows

Performance

▪ databases focused on large data volume reads behave differently from those focused on frequent and “easy” inserts

▪ slow queries are a function of 1) LookML = f(model), 2) db resources, 3) and how the data is stored

Use flatter (wider) tables and don’t be afraid of redundant date columns

Shortest Path

There is very little analytical value derived from modeling “long path” designs with Looker

extra +1 join adds modeling complexity

Imagine a ride-sharing app

id created_at attribute_id

100001 2016-01-01 1

100002 2016-01-01 2

App Events

Example values:

id value

1 {json...}

Attributes

One Intuitive Solution

- explore: events joins: - join: attributes sql_on: ${events.attribute_id} = ${attributes.id}

- explore: users joins: - join: attributes relationship: one_to_many sql_on: ${users.id} = ${attributes.user_id}

- joins: events relationship: one_to_many sql_on: ${attributes.id} = ${events.attribute_id}

- view: attributes fields: - dimension: user_id sql: JSON_EXTRACT(${value}, 'user_id')

- dimension: service_charge sql: JSON_EXTRACT(${value}, 'service_charge')

- dimension: amount sql: ${service_charge} + ${wait_charge} + ${tax}

Let’s see how we did

Bad Bad Bad... Sure O.K.

Shortest Path ✗

Performance ✗

Single Source of Truth

Transparency ✓

Can we do better?

id created_at event_type amount location

100001 2016-01-01 transaction 14.3

100002 2016-01-01 ride_started 37.7833° N, 122.4167° W

Production

Data Warehouse

... ... ...

.. .. ...

... ... ..

... ...

.. ..

... ...

ETL

Pre-flattening the table

SELECT id , created_at , JSON_EXTRACT(attribute.value,'type') AS event_type , JSON_EXTRACT(attribute.value,'service_charge') + JSON_EXTRACT(attribute.value,'wait_charge') +JSON_EXTRACT(attribute.value,'tax') AS amount , JSON_EXTRACT(attribute.value,'location') AS locationFROM eventsLEFT JOIN attributes ON events.attribute_id = attributes.id

ETL

Let’s see how we did #2

Bad Sure O.K.

Shortest Path ✓

Performance ✓

Single Source of Truth ✗

Transparency ✗

- explore: users joins: - joins: event_attributes relationship: one_to_many sql_on: ${users.id} = ${event_attributes.user_id}

- view: event_attributes fields: - dimension: user_id sql: ${TABLE}.user_id

- dimension: amount sql: ${TABLE}.amount...

Let’s try another improvement

id created_at user_id service_charge

wait_charge

tax

100001 2016-01-01 1 10 3 1.3

Data WarehouseTransaction Events

id created_at user_id location

100002 2016-01-01 1 37.7833° N, 122.4167° W

Ride_started Events

Let’s try another improvementModel- explore: events joins: - joins: transaction_events view_label: 'Events' relationship: one_to_one sql_on: ${events.id} = ${transaction_events.id}

- explore: users joins: - join: events relationship: one_to_many sql_on: ${users.id} = ${events.user_id}

- view: events: derived_table: sql: | SELECT id, created_at, user_id FROM transaction_events UNION ALL SELECT id, created_at, user_id FROM ride_started_events

- view: transaction_events...

- view: ride_started_events...

Let’s see how we did #3

Bad Sure O.K.

Shortest Path ✓

Performance ✓

Single Source of Truth ✓

Transparency ✓

Recommended