View
1.850
Download
4
Category
Preview:
Citation preview
Segah MeerSr. Data Consultant, Professional Services
Connect. Describe. Explore.
Designing a Data Warehouse- what would a BI solution recommend?
4 Rules of Thumb
▪ Transparent E(T)L process
▪ Single copy of data
▪ Performance
▪ Shortest path
Transparent E(T)L process
Perform transformations to optimize on performance and shortest-path, but avoid making broad assumptions about the final use case. Ex: how the revenue is calculated
account profit
1 1000
You seeaccount value
1 {revenue: 2000, expenses: 500,account_payable: 500, is_current: true}
2 {revenue: 2000, expenses: 100, is_current: false}
Actual Data
Single Copy of Data
If data can change, store it in a single row. Avoid redundant tables. Ex: customer information
name phone_number
Segah Meer 650-575-5410
... ...
Segah Meer 650-575-5411
account profit
1 1000
account revenue cost
1 1500 500
OR
Redundant TablesDuplicate rows
Performance
▪ databases focused on large data volume reads behave differently from those focused on frequent and “easy” inserts
▪ slow queries are a function of 1) LookML = f(model), 2) db resources, 3) and how the data is stored
Use flatter (wider) tables and don’t be afraid of redundant date columns
Shortest Path
There is very little analytical value derived from modeling “long path” designs with Looker
extra +1 join adds modeling complexity
Imagine a ride-sharing app
id created_at attribute_id
100001 2016-01-01 1
100002 2016-01-01 2
App Events
Example values:
id value
1 {json...}
Attributes
One Intuitive Solution
- explore: events joins: - join: attributes sql_on: ${events.attribute_id} = ${attributes.id}
- explore: users joins: - join: attributes relationship: one_to_many sql_on: ${users.id} = ${attributes.user_id}
- joins: events relationship: one_to_many sql_on: ${attributes.id} = ${events.attribute_id}
- view: attributes fields: - dimension: user_id sql: JSON_EXTRACT(${value}, 'user_id')
- dimension: service_charge sql: JSON_EXTRACT(${value}, 'service_charge')
- dimension: amount sql: ${service_charge} + ${wait_charge} + ${tax}
Let’s see how we did
Bad Bad Bad... Sure O.K.
Shortest Path ✗
Performance ✗
Single Source of Truth
✓
Transparency ✓
Can we do better?
id created_at event_type amount location
100001 2016-01-01 transaction 14.3
100002 2016-01-01 ride_started 37.7833° N, 122.4167° W
Production
Data Warehouse
... ... ...
.. .. ...
... ... ..
... ...
.. ..
... ...
ETL
Pre-flattening the table
SELECT id , created_at , JSON_EXTRACT(attribute.value,'type') AS event_type , JSON_EXTRACT(attribute.value,'service_charge') + JSON_EXTRACT(attribute.value,'wait_charge') +JSON_EXTRACT(attribute.value,'tax') AS amount , JSON_EXTRACT(attribute.value,'location') AS locationFROM eventsLEFT JOIN attributes ON events.attribute_id = attributes.id
ETL
Let’s see how we did #2
Bad Sure O.K.
Shortest Path ✓
Performance ✓
Single Source of Truth ✗
Transparency ✗
- explore: users joins: - joins: event_attributes relationship: one_to_many sql_on: ${users.id} = ${event_attributes.user_id}
- view: event_attributes fields: - dimension: user_id sql: ${TABLE}.user_id
- dimension: amount sql: ${TABLE}.amount...
Let’s try another improvement
id created_at user_id service_charge
wait_charge
tax
100001 2016-01-01 1 10 3 1.3
Data WarehouseTransaction Events
id created_at user_id location
100002 2016-01-01 1 37.7833° N, 122.4167° W
Ride_started Events
Let’s try another improvementModel- explore: events joins: - joins: transaction_events view_label: 'Events' relationship: one_to_one sql_on: ${events.id} = ${transaction_events.id}
- explore: users joins: - join: events relationship: one_to_many sql_on: ${users.id} = ${events.user_id}
- view: events: derived_table: sql: | SELECT id, created_at, user_id FROM transaction_events UNION ALL SELECT id, created_at, user_id FROM ride_started_events
- view: transaction_events...
- view: ride_started_events...
Let’s see how we did #3
Bad Sure O.K.
Shortest Path ✓
Performance ✓
Single Source of Truth ✓
Transparency ✓
Recommended