30
Redshift at Lightspeed How to continuously optimize and modify Redshift schemas

Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Embed Size (px)

Citation preview

Page 1: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Redshift at Lightspeed

How to continuously optimize and modify Redshift schemas

Page 2: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

The Missing Part: Continuous Data Warehousing

Core Idea Product

Continuous IntegrationPuppet

Chef New Relic

Unit Tests

AWS Heroku

Docker

Server Frameworks

Github Bitbucket

Client Frameworks

SCRUM Kanban Extreme

Page 3: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

speed/spēd/

noun

the rate at which someone or something is able to operate or change state

Page 4: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Make the change easy

Make the easy change

First,

Then,“— Kent Beck

Page 5: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

1. DataColumns, Tables, Data Types, Compression, Constraints

/ code changes

2. QueriesTransformations, Sortkeys, Distkeys

/ life, business, environment

Continuous Data Integration

Page 6: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

#1 Data & Metadata Changes

Page 7: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

Groups

int g_id

string name

Session Events

int s_id

int u_id

datetime time

datetime start_time

datetime end_time

Users

int u_id

string gender

string name

string first_name

string last_name

int g_id

Messages

int m_id

int from

int to

string text

int to_u

int to_g

Users supportcommit #4acd617 by alice

Adding messagescommit #0ca9e87 by bob

Track user sessionscommit #709ff49 by alice

Breakdown the namecommit #791079b by alice

Add Groupscommit #44ff83b by bob

Session time-rangecommit #df7a369 by alice

Page 8: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

automate with build scripts

users: - column: u_id type: int

- column: first_name type: varchar - column: last_name type: varchar

- column: g_id - type: int - references: - groups.g_idgroups: - column: g_id type: int

Users supportcommit #4acd617 by alice

Breakdown the namecommit #791079b by alice

Add Groupscommit #44ff83b by bob

Commit Log schema.yaml

Page 9: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

schema.yaml Users

integer id

varchar address

Groups

integer admin

users: - column: id type: integer - column: address type: varchargroups: - column: admin type: integer references: - users.id

Reject on error

create table ... ( ... )alter table ... add column ...alter table ... remove column ...alter table ... rename column ... to ...

remodel

Page 10: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

Concurrency & Locks

commit 1 commit 2 commit 3

rollback rollbackDoneStart

Users

integer id

varchar addressQueries

Alter Table locks

Page 11: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

Messages

date created

Add temporary columnalter table messages add column timestamp created_tmp

Messages

date created-old

ts created

Rename columnsalter table messages rename column created to created-old;alter table messages rename column created-tmp to created;

Users

ts created

Drop old columnalter table messages drop column created-old

Messages

date created

ts created_tmp

Copy data to new columnupdate messages set created_tmp = created

Reject on error

Altering Column Types

Page 12: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

ViewGroup Admins

string group_name

string admin_name

Users

string name

Groups

integer admin

Approach #1Drop all, and reconstruct until reaching stability

Approach #2pg_depend

On error - reject

DAG: Directional A-cyclic Graph

Rebuilding Views & Constraints

Page 13: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

#2 Query Changes

Transformations Sortkeys Distkeys

Page 14: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

ETLextract transform load

Data Available

Data Available

ELT extract load transform

ETL Is Yesterday’s Problem

Rigid, Inflating Dev-dependent

Lost

Page 15: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

users

groups

Viewusers-per-group

int g_id

varchar name

int count_users

Viewavg-turn-around

string type

float turn_around

int uniques

Raw DataTransformation Views

Immediate Availability

avg-turn-around

…Selective Materialization

Page 16: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

Unsorted gender

1mb blocks

Sorted gender

1mb blocks

Female Male

Sortkeys: Recap

Page 17: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

SELECT COUNT(1) FROM users WHERE gender = 'female'

count

2498644

SELECT * FROM STL_EXPLAIN WHERE ...

plannode cost info

Aggregate cost=68718.76..68718.76 rows=1 width=0

Seq Scan on users cost=0.00..62500.00 rows=2487501 width=0 Filters: gender = ‘female’

SELECT DATEDIFF('ms', endtime, starttime), * FROM STL_SCAN

slice datediff rows pre_filter is_rrscan

1 250 4071 7813 t

2 309 3846 7813 t

52%49%

Page 18: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

Even AllKey gender

Female Male

Diststyle & Distkeys: Recap

Page 19: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

SELECT groups.name, COUNT(DISTINCT u_id)FROM groups FULL JOIN users ON groups.g_id = users.g_idGROUP BY groups.name;

name count

Group 1 6

Group 2 2

plan cost info

HashAggregate 61000331250

Subquery Scan 61000306250

HashAggregate 61000256250

Hash Full Join DS_DIST_BOTH 61000231250 users.g_id = groups.g_id

Seq Scan on users 50000

Hash 15000

Seq Scan on groups 15000

inner

outer

Page 20: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

DS_DIST_BOTH

DS_DIST_ALL_INNERall all all

Good

OK

Bad

DS_DIST_INNER

DS_DIST_NONEDS_DIST_ALL_NONE

DS_BCAST_INNER

node 1 node 2 node 3

Page 21: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

plan cost info HashAggregate 331250

Subquery Scan 306250

HashAggregate 256250

Hash Full Join DS_DIST_NONE 231250 users.g_id = groups.g_id

Seq Scan on users 50000

Hash 15000

Seq Scan on groups 15000

users DISTSTYLE KEY DISTKEY (g_id)groups DISTSTYLE KEY DISTKEY (g_id)

Page 22: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

SELECT "type", AVG(time_to_session) avg_time_to_sessionFROM (SELECT u_id, "type", DATEDIFF(seconds,MS.time,first_session_after_message) time_to_session FROM (SELECT MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END "type", MIN(S.start_time) first_session_after_message FROM (SELECT M.time, to_u, to_g, nvl(A.u_id,B.u_id) u_id FROM messages M LEFT JOIN (SELECT DISTINCT G.id AS g_id, U.u_id FROM groups G RIGHT JOIN users U ON G.id = U.g_id) A ON M.to_g = A.g_id LEFT JOIN users B ON M.to_u = B.u_id) MM LEFT JOIN (SELECT u_id, start_time, end_time FROM sessions) S ON MM.u_id = S.u_Id AND MM.time < S.start_time AND MM.time < S.end_time WHERE DATEDIFF (seconds,MM.time,S.start_time) < 3600 GROUP BY MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END) MS) MS1 GROUP BY "type";

Real Life Example

Page 23: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

SELECT ...

type

Private

time_to_session

62.398

Group 102.873

EXPLAIN SELECT ...

plan cost info Hash Join DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id

Hash Left Join DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id

Hash Left Join DS_DIST_INNER 1349121397704 messages.to_g = users.g_id

Hash Left Join DS_DIST_BOTH 49000231250 users.g_id = groups.g_id

1.

2.

3.

4.

DS_DIST_BOTH

DS_BCAST_INNER

DS_DIST_INNER

Page 24: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

users DISTKEY (u_id)sessions DISTKEY (u_id)messages DISTKEY (to_u)groups DISTSTYLE ALL

Cost: 2x faster Actual: 8x faster (12 mins to 1.5mins)

50%

32%

3%

121,000%

hash join cost info

DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id

DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id

DS_DIST_INNER 1349121397704 messages.to_g = users.g_id

DS_DIST_BOTH 49000231250 users.g_id = groups.g_id

Previously

1.

2.

3.

4.

hash join cost

DS_DIST_OUTER 2900030128376

DS_DIST_NONE 1300007879765

DS_BCAST_INNER 1300001568828

DS_DIST_NONE 231250

1.

2.

3.

4.

Page 25: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

analyzeSTL_EXPLAIN

STL_SCAN

rebuildN-weeks optimum

parseexplain_analysis

int q_id

timestamp time

varchar table

varchar column

float filter_cost

float dist_cost

1-distkey per table Duplicate tables or use ALL

current configuration

users: - diststyle: KEY - diskey: g_id - sortkeys: - gendergroups: - diststyle: KEY - diskey: g_id

Automated Optimization

Page 26: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.io

users

users-dist-uid

…unload / copy data

…chase by update_time

Clone table schema with new Sortkey and Distkey

Empiric testreplay queries / explains

instant swap

Rebuilding Tables

Page 27: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Summary Continuous Data Warehousing

Page 28: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Future Tip of the Iceberg Frameworks & Platforms

Page 29: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

?Automated Data Management Platform over RedshiftPanoply.io

Page 30: Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

Panoply.ioAutomated Data Management

Platform over Redshift