Best Practices for Supercharging Cloud Analytics on Amazon Redshift

1

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftTina Adams, Amazon RedshiftBrandon Davis, CervelloManeesh Joshi, SnapLogic

May 2014

2

Featured Speakers

3

Agenda

• Amazon Redshift Feature and Market Update

• SnapLogic Case Studies with Amazon Redshift

• Demo: SnapLogic Free Trial for Amazon Redshift and RDS

• Cervello: Implementation Best Practices

4

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

5

Amazon Redshift Architecture

• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution

• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via

Amazon S3; load from Amazon DynamoDB or SSH

• Two hardware platforms– Optimized for data processing– DW1: HDD; scale from 2TB to 1.6PB– DW2: SSD; scale from 160GB to

256TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


LeaderNode

6

Amazon Redshift is priced to let you analyze all your data

• Number of nodes x cost per hr

• No charge for leader node

• No upfront costs

• Pay as you go

DW1 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB

On-Demand $ 0.850 $ 3,723

1 Year Reservation

$ 0.500 $ 2,190

3 Year Reservation

$ 0.228 $ 999

DW2 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB

On-Demand $ 0.250 $ 13,688

1 Year Reservation

$ 0.161 $ 8,794

3 Year Reservation

$ 0.100 $ 5,498

7

Amazon Redshift Feature Delivery

-60

40

-30

8

Improved Concurrency

Before15

After50

9

COPY from JSON

{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] }

COPY venue FROM 's3://mybucket/venue.json' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' JSON AS 's3://mybucket/venue_jsonpaths.json';

10

COPY from Amazon Elastic MapReduce

COPY sales From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials ‘aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';

Amazon EMR Amazon Redshift

11

REGEX_SUBSTR()

select email, regexp_substr(email,'@[^.]*') from users limit 5;

email | regexp_substr --------------------------------------------+---------------- [email protected] | @nonnisiAenean [email protected] | @lacusUtnec [email protected] | @semperpretiumneque [email protected] | @tristiquealiquet [email protected] | @sodalesat

12

Resize Progress

• Progress indicator in console

• New API call

13

ECDHE cipher suites for perfect forward security over SSL

ECDHE-RSA & ECDHE-ECDCSA cipher suites supported

14

Amazon Redshift integrates with multiple data sources

Amazon S3 Amazon EMR

Amazon Redshift

DynamoDB

Amazon RDS

Corporate Datacenter

15

Agenda





16

The SnapLogic Platform for Elastic Integration Powering Analytics, Apps and APIs

Data Applications APIs

17

Why SnapLogic?

Multi-Point Orchestration

• SnapStore: 160+ Prebuilt Snaps

• Orchestration & Workflow

Modern Platform• Elastic, Scale-out

Architecture• Hybrid: Cloud to Cloud and

Cloud to Ground Use Cases

Faster Integration• Easily Design, Monitor,

Manage • Deploy in Days not Months

18

Multi-Point: Comprehensive ConnectivitySnap your Apps: 160+ pre-built integrations

19

Software-defined Integration

Metadata

Data

• Streams: No data is stored/cached

• Secure: 100% standards-based

• Elastic: Scales out & handles data, app, API integration use cases

Hybrid Scale-out Architecture Respects Data Gravity

20

International Hotel Chain Reservation Data Mgmt.

• 126 TB of hotel reservation data

• Prohibitive cost-per-query for analytics

• Unacceptable performance

PAST PRESENT

• FedEx’ed 126 TB of data to load into AWS Redshift

• Now run daily sync between on-premise and cloud with SnapLogic of data changes (100-150GB)

• Enrich analytics with Twitter and Travelocity data

• Improved cost-per-query and performance

21

Mid-sized Pharma Creates Cloud Data Mart

Cloud to On-prem Snaplex

REST

Cloud to Cloud Snaplex

Metadata

Data

• Consolidate DBs (Customer, Address, and Order) and SFDC (Contact and Account) into Redshift

• MicroStrategy is the visualization layer

22

Agenda





23

DEMO

24

Agenda





25

Enterprise Performance Management

(Finance)

Customer Relationship

Management (Sales &

Marketing)

Data Management

Custom Development

Business Intelligence &

Analytics (IT)

• We have offices in Boston, New York, Dallas and the UK• Offshore development and support teams in Russia and India• We partner with the leading on premise and cloud technology

companies

Advise, Implement, Support

Cervello Helps Clients Win With Data

26

Implementation Case Study

• Hospitality industry analytics– Detailed transactional data

– Weekly / monthly / yearly trend analysis

– Began with single-node cluster, adding nodes as data volumes grow

Source Data Redshift Analytics

ETL

27

• Collect external data loads before merging with existing data

• Maintain history of cleansed and standardized source data

• Use data structures optimized for analytics– Dimension and fact tables

for analytics

– Aggregate tables

Best Practice #1: Choose The Right Pattern

• Staging tables

• History tables

• Star schema data warehouse

Requirements Design

28

Best Practice #2: Select the Right Node Type

• Performance was good with initial volumes and small data sets on single node

• Evaluated dense storage (dw1) and dense compute (dw2) nodes

• More opportunity to optimize design as volumes grew

• Increased nodes to handle larger volumes– Solution leverages dense

storage (dw1) nodes

– Expected to stabilize between 10-20TB

• Have also seen smaller volumes that work really well in dense compute (dw2) nodes

Early Stages Mature Stage

29

Best Practice #3: Leverage MPP

• Spread data evenly across nodes while also optimizing join performance

• Distribution key and sort keys are primary considerations

Leader Node

Compute Node 1

Compute Node 2

Compute Node n

Compute Node 3

• Initial fact table distribution key caused skewed data

• Changed to dimension foreign key with better distribution for 40%+ improvement in query times

• Surrogate keys on dimension tables– Primary key

– Sort key and distribution key OR distribute to all nodes

– Sort on foreign keys in fact tables

Goals Approach

30

Best Practice #4: Use Columnar Compression

• Started with compression settings based on general data types– VARCHAR to TEXT255,

INTEGER to MOSTLY16, etc.

– Iterate using ANALYZE COMPRESSION

• Redshift applies automatic compression during COPY– Staging tables

• Reduce I/O workload by minimizing size of data stored on disk

Goals Approach

31

Best Practice #5: Load and Manage Data

• ETL and ELT– ETL: First set of processes prepares data for analytics –

business logic, standardization, validation

– ELT: Second set of processes load data into Redshift and transform into analytical structures

• Data management– Enforce constraints within ETL processes

– Analyze after loads to update statistics

– Vacuum after large loads to existing tables, updates and deletes

32

Bringing it All Together

• Analytic queries– Minimize number of query columns to improve

performance

– Most queries use SUM or COUNT

– Leveraging aggregate tables for monthly dashboards

• Explain long running queries to help optimize design– Sorting / merging within nodes and merging at leader

node

33

Learn more…

1. Try out the SnapLogic Free Trial for Amazon Redshift: http://snaplogic.com/redshift-trial

2. Learn more about Amazon Redshift at:

http://aws.amazon.com/redshift

3. Learn more about Cervello at:

http://mycervello.com/

Technology

Best Practices for Supercharging Cloud Analytics on Amazon Redshift