Upload
snaplogic-inc
View
1.992
Download
1
Tags:
Embed Size (px)
DESCRIPTION
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations. To learn more, visit: http://www.snaplogic.com/redshift-trial
Citation preview
1
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftTina Adams, Amazon RedshiftBrandon Davis, CervelloManeesh Joshi, SnapLogic
May 2014
2
Featured Speakers
3
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
4
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
5
Amazon Redshift Architecture
• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution
• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via
Amazon S3; load from Amazon DynamoDB or SSH
• Two hardware platforms– Optimized for data processing– DW1: HDD; scale from 2TB to 1.6PB– DW2: SSD; scale from 160GB to
256TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
6
Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost per hr
• No charge for leader node
• No upfront costs
• Pay as you go
DW1 (HDD)Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation
$ 0.500 $ 2,190
3 Year Reservation
$ 0.228 $ 999
DW2 (SSD)Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation
$ 0.161 $ 8,794
3 Year Reservation
$ 0.100 $ 5,498
7
Amazon Redshift Feature Delivery
-60
40
-30
8
Improved Concurrency
Before15
After50
9
COPY from JSON
{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] }
COPY venue FROM 's3://mybucket/venue.json' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' JSON AS 's3://mybucket/venue_jsonpaths.json';
10
COPY from Amazon Elastic MapReduce
COPY sales From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials ‘aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';
Amazon EMR Amazon Redshift
11
REGEX_SUBSTR()
select email, regexp_substr(email,'@[^.]*') from users limit 5;
email | regexp_substr --------------------------------------------+---------------- [email protected] | @nonnisiAenean [email protected] | @lacusUtnec [email protected] | @semperpretiumneque [email protected] | @tristiquealiquet [email protected] | @sodalesat
12
Resize Progress
• Progress indicator in console
• New API call
13
ECDHE cipher suites for perfect forward security over SSL
ECDHE-RSA & ECDHE-ECDCSA cipher suites supported
14
Amazon Redshift integrates with multiple data sources
Amazon S3 Amazon EMR
Amazon Redshift
DynamoDB
Amazon RDS
Corporate Datacenter
15
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
16
The SnapLogic Platform for Elastic Integration Powering Analytics, Apps and APIs
Data Applications APIs
17
Why SnapLogic?
Multi-Point Orchestration
• SnapStore: 160+ Prebuilt Snaps
• Orchestration & Workflow
Modern Platform• Elastic, Scale-out
Architecture• Hybrid: Cloud to Cloud and
Cloud to Ground Use Cases
Faster Integration• Easily Design, Monitor,
Manage • Deploy in Days not Months
18
Multi-Point: Comprehensive ConnectivitySnap your Apps: 160+ pre-built integrations
19
Software-defined Integration
Metadata
Data
• Streams: No data is stored/cached
• Secure: 100% standards-based
• Elastic: Scales out & handles data, app, API integration use cases
Hybrid Scale-out Architecture Respects Data Gravity
20
International Hotel Chain Reservation Data Mgmt.
• 126 TB of hotel reservation data
• Prohibitive cost-per-query for analytics
• Unacceptable performance
PAST PRESENT
• FedEx’ed 126 TB of data to load into AWS Redshift
• Now run daily sync between on-premise and cloud with SnapLogic of data changes (100-150GB)
• Enrich analytics with Twitter and Travelocity data
• Improved cost-per-query and performance
21
Mid-sized Pharma Creates Cloud Data Mart
Cloud to On-prem Snaplex
REST
Cloud to Cloud Snaplex
Metadata
Data
• Consolidate DBs (Customer, Address, and Order) and SFDC (Contact and Account) into Redshift
• MicroStrategy is the visualization layer
22
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
23
DEMO
24
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and RDS
• Cervello: Implementation Best Practices
25
Enterprise Performance Management
(Finance)
Customer Relationship
Management (Sales &
Marketing)
Data Management
Custom Development
Business Intelligence &
Analytics (IT)
• We have offices in Boston, New York, Dallas and the UK• Offshore development and support teams in Russia and India• We partner with the leading on premise and cloud technology
companies
Advise, Implement, Support
Cervello Helps Clients Win With Data
26
Implementation Case Study
• Hospitality industry analytics– Detailed transactional data
– Weekly / monthly / yearly trend analysis
– Began with single-node cluster, adding nodes as data volumes grow
Source Data Redshift Analytics
ETL
27
• Collect external data loads before merging with existing data
• Maintain history of cleansed and standardized source data
• Use data structures optimized for analytics– Dimension and fact tables
for analytics
– Aggregate tables
Best Practice #1: Choose The Right Pattern
• Staging tables
• History tables
• Star schema data warehouse
Requirements Design
28
Best Practice #2: Select the Right Node Type
• Performance was good with initial volumes and small data sets on single node
• Evaluated dense storage (dw1) and dense compute (dw2) nodes
• More opportunity to optimize design as volumes grew
• Increased nodes to handle larger volumes– Solution leverages dense
storage (dw1) nodes
– Expected to stabilize between 10-20TB
• Have also seen smaller volumes that work really well in dense compute (dw2) nodes
Early Stages Mature Stage
29
Best Practice #3: Leverage MPP
• Spread data evenly across nodes while also optimizing join performance
• Distribution key and sort keys are primary considerations
Leader Node
Compute Node 1
Compute Node 2
Compute Node n
Compute Node 3
• Initial fact table distribution key caused skewed data
• Changed to dimension foreign key with better distribution for 40%+ improvement in query times
• Surrogate keys on dimension tables– Primary key
– Sort key and distribution key OR distribute to all nodes
– Sort on foreign keys in fact tables
Goals Approach
30
Best Practice #4: Use Columnar Compression
• Started with compression settings based on general data types– VARCHAR to TEXT255,
INTEGER to MOSTLY16, etc.
– Iterate using ANALYZE COMPRESSION
• Redshift applies automatic compression during COPY– Staging tables
• Reduce I/O workload by minimizing size of data stored on disk
Goals Approach
31
Best Practice #5: Load and Manage Data
• ETL and ELT– ETL: First set of processes prepares data for analytics –
business logic, standardization, validation
– ELT: Second set of processes load data into Redshift and transform into analytical structures
• Data management– Enforce constraints within ETL processes
– Analyze after loads to update statistics
– Vacuum after large loads to existing tables, updates and deletes
32
Bringing it All Together
• Analytic queries– Minimize number of query columns to improve
performance
– Most queries use SUM or COUNT
– Leveraging aggregate tables for monthly dashboards
• Explain long running queries to help optimize design– Sorting / merging within nodes and merging at leader
node
33
Learn more…
1. Try out the SnapLogic Free Trial for Amazon Redshift: http://snaplogic.com/redshift-trial
2. Learn more about Amazon Redshift at:
http://aws.amazon.com/redshift
3. Learn more about Cervello at:
http://mycervello.com/