33
Steffen Krause Amazon Redshift Technology Evangelist @AWS_Aktuell [email protected]

AWS Summit Berlin 2013 - Amazon Redshift

Embed Size (px)

DESCRIPTION

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. In this session we'll give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more.

Citation preview

Page 1: AWS Summit Berlin 2013 - Amazon Redshift

Steffen Krause

Amazon Redshift

Technology Evangelist

@[email protected]

Page 2: AWS Summit Berlin 2013 - Amazon Redshift

Data warehousing done the AWS way

• No upfront costs, pay as you go

• Really fast performance at a really low price

• Open and flexible with support for popular tools

• Easy to provision and scale up massively

Page 3: AWS Summit Berlin 2013 - Amazon Redshift

We set out to build…

A fast and powerful, petabyte-scale data warehouse that is:

Delivered as a managed service

A Lot Faster

A Lot Cheaper

A Lot Simpler

Amazon Redshift

Page 4: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State

123 20 CA

345 25 WA

678 40 FL

Row storage Column storage

ScanDirection

Page 5: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift uses SQL

• Industry standard SQL• ODBC and JDBC driver to access data (using Postgres 8.x drivers)

– Most PostgreSQL features supported– See documentation for differences

• INSERT/UPDATE/DELETE are supported– But loading data using COPY from S3 or DynamoDB is significantly faster– Use VACUUM after significant number of deletes or update

Page 6: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift loads data from S3 or DynamoDB

• Direct load from S3 or DynamoDB is supportedcopy customer from 's3://mybucket/customer.txt’ credentials 'aws_access_key_id=<your-access-key-id>;aws_secret_access_key=<your-secret-access-key>’ gzip delimiter '|’;

• Load Data in Parallel– Split data into multiple files for parallel load– Name files with a common prefix

• customer.txt.1, customer.txt.2, …

– Compress large datasets with gzip

• Load data in sort key order if possible

Page 7: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift automatically compresses your data

• Compress saves space and reduces disk I/O

• COPY automatically analyzes and compresses your data

– Samples data; selects best compression encoding– Supports: byte dictionary, delta, mostly n, run

length, text

• Customers see 4-8x space savings with real data– 20x and higher possible based on data set

• ANALYZE COMPRESSION to see details

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Page 8: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift architecture

• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution

• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via Amazon S3– Parallel load from Amazon DynamoDB

• Single node version available

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

jdbc:postgresql://mycluster.c7lp0qs37f41.us-east-1.redshift.amazonaws.com:8192/mydb

Page 9: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift runs on optimized hardware

HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate

16 GB RAM

2 TB disk

2 cores

HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage

• Optimized for I/O intensive workloads

• High disk density

• Runs in HPC - fast network

• HS1.8XL available on Amazon EC2

128 GB RAM

16 cores

16 TB disk

Page 10: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup

• Restore

• Resize

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 11: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift lets you start small and grow big

Extra Large Node (HS1.XL)3 spindles, 2 TB, 16 GB RAM, 2 cores

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL)24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE

Cluster 2-100 Nodes (32 TB – 1.6 PB)

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

Note: Nodes not to scale

Page 12: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Page 13: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Page 14: AWS Summit Berlin 2013 - Amazon Redshift

Provision a data warehouse in minutes

Page 15: AWS Summit Berlin 2013 - Amazon Redshift

Provision a data warehouse in minutes

Page 16: AWS Summit Berlin 2013 - Amazon Redshift

Provision a data warehouse in minutes

Page 17: AWS Summit Berlin 2013 - Amazon Redshift

Provision a data warehouse in minutes

Page 18: AWS Summit Berlin 2013 - Amazon Redshift

Monitor query performance

Page 19: AWS Summit Berlin 2013 - Amazon Redshift

Monitor query performance

Page 20: AWS Summit Berlin 2013 - Amazon Redshift

Point and click resize

Page 21: AWS Summit Berlin 2013 - Amazon Redshift

SQL Clients/BI Tools

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

Resize your cluster while remaining online

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

• New target provisioned in the background• Only charged for source cluster

Page 22: AWS Summit Berlin 2013 - Amazon Redshift

Resize your cluster while remaining online

SQL Clients/BI Tools• Fully automated

– Data automatically redistributed

• Read only mode during resize

• Parallel node-to-node data copy

• Automatic DNS-based endpoint cutover

• Only charged for one cluster

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

Page 23: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest– AES-256; hardware accelerated– All blocks on disks and in Amazon

S3 encrypted

• No direct access to compute nodes

• Amazon VPC support

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 24: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift continuously backs up your data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

Page 25: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift integrates with multiple data sources

Amazon DynamoDB

Amazon Elastic MapReduce

Amazon Simple Storage Service (S3)

Amazon Elastic Compute Cloud (EC2)

AWS Storage Gateway Service

Corporate Data Center

Amazon Relational Database Service

(RDS)

Amazon Redshift

More coming soon…

Page 26: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift provides multiple data loading options

• Upload to Amazon S3

• AWS Import/Export

• AWS Direct Connect

• Work with a partner

Data Integration Systems Integrators

More coming soon…

Page 27: AWS Summit Berlin 2013 - Amazon Redshift

Amazon Redshift works with your existing analysis tools

JDBC/ODBC

Amazon Redshift

Connect using drivers from PostgreSQL.org

More coming soon…

Page 28: AWS Summit Berlin 2013 - Amazon Redshift
Page 29: AWS Summit Berlin 2013 - Amazon Redshift

SkillPages

Customer Use Case

Everyone NeedsSkilled People

At HomeAt WorkIn Life

Repeatedly

Page 30: AWS Summit Berlin 2013 - Amazon Redshift

Who they are

What they can do

Your real life connections to them

Examples of what they can do

Page 31: AWS Summit Berlin 2013 - Amazon Redshift

Data Architecture

Data Analyst

Raw Data

Get Data

Join via Facebook

Add a Skill Page

Invite Friends

Web Servers Amazon S3User Action Trace Events

EMRHive Scripts Process Content

• Process log files with regular expressions to parse out the info we need.

• Processes cookies into useful searchable data such as Session, UserId, API Security token.

• Filters surplus info like internal varnish logging.

Amazon S3

Aggregated Data

Raw Events

Internal Web

Excel Tableau

Amazon Redshift

Page 32: AWS Summit Berlin 2013 - Amazon Redshift

Resources & Questions

• Steffen Krause | [email protected] | @AWS_Aktuell

• http://aws.amazon.com/redshift• Getting Started Guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html• Setting Up SQL Workbench/J:

http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html • SQL Reference: http://docs.aws.amazon.com/redshift/latest/dg/

cm_chap_SQLCommandRef.html

• Client Tools:• https://aws.amazon.com/marketplace/redshift/• https://www.jaspersoft.com/webinar-AWS-Agile-Reporting-and-Analytics-in-the-Cloud

Page 33: AWS Summit Berlin 2013 - Amazon Redshift

Thank you!