Upload
aws-germany
View
1.282
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. In this session we'll give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more.
Citation preview
Data warehousing done the AWS way
• No upfront costs, pay as you go
• Really fast performance at a really low price
• Open and flexible with support for popular tools
• Easy to provision and scale up massively
We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
Delivered as a managed service
A Lot Faster
A Lot Cheaper
A Lot Simpler
Amazon Redshift
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State
123 20 CA
345 25 WA
678 40 FL
Row storage Column storage
ScanDirection
Amazon Redshift uses SQL
• Industry standard SQL• ODBC and JDBC driver to access data (using Postgres 8.x drivers)
– Most PostgreSQL features supported– See documentation for differences
• INSERT/UPDATE/DELETE are supported– But loading data using COPY from S3 or DynamoDB is significantly faster– Use VACUUM after significant number of deletes or update
Amazon Redshift loads data from S3 or DynamoDB
• Direct load from S3 or DynamoDB is supportedcopy customer from 's3://mybucket/customer.txt’ credentials 'aws_access_key_id=<your-access-key-id>;aws_secret_access_key=<your-secret-access-key>’ gzip delimiter '|’;
• Load Data in Parallel– Split data into multiple files for parallel load– Name files with a common prefix
• customer.txt.1, customer.txt.2, …
– Compress large datasets with gzip
• Load data in sort key order if possible
Amazon Redshift automatically compresses your data
• Compress saves space and reduces disk I/O
• COPY automatically analyzes and compresses your data
– Samples data; selects best compression encoding– Supports: byte dictionary, delta, mostly n, run
length, text
• Customers see 4-8x space savings with real data– 20x and higher possible based on data set
• ANALYZE COMPRESSION to see details
analyze compression listing;
Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
Amazon Redshift architecture
• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution
• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via Amazon S3– Parallel load from Amazon DynamoDB
• Single node version available
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
jdbc:postgresql://mycluster.c7lp0qs37f41.us-east-1.redshift.amazonaws.com:8192/mydb
Amazon Redshift runs on optimized hardware
HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate
16 GB RAM
2 TB disk
2 cores
HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage
• Optimized for I/O intensive workloads
• High disk density
• Runs in HPC - fast network
• HS1.8XL available on Amazon EC2
128 GB RAM
16 cores
16 TB disk
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup
• Restore
• Resize
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
Amazon Redshift lets you start small and grow big
Extra Large Node (HS1.XL)3 spindles, 2 TB, 16 GB RAM, 2 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL)24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE
Cluster 2-100 Nodes (32 TB – 1.6 PB)
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
Note: Nodes not to scale
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL Single Node
Effective Hourly Price Per TB
Effective Annual Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Provision a data warehouse in minutes
Provision a data warehouse in minutes
Provision a data warehouse in minutes
Provision a data warehouse in minutes
Monitor query performance
Monitor query performance
Point and click resize
SQL Clients/BI Tools
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresLeaderNode
Resize your cluster while remaining online
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresLeaderNode
• New target provisioned in the background• Only charged for source cluster
Resize your cluster while remaining online
SQL Clients/BI Tools• Fully automated
– Data automatically redistributed
• Read only mode during resize
• Parallel node-to-node data copy
• Automatic DNS-based endpoint cutover
• Only charged for one cluster
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresCompute Node
128GB RAM
48TB disk
16 coresLeaderNode
Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest– AES-256; hardware accelerated– All blocks on disks and in Amazon
S3 encrypted
• No direct access to compute nodes
• Amazon VPC support
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3
Customer VPC
InternalVPC
JDBC/ODBC
LeaderNode
Compute Node
Compute Node
Compute Node
Amazon Redshift continuously backs up your data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
Amazon Redshift integrates with multiple data sources
Amazon DynamoDB
Amazon Elastic MapReduce
Amazon Simple Storage Service (S3)
Amazon Elastic Compute Cloud (EC2)
AWS Storage Gateway Service
Corporate Data Center
Amazon Relational Database Service
(RDS)
Amazon Redshift
More coming soon…
Amazon Redshift provides multiple data loading options
• Upload to Amazon S3
• AWS Import/Export
• AWS Direct Connect
• Work with a partner
Data Integration Systems Integrators
More coming soon…
Amazon Redshift works with your existing analysis tools
JDBC/ODBC
Amazon Redshift
Connect using drivers from PostgreSQL.org
More coming soon…
SkillPages
Customer Use Case
Everyone NeedsSkilled People
At HomeAt WorkIn Life
Repeatedly
Who they are
What they can do
Your real life connections to them
Examples of what they can do
Data Architecture
Data Analyst
Raw Data
Get Data
Join via Facebook
Add a Skill Page
Invite Friends
Web Servers Amazon S3User Action Trace Events
EMRHive Scripts Process Content
• Process log files with regular expressions to parse out the info we need.
• Processes cookies into useful searchable data such as Session, UserId, API Security token.
• Filters surplus info like internal varnish logging.
Amazon S3
Aggregated Data
Raw Events
Internal Web
Excel Tableau
Amazon Redshift
Resources & Questions
• Steffen Krause | [email protected] | @AWS_Aktuell
• http://aws.amazon.com/redshift• Getting Started Guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html• Setting Up SQL Workbench/J:
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html • SQL Reference: http://docs.aws.amazon.com/redshift/latest/dg/
cm_chap_SQLCommandRef.html
• Client Tools:• https://aws.amazon.com/marketplace/redshift/• https://www.jaspersoft.com/webinar-AWS-Agile-Reporting-and-Analytics-in-the-Cloud
Thank you!