View
4
Download
0
Category
Preview:
Citation preview
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best practices in migrating from on-premises data warehouses to Amazon Redshift
A N T 3 2 3 - R
Matt Scaer
Principal DW Specialist Solutions Architect
Amazon Web Services
Thiyagarajan Arumugam
DW Specialist Solutions Architect
Amazon Web Services
Agenda
• Architecture and concepts
• Scaling options
• Migration techniques
• Migration best practices
• Additional resources
• Open Q&A
Related breakouts
ANT237-R - Zero to hero: Quick starts with Amazon Redshift
ANT335-R - Build data analytics stacks with Amazon Redshift, featuring Warner Bros.
ANT338-R - Best practices for using popular BI tools with Amazon Redshift
ANT404-R - Build a single query to analyze data across Amazon Redshift & Amazon S3
DAT363-R - Migrating your data warehouses to Amazon Redshift
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift benefits10s of thousands of customers use Amazon Redshift & process exabytes of data per day
3x faster than the most popular alternative
Up to 75% less than other cloud data warehouses & predictable costs
AWS Lake Formation catalog & security,exabyte querying (Redshift Spectrum & federated), AWS integrated (AWS DMS,
Amazon CloudWatch)
AWS grade security, (e.g., Amazon VPC, encryption with AWS KMS, AWS
CloudTrail), certifications such as SOC, PCI, DSS, ISO, FedRAMP, HIPAA
Easy to provision & manage, automated backups, AWS support, 99.9% SLAs
Virtually unlimited elastic linear scaling
Amazon
Redshift is the most popular
cloud data warehouse
Moving to Amazon Redshift enables you run a 24x7 data warehouse cost effectively
Amazon Redshift is the most cost effective
Fastest Most cost effective
up to 75%
Use RI to reduce price up to 75% with Reserved Instances (RIs)
You can start small for just $0.25 per hour and scale to $1000 per terabyte per year,
less than one-tenth the cost of other solutions
Your Amazon Redshift data warehouse and Amazon S3 data lake enable all your workloads
Amazon Redshift integrates with your data lake
Fastest Most cost effective Integrates with your data lake
Exabyte scale
Store and analyze relational and nonrelational data
Purpose-built analytics tools
Cost effective
• Store at 2.3 cents per GB-month in Amazon S3
• Query with Redshift Spectrum at $5/TB scanned
• DW with Amazon Redshift for $1000/TB/year
Give access to everyone
• Amazon QuickSight: $0.30 for 30 minutes of use
Amazon Redshift architectureMassively parallel, shared nothing columnar
architecture
Leader node
SQL endpoint
Stores metadata
Coordinates parallel SQL processing
Compute nodes
Local, columnar storage
Executes queries in parallel
Load, unload, backup, restore
Amazon Redshift Spectrum nodes
Execute queries directly against
Amazon Simple Storage Service (Amazon S3)
Compute
nodes
10 GigE (HPC)
JDBC/ODBC
SQL clients/
BI tools
Leader node
Am
azo
n R
ed
shif
t clu
ster
Amazon S3Exabyte-scale object storage
Efficient data loads
Streaming backup/restore
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concurrency scaling
Backup
Amazon Redshift automatically adds transient clusters,
in seconds, to serve sudden spike in concurrent requests with
consistently fast performance. No hydration required.
Caching Layer
How it works:
All queries go to the leader node, user only sees less wait for queries.
When queries in designated WLM queue begin queuing, Amazon Redshift automatically routes them to the new clusters, enabling Concurrency Scaling automatically.
Amazon Redshift automatically spins up a new cluster, processes waiting queries and automatically shuts down the Concurrency Scaling cluster.
1
2
3
For every 24 hours that your main cluster is in use, you accrue a one-hour credit for Concurrency Scaling. This means that Concurrency Scaling is free for > 97% of customers.
Concurrency Scaling is a powerful new feature of
Amazon Redshift. We were really impressed by
the performance of the feature, especially with
its ability to instantly add transient capacity with
nothing to manage on our end. As Redshift
administrators at Yelp, we think that
Concurrency Scaling will keep our many users
happy, even under peak load. We’re excited that
Concurrency Scaling provides the flexibility to
handle significant variance in our workloads over
the course of a day.
-Shahid Chohan,
Software Engineer, Yelp
Amazon Redshift Elastic Resize
Amazon Redshift Cluster
Co
mp
ute
no
des
Amazon Redshift Managed Amazon S3
JDBC/ODBC
1
2
3
Leader Node
Backup
How it works:
Amazon Redshift updates the snapshot on Amazon S3 with the most recent data.
New nodes are added (for scaling up) or removed (for scaling down) during this period.
The cluster is fully available for read and write queries. Queries that were being held are queued for execution automatically.
1
2
3
Scale your Amazon Redshift clusters up and down in minutes to get the optimal performance.
Amazon Redshift Spectrum
Amazon Redshift Spectrum
query engine
Query across Amazon Redshift and Amazon S3
Amazon Redshift data
Amazon S3data lake
Extend the data warehouse to exabytes of data in Amazon S3 data lake
No data loading required
Scale compute and storage separately
Directly query data stored in Amazon S3
Parquet, ORC, Avro, JSON, and CSV data formats
Spectrum Request Accelerator
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MPP migration timeline
Power Off
Go Live
Stakeholder
Signature
Cutovers
Amazon S3
CDC Sync
Audit per Application
Current System
Future System
AWS Data Migrations: Broadest toolkit
AWS provides the broadest range of tools for easy, fast, and secure data movement to and from the AWS Cloud
AWS Direct
Connect
AWS
Snowball
AWS Database
Migration
Service
AWS Storage
Gateway
Amazon Kinesis
Data Firehose
E a s y , P o w e r f u l T o o l s f o r M i g r a t i n g D a t a
Petabyte-scale data transport
solution
Migration is a two-step process:Step 1: Convert or Copy your Schema
Source DB or DW
AWS SCT
Native Tool
Destination DB or DW
Step 2: Move your data
Source DB or DW Destination DB or DW
AWS DMS
AWS SCT
When to use AWS SCT?
Modernize
Modernize your Database tier
Modernize and Migrate your Data
Warehouse to Amazon Redshift
AWS SCT helps with converting tables, views, and code
• Sequences
• User-defined types
• Synonyms
• Packages
• Stored procedures
• Functions
• Triggers
• Schemas
• Tables
• Indexes
• Views
• Sort and distribution
keys
Database migration assessment
Connect AWS SCT to
source and target
databases
Run assessment report
Read executive summary
Follow detailed
instructions
“AWS Database Migration Service is
the most impressive migration service
we’ve seen.”
2. Relational Databases
1. Non-Relational Databases
3. Other sources
Amazon S3
A c c e l e r a t e M i g r a t i o n s t o A m a z o n R e d s h i f t
Amazon
Redshift
AWS Database
Migration Service
AWS Database Migration Service (AWS DMS)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Leverage AWS Ecosystem
Categories Amazon Redshift
Schema Conversion AWS Schema Conversion Tool for schema/stored proc conversion
Data Movement AWS Schema Conversion Tool Data Extractors
Near Real Time / Streaming Kinesis Data Firehose and/or Amazon Redshift SpectrumAmazon Redshift as a data source for machine learningAmazon Redshift as a destination for Amazon DynamoDB
For Data Lake Integration + column level security
AWS Lake Formation
Data Integration AWS Glue + Amazon EMRData Visualization Amazon QuickSightMonitoring and alerting Amazon CloudWatch Launching Resources AWS CloudFormationSecurity AWS Identity and Access Management (IAM)
Best practices: Legacy data warehouse migrations
• Amazon Redshift is a fully ACID and ANSI SQL Compliant, Columnar store
Data Warehouse
• No indices – instead, fast query performance is achieved through
parallelism and efficient data storage to minimize I/O
• Amazon Redshift is relational – organizes data as tables comprised of
columns and rows in named schemas (namespaces)
• Amazon Redshift has Schemas, Users, and Groups like most other
databases
• Amazon Redshift is designed to working with structured data, and support
semi-structured data in the form of Parquet, JSON or Avro
Best practices: Legacy data warehouse migrations
When moving from legacy row-based data warehouses
• Amazon Redshift is efficient with wide tables because of columnar storage and compression
• Denormalize tables where it makes sense (predicate columns in the fact table)
When moving from SMP legacy data warehouses
• Accelerated joins can be achieved (when frequently queried together) by Colocation of tables
• Leverage DIST STYLE ALL/AUTO or KEY
• Amazon Redshift is designed for big data (> 100GB to PB scale)
Leverage Amazon Redshift Stored Procedures for faster migrations
• PL/pgSQL stored procedures were added to make porting legacy procedures easier
Extend Data warehouse to Data lake using Amazon Redshift Spectrum
• Unlock the data lake in Amazon S3 for end users with industry-standard SQL, with limitless scaling
• Archive Historical Data into Amazon S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWSLabs on GitHub — Amazon Redshift
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin scripts
Collection of utilities for running diagnostics on your cluster
Admin views
Collection of utilities for managing your cluster, generating schema DDL, and so on
Analyze Vacuum utility
Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster
Column Encoding utility
Utility that will apply optimal column encoding to an established schema with data already loaded
AWS Big Data Blog – Amazon Redshift
Amazon Redshift Engineering’s Advanced Table Design Playbook
https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/
Top 10 Performance Tuning Techniques for Amazon Redshift
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
Twelve Best Practices for Amazon Redshift Spectrum
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift
https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/
How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime
https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn big data with AWS Training and Certification
Visit aws.amazon.com/training/paths-specialty/
New free digital course, Data Analytics Fundamentals, introduces
Amazon S3, Amazon Kinesis, Amazon EMR,
AWS Glue, and Amazon Redshift
Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam
Resources created by the experts at AWS to help you build and validate data analytics skills
Classroom offerings, including Big Data on AWS, feature AWS expert instructors and hands-on labs
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recommended