AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Preview:

Citation preview

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 30, 2016

Migrating Your Data Warehouse to Amazon Redshift

DAT202

Pavan Pothukuchi, Sr. Manager PM, Amazon Redshift

Ali Khan, Director of BI and Analytics, Scholastic

Laxmikanth Malladi, Principal Architect, Northbay Solutions

“It’s our biggest driver of growth in our biggest markets, and is a feature of the

company” …on Data Mining in Redshift– Chris Lambert, Lyft CTO

“The doors were blown wide open to create custom dashboards for anyone to

instantly go in and see and assess what is going in our ad delivery landscape,

something we have never been able to do until now.”– Bryan Blair, Vevo’s VP of Ad Operations

“Analytical queries are 10 times faster in Amazon Redshift than they

were with our previous data warehouse.”– Yuki Moritani, NTT Docomo Innovation Manager

“We have several petabytes of data and use a massive Redshift

cluster. Our data science team can get to the data faster and then

analyze that data to find new ways to reduce costs, market

products, and enable new business.”– Yuki Moritani, NTT Docomo Innovation Manager

“We saw a 2x performance improvement on a wide variety of

workloads. The more complex the queries, the higher the

performance improvement..”- Naeem Ali, Director of Software Development, Data

Science at Cablevision (Optimum)

“Over the last few years, we’ve tried all kinds of databases in search of more

speed, including $15k of custom hardware. Of everything we’ve tried,

Amazon Redshift won out each time.”– Periscope Data, Analyst’s Guide to Redshift

“We took Amazon Redshift for a test run the moment it was

released. It’s fast. It’s easy. Did I mention it’s ridiculously fast?

We’re using it to provide our analysts an alternative to Hadoop.”– Justin Yan, Data Scientist at Yelp

“The move to Redshift also significantly improved dashboard query

performance… Redshift performed ~200% faster than the

traditional SQL Server we had been using in the past.”

- Dean Donovan, Product Development at DiamondStream

“…[Redshift] performance has blown away everyone here (we

generally see 50-100x speedup over Hive)”

- Jie Li Data Infrastructure at Pinterest

“450,000 online queries 98 percent faster than previous traditional data

center, while reducing infrastructure costs by 80 percent.”

- John O’Donovan, CTO, Financial Times

“We needed to load six months' worth of data, about 10 TB of data, for a

campaign. That type of load would have taken about 20 days with our previous

solution. By using Amazon Redshift, it only took six hours to load the data.”

- Zhong Hong, VP of Infrastructure, Vivaki (Publicis Groupe)

“We regularly process multibillion row datasets and we do that in a

matter of hours. We are heading to up to 10 times more data volumes in

the next couple of years, easily.”

- Bob Harris, CTO, Channel 4

“On our previous big data warehouse system, it took around 45

minutes to run a query against a year of data, but that number went

down to just 25 seconds using Amazon Redshift”

- Kishore Raja Director of Strategic Programs and R&D, Boingo Wireless

“Most competing data warehousing solutions would have cost us up

to $1 million a year. By contrast, Amazon Redshift costs us just

$100,000 all-in, representing a total cost savings of around 90%”

- Joel Cumming, Head of Data, Kik Interactive

“Annual costs of Redshift are equivalent to just the annual

maintenance of some of the cheaper on-premises options for

data warehouses..”

- Kevin Diamond, CTO, HauteLook (Nordstrom)

“Our data volume keeps growing, and we can support that

growth because Amazon Redshift scales so well.. We wouldn’t

have that capability using the supporting on-premises hardware in

our previous solution.”

- Ajit Zadgaonkar, Director of Ops. and Infrastructure, Edmunds

“With Amazon Redshift and Tableau, anyone in the company can set up

any queries they like - from how users are reacting to a feature, to growth by

demographic or geography, to the impact sales efforts had in different areas”

- Jon Hoffman, Head of Engineering, Foursquare

Today’s agenda

• Amazon Redshift Overview

• Use cases and benefits

• Migration options

• Scholastic’s use case

• Architecture details

• Technical overview

• Key project learnings

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical

representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any

vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

Forrester Wave™ Enterprise Data Warehouse Q4 ’15

Selected Amazon Redshift customers

Why migrate to Amazon Redshift?

100x faster

Scales from GBs to PBs

Analyze data without storage

constraints

10x cheaper

Easy to provision and operate

Higher productivity

10x faster

No programming

Standard interfaces and

integration to leverage BI tools,

machine learning, streaming

Transactional database MPP database Hadoop

Migration from Oracle @ Boingo Wireless

2000+ Commercial Wi-Fi locations

1 million+ Hotspots

90M+ ad engagements

100+ countries

Legacy DW: Oracle 11g based DW

Before migration

Rapid data growth slowed

analytics

Mediocre IOPS, limited memory,

vertical scaling

Admin overhead

Expensive (license, h/w, support)

After migration

180x performance improvement

7x cost savings

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

Exadata SAPHANA

Redshift

$400,000

$300,000

$55,000

7,200

2,700

15 15

Query

Performance

Data Load

Performance

1 year of data

1 million records

Late

ncy in s

econds

RedshiftExisting System

7X cheaper than Oracle Exadata 180X faster than Oracle database

Migration from Oracle @ Boingo Wireless

Migration from Greenplum @ NTT Docomo

68 million customers

10s of TBs per day of data across

mobile network

6PB of total data (uncompressed)

Data science for marketing

operations, logistics etc.

Legacy DW: Greenplum on-premises

After migration:

125 node DS2.8XL cluster

4,500 vCPUs, 30TB RAM

6 PB uncompressed

10x faster analytic queries

50% reduction in time for new BI

app. deployment

Significantly less ops. overhead

Migration from SQL on Hadoop @ Yahoo

Analytics for website/mobile events

across multiple Yahoo properties

On an average day

2B events

25M devices

Before migration: Hive – Found it to be

slow, hard to use, share and repeat

After migration:

21 node DC1.8XL (SSD)

50TB compressed data

100x performance improvement

Real-time insights

Easier deployment and

maintenance

Migration from SQL on Hadoop @ Yahoo

1

10

100

1000

10000

CountDistinctDevices

Count AllEvents

FilterClauses

Joins

Seco

nd

s

Amazon Redshift

Impala

Business Value and Productivity

Business Productivity Benefits

Analyze more data

Faster time to market

Get better insights

Match capacity with demand

ENGINE X Amazon Redsh i f t

ETL Sc r i p t s

SQL i n repo r t s

Adhoc . que r i es

How to Migrate?

Schema Convers ion Database Migra t ion

Map da ta t ypes

Choose compress ion

encod ing , so r t keys ,

d i s t r i bu t i on keys

Gene ra te and app l y DDL

Schema & Data

Trans format ionData Migrat ion

Conver t SQL Code

Bu lk Load

Cap tu re upda tes

Trans fo rma t i ons

Assess Gaps

Sto red P rocedu res

Func t i ons

1 2

3

4

Convert schema in a few clicks

Sources include Oracle, Teradata,

Greenplum and Netezza

Automatic schema optimization

Converts application SQL code

Detailed assessment report

AWS Schema

Conversion Tool

(AWS SCT)

AWS Schema Conversion Tool

Start your first migration in few minutes

Sources include: Aurora, Oracle, SQL

Server, MySQL and PostgreSQL

Bulk load and continuous replication

Migrate a TB for $3

Fault tolerant

(AWS DMS)

AWS DMS: Change data capture

Replication instance

Source Target

Update

t1 t2

t1

t2

Transactions Change

apply

after bulk

load

Data integration partners

Data Integration Systems Integrators

Amazon Redshift

Beyond Amazon Redshift…

Scholastic, Established 1920

Where were we?

Platform

13+ years old. IBM AS/400 DB2 and Microsoft SQL Server are the primary data

warehouse platforms. BI Platform is primarily Microsoft (SSRS, SSAS, Excel, SharePoint)

500+ direct users across every LOB and business function

20+ TB. 5,500+ DB2 workloads, 350+ SQL Server workloads, 15 SSAS cubes, 150+

SSRS reports

Challenges

Inflexible, multi-layered architecture – slow time to market

Inability to meet internal SLAs due to performance of daily ETL processes

Scalability limitations with SQL Server Analysis Services (SSAS) for reports

Limited ability to perform self-service Business Intelligence

21

Moving forward: Key decision factors

• Improved performance, scalability, availability,

logging, security

• Enablement of self service business intelligence

• Leverage the skill set of current team (Relational DB

& SQL)

• Integration with existing technology stack

• Alignment with the tech strategy (devops model,

Cloud First)

• Ability to support Big Data initiatives

• Team up with an experienced consulting partner22

Why we chose AWS and Amazon Redshift

AWS was chosen for its agility, scalability, elasticity, and

security

Redshift

• Scalable, fast

• Managed service, cost-optimization models,

elastic

• SQL/relational matched skillset of team

S3 was chosen as location for ingestion process

NorthBay was chosen as the implementation partner for

their expertise in Big Data and Redshift migrations

23

How the project unfolded

Goals

• 3-month pilot to migrate a Functional area in key LOB

• Demonstrate immediate business value

• Use AWS Stack & Open Source for Data Movement from DB2

(No CDC/ETL tool)

Outcomes

• Core Framework for Migration

• ELT Architecture and Validation

• Visualization/Self-service capability through Tableau

EMR Cluster running

Sqoop ScriptOutput Bucket EC2 Instance running

Copy Command

Redshift

(Staging)

Data Pipeline

SNS Topic

(Pipeline Status) (Pipeline Failure)

SNS Email Notification

Lambda

(Save Pipeline Stats)

RDS MySQL Instance

(Pipeline

Configurations)

DynamoDB

Redshift

(Enterprise Data

Repository)

AS400 / DB2

(Staging)

SQL Server EDW

Tableau

(Reporting Tool)Source

DBs

SSAS CubesSSRS Reports

Scholastic data cloud: Technical architecture

Core Framework

• Jobs and Job Groups are defined as metadata in DynamoDB

• Control-M scheduler, Custom Application and Data Pipeline for

Orchestration

• ELT Process with EMR/Sqoop for Extraction. Load and Transform

the data through Redshift SQL scripts

• Core Framework enables

• Restart capability from point of failure

• Capturing of operational statistics (# of rows updated, etc.)

• Audit capability (which feed caused the Fact to change, etc.)

26

Extract

• Pre-create EMR resources at the start of Batch

• Achieve parallelism in Sqoop with mappers and Fair Scheduling

• Sqoop query to add additional fields like Batch_id, Updated_date etc

• Data extracts are split and compressed for optimized loading into Redshift

27

AS400 / DB2

EMR with Sqoop

S3

Metadata

KMS

Data Pipeline

1

2

3

4

5 6

Control Flow

Data Flow

Load

• Truncate and Load through Data Pipeline for Staging tables

• Dynamic Work Load Management (WLM) queues setup to allow maximum

resources during Loading/Transformation

• Check and terminate any locks on tables to allow truncation

• Capture metrics related to number of rows loaded, time taken, etc.28

StagingS3

KMSData Pipeline

4

1 2

3

EC2 Control Flow

Data Flow

Transform

• Custom Application for building Dimensions and Facts

• SQL Scripts are stored in S3 and executed by ELT process

• SQL scripts refactored from SQL Server and AS400 scripts

• Non-Functional Requirements are achieved through Custom App

29

1

32

4

5

6

7a

7b

S3Staging

Facts

Metadata

Dimensions

App

Control Flow

Data Flow

Schema Design

• Modified Star Schema

• Natural Keys instead of generating unique identifiers

• Commonly used columns from Dimensions are copied over to

Facts

• Surrogate keys are eliminated except for few cases

• Compression

• Define appropriate Distribution and Sort Keys

• Define primary key and Foreign keys

Security

• AWS Key Management Service (KMS) is used for encrypting

access credentials to Source and Target databases

• Jenkins job to allow encrypting of credentials using KMS

directly by Database Administrators

• Amazon EMR, Jenkins resources are given KMS decrypt

permissions to allow connecting to Sources and Targets during

the ELT process

• Standard Security in Transit and at Rest throughout the process

• IAM federation through Enterprise Active Directory31

Reporting

• Business users access to Facts/Dimensions through Tableau

• Power users access to Staging tables through Tableau

• Enable Data Analysts access to files in S3 using Hive/Presto

• Self-Service capability across business users

32

S3 Staging Facts/ Dimensions

Business

AnalystsPower

UsersData

Analysts

EMR

Presto/Hive

Workstream Effort

• Define Jobs and Job Groups specific to each

Workstream

• Create Redshift tables (Staging, Facts, Dimensions)

based on mapping from AS400 and best practices

learned

• Create new SQL scripts (based on the logic from

AS400/SQL Server code) for transformation

• Develop, Test and Deploy in 2-week Agile sprints33

Key Lessons - Technical

• Isolate core framework with project specific code repositories

• Consolidating logging solution across Amazon S3, Amazon

Redshift, Amazon DynamoDB etc., was a challenge

• Make appropriate schema changes when migrating to new

platform

• Custom Framework for gathering operational stats (eg: # of

rows loaded etc.)

• Start with Test Automation tools and Acceptance Test Driven

Development (ATDD) earlier in the project34

Project timeline revisited

After the successful pilot:

• Executive Leadership accelerated timeline:

• Reduce project timeline by 50% (to 12 months) to

deliver value faster to LOBs

• Realize cost savings by eliminating the DB2 and

SQL Server platforms earlier

• Users wanted to be on the new platform!

• Scholastic & NorthBay partnered to create a

training curriculum to ensure a supply of skilled

staff would be available to our teams35

Scaling up: 7 workstreams

• Developed a model for estimating effort and cost

(AWS costs & Labor per LOB migration)

• Running agile teams in parallel – employed Agile

coaches

• Enhanced the core framework to ensure it would

scale effectively when in use by multiple teams

simultaneously

• Building a Code repository for use by all teams

• Building CI / CD Frameworks

Where are we now?

• 4 of 7 LOBs migrated – framework enables complete migration of a

functional area within days/weeks as opposed to months. On track to

migrate and decommission entire legacy environment within next 6

months

• 10 weeks to migrate from an external vendor hosting data and providing

reports for one LoB

• Cost of Data Ingestion Framework is under $40/day (EC2, EMR, Data

Pipeline)

• First “Big Data” initiative in production, captures and processes an

average of 1.5 Million e reading events daily (peak: 7 Million)

• Profile: LOB #1

• Loading ~5-6 Million rows/day (6-7GB/day)

• Processing over 1.5 billion rows within Redshift daily

• Complete ETL/ELT batch cycle performance improved by over 170%

Key lessons – project execution

• Essential to monitor and optimize AWS costs

• “Data Champion” / “Data Guide” partnership absolutely critical for

successful adoption of new platforms

• Importance of strong Agile coaches while scaling out Agile teams

• Criticality of choosing consulting partners (AWS & North Bay)

who can ramp up and supply key resources fast and cycle off the

project when finished

• Creating new data platforms and migrating data into them is

easy, especially with AWS. Decommission of existing data

platforms is hard!

38

Thank you!

Remember to complete

your evaluations!

Related Sessions

Hear from other customers discussing their Amazon Redshift use cases:

• BDM402—Best Practices for Data Warehousing with Amazon Redshift (King.com)

• BDA304—What’s New with Amazon Redshift

• SVR308—Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year

• GAM301—How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful

Player Insights

• BDA207—Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS

• BDM306— Netflix: Using Amazon S3 as the fabric of our big data ecosystem

• BDA203 — Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift

(GE Power and Water)

• BDM206 — Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT

Analytics Platform on AWS (Hello)

• STG307— Case Study: How Prezi Built and Scales a Cost-Effective, Multipetabyte Data Platform

and Storage Infrastructure on Amazon S3

Recommended