Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1...

Preview:

Citation preview

Cloud migration Why? How? What happened?

Roy Lou 17media

About me

Sr. Director, Backend / SRE in 17media

Past: HTC, Google, NVIDIA

3-year-old monster’s dad

Swimming, jogging, snowboarding

AWS Oregon Region

GCP Oregon Region

2017/4/20 萌生 Migration 計劃

2017/10/25 Staging in GCP

2018/5/3 Production in GCP 1

Why migrating the cloud?

How to prepare?

What happened on that day?

Agenda

About 17media

App weekly release

Server daily release (30 commits / day)

Code review: Phabricator, CircleCI (auto lint + unit test + e2e test)

Master branch: CircleCI tests & generates docker image

Deploy: Slack -> Jenkins (generate task definition & update service) -> ECS

P50: 15ms

P95: 90ms

Better docker support

Cost saving

Analytics

Geographic

Why Migrating the Cloud?

120ms150ms

200ms200ms

AWS Oregon Region

GCP Taiwan Region

How did you prepare?

How did you prepare to reduce risks?

Risk of the new cloud

Risk of migration process

Risk of the new cloud

Logic issue

Performance issue

Data issue

Reduce Risk of New CloudLogic issue

Unit test

End to end test

At least 1 test for each service

Performance issue

Data issue

Logic issue

Performance issue - Stress test

Stress test databases

Synthetic traffic - extend from e2e test

Real traffic - GoReplay

Data issue

Reduce Risk of New Cloud

Logic issue

Performance issue

Data issue

Use open-sourced dump/restore/sync library

Seek for help from consultant

Reduce Risk of New Cloud

Risk of the new cloud

Risk of migration process

Reduce Risk of Migration Process

Complicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Plan A: 1-time downtime migration

Step 1. Cut-off traffic in AWS

Step 2. Dump & restore databases

Step 3. Deploy containers / VM in

GCP

Step 4. Enable traffic in GCP

Need a 4-hour downtime due to dump-restore

If failed, need another 4-hour downtime to rollback

To shorten downtime, need to sync between databases

Prod write: 180~420 op/s

Mongo connector for AWS Oregon -> GCP Taiwan: 252 op/s

Mongo connector for AWS Oregon -> GCP Oregon: 320 op/s

AWS Oregon Region

GCP Oregon Region

GCP Taiwan Regionstep 1

step 2

Plan B: Many online migrations

Database migration

VM + Redis + Container migration

Reduce Risk of Migration Process

Complicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Application Server

Sync

Source DB Destination DBOnline Migration

Reduce Risk of Migration ProcessComplicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Runbook

Dryrun

Pilot

Runbook

Dryrun #1

Dryrun #3

Last dry run

Milestones• Prepare Runbook • dry-run

Deliverables• 3/30 1st dry-run in Redis/Mongo • 4/10 1st dry-run in MySQL • 4/13 2nd dry-run in Redis/Mongo/MySQL/VM/K8S/DNS • 4/17 3rd dry-run • 4/24 4th & 4.1th & 4.2th dry-run • 4/27 5th & 6th dry-run • 4/28 7th & 8th dry-run

Sep Oct Dec Jan Feb Mar Apr May

Phase I Inception

Phase II Elaboration

Phase III Construction

Phase IVTransition

Sep 13, 2017Oct

Start to Plan• Initial Project Scope

Jan, 2018

Deliverables• STA has moved to GCP • Stress Test

Milestones• Poster Day

Deliverables• 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/GKE/DNS

Apr 27, 2018

What happened on that day?

1: Migration

4/30 The day before migration

5/1 Confirmed everything. Ready to go.

2: MongoDB Migration

5/2 : 4:00 , 5:00 MongoDB migration

5/3 : 3:00 , 3:45 MySQL migration 4:20 VM migration 4:30 Redis + container migration

5/3 4:15 am

3:

5/3 8:09 pm

Still in Oregon To be continued…

https://github.com/17media/jobs/issues

Recommended