45
Cloud migration Why? How? What happened? Roy Lou 17media

Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Cloud migration Why? How? What happened?

Roy Lou 17media

Page 2: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

About me

Sr. Director, Backend / SRE in 17media

Past: HTC, Google, NVIDIA

3-year-old monster’s dad

Swimming, jogging, snowboarding

Page 3: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

AWS Oregon Region

GCP Oregon Region

Page 4: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

2017/4/20 萌生 Migration 計劃

Page 5: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

2017/10/25 Staging in GCP

Page 6: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

2018/5/3 Production in GCP 1

Page 7: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Why migrating the cloud?

How to prepare?

What happened on that day?

Agenda

Page 8: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

About 17media

App weekly release

Server daily release (30 commits / day)

Code review: Phabricator, CircleCI (auto lint + unit test + e2e test)

Master branch: CircleCI tests & generates docker image

Deploy: Slack -> Jenkins (generate task definition & update service) -> ECS

Page 9: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr
Page 10: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

P50: 15ms

P95: 90ms

Page 11: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Better docker support

Cost saving

Analytics

Geographic

Why Migrating the Cloud?

Page 12: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

120ms150ms

200ms200ms

Page 13: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

AWS Oregon Region

GCP Taiwan Region

Page 14: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

How did you prepare?

Page 15: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

How did you prepare to reduce risks?

Page 16: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Risk of the new cloud

Risk of migration process

Page 17: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Risk of the new cloud

Logic issue

Performance issue

Data issue

Page 18: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Reduce Risk of New CloudLogic issue

Unit test

End to end test

At least 1 test for each service

Performance issue

Data issue

Page 19: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Logic issue

Performance issue - Stress test

Stress test databases

Synthetic traffic - extend from e2e test

Real traffic - GoReplay

Data issue

Reduce Risk of New Cloud

Page 20: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Logic issue

Performance issue

Data issue

Use open-sourced dump/restore/sync library

Seek for help from consultant

Reduce Risk of New Cloud

Page 21: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Risk of the new cloud

Risk of migration process

Page 22: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Reduce Risk of Migration Process

Complicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Page 23: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Plan A: 1-time downtime migration

Step 1. Cut-off traffic in AWS

Step 2. Dump & restore databases

Step 3. Deploy containers / VM in

GCP

Step 4. Enable traffic in GCP

Page 24: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Need a 4-hour downtime due to dump-restore

If failed, need another 4-hour downtime to rollback

To shorten downtime, need to sync between databases

Prod write: 180~420 op/s

Mongo connector for AWS Oregon -> GCP Taiwan: 252 op/s

Mongo connector for AWS Oregon -> GCP Oregon: 320 op/s

Page 25: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

AWS Oregon Region

GCP Oregon Region

GCP Taiwan Regionstep 1

step 2

Page 26: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Plan B: Many online migrations

Page 27: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Database migration

Page 28: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

VM + Redis + Container migration

Page 29: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Reduce Risk of Migration Process

Complicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Page 30: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Application Server

Sync

Source DB Destination DBOnline Migration

Page 31: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Reduce Risk of Migration ProcessComplicated migration == High risk

Offline migration => Simple but high risk

Accident during official migration

Runbook

Dryrun

Pilot

Page 32: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Runbook

Page 33: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Dryrun #1

Page 34: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Dryrun #3

Page 35: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Last dry run

Page 36: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Milestones• Prepare Runbook • dry-run

Deliverables• 3/30 1st dry-run in Redis/Mongo • 4/10 1st dry-run in MySQL • 4/13 2nd dry-run in Redis/Mongo/MySQL/VM/K8S/DNS • 4/17 3rd dry-run • 4/24 4th & 4.1th & 4.2th dry-run • 4/27 5th & 6th dry-run • 4/28 7th & 8th dry-run

Sep Oct Dec Jan Feb Mar Apr May

Phase I Inception

Phase II Elaboration

Phase III Construction

Phase IVTransition

Sep 13, 2017Oct

Start to Plan• Initial Project Scope

Jan, 2018

Deliverables• STA has moved to GCP • Stress Test

Milestones• Poster Day

Deliverables• 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/GKE/DNS

Apr 27, 2018

Page 37: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

What happened on that day?

Page 38: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

1: Migration

4/30 The day before migration

Page 39: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

5/1 Confirmed everything. Ready to go.

Page 40: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

2: MongoDB Migration

5/2 : 4:00 , 5:00 MongoDB migration

Page 41: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

5/3 : 3:00 , 3:45 MySQL migration 4:20 VM migration 4:30 Redis + container migration

Page 42: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

5/3 4:15 am

Page 43: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

3:

5/3 8:09 pm

Page 44: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

Still in Oregon To be continued…

Page 45: Cloud migration Why? How? What happened? · 2018-09-17 · • Poster Day Deliverables • 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/ GKE/DNS Apr

https://github.com/17media/jobs/issues