Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Cloud migration Why? How? What happened?
Roy Lou 17media
About me
Sr. Director, Backend / SRE in 17media
Past: HTC, Google, NVIDIA
3-year-old monster’s dad
Swimming, jogging, snowboarding
AWS Oregon Region
GCP Oregon Region
2017/4/20 萌生 Migration 計劃
2017/10/25 Staging in GCP
2018/5/3 Production in GCP 1
Why migrating the cloud?
How to prepare?
What happened on that day?
Agenda
About 17media
App weekly release
Server daily release (30 commits / day)
Code review: Phabricator, CircleCI (auto lint + unit test + e2e test)
Master branch: CircleCI tests & generates docker image
Deploy: Slack -> Jenkins (generate task definition & update service) -> ECS
P50: 15ms
P95: 90ms
Better docker support
Cost saving
Analytics
Geographic
Why Migrating the Cloud?
120ms150ms
200ms200ms
AWS Oregon Region
GCP Taiwan Region
How did you prepare?
How did you prepare to reduce risks?
Risk of the new cloud
Risk of migration process
Risk of the new cloud
Logic issue
Performance issue
Data issue
Reduce Risk of New CloudLogic issue
Unit test
End to end test
At least 1 test for each service
Performance issue
Data issue
Logic issue
Performance issue - Stress test
Stress test databases
Synthetic traffic - extend from e2e test
Real traffic - GoReplay
Data issue
Reduce Risk of New Cloud
Logic issue
Performance issue
Data issue
Use open-sourced dump/restore/sync library
Seek for help from consultant
Reduce Risk of New Cloud
Risk of the new cloud
Risk of migration process
Reduce Risk of Migration Process
Complicated migration == High risk
Offline migration => Simple but high risk
Accident during official migration
Plan A: 1-time downtime migration
Step 1. Cut-off traffic in AWS
Step 2. Dump & restore databases
Step 3. Deploy containers / VM in
GCP
Step 4. Enable traffic in GCP
Need a 4-hour downtime due to dump-restore
If failed, need another 4-hour downtime to rollback
To shorten downtime, need to sync between databases
Prod write: 180~420 op/s
Mongo connector for AWS Oregon -> GCP Taiwan: 252 op/s
Mongo connector for AWS Oregon -> GCP Oregon: 320 op/s
AWS Oregon Region
GCP Oregon Region
GCP Taiwan Regionstep 1
step 2
Plan B: Many online migrations
Database migration
VM + Redis + Container migration
Reduce Risk of Migration Process
Complicated migration == High risk
Offline migration => Simple but high risk
Accident during official migration
Application Server
Sync
Source DB Destination DBOnline Migration
Reduce Risk of Migration ProcessComplicated migration == High risk
Offline migration => Simple but high risk
Accident during official migration
Runbook
Dryrun
Pilot
Runbook
Dryrun #1
Dryrun #3
Last dry run
Milestones• Prepare Runbook • dry-run
Deliverables• 3/30 1st dry-run in Redis/Mongo • 4/10 1st dry-run in MySQL • 4/13 2nd dry-run in Redis/Mongo/MySQL/VM/K8S/DNS • 4/17 3rd dry-run • 4/24 4th & 4.1th & 4.2th dry-run • 4/27 5th & 6th dry-run • 4/28 7th & 8th dry-run
Sep Oct Dec Jan Feb Mar Apr May
Phase I Inception
Phase II Elaboration
Phase III Construction
Phase IVTransition
Sep 13, 2017Oct
Start to Plan• Initial Project Scope
Jan, 2018
Deliverables• STA has moved to GCP • Stress Test
Milestones• Poster Day
Deliverables• 5/1 official in Mongo • 5/2 official in MySQL/VM • 5/3 official in Redis/GKE/DNS
Apr 27, 2018
What happened on that day?
1: Migration
4/30 The day before migration
5/1 Confirmed everything. Ready to go.
2: MongoDB Migration
5/2 : 4:00 , 5:00 MongoDB migration
5/3 : 3:00 , 3:45 MySQL migration 4:20 VM migration 4:30 Redis + container migration
5/3 4:15 am
3:
5/3 8:09 pm
Still in Oregon To be continued…
https://github.com/17media/jobs/issues