Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Scaling MySQL at Venmo
Dong Wang (PayPal Inc), Van Pham (Venmo), Heidi Wang (PayPal Inc)
Percona Live 2019, Austin TX
© 2019 PayPal Inc. Confidential and proprietary.
AgendaIntroduction
Venmo History
Application Ecosystem and MySQL Architecture
Scalability Challenges
Short Term Tactical Improvements
Long Term Strategic Improvements
Wrap up: Q & A
© 2019 PayPal Inc. Confidential and proprietary.
The History of Venmo
© 2019 PayPal Inc. Confidential and proprietary.
The History of Venmo
• Venmo was founded by Andrew
Kortina and Iqram Magdon-Ismail,
as freshman roommates at the
University of Pennsylvania
• The original prototype sent money
through text messages, and
eventually transitioned to a
smartphone app
• In 2012, the company was acquired
by Braintree for $26.2 million
• In December 2013, PayPal acquired
Braintree and by default Venmo
How Venmo Started
© 2019 PayPal Inc. Confidential and proprietary
The History of Venmo
As of Q1 2019
• Venmo has 40 million users
• Account for 14.5% of PayPal's total user
base
• Venmo posted $21 billion in volume
• Growth of 73% in volume annually
How Venmo Works
Pay with Venmo
Just Venmo Me
© 2019 PayPal Inc. Confidential and proprietary.
Challenges
• Growth in partnership with vendors
• Exponential growth in the user base
• Exponential growth in payment volume
• Exponential growth in data volume
Business Requirement
• Add new features and initiatives
• Increase fraud detections and
security
• Scale the payment volume
• Increase user satisfaction
• Keep what makes Venmo unique
Venmo Growing Pains
© 2019 PayPal Inc. Confidential and proprietary.
Venmo Payment Volume 2 Year Trend
© 2019 PayPal Inc. Confidential and proprietary.
8©2018 PayPal Inc. Confidential and proprietary.
Application Ecosystem and Database Architecture
Full Stack in AWS
Web(Shabu)
AndroidiOS Amazon CDN(CloudFront)
WebMobile
Routing Amazon Route 53
Task Workers (risk/fraud/comp, etc) Misc (cron, etc)
DB
Admin(Scope/VU)
Core Analytics Auth OFAC
celery
brokersLocks
Orch Developer API
REST
Nginx Envoy
Venmo Application Ecosystem
FeedSocial Graph Pub/Friend/User Feeds QueriesLogin Events
© 2019 PayPal Inc. Confidential and proprietary.
Model
View
Template
ORM Generated
SQLDB
Venmo Application Framework
MVT Framework for both web and web services
© 2019 PayPal Inc. Confidential and proprietary.
Database ArchitectureAmazon Aurora
© 2019 PayPal Inc. Confidential and proprietary.
Pros
• Managed services
• Low latency read replicas
• Stable and better performance
• Easy provisioning and scaling
• Faster backup and cloning
• Point-in-time recovery
• Custom end points
• Monitoring tools
Cons
• Less visibility to system and storage
layer
• Limited vertical and horizontal
scalability
• Maximum cluster volume of 64 TB
• Writer restart causes all readers to
reboot
Amazon Aurora - Pros and Cons
© 2019 PayPal Inc. Confidential and proprietary.
13©2018 PayPal Inc. Confidential and proprietary.
Scalability Challenges at Venmo
Area Symptom Impact
Scalability • Limited horizontal scalability with
more read-only nodes for MongoDB and
MySQL
• Uneven CPU/connection distribution
• Read traffic not using read replicas
effectively
• Can’t handle increase in payment
volume
• Bad user experience
• Higher call volume to
customer support
• Low user ratings in the
app stores
Platform • Old version of MySQL, MongoDB and Cassandra
• Slow performance
• Inconsistent data
DR Readiness (in
progress)
• Limited distribution of DB nodes in US-
East AZs
• Lack of DB regional parity in US-West
• Degraded performance
in a regional failure
scenario
2018 Infrastructure Challenges
© 2019 PayPal Inc. Confidential and proprietary.
Area Symptom Impact
Query
Performance
• Bad performance from queries generated by ORM
• Top 10 slow queries > 75% of slow query time
• High latency during peak
time
• High CPU usage
Changes in
Access Pattern
• Too many indexes not used by the application
• Can’t add covering index to large tables• Slow queries
• Slower updates &
payment per second
Data Model • No data retention policy
• Heavily skewed data distribution in MySQL• Unoptimized keyspace in Cassandra and
MongoDB
• Payment failure or low
payment per second
• A high rate of time out
• Maintenance challenge
Transaction
Model
• Multiple DBs involved• High number of deadlocks• High number of blocking reads (Select for
Update)
• Payment failure
• Reduce payment
per second
2018 Architecture and Performance Challenges
© 2019 PayPal Inc. Confidential and proprietary.
Area Symptom Impact
Monitoring • Many monitoring tools including legacy and
new Grafana, New Relic, DataDog, Sumo Logic, PMM, MongoDB Cloud Manager
• Only monitor basic metrics for all
datastores
• Low confidence in
metrics validity
• Lack of notification for
critical metrics
• Harder to troubleshoot
problems
Release Process • No dedicated release engineering org• Limited QA review of releases• Lack of sufficient DBA review• Lack of sufficient testing before
production release
• Frequent incidents
• Bad user experience
• Higher call volume to
customer support
• Low user ratings in the
app stores
2018 Operation Challenges
© 2019 PayPal Inc. Confidential and proprietary.
17©2018 PayPal Inc. Confidential and proprietary.
Short Term Tactical Improvements
Principles
• Aim for peak traffic during Super Bowl
• Target availability improvement
• Minimize code change
• No big surgery on data models
• Align with strategic moves
• Provide foundational benefits for both
short/long term
Short Term Tactical Improvements
2.5x pps
© 2019 PayPal Inc. Confidential and proprietary.
• MySQL upgrade and row-based
replication
• Vertical scale of writer node
• Vertical and horizontal scale of reader
nodes
• Read/write traffic separation
• Domain isolation
Infrastructure Scaling
© 2019 PayPal Inc. Confidential and proprietary.
20©2018 PayPal Inc. Confidential and proprietary.
Improved DML latency Reduced blocked transaction
Infrastructure Scaling
21©2018 PayPal Inc. Confidential and proprietary.
Improvement in CPU usage Improvement in RAM available
Infrastructure Scaling
Task Outcome
Django Code Optimization On Single Table
Select
• Avoid querying millions of rows and then
throwing away• 25% CPU reduction across the board
Tuning of 7 Table Joining Queries • Workaround the plan instability by avoiding
order by PK column• Query execution time reduced
from >60 seconds to millisecond
Django Code Optimization to Reduce DB Round
Trips
• Get the one-row result set directly without
counting the number of rows first• Queries to DB reduced by 50% for a GET call
Fixing Slow Queries of Critical Jobs Due to Plan
Instability
• Root cause analysis using advanced techniques
• Avoid cascading slow queries• No more critical job failure
Application Optimization/Query Tuning
© 2019 PayPal Inc. Confidential and proprietary.
23©2018 PayPal Inc. Confidential and proprietary.
Optimization on Single Table Select
Avoid unnecessary query and then throw away fetched data
Time CPU % Payments/Sec (PPS) CPU% for 100 PPS
12/26/2018 19:00 UTC 32.9 45 32.9/45*100 = 73.1
12/28/2018 19:00 UTC 30.1 55 30.1/55*100 = 54.7
Net Reduction (73.1 – 54.7) / 73.1 = 25%
1233 def _function(cls, users):1234 """1235 :param uses: A list of `User` objects.1236 :return: A tuple of `User` objects which meets the condition1237 """12381239 if not users or len(users) == 0:1240 return [ ]
24©2018 PayPal Inc. Confidential and proprietary.
Workaround Plan Instability of a Slow 7-Table Joining Query
Avoid Order By PK Column
25©2018 PayPal Inc. Confidential and proprietary.
Root Cause of Ever Growing Cost Estimate on Using 2ndary Indexes
Avoid blocking of index page merges
Task Outcome
Elasticity of Dev API Pool • Proactive expansion of application server pools
during peak
Improve Capacity Planning • Simulate peak handling in load test environment
• Realistic projections of system capacity for peak traffic
Improved Monitoring • Enable Performance Insight and performance
schema. Single glass view of all databases
Knowing All Levers and Knobs for Peak Handling • Optimize cron job timing
• Turn feature off/on
PayPal Risk Calls Capacity Preparation • Vastly improved the risk integration with PayPal
and reduced losses
Playbooks, Event Coordination/Communication • Planned execution of preemptive steps
• Anticipation of issue handling with playbooks• War room, slack channels, dedicated bridge, point
of contacts
• Creation of the Performance & Scalability Team
Operational Enhancements
© 2019 PayPal Inc. Confidential and proprietary.
Task Outcome
Improve release process • Less rollback of releases
Improve incident management • Reduction in the number of incidents
Improve root cause analysis process • Analysis result in actionable ticket
Improve QA & load test • Drastic reduction in Serv1 & Serv2
Mandatory DBA review of new data model and
queries
• Major improvement in query performance
• Drastic reduction in the number of slow queries
Release Process Enhancements
© 2019 PayPal Inc. Confidential and proprietary.
Core MySQL DB Performance
© 2019 PayPal Inc. Confidential and proprietary.
29©2018 PayPal Inc. Confidential and proprietary.
Long Term Strategic Improvements
30©2018 PayPal Inc. Confidential and proprietary.
Principles
• Business outcome prioritized
• Evolution rather than revolution
• Horizontal scalability
• Data lifecycle management (DLM)
enabled
• Tight transactional integrity
• Domain isolation
• Foundation for linear scalability
Long Term Strategic Improvements
100x pps
31©2018 PayPal Inc. Confidential and proprietary.
Questions?
Thank You
© 2019 PayPal Inc. Confidential and proprietary.