Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
#vmworld
Billion Lyft Rides, Half Million IoT Users
How to Scale SaaS with Analytics Insights
Rob Fisher, SRE, Centrica HiveYash Kumaraswamy, Sr. Software Engineer, Lyft
Stela Udovicic, Wavefront Product Marketing, VMware
MGT1402BE
#MGT1402BE
VMworld 2018 Content: Not for publication or distribution
Disclaimer
2©2018 VMware, Inc.
This presentation may contain product features orfunctionality that are currently under development.
This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.
VMworld 2018 Content: Not for publication or distribution
Agenda
3©2018 VMware, Inc.
Introduction
What is New with Wavefront
Centrica Hive
Lyft
VMworld 2018 Content: Not for publication or distribution
©2018 VMware, Inc. 4
SaaS is Growing
$60 Billion2017 SaaS revenue worldwide
$117 Billion2021 SaaS revenue forecast worldwide
*Source Gartner PR - April 12, 2018
VMworld 2018 Content: Not for publication or distribution
©2018 VMware, Inc. 5
But Scaling Cloud Applications is Not Easy
Visibility Issues
• Containers rapid churn
• Serverless speed
• Cloud-scale
Security risks of public cloud applications
VMworld 2018 Content: Not for publication or distribution
6©2018 VMware, Inc.
Wavefront Cloud-Native Analytics and Monitoring Platform
UI and API Backend
Trend & Alert on Anomalies
Troubleshoot Issues
Visualize Metrics at Scale
Self-Service Metrics Analytics for All
Advanced Analytics Engine
Metrics Collection and Storage
UI and API Backend
VMworld 2018 Content: Not for publication or distribution
7©2018 VMware, Inc.
Cloud-Native Analytics and Monitoring Platform
Wavefront by VMware Reliably Scale Your Digital Business
Massive Container Scalability
Serverless Application Instrumentation and Monitoring
Enriched AWS Dashboards -New UI for Faster Troubleshooting
Enhanced Security Access Control
NEW!
VMworld 2018 Content: Not for publication or distribution
‹#› 8©2018 VMware, Inc.
Wavefront's Massive Container Scalability Helps Easily Grow Digital Business – No Blind Spots
Concurrently Running Containers
100,000
Ingesting, Analyzing, Visualizing Metrics from
VMworld 2018 Content: Not for publication or distribution
9
Deep Wavefront PKS Integration for Holistic Kubernetes Monitoring
Kubernetes Health Monitoring
Resource Consumption
Programmatic Alerting
VMworld 2018 Content: Not for publication or distribution
‹#› 10©2018 VMware, Inc.
Deliver Serverless Code Faster Using Wavefront Serverless Instrumentation and Monitoring
Wavefront AWS Lambda Functions SDK -Python, Go, Node.js
– Faster: send metrics from functions directly, bypass AWS CloudWatch Lambda
– Better granularity: 1 sec compared to 5 min
Wavefront AWS Functions Dashboards – At-a-glance serverless health monitoring– Easy customization– Correlation native with custom metrics
Wavefront Delta Counters – More accurate reporting prevents metric loss– Aggregates metric counters from various
sources
VMworld 2018 Content: Not for publication or distribution
11©2018 VMware, Inc.
Detailed Per Function Dashboard
At-a-Glance Health with Aggregated
Visibility
Visualize Health of Serverless Environment with New Dashboards
VMworld 2018 Content: Not for publication or distribution
‹#› 12©2018 VMware, Inc.
Troubleshoot Cloud Environments Faster with Enriched Wavefront Dashboards for AWS
Holistic AWS Monitoring
Map and view the status of your global AWS cloud estate to see where problems are emerging system-wide
Host Maps with Easy Drill-Downs
Click and link to cloud assets to seamlessly drill -down across regional, zones, and instances
New Dashboards & Widgets
Use prebuilt dashboards with new metric widgets to accelerate incident resolution
VMworld 2018 Content: Not for publication or distribution
VMworld 2018 Content: Not for publication or distribution
VMworld 2018 Content: Not for publication or distribution
VMworld 2018 Content: Not for publication or distribution
‹#› 16©2018 VMware, Inc.
Intuitive Security Access Controls for Protecting Digital Insights
User Groups Management
Create user groups to assign permissions to easily manage dashboards accessibility
Programmatic Controls
Enhanced usability with easy manipulation of user groups helps manage user growth
ACL on Entities
Set ACL for dashboards to isolate access by user groups to avoid malicious data integrity attacks
VMworld 2018 Content: Not for publication or distribution
17©2018 VMware, Inc.
Some of Wavefront CustomersMonitoring Cloud-native Applications and Infrastructure
VMworld 2018 Content: Not for publication or distribution
18©2018 VMware, Inc.
Centrica Hive
VMworld 2018 Content: Not for publication or distribution
©2018 VMware, Inc. 19
About Centrica Hive
Largest IoT platform in the United Kingdom
Over 500,000 customers
Entirely cloud-native
Part of Centrica group, but grew like a startup
VMworld 2018 Content: Not for publication or distribution
©2018 VMware, Inc. 20
Taking Control
The Estate
Configuration Management
The Platform
Security and Compliance
Alerting (and sleeping all night)
Cost
VMworld 2018 Content: Not for publication or distribution
©2018 VMware, Inc. 21
Wavefront at Centrica Hive Future Plans
Moving to Serverless and Kubernetes
Integrating all our devices
Staying in control
VMworld 2018 Content: Not for publication or distribution
22©2018 VMware, Inc.
Improving Lives with the World’s Best Transportation
VMworld 2018 Content: Not for publication or distribution
23©2018 VMware, Inc.
About Me
HOBBIES Include Guitar, Golf, Skateboarding, Cooking/Baking, and Automobiles
FORMERTech Lead of Lyft Observability
CURRENTLYWorking closely with the Express Drive team – Lyft’svehicle rental program for drivers
OVER9 years in tech
2010-2014Zynga
EARLYLyft Infrastructure (DevOps) Engineer
VMworld 2018 Content: Not for publication or distribution
24©2018 VMware, Inc.
One Billion Rides
2018
VMworld 2018 Content: Not for publication or distribution
25©2018 VMware, Inc.
About Lyft
• Transportation as a service
• “Your friend with a car,” redefines
personal transportation
• Founded in San Francisco 2012
• Currently serving in US and Canada
• Available in 300+ cities and 1500 drivers
at any minute
VMworld 2018 Content: Not for publication or distribution
26©2018 VMware, Inc.
Lyft – More Fun Facts
• 250,000 Lyft community members gave up their cars at the beginning of 2017
• The Lyft community will take 1 million cars off the road by the end of 2019
• Autonomous vehicle fleets will become widespread & will account for the majority of Lyft
rides within five years
• By 2025, private car ownership will all-but end in major US cities
• Lyft rides are carbon-neutral
• Lyft Bikes and Scooters will be our solution to last mile commute
VMworld 2018 Content: Not for publication or distribution
27©2018 VMware, Inc.
Lyft Stats – in 2017
Annual Rides
MM
New Year Eve Rides
MM
Employees
2K+
Halloween Drop-
Offs/sec
K+
Microservices
200+
EC2 instances
10,000+
Lots* of logs and metrics
VMworld 2018 Content: Not for publication or distribution
28©2018 VMware, Inc.
Observability Team at Lyft
Founded in early 2016, a small and cohesive team of 5 engineers
Team collectively owns
• Client and Server logging infrastructure
• Metric ingest pipeline and real-time aggregation
• Distributed Tracing
• PagerDuty interactions and integrations
• The real-time business metric framework
• Dashboards and user experience with monitoring and alarming setup
• Logging and metric-based alerting
• Baseline monitoring systems for all microservices
• Core librariesVMworld 2018 Content: Not for publication or distribution
29©2018 VMware, Inc.
Metrics at Lyft: The Before Times
VMworld 2018 Content: Not for publication or distribution
30©2018 VMware, Inc.
Before Wavefront by VmwareChallenges with Open Source Tooling
30
• Manual maintenance• Resource-hungry drives
cost
• Query performance issues• Ingest performance issues
• Hard to scale• Sharding handled
externally
Reliability Performance Maintainability
VMworld 2018 Content: Not for publication or distribution
31©2018 VMware, Inc.
Observability Challenges Early in 2015
• Lyft used Graphite (and whisper files) located on i2 instances
• Hard to scale, we handledsharding externally
• Relays provided poor controlfor fan out of data to alternate destinations
• We computed top-level aggregates from the onealready existing
• This stack processed local minutely aggregated samples
VMworld 2018 Content: Not for publication or distribution
32©2018 VMware, Inc.
Observability Challenges Early in 2016
• Replace the poorly scaling
Python-based intermediaries
with more efficient
components
• Reduce end to end to end
latency for site >3m to < 2m
• Produce improved and
accurate top-level
aggregates - p95/99/999/9999
VMworld 2018 Content: Not for publication or distribution
33©2018 VMware, Inc.
Early 2016 – Enter Wavefront
• Node.js based StatsD replaced by C implementation of StatsD server – lower overhead, better data quality
• Added fan-out for StatsD traffic to other clusters or receivers, e.g., Wavefront
• Wrote cluster-wide aggregated metrics to the existing cluster graphite under a new namespace to allow comparisons of latency and accuracy
• Aggregated StatsD packets over time in several dimensions, including per-host and per-cluster
Wavefront starts serving 20% of reading traffic on March 2016
• Time series ingestion
• Integrated alarms
• Wavefront salt module for alert, dashboard and user management
• Grafana integration
VMworld 2018 Content: Not for publication or distribution
34©2018 VMware, Inc.
So Many Metrics!
System metrics• Collected• Custom scripts• Bash functions
‚‚Applications metrics
Core libraries instrumentation
Scraper scripts - pull metrics• Cloudwatch metrics• Google Cloud Platform metrics• Mongo telemetry
Containers generated parameters (future Kubernetes)VMworld 2018 Content: Not for publication or distribution
35©2018 VMware, Inc.
Opt-in mechanism for per-host and per-second data
Only ~300K metrics per second, thanks to rollups
Per-instance cardinality limits
So Many Metrics!
Billions per second, even with aggregation and sampling
Graphite meltdown!
VMworld 2018 Content: Not for publication or distribution
36©2018 VMware, Inc.
Wavefront by VMware at Lyft Today
36
• System Monitoring
• Application monitoring
• > 500,000 metrics/second - peaked at 800,000
• 1,000+ engineers using Wavefront
• 1,000+ Wavefront dashboards
• 18,000+ Wavefront alerts
VMworld 2018 Content: Not for publication or distribution
37©2018 VMware, Inc.
Python and Golang
• Common base libraries for each language
• Hundreds of microservices, one monorepo (that is getting decomposed)
• Frequent deploys
• Common “base” deploy, Salt (masterless), AWS public cloud
• DevOps (Infrastructure team) has the role of enabling others, not to operate
• Teams are responsible for their service
• No SRE
Today Lyft Relies on Wavefront for Time Series and Alarming
VMworld 2018 Content: Not for publication or distribution
38©2018 VMware, Inc.
How Does Metrics Aggregation Pipeline at Lyft WorkCascaded Approach
github.com/lyft/statsrelay.git
github.com/lyft/statsite.git
VMworld 2018 Content: Not for publication or distribution
39©2018 VMware, Inc.
Service level aggregates centrally - correct histogramsPer host aggregates locally
Default metrics aggregated at 60s intervalThe 1-second interval is possible with a whitelist
Data Aggregation
VMworld 2018 Content: Not for publication or distribution
40©2018 VMware, Inc.
Transitioning from Graphite to Wavefront Format Is Easy
VMworld 2018 Content: Not for publication or distribution
41©2018 VMware, Inc.
Lyft Business Metrics in Wavefront
Passenger metrics• New user signups / installs / activations• Current passengers with the app open
Driver metrics• New driver applications / activations• Current drivers with the app open
Ride metrics• Rides requested / accepted / dropped off / canceled / lapsed• Lyft Line rides dropped off• Paid vs. Couponed rides dropped off
Marketplace metrics• Drivers available• Drivers en route• Driver utilization %VMworld 2018 Content: Not for publication or distribution
42©2018 VMware, Inc.
Passenger - PAX Client Metrics - Wavefront Integration with Grafana
VMworld 2018 Content: Not for publication or distribution
43©2018 VMware, Inc.
Techniques Used at Lyft to Avoid Production Incidents with Hundreds of Micro Services
VMworld 2018 Content: Not for publication or distribution
44©2018 VMware, Inc.
from lyft_stats import stats
handler = stats.get_stats(‘test_prefix’)
map = {‘foo’: ‘bar’}
try:
with handler.timer(‘sample.timer’):
# do other things
print(map[‘test’])
except KeyError:
handler.incr(‘illegal.access’)
pass
Easy Application Metrics Collection - Python Metrics Library
VMworld 2018 Content: Not for publication or distribution
45©2018 VMware, Inc.
Easy Metrics Collection Go Metrics Library
https://github.com/lyft/gostats
VMworld 2018 Content: Not for publication or distribution
46©2018 VMware, Inc.
Observability in the Age of Microservice Mesh
VMworld 2018 Content: Not for publication or distribution
47©2018 VMware, Inc.
Envoy Primer
• Envoy Proxy- modern, high performance, small footprint edge and service proxy
designed for cloud-native applications
• Out of process architecture (sidecar)
• C++ 11 code base
• Service discovery and active/passive health checking
• Advanced load balancing
• Edge and service proxy
• HTTP L7 filter architecture
• Best in class Observability (tracing, logging, and stats)
VMworld 2018 Content: Not for publication or distribution
48©2018 VMware, Inc.
Measure Everything!
VMworld 2018 Content: Not for publication or distribution
49©2018 VMware, Inc.
• Monolithic repository for managing dashboards
• Close integration with our salt infrastructure
• Grafana and Wavefront modules for dashboard/alert management
• Dashboards/alerts defined as salt states (jinja2+yaml)
• The rigorous code review process
• Consistent look and feel
• Distributed ownership
Managed Dashboards and Alarms Hub
VMworld 2018 Content: Not for publication or distribution
50©2018 VMware, Inc.
Consistent Look and Feel Across All Our Microservices
VMworld 2018 Content: Not for publication or distribution
51©2018 VMware, Inc.
Envoy Global Health DashboardWavefront Integration with Grafana
VMworld 2018 Content: Not for publication or distribution
52©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
VMworld 2018 Content: Not for publication or distribution
53©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
VMworld 2018 Content: Not for publication or distribution
54©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
VMworld 2018 Content: Not for publication or distribution
55©2018 VMware, Inc.
Enrichment
VMworld 2018 Content: Not for publication or distribution
56©2018 VMware, Inc.
Finding a Needle in a Haystack
VMworld 2018 Content: Not for publication or distribution
57©2018 VMware, Inc.
Help Us Arrive at Root Cause Quickly
VMworld 2018 Content: Not for publication or distribution
58©2018 VMware, Inc.
Tight Coupling
VMworld 2018 Content: Not for publication or distribution
59©2018 VMware, Inc.
Benefits of Wavefront for Lyft
59
• Multiple-system syndrome- Fewer tools for triage, better and faster resolution
- Context switching is expensive
- Wavefront puts metrics and data from numerous sources
up front and makes them available in a single click
• Real-time visibility into the performance of our key services
• Highly efficient Alert Engine
- Relies on Wavefront to create smart alerts that dynamicallyfilter noise and capture veritable anomalies
• Powerful metrics explorer and chart viewVMworld 2018 Content: Not for publication or distribution
60©2018 VMware, Inc.
Big Wins with Wavefront
Ability to monitor releases to help engineers makeaccurate decisions
Predict the future
Empirical data to guide decision making
Robust alerting - for when you’re not watching
The first-class citizen, to answer questions: “Is Lyft up?” or “How many rides did we complete?”
Intuitive yet powerful query language
VMworld 2018 Content: Not for publication or distribution
DON’T FORGET TO FILL OUT YOUR SURVEY.
#vmworld #MGT1402BE
VMworld 2018 Content: Not for publication or distribution
THANK YOU!
#vmworld #MGT1402BE
VMworld 2018 Content: Not for publication or distribution