29
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Embed Size (px)

Citation preview

SignalFx

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Phillip Liu [email protected]

@SignalFx - signalfx.com

Agenda

• My background

• Microservices, a review

• Analytics approach to monitoring

• Code push side effects, an example

• Summary

SignalFx

My Background

Experience

[2013 - ] SignalFx - Founder, CTO, Software EngineerMicroservices; Monitoring using Analytics

[2008 - 2012] Facebook - Software Engineer, Software ArchitectHyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics

[2004 - 2008] Opsware - Chief Architect, Software EngineerMonolithic Architecture; Monitoring using Ganglia, Nagios, Splunk

[2000 - 2004] Loudcloud - Software EngineerLAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool

[1998 - 2000] Marimba - Software EngineerClient / Server; Monitoring using SNMP, FreshWater Software

[ … ]

SignalFx

Microservices, a Review

A Microservices Definition

Loosely coupled service oriented architecture with bounded context.

Adrian Cockcroft

SignalFx’s Microservices

More than 15 internal services. Spanning hundreds of instances. Across 3 AZs.

Have dependencies on tens of external services.

Monitoring Challenges

• High iteration rate leads to shortened test cycles

• Integration test combinations are intractable

• Catch problems during rolling deployments

• Identify upstream/downstream side effects

• e.g. backpressure

• Identify brownouts before the customer

• etc.

SignalFx

Analytics Approach to Monitoring

Measure

Store

Analyze

Detect

SignalFx

Examples

Monitoring at SignalFx

•We use SignalFx to monitor SignalFx

•CollectD for OS and Docker metrics on all VMs

•Yammer metrics for all Java app servers

•Custom logger to count exception types

•All metrics are sent to an analytics service

•Each service deploy a their cadence

•Push lab, then canary in prod, then rest of tier

Code Push Side Effects

Code Push Side Effects

Push canary instance and Metadata API dashboard shows healthy tier.

Code Push Side Effects

However, upstream UI dashboard showed unusual # of timeouts.

Code Push Side Effects

In search of root cause. Always safe to start by looking at exception counts.Can’t derive much from all the noise.

Code Push Side Effects

Sum the # of exceptions to create a single signal.

Code Push Side Effects

Compare sum with time-shifted sum from a day ago.

Code Push Side Effects

Look at an outlier host - an Analytics service host.

Code Push Side Effects

java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_79] …

Looking at Analytic’s logs revealed source of the problem.

Code Push Side Effects

• Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down

root cause • Discovery allowed us to create a detector using

analytics to notify similar problems in the future

Other Examples

• A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different

write strategies • Create per-service capacity reports • Identify memory usage patterns across our

Analytics service • Create a detector for every previously uncaught

error conditions - postmortem output

SignalFx

Summary

• Measure and Store as much metrics and events as possible

• Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence

SignalFx

Thank You!

Phillip Liu [email protected]

WE’RE HIRING [email protected]

@SignalFx - signalfx.com