Diffy : Automatic Testing of Microservices @ Twitter

DiffyAutomatic Testing of Microservices @Twitter

Puneet Khanduri, Arun Kejariwal(@pzdk, @arun_kejariwal)

1

Oct 8, 2014

Twitter, Inc. Down 2% Due To Broken Signup

2

Oct 8, 2014

Twitter, Inc. NOT Down 2% Due To NOT Broken Signup

3

“I just refactored a critical part of my service. How do I know I didn’t break anything?”

- Every Service Developer @ Twitter

4

“They just refactored a critical part of their service. How do I know they didn’t break anything?”

- Every Site Reliability Engineer @ Twitter

5

Tier #0Unit Tests

CostWriting good tests takes 1.5x development time

Limited ScopeTesting classes/methods in isolation

High coverage per testExample: A method has 5 independent code paths

1 unit test => 20% coverage

Tier#0 - Unit Tests Cost

Writing good tests takes ~1.5x of development time

Limited Scope Testing classes/methods in isolation

High Coverage % per Test

e.g. A method has 5 independent code paths => 1 test yields 20% coverage

6

Tier #1Component Tests

CostSame as Unit Tests

Limited ScopeTesting classes/methods in isolation

Low coverage per testCyclomatic complexity is O(kn) - impractical to target 100%

Handpicked test cases

Tier#1 - Component Tests Testing a service in isolation with a fully mocked environment.

Cost of a single test Same as unit tests

Low Coverage% per test

Cyclomatic complexity is O(k^n) - impractical to target 100%

Handpicked test cases e.g. A request path has 6 methods with 5 paths per method => 1 test = 0.03% coverage

7

Tier #1Component Tests

Tier#1 - Component Tests Testing a service in isolation with a fully mocked environment.

Cost of a single test Same as unit tests

Low Coverage% per test

Cyclomatic complexity is O(k^n) - impractical to target 100%

Handpicked test cases e.g. A request path has 6 methods with 5 paths per method => 1 test = 0.03% coverage

Request path with 6 methods and 5 paths per method

1 test => 0.03% coverage

8

Tier #2Integration Tests

CostSame as Unit Tests

+ Amortized cost of a staging environment

Negligible coverage per test Much less than component tests

A request path has 4 services, 6 methods/service, 5 paths/methods

Testing a service and its downstream dependencies in a real (staging) environment

9

Emerging pattern

Super exponential cost of coverage

… emerging pattern ...

super exponential cost of coverage 10

Diffy ApproachHigher coverage for free

11

Diffy Approach

Free test inputs

Sample production traffic or whatever traffic source you prefer

Free assertions

Use “known good” versions of your code to generate assertions

12

What about the noise?

Server generated timestamps

Random number generators

Downstream non-determinism

Race conditions

13

Diffy TopologyDiffy Topology

diffy

secondary

candidate

primary

raw differences

non-deterministic noise

filtered differences

sampled production traffic

14

15

Automation

Compare latest in master against last deploy to production

Automatically deploy master as candidate

Automatically deploy prod tag as primary and secondary

16

Automation (contd.)

Reporting

Diffy e-mails a report with highlighted critical endpoints and fields

Sample requests and response available for further analysis

17

18

Performance Regression

Why is it challenging?

Software New release

Hardware performance Uncontrolled parameter

Makes robust analysis challenging

Large variability across nodes

19

Performance Regression: Diffy Approach

Observation All target service instances see identical load

Key Idea

Discover all performance metrics (thousands of time series)

Compare reference instances to test instances

Report metrics with significant deviations20

Performance Regression (contd.)

Visual analysis: Error proneFalse&nega)ve&

21

Common Statistical Methods

Welch’s t-Test Two sample test

H0: Means of two populations are equal

22

Common Statistical Methods (contd.)

F-Test H0: Means of a set of populations are equal

Two groups F = t2, where t is Student’s statistic

Assumptions Normally distributed populations [1] Equal variance (Homoscedastic) Independent samples

[1] “Power Func/on of the F-‐Test Under Non-‐Normal Situa/ons”, by M. L. Tiku. In Journal of the American Sta2s2cal Associa2on, Vol. 66, No. 336 (Dec., 1971), pp. 913-‐916. 23

Similarity based Match count Longest subsequence based

Clustering k-Means, phased k-Means EM Dynamic clustering k-Mediods Single linkage clustering PCA, SVM

24

Other Previous Work

Common Statistical Methods (contd.)

Diffy Performance TopologyDiffy-Performance Topology

diffy

reference cluster

test cluster

sampled production traffic

classifier

PASSED

IGNORED

FAILED

25

Classifiers

Sample count Minimum number of samples

Relative Threshold Variance within reference vs. distance between reference and test

Absolute Threshold Distance between reference and test vs. median of reference

26

Classifiers (contd.)

MAD Median Absolute Deviation

Robust Statistic

27

Classifiers (contd.)

Ensemble of Composable Classifiers

val classifier = { SampleCountClassifier(40) and (

RelativeThresholdClassifier(50, 0.1) or AbsoluteThresholdClassifier(50, 0.1) or MadClassifier

) }

28

DEMO

29

Open Source (@diffyproject)

Github

https://github.com/twitter/diffy

Blog

https://blog.twitter.com/2015/diffy-testing-services-without-writing-tests

30

https://github.com/twitter/diffy

https://blog.twitter.com/2015/diffy-testing-services-without-writing-tests

31