Microservices Tracing with Spring Cloud and Zipkin

Preview:

Citation preview

Microservices tracing with Spring Cloud and Zipkin

Marcin Grzejszczak

Marcin Grzejszczak @mgrzejszczak, 11-13 May 2016

About meDeveloper at Pivotal

Part of Spring Cloud Team

Working with OSS:● Accurest - Consumer Driven Contracts verifier for Java● JSON Assert - fluent JSON assertions● Spock Subjects Collaborators Extension● Gradle Test Profiler● Up To Date Gradle Plugin

TWITTER: @MGrzejszczakBLOG: http://TOOMUCHCODING.COM

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

AgendaWhat is distributed tracing?

How to correlate logs with Spring Cloud Sleuth?

How to visualize latency with Spring Cloud Sleuth and Zipkin?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

An ordinary system...

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

UI calls backend

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

UI -> BACKEND

Everything is awesome

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

CLICK 200

Until it’s not

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

CLICK 500

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Time to debug

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

https://tonysbologna.files.wordpress.com/2015/09/mario-and-luigi.jpg?w=468&h=578&crop=1

It doesn’t look like this

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

More like this

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

On which server / instance was the exception thrown?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

SSH and grep for ERROR to find it?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Distributed tracing - terminology

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Span

Trace

Logs (annotations)

Tags (binary annotations)

Distributed tracing - terminology

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Span

Trace

Logs (annotations)

Tags (binary annotations)

Span

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

The basic unit of work (e.g. sending RPC)

● Spans are started and stopped

● They keep track of their timing information

● Once you create a span, you must stop it at some point in the future

● Has a parent and can have multiple children

Trace

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

A set of spans forming a tree-like structure.

● For example, if you are running a book store then

○ Trace could be retriving a list of available books

○ Assuming that to retrive the books you have to send 3 requests to 3 services then you could have at least 3 spans (1 for each hop) forming 1 trace

SERVICE 1

REQUEST

No Trace IdNo Span Id

RESPONSE

SERVICE 2

SERVICE 3

Trace Id = XSpan Id = A

Trace Id = XSpan Id = A

Trace Id = XSpan Id = A

REQUEST

RESPONSE

Trace Id = XSpan Id = BClient Sent

Trace Id = XSpan Id = B

Client Received

Trace Id = XSpan Id = B

Server Received

Trace Id = XSpan Id = C

Trace Id = XSpan Id = BServer Sent

REQUEST

RESPONSE

Trace Id = XSpan Id = DClient Sent

Trace Id = XSpan Id = D

Client Received

Trace Id = XSpan Id = D

Server Received

Trace Id = XSpan Id = E

Trace Id = XSpan Id = DServer Sent

Trace Id = XSpan Id = E

SERVICE 4

REQUEST

RESPONSE

Trace Id = XSpan Id = FClient Sent

Trace Id = XSpan Id = F

Client Received

Trace Id = XSpan Id = F

Server Received

Trace Id = XSpan Id = G

Trace Id = XSpan Id = FServer Sent

Trace Id = XSpan Id = G

Trace Id = XSpan Id = C

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Span Id = AParent Id = null

Span Id = BParent Id = A

Span Id = CParent Id = B

Span Id = DParent Id = C

Span Id = EParent Id = D

Span Id = FParent Id = C

Span Id = GParent Id = F

Is it that simple?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Is it that simple?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

How do you pass tracing information (incl. Trace ID) between:

● different libraries?

● thread pools?

● asynchronous communication?

● …?

Log correlation with Spring Cloud Sleuth

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

We take care of passing tracing information between threads / libraries / contexts for● Hystrix● RxJava● Rest Template● Feign● Messaging with Spring Integration● Zuul● ...

If you don’t do anything unexpected there’s nothing you need to do to make Sleuth work. Check the docs for more info.

Now let’s aggregate the logs!

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Instead of SSHing to the machines aggregate the logs!

● With Cloud Foundry’s (CF) Loggergator the logs from different instances are streamed into a single place

● You can harvest your logs with Logstash Forwarder / FileBeat

● You can use ELK stack to stream and visualize the logs

Spring Cloud Sleuth with Maven

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

<dependencyManagement>

<dependencies>

<dependency>

<groupId>org.springframework.cloud</groupId>

<artifactId>spring-cloud-dependencies</artifactId>

<version>Brixton.RELEASE</version>

<type>pom</type>

<scope>import</scope>

</dependency>

</dependencies>

</dependencyManagement>

<dependency>

<groupId>org.springframework.cloud</groupId>

<artifactId>spring-cloud-starter-sleuth</artifactId>

</dependency>

Spring Cloud Sleuth with Gradle

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

dependencies {

compile "org.springframework.cloud:spring-cloud-starter-sleuth"

}

dependencyManagement {

imports {

mavenBom "org.springframework.cloud:spring-cloud-dependencies:Brixton.

RELEASE"

}

}

Log correlation with Spring Cloud SleuthDEMO

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Great! We’ve found the exception!But meanwhile....

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

The system is slow...

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

CLICK 200

One of the services is slow?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Which one?How to measure that?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● Client Sent (CS) - The client has made a request

● Server Received (SR) - The server side got the request and will start processing it

● Server Send (SS) - Annotated upon completion of request processing

● Client Received (CR) - The client has successfully received the response from the server side

Let’s log events!

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

CS 0 ms SR 100 ms

SS 200 msCR 300 ms

● The request started at T=0ms

● It took 300 ms for the client to receive a response

● Server side received the request at T=100 ms

● The request got processed on the server side in 100 ms

● Why is there a delay between sending and receiving messages?

Conclusions

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

CS 0 ms SR 100 ms

SS 200 msCR 300 ms

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

https://blogs.oracle.com/jag/resource/Fallacies.html

Distributed tracing - terminology

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Span

Trace

Logs (annotations)

Tags (binary annotations)

Logs

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Represents an event in time associated with a span

● Every span has zero or more logs

● Each log is a timestamped event name

● Event should be the stable name of some notable moment in the lifetime of a span

● For instance, a span representing a browser page load might add an event for each of the Performance.timing moments (check https://developer.mozilla.org/en-US/docs/Web/API/PerformanceTiming)

Main logs

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● Client Send (CS)○ The client has made a request - the span was started

● Server Received (SR)○ The server side got the request and will start processing it

○ SR timestamp - CS timestamp = NETWORK LATENCY

Main logs

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● Server Send (SS)○ Annotated upon completion of request processing

○ SS timestamp - SR timestamp = SERVER SIDE PROCESSING TIME

● Client Received (CR)○ The client has successfully received the response from the server side

○ CR timestamp - CS timestamp = TIME NEEDED TO RECEIVE RESPONSE

○ SS timestamp - CR timestamp = NETWORK LATENCY

Key-value pair

● Every span may also have zero or more key/value Tags

● They do not have timestamps and simply annotate the spans.

● Example of default tags in Sleuth○ message/payload-size○ http.method○ commandKey for Hystrix

Tag

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

How to visualise latency in a distributed system?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● Zipkin is a distributed tracing system

● It runs as a separate process (you can run it as a Spring Boot application)

● It helps gather timing data needed to troubleshoot latency problems in microservice architectures

● The front end is a "waterfall" style graph of service calls showing call durations as horizontal bars

The answer is: Zipkin

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

How does Zipkin work?

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

SPANS SENT TO COLLECTORS

SPANS SENT TO COLLECTORS

STORE IN DB

APP

APP

UI QUERIES FOR TRACE INFO VIA API

Spring Cloud Sleuth and Zipkin integration

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● We take care of passing tracing information between threads / libraries / contexts

● Upon closing of a Span we will send it to Zipkin○ either via HTTP (spring-cloud-sleuth-zipkin)○ or via Spring Cloud Stream (spring-cloud-sleuth-stream)

● You can run Zipkin Sping Cloud Stream Collector as a Spring Boot app (spring-cloud-sleuth-zipkin-stream)○ you can add the dependency to Zipkin UI!

Spring Cloud Sleuth Zipkin with Maven

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

<dependencyManagement>

<dependencies>

<dependency>

<groupId>org.springframework.cloud</groupId>

<artifactId>spring-cloud-dependencies</artifactId>

<version>Brixton.RELEASE</version>

<type>pom</type>

<scope>import</scope>

</dependency>

</dependencies>

</dependencyManagement>

<dependency>

<groupId>org.springframework.cloud</groupId>

<artifactId>spring-cloud-starter-zipkin</artifactId>

</dependency>

Spring Cloud Sleuth Zipkin with Gradle

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

dependencies {

compile "org.springframework.cloud:spring-cloud-starter-zipkin"

}

dependencyManagement {

imports {

mavenBom "org.springframework.cloud:spring-cloud-dependencies:Brixton.

RELEASE"

}

}

SERVICE 1/start

REQUEST

No Trace IdNo Span Id

RESPONSE

SERVICE 2/foo

SERVICE 3/bar

Trace Id = XSpan Id = A

Trace Id = XSpan Id = A

Trace Id = XSpan Id = A

REQUEST

RESPONSE

Trace Id = XSpan Id = BClient Sent

Trace Id = XSpan Id = B

Client Received

Trace Id = XSpan Id = B

Server Received

Trace Id = XSpan Id = C

Trace Id = XSpan Id = BServer Sent

REQUEST

RESPONSE

Trace Id = XSpan Id = DClient Sent

Trace Id = XSpan Id = D

Client Received

Trace Id = XSpan Id = D

Server Received

Trace Id = XSpan Id = E

Trace Id = XSpan Id = DServer Sent

Trace Id = XSpan Id = E

SERVICE 4/baz

REQUEST

RESPONSE

Trace Id = XSpan Id = FClient Sent

Trace Id = XSpan Id = F

Client Received

Trace Id = XSpan Id = F

Server Received

Trace Id = XSpan Id = G

Trace Id = XSpan Id = FServer Sent

Trace Id = XSpan Id = G

Trace Id = XSpan Id = C

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

DEMO

Zipkin for Brewery

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● A test app for Spring Cloud end to end tests

● Source code: https://github.com/spring-cloud-samples/brewery

● Around 10 applications involved

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

Summary

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

● Log correlation allows you to match logs for a given trace

● Distributed tracing allows you to quickly see latency issues in your system

● Zipkin is a great tool to visualize the latency graph and system dependencies

● Spring Cloud Sleuth integrates with Zipkin and grants you log correlation

Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016

THANK YOU● https://github.com/marcingrzejszczak/vagrant-elk-box/tree/presentation - code for this presentation (clone

and run getReadyForConference.sh - NOTE: you need Vagrant!)

● https://github.com/spring-cloud/spring-cloud-sleuth - Spring Cloud Sleuth repository

● http://cloud.spring.io/spring-cloud-sleuth/spring-cloud-sleuth.html - Sleuth’s documentation

● http://toomuchcoding.com/blog/2016/03/25/spring-cloud-sleuth-rc1-deployed/ - article about RC1 release

● https://github.com/openzipkin/zipkin-java - Repo with Spring Boot Zipkin server

● http://docssleuth-service1.cfapps.io/start - The service1 app from this presentation deployed to Pivotal Cloud Foundry - point of entry to the app

● http://docssleuth-zipkin-server.cfapps.io/ - Zipkin deployed to Pivotal Cloud Foundry

● http://brewery-zipkin-web.cfapps.io - Zipkin deployed to PCF for Brewery Sample app

Marcin Grzejszczak, @mgrzejszczak Kraków, 11-13 May 2016

Recommended