Microservices at Mercari

Preview:

Citation preview

Microservices at MercariCurrent status and challenges

Taichi Nakashima (@deeeet/@tcnksm)

SRE at Mercari, automation obsessed, gopher

SRE mission at Mercari

● To ensure a reliable service that is enjoyable to use at anytime● Takes care of all engineering apart from new service development

○ Performance improvement, automation, security etc

Current Mercari architecture

nginx

HTTP

API API API

MySQL MySQL

solr solr solr

Cache

Simple 3 tiler + α architecture

Single code base

Current Mercari architecture

Same architectureIn 3 region

JPUS

UK

Positive

● A central ops team (SRE) can efficiently handle

Challenges

Challenges

nginx

HTTP

API API API

MySQL MySQL

solr solr solr

Cache

Simple 3 tiler + α architecture

Single code base

Challenges

nginx

HTTP

API API API

MySQL MySQL

solr solr solr

Cache

Simple 3 tiler + α architecture

Monolith?

Challenges

● Code is too huge/complex to understand ● Team is too large to efficiently work on shared code base● Communication overhead is too large ● Velocity (development cycle) is stalled...

Microservices

Microservices?

● Architectural and organizational approach to software development○ To speed up deployment cycles○ Foster innovation and ownership ○ Improve maitainability and scalability

Microservices?

$ cat inside.txt | cut -f 1 -d ' ' | sort | uniq -c | sort -nr

Microservices

● Do one thing well ○ Unix philosophy○ One function in one service, not multiple functions in one service

● Decentralized Governance○ Each team has ownership on each service

● Independent○ Each service can be changed, upgraded, or replaced independently

● Polyglot○ Right framework and tool for each domain

Goal

● Software Engineer○ Without velocity stalled, rather make feature improvement iteration speed fast ○ -> Provide great features to customers faster

● SRE ○ Provide automated platform for microservice ○ Give some responsibility (e.g., deployment, debug) to software engineering○ -> Focus on more SRE related software engineering task

Team

@deeeet @spensnova @babarot

State of microservices in US

Microservices architecture in US

Mercari API

HTTP

Microservices architecture in US

Gateway API

Mercari API

HTTP

HTTP

Microservices architecture in US

Gateway API

Mercari API

HTTP

offer

HTTP

gRPC

Microservices architecture in US

Gateway API

Mercari API

HTTP

search offer

HTTP

gRPC

Microservices architecture in US

Gateway API

Mercari API

HTTP

search

personalization

offer

HTTP

gRPC

Technical stacks

● Docker● Kubernetes (Google Container Engine) ● gRPC

Container

● Resource isolation● Resource limitation● Fast boot (vs. VM)

Docker

● Easy to build container image● Easy to distribute via registry

Why Docker?

● Software engineer control more○ They can include what they want (e.g., runtime, library)

● Environmental parity○ What works on local development (or QA env) is exact same (easy to debug)○ No more “it works on my environment but not in production!”

● Easy to deploy○ Docker image ≒ Single static linked binary○ You already know its benefit if you use Go

Kubernetes (GKE)

● Container orchestration● Derives from Google internal

system named Borg & Omega● Inspired and informed by

Google’s experiences and internal systems

Why kubernetes?

● Best way to maximize container benefit○ Resource isolation/limitation enables us compute resource utilization. But how?

■ K8s can correctly schedule container proper instances○ How to communicate between dynamically scheduled containers?

■ K8s provide the service discovery

● Reduce operation costs ○ Self healing & auto scaling

● Infrastructure of infrastructure○ Industrial standard https://githubengineering.com/kubernetes-at-github○ More tools/software comes top on k8s in future (I guess)

gRPC

● gRPC Remote Procedure Call● High performance, general

purpose, open source, standards-based, RPC framework

● Open source version of stubby RPC in used in Google

gRPC

● Simple service definition○ By default, gRPC uses protocol buffers as the Interface Definition Language (IDL) for

describing both the service interface and the structure of the payload messages.

● Works across languages and platforms○ Write golang server and python client○ Utilize polyglot microservices

Why not REST?

● Who can implement REST correctly?○ High cost to design (Path? Parameters? hah?)○ Eventually it’s just HTTP endpoints

● No more HTTP client implementation ..

Challenges

Challenges

● Deployment ● Observability

Deployment

● Deployment is key in microservices platform○ “Without velocity stalled, rather make iteration speed faster”

● We need easy & safe automated deployment system○ We started chatbot style deployment but it was not scale

Spinnaker

● Continuous Delivery platform● Developed in Netflix

○ Worked with Google and open sourced in 2015

● Support multi cloud○ Kubernetes!, GCE, AWS

Spinnaker GUI

Spinnaker pipeline

Why Spinnaker?

● Kubernetes support● Built-in deployment best practice from Netflix and Google

○ Immutable infrastructure○ Blue/Green deployment, Canary deployment○ Manual judgement (by manager) phase○ Run integration tests

Spinnaker in Mercari

● Currently only for container deployment to kubernetes● Each team uses spinnaker to deploy their own services● One spinnaker handles all microservices in all region

Example pipeline of API gateway deployment (Canary)

One spinnaker cluster manages Mercari global deployment

JPUS

UK

Future of spinnaker

● Pipeline as a Code○ https://github.com/spinnaker/dcd-spec

● Automated canary analysis

Automated canary analysis

https://blog.spinnaker.io/can-i-push-that-building-safer-low-risk-deployments-with-spinnaker-a27290847ac4

Observability

Observability (logging, metrics & tracing) is important

● Each team needs to debug service by themselves without SSH● It’s harder and more complex than monolith

Stackdriver logging

Request ID in log

● Which service caused problem in one request?

Request ID in log

Gateway API

Mercari API

HTTP

search

personalization

offer

HTTP

gRPC

① Generate unique ID

② Annotate log by the ID in same request

HTTP headergRPC metadata

Request ID in log

Search by request ID

Log from gateway

Log from service X

Distributed tracing

● Which services makes the request slow?

Stackdriver tracing

Metrics

Selection of metrics service/software is still on-going discussion & trial

● First support of container and kubernetes ● Integration with kubernetes ecosystem

○ Spinnaker, istio and so on

● Service dependency visualization

Prometheus + grafana

Datadog

Instana

State of microservices in JP

State of microservices in JP

JP is just started

● Some services (Machine learning product) are started to containerized and deployed on GKE

● On-going discussion about the best architecture

Conclusion

● Why we started microservices?● Current state of US microservices and challenges

We’re hiring

● Who loves automation● Technical keywords

○ Docker○ Kubernetes○ gRPC○ Golang ○ Container monitoring

Spinnaker is deployed on GKE

Testing

Testing in microservice is hard?

● Focus on unit tests as usual○ Because each service is supposed to independent ○ Each microservices must measure testing coverage

● Integration tests?○ Use mock instead of working hard for preparing local env

Testing pyramid

Google Testing Blog: Just Say No to More End-to-End Tests

Do this a lot !

Do mock

QA environment

How to test development feature from QA device?

● Pull request (PR) based pod creation

PR based pod creation

Proxy API gateway (master)

API gateway (PR 313)

API gateway (PR 314)

Proxy by PR number

Set RP number

Container is deployed via CI

PR based docker container (QA env)

Easy to switch

PR based pod creation

Proxy API gateway (master)

API gateway (PR 313)

API gateway (PR 314)

Service A (master)

Service A (PR 21)

Proxy by PR number

Set RP number

Container is deployed via CI

Future works

Service mesh

Don’t trust each other!

● Traffic management○ API rate limit, circuit breaker

● Policy enforcement○ Ensure access policies (which service can access which service?)

We should realize above without modifying client/server code!

Service mesh (Istio)

https://istio.io/

Service mesh (Istio)

Chaos engineering

● Real world is hard … ○ machine is crashed, network is unstable (especially in distributed system)

● Dependent service fails anytime

Chaos engineering

● Service must be fault tolerance whenever something wrong● Emulate real world problem

○ We need to identify weaknesses ■ Improper fallback settings when a service is unavailable

○ Software Engineer should be aware

Chaos engineering (Chaos monkey)

https://github.com/Netflix/chaosmonkey

Recommended