38
Mercari’s Never Ending Improvements @siroken3 / KENICHI Sasaki SRE Team @ Mercari, Inc. 2016/02

dots. Conference Spring 2016 大規模Webサービスを支える技術 (mercari)

Embed Size (px)

Citation preview

Mercari’s Never Ending Improvements

@siroken3 / KENICHI SasakiSRE Team @ Mercari, Inc.

2016/02

Mercari - Your Friendly Mobile Marketplace

https://www.mercari.com/

Self Introduction

• Joined Mercari in July,2014

• SRE (Site Reliability Engineer)

• Role

• Development Productivity

What is SRE?• Site Reliability Engineer

• The Role/position introduced in Google “Software Engineers responsible for ensuring that all of Google’s services are super reliable and super fast, all of the time.”

• Mercari SRE team members responsible for

• Availability

• Performance

• Construction and operation log analytics platform

• Server provisioning, deployment

• Security

• Development of the development environment

JP Growthhttps://pixabay.com/photo-918965/

Download Numbers

0M

8M

15M

23M

30M

July 2014 Feb 2016

24M

4M

+500%

Req/Sec (HTTPS: Peak)

0

5000

10000

15000

20000

July, 2014 Feb. 2016

20K

3K

+560%

Servers (APP)

July 2014 Feb. 2016

+50%

Servers (DB)

July 2014 Feb. 2016

+9

SRE Members

July. 2014 Feb. 2016

Infrastructure Overview

• JP

• SAKURA Internet Ishikari DC dedicated server + cloud

• US

• AWS Oregon

• Shared

• Akamai

• Amazon Route53, S3, CloudFront

• Google BigQuery

app

Infrastructure 2014

mail nat

internet

DB

Redis

batch

Q4M

Worker search

global

private

app

Infrastructure 2016

lb nat

internet

lb_pascal

DB

memcached

batch

Q4M

Worker

lb_push

push

lb_search

search

deploybase monitor dns logview cep

global

private

logbatchlog

lb_general

Softwares (2016)• nginx

• PHP 5.6

• Apache + mod_php

• Go

• Node

• MySQL

• Q4M

• memcached

• Solr

• Gaurun

• fluentd

• Norikra

• Kibana

• Zabbix

• kurado

• etc..

Improve? or Crisis!

• Continuous Increase in Access

• Continuous Increase in Data Volume

• Growth of Specifications

• Unstable Deployment

Continuous Increase in Access

https://pixabay.com/en/traffic-rush-hour-rush-hour-urban-843309/

Continuous Increase in AccessProblem

• Lack of CPU Resources

• Slow down response time

• Lack of network bandwidth

• Network congestion

Improvement:Introduce dedicated server

• BEFORE

• SAKURA Cloud

• (Ex) AFTER

• CPU: Xeon 6Core x 2

• Mem: 32G

• DISK: 240GB SSD

Improvement:Introduce lb based on nginx• BEFORE

• All httpd server was faced on the internet

• DNS Round Robin

• AFTER

• nginx!

• Reverse Proxy, TLS, SPDY Terminator

internet

lb lb lb lb

DNS RR

©2011 Amazon Web Services LLC or its affiliates. All rights reserved.

User Users Client Multimedia Corporate data center

Traditional server

Mobile Client

Internet AWS Management Console

IAM Add-on Example:IAM Add-on

Amazon Mechanical Turk

On-Demand Workforce

Human Intelligence Tasks (HIT)

Assignment/Task

RequesterWorkersAmazon Mechanical Turk

Non-Service Specific

Improvement:Continuous application tuning

• MySQL index tuning

• (Ex.) 2-dimensional large array -> convert 2nd tier to text data and parse

Of course, There is no silver bullet.

Improvement:Continuous application tuning

require_once(‘master_data.php’); was slow!!

Large

http://www.slideshare.net/kazeburo/big-master-data-php-blt-1

Improvement:Continuous application tuning

http://www.slideshare.net/kazeburo/big-master-data-php-blt-1

Continuous Increase in Data Volume

https://flic.kr/p/miwdvy

Problem:Increasing DB historical table

records• Increasing DB historical table records

• Shortage of DISK capacity

• Slow down item search throughput

• Increasing access log

• Customer service tune around time be too slow

• DB table are partitioned into multiple servers

• Slave servers are only in main cluster

• Using DNS RR

Improvement:Server partitioning (MySQL)

Master

Slave Slave BackupSlave Backup

Master

Backup

Master

Backup

Master AnonDB

Main todolists l2-db cs-tool anon-db

Improvement:Server partitioning (Solr)

• solr

• Master - Slave

• latest & all cluster

• nginx

• load balancer

• Lua controls cluster access

lb_search

app

SolrMaster

double write更新は両方に

SolrSlave

SolrSlave

Worker

SolrMaster

SolrSlave

SolrSlave

latest cluster直近N日

all index cluster全商品

latestを先に検索し件数が足りなければall

Improvement

app

Worker Batch

access_logapplication_logapp_error_logerror_logphp_log...

log

©2011 Amazon Web Services LLC or its affiliates. All rights reserved.

AWS Simple Icons

Check to make sure you have the most recent set of AWS Simple Icons.This version was last updated 12/1/2011(v1.4) Find the most recent set at aws.amazon.com/architecture/icons

Always use Icon labels – Be sure to always include a label below the icon or on the group in Arial. The only exception is in complex diagrams, you have the option to create a key.

Non-AWS Technology – Any server or other non-AWS technology in an architecture diagram should be represented with they grey server (see Slide 6).

Usage Guidelines

Traditional server

Elastic LoadBalancer

DEC

01Creating diagrams – Model your diagrams after the usage examples (Slides 8 and 9). Try to use direct lines (rather than ‘criss-cross’), use adequate whitespace, and remember to label all icons.

Product Icons – The first icon in most service sets is a product icon. This should be used to represent the service on a more general level when you will not be going into as much depth.Amazon Elastic

Compute Cloud (EC2)

BigQuery

nat

logview

kibana: Log Viewer

cep

©2011 Amazon Web Services LLC or its affiliates. All rights reserved.

AWS Simple Icons

Check to make sure you have the most recent set of AWS Simple Icons.This version was last updated 12/1/2011(v1.4) Find the most recent set at aws.amazon.com/architecture/icons

Always use Icon labels – Be sure to always include a label below the icon or on the group in Arial. The only exception is in complex diagrams, you have the option to create a key.

Non-AWS Technology – Any server or other non-AWS technology in an architecture diagram should be represented with they grey server (see Slide 6).

Usage Guidelines

Traditional server

Elastic LoadBalancer

DEC

01Creating diagrams – Model your diagrams after the usage examples (Slides 8 and 9). Try to use direct lines (rather than ‘criss-cross’), use adequate whitespace, and remember to label all icons.

Product Icons – The first icon in most service sets is a product icon. This should be used to represent the service on a more general level when you will not be going into as much depth.Amazon Elastic

Compute Cloud (EC2)

Mackerel

©2011 Amazon Web Services LLC or its affiliates. All rights reserved.

AWS Simple Icons

Check to make sure you have the most recent set of AWS Simple Icons.This version was last updated 12/1/2011(v1.4) Find the most recent set at aws.amazon.com/architecture/icons

Always use Icon labels – Be sure to always include a label below the icon or on the group in Arial. The only exception is in complex diagrams, you have the option to create a key.

Non-AWS Technology – Any server or other non-AWS technology in an architecture diagram should be represented with they grey server (see Slide 6).

Usage Guidelines

Traditional server

Elastic LoadBalancer

DEC

01Creating diagrams – Model your diagrams after the usage examples (Slides 8 and 9). Try to use direct lines (rather than ‘criss-cross’), use adequate whitespace, and remember to label all icons.

Product Icons – The first icon in most service sets is a product icon. This should be used to represent the service on a more general level when you will not be going into as much depth.Amazon Elastic

Compute Cloud (EC2)

Slack

Norikra: Stream Processing

Growth of Specifications

https://flic.kr/p/7RrWCg

Problem:In deployment…

• Large-scale deployment of multiple features

• Unplanned, rushed deployment

Improvements:

• Deploy many times per day, instead of once a week

• Google Calendar & chat both based deployment

Improvements:

Improvement:Scheduled,automated deploy

http://tech.mercari.com/entry/2015/10/15/183000

Unstable Deployment

http://popsych.org/wp-content/uploads/2015/05/jenga-tower.jpg

Problem:Each deploy, get 50x responses

• Cause

• Inconsistence of PHP Opcache

• Result

• Negative customer feedback

Improvement:ngx_dynamic_upstream + rsync

deploybase

App

YES!!!

App

App

App

App

App

App

Worker

Worker

Batch

lblblb• ngx_dynamic_upstream

• Dynamic attach and detach app. server to lb

• Using —rsync-path

• detach from lb

• rsync

• attach lb

Conclusion

http://s0.geograph.org.uk/geophotos/02/95/15/2951585_5b854214.jpg

We have improved continuously

• Rome was not built in a day

• We will continue doing improvements

Preface

• Big “Master” Data (http://www.slideshare.net/kazeburo/big-master-data-php-blt-1)

• ngx_dynamic_upstream (https://github.com/cubicdaiya/ngx_dynamic_upstream)

• 大人のスタートアップは大人のリリースができる。そう、ChatOpsならね。(http://tech.mercari.com/

entry/2015/10/15/183000)