Masahiro Nakagawa Senior Software Engineer Treasure Data, inc. Treasure Data & AWS The light and dark side of the Cloud

Treasure Data and AWS - Developers.io 2015

Embed Size (px)

Citation preview

Masahiro NakagawaSenior Software Engineer

Treasure Data, inc.

Treasure Data & AWSThe light and dark side of the Cloud

Who am I

> Masahiro Nakagawa > github: repeatedly

> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer

> Living at OSS :) > D language - Phobos, a.k.a standard library, committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc…) > etc…

TD Service Architecture

Time to Value

Send query result Result Push

Acquire Analyze Store

Plazma DB Flexible, Scalable, Columnar Storage

Web Log

App Log





Treasure Agent(Server) SDK(JS, Android, iOS, Unity)

Streaming Collector

Batch / Reliability

Ad-hoc /Low latency


KPI Dashboard

BI Tools

Other Products

RDBMS, Google Docs, AWS S3, FTP Server, etc.

Metric Insights

Tableau, Motion Board�����etc.



Bulk Uploader

Embulk,TD Toolbelt

SQL-based query



Economy & Flexibility Simple & Supported

Treasure Data System Overview

FrontendJob Queue




Applications push metrics to Fluentd (via local Fluentd)

Datadogfor realtime monitoring

Treasure Datafor historical analysis

Fluentd sums up data minutes(partial aggregation)

Plazma - Treasure Data’s distributed analytical database

Plazma by the numbers

> Data import > 500,000 records / sec

> 43 billion records / day > Hive Query

> 2 trillion records / day > 2,828 TB/day

> Presto Query > 10,000+ queries / day

Used AWS components

> EC2 > Hadoop / Presto Clusters > API Servers

> S3 > MessagePack Columnar Storage

> RDS > MySQL for service information > PostgreSQL for Plazma metadata > Distributed Job Queue / Schedular

Used AWS components

> CloudWatch > Monitor AWS service metrics

> ELB > Endpoint for APIs > Endpoint for Heroku drains

> ElastiCache > Store TD monitoring data > Event de-duplication for mobile SDKs

Why not use HDFS for storage?

> To separate machine resource and storage > Easy to add or replace workers > Import load doesn’t affect queries

> Don’t want to maintain HDFS… > HDFS crash > Upgrading HDFS cluster is hard

> The demerit of S3 based storage > Eventual consistency > Network access

Data Importing

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for5 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} …

MySQL (PerfectQueue)

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for1 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

MySQL (PerfectQueue)

unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38

Import Queue

td-agent / fluentd

Import Worker

✓ Buffering for5 minute

✓ Retrying(at-least once)

✓ On-disk buffering on failure

✓ Unique ID for each chunk

API Server

It’s like JSON.

but fast and small.

MySQL (PerfectQueue)

unique_id time

375828ce5510cadb 2015-12-01 10:47

2024cffb9510cadc 2015-12-01 11:09

1b8d6a600510cadd 2015-12-01 11:21

1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)

Import Queue

Import Worker

Import Worker

Import Worker

✓ HA ✓ Load balancing

Realtime Storage


Amazon S3 / Basho Riak CS


Import Queue

Import Worker

Import Worker

Import Worker

Archive Storage

Realtime Storage


Amazon S3 / Basho Riak CS


Import Queue

Import Worker

Import Worker

Import Worker

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

Archive Storage

Metadata of the records in a file (stored on PostgreSQL)

Amazon S3 / Basho Riak CS


Merge Worker(MapReduce)

uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage


Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)

Amazon S3 / Basho Riak CS


uploaded time file index range records

2015-03-08 10:47 [2015-12-01 10:47:11, 2015-12-01 10:48:13] 3

2015-03-08 11:09 [2015-12-01 11:09:32, 2015-12-01 11:10:35] 25

2015-03-08 11:38 [2015-12-01 11:38:43, 2015-12-01 11:40:49] 14

… … … …

file index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

Realtime Storage

Archive Storage


GiST (R-tree) Index on“time” column on the files

Read from Archive Storage if merged. Otherwise, from Realtime Storage

Why not use LIST API?

> LIST API is slow > It causes slow query on large dataset

> Riak CS’s LIST is also toooo slow! > LIST API has a critical problem… ;(

> LIST skips some objects when high-loaded environment > It doesn’t return an error…

> Using PostgreSQL improves the performance > Easy to check time range > Operation cost is cheaper than S3 call

Why not MySQL? - benchmark






INSERT 50,000 rows SELECT sum(id) SELECT sum(file_size) WHERE index range




MySQL PostgreSQL


Index-only scan

GiST index + range type

Data Importing

> Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage

> Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour.

> Metadata > Index is built on PostgreSQL using RANGE type and

GiST index

Data processing

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

MessagePack ColumnarFile Format

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

time code method

2015-12-01 11:10:09 200 GET

2015-12-01 11:21:45 200 GET

2015-12-01 11:38:59 200 GET

2015-12-01 11:43:37 200 GET

2015-12-01 11:54:52 “200” GET

… … …

Archive Storage

path index range records

[2015-12-01 10:00:00, 2015-12-01 11:00:00] 3,312

[2015-12-01 11:00:00, 2015-12-01 12:00:00] 2,143

… … …

column-based partitioning

time-based partitioning

Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00 GROUP BY code

Handling Eventual Consistency

1. Writing data / metadata first > At this time, data is not visible

2. Check S3 data is available or not > GET, GET, GET…

3. S3 data become visible > Query includes imported data!

Ex. Netflix case > https://github.com/Netflix/s3mper

Hide network cost

> Open a lot of connections to S3 > Using range feature with columnar offset > Improve scan performance for partitioned data

> Detect recoverable error > We have error lists for fault tolerance

> Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to S3

and re-read data


Optimizing Scan Performance

•  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck



•  s3 file list •  table schema header


S3 / RiakCS�

•  release(Buffer) Buffer size limit Reuse allocated buffers

Request Queue�

•  priority queue •  max connections limit

Header�Column Block 0 (column names)�

Column Block 1�

Column Block i�

Column Block m�

MPC1 file


•  callback to HeaderParser


header HeaderParser�

•  parse MPC file header • column block offsets • column names

column block request Column block requests

column block





S3 read�

S3 read�

pull records

Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency

S3 read�•  decompression •  msgpack-java v07

S3 read�

S3 read�

S3 read�

Optimize scan performance

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Recoverable errors> Error types

> User error > Syntax error, Semantic error

> Insufficient resource > Exceeded task memory size

> Internal failure > I/O error of S3 / Riak CS > worker failure > etc

We can retry these patterns

Presto retry on Internal Errors

> Query succeed eventually log scale

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

user time code method

391 2015-12-01 11:10:09 200 GET

482 2015-12-01 11:21:45 200 GET

573 2015-12-01 11:38:59 200 GET

664 2015-12-01 11:43:37 200 GET

755 2015-12-01 11:54:52 “200” GET

… … …

MessagePack ColumnarFile Format is schema-less✓ Instant schema change

SQL is schema-full✓ SQL doesn’t work

without schema


Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}



Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage




{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| name | “plazma”

| value | 120

| host | NULL

| |


Realtime Storage

Query EngineHive, Pig, Presto

Archive Storage

{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}

CREATE TABLE events ( user INT, name STRING, value INT, host INT );

| user | 54

| name | “plazma”

| value | 120

| host | NULL

| |





Datadog based monitoring

> dd-agent for system metrics > Send application metrics using Fluentd

> Hadoop / Presto usage > Service metrics > PostgreSQL status

> Check AWS events > EC2, CloudTrail and more

> Event based alert

CloudTrail example

Presto example

Pitfall of PostgreSQL on RDS

> PostgreSQL on RDS has TCP Proxy > “DB connections” metrics shows TCP connections,

not execution processes of PostgreSQL > PostgreSQL spawns a process for each TCP connection > The problem is the process is sometimes still running

even if TCP connection is closed. > In this result, “DB connections” is decreased but

PostgreSQL can’t receive new request ;( > We collect actual metrics from PostgreSQL tables.

> Can’t use some extensions


> Build scalable data analytics platform on Cloud > Separate resource and storage > loosely-coupled components

> AWS has some pitfalls but we can avoid it > There are many trade-off

> Use existing component or create new component? > Stick to the basics!

Check: treasuredata.com   treasure-data.hateblo.jp/ (Japan blog)

Cloud service for the entire data pipeline