DataEngConf SF16 - Collecting and Moving Data at Scale

COLLECTING AND MOVING DATA AT SCALE

Sada Furuhashi Chief ArchitectInvented Fluentd, Messagepack

BACKGROUND

HIGH LEVEL ANALYTICS ARCHITECTURE

Collect Store Process Visualize

THE CHALLENGE

Collect Store Process Visualize

How do we shorten the collection process?

Easier & Shorter Time ExcelTableau

THE PROBLEM

TYPICAL ARCHITECTURE BEFORE FLUENTD

Log Server

Application

App Server

File FileFile

High latencyMust wait for a day

Hard to analyzeComplex text parsers

Application

App Server

File FileFile

Application

App Server

File FileFile

THE FALSE SOLUTION

MULTIPLY CONNECTIONS / COMBINATION EXPLOSION

LOGFile

script to parse data

cron job forloading

filteringscript

syslogscript

Tweet-fetching

script

aggregationscript

aggregationscript

script to parse data

rsyncserver

THE SOLUTION

CENTRALIZED CONNECTIONS

LOGFILE

FLUENTD INTERNAL ARCHITECTURE

INTERNAL ARCHITECTURE (SIMPLIFIED)

Plugin

Input Filter Buffer Output

Plugin Plugin Plugin

2012-02-04 01:33:51myapp.buylog{

“user”:”me”,“path”: “/buyItem”,“price”: 150,“referer”: “/landing”}

TimeTag

Record

ARCHITECTURE: INPUT PLUGINS

HTTP+JSON (in_http)File tail (in_tail)Syslog (in_syslog)…

Receive logs

Or pull logs from data sources

In non-blocking manner

Plugin

Input

Filter

ARCHITECTURE: FILTER PLUGINS

Transform logs

Filter out unnecessary logs

Enrich logs

Plugin

Encrypt personal dataConvert IP to countriesParse User-Agent…

Buffer

ARCHITECTURE: BUFFER PLUGINS

Plugin

Improve performance

Provide reliability

Provide thread-safety

Memory (buf_memory)File (buf_file)

ARCHITECTURE: OUTPUT PLUGINS

Output

Write or send event logs

Plugin

File (out_file)Amazon S3 (out_s3)MongoDB (out_mongo)…

Buffer

ARCHITECTURE: BUFFER PLUGINS

Chunk

Plugin

Improve performance

Provide reliability

Provide thread-safety

Input

Output

Chunk

Chunk

Retry

Error

Retry

Batch

Stream Error

Retry

Retry

DIVIDE & CONQUER & RETRY

EXAMPLE USE CASES

STREAMING FROM APACHE TO MONGODB PT I

in_tail /var/log/access.log

/var/log/fluentd/buffer

but_file

ERROR HANDLING

in_tail /var/log/access.log

/var/log/fluentd/buffer

but_file

Buffering for any outputs Retrying automatically With exponential wait and persistence on a disk

TAILING FILE INPUT

Supported formats:

Read a log file Custom regexp Custom parser in Ruby

• apache • apache_error • apache2 • nginx

• json • csv • tsv • ltsv

• syslog • multiline • none

pos fileaccess.log

OUT TO MULTIPLE LOCATIONS

Routing based on tags Copy to multiple storages

bufferaccess.log

in_tail

H.A. CONFIGURATION (HIGH AVAILABILITY)

Retry automatically Exponential retry wait Persistent on a disk

bufferAutomatic fail-over Load balancing

access.log

in_tail

FOR HADOOP USERS


access.logbuffer

Custom text formatter

Slice files based on time

2016-01-01/01/access.log.gz 2016-01-01/02/access.log.gz 2016-01-01/03/access.log.gz …

in_tail

HADOOP INTEGRATION INTO S3


buffer

Slice files based on time

in_tail

2016-01-01/01/access.log.gz 2016-01-01/02/access.log.gz 2016-01-01/03/access.log.gz …

access.log

3RD PARTY INPUT PLUGINS

dstat

df AMQL

munin

jvmwatcher

SQL

3RD PARTY OUTPUT PLUGINS

AMQL

Graphite

REAL WORLD USE CASES

HIGH-VOLUME FORWARDING

T R E A S U R ED A T A

-At-most-once / At-least-once -HA (failover) -Load-balancing

NEAR REALTIME AND BATCH COMBO

Hot data

All data

EXAMPLE CONFIGURATION FOR REAL TIME BATCH COMBO

CEP FOR STREAM PROCESSING

Nora is a SQL based CEP engine: http://norikra.github.io/

CONTAINER LOGGING

T R E A S U R ED A T A

FLUENTD IN PRODUCTION

MICROSOFT

Operations Management Suite uses Fluentd: "The core of the agent uses an existing open source data aggregator called Fluentd. Fluentd has hundreds of existing plugins, which will make it really easy for you to add new data sources."

Syslog

Linux Computer

Operating SystemApache

MySQLContainers

omsconfig (DSC)PS DSC

Prov

ider

s

OMI Server(CIM Server)

omsagent

Fire

wal

l / p

roxy

OM

S Se

rvic

e

Upload Data(HTTPS)

Pullconfiguration

(HTTPS)

https://www.microsoft.com/en-us/server-cloud/operations-management-suite/overview.aspx

http://blogs.technet.com/b/momteam/archive/2015/11/04/oms-agent-for-linux-now-available.aspx

ATLASSIAN

"At Atlassian, we've been impressed by Fluentd and have chosen to use it in Atlassian Cloud's logging and analytics pipeline."

Kinesis

Elasticsearchcluster

Ingestionservice

AMAZON WEB SERVICES

The architecture of Fluentd (Sponsored by Treasure Data) is very similar to Apache Flume or Facebook’s Scribe. Fluentd is easier to install and maintain and has better documentation and support than Flume and Scribe.

Types of DataStoreCollectTransactional • Database reads & write (OLTP)• Cache

Search • Logs• Streams

File • Log files (/val/log)• Log collectors & frameworks

Stream • Log records• Sensors & IoT data

Web Apps

IoT

Appl

icat

ions

Logg

ing

Mobile AppsDatabase

Search

File Storage

Stream Storage

THANK YOU!

Technology

DataEngConf SF16 - Collecting and Moving Data at Scale