47
APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH CYBERAGENT, INC. See all the presentations from the In-Memory Computing Summit at http://imcsummit.org

IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

Embed Size (px)

Citation preview

Page 1: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE IGNITE AS A DATA PROCESSING HUBROMAN SHTYKHCYBERAGENT, INC.

See all the presentations from the In-Memory Computing Summit at http://imcsummit.org

Page 2: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

INTRODUCTION

Page 3: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

ABOUT ME

Roman Shtykh R&D Engineer at CyberAgent, Inc. Areas of focus

Data streaming and NLP Committer on the Apache Ignite and MyBatis projects Judoka @rshtykh

Page 4: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

CYBERAGENT, INC.

Internet ads Games Media Investing

25%

13%

52%

3%7%

GamesMediaInternet adsInvestingOther

* As of Sep 2015

Page 5: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

AMEBA SERVICES

・  Monthly visitors (DUB total): 6 billion*・  Number of member users : about 39 million*

CyberAgent, Inc.Ameba Services

* As of Dec 2014

• Games• Community services• Content curation• Other

Page 6: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

AMEBA SERVICES

Ameba Pigg

Page 7: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

CONTENTS

Apache Ignite Feed your data

Log Aggregation with Apache Flume Integration with Apache Ignite

Streaming Data with Apache Kafka Data Pipeline with Kafka and Ignite: Example

Page 8: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE IGNITE

“High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”

High performance, unlimited scalability and resiliency High-performance transactions and fast analytics Hadoop Acceleration, Apache Spark Apache project

https://ignite.apache.org/

Page 9: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

MAKING APACHE IGNITE A DATA PROCESSING HUB

Question: How to feed data? A simple solution: Create a client node

Page 10: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

MAKING APACHE IGNITE A DATA PROCESSING HUB

Question: How to feed data? A simple solution: Create a client node

Is it reliable? Does it scale? Ignite-only solution? Does it keep your operational costs low?

Page 11: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

MAKING APACHE IGNITE A DATA PROCESSING HUB

Question: How to feed data? A simple solution: Create a client node

Is it reliable? Does it scale? Ignite-only solution? Does it keep your operational costs low?

Page 12: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

LOG AGGREGATION WITH APACHE FLUME

Page 13: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

LOG AGGREGATION WITH APACHE FLUME

Flume “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large

amounts of log data.” Scalable Flexible Robust and fault tolerant Declarative configuration Apache project

Page 14: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA FLOW IN FLUME

Source Sink

Agent

ChannelIncoming datato another Agentor Destination

Page 15: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)

SourceSink

Agent

ChannelIncoming data

SinkChannelChannel Selector

Page 16: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA FLOW IN FLUME (RELIABILITY)

No data is lost (configurable)

Source Sink

Agent

ChannelIncoming data

Source tx Sink tx

Page 17: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

LOG TRANSFER AT AMEBA

AmebaService

Aggregator

Aggregator

Aggregator

Monitoring

Recommender System

Elastic Search

HadoopBatch processing

HBase

Stream Processing(Onix)

Stream Processing(HBaseSink)

AmebaService

AmebaService

Page 18: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

LOG TRANSFER AT AMEBA

Web Hosts More than 1600

Size 5.0 TB/day (raw)

Traffic at peak 160Mbps (compressed)

Page 19: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

IGNITE SINK

Reads Flume events from a channel With a user-implemented pluggable transformer converts them into cacheable entries Adding it requires no modification to the existing architecture

Page 20: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

FLUME ⇒ IGNITE (1)

Source Ignite Sink

Agent

ChannelIncoming data new connection

Page 21: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

FLUME ⇒ IGNITE (2)

Source Ignite Sink

Agent

ChannelIncoming data

Sink tx

start tx

Page 22: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

FLUME ⇒ IGNITE (3)

Source Ignite Sink

Agent

ChannelIncoming data

Sink tx

takeevent send events

Page 23: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

ENABLING FLUME SINK Steps1. Implement EventTransformer

convert Flume events into cacheable entries (java.util.Map<K, V>)2. Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib3. Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext4. Set up a Flume agent

Sink setupa1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSinka1.sinks.k1.igniteCfg = /some-path/ignite.xmla1.sinks.k1.cacheName = testCachea1.sinks.k1.eventTransformer = my.company.MyEventTransformera1.sinks.k1.batchSize = 100

Page 24: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

FLUME SINKS

HDFS THRIFT AVRO HBASE ElasticSearch IRC IGNITE

Page 25: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE FLUME & APACHE IGNITE

If you do data aggregation with Flume Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent

If you store your data (and do computations) in Ignite Improving data injection becomes easy with Flume sink

Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) Scalable

Reliable

Highly-Performant

Page 26: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

STREAMING DATA WITH APACHE KAFKA

Page 27: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE KAFKA

“Publish-subscribe messaging rethought as a distributed commit log” Low latency High Throughput Partitioned and Replicated

Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

Page 28: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE KAFKA

Messages are grouped in topics Each partition is a log Each partition is managed by a broker

(when replicated, one broker is the partition leader) Producers & consumers (consumer groups)

Used for Log aggregation Activity tracking Monitoring Stream processing

http://kafka.apache.org/documentation.html

Page 29: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

KAFKA CONNECT

Designed for large scale stream data integration using Kafka Provides an abstraction from communication with your Kafka cluster

Offset management Delivery semantics Fault tolerance Monitoring, etc.

Worker (scalability & fault tolerance) Connector (task config) Task (thread) Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released

Page 30: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

INGESTING DATA STREAMS

Two ways Kafka Streamer Sink Connector

SQL queries Distributed closuresTransactions

Conn

ect

ETL

Page 31: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

STREAMING VIA SINK CONNECTOR

Configure your connector Configure Kafka Connect worker Start your connector

# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml

$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

Page 32: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

STREAMING VIA SINK CONNECTOR

Easy data pipeline Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer At-least-once delivery guarantee As of 1.6, start a new connector to write to a different cache

a b c d e0 1 2 … Kafka offsets

a.key, a.valb.key, b.val…

a2

b2

c2

d2

e2

Page 33: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

INGESTING DATA STREAMS

Bi-directional streaming

SQL queries Distributed closuresTransactions

Conn

ect

EventsContinuous queries

Conn

ect

Sin k

Sour

ce

Page 34: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

STREAMING BACK TO KAFKA

Listening to cache events PUT READ REMOVED EXPIRED, etc.

Remote filtering can be enabled Kafka Connect offsets are ignored Currently, no delivery guarantees

evt1evt2evt3 as records

Page 35: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

ENABLING SOURCE CONNECTOR

Configure your connector Define a remote filter if needed

cacheFilterCls=MyCacheEventFilter Make sure that event listening is enabled

on the server nodes Configure Kafka Connect worker Start your connector

#connectorname=ignite-src-connectorconnector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnectortasks.max=2

#topics, eventstopicNames=testcacheEvts=put,removed

#cachecacheName=myCacheigniteCfg=myconfig/ignite.xml

key.converter=org.apache.kafka.connect.storage.StringConvertervalue.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

Page 36: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

APACHE KAFKA & APACHE IGNITE

If you do data streaming with Kafka Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need

it for source) If you store your data (and do computations) in Ignite

Improving data injection and listening for events on data becomes easy with Kafka Connectors

Combining Apache Kafka and Ignite makes/keeps your data pipeline Scalable Reliable Highly-Performant Covers a wide range of ETL contexts

Page 37: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITEEXAMPLE

Page 38: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Requirements instant processing and analysis scalable and resilient to failures low latency high throughput flexibility

Page 39: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Filter and aggregate events

data Flume

filter/transform

datadata

data

slow down on heavy loads

more channels/layers

Page 40: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

data

filtertransfor

metc.

• Parsimonious resource use• Replay enabled• More operations on streams• Flexibility

deriv

ed

stre

ams

Other sourc

es

Page 41: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Filter and aggregate events Store events Notify about updates on aggregates

data

filtertransfor

metc.

Connectorsderiv

ed

stre

ams

Page 42: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Filter and aggregate events Store events Notify about updates on aggregates

data

filtertransfor

metc.

Connectorsderiv

ed

stre

ams

Page 43: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Improving ads delivery

clicksimpressions

ads

Adsdelivery

Adsrecommend

er

storage/computati

on

Imagestorag

e

data & computationin one place

Page 44: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

DATA PIPELINE WITH KAFKA AND IGNITE

Improving ads delivery Better network utilization and reliability

clicksimpressions

ads

Adsdelivery

Adsrecommen

der

storage/computati

on

Imagestorag

e

Anomaly

detection

Page 45: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

OTHER INTEGRATIONS

Page 46: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

OTHER COMPLETED INTEGRATIONS

CAMEL MQTT STORM FLINK SINK TWITTER

Page 47: IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

THE END