IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

  • Published on
    06-Jan-2017

  • View
    279

  • Download
    0

Embed Size (px)

Transcript

PowerPoint Presentation

Apache Ignite as a Data Processing HubRoman shtykhCyberagent, inc.See all the presentations from the In-Memory Computing Summit at http://imcsummit.org

introduction

About MERoman ShtykhR&D Engineer at CyberAgent, Inc.Areas of focusData streaming and NLPCommitter on the Apache Ignite and MyBatis projectsJudoka@rshtykh

Cyberagent, inc.Internet adsGamesMediaInvesting* As of Sep 2015

Ameba servicesMonthly visitors (DUB total): 6 billion*Number of member users : about 39 million*CyberAgent, Inc.Ameba Services

* As of Dec 2014GamesCommunity servicesContent curationOther

Ameba servicesAmeba Pigg

contentsApache IgniteFeed your dataLog Aggregation with Apache Flume Integration with Apache IgniteStreaming Data with Apache KafkaData Pipeline with Kafka and Ignite: Example

Apache igniteHigh-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.

High performance, unlimited scalability and resiliencyHigh-performance transactions and fast analyticsHadoop Acceleration, Apache SparkApache projecthttps://ignite.apache.org/

Making apache ignite a data processing hubQuestion: How to feed data?A simple solution: Create a client node

Making apache ignite a data processing hubQuestion: How to feed data?A simple solution: Create a client node

Is it reliable?Does it scale?Ignite-only solution?Does it keep your operational costs low?

Making apache ignite a data processing hubQuestion: How to feed data?A simple solution: Create a client node

Is it reliable?Does it scale?Ignite-only solution?Does it keep your operational costs low?

Log aggregation with apache flume

Log aggregation with apache flumeFlumeDistributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.ScalableFlexibleRobust and fault tolerantDeclarative configurationApache project

Data flow in flume

SourceSink

AgentChannelIncoming datato another Agentor Destination

Data flow in flume (Replication/multiplexing)

SourceSink

AgentChannelIncoming data

SinkChannel

Channel Selector

Data flow in flume (RELIABILITY)No data is lost (configurable)

SourceSink

AgentChannelIncoming data

Source tx

Sink tx

Log TRANSFER at amebaAmebaService

AggregatorAggregatorAggregatorMonitoringRecommender SystemElastic SearchHadoopBatch processingHBaseStream Processing(Onix)Stream Processing(HBaseSink)

AmebaServiceAmebaService

Log transfer at AmebaWeb HostsMore than 1600Size5.0 TB/day (raw)Traffic at peak160Mbps (compressed)

Ignite sinkReads Flume events from a channelWith a user-implemented pluggable transformer converts them into cacheable entriesAdding it requires no modification to the existing architecture

flume ignite (1)

SourceIgnite Sink

AgentChannelIncoming datanew connection

flume ignite (2)

SourceIgnite Sink

AgentChannelIncoming data

Sink txstart tx

flume ignite (3)

SourceIgnite Sink

AgentChannelIncoming data

Sink txtakeeventsend events

Enabling flume sinkStepsImplement EventTransformerconvert Flume events into cacheable entries (java.util.Map)Put transformers jar to ${FLUME_HOME}/plugins.d/ignite/libPut IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libextSet up a Flume agentSink setup

a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSinka1.sinks.k1.igniteCfg = /some-path/ignite.xmla1.sinks.k1.cacheName = testCachea1.sinks.k1.eventTransformer = my.company.MyEventTransformera1.sinks.k1.batchSize =100

Flume SINKSHDFSTHRIFTAVROHBASEElasticSearchIRCIGNITE

Apache flume & apache igniteIf you do data aggregation with FlumeAdding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agentIf you store your data (and do computations) in IgniteImproving data injection becomes easy with Flume sink

Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing)Scalable

Reliable

Highly-Performant

Streaming data with apache kafka

Apache KafkaPublish-subscribe messaging rethought as a distributed commit logLow latencyHigh ThroughputPartitioned and Replicated

Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

Apache kafkaMessages are grouped in topicsEach partition is a logEach partition is managed by a broker(when replicated, one broker is the partition leader)Producers & consumers (consumer groups)

Used forLog aggregationActivity trackingMonitoringStream processing

http://kafka.apache.org/documentation.html

Kafka connectDesigned for large scale stream data integration using KafkaProvides an abstraction from communication with your Kafka clusterOffset managementDelivery semanticsFault toleranceMonitoring, etc.

Worker (scalability & fault tolerance) Connector (task config) Task (thread)Standalone & Distributed execution models

http://www.confluent.io/blog/apache-kafka-0.9-is-released

Ingesting data streamsTwo waysKafka StreamerSink Connector

SQL queriesDistributed closuresTransactions

ConnectETL

Streaming via sink connectorConfigure your connectorConfigure Kafka Connect workerStart your connector# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

Streaming via sink connectorEasy data pipelineRecords from Kafka are written to Ignite grid via high-performance IgniteDataStreamerAt-least-once delivery guaranteeAs of 1.6, start a new connector to write to a different cacheabcde012Kafka offsets

a.key, a.valb.key, b.vala2b2c2d2e2

Ingesting data streamsBi-directional streaming

SQL queriesDistributed closuresTransactions

ConnectEventsContinuous queriesConnectSinkSource

Streaming back to kafkaListening to cache eventsPUTREADREMOVEDEXPIRED, etc.Remote filtering can be enabledKafka Connect offsets are ignoredCurrently, no delivery guaranteesevt1evt2evt3

as records

Enabling source connectorConfigure your connectorDefine a remote filter if neededcacheFilterCls=MyCacheEventFilterMake sure that event listening is enabled on the server nodesConfigure Kafka Connect workerStart your connector

#connectorname=ignite-src-connectorconnector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnectortasks.max=2

#topics, eventstopicNames=testcacheEvts=put,removed

#cachecacheName=myCacheigniteCfg=myconfig/ignite.xmlkey.converter=org.apache.kafka.connect.storage.StringConvertervalue.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

Apache kafka & apache igniteIf you do data streaming with KafkaAdding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)If you store your data (and do computations) in IgniteImproving data injection and listening for events on data becomes easy with Kafka Connectors

Combining Apache Kafka and Ignite makes/keeps your data pipelineScalableReliableHighly-PerformantCovers a wide range of ETL contexts

Data pipeline with Kafka and Igniteexample

Data pipeline with Kafka and IgniteRequirementsinstant processing and analysisscalable and resilient to failureslow latencyhigh throughputflexibility

Data pipeline with Kafka and IgniteFilter and aggregate eventsdataFlume

filter/transformdatadatadata

slow down on heavy loadsmore channels/layers

Data pipeline with Kafka and Ignitedata

filtertransformetc.

Parsimonious resource useReplay enabledMore operations on streamsFlexibilityderived streamsOther sources

Data pipeline with Kafka and IgniteFilter and aggregate eventsStore eventsNotify about updates on aggregatesdata

filtertransformetc.

Connectors

derived streams

Data pipeline with Kafka and IgniteFilter and aggregate eventsStore eventsNotify about updates on aggregatesdata

filtertransformetc.

Connectors

derived streams

Data pipeline with Kafka and IgniteImproving ads delivery

clicksimpressions

adsAdsdeliveryAdsrecommender

storage/computationImagestorage

data & computationin one place

Data pipeline with Kafka and IgniteImproving ads deliveryBetter network utilization and reliability

clicksimpressions

adsAdsdeliveryAdsrecommenderstorage/computationImagestorage

Anomalydetection

Other integrations

Other completed integrationsCAMELMQTTSTORMFLINK SINKTWITTER

THE END

Recommended

View more >