Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Real-time Analytics with Spark

Maciej Dabrowski, Chief Data Scientist, Altocloud !Galway Data Meetup, 2015-02-03

2

MEETS A SMALL STARTUP

source: https://media.licdn.com/mpr/mpr/p/1/005/0a0/167/2f98d60.jpg

‣ We built predictive communications software that uses analytics to make customer interactions and experience better

Altocloud

3

Monitoring live users

4

5

6

ANALYTICS

source: http://olap.com/

http://olap.com/

‣ Real-time for us is under 1-5s

‣ Q: How many customers are currently online?

‣ Q: How many chats/calls are taking place at the moment?

‣ Q: What is the utilisation of my customer support agents?

Use Case 1: Real-time analytics

7

‣ Q: How many calls were offered in the last week?

‣ Q: What is the acceptance rate of my chat offers?

Use Case 2: Reporting

8

‣ Q: Which customers currently on my site I should engage?

Use Case 3: Predictive Analytics

9

‣ Scalability

‣ Limited resources

‣ Various analytics use cases

Technical challenges

10

11

Real-time analytics with Hadoop

source: http://barbarashdwallpapers.com/funny-elephant-wallpapers/

http://barbarashdwallpapers.com/funny-elephant-wallpapers/

APIs

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Platform

12

MESSAGE QUEUES

FRONT-END APIs KAFKA

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

BACK-END APIS

APPS

BACK-END APIs

MONGODB

DATA SOURCES

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Data Platform

13

MESSAGE QUEUES

FRONT-END APIs KAFKA

MONGODB OPLOG

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

FRONT-END APIS

APPS

MONGODB

‣ One code base for streaming and batch processing

‣ Rich API in Scala/Python/Java

‣ Fast for iterative algorithms (important for ML)

‣ Growing community

‣ The concept of a micro-batch

‣ Nicely integrates with Kafka and Cassandra

‣ Fairly easy setup

Why Spark

14

Spark components

15

‣ Hadoop

!

!

!

!

!

!

‣ Spark

Word count in Spark

16

‣ Example: user event aggregation stored in Cassandra

‣ Still much better than Hadoop!

What about something more useful?

17

‣ User activity is an input (e.g. page view)

‣ Users for multiple businesses online

‣ Scale 100s to 100 000s activities per second

‣ Response time under 5s

‣ A perfect use case for spark streaming

Counting users currently online

18

‣ Pub-sub message broker

‣ Fast: 100s MBs /s on a single broker

‣ Scalable: partitioned data streams

‣ Durable: messages persisted and replicated

‣ Distributed: Strong durability with and fault-tolerance

‣ Downside: requires ZooKeeper

!see https://kafka.apache.org

Data source: Kafka

19

https://kafka.apache.org

!

!

!

!

!

!

!‣ Kafka with Spark: http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Spark and Kafka

20

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

‣ Simple count unique events

!

!

‣ Count visit events for unique users

Count users online

21

‣ Twitter Algebird to the rescue!

‣ HyperLogLog - a probabilistic data structure saving a lot of memory!

‣ https://github.com/twitter/algebird

Sets can take a lot of memory!

22

https://github.com/twitter/algebird

‣ Easy to setup

‣ High availability - no master

‣ Great performance

‣ CQL - SQL like querying

‣ Great support and bug-free drivers from Datastax

‣ Key: Design your schema around queries; !!

see https://cassandra.apache.org

Storing your results

23

https://cassandra.apache.org

‣ Datastax driver is very easy to use

!

!

‣ Save our results to Cassandra

Store data in Cassandra

24

25source: http://top1walls.com

‣ Spark streaming job performs two major tasks:

• data processing • data receiving

‣ Receiver always takes one core

‣ Technically, you need 2N cores to run N streaming jobs

‣ Not a big deal in production, what about testing?

Spark streaming

26

‣ Containerise your app including all its dependencies

‣ Distribute your app in this standard container

‣ Run it on any machine with docker

‣ Very lightweight

Docker

27

c3.xlarge: 4 cores

‣ AWS example

Spark

SPARK EXECUTOR

c3.large: 2 cores

SPARK DRIVER

SPARK EXECUTOR

CORE 1 CORE 2 CORE 3 CORE 4

c3.xlarge: 4 cores

‣ AWS example

Spark on Docker

c3.large: 2 cores

SPARK DRIVER

CORE 1 CORE 2 CORE 3 CORE 4

docker-1: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

docker-2: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

SPARK EXECUTOR

‣ Spark Streaming is fast to deploy but tuning is VERY important

‣ The lower the number of tasks, the better (in general)

‣ When reading from Kafka make sure that you configure blockingInterval

‣ optimize your jobs when possible - similar jobs can be sometimes merged

‣ persist your data from workers, NOT the driver

Spark Streaming

30

‣ OLAP-type queries using Spark SQL

‣ More advanced performance testing

‣ Detailed unit testing

‣ More batch jobs

Where do we go from here?

31

‣ Spark Documentation

‣ Reference application: http://github.com/killrweather/killrweather

‣ Productionalizing Spark Streaming

‣ Spark and Kafka

‣ Docker

‣ Free Hadoop Training from MapR

‣ Free edX course on Spark

Resources

32

http://github.com/killrweather/killrweather

http://spark-summit.org/wp-content/uploads/2013/10/Productionalizing-Spark-Streaming-Spark-Summit-2013-copy.pdf

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

https://www.docker.com

https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

Documents

Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud