Spark Streaming and MLlib - Hyderabad Spark Group

Preview:

Citation preview

Spark Streaming and MLlibThe stack for distributed,

massively scalable, (near) real-timedata processing and machine learning

present

Phaneendra Chiruvella

http://twitter.com/pcx66

Hyderabad Spark Group & Zemoso Technologies

Agenda● Brief intro to Spark Core

● Introduction to Spark Streaming

● What is the world talking about?: A demo of Spark

Streaming with Twitter

● Introduction to Spark MLlib

● Let’s see what movies you might like: A demo of Spark

MLlib by building a Movie Recommendation Engine

Spark: Lightning-fast cluster computing ● Data processing engine

● Distributed

● Massively scalable: Known largest cluster size is 8,000

machines with PBs of data processed

● Programmable in Scala, Java, Python and R

● Interactive shell

● Both Batch & Stream processing

● Stable and robust: being used in production at many

companies

● Known to work well with other “Big data” tools like Kafka,

Cassandra, HDFS, HBase, etc.

Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Spark: How it works?

● Every application has it’s own

SparkContext

● Cluster Managers available are:

Spark Standalone, YARN, Mesos

Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Spark: Resilient Distributed DatasetsRDD is the fundamental abstraction of Spark, providing a rich, fault-tolerant layer over a cluster of machines

Executors

SparkContext

RDD

Spark Core: Demo● Creating RDDs

● Transformations

● Actions

● Cache

Spark Streaming:batch processing not enuf!

● Extension to Core API● Micro-batches processed in

realtime● Minimize latency to seconds

Spark Streaming: How it works?● DStreams - Just a chain of RDDs

● Batch Interval, Input DStreams and Receivers

● Some Input Sources: Sockets, File systems, Kafka, Twitter

Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming: How it works?● Windowed operations

● DStream Transformations are translated to RDD Transformations

● Direct access to RDDs underneath

Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming: Demo● What is the world talking about?: A Twitter stream analysis

Spark MLlib: Just analytics not enuf!● Practical, scalable ML library with

implementations of several common

algorithms and more being added

● Alternative spark.ml high-level API based

on spark.sql.DataFrame. Out of scope

for our current talk.

Spark MLlib: Demo● Let’s see what movies you might like: A demo of Spark MLlib by building a

Movie Recommendation Engine

Spark: Streaming and MLlib, match made in heaven!

● MLlib provides algorithms that can learn on streaming data and simultaneously apply on the streaming data!

● Also, a large set of algorithms that

can learn offline and be applied on

the streaming data

Spark: What next?● Spark SQL - A SQL-like layer over RDDs● spark.ml● Spark GraphX - A graph-processing abstraction over RDDs● Apache Storm and Apache Flink - Modern streaming-first systems

Q&A

Thank you!Slide deck will be made available at:http://blog.zemosolabs.com/

Spark Docs are a great place to get startedhttp://spark.apache.org/docs/latest/programming-guide.html

Acknowledgements:Code demos are from Databricks TrainingMemes generated from ImgFlip.com

Recommended