Spark - Philly JUG

Brian O’Neill (@boneill42)Monetate

Agenda● History / Context

○Hadoop○Lambda

●Spark Basics○RDDs, Dataframe, SQL, Streaming

● Play along / Demo

We work at Monetate...Client

(e.g. Retailer)

DecisionEngine

AnalyticsEngine

consumer marketer

Dashboard

Warehouse

Observations

We call it a...Personalization Platform

Not so hard until...m’s → B’s

100ms’s → 10ms’sdays → minutes

(sessions / month)

(response times)

(analytics lag)

HISTORY

history - hadoop

map / reduce

tuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]

word count The Codedef map(doc) doc.each do |word| emit(word, 1) endend

def reduce(key, values[]) sum = values.inject {|sum,x| sum + x } emit(key, sum)end

The Rundoc1 = “boy meets girl”doc2 = ”girl likes boy”)map (doc1) -> (boy, 1), (meets, 1), (girl, 1)map (doc2) -> (girl, 1), (likes, 1), (boy, 1)reduce (boy, [1, 1]) -> (boy, 2)reduce (girl, [1, 1]) -> (girl, 2)reduce (likes [1]) -> (likes, 1) reduce (meets, [1]) -> (meets, 1)

Jobs on top of jobs...

Real-time? Different hammer.

Let’s invent some terminology...

Traditional lambda...

Can we collapse the lambda?

Spark- FTW!

Lambda on Spark (e.g.)

Dataframe

SPARK BASICS

Concept : RDDs“Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.”

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Concept : Transformations & OperationsTransformation:

RDD(s) → RDDe.g. map, filter, groupBy, etc.

Action:RDD → valuee.g. reduce, count, etc.

Code: RDDsJavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc) .cassandraTable("java_api", "products", productReader) .keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); }});

Lazily evaluated!

Concept : DataFramesDataFrames = RDD + Schema“A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

Concept : Spark SQL

SELECT min(event_time) AS start_time, max(event_time) AS end_time, account_id FROM events GROUP BY account_id

Code: SQL + Dataframes

StructType schema = configuration.getSchemaForProduct();DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);sqlContext.registerDataFrameAsTable(dataFrame, “products”);

And remember Uncle Ben…

“With great power, comes great responsibility.”

Concept : Streaming.forEachRDD

Code: Streaming JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(), SessionizerState.getConfig().getSparkStreamingBatchDuration()); JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...); kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() { @Override public void call(JavaRDD<byte[]> rdd) throws Exception { JavaRDD<String> lines = rdd.map(new Function<byte[], String>( { public String call(byte[] bytes) throws IOException { return new String(bytes, Charset.forName("UTF-8")); } }); processRdd(lines); } });

DEPLOYMENT

Basic Architecture

http://spark.apache.org/docs/latest/cluster-overview.html

Kinesis / Streaming Architecture

Amazon’s EMR

Play along demo.

Get stuff...Get Spark...http://spark.apache.org/downloads.html

Get Cassandra…http://cassandra.apache.org/download/

Get Code…https://github.com/boneill42/spark-on-cassandra-quickstart

Configure stuff...$spark/conf> cp spark-env.sh.template spark-env.sh$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh

Start stuff...# Start Master$spark> sbin/start-master.sh$spark> tail -f logs/*

# Start Worker$spark> bin/spark-class org.apache.spark.deploy.worker.Worker \ spark://127.0.0.1:7077

Build and launch stuff...# Build$code> mvn clean install[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------

# Launch$code> spark-submit --class com.github.boneill42.JavaDemo \ --master spark://127.0.0.1:7077 \ target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ spark://127.0.0.1:7077 127.0.0.1

A message from our sponsor

Advertisements...https://github.com/monetate/koupler

https://github.com/monetate/ectou-metadata

https://github.com/monetate/ectou-export

Spark - Philly JUG

Technology

Booklet Philly Uncubed

NSA Philly

welcome to philly

Philly JUG: May 21, 2002 Dynamic Java Aaron Mulder Chief Technical Officer Chariot Solutions Classes Without Code

Trip to Philly

philly works 2010

Comcast Philly Lawsuit

Philly Steakout

Digital Philly

Philly Loft Project

Philly Benchamrking

Philly Truck Stalker

Philly FreePress

Walgreens Philly Flagship

Philly sports scenes

Philly the Kid

Koepfler philly chi_submission_uxshowandtell

Philly classics presentation

Digital Philly

Philly style soft