View
359
Download
0
Category
Preview:
Citation preview
Spark
Brian O’Neill (@boneill42)Monetate
Agenda● History / Context
○Hadoop○Lambda
●Spark Basics○RDDs, Dataframe, SQL, Streaming
● Play along / Demo
We work at Monetate...Client
(e.g. Retailer)
DecisionEngine
Data
AnalyticsEngine
consumer marketer
Dashboard
Warehouse
Meta
Observations
We call it a...Personalization Platform
Not so hard until...m’s → B’s
100ms’s → 10ms’sdays → minutes
(sessions / month)
(response times)
(analytics lag)
HISTORY
history - hadoop
map / reduce
tuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]
word count The Codedef map(doc) doc.each do |word| emit(word, 1) endend
def reduce(key, values[]) sum = values.inject {|sum,x| sum + x } emit(key, sum)end
The Rundoc1 = “boy meets girl”doc2 = ”girl likes boy”)map (doc1) -> (boy, 1), (meets, 1), (girl, 1)map (doc2) -> (girl, 1), (likes, 1), (boy, 1)reduce (boy, [1, 1]) -> (boy, 2)reduce (girl, [1, 1]) -> (girl, 2)reduce (likes [1]) -> (likes, 1) reduce (meets, [1]) -> (meets, 1)
Jobs on top of jobs...
Real-time? Different hammer.
Let’s invent some terminology...
Traditional lambda...
Can we collapse the lambda?
Spark- FTW!
Lambda on Spark (e.g.)
S3
Kafka
MySQL
RDD
RDD
Dataframe
Druid
SPARK BASICS
Concept : RDDs“Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.”
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
Concept : Transformations & OperationsTransformation:
RDD(s) → RDDe.g. map, filter, groupBy, etc.
Action:RDD → valuee.g. reduce, count, etc.
Code: RDDsJavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc) .cassandraTable("java_api", "products", productReader) .keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); }});
DAGs
Lazily evaluated!
Concept : DataFramesDataFrames = RDD + Schema“A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.”
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
Concept : Spark SQL
SELECT min(event_time) AS start_time, max(event_time) AS end_time, account_id FROM events GROUP BY account_id
Code: SQL + Dataframes
StructType schema = configuration.getSchemaForProduct();DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);sqlContext.registerDataFrameAsTable(dataFrame, “products”);
And remember Uncle Ben…
“With great power, comes great responsibility.”
Concept : Streaming.forEachRDD
Code: Streaming JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(), SessionizerState.getConfig().getSparkStreamingBatchDuration()); JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...); kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() { @Override public void call(JavaRDD<byte[]> rdd) throws Exception { JavaRDD<String> lines = rdd.map(new Function<byte[], String>( { public String call(byte[] bytes) throws IOException { return new String(bytes, Charset.forName("UTF-8")); } }); processRdd(lines); } });
DEPLOYMENT
Basic Architecture
http://spark.apache.org/docs/latest/cluster-overview.html
YARN!
Kinesis / Streaming Architecture
Amazon’s EMR
Play along demo.
Get stuff...Get Spark...http://spark.apache.org/downloads.html
Get Cassandra…http://cassandra.apache.org/download/
Get Code…https://github.com/boneill42/spark-on-cassandra-quickstart
Configure stuff...$spark/conf> cp spark-env.sh.template spark-env.sh$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh
Start stuff...# Start Master$spark> sbin/start-master.sh$spark> tail -f logs/*
# Start Worker$spark> bin/spark-class org.apache.spark.deploy.worker.Worker \ spark://127.0.0.1:7077
Build and launch stuff...# Build$code> mvn clean install[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------
# Launch$code> spark-submit --class com.github.boneill42.JavaDemo \ --master spark://127.0.0.1:7077 \ target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ spark://127.0.0.1:7077 127.0.0.1
A message from our sponsor
Advertisements...https://github.com/monetate/koupler
https://github.com/monetate/ectou-metadata
https://github.com/monetate/ectou-export
Recommended