©2013 DataStax Confidential. Do not distribute without consent.
@PatrickMcFadin
Patrick McFadinChief Evangelist for Apache Cassandra
Apache Cassandra and Spark You got the lighter, let’s spark the fire
1
6 years. How’s it going?
Cassandra 3.0 & 3.1
Spring and Fall
Cassandra is…• Shared nothing •Masterless peer-to-peer • Great scaling story • Resilient to failure
Cassandra for Applications
APACHE
CASSANDRA
A Data Ocean or Pond., Lake
An In-Memory Database
A Key-Value Store
A magical database unicorn that farts rainbows
Apache Spark
Apache Spark• 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options
Up to 100× faster (2-10× on disk)
2-5× less code
Spark Components
Spark Core
Spark SQL structured
Spark Streaming
real-time
MLlib machine learning
GraphX graph
org.apache.spark.rdd.RDDResilient Distributed Dataset (RDD)
•Created through transformations on data (map,filter..) or other RDDs
•Immutable
•Partitioned
•Reusable
RDD Operations•Transformations - Similar to scala collections API
•Produce new RDDs
•filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
•Actions
•Require materialization of the records to generate a value
•collect: Array[T], count, fold, reduce..
Analytic
Analytic
Search
Transformation
Action
RDD Operations
Cassandra and Spark
Cassandra & Spark: A Great Combo
Datastax: spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector
•Both are Easy to Use
•Spark Can Help You Bridge Your Hadoop
and Cassandra Systems
•Use Spark Libraries, Caching on-top of
Cassandra-stored Data
•Combine Spark Streaming with Cassandra
Storage
Spark On Cassandra•Server-Side filters (where clauses)
•Cross-table operations (JOIN, UNION, etc.)
•Data locality-aware (speed)
•Data transformation, aggregation, etc.
•Natural Time Series Integration
Apache Spark and Cassandra Open Source Stack
Cassandra
Spark Cassandra Connector
18
Spark Cassandra Connector*Cassandra tables exposed as Spark RDDs
*Read from and write to Cassandra
*Mapping of C* tables and rows to Scala objects
*All Cassandra types supported and converted to Scala types
*Server side data selection
*Virtual Nodes support
*Use with Scala or Java
*Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0
Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._
// Spark connection optionsval conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Accessing DataCREATE TABLE test.words (word text PRIMARY KEY, count int);
INSERT INTO test.words (word, count) VALUES ('bar', 30);INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDDval rdd = sc.cassandraTable("test", "words")// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]
rdd.toArray.foreach(println)// CassandraRow[word: bar, count: 30]// CassandraRow[word: foo, count: 20]
rdd.columnNames // Stream(word, count) rdd.size // 2
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]firstRow.getInt("count") // Int = 30
*Accessing table above as RDD:
Saving Dataval newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;
word | count------+------- bar | 30 foo | 20 cat | 40 fox | 50
(4 rows)
*RDD above saved to Cassandra:
Spark Cassandra Connectorhttps://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled and Supported with DSE 4.5!
Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*
Spark C*
Full Token Range
Each Executor Maintains a connection to the C* Cluster
Spark Executor
DataStax Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different splits based on sets of tokens
Spark Cassandra Connector
Co-locate Spark and C* for Best Performance
C*
C*C*
C*
Spark Worker
Spark Worker
Spark Master
Spark WorkerRunning Spark Workers
on the same nodes as your C* Cluster will save network hops when reading and writing
Analytics Workload Isolation
Cassandra+ Spark DC
CassandraOnly DC
Online App
Analytical App
Mixed Load Cassandra Cluster
Data Locality
Example 1: Weather Station• Weather station collects data • Cassandra stores in sequence • Application reads in sequence
Data Model• Weather Station Id and Time
are unique • Store as many as needed
CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY (weather_station,year,month,day,hour) );
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-5.1);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-4.9);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.3);
Primary key relationship
PRIMARY KEY (weather_station,year,month,day,hour)
Primary key relationship
PRIMARY KEY (weather_station,year,month,day,hour)
Partition Key
Primary key relationship
PRIMARY KEY (weather_station,year,month,day,hour)
Partition Key Clustering Columns
Primary key relationship
PRIMARY KEY (weather_station,year,month,day,hour)
Partition Key Clustering Columns
10010:99999
2005:12:1:7
-5.6
Primary key relationship
PRIMARY KEY (weather_station,year,month,day,hour)
Partition Key Clustering Columns
10010:99999-5.3-4.9-5.1
2005:12:1:8 2005:12:1:9 2005:12:1:10
Partition keys
10010:99999 Murmur3 Hash Token = 7224631062609997448
722266:13850 Murmur3 Hash Token = -6804302034103043898
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);
Consistent hash. 128 bit number between 2-63 and 264
Partition keys
10010:99999 Murmur3 Hash Token = 15
722266:13850 Murmur3 Hash Token = 77
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);
For this example, let’s make it a reasonable number
Writes & WAN replication
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
DC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.10.0.1 00-25
10.10.0.4 76-100
10.10.0.2 26-50
10.10.0.3 51-75
DC2
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
DC2: RF=3
Client
Insert DataPartition Key = 15
Asynchronous Local Replication
Asynchronous WAN Replication
Locality
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
DC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.10.0.1 00-25
10.10.0.4 76-100
10.10.0.2 26-50
10.10.0.3 51-75
DC2
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
DC2: RF=3
Client
Get DataPartition Key = 15
Client
Get DataPartition Key = 15
Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!
Spark Reads on Cassandra
Awesome animation by DataStax’s own Russel Spitzer
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
Cassandra Data is Distributed By Token Range
Cassandra Data is Distributed By Token Range
0
500
Cassandra Data is Distributed By Token Range
0
500
999
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
Node 1
120-220
300-500
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
1
780-830
1
Node 1
120-220
300-500
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
2
1
Node 1 300-500
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
2
1
Node 1 300-500
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
2
1
Node 1
300-400
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
3
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
400-500
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
4
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
4
21
Node 1
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
421
Node 1spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make Spark Partitions
3
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
Co-locate Spark and C* for Best Performance
C*
C*C*
C*
Spark Worker
Spark Worker
Spark Master
Spark WorkerRunning Spark Workers
on the same nodes as your C* Cluster will save network hops when reading and writing
Weather Station Analysis• Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new
tables
Windsor California July 1, 2014
High: 73.4
Low : 51.4
Roll-up table(SparkSQL example)CREATE TABLE daily_high_low ( weatherstation text, date text, high_temp double, low_temp double, PRIMARY KEY ((weatherstation,date)) );
• Weather Station Id and Date are unique
• High and low temp for each day
SparkSQL> INSERT INTO TABLE > daily_high_low > SELECT > weatherstation, to_date(year, day, hour) date, max(temperature) high_temp, min(temperature) low_temp > FROM temperature > GROUP BY weatherstation_id, year, month, day; OK Time taken: 2.345 seconds
functions aggregations
What just happened
• Data is read from temperature table • Transformed • Inserted into the daily_high_low table
Table: temperature
Table: daily_high_low
Read data from table Transform Insert data
into table
Spark Streaming
zillions of bytes gigabytes per second
Spark Versus Spark Streaming
Analytic
Analytic
Search
Spark Streaming
Kinesis,'S3'
DStream - Micro Batches
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs
DStream
• Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch
computations on small time intervals
Spark Streaming Example
val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077")
val sc = new SparkContext(conf)
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")
val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start()ssc.awaitTermination()
Initialization
Transformations and Action
CassandraRDD
Stream Initialization
Now what?
CassandraOnly DC
Cassandra+ Spark DC
Spark JobsSpark Streaming
You can do this at home!
https://github.com/killrweather/killrweather
On your USB!
Thank you!
Bring the questions
Follow me on twitter @PatrickMcFadin