Download pdf - Apache cassandra and spark. you got the the lighter, let's start the fire

©2013 DataStax Confidential. Do not distribute without consent.

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra

Apache Cassandra and Spark You got the lighter, let’s spark the fire

1

6 years. How’s it going?

Cassandra 3.0 & 3.1

Spring and Fall

Cassandra is…• Shared nothing •Masterless peer-to-peer • Great scaling story • Resilient to failure

Cassandra for Applications

APACHE

CASSANDRA

A Data Ocean or Pond., Lake

An In-Memory Database

A Key-Value Store

A magical database unicorn that farts rainbows

Apache Spark

Apache Spark• 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options

Up to 100× faster (2-10× on disk)

2-5× less code

Spark Components

Spark Core

Spark SQL structured

Spark Streaming

real-time

MLlib machine learning

GraphX graph

org.apache.spark.rdd.RDDResilient Distributed Dataset (RDD)

•Created through transformations on data (map,filter..) or other RDDs

•Immutable

•Partitioned

•Reusable

RDD Operations•Transformations - Similar to scala collections API

•Produce new RDDs

•filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

•Actions

•Require materialization of the records to generate a value

•collect: Array[T], count, fold, reduce..

Analytic

Analytic

Search

Transformation

Action

RDD Operations

Cassandra and Spark

Cassandra & Spark: A Great Combo

Datastax: spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector

•Both are Easy to Use

•Spark Can Help You Bridge Your Hadoop

and Cassandra Systems

•Use Spark Libraries, Caching on-top of

Cassandra-stored Data

•Combine Spark Streaming with Cassandra

Storage

https://github.com/datastax/spark-cassandra-connector

Spark On Cassandra•Server-Side filters (where clauses)

•Cross-table operations (JOIN, UNION, etc.)

•Data locality-aware (speed)

•Data transformation, aggregation, etc.

•Natural Time Series Integration

Apache Spark and Cassandra Open Source Stack

Cassandra

Spark Cassandra Connector

18

Spark Cassandra Connector*Cassandra tables exposed as Spark RDDs

*Read from and write to Cassandra

*Mapping of C* tables and rows to Scala objects

*All Cassandra types supported and converted to Scala types

*Server side data selection

*Virtual Nodes support

*Use with Scala or Java

*Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0

Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

Connecting to Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._

// Spark connection optionsval conf = new SparkConf(true)

.setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo")

.set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra")

val sc = new SparkContext(conf)

Accessing DataCREATE TABLE test.words (word text PRIMARY KEY, count int);

INSERT INTO test.words (word, count) VALUES ('bar', 30);INSERT INTO test.words (word, count) VALUES ('foo', 20);

// Use table as RDDval rdd = sc.cassandraTable("test", "words")// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]

rdd.toArray.foreach(println)// CassandraRow[word: bar, count: 30]// CassandraRow[word: foo, count: 20]

rdd.columnNames // Stream(word, count) rdd.size // 2

val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]firstRow.getInt("count") // Int = 30

*Accessing table above as RDD:

Saving Dataval newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]

newRdd.saveToCassandra("test", "words", Seq("word", "count"))

SELECT * FROM test.words;

word | count------+------- bar | 30 foo | 20 cat | 40 fox | 50

(4 rows)

*RDD above saved to Cassandra:

Spark Cassandra Connectorhttps://github.com/datastax/spark-‐cassandra-‐connector

Keyspace Table

Cassandra Spark

RDD[CassandraRow]

RDD[Tuples]

Bundled and Supported with DSE 4.5!

https://github.com/datastax/spark-cassandra-connector

Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*

Spark C*

Full Token Range

Each Executor Maintains a connection to the C* Cluster

Spark Executor

DataStax Java Driver

Tokens 1-1000

Tokens 1001 -2000

Tokens …

RDD’s read into different splits based on sets of tokens

Spark Cassandra Connector

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Analytics Workload Isolation

Cassandra+ Spark DC

CassandraOnly DC

Online App

Analytical App

Mixed Load Cassandra Cluster

Data Locality

Example 1: Weather Station• Weather station collects data • Cassandra stores in sequence • Application reads in sequence

Data Model• Weather Station Id and Time

are unique • Store as many as needed

CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY (weather_station,year,month,day,hour) );

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);




Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)



Partition Key



Partition Key Clustering Columns




10010:99999

2005:12:1:7

-5.6




10010:99999-5.3-4.9-5.1

2005:12:1:8 2005:12:1:9 2005:12:1:10

Partition keys

10010:99999 Murmur3 Hash Token = 7224631062609997448

722266:13850 Murmur3 Hash Token = -6804302034103043898



Consistent hash. 128 bit number between 2-63 and 264

Partition keys





For this example, let’s make it a reasonable number

Writes & WAN replication

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.10.0.1 00-25

10.10.0.4 76-100

10.10.0.2 26-50

10.10.0.3 51-75

DC2


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

DC2: RF=3

Client

Insert DataPartition Key = 15

Asynchronous Local Replication

Asynchronous WAN Replication

Locality

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=3


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.10.0.1 00-25

10.10.0.4 76-100

10.10.0.2 26-50

10.10.0.3 51-75

DC2


10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

DC2: RF=3

Client

Get DataPartition Key = 15

Client

Get DataPartition Key = 15

Data Locality

weatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Spark Reads on Cassandra

Awesome animation by DataStax’s own Russel Spitzer

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Node 2

Node 1



RDD

2

346

7 8 9

Node 3

Node 4

1 5

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5



Cassandra Data is Distributed By Token Range


0

500


0

500

999


0

500

Node 1

Node 2

Node 3

Node 4


0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes


0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Node 1

120-220

300-500

780-830

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

Node 1

120-220

300-500

0-50




1

780-830

1

Node 1

120-220

300-500

0-50




780-830

2

1

Node 1 300-500

0-50




780-830

2

1

Node 1 300-500

0-50




780-830

2

1

Node 1

300-400

0-50




780-830400-500

21

Node 1

0-50




780-830400-500

21

Node 1

0-50




780-830400-500

3

21

Node 1

0-50




780-830

3

400-500

21

Node 1

0-50




780-830

3

4

21

Node 1

0-50




780-830

3

4

21

Node 1

0-50




780-830

3

421

Node 1spark.cassandra.input.split.size 50



3

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50780-830

Node 1

4



0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830


4



0-50

780-830

Node 1



4



0-50

780-830

Node 1



50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4



0-50

780-830

Node 1



50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4



0-50

780-830

Node 1


50 CQL Rows

4



0-50

780-830

Node 1


50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4



0-50

780-830

Node 1


50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Weather Station Analysis• Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new

tables

Windsor California July 1, 2014

High: 73.4

Low : 51.4

Roll-up table(SparkSQL example)CREATE TABLE daily_high_low ( weatherstation text, date text, high_temp double, low_temp double, PRIMARY KEY ((weatherstation,date)) );

• Weather Station Id and Date are unique

• High and low temp for each day

SparkSQL> INSERT INTO TABLE > daily_high_low > SELECT > weatherstation, to_date(year, day, hour) date, max(temperature) high_temp, min(temperature) low_temp > FROM temperature > GROUP BY weatherstation_id, year, month, day; OK Time taken: 2.345 seconds

functions aggregations

What just happened

• Data is read from temperature table • Transformed • Inserted into the daily_high_low table

Table: temperature

Table: daily_high_low

Read data from table Transform Insert data

into table

Spark Streaming

zillions of bytes gigabytes per second

Spark Versus Spark Streaming

Analytic

Analytic

Search

Spark Streaming

Kinesis,'S3'

DStream - Micro Batches

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

Processing of DStream = Processing of μBatches, RDDs

DStream

• Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch

computations on small time intervals

Spark Streaming Example

val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077")

val sc = new SparkContext(conf)

val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")

val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start()ssc.awaitTermination()

Initialization

Transformations and Action

CassandraRDD

Stream Initialization

Now what?

CassandraOnly DC

Cassandra+ Spark DC

Spark JobsSpark Streaming

You can do this at home!

https://github.com/killrweather/killrweather

On your USB!

https://github.com/killrweather/killrweather

Thank you!

Bring the questions

Follow me on twitter @PatrickMcFadin