Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016

Lessons Learned with Cassandra & Spark_Stephan Kepser Matthias Niehoff Cassandra Summit 2016

@matthiasniehoff @codecentric

1

Our Use Cases

read

read

join

join

write

write

Lessons Learned with

Cassandra

! Primary key defines access way to a table ! It cannot be changed after the table is filled with data. ! Any change of the primary key,

like, e.g, adding a column, requires a data migration.

Data modeling: Primary key

! Well known:If you delete a column or whole row, the data is not really deleted.Rather a tombstone is created to mark the deletion.

! Much later tombstones are removed during compactions.

Data modeling: Deletions

! Updates don’t create tombstones, data is overwritten in situ.

! Exception: maps, lists, sets

! Use frozen<list> instead.

Unexpected Tombstones: Built-in Maps, Lists, Sets

! sstable2json shows details of sstable files in json format.

! Usage: go to /var/lib/cassandra/data/keyspace/table ! > sstable2json *-Data.db ! You can actually see what’s in the individual rows of the

data files. ! sstabledump in 3.6

Debug tool: sstable2json

ExampleCREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text )

select * from tenant ;

name | status ------+-------- ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE

Example{"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]}

deletion marker

! Strategy to reduce partition size. ! Actually, general strategy in databases to distribute a

table. ! In Cassandra, each partition of a table can be

distributed over a set of buckets. ! Example: A train departure time table

is bucketed over the hours of the day. ! The bucket becomes part of the partition key. ! It must be easily calculable.

Bucketing

! Bulk read or write operations should not be performed synchronously.

! It makes no sense to wait for each read in the bulk to complete individually.

! Instead ! send the requests asynchronously to Cassandra, and ! collect the answers.

Bulk Reads or Writes: Asynchronously

ExampleSession session = cc.openSession();PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „." + table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ? AND bucket = ? „);

private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) { List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(partitionKeys.size()); for (Object[] keys : partitionKeys) { futures.add(session.executeAsync(getEntries.bind(keys))); } return futures;}

Example

private void processAsyncResults(List<ResultSetFuture> futures) { for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) { ResultSet rs = future.get(); if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) { // do your program logic here } } }

Cassandra as Streaming Source

table of consolidated entriesEvent Source

! Event log fills over time ! Requires fine grained time based partitions to avoid

huge partition size. ! Spark jobs read latest events. ! Access path: time stamp ! Jobs have to read many partitions.

Cassandra as Streaming Source

! Pull method requires frequent reads of empty partitions, because no new data arrived.

! Impossible to re-read the whole table: by far too many partitions to read (our case: 900,000 per day -> 30 millions for a month) Unsuitable for λ-architecture

! Eventual consistency ! When can you be sure that you read all events?

— Never ! Might improve with CDC

Cassandra is unsuitable as a time-based event source.

! One keyspace per tenant? ! One (set of) table(s) per tenant?

! Our option: Table per tenant ! Feasible only for limited number of tenants (~1000)

Separating Data of Different Tenants

! Switch on monitoring ! ELK, OpsCenter, self built, .... ! Avoid Log level debug for C* messages ! Drowning in irrelevant messages ! Substantial performance drawback ! Log level info for development, pre-production ! Log level error in production sufficient

Monitoring

! When writing data, Cassandra never checks if there is enough space left on disk.

! It keeps writing data till the disk is full. ! This can bring the OS to a halt. ! Cassandra error messages are confusing at this point. ! Thus monitoring disk space is mandatory.

Monitoring: Disk Space

! Cassandra requires a lot of disk space for compaction. ! For SizeTieredCompaction reserve as much disk

space as Cassandra needs to store the data. ! Set-up monitoring on disk space. ! Receive an alert if the data carrying disk partition fills

up to 50%. ! Add nodes to the cluster and rebalance.

Monitoring: Disk Space

Lessons Learned with

Cassandra & Spark

repartitionByCassandraReplica

Node 1

Node 2

Node 3

Node 4

1-25

26-50 51-75

76-0

repartitionByCassandraReplica

Node 1

Node 2

Node 3

Node 4

1-25

26-50 51-75

76-0

some tasks took ~3s longer..

! Watch for Spark Locality Level ! aim for process or node local ! avoid any

Spark locality

! spark job does not run on every C* node ! # spark nodes < # cassandra nodes ! # job cores < # cassandra nodes ! spark job cores all on one node

! time for repartition > time saving through locality

Do not use repartitionByCassandraReplica when ...

! one query per partition key ! one query at a time per executor

joinWithCassandraTable

Spark

Cassandrat t+1 t+2 t+3 t+4 t+5

! parallel async queries


Spark

Cassandrat t+1 t+2 t+3 t+4 t+5

! built a custom async implementation


someDStream.transformToPair(rdd->{returnrdd.mapPartitionsToPair(iterator->{...Sessionsession=cc.openSession()){while(iterator.hasNext()){...session.executeAsync(..)}[collectfutures]

returnList<Tuple2<Left,Optional<Right>>>});});

! solved with SPARKC-233 (1.6.0 / 1.5.1 / 1.4.3)

! 5-6 times faster than sync implementation!


! joinWithCassandraTable is a full inner join

Left join with Cassandra

RDD C*

! Might include shuffle --> quite expensive


RDD C*join =

RDD substract = RDD

+ RDD=


! built a custom async implementationsomeDStream.transformToPair(rdd->{returnrdd.mapPartitionsToPair(iterator->{...Sessionsession=cc.openSession()){while(iterator.hasNext()){...session.executeAsync(..)...}[collectfutures]

returnList<Tuple2<Left,Optional<Right>>>});});

! solved with SPARKC-1.81 (2.0.0)

! basically uses async joinWithC* implementation


! spark.cassandra.connection.keep_alive_ms ! Default: 5s

! Streaming Batch Size > 5s ! Open Connection for every new batch

! Should be multiple times the streaming interval!

Connection keep alive

! cache saves performance by preventing recalculation ! it also helps you with regards to correctness!

Cache! Not only for performance

valchangedStream=someDStream.map(e->someMethod(e)).cache()changedStream.saveToCassandra("keyspace","table1")changedStream.saveToCassandra("keyspace","table1")

ChangedEntrysomeMethod(Entrye){returnnewChangedEntry(newDate(),...);}

• Know the most important internals • Know your tools • Monitor your cluster • Use existing knowledge resources • Use the mailing lists • Participate in the community

Summary

Questions?

Matthias Niehoff Stephan Kepser

Software

Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentric AG) | C* Summit 2016