Upload
datastax
View
225
Download
3
Embed Size (px)
Citation preview
Lessons Learned with Cassandra & Spark_Stephan Kepser Matthias Niehoff Cassandra Summit 2016
@matthiasniehoff @codecentric
1
! Primary key defines access way to a table ! It cannot be changed after the table is filled with data. ! Any change of the primary key,
like, e.g, adding a column, requires a data migration.
Data modeling: Primary key
! Well known:If you delete a column or whole row, the data is not really deleted.Rather a tombstone is created to mark the deletion.
! Much later tombstones are removed during compactions.
Data modeling: Deletions
! Updates don’t create tombstones, data is overwritten in situ.
! Exception: maps, lists, sets
! Use frozen<list> instead.
Unexpected Tombstones: Built-in Maps, Lists, Sets
! sstable2json shows details of sstable files in json format.
! Usage: go to /var/lib/cassandra/data/keyspace/table ! > sstable2json *-Data.db ! You can actually see what’s in the individual rows of the
data files. ! sstabledump in 3.6
Debug tool: sstable2json
ExampleCREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text )
select * from tenant ;
name | status ------+-------- ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE
Example{"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]}
deletion marker
! Strategy to reduce partition size. ! Actually, general strategy in databases to distribute a
table. ! In Cassandra, each partition of a table can be
distributed over a set of buckets. ! Example: A train departure time table
is bucketed over the hours of the day. ! The bucket becomes part of the partition key. ! It must be easily calculable.
Bucketing
! Bulk read or write operations should not be performed synchronously.
! It makes no sense to wait for each read in the bulk to complete individually.
! Instead ! send the requests asynchronously to Cassandra, and ! collect the answers.
Bulk Reads or Writes: Asynchronously
ExampleSession session = cc.openSession();PreparedStatement getEntries = session.prepare("SELECT * FROM " + keyspace + „." + table +" WHERE tenant=? AND core_system=? AND change_type = ? AND seconds = ? AND bucket = ? „);
private List<ResultSetFuture> sendQueries(Collection<Object[]> partitionKeys) { List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(partitionKeys.size()); for (Object[] keys : partitionKeys) { futures.add(session.executeAsync(getEntries.bind(keys))); } return futures;}
Example
private void processAsyncResults(List<ResultSetFuture> futures) { for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) { ResultSet rs = future.get(); if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) { // do your program logic here } } }
! Event log fills over time ! Requires fine grained time based partitions to avoid
huge partition size. ! Spark jobs read latest events. ! Access path: time stamp ! Jobs have to read many partitions.
Cassandra as Streaming Source
! Pull method requires frequent reads of empty partitions, because no new data arrived.
! Impossible to re-read the whole table: by far too many partitions to read (our case: 900,000 per day -> 30 millions for a month) Unsuitable for λ-architecture
! Eventual consistency ! When can you be sure that you read all events?
— Never ! Might improve with CDC
Cassandra is unsuitable as a time-based event source.
! One keyspace per tenant? ! One (set of) table(s) per tenant?
! Our option: Table per tenant ! Feasible only for limited number of tenants (~1000)
Separating Data of Different Tenants
! Switch on monitoring ! ELK, OpsCenter, self built, .... ! Avoid Log level debug for C* messages ! Drowning in irrelevant messages ! Substantial performance drawback ! Log level info for development, pre-production ! Log level error in production sufficient
Monitoring
! When writing data, Cassandra never checks if there is enough space left on disk.
! It keeps writing data till the disk is full. ! This can bring the OS to a halt. ! Cassandra error messages are confusing at this point. ! Thus monitoring disk space is mandatory.
Monitoring: Disk Space
! Cassandra requires a lot of disk space for compaction. ! For SizeTieredCompaction reserve as much disk
space as Cassandra needs to store the data. ! Set-up monitoring on disk space. ! Receive an alert if the data carrying disk partition fills
up to 50%. ! Add nodes to the cluster and rebalance.
Monitoring: Disk Space
repartitionByCassandraReplica
Node 1
Node 2
Node 3
Node 4
1-25
26-50 51-75
76-0
some tasks took ~3s longer..
! spark job does not run on every C* node ! # spark nodes < # cassandra nodes ! # job cores < # cassandra nodes ! spark job cores all on one node
! time for repartition > time saving through locality
Do not use repartitionByCassandraReplica when ...
! one query per partition key ! one query at a time per executor
joinWithCassandraTable
Spark
Cassandrat t+1 t+2 t+3 t+4 t+5
! built a custom async implementation
joinWithCassandraTable
someDStream.transformToPair(rdd->{returnrdd.mapPartitionsToPair(iterator->{...Sessionsession=cc.openSession()){while(iterator.hasNext()){...session.executeAsync(..)}[collectfutures]
returnList<Tuple2<Left,Optional<Right>>>});});
! solved with SPARKC-233 (1.6.0 / 1.5.1 / 1.4.3)
! 5-6 times faster than sync implementation!
joinWithCassandraTable
! Might include shuffle --> quite expensive
Left join with Cassandra
RDD C*join =
RDD substract = RDD
+ RDD=
Left join with Cassandra
! built a custom async implementationsomeDStream.transformToPair(rdd->{returnrdd.mapPartitionsToPair(iterator->{...Sessionsession=cc.openSession()){while(iterator.hasNext()){...session.executeAsync(..)...}[collectfutures]
returnList<Tuple2<Left,Optional<Right>>>});});
! solved with SPARKC-1.81 (2.0.0)
! basically uses async joinWithC* implementation
Left join with Cassandra
! spark.cassandra.connection.keep_alive_ms ! Default: 5s
! Streaming Batch Size > 5s ! Open Connection for every new batch
! Should be multiple times the streaming interval!
Connection keep alive
! cache saves performance by preventing recalculation ! it also helps you with regards to correctness!
Cache! Not only for performance
valchangedStream=someDStream.map(e->someMethod(e)).cache()changedStream.saveToCassandra("keyspace","table1")changedStream.saveToCassandra("keyspace","table1")
ChangedEntrysomeMethod(Entrye){returnnewChangedEntry(newDate(),...);}
• Know the most important internals • Know your tools • Monitor your cluster • Use existing knowledge resources • Use the mailing lists • Participate in the community
Summary