Upload
anirvan-chakraborty
View
677
Download
1
Embed Size (px)
Citation preview
Martin Zapletal @zapletal_martin
Cake Solutions @cakesolutions
#CassandraSummit
Presented by Anirvan Chakraborty @anirvan_c
● Introduction● Event sourcing and CQRS● An emerging technology stack to handle data● A reference application and it’s architecture● A few use cases of the reference application● Conclusion
● Increasing importance of data analytics● Current state
○ Destructive updates○ Analytics tools with poor scalability and integration○ Manual processes○ Slow iterations○ Not suitable for large amounts of data
● Whole lifecycle of data
● Data processing● Data stores● Integration and messaging● Distributed computing primitives● Cluster managers and task schedulers● Deployment, configuration management and DevOps● Data analytics and machine learning
● Spark, Mesos, Akka, Cassandra, Kafka (SMACK, Infinity)
● Create, Read, Update, Delete● Exposes mutable internal state● Many read methods on repositories● Mapping of data model and objects (impedance mismatch)● No auditing● No separation of concerns (read / write, command / event)● Strongly consistent● Difficult optimizations of reads / writes● Difficult to scale● Intent, behaviour, history, is lost
Balance = 5
Balance = 10
Update Account
Balance = 10
Account
[1]
CQRS
Client
QueryCommand
DBDB
Denormalise/Precompute
Kappa architecture
Batch-Pipeline
Kafka
All
you
r d
ata
NoSQL
SQL
Spark
Client
Client
Client Views
Streamprocessor
Flume
ScoopHive
Impala
Oozie
HDFS
Lambda Architecture
Batch Layer Serving Layer
Stream layer (fast)
Query
Query
All
you
r d
ata
Serving DB
● Append only data store● No updates or deletes (rewriting history)● Immutable data model● Decouples data model of the application and storage● Current state not persisted, but derived. A sequence of updates that led to it.● History, state known at any point in time● Replayable● Source of truth● Optimisations possible● Works well in distributed environment - easy partitioning, conflicts● Helps avoiding transactions● Works well with DDD
userId date change
1
1
1
10/10/2015
11/10/2015
23/10/2015
+300
-100
-200
1 24/10/2015 +100
balanceChanged
event
balanceChanged
balanceChanged
balanceChanged
Event journal
● Command Query Responsibility Segregation● Read and write logically and physically separated ● Reasoning about the application● Clear separation of concerns (business logic)● Often different technology, scalability● Often lower consistency - eventual, causal
Command
● Write side● Messages, requests to mutate state● Behaviour, serialized method call essentially● Don’t expose state● Validated and may be rejected or emit one or more events (e.g. submitting a form)
Event
● Write side● Immutable● Indicating something that has happened● Atomic record of state change● Audit log
Query
● Read side● Precomputed
userId = 1updateBalance(+100)
Write
Command Event
userId date change
1
1
1
10/10/2015
11/10/2015
23/10/2015
+300
-100
-200
1 24/10/2015 +100
balanceChanged
eventbalanceChanged
balanceChanged
balanceChanged
Event journal
Command handler
Read
balance
1 100
userId = 1balance = 100
Query
userId
● Partial order of events for each entity● Operation semantics, CRDTs
UserNameUpdated(B)
UserNameUpdated(B)
UserNameUpdated(A)
UserNameUpdated(A)
● Localization● Conflicting concurrent histories
○ Resubmission○ Deduplication○ Replication
● Identifier● Version● Timestamp● Vector clock
● Actor framework for truly concurrent and distributed systems● Thread safe mutable state - consistency boundary● Domain modelling, distributed state● Simple programming model - asynchronously send messages, create
new actors, change behaviour● Supports CQRS/ES● Fully distributed - asynchronous, delivery guarantees, failures, time
and order, consistency, availability, communication patterns, data locality, persistence, durability, concurrent updates, conflicts, divergence, invariants, ...
● Actor backed by data store● Immutable event sourced journal● Supports CQRS (write and read side)
● Persistence, replay on failure, rebalance, at least once delivery
class UserActor extends PersistentActor {
override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId
override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator)
def notRegistered(distributedData: ActorRef): Receive = { case cmd: AccountCommand => persist(AccountEvent(cmd.account)){ acc => context.become(registered(acc)) sender() ! \/-() } }
def registered(account: Account): Receive = { case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) => persist(eres)(data => sender() ! \/-(id)) }
override def receiveRecover: Receive = { ... }}
● Akka Persistence Cassandra journal○ Globally distributed journal○ Scalable, resilient, highly available○ Performant, operational database
● Community plugins
akka {
persistence {
journal.plugin = "cassandra-journal"
snapshot-store.plugin = "cassandra-snapshot-store"
}
}
● Partition-size● Events in each cluster partition ordered (persistenceId - partition pair)
CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, partition_nr bigint, sequence_nr bigint, marker text, message blob, PRIMARY KEY ((processor_id, partition_nr), sequence_nr, marker)) WITH COMPACT STORAGE AND gc_grace_seconds = ${config.gc_grace_seconds}
processor_id partition_nr sequence_nr marker message
user-1 0 0 H 0x0a6643b334...
user-1 0 1 A 0x0ab2020801...
user-1 0 2 A 0x0a98020801...
● Internal state, moment in time● Read optimization
CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, sequence_nr bigint, timestamp bigint, snapshot blob, PRIMARY KEY (processor_id, sequence_nr)) WITH CLUSTERING ORDER BY (sequence_nr DESC)
processor_id sequence_nr snapshot timestamp
user-1 16 0x0400000001... 1441696908210
user-1 20 0x0400000001... 1441697587765
● Uses Akka serialization
0x0a6643b334 …
PersistentRepr
Akka.Serialization
Payload: T
Protobuffactor {
serialization-bindings {
"io.muvr.exercise.ExercisePlanDeviation" = kryo,
"io.muvr.exercise.ResistanceExercise" = kryo,
}
serializers {
java = "akka.serialization.JavaSerializer"
kryo = "com.twitter.chill.akka.AkkaSerializer"
}
}
class UserActorView(userId: String) extends PersistentView {
override def persistenceId: String = UserPersistenceId(userId).persistenceId
override def viewId: String = UserPersistenceId(userId).persistentViewId
override def autoUpdateInterval: FiniteDuration = FiniteDuration(100, TimeUnit.MILLISECONDS)
def receive: Receive = viewState(List.empty)
def viewState(processedDeviations: List[ExercisePlanProcessedDeviation]): Receive = {
case EntireResistanceExerciseSession(_, _, _, _, deviations) if isPersistent =>
context.become(viewState(deviations.filter(condition).map(process) ::: processedDeviations))
case GetProcessedDeviations => sender() ! processedDeviations
}
}
● Akka 2.4● Potentially infinite stream of data● Ordered, replayable, resumable● Aggregation, transformation, moving data
● EventsByPersistenceId● AllPersistenceids● EventsByTag
val readJournal =
PersistenceQuery(system).readJournalFor(CassandraJournal.Identifier)
val source = readJournal.query(
EventsByPersistenceId(UserPersistenceId(name).persistenceId, 0, Long.MaxValue), NoRefresh)
.map(_.event)
.collect{ case s: EntireResistanceExerciseSession => s }
.mapConcat(_.deviations)
.filter(condition)
.map(process)
implicit val mat = ActorMaterializer()
val result = source.runFold(List.empty[ExercisePlanDeviation])((x, y) => y :: x)
● Potentially infinite stream of events
Source[Any].map(process).filter(condition)
Publisher Subscriber
process
condition
backpressure
● In Akka we have the read and write sides separated, in Cassandra we don’t
● Different data model● Avoid using operational datastore● Eventual consistency● Streaming transformations to different format● Unify journalled and other data
● Computations and analytics queries on the data● Often iterative, complex, expensive computations● Prepared and interactive queries● Data from multiple sources, joins and transformations● Often directly on a stream of data● Whole history of events● Historical behaviour● Works retrospectively, can answer questions in the future that we don’t
know exist yet● Various data types from various sources● Large amounts of fast data● Automated analytics
● Cassandra 3.0 - user defined functions, functional indexes, aggregation functions, materialized views
● Server side denormalization● Eventual consistency● Copy of data with different partitioning
userId
performance
● In memory dataflow distributed data processing framework, streaming and batch
● Distributes computation using a higher level API● Load balancing● Moves computation to data ● Fault tolerant
● Resilient Distributed Datasets● Fault tolerance● Caching● Serialization● Transformations
○ Lazy, form the DAG○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ...
● Actions○ Execute DAG, retrieve result○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● Accumulators● Broadcast Variables● Integration● Streaming● Machine Learning● Graph Processing
textFile mapmapreduceByKey
collect
sc.textFile("counts") .map(line => line.split("\t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect()
[4]
● Cassandra can store● Spark can process
● Gathering large amounts of heterogeneous data● Queries● Transformations● Complex computations● Machine learning, data mining, analytics● Now possible● Prepared and interactive queries
lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val data = sc.cassandraTable[T]("keyspace", "table").select("columns")
val processedData = data.flatMap(...)...
processedData.saveToCassandra("keyspace", "table")
● Akka Analytics project● Handles custom Akka serialization
case class JournalKey(persistenceId: String, partition: Long, sequenceNr: Long)
lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val events: RDD[(JournalKey, Any)] = sc.eventTable()
events.sortByKey().map(...).filter(...).collect().foreach(println)
● Spark streaming● Precomputing using spark or replication often aiming for different data
modelOperational cluster Analytics cluster
Precomputation / replication
Integration with other data sources
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)
val deviationsFrequency = sqlContext.sql(
"""SELECT planned.exercise, hour(time), COUNT(1)
FROM exerciseDeviations
WHERE planned.exercise = 'bench press'
GROUP BY planned.exercise, hour(time)""")
val deviationsFrequency2 = exerciseDeviationsDF
.where(exerciseDeviationsDF("planned.exercise") === "bench press")
.groupBy(
exerciseDeviationsDF("planned.exercise"),
exerciseDeviationsDF("time”))
.count()
val deviationsFrequency3 = exerciseDeviations
.filter(_.planned.exercise == "bench press")
.groupBy(d => (d.planned.exercise, d.time.getHours))
.map(d => (d._1, d._2.size))
def toVector(user: User): mllib.linalg.Vector =
Vectors.dense(
user.frequency, user.performanceIndex, user.improvementIndex)
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val users: RDD[User] = events.filterClass[User]
val kmeans = new KMeans()
.setK(5)
.set...
val clusters = kmeans.run(users.map(_.toVector))
val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val exerciseDeviations = events
.filterClass[EntireResistanceExerciseSession]
.flatMap(session =>
session.sets.flatMap(set =>
set.sets.map(exercise => (session.id.id, exercise.exercise))))
.groupBy(e => e)
.map(g =>
Rating(normalize(g._1._1), normalize(g._1._2),
normalize(g._2.size)))
val model = new ALS().run(ratings)
val predictions = model.predict(recommend)
bench press
bicep curl
dead lift
user 1 5 2
user 2 4 3
user 3 5 2
user 4 3 1
val events = sc.eventTable().cache().toDF()
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),
new IntensityFeatureExtractor(), lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept, Array(true, false))
getEligibleUsers(events, sessionEndedBefore)
.map { user =>
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
val model = trainValidationSplit.fit(
events,
ParamMap(ParamPair(userIdParam, user)))
val testData = // Prepare test data.
val predictions = model.transform(testData)
submitResult(userId, predictions, config)
}
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val connections = events.filterClass[Connections]
val vertices: RDD[(VertexId, Long)] =
connections.map(c => (c.id, 1l))
val edges: RDD[Edge[Long]] = connections
.flatMap(c => c.connections
.map(Edge(c.id, _, 1l)))
val graph = Graph(vertices, edges)
val ranks = graph.pageRank(0.0001).vertices
● Exercise domain as an example● Analytics of both batch (offline) and streaming (online) data
● Analytics important in other areas (banking, stock market, network, cluster monitoring, business intelligence, commerce, internet of things, ...)
● Enabling value of data
● Event sourcing● CQRS● Technologies to handle the data
○ Spark○ Mesos○ Akka○ Cassandra○ Kafka
● Handling data● Insights and analytics enable value in data
● Jobs at www.cakesolutions.net/careers● Code at https://github.com/muvr ● Martin Zapletal @zapletal_martin ● Anirvan Chakraborty @anirvan_c
[1] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
[2] http://malteschwarzkopf.de/research/assets/google-stack.pdf
[3] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
[4] http://www.slideshare.net/LisaHua/spark-overview-37479609