Upload
patrick-mcfadin
View
3.132
Download
0
Embed Size (px)
Citation preview
@PatrickMcFadin
Patrick McFadinChief Evangelist for Apache Cassandra
Laying down the SMACK on your data pipelines
1
The problem
Your Magical
App
Sad solutions
SMACK
SparkMesosAkka
Cassandra
Kafka
CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka SparkSparkSpark
AkkaAkkaAkka CassandraCassandraCassandra
CassandraAkka
SparkKafka
Organize Process Store
Managing Weather Data
Windsor California 67.3 F Rainfall total: 1.2cm
Today:
High: 73.4F Low : 51.4F
Yesterday:
High: 75.2F Low : 52.3F
Our Magical
App
Reactive and immediate
Batch
KillrWeather
KillrWeather
Windsor California 67.3 F Rainfall total: 1.2cm
Today:
High: 73.4F Low : 51.4F
Yesterday:
High: 75.2F Low : 52.3F
https://github.com/killrweather/killrweather
SparkMesosAkka
Cassandra
Kafka
Kafka
Kafka decouples data pipelines
The problem
Kitchen
Hamburger please
Meat disk on bread please
The problem
Kitchen
The problem
Kitchen
Order Queue
Hamburger please
Order
The problem
Kitchen
Order Queue
The problem
Kitchen
Order Queue
Meat disk on bread please
You mean a Hamburger?
Uh yeah. That.
Order
Order from chaosProducer
Consumer
Topic = FoodOrder
Order from chaosProducer
Topic = Food
Order
1
Consumer
Order from chaosProducer
Topic = Food
Order
1
Order
Consumer
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
ScaleProducer
Topic = Hamburgers
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Topic = Pizza
Order
1
Order
2
Order
3
Order
4
Order
5
Topic = Food
KafkaProducer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection API
Temperature Processor
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5Precipitation Processor
Broker
KafkaProducer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection API
Temperature Processor
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5Precipitation Processor
Broker
Partition 0
Partition 0
KafkaProducer Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem1
Temp2
Tem3
Temp4
Temp5
Partition 1 Temperature Processor
KafkaProducer Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Temperature Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Topic TemperatureReplication Factor = 2
Topic PrecipitationReplication Factor = 2
KafkaProducer
Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1 Temperature
Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Temperature Processor
Temperature Processor
Precipitation Processor
Topic TemperatureReplication Factor = 2
Topic PrecipitationReplication Factor = 2
GuaranteesOrder •Messages are ordered as they are sent by the producer
•Consumers see messages in the order they were inserted by the producer
Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages
SparkMesosAkka
Cassandra
Kafka
Akka
Akka in a nutshell• Highly concurrent • Reactive • Fully distributed • Completely elastic and resilient
Actor
Mailbox
Actor
Mailbox
Actor
Mailbox
Actor
Mailbox
KafkaStreamingActor• Pulls from Kafka Queue• Immediately saves to Cassandra Counter
kafkaStream.map { weather => (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
Temperature High/Low Stream
Weather Stations Receive API
Producer
TemperatureActor
TemperatureActor
TemperatureActor
Consumer
TemperatureActor
class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {
def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }
TemperatureActor /** Computes and sends the daily aggregation to the `requester` actor. * We aggregate this data on-demand versus in the stream. * * For the given day of the year, aggregates 0 - 23 temp values to statistics: * high, low, mean, std, etc., and persists to Cassandra daily temperature table * by weather station, automatically sorted by most recent - due to our cassandra schema - * you don't need to do a sort in spark. * * Because the gov. data is not by interval (window/slide) but by specific date/time * we look for historic data for hours 0-23 that may or may not already exist yet * and create stats on does exist at the time of request. */ def daily(day: Day, requester: ActorRef): Unit = (for { aggregate <- sc.cassandraTable[Double](keyspace, rawtable) .select("temperature").where("wsid = ? AND year = ? AND month = ? AND day = ?", day.wsid, day.year, day.month, day.day) .collectAsync() } yield forDay(day, aggregate)) pipeTo requester
TemperatureActor
/** * Would only be handling handles 0-23 small items or fewer. */ private def forDay(key: Day, temps: Seq[Double]): WeatherAggregate = if (temps.nonEmpty) { val stats = StatCounter(temps) val data = DailyTemperature( key.wsid, key.year, key.month, key.day, high = stats.max, low = stats.min, mean = stats.mean, variance = stats.variance, stdev = stats.stdev)
self ! data data } else NoDataAvailable(key.wsid, key.year, classOf[DailyTemperature])
TemperatureActor
class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends WeatherActor with ActorLogging {
def receive : Actor.Receive = { case e: GetDailyTemperature => daily(e.day, sender) case e: DailyTemperature => store(e) case e: GetMonthlyHiLowTemperature => highLow(e, sender) }
TemperatureActor
/** Stores the daily temperature aggregates asynchronously which are triggered * by on-demand requests during the `forDay` function's `self ! data` * to the daily temperature aggregation table. */ private def store(e: DailyTemperature): Unit = sc.parallelize(Seq(e)).saveToCassandra(keyspace, dailytable)
SparkMesosAkka
Cassandra
Kafka
Cassandra
NodeServer
TokenServer•Consistent hash between 2-63 and 264
•Each node owns a range of those values
•The token is the beginning of that range to the next node’s token value
•Virtual Nodes break these down further
Data
Token Range
0 …
Cluster Server
Token Range
0 0-100
0-100
Cluster Server
Token Range
0 0-50
51 51-100
Server
0-50
51-100
Cluster Server
Token Range
0 0-25
26 26-50
51 51-75
76 76-100Server
ServerServer
0-25
76-100
26-5051-75
Table
CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );
Table Name
Column NameColumn CQL Type
Primary Key Designation Partition Key
Queries supported
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time
Replication10.0.0.1 00-25
DC1
DC1: RF=1
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
Replication10.0.0.1
00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
DC1
DC1: RF=2
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
76-100
00-25
26-50
51-75
ReplicationDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
Consistency level
Consistency Level Number of Nodes Acknowledged
One One - Read repair triggered
Local One One - Read repair in local DC
Quorum 51%
Local Quorum 51% in local DC
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= One
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= One
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= Quorum
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
DC2: RF=3
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
DC2: RF=3
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
DC2: RF=3
SparkMesosAkka
Cassandra
Kafka
Spark
Great combo
Store a ton of data Analyze a ton of data
Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
Great comboSpark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Spark Connector
Executer
Master
Worker
Executer
Executer
Server
Master
Worker
Worker
Worker Worker
0-24Token Ranges 0-100
25-49
50-74
75-99
I will only analyze 25% of the data.
Master
Worker
Worker
Worker Worker
0-24
25-49
50-74
75-9975-99
0-24
25-49
50-74
AnalyticsTransactional
Executer
Master
Worker
Executer
Executer
75-99
SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Spark Connector
Executer
Master
Worker
Executer
Executer
75-99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()
Executer
SELECT * FROM isd_weather_data.raw_weather_data
Spark RDD
Spark Partition
Spark Connector
Saving back the weather data
val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")
Spark Streaming - Micro Batching
83© 2015. All Rights Reserved.
DStream
84© 2015. All Rights Reserved.
Sliding Windows
85© 2015. All Rights Reserved.
SparkMesosAkka
Cassandra
Kafka
Mesos
CassandraAkka
SparkKafkaKafkaKafkaKafka SparkSparkSpark
AkkaAkkaAkka CassandraCassandraCassandra
I need CPU!!
I need memory!!
Got you covered
KafkaAkka AkkaAkka
KafkaSpark Spark
Kafka
Akka
Akka
Akka
KafkaSpark Spark
Kafka on Mesos exampleScheduler • Provides the operational automation for a Kafka Cluster • Manages the changes to the broker's configuration • Exposes a REST API for the CLI to use or any other client • Runs on Marathon for high availability
Executor • The executor interacts with the kafka broker as an
intermediary to the scheduler
CassandraAkka
SparkKafka
Organize Process Store
Mesos
Go get your SMACK on
Thank you!
Follow me on twitter: @PatrickMcFadin