Upload
duyhai-doan
View
568
Download
3
Embed Size (px)
Citation preview
Spark/Cassandra/Zeppelin for particle accelerator metrics storage and aggregationDuyHai DOANApache Cassandra Evangelist
@doanduyhai
Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist• talks, meetups, confs …• open-source projects (Achilles, Apache Zeppelin ...)• OSS Cassandra point of contact
☞ [email protected] ☞ @doanduyhai
2
The HDB++ project• What is Synchrotron • HDB++ project presentation • Why Spark, Cassandra and Zeppelin ?
@doanduyhai
What is Synchrotron ?
4
• particle accelerator (electrons)
• electron beams used for crystallography analysis of:• material• molecular biology• …
@doanduyhai
What is Synchrotron ?
5
@doanduyhai 6
@doanduyhai
The HDB++ project
7
• Sub-project of TANGO, software toolkit to• connect• control/monitor• integrate sensor devices
• HDB++ = new TANGO event-driven archiving system • historically used MySQL • now stores data into Cassandra
@doanduyhai
The HDB++ project
8
@doanduyhai
The HDB++ project
9
As of Sept - 2015
@doanduyhai
The HDB++ GUI
10
@doanduyhai
The HDB++ GUI
11
@doanduyhai
The HDB++ hardware specs
12
13
Q & A
! "
The HDB++ Cassandra data model
@doanduyhai
Metrics table
15
CREATE TABLE hdb.att_scalar_devshort_ro ( att_conf_id timeuuid, period text, data_time timestamp, data_time_us int, error_desc text, insert_time timestamp, insert_time_us int, quality int, recv_time timestamp,recv_time_us int, value_r int, PRIMARY KEY((att_conf_id,period),data_time, data_time_us))
@doanduyhai
Statistics table
16
CREATE TABLE hdb.stat_scalar_devshort_ro ( att_conf_id text, type_period text, //HOUR, DAY, MONTH, YEAR period text, //yyyy-MM-dd:HH, yyyy-MM-dd, yyyy-MM, yyyy count_distinct_error bigint, count_error bigint, count_point bigint, value_r_max int, value_r_min int, value_r_mean double, value_r_sd double, PRIMARY KEY ((att_conf_id, type_period), period) );
@doanduyhai
Statistics table
17
INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456); // Request by period of time SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';
18
Q & A
! "
The Spark jobs
@doanduyhai
Source code
20
val devShortRoTable = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "att_scalar_devshort_ro", "keyspace" -> "hdbtest")) .load() devShortRoTable.registerTempTable("att_scalar_devshort_ro")
@doanduyhai
Source code
21
val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")
@doanduyhai
Source code
22
devShortRo.write .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "stat_scalar_devshort_ro", "keyspace" -> "hdbtest")) .mode(SaveMode.Append) .save()
Demo Zeppelin 23
@doanduyhai
Zeppelin vizualisation (export as Iframe)
24
25
Q & A
! "
Spark/Cassandra/Zeppelin tricks and traps• Zeppelin/Spark/Cassandra • Spark/Cassandra
@doanduyhai
Zeppelin/Spark/Cassandra
27
• Legend
💣= trap 💡= trick
@doanduyhai
Zeppelin/Spark/Cassandra
28
• Zeppelin build mode• standard• with Spark-Cassandra connector (maven profile -Pcassandra-spark-1.x)
• Spark run mode• local• with a stand-alone Spark co-located with Cassandra
@doanduyhai
Zeppelin/Spark/Cassandra
29
• Zeppelin build mode standard, Spark run mode local • needs to add Spark-Cassandra connector as dependency to the Spark interpreter
💡
@doanduyhai
Zeppelin/Spark/Cassandra
30
• Zeppelin build mode standard, Spark run mode local • on Spark interpreter init, all declared dependencies will be fetched from declared
repositories (default = Maven central + local Maven repo)• beware of corporate FIREWALL !!!!!!!!!
• Where are the downloaded dependencies (jars) stored ?
💣💡
@doanduyhai
Zeppelin/Spark/Cassandra
31
• Zeppelin build mode standard, Spark run mode cluster • Zeppelin uses spark-submit command• Spark interpreter run by bin/interpreter.sh
💡
@doanduyhai
Zeppelin/Spark/Cassandra
32
• Zeppelin build mode standard, Spark run mode cluster • run at least in local mode ONCE so that Zeppelin can dowload dependencies into
local repo !!!! (zeppelin.interpreter.localRepo) 💣
@doanduyhai
Zeppelin/Spark/Cassandra
33
• Zeppelin build mode with connector, Spark run mode local or cluster • run smoothly because all Spark-Cassandra connector dependencies are merged
into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the build process
💡
@doanduyhai
Zeppelin/Spark/Cassandra
34
• OSS Spark• needs to add Spark-Cassandra connector dependencies • in conf/spark-env.sh
... ... Caused by: java.lang.NoClassDefFoundError: com/datastax/driver/core/ConsistencyLevel
@doanduyhai
Zeppelin/Spark/Cassandra
35
• OSS Spark
• needs to provide all transitive dependencies for the Spark-Cassandra connector !!!
• in conf/spark-env.sh
• or use spark-submit --package groupId:artifactId:version option
💣
@doanduyhai
Zeppelin/Spark/Cassandra
36
• DSE Spark• run smoothly because the Spark-Cassandra connector dependencies are already
embedded into the package ($DSE_HOME/resources/spark/lib)
@doanduyhai
Spark/Cassandra
37
• Spark deploy mode (spark-submit --deploy-mode )• client• cluster
• Zeppelin deploys by default using client mode 💡
@doanduyhai
Spark/Cassandra
38
• Spark client deploy mode• default • needs to ship all driver program dependencies to the workers (network intensive)• suitable for REPL (Spark Shell, Zeppelin)• suitable for one-shot job/testing
@doanduyhai
Spark/Cassandra
39
• Spark cluster deploy mode• driver program runs on a worker node • all driver program dependencies should be reachable by any worker • usually dependencies are stored in HDFS, can be stored on local FS on all workers • suitable for recurrent jobs • need a consistent build & deploy process for your jobs
@doanduyhai
Spark/Cassandra
40
• The job fails when using spark-submit• but succeeded with Zeppelin … • error: value stddev not found
val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")
@doanduyhai
Spark/Cassandra
41
• Indeed Zeppelin use Hive context by default …
• Fix💣
42
Q & A
! "
@doanduyhai
Cassandra Summit 2016 September 7-9 San Jose, CA
Get 15% Off with Code: DoanDuy15
Cassandrasummit.org