Spark zeppelin-cassandra at synchrotron

Spark/Cassandra/Zeppelin for particle accelerator metrics storage and aggregationDuyHai DOANApache Cassandra Evangelist

@doanduyhai

Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist•  talks, meetups, confs …•  open-source projects (Achilles, Apache Zeppelin ...)•  OSS Cassandra point of contact

☞ [email protected] ☞ @doanduyhai

2

The HDB++ project•  What is Synchrotron •  HDB++ project presentation •  Why Spark, Cassandra and Zeppelin ?

@doanduyhai

What is Synchrotron ?

4

•  particle accelerator (electrons)

•  electron beams used for crystallography analysis of:•  material•  molecular biology•  …

@doanduyhai

What is Synchrotron ?

5

@doanduyhai 6

@doanduyhai

The HDB++ project

7

•  Sub-project of TANGO, software toolkit to•  connect•  control/monitor•  integrate sensor devices

•  HDB++ = new TANGO event-driven archiving system •  historically used MySQL •  now stores data into Cassandra

@doanduyhai

The HDB++ project

8

@doanduyhai

The HDB++ project

9

As of Sept - 2015

@doanduyhai

The HDB++ GUI

10

@doanduyhai

The HDB++ GUI

11

@doanduyhai

The HDB++ hardware specs

12

13

Q & A

! "

The HDB++ Cassandra data model

@doanduyhai

Metrics table

15

CREATE TABLE hdb.att_scalar_devshort_ro ( att_conf_id timeuuid, period text, data_time timestamp, data_time_us int, error_desc text, insert_time timestamp, insert_time_us int, quality int, recv_time timestamp,recv_time_us int, value_r int, PRIMARY KEY((att_conf_id,period),data_time, data_time_us))

@doanduyhai

Statistics table

16

CREATE TABLE hdb.stat_scalar_devshort_ro ( att_conf_id text, type_period text, //HOUR, DAY, MONTH, YEAR period text, //yyyy-MM-dd:HH, yyyy-MM-dd, yyyy-MM, yyyy count_distinct_error bigint, count_error bigint, count_point bigint, value_r_max int, value_r_min int, value_r_mean double, value_r_sd double, PRIMARY KEY ((att_conf_id, type_period), period) );

@doanduyhai

Statistics table

17

INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456); // Request by period of time SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';

18

Q & A

! "

The Spark jobs

@doanduyhai

Source code

20

val devShortRoTable = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "att_scalar_devshort_ro", "keyspace" -> "hdbtest")) .load() devShortRoTable.registerTempTable("att_scalar_devshort_ro")

@doanduyhai

Source code

21

val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")

@doanduyhai

Source code

22

devShortRo.write .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "stat_scalar_devshort_ro", "keyspace" -> "hdbtest")) .mode(SaveMode.Append) .save()

Demo Zeppelin 23

@doanduyhai

Zeppelin vizualisation (export as Iframe)

24

25

Q & A

! "

Spark/Cassandra/Zeppelin tricks and traps•  Zeppelin/Spark/Cassandra •  Spark/Cassandra

@doanduyhai

Zeppelin/Spark/Cassandra

27

•  Legend

💣= trap 💡= trick

@doanduyhai


28

•  Zeppelin build mode•  standard•  with Spark-Cassandra connector (maven profile -Pcassandra-spark-1.x)

•  Spark run mode•  local•  with a stand-alone Spark co-located with Cassandra

@doanduyhai


29

•  Zeppelin build mode standard, Spark run mode local •  needs to add Spark-Cassandra connector as dependency to the Spark interpreter

💡

@doanduyhai


30

•  Zeppelin build mode standard, Spark run mode local •  on Spark interpreter init, all declared dependencies will be fetched from declared

repositories (default = Maven central + local Maven repo)•  beware of corporate FIREWALL !!!!!!!!!

•  Where are the downloaded dependencies (jars) stored ?

💣💡

@doanduyhai


31

•  Zeppelin build mode standard, Spark run mode cluster •  Zeppelin uses spark-submit command•  Spark interpreter run by bin/interpreter.sh

💡

@doanduyhai


32

•  Zeppelin build mode standard, Spark run mode cluster •  run at least in local mode ONCE so that Zeppelin can dowload dependencies into

local repo !!!! (zeppelin.interpreter.localRepo) 💣

@doanduyhai


33

•  Zeppelin build mode with connector, Spark run mode local or cluster •  run smoothly because all Spark-Cassandra connector dependencies are merged

into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the build process

💡

@doanduyhai


34

•  OSS Spark•  needs to add Spark-Cassandra connector dependencies •  in conf/spark-env.sh

... ... Caused by: java.lang.NoClassDefFoundError: com/datastax/driver/core/ConsistencyLevel

@doanduyhai


35

•  OSS Spark

•  needs to provide all transitive dependencies for the Spark-Cassandra connector !!!

•  in conf/spark-env.sh

•  or use spark-submit --package groupId:artifactId:version option

💣

@doanduyhai


36

•  DSE Spark•  run smoothly because the Spark-Cassandra connector dependencies are already

embedded into the package ($DSE_HOME/resources/spark/lib)

@doanduyhai

Spark/Cassandra

37

•  Spark deploy mode (spark-submit --deploy-mode )•  client•  cluster

•  Zeppelin deploys by default using client mode 💡

@doanduyhai

Spark/Cassandra

38

•  Spark client deploy mode•  default •  needs to ship all driver program dependencies to the workers (network intensive)•  suitable for REPL (Spark Shell, Zeppelin)•  suitable for one-shot job/testing

@doanduyhai

Spark/Cassandra

39

•  Spark cluster deploy mode•  driver program runs on a worker node •  all driver program dependencies should be reachable by any worker •  usually dependencies are stored in HDFS, can be stored on local FS on all workers •  suitable for recurrent jobs •  need a consistent build & deploy process for your jobs

@doanduyhai

Spark/Cassandra

40

•  The job fails when using spark-submit•  but succeeded with Zeppelin … •  error: value stddev not found

val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")

@doanduyhai

Spark/Cassandra

41

•  Indeed Zeppelin use Hive context by default …

•  Fix💣

42

Q & A

! "

@doanduyhai

Cassandra Summit 2016 September 7-9 San Jose, CA

Get 15% Off with Code: DoanDuy15

Cassandrasummit.org

44

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You

Technology

Spark zeppelin-cassandra at synchrotron