42
Apache Cassandra DC/OS AWS SMACK SMACKZ INTERACTIVE DATA SCIENCE FROM SCRATCH WITH APACHE ZEPPELIN AND APACHE SPARK FELIX CHEUNG APACHECON BIG DATA 2016 - MAY

Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

Embed Size (px)

Citation preview

Page 1: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

Apache CassandraDC/OSAWSSMACKSMACKZ

INTERACTIVE DATA SCIENCE FROM SCRATCH WITH APACHE ZEPPELIN

AND APACHE SPARKFELIX CHEUNG

APACHECON BIG DATA 2016 - MAY

Page 2: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

TODAY

• SETUP A VIRTUAL MACHINE ON YOUR LAPTOP – THIS COULD TAKE 25-MIN TO 1 HR• VM NEEDS 8GB RAM

• IN THE VIRTUAL MACHINE, YOU WILL BE RUNNING SPARK, ZEPPELIN AND OTHERS

• AND YOU WILL RUN SOME DATA PROCESSING AND MACHINE LEARNING USE CASES

Page 3: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

• CREATE AND CONFIGURE LIGHTWEIGHT, REPRODUCIBLE AND PORTABLE DEVELOPMENT ENVIRONMENT

• $ vagrant up• MACHINES ARE PROVISIONED ON TOP OF VIRTUALBOX, VMWARE, AWS,

OR ANY OTHER PROVIDER, INDUSTRY-STANDARD PROVISIONING TOOLS SUCH AS SHELL SCRIPTS, CHEF, OR PUPPET, CAN BE USED TO AUTOMATICALLY INSTALL AND CONFIGURE SOFTWARE ON THE MACHINE

• HTTPS://WWW.VAGRANTUP.COM/DOWNLOADS.HTML

Page 4: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

• X86 AND AMD64/INTEL64 VIRTUALIZATION• OPEN SOURCE SOFTWARE, ORACLE• RUNS ON WINDOWS, LINUX, MACINTOSH, AND SOLARIS HOSTS AND

SUPPORTS A LARGE NUMBER OF GUEST OPERATING SYSTEMS INCLUDING BUT NOT LIMITED TO WINDOWS (NT 4.0, 2000, XP, SERVER 2003, VISTA, WINDOWS 7, WINDOWS 8, WINDOWS 10), DOS/WINDOWS 3.X, LINUX (2.4, 2.6, 3.X AND 4.X), SOLARIS AND OPENSOLARIS, OS/2, AND OPENBSD

• HTTPS://WWW.VIRTUALBOX.ORG/WIKI/DOWNLOADS

Page 5: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

LET’S START!

• VAGRANT -> VIRTUALBOX• HTTPS://GITHUB.COM/FELIXCHEUNG/VAGRANT-PROJECTS• SPARK-CASSANDRA-ZEPPELIN• DOWNLOAD AND PUT THEM IN A DIRECTORY• $ vagrant up

Page 6: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Page 7: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

• SPARK• SPARK SQL + DATA

FRAME + DATA SOURCE

• SPARK STREAMING• MLLIB• GRAPHX

Page 8: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

val countsByAge = df.groupBy("age").count()

Page 9: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

ZEPPELIN

Page 10: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

APACHE ZEPPELIN (INCUBATING)

• INTERACTIVE DATA ANALYTICS ENVIRONMENT FOR DISTRIBUTED DATA PROCESSING SYSTEM. IT PROVIDES BEAUTIFUL INTERACTIVE WEB-BASED INTERFACE, DATA VISUALIZATION, COLLABORATIVE WORK ENVIRONMENT AND MANY OTHER NICE FEATURES TO MAKE YOUR DATA ANALYTICS MORE FUN AND ENJOYABLE.

• ZEPPELIN HAS BEEN INCUBATING SINCE DEC 2014.HTTPS://ZEPPELIN.INCUBATOR.APACHE.ORG/

Page 11: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

• REALTIME COLLABORATION - ENABLED BY WEBSOCKET COMMUNICATIONS

• FRONTEND: ANGULARJS BACKEND SERVER: JAVA INTERPRETERS: JAVAVISUALIZATION: NVD3

Page 12: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

INTERPRETERS

• ALLUXIO (WAS TACHYON)• CASSANDRA• ELASTICSEARCH• FLINK• GEODE• HBASE• HDFS• HIVE• IGNITE• JDBC/PHOENIX/POSTGRESQL/HAWQ• LENS• MARKDOWN• R• SCALDING• SHELL• SPARK• TAJO

Page 13: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

LET’S CHECK THIS OUT

Page 15: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

SPARK INTERPRETER

• CLUSTER MODE – MASTER• SPARK.EXECUTOR.MEMORY• HTTP://

SPARK.APACHE.ORG/DOCS/LATEST/CONFIGURATION.HTML

• “SHARED” INTERPRETER• CREATE ADDITION INTERPRETER

INSTANCES

Page 17: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

MACHINE LEARNING WITH SPARK

Page 18: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

K-MEANS

• K-MEANS CLUSTERING AIMS TO PARTITION N OBSERVATIONS INTO K CLUSTERS IN WHICH EACH OBSERVATION BELONGS TO THE CLUSTER WITH THE NEAREST MEAN, SERVING AS A PROTOTYPE OF THE CLUSTER.

• SPARK – K-MEANS|| HTTP://THEORY.STANFORD.EDU/~SERGEI/PAPERS/VLDB12-KMPAR.PDFPARALLELIZED VARIANT OF THE K-MEANS++ METHOD

• SPARK – STREAMING K-MEANS

Page 19: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

GRAPH

• GRAPH-PARALLEL COMPUTATION

Page 20: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Page 21: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

GRAPHFRAMES

• POWER OF GRAPHX + DATAFRAME• MOTIF FINDING (A)-[E]->(B); (B)-[E2]->(A)• BREADTH-FIRST SEARCH (BFS)• CONNECTED COMPONENTS• PAGERANK• SHORTEST PATHS• TRIANGLE COUNT

Page 22: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

NOTEBOOK – K-MEANS – EXPLORATORY ANALYSIS

Page 23: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

NOTEBOOK – GRAPHFRAMES

Page 24: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

“BIG DATA”

Page 25: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

• OPEN-SOURCE, DISTRIBUTED, VERSIONED, NON-RELATIONAL DATABASE• MODELED AFTER GOOGLE'SBIGTABLE: A DISTRIBUTED STORAGE SYSTEM FOR STRUCTURED

DATA• AUTOMATIC FAILOVER SUPPORT BETWEEN REGIONSERVERS• BLOCK CACHE AND BLOOM FILTERS FOR REAL-TIME QUERIES• QUERY PREDICATE PUSH DOWN VIA SERVER SIDE FILTERS• THRIFT GATEWAY AND A REST-FUL WEB SERVICE THAT SUPPORTS XML, PROTOBUF, AND BINARY

DATA• EXTENSIBLE JRUBY-BASED (JIRB) SHELL• HTTPS://HBASE.APACHE.ORG/BOOK.HTML#QUICKSTART

Page 26: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

HBASE SCRIPTING

• CREATE 'TEST', 'CF’• LIST 'TEST' • PUT 'TEST', 'ROW1', 'CF:A', 'VALUE1' • PUT 'TEST', 'ROW2', 'CF:B', 'VALUE2' • SCAN 'TEST' • GET 'TEST', 'ROW1' • DISABLE 'TEST' • ENABLE 'TEST'

Page 27: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

CASSANDRA

Page 28: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

CASSANDRA

• BORN AT FACEBOOK AND BUILT ON AMAZON’S DYNAMO AND GOOGLE’S BIGTABLE

• DISTRIBUTED DATABASE• STRUCTURED DATA• HIGHLY AVAILABLE SERVICE AND NO SINGLE POINT OF FAILURE• MASTERLESS “RING” DESIGN• USER-CASES: PB’S OF DATA IN CLUSTERS OF OVER 75,000 NODES

• HTTP://WIKI.APACHE.ORG/CASSANDRA/GETTINGSTARTED

Page 29: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

CASSANDRA QUERY LANGUAGE (CQL)

• CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = { 'CLASS' : 'SIMPLESTRATEGY', 'REPLICATION_FACTOR' : 1 };

• CREATE TABLE USERS ( USER_ID INT PRIMARY KEY, FNAME TEXT, LNAME TEXT );

• INSERT INTO USERS (USER_ID, FNAME, LNAME) VALUES (1745, 'JOHN', 'SMITH');

• SELECT MAX(NAME), NAME, COUNT(*) FROM USERS;

Page 30: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

NOTEBOOK – HBASE

Page 31: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

NOTEBOOK – CASSANDRA, SPARK-CASSANDRA

Page 32: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

TIPS

• AFTER VAGRANT HALT YOU NEED TO RESTART ZEPPELIN• VAGRANT UP• VAGRANT SSH• SUDO SERVICE ZEPPELIN START

• TO RESTART HBASE OR CASSANDRA, SEE THE COMMAND HERE• TO START CLEANLY, CONSIDER REBUILDING THE VM FROM SCRATCH: $

VAGRANT DESTROY• /OPT/APACHE-CASSANDRA-3.5/BIN/CQLSH

• “CONNECTED TO TEST CLUSTER AT 127.0.0.1:9042.”• HTTPS://

WWW.DIGITALOCEAN.COM/COMMUNITY/TUTORIALS/HOW-TO-INSTALL-CASSANDRA-AND-RUN-A-SINGLE-NODE-CLUSTER-ON-A-UBUNTU-VPS

• CASSANDRA 3.0 AND LATER REQUIRE JAVA 8U40 OR LATER.

Page 33: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

SCALING UP – CLOUD-BASED ARCHITECTURE

Page 34: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

DC/OS

• OPEN SOURCE DATACENTER-SCALE OPERATING SYSTEM FOR BUILDING AND RUNNING MODERN APPS WITH EASE.

• EASILY DEPLOY AND RUN STATEFUL OR STATELESS DISTRIBUTED WORKLOADS INCLUDING DOCKER CONTAINERS, BIG DATA, AND TRADITIONAL APPS.

• RUNNING ON APACHE MESOS• GUI/CLI• “UNIVERSE” APP STORE• SERVICE DISCOVERY AND MONITORING• ENTERPRISE DC/OS

Page 35: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

AMAZON WEB SERVICES (AWS)

• “CLOUD”• COMPUTE• STORAGE, CONTENT• DATABASE• NETWORKING• HADOOP CLUSTER

Page 36: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

DC/OS ON AWS

• REQUIRES M3.XLARGE • PRIVATE NODES• PUBLIC NODE

Page 37: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

SMACKZ = SMACK + Z

• SPARK – SCALABLE DATA PROCESSING AND ANALYTICS• MESOS – CLUSTER MANAGER, RESOURCE SHARING, SCHEDULING• AKKA – ACTOR-BASED CONCURRENT, DISTRIBUTED APPLICATION

FRAMEWORK• CASSANDRA – DISTRIBUTED DATABASE• KAFKA – HIGH-THROUGHPUT, LOW-LATENCY, PUB-SUB MESSAGING• ZEPPELIN – DATA MANIPULATION AND VISUALIZATION INTERFACE• SINGLE INTERFACE, HIGHLY SCALABLE CLUSTER

Page 38: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

SMACKZ – SMACK STACK + ZEPPELIN

http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html

Page 39: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

SMACK

• HTTPS://DOCS.MESOSPHERE.COM/ADMINISTRATION/INSTALLING/CLOUD/AWS/

• HTTPS://DCOS.IO/DOCS/1.7/ADMINISTRATION/INSTALLING/CLOUD/AWS/• $ dcos package install spark$ dcos package install cassandra$ dcos package install kafka

Page 40: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

ZEPPELIN

• $ cat options.json{ "zeppelin": { "role": "slave_public" } }• $ dcos package install --options=options.json zeppelin

Page 41: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

ZEPPELIN

• RUNS ON PUBLIC NODE• RUNS FROM MARAHTON$ dcos marathon task list$ dcos marathon task show zeppelin….

• https://dcos.io/docs/1.7/usage/tutorials/spark/

Page 42: Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

CONTACT ME

• GITHUB: HTTPS://GITHUB.COM/FELIXCHEUNG