Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee

Data science lifecycle with Apache Zeppelin And Spark

2015 Spark Summit Amsterdam Moon [email protected] NFLabs www.nflabs.com

Data science lifecycle

Data Science: process

https://en.wikipedia.org/wiki/Data_analysis

Data Science: tools

MLlib

Data Science: peopleEngineerData ScientistDevOpsBusiness

http://aarondavis.design/

Hadoop Landscape

Cloudera-MLML-baseMRQLShark

?

Project TimelineASF Incubation12.201408.2014Started getting adoptionhttp://zeppelin.incubator.apache.org

12.2012Commercial Product for data analysis10.2013Open sourced a single feature

Commercial Product 12.2012

Zeppelin 10.2013

Zeppelin10.2013

Zeppelin08.2014

Zeppelin08.2014

Third-party Products 10.2014

Apache Incubation Proposal11.2014

Acceptance by Incubator 23.12.2014

Current Status1 Release71 Contributors worldwide766 Stars on GH300/900 Emails at users/dev @i.a.o

Interactive Notebooks

Interactive Visualization

Multiple Backends

Interpreter

http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html

Writing an Interpreter public abstract void open();

public abstract void close();

public abstract InterpreterResult interpret(String st, InterpreterContext context);

public abstract void cancel(InterpreterContext context);

public abstract int getProgress(InterpreterContext context);

public abstract List completion(String buf, int cursor);

public abstract FormType getFormType();

public Scheduler getScheduler();MusthaveGood to haveAdvanced

Display SystemZeppelin ServerSpark Interpreter

Other Interpreter

Zeppelin webapp

Websocket, RESTTextHtmlTableAngular

Display SystemSelect display system through output

Built in scheduler

Built-in scheduler runs your notebook with cron expression.

Flexible layout

Flexible layout

DEMO

Zeppelin & Friends

Z-Manager

ZeppelinHub

Collaboration/SharingPackaging & DeploymentZeppelin + Full stack on a cloud

PackagesBackend Integration

Z-Manager installer

Deploymenthttps://github.com/hortonworks-gallery/ambari-zeppelin-service

Deployment

As a Service

AWS EMR

https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/

Online Viewer

Zeppelin for organizations

An Engineerengineer by http://aarondavis.design/

A Teamengineer by http://aarondavis.design/

An Organizationengineer by http://aarondavis.design/

Thats too many!engineer by http://aarondavis.design/

What is the problem?Too much:InstallConfigureCluster resources

Solution?We have containers+reverse proxy

Z Managerhttp://github.com/NFLabs/z-manager

Apache 2.0 LicenceContainerized deployment per user Reverse proxySingle binarySimple web applicationZ ManagerSGA to ASF coming *

following the destiny of Z:PoC, internal adoption, OSS, ASF

Z Manager

Auto-update

engineer by http://aarondavis.design/

Linux box

go + react :)Z Manager process

Z Manager

ZeppelinHub

https://www.zeppelinhub.comSharing notebooks with access control

Zeppelin


Shares Notebook

Provides multi-tenant environment

z-managerZeppelinHub

Data Science: peopleEngineerData ScientistDevOpsBusiness


Before


?

After


Project roadmap

Helium

People do the similar workwith different data

New visualizationModel & AlgorithmData process pipeline


Package and distribute work

New visualizationModel & AlgorithmData process pipeline

PkgRepo


Heliumhttps://s.apache.org/heliumPlatform foron top of Apache Zeppelin

Data Analytics Application

Helium Application=

+

ViewAlgorithmZeppelin provided Resources

Resources

DataComputingAny java object - Result of last execution - JDBC connection (from JDBC Interpreter)* - SparkContext (from SparkInterpreter) - Flink environment (from FlinkInterpreter)*- Provided by user created Interpreter- Provided by user created Helium application

Application ExamplesDataComputing- ex) get git commit log data https://github.com/Leemoonsoo/zeppelin-gitcommitdataVisualization - ex) run cpu usage monitoring code across spark cluster, using SparkContext https://github.com/Leemoonsoo/zeppelin-sparkmon- ex) display result data as a wordcloud https://github.com/Leemoonsoo/zeppelin-wordcloud

How it worksZeppelin Server

Web browserViewInterpreter ProcessAlgorithmResource pool

Resource pool

Resource pools are connected

Algorithm runs where resource exists

APIclass YourApplication extends org.apache.zeppelin.helium.Application {

@Override public void run(ApplicationArgument arg, InterpreterContext context) { .. }}Easy APIJust extend helium.Application

Application Spec{ mavenArtifact : "groupId:artifactId:version", className : "your.helium.application.Class", icon : "fa fa-cloud", name : "My app name", description : some description", consume : [ "org.apache.spark.SparkContext" ]}SimpleWriting a spec file allow Zeppelin load application

Deploy

PublicRepositoryPrivateRepository

Handy PrivatePublicPackaged to Jar and Distributed through MavenDownloaded on the fly and run when user selects it

Thank youQ & A Moon [email protected]

NFLabs www.nflabs.comhttp://zeppelin.incubator.apache.org/

Data & Analytics

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee