61
Data science lifecycle with Apache Zeppelin And Spark 2015 Spark Summit Amsterdam Moon [email protected] NFLabs www.nflabs.com

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee

Embed Size (px)

Citation preview

Data science lifecycle with Apache Zeppelin And Spark

2015 Spark Summit Amsterdam Moon [email protected] NFLabs www.nflabs.com

Data science lifecycle

Data Science: process

https://en.wikipedia.org/wiki/Data_analysis

Data Science: tools

MLlib

Data Science: peopleEngineerData ScientistDevOpsBusiness

http://aarondavis.design/

Hadoop Landscape

Cloudera-MLML-baseMRQLShark

?

Project TimelineASF Incubation12.201408.2014Started getting adoptionhttp://zeppelin.incubator.apache.org

12.2012Commercial Product for data analysis10.2013Open sourced a single feature

Commercial Product 12.2012

Zeppelin 10.2013

Zeppelin10.2013

Zeppelin08.2014

Zeppelin08.2014

Third-party Products 10.2014

Apache Incubation Proposal11.2014

Acceptance by Incubator 23.12.2014

Current Status1 Release71 Contributors worldwide766 Stars on GH300/900 Emails at users/dev @i.a.o

Interactive Notebooks

Interactive Visualization

Multiple Backends

Interpreter

http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html

Writing an Interpreter public abstract void open();

public abstract void close();

public abstract InterpreterResult interpret(String st, InterpreterContext context);

public abstract void cancel(InterpreterContext context);

public abstract int getProgress(InterpreterContext context);

public abstract List completion(String buf, int cursor);

public abstract FormType getFormType();

public Scheduler getScheduler();MusthaveGood to haveAdvanced

Display SystemZeppelin ServerSpark Interpreter

Other Interpreter

Zeppelin webapp

Websocket, RESTTextHtmlTableAngular

Display SystemSelect display system through output

Built in scheduler

Built-in scheduler runs your notebook with cron expression.

Flexible layout

Flexible layout

DEMO

Zeppelin & Friends

Z-Manager

ZeppelinHub

Collaboration/SharingPackaging & DeploymentZeppelin + Full stack on a cloud

PackagesBackend Integration

Z-Manager installer

Deploymenthttps://github.com/hortonworks-gallery/ambari-zeppelin-service

Deployment

As a Service

AWS EMR

https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/

Online Viewer

Zeppelin for organizations

An Engineerengineer by http://aarondavis.design/

A Teamengineer by http://aarondavis.design/

An Organizationengineer by http://aarondavis.design/

Thats too many!engineer by http://aarondavis.design/

What is the problem?Too much:InstallConfigureCluster resources

Solution?We have containers+reverse proxy

Z Managerhttp://github.com/NFLabs/z-manager

Apache 2.0 LicenceContainerized deployment per user Reverse proxySingle binarySimple web applicationZ ManagerSGA to ASF coming *

following the destiny of Z:PoC, internal adoption, OSS, ASF

Z Manager

Auto-update

engineer by http://aarondavis.design/

Linux box

go + react :)Z Manager process

Z Manager

ZeppelinHub

https://www.zeppelinhub.comSharing notebooks with access control

Zeppelin

http://aarondavis.design/

Shares Notebook

Provides multi-tenant environment

z-managerZeppelinHub

Data Science: peopleEngineerData ScientistDevOpsBusiness

http://aarondavis.design/

Before

Cloudera-MLML-baseMRQLShark

?

After

Cloudera-MLML-baseMRQLShark

Project roadmap

Helium

People do the similar workwith different data

New visualizationModel & AlgorithmData process pipeline

engineer by http://aarondavis.design/

Package and distribute work

New visualizationModel & AlgorithmData process pipeline

PkgRepo

engineer by http://aarondavis.design/

Heliumhttps://s.apache.org/heliumPlatform foron top of Apache Zeppelin

Data Analytics Application

Helium Application=

+

ViewAlgorithmZeppelin provided Resources

Resources

DataComputingAny java object - Result of last execution - JDBC connection (from JDBC Interpreter)* - SparkContext (from SparkInterpreter) - Flink environment (from FlinkInterpreter)*- Provided by user created Interpreter- Provided by user created Helium application

Application ExamplesDataComputing- ex) get git commit log data https://github.com/Leemoonsoo/zeppelin-gitcommitdataVisualization - ex) run cpu usage monitoring code across spark cluster, using SparkContext https://github.com/Leemoonsoo/zeppelin-sparkmon- ex) display result data as a wordcloud https://github.com/Leemoonsoo/zeppelin-wordcloud

How it worksZeppelin Server

Web browserViewInterpreter ProcessAlgorithmResource pool

Resource pool

Resource pools are connected

Algorithm runs where resource exists

APIclass YourApplication extends org.apache.zeppelin.helium.Application {

@Override public void run(ApplicationArgument arg, InterpreterContext context) { .. }}Easy APIJust extend helium.Application

Application Spec{ mavenArtifact : "groupId:artifactId:version", className : "your.helium.application.Class", icon : "fa fa-cloud", name : "My app name", description : some description", consume : [ "org.apache.spark.SparkContext" ]}SimpleWriting a spec file allow Zeppelin load application

Deploy

PublicRepositoryPrivateRepository

Handy PrivatePublicPackaged to Jar and Distributed through MavenDownloaded on the fly and run when user selects it

Thank youQ & A Moon [email protected]

NFLabs www.nflabs.comhttp://zeppelin.incubator.apache.org/