21
Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 [email protected] Karim Chine

Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 [email protected] Karim

Embed Size (px)

Citation preview

Page 1: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

Leveraging scriptable infrastructures, Towards a paradigm shift in software for data scienceCloud Era Ltd 14 June 2013 [email protected]

Karim Chine

Page 2: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

22

Outline

Introduction Fragmentation and friction in the Data Science arena The understated potential of Cloud scriptability Elastic-R: Towards a universal platform for Data Science Elastic-R: Design and Technologies Overview Elastic-R: The scriptability toolset Demo Conclusion

Page 3: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

33

Introduction

“The next information revolution is well underway. But it is not happening where information scientists, information executives, and the information industry in general are looking for it. It is not a revolution in technology, machinery, techniques, software, or speed. It is a revolution in concepts.”

Peter Drucker. Forbes ASAP, August 24, 1998

Page 4: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

44

Introduction

Science and the 4th Paradigm

The e-Research Dancing Bears

2

22.

3

4

a

cG

a

a

Experimental Science Theoretical Science Computational Science e-Science / Data-intensive Science

The townspeople gather to see the wondrous sight as the massive, lumbering beast shambles and shuffles from paw to paw. The bear is really a terrible dancer, and the wonder isn't that the bear dances well but that the bear dances at all.

Page 5: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

5

Fragmentation and friction in the Data Science Arena

www.scilab.org

http://root.cern.ch

www.sagemath.org

www.sas.com

office.microsoft.com

www.mathworks.com

www.scipy.org

www.spss.com

www.wolfram.com

www.python.org

Open-source (GPL) software environment for statistical computing and graphics

Lingua franca of data analysis. Repositories of contributed R

packages related to a variety of problem domains in life sciences, social sciences, finance, econometrics, chemo metrics, etc. are growing at an exponential rate.

R is Super Glue

Page 6: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

6

The understated potential of Cloud scriptability

3D printers are becoming a common place: Creating three-dimensional

solid object of virtually any shape from a digital model.

Scripting the physical world Sharing physical reality on

Facebook is now as easy as sharing your holiday pictures

Page 7: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

7

Elastic-R: Towards a universal platform for Data Science

Computational Components R packages, Wrapped C,C++,Fortran code, Python modules, Matlab Toolkits…

Open source or commercial

Computational Resources

Clusters, grids, private or public clouds

Free or pay-per-use

Computational GUIsHTML5 and Desktop Workbench

Built-in views /Plugins /Collaborative views

Open source or commercial

Computational Scripts R / Python / Matlab / Groovy

Computational APIs Java / SOAP / REST, Stateless and stateful

Computational StorageLocal, NFS, FTP, Amazon S3, EBS

Generated Computational Web ServicesStateful or stateless, mapping of R objects/functions

Elastic-R

Page 8: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

8

Elastic-R: Towards a universal platform for Data Science

Page 9: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

9

Elastic-R: Towards a universal platform for Data Science

Public Clouds

Private Cloud

Page 10: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

10

Elastic-R: Design and Technologies Overview

You are here

Your data is here

Page 11: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

11

Elastic-R: Design and Technologies Overview

Remote Java/R Processes Events-driven Remote

Objects/Engines R, Python, Mathematica,

Matlab, Scilab, ... Collaborative Spreadsheets Collaborative Scientific

Graphics Canvas Collaborative Dashboard with

collaborative widgets

Page 12: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

12

Elastic-R: Design and Technologies Overview

Elastic-R AMI 1R 2.10

BioC 2.5

Elastic-R AMI 2R 2.9

BioC 2.3

Elastic-R AMI 3R 2.8

BioC 2.0

Elastic-R Amazon Machine Images

Elastic-R EBS 1

Data Set XXX

Elastic-R EBS 2

Data Set YYY

Elastic-R EBS 3

Data Set ZZZ

Elastic-R EBS 4

Data Set VVV

Elastic-R AMI 2

R 2.9BioC 2.3

Elastic-R EBS 4

Data Set VVV

Amazon Elastic Block Stores

Eastic-R AMI 2R 2.9

BioC 2.3

Elastic-R.org

Elastic-R EBS 4

Data Set VVV

Page 13: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1313

Modes of access Individuals with AWS accounts

Standard AMIs (Amazon Machine Images): paid-per-use to Amazon. For Academic use and trial purposes

Paid AMI: Paid-per-use software model. For business users

Individuals without AWS accounts Trial tokens, purchased tokens, tokens granted by other users. Resources (data science engines) shared by other users Individual subscribtions

Companies/Educational & Research Institutions Dedicated Platform and AMIs on an Amazon VPC (Virtual Private

Cloud). Paid via subscribtion

Page 14: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1414

Elastic-R: The scriptability toolset

Command Line

Web Console

SDK

API

Page 15: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1515

Elastic-R: The scriptability toolset

Command Line

Web Console

SDK

API

Page 16: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1616

Elastic-R: The scriptability toolset

WS

generator

rws.war+ mapping.jar

+ pooling framework

+ R Java Bridge

+ JAX-WS

- Servlets

- Generated artifacts

Eclipse Web Service Client Generator

public static void main(String[] args) throws Exception { RGlobalEnvFunctionWeb g=new RGlobalEnvFunctionWebServiceLocator().getrGlobalEnvFunctionWebPort(); RNumeric x=new RNumeric(); x.setValue(new Double[]{6.0}); System.out.println(g.square(x).getValue()[0]);}

square function(x) {return(x^2) }typeInfo(square) SimultaneousTypeSpecification(TypedSignature(x = "numeric"), returnType = "numeric")

Script / globals.r

Script / rjmap.xml

<rj> <publish> <functions> <function name="square" forWeb="true"/> </functions> </publish> <scripts> <initScript name="globals.r" embed="true"/> </scripts></rj>

Deploy

Page 17: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1717

Elastic-R: The scriptability toolset API: Soap and Restful Web Services

Data Analysis Engines control Data Analysis Engines Management (Life cycle, etc.) Virtual appliances/artifacts management Platform administartion

SDKs: Java, R, Microsoft Office (Vba) Command line: elasticR package Html5 Workbench

Elastic-R artifacts management interface Engines control interface

Page 18: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1818

Demo Register to Elastic-R academic and trial portal (

www.elastic-r.org) Create Data Science Engines using trial tokens Work with R, Python and Scientific Spreadsheets in the

browser Share Data Science Engine and Collaborate Connect to the remote Data Science Engine from within

a local R session, push and pull data, execute commands and show impact on spreadsheets and dashboards

Page 19: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

1919

Conclusion Elastic -R unlocks the potential of the cloud for Data

Scientists and analytical applications developers and improves dramatically their productivity

The Data Science IT Factory chain, from resources acquisition to services and applications design and publication becomes: Fully scriptable in R Under the direct control of the Data Scientist and the developer Fully traceable and reproducible

Elastic-R can seamlessly extend any existing application or service and can be used as a collaboration backend

Openness, security and reliability are among the key capabilities delivered

Page 20: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

2020

What to do Next Register to Elastic-R and try the HTML 5 Workbench and

the collaboration capabilities Download the R package elasticR and use it to access

the cloud from local R sessions Download the Java SDK and create your first analytical

application using AWS and advanced tools for programming with data.

Get in touch with me to explore potential partnerships and collaborations

Page 21: Leveraging scriptable infrastructures, Towards a paradigm shift in software for data science Cloud Era Ltd 14 June 2013 karim.chine@cloudera.co.uk Karim

2121

Contact details Karim Chine [email protected]