23
© 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary Python Notebooks for Collaborative Data Science Peter Wang CTO, Co-Founder Anaconda Open Data Science Platform

Python Notebooks for Collaborative Data Science

Embed Size (px)

Citation preview

Page 1: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Python Notebooks for Collaborative Data Science

Peter WangCTO, Co-FounderAnaconda Open Data Science Platform

Page 2: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Open Data Science Platform

– 730+ Popular Python & R packages

– Compiled for Windows, Mac, and Linux

– Extensible via Conda Package Manager

– Easily sandbox and deploy packages & analytical computing environments

– Free and Open Source Core

– Foundation of our Enterprise Platform

Accelerate, Connect & Empower

Page 3: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 33

Anaconda…is Trusted by Industry LeadersFinancial ServicesRisk management, Quant modeling, Data exploration and processing, algorithmic trading, compliance reporting

GovernmentFraud detection, data crawling, web & cyber data analytics, statistical modeling

Healthcare & Life SciencesGenomics data processing, cancer research, natural language processing for health data science

High TechCustomer behavior, recommendations, ad bidding, retargeting, social media analytics

Retail & CPGEngineering simulation, supply chain modeling, scientific analysis

Oil & GasPipeline monitoring, noise logging, seismic data processing, geophysics

Page 4: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Env 1

Python 2.7

Conda: Package and Environment ManagementEnv 2

Python 3.4

Pandas v.0.18

Jupyter

Env 3

R

R Essentials

conda

Windows, Mac OSX, Linux

– Install packages

– Update packages

– Create sandboxes: Conda environments

– Conda environments: Critical for reproducibility, collaboration & scale

NumPyv1.11

NumPyv1.10

Pandas v.0.16

Page 5: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 55

Anaconda• High performance Python &

R• 720+ data science

packages• Cross-platform package,

dependency & environments

• Community driven package repository collaboration

Anaconda Navigator• Desktop Portal & Installer

Anaconda Enterprise Components

OPEN DATA SCIENCE

DATA SCIENCE GOVERNANCE

DATA SCIENCE COLLABORATION

Anaconda Repository• Storage & sharing of

packages, environments, notebooks

• On-premise governance• Enterprise authentication

Anaconda• Deep Learning: Theano,

Tensorflow, Caffe, Keras, Neon, Lasagne

• Natural Language Processing: NLTK, spaCy

• Machine Learning: Scikit-learn

• GPU enablement

Anaconda Enterprise Notebooks

• Collaborative project based workflows for Python & R

• Enterprise authentication & permissioning

• Notebook sharing, versioning, search, differencing

Anaconda• Interactive browser based

dashboards & visualizations with Bokeh

• Bokeh apps using Python, R, Scala

DATA SCIENCE FOR BIG DATA

Anaconda Scale • Hadoop & Spark integration• Scalable distributed

processing framework• Integration with resource

management & data stores• Distributed package,

dependency & environments

Anaconda Fusion• Integration of Open Data

Science with Microsoft Excel®

• Big Data querying & transformations

Page 6: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

On-premises package repository– Governance for your analytics environment– Empower your data scientists within the

structure of enterprise IT

Enterprise notebook collaboration– Easily replicate and share analysts’

environments– Centrally store proprietary libraries and

manage versioning

Scalable analytics computations– Scale up: leverage GPU and parallel-

optimized libraries

– Scale out: easily manage Anaconda across your Hadoop/Spark cluster

– Scale up and out with Python and R

Enterprise data science deployment– Encapsulate and deploy data science projects

– Deploy live notebooks, dashboards, interactive applications, and models with REST APIs

Anaconda EnterpriseOpen Source Without Anxiety: Governance and Scalability

Page 7: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 77

Continuum Sponsored Open-Source Projects

• Bokeh - Interactive Web Visualizations

and Applications

• Dask – Painless distributed and parallel

computations in Python

• Numba - JIT for Python applications

• Jupyter, Spyder – Notebooks and IDE

for data science

• Pandas, Datashader, Blaze, …

Page 8: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 88

Anaconda, Jupyter, and Notebooks

• In 2008, we helped kick off some of the initial efforts on IPython which lead to the separation of the Kernel from the front-end. (Previously it was just a "nicer" command line REPL)

• We also helped fund the initial Notebook interface (based on Qt, not HTML) in 2008/2009

• Web Notebooks really started taking off around 2011

• Our Notebook cloud service launched in 2012

• Anaconda Enterprise Notebooks launched in 2014

Page 9: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Data Science Notebooks

Page 10: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 10

Jupyter Notebook

• Interleave code, text, graphics

• Multiple languages: Python, R,

SQL, SAS, Spark, Julia, etc.

• Runs in the browser

• Open Source

Page 11: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 11

Page 12: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 12

Notebook Demos

Page 13: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 13

Gateway & Project Nodes,running IPython kernels

Package Control

Internal Anaconda Repository

Authentication

Anaconda Enterprise Notebook Server

Computation

Web Interface

Active Directory/ LDAPOptional

Workflow:– Analyst Log into the Enterprise

notebook server, authenticating against LDAP/AD

– Based on the project they select, is re-directed to the appropriate project node

– All notebooks/python code runs on project nodes; any needed packages are pulled down from your local repository

Anaconda Enterprise Notebook Computing

User 1 User 2 User 3

Page 14: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects - Notebooks

Page 15: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects - Dashboards

Page 16: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 16

JupyterLab

• Notebooks, plots, data

tables, code editors

• Dashboard authoring

• Interactive distributed

computing

• Collaboration between

Continuum, Bloomberg,

and others

Page 17: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda Fusion

Page 18: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary 18

Anaconda Fusion brings Open Data Science to Microsoft Excel

AnacondaFusion

• BRING interactive visualizations, machine learning and ETL to Excel

• BRIDGE Excel Data to Python & R through notebooks

• ACCESS all the power of Python and Big Data, natively embedded inside Excel

Page 19: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Empowering Business Analysts & Data-driven Employees

• Anaconda Fusion is a Microsoft Excel® Add-in that enables a unique and simple link between Excel and Python without writing code

• Anaconda Fusion is targeted to Business Analysts who want “No Code” Data Science

Page 20: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Analysts and Data Scientists can keep using their prefered tools

20

Page 21: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

“No Code” Data Science – Data Loading Example

1 2Select Anaconda Fusion Notebook and click “Upload”

Select function you wish to run

Click “Run” Data is loaded into spreadsheet3 4

Page 22: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

Just change one line of code in your notebook

Page 23: Python Notebooks for Collaborative Data Science

© 2016 Continuum Analytics - Confidential & Proprietary

• Extract data - pull data directly into Excel to perform analysis

• Machine Learning – use trained models created by Data Scientists and plug them into your spreadsheet data

• Interactive Visualizations – create custom advanced interactive graphs, charts and plots from Excel data

• Big Data – analyze, transform, model and query data stored in Hadoop and Spark

Figure: Anaconda Fusion on Mac

Anaconda Fusion Use Cases