PLOTCON NYC: Interactive Visual Statistics on Massive Datasets

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Interactive Visual Statistics on Massive Datasets

Peter Wang CTO, Co-founder Continuum Analytics @pwang

© 2016 Continuum Analytics - Confidential & Proprietary 2

• Introductions

• Company Overview

• Goals of Analytics and IT teams

• Why Python for Data Science

• Anaconda - Making Python Better for Data Science

• Package Management

• Cluster Environment Management

• Notebook Computing

• Demonstrations

• Q&A / Next steps

AGENDA


What’s the Problem?

© 2016 Continuum Analytics - Confidential & Proprietary

Big data magnifies small problems

4

• Of course, big data presents storage and computation problems • More importantly, standard plotting tools have problems that are

magnified by big data: • Overdrawing/Overplotting • Saturation • Undersaturation • Binning issues

• We’ll first explain these problems, and then present a new technique called datashading to address them head-on.


Overdrawing

5

• For a scatterplot, the order in which points are drawn is very important

• The same distribution can look entirely different depending on plotting order

• Last data plotted overplots


Overdrawing

6

• Underlying issue is just occlusion

• Same problem happens with one category, but less obvious

• Can prevent occlusion using transparency


Saturation

7

• E.g. for alpha = 0.1, up to 10 points can overlap before saturating the available brightness

• Now the order of plotting matters less

• After 10 points, first-plotted data still lost

• For one category, 10, 20, or 2000 points overlapping will look identical


Saturation

8

• Same alpha value, more points:

• Now is highly misleading

• alpha value depends on size, overlap of dataset

• Difficult-to-set parameter, hard to know when data is misrepresented


Saturation

9

• Can try to reduce point size to reduce overplotting and saturation

• Now points are hard to see, with no guarantee of avoiding problems

• Another difficult-to-set parameter

• For really big data, scatterplots start to become very inefficient, because there are many datapoints per pixel — may as well be binning by pixel


Binning issues

10

• Can use heatmap instead of scatter

• Avoids saturation by auto-ranging on bins

• Result independent of data size

• Here two merged normal distributions look very different at different binning

• Another difficult-to-set parameter


Plotting big data

11

• When exploring really big data, the visualization is all you have — there’s no way to look at each of the individual data points

• Common plotting problems can lead to completely incorrect conclusions based on misleading visualizations

• Slow processing makes trial and error approach ineffective

When data is large, you don’t know when the viz is lying.


Datashading


Datashading

13

• Flexible, configurable pipeline for automatic plotting • Provides flexible plugins for viz stages, like in graphics shaders • Completely prevents overplotting, saturation, and undersaturation • Mitigates binning issues by providing fully interactive exploration in web

browsers, even of very large datasets on ordinary machines • Statistical transformations of data are a first-class aspect of the

visualization • Allows rapid iteration of visual styles & configs, interactive selections and

filtering, to support data exploration


Datashading Pipeline: Projection

14

Data

Project / Synthesize

Scene

• Stage 1: select variables (columns) to project onto the screen

• Data often filtered at this stage


Datashading Pipeline: Aggregation

15

Data


Scene Aggregates

Sample / Raster

• Stage 2: Aggregate data into a fixed set of bins

• Each bin yields one or more scalars (total count, mean, stddev, etc.)


Datashading Pipeline: Transfer

16

Data


Scene Aggregates

Sample / Raster Transfer

Image

• Stage 3: Transform data using one or more transfer functions, culminating in a function that yields a visible image

• Each stage can be replaced and configured separately


Demos


New Developments


Flexible Statistics

19

Normalized Vegetation Difference Index


Flexible Statistics

20

Slope & Aspect Ratio from pure Elevation


Anaconda


• Simplify setup for non-engineers

• Enable easy development on and deployment to multiple platforms.

• Enable data scientists to experiment and iterate even more rapidly

• Eliminate the pains associated with package and dependency management

Why Did We Create Anaconda?To Enhance Python and Enable Data Scientist to Quickly Engage with Their Data


Anaconda

Modern, Open-Source Analytics Platform powered by Python

Quickly Engage w/ Your Data

• 500+ Popular Python Packages

• Optimized & Compiled

• Free for Everyone

• Extensible via Conda Package Manager

• Sandbox Packages & Libraries

• Cross-Platform – Windows, Linux, Mac

• Not just Python - over 230 R packages

• Foundation of our Enterprise Products


On-premises package repository and sharing platform • Governance for your analytics environment - maintain

control of the packages used by your analysts

• Easily replicate and share analysts’ environments

• Centrally store proprietary libraries and manage versioning

Cluster environment management • Manages Python, R, Java, Scala packages

across the cluster

• Easily replicate analysts’ environments for different jobs/users/groups

• Strong support for Hadoop & Spark

Anaconda Enterprise


Anaconda EnterpriseScalable Computing and Collaboration

• Multi-user notebook deployments

• Scalable notebook deployment model • Project-based management • Notebook versioning and locking

• Extended support for Hadoop Stack (Storm, Spark Streaming, Kafka)

• Single sign-on support(PKI, Kerberos etc.) • Burst Compute support


Consulting

Customers include: • JPL • DARPA • Sandia National Labs • AMD • Bank of America • Bloomberg

We Will Help Design, Architect, and Build the Right Analytics For You

Leverage our Open-Source Projects

Data & Analytics

PLOTCON NYC: Interactive Visual Statistics on Massive Datasets