How to interactively visualise and explore a billion objects (wit vaex)

How to interactively visualise and explore a billion objects (with vaex)

Maarten Breddels&

Amina Helmi

PyData Paris 2016

Outline

• Motivation

• Technical

• Vaex

• Conclusions

Copyright ESA/ATG medialab; background: ESO/S. Brunier

• Gaia satellite

• launched by ESA in december 2013

• determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way

• First catalogue in September 2016

• Final catalogue ~2020

Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech

image:Devin Powell

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network








• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking








• Problem




• Interactive?

• Zoom, pan etc

• Explore










• Problem




• Interactive?

• Zoom, pan etc

• Explore










• Problem




• Interactive?

• Zoom, pan etc

• Explore










• Problem




• Interactive?

• Zoom, pan etc

• Explore










• Problem




• Interactive?

• Zoom, pan etc

• Explore


• Subspace ranking• Python

• rich set of libraries, becoming the default in science

Situation

• TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java.

• Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex)

• DataShader?

• We want something to visualize 109 rows/objects in ~1 second

• Do we need to resort to Big Data solutions?

Big data?• Big data

• definitions vary

• Practical question to ask

• Do you need Big Data solutions?

• Many computers / grid

• Invest in software

• Or

• Can we do it on a single computer ‘old style’

Outline

• Motivation

• Technical

• Vaex

• Conclusions

Interactive? How?

• Interactive?

• 109

* 2 * 8 bytes = 15 GiB (double is 8 bytes)

• Memory bandwidth: 10-20 GiB/s: ~1 second

• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second

• Few cycles per row/object, simple algorithm

• Histograms/Density grids

• Yes, but

• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)

• proper storage and reading of data

• simple and fast algorithm for binning

•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second

Interactive? How?

• Interactive?

• 109

* 2 * 8 bytes = 15 GiB (double is 8 bytes)

• Memory bandwidth: 10-20 GiB/s: ~1 second

• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second

• Few cycles per row/object, simple algorithm

• Histograms/Density grids

• Yes, but

• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)

• proper storage and reading of data

• simple and fast algorithm for binning

•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second

• Storage: native, column based (hdf5, fits)

• Normal (POSIX read) method:

• Allocate memory

• read from disk to memory

• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)

• Wastes memory (actually disk cache)

• 15 GB data, requires 30 GB is you want to use the file system cache

cache

How to store and read the data



• Allocate memory





cache


• Memory mapping:

• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)

• avoid memory copies, more cache available

• In previous example:

• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s

• Can be 2-3x slower (cpu cache helps a bit)



• Allocate memory





cache


• Memory mapping:

• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)

• avoid memory copies, more cache available

• In previous example:

• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s

• Can be 2-3x slower (cpu cache helps a bit)

Translate to something visual

• 1d: histogram

• 2d:

• histogram

• use colormap to map to a color

• 3d: volume rendering


• 1d: histogram

• 2d:

• histogram




• 1d: histogram

• 2d:

• histogram



How to get good performance

• Pure python

• slow, GIL

• Numpy

• numpy.histogram slow

• C extension

• fast, no GIL, slower development

• less boilerplate code: cffi

• Numba @jit(nopython=True, nogil=True)

• Python code, c performance

• (nogil didn’t exist)


• Pure python

• slow, GIL

• Numpy


• C extension







• Pure python

• slow, GIL

• Numpy


• C extension







• Pure python

• slow, GIL

• Numpy


• C extension






600 times faster

Great… 2d histograms…1 32 4 .. 9 11x4 7 41 .. 91 61y

+1

Great… 2d histograms…• There is more

• Weighted histogram

• Don’t sums 1’s

• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+1




• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+12 1 4 .. 8 3v

+v, or v2




• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+12 1 4 .. 8 3v

+v, or v2• Possibilities

• Total: flux, mass

• Mean: velocity, metallicity

• Dispersions: velocity…

• Basically statistics on a grid

Examples

Examples

Examples

Exploration

• Selections

• expressions

• visual

• linked views


Exploration

• Selections

• expressions

• visual

• linked views


Exploration

• Selections

• expressions

• visual

• linked views


Exploration

• Selections

• expressions

• visual

• linked views


Outline

• Motivation

• Technical

• Vaex

• Conclusions

Vaex: Visualization And EXploration

• A library

• python package

• ‘import vaex’

• reading of data

• multithreading

• binning (1,2,3, Nd)


• server/client

• integrates with IPython notebook

• A GUI program

• Gives interactive navigation, zoom, pan

• interactive selection (lasso, rectangle)

• client

• undo/redo

Example data / vaex.example()• Helmi and de Zeeuw 2000

• build up of a MW (stellar) halo from 33 satellites

• 3.3 * 106 particles

• Almost smooth in configuration space (x,y,z)

• Structure visible in E-Lz-L space

Exploration: Automated

• High dimensionality → many subspaces

• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-Lz, z-Lz

• Can we automate this / at least help?

• Ranking subspaces using Mutual Information

Mutual information• Measure the information loss between

p(x,y) and p(x)p(y)

• information loss measured using the KL divergence:

• In short:

• How much does breaking up correlation (not just linear) change density distribution?

Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)

0.003

0.837

x,y

Lz,L

Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)

0.003

0.837

x,y

Lz,L

Expressions/Virtual columns

• Not just visualize existing columns

• mathematical operations

• sqrt(x**2+y**2)

• log(r)

• Virtual columns

• r=sqrt(x**2+y**2)

• Suggest common ones

• equatorial to galactic coordinates

How to get the data

• Gaia: ~1-2 TB

• download

• torrent

• Do you need to?

• server/client?

Client/Server

• 2d histogram example

• raw data:15 GB

• binned 256x256 grid

• 500KB, or 10-100kb compressed

• Proof of concept

• over http / websocket

• vaex webserver [file …]

Random subset / shuffled storage





…

dask?wendelin?

Get vaex• standalone binary (OS X , Linux) (just download and start)

• www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’)

• www.github.com/maartenbreddels/vaex

• In Python

• Vanilla Python

• pip install —user —pre vaex

• Anaconda

• conda install -c maartenbreddels vaex

http://www.astro.rug.nl/~breddels/vaex

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)



objects






• Or as a server






objects






• Or as a server






objects






• Or as a server






objects






• Or as a server






objects






• Or as a server




Technology

How to interactively visualise and explore a billion objects (wit vaex)