Upload
ali-ziane-myriam
View
155
Download
2
Embed Size (px)
Citation preview
How to interactively visualise and explore a billion objects (with vaex)
Maarten Breddels&
Amina Helmi
PyData Paris 2016
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech
image:Devin Powell
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation• We will soon have the Gaia catalogue
• > 109 objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral network
• Problem
• Scatter plots do not work well for 109
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and 3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking• Python
• rich set of libraries, becoming the default in science
Situation
• TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java.
• Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in ~1 second
• Do we need to resort to Big Data solutions?
Big data?• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Interactive? How?
• Interactive?
• 109
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second
Interactive? How?
• Interactive?
• 109
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the file system cache
cache
How to store and read the data
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
Translate to something visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map to a color
• 3d: volume rendering
Translate to something visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map to a color
• 3d: volume rendering
Translate to something visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map to a color
• 3d: volume rendering
How to get good performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
Great… 2d histograms…1 32 4 .. 9 11x4 7 41 .. 91 61y
+1
Great… 2d histograms…• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x4 7 41 .. 91 61y
+1
Great… 2d histograms…• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x4 7 41 .. 91 61y
+12 1 4 .. 8 3v
+v, or v2
Great… 2d histograms…• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x4 7 41 .. 91 61y
+12 1 4 .. 8 3v
+v, or v2• Possibilities
• Total: flux, mass
• Mean: velocity, metallicity
• Dispersions: velocity…
• Basically statistics on a grid
Examples
Examples
Examples
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom, pan
• interactive selection (lasso, rectangle)
• client
• undo/redo
Example data / vaex.example()• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
Exploration: Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
Mutual information• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL divergence:
• In short:
• How much does breaking up correlation (not just linear) change density distribution?
Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic coordinates
How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
…
dask?wendelin?
Get vaex• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary• vaex: visualisation and exploration
• of large datasets 106-9
objects
• fast: > 109 objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)