2013.10.24 big datavisualization

Preview:

DESCRIPTION

When the number of data elements gets large - thousands to billions or more data points - standard visual representations and interaction techniques break down. In this talk, we will survey methods for scaling interactive visualizations to data sets too large to process or explore using traditional means. I will compare data reduction techniques such as sampling, aggregation and model fitting, as well as interesting hybrid approaches, and discuss their trade-offs. I will also describe methods to enable real-time interactive exploration within standards-compliant web browsers. Attendees will learn effective visualization techniques and interaction methods that are applicable to billion+ element databases.

Citation preview

Visualizing “Big” DataSean Kandel & Je!rey Heer Trifacta Inc. @trifacta

How can we visualize and interact with billion+ record

databases in real-time?

Two Challenges:1. E!ective visual encoding2. Real-time interaction

Perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the

number of records.

Perception

Data Sampling

ModelingBinning

Google Fusion Tables (Sampling)

imMens (Binned Aggregation)

Bin > Aggregate (> Smooth) > Plot

1. Bin Divide data domain into discrete “buckets”Categories: Already discrete (but check cardinality)Numbers: Choose bin intervals (uniform, quantile, ...)Time: Choose time unit: Hour, Day, Month, etc.Geo: Bin x, y coordinates after cartographic projection

Number of Bins?

100,000 Data Points Rectangular BinsHexagonal Bins

Hexagonal or Rectangular Bins?

Hex bins better estimate density for 2D plots,but the improvement is marginal [Scott 92], whilerectangles support reuse and query processing.

Bin > Aggregate (> Smooth) > Plot

1. Bin Divide data domain into discrete “buckets”Categories: Already discrete (but check cardinality)Numbers: Choose bin intervals (uniform, quantile, ...)Time: Choose time unit: Hour, Day, Month, etc.Geo: Bin x, y coordinates after cartographic projection

2. Aggregate Count, Sum, Average, Min, Max, ...

Bin > Aggregate (> Smooth) > Plot

1. Bin Divide data domain into discrete “buckets”Categories: Already discrete (but check cardinality)Numbers: Choose bin intervals (uniform, quantile, ...)Time: Choose time unit: Hour, Day, Month, etc.Geo: Bin x, y coordinates after cartographic projection

2. Aggregate Count, Sum, Average, Min, Max, ...

(3. Smooth Optional: smooth aggregates [Wickham ’13])

[1] Wickham 2013

Bin > Aggregate (> Smooth) > Plot

1. Bin Divide data domain into discrete “buckets”Categories: Already discrete (but check cardinality)Numbers: Choose bin intervals (uniform, quantile, ...)Time: Choose time unit: Hour, Day, Month, etc.Geo: Bin x, y coordinates after cartographic projection

2. Aggregate Count, Sum, Average, Min, Max, ...

(3. Smooth Optional: smooth aggregates [Wickham ’13])

4. Plot Visualize the aggregate summary values

Plot: Visual Encoding

Choose Most E!ective Encoding [Cleveland & McGill ’84]

1D Plot -> Position or Length EncodingHistograms, line charts, etc.

2D Plot -> Area or Color EncodingSpatial dimensions (x, y) already allocated.While less e!ective than area for magnitude estimation, color can be used at the per-pixel level and provides an overall “gestalt”

Standard Color RampCounts near zero are white.

-> Outliers are missed

Add Discontinuity after ZeroCounts near zero remain visible.

-> Outliers can be seen

Linear Alpha Interpolationis not perceptually linear.

Cube-Root Alpha Interpolationapproximates perceptual linearity.

Color Encoding

Luminance (in range 0-1)

Min. Non-Zero Intensity (α=0.15) [1] Perceptual Scaling (γ=1/3) [2]

User-Adjustable Min/Max Values [3]

[1] Keep small non-zero values visible (outliers!)[2] Match color ramp to perceptual distances[3] Enable exploration across value ranges

Design Space of Binned Plots

Interaction

Interaction Techniques?1. Select Detail-on-Demand2. Navigate Pan & Zoom3. Query Brush & Link

5-D Data CubeMonth, Day, Hour, X, Y

X

Y

256

767

512 1023…

Day

Hour

Month

23…

0 1 … 30

0 …

11

1

23…

0…

11

0 1 … 30 0 1 … 30 0

23…

0

11

10

10

12 x 31 x 24 x 512 x 512 = ~2.3 billion cells

X

Y

256

767

512 1023…

Day

Hour

Month

23…

0 1 … 30

0 …

11

1

23…

0…

11

0 1 … 30 0 1 … 30 0

23…

0

11

10

10

Brushing JanuaryMonth, Day, Hour, X, Y

31 x 24 x 512 x 512 = ~195 million cells

Multivariate Data Tiles1. Send data, not pixels2. Embed multi-dim data

Full 5-D Cube

For any pair of 1D or 2D binned plots, the maximum number of dimensions needed to support brushing & linking is four.

Σ Σ Σ Σ

X : 512 bins

Y :

512

bins

~2.3B bins

~17.6M bins (in 352KB!)

Full 5-D Cube

13 3-D Data Tiles

Σ Σ Σ Σ

Query & Render on GPU via WebGL

Pack data tiles as PNG image files,bind to WebGL as image textures.

Query & Render on GPU via WebGL

Σ

Invoke program for each output bin.Executes in parallel on GPU.

Query & Render on GPU via WebGL

Σ

Performance BenchmarksSimulate interaction:brushing & linkingacross binned plots.

- imMens vs. Profiler- 4x4 and 5x5 plots- 10 to 50 bins

Measure time from selection to render.

Test setup:2.3 GHz MacBook Pro (4-core)

NVIDIA GeForce GT 650MGoogle Chrome v.23.0

~50fps querying of visualsummaries of 1B data points.

In-Memory Data Cube

imMens

Number of Data Points

5 dimensions x 50 bins/dim x 25 plots

[1] Lins et. al. Infovis 2013

[2] Sismanis et. al. SIGMOD 2002

NanoCubes

[1] Lins et. al. Infovis 2013

NanoCubes

ResourcesimMens vis.stanford.edu/projects/immensTableau Public tableausoftware.com/publicBigVis (R) github.com/hadley/bigvisNanocubes nanocubes.netBlinkDB blinkdb.orgMapD geops.csail.mit.edu/docs/

AcknowledgmentsZhicheng “Leo” LiuBiye Jiang

Visualizing “Big” DataSean Kandel & Je!rey Heer Trifacta Inc. @trifacta

Recommended