Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Visualization of Big Data DANYEL FISHER, MICROSOFT RESEARCH
Contents
Big Data & Visualization Overview
Background, and How We Know What We Know
Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges
The Three Vs Volume Velocity Variety
… and I’ll add one more: “Visitation.” This is what we used to call “Exploratory Data Analysis”, but I want to keep up with the “V” thing.
Defining “Big” Volume
“…200,000 magnetic tape reels which represent over 900 billion characters of data”
1975
Exploration is not presentation EXPLORATION:
Learn about the dataset
Explore multiple hypotheses
Manipulate data freely
May be discarded after completion
Examples: Some of Tableau, PowerView, GGPLOT, etc
PRESENTATION:
Communicate a specific view
Constrain interaction
Visual style important
Examples: visual dashboards, data storytelling
Goals Responsive, exploratory visualization
We’re NOT interested in ◦ Pre-cooked datasets and visualizations ◦ Knowing precisely what you plan to look at / do
“the size of the dataset is part of
the problem”
Problem Space On one PC, it ◦ Run out of screen to draw each data point [106] ◦ Takes a long time to look at every data point [109] ◦ May not be able to store all the data points [1012]
Rendering Problem
x
y
Scatterplot (at least one pixel per point)
Network Diagram Parallel Coordinates
(individual lines)
II: Hotmap, A Personal Story
One of the most popular spots in the world.
Based on a table with a few billion rows
South Dakota: zoom on the center of the map
How We Know What We Know Building data-based systems for a long time
Interview study with data analysts, published “Interactions with Big Data Analytics”
Building “Big Sky” system (SIGMOD demo as “Stat”) ◦ Integrated Visualizations ◦ Streaming Data Streams
Outline Data Processing Constraints
Data Communication
Data Aggregation
Processing DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
Solution Space ◦ Work Offline ◦ Index ◦ (OLAP, InMems, Nanocubes) ◦ Restrict Data ◦ Sample (or Stream) ◦ Divide & Conquer
ONE-PASS ALGORITHMS Touch each data point once
In a histogram—where does it go? ◦ Categorization is easy. (“Bucket A”) ◦ But … what about other bucketing algorithms? Database Sketches: one-pass approximations Standard deviation, mean are fairly easy “Is the highest value” is very hard “Falls in the top 10%” isn’t bad
Two strategies for exploration DIVIDE AND CONQUER
ONLINE QUERY PROCESSING
Time
100%
Online Traditional
The Progressive Pitch
Trust Me, I'm Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster",
Design Constraints for Visualizing Big Data There’s Too Much to Process You’ll Never See It All The Rules Change Streaming is Hard
Fallback: Reservoir Sampling Streaming sample Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir
Raw data Relevant dimensions Apply buckets on dimensions
Filter data Aggregate data Create shapes
Assign scales to shapes
Render to screen
DISPLAY DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
Solution Space AGGREGATE One visual point represents multiple data points SAMPLE Show only some of the dataset
THINK AGGREGATION Bar Chart -> Histogram
Points on a Map, -> 2D Histogram Scatterplot, Heatmap
Line Chart -> Approx Line Chart
Parallel Coordinates -> Area Para Coordinates
What about network diagram?
SAMPLING: You’ll never know it all
TASKS
Find extreme
Compare bars
Bar to constant
Bar to range
Order (top-K)
SAMPLING: Probabilistic Views “Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”
Design Goals
Easy to interpret
Consistency across tasks
Spatial Stability
Minimize Visual Noise (overhead)
“Is Bar A > Bar B”
Other Tasks
Find extreme Compare to value Compare to Range
Design Problems DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
1 IS VERY DIFFERENT FROM 0 When the Y axis goes all the way to very high values, it’s still very interesting to know which values are possible
STREAMING MEANS YOUR WORLD CAN CHANGE
Categorical -> too many categories! Numerical -> changing bounds Any color map or scale can change
STREAMING, STORING, SENDING
Implications for interaction, for updates Care a lot about changes that are server-side only vs client-only. Change color, change height scale … vs change bucket size. [Research opportunity: what are the tradeoffs of different models?]
Disk Data Aggregate Shapes Render Screen
Network? (D3)
Network? (Tableau Public)
Network? (SVG)
Hard to Do Research This isn’t the way SQL works today
You don’t want to stand up a Hadoop cluster yourself—and it’s a whole other skillset.
You can approximate: ◦ repeat medium-sized data over and over? ◦ generate data based on a model?
Conclusion Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges
Visualization of Big DataSlide Number 2ContentsThe Three VsDefining “Big” VolumeSlide Number 6Exploration is not presentationGoals“the size of the dataset is part of the problem”Problem SpaceRendering ProblemII: Hotmap,�A Personal StorySlide Number 13Slide Number 14Slide Number 15How We Know What We KnowOutlineProcessingSolution SpaceONE-PASS ALGORITHMSTwo strategies for explorationThe Progressive PitchFallback: Reservoir SamplingSlide Number 24DISPLAYSolution SpaceTHINK AGGREGATIONSAMPLING: You’ll never know it allSAMPLING: Probabilistic Views“Is Bar A > Bar B”Other TasksDesign Problems1 IS VERY DIFFERENT FROM 0STREAMING MEANS YOUR WORLD CAN CHANGESTREAMING, STORING, SENDINGHard to Do ResearchConclusion