37
Visualization of Big Data DANYEL FISHER, MICROSOFT RESEARCH

Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT RESEARCH . Contents . Big Data & Visualization Overview Background, and How We Know What

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

  • Visualization of Big Data DANYEL FISHER, MICROSOFT RESEARCH

  • Contents

    Big Data & Visualization Overview

    Background, and How We Know What We Know

    Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges

  • The Three Vs Volume Velocity Variety

    … and I’ll add one more: “Visitation.” This is what we used to call “Exploratory Data Analysis”, but I want to keep up with the “V” thing.

  • Defining “Big” Volume

    “…200,000 magnetic tape reels which represent over 900 billion characters of data”

    1975

  • Exploration is not presentation EXPLORATION:

    Learn about the dataset

    Explore multiple hypotheses

    Manipulate data freely

    May be discarded after completion

    Examples: Some of Tableau, PowerView, GGPLOT, etc

    PRESENTATION:

    Communicate a specific view

    Constrain interaction

    Visual style important

    Examples: visual dashboards, data storytelling

  • Goals Responsive, exploratory visualization

    We’re NOT interested in ◦ Pre-cooked datasets and visualizations ◦ Knowing precisely what you plan to look at / do

  • “the size of the dataset is part of

    the problem”

  • Problem Space On one PC, it ◦ Run out of screen to draw each data point [106] ◦ Takes a long time to look at every data point [109] ◦ May not be able to store all the data points [1012]

  • Rendering Problem

    x

    y

    Scatterplot (at least one pixel per point)

    Network Diagram Parallel Coordinates

    (individual lines)

  • II: Hotmap, A Personal Story

  • One of the most popular spots in the world.

    Based on a table with a few billion rows

  • South Dakota: zoom on the center of the map

  • How We Know What We Know Building data-based systems for a long time

    Interview study with data analysts, published “Interactions with Big Data Analytics”

    Building “Big Sky” system (SIGMOD demo as “Stat”) ◦ Integrated Visualizations ◦ Streaming Data Streams

  • Outline Data Processing Constraints

    Data Communication

    Data Aggregation

  • Processing DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

  • Solution Space ◦ Work Offline ◦ Index ◦ (OLAP, InMems, Nanocubes) ◦ Restrict Data ◦ Sample (or Stream) ◦ Divide & Conquer

  • ONE-PASS ALGORITHMS Touch each data point once

    In a histogram—where does it go? ◦ Categorization is easy. (“Bucket A”) ◦ But … what about other bucketing algorithms? Database Sketches: one-pass approximations Standard deviation, mean are fairly easy “Is the highest value” is very hard “Falls in the top 10%” isn’t bad

  • Two strategies for exploration DIVIDE AND CONQUER

    ONLINE QUERY PROCESSING

    Time

    100%

    Online Traditional

  • The Progressive Pitch

    Trust Me, I'm Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster",

    Design Constraints for Visualizing Big Data There’s Too Much to Process You’ll Never See It All The Rules Change Streaming is Hard

  • Fallback: Reservoir Sampling Streaming sample Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir

  • Raw data Relevant dimensions Apply buckets on dimensions

    Filter data Aggregate data Create shapes

    Assign scales to shapes

    Render to screen

  • DISPLAY DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

  • Solution Space AGGREGATE One visual point represents multiple data points SAMPLE Show only some of the dataset

  • THINK AGGREGATION Bar Chart -> Histogram

    Points on a Map, -> 2D Histogram Scatterplot, Heatmap

    Line Chart -> Approx Line Chart

    Parallel Coordinates -> Area Para Coordinates

    What about network diagram?

  • SAMPLING: You’ll never know it all

    TASKS

    Find extreme

    Compare bars

    Bar to constant

    Bar to range

    Order (top-K)

  • SAMPLING: Probabilistic Views “Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”

    Design Goals

    Easy to interpret

    Consistency across tasks

    Spatial Stability

    Minimize Visual Noise (overhead)

  • “Is Bar A > Bar B”

  • Other Tasks

    Find extreme Compare to value Compare to Range

  • Design Problems DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

  • 1 IS VERY DIFFERENT FROM 0 When the Y axis goes all the way to very high values, it’s still very interesting to know which values are possible

  • STREAMING MEANS YOUR WORLD CAN CHANGE

    Categorical -> too many categories! Numerical -> changing bounds Any color map or scale can change

  • STREAMING, STORING, SENDING

    Implications for interaction, for updates Care a lot about changes that are server-side only vs client-only. Change color, change height scale … vs change bucket size. [Research opportunity: what are the tradeoffs of different models?]

    Disk Data Aggregate Shapes Render Screen

    Network? (D3)

    Network? (Tableau Public)

    Network? (SVG)

  • Hard to Do Research This isn’t the way SQL works today

    You don’t want to stand up a Hadoop cluster yourself—and it’s a whole other skillset.

    You can approximate: ◦ repeat medium-sized data over and over? ◦ generate data based on a model?

  • Conclusion Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges

    Visualization of Big DataSlide Number 2ContentsThe Three VsDefining “Big” VolumeSlide Number 6Exploration is not presentationGoals“the size of the dataset is part of the problem”Problem SpaceRendering ProblemII: Hotmap,�A Personal StorySlide Number 13Slide Number 14Slide Number 15How We Know What We KnowOutlineProcessingSolution SpaceONE-PASS ALGORITHMSTwo strategies for explorationThe Progressive PitchFallback: Reservoir SamplingSlide Number 24DISPLAYSolution SpaceTHINK AGGREGATIONSAMPLING: You’ll never know it allSAMPLING: Probabilistic Views“Is Bar A > Bar B”Other TasksDesign Problems1 IS VERY DIFFERENT FROM 0STREAMING MEANS YOUR WORLD CAN CHANGESTREAMING, STORING, SENDINGHard to Do ResearchConclusion