Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Infographics and Visualisation
(or: Beyond the Pie Chart)
LSS: ITNPBD4, 6 and 10 Oct 2015
ITNPD4: Applications of Big Data 2
Overview – Why infographics and visualisation
– What’s the problem we’re trying to solve? – What makes for good infographics and visualisations? – Where are we now in this area?
Overview
The problem
• Data analysis may tell you something about the structure of a problem
• Or may predict how to optimise something – Profit, energy usage etc.
• BUT: – In general you will have to convince someone – And they may not be convinced by the numbers on their
own • They expect some sort of graphic that they can
show to the Board/CEO to convince them – A visualisation, perhaps an infographic.
ITNPD4: Applications of Big Data 3
Visualisation and infographics
• Visualisation is the generic name for displaying data – May be a single image – Or a movie, for example. “Visualizations help people see things that were not obvious to them before” (SAS website)
• There is also sonification, where data is sounded out: this works, because our ears are very good a picking up patterns. – E.g. Geiger counter, reversing systems in modern cars.
• Infographics are generally single images – Providing a visualisation of a specific set of data.
ITNPD4: Applications of Big Data 4
Infographics • An infographic is a
picture that displays information in an accessable and/or informative way.
• Can be quite simple • …or quite complex
ITNPD4: Applications of Big Data 5
…not a new idea (Minard, 1869)!
ITNPD4: Applications of Big Data 6
The standard text in this area is E. R. Tufte, “The visual display of quantitative information”
ITNPD4: Applications of Big Data 7
Infographic shows the troops and troop movements on the eastern from in World War 2.
Visualisation of low-dimensional datasets
• Low-dimensional datasets are often visualised as simple X/Y graphs: but even here there are issues – For both X and Y axes:
• Offset (is the origin at 0?) • Scale • Linear or logarithmic? • Continuous or broken axes.
– Graph lines: • One or more than one? • Line style: continuous, dashed, dotted… • Line colour • Symbols and/or lines?
ITNPD4: Applications of Big Data 8
ITNPD4: Applications of Big Data 9
0 100 200 300 400 500 600 700 800 900 1000-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
100 101 102 103-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 100010-4
10-3
10-2
10-1
100
101
100 101 102 10310-4
10-3
10-2
10-1
100
101
ITNPD4: Applications of Big Data 10
0 100 200 300 400 500 600 700 800 900 1000-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
ITNPD4: Applications of Big Data 11
Visualising high dimensional datasets
• This is harder: and can be where infographics comes in – Cannot do this directly.
• Can plot two or three dimensions directly, but not more • Clever infographics can plot more dimensions, for example using
geographical location, lines of varying thickness and colour, multiple symbols
– How can we show the structure of such datasets? • When we can’t think of one-off target-domain clever tricks …
– Discuss earlier infographics
• Clearly depends on what we are trying to show! – Geography as timeline, for example – See also http://www.creativebloq.com/graphic-design-tips/great-
infographic-design-tips-1232813
ITNPD4: Applications of Big Data 12
What can we do in general
• Let’s say that we don’t have any inspiration for designing a good infographic (!) – Infographics often depends on specific factors
• E.g. dates, geographic distribution, …
• Can we find 2 or 3 (or even a few more) dimensions that … – … in some sense
• …summarise (what we want to emphasise about) the dataset?
• Ways forward: projecting and clustering
ITNPD4: Applications of Big Data 13
Choosing dimensions and projecting data
• If the data is evenly spread throughout all the dimensions and has no structure? – Give up. There’s nothing to be learned from it.
• Datasets that have something to tell us have some from of structure
• Maybe the data lie (largely) on a smaller dimensional subset of the high-dimensional space. – As opposed to being spread evenly throughout the
original space.
ITNPD4: Applications of Big Data 14
Example • Say that we have 3-dimensional data, sampled over time
– Each point is (x,y,z,t): really 4-dimensional data • and -1 <= x,y,z <=1, 0<=t<=10 (the points (x,y,z) are
inside a sphere, of radius 1, centered at the origin) • Let’s also say that at each time t, sqrt(x2+y2+z2) = t/10
– So that the points at time t are on the surface of a sphere of radius t/10
• Clearly, if we simply look at all the(x,y,z) points (ignoring t) they are spread throughout the sphere – But not in an unstructured way
ITNPD4: Applications of Big Data 15
Discovering structure in data
• There are many techniques for discovering structure – Principal component analysis (pca)
• Linearly projecting a high dimensional dataset on to a smaller number of dimensions
• In such a way that as much as possible of the variance in the data is contained in this smaller number of dimensions
• And the dimensions are orthogonal to each other • Well-understood and commonly used technique for data
dimension reduction
ITNPD4: Applications of Big Data 16
Independent components analysis • Independent components analysis (ica)
– a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. Hyvärinen, (U Helsinki)
• Essentially looking for dimensions that co-vary • Finding ways of summarising points in the N-dimensional
space using less than N values. • Data is assumed to be a linear mixture of underlying
latent variables – These are assumed non-Gaussian, and mutually independent:
independent components • Related to PCA, but can find structure when PCA fails to
do so
ITNPD4: Applications of Big Data 17
Example: input
ITNPD4: Applications of Big Data 18
ICA output
ITNPD4: Applications of Big Data 19
ITNPD4: Applications of Big Data 20
Clustering data
• Often rather than projecting data on to other axes, it is better to look at how the data points are grouped – The aim is to classify a large number of data vectors
into a small number of manageable groups • Does the data fall into clusters?
– How unevenly distributed is the data? – Does it cluster in
• The original high-dimensional space • In a lower-dimensional projected space?
ITNPD4: Applications of Big Data 21
How does clustering work?
• Techniques – Partition or Hierarchical
ITNPD4: Applications of Big Data 22
Examples
ITNPD4: Applications of Big Data 23
Partition-based clustering • Based on distance between vectors
– But which distance? • Euclidean • City-block? • Weighted versions • Chebychev distance
• Forming clusters: – Simple method:
• Start with each vector as a single-element cluster • Identify two closest vectors and combine them into the same
cluster. • Keep doing this until the distance between the two closest
vectors not in the same cluster is large.
ITNPD4: Applications of Big Data 24
Criticisms of clustering
• Clustering is descriptive, and not unique – Actual clusters may depend on techniques used, as well
as on the data • Clustering techniques will always find clusters
– Even when there aren’t any! – (This implies some measure for qualirty of clustering
should be used) • Clustering techniques depend strongly on the
measures used – There should ideally be some conceptual support of the
measures used to calculate distances between vectors.
ITNPD4: Applications of Big Data 25
Examples:
• Google News indexes – Uses text to create topic clusters
• Title, article listings • Used to discover multiple reports of same story
• Video clusters on YouTube – Uses keywords, popularity, viewer engagement, user
browsing history – http://www.strutta.com/blog/six-degrees-of-youtube/
ITNPD4: Applications of Big Data 26
Infographics tools
• At its simplest, Excel has many facilities for creating infographics and visualisations. – But it’s limited, and proprietary (though one can
import comma separated values) • Matlab? Not free! Good graphing tools • Flot: jQuery and JavaScript based • Google Chart API: free
– JavaScript based, browser output
• D3: JavaScript based, very powerful.
ITNPD4: Applications of Big Data 27