View
2
Download
0
Category
Preview:
Citation preview
The School of Mathematics
Data Visualisation of ScottishDemographic Information
by
Graeme Taylor
Dissertation Presented for the Degree of
MSc in Operational Research
August 2014
Supervised by
Dr Belen Martin-Barragan and Dr Esther Roughsedge
Abstract
This project explores the use of interactive visualisations to augment the extensive data published by
the National Records of Scotland. Good visualisation can illustrate key trends in statistical data, in-
creasing impact and accessibility; great visualisation can go further, and enable us to identify and ex-
plore unexpected connections. Data visualisations can therefore support operational research, but we
will see that producing them also entails solving problems of an OR flavour.
We survey the existing literature for principles of good design in presenting data visually; much of this
is aimed at hand-produced imagery for print, so we examine how it can be best used in the new context
of procedurally-generated, interactive visualisations for the web. In the first instance, we consider this
for chart types which have proven popular or successful for static visualisations, particularly if already
used by NRS.
This leads us to investigate more complicated data sets which can be interpreted as having a graph
theoretic structure. We will show how the constrained layout of networks of vertices with an associated
size can be posed as an optimisation problem, and develop a visualisation that operates under such
constraints. Further, we will consider the use of geographic clustering to represent migration flow,
describing and implementing a novel ‘re-wiring’ algorithm to generate tree structures that produce
better visualisations than standard agglomerative approaches.
Finally, we present a portfolio of visualisations created for NRS that follow the design principles iden-
tified and make use of the software tools developed during the project.
Own Work Declaration
I declare that this thesis was composed by myself and that the work contained therein is my own,
except where explicitly stated otherwise in the text.
Edinburgh, August 18, 2014
Place, DateMyself
Contents
1 Data Visualisation 1
1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 D3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Graphical Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Use of Colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Charts 10
2.1 Small Multiples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Tree Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Choropleth Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Frequency Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Graphs 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Graph Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Flow Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Flow graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Star graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Flow graph layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.5 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Chord Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
i
4 Portfolio 43
4.1 Migration Flow Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The Cause of death explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Cause of Death Treemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Distributions with cohort effects: Fertility Data . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Conclusions 51
Bibliography 52
Appendices I
A Guide to Electronic Appendices I
A.1 Flow Map Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Cause of Death Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.3 Cause of Death Zoomable Treemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.4 Fertility Data (cohort effects) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.5 Popular Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.6 Life Expectancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.7 Gender distribution by age (Frequency plot) . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.8 Migration within Scotland (Chord Diagram) . . . . . . . . . . . . . . . . . . . . . . . . . . . II
i
List of Figures
1.1 The 2011 State of the Union Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 If Bush Tax Cuts Expire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 A misleading rainbow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Air pollution in Southern California. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Slice-and-dice treemap of Scottish population data. . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Squarified treemap of Scottish population data. . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Example of tile placement in the squarified algorithm. . . . . . . . . . . . . . . . . . . . . . 14
2.5 The Singing Mondrian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Choropleth map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Number of males and females per 100 centenarians, Scotland 2012. . . . . . . . . . . . . . 17
2.8 Possible outcomes of breast cancer screening. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Four views of the Petersen Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Non-planar and planar diagrams for the same graph. . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Radius versus bounding box packing of circles. . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Minard’s map of exports of French wine in 1864. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Minard’s 1861 visualisation of Napoleon’s Russian campaign of 1812. . . . . . . . . . . . . 28
3.6 Selecting branch point location to minimise chart ink. . . . . . . . . . . . . . . . . . . . . . 30
3.7 Computer generated flow maps of migration from California 1995-2000 . . . . . . . . . . 30
3.8 Star-graph flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Bounding-box dendrogram for sources of migration to Scotland. . . . . . . . . . . . . . . . 33
3.10 Algorithmically-generated flow map of migration to Scotland . . . . . . . . . . . . . . . . . 36
3.11 User-adjusted flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.12 Re-assigning root for sources of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . 38
ii
3.13 Visualizing information flow in science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.14 Chord diagram of migration within Scotland 2011-2012. . . . . . . . . . . . . . . . . . . . . 40
3.15 Chord diagram of migration between Scottish councils and the rest of the UK 2011-2012. 41
3.16 Chord diagrams with internal flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 User-adjusted rewired flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . 44
4.2 Cause of death data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 The Cause of Death Explorer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Treemaps of cause of death data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Live births per 1,000 women, by age, selected years. . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Interactive visualisation of fertility data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iii
Chapter 1
Data Visualisation
We live in an era of big data; in a typical minute, a hundred hours of footage are added to Youtube [44],
whilst over 200 million emails are exchanged [37]. But data is not the same as information, and only
after processing to make it meaningful to an intended audience can it qualify as the latter. Even then,
much of it simply passes us by. The World Bank makes tens of thousands of its reports - often produced
with the goal of influencing government policy or public debate - available online, but estimates that
nearly a third have never been downloaded [10].
McCandless, in the introduction to Information is Beautiful [23], describes being “swamped by infor-
mation [and] searching for a better way to see it all and understand it”. The key phrase, perhaps, being
‘to see’; as he did, we will consider ways in which visualisation of data can both direct our attention to
interesting content, and help us understand it better.
We will focus on data produced and maintained by the National Records of Scotland. Formed out of
the merger of the General Register Office and the National Archives, the NRS “plays a central role in
the cultural, social and economic life of Scotland, supporting several of the Scottish Government’s key
National Outcomes and measuring its Population Purpose Target” [28]. In particular, it performs “the
registration and statistical functions of the Registrar General of Scotland, including responsibility for
demographic statistics and census”; it is this demographic data from the GRO that we will draw upon.
This report is structured as follows. In the remainder of this chapter we fix terminology for some key
concepts; introduce and justify use of the Javascript library D3 which will be used for creating visuali-
sations; and identify a series of guiding principles for good design. In Chapter 2 we consider a variety
of popular ‘static’ visualisation techniques, and how they can be translated to an interactive context.
This will already require us to consider problems of mathematical optimisation to satisfy our design
goals. In Chapter 3, this interplay between design and mathematics is pushed further, as we examine
1
how data sets can be interpreted as graphs. We will pose graph layout as an optimisation problem,
and extend this to create ‘flow maps’ of tree structures. These are created from a clustering and ‘re-
wiring’ process we developed and implemented for this visualisation task. Combining the ideas of
earlier chapters, a selection of visualisations developed from NRS data sets is presented in Chapter 4,
with further examples in the electronic appendices. We conclude with a summary of our methodology
and the project outcomes in Chapter 5.
1.1 Terminology
Data visualisation is multi-disciplinary in nature, drawing upon (amongst others) mathematics, statis-
tics, computer science, art and design. But this can lead to competing notation and terminology, so
for convenience we will fix a few definitions here. In particular, by graph we will mean a collection of
vertices linked by edges, rather than plots such as time series; the latter being considered an example
of a chart. Similarly, we will use graphical to describe methods related to the use of graphs in under-
standing data, rather than simply through visual depictions, and avoid reference to computer graphics
entirely.
By a data visualisation we will mean a representation of a data set as an image, generated automatically
by algorithm. This is in contrast to images produced ‘by hand’, and thus requiring manual intervention
by the designer to update for new data values (sometimes described as infographics). We will describe
a data visualisation as static if the end result is an image, and as interactive if the end result is software,
with features that allow the user to manipulate the image and explore the data further.
For a fixed data set, an interactive visualisation may effectively consist of a great many possible static
visualisations of lower-dimensional subsets. The user selects from these by choosing certain control
parameters; we will describe this as taking a slice of the data set through the corresponding point in
parameter space. We may also animate motion along a control dimension by iteratively presenting
slices; in effect, time becomes one of the display dimensions of our data visualisation, and we can think
of the individual static visualisations as frames of a movie. This is a powerful advantage of interactive
visualisations over static ones; however, care must be taken to establish whether there might be a better
static visualisation capable of capturing ‘the big picture’ all at once.
2
1.2 D3
Data-Driven Documents, introduced in [4] and almost always refered to simply as D3, is a javascript
library for creating data visualisations (static or interactive) that can be accessed with a standard web
browser. Moreover, it does so by building upon the existing frameworks for modern web content. A
well-designed web page draws a distinction between content (tagged using HTML) and style (deter-
mined by CSS); interactivity is typically enabled through use of Javascript. These technologies and
others can be interwoven thanks to a shared representation of the resulting page, the Document Ob-
ject Model (DOM). So, for instance, an HTML fragment may specify that there should be a particular
line of ‘header’ text; the CSS will specify what a header looks like; and Javascript can then monitor
this object to trigger an action when the user clicks on it. D3 extends this by allowing existing objects
to be bound to data values drawn from elsewhere, with new objects created or old ones destroyed as
the data set changes. Object styling can then be determined by the associated values, which can be
updated based on user actions.
By building D3 visualisations we ensure access to the existing audience for online content currently
produced by NRS. Visualisations in earlier web-based systems such as Prefuse (2005) or Flare (2007)
were developed in other languages, and required plug-ins to be rendered on the page (Java and Flash
respectively). This requirement is detrimental to accessibility, particularly in the long term; plug-in
support can erode due to security concerns (by default, a modern Java installation will block access to
applets that would have been treated as safe just a few years ago), platform incompatibility (Flash is
not supported by Apple iOS devices, for instance) or simply obsolesence (NRS data sets may remain of
interest for decades, so data should not be locked up in proprietary formats).
A potential concern is that working in D3 is more difficult than systems such as Flare, which can offer
a higher level of abstraction. We will not attempt to give a tutorial on D3 here, as the best approach
will likely depend on background. With programming experience but only passing familiarity with
HTML/CSS and no experience of Javascript, the guides [22] and [26] were a useful introduction. From
there, it was easy to modify or adapt to new data the wealth of examples available online1 before finally
constructing new work from scratch. We may hope for some of our own visualisations to be similarly
instructive, and thus provide a useful basis for future work at NRS. There are also extensions to D3
- most notably C3 [36] - that provide common charts, thus avoiding the need to reinvent the wheel
entirely for standard visualisation tasks.
Moreover, if a D3-based page is designed appropriately then any one of the data, other content, styling,
1Particularly those by D3 author Mike Bostock at http://bl.ocks.org/mbostock; or for specific requests the commu-nity at http://stackoverflow.com.
3
interactive components or visualisation behaviour can be updated or modified by NRS staff even if
they are not familiar with the technologies involved in the others. For instance, a designer can restyle
the page without needing to learn D3; whereas a plug-in based visualisation (which is a single opaque
object to the DOM) must be rewritten in the appropriate language. Further, a statistician can update
to the latest figures without needing to know about any aspect of web development, by providing an
appropriately-formatted new data file. In this way some outputs of the project may remain of use
beyond its completion, by providing templates for visualising certain types of data rather than one-off
instances for the 2014 figures.
1.3 Design principles
Any definition of ‘good’ visualisation will inevitably include some component of personal taste, as well
as being dependent on the context it is presented in and the audience it is intended for. Nonetheless,
we can attempt to outline some general design principles - and we can often identify features that
make a visualisation ‘bad’!
The canonical references on this matter are Tufte’s two works [41] and [40]. Disdainful of the influ-
ence of designers in what he sees as a statistical field, Tufte takes an almost entirely practical view,
with visual appeal a potential side-effect but never the goal: “Occasionally artfulness of design makes
a graphic worthy of the Museum of Modern Art, but essentially statistical graphics are instruments to
help people reason about quantitative information” [41]. In his foreword to Lima’s work [21], Manovich
takes a broader view: “The space defined by the disciplines of science, design, or art [...] contain lots of
possibilities. A given visualization project can be situated anywhere in this space, depending on what it
privileges” and goes on to argue that the best examples “manage to combine all three” aspects of this
space. [21] pulls its examples from the online collection Visual Complexity, and this name is telling in
comparison to Tufte’s titles; many of the works presented are striking representations of almost over-
whelming complexity, but as such can only offer high level insights rather than serving as tools for
detailed quantitative analysis.
A point of agreement, however, is that aesthetic appeal should be derived from the data itself, not
decoration. Tufte distinguishes between “data ink” - that which would reduce information content if
deleted - and “chartjunk”, added for artistic reasons and superfluous to understanding the data. He
argues that “data graphics [...] stand or fall on their content, gracefully displayed. Graphics do not
become attractive and interesting through the addition of ornamental hatching and false perspective to
a few bars. Chartjunk can turn bores into disasters, but it can never rescue a thin data set. The best
designs [...] are intriguing and curiosity-provoking.” [41]. Similarly, McCandless describes his projects
4
in [23] as “A series of experiments in making information approachable and beautiful”, motivated by
the question “can a book with the minimum of text, crammed with diagrams, maps and charts, still be
exciting and readable?”
In line with these goals, a visualisation should make the viewer’s task in interpreting the data easier,
not harder: Tufte warns against the creation of “puzzle graphics” that have to be decoded through a
verbal train of thought rather than being visually self-evident. A sure sign of a bad visualisation is that
it is harder to work with than the original tables of data!
1.3.1 Graphical Integrity
Worse, though, is a visualisation that is not difficult to interpret, but easy to misinterpret - that is,
misrepresents the underlying data. In translating to visual form, the medical principle of first, do no
harm should be followed. Tufte devotes the second chapter of [41] to this issue of graphical integrity,
identifying various general principles; we highlight two in particular.
• The representation of numbers, as physically measured on the surface of the graphic itself,
should be directly proportional to the numerical quantities represented.
• Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambi-
guity. Write out explanations of the data on the graphic itself. Label important events in the
data.
In many visualisations, a set of data items may be represented by simple two dimensional shapes or
icons; when each item has an associated size, it is natural to scale these representations accordingly.
However, a common mistake is to multiply both dimensions by the desired scale factor s, rather than
a single one, to preserve the aspect ratio; with the end result of rescaling the area by a factor of s2.
Deliberately or not, this simple error can be made at even the highest levels with straightforward data
sets, as Figure 1.1 shows.
This issue is further complicated by variations in perception: Tufte notes that “the perceived area of a
circle probably grows somewhat more slowly than the actual (physical, measured) area [...]; perceptions
change with experience; and perceptions are context-dependent.”. His second point seeks to address
this, arguing that visual depictions should be supplemented with numerical values where possible. We
note that for web-based visualisations there is almost limitless potential for this through the use of tool-
tips: additional content that appears only when hovering the mouse over an element. These can be
incorporated into otherwise static visualisations, and allow access to full data values without needing
5
to make a trade-off against ease of reading due to clutter as in visualisations for print. However, labeling
alone cannot rescue poor design, if the visual perception overwhelms the message of the numerical
values (as illustrated in Figure 1.2).
Figure 1.1: The 2011 State of the Union Address
(Capture from video [43]; the official annotation in blue erroneously scales the radius of each circle in
proportion to GDP. The green circles, which correctly scale the area, have been added by the author
and tell a rather different story).
Figure 1.2: If Bush Tax Cuts Expire
(Left: Fox Business, Cavuto via Media Matters [24], presents the change in top tax rate - from 35% to
39.6% - with accurate labelling but misleading relative areas due to a truncated axis. Right: a more
conventional presentation by Media Matters, with axis starting at zero.)
1.3.2 Use of Colour
Tufte offers considerable guidance on the use of colour in Chapter 5 of [40], as usual contrasting its
power when used well against the calamities that can occur when inexpertly deployed. He notes a
number of “fundamental uses of color in information design: to label (color as noun), to measure (color
6
as quantity), to represent or imitate reality (color as representation), and to enliven or decorate (color as
beauty)”.
Labeling by colour is a useful way to indicate a categorical data dimension, avoiding the clutter or
confusion or textual or numerical labels and providing a quick way to visually group related items.
This can work particularly well if the colours chosen are also representative - Tufte gives examples from
cartography, where there are natural interpretations of greens and blues - but this is also a hazard, if
such associations would be misleading, or simply clash with the data key.
A further complication arises in interactive visualisations, as we may not know which elements are cur-
rently on-screen, or (when designing general templates) how many categories (and thus colours) there
may even be. The approach in D3 is to once again divorce style from data; a selection of carefully cu-
rated palettes (of ten to twenty colours) are available as functions which map integer values to pleasant
colour choices. Additional palettes can be supplied as simple lists2 or functions of data values. Thus
abstract categories can be specified at the data level, then styled as appropriate given the state of the
visualisation (such as the depth of a hierarchy explored). For indicating that various data items are
related but not identical, [12] suggests colouring each with a small perturbation of a base colour (such
as that given by the palette); we implement Javascript code for doing so in the visualisation described
in Section 4.3.
Care is also required when colour is used to measure ordinal data, particularly when the underlying
quantity is continuous. Figure 1.3 demonstrates the risks: it seems to imply a sharp divide between the
eastern and western halves of the US, with a boundary line running between the green and yellow re-
gions. But closer inspection of the legend shows that these colours correspond to adjacent bands; the
data values could vary smoothly from east to west, but the rainbow colour scheme does not. Such
a scheme remains lamentably popular despite the many limitations outlined in [3]: “The rainbow
color map confuses viewers through its lack of perceptual ordering, obscures data through its uncon-
trolled luminance variation, and actively misleads interpretation through the introduction of non-data-
dependent gradients.” This last remark explains the failure of Figure 1.3; the abrupt changes between
hues are far more noticeable than the variations within each band, and misleadingly suggest a discon-
tinuity in the data, too. They note the gray scale as being perceptually ordered: “Increasing luminance
from black to white is a strong perceptual cue that indicates values mapped to darker shades of gray are
lower in value than values mapped to lighter shades of gray. This mapping is natural and intuitive.”
We can combine this useful feature of the gray scale with a chosen colour (for aesthetic reasons or to
indicate quantity simultaneously with category) in D3 by adjusting the alpha (or transparency) value.
2We note the usefulness of [5], particularly for generating palettes that are accessible for colourblind users.
7
Figure 1.3: A misleading rainbow.
(Figure 13, Estimated fraction of precipitation lost to evapotranspiration 1971-2000, of [34]).
1.3.3 Transitions
To a large extent, good design of interactive visualisations can be infered from the principles for the
static case. However, there is a further aspect to be considered, which is the transitions between each
state. Although not specific to visualisation, user interface guides such as [16] provide useful guidance.
If we wish to associate the size or colour of an object with the magnitude of the data item it is asso-
ciated with, then this association can be strengthened or weakened depending on how it is portrayed
when in motion. Larger, “heavier” objects should be less prone to disturbances, move slower, and both
accelerate and decelerate at a lower rate than smaller ones. Even without such an association, objects
should still move in physically authentic ways: from [16] we note that “animation with abrupt starts
and stops or rapid changes in direction appears unnatural and can be an unexpected and unpleasant
disruption for the user [...] transitioning between two visual states should be smooth, appear effortless,
and above all, provide clarity to the user, not confusion”.
8
Ideally, we wish to minimize the disruption to the user’s “mental map” by ensuring that visual changes
only occur to the extent required by changes in the underlying data. In particular, sudden appearance
or disappearance is not physically realistic; [16] suggests “When an object enters the frame, ensure it’s
moving at its peak velocity. [...] Similarly, when an object exits the frame, have it maintain its velocity,
rather than slowing down as it exits the frame. Easing in when entering and slowing down when exiting
draw the user’s attention to that motion.”
Much of the heavy lifting is handled in D3 by the inclusion of transitions: when the state of an object
is to be updated, the timescale for this change to occur can be specified and properties such as size,
colour and location will be smoothly interpolated from the start to end state. By automatically trigger-
ing such updates just as the previous transition ends, we can thus create a continuous animation along
a control parameter dimension merely by specifying a discrete set of data slices.
9
Chapter 2
Charts
In this chapter, we will examine a variety of popular visualisation techniques. We do so with a view to
identifying best practice for their use in line with the principles set out in the previous chapter; noting
any particular strengths or weaknesses of each. We also consider how they might be adapted to the
context of interactive visualisation on the web.
2.1 Small Multiples
Figure 2.1: Air pollution in Southern California.
From [41], work of G. McRae, California Institute of Technology via Los Angeles Times July 22, 1979.
As discussed in Section 1.1, we can ‘slice’ a high dimensional data set along a control dimension such
as time, giving a lower dimensional set at each t value that is hopefully easier to visualise. By varying t
a series of slices is obtained, which, like the frames of a movie, we can present sequentially to recover
10
some insight into how the data evolves through time. Such a sequence of static images is a simple
example of a “small multiple’- a series of structurally similar charts allowing for easy comparison along
the changing parameter. There is no reason why this parameter has to be time, as we can slice through
any dimension of the data set; moreover, we can take a two-dimensional slice by fixing the values of
two parameters and presenting the resulting visualisation as one cell of a grid. Figure 2.1 gives an
example of this grid structure, with different columns corresponding to time values, and different rows
to a categorical variable, the pollutant type. Each pair of time value and pollutant type specifies a data
set for a particular cell, but the method of visualisation (including features such as the range of axes) is
consistent across all cells.
Tufte strongly endorses the small multiple, devoting entire chapters of both [41] and [40] to them;
from the latter, he argues that “at the heart of quantitative reasoning is a single question: Compared
to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing
comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range
of problems in data presentation, small multiples are the best design solution.”
It is important to stress the difference between a small multiple and a series of full-scale versions of the
individual cells, arranged over multiple pages. Presentation in a grid ensures that “Information slices
are positioned within the eyespan, so that viewers make comparisons at a glance- uninterrupted visual
reasoning” [40]. As well as enabling this simultaneous view of the full data set we can, by restricting
our attention to the panels of a single row or column, assess how the data varies in that direction;
or, by comparing rows (or columns) to each other, how that variation itself varies with changes in the
other control parameter. Any sequential presentation naturally suppresses comparison along one of
the directions, and these higher level comparisons become impossible. Moreover, our ability to make
comparisons even along the favoured dimension is impaired by the lack of permanence - if we were
to present the twelve panels of Figure 2.1 over several pages, we might be able to compare consecutive
panels more effectively - but it is unlikely that we could recall the specifics of a panel from several pages
back.
These may seem like concerns that only apply in print - and an obvious criticism of small multiples is
that they are, well, small, even if we restrict our control parameters to taking just a handful of values.
Prizing data density, Tufte sees no problem with “illustrations of postage-stamp size” [40], but with
the luxury of limitless ‘pages’ in building web visualisations, it is tempting to scale up. We can treat
the control parameters in a more literal sense, and present a full-screen view of the current slice and
update it as the user alters the parameters, implicitly navigating around the larger grid. However, care
must be taken that we do not lose the benefits of comparison at a glance; animation can help with this.
11
2.2 Tree Maps
In a tree map, we seek to tile a fixed area so that the size of each tile is proportional to the data item
it represents. The original application by Shneiderman was to visualising hard disk usage across a
file system [18], with the name being chosen to convey “ the notion of turning a tree into a planar
space-filling map” [35]. Hierarchical structures with cumulative sizes such as directory structures are
particularly suited to tree mapping due to the natural recursion - once a tile has been assigned to a
directory, it can itself be sub-tiled to analyse its contents. The appropriate depth to be explored in this
way depends largely on user experience; Shneiderman notes that whilst “we were impressed to examine
thousands of nodes at 5-7 levels at once on the screen, [...] novices did better seeing 20-50 nodes at 1-3
levels.” [35].
As for other visualisations we will consider, there are competing measures, both mathematical and
aesthetic, of the quality of a treemap. For datasets where the categories have a natural order, we may
wish to preserve this through adjacency of tiles. An easy way to achieve this is illustrated in Figure 2.2,
which shows estimated Scottish population counts in mid-2013 by age groups five years wide , older
groups being denoted by greater opacity (in line with our observations in Section 1.3.2).
This (extremely simple to produce) treemap also scores highly for stability: in an interactive visualisa-
tion where each treemap corresponds to a slice, we may wish to resize tiles as the control parameter
varies, but without substantial changes to their position (which would disrupt the user’s mental map).
For this presentation, we can smoothly resize each column whilst preserving its place in the ordering.
Figure 2.2: Slice-and-dice treemap of Scottish population data.
However, it is a poor choice when considering the aspect ratio of the tiles - that is, max(
hei g htwi d th , wi d th
hei g ht
)- as the fixed height necessarily forces tiles for small data items to be very thin. Indeed, many would
not even recognise it as a treemap! This is not purely an aesthetic consideration, with various practi-
cal advantages of nearly-square tiles being identified in [6]. For instance, since squares minimize the
perimeter for a given area, display space is more efficienctly used when tile borders are required; thin
12
rectangles can cause aliasing errors in print and are hard to select / label in interactive visualisations;
and comparison between rectangles is easier when they have similar aspect ratios. Conversely, such
‘squarified’ treemaps tend to have low stability, as well as disrupting ordered data sets - see Figure
2.3, which presents the same data set as Figure 2.2. Finding rectangular tesselations with aspect ratio
as close to 1 as possible is also an NP-hard problem, although [6] presents an approximate algorithm
which (empirically) behaves well in practice. This is available in D3, and was used to produce Figure
2.3. We will describe this algorithm by walking through their example (reproduced as Figure 2.4) in
greater detail.
Figure 2.3: Squarified treemap of Scottish population data.
Example 2.2.1. Suppose we wish to pack tiles of area 6,6,4,3,2,2,1 in a rectangle R of width 6 and height
4. We place the tiles with largest area first.
As we progress, we will finalise the placement of some tiles which we consider to be ‘locked’; the unlocked
remainder of R is our working space R ′; this will always be a rectangle, and initially is all of R. Within
this, we will have chosen an ‘active side’, and will be forming a stack S of unlocked tiles along that side.
For instance, if the active side is vertical, then we will be stacking tiles into rows, and as more tiles are
added, the width of the stack must necessarily grow. This is the case in steps 1, 2 and 3- we have chosen
the left vertical as active side, and the first tile placed (area 6) thus gives us a stack of width 1.5. Adding
a second area 6 tile to the stack gives it width 3; and the third tile, of area 4, takes it to width 4.
However, when adding each tile to the stack, we note the aspect ratio that arises as a result; by placing
successively smaller tiles, the worst in stack is always that of the newly-placed tile. For instance, adding
the area 4 tile in step 3 gives it an aspect ratio of 4/1. Instead of placing it in the current stack, we could
instead start a new one in R ′\S with this tile, as shown in step 4. This has a preferable aspect ratio of 9/4,
so the stack with two area 6 tiles is locked; R ′\S becomes our new R ′, and we start a new stack S with the
area 4 tile. The choice of tiling direction is determined by the aspect ratio of R ′: we pick a vertical active
side if R ′ is wider than it is tall.
Thus in steps 4 to 6, with a new working space R ′ of width 3 and height 4, we place tiles along the
13
horizontal edge to form a stack of varying height. Introducing the area 3 tile gives it an acceptable aspect
ratio, so we do so (step 5), but continuing in this way with the area 2 tile in step 6 is rejected. Instead we
lock the area 4 and 3 tiles, shrink our working space, and start a new stack with the area 2 tile (step 7).
This we initially assign a vertical active side, but on placing the next tile (step 8) we find that the aspect
ratio is worse than locking the first area 2 tile and starting a new stack with the second one (step 9; note
that locking a single tile stack is effectively the same as swapping the orientation of the active side). Now
there is no choice in placing the final tile, and we are done (step 10).
Figure 2.4: Example of tile placement in the squarified algorithm.Originally Figure 4 of [6].
For static representations of unordered data sets (so neither stability nor ordering are a concern) squar-
ified tree maps are an effective way to communicate hierarchical data sets. Aesthetically, as Figure 2.5
shows, they can be potentially indistinguishable from art! They also offer practical advantages over
rival presentations, as discussed in [6]- with listings “it is hard to form a mental image of the overall
structure”, whilst tree diagrams (see Section 3.2) “use the display space inefficiently [as] most of the pix-
els are used as background”.
But from the very start [18] treemaps were intended to be used interactively, and this allows some of
14
the usability issues to be addressed. For instance, they can allow users to select a global tree depth they
are comfortable with. By offering the ability to zoom in to tiles, it becomes possible to apply a greater
level of detail to just the categories of interest, allowing for user-driven exploration of the data set. Due
to the stability issues mentioned care may be required when using them as slices, and although flat (as
opposed to hierarchical) categorical data can be presented, as with simpler area-based representations
such as pie charts, negative quantities cannot be effectively handled. Further, whilst “treemaps are very
effective when size is the most important feature to be displayed” [6], they are less useful when it is the
tree structure we are interested in, rather than the values assigned to its leaves. We identify further
issues with tree maps in our investigations in Section 4.3.
Figure 2.5: The Singing Mondrian.From [20], a treemap visualisation of popular artists on the music-tracking website Last.fm, styled
after Piet Mondrian’s compositions (colour-coding denotes genre).
2.3 Choropleth Maps
A choropleth map presents the values of an ordinal statistic associated with geographic regions by
means of a shaded or colour-coded map. We have already seen an example - albeit with a poorly
chosen colour scheme - in Figure 1.3. An advantage of conveying magnitude through colour (rather
than area, as with tree maps and other representations we shall consider) is that with an appropriate
scale we can indicate negative quantities.
15
In partial analogy with small multiples1, a very high density of local data can be presented whilst still
allowing easy at-a-glance comparisions within larger contexts; familiar geographic placement allowing
for easier navigation of the data than a grid or tabular representation. However, this familiarity being
based on existing geographic or political boundaries also presents a risk, in that the areas of regions
are naturally fixed. Whilst the colour scheme is meant to be the only data channel, it is inevitable
that size will influence our perception somewhat. Figure 2.6 gives an example; although appropriately
coloured, it is easy to interpret the migration from Canada as being more siginificant than that from
Germany, when in fact there are more than twice as many migrants from the latter.
Figure 2.6: Choropleth map of migration to Scotland.
One solution to this problem, as discussed in [40], is to use a mesh map, partitioning with an equally
spaced grid so that “arbitrary but statistically wise boundaries now cradle the micro-data”. However, if
the original statistical reporting has been aggregated to more conventional divisions (countries, coun-
cil regions, health wards etc.) then such a partition may not be possible to produce. We will consider an
alternative to area-based representation for migration flows from given regions in considerable detail
in Chapter 3.
1the term choropleth is derived from the greek words for ‘area/region’ and ‘multitude’.
16
2.4 Frequency Plots
One opportunity presented by visualisation is to move away from numerical values entirely, and in-
stead represent quantities of interest pictorially. Further, demographic statistics are often concerned
with populations of a size that is hard to comprehend; whilst the arrival of around 32,000 migrants to
Scotland in 2012 might sound large, it is only 0.6% of the overall population of around 5.31 million.
This is made yet easier when phrased as a frequency - that if a thousand Scots were selected at ran-
dom that year, only six would be new migrants. There is even anthropological evidence that humans
are only capable of keeping track of social groups of moderate size - known as “Dunbar’s number”2,
this cognitive limit is typically taken to be 150. Thus we may benefit from presenting percentage val-
ues literally as a visual proportion of 100, such as in Figure 2.7, which illustrates the gender divide in
Scotland’s centenarian population (of which 85% are women).
Figure 2.7: Number of males and females per 100 centenarians, Scotland 2012.
Originally Figure 5 of [14], see Appendix A.7 for an interactive version.
2or in popular discussion, “the monkeysphere”
17
We also note the particular success of natural frequency diagrams in understanding conditional prob-
ability when drawing from joint distributions (that is, overlapping populations) as in Figure 2.8. This
illustrates the outcomes of breast cancer screening for a population of 1000 women, given the follow-
ing:
• The probability of a woman having breast cancer is 1%
• If a woman has breast cancer, the probability of testing positive is 85% (sensitivity)
• If a woman does not have breast cancer, the probability of testing negative is 90% (specificity).
From these, a direct calculation via Bayes theorem and the law of total probability gives us the proba-
bility of actually having breast cancer, given a positive test result, as around 8%:
P (Cancer|+ ve) = P (Cancer∩+ve)
P (+ve)
= P (+ve|Cancer)P (Cancer)
P (+ve|Cancer)P (Cancer)+P (+ve|No cancer)P (No cancer)
= 0.01×0.85
0.01×0.85+0.1×0.99
≈ 0.08
However, in a test where medical practitioners were given similar data and asked to determine this
probability, almost half erroneously assess the cancer risk as being the sensitivity (in this example,
85%) [8]. This was despite the specificity being rephrased as the “false alarm rate” (10% of women
without breast cancer nonetheless testing positive); and multiple choice options being offered. Given
the controversial history of Bayes theorem, it is perhaps not surprising that such calculations prove
challenging even for specialists; the natural frequency presentation in Figure 2.8 makes the figures
much easier to grasp, which may lend support to the use of simple frequency charts such as Figure 2.7
too.
18
Fig
ure
2.8:
Po
ssib
leo
utc
om
eso
fbre
astc
ance
rsc
reen
ing.
Cap
ture
fro
m[4
2].
19
Chapter 3
Graphs
In this chapter, we turn our attention to the use of graph-theoretic structures - particularly trees - to
design visualisations, with the aim of overcoming some of the limitations identified in the previous
chapter. Of particular interest will be graph layout: firstly for given structures, such as hierarchical
data sets that we previously considered in the context of tree maps; then of structures computed from
the data. For the latter we will use clustering to generate trees from geographical data, enabling a
presentation as a flow map.
3.1 Introduction
To fix notation, by a graph we will mean a collection of objects - the vertex set V - with a linking structure
given by the edge set E ⊆V ×V . At times, it may be convenient to identify an edge e ∈ E by its end points
vi , v j ∈ V ; we will denote this by e = i ↔ j , and say that vi , v j are adjacent. Often we will think of the
vertex set V as simply a set of integers (so the i th vertex vi is identified simply as i ).
The vertices and edges of a graph lend themselves to a natural presentation as a diagram of points and
lines, and we often think of this diagram as being the graph. But it is important to realise that seemingly
different diagrams may just be different representations of the same abstract structure. For instance,
all of the diagrams in Figure 3.1 are - considering only adjacency - ‘the same graph’ (specifically, an
object known as the Petersen graph). It should be immediately apparent that there is no structural
difference between graphs (i) and (ii), since all that has changed are visual properties: colour, shape,
and labelling language (letters instead of numbers). Without the numbering of vertices, it might be
hard at first glance to verify that (i) and (iii) are the same, but by checking the neighbours of each, it
can be seen that the ten vertices have simply been repositioned in space. A similar analysis shows that
20
(iv) is just a repositioning (and fresh colour makeover) of the lettered vertices in (ii)- which was itself
equivalent to (i).
On the one hand, this level of abstraction makes graphs a powerful tool for visualisation. Many data
sets which do not have an obvious ‘vertices and edges’ nature may nonetheless have a graphical inter-
pretation which can give us insight into structure within the data. We can then present this with an
appropriate diagram, and we are free to use tools like colour and vertex shape/size to indicate further
aspects of the data. But this freedom also results in a number of challenges in producing - or even
defining - good diagrams for a given graph; some of these issues will be considered in Section 3.2.
(i )
12
3
4
56
7
8
9
10
(i i )
AB
C
D
EF
G
H
I
J
(i i i )
1
2
3
45
6
7
8
9 10
(i v)
A
B
CD
EF
GH
I
J
Figure 3.1: Four views of the Petersen Graph.
As mentioned, our vertices may correspond to items in our data set with further attributes. But we
may also be interested in assigning attributes to the edges between them. Generally we treat edges
as unordered pairs {i , j } (so two vertices are either adjacent or not); for a directed graph we consider
ordered pairs, so we may have an edge e = i → j ∈ E without the corresponding j → i being in E . We
will call i → j and out-edge of i and an in-edge of j .
For both undirected and directed graphs, we may also be interested in assigning a weight w(e) to each
edge e ∈ E , usually to indicate strength of ties between vertices beyond a binary classification of adja-
cent or not. In this general setting (or indeed any of the simpler ones), we can associate an n-vertex
graph with an n ×n adjacency matrix, where the entry Ai j is the weight of the edge i → j (zero if no
21
such edge in E , or if i = j ; conventionally 1 for edges in an unweighted graph). For certain tables of
data we can therefore immediately construct a corresponding graph by interpreting the table as an
adjacency matrix.
3.2 Graph Layout
Determining appropriate vertex placement (and possibly edge routing, if not simply straight lines) can
be formulated as an optimisation problem. To do so, an objective cost must be assigned to possible
vertex locations that captures the quality of the corresponding diagram. Care is required, however,
as innocent-looking criteria can give rise to intractable problems. For instance, we describe a graph
as planar if there is an embedding into the plane with no edge crossings. A graph with n-edges can
be tested for planarity in O(n) time, and if a suitable embedding exists then this can be recovered for
the same complexity; see [7]). However, if a graph fails to be planar, then the question of how many
crossings are neccessary to draw it is already NP-hard [13].
Worse, the concept of a ‘good’ diagram can depend on aesthetic judgments; with respect to crossing
number, both diagrams (i) and (iv) of Figure 3.1 are equally good; see also Figure 3.2, where the version
with crossings is likely preferable to the planar embedding. Many properties such as symmetry that are
useful for laying out examples in pure graph theory may not hold for graphs constructed from messier
real world data sets. Moreover, it is unlikely that the preferences of the designer will perfectly match
those of the viewers of a diagram or users of a visualisation incorporating one; in the latter case, user
requirements may vary as they explore a data set.
Figure 3.2: Non-planar and planar diagrams for the same graph.
For simple criteria, though, algorithmic optimisation of layout can work well, with the family of force-
directed placement algorithms - driven by an underlying ‘physical’ process - being successful exam-
ples. In [19], vertices are associated with particles linked by springs. By setting a desired (fixed) edge
22
length of L, the ideal geometric distance between two positions pi, pj is taken to be li j = L ×di j for
di j the graph theoretic distance - that is, length of shortest path - between vertices i and j . If pi,pj are
placed too far or close apart, this will require energy to either expand or compress the spring joining
them. Introducting a strength ki j for each spring - a value proportional to 1/d 2i j is suggested - a total
energy of
E =n−1∑i=1
n∑j=i+1
1
2ki j
(|pi −pj|− li j)2 (3.2.1)
can be assigned to each choice of positions p1, . . . ,pn. This varies continuously with the pi ’s, and min-
imizing E corresponds to reducing the discrepancy between ideal and actual spacing.
As an analytic solution to the n-dimensional nonlinear equations which arise is not possible, the au-
thors of [19] propose a heuristic approach based on iterative refinement of the best solution found so
far. At each step, the particle with position pm = (xm , ym) of greatest discrepancy ∆m is identified, and
all others fixed. A local minimum with respect to this point ( that is, the partial derivatives wrt. xm
and ym both being zero) can be attained to any desired precision, by iteratively solving a sequence of
2-dimensional linear sub-problems and relocating to some (xm+δx , ym+δy ) at each iteration. Another
step for some other m can then be taken if E is not yet sufficiently low. For interactive visualisations,
this threshold can be set by the user, either explicitly, or implicitly by taking further steps whenever
they relocate a particle (such as to resolve a visually-obvious local minimum). We also note that par-
ticles can be effectively fixed in position (again, by design or user selection) by excluding them from
consideration when identifying the particle of greatest discrepancy.
A limitation of this and similar approaches is that they treat the particles as infinitesimal points in
space, and only graph-distance plays a role in positioning. For visualisation purposes, it is likely that
further constraints will be desired. In [9] various examples are given, such as preventing overlap of
vertices, arranging groups of vertices in bands or clusters, or ensuring that directed edges have a con-
sistent orientation. The same paper describes how to extend force-directed placement to allow for sep-
aration constraints in each dimension, of the form u +d ≤ v where u and v are variables representing
horizontal or vertical position and d is a desired constant minimum separation. Although seemingly
limited, these separation constraints suffice for the examples discusssed and several others, and the
linearity is convenient: by modifying the force-directed approach to respect the separation constraints
during the minimization of energy at each relocation step, the subproblem becomes a quadratic pro-
gram.
However, we note that this does not suffice to handle constraints in terms of Euclidean distance. For
instance, with circular vertices of radii ri located at pi = (xi , yi ), a componentwise enforcement of non-
23
overlapping requires constraints of the form
xi + ri ≤ x j − r j for 1 ≤ i 6= j ≤ n,
yi + ri ≤ y j − r j for 1 ≤ i 6= j ≤ n.
This effectively separates not just the circles, but their bounding boxes; Figure 3.3 demonstrates the
limitations of this, in comparison with a genuine radius based separation with constraints of the (non-
linear) form
|pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n.
Therefore in [11] one of the authors of [9] remedies the two major limitations of that work: the separate
treatment of the horizontal and vertical axes in specifying constraints, which is resolved by describing
how to handle constraints of the general form
|pi −pj|(=,≤,≥)d ;
and difficulties in scaling to large graphs due to the quadratic time complexity of the constrained opti-
mization algorithm.
Figure 3.3: Radius versus bounding box packing of circles.
In tacking the second limitation, the author notes that a preoccupation with mathematical rigour is
partially to blame for the time complexity: the methods of the earlier paper can be proven to converge
to stable local minima, but “it is not clear that such rigor is necessary simpy to obtain an aesthetically
appealing layout” [11]. Instead, inspiration is drawn from the field of computer game graphics and
animation, where ad-hoc methods are used without rigorous justification, yet “by a miracle routinely
attributed to either Jacobi or Gauss-Seidel the method usually converges to a stable state in very few
24
iterations” [11]. The author goes on to explain that the appeal to either Jacobi or Gauss-Seidel methods
is mis-placed due to the lack of formal proofs of convergence or even correctness once constraints
have been introduced. Nonetheless, this deficiency of the academic literature is not the same as a
proof of incorrectness or nonconvergence, and the practical results seem satisfactory to the animation
community. For the graph layout problem, the approach taken is to alternate between unconstrained
optimization steps in line with the chosen force-directed approach, and constraint-satisfaction steps
based on a “simple, naïve, and yet effective heuristic” for skeleton-based animation of computer game
characters. The result is that incremental layout of n node, m edge graphs subject to c Euclidean
constraints can be performed with a time complexity of O(n logn +m + c), a vast improvement on [9]
which makes handling large graphs in real time feasible.
A further advantage of supporting Euclidean constraints is this allows more sophisticated physical laws
to be applied to the interaction of particles, where the force is a function of distance - such as the
effects of gravity, or the attraction / repulsion of electrical charges. The force layout features of D3 offer
exactly this: a global gravity parameter causes attraction to the center of the visualisation, whilst each
particle carries a charge (which can be constant of a function of the data associated with the vertex)
and edges have a target length; friction can also be applied to dampen movement. We note however
that the forces are applied to single-pixel points; additional code is required to implement features
such as constrained display region or non-overlapping vertices, although this is entirely possible. We
note here a general formulation as an optimization problem (which, viewed abstractly, has much in
common with standard problems such as facility location); in Section 4.2 we present a visualisation
which applies D3’s force layout under such constraints.
Definition 3.2.1 (Graph Layout Constraints). For a set of n vertices with associated radii ri , and graph
structure given by adjacency matrix A, layout in a region of width w and height h with a target edge
length of L requires the selection of positions pi = (xi , yi ) such that
1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n (non-overlap of vertices);
2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n (vertices wholly contained in width of region);
3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n (vertices wholly contained in height of region);
4. |pi −pj| = L for 1 ≤ i 6= j ≤ n (edge length).
To turn this into an optimisation problem, we require an objective. Instead of the graph-distance based
formula 3.2.1, which considers all pairs of vertices, we can instead make use of the adjacency structure
to consider just the adjacent pairs, by relaxing condition 4 from Definition 3.2.1. In this way, we are
25
allowing for some discrepancy between the actual (euclidean) distances and the target L; we have var-
ious options in how we score this, two of which we present below.
Definition 3.2.2 (Graph Layout optimization, average discrepancy version). For vertices 1. . . ,n with
adjacency structure given by the matrix A and radii given by r = (r1, . . .rn), plus a target edge length L
and region of dimensions w ×h, the average discrepancy version of the graph layout problem is to
mi ni mi ze∑
1≤i< j≤nAi j
∣∣(|pi −pj|−L)∣∣
for decision variables pi = (xi , yi ), 1 ≤ i ≤ n, subject to
1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n;
2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n;
3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n.
Definition 3.2.3 (Graph Layout optimization, greatest discrepancy version). For vertices 1. . . ,n with
adjacency structure given by the matrix A and radii given by r = (r1, . . .rn), plus a target edge length L
and region of dimensions w ×h, the greatest discrepancy version of the graph layout problem is to
mi ni mi ze max1≤i< j≤n
Ai j∣∣(|pi −pj|−L)
∣∣for decision variables pi = (xi , yi ), 1 ≤ i ≤ n, subject to
1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n;
2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n;
3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n.
We note that in either definition 3.2.2,3.2.3, the edge length between circular vertices i , j is equated
with the distance between their centres pi,pj. In either case, if we wish to treat the edge as only starting
from the boundary of the circles, we can simply replace L in each term of the objective with L+ ri + r j .
Further, in a particular instance of either problem we may wish to specify fixed locations pi′ for a subset
of the particles I ′ ⊂ {1, . . . ,n}; these can simply be added as constraints of the form xi = x ′i , yi = y ′
i for
all i ∈ I ′, which allows us to preserve the objective function and adjacency matrix.
26
3.3 Flow Maps
In our discussion (Section 2.3 ) of choropleth maps for migration data we noted an inherent limitation
arising from the fixed dimensions of each country or region being considered. After all, it is not the
size of the location we are interested in, but of the flow of people to or from it. Cartographers have long
known how to resolve this issue, through the use of flow maps; see for example the work of Minard in
the 19th century visualising exports of French wine [25] (Figure 3.4; for a larger version see page 25 of
[41]) or Napoleon’s Russian campaign of 1812 (Figure 3.5, or Ibid. p.41).
Figure 3.4: Minard’s map of exports of French wine in 1864.
These early representations of flow were necessarily drawn by hand (although this allows for precise
geographic accuracy to yield to clarity, as in the repositioning of the UK in Figure 3.4); the paper of Phan
et al. [33] describes their process for generating such maps algorithmically from a list of locations and
the flow to each from a specified root. To do so they consider the geographic locations as vertices
of a graph, and use clustering to construct a tree from the desired root with the remaining vertices
as leaves. This abstract partitioning introduces further vertices as branch points; unlike the original
vertices, these can be placed arbitrarily on the map, but the quality of the visualisation will depend
strongly on the choices made for these.
27
Figure 3.5: Minard’s 1861 visualisation of Napoleon’s Russian campaign of 1812.Conveying six variables including both geographic position and size of the army in a readily
understood way, Tufte remarked that “it may well be the best statistical graphic ever drawn” [41].
3.3.1 Flow graphs
Suppose we are given a source location 0 and a set of n target locations each with a required flow fi ,
i = 1, . . . ,n; and that for each location we also have fixed position coordinates (xi , yi ).
Definition 3.3.1. We call T a directed flow tree if it contains: a nonempty set of leaf vertices correspond-
ing to each target i , with no out-edges and a single in-edge of weight equal to fi ; a single root vertex 0
with no in-edges and total weight over all out edges equal ton∑
i=1fi ; and a (possibly empty) set of branch
vertices which satisfy the balance condition that total flow in equals total flow out.
From such a T we can infer an undirected flow graph F . This suffices to produce an image such as
Figure 3.4, as once we have determined suitable locations for the branch vertices we can render the
graph with lines of thickness proportional to edge weight.
Following [33] we will use clustering techniques to introduce the branch vertices. This will result in
an agglomeration tree H with leaves 0, . . . ,n; we will show that a flow graph F can be generated from
this anyway, but may have undesirable properties with respect to layout. We will see how to resolve
this somewhat by re-wiring certain edges and thus eliminating branch vertices (based on the break-
down and reclustering of H employed in [33], but in a simpler-to-implement manner that works on
the undirected graph F instead).
28
3.3.2 Star graphs
We note that we can immediately produce a flow graph for any source and set of n targets: use the
star graph Sn (that is, the complete bipartite graph K1,n with n +1 vertices) where each of the n leaves
attached to the root 0 by an edge of weight fi . These are entirely straightforward to render, and turn
out to have two properties that would seem desirable for any flow map. Clearly, there will be no edges
that cross each other. Moreover, we minimize the amount of chart ink:
Definition 3.3.2. For a flow graph F with edge set E, the chart ink for a given rendering is the quantity
∑e∈E
le we ,
where le the length of edge e is the chart ink required for the rendering. (Note that for e joining vertices
i , j we have le =√
(xi −x j )2 + (yi − y j )2, i.e., determined by the chosen locations of each vertex in the
rendering, not just the branch structure of F ).
Proposition 3.3.1. For a given directed flow tree T , the chart ink is minimized by placing all branch
points at the same location of the root, so the corresponding flow map is a star graph.
Proof. If there are no branch vertices then we are done. Otherwise, let a, b be leaf vertices that share a
parent v , itself with a parent u (since we are considering the directed tree, this hierarchical interpreta-
tion of the vertices is possible). Then there are flows fa , fb along edges v → a, v → b respectively, and
for balance a flow f of at least1 fa + fb along u → v .
The total chart ink is therefore
f d(u, v)+ fad(v, a)+ fbd(v,b) ≥ ( fa + fb)d(u, v)+ fad(v, a)+ fbd(v,b)
= fa(d(u, v)+d(v, a))+ fb(d(u, v)+d(v,b))
≥ fad(u, a)+ fbd(u,b) by the triangle inequality
1There could be further children of v , introducing additional chart ink, so we give the general case; however, for theclustering methods we will employ there will only ever be two children.
29
Figure 3.6: Selecting branch point location to minimise chart ink.
But we can achieve this lower bound if v is both on the line u → a (triangle inequality applied to u, v, a,
as shown in Figure 3.6) and on the line u → b (for ∆u, v,b); this requires that v be precisely at u.
So the parent of every leaf should be located at its own parent. Iterating this process by pruning the
leaves and thus treating their parents as the leaves of a new smaller tree, we find that u (and hence v)
should be located precisely at its parent, and so on up to the root.
Reducing this quantity is beneficial in avoiding clutter and thus increasing legibility. However, we lose
the agglomerative nature of a flow map that features suitable branching and it can be hard to distin-
guish rays; consider Figure 3.8 or the comparison in Figure 3.7. Proposition 3.3.1 therefore implies
that we should not use minimization of chart ink as a metric for optimising the placement of branch
vertices if we want to present such a structure.
Figure 3.7: Computer generated flow maps of migration from California 1995-2000
Tobler’s approach [38, 39] (left) vs Phan et al.’s from [33] (right).
30
Figure 3.8: Star-graph flow map of migration to Scotland.
3.3.3 Hierarchical clustering
Clustering is a fundamental topic in machine learning, and we will only skim out a few relevant details
here (a comprehensive overview can be found in [17]). Given a set of elements of some space with a
measure of ‘distance’ between them, the goal is to group ‘similar’ elements together into subsets (the
clusters). This is generally an example of unsupervised learning, in that we are attempting to infer
appropriate divisions from the data rather than with respect to some existing grouping.
An algorithm which starts with each element as its own cluster then successively merges clusters to-
gether is described as agglomerative; if instead it starts from a single cluster of all elements then pro-
ceeds by splitting clusters, then it is instead divisive. In either approach there is generally a stopping
criterion, at which point further merges or splits are deemed detrimental to the quality of the cluster-
ing. Determining a suitable criterion is not a simple task - and likely will need to vary with the data set
- but we can sidestep this issue by running the process to completion: a single cluster for agglomera-
tive, or singleton clusters for divisive. The result in either case is a dendrogram; a tree structure with
the single cluster as root, and singleton clusters as leaves. We can then select a clustering of varying
coarseness by picking a height in the tree and partitioning with respect to the branches at that level.
Nearer the root we have fewer clusters of less similar elements; towards the leaves the elements of
clusters will be more similar to each other, but there may be (many) more clusters.
As an example, consider the 15 most common non-UK nationalities in Scotland. Assigning an x, y lo-
31
cation on a 2D map of the world to each country, we can cluster by distance in the geographic sense2.
Working agglomeratively, the comparison of singleton clusters {a}, {b} is straightforward; the Euclidean
distance d(a,b). But once a cluster contains more than one country, there are again various options
for defining distance. By keeping track of all elements in every cluster, we can consider single-link dis-
tance between clusters A and B as mina∈A,b∈B
d(A,B); the complete-link distance instead takes the max-
imum over these possible pairs of elements with one from each cluster. Both of these concepts can
be expanded further by considering all points within each country as part of the cluster, rather than a
single representative.
However, simpler data structures arise if we just synthesise an x, y value for each new cluster based on
those of the clusters being merged. Since in our application to migration data each location carries a
weight, we can consider a centre of mass; or even more simply just average the x and y coordinates
each time. In [33] another approach is taken - for each cluster we take its position to be the centre
of the bounding box of all its members. Following this approach, we find that clustering proceeds as
follows:
Merging Scotland with Republic of Ireland to create cluster 1
Merging Germany with France to create cluster 2
Merging Italy with cluster 2 to create cluster 3
Merging Poland with cluster 3 to create cluster 4
Merging Spain with cluster 4 to create cluster 5
Merging cluster 1 with cluster 5 to create cluster 6
Merging India with Pakistan to create cluster 7
Merging China with Hong Kong to create cluster 8
Merging USA with Canada to create cluster 9
Merging cluster 7 with cluster 8 to create cluster 10
Merging Nigeria with cluster 6 to create cluster 11
Merging South Africa with cluster 11 to create cluster 12
Merging Australia with cluster 10 to create cluster 13
Merging cluster 12 with cluster 13 to create cluster 14
Merging cluster 9 with cluster 14 to create cluster 15
The corresponding dendrogram is shown in Figure 3.9. From this we can deduce various clusterings;
for instance, two levels down from the root we get three clusters, comprising North America, Europe &
Africa, and Asia & Australasia.
2but it is important to realise that we can cluster on any notion of proximity / similarity.
32
For our purposes in constructing a flow graph we are nearly done - the dendrogram implies a set of
branch points, and we merely need to assign appropriate weight to the edges before turning to the
problem of layout. We note that each merger reduces the number of active clusters by one, and we are
done when there is only a single cluster; so in all n = 15 new clusters are created, each introducing a
branch vertex. Our flow graph therefore has 2n +1 vertices in total; since 2 edges are created during
each of n merges, we have 2n in total. So |E | = |V |−1; as the graph is connected (we can trace a path
from any vertex to the final branch vertex), it is therefore a tree as desired. The known vertex and edge
counts lead to convenient data structures; Algorithm 1 outlines a pseudocode description of how to
generate the flow graph during agglomerative clustering based on bounding box centre distances. A
Java implementation is given in Appendix A.1.
ZA NG Scot. IE DE FR IT PL ES IN PK CN HK AU US CA
dis
tan
ce
Figure 3.9: Bounding-box dendrogram for sources of migration to Scotland.
Proposition 3.3.2. The output of Algorithm 1 satisfies the conditions of Definition 3.3.1, i.e, T is a di-
rected flow tree.
Proof. We have already seen that we obtain a (2n + 1)-vertex tree structure, namely the dendrogram
H arising from the agglomerative clustering. However, the leaves of H are singleton clusters corre-
sponding to each of the locations, including the source; its root is the (2n +1)st cluster (containing all
locations). So the tree structure of T cannot be that of H .
Consider first the leaves of H , the singleton clusters corresponding to locations i = 0, . . .n. For each of
these we have a vertex i in T ; there will only be a single edge in T incident at this vertex, with the other
end being some branch vertex k. For the target locations i = 1, . . .n, we know W [i ] = fi > 0, so this edge
33
Algorithm 1: Flow Graph construction
Input: Source location 0 with coordinates (x0, y0).Input: Target locations i = 1, . . . ,n with coordinates (xi , yi ) and required flow fi .Output: A directed flow tree T = (V ,E) (with appropriately weighted edges).Initialise:Construct vectors X ,Y ,max X ,mi nX ,maxY ,mi nY ,W, i nC of length 2n +1 and all entries zeroSet V = {0, . . . ,2n} and E =;Set X [0] = max X [0] = mi nX [0] = x0
Set Y [0] = maxY [0] = mi nY [0] = y0
Set W [0] =−n∑
i=1fi
Set i nC [0] = 1for i=1,. . . n do
Set X [i ] = max X [i ] = mi nX [i ] = xi
Set Y [i ] = maxY [i ] = mi nY [i ] = yi
Set W [i ] = fi
Set i nC [i ] = 1
Set k = n +1while k ≤ 2n do
Get closest two active clusters to merge; this will give our k th cluster / (k −n)th branch vertex:Find 0 ≤ i < j ≤ 2n such that
i nC [i ] = i nC [ j ] = 1
and(X [i ]−X [ j ])2 + (Y [i ]−Y [ j ])2
is minimal over such i , j .Find bounding box, weight and centre of merged cluster:Set max X [k] = max(max X [i ],max X [ j ]), mi nX [k] = min(mi nX [i ],mi nX [ j ])Set maxY [k] = max(maxY [i ],maxY [ j ]), mi nY [k] = min(mi nY [i ],mi nY [ j ])Set X [k] = 1
2 (mi nX [k]+max X [k]), Y [k] = 12 (mi nY [k]+maxY [k])
Set W [k] =W [i ]+W [ j ]Update flow graph:if W [i ] < 0 then
Add edge i → k of weight |W [i ]| to Eelse
Add edge k → i of weight W [i ] to Eif W [ j ] < 0 then
Add edge j → k of weight |W [ j ]| to Eelse
Add edge k → j of weight W [ j ] to EUpdate active clusters:Set i nC [i ] = f al se, i nC [ j ] = f al se, i nC [k] = tr ueUpdate next cluster index:Set k=k+1
return T = (V ,E)
34
is directed k → i . Thus the target locations have no out-edges and single in-edge of weight fi ; they are
therefore the leaves of T .
Conversely, at the source i = 0 we have W [0] =−n∑
i=1fi < 0, so its incident edge is directed i → k. So it is
the root of T , with no in-edges and total weight over all out-edges (just the one!) equal to |W [0]| =n∑
i=1fi
as required.
The remaining vertices k = n +1, . . . ,2n are branch vertices. Note that any cluster k will have negative
weight W [k] if and only if the source 0 is an element of the cluster. Thus the merging of two clusters
i , j ∈ {0, . . . ,2n} to give some cluster k will always give a balanced flow at branch vertex k. To see this,
suppose W [k] = W [i ]+W [ j ] < 0. Then as 0 can only be in one of the clusters i , j , wlog i , we have
W [i ] < 0, W [ j ] ≥ 0, and thus an in-edge i → k of weight |W [i ]|−W [i ] and an out-edge k → j of weight
W [ j ]. But when k is itself merged into some cluster k ′ the edge created will be k → k ′ of weight |W [k]| =−W [k] = −W [i ]−W [ j ]. So the total flow out is W [ j ]+ (−W [i ]−W [ j ]) = −W [i ] and total flow in is
−W [i ], so flow is balanced at k. The analysis for W [k] ≥ 0 is much simpler; in this case both W [i ] and
W [ j ] are positive so we have out edges k → i , k → j with weights W [i ], W [ j ] respectively, for a total
flow out of W [k]; when k comes to be merged into some k ′ the edge created will run k ′ → k for an in
flow of W [k], ensuring balance. So all the conditions of Definition 3.3.1 are satisfied by T .
Remark 3.3.1. In practice (that is, the code given in Appendix A.1) the directionality of edges created is
always set from child cluster to parent; in effect, a weighted version of the dendrogram H. Suppressing
this directionality entirely still gives us a flow graph suitable for rendering, and tracking the hierarchy of
H is convenient for the re-wiring process described in section 3.3.5.
We may now turn our attention to issues of layout.
3.3.4 Flow graph layout
For each of the 2n +1 clusters considered during Algorithm 1 we assigned a bounding box and central
coordinates. However, it is only for the original n+1 locations that coordinates are fixed; we may place
the n branch vertices wherever we wish. As noted earlier in Proposition 3.3.1 the obvious ‘optimal’
choice just collapses to a star graph; ‘good’ positioning is therefore largely an aesthetic judgment.
In light of this, we make use of a D3 force layout, fixing the nodes corresponding to location vertices and
allowing the branch vertices to settle on positions in accordance with the physics processes described
in Section 3.2. A typical algorithmically-generated positioning is given in Figure 3.10. We build upon
this automated process by adding the ability for a user to manipulate the branch vertices interactively,
35
thus fine-tuning a layout before capturing a static version for print. An example of this is given in Figure
3.11, in which edge crossings have been eliminated - at the cost of much more chart ink!
Figure 3.10: Algorithmically-generated flow map of migration to Scotland
Parameters: gravity 0, friction 0.5, charge −20.
Figure 3.11: User-adjusted flow map of migration to Scotland.
However, we note from these examples an inherent limitaton of the flow map generated from the hi-
erarchy given in Figure 3.9. By driving the clustering to completion we force the merger of distant
36
clusters: specifically the USA&Canada pair with the rest of the locations, which results in a branch
point that is hard to place without crossings. This problem was noted in [33]; it is a side effect of the
hierarchy root not being our desired source location. Thus in the repositioning shown in Figure 3.11 it
is the root of H - now located just off the coast of South Africa - that draws our attention, rather than
the massed flow at the source location in Scotland.
3.3.5 Rewiring
Modifying Algorithm 1 as per Remark 3.3.1, we have a flow graph F corresponding to the dendrogram
H from the hierachical clustering. We wish to adapt this to one with the source vertex 0 as root; inspired
by [33] we do this by identifying the path from 0 to the root of H and the clusters that attach to it. We
differ from their approach by collapsing the path into a single vertex, thus merging 0 with the original
root and giving it multiple child clusters; rather than performing a second clustering subject to the new
merging rule they detail. The pseudo-code description is given in Algorithm 2.
Algorithm 2: Flow Graph rewiring
Input: V = {Source vertex 0,target vertices 1, . . . ,n,branch vertices n +1, . . . ,2n}.Input: Weight function w and edge set E such that H = (V ,E) is a weighted directed
dendrogram: (i → j ∈ E) ⇒ j > i .Output: A flow graph F = (V ′,E ′) rooted at the source vertex.Initialise:Construct vector X of length 2n +1 and all entries zeroSet V ′ = E ′ =;Set i = 0Identify vertices on Source→root path for exclusionwhile i<2n do
Find j such that i → j is in ESet X [ j ] = 1Set i = j
Form new vertex set:for i = 0, . . . ,2n do
if X [i ] = 0 thenAdd i to V ′
Form new edge set:for e = i → j in E do
if X [i ] = 0 thenif X [ j ] = 1 then
Rewire cluster i to sourceAdd edge i → 0 with weight w(e) to E ′
elseEdge within a surviving clusterAdd edge i → j with weight w(e) to E ′
return F = (V ′,E ′)
37
ZA NG Scot. IE DE FR IT PL ES IN PK CN HK AU US CA
Figure 3.12: Re-assigning root for sources of migration to Scotland.
The red path from Scotland to the dendrogram root will be collapsed into a single vertex, with child
clusters {DE,FR,IT,PL,ES}, {IN,PK,CN,HK,AU}, {US,CA}, {ZA}, {NG} and {IE}, within which the
dendrogram branching will be maintained.
Applying this process to the Scottish migration data, we see (from Figure 3.12) that six child clusters are
created, ranging from single locations to larger groupings with recogniseable continental structure -
North American, Asia&Australasian and European clusters. By anchoring each of these to Scotland, we
avoid the need for intercontinental edges to difficult-to-place branch points as encountered in Figures
3.10,3.11. Effectively, we are interpolating between the star-graph (guaranteed to have no crossings,
but with no aggregation of flow from branches) and the dendrogram from our initial clustering. Figure
4.1 in Chapter 4 illustrates the result.
3.4 Chord Diagrams
In the previous Section we were concerned with presenting the flows between a single source location
and multiple targets. However, we may be interested in the flows between all possible pairs of loca-
tions; that is, visualising general weighted graphs. The layout processes described in Section 3.2 are
one approach, but for dense graphs (those with many edges) we may wish to prescribe a particular
positioning of the vertices. Despite maximising crossings, arranging the vertices (or at least a majority
of them) on a circle is a popular option that can produce striking images such as Figure 3.13; many
38
more examples can be found throughout [21], particularly in the sections on communication/social
networks from Chapter 4, or the ‘radial convergence’ sections of Chapter 5.
Figure 3.13: Visualizing information flow in science.(adapted from [27] or see p.103 of [21].)
A further complication arises when the graph is directed as well as weighted, since now in principle we
may have two edges between each pair. However, through the use of a chord diagram the flow in each
direction can be accounted for in a single stroke. Let G be an n-vertex weighted directed graph with
non-negative adjacency matrix A, such that the diagonal entries Ai i are all zero (no flow from a vertex
to itself). Then we can construct a chord diagram by first partitioning a radius-r circle into n segments,
with the length of the i th segment proportional to the total flow out of vertex i ; that is, segment i has
length
2πr
n∑j=1
Ai j
n∑i=1
n∑j=1
Ai j
.
The chord between i and j is then drawn with width 2πr Ai j at segment i and width 2πr A j i at segment
j . Thus the chord will taper in accordance with the net flow between the two. As implemented in D3 -
such as the example given in Figure 3.14, lightly adapted from stock examples - this dominant direction
39
is also indicated by colouring the chord to match that of the end segment with greater out-flow.
Figure 3.14: Chord diagram of migration within Scotland 2011-2012.
We note from Figure 3.14 that in print form chord diagrams are more useful for capturing high level
trends than conveying specific detail - even with an understanding of the colour and sizing conven-
tions and the addition of a scale, it would be difficult to read off precise values. The situation is im-
proved as an interactive visualisation: tool-tips can convey the value associated with a single segment
or chord; whilst a single location can be investigated more easily by suppressing the chords from other
segments on user selection of a particular one. Both of these properties are illustrated in Figure 3.15
(or see the interactive version detailed in Appendix A.8).
40
Figure 3.15: Chord diagram of migration between Scottish councils and the rest of the UK 2011-2012.
Treating each chord diagram as a data slice for a particular parameter (such as time), change along
this extra dimension can be easily indicated in a visualisation by dynamic resizing of the segments
(colour and order should be preserved for clarity). The recent work [1] gives an effective example of
this. Their approach, described in [2], also describes how to tackle data sets with internal flow (that is,
non-zero diagonal entries in the adjacency matrix, or a loop in the graph) through offseting the start
point of a chord as in their Figure 3.16. This also shows how to summarise both in- and out-flows for
each region; the example is based on hypothetical data, however, and we note that this introduces
extra clutter whilst placing further interpretative burden on the user. Indeed, for their real data set as
visualised in [1] they suppress this feature.
41
Figure 3.16: Chord diagrams with internal flow.
42
Chapter 4
Portfolio
In this Chapter, we present a selection of the visualisations (not all successful!) developed from NRS
data sets. Since the visualisations presented are generally interactive, we encourage experimentation
with the ‘live’ versions; see the electronic appendices A, which also contain further examples either of
mathematically simpler content or visualisations already discussed in earlier chapters.
4.1 Migration Flow Map
Figure 4.1 shows the final visualisation developed using the clustering and rewiring techniques dis-
cussed in Section 3.3.
4.2 The Cause of death explorer
As part of the vital events reference tables [15], the General Register Office publishes annual statistics
on causes of death. This visualisation is based upon Table 6.1, which tabulates the total male and fe-
male deaths from particular causes each year 2002-2012; Figure 4.2 show a typical portion of the data
file. From this, we see that rather than a flat list of every possible cause, the reporting of deaths col-
lects them into increasingly broad categories, forming a hierarchy from individual causes of particular
interest up to all deaths. Thus one slice through the data is to fix a year and consider the distribution
of deaths for that year across the categories. The other direction is to fix a category, and see how the
number of deaths from those causes has varied through time.
The visualisation produced enables both of these slices to be explored simultaneously, using linked
charts; an example of it in action is given in Figure 4.3. The bulk of the area is devoted to a representa-
43
Fig
ure
4.1:
Use
r-ad
just
edre
wir
edfl
owm
apo
fmig
rati
on
toSc
otl
and
.
44
Figure 4.2: Cause of death data
tion of the category counts for a given year; due to the hierarchical data structure, this is a tree which
we can arrange using the graph layout techniques discussed in Section 3.2. Each circular data vertex
has an area proportional to the deaths from the causes it represents, for males, females or both de-
pending on user selection. However, if a data vertex represents a category for which sub-categories are
reported on in Table 6.1 of [15], then it can be expanded to reveal child vertices for each sub-category.
The original vertex is then re-sized and re-coloured to indicate its non-data status; the area of the orig-
inal data vertex is distributed appropriately across its children. This improves graphical integrity, as
otherwise the total area presented varies with tree depth explored, and a deep branch of the tree would
appear disproportionately important. As a further aid to comparison, tooltips give the precise count
for a category on hovering above its corresponding vertex.
Initially, only a single vertex - representing all deaths for the selected year - is presented. The user
can then explore the tree by expanding or contracting nodes, allowing the data displayed to be driven
by their level of interest; the fully expanded tree is a potentially overwhelming start point and makes it
difficult to locate particular causes. The basic layout algorithms in D3 have been built upon to be aware
of the vertex sizes (preventing overlap) and the bounding box; the user may also reposition and lock
vertices in place, with other unlocked vertices adjusting. For a consistent mental map, vertex positions
are preserved as much as possible when updating their size on selection of a different year or gender
category; successive expansion - contraction - expansion should also place child vertices in consistent
positions.
Vertex selection within the graph also drives the other two charts included, presenting the by-cause
slice through the data. The time series for deaths from the chosen cause for each gender across all
years in the data set is given by the line graph that occupies most of the bottom third. With both a year
and a cause specified, we can also assess the gender ratio, which is illustrated by the pie chart to the
left (and which updates with changes to the year).
45
To reduce clutter, shortened category names are used in the graph layout; these can be swapped for the
ICD10 category codes for users familiar with the data set, or turned off entirely. We therefore devote
the remaining portion of the visualisaton to a text field, which gives the precise category description
as per Table 6.1 of [15]. In Figure 4.2, we see that some entries have footnotes which are important
to integrity. For instance, the selection shown in Figure 4.3 is for poisonings, which appear to show a
sudden increase in 2011; Table 6.1 of [15] notes that this is due to a change in the category definition,
and thus so should the visualisation. An additional text field for per-category commentary is therefore
included for this purpose.
The data structure upon which the visualisation is built recreates Table 6.1 of [15], but the visualisation
is not limited to precisely that data set. In particular, values for additional years can be added to the
data lines and the visualisation will adjust to include them; a different selection of categories, including
a deeper or shallower tree structure, can also be specfiied purely at the data level rather than being
hardcoded into the graph structure. The figures for both genders are also computed automatically
from the entries for males and females.
4.3 Cause of Death Treemap
An earlier attempt at visualising the cause of death data from [15] is the zoomable tree map illustrated
in Figure 4.4. This uses the squarified algorithm described in Section 2.2; due to the stability issues
discussed, only a single slice-by-year is presented (the most recent figures, from 2012, although the
visualisation will adapt to any suitably formatted data file). By treating the counts for males and fe-
males as the leaf nodes for each category, we can convey both the aggregate totals and the values by
gender. Interaction allows for user-driven exploration, making maximal use of the display space for the
currently chosen category. Continuity between levels is assisted through the colour scheme - each of
the top level categories is assigned a colour, and further down the tree categories are assigned a small
random variation on the colour of their parent (as the second image in Figure 4.4 shows.
However, this visualisation reveals two limitations of area-based representations of size. Firstly, there
are obvious difficulties with labelling - not helped by the more obscure causes of death having longer
names but claiming less area to fit the label in. Tooltips offer some assistance - and can be used to give
the precise count - but it is possible for a child category to have such a low count that it is assigned
essentially no area. For instance, influenza accounts for less than 0.3% of deaths related to diseases
of the respiratory system, and whilst with the cause of death explorer a minimal vertex size can be
enforced, here the data necessarily takes priority.
46
Fig
ure
4.3:
Th
eC
ause
ofD
eath
Exp
lore
r.
47
Figure 4.4: Treemaps of cause of death data.Two different levels of the hierarchy are shown; selecting a portion of the tree map ‘zooms’ to a
treemap of its children.
48
The fixed representation area also introduces issues of data integrity, as the meaning (in number of
deaths) of a given area will vary depending on the total count for all causes currently on screen. Thus
a 50 : 50 ratio of two causes will appear the same regardless of whether they both claimed 10 or 1000
lives.
For these reasons, plus some technical challenges in implementing the D3 treemap on certain browsers,
we recommend the collapsible graph layout used in the cause of death explorer over the zoomable tree
map when working with hierarchical data sets.
4.4 Distributions with cohort effects: Fertility Data
As part of the vital events reference tables [15], the General Register Office publishes annual statistics
on fertility rates. This visualisation is based upon Table 3.6, which tabulates the total number of live
births per 1000 female population from 1973 to 2012 (plus 1951 and 1964), by age of mother in years.
Figure 4.5, shows a typical static visualisation of this data from the Annual Review [29]. Although 42
years of data are available, only five are shown, and this is already causing clutter and requires reference
to a key. Moreover, this is only one slice through the data set; there is also a ‘cohort effect’, in that we
may wish to track the changing fertility through time of women who share a birth year (i.e., as they
age).
Figure 4.5: Live births per 1,000 women, by age, selected years.(Originally Figure 2.4 of [29].)
Our interactive visualisation, illustrated in Figure 4.6, resolves both of these concerns. By allowing user
selection of the year, we need only show a single distribution. However, comparison between years
is made possible by animated transition from one distribution to another on selection of a new year.
‘Movie’ playback of the distributions in time order (from any starting year) chains these transitions
together to create a single smooth animation. Moreover, the transitions take account of the cohort
effect; it is not strictly accurate to say that 25 year old women in 2011 had a higher fertility rate than
49
they did in 2010. Rather, it is the case that women born in 1985 had fewer children at age 25 (in 2010)
than those born in 1986 did (in 2011). Thus the transition is shown as progression of a birth year cohort,
with a horizontal component, rather than just a rise or fall within an age category. This cohort effect is
also indicated by the colour coding - each cohort has a consistent colour which can be easily tracked
as the year varies, with no two visible cohorts sharing a colour. Their birth year is indicated by the
tool-tip, which also gives the precise birth rate figures for the active year.
Table 3.6 of [15] also reports the average age of mother at childbirth for each year, which we indicate
by a moving line; this also updates dynamically, with the animation between years again serving as a
visual shorthand for significant changes (or the lack of).
The visualisation supports arbitrary sets of consecutive years, so can be easily modified to include
future fertility data.
Figure 4.6: Interactive visualisation of fertility data.
50
Chapter 5
Conclusions
Throughout this project, we have sought to identify best practice in visualisation of data, implemen-
tation of which has raised a number of mathematical challenges. These have been met through the
creation of a variety of interactive visualisations based on NRS data, which will add to their existing
online resources. To conclude, we will summarise the key points identified and results of the project.
We began with an examination of principles for effective visualisation of data, and considered how
they should best be applied in interactive rather than static visualisations. In particular, we noted
• the use of data slices to reduce dimensionality
• the importance of being able to generate visualisations algorithmically, based on changing data;
and for design, functionality and data to be separated so as to enable rapid re-purposing of the
developed visualisations for new data sets
• potential risks to graphical integrity - avoiding conveying the wrong meaning through mislead-
ing or distorted visual cues such as area-based representation of quantity, inappropriate scales,
lack of labeling, or inconsistency across representations
• relatedly, the appropriate use of colour for either categorical or ordinal data
• the opportunity - and risk - for animation of transitions between data sets to emphasise change.
From this foundation, we turned our attention to several popular (static) methods of visualising data,
and considered how useful properties could be preserved or enhanced in moving to an interactive
setting. The small multiple added to our understanding of dimension reduction. Treemaps gave our
first example of how desirable properties from a design standpoint lead to mathematical optimisation
problems - and how different objectives can radically alter the nature of the images produced. We
51
identified that choropleth maps were at odds with some of our principles of graphical integrity when
presenting migration data. Finally, we considered the psychology of frequency plots as illustration of
the potential for visualisation to aid reasoning about mathematical concepts.
These investigations prepared us for more sophisticated visualisation techniques, and correspondingly
more difficult mathematical problems. The key to these was identifying that data sets could often be
associated with a graph-theoretic structure. Given a graph with data values associated with its vertices,
we demonstrated how the positioning of appropriately-sized vertices in a representation of the graph
structure could be posed as an optimisation task, and produced visualisations that dynamically solve
this problem. From the question of graph layout, we turned to the issue of ‘which graph?’. To rem-
edy deficiencies in the choropleth map of migration data, we developed a clustering algorithm for the
construction of rooted trees from weighted geographic data to allow automated generation of visually-
pleasing flow maps. Finally, we illustrated how general adjacency matrices of directed graphs could be
visualised by chord diagrams, with application to bi-directional migration flows.
Combining these two themes - design principles, and mathematical algorithms that enable them to be
achieved - a variety of interactive, data-driven visualisations were built for the NRS. These are in the
process of being made available on their website; in line with our design goals, they are constructed
to be of continued use after the end of this project due to easy update mechanisms. A number of the
more mathematically sophisticated visualisations - and the design issues they raise - were illustrated
in the portfolio, and further examples (with applications to NRS data) are given in the appendices.
52
Bibliography
[1] G. Abel, R. Bauer and N. Sander The Global Flow of People http: // www. global-migration.
info , 2014.
[2] G. Abel, R. Bauer, N. Sander and J. Schmidt Visualising Migration Flow Data with Circular Plots
Vienna Institute of Demography Working Papers, 02/2014.
[3] D. Borland and R. Taylor Rainbow Color Map (Still) Considered Harmful IEEE Computer Graph-
ics and Applications Vol. 27 Iss. 2 p14-17 2007.
[4] M. Bostock, J. Heer and V. Ogievetsky D3: Data-Driven Documents IEEE Transactions on Visual-
ization and Computer Graphics, IEEE Press, October 2011.
[5] C. Brewer, M. Harrower ColorBrewer 2.0 http: // colorbrewer2. org .
[6] M. Bruls, K. Huizing And J. Van Wijk Squarified treemaps Joint Eurographics and IEEE TCVG Sym-
posium on Visualization, IEEE Computer Society, 33-42, 2000.
[7] N. Chiba, T. Nishizeki, S. Abe and T. Ozawa A Linear Algorithm for Embedding Planar Graphs
Using PQ-trees, Journal of Computer and Systems Sciences 30(1): 54-76, 1985.
[8] W. Kremer Do doctors understand test results? http: // www. bbc. co. uk/ news/
magazine-28166019 July 2014.
[9] T. Dwyer, Y. Koren and K. Marriott IPSep-CoLa: an Incremental Procedure for Separation Con-
straint Layout of Graphs, IEEE Transactions on Visualization and Computer Graphics 12, 5:821-
828, 2006.
[10] D. Doemeland and J. Trevino Which World Bank reports are widely read? Policy Research Work-
ing Paper WPS6851, 2014.
[11] T. Dwyer Scalable, Versatile and Simple Constrained Graph Layout, Computer Graphics Forum
28:991-998, 2009.
[12] B Fry Visualizing Data O’Reilly Media, Inc, Sebastopol, California, First Edition 2007.
[13] M Garey and D Johnson, Crossing Number is NP-Complete, SIAM. J. on Algebraic and Discrete
Methods, 4(3):312-316, 1983.
[14] General Register Office for Scotland Centenarians in Scotland, 2002 to 2012 including mid-year
population estimates for those aged 90 and over, 21 March 2014.
53
[15] General Register Office for Scotland Vital Events Reference Tables http: // www.
gro-scotland. gov. uk/ statistics/ theme/ vital-events/ general/ ref-tables/
index. html Latest edition (2012).
[16] Google Material Design http: // www. google. com/ design/ spec/ material-design .
[17] A. Jain, M. Murty and P. Flynn Data Clustering: A Review, ACM Comput. Surv., ACM Press, 31:264-
323, 1999.
[18] B. Johnson and B. Shneiderman Tree-Maps: A Space-Filling Approach to the Visualization of
Hierarchical Information Structures Proc. of ACM CHI’86, Conference on Human Factors in com-
puting systems 16-23, 1986.
[19] T. Kamada and S. Kawai, An Algorithm for Drawing General Undirected Graphs, Information
Processing Letters 31:7-15, 1989.
[20] M. Kazi and B. Shneiderman The Treemap Art Project, http: // treemapart. wordpress. com/
2013.
[21] M. Lima Visual Complexity: Mapping Patterns of Information, Princeton Architectural Press,
New York, 1st ed. 2011.
[22] M. Maclean D3 Tips and Tricks https: // leanpub. com/ D3-Tips-and-Tricks .
[23] D. McCandless Information is Beautiful William Collins (Edition of 2012).
[24] Media Matters for America A History Of Dishonest Fox Charts http: // mediamatters. org/
research/ 2012/ 10/ 01/ a-history-of-dishonest-fox-charts/ 190225 October 2012.
[25] C. J. Minard, Tableaux Graphiques et Cartes Figuratives de M. Minard, 1845-1869, a portfolio of
his work held by the Bibliothèque de l’École Nationale des Ponts et Chaussées, Paris.
[26] S. Murray Interactive Data Visualization for the Web O’Reilly Media, March 2013.
[27] Moritz Stefaner, Visualizing Information Flow in Science http: // well-formed.
eigenfactor. org , 2009.
[28] National Records of Scotland About Us http: // www. nrscotland. gov. uk/ about-us (re-
trieved August 2014).
[29] National Records of Scotland Scotland’s Population 2012: The Registrar General’s Annual Re-
view of Demographic Trends 158th Edition SG/2013/208, 17 October 2013.
[30] National Records of Scotland Scottish life expectancy at its high-
est ever level http: // www. nrscotland. gov. uk/ news/ 2014/
scottish-life-expectancy-at-its-highest-ever-level April 2014.
[31] National Records of Scotland Wide variation in life expectancy be-
tween areas in Scotland http: // www. nrscotland. gov. uk/ news/ 2014/
wide-variation-in-life-expectancy-between-areas-in-scotland April 2014.
54
[32] Ohio State University Department of Political Science D3: Zoomable Treemap Explained
https: // secure. polisci. ohio-state. edu/ faq/ d3/ zoomabletreemap_ code. php
(retrieved August 2014).
[33] D. Phan, L. Xiao, R. Yeh, P. Hanrahan, T. Winograd Flow Map Layout , IEEE Information Visualiza-
tion (InfoVis), 219-224, 2005.
[34] W. Sanford and D. Selnick Estimation of Evapotranspiration Across the Conterminous United
States Using a Regression With Climate and Land-Cover Data JAWRA Journal of the American
Water Resources Association Vol. 49 Iss. 1 pages 217-230 2013.
[35] B. Shneiderman Treemaps for space-constrained visualization of hierarchies, http: // cs.
umd. edu/ hcil/ treemap-history/ .
[36] M. Tanaka C3.js | D3-based reusable chart library c3js. org .
[37] K. Temple (Intel) What Happens in an Internet Minute? http: // scoop. intel. com/
what-happens-in-an-internet-minute/ , March 2012.
[38] W. Tobler Experiments in Migration Mapping by Computer, American Cartographer, 1987.
[39] W. Tobler Movement Mapping. http://csiss.ncgia.ucsb.edu/clearinghouse/FlowMapper/ 2004.
[40] E. Tufte Envisioning Information Graphics Press, Cheshire, Connecticut, January 1990.
[41] E. Tufte The Visual Display of Quantitative Information, Graphics Press, Chesire, Connecticut
Sixteenth printing, January 1998.
[42] Understanding Uncertainty Screening tests (Breast screening) http: //
understandinguncertainty. org/ files/ animations/ BayesTheorem1/ BayesTheorem.
html .
[43] The White House The 2011 State of the Union Address: Enhanced Version http: // youtu. be/
kl2g40GoRxg .
[44] Youtube Press: Statistics https: // www. youtube. com/ yt/ press/ en-GB/ statistics.
html (retrieved August 2014).
55
Appendix A
Guide to Electronic Appendices
These appendices describe the files supplied on the attached CD; the file index.html should be opened
in a web browser to navigate to any of the collections described below. The files are also mirrored on-
line at http://maths.straylight.co.uk/mscapp/; depending on system configuration (in particu-
lar, for browsers other than Firefox), use of the latter may be required for access to the visualisations,
and is therefore recommended. Alternatively the appropriate source files can be copied to a webserver
(for instance, to access over an intranet); this also allows for modification to data files to observe the
effects.
For all files, the link to ‘view’ is the original file, and ‘source’ is a typset version that can be viewed in thebrowser. In this way, the source of any filetype can be viewed regardless of underlying system support.Note that for html files, following the ‘view’ link will load the corresponding webpage (typically thevisualisation); if the working source rather than typeset version is required, then the original should beaccessed using ‘save link as..’ on the ‘view’ link.
A.1 Flow Map Construction
The file buildWeightedTree.java provides an implementation in java of Algorithm 1, with the mod-ification described in Remark 3.3.1. This relies on the helper class Graph.java, which implementsAlgorithm 2. Geographic data should be supplied by the text file graph.in; output is written in json for-mat to graph.json. The latter can then be used in conjunction with the D3 visualisation flowmap.html.
A.2 Cause of Death Explorer
The visualisation described in Section 4.2.
The D3 visualisation is given in CODE.html, drawing on data files initial.json and scotland-other.json.
A.3 Cause of Death Zoomable Treemap
The visualisation described in Section 4.3.
I
The D3 visualisation is given in codzoom.html, drawing on data file codzoom.json. Treemap codefrom [32], with the addition of code for random perturbation of colours specified as hex triples, plustool tips.
A.4 Fertility Data (cohort effects)
The visualisation described in Section 4.4.
The D3 visualisation is given in fertility.html, drawing on data file fertility.json.
A.5 Popular Names
A visualisation of popular baby names; this also demonstrates the potential for data-driven visualisa-tions. Two data files - for names of boys (boys_top20_noclash.json) and girls(girls_top20_noclash.json)- are given, and the two visualisations - bnames.html and gnames.html
- differ in only a single line, where the data file to be used is specified.
A.6 Life Expectancy
An interactive version of an existing NRS visualisation [31], allowing any two council areas to be se-lected for comparison, and animating changes. The Scottish average is also provided throughout.
The D3 visualisation is given in lifeexp.html, drawing on data file lifeexp.json.
A.7 Gender distribution by age (Frequency plot)
An interactive version of an existing NRS visualisation [30], allowing any age category to be selected,and animating changes. A template for a two-variable version (illustrating how higher dimensionalslices can be taken) has also been produced.
The D3 visualisation is given in frequency.html, drawing on data file frequency.csv. The templatefor two variables is frequency2.html, which uses the (synthetic) data in frequency2.json.
A.8 Migration within Scotland (Chord Diagram)
The implementation illustrated in Figures 3.14, 3.15.
The D3 visualisation is given in chord.html, drawing on data files immig13.json and immig13-regions.csv.
II
Recommended