Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
PROTEUS Scalable online machine learning for predictive analytics and real-time
interactive visualization
687691
D5.1 Visualization Requirements for
Massive Online Machine Learning
Strategies Lead Author: Abdelhamid Bouchachia
With contributions from: Rubén Casado Reviewer: Marco Laucelli
Deliverable nature: Report (R)
Dissemination level: (Confidentiality)
Public (PU)
Contractual delivery date: M06 (05/2016)
Actual delivery date: M06 (05/2016)
Version: 4.0
Total number of pages: 35
Keywords: Visualisation, interactivity, online machine learning, requirements, architecture
PROTEUS Deliverable D5.1
687691 Page 2 of 35
Abstract
The present report discusses visualisation of big data which is characterised by volume, velocity and variety.
Despite its importance, visual analytics is still in its first outset, especially for big data, and is scattered
among various disciplines. The relevance of visualisation and visual analytics in general for both experts
(data scientist) and end-users for understanding the behaviour of the machine learning algorithms and how
these later make decisions is high. Specifically, visualising the details, tuning the parameters, changing the
training data, tracking back the results (cause-effect) are usually some aspects that are appealing for every
user of data. This is particularly desired in order to cope with transparency and thus for white-box machine
learning.
In this report, we reflect on these requirements by discussing the value of visualisation, the existing
visualisation techniques and tools described in the literature and related to selected computational models
that are used for online and stream-based machine learning. For each model, we put forward the minimum
visualisation requirements, from both perspectives, the expert and user perspectives, that such model should
be equipped with. These requirements reflect mostly on the volume and velocity aspects that characterise big
data. Towards the end of the report, a quite detailed description of the visualisation system that PROTEUS
will develop is presented. The final aim of PROTEUS is to obtain a system that is interactive and fulfils the
requirements associated with the various scalable online machine learning algorithms to be developed.
Deliverable D5.1 PROTEUS
687691 Page 3 of 35
Executive summary
Visualization is relevant to many broad areas of science such as, imaging and graphics, man-machine
interfaces for various scientific disciplines including machine learning and data mining. It is about
representing information in an interpretable graphical form taking into account the available plotting space.
There are basic graphical elements that each representation uses such as points, lines, shapes, images, text,
and area and there are attributes associated with these elements such as color, intensity, size, position, shape,
and motion [1]. The importance of visualisation stems from its relevance but also and its versatility to
manipulate various types of information such as raw and processed data, processes, relationships, concepts,
bodies, etc. Interestingly enough, visualisation can be animated and interactive allowing the user to control
and manipulate the representations (including the elements and their attributes) often by "Overview first,
zoom and filter, And Then details-on-demand"1. This requires “interaction” which is basic modality in any
analysis or illustration it allows the user to manipulate the representations at wish.
As specified in [2], among the top 10 challenges in extreme-large visual analytics, large data visualisation,
scalable algorithms and data movement are very prominent challenges. Thus, from a machine learning
perspective, visualization allows both the expert (data scientist) as well as the end-user to understand the
data, the algorithm (model) being used and its individual components, its evolution, and its output upon
presentation of input allowing thus a better insight into the decision making process. Thus the overall aim of
visualisation using interaction is knowledge discovery, process explanation, and decision making [3]. These
should allow for transparency and interpretability that have been regarded as important research topics for
statistics, machine learning and artificial intelligence over many years. In a nutshell, for the data scientist, the
key questions that can be addressed relying on visualisation are mostly related to the verification and
validation of the theoretical models (often based on some statistical assumptions, e.g., Gaussianity) on real-
world data, to how the model gets updated upon seeing new data and to why and how the results are
generated by the model.
While building transparent machine learning tools that address those key questions, mentioned earlier, is
well-relevant to static data of moderate size, it is even vital for big data because of the volume, velocity and
variety characteristics. The dynamic and streaming aspect of big data is maybe the one that requires more
attention compared to the existing techniques, since both the expert and the user need to understand the
behaviour in (pseudo) real-time as data streams flow into the model.
In this report, we reflect on these requirements by discussing the value of interactive visualisation, the
existing visualisation techniques and tools described in the literature and related to selected computational
models that are used for online and stream-based machine learning. For each model, we put forward the
minimum visualisation requirements, from both perspectives, the expert and user perspectives, that such
model should be equipped with. The focus is placed on online analytics techniques such as sketches,
clustering, classification and regression.
These requirements reflect mostly on the volume and velocity aspects that characterise big data. Towards the
end of the report, a quite detailed description of the visualisation system that PROTEUS will develop is
presented. The final aim of PROTEUS is to obtain a system that is interactive and fulfils the requirements
associated with the various scalable online machine learning algorithms to be developed.
1 https://www.recordedfuture.com/information-seeking-mantra/
PROTEUS Deliverable D5.1
687691 Page 4 of 35
Document information
IST Project
Number
687691 Acronym PROTEUS
Full Title Scalable online machine learning for predictive analytics and real-time
interactive visualization
Project URL http://www.proteus-bigdata.com/
EU Project Officer Martina EYDNER
Deliverable Number D5.1 Title Visualization requirements for massive online
machine learning strategies
Work Package Number WP5 Title Real-time interactive visualization
Date of Delivery Contractual M06 Actual M01
Status version 4.0 final
Nature report demonstrator □ other □
Dissemination level public restricted □
Authors (Partner)
Responsible Author Name Abdelhamid
Bouchachia E-mail [email protected]
Partner BU Phone 01202962401
Abstract
(for dissemination)
The present report discusses visualisation of big data which is characterised by
volume, velocity and variety. Despite its importance, visual analytics is still in
its first outset, especially for big data, and is scattered among various
disciplines. The relevance of visualisation and visual analytics in general for
both experts (data scientist) and end-users for understanding the behaviour of
the machine learning algorithms and how these later make decisions is high.
Specifically, visualising the details, tuning the parameters, changing the training
data, tracking back the results (cause-effect) are usually some aspects that are
appealing for every user of data. This is particularly desired in order to cope
with transparency and thus for white-box machine learning.
In this report, we reflect on these requirements by discussing the value of
visualisation, the existing visualisation techniques and tools described in the
literature and related to selected computational models that are used for online
and stream-based machine learning. For each model, we put forward the
minimum visualisation requirements, from both perspectives, the expert and
user perspectives, that such model should be equipped with. These requirements
reflect mostly on the volume and velocity aspects that characterise big data.
Towards the end of the report, a quite detailed description of the visualisation
system that PROTEUS will develop is presented. The final aim of PROTEUS is
to obtain a system that is interactive and fulfils the requirements associated with
the various scalable online machine learning algorithms to be developed.
Keywords Visualisation, interactivity, online machine learning, requirements, architecture
Version Log
Issue Date Rev. No. Author Change
01/04/2016 V1.0 Abdelhamid Bouchachia Table of contents
12/04/2016 V1.5 Abdelhamid Bouchachia Visualisation of classifiers
19/04/2016 V2 Abdelhamid Bouchachia Visualisation of clustering and
regression
22/04/2016 V2.5 Rubén Casado Architectural requirements
Deliverable D5.1 PROTEUS
687691 Page 5 of 35
25/04/2016 V3 Abdelhamid Bouchachia Visualisation of sketches
27/04/2016 V3.5 Rubén Casado Challenges of big data visualisation
17/05/2016 V4 Abdelhamid Bouchachia Introduction, conclusion and front
matter
PROTEUS Deliverable D5.1
687691 Page 6 of 35
Table of contents
Executive summary ........................................................................................................................................... 3 Document information ....................................................................................................................................... 4 Table of contents ............................................................................................................................................... 6 List of figures .................................................................................................................................................... 7 1 Introduction ................................................................................................................................................ 8 2 Challenges of visual analytics for data streams ........................................................................................ 10
2.1 Data presentation............................................................................................................................... 10 2.2 Data analysis ..................................................................................................................................... 11 2.3 Perceptual scalability ........................................................................................................................ 11
3 Visualisation requirements for online machine learning .......................................................................... 12 3.1 Sketches ............................................................................................................................................ 12
3.1.1 Moments .................................................................................................................................... 12 3.1.2 Sampling .................................................................................................................................... 12 3.1.3 Change detection ....................................................................................................................... 12 3.1.4 Feature selection ........................................................................................................................ 13
3.2 Online clustering ............................................................................................................................... 14 3.2.1 Partitional clustering .................................................................................................................. 15 3.2.2 Hierarchical clustering ............................................................................................................... 16
3.3 Online classification .......................................................................................................................... 16 3.3.1 Classification trees ..................................................................................................................... 17 3.3.2 Neural networks ......................................................................................................................... 17 3.3.3 Probabilistic classifiers .............................................................................................................. 21 3.3.4 Ensemble learning ..................................................................................................................... 22
3.4 Online regression .............................................................................................................................. 24 4 Architectural requirements for scalable visual analytics .......................................................................... 25
4.1 Data collector .................................................................................................................................... 25 4.2 Incremental analytics engine ............................................................................................................. 26 4.3 Visualization layer ............................................................................................................................ 28
4.3.1 Websocket connector ................................................................................................................. 28 4.3.2 Graph library .............................................................................................................................. 28
5 Conclusions .............................................................................................................................................. 29 References ....................................................................................................................................................... 30
Deliverable D5.1 PROTEUS
687691 Page 7 of 35
List of figures
Figure 1: Visualisation of streams as streamgraph and horizon lines [4] .......................................................... 8 Figure 2: INFUSE visualisation system ......................................................................................................... 13 Figure 3: Design patterns for feature analysis ................................................................................................. 14 Figure 4: Examples of output by the different clustering algorithms .............................................................. 15 Figure 5: Visualisation of SOM using U-matrix [51] ...................................................................................... 15 Figure 6: Visualisation of mixture-based models ............................................................................................ 16 Figure 7: Hierarchical structures ..................................................................................................................... 17 Figure 8: Visualisation of trees [64] ................................................................................................................ 18 Figure 9: A two-layer NN with its corresponding diagram [66] ..................................................................... 18 Figure 10: Bond diagram for a network consisting of 6 input units, 2 hidden units and one output unit. ...... 19 Figure 11: Hyperplane diagram for the hidden nodes of the network shown in Figure 9 [66]. ...................... 19 Figure 12: Response-function plots for the network shown in Figure 9 [66]. Leftmost and middle plots
represent the hidden units, while the rightmost plot represents the output unit. ...................................... 20 Figure 13: Visualisation of convolution networks........................................................................................... 20 Figure 14: Visualisation of class probabilities and decision boundary ........................................................... 21 Figure 15: EnsembleMatrix visualisation. ....................................................................................................... 23 Figure 16: Details of EnsembleMatrix. ........................................................................................................... 23 Figure 17: Output of visreg for a non-linear regression model ....................................................................... 24 Figure 18: PROTEUS’s Architecture .............................................................................................................. 25 Figure 19: Data collector ................................................................................................................................. 25 Figure 20: Traditional AVG computation ....................................................................................................... 26 Figure 21: Concpet of Incremental AVG computation ................................................................................... 26 Figure 22: Flink generic workflow for incremental operations ....................................................................... 27 Figure 23: Implementation apply method for IncrementalAverageOperation ................................................ 27 Figure 24: Real-time and incremental communication ................................................................................... 27
PROTEUS Deliverable D5.1
687691 Page 8 of 35
1 Introduction
Visualisation is central to many areas and usually used to illustrate mainly the architecture of the system
under observation, its evolving behaviour and the final outcome of the processing. Applying visualisation
techniques for machine learning can be extremely useful not only for the expert (developer or data scientist
in general), but also for the users. While for the former, it is worth checking how the machine learning model
performs, for the latter, we seek a better insight into the decision making process and interaction with the
model to change the parameters, to closely examine a particular aspect of the algorithm, or understand the
results.
Advanced interactive visualization of data (termed as visual analytics) has become one of the trends in
machine learning. The assumption is that machine learning (and data analytics) tools are seen by non-experts
as black-box tools lacking understanding the internals of those tools that come with less interaction and
mostly without any explanation module or facility. The transparency and interpretability of the behaviour
and the (intermediate) results have been also an important research topic for statistics, machine learning and
artificial intelligence researchers. Key questions that can be addressed are mostly related to the verification
and validation of the theoretical models, often based on some statistical assumptions (e.g., Gaussianity), on
real-world data, how the model gets updated as new data and why and how the results are generated by
model.
These questions are mainly embedded into the type of data analysis that can be used. While confirmatory
analysis is about accepting assumptions about the data, developing models and establishing whether those
assumptions are true or false. This is clearly very relevant to the developer of the model, rather than to the
user. On the other hand, exploratory analysis is about investigating and discovering, through the model and
the data, and it is relevant to both the expert and the user. In this later case, interaction is a key element.
Thus, interactive machine learning should facilitate the exploration of the data as well as the machine
learning models.
While building transparent machine learning tools that address those key questions, mentioned earlier, is
well-relevant to static data of moderate size, it is more vital for big data because of the volume, velocity and
variety characteristics. The dynamic and streaming aspect of big data is maybe the one that requires more
attention, since both the expert and the user need to understand the behaviour in (pseudo) real-time as data
streams through into the model.
The visualisation techniques used for static data such as response-function plots, scatterplots, parallel
coordinates, heatmaps, parallel sets, linear and non-linear projection, but also for data streams, time-series
graphs, temporal mosaics, streamgraph (see Figure 1), line charts, horizon chart, etc. are standard techniques
and can be used for evolving data as well. However, they should be adapted to the dynamic nature of data
streams and to interactivity required for nowadays big data applications. In fact, visualization of data streams
is strongly related to its temporal nature, but, in many cases, also to other important aspects such as data
source, space, relevance, etc. What is required for big data are adapted rich and dynamic user interfaces for
interacting with complex and possibly linked data to derive analytic insights by visualisation of the data and
the models developed.
(a) Streamgraph (b) Multiple streams using horizon lines
Figure 1: Visualisation of streams as streamgraph and horizon lines [4]
Deliverable D5.1 PROTEUS
687691 Page 9 of 35
In order to reflect on the current practices in visual analytics in order to inspire PROTEUS, the rest of the
document will highlight some representative visualisation studies. We will focus mostly on visualisation
techniques used for online data analytics and we will consider in particular the following analytics
techniques: sketches, clustering, classification and regression. We then specify the visualisation requirements
that PROTEUS will fulfil for each of those techniques to meet the need of the challenges of massive and/or
streaming data in terms of presenting information as well exploring the data and the machine learning
models. The document will also highlight the architectural design of the visualisation tool that will be
developed within PROTEUS.
PROTEUS Deliverable D5.1
687691 Page 10 of 35
2 Challenges of visual analytics for data streams
Advanced visualization of data analytics in real-time, user experience and usability is still an open issue in
the context of Big Data. How does Big Data change the nature of visual interaction? The interactivity
requirement creates special challenges when it comes to Big Data. Interaction is a necessary condition for
data analysis tasks, especially when using exploratory visual tools. However, most state-of-the-art tools or
techniques do not properly accommodate Big Data.
Specifically, a key challenge of visual analytics is to meet the requirements of Big Data in supporting real-
time interaction while considering the challenges of volume, velocity and variety. Despite the emerging
advances to achieve low latency for ad-hoc queries, it is still necessary to rethink efficient software
architecture styles to enable real-time interaction. On the other hand, visualization of data streams is strongly
related to its temporal context. Although the data being generated and delivered in the streams has a strong
temporal component, in many cases it is not only the temporal component that the analysts are interested in.
There are other important data dimensions (e.g. source, space, relevance, etc.) that are equally important and
time might be just an additional aspect that they care about. Finally the use of visualisation paradigms
dedicated to machine learning and data analytics methods would help inspect the data as well as to explain
the behaviour of the algorithms.
2.1 Data presentation
The main objective of data visualization [5][6] is to represent knowledge more intuitively and effectively by
using different graphs. To convey information easily by providing knowledge hidden in the complex and
large-scale data sets, both aesthetic form and functionality are necessary. Information that has been
abstracted in some schematic forms, including attributes or variables for the units of information, is also
valuable for data analysis. This way is much more intuitive [5] than sophisticated approaches. For Big Data
applications, it is particularly difficult to conduct data visualization because of the large size and high
dimension of Big Data. However, current Big Data visualization tools mostly have poor performances in
functionalities, scalability and response time. It is necessary to rethink the way we visualize Big Data, not
like the way we adopt before. For example, the history mechanisms for information visualization [7] also are
data-intensive and need more efficient approaches. Uncertainty can lead to a great challenge to effective
uncertainty-aware visualization and arise in any stage of a visual analytics process [8]. New framework for
modelling uncertainty and characterizing the evolution of the uncertainty information are highly necessary
through analytical processes [9].
High-volume datasets are ubiquitous in many domains, such as finance, discrete manufacturing, or sports
analytics [10]. It is not uncommon that millions of readings from high-frequency sensors are subsequently
stored in relational database management systems (RDBMS), to be later accessed using visual data analysis
tools. Modern data analysis tools must support a fluent and flexible use of visualizations [11] and still be able
to squeeze a billion records into a million pixels [12]. In this regard, one open issue for the scientific
community is the development of compact data structures that support algorithms for rapid data filtering,
aggregation, and display rendering. Unfortunately, these issues are yet unsolved for existing RDBMS-based
visual data analysis tools, such as Tableau Desktop [13], SAP Lumira [14], QlikView [15], Tibco Spotfire
[16] or Datawatch Desktop [18].
While they provide flexible and direct access to relational data sources, they do not consider an automatic,
visualization-related data filtering or aggregation and are not able to quickly and easily visualize high-
volume historical data, having one million records or more. For example they redundantly store copies of the
raw data as tool-internal objects, requiring significant amounts of system memory per record. This causes
long waiting times for the users, leaving them with unresponsive tools or even impairing the user’s operating
systems, in case the system memory is exhausted. Apart of commercial solutions, a number of open-source
visual toolkits exist; each covers a specific set of functionalities for visualization, analysis and interaction.
For example, InfoVis Toolkit [19], Prefuse [20], Improvise [21] and D3 [22]. Using existing toolkits for
required functionality instead of implementing from scratch provides much efficiency while developing new
visual analytics solutions, although the level of maintenance, development and user community support of
open source toolkits can vary drastically.
Deliverable D5.1 PROTEUS
687691 Page 11 of 35
2.2 Data analysis
A common gap for both commercial and open source solutions is that all existing tools are focused on batch
data (data-at-rest), not in data streams (data-in-motion). There are some domain-specific tools to address
such gap. ELVIS [23] is a highly interactive system to analyse system log data, but cannot be applied to real-
time streams. SnortView focus [24] on the specific analyses of intrusion detection alerts. The focus of Event
Visualizer [25], is to provide real-time visualizations for event data streams for real-time monitoring and
possibilities to smoothly change to exploration mode. In contrast to this event-based approach, authors in
[26] propose another real-time system to enhance situational awareness using the analysis of network traffic
based on LiveRAC[27]. The analysed and aggregated time-series are displayed in a zoomable tabular
interface to provide the analysis an interactive exploration interface for time-series data. Another tool, which
focuses on monitoring of time series data is VizTree[28], which provides visual real-time anomaly detection
for time series. The general approach is to transform the time series data to a representation of symbols.
2.3 Perceptual scalability
From the perception point of view, we can identify two main issues [1]:
Human perception: Human eyes have difficulty to extract meaningful information when the data becomes
extremely large. Not many existing visualization systems are designed to scale nicely to present meaningful
and quality information of human perception.
Limited screen: Data is becoming simply larger and larger, it is challenging when visualization displays too
many data items of features on limited screen, especially a dataset with a billion entries. Too many data to
present on the limited screen that resulting of which the visualization is too dense to be useful to the users.
The limitation of screen resolution forces us to explore novel ways display and visualize information using
various abstraction techniques.
PROTEUS Deliverable D5.1
687691 Page 12 of 35
3 Visualisation requirements for online machine learning
In the following we will present the main online machine learning techniques that will be investigated in
PROTEUS. We will show some of the existing visualisation aspects of such techniques for the sake of
illustration. We then highlight the main visualisation requirements that online machine learning library
should consider.
3.1 Sketches
A sketch is a compact representation of data and usually designed in an efficient way to cope with high-
speed data streams. It can be used for various purposes by capturing some key statistics about the stream.
The most known sketches are the count-min, lossy counting, Bloom filters, etc.
3.1.1 Moments
Being typical sketches, moments are usually used to summarise particular statistical characteristics of the
data stream that are used to capture trends, anomalies, etc. Such sketches can be either related to frequency
of the stream items or to the items themselves. These are quite useful in many monitoring applications such
as network traffic monitoring, network topology monitoring, sensor networks, financial market monitoring,
and web-log monitoring [29]. Since moments are numbers, their visualisation has not been of an issue in the
relevant literature. It is however interesting to visualise such quantities as a stream to highlight their
evolution over time. Illustrating different moments on the same screen can be extremely handy for human
decision makers.
3.1.2 Sampling
In contrast to sketches which are computed from the whole data stream, sampling is another way of
summarizing a data stream as it allows to compute a representative set of stream elements so that an efficient
processing. In general the sample is continuously maintained in order to accommodate anytime approximate
query answering, selectivity estimation, query planning, or any other mining task. The sample fits in the
RAM, hence various standard offline algorithms can be applied. The challenge is to develop sampling
techniques that provide unbiased estimate of the underlying stream with provable error guarantees.
In one-pass stream sampling, the main challenge is to ensure that a sample is drawn uniformly across the
union of the data while minimizing the communication needed to run the protocol on the evolving data. At
the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low
for each stream [30].
Depending on the approach taken to do sampling, the visualisation system should fulfil the following
requirements:
- Visual presentation of the selection criteria
- Details about the effect of adding new data points to the sample (e.g., accuracy)
- Visualisation of the characteristics of the sample (moments, distribution, density over time, etc.)
It is worth mentioning, that little has been done in this context.
3.1.3 Change detection
The relevance of change detection in streaming applications is quite straightforward as it aims at identifying
the any change that occurs. In general, change refers to many concepts such as concept drift [31], novelty
detection [32] and anomaly detection [33]. Each of these has had great attention by different research
communities. Generally speaking, a change corresponds to change in the underlying probability distribution
of the data. The goal is therefore to identify the deviation of the model at hand by monitoring its behaviour in
Deliverable D5.1 PROTEUS
687691 Page 13 of 35
real-time. Thus, to deal with changes in the context of streaming, there should be a need to mechanisms of
detection and verification that allow distinguishing noise from real change.
Visualisation of changes in data streams is a very interesting avenue as it helps to explain the evolution of the
data over time and what factual characteristics/events emerge over time. Therefore the set of requirements
associated with change visualisation can be summarised in the following elements:
- Real-time tracking of the sketches used for change detection (illustrating the change)
- Identification of the change by indicating the location in the stream with the associated sketches
- Illustration of the change effect on the learning algorithm.
3.1.4 Feature selection
Streaming and online feature selection has been the centre of a number of investigations. There are two
variants: (i) horizontal: the set of data points is fixed, while new features become available over time, (i)
vertical: the features are fixed, while new data arrives over time. A third (iii) option would be the
combination of (i) and (ii) where both new features and new data become available over time. So far, variant
(i) has attracted some attention [34][35][36], variant (ii) has been also investigated in few works
[37][38][39]. To the best of our knowledge, variant (iii) has not yet been discussed.
In terms of visualisation, there has not been much focus on feature selection. INFUSE [40] is an interesting
visualisation tool that provides many functionalities (see Figure 2), but works offline. It is interactive and
investigative into the process of feature selection for static data.
(b) The glyph
representation of features
which are ranked by 4
selection algorithms
(a) Overview of INFUSE
Figure 2: INFUSE visualisation system
A visualisation system proposed in [41] allows visualising, select and measure the correlation between
features in the context of space analysis. It is based on an interesting series of visual designs as shown in
Figure 3.
Mostly two aspects have presented in the studies:
- Effect of adding new (batches of) data points
- Effect of the selected features on the accuracy of the predictive algorithms
PROTEUS Deliverable D5.1
687691 Page 14 of 35
Figure 3: Design patterns for feature analysis
For online feature selection, in addition to these aspects, the visualisation system should also fulfil the
following:
- Interactive (manual) as well as automatic selection of features
- Visualisation of the selection criteria
- Visualisation of the ranking of the features in case of automatic selection
- Visualisation of the data in the space of selected feature (after mapping)
3.2 Online clustering
The goal of clustering is to uncover the hidden structure of data. In other terms, it aims partitioning
unlabelled data objects into groups, called clusters, in a way that similar data points are assigned to the same
group and distinct ones are assigned to different groups. Ideally, a good clustering algorithm should observe
two main criteria: (i) minimization of intra-cluster distance, and (ii) maximization of the inter-cluster
distance. A sheer mass of offline clustering algorithms based on (or partly based on) these two criteria have
been proposed in the literature. We can distinguish two types of clustering algorithms: (i) hard partitioning
where data objects belong to one cluster and (ii) soft partitioning where objects belong all clusters to a
certain degree. In addition, there are classes of algorithms: (i) partitional where data are split into a
predefined number of clusters according to some criteria and (ii) hierarchical where the output takes the form
of a tree (dendrogram) and can be performed either in a divisive (top-down) or aggregative (bottom-up)
manner [42].
In terms of visualisation, usually the users are interested in the boundary of clusters (low density regions
between clusters) and the clusters themselves (high density regions) populated by the data points. The cluster
centres (prototypes) as different statistics (e.g., clustering quality) as well as the membership of the data
points to their respective clusters can also be of interest.
Deliverable D5.1 PROTEUS
687691 Page 15 of 35
3.2.1 Partitional clustering
There are a number of models for online partitional clustering; among these we just mention few:
Neural networks, such as ART networks [43], Generalized fuzzy min-max neural networks
(GFMMN) [44], MaxNet [45], evolving self-organizing maps (ESOM) [46][47][48], growing neural
gas [49] and many others such as minimum allocation networks. All of these methods are based on
the concept of competitive learning and vector quantization. Visualisation of clustering depends
mainly on the computational model used and the model’s output as shown in Figure 4.
Neural Gas ART and GFMMN (E)SOM
Figure 4: Examples of output by the different clustering algorithms
One of the neural networks that is known for its visualisation capabilities is self-organising maps
(SOM) [50]. They are so popular because they allow showing clusters visually. The estimation of
clusters can be intuitive. There have a lot of work on the visualisation of SOM. The most widely
used technique is the U-matrix which represents the distance between the each neuron (codebook
vector) and its neighbouring neurons. The U-matrix is visualised as a 2-D image (see Figure 5) with
different colourings between the neighbouring neurons. A dark colouring indicates a large distance
(between clusters), while a light colouring indicates that the neurons are similar (forming a cluster).
It is quite interesting to note that U-matrix can be used even if the input data is high-dimensional.
Another technique is the P-matrix used to visualise SOM. Instead of distances, P-matrix makes use
of density values of the data space at the neurons. A combination of both U-matrix and P-matrix was
proposed in [51].
Figure 5: Visualisation of SOM using U-matrix [51]
While Sammon’s mapping [52] is a non-linear mapping that maps data from an input space onto an
output space and can be considered as a dimension reduction technique, it has been used also to
visualize SOM by mapping the codebook vectors onto a plane2. Sammon’s mapping and other
transformation techniques such as multidimensional scaling and curvilinear component analysis are
2 http://users.ics.aalto.fi/jhollmen/dippa/node1.html
PROTEUS Deliverable D5.1
687691 Page 16 of 35
general, but were combined with SOM to enhance the visualisation of data [53]. Clearly these
techniques can be applied to the online version of SOM
Mixture-based models: they can be either parametric (e.g., the Gaussian mixtures models) or non-
parametric (e.g., DBSCAN and Dirichlet process-based clustering). Representative algorithms are
the Growing Gaussian Mixture Model (2G2M) [54] and the Incremental Gaussian Mixture Model
(IGMM)[55] and LSEARCH [56]. The outcome of such algorithms is shown in Figure 6:
Visualisation of mixture-based models. The visualisation associated with these models would
require showing how the clusters are generated and how the clusters themselves evolve as new data
become available. Often an optimisation process is applied to control the complexity of clustering
(number of clusters).
Objective function-based models rely on the optimisation of an objective function often by
introducing some simplification and assumptions to avoid iterating over data. A set of algorithms
based on K-means and Fuzzy C-means have been proposed like in [57][58] and other general stream
algorithms described in [59]. In general, the partition matrix, the prototypes or other statistics are
recursively updated as new data become available. In terms of visualisation, there are no specific
requirements for this class of online clustering algorithms and thus they follow the previous classes.
Figure 6: Visualisation of mixture-based models
3.2.2 Hierarchical clustering
Traditionally, hierarchical clustering is visualised using a dendogram, but other representations such as
sunburst3,4 have been used [60] (See Figure 7).
In a sunburst diagram, the root of the tree is represented as the centre of the diagram, while each concentric
ring represents a subtree (sub-cluster). Online hierarchical clustering [61] requires adapted visualisation
techniques. The dendogram and sunburst representations still work for the case of streams, but additionally
the visualisation system should reflect on the evolution and the size of the hierarchical structure of the
clustering. It should show the updated leaf continuously or on demand while accommodating zoom-in and
out to observe the overall evolution.
3.3 Online classification
Motivated by the requirement of transparency and understanding the behaviour as well as the decision
making process of the classifiers, visualisation is an important tool to gain insight by both expert and users
(non-experts) as explained in the following.
3 http://www.cc.gatech.edu/gvu/ii/sunburst/ 4 http://vcg.informatik.uni-rostock.de/~hs162/treeposter/poster.html#Chen2015
(a) Growing GMM (b) LSEARCH (c) Incremental GMM
Deliverable D5.1 PROTEUS
687691 Page 17 of 35
(a) Dendogram representation (b) Sunburst representation of hierarchical
structures
Figure 7: Hierarchical structures
3.3.1 Classification trees
Decision trees are some of the very early algorithms adapted for classifying data streams. In particular,
Hoeffding Tree or Very Fast Decision Tree (VFDT) [62][63] is the standard decision tree algorithm for data
stream classification. The Hoeffding tree induction algorithm induces a decision tree from a data stream
incrementally, inspecting each example in the stream only once, without the need for storing examples after
they have been used to update the tree. The only information needed in memory is the tree itself, which
stores sufficient information in its leaves in order to grow and can be used to form predictions at any time.
Visualising trees during the continuous learning process is generally straightforward from a conceptual point
of view. A tree is represented as a set of nodes, the top node (level 0) is the root and the nodes at level i are
children of nodes i-1, the lowest level includes the class nodes. Since the decision tree changes (or grows if
new attributes become available) continually in the context of streaming, an efficient visualiser should
accommodate the following functionalities:
- During the evolution, show the updated path this corresponds to a focused examination
- Enable expansion and contraction of the tree’s parts around node(s) of interest
- Enable the visualisation of the partial results from different workers
While the visualisation of a tree looks quite intuitive as shown in Figure 8 [64], the space required for
visualising the whole tree at one become challenging because of its increasing size over time. Therefore new
and different visual formats are required. There exist some libraries available that allow a compact
visualisation such as SunBurst5 style. Here the tree looks like a pie chart. However other format can be also
used. There exist interesting libraries that can be of high relevance for tree visualisation6,7,8.
3.3.2 Neural networks
Neural networks (NN) are not very popular for learning data streams, but there exist a number of online NN
algorithms proposed many years ago such as adaptive resonance theory networks, growing neural gas,
incremental learning based on function decomposition, min-max neural networks, incremental radial basis
5 http://www.cc.gatech.edu/gvu/ii/sunburst/ 6 http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html 7 http://bl.ocks.org/ 8 https://github.com/mbostock/d3
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
PROTEUS Deliverable D5.1
687691 Page 18 of 35
function networks such as minimal resource allocation networks. Neural networks have recently gained a lot
of interest due to the advent of deep learning architectures which show great potential.
Figure 8: Visualisation of trees [64]
The visualisation of neural networks often used to track their behaviour and performance. Because, NNs are
considered blackbox tools (once trained, knowledge is encoded as a set of numerical weights), the
visualisation becomes even more relevant. Early work on NN visualisation goes back to the study by [65]
where diagram, called Hinton diagram, was proposed to explain NNs (see Figure 9).
Figure 9: A two-layer NN with its corresponding diagram [66]
Hinton diagrams are used to visualise a 2D array that represents a weight matrix. The white and black
squares indicate the positive and negative values respectively, while the size of each square depicts the
magnitude of the corresponding weight.
Deliverable D5.1 PROTEUS
687691 Page 19 of 35
A more intuitive visualisation technique, called bond diagrams, proposed in [67] depicts not only the weights
but also the topology of the network as shown in Figure 10. The weights are depicted as bonds, where
positive and negative weights are given different colours or shapes (e.g., striped bonds represents negative
weights and plain ones represent positive weights). The thickness indicates the magnitude of the weights,
while the size of the units represents the magnitude of the unit’s bias.
These diagrams do not refer to the training data and how the decision boundary (i.e., the hyperplane) is
determined. To overcome this problem, [68] proposed a visualisation tool that illustrates how the hyperplane
changes during the learning process. Variations of this representation have been used by different tools.
A hyperplane diagram is used to depict how hidden units make decisions with the help of the input units, but
more often, it is applied to illustrate how output units make decision using the hidden units’ outputs as shown
in Figure 11. The hyperplanes show how the hidden units partition its input by approximating their transfer
function by a threshold function. The movement of the hyperplanes can be animated along with the change
of the weights during the training to exhibit the network’s behaviour.
Figure 10: Bond diagram for a network consisting of 6 input units, 2 hidden units and one output unit.
Figure 11: Hyperplane diagram for the hidden nodes of the network shown in Figure 9 [66].
An alternative technique for visualising the decision boundary is to use the response-function plots which
show the decision surfaces formed by the individual hidden and output units. Figure 12 shows an example of
response-function plots. Axes are the input for each unit and the lighter shades indicate higher activation.
PROTEUS Deliverable D5.1
687691 Page 20 of 35
Figure 12: Response-function plots for the network shown in Figure 9 [66]. Leftmost and middle plots
represent the hidden units, while the rightmost plot represents the output unit.
However, visualising data could be challenging if it is high-dimensional (>3). Hence, a good visualisation in
such a case should be equipped with either an interactive interface that allows the user to dynamically and
repeatedly decide which features (dimensions) to be plotted or an option for the tool to choose the most
influential features using feature selection algorithms.
Authors in [66] developed a tool, called Lascaux, which offer possibility to display: (a) architecture of the
network, (2) importance of the weights (solid lines indicate positive weights, dashed lines indicate negative
weights - the thicker the lines, the more important the weights are).
With the advent of deep learning, visualisation of neural networks has taken more importance. Several recent
visualisation studies have emerged [69][70]. The challenge of visualising deep architectures is known [70],
because of (a) the complexity and the number of layers a deep architecture which are often not well
understood, (b) each component of a network may have dozens of hyper-parameters, and (c) the complexity
of neural networks has protected them from the rigorous formalism of other fields of machine learning, so
practitioners can only rely on anecdotal results to guide design.
An interesting visualisation tool was presented in [70], called deepViz. It enable the users to view bitmap
representations of filter banks, weight matrices, the output of a corresponding input image, confusion matrix,
images from various classes, etc. [71] suggested the use of a deconvolution method to visualise convolution
networks by finding which neurons are activated by which parts of the image. The deconvolution consists of
projecting the prediction from the output layer back to the input layer.
Similarly, authors in [72] introduced a tool (see Figure 13) that provides an interactive visualization of
neurons in a trained convolution network when presented with an image or video. The tool allows visualising
forward activation of units, top images for each unit from the training set, and deconvolution to understand
the reaction of units to images as proposed in [71].
Figure 13: Visualisation of convolution networks
Other visualisation of deep networks studies [73][74] applied sensitivity analysis resulting in heatmaps that
makes use of partial derivatives. Interestingly enough, this approach is general and can be applied to any
classifier and irrespective of being linear or non-linear. It measures the relative importance of the input
features to the classifier [75].
Deliverable D5.1 PROTEUS
687691 Page 21 of 35
This section has focused mostly on offline neural networks, but the aim was to show the visualisation
techniques that have been proposed for neural networks. Actually all of these techniques fit perfectly the case
of online neural networks for data streams. To summarise, the requirements for real-time informative
visualisation when using neural networks can be addressed as follows:
- Insight into the behaviour of selected network’s neurons upon presentation of a new input or a batch
of input and probably on regular interval or on demand
- Insight into the evolution of the decision boundary
- For networks with dynamic structure that evolves over time, insight into the evolution of the
architecture
- Insight into the evolution of the actual accuracy of the network
3.3.3 Probabilistic classifiers
Often probabilistic classifiers (e.g., Bayesian networks) eek to quantify the likelihood scores for a data point
being of each of the classes. Such membership probabilities explain well the behaviour as well as the
decision making by the classifier. Thus it is quite appealing to visualize the probabilistic landscape
associated with the classifier’s model and output.
For instance, authors in [76] proposed a tool to visualise the class probabilities. They used projection
techniques based SOM to visualize data, then they considered various quantities such as the class probability
at each data point in the data space, the decision boundary to indicate the classes, misclassification
information, misclassification types (e.g., false positives and false negatives for binary classes) and the meta-
attributes of the model (e.g., the distribution and density of the training data used to build the model as well
as the confidence assigned to each estimate). Figure 14 some of their heapmap-based visualisation output.
In [77] apply Bayesian decision theory to compute the risks as a function of the class probability associated
with mis-classification and to visualise the class boundaries. They further developed an interactive
optimization system, called ManiMatrix, to guide the classification towards targeted confusion matrix
configuration set by the user.
Figure 14: Visualisation of class probabilities and decision boundary
The authors in [78] describe some visual methods for analysing classification results of a probabilistic
classifier. In particular class probabilities are represented as coloured histograms, while features are ranked
and visualised based on their discrimination power. Different interaction mechanisms are used to explore the
classification outcome as well as the data.
The work in [79] proposed parametric embedding (PE), as a projection method, to visualise the posteriors
estimated over a mixture model. The PE method maps the class posterior of data points onto an embedding
space by minimizing a sum of Kullback-Leibler divergences considering that the data is generated by a
Gaussian mixture with equal covariances in the embedding space.
PROTEUS Deliverable D5.1
687691 Page 22 of 35
The authors in [80] proposed a visualization system that shows the classification results at three levels, the
classifier level, the class level, and the test item level. Classes are represented on the perimeter of a circle
which is filled with the data. Any data point can be interactively activated so that its class probabilities are
highlighted using lines whose thickness indicates the value class probability.
From these examples, it can be easily concluded that a visualisation of probabilistic classification makes use
of the class distribution of data points to explain both the classification (class boundaries) as well as the
membership of data points to classes. Therefore online probabilistic classifiers for data streams, such as
online random Naïve Bayes classifier [81] and Bayesian online classification [82] should follow the same
visualisation concept.
3.3.4 Ensemble learning
The idea of combining different algorithms, known as ensemble learning, has attracted a lot of attention. The
main motivation for it is that even if the performance of one predictor (learner, expert), called base learner,
or few may not be that much satisfactory, the ensemble of the algorithms can still perform well. Usually,
when the task is relatively hard, multiple predictors are used following the conquer-and- divide principle
[83][84]. Ensemble learning offers the advantage of limiting the effect of the parameters of each basic
learner.
We can distinguish two combination schemes. The first is that the base learners (based on the same model)
are trained on different data sets randomly generated sets (re- sampling from a larger training set) before they
are combined. These include stacking, bagging and boosting. The second scheme assumes that the ensemble
contains several base learners trained on the same data but are based on different models (neural networks,
decision trees, etc.) with different parameters and trained using different initial conditions (e.g., weight
initialization in neural networks, etc.). Both schemes seek to ensure high diversity of the ensemble.
There have been very few studies on visualisation in the context of ensemble learning. Probably the more
prominent one is the work presented in [85]. This study proposed an interactive visualisation system, called
EnsembleMatrix that visualises the confusion matrices of the individual learners (classifiers) graphically in
order to understand the behaviour and performance of each learner. EnsembleMatrix enables the users to
interact with the confusion matrices to decide the combination they want to inspect.
Figure 15 shows the EnsembleMatrix. The left side shows the confusion matrix Confusion matrices of the
current ensemble classifier built by the user, while the bottom right side shows the confusion matrix of the
individual base classifiers. Here the confusion matrix is encoded by colour (the darker, the higher). The user
can select any part of ensemble confusion matrix for examining the how the base classifiers perform on that
part. The top left side (polygon) is used to fix by the user the weight of each classifier in the desired linear
combination.
Deliverable D5.1 PROTEUS
687691 Page 23 of 35
Figure 15: EnsembleMatrix visualisation.
EnsembleMatrix was patented9 with further details, especially in terms of steps as shown in Figure 16.
Figure 16: Details of EnsembleMatrix.
Authors in [86] used different visualisations to analyse tree forest which were generated using bagging. Such
visualisations are however static and are just for displaying statistics about the trees and the variables
involved therein. Similar work was presented in [64] where a visualisation tool, called VISE, was presented
for analysing small ensembles that consist of three decision trees obtained using bagging.
PROTEUS focuses on data streams and online learning, hence the relevance of online ensemble learning.
There have been a number of studies reporting on the online ensemble learning [87][88][84][89][90]. But all
these studies did not look at the visualisation aspect, expect the performance curves.
It is however clear that the visualisation techniques for online ensemble methods should reflect on the real-
time updates which may be of different nature [84]:
- Visualisation of the evolution of the weights associated with the ensemble when using a dynamic
combination of the learners to illustrate the importance of each base learner.
- Visualisation of new data, classification boundary and confusion matrix of both the ensemble as well
as the base learners.
- Visualisation of the ensemble structure, possibly along with the evolution of the performance as in
Figure 15. In case a dynamic structure is adopted, that is, if new learners are added or removed
9 http://www.google.co.uk/patents/US8306940
PROTEUS Deliverable D5.1
687691 Page 24 of 35
dynamically, the change should be tracked. The evolution of the ensemble as well as that of the base
learners
While a learner here refers to a classifier, the visualisation requirements and the learner are meant for any
type of ensembles (e.g., clustering, mapping, regression).
3.4 Online regression
Regression is about relating investigating the relationship between one or several predictors (known as
independent variables, input variables or explanatories) and the response (known as dependent variable or
output). There exist a lot of models linear, nonlinear, multiple linear and nonlinear, etc. Overall there are two
types of models: parametric and non-parametric regression. There exist also a number of studies that discuss
incremental and online regression [91][92][93] relying often on mechanisms like recursive least squares,
online support vector machines [94][95].
However in terms of visualization, often authors show just the outcome of the model fitting and computing
different quantities to show the performance of the model. Estimating the relationship between the response
and the predictors through a visual inspection is worth.
Authors in [96] described a visualisation tool, called visreg, for different regression models. Visreg allows
mainly illustrating the regression curve along with the corresponding region (see Figure 17). The work in
[97] discussed also in detail the visualisation techniques for various regression models using Stata. The study
presented in [98] suggested an interactive approach to develop regression. Two approaches are proposed, the
first is based on multiple expert while the second performs local regression relying on guidance by the user.
Figure 17: Output of visreg for a non-linear regression model
Visualisation of online regression models should allow for the following:
- The accommodation of the model in 2D and 3D if the data is multi-variate.
- The evolution of the regression model, for instance how the fit changes as new data becomes available.
- The evolution of the model’s parameters when new data arrives.
- The quantification of the fitness/shape of the model to the data.
- The indication of the regions where the fit not good.
These requirements apply uniformly to all types of regression models (generalised linear model,
nonparameteric models, decision trees, etc.) and to all computational models (neural networks, probabilistic
models, statistical learning, etc.) used to compute the regression models.
It is worthwhile mentioning that regression models have been used for efficient visualisation multi-
dimensional data. A number of studies relied on regression models such as [99][91]. For instance, the work
in [99] used an ensemble of regression models (i.e., neural networks) to reduce the dimensionality by
mapping the features in the input space into a two-dimensional latent space. Authors in [91] developed a tool
for multivariate data visualization and exploration based on the integrated use of regression analysis and
advanced parallel coordinates visualization. Because of the difficulty of using parallel-coordinates for
presenting multivariate data on a 2D screen, the authors used a LASSO-based regression model to select,
order, and group dimensions.
Deliverable D5.1 PROTEUS
687691 Page 25 of 35
4 Architectural requirements for scalable visual analytics
In order to deal with massive datasets and continuous unbounded streams of data visualization issues,
PROTEUS proposes a new novel architecture using incremental methods. This approach will allow end users
to explore both data-at-rest and data-in-motion efficiently to make well-informed decisions in real time
The architecture we propose consists of three main layers (see ¡Error! No se encuentra el origen de la
referencia.): Data Collector, Incremental Analytics Engine and Visualization Layer. The Data Collector is in
charge of continuously getting new data from data sources and sending them to the next processing layer.
The Incremental Analytics Engine processes data using the online incremental algorithms and outputs up-to-
date results which are then visualized by the third layer. The visualization of the results at various time points
allows the users to track and interact with those results in real-time
Figure 18: PROTEUS’s Architecture
4.1 Data collector
Data collector is in charge of continuously collecting new stream data points from the data sources. As soon
as a data chunk (window) becomes available it is sent the next layer to cope with the high velocity of the
stream. The process of data collection is done incrementally in line with the requirements of the streaming
processing. Figure 19 depicts how data collector retrieves and sends sequentially data chunks to the next
layer:
Figure 19: Data collector
PROTEUS Deliverable D5.1
687691 Page 26 of 35
With this new approach, next layer does not need to wait until data collection process ends, since it is
continuously receiving chunks of from data collector.
4.2 Incremental analytics engine
The incremental analytics engine processes data incrementally, mostly using the concepts of recursivity and
approximation. The algorithmic processing of each batch will lead to results that are communicated to the
next layer for visualization. Online incremental analytics algorithms can range from simple statistical
moments (e.g., average, median, sum, max, min, etc.) to advanced machine learning and data mining
algorithms such classifiers and clustering algorithms. For instance, Figure 20 depicts how the average is
calculated with a traditional approach.
Figure 20: Traditional AVG computation
In order to send the final result to the visualization layer, the offline AVG computation needs to wait until all
the data points are processed. Instead,. Figure 21 summarizes a naïve online version of the AVG
computation.
Figure 21: Concpet of Incremental AVG computation
This version depicts a poor implementation of the incremental average, since a new result is sent for each
new element processed leading to communication overhead. In addition, it assumes that only one partial
result (e.g. there is no avg for each different key) is generated and the whole computation occurs in a single
node (no distributed architecture). As explained in Data Collection section, input data are partially received
and then processed chunk-wise .
To realize the incrementallity in an efficient way, we make use of mechanisms offered by big data streaming
engines. These later introduce concepts such as windowing that splits data streams into finite sequence of
data points. By using windows, it is possible to execute aggregations on unbounded data streams. We process
data in windows of size N (denoted as WINDOW_SIZE in Figure 22) to generate a partial result. When a
window is filled, it automatically calls to the apply(…) method that initiates the computation for that
window. Every incremental operation, executed over the window, needs the result of the previous one to
successfully calculate the new average. Normally it is necessary to save not only the new average, but also
the number of windows used for that partial result. That information is called state.
Figure 22 of code summarizes the normal flow for stream processing in Apache Flink, using the window
concept. The IncOperation() class provides the logic necessary to calculate approximated results,
window by window. As an example of incremental operation, Figure 23 shows the schema of
IncrementalAverageOperation implementation of its apply (…) function.
for value in values
avg+=value;
avg = avg / values.length;
send(avg);
for value in values
avg+=value;
send(avg/values.length)
Deliverable D5.1 PROTEUS
687691 Page 27 of 35
Figure 22: Flink generic workflow for incremental operations
Figure 23: Implementation apply method for IncrementalAverageOperation
As incremental operations continuously send results to the visualization layer, a continuous and full-duplex
communication between the incremental analytics engine and the visualization layer is required. For this
reason, we have decided to use Websockets . Websocket is a protocol providing full-duplex communication
channels over a single TCP connection. It is an independent TCP-based protocol and it is designed to be
implemented in web browsers and web servers, but it can be used by any client or server application. We will
use Websocket protocol instead of HTTP or other application protocols because these do not provide
bidirectional communication (in a normal web scenario, architectures are based on based on request-response
protocols like HTTP).
When a window computation ends, DataStream class automatically calls the invoke() method of the
WebsocketSink class, which is in charge of sending results to the visualization layer. Figure 6 depicts
how incremental analytics engine and visualization layer are connected. Once a partial result is computed, it
is sent to the websocket server and then to the visualization layer.
Figure 24: Real-time and incremental communication
stream
.keyBy(“key”)
.countWindow(WINDOW_SIZE)
.apply(new IncOperation())
.addSink(new WebsocketSink());
ValueState state =
getRuntimeContext().getState();
double[] values = getWindowValues();
//calculate new avg using the new elements
//and the previous result
double actualAVG = calculate(state,values);
//Update window state with new values
updateWindowState(state, actualAVG);
//Send partial result to WebsocketSink
collector.collect(actualAVG);
PROTEUS Deliverable D5.1
687691 Page 28 of 35
4.3 Visualization layer
This layer is a web-based library that allows users to real-time graphically visualize results of incremental
operations carried out by the incremental analytics engine layer. Visualizations are performed using a set of
minimalist graphs that visualization layer provides such as line, bar, pie or stream graphs are some of the
basic visualization elements available in this library. All of the components of the visualization layer have
been developed using Javascript, since it is the de facto programming language for creating interactive
applications on the web.
Like all modern programming languages, Javascript is implemented following a certain scripting language
specifications. Ecmascript6 (commonly known as ES6) is the latest Javascript specification language which
we use for developing the visualization layer. ES6 is not fully supported by all the modern browsers yet, but
they are tending to implement the most of its features, such as arrow functions, class orienting, modulating,
etc.
This layer continuously receives partial results from the incremental analytics engine layer. To successfully
implement this functionality, we need two core components: (i) a websocket connector that receives the
results from the incremental analytics engine layer and (ii) a graph library that contain the basic graphical
elements used to visualize data.
4.3.1 Websocket connector
Websocket is a technology, based on the websocket protocol that makes it possible to establish a continuous
full-duplex connection stream between a client and a server. Although the ws protocol is platform
independent, clients are typically based on web browsers. The visualization layer provides a Javascript
websocket connector that enables a bidirectional communication between web-browsers and the
WebsocketSink component developed in the analytics engine layer. This connector is in charge of
receiving and sending data and acts as a proxy between the visualization and the analytics layers.
4.3.2 Graph library
Our graph library is a visualization tool that allows users to real-time visualize data. This library will be built
using the Scalable Vector Format (SVG) as the format to show data to users. SVG is a language for
describing two-dimensional graphics in XML format. It allows for three types of graphics objects: vector
graphic shapes, images and text. Graphical object can be grouped, transformed and composited into
previously rendered objects. It also includes nested transformations, clipping paths, alpha masks, filter
effects and template objects.
After analyzing other graphic technologies such as Canvas or WebGL, we opted for SVG due to its
simplicity and easy user-interaction API. To deal with SVG API and facilitate the graph creation, we have
starting building the library on top of D3.js . D3.js is a Javascript library for manipulating documents based
on data. D3.js provides the full capabilities of modern browsers and is developed using data-driven approach
to DOM manipulation.
Deliverable D5.1 PROTEUS
687691 Page 29 of 35
5 Conclusions
Interactive visualisation of big data is extremely valuable for both the expert (data scientist) and the end-user
for understanding the behaviour of the machine learning algorithms and decision making process of the
algorithms. Tuning the parameters, changing the training data, tracking back the results (cause-effect) are
some of the requirement, white-box machine learning should offer in order to cope with transparency.
In this report, we reflect on these requirements by discussing the value of interactive visualisation, the
existing visualisation techniques and tools described in the literature and related to selected computational
models that are used for online and stream-based machine learning. For each model, we put forward the
minimum visualisation requirements, from both perspectives, the expert and user perspectives, that such
model should be equipped with. These requirements reflect mostly on the volume and velocity aspects that
characterise big data. Towards the end of the report, a quite detailed description of the visualisation system
that PROTEUS will develop is presented. The final aim of PROTEUS is to obtain a system that is interactive
and fulfils the requirements associated the various scalable online machine learning algorithms to be
developed.
PROTEUS Deliverable D5.1
687691 Page 30 of 35
References
[1] B. Shneiderman and C, Plaisant. Designing the user interface, Pearson Education, Inc.2005.
[2] C. Chen, Top 10 unsolved information visualization problems, in IEEE Computer Graphics and
Applications, vol. 25, no. 4, pp. 12-16, July-Aug. 2005.
[3] D. A. Keim, Information visualization and visual data mining, in IEEE Transactions on Visualization
and Computer Graphics, vol. 8, no. 1, pp. 1-8, Jan/Mar 2002.
[4] M. Krstajić and D. A. Keim, "Visualization of streaming data: Observing change and context in
information visualization techniques," Big Data, 2013 IEEE International Conference on, Silicon
Valley, CA, 2013, pp. 41-47.
[5] D. Simeonidou, R. Nejabati, G. Zervas, D. Klonidis, A. Tzanakaki, and M. J. O’Mahony, “Dynamic
optical-network architectures and technologies for existing and emerging grid services,” J. Light.
Technol., vol. 23, no. 10, pp. 3347–3357, Oct. 2005.
[6] D. A. Keim, C. Panse, M. Sips, and S. C. North, “Visual Data Mining in Large Geospatial Point
Sets,” IEEE Comput. Graph. Appl., vol. 24, no. 5, pp. 36–44, Sep. 2004.
[7] J. Heer, J. Mackinlay, C. Stolte, and M. Agrawala, “Graphical histories for visualization: supporting
analysis, communication, and evaluation.,” IEEE Trans. Vis. Comput. Graph., vol. 14, no. 6, pp.
1189–96, Jan. 2008.
[8] N. D. Lane, Y. Xu, H. Lu, A. T. Campbell, T. Choudhury, and S. B. Eisenman, “Exploiting Social
Networks for Large-Scale Human Behavior Modeling,” IEEE Pervasive Comput., vol. 10, no. 4, pp.
45–53, Apr. 2011.
[9] C. L. Philip Chen and C.-Y. Zhang, “Data-intensive applications, challenges, techniques and
technologies: A survey on Big Data,” Inf. Sci. (Ny)., vol. 275, pp. 314–347, Aug. 2014.
[10] J. Davey, F. Mansmann, and D. Keim, “Visual Analytics: Towards Intelligent Interactive Internet and
Security Solutions,” pp. 93–104, 2012.
[11] J. Heer and B. Shneiderman, “Interactive dynamics for visual analysis,” Commun. ACM, vol. 55, no.
4, p. 45, Apr. 2012.
[12] B. Shneiderman, “Extreme visualization,” in Proceedings of the 2008 ACM SIGMOD international
conference on Management of data - SIGMOD ’08, 2008, pp. 3–12.
[13] “Tableau Desktop | Tableau Software.” [Online]. Available:
http://www.tableau.com/products/desktop. [Accessed: 31-Mar-2015].
[14] “SAP Lumira.” [Online]. Available: http://saplumira.com/. [Accessed: 31-Mar-2015].
[15] “Qlik - Business Intelligence and Data Visualization Software.” [Online]. Available:
http://www.qlik.com/uk. [Accessed: 31-Mar-2015].
[16] “TIBCO Spotfire - Business Intelligence Analytics Software & Data Visualization.” [Online].
Available: http://spotfire.tibco.com/. [Accessed: 01-Apr-2015].
[17] “Datawatch | Data Visualization and Big Data Analytics.” [Online]. Available:
http://www.datawatch.com/. [Accessed: 31-Mar-2015].
[18] J.-D. Fekete, “The InfoVis Toolkit,” pp. 167–174, Oct. 2004.
[19] J. Heer, S. K. Card, and J. A. Landay, “prefuse,” in Proceedings of the SIGCHI conference on Human
factors in computing systems - CHI ’05, 2005, p. 421.
[20] C. Weaver, “Building Highly-Coordinated Visualizations in Improvise,” in IEEE Symposium on
Information Visualization, pp. 159–166.
[21] “D3.js - Data-Driven Documents.” [Online]. Available: http://d3js.org/. [Accessed: 01-Apr-2015].
Deliverable D5.1 PROTEUS
687691 Page 31 of 35
[22] C. Humphries, N. Prigent, C. Bidan, and F. Majorczyk, “ELVIS,” in Proceedings of the Tenth
Workshop on Visualization for Cyber Security - VizSec ’13, 2013, pp. 9–16.
[23] H. Koike and K. Ohno, “SnortView,” in Proceedings of the 2004 ACM workshop on Visualization
and data mining for computer security - VizSEC/DMSEC ’04, 2004, p. 143.
[24] F. Fischer, F. Mansmann, and D. A. Keim, “Real-time visual analytics for event data streams,” in
Proceedings of the 27th Annual ACM Symposium on Applied Computing - SAC ’12, 2012, p. 801.
[25] D. M. Best, S. Bohn, D. Love, A. Wynne, and W. A. Pike, “Real-time visualization of network
behaviors for situational awareness,” in Proceedings of the Seventh International Symposium on
Visualization for Cyber Security - VizSec ’10, 2010, pp. 79–90.
[26] P. McLachlan, T. Munzner, E. Koutsofios, and S. North, “LiveRAC,” in Proceeding of the twenty-
sixth annual CHI conference on Human factors in computing systems - CHI ’08, 2008, p. 1483.
[27] J. Lin, E. Keogh, S. Lonardi, J. P. Lankford, and D. M. Nystrom, “VizTree: a tool for visually
mining and monitoring massive time series databases,” pp. 1269–1272, Aug. 2004.
[28] R. Agrawal, A. Kadadi, X. Dai, and F. Andres, “Challenges and opportunities with big data
visualization,” in Proceedings of the 7th International Conference on Management of computational
and collective intElligence in Digital EcoSystems - MEDES ’15, 2015, pp. 169–173.
[29] S. Ganguly and G. Cormode. On estimating frequency moments of data streams. In Proceedings of
the 11th International Workshop on Randomization and Computation (RANDOM), pages 479-493,
2007.
[30] Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for Massive Data: Samples,
Histograms, Wavelets, Sketches. Now Publishing: Foundations and Trends in Databases Series
(2011)
[31] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift
adaptation.In ACM Comput. Surv. 46, 4, 2014.
[32] M. Pimentel, D. Clifton, L. Clifton, and L. Tarassenko. Review: A review of novelty detection.
Signal Process. 99 (June 2014), 215-249, 2014.
[33] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv. 41, 3,
2009.
[34] J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. J. Mach. Learn.
Res., 7:1861–1885, 2006.
[35] X. Wu, K. Yu, W. Ding, H. Wang and X. Zhu. Online Feature Selection with Streaming Features. In
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178-1192, May
2013.
[36] S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient
descent in function space. The Journal of Machine Learning Research, vol. 3, pp. 1333–1356, 2003.
[37] A. Bouchachia. An evolving classification cascade with self-learning. Evolving Systems 1(3): 143-
160 (2010)
[38] J. He, L. Balzano, and J. Lui. Online robust subspace tracking from partial information. arXiv
preprint arXiv:1109.3827, 2011.
PROTEUS Deliverable D5.1
687691 Page 32 of 35
[39] C. Qiu, N. Vaswani, and L. Hogben. Recursive robust pca or recursive sparse recovery in large but
structured noise. arXiv preprint arXiv:1211.3754, 2012.
[40] J. Krause, A. Perer, E. Bertini, Infuse: interactive feature selection for predictive modeling of high
dimensional data, IEEE Trans. Visual. Comput. Graph. 20 (12) (2014) 1614–1623.
[41] S. Goodwin, Dykes, J., Slingsby, A. & Turkay, C. Visualizing Multiple Variables Across
Scale and Geography. IEEE Transactions on Visualization and Computer Graphics, 22(1), pp. 599-
608, 2015.
[42] A. Jain, M. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31:264–323,
1999.
[43] S. Grossberg. Nonlinear neural networks: principles, mechanism, and architectures. Neural
Networks, 1:17–61, 1988.
[44] B. Gabrys and A. Bargiela. General fuzzy min-max neural network for clustering and classification.
IEEE Trans. on Neural Networks, 11(3):769–783, 2000.
[45] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 2006.
[46] D. Da Deng and N. Kasabov. On-line pattern analysis by evolving self-organizing maps.
Neurocomputing, 51:87 – 103, 2003.
[47] N. Rougier and Y. Yann Boniface. Dynamic self-organising map. Neurocomputing, 74(11):1840–
1847, 2011.
[48] J. Rubio. Sofmls: Online self-organizing fuzzy modified least-squares network. IEEE T. Fuzzy
Systems, 17(6):1296–1309, 2009.
[49] B. Fritzke. A growing neural gas network learns topologies. In Advances in neural information
processing systems, pages 625–632, 1995.
[50] T. Kohonen, Self-Organization and Associative Memory. Springer, Berlin, 1984.
[51] A. Ultsch. Clustering with SOM: U*C. Proceedings Workshop on Self-Organizing Maps, pp. 75-82,
2005.
[52] J.W. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Trans. Comput., vol. C-18,
pp. 401–409, 1969.
[53] S. Wu and T. Chow. PRSOM: A new visualization method by hybridizing multidimensional scaling
and self-organizing map. IEEE Trans. Neural Networks, 16: 1362–1380.
[54] A. Bouchachia and C. Vanaret: GT2FC: An Online Growing Interval Type-2 Self-Learning Fuzzy
Classifier. IEEE Trans. Fuzzy Systems 22(4): 999-1018 (2014)
[55] Engel, P. M. and Heinen, M. R. (2010b). Incremental learning of multivariate Gaussian mixture
models. In Proc. 20th Brazilian Symposium on AI (SBIA), volume 6404 of LNCS, pages 82–91, São
Bernardo do Campo, SP, Brazil. Springer-Verlag.
[56] S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O'Callaghan. Clustering data streams: Theory
and practice. In the IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515-
528, May-June 2003.
Deliverable D5.1 PROTEUS
687691 Page 33 of 35
[57] D. Dovzan and I. Igor Skrjanc. Recursive clustering based on a Gustafson-Kessel algorithm.
Evolving Systems, 2(1):15–24, 2011.
[58] O. Georgieva and F. Klawonn. Dynamic data assigning assessment clustering of streaming data.
Appl. Soft Comput., 8(4):1305–1313, 2008.
[59] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. de Carvalho, and J. Gama. Data stream
clustering: A survey. ACM Computing Surveys (CSUR), vol. 46, no. 1, p. 13, 2013.
[60] R. Patton, J. Beaver, C. Steed, T. Potok and J. Treadwell. Hierarchical clustering and visualization of
aggregate cyber data. The 7th International Wireless Communications and Mobile Computing
Conference (IWCMC), 1287-1291, 2011.
[61] P. Rodrigues, Gama, J., and Pedroso, J. Hierarchical clustering of time-series data streams. IEEE
Transactions on Knowledge and Data Engineering, 20 (5), 615-627, 2008.
[62] G. Hulten, L. Spencer, and P. Domingos. “Mining time-changing data streams” in KDD ’01, 2001.
[63] P. Domingos and G. Hulten. “Mining high-speed data streams” In KDD ’00, 2000.
[64] G. Stiglic, N. Khan, M. Verlic, and P. Kokol. Gene expression analysis of Leukemia samples using
visual interpretation of small ensembles: a case study. In Proceedings of the 2nd IAPR international
conference on Pattern recognition in bioinformatics, 189-197, 2007.
[65] E. Hinton, McClelland, J. L., and Rumelhart, D. E.Distributed representations.In Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations,
MIT Press, 1986.
[66] W. Craven, J. Shavlik. Constructive Induction in Knowledge-Based Neural Networks Machine
learning: proceedings of the Eighth International Workshop (ML91) 8, 213, 1991
[67] J. Wejchert and G. Tesauro. Neural network visualization. In Advances in neural information
processing systems 2, David S. Touretzky (Ed.). Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA 465-472, 1990.
[68] L. Pratt and S. Nicodemus. HYPERPLANE ANIMATOR: Graphical display of backpropagation
training data and weights. Department of Mathematical and Computer Sciences Colorado School of
Mines 402 Stratton Golden, CO 80401
[69] Zeiler, M. and Fergus, R. Visualizing and understanding convolutional neural networks. arXiv
preprint, arXiv:1311.2901, 2013
[70] Bruckner, D., Rosen, J. and Sparks, E. Deepviz: Visualizing convolutional neural networks for
image classification, 2013. URL http://vis.berkeley.edu/courses/cs294-10-fa13/wiki/images/f/fd/
DeepVizPaper.pdf.
[71] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. ECCV, volume
8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.
[72] J. Yosinski, J. Clune, A. M. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks
through deep visualization,” CoRR, vol. abs/1506.06579, 2015.
[73] W. Samek, A. Binder, G. Montavon, S. Bach, and K. Mueller. Evaluating the visualization of what a
deep neural network has learned. CoRR, abs/1509.06321, 2015.
[74] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising
image classification models and saliency maps,” in ICLR Workshop, 2014.
PROTEUS Deliverable D5.1
687691 Page 34 of 35
[75] P. M. Rasmussen, T. Schmah, K. H. Madsen, T. E. Lund, G. Yourganov, S. C. Strother, and L. K.
Hansen, “Visualization of nonlinear classification models in neuroimaging - signed sensitivity
maps,” in BIOSIGNALS, pp. 254–263, 2012.
[76] P. Rheingans and M. desJardins Visualizing High-Dimensional Predictive Model Quality. Proc. VIS
2000, pp:493-496.
[77] A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Interactive optimization for steering machine
classification. In Proceedings of the International Conference on Human Factors in Computing
Systems (CHI), pages 1343–1352. ACM, 2010.
[78] B., Alsallakh, A. Hanbury, H. Hauser, S. Miksch, and A. Rauber, "Visual Methods for Analyzing
Probabilistic Classification Data", IEEE Transactions on Visualization and Computer Graphics, vol.
20, issue 12, no. 12, pp. 1703--1712, 12/2014
[79] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum. Parametric
embedding for class visualization. Neural Computation, 19(9):2536–2556, 2007.
[80] C. Seifert and E. Lex. A novel visualization approach for data-mining related classification. In the
13th International Conference on Information Visualisation , pages 490–495, 2009.
[81] M. Godec, C. Leistner, A. Saffari, and H. Bischof. Online Random Naive Bayes for tracking. In Int’l
Conf. on Pattern Recognition, pages 3545–3548, 2010.
[82] T. Minka, P., Xiang, R. and Qi, Y. Virtual vector machine for Bayesian online classification. The
25th Conference On Uncertainty In Artificial Intelligence, 2009.
[83] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers, IEEE Transactions on Pattern
Analysis and Machine Intelligence 20 (3) (1998) 226–239.
[84] L. Kuncheva, Classifier ensembles for changing environments, in: Proceedings of the Fifth
International Workshop on Multiple Classifier Systems, 2004, pp. 1–15.
[85] J. Talbot, lee, b., Kapoor, a., and Tan, d. S. 2009. EnsembleMatrix: interactive visualization to
support machine learning with multiple classifiers. ACM Conference on Human Factors in
Computing Systems, Boston, MA, April, 1283-1292.
[86] Urbanek, S. Exploring Statistical Forests. Proc. of the 2002 Joint Statistical Meeting, Springer
(2002).
[87] A. Bouchachia, E. Balaguer-Ballester. DELA: A Dynamic Online Ensemble Learning Algorithm. In
the 22th European Symposium on Artificial Neural Networks, Bruges, 2014
[88] A.Bouchachia. Incremental learning with multi-level adaptation. Neurocomputing 74(11): 1785-
1799 (2011)
[89] J. Kolter and M. Maloof, Dynamic weighted majority: a new ensemble method for tracking concept
drift, in: Proceedings of the Third International Conference on Data Mining ICDM’03, IEEE CS
Press, 2003, pp. 123–130
[90] W. Street, Y. Kim, A streaming ensemble algorithm (sea) for large-scale classification, in:
Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining
KDDM’01, 2001, pp. 377–382.
Deliverable D5.1 PROTEUS
687691 Page 35 of 35
[91] S. Wang, Y. Yang, J. Chang, F. Lin. Using Penalized Regression with Parallel Coordinates for
Visualization of Significance in High Dimensional Data. International Journal of Advanced
Computer Science and Applications, Vol. 4, No. 10, pp:32-28, 2013
[92] S. Vijayakumar, S. Schaal, Locally weighted projection regression: An O(n) algorithm for
incremental real time learning in high dimensional space, in: Proceedings of the 17th International
Conference on Machine Learning, ICML'00, 2000.
[93] S. Vijayakumar, A. D'Souza and S. Schaal. Incremental Online Learning in High Dimensions.
Neural Computation, vol. 17, no. 12, pp. 2602-2634 (2005)
[94] G Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In
Advances in Neural Information ProcessingSystems , volume 13, 2001.
[95] J. Ma, J. Theiler, and S. Perkins. Accurate Online Support Vector Regression. Neural Computation
15(11):2683-703 · November 2003
[96] P. Breheny and W. Burchett. Visualization of regression models using Visreg. Internal report,
University of Kentucky, 2012.
[97] M. Mitchell. Interpreting and Visualizing Regression Models Using Stata. Stata Press, 2012
[98] D. Maniyar, and I. Nabney. Data visualization with simultaneous feature selection. Proceedings of
the 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational
Biology, CIBCB'06. 2006. p. 156-163.
[99] N. Gianniotis and C. Riggelsen, Visualisation of high-dimensional data using an ensemble of neural
networks., IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL),
Singapore, 2013, pp. 17-24.