9
Quality-Based Visualization Matrices Georgia Albuquerque 1 , Martin Eisemann 1 , Dirk J. Lehmann 2 , Holger Theisel 2 and Marcus Magnor 1 1 Computer Graphics Lab, TU Braunschweig, Germany 2 Visual Computing, University of Magdeburg, Germany Email: 1 {georgia, eisemann, magnor}@cg.tu-bs.de, 2 {dirk, theisel}@tu-bs.de Abstract Parallel coordinates and scatterplot matrices are widely used to visualize multi-dimensional data sets. But these visualization techniques are in- sufficient when the number of dimensions grows. To solve this problem, different approaches to pre- select the best views or dimensions have been pro- posed in the last years. However, there are still sev- eral shortcomings to these methods. In this paper we present three new methods to explore multivari- ate data sets: a parallel coordinates matrix, in anal- ogy to the well-known scatterplot matrix, a class- based scatterplot matrix that aims at finding good projections for each class pair, and an importance aware algorithm to sort the dimensions of scatter- plot and parallel coordinates matrices. 1 Introduction With the exponentially increasing amount of ac- quired multivariate data, several multi-dimensional visualization techniques have been proposed dur- ing the last decades [10]. Based on the fact that human perception cannot deal well with more than three continuous dimensions simultaneously, such techniques usually project the data in low- dimensional embeddings and combine these repre- sentations in a single plot or present them to the user in an interactive way. Some well-known exam- ples of multi-dimensional visualization techniques are glyph techniques [17], parallel coordinates [9], scatterplot matrices [7] and pixel level visualiza- tions [11]. But even these techniques do not scale well to high-dimensional data sets. In this work we focus on parallel coordinates plots (PCP) and scat- terplot matrices (SPLOM), and propose extensions to these well-known visualization techniques. Scatterplots are one of the oldest and widely used visualization methods.We can define them as graphs where the values of two variables for a sample in a data set are used to plot a point in 2-dimensional space, resulting in a scattering of points. Scatter- plots are very useful for visually determining the correlation between two variables. A SPLOM is a symmetric matrix of adjacent scatterplots and al- lows the user to analyze the diverse dimensions at once. If there are n variables, the SPLOM has di- mension n × n and the element at the i-th row and j -th column is a scatterplot of the i-th and j -th vari- able. Related to this kind of visualization, we pro- pose two extensions: a class based scatterplot ma- trix and an importance oriented reordering of the dimensions of the matrix. The proposal of the class based SPLOM (C-SPLOM) is to support the visual analysis of labeled (classified) data sets. In such data sets, the analyst often searches for projections where distinct clusters can be observed. Previous approaches aim at finding good views of a data set considering all classes at once. The problem with such approaches is that this global optimization may ignore views that separate two classes well, be- cause of the distribution of the remaining classes. To deal with this, our C-SPLOM presents the best projection for each class pair, based on a ranking in- dex. This class based visualization method is useful to analyze labeled data sets with a large number of variables that cannot be well visualized using tradi- tional SPLOMs. Another popular visualization technique are par- allel coordinates plots (PCPs) [9]. In such plots, each sample of an N -dimensional data set is rep- resented by a polyline that intersects N vertical axes (dimensions). The intersection point repre- sents its value in the respective dimension. Similar to the scatterplot matrices the parallel coordinates plots, do not scale well when the number of dimen- sions grows, as important dimensional relationships might not be visualized. Addressing this shortcom- ing, we propose an importance oriented parallel co- VMV 2009 M. Magnor, B. Rosenhahn, H. Theisel (Editors)

Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

Quality-Based Visualization Matrices

Georgia Albuquerque1, Martin Eisemann1, Dirk J. Lehmann2, Holger Theisel2 and Marcus Magnor1

1Computer Graphics Lab, TU Braunschweig, Germany2Visual Computing, University of Magdeburg, Germany

Email: 1{georgia, eisemann, magnor}@cg.tu-bs.de, 2{dirk, theisel}@tu-bs.de

Abstract

Parallel coordinates and scatterplot matrices arewidely used to visualize multi-dimensional datasets. But these visualization techniques are in-sufficient when the number of dimensions grows.To solve this problem, different approaches to pre-select the best views or dimensions have been pro-posed in the last years. However, there are still sev-eral shortcomings to these methods. In this paperwe present three new methods to explore multivari-ate data sets: a parallel coordinates matrix, in anal-ogy to the well-known scatterplot matrix, a class-based scatterplot matrix that aims at finding goodprojections for each class pair, and an importanceaware algorithm to sort the dimensions of scatter-plot and parallel coordinates matrices.

1 Introduction

With the exponentially increasing amount of ac-quired multivariate data, several multi-dimensionalvisualization techniques have been proposed dur-ing the last decades [10]. Based on the factthat human perception cannot deal well with morethan three continuous dimensions simultaneously,such techniques usually project the data in low-dimensional embeddings and combine these repre-sentations in a single plot or present them to theuser in an interactive way. Some well-known exam-ples of multi-dimensional visualization techniquesare glyph techniques [17], parallel coordinates [9],scatterplot matrices [7] and pixel level visualiza-tions [11]. But even these techniques do not scalewell to high-dimensional data sets. In this work wefocus on parallel coordinates plots (PCP) and scat-terplot matrices (SPLOM), and propose extensionsto these well-known visualization techniques.

Scatterplots are one of the oldest and widely usedvisualization methods.We can define them as graphs

where the values of two variables for a sample in adata set are used to plot a point in 2-dimensionalspace, resulting in a scattering of points. Scatter-plots are very useful for visually determining thecorrelation between two variables. A SPLOM isa symmetric matrix of adjacent scatterplots and al-lows the user to analyze the diverse dimensions atonce. If there aren variables, the SPLOM has di-mensionn × n and the element at thei-th row andj-th column is a scatterplot of thei-th andj-th vari-able. Related to this kind of visualization, we pro-pose two extensions: a class based scatterplot ma-trix and an importance oriented reordering of thedimensions of the matrix. The proposal of theclassbased SPLOM(C-SPLOM) is to support the visualanalysis of labeled (classified) data sets. In suchdata sets, the analyst often searches for projectionswhere distinct clusters can be observed. Previousapproaches aim at finding good views of a data setconsidering all classes at once. The problem withsuch approaches is that this global optimizationmay ignore views that separate two classes well, be-cause of the distribution of the remaining classes.To deal with this, our C-SPLOM presents the bestprojection for each class pair, based on a ranking in-dex. This class based visualization method is usefulto analyze labeled data sets with a large number ofvariables that cannot be well visualized using tradi-tional SPLOMs.

Another popular visualization technique are par-allel coordinates plots (PCPs) [9]. In such plots,each sample of anN -dimensional data set is rep-resented by a polyline that intersectsN verticalaxes (dimensions). The intersection point repre-sents its value in the respective dimension. Similarto the scatterplot matrices the parallel coordinatesplots, do not scale well when the number of dimen-sions grows, as important dimensional relationshipsmight not be visualized. Addressing this shortcom-ing, we propose an importance orientedparallel co-

VMV 2009 M. Magnor, B. Rosenhahn, H. Theisel (Editors)

Page 2: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

ordinates matrix(PCM). Unlike the SPLOM thePCM is not symmetric, each rowi of the matrixrepresents the relation of one dimensiond to theothers of the data set, ordered by the inherent in-formation value. Additionally, we propose a qual-ity aware dimension reordering framework for visu-alization matrices, like SPLOMs, C-SPLOMs andPCMs, to improve the visual analysis task of high-dimensional data sets.

2 Related Work

SPLOM and PCP are two of the most popular multi-dimensional visualization techniques and are imple-mented in diverse popular visualization tools as forexample in the XmdvTool [17] and GGobi [14].

2.1 Scatterplot Matrix

The SPLOM was first published by John Hartigan[7] and later explored and extended in diverse visualexploration tools. As aforementioned, SPLOMslose their effectiveness when the number of vari-ables is large; to deal with this problem differentapproaches have been proposed: The grand tour[3] is a dynamic tool that presents a continuous se-quence of lower dimensional (e.g. 2-dimensional)point scatters. However, an exhaustive explorationof a high-dimensional data set requires prohibitivetime. Projection pursuit [6, 8] was proposed as analternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (oneor two-dimensional) projections that expose inter-esting structures of the high dimensional data set.Later on, different projection pursuit indices [5, 8]and a combination of the grand tour and projec-tion pursuit [4] as a visual exploration system havebeen proposed. In a similar direction, the Scagnos-tics method [16, 18] was proposed.In this tech-nique, different scagnostics indices (e.g. Convex-ity, Skinny, etc.) are computed and presented asa scatterplot matrix of the indices themselves (thescagnostics SPLOM). Such scagnostic indices canbe used to reveal structures of the data set in theform of trends, hypersurfaces, clusters, or anoma-lies in the data set.

In our method we make use of projection pur-suit like measures in a twofold way, to selectinformation-bearing projections for the C-SPLOMand PCM, and to perform dimension reordering.

Considering classified datasets, a class consistencyvisualization algorithm has been proposed by [12].Similar to our class based matrix, the class consis-tency method proposes measures to rank lower di-mension representations. The method proposed in[12] filters the best scatterplots based on their rank-ing values and present them in an ordinary scatter-plot matrix. One problem of this method is that theSPLOM does not scale well for high-dimensionaldata sets and even if a zoom option is available,the overall visualization of the SPLOM is preju-diced. Another problem happens when all classesare analyzed together to rank the projections, inthis case, projections that separate two classes verygood might receive a bad ranking because of thedistribution of the remaining classes. Our methodreduces the matrix size to the number of classes ofthe data set and presents to the user the best projec-tions for each class pair individually.

2.2 Parallel Coordinates

Another very popular multivariate visualizationtechnique are parallel coordinates [9]. In a paral-lel coordinate plot each dimension appears just onceand the relation with other dimensions may be diffi-cult to pinpoint depending on the distance betweenthem in the plot. Diverse linking and brushing al-gorithms [17, 14] together with transparency lev-els have been proposed to help visualizing these re-lations. However, they do not solve the problemwhen one dimension shares important correlationswith more than its two neighboring dimensions inthe visualization. Opposingly, we propose a paral-lel coordinate matrix, where there is the possibilityto plot all possible 3-dimensional combinations foreach dimension. In this matrix we have for each di-mensiond up to(n− 1)/2 3D parallel plots, whered is the central dimension, theoretically revealingall important relations for this dimension. An im-portant issue for parallel coordinates is how to or-der the dimensions in the plot. Different proposalsto solve this problem focus on ordering the dimen-sions by similarity [2, 19], and a recent work [15]proposes a sorting of the dimensions based on thequality of the plots. In this second case, a rank func-tion evaluates each 2D dimensional parallel plot andthe result is used to determine the order of the di-mensions in the final plot. Our PCM capitalizeson this second approach to order the 3D individ-ual plots. For each dimension we sort its respective

Page 3: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

3D plots using a ranking function; the plots with ahigher ranking are presented first and the ones witha lesser amount of useful information are presentedlast.

3 Visualization Matrices

In the following subsections we describe our infor-mation bearing visualization matrices in more de-tail and define the measures we use to rank thelow-dimensional projections. We then discuss re-ordering of scatterplot matrices using such qualitymeasures and how it can help to visualize high-dimensional data sets.

3.1 Parallel Coordinates Matrix

Parallel coordinates plots [9] are one of the tech-niques which allow to visualize an arbitrary num-ber of dimensions of a data set within the sameplot. This makes them very attractive for high-dimensional data sets but comes at a cost. Theamount of information bearing content is very sen-sitive to the ordering of dimensions [2]. In addition,every dimension can be paired with only two otherdimensions. Therefore important relations to a thirdor fourth dimension might be missed.

Our approach aims at presenting all those rela-tions in a single matrix, where each entry showsthe relationship between only two dimensions. Thisway alln2 possible combinations of dimensions arerepresented and no information is lost. The problemwith such an approach is the overstraining of theuser as he would have to check every single visual-ization for possible information content. It is there-fore important to sort the visualizations inter- aswell as intradimensionally, so that important visu-alizations are spatially close together and at knownpositions in the matrix. We found that such a matrixis most legible if three constraints are fulfilled:

1. Every row should contain one main dimen-sion, which appears in every visualization inthis row. A label is assigned at the left of therow for faster indexing.

2. The visualizations in each row should besorted in descending order according to theirinherent information value. The best should bepositioned on the left, the worst on the right.

3. The dimensions itself, i.e. the rows of thematrix, should be rearranged so that the most

Figure 1: Structural overview of the PCM. Eachrow has one main dimension appearing in the mid-dle of each 3D plot and as a label on the left for abetter overview. The rows are ordered according tothe overall importance of the main dimension in as-cending order. The visualizations in each row areagain odered according to their relative importance.

valuable rows are on top, while dimensionswith lesser information value are closer to thebottom of the matrix.

This way, only looking at then×p submatrix, start-ing at index(0, 0) reveals the most valuable rela-tionships, i.e. visualizations to the user. An exampleof this concept is given in Figure 1.

In a first step, alln2 2D visualizations are cre-ated. A quality measurement is applied to charac-terize the possible information value of each visu-alization. Most approaches in the literature aim atfinding the best ordering of alln dimensions glob-ally, or choosing a subset of them.

Only recently an approach has been presented,which rates every PCP consisting of only two di-mensions and combines them in a second step tothe complete visualization [15].

We exemplarily make use of theiroverlap mea-sure for our test data, which measures the simi-larity between the different classes of the data setin Hough space. Visualizations with distinctiveclasses therefore receive a high quality value andvisualizations with very similar classes receive alow value. Other measurements, class and non-classbased, would be possible as well and can be easilyincluded in our framework, like [2].

We initialize the matrix so that each row of thematrix has one main dimension, e.g. each visual-ization in row 1 contains the dimension 1, each inrow 2 the dimension 2 and so on. We then sort thevisualizations intradimensionally, i.e. per row. Aseach visualization is associated with a quality value,

Page 4: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

we can easily apply a simple standard sorting algo-rithm. We always combine two 2D visualizationsto a 3D visualization, as both share the same maindimension, which is then positioned in the middle.

In a last step we reorder the dimensions itself,i.e. the rows of the matrix. We tested different cri-teria, like summation of all quality values in eachrow or linear and Gaussian falloffs, increasing theimportance of the first visualizations in each row,while decreasing the importance of the lesser valuedones, and found that the linear falloff gives good re-sults for the PCM. More details and a more generaldescription for dimension reordering visualizationmatrices is given in Section 3.3. Therefore the qual-ity value of thej-th dimension is computed by

Dj =

nX

i

(n − i)

nQ(p(j,i)) , (1)

wheren is the number of dimensions andQ(p(j,i))is the quality value for thei-th visualization in thej-th row of the matrix.

3.1.1 Evaluation and Results

We used theWisconsin Diagnostic Breast Cancer(WDBC) as well as others from the UCI data base totest the usefulness of our PCM. The WDBC data setconsists of 569 samples with 32 real-valued dimen-sions each [13]. The task is to find the best dimen-sions separating the malign and benign cells in thedata set. We created our PCM for this data set us-ing theoverlap measurefrom [15]. Other measure-ments could be used as well, depending on the task.Figure 2 shows the complete PCM with the bestand worst ranked visualizations enlarged. Visual-izations with higher information value are found inthe top left of the matrix, as desired, while the visu-alizations on the bottom right are hardly of any use.Looking only at the best parallel coordinates plots,like other approaches did [19, 15], one might missimportant information. E.g. dimension 22 (radius(worst)) in combination with dimension 9 (concavepoints (mean)), 29 (concave points(worst)), 25 (area(worst)) and 5 (area (mean)) all separate the ma-lign and benign cells comparingly well, but in usualparallel coordinate visualizations only two combi-nations could be displayed in one visualization.

One could argue that SPLOMs fulfill a similartask as PCMs, but there are major differences be-tween these two approaches. First, SPLOMS are

Figure 2: Results of the PCM for the WDBC dataset. Malign nuclei are colored black while healthynuclei are red. Visualizations with only few overlapare preferred, so the difference between malign andbenign cells becomes more clear, and can be foundin the top left of the matrix. The worse visualiza-tions in the bottom right hardly convey any usefulinformation.

not sorted. This limits their usefulness for data setsof up to a dozen dimensions only, otherwise ex-haustively investigating each plot is overstrainingfor a user. Even when sorting the dimensions be-forehand, as proposed in 3.3, additional informa-tion as color encoding or ranking values are neededto guide the visual search. Using PCMs, looking atthen × p submatrix, starting at index(0, 0) alwaysreveals the most valuable relationships, i.e. visual-izations to the user, no matter how high the dimen-sionality of the data set is. Of course the choice ofParallel Coordinates could also be exchanged withScatterplots, which one is more beneficial dependson the preference of the user.

3.2 Class-Based Scatterplot Matrix

A common task in visual analytics is to searchfor projections of high-dimensional data that showswell defined clusters. The same occurs whenclass information is available; finding the projec-

Page 5: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

tions or dimensions that can well separate the dis-tinct classes is a desired outcome. To serve thispurpose, we introduce a new visualization matrixcalledclass-based scatterplot matrix(C-SPLOM).We assume that each point in the high-dimensionalspace has a class labelc. Diverse data sets have aclear definition of classes, but this assumption doesnot limit the use of this technique to these data sets,as class labels can be assigned through an automaticclustering algorithm.

Similar to the well known SPLOM, the class-based version is also a matrix of pairwise scatter-plots s(a, b), with data dimensionsa and b. Thedifference is that the classes are listed on the rowsand columns instead of the original dimensions. Ifthere arem classes in a data set, the C-SPLOM hasdimensionsm × m and the element at thei-th rowandj-th column is the scatterplot of thek-th andh-th variable. The projection axesk andh are chosenin a way to maximize the information content forthe pairwise relation of thei-th andj-th classes.

An important issue of the C-SPLOM is to choosean appropriated analysis algorithm to compute thequality indexQ(s(a, b)) of the scatterplots. Dif-ferent algorithms can be used to this end [12, 15]as long as they consider the pairwise relationshipsbetween classes. The problem in considering allclasses at once, as proposed in [12, 15], is thatthe global optimization may ignore views that sep-arate two classes well, because of the distributionof the remaining classes.The Figure 3 shows exam-ples of scatterplots generated from theOlivesdataset. (Section3.2.1). The first scatterplots(4, 5)has the highest rankQ(s(4, 5)) = 1 consideringall classes. However the scatterplots(2, 8) withrank Q(s(2, 8)) = 0.44 presents a better separa-tion of3-th and4-th classes presented in the data set(theSouth-Apuliaregion in red andSicily region ingreen, respectively), as can be seen in the third plot.This outcome is only possible if the adopted mea-sure analyzes the pairwise relationships betweenclasses instead of a global measure. The resultingquality indexQ is then used to rank the scatterplots,and the best scatterplot is selected for the respectiveclass pair.

We tested our C-SPLOM with two similar algo-rithms to measure the quality of scatterplots withclass information. The first algorithm is theclassdensitymethod proposed in [15]. It assigns highvalues to plots with few overlap between the classes

Figure 3: The first scatterplot is the one with thehighest rankQ = 1, when considering all classes,however the second one with rankQ = 0.44presents a better separation of the3-th and 4-thclasses (red and green), as can be seen in the thirdplot.

and dense clusters. We adopted their algorithm withthe difference that only one class pair is consideredper time. To rank projections considering a specificclass pair, the algorithm is applied only to the dataof the respective classes and the best ranked scatter-plot will represent this class pair in the C-SPLOM.The second measure, as the first one, presents highvalues for plots with well separate clusters and in-stead of dense clusters, this measure has a bias to-wards larger distances between the clusters. Thedistance at a pixelp is defined asr, wherer isthe radius of the enclosing sphere of thek−nearestneighbors ofp:

r = maxi∈Np ||x − xi||, (2)

as defined in [15]. Both measures then computethe sum of the mutual differences of these images.To decide which algorithm is the best one dependsstrongly on the user task. Figure 4 shows an exam-ple of the differences between the C-SPLOMs fortheWinedata set (Section3.2.1), using these two ap-proaches.

Note that for the1-st and2-nd class (in blackand red respectively) the class density presents ascatterplot with more dense clusters as best result,while the class distance measure presents a scatter-plot where the distance between the center of theclusters is larger. The same happens for the1-st and3-rd class (in black and green respectively), and forthe2-nd and3-rd the same scatterplot is chosen.

3.2.1 Evaluation and Results

To evaluate our C-SPLOM, we tested it on diversereal data sets from the UCI repository [1] with la-beled information. The first presented data set is

Page 6: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

Figure 4: The resulting C-SPLOM for the classdensity (left) and class distance quality measures(right).

theWinedata set, a classified data set with 178 in-stances and 13 attributes describing chemical prop-erties of Italian wines derived from three differentcultivars. The user task here is to find the projec-tions (dimensions) that separate these classes well.Figure 5 shows the comparison of the C-SPLOM(upper-right) and its counterpart SPLOM (bottom-left). The C-SPLOM was computed by means ofthe distance measuredescribed previously. An-other data set we used to evaluate the C-SPLOMis theOlives[20] data set. With 572 olive oil sam-ples from nine different regions of Italy; for eachsample the normalized concentrations of eight fattyacids are given. Figure 3 show two scatterplotsof this data set, the first one with the4-th and5-th dimensions (concentrations of the oleic andlinoleic acids), and the second with the2-nd and8-th dimensions (concentrations of the palmitoleicand eicosenoic acids).

3.3 Dimension Reordering

Often, n-dimensional datasets are represent as aseries of 2D scatterplots. Such scatterplots arecommonly arranged in a SPLOM and usually, thedimensions are arranged as provided by the dataset. Dimension reordering methods for SPLOMbased on the similarity between the projections havebeen proposed [18]. But noquality-aware sort-ing methods have been presented.. This motivatedus to adopt aquality-aware sortingconcept and tostart investigating the advantages of such an ap-proach. Note that the concept of dimension reorder-ing can be applied to any matrix-based visualiza-tion, e.g. in Section 3.1 we also apply a dimen-sion reordering for our PCM, but for the ease ofunderstanding we will use SPLOMs in this chapter.

Figure 5: Results of the C-SPLOM for the Winedata set. Visualizations with only few overlap arepreferred, so the difference between the wine cul-tivars becomes clearer. The best visualizations foreach class pair are shown in the C-SPLOM.

The pipeline of our framework for quality-aware re-ordered SPLOMs (D-SPLOM) is shown in Figure 6and explained in the following. As a pre-process forthe reordering, we initially apply a quality-measureQ(s(a,b)) to each scatterplots(a,b). This quality-measure ought to be a scalar one, so that it ratesthe scatterplot unambiguously with a single num-ber. Apart from that, it could be any useful measure[18, 12, 15]. Furthermore, we need this quality-measure to estimate the quality of each dimensionitself. Once we haven − 1 scatterplots for eachdimension in an-dimensional dataset, we considern − 1 quality measures (one per plot) to computethe dimension overall quality-measure.

For each dimensiond, we compute a dimension-measure as the base for reordering. A dimension-measureQd is a scalar functionQd : R

n−1 → R

over all quality-measuresQ(s(a,b)) of a dimen-sion d: Dd = Q(s(d,i)), wherei 6= d and i ∈[1, . . . , n]. It appraises the quality-aware impactover all scatterplots, which contains the dimensiond. There are exactlyn dimension-measures for an-dimensional dataset. Different functions couldbe used for computingDd, as long as it is guar-anteed that the measure-values are comparable toeach other, as themean, aPCAor thevarianceover

Page 7: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

pre-computed quality-measures. We decided to usethe sum over all quality-measures as dimension-measure for this paper, as a proof of concept:

Dd =

nX

i=1, i6=d

Q(s(d,i)). (3)

This measure produces the same partial order be-tween the dimension-measures as the mean, withthe advantage that it is easier and faster to compute.In this last step, we make use of the computed in-formation to reorder an×n SPLOM. Because sucha SPLOM is symmetric, we use the upper triangu-lar matrix for display. First, we allocate to each di-mension its quality-measure valueDd. We sort allquality-measure/dimension pairs(Dd, d) by meansof a simple partial order (≥) with respect to thequality-measureDd, which gives us a dimension-rankingr = (sort{(Dd, d)};≥). The dimensionr[0].d from rankingr describes thebestdimensionandr[n].d theworstone.

We map a dimensiond to its position in therankingr and, depending on that mapping, we re-order the scatterplots in the SPLOM and get thedimension-based reordered SPLOM (D-SPLOM),as can be seen in Figure 6.

Figure 6: Overview of the dimension reorderingProcess

3.3.1 Evaluation and Results

To evaluate our concept, we tested it on realclass-based and non-class-based multi-dimensionaldatasets. For classified data, we applied theClassDensity Measure(CDM) [15] as a quality-measureto theOlivesdataset [20], see Section 3.2.1 for a de-scription. The CDM assigns higher values to scat-terplots with a better separation between the classes.The result of the reordering is shown in Figure7. For non-classified data we applied theRotatingVariance Measure(RVM) [15] as quality-measureto the Parkinson-dataset. This set has 13 dimen-sions, no classes and 197 items. The RVM rates the

linear and non-linear correlation within the scatter-plots with respect to its two dimensions. The resultis shown in Figure 8.

Figure 7: Evaluation on class-basedOliveoil dataset using the CDM: The ordinary SPLOM (top), theresulting D-SPLOM (bottom)

In Figure 7 and 8, relevant scatterplots are col-ored more red than non-relevant. It is easy to seethat both types of scatterplots are distributed overthe whole SPLOM before the reordering. Afterthe reordering, relevant and non-relevant scatter-plots in the D-SPLOM are mostly separated fromeach other. Therefore, we can observe that thequality-aware reordering reduces the region of in-terest, speeding up the visual search. I.e., a quality-aware reordering has practical advantages and en-hances the visual quality of SPLOMs. Depend-ing on the data set, some dimensions might containoutliers. This may happen, when the used quality-measure assigns a low value to most visualizationsof one dimension, but a high value to only a few, asthe dimension4 in the SPLOM, shown in Figure 8.Our applied color coding allows for easy recogni-tion of such plots. In the future, we should inves-

Page 8: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

Figure 8: Evaluation on non-class-basedParkinsondata set: The ordinary SPLOM (top), the resultingD-SPLOM (bottom).

tigate how far the quality-aware framework is ap-propriate and stable to fading out non-relevant scat-terplot from SPLOMs, speeding up even more thevisual search task.

4 Conclusion

In this paper we, presented two new visualizationmatrices to support the visual analysis of high di-mensional data sets: A class based scatterplot ma-trix for data sets with label information that sup-ports the analysis of pairwise relationship betweenclasses, and a parallel coordinates matrix that al-lows examining correlations between all possibledimensions using parallel coordinates plots. Ad-

ditionally, we proposed an information-bearing re-ordering framework that can improve the visualanalysis task for any matrix-based visualizationmethod. We have shown that our quality-based vi-sualization matrices together with the presented re-ordering framework successfully reduces the regionof interest of the visualization matrices. In the fu-ture, we intent to evaluate our methods more thor-oughly with an adequate user study and to allowuser interaction while forming and reordering thematrices. Furthermore, we would like to test moresophisticated ranking functions for dimension re-ordering.

Acknowledgements

The authors gratefully acknowledge funding by theGerman Science Foundation from projects DFGMA2555/6-1 and DFG TH692/6-1.

References

[1] D. N. A. Asuncion. UCI machine learningrepository, 2007.

[2] M. Ankerst, S. Berchtold, and D. A. Keim.Similarity clustering of dimensions for anenhanced visualization of multidimensionaldata.Information Visualization, IEEE Sympo-sium on, 0, 1998.

[3] D. Asimov. The grand tour: a tool for view-ing multidimensional data.Jurnal on Scien-tific and Statistical Computing, 6(1):128–143,1985.

[4] D. Cook, A. Buja, J. Cabreta, and C. Hur-ley. Grand tour and projection pursuit.Jurnalof Computational and Statistical Computing,4(3):155–172, 1995.

[5] M. A. Fisherkeller, J. H. Friedman, andJ. W. Tukey. Prim-9: An interactive multi-dimensional data display and analysis system,volume In W. S. Sleveland, editor. Chapmanand Hall, 1987.

[6] J. Friedman and J. Tukey. A projection pursuitalgorithm for exploratory data analysis.Com-puters, IEEE Transactions on, C-23(9):881–890, Sept. 1974.

[7] J. A. Hartigan. Printer graphics for clustering.Journal of Statistical Computation and Simu-lation, 4(3):187–273, 1975.

Page 9: Quality-Based Visualization Matrices - TU Braunschweig...alternative to an exhaustive visual search, a statis-tical technique to search for low dimensional (one or two-dimensional)

[8] P. J. Huber. Projection pursuit.The Annals ofStatistics, 13(2):435–475, 1985.

[9] A. Inselberg. The plane with parallel coordi-nates.The Visual Computer, 1(4):69–91, De-cember 1985.

[10] D. A. Keim. Information visualization and vi-sual data mining.IEEE Transactions on Vi-sualization and Computer Graphics, 8(1):1–8,2002.

[11] D. A. Keim, M. Ankerst, and H.-P. Kriegel.Recursive pattern: A technique for visualiz-ing very large amounts of data. InVIS ’95:Proceedings of the 6th conference on Visual-ization ’95, pages 279–286, Washington, DC,USA, 1995. IEEE Computer Society.

[12] M. Sips, B. Neubert, J. P. Lewis, andP. Hanrahan. Selecting good views of high-dimensional data using class consistency.Computer Graphics Forum (Proc. EuroVis2009), 28(3):831–838, 2009.

[13] W. Street, W. Wolberg, and O. Mangasarian.Nuclear feature extraction for breast tumordiagnosis. IS&T / SPIE International Sym-posium on Electronic Imaging: Science andTechnology, 1905:861–870, 1993.

[14] D. F. Swayne, D. Temple Lang, A. Buja,and D. Cook. GGobi: evolving from XGobiinto an extensible frameworkfor interactivedata visualization.Computational Statistics &Data Analysis, 43:423–444, 2003.

[15] A. Tatu, G. Albuquerque, M. Eisemann,J. Schneidewind, H. Theisel, M. Magnor, andD. Keim. Combining automated analysis andvisualization techniques for effective explo-ration of high dimensional data.IEEE Sym-posium on Visual Analytics Science and Tech-nology, page to appear, 2009.

[16] J. Tukey and P. Tukey. Computing graphicsand exploratory data analysis: An introduc-tion. In Proceedings of the Sixth Annual Con-ference and Exposition: Computer Graphics85. Nat. Computer Graphics Assoc., 1985.

[17] M. O. Ward. Xmdvtool: Integrating multiplemethods for visualizing multivariate data. InProceedings of the IEEE Symposium on Infor-mation Visualization, pages 326–333, 1994.

[18] L. Wilkinson, A. Anand, and R. Grossman.Graph-theoretic scagnostics. InProceedingsof the IEEE Symposium on Information Visu-alization, pages 157–164, 2005.

[19] J. Yang, M. Ward, E. Rundensteiner, andS. Huang. Visual hierarchical dimension re-duction for exploration of high dimensionaldatasets, 2003.

[20] J. Zupan, M. Novic, X. Li, and J. Gasteiger.Classification of multicomponent analyticaldata of olive oils using different neural net-works. In Analytica Chimica Acta, volume292, pages 219–234, 1994.