A User-Centric Taxonomy for Multidimensional Data ... · A User-Centric Taxonomy for Multidimensional Data Projection Tasks Ronak Etemadpour1, Lars Linsen2, Christopher Crick1 and

A User-Centric Taxonomyfor Multidimensional Data Projection Tasks

Ronak Etemadpour1, Lars Linsen2, Christopher Crick1 and Angus Forbes2

1Computer Science Department, Oklahoma State University, Stillwater, OK, USA.2School of Engineering and Science, Jacobs Bremen University, Bremen, Germany.3Department of Computer Science, University of Illinois at Chicago, Chicago, USA.

Keywords: Multidimensional data analysis, task taxonomy, multidimensional data projection, user-centric evaluation.

Abstract: When investigating multidimensional data sets with very large numbers of objects and/or a very large numberof dimensions, a variety of visualization methods can be employed in order to represent the data effectivelyand to enable the user to explore the data at different levels of detail. A common strategy for encoding multidi-mensional data for visual analysis is to use dimensionality reduction techniques that project data from higherdimensions onto a lower-dimensional space. In this paper, we focus on projection techniques that output 2Dor 3D scatterplots which can then be used for a range of data analysis tasks. Existing taxonomies for mul-tidimensional data projections focus primarily on tasks in order to evaluate the human perception of class orcluster separation and/or preservation. However, real-world data analysis of complex data sets often includesother tasks besides cluster separation, such as: cluster identification, similarity seeking, cluster ranking, com-parisons, counting objects, etc. A contribution of this paper is the identification of subtasks grouped into fourmain categories of data analysis tasks. We believe that this user-centric task categorization can be used toguide the organization of multidimensional data projection layouts. Moreover, this taxonomy can be used asa guideline for visualization designers when faced with complex data sets requiring dimensionality reduction.Our taxonomy aims to help designers evaluate the effectiveness of a visualization system by providing anexpanded range of relevant tasks. These tasks are gathered from an extensive study of visual analytics projectsacross real-world application domains, all of which involve multidimensional projection. In addition to oursurvey of tasks and the creation of the task taxonomy, we also explore in more detail specific examples of howto represent data sets effectively for particular tasks. These case studies, while not exhaustive, provide a frame-work for how specifically to reason about tasks and to decide on visualization methods. That is, we believethat this taxonomy will help visualization designers to determine which visualization methods are appropriatefor specific multidimensional data projection tasks.

1 INTRODUCTION

Visualization is a crucial step in the process ofdata analysis. Often, when analyzing multidimen-sional data, dimensionality reduction (DR) techniquesare displayed in form of 2D or 3D scatterplots thatproject the multidimensional points onto a lower-dimensional visual space. Methods using different al-gorithms to generate scatterplots with particular pointplacements are the most common visual encoding(VE) techniques for the resulting lower-dimensionaldata. DR techniques, coupled with appropriate VEs,enable an understanding of the relations that ex-ist within the higher-dimensional data by displayingthem in such a way that makes it easier for users todiscover meaningful patterns (Samet, 2005).

Data analysis tasks are primarily concerned with

the detection of structures such as patterns, groups,and outliers. Within a multidimensional data set, datapoints can be grouped manually into classes or auto-matically into clusters. For example, classes may bedefined through manually labeling a collection of doc-uments so that each document belongs to one topicwithin a set of topics, or by splitting an image col-lection into ten classes by assigning each image aparticular theme from a set of ten themes. Clus-ters, on the other hand, are generated automaticallyusing a clustering algorithm that may, for instance,identify groupings of similar points, or partition thedata into dissimilar groups where each cluster con-tains similar items (Muller et al., 2009). However, itmay be difficult to see these clusters or classes whenprojected onto a lower-dimensional space. To makesense of this multidimensional data, it can be useful

to know how the clusters or classes are defined andstructured in the original multidimensional attributespace. However, multidimensional projection map-pings are especially prone to distortion because pro-jection methods may not necessarily preserve the spa-tial relations of the data. Thus, it is important to knowhow effective the scatterplots are at preserving segre-gation of the data (Sips et al., 2009).

Several studies evaluate the quality of projectionswith respect to preserving certain properties, thusguiding a user to select the most appropriate projec-tion method for their task. Various numerical and vi-sual methods have been introduced to quantify theaccuracy of projection methods with respect to suchproperties (Sips et al., 2009; Tatu et al., 2009). Recentstudies (Sedlmair et al., 2012b) have shown that thequality of cluster separation by these measures washighly discrepant with user assessment of the clus-ter separation within the same data sets. Lewis etal. (Lewis and Ackerman, 2012) believe that accurateevaluation of clustering quality is essential for dataanalysts, and they showed that such clustering evalu-ation skills are present in the general population.

On the other hand, other studies have attemptedto find a perception-based quality measure for scat-terplots. They either evaluated users’ performanceon layouts generated by different projection tech-niques (Etemadpour et al., 2014c) or allowed usersto assess a series of scatterplots (Albuquerque et al.,2011). Etemadpour et al. (Etemadpour et al., 2014c)used eye-tracking in a user study, asking users toperform typical analysis tasks for projected multidi-mensional data. Other studies have investigated theperception of correlation in scatterplots from a psy-chological perspective; however these studies did notconsider real-world data sets (Rensink and Baldridge,2010), (Etemadpour et al., 2014a).

Because of the absence of a standard approachfor evaluating multidimensional data projection,the results of these studies, and others like them,are difficult to compare. We present a taxonomyof visual analysis tasks for multidimensional dataprojection that we believe could be a useful meansfor evaluation. The idea of creating a task taxon-omy has been recently explored by Brehmer andMunzner (Brehmer and Munzner, 2013). They con-tribute a multi-level typology of visualization tasksthat augments existing taxonomies by filling a gapbetween low-level and high-level tasks. Specifically,they distinguish what the task inputs and outputsare, as well as why and how a visualization taskis performed. In doing so, they more thoroughlyorganize the motivations for and methods of specifictasks for particular data analysis situations. Their

task taxonomy is more general, and does not addressmultidimensional data projection in any detail. Inthis paper, we provide a taxonomy of visual analysistasks related to multidimensional data projection.Our task taxonomy enables evaluation designers toinvestigate visualization performance effectively onboth synthetic and real-world data sets. The maincontributions of the paper is:

• We provide a systematic user-centric taxonomy ofvisual tasks related to projected multidimensionaldata.

• We divide the projection-related tasks into differ-ent categories based on their impact on the anal-ysis of multidimensional data. The categorieswe identify are relation-seeking, behavior com-parison, membership disambiguation, and patternidentification tasks.

• We enable, via our task taxonomy, visualizationdesigners to improve visualization tasks related tothe analysis of multidimensional data.

• We present our taxonomy as a guideline for re-searchers in choosing visualization techniques forthese tasks, and provide explicit examples.

• We adapt Brehmer and Munzner’s multilevel ty-pology of abstract visualizations to multidimen-sional data projection tasks (Brehmer and Mun-zner, 2013).

In the next section, we provide a brief reviewof existing task taxonomies for DR and VE tech-niques. In Section 3, we introduce our task tax-onomy for multidimensional data projection by de-scribing new sets of tasks related to typical analy-sis tasks, including pattern identification, such as de-tecting clusters, behavior comparison, such as com-paring characteristics of subsets, membership disam-biguation, such as counting the number of objects in acluster, and relation seeking, such as correlating sub-sets to each other. We discuss the effects of our pro-posed tasks on the evaluation of scatterplots by pro-viding some examples of how different tasks supportdecision making respective to human perception overmultidimensional data projections. We also character-ize our proposed tasks using the multi-level typologyof abstract visualization tasks (Brehmer and Munzner,2013). We applied Brehmer and Munzner’s multi-level topology concept for describing two tasks asguidelines, while the three questions (WHY, WHAT,HOW) can be used to structure the description of alltasks.

2 Related Work

Many projection methods exist to generate 2Dsimilarity-based layouts from a higher-dimensionalspace. The design goals include maintaining pairwisedistances between points as implemented in multidi-mensional scaling (MDS) (Borg and Groenen, 2010),maintaining distances within a cluster, or maintain-ing distances between clusters (Tenembaum et al.,2000). Principal component analysis (PCA) gener-ates similarity layouts by reducing data to lower di-mensional visual spaces (Jolliffe, 1986). Some pro-jection methods, such as isometric feature mapping(Isomap), favor maintaining distances between clus-ters instead. Isomap is an MDS approach that hasbeen introduced as an alternative to classical scalingcapable of handling non-linear data sets. It replacesthe original distances by geodesic distances computedon a graph to obtain a globally optimal solution tothe distance preservation problem (Tenembaum et al.,2000). Least-Square Projection (LSP) computes anapproximation of the coordinates of a set of projectedpoints based on the coordinates of some samples ascontrol points. This subset of points is representa-tive of the data distribution in the input space. LSPprojects them to the target space with a precise MDSforce-placement technique. It then builds a linear sys-tem from information given by the projected pointsand their neighborhoods (Paulovich et al., 2008).

The correlations of data points or clusters are notalways known after they have been mapped from ahigher-dimensional data space to 2D or 3D displayspace. Thus, several approaches evaluate the bestviews of multidimensional data sets. Sips et al. (Sipset al., 2009) provide measures for ranking scatterplotswith classified and unclassified data. They proposetwo additional quantitative measures on class consis-tency: one based on the distance to the cluster cen-troids, and another based on the entropies of the spa-tial distributions of classes. They propose class con-sistency as a measure for choosing good views ofa class structure in high-dimensional space. Tan etal. (Tan et al., 2005), Paulovich et al. (Paulovich et al.,2008), and Geng et al. (Geng et al., 2005) also eval-uate the quality of layouts numerically. By rankingthe perceptual complexity of the scatterplots, otherstudies investigate user perception by conducting userstudies on scatterplots, finding that certain arrange-ments were more pleasing to most users (Tatu et al.,2010), (Albuquerque et al., 2011). However, these op-erational measures were not necessarily equivalent tothe measures of user preference based on their quali-tative perceptions.

Sedlmair et al. (Sedlmair et al., 2012a) have dis-

cussed the influence of factors such as scale, point dis-tance, shape, and position within and between clus-ters in qualitative evaluation of DR techniques. Theyexamined over 800 plots in order to create a de-tailed taxonomy of factors to guide the design andthe evaluation of cluster separation measures. Theyfocused only on using scatterplot visualizations forcluster finding and verification. DimStiller (Ingramet al., 2010) is a system to provide global guidancefor navigating a data-table space through the processof choosing DR and VE techniques. This analysistool captures useful analysis patterns for analysts whomust deal with messy data sets.

Rensink and Baldridge (Rensink and Baldridge,2010) explore the use of simple properties such asbrightness to generate a set of scatterplots in orderto test whether observers could discriminate pairs us-ing these properties. They found that perception ofcorrelations in a scatterplot is rapid, and that in or-der to limit visual attention to specific information itis more effective to group features together. Etemad-pour et al. (Etemadpour et al., 2014c) postulate thatcluster properties such as density, shape, orientation,and size influence perception when interpreting dis-tances in scatterplots, and specifically, observe thatthe density of clusters is more influential than theirsize.

In general, little attention has been paid to provid-ing details about low-level tasks that guide users tochoose DR and VE techniques. However, both high-level goals and much more specific low-level tasksare important aspects of analytic activities. Amar etal. (Amar et al., 2005) presented a set of ten low-levelanalysis tasks that they found to be representative ofquestions that are needed to effectively facilitate an-alytic activity. Andrienko and Andrienko distinguishelementary tasks that address specific elements of aset and synoptic tasks that address entire sets or sub-sets, according to the level of analysis (Andrienkoet al., 2011).

Brehmer and Munzer (Brehmer and Munzner,2013) emphasize three main questions, why the tasksare performed, how they are performed, and what aretheir inputs and outputs. These questions encompasstheir concept of multi-level typology. They believethat “low-level characterization does not describe theuser’s context or motivation; nor does it take into ac-count prior experience and background knowledge.”Their typology relies on a more abstract categoriza-tion based on concepts, rather than a taxonomy ofpre-existing objects or tasks. In contrast, we attemptto specify tasks at the lowest level that can providedetails about multidimensional data projection. How-ever, the general approach of Brehmer and Munzner

can be easily adopted as a tool to put these low-leveltasks in context, facilitating the evaluation of user ex-periences by evaluation designers. This approach pro-vides essential information, such as motivation anduser expertise, for field studies that examine visual-ization usage. Therefore, we show how our definedtasks can be described according to a typology of ab-stract tasks relating intents and techniques (how) tomodes of goals and tasks (why).

We 1) categorize possible tasks performed whenanalyzing a specific multidimensional data visualiza-tion, and 2) formulate guidelines for analysts to assistin selecting appropriate projection techniques for per-forming specific visualization tasks on data sets.

3 Task Taxonomy forMultidimensional Data Projection

We define a list of tasks from studies of differ-ent projection techniques and their 2D layouts suchas PCA (Jolliffe, 1986), Isomap (Tenembaum et al.,2000), LSP (Paulovich et al., 2008), Glimmer (Ingramet al., 2009), and NJ tree (Paiva et al., 2011), as wellas the applications behind the data (e.g. document andimage data). We explain some of these tasks in de-tail and provide examples of effective data representa-tions for relevant visual analysis tasks. As explainedin Section 2, how well groups of points can be dis-tinguished by users in scatterplots defines visual classseparability. Our cluster-level tasks also focus on howeasily a grouping of related points in multidimen-sional space (e.g., clusters) can be detected by userswhen projected into lower-dimensional space. How-ever, rather than only looking at visual class separa-bility, we consider how effective users are performingmeaningful tasks related to the perceived clusters.

Although other researchers have explored some ofthese tasks, we systematically list the full range ofanalytic tasks for multidimensional projection tech-niques appropriate for large data sets. Additionally,our organization of these tasks takes into considera-tion user perception.

We divided the tasks into four categories accord-ing to the typical visualizations required to supportthem:

Pattern identification tasks: We examine trends,which are more obvious for lower-dimensionaldata than for projected higher-dimensional ones.Relevant issues include cluster/class preservation andseparation.

Relation-seeking tasks: Relationships and

similarities between different reference sets areconsidered.

Behavior comparison tasks: To comparecharacteristics of subsets (or clusters), we considercapturing different data behaviors (like asking thesubjects to compare the point densities within clus-ters, where density is defined as the number of pointsper area).

Membership disambiguation tasks: Po-sitional and distributional relationships withinclasses/clusters are particularly considered whereobjects occlude each other. Clutter and noise obscurethe structure present in the data and make it hardfor users to find patterns and relationships. Peng etal. (Peng et al., 2004) state that clutter reduction isa visualization-dependent task. Therefore, the DRand VE need to minimize the amount of confusingclutter. We believe that clutter can be measured byusers using a wide variety of visualization techniques.

We now clarify these taxonomic categories bylooking at common tasks found in the literature.

3.1 Pattern identification task

Multidimensional data sets may include hundreds orthousands of objects described by dozens or hundredsof attributes. Data characteristics regarding the dis-tribution within multidimensional feature spaces varyfor different application domains. For example, con-sider document data versus image data: text usuallyproduces sparse spaces while imagery produces densespaces. As Song et al. (Song et al., 2006) state, tra-ditional document representation like bag-of-wordsleads to sparse feature spaces with high dimensional-ity. This makes it difficult to achieve high classifica-tion accuracies. Figure 1 shows histograms of the dis-tribution of the pairwise distances between four dataobjects after normalization to the interval [0; 1]. Thedocument data sets are referred to as CBR and KD-Viz 1. The image data sets are referred to as Corel 2

1CBR comprises 680 documents, which include ti-tle, authors, abstract, and references from scientific pa-pers in the four different subjects, leading to a data setwith 680 objects and 1,423 dimensions. KDViz datahas been generated from an Internet repository on thetopics bibliographic coupling, co-citation analysis, mil-grams, and information visualization, leading to 1,624 ob-jects, 520 dimensions, and four highly unbalanced labels(http://vicg.icmc.usp.br/infovis2/data sets).

21,000 photographs on ten different themes. Each imageis represented by a 150-dimensional vector of SIFT descrip-tors (3UCI KDD Archive, http://kdd.ics.uci.edu).

and Medical 3. The revealed histograms illustrate dif-ferent characteristics for document data sets and im-age data sets. Both image data sets exhibit lowermean distance values and much wider variance (repre-sentative of a denser feature space) than the documentdata sets.

(a)

(b)

(c)

(d)Figure 1: Histograms of document data (top) and imagedata (bottom) exhibit characteristic distance distributions:(a) CBR. (b) KDViz. (c) Corel. (d) Medical.

Identifying patterns in high-dimensional spacesand representing them using dimensionality reduc-tion techniques, in order to reveal trends, is a chal-lenge in many scientific and commercial applications.To identify outliers, trends and interesting patterns indata, one of the many objectives of data explorationis to find correlations in the data, thus uncoveringhidden relationships in the data distribution and pro-viding additional insights about the high-dimensionaldata (Zhang et al., 2008). Therefore, a list of ques-tions are suggested that can reveal user’s perspectiveabout local and global correlations with respect to fea-tures – for instance, those subsets of data which formrelevant patterns (e.g. subsets of data within densefeature groups):

• Estimate the number of outliers in the given lay-out.

• Estimate the number of observed clusters.

3Each image is represented by 28 features, includingFourier descriptors and energies derived from histograms,as well as mean intensity and standard deviation computedfrom the images themselves. Hence, the data set contains540 objects and 28 dimensions

• Find the number of clusters in a selected region.

• Find the number of subclusters in a given cluster.

• Find a cluster with a specific characteristic (e.g.,longish).

• Find the specific characteristics (e.g., sparsity) ofa cluster.

• Determine the number of outliers in a given clus-ter.

If researchers aim to find the user’s performanceon class segregation, it is important to draw the user’sattention to global project views. Thus, we suggestasking Estimate the number of clusters in the givenlayout to identify the informative aspects of the data.

Pattern identification tasks often favor clear segre-gation by class, which means that techniques whichincorporate cluster enclosing surfaces can be helpful.In some situations, the labeled classes in each data setcan be considered as ground truth. For such cases,Poco et al. (Poco et al., 2011) developed a 3D projec-tion method by generalizing the LSP technique froma 2D to a 3D scheme. A non-convex hull (of eachcluster) that is computed from a 3D Voronoi diagramof the cluster points is illustrated in Figure 2(a). Thisrepresentation, when it is possible to construct, is bothaccurate and satisfying to users, compared to othertechniques.

While this projection works well when the data’spre-assigned class structure accurately models thedata’s inherent organization, this is often not feasible.In many situations, analysts want to leverage humanperception to identify “visual groupings” of points,and in this case a point cloud representation producesfavorable results. For example, when grouping infor-mation is not available, a point-based visualization asshown in Figure 2(b) is still applicable. Also, Glim-mer (Ingram et al., 2009), as a technique represen-tative of force-directed placement MDS, does not fa-vor class segregation when employed on the KDVizdata set4. Thus, color coding to separate nodes of dif-ferent classes can be useful as shown in Figure 2(c).Therefore, if we have accurate class labels and goodclass separation, we suggest enclosing surfaces likenonconvex hulls. According to the eye-tracking studyon Glimmer projection, the visual attention pattern isscattered and it is hard to identify any meaningful areaof interest (AOIs) for Glimmer (Etemadpour et al.,2014c). Hence, it is useful to differentiate classeswhen the projection doesn’t reflect the class distribu-tion at all.

4KDViz contains documents collected from an Internetrepository related to four different topics with 1,624 uniquedocuments, 520 different dimensions, and 4 highly unbal-anced labels, http://vicg.icmc.usp.br/infovis2/data sets

(a) (b) (c)Figure 2: Estimate the number of observed clusters: (a) Non-convex hulls computed from enclosing surfaces isodistant tocluster using LSP projection; (b) Point-based visualization using PCA projection taken from (Schreck et al., 2010); (c) Thelayout obtained with Glimmer projection on the KDViz data set. Circle color indicates instance class label.

3.2 Relation-seeking tasks

Relation-seeking tasks investigate the similarities anddifferences between subgroups which represent clus-ters or individual objects. Similarity layouts em-ploy projection techniques to reducing data to lower-dimensional visual spaces, but in a different man-ner from that used in pattern identification. In thisapplication, an analyst is interested in investigatingwhether a point (or object) is more similar to onecluster or to another, or whether a whole cluster ismore similar to a second cluster or a third. We be-lieve that relationship-seeking is a search task, An-drienko’s visual task taxonomy model notwithstand-ing (in which search tasks are limited to lookup andcomparison) (Andrienko et al., 2000). In contrast,Zhang et al. (Zhang et al., 2009) consider comparisonand relationship-seeking to be compound tasks, con-taining at least two relationships, one being the datafunction and the other being relationships betweenvalues (or value sets) of a variable. Under this defini-tion, we believe that finding similarities in projectedhigh-dimensional data can be considered as a relation-seeking tasks. Users perform comparison tasks withrespect to a given reference set, which can be a clusteror an individual object, and can undertake a similar-ity search by identifying a given cluster’s neighbors.In such a search, the specified relationship is definedby a distance search within a high-dimensional dataprojection.

A list of potential tasks within the relation-seekingtask category can be considered for multidimensionaldata visualization:

• Identify the closest cluster to a given cluster.• Identify the most similar cluster to a given cluster.• Identify the closest cluster to a reference point.• Identify the most similar cluster to a given object.• Find k closest (most similar) clusters to the given

cluster.• Find k closest (most similar) objects to the given

cluster.

• Find k closest (most similar) objects to the refer-ence object.

• Find the closest (most similar) cluster to a clusterwith a specific characteristic (e.g., Find the closestcluster to the longish cluster).

• Identify the cluster to which the reference set/setsbelong.

• Find the closest (most similar) cluster to the setof points with specific characteristics (e.g., pointsthat have identical movement).

• Find k closest (most similar) points to the set ofpoints with specific characteristics.

• Find the clusters that have hierarchical relations.

• Find k similar objects within a cluster.

• Find a cluster that is the parent of two referencesets.

Etemadpour et al. (Etemadpour et al., 2014b) in-vestigated how domain-specific issues affect the out-come of the projection techniques. They used a num-ber of similarity interpretation tasks to assess thelayouts generated by projection techniques as per-ceived by their users. To show that projection per-formance is task-dependent, they generated layoutsof high-dimensional data with five techniques repre-sentative of different projection approaches. To finda perception-based quality measure, they asked indi-viduals to identify the closest cluster to a given clusterand object. Users also ranked the k nearest objects toa given object. As shown in Figure 3, the target clus-ter/object was shown in one color (red) and two otherclusters in other colors (green and blue), from whichthe one closer to the target cluster/object should beidentified.

Node-link diagrams have been studied in detail inmany graph drawing topics or graph visualization ap-proaches, where a node is representing an entity thatis connected to other nodes through lines (i.e., links).Although the node-link diagram is an intuitive way tovisually represent relationships between entities forrelatively small data sets (Henry and Fekete, 2006),

Figure 3: Task: determine whether green or blue cluster iscloser to red object in order to investigate PCA projectionperformance.

there may be too many lines crossing with each otherthat obscure relationships among entities when deal-ing with larger data sets. In order to represent spa-tial distance visually in cases like these, a techniquelike the Force-Directed Placement approach (Eades,1984) can be used to reveal connections and similar-ity magnitude between entities. This technique relieson iterative algorithms that model the data points as asystem of particles attached to each other by springs.The length of the spring connecting two particles isgiven by the distance between their correspondingdata points as shown in Figure 4. A spatial embeddingis obtained with an iterative simulation of the springforces acting on this hypothetical physical system, un-til it reaches an equilibrium state.

Figure 4: The spring embedder model (Eades et al., 2010).

To Find k closest objects to the reference object, ifthe performance of a projection in terms of maintain-ing distances within a cluster is under investigationand the cluster structure is known, a combination ofhull-based and point-based visualizations can be used.Schreck et al. (Schreck et al., 2010) implemented aninteractive system that combined these two visual pre-sentations letting users choose the best visual repre-sentation of the projected data. They believed thatsuch combined representations introduce visual re-dundancy; however, it can improve user’s perceptionof the projection precision information depending onthe application. Poco et al. (Poco et al., 2011) im-proved the performance of their 3D point representa-tion when they combined standard point clouds withthis user-guided process. Figure 5 demonstrates find-ing 3 closest objects to the red object within a clusterwhen the convex hull of the points is used.

Brehmer and Munzner’s typology is intended tofacilitate understanding of users’ individual analyticalstrategies. We employ their multi-level code, usedto label user behaviour, to enhance the evaluationof high-dimensional data projection. By utilizingthe Brehmer and Munzner multi-level typology, weprovide a systematic way of justifying the choice of aparticular task through asking three main questions:Why, What and How. This multi-level typologyof abstract visualization tasks fills the gap betweenlow-level and high-level classification to describeuser tasks in a useful way. This approach to ana-lyzing visualization usage supports making precisecomparisons of tasks between different visualizationtools and across application domains (Brehmerand Munzner, 2013). For an effective design andevaluation of multidimensional data visualizationtools, one should consider why and how our definedtasks should be conducted, and what are their po-tential inputs and outputs. Meanwhile, sequencesof tasks can be linked, so that the output of onetask may serve as input to a subsequent task. Wefocused on Find k closest clusters to the givencluster in the relation-seeking category. We did notconsider any specific projection technique becauseit can be changed based on the evaluator’s motivation.

Find k closest cluster to the given cluster: WHY:The goal is to Discover k groups that are closest to agiven cluster. A known target (given cluster) and thewhole projection visualization are provided. If the lo-cation of a given cluster was known (or given by theexaminer), then participants perform a Lookup. If thecharacteristic of the given cluster was given, the usercan Locate the given cluster with specific characteris-tics (e.g., searching for a given cluster in which the el-ements are colored red). Then individuals search fork clusters that are in the neighborhood of the givencluster and list these groups. WHAT: The input forthis task is a given cluster; this can be shown by theexaminer or can be indicated by a particular character-istic like the color red. All other clusters in the entirevisualization are also visible to the participants. Theoutput is a list of k groups that are closest to the givencluster. HOW: Participants identify the k closest clus-ters to the given cluster. For example, they determinewhether the green or blue cluster is closer to the redcluster. They provide a list of clusters that follow anascending order, so that the distance of the first clus-ter in this list to the given cluster is shortest comparedto the other clusters. Select refers to differentiatingselected elements from the unselected remainder.

Trees are a natural form for depicting hierarchi-cal relations and can be used to Find the clusters that

Figure 5: Find 3 closest objects to the red object: Convex-hull of the point clusters.

have hierarchical relations. A distinct category of2D mapping employs tree layouts to convey similaritylevels contained in a distance matrix. The algorithmsto generate similarity layouts (Cuadros et al., 2007;Paiva et al., 2011) are inspired by the well-knownNeighbor-Joining (NJ) heuristic originally proposedto reconstruct phylogenetic trees. Similar pointsamong members of the same subsets are placed at theends of branches. The points nearer the root of thetree are less similar when compared with the points atthe ends of branches.

Similarity trees generate a hierarchy, creating atree structure where interpretation is subject to orga-nization of the branches; for example, mapping datasetswith the NJ and LSP projections are compared inFigure 6. In this example, the INFOVIS04 data setis composed of documents published in a conferenceon information visualization, and its content is homo-geneous. Using NJ, documents with a high degreeof similarity are placed along the same branch. Thebranches circled in the figure are examples of longbranches without too many ramifications, and proba-bly represent specific sub-topics inside the collection.LSP, on the other hand, has a tendency to create clus-ters in round clumps. This representation performswell for certain tasks, but is less useful for finding theclosest clusters to selected objects (Etemadpour et al.,2014b).

Figure 6: Comparison of INFOVIS04 document data setmap using Neighbor Joining and LSP projections: Four dif-ferent topics of information visualization are identified bycoloring points. Figure is taken from (Cuadros et al., 2007).

Collins et al. (Collins et al., 2009) introducedBubbleSets as a visualization technique for data thatmakes explicit use of grouping and clustering infor-mation. Members of the same set are in continu-

ous and concave isocontour, while a primary semanticdata relation is maintained with spatial organization.These delineated contours do not disrupt the primarylayout, so they avoid layout adjustment techniques.This visualization technique is designed in order tofacilitate depicting more than one data relationshipin data sets that contain multiple relationships. Us-ing this concept, we suggest contours around nodesbelonging to the same set to Find k similar objectswithin a cluster in a projection technique. Figure 7shows an example that uses the BubbleSets conceptfor an NJ heuristic projection. The points that aresharing the same contour are members of the sameset. These boundaries are used to indicate the group-ing clearly.

Figure 7: NJ projection: geometric relationships, hierarchyand cluster perimeter are all clearly defined using Bubble-Sets concept.

3.3 Behavior comparison tasks

A third way in which high-dimensional data projec-tions can display data items in lower-dimensional sub-spaces can provide insight into important data dimen-sions and details. Our taxonomy distinguishes thesubsets of tasks used for behavior comparison:

• Find the cluster with the largest (smallest) occu-pied visual area.

• Find the cluster with the most (least) number ofpoints or size.

• Find densest (sparsest) cluster.• Given specific number of clusters (e.g. 5 clusters

is given).• Rank the clusters by density.• Rank the clusters by their occupied visual area.• Rank the clusters by their size.• Compare density of two given clusters with dif-

ferent or similar characteristics (e.g., density of alongish cluster vs. a roundish cluster).

• Compare the size of two given clusters with dif-ferent or similar characteristics.

• Compare the visual area of two given clusters withdifferent or similar characteristics.

Density is an important metric because it indicatesstronger relationships between points within a cluster.Moreover, many studies have indicated that represen-tations of density can play an important role in vi-sualization (Ahuja and Tuceryan, 1989; Sears, 1995;Tullis, 1988). Further, studies in psychophysics haveshown that visual search can be affected by the vari-ance in the number of objects within groups (Duncanand Humphreys, 1989; Rosenholtz et al., 2009; Treis-man, 1982). Sedlmair et al. (Sedlmair et al., 2012b)named density as one of the Within-Cluster factors,namely, the ratio between count and size. This canrange from sparse, with few data points and a largespread, to dense, with many points and a small spread.If the task is to Compare density of two given clusterswith different or similar characteristics(i.e. differ-ent shapes), we suggest a point-based visualization.This allows users to easily see the point distributionwithin a cluster and the occupied visual space. More-over, as investigated in (Etemadpour et al., 2014c),according to the Gestalt principle (Koffka, 1935), theshape and orientation of a cluster should also influ-ence decisions during visual analysis. For example,when two stretched clusters are aligned, they maybe perceived as a continuation of one cluster or inother words, characteristics of the clusters influencethe visual analysis from a perceptual view. Followingthese ideas, continuity and closure create the percep-tion of a whole cluster. Figure 8 illustrates the densityof a longish cluster versus a cluster that looks moreroundish. In this example, cluster shape (e.g., whethera cluster appears to be round or elongated) has beenexamined, while density and size of the clusters werethe same. In addition, 2D scatter plots are manuallygenerated using synthetic clusters (Etemadpour et al.,2014c). Cluster shape (in projected space) influencesusers’ performance on various inference tasks.

Figure 8: Task: Compare the density of the longish clus-ter versus the roundish cluster. Scatter plots were generatedwith varying shapes, while holding density and size con-stant, in order to investigate the effect of cluster shape (inprojected space) on a user’s inferences and perceptions ofthe data.

Again by utilizing the Brehmer and Munznermulti-level typology, we provide an example thatshows how our defined tasks can be fitted to thismulti-level typology of abstract visualization tasks,

in order to concisely describe our pre-definedtasks. Find the cluster with the highest number ofsub-clusters in the behavior comparison categoryhas been considered. Additionally, we did notconsider any specific projection technique becauseit can be changed based on the evaluator’s motivation.

Find the cluster with the highest number of sub-clusters: WHY: The purpose is to Discover a clusterwith the highest number of sub-clusters. The clus-ter characteristic is not provided; therefore, the searchtarget is unknown and Explore entails searching forthe cluster with the highest number of sub groups.Once the search process is done, Identify returns thedesired reference. WHAT: The input for this task isthe entire visualization, including all clusters and theirsub-groups. The output is the identity of a clusterwith the largest number of sub-clusters. HOW: In-dividuals need to estimate the number of sub-clustersof each cluster. This involves counting sub-groupswithin successive clusters until the largest number isfound. Therefore, they must Derive new data ele-ments, then Select the desired cluster.

3.4 Membership disambiguation

It is desirable for the visual representation to avoidclutter, resolve ambiguity and handle noise. At times,“identifying overlaps” may indicate that the classesare not clearly separable, which suggests that theoverriding task is one of pattern identification. How-ever, too much data on too small an area of the dis-play, such as a dense region of entangled clusters, di-minishes the potential usefulness of the projectionseven if the projection consists of some clearly sepa-rated clusters. This is especially true when the user isexploring the data to:

• Estimate the number of objects in a selection.

• Find an object with specific characteristic (e.g. la-beled point) within a cluster.

• Count the number of objects in a given cluster.

• Identify the objects that overlap in a selected area.

When Finding an object with a specific charac-teristic within a cluster, a visualization can favorgood performance in preserving distances and rela-tionships, but only at the expense of producing visualclutter. As an example, the PCA scatterplot of KD-Viz is too cluttered and distinguishing a specific ob-ject within a cluster is not an easy task (Figure 9).

To Estimate the number of objects in a selection, atarget cluster/selection can be highlighted with a dif-ferent color as shown in Figure 10.

Figure 9: Find a purple object within the green cluster.Using a PCA projection employed on the KDViz data set, itis hard to distinguish the purple point.

Figure 10: Estimate the number of objects in a selection inLSP projection.

3.5 Meta-projection

The tasks that are explained above can be used asgiven, or can be combined into multi-step macrotasks.We note that the tasks that we have provided may notcover all possible tasks of a given type, but they canbe used as exemplars when defining new tasks. Sub-clusters of a given cluster or group of points can beconsidered as a meta-object. Meta-objects can createa meta-projection, and new tasks can be executed onthis projection based on this process. In Figure 11(a),the task is: “Find the closest cluster to the given clus-ter”. For instance, as apparent “Linear Square” isthe closest sub-cluster to the “Information Visualiza-tion” sub-cluster and “Tree” is the closest sub-clusterto “Graph Drawing”. Therefore, as shown in Fig-ure 11(b) we can analyze the meta-projection to seethat “Time Varying Filtering” is the closest cluster tothe “Visualization” cluster and similarly “Visualiza-tion” is the closest cluster to “Data Mining”. Usingthis meta-projection, we can get more insight into ourdata.

Thus, in section 3, we saw examples of how ap-propriate visualization methods could be determinedfor specific tasks.

4 Conclusion

Our taxonomy supports precise comparisonsacross different multidimensional data projection

(a)

(b)Figure 11: A meta-projection: (a) sub-clusters; (b) clusters(meta-objects).

techniques. However, it can be extended by consid-ering more application domains like volumetric datasets, which may introduce new VEs like continuousscatterplots. We argue that projection methods aredistinct in their characteristics in terms of sparsenessand distance distribution, and that the nature of thetask (in taxonomic terms) should guide the visual-ization design. We believe that our taxonomy canbe used for examining projection layouts and scat-terplots to see how users perceive multidimensionaldata. We also incorporate findings about perceptionrules (e.g., Gestalt laws) and cognitive processes likevisual attention as a valuable source of informationfor such analyses. Our taxonomy can help in catego-rizing possible tasks when analyzing a multidimen-sional data visualization. These tasks can be usedas guidelines for assessing other visualization tech-niques as well, such as Star Coordinates (Van Longand Linsen, 2011).

Our tasks are not projection-specific or data-set-specific. We list a number of example tasks withineach taxonomic task classification; these are not in-tended to be exhaustive. Other tasks can be placedwithin our taxonomic categories, and our visualiza-tion recommendations applied appropriately. We mayextend our study to look into whether certain tasks arespecific for certain applications.

REFERENCES

Ahuja, N. and Tuceryan, M. (1989). Extraction of earlyperceptual structure in dot patterns: Integrating re-gion, boundary, and component gestalt. Comput. Vi-sion Graph. Image Process., 48(3):304–356.

Albuquerque, G., Eisemann, M., and Magnor, M. (2011).Perception-based visual quality measures. In Proc.IEEE Symposium on Visual Analytics Science andTechnology (VAST) 2011, pages 13–20.

Amar, R., Eagan, J., and Stasko, J. (2005). Low-level com-ponents of analytic activity in information visualiza-tion. In Proceedings of the Proceedings of the 2005IEEE Symposium on Information Visualization, IN-FOVIS ’05, pages 15–, Washington, DC, USA. IEEEComputer Society.

Andrienko, G., Andrienko, N., Bak, P., Keim, D., Kisile-vich, S., and Wrobel, S. (2011). A conceptualframework and taxonomy of techniques for analyzingmovement. J. Vis. Lang. Comput., 22(3):213–232.

Andrienko, N. V., Andrienko, G. L., and Gatalsky, P.(2000). Visualization of spatio-temporal informa-tion in the internet. In 11th International Work-shop on Database and Expert Systems Applications(DEXA’00), 6-8 September 2000, Greenwich, London,UK, pages 577–585.

Borg, I. and Groenen, P. J. F. (2010). Modern Multidimen-sional Scaling Theory and Applications. Springer Se-ries in Statistics. Springer, 2nd. edition edition.

Brehmer, M. and Munzner, T. (2013). A multi-level ty-pology of abstract visualization tasks. IEEE Trans.Visualization and Computer Graphics (TVCG) (Proc.InfoVis), 19(12):2376–2385.

Collins, C., Penn, G., and Carpendale, S. (2009). Bubblesets: Revealing set relations with isocontours over ex-isting visualizations. IEEE Transactions on Visualiza-tion and Computer Graphics, 15(6):1009–1016.

Cuadros, A. M., Paulovich, F. V., Minghim, R., and Telles,G. P. (2007). Point placement by phylogenetic treesand its application to visual analysis of document col-lections. In Proceedings of the 2007 IEEE Symposiumon Visual Analytics Science and Technology, pages99–106. IEEE Computer Society.

Duncan, J. and Humphreys, G. (1989). Visual search andstimulus similarity. Psychological Review, 96:433–458.

Eades, P., Huang, W., and Hong, S. (2010). A force-directedmethod for large crossing angle graph drawing. CoRR,abs/1012.4559.

Eades, P. A. (1984). A heuristic for graph drawing. In Con-gressus Numerantium, volume 42, pages 149–160.

Etemadpour, R., Carlos da Motta, R., Paiva, J. G. d. S.,Minghim, R., Ferreira, M. C., and Linsen, L. (2014a).Role of human perception in cluster-based visual anal-ysis of multidimensional data projections. In 5th In-ternational Conference on Information VisualizationTheory and Applications (IVAPP), pages 107–113,Lisbon, Portugal.

Etemadpour, R., Motta, R., de Souza Paiva, J. G., Minghim,R., de Oliveira, M. C. F., and Linsen, L. (2014b).

Perception-based evaluation of projection methods formultidimensional data visualization. IEEE Trans-actions on Visualization and Computer Graphics,99(PrePrints):1.

Etemadpour, R., Olk, B., and Linsen, L. (2014c). Eye-tracking investigation during visual analysis of pro-jected multidimensional data with 2d scatterplots. In5th International Conference on Information Visual-ization Theory and Applications (IVAPP), pages 233–246, Lisbon, Portugal.

Geng, X., Zhan, D.-C., and Zhou, Z.-H. (2005). Supervisednonlinear dimensionality reduction for visualizationand classification. IEEE Transactions on Systems,Man, and Cybernetics, Part B, 35(6):1098–1107.

Henry, N. and Fekete, J. (2006). Matrixexplorer: adual-representation system to explore social networks.IEEE Transactions on Visualization and ComputerGraphics, 12:677–684.

Ingram, S., Munzner, T., Irvine, V., Tory, M., Bergner, S.,and Mller, T. (2010). Dimstiller: Workflows for di-mensional analysis and reduction. In IEEE VAST,pages 3–10. IEEE.

Ingram, S., Munzner, T., and Olano, M. (2009). Glimmer:Multilevel mds on the gpu. IEEE Transactions on Vi-sualization and Computer Graphics, 15(2):249–261.

Jolliffe, I. T. (1986). Pincipal Component Analysis.Springer-Verlag.

Koffka, K. (1935). Principles of Gestalt Psychology. . LundHumphries, London.

Lewis, J. M. and Ackerman, M. (2012). Human cluster eval-uation and formal quality measures: A comparativestudy. pages 1870–1875. 34th Annual Conference ofthe Cognitive Science Society.

Muller, E., Gunnemann, S., Assent, I., and Seidl, T. (2009).Evaluating clustering in subspace projections of highdimensional data. PVLDB, 2(1):1270–1281.

Paiva, J. G. S., C., L. F., Pedrini, H., Telles, G. P., andMinghim, R. (2011). Improved similarity trees andtheir application to visual data classification. IEEETransactions on Visualization and Computer Graph-ics, 17(12):2459–2468.

Paulovich, F. V., Nonato, L. G., Minghim, R., and Lev-kowitz, H. (2008). Least square projection: A fasthigh-precision multidimensional projection techniqueand its application to document mapping. IEEETransactions on Visualization and Computer Graph-ics, 14(3):564–575.

Peng, W., Ward, M. O., and Rundensteiner, E. A. (2004).Clutter reduction in multi-dimensional data visualiza-tion using dimension reordering. In Ward, M. O. andMunzner, T., editors, INFOVIS, pages 89–96. IEEEComputer Society.

Poco, J., Etemadpour, R., Paulovich, F. V., Long, T. V.,Rosenthal, P., de Oliveira, M. C. F., Linsen, L., andMinghim, R. (2011). A framework for exploringmultidimensional data with 3d projections. Comput.Graph. Forum, 30(3):1111–1120.

Rensink, R. A. and Baldridge, G. (2010). The perceptionof correlation in scatterplots. Comput. Graph. Forum,29(3):1203–1210.

Rosenholtz, R., Twarog, N. R., Schinkel-Bielefeld, N., andWattenberg, M. (2009). An intuitive model of percep-tual grouping for hci design. In Proceedings of theSIGCHI Conference on Human Factors in ComputingSystems, CHI ’09, pages 1331–1340, New York, NY,USA. ACM.

Samet, H. (2005). Foundations of Multidimensional andMetric Data Structures (The Morgan Kaufmann Se-ries in Computer Graphics and Geometric Model-ing). Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA.

Schreck, T., von Landesberger, T., and Bremm, S. (2010).Techniques for precision-based visual analysis of pro-jected data. In Park, J., Hao, M. C., Wong, P. C., andChen, C., editors, VDA, volume 7530 of SPIE Pro-ceedings, page 75300. SPIE.

Sears, A. (1995). Aide: A step toward metric-based inter-face development tools. In Proceedings of the 8th An-nual ACM Symposium on User Interface and SoftwareTechnology, UIST ’95, pages 101–110, New York,NY, USA. ACM.

Sedlmair, M., Brehmer, M., Ingram, S., and Munzner, T.(2012a). Dimensionality reduction in the wild: Gapsand guidance - ubc computer science technical re-port tr-2012-03. Technical report, The University ofBritish Columbia.

Sedlmair, M., Tatu, A., Munzner, T., and Tory, M. (2012b).A taxonomy of visual cluster separation factors.Comp. Graph. Forum, 31(3pt4):1335–1344.

Sips, M., Neubert, B., Lewis, J. P., and Hanrahan, P. (2009).Selecting good views of high-dimensional data usingclass consistency. Computer Graphics Forum (Proc.EuroVis 2009), 28(3):831–838.

Song, Y., Zhou, D., Huang, J., Councill, I. G., Zha, H., andGiles, C. L. (2006). Boosting the feature space: Textcategorization for unstructured data on the web. In theSixth IEEE international Conference on Data Mining,(ICDM 2006). IEEE.

Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Intro-duction to Data Mining. Addison-Wesley Longman,Boston, MA, USA.

Tatu, A., Bak, P., Bertini, E., Keim, D. A., and Schnei-dewind, J. (2010). Visual quality metrics and humanperception: an initial study on 2D projections of largemultidimensional data. In Proceedings of the WorkingConference on Advanced Visual Interfaces (AVI ’10),pages 49–56.

Tatu, A., Theisel, H., Magnor, M., Eisemann, M., Keim, D.,Schneidewind, J., and et al. (2009). Combining auto-mated analysis and visualization techniques for effec-tive exploration of high-dimensional data.

Tenembaum, J. B., de Silva, V., and Langford, J. C. (2000).A global geometric faramework for nonlinear dimen-sionality reduction. Science, 290:2319–2323.

Treisman, A. (1982). Perceptual grouping and attention invisual search for features and for objects. The Ex-perimental Psychology, Human perception and per-formance, 8(2):194–214.

Tullis, T. S. (1988). A system for evaluating screen formats:Research and application. Hartson, H. Rex, and Hix,

Hartson, Advances in Human-Computer Interaction,2:214–286.

Van Long, T. and Linsen, L. (2011). Visualizing high den-sity clusters in multidimensional data using optimizedstar coordinates. Comput. Stat., 26(4):655–678.

Zhang, X., Pan, F., and Wang, W. (2008). Care: Finding lo-cal linear correlations in high dimensional data. 2014IEEE 30th International Conference on Data Engi-neering, 0:130–139.

Zhang, Y., Passmore, P. J., and Bayford, R. H. (2009).Visualization of multidimensional and multimodaltomographic medical imaging data, a case study.Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences,367(1900):3121–3148.

Documents

A User-Centric Taxonomy for Multidimensional Data ... · A User-Centric Taxonomy for Multidimensional Data Projection Tasks Ronak Etemadpour1, Lars Linsen2, Christopher Crick1 and