Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Many existing and emergent applicationscollect and reference data by geospatial
location. Credit card transactions, for example, includeaddresses of both the place of purchase and the pur-chaser; telephone records include addresses and some-times cell phone zones and geocoordinates; andpopulation census tables contain addresses and other
location information. These datasets are sources of potentially valu-able information that can give theirholders a competitive advantage.Government agencies also publisha wealth of statistical informationthat data analysts can apply to keyproblems in public health and safe-ty or combine with proprietarydata. The difficulty lies in findingthe details that reveal the fine struc-tures hidden in this data.
Many approaches to analyzingsuch data exist—for example, statis-tical models, clustering, and associ-ation rules. Effective spatial datamining, however, must focus onfinding location-related patterns
and relationships. Interactive visual data exploration isimportant to spatial data mining.1,2 The wide area lay-out data observer (Waldo) involves the analyst in dataexploration, thus complementing human perceptualskills, imagination, and flexibility with current comput-er systems to process large volumes of data and gener-ate sophisticated displays. In this setting, the analystdirectly interacts with the data, solving problems byapplying domain expertise and general backgroundknowledge to form and validate new hypotheses.
In recent decades, visual data-mining techniques haveproven valuable in exploratory data analysis, and theyhave strong potential in the exploration of large data-bases.3 Visual data exploration is particularly usefulwhen little is known about the data and when goals areindistinct. Because users directly guide the explorationprocess, they can easily shift or adjust the goals as need-
ed. However, analyzing the torrent of spatial detailsavailable in these large (terabytes and beyond) databas-es to extract interesting knowledge or general charac-teristics is almost impossible for users. Thus we neednew, more scalable visual techniques.
Visualizing geospatial dataGeospatial data describe real-world objects or phe-
nomena with specific locations and associated statisti-cal values or attributes. By considering just onestatistical attribute at a time, we can interpret geospatialdata sets as points in a 3D data space—that is, two geo-graphical dimensions and a statistical dimension.Because real-world data set distributions are oftennonuniform, the data points form readily identifiable3D point clouds. Figure 1a shows a household incomedistribution in a 3D data space spanned by longitude,latitude, and median household income. Figure 1bshows an xy-plot of the 3D point clouds.
Visualizing large geospatial data sets involves map-ping the two geographical dimensions to screen coordi-nates and encoding the statistical value by color. (Keimet al. give a good overview of visual data-mining tech-niques for geospatial data sets.4) The difficulty is findinga useful mapping function f.
When using a simple dot plot mapping function f,developers encounter two important visualization chal-lenges:
� Overplotting obscures data points in densely popu-lated areas; however, sparsely populated areas wastespace while conveying scant detailed information.
� Small clusters are difficult to find. In general, theyaren’t noticeable enough in conventional maps andare often occluded by large clusters.
These difficulties lead to three important visual explo-ration goals for geospatial data, which we express asmapping constraints: no overlap, position preservation,and clustering. (The “Visual Exploration Goals” sidebardescribes these goals.)
We bring visualization to data analysts’ desktops to
Visual Analytics
The Wide Area Layout Data
Observer (Waldo)
complements uniquely
human abilities with current
computing technologies to
find location-related
patterns in large geospatial
data sets.
Daniel A. Keim, Christian Panse, and Mike SipsUniversity of Constance, Germany
Stephen C. NorthAT&T Labs
36 September/October 2004 Published by the IEEE Computer Society 0272-1716/04/$20.00 © 2004 IEEE
Visual Data Miningin Large GeospatialPoint Sets
IEEE Computer Graphics and Applications 37
Visual Exploration GoalsWe define the visualization of georeferenced data as a
mapping of input data points, with their associatedpositions and statistical attributes, to unique positions on anoutput map. The mapping function must satisfy three mainconstraints: no overlap, position preservation, and clustering.We formally define these constraints as follows.
Let A be the set of input points A = {a0, …, aN−1}, whereai=a x
i,a yi is each point’s original position and S1(ai), …, Sk(ai)
are its associated statistical parameters. We assume A islarge, so we will likely have many data points i and j forwhich the original positions are very close or evenidentical—that is, ai ≈ aj. We define the data display space(screen or window space) DS ⊂ INT2 as DS = {0, … , xmax − 1}× {0, …, ymax − 1}, where xmax and ymax are the displaybounds. The algorithm attempts to determine a mappingfunction f of the original data set to a solution set B = {b0,…, bN−1, 0 ≤ b x
i ≤ xmax =1, 0 ≤ b yi ≤ ymax =1 such that f : A → B,
f(ai) = bi ∀i = {0, …, N − 1}—that is, f determines the newposition bi of ai.
Figure A shows graphical representations of the mappingconstraints. Visual exploration techniques aim to balancethe position preservation and clustering constraints underthe condition that the no-overlap constraint is alwayssatisfied.
No overlap. All data points are individually visible, witheach assigned a unique pixel position (Figure A1). Weexpress this formally as i ≠ j ⇒ bi ≠ bj ∀i, j ∈ {1, …, N − 1}.
Position preservation. New positions should be as close aspossible to the original positions. We measure thisconstraint using the points’ absolute distance from theiroriginal positions (Figure A2) or their relative distance fromeach other (Figure A3). This gives us the followingoptimization goals:
LongitudeLatitude
Income
x
y
l
lll
l
lllll lll
l
l
l
llll
l
l
l
llll
l
lll
ll
l ll
l
llll l0.00.20.40.60.81.0
.0 .2 .4 .6 .8 1.0
Income
lll
l
lll
l
lllll lllllllllllllllllllllllllllllllllllllll
Income
ll
l
lllll llllllllllll
l
llllllll
l
lllll
ll
llllllllllllllllllllllllll ll llllll
l
l ll
ll
llll
l
l
l
llllllllllllll
l
ll l
ll
llll
l
l
lllllllllllllll
ll
lllll
l
ll
l
ll
l
ll
lll
l
llllllll
ll
lll
l
lllll
l
llll
l
lll
l
llllll
ll
l
lll
ll
llll
llll
l
l
ll
lllll llllll
ll
llll
ll
l
l
ll lll
.0 .2 .4 .6 .8 1.0
Income
l
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
l
ll
l lll
ll
ll
l l
l
ll
ll
lll
ll
ll
llll
l
lll
l
l
l
l
l
l
l
l l
ll
l
l ll
l
l
ll
l
l
l ll l
l
ll ll
l
l
ll ll
lll
ll
lllll
ll l
l
l ll
l
l
llll
l
ll
ll
l
l
l
lll
l
l
l
l
l
ll
l l
ll
l
lllll lll
l
ll
l
l
l l
l
l
l lllll
lll
lll
l
ll
l
l
ll ll
l
llll
l
l
l
lll
l
lll
l
ll
ll
lll
l
l
ll
l
l
ll
l
lll lll
ll
l l
l l
l
l l
l
l l
l
l
l
l
l
ll
lll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
llll
l
ll
ll
l
ll
l
lll
l l
l
ll
l
l
l
lll
l
l
llll ll
l
l
l
llllllll
l
l
l
ll
ll
ll
l
ll
ll
l ll
l
l
l
ll
l
l
ll
l
l
llll
l
l
l
l
llllll
l
l
ll
l
l
ll
l l
llll
l
l
l ll
l
l
ll
l
l
l
ll
l l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
ll ll
ll
l
l
ll
l
l l
l l
ll
l
ll
ll
lll
l
l
lll
l
llll
l
lllll
l
l
l l
l
l
l
l
l ll
l
ll
ll
l
ll
l ll
l
ll
l
ll
l
l
ll
l
l
l
ll
l
l
l
ll
l
l
ll l
l
ll
l
l
lll
ll l ll
l
ll
l
l l
lllll
l
l
l
lll
l
l
l ll
ll
ll
l
l
ll
l
l
l
l
ll
lll
l
ll
l
l
l
ll
ll ll
l
l
ll l
l
l
lll
ll
ll
ll
l
l
l
l
ll l
ll
lllll
l
lll
l
l
l l
l
l
lll
l
l
l l
l lllllllll
l
l
l
lll ll
l
lll
lll l
lll
ll
llll
l
l
ll l
ll
ll
l
l
l
l
l
lll
ll
l
l
l
l
ll
llll ll
l
ll l
l
l
llllllll
l ll
l l ll
l
l
ll
l
ll
l
llll l
lll
llll
ll
l l
l
l
l
l
l
l
llllll
l
ll
ll
ll
l
lll
l
l
ll
l
l
l
l
l
l
lll
l
ll
l
lll
l
l
ll
l
l l
l
llll
ll
lllll
l
l
ll
l
lll
l
l
l
l
l
l
ll
l
ll
l
ll l
l
ll
l
ll
ll
ll
ll
l l
l
l
lll
l ll
llll
ll lll
llll
l
ll
ll
ll l
l
ll
l
l
ll l
l
l
ll
l
ll
ll
l
l
l
l
l
l l
llll
l
l
l
l
l l
l
l
l
Income
ll
l
ll
l
ll
l l
lll
l
ll
l
lllll lll
l
ll
l
ll l
l
llll
ll l
l
ll
l
llllllll
l
l
l
ll ll
l
llll l
lll
l
l
l
l
l
ll l
l llll
l
ll
ll
l
l ll
l
ll
llll
l l
l
l
ll
l
ll
l
l
l lll
l
lllll
l
l
lll
l ll l
l
l
l
ll ll ll
l ll
l l l
ll l
l
ll
ll
l
l
ll l
l ll
ll
l
l
lll
l
lll
lll
l
l
ll
ll lll
ll l
ll
lll
l
lll
l
l
l
llll
ll l
l
l
l
ll
l
lll
l
l
l
l
l
l
llll
l
l
l
l l
l
ll
ll
l
l
l
l
ll lll llll
l
l
l
lll
l
l
lll
l
lll ll l
llll l llll
ll
ll l
l ll
l
ll
ll
l
ll
lll
l
llll ll
l
l
l
l ll l
ll
l
l
l l lll l
l
ll ll ll
lll
ll ll
lll l
l
l
l
ll
lll lll
ll
l
l
l
ll ll
l l
l
l
l
l
l l lllll l l
ll
l l
l
l
l
l
lll llll
lll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
ll llll
l
l
llll l
l
ll
l
ll l
l
l llll
l
llll ll llll
l
l
l
l ll
l
llllll l
l
lllll
ll
l
l
l
l l
l llll ll
l
ll lllll
ll
llll l
ll
ll
l
l
l
l
lll
ll lll
l
lll
l
lll ll l
l
lll
l
l
l
l
ll
l
ll l
l
lll l
l
lll ll
l
l
ll
l
l
l
l
l
ll l
l ll
l
l
ll l
lll
l
l
l
ll
l
l
l llll
l
l l
l
ll
l
lll
l
lll
l
ll
ll
l
l
l
l l l
lll
l
l
l
l lll
l
l l
ll
ll
l
l
l
l
llllllllll l llll
l
l lll
l
ll
l
lll ll
l l
l ll l
l
lll l
l
l
ll
l
l
lll
l
l
l
l l ll
l l
ll l
l
l
l
l
l l llll
l
l l
ll
ll l
l
ll
l l
l
ll
l ll
ll
l
l
ll l
l l l
l
l
l
l
llllll
l
ll
l
ll
lllllll
l
ll
l
lll
l
l lll
l
l ll
l
ll
ll ll
l l
l
l
ll
l
l l
l ll ll ll
lll
l
lll ll
l
l
lllll l
lll l
lll l
l l
l
ll
ll
ll
llll
l
l ll
l
l
l
l
l
lll
l
ll
ll
llll ll
l
l
l
l
Income
l lll
l l
l
l l
l
lllllll
ll l lll lll
l
ll llllllll lll
lll llll ll
l
l
l llll llllllll ll
l
l ll llll
llll l ll
l
lll
ll
lll lll l llll
l
lll
l
l
l
l lll
llllllll ll
l l
ll
lllll ll l
l
lll
l
l
ll
l
llll
l
l l lllllll lll ll llll llll ll lllll
l
llll
l
l l l
l
llllll
ll l ll
ll
l ll l ll llll ll ll ll
l
ll
l
lllll ll
llll l
l
ll
l
l
l
l lllll l ll l
l l
l l
l
ll ll
lll l
l
l llll
ll
l l l llllll ll
l
lll
l
llll llll lll lll ll llll
l ll lll
l ll l
l l
lll llll
l
ll
ll lll
l
l llll llll
llllllll
ll
ll lll ll lll lllll lll lll
ll lll
l
l
ll
l l
0.00.20.40.60.81.0
Incomell lll
l
l ll
l
lll l
l
l lllll ll
l
ll lll ll lll lll lllll
l
llllll ll l ll
lll lllllllll l0.0
0.20.40.60.81.0
Income
l llllll llll llll llllllll l llll
Income
.0 .2 .4 .6 .8 1.0
(a) (b)
1 Plotting of geospatial data points on (a) longitude, latitude, and statistical attribute (income in this example) 3D-point clouds ariseseven in small real-world data sets (1 percent of the data); and (b) an xy-plot of the 3D-point clouds. The goal is to display all 3D-pointclouds in a single continuous display without overlap. (Example shows a small sample, 1 percent, from the US Census; seehttp://www.census.gov, 1999 New York Household Income data set.)
continued on p. 38
A Problem definition constraints: (1) no overlap, (2) absolute position preservation, (3) relative position preservation, and (4)clustering constraint. Visual exploration aims to find a good tradeoff between 2, 3, and 4 such that 1 is always satisfied.
(1) (2) (3) (4)
encourage more intuitive and pro-ductive exploration of large geospa-tial data resources. High-resolutionpixilated displays are increasinglyavailable in both wall-sized anddesktop units. Although extra dis-play pixels let us show more data,this technology alone doesn’t elimi-nate overplotting. Figure 2 shows theresulting visualization in case ofzoom in. Figure 3 shows how thedegree of overlap—that is, thedegree to which data points share apixel position—varies with screenresolution. Although the number ofpoints assigned to the same positiondecreases as resolution increases,even large, high-resolution displays,such as the display in Figure 4a, can’tachieve a zero overlap and could losepotentially interesting patterns.
Another common solution is toaggregate all data points in eachregion and only show a summary.With this approach, the visualiza-tion reflects all the data points butdoesn’t show all available informa-tion in the dense regions. (See the“Related Work” sidebar for a discus-sion of other approaches.)
Wide area layout dataobserver
In addition to a basic visualizationtechnique, successful data explo-ration often involves adjusting the
Visual Analytics
38 September/October 2004
� absolute position preservation,
� relative position preservation,
The application determines the weighting betweenabsolute and relative position preservation to be used.
We define the distance function d by an Lm − norm (m =1 or 2):
Clustering. The clustering constraint involves repositioningthe data points so those with high similarity in statistical
attribute Si (where Si, i ∈ {0, … , k}) are near each other(Figure A4). (We assume clustering depends on thestatistical attribute S ∈ {S0, … , Sk}.) In other words, otherpoints in the neighborhood of any given data point shouldhave similar values, yielding pixel coherence. To formalizethis constraint, we define the neighborhood NH of a datapoint ai and a distance function dS on the statisticalattribute S:
This neighborhood function sums up all the differences in Sbetween each point and its neighboring points. We definethe function as NH(bi) = {bj | d(bi, bj) < ε}.
Because Si can have a highly nonuniform distribution,applying nonlinear scaling to S before computing distancesdS might also be necessary. In addition, in some situationsmany similar points might be in some regions of the map,while only a few are in others. In this case, varying ε in theregion under consideration might be helpful.
d S b S bS i jb NH bi
N
j i
( ( ), ( )) min( )
→∈=
−
∑∑0
1
d b b b b b bi j ix
jx
m
iy
jy m
m,( ) = −( ) + −( )
∑ ∑ ) − )((
→
=
−
= ≠
−
i
N
j i j
N
i j i jd b b d a a0
1
0
1 2
,, , mmin
∑ ( ) →=
−
i
N
i id a b0
1
, min
continued from p. 37
2 Zooming solves neither the overplotting nor pixel coherence problems. (a) Overplottingon a conventional map with interactive zooming, showing only a small sample of the data. (b)PixelMap showing 100 percent of the data without overplotting. (c) Household income his-togram. Green represents low income and red represents high income.
(a)
(b)
(c)
visual representation of data to suitthe task at hand. Waldo is a pixel-based visual exploration systemcombining several relevant interac-tion techniques. Such techniques letdata analysts directly interact withthe geospatial data visualizations,dynamically changing them accord-ing to the exploration objectives.Waldo also lets analysts relate andcombine multiple visualizations.
Waldo is more effective thanstand-alone automatic data-miningtechniques in that it
� yields results quickly, with a highdegree of user satisfaction andconfidence in findings;
� lets analysts guide the search andshift or adjust goals on the fly;
� deals with nonhomogeneous andnoisy data;
� requires less understanding ofcomplex mathematical or statisti-cal algorithms or parameters; and
IEEE Computer Graphics and Applications 39
Resolution (pixels)
Deg
ree
of o
verla
p
Map like visualization
no longer useful
Screen resolution30% of all data points can’t be directly placed
Powerwall
0 × 0500 × 500
1,000 × 1,0001,500 × 1,500
2,000 × 2,000
0.0
0.2
0.4
0.6
0.8
1.0
3 Varying degree of pixel overlap depending on screen resolution. Evenwith a screen resolution of 1,600 × 1,200, overlap is about 0.3 degrees; 30percent of our data points (about 12,000 points) from the US Year 2000Census Household Income data set can’t be directly placed without overwrit-ing occupied pixels.
4 Large displays solve neither the overplotting nor thepixel coherence problems. Only alternative pixel-basedvisualization techniques can solve these problems. (a)In a conventional map, overplotting obscures datapoints even on high-resolution displays. (b) In Waldo,we avoid overplotting on a regular LCD-display.
Related WorkRather than aggregate data, Gridfit1 avoids overlap in the 2D
display by repositioning pixels locally. In areas with high overlap,however, the repositioning depends on the ordering of the pointsin the database, which might be arbitrary. Gridfit places the firstdata item found in the database at its correct position, and movessubsequent overlapping data points to nearby free positions,making their placement quasirandom.
Cartograms are another common technique dealing withadvanced map distortion.2 Cartogram techniques let data analyststrade shape against area and preserve the map’s topology toimprove map visualization by scaling polygonal elementsaccording to an external parameter. Thus, in cartogramtechniques, the rescaling of map regions is independent of a localdistribution of the data points. A cartogram-based map distortionprovides much better results, but solves neither the overlap nor thepixel coherence problems. Even if the cartogram provides a perfectmap distortion (in many cases, achieving a perfect distortion isimpossible), many data points might be at the same location, andthere might be little pixel coherence. Therefore, cartogram-baseddistortion is primarily a preprocessing step.
References1. D.A. Keim and A. Herrmann, “The Gridfit Algorithm: An Efficient and Effec-
tive Approach to Visualizing Large Amounts of Spatial Data,” Proc. IEEEVisualization Conf., IEEE CS Press, 1998, pp. 181-188.
2. D.A. Keim, S.C. North, and C. Panse, “Cartodraw: A Fast Algorithm forGenerating Contiguous Cartograms,” IEEE Trans. Visualization and Com-puter Graphics (TVCG), vol. 10, no. 1, 2004, pp. 95-110.
(a)
(b)
� provides a qualitative overview of data, letting ana-lysts isolate unexpected phenomena for further quan-titative analysis.
Basic visualization techniqueWe use PixelMaps5 as our basic visualization tech-
nique. PixelMaps rescales map subregions to better fitdense, nonuniformly distributed points to unique outputpositions. The technique is novel in at least two ways:
� It provides meaningful and intuitive graphical repre-sentations of large data sets.
� It combines well-founded clustering algorithms withpixel-oriented visualization, thus exploiting a com-puter’s data processing and graphics power and theflexibility, creativity, and domain knowledge ofhuman data analysts.
PixelMaps aims to represent dense areas while pre-serving some of the key structures of the original geo-graphical space a x
i, a yi, and to allocate all data points to
unique display pixels, even in dense regions. To provide nonoverlap pixel displays, PixelMaps fol-
lows a four-step process.
Density-based map distortion. PixelMaps usesrecursive partitioning to approximate equal density inthe two geographical dimensions a x
i, a yi. Splitting the
data set at low-density positions (less than 10 percentof (l + r)/2 of the data points) achieves efficient parti-tioning (gridfile-like operations). Applying every splitto two areas with an equal number of points but differ-ent input screen space determines the map’s distortion.In the first split, for example, PixelMaps considers two
areas that each have about 50 percent of the data pointsbut unequal screen space. It then applies distortion togive each half of the data equal area in the output map.
Allocation and scaling. For efficient rescaling, weperform quadtree split operations on the extent of the2D screen space, causing empty areas to shrink anddense areas to expand.
We propose a new data structure to simultaneouslymanage allocation and scaling of both data and screenspace. It combines gridfiles (to manage input point par-titioning) and quadtrees (to manage new screen spacepositions). The computed rescaling reduces the size ofvirtually empty regions, reallocating the unused pixelsto dense regions. Figure 5 illustrates the rescaling of cer-tain map regions.
Array-based clustering. PixelMaps next computes anarray-based clustering of each partition. It divides thethird (statistical) dimension into intervals, from minimalto maximal value. The number of intervals depends onthe application scenario, and can be user specified. Pix-elMaps data structure stores each interval’s end points inan array. Each interval corresponds to a class (incomeclass, for example) and can be quickly determined foreach statistical value using a binary search. PixelMapsthen colors pixels according to cluster class indices.
Cluster positioning heuristic. Finally, after rescal-ing and clustering, PixelMaps assigns data points to pix-els, starting with the densest regions and choosing thesmallest cluster in each region first. Figure 6 shows ourcluster-positioning heuristic. To determine the place-ment sequence, we sort all final partitions (leaves of the
Visual Analytics
40 September/October 2004
5 Rescalingreduces the sizeof virtuallyempty regions,reallocating theunused pixels todense regions.We created thisseries by mov-ing Waldo’sdistortion slider.
PixelMaps data structure) by the number of data pointscontained.
The pixel placement step provides visualizations thattrade off position, distance, and cluster preservation.
Exploratory data analysisVisual data exploration involves three steps in a
process so common that researchers have called it theinformation-seeking mantra6:
� Overview—an analyst examines a summary of thedata;
� Zoom and filter—the examination might reveal inter-esting patterns or data subsets meriting further inves-tigation; and
� Details on demand—the analyst focuses on the pat-terns identified in the previous step, inspecting detailsto form or validate hypotheses.
A PixelMaps overview of geospatial data reveals sub-sets with interesting structures by allocating larger dis-play areas to dense regions with many potentiallyinteresting subsets and smaller areas to less interestingitems. PixelMaps provides the basicvisualization technology in Waldoand bridges the gap between thethree visual exploration steps.
Visual exploration using Waldoresembles a hypothesis-generationprocess: PixelMaps lets analysts gaininsight into data and thereby devel-op and confirm new hypotheses. Tocomplement visualization, we canuse automatic techniques from sta-tistics, pattern recognition, ormachine learning to verify thehypotheses.
Interaction with PixelMapsWaldo uses several relevant inter-
action techniques to adjust the visu-al representation of data to suit thetask at hand.
First, relate and combine lets ana-lysts display data from several mapsin multiple linked views, often withidentical coordinate systems. Sec-ondary statistical parameters typi-cally appear on alternative maps, with data points atthe same positions but colored by other parameters.This makes it easy to compare parameters and detectlocal correlations, dependencies, and similar patterns.
Next, interactive distortion sliders let analysts adjustthe level of detail to change the distortion level. Figure5 shows the effect of changing spatial distortion.
A selection mechanism lets analysts isolate a subset ofthe displayed data for further processing, such as high-lighting, filtering, or quantitative analysis. Analysts canselect data on the visualization itself (direct manipula-tion) or through dialog boxes and other queries (indi-rect manipulation).
Finally, linking and brushing lets analysts relate select-
ed items to their representations in other views. Forexample, an analyst might compare points in PixelMapsto traditional displays such as 2.5D aggregated plots andbar maps.
Application examplesAn important issue in visual data mining is determin-
ing the effectiveness of the proposed visualizations. Ourevaluation compares PixelMaps displays with tradition-al approaches and provides examples using censusrecords and a telephone call volume data set.
Figure 7 shows a zoomed view of New York using atraditional map and a PixelMap made by Waldo. Thedegree of overlap for a 1,200 × 1,200 screen resolution
IEEE Computer Graphics and Applications 41
DDaattaa : P : data points belonging the same partition P;DS: Display Space
RReessuulltt PixelMap:ffoorr Pi ∈ P ddoo
iiff ||Pi||< min ∧ Var(Pi, Cntrd(Pi))>√||Pi||tthheenn
CNoise ← CNoise ∪ Pi;eellssee
C ← C ∪ Pi; C ← sort C acc Pi with Pi ∈ C; ffoorr Ci ∈ C ddoo
iiff Ci pixels are free around Cntrd(Ci) in DS tthheennDS ← SetPixels(Ci, Cntrd(Ci));
eellssee/* Find Closest Free Pixels */;fp ← FndClsstFrPxls(Ci, Cntrd(Ci), DS);DS ← SetPixels(Ci, fp, DS);
for Ci ∉ CNoise ddooiiff DS[pos(p)] == 0 tthheenn
DS ← SetPixel{p, pos(p), DS};eellssee
/* Find Closest Free Pixel */;fp ← FndClsstFrPxl(p, pos(p), DS);
7 Traditional map versus PixelMaps displays using New York state interest and dividendincome data for 2000.
6 Cluster-positioning heuristic.
Low High
is 0.82 for the region. We based both visualizations onUS Census Interest and Dividend Household Income data.
The traditional map provides random results in areaswith a high degree of overlap (Manhattan, for example)but leaves sparsely populated areas virtually empty. Pix-elMaps increases space allotted to the densest regionsso all data points can be close together. We ran Pix-elMaps on the most detailed data we have at the censusblock level. To demonstrate its scalability, we createdindividual data points for each household, initially plac-ing them at the block centers. As the figure shows, clus-ters of households with very high investment incomeare in Manhattan and Queens, and households with lowinvestment income are in the Bronx and Brooklyn. Asalient cluster of wealthy households are on the east sideof Central Park.
Census demographic analysisWe performed a census demographics analysis using
data sets from the US Census Bureau. For the analysis,we extracted household income, investment income, andthe asking price of vacant homes for every state in the US.
The average number of data points assigned to thesame position in each state’s input data set heavily influ-ences PixelMaps performance, as Figure 8 shows.
California, Texas, New York, andFlorida had the most pointsassigned to the same position andwere therefore the most interestingstates for PixelMaps. For these fourstates, we ran PixelMaps in suitabletime (less than 20 seconds) for anefficient data exploration; for allother states we ran PixelMaps in realtime. We ran the experiments on a2.4-GHz Intel Xeon computer witha 4-Gbyte main memory.
As Figure 9 shows, householdincome is strongly correlated toinvestment income. The figure alsoshows that California has only a fewvacant homes with low or mediumasking prices (blue areas indicatenonvacant homes) and that NewYork has a few vacant homes in lessdesirable neighborhoods with lowerasking prices. Florida has relativelymore vacant homes, and the priceasked for these houses is stronglycorrelated with household incomein these areas. Although, medianhousehold income and investmentincome are strongly correlated ineach state—in particular, wealthyhouseholds are noticeable on theeast side of Central Park, on Flori-da’s Gold Coast, and on the Califor-nia coast.
A detailed analysis of PixelMapsefficiency and effectiveness withrespect to the defined visual explo-ration goals is available elsewhere.5,7
Call volume analysisMarketing analysts and network engineers look for
interesting patterns in network usage data to help themrecognize and respond to changing conditions quickly.One of Waldo’s key motivations is the need to analyzeextremely large customer service data sets. The exam-ple visualization in Figure 10 shows the call volume ofa telephone service during a 24-hour period.
The traditional map (Figure 10a) gives random resultsin areas with a high degree of overlap while leavingsparsely populated areas virtually empty. The Waldovisualizations (Figures 10b−d) show the advantages ofthe PixelMaps algorithm. The maps show that New YorkCity and Los Angeles County are the population areaswith the highest call volume in the US. The PixelMapsdisplays can show the local distribution of call volumesin these regions.
ConclusionDetecting interesting local patterns in large data
sets is a key research challenge. Particularly chal-lenging today is finding and deploying efficient andscalable visualization strategies for exploring largegeospatial data sets. One way is to share ideas from
Visual Analytics
42 September/October 2004
0 50,000 100,000 150,000 200,000 250,000
05
1015
(Cumulated time: 83.411 seconds; total data points assigned to the same position: 1,472,687)Number of data points assigned to the same position
Pixe
lMap
s co
mp
utat
ion
time
(sec
.)
AlabamaArizona
California
ColoradoConnecticut
Florida
Georgia
Idaho
Illinois
KentuckyLouisiana
Maine
MarylandMassachusetts
Michigan
MinnesotaMissouri
MontanaNebraskaNevada
ew_Hampshire
New_Jersey
New_Mexico
New_York
North_Carolina
Ohio
OklahomaOregon
Pennsylvania
South_Carolina
outh_Dakota
Tennessee
Texas
UtahVermont
Virginia
Washington
Wisconsin
Wyoming
8 Computation time based on the number of points assigned to the same xy-position using astandard screen resolution. Most PixelMaps can be computed in less than 5 seconds. In case ofhigh overplotting, the computation time is suitable for efficient data exploration.
the statistics and machine-learning disciplines withideas and methods from the information and geovi-sualization disciplines. PixelMaps in the Waldo sys-tem demonstrates how data mining can besuccessfully integrated with interactive visualiza-tion. The increasing scale and complexity of dataanalysis problems will require tighter integration of
interactive geospatial data visualization with statis-tical data-mining algorithms.
Further information on visual analysis of massivegeospatial data sets, as well as an implementation of thePixelMaps algorithm and Waldo, is available at the Pix-elMaps Project Web site at http://dbvis.inf.uni-konstanz.de/~sips/pixel_based_dm/. �
IEEE Computer Graphics and Applications 43
9 PixelMapsresults for USCensus demo-graphics analy-sis showinginterest divi-dends income,median house-hold income,and price askedfor (a) Califor-nia, (b) Texas,(c) New York,and (d) Florida.
Low High
10 Call volumeanalysis using(a) traditionalmaps and (b−d)PixelMaps withincreasingscreen resolu-tion: (b) 800 ×347 pixels, (c)1,024 × 445pixels, and (d)1,600 × 695pixels.
(a) (b)
(c) (d)
Low High
(a) (b) (c) (d)
AcknowledgmentsThe Information Society Technologies Programme
of the European Commission, Future and EmergingTechnologies, partially funded this work under the IST-2001-33058 PANDA project (2001-2004). We thankWaldo Tobler for his very helpful comments, DaveBelanger and Mike Wish for encouraging this investi-gation, and Eleftherios Koutsofios for providing dataand other assistance.
References1. A.S. Fotheringham and P. Rogerson, Spatial Analysis and
GIS, Taylor and Francis, 1994.2. K. Koperski, J. Adhikary, and J. Han, “Spatial Data Mining:
Progress and Challenges,” Research Issues on Data Miningand Knowledge Discovery, ACM Press, 1996.
3. D.A. Keim et al., “Pushing the Limit in Visual Data Explo-ration: Techniques and Applications,” Proc. Advances inArtificial Intelligence, 26th Ann. German Conf. AI, LNAI2821, Springer-Verlag, 2003, pp. 37-51.
4. D.A. Keim, C. Panse, and M. Sips, “Information Visualiza-tion: Scope, Techniques, and Opportunities for Geovisual-ization,” to be published in Exploring Geovisualization, J.Dykes, A. MacEachren, and M.-J.Kraak, eds., Elsevier,2004, pp. 15-44.
5. D.A. Keim et al., “PixelMaps: A New Visual Data MiningApproach for Analyzing Large Spatial Data Sets,” Proc. 3rdIEEE Int’l Conf. Data Mining (ICDM 03), IEEE CS Press,
2003, pp. 565-568.6. B. Shneiderman, “The Eyes Have It: A Task by Data Type
Taxonomy for Information Visualizations,” Proc. IEEE Visu-al Languages Conf., IEEE CS Press, 1996, pp. 336-343.
7. D.A. Keim et al., “Pixel Based Visual Mining of GeospatialData,” Computers and Graphics (CAG), vol. 28, no. 3, June2004, pp. 327-344.
Daniel A. Keim is a professor ofcomputer science at the University ofConstance, Germany. His researchinterests include information visual-ization and data mining. Keim has aPhD in computer science from theUniversity of Munich. He is an editor
of the IEEE Transactions on Visualization and Comput-er Graphics, the IEEE Transactions on Knowledge andData Engineering, and the Palgrave Information Visual-ization Journal.
Christian Panse is pursuing aPhD in the Data Mining and Visual-ization Group at the University ofConstance. His research interestsinclude visual data mining on largespatial data and cartogram drawing.Panse has an MS in computer science
from the Martin-Luther-University Halle-Wittenberg, Ger-many. He is a member of the IEEE Computer Society.
Mike Sips is completing his PhDstudies in the Data Mining and Visu-alization Group at the University ofConstance. His research interestsinclude visual data mining on largespatial data, spatial data transfor-mation, information visualization,
and advanced visual interfaces. Sips has an MS in com-puter science from the Martin-Luther-University Halle-Wittenberg. He is a member of the IEEE Computer Society,the ACM, and the German Society for Informatics.
Stephen C. North is head of Infor-mation Visualization Research atAT&T Labs. His research interestsinclude software visualization,applied computational geometry,reusable software design, dynamicand large-scale graph layout, and
spatial data transformation. North has a PhD in comput-er science from Princeton University. He is a senior mem-ber of the IEEE and a member of the ACM.
Readers may contact Daniel A. Keim at Dept. of Com-puter and Information Science, Univ. of Konstanz, Univer-sitatsstr. 10, Box D78, D78457 Konstanz, Germany;[email protected].
Visual Analytics
44 September/October 2004
Help
shape
the IEEE
Computer
Society of
tomorrow.
Vote for 2005 IEEE
Computer Society officers.
Polls open 13 August –
6 October
www.computer.org/election/