Generative Modeling for Maximizing Precision and Recall in ...proceedings.mlr.press/v15/peltonen11a/peltonen11a.pdf · (SNE). While SNE maximizes pure recall, adding a mixture component

579

Generative Modeling for Maximizing Precision and Recall inInformation Visualization

Jaakko Peltonen1 Samuel Kaski1,2

1Aalto University, Department of Information and Computer Science,Helsinki Institute for Information Technology HIIT, P.O. Box 15400, FI-00076 Aalto, Finland

2University of Helsinki, Department of Computer Science,Helsinki Institute for Information Technology HIIT, P.O. Box 68, 00014 University of Helsinki, Finland

Abstract

Information visualization has recently beenformulated as an information retrieval prob-lem, where the goal is to find similar datapoints based on the visualized nonlinear pro-jection, and the visualization is optimized tomaximize a compromise between (smoothed)precision and recall. We turn the visualiza-tion into a generative modeling task wherea simple user model parameterized by thedata coordinates is optimized, neighborhoodrelations are the observed data, and straight-forward maximum likelihood estimation cor-responds to Stochastic Neighbor Embedding(SNE). While SNE maximizes pure recall,adding a mixture component that “explainsaway” misses allows our generative model tofocus on maximizing precision as well. Theresulting model is a generative solution tomaximizing tradeoffs between precision andrecall. The model outperforms earlier modelsin terms of precision and recall and in exter-nal validation by unsupervised classification.

1 INTRODUCTION

The importance of information visualization as a cen-tral part of data analysis has been evident in ex-ploratory branches of statistics, called for instance ex-ploratory data analysis, and the importance of visual-ization is being emphasized in the current strong visualanalytics movement. Machine learning seems to havean obvious contribution to the field through nonlin-

Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

ear dimensionality reduction. Many nonlinear dimen-sionality reduction methods developed during the pastten years have been designed for manifold learning,including Isomap (Tenenbaum et al., 2000), LocallyLinear Embedding (Roweis and Saul, 2000), Stochas-tic Neighbor Embedding (Hinton and Roweis, 2002),Maximum Variance Unfolding (Weinberger and Saul,2006), Laplacian Eigenmap (Belkin and Niyogi, 2002),and their more recent variants (see for example Zhangand Wang, 2007; Choi and Choi, 2007; van der Maatenand Hinton, 2008; Song et al., 2008). See van derMaaten et al. (2009) for a recent comparison of meth-ods. At first sight it might then seem attractive tosimply use manifold learning methods for visualiza-tion. However, the manifold learning methods havenot been designed or optimized for visualization andhence may not work well for visualization if the in-herent dimensionality of the data manifold is largerthan the display dimension (Venna et al., 2010). Whilethere now are several more or less rigorous formula-tions for the manifold learning problem, there are notmany for the visualization problem.

Visualization has recently been formulated as a visualinformation retrieval task (Venna et al., 2010), withthe goal being to organize points on the display suchthat if similar points are retrieved based on the display,the accuracy of retrieving truly similar data is maxi-mized. As in all information retrieval, the result nec-essarily is a compromise between precision and recall,of minimizing false positives and misses. StochasticNeighbor Embedding (SNE) corresponds to maximiz-ing recall.

We also take SNE as the starting point because itworks well and has a nice interpretation explicated be-low: The cost function is (mean) maximum likelihoodof a simple user model, where the user is assumed topick neighbors according to a kernel probability distri-bution on the display. The data are the actual neigh-bor relationships, in practice often given by specifying

580

Generative Modeling for Maximizing Precision and Recall in Information Visualization

a kernel as well. Now the question we asked is: Ifmaximizing recall is a generative modeling task, coulda generative model be made to focus on precision aswell, or in fact on any tradeoff between the two?

We formulate information visualization as a genera-tive modeling task, which reduces to SNE when max-imizing pure recall, and precision is maximized by amixture model. When a mixture component is addedto explain away the misses the rest of the model willfocus more on minimizing false positives. This turnsthe whole visualization task into a generative modelingtask which makes it more understandable for model-ers, easier to extend and, as it turns out, makes thevisualizations better. Our cost function, in contrast toVenna et al. (2010), is directly a likelihood of observedneighborhoods. This makes visualization a rigorousstatistical modeling task, with all tools of generativemodeling available.

2 GENERATIVE MODELING FOR

VISUALIZATION

Consider visualization as a model learning task, whereobserved similarity relationships are the data and thecoordinates of points on the display are the param-eters. We construct a generative model which willgenerate neighbor relationships for query points, andcan naturally generate a distribution over query pointstoo although we do not consider that straightforwardextension in this paper. The model can be consid-ered as a user model, that is, a model that specifieswhich other points the user would inspect given thequery point. When the visualization is optimized forthe specific user model (neighborhood kernel), it willnaturally be optimal for a user behaving according tothat model.

If the data consists of observed neighborhood relation-ships, for instance as counts of citations in a paper orcounts of social interactions of a person, we can usethem directly or, assuming large sample size, normal-ize them into distributions. Let pij denote the “groundtruth” probability that j would be chosen as a neigh-bor of i without any constraints coming from the vi-sualization, and

∑

j 6=i pij = 1 for all i. In practice theanalysis often starts with a kernel, or distance mea-sure and functional form. In that case, we denote thedensity after appropriate normalization by p.

Let now probabilities rij denote the neighborhood re-lationships of the model; rij is the probability of choos-ing point j as a neighbor of point i,

∑

j 6=i rij = 1 forall i. The rij are interpretable as a “user model” asfollows: rij is the probability with which the model be-lieves a user will inspect point j when the query point

is i, given the visualization. The user model is param-eterized by the coordinates yj of each point j on thevisualization display. Many definitions of rij can beused depending on the needs of the analyst; we willuse the simple Gaussian falloff around the query pointi:

rij =exp(−||yi − yj ||

2/σ2

i )∑

k 6=i exp(−||yi − yk||2/σ2

i )(1)

where σi is a neighborhood radius around point i.Another recent possibility is a t-distributed falloff(van der Maaten and Hinton, 2008) which can be easilyincluded.

Simple generative modeling to maximize recall.

Now consider simply maximizing the log-likelihood ofthe observed neighborhoods, that is, maximizing

∑

i

∑

j 6=i

pij log rij . (2)

This corresponds to minimizing∑

i DKL(pi·, ri·), thesum of Kullback-Leibler divergences from the observedneighborhoods to the user model, which is the costfunction of Stochastic Neighbor Embedding (SNE;Hinton and Roweis, 2002). This is a straightforwardre-interpretation of SNE.

We then consider a simple user model in order to builda connection to recall, extending the work of Vennaet al. (2010). Assume that the user (or retrieval model)retrieves a set Ri of points as neighbors of query pointi, and places a uniform distribution rij = (1 − ǫ)/|Ri|across the retrieved points with a very small proba-bility ǫ/(N − 1 − |Ri|) for others, where ǫ is a verysmall number and N − 1 is the total number of pointsother than i. Similarly, let the set of actually rel-evant neighbors be Pi, with a uniform distributionpij = (1 − ǫ)/|Pi| across the relevant neighbors andvery small probabilities for the rest. Then the objec-tive function for a single query point i becomes

∑

j 6=i

pij log rij ≈∑

j∈Pi∩Ri

1 − ǫ

|Pi|log

(

1 − ǫ

|Ri|

)

+∑

j∈Pi∩Rc

i

1 − ǫ

|Pi|log

(

ǫ

N − 1 − |Ri|

)

≈NTP,i

|Pi|log

(

1

|Ri|

)

+NMISS,i

|Pi|log ǫ (3)

where Rci and P c

i denote complements of Ri and Pi,NTP,i = |Pi ∩Ri| = |Ri| −NFP,i is the number of truepositives (retrieved relevant points), NFP,i = |Ri∩P c

i |is the number of false positives (retrieved non-relevantpoints), and NMISS,i = |Pi ∩ Rc

i | is the number ofmisses (relevant non-retrieved points). With small ǫ

581

Jaakko Peltonen, Samuel Kaski

the rightmost term in (3) dominates, and maximiz-ing the objective function (3) with respect to the re-trieval distribution defined by the rij is equivalent tominimizing the number of misses, that is, maximiz-ing recall = NTP,i/|Pi| = 1−NMISS,i/|Pi|. ThereforeSNE, which maximizes (3), can be seen as a generativemodel of neighborhood relationships which maximizesrecall.

2.1 Extending the generative model for

flexible visualization goals

We showed above that maximizing the likelihood forthe simple retrieval model corresponds to maximizingrecall, and it also corresponds to the objective of SNE.

However, maximizing recall is only one possible goalof successful visualization: it corresponds to minimiz-ing misses (missed true neighbors), but it ignores theother type of visualization error, false positives. Min-imizing false positives would be equivalent to maxi-mizing precision of retrieving neighbors from the vi-sualization. Both precision and recall, or any tradeoffbetween them, are useful optimization goals for visu-alization. We next show that we can change the re-trieval model to optimize a tradeoff between precisionand recall, while keeping the same rigorous generativemodeling approach which we introduced above.

Notice that the simple analysis above already givesa hint on how to proceed; equation (3) does involvethe number of true positives NTP,i (or equivalentlythe number of false positives NFP,i) in the first termon the right-hand side. However, this term does nothave much influence on optimization because the costfunction is in practice dominated by the second terminvolving misses. For small ǫ the second term is alwaysmuch larger than the first, therefore misses are likelyto dominate. If we could somehow change the modelso that the cost of misses becomes less dominant, themodel would be able to focus also on false positives.

More flexible generative modeling to maximize

a tradeoff between precision and recall. Let usdesign a more flexible retrieval model qij which extendsrij . We will define qij as a mixture of two retrievalmechanisms: the user model rij which depends on thevisualization coordinates of points, and an additionalmodel which need not depend on the visualization co-ordinates; the goal of the additional model is to explain

away those neighbors that the user model rij misses.We give the precise definitions soon.

Intuitively, if we can create an additional retrievalmechanism which retrieves all those neighbors that theuser model rij misses, then when we fit the combinedmodel to maximize the likelihood of observed neigh-

borhoods, the user model rij (which is part of the func-tional form of qij) can minimize the remaining kind oferror, the number of retrieved false positives.

A simple solution is to define the retrieval distributionqij as a mixture of the plain user model rij and an“explaining away” model:

qij ∝ rij + γpij (4)

where γ ≥ 0 is a multiplier which controls the amountof explaining away. The model is again fitted tothe observed neighborhoods by maximizing the log-likelihood

L =∑

i

∑

j 6=i

pij log qij (5)

with respect to the output coordinates yi of all datapoints, which affect the qij through the plain usermodel rij .

It is easy to see that the explaining-away has no effectin the perfect retrieval case where rij already equalspij (then qij = rij); instead, the explaining-away af-fects how severely errors in rij affect the likelihood.In the log-likelihood (5) the explaining-away has thelargest effect on the terms corresponding to misses,where rij is small but pij is large; for such terms qij isalso large and the cost of misses thus no longer dom-inates the likelihood. Therefore, optimizing qij withrespect to the visualization coordinates is now able tobetter take into account the false positives, and hencethe visualization will be better arranged to avoid falsepositives.

An analysis of the mixture model likelihood.

In the simple case that we discussed above, where theobserved neighborhoods pij and plain user models rij

are uniform over some subsets of points Pi and Ri

respectively, and near-zero elsewhere, it can be shownthat the log-likelihood of the mixture model for a singlequery point becomes

∑

j 6=i

pij log qij ≈ const.

+ recall · log

(

precisionrecall

+ γ)

(

a − recallprecision

)

γ(

a − recallprecision

)

+ ǫ1−ǫ

+ (1 − ǫ) log

(

γ(1 − ǫ)

|Pi|+

ǫ

N − |Ri| − 1

)

(6)

where a = (N − 1)/|Pi|, and the information retrievalcriteria are recall = NTP,i/|Pi| and precision =NTP,i/|Ri| as usual. With no explaining-away (γ =0), maximizing (6) reduces to maximizing equation(3), that is, maximizing the mixture model likeli-hood without explaining-away is the same as maximis-ing recall · const. and ignoring precision. However,

582


with a sufficient amount of explaining away such thatγ ≫ ǫ > 0 the above reduces to the more appealingform

∑

j 6=i

pij log qij ≈ const.+recall·log

(

1 +1

γ·precision

recall

)

(7)where we can see that, because of the explaining-away,the objective function is affected both by precision(false positives) and recall (misses). The influence ofprecision is strongest when γ is small but still clearlylarger than ǫ.

In more detail, the log-likelihood is dominated by asum over misses and a sum over true positives. Ifγ is much smaller than ǫ, the misses dominate thecost function, which reduces to maximization of re-call. On the other hand, as γ grows, it begins to affectthe log-likelihood of true positives: rij for true posi-tives depends on the number of retrieved points |Ri|and hence depends on precision; however, for asymp-totically large γ, qij ∝ rij + γpij for true positivesbecomes nearly constant with respect to the visualiza-tion, which then reduces the influence of precision onoptimization. For more details, see the full derivationof equation (7) in the supplement.

To summarize, in order to make precision influencethe cost function as much as possible, γ should be suffi-ciently larger than zero so that it can explain away themisses. Otherwise, γ should be kept small: then thelog-likelihood of true positives depends on the retrievaldistribution rij rather than being explained away. Inour experiments we use γ = 0.9 which yielded verygood results.

The objective function can be maximized with respectto the output coordinates yi by gradient methods; herewe use conjugate gradient. The computational com-plexity per iteration is O(N2) for N data points whichis the same complexity as for SNE. To help avoid localminima, we first run the method with no explaining-away (γ = 0) and use the resulting coordinates yi asinitialization for the final run with the desired amountof explaining-away (desired γ value).

2.2 Comparison to regularization

The functional form of our retrieval distribution qij issuperficially similar to regularization: the user modelrij is mixed with another distribution which keeps cru-cial retrieval probabilities nonzero. Regularized vari-ants of stochastic neighbor embedding have been pro-posed earlier: in particular, UNI-SNE (Cook et al.,2007) is a variant of SNE where the retrieval distribu-tion is regularized by mixing it with a uniform distri-bution, which is equivalent to qij ∝ rij + const.

The problem with such regularization is that it dis-torts the retrieval model and hence cannot achieve theoptimal embedding result. Because the regularizationalways mixes a constant to all retrieval probabilities,the user model is forced to compensate for this regu-larization which distorts the embedding.

It can be shown that even if a perfect embedding(where rij = pij) is possible, for example when theoriginal data lives on a low-dimensional subspace, theUNI-SNE optimum does not correspond to that per-fect embedding. (This can be seen by taking the gradi-ent of the UNI-SNE cost function with respect to rij ,enforcing nonnegativity and sum-to-one constraints byreparameterization, and showing that rij = pij is nota zero-point of that gradient.)

In contrast, our method mixes the user model withthe “perfect retrieval” distribution pij , which is data-dependent and non-uniform. This is a true “explain-ing away” model which does not distort the embed-ding: it is easy to show that if perfect embedding(where rij = pij) is possible, it corresponds to theoptimum of our method, as desired. To show this,simply note that if rij = pij then also qij = pij

which yields the maximum value of the log-likelihood∑

i,j 6=i pij log qij , or equivalently the minimal value of∑

i DKL(pi., qi.) where the DKL are Kullback-Leiblerdivergences between the relevance probabilities pij andthe qij . Therefore, if rij = pij can be achieved, it cor-responds to the optimum of our method.

In summary, the new method can be seen as a rig-orous approach to the same problem that has beenpreviously addressed by regularization approaches likeUNI-SNE. Our new method also has a novel interpre-tation and an analysis in terms of precision and recall;and it corrects a problem present in UNI-SNE, so thatthe new method is able to find the optimal embedding.

3 EXPERIMENTS

We compare our new method to several previous meth-ods, first in terms of retrieval performance and thenin terms of unsupervised classification performance;lastly, we plot visualizations produced by our methodon several data sets.

3.1 Comparison of retrieval performance

We evaluate the performance of the new methodagainst a comprehensive set of alternatives on twodata sets, in the task of visualizing the sets as scat-terplots in 2D. The Faces data set (http://www.cs.toronto.edu/~roweis/data.html) contains 400 faceimages, from 40 people with 10 images each, with dif-ferent facial expressions and lighting; each image is

583


64×64 pixels with 256 grey levels. The Seawater tem-

perature time series (Liitiainen and Lendasse, 2007)contains weekly measurements of seawater tempera-ture over several years. Each data point is a 52-weekwindow of the temperature time series, and for thenext data point the window is shifted one week for-ward; this yields 823 data points with 52 dimensions.

We compare our method with thirteen others: Prin-cipal Component Analysis (PCA; Hotelling, 1933),Metric Multidimensional Scaling (MDS; see Borg andGroenen, 1997), Locally Linear Embedding (LLE; seeRoweis and Saul, 2000), Laplacian Eigenmap (LE;Belkin and Niyogi, 2002), Hessian-based Locally Lin-ear Embedding (HLLE; Donoho and Grimes, 2003),Isomap (Tenenbaum et al., 2000), Curvilinear Compo-nent Analysis (CCA; Demartines and Herault, 1997),Curvilinear Distance Analysis (CDA; Lee et al., 2004),Maximum Variance Unfolding (MVU; Weinberger andSaul, 2006), Landmark Maximum Variance Unfolding(LMVU; Weinberger et al., 2005), Local MDS (LMDS;Venna and Kaski, 2006), Neighbor Retrieval Visualizer(NeRV; Venna et al., 2010), and UNI-SNE (Cook et al.,2007).

We use the same test setup as Venna et al. (2010). Inbrief, each method was run with several parameter val-ues, and non-convex methods were run from five ran-dom initializations. For each method, the best resultwas chosen in the sense of maximizing the (unsuper-vised) F-measure computed as 2(P ·R)/(P +R) whereP and R are rank-based smoothed precision and re-call measures; see Venna et al. (2010) for details. TheNeRV and LocalMDS methods which allow a tradeoffbetween precision and recall were run with several val-ues of their tradeoff parameter λ; for clarity we showresults for a single λ value chosen by the F-measure.We ran our method with two settings: the base-line case γ = 0 (no explaining-away; corresponds toStochastic Neighbor Embedding) and γ = 0.9 (strongexplaining-away during training). We ran the corre-sponding setting for UNI-SNE, with λ = 0.47 whichcorresponds to γ = 0.9 in our method. Note that boththe explaining-away in our method and regularizationin UNI-SNE are only used during training, and onlythe resulting visualization (data point locations) areused to evaluate the quality of the method.

The quality of the visualizations is evaluated by howwell the real neighborhoods are visible in the visualiza-tion or, equivalently, how well the real neighbors (setto be the 20 nearest neighbors of points in the originalspace) can be retrieved based on the visualization. Weuse traditional precision–recall curves to measure this.

Based on Figure 1 our new method (denoted “NM” inthe figures) performs very well: it attains clearly the

best precision. In terms of recall it is roughly as goodas NeRV or, equivalently, SNE. The simple regulariza-tion approach UNI-SNE also performs fairly well, butour more rigorous approach achieves better results.

3.2 Unsupervised classification

We additionally compare the methods using externalvalidation, computing unsupervised 2D displays andthen measuring how well known but so far unusedclasses are separated on the display. Class separationis measured by classification accuracy of a k-nearestneighbor classifier (k = 5) operating on the displaycoordinates; each point is classified according to a ma-jority vote of its k nearest neighbors excluding itself.

Four data sets are used. The Letter recognition dataset is from the UCI machine learning repository (Blakeand Merz, 1998) and contains 4 × 4 images of capitalletters, based on distorting letter shapes in differentfonts; the data set has 16 dimensions and 26 classes.The Phoneme data set is from LVQ-PAK (Kohonenet al., 1996) and contains spoken phoneme samples;the data are 20-dimensional and there are 13 classes(different phonemes). The Landsat satellite data set isfrom the UCI machine learning repository; it containssatellite images, each of which is 3×3 and measured infour spectral bands, yielding 36 dimensions per image.Each image is classified into one of 6 classes whichdenote different soil types. The TIMIT data set isfrom the DARPA TIMIT speech database (TIMIT,1998); it contains phoneme samples, each of which is12-dimensional, and there are 41 classes.

Our new method (“NM” in Table 1) with strongexplaining-away during training (γ = 0.9) yields thebest results on two data sets (Landsat and TIMIT),second-best on one (Letter), and third-best on one(Phoneme). The use of explaining-away during train-ing clearly improves results on all data sets comparedto the no explaining-away case (γ = 0, correspondingto SNE). UNI-SNE performs almost as well: it is beston two data sets (Letter and Phoneme), third-best onone (TIMIT) and fourth-best on one (Landsat). Othermethods that perform well are LocalMDS and NeRV.

Although for brevity we report the results of ourmethod only with two choices of γ, the results are verygood for all the nonzero gamma values that we tried(between 0.1 and 0.9). On TIMIT our method is bestwith any such γ value; on Letter, Phoneme and Land-sat, our method is always in the top-two, top-three,and top-four respectively.

584


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

NeRV

λ=0.9

LMDS

λ=0.4

MDS

Isomap

CDA

MVU

CCANM γ=0.9

NM

γ=0

UNI−SNE

λ=0.47

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

NeRV

λ=0.3

LMDS

λ=0.1

MDS

CDA

MVU

NM γ=0.9

NM

γ=0

UNI−SNE

λ=0.47

Seawater temperature time seriesFaces

Mean precision (vertical axes) − Mean recall (horizontal axes)

Figure 1: Retrieval quality measures for neighbor retrieval based on the visualizations, for two data sets: Faces

and Seawater temperature time series. For clarity, only a few of the best-performing methods are shown for eachdata set. Performance is measured by standard precision-recall curves. For NeRV and LocalMDS, for clarityperformance is shown for only a single λ chosen by a F-measure. For our method (denoted “NM”) we reportperformance for γ = 0 (no explaining away; corresponds to Stochastic Neighbor Embedding) and γ = 0.9 (strongexplaining away used during training). For UNI-SNE we report results for λ = 0.47 which corresponds to oursetting γ = 0.9; UNI-SNE at λ = 0 is essentially equivalent to our method at γ = 0. Our new method attainsthe highest precision for both data sets and is comparable to NeRV/SNE in terms of recall.

Table 1: (In)separability of known classes on unsu-pervised diplays for four data sets. The cost mea-sure is classification error rate, based on the visualiza-tions, with a k-nearest neighbor classifier, k = 5. Ourmethod is the best on two of the four data sets, second-best on one data set (Letter), and third-best on onedata set (Phoneme). On Landsat data our method andLocalMDS yield the same accuracy. The best methodin each column has been boldfaced.

Letter Phon. Land. TIMITEigenmap 0.914 0.121 0.168 0.674LLE n/a 0.118 0.212 0.722Isomap 0.847 0.134 0.156 0.721MVU 0.763 0.155 0.153 0.699LMVU 0.819 0.208 0.151 0.787MDS 0.823 0.189 0.151 0.705CDA 0.336 0.118 0.141 0.643CCA 0.422 0.098 0.143 0.633NeRV 0.532 0.079 0.139 0.626LocalMDS 0.499 0.118 0.128 0.637UNI-SNE, 0.299 0.072 0.136 0.628λ = 0.47NM, γ = 0 0.590 0.088 0.133 0.657NM, γ = 0.9 0.326 0.080 0.128 0.594

3.3 Demonstrations on toy data, face images,

and fMRI data

We demonstrate the visualizations on three datasets. First, we replicate the simple demonstrationof the precision–recall tradeoff shown in Venna et al.(2010). Data points are distributed on the surfaceof a three-dimensional sphere (Figure 2A). We createtwo-dimensional visualizations with our new method,with two settings: no explaining-away (γ = 0; cor-responds to SNE) which concentrates on minimizingmisses, and strong explaining-away during training(γ = 0.9) which concentrates on minimizing false pos-itives. The result trained without explaining-away(Figure 2B) minimizes misses by squashing the sphereflat, which leads to numerous false neighbors whenpoints originally on opposing sides of the sphere areplaced near each other. With strong explaining-away(Figure 2C) false neighbors are minimized by openingup the sphere, at the expense of missing some neigh-bors across the tear. Both solutions are useful visual-izations of the sphere, but for different purposes.

Secondly, we visualize the face images data set whichwas already used in the previous section. In the plotwith γ = 0.9, Figure 2D, faces of the same person be-come mapped close to each other in the visualization.

585


E

B

C

AD

Figure 2: Demonstrations of our method. A-C demonstrate the tradeoff between misses and false positives.Points on a three-dimensional sphere (A) are mapped to a two-dimensional display by the new method. In B,the visualization is optimized without explaining-away (γ = 0; corresponds to Stochastic Neighbor Embedding)which minimizes misses by squashing the sphere flat. In C, the visualization is optimized with strong explaining-away (γ = 0.9) which minimizes false positives by opening up the sphere. D: Face images (γ = 0.9); faces of thesame person occur close to each other. E: Visualization of fMRI whole-head volumes from an experiment withseveral people experiencing multiple stimuli (γ = 0.9). The four stimuli types (red: tactile, yellow: auditorytone, green: auditory voice, blue: visual) have become separated in the visualization; the two auditory stimulitypes are arranged close-by as is intuitively reasonable. An axial slice is shown for each whole-head volume,chosen so that the shown slice contains the highest-activity voxel.

586


Thirdly, we visualize a set of functional magnetic reso-nance imaging (fMRI) measurements. The data setMalinen et al. (2007) includes measurements of sixhealthy young adults in two measurement sessionswhere they received temporally non-overlapping stim-uli: auditory (binaural tones or a male voice), visual(shown video clips), and tactile (touch pulses deliveredto fingers). Using an MRI scanner, 161 whole-headvolumes (time points) were obtained for each personin each test. Preprocessing of the volumes included re-alignment, normalization, smoothing, and extractionof 40 components by independent component analy-sis; see Ylipaavalniemi et al. (2009) for details.

For our purposes we took every fourth time point(whole-head volume) from the first half of each sessionas a data item to be visualized, yielding 6×2×19 = 228data items with 40 dimensions. We visualize this dataset in two dimensions using our new method withexplaining-away (γ = 0.9) during training. Figure 2Eshows the result. The different stimuli types are sep-arated in the visualization. This kind of a display isuseful for interactive analysis of the experiment, wherebrowsing for evidence of common patterns is inter-leaved with interactive slicing through the 3D brainvolumes to more accurately view the sets of 3D activeregions.

4 CONCLUSIONS AND

DISCUSSION

We have introduced a novel way to perform nonlineardimensionality reduction by bringing in the genera-tive modeling framework and a way of controlling theprecision and recall of the visualization. The methodincludes Stochastic Neighbor Embedding (SNE) as aspecial case, and thus gives a generative interpretationfor it, but where SNE minimizes only one kind of error(misses) we allow a flexible amount of explaining-away

during training to let the model concentrate on reduc-ing the other kind of error, false positives. Our modelsimply mixes the retrieval “user model” linearly withan explaining-away distribution during training; thisremarkably simple model suffices to yield a flexibletradeoff between minimizing misses and minimizingfalse positives, and in experiments it gives visualiza-tions that outperform alternative methods accordingto several measures.

Compared to the earlier regularization-based approachUNI-SNE (Cook et al., 2007), our method performsslightly better. Furthermore, it has a novel interpre-tation and an analysis in terms of precision and recall,and it corrects a problem present in UNI-SNE. In con-trast to UNI-SNE, the new method is able to find theoptimal embedding.

Compared to a previous approach (Venna et al., 2010)which also minimized a tradeoff between misses andfalse positives, our novelty is the rigorous generativeframework; our cost function is directly a likelihoodof observed neighborhoods and we control precisionand recall by using a generative model. This makesit easier to analyze the performance and extend themodel. In particular, it should now be possible to startto rigorously learn the user model too, on-line or off-line, to adapt to real user behavior and needs.

We have now brought information visualization intothe domain of rigorous probabilistic generative model-ing. The specific modeling choices were made to showthat this is possible; we did not yet make any claimsabout optimality, in particular about maximization ofprecision. However, even the proof-of-concept modeloutperformed existing models in empirical tests, givingstrong support to this line of research.

A simple extension is to use alternative distributionalassumptions. Instead of the Gaussian falloffs whichgave very good results here, there is evidence that t-distributed neighborhoods could work even better forvisualizations (van der Maaten and Hinton, 2008).

In this paper the goal is information visualization,where it is natural to have 2-3 output dimensions.Controlling precision and recall with generative model-ing may also be useful more generally in dimensionalityreduction with higher output dimensionalities.

Acknowledgments

The authors belong to the Adaptive Informatics Re-search Centre, a national CoE of the Academy of Fin-land. JP was supported by the Academy of Finland,decision number 123983. This work was also sup-ported in part by the PASCAL2 Network of Excel-lence, ICT 216886. We thank Jarkko Ylipaavalniemi,from the Department of Information and ComputerScience, and Riitta Hari, from the Advanced MagneticImaging Centre (AMI Centre) and the Brain ResearchUnit in the Low Temperature Laboratory, for provid-ing and helping with the fMRI data.

References

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps andspectral techniques for embedding and clustering. In Di-etterich, T. G., Becker, S., and Ghahramani, Z., editors,Advances in Neural Information Processing Systems 14,pages 585–591. MIT Press, Cambridge, MA.

Blake, C. L. and Merz, C. J. (1998). UCIrepository of machine learning databases.http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Borg, I. and Groenen, P. (1997). Modern MultidimensionalScaling. Springer, New York.

587


Choi, H. and Choi, S. (2007). Robust kernel isomap. Pat-tern Recognition, 40:853–862.

Cook, J., Sutskever, I., Mnih, A., and Hinton, G. (2007).Visualizing similarity data with a mixture of maps. InMeila, M. and Shen, X., editors, Proceedings of AIS-TATS*07, the 11th International Conference on Arti-ficial Intelligence and Statistics (JMLR Workshop andConference Proceedings Volume 2), pages 67–74.

Demartines, P. and Herault, J. (1997). Curvilinear com-ponent analysis: A self-organizing neural network fornonlinear mapping of data sets. IEEE Transactions onNeural Networks, 8(1):148–154.

Donoho, D. L. and Grimes, C. (2003). Hessian eigen-maps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academyof Sciences, 100:5591–5596.

Hinton, G. and Roweis, S. T. (2002). Stochastic neighborembedding. In Dietterich, T., Becker, S., and Ghahra-mani, Z., editors, Advances in Neural Information Pro-cessing Systems 14, pages 833–840. MIT Press, Cam-bridge, MA.

Hotelling, H. (1933). Analysis of a complex of statisticalvariables into principal components. Journal of Educa-tional Psychology, 24:417–441,498–520.

Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., andTorkkola, K. (1996). LVQ PAK: The learning vectorquantization program package. Technical Report A30(26 pages), Helsinki University of Technology, Labora-tory of Computer and Information Science, FIN-02150Espoo, Finland.

Lee, J. A., Lendasse, A., and Verleysen, M. (2004). Nonlin-ear projection with curvilinear distances: Isomap versuscurvilinear distance analysis. Neurocomputing, 57:49–76.

Liitiainen, E. and Lendasse, A. (2007). Variable scalingfor time series prediction: Application to the ESTSP’07and the NN3 forecasting competitions. In Proceedings ofIJCNN 2007, International Joint Conference on NeuralNetworks, pages 2812-2816. IEEE, Piscataway, NJ.

Malinen, S., Hlushchuk, Y., and Hari, R. (2007). Towardsnatural stimulation in fMRI – issues of data analysis.NeuroImage, 35(1):131–139.

Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimen-sionality reduction by locally linear embedding. Science,290:2323–2326.

Song, L., Smola, A., Borgwardt, K., and Gretton, A.(2008). Colored maximum variance unfolding. In Platt,J., Koller, D., Singer, Y., and Roweis, S., editors, Ad-vances in Neural Information Processing Systems 20,pages 1385–1392. MIT Press, Cambridge, MA.

Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000).A global geometric framework for nonlinear dimension-ality reduction. Science, 290:2319–2323.

TIMIT (1998). TIMIT. CD-ROM prototype version of theDARPA TIMIT acoustic-phonetic speech database.

van der Maaten, L. and Hinton, G. (2008). Visualizing datausing t-SNE. Journal of Machine Learning Research,9:2579–2605.

van der Maaten, L., Postma, E., and van der Herik, J.(2009). Dimensionality reduction: A comparative re-view. Technical report TiCC-TR 2009-005 (35 pages),Tilburg centre for Creative Computing, Tilburg Univer-sity.

Venna, J. and Kaski, S. (2006). Local multidimensionalscaling. Neural Networks, 19:889–99.

Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski,S. (2010). Information retrieval perspective to nonlineardimensionality reduction for data visualization. Journalof Machine Learning Research, 11:451–490.

Weinberger, K., Packer, B., and Saul, L. (2005). Nonlineardimensionality reduction by semidefinite programmingand kernel matrix factorization. In Cowell, R. G. andGhahramani, Z., editors, Proceedings of the Tenth Inter-national Workshop on Artificial Intelligence and Statis-tics (AISTATS 2005), pages 381–388. Society for Artifi-cial Intelligence and Statistics. (Available electronicallyat http://www.gatsby.ucl.ac.uk/aistats/).

Weinberger, K. Q. and Saul, L. K. (2006). Unsupervisedlearning of image manifolds by semidefinite program-ming. International Journal of Computer Vision, 70:77–90.

Ylipaavalniemi, J., Savia, E., Malinen, S., Hari, R.,Vigario, R., and Kaski, S. (2009). Dependencies be-tween stimuli and spatially independent fMRI sources:Towards brain correlates of natural stimuli. Neuroimage,48:176–185.

Zhang, Z. and Wang, J. (2007). MLLE: Modified locallylinear embedding using multiple weights. In Scholkopf,B., Platt, J., and Hoffman, T., editors, Advances in Neu-ral Information Processing Systems 19, pages 1593–1600.MIT Press, Cambridge, MA.

Documents

Generative Modeling for Maximizing Precision and Recall in ...proceedings.mlr.press/v15/peltonen11a/peltonen11a.pdf · (SNE). While SNE maximizes pure recall, adding a mixture component