Automatic Generation of Co-Embeddings from Relational Data with Adaptive Shaping

Automatic Generation of Co-Embeddings fromRelational Data with Adaptive ShapingTingting Mu, Member, IEEE, and John Yannis Goulermas, Senior Member, IEEE

Abstract—In this paper, we study the co-embedding problem of how to map different types of patterns into one common low-

dimensional space, given only the associations (relation values) between samples. We conduct a generic analysis to discover the

commonalities between existing co-embedding algorithms and indirectly related approaches and investigate possible factors

controlling the shapes and distributions of the co-embeddings. The primary contribution of this work is a novel method for computing

co-embeddings, termed the automatic co-embedding with adaptive shaping (ACAS) algorithm, based on an efficient transformation of

the co-embedding problem. Its advantages include flexible model adaptation to the given data, an economical set of model variables

leading to a parametric co-embedding formulation, and a robust model fitting criterion for model optimization based on a quantization

procedure. The secondary contribution of this work is the introduction of a set of generic schemes for the qualitative analysis and

quantitative assessment of the output of co-embedding algorithms, using existing labeled benchmark datasets. Experiments with

synthetic and real-world datasets show that the proposed algorithm is very competitive compared to existing ones.

Index Terms—Relational data, data co-embedding, heterogeneous embedding, data visualization, structural matching

Ç

1 INTRODUCTION

CO-EMBEDDING, also known as joint or heterogeneousembedding, is a recent concept in machine learning

and pattern recognition with applications in biology,information retrieval, data mining, data visualization, textanalysis, sentiment analysis, and so on [1], [2], [3], [4].Contrary to the standard embedding problem that embedshomogeneous data objects of one type into a low-dimensionalspace given their high-dimensional feature representation[5], co-embedding simultaneously handles two or moreheterogeneous types of data, such as genes and symptoms,images and words, documents and topics, and so on,and maps them onto a single common space. The input toa co-embedding algorithm is a set of (dis)similarities,associations, or entries of contingency tables. Examplesinclude a table between histological melanoma typesand sites of the tumor [6], co-occurrence rates betweendocuments and words [7], conditional probabilities ofdocuments and semantic topics [2], semantic connectionsbetween images and text [3], or functional relations betweengenes and protein production [3]. It should be mentioned thatthese associations are derived using domain expertise anddataset analysis, without necessarily relying on the hetero-geneous objects having explicit, known, or comparablefeature representations.

Although co-embedding methods have been primarilyproposed in some recent works [1], [2], [3], [8], [9], [10], [11],

co-embedding has also been indirectly studied in earlierresearch on visualization, co-clustering, and exploratory dataanalysis [6], [12], [13], [14]. The earliest perhaps related workis correspondence analysis (CA), dating back to 1933 [6], [15],whose goal was to display the rows and columns of acontingency table in a low-dimensional space. Co-clusteringis another quite popular methodology that also processesheterogeneous objects, but its goal is very different in that itaims at the automatic grouping of heterogeneous samplesinto different clusters [4], [7], [16], [17], [18], [19], [20], [21].Latent semantic indexing (LSI) [22] is a popular embeddingmethod for document analysis in information retrievalwhich, because it can be used to compare between documentsand words [14], can be treated as a co-embedding method.

Differently from the above approaches that are based onmatrix factorizations, another group of co-embeddingalgorithms is based on statistical models; these algorithmsare perhaps the only explicitly developed ones for thecomputation of co-embeddings [2], [3], [8], [9], [10], [11],[23]. They treat the relations between heterogeneoussamples as empirical joint or conditional distributions andmaximize the log likelihood of the underlying models tofind the optimal co-embeddings. The efficiency of thistype of approach, however, is greatly dependent on thenumber of samples and the dimensionality of the resultingco-embeddings.

A notable problem in this area is the evaluation andcomparison of the computed co-embeddings between theproposed algorithms. Because co-embedding is a compara-tively new research topic, there is no methodological study inthe existing literature on how to evaluate the quality of theobtained co-embeddings, apart from certain domain-relatedevaluations designed for specific applications [2], [3].

In this work, we attempt to systematically study the co-embedding problem and the common elements betweenexisting co-embedding algorithms and indirectly related

2340 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 10, OCTOBER 2013

. The authors are with the Department of Electrical Engineering andElectronics, School of Electrical Engineering, Electronics and ComputerScience, University of Liverpool, Brownlow Hill, Liverpool L69 3GJ,United Kingdom. E-mail: {t.mu, j.y.goulermas}@liverpool.ac.uk.

Manuscript received 4 June 2011; revised 11 Aug. 2012; accepted 20 Mar.2013; published online 27 Mar. 2013.Recommended for acceptance by D.D. Lee.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2011-06-0355.Digital Object Identifier no. 10.1109/TPAMI.2013.66.

0162-8828/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

methods. We propose a novel algorithm, referred to asautomatic co-embedding with adaptive shaping (ACAS),which is based on modeling the entire similarity matrixthrough a general formulation of co-embedding. Thealgorithm is capable of capturing the global structure ofthe problem with a much reduced number of modelvariables, and its optimizing criterion utilizes quantizedrelation information. Additionally, we propose two genericschemes for the qualitative analysis and quantitativeassessment of the resulting co-embeddings. We conductexperiments using multiple datasets and a real-worldapplication to document-topic visualization for demonstra-tion, evaluation, and algorithmic comparison. Results showthat ACAS performs better than other existing methods.

The organization of this paper is as follows: Section 2gives a succinct but thorough review of the state-of-the-artco-embedding techniques. For completeness, it also exam-ines techniques indirectly relevant to co-embedding. Thedifferent sections of Section 3 and 4 analyze the principalcontributions of this work as previously described. Section 5reports the experimental results and comparative analyses,while Section 6 concludes the work and outlines futuredirections.

2 RELATED APPROACHES

In this work, we focus on the most common co-embedding

based on two types of objects. Assume the only available

information is an n�m relation (similarity) matrix between

two groups X and Y of heterogeneous samples fxigni¼1 and

fyjgmj¼1, respectively, denoted by R ¼ ½rij� with nonnegative

entries. The goal of co-embedding is to compute k-dimen-

sional embeddings for all samples in both groups so that the

similarities between the embedded points indicate the

between-group associations between samples from X and

Y, and also reflect the implicit within-group information that

may be recoverable. The output is one n� k embedded

feature matrix Zx ¼ ½zðxÞij � for samples in group X and

onem� kmatrix Zy ¼ ½zðyÞij � for samples in Y. In the resulting

common space, we will use zðxÞi ¼ ½z

ðxÞi1 ; z

ðxÞi2 ; . . . ; z

ðxÞik �

T to

denote the embedded feature vector for xi and zðyÞj ¼

½zðyÞj1 ; zðyÞj2 ; . . . ; z

ðyÞjk �

T the embedding for yj. In the following

paragraphs, we summarize various existing techniques

which are either ones that have been explicitly proposed

for co-embedding (Section 2.1) or ones that are indirectly

related as they process heterogeneous objects (Section 2.2).

2.1 Existing Co-Embedding Methods

2.1.1 Statistical Model-Based Co-Embedding

This is a group of similar co-embedding algorithms [2], [3],[8], [9], [10], [11], [23] based on a statistical model, whichinterprets the ijth element of the relation matrix as theempirical joint or conditional distribution of the samplepair ðxi; yjÞ and assumes it is proportional to the exponen-tial of the squared euclidean distance between the soughtco-embeddings of xi and yj. For instance, the jointdistribution can be modeled by

pðxi; yjÞ ¼1

NijðR;Zx;ZyÞexp

��zðxÞi � z

ðyÞj

��2

2

�; ð1Þ

where different versions of NijðR;Zx;ZyÞ for normalization

can be found in [2], [3]. Each component of the co-

embeddings becomes a model variable and is trained by

maximizing the log likelihood of the observed sample pairs.

By incorporating (1) into the log likelihood computed byPij rij log pðxi; yjÞ, the corresponding optimization problem

becomes

maxZx2Rn�kZy2Rm�k

�Xni¼1

Xmj¼1

rij��zðxÞi � z

ðyÞj

��2

2þ logNijðR;Zx;ZyÞ

�; ð2Þ

where a total of ðnþmÞ � k variables zðxÞij and z

ðyÞij are to be

optimized.One version of the statistical co-embedding, the co-

occurrence data embedding (CODE) [3], [8], [23], computes

the normalization term NijðR;Zx;ZyÞ byPni¼1

Pmj¼1

��Pmt¼1 rit

��Pnt¼1 rtj

�exp

��zðxÞi � z

ðyÞj

��2

2

��Pmt¼1 rit

��Pnt¼1 rtj

� :

ð3Þ

Another variation, the parametric embedding [2], [9], uses

different normalization and adds the two regularization

terms �xPn

i¼1 kzðxÞi k

22 and �y

Pmj¼1 kz

ðyÞj k

22 to (2). These terms

keep the Hessian of the objective function positive definite

so that the employed iterative training procedure can reach

in each iteration a global optimum for the co-embeddings of

one group when those of the other group remain fixed.The above methods were initially proposed for analyzing

co-occurrence rates [3], [8], [10], [11], [23] or class condi-

tional probabilities [2], [9]. This is related to each element of

R being interpreted as the empirical distribution of a

sample pair, which requires the input relation matrix to

always satisfy either the conditionPn

i¼1

Pmj¼1 rij ¼ 1 orPm

j¼1 rij ¼ 1. Thus, given an arbitrary relation matrix with

nonnegative entries, to apply these algorithms one needs to

normalize the input matrix by either the total sum of its

elements or the sum of each row.

2.1.2 Correspondence Analysis

CA [6], [12], [13] is a statistical visualization method, which

displays the associations between the row and column

objects of a contingency table (equivalent to a relation

matrix with nonnegative entries) in a low-dimensional

space. CA is perhaps the earliest approach for computing

co-embeddings. In the embedding space, the euclidean

distances between the row (column) objects are equal to the

�2-distances between the row (column) vectors of the table.To achieve this, given an input R, CA first scales the

matrix by the total sum of all its elements, so thatPi

Pj rij ¼ 1, then computes the two embedded feature

matrices as

Zx ¼ D�1

2x Uk��k; ð4Þ

Zy ¼ D�1

2y Vk��k; ð5Þ

MU AND GOULERMAS: AUTOMATIC GENERATION OF CO-EMBEDDINGS FROM RELATIONAL DATA WITH ADAPTIVE SHAPING 2341

where Dx denotes the n� n diagonal matrix formed by thevector of row sums of R and Dy the m�m diagonal matrixformed by the column sums. Uk, Vk, and ��k arecorrespondingly the matrices with the left and rightsingular vectors, and the singular values of the largest 2to kþ 1 singular values of the matrix D

�12

x ½R�R� 1m�n �R�D�

12

y [13], [24], [25] or of D�1

2x RD

�12

y [6]. Both matrix formslead mathematically to equivalent output from theirsingular value decomposition (SVD). Other variations ofCA have been proposed for computing the SVD of differentversions of matrices constructed from R [26].

2.2 Indirectly Relevant Methods

Although the following methods were not proposed for theco-embedding problem, they are closely related, and it ispossible to use them to calculate co-embeddings. In thissection, we briefly describe such algorithms and reveal theirconnections with co-embedding.

2.2.1 Latent Semantic Indexing

LSI is widely used for analyzing the co-occurrence countsbetween a set of documents and a set of terms (e.g., words)[22]. LSI translates the documents into a set of documentconcept vectors and the terms into a set of term conceptvectors, by computing the SVD of the co-occurrence matrixbetween documents and terms. Similarities between docu-ments and similarities between terms can be evaluated bycomparing the corresponding concept vectors within theirown space. It was suggested in [14] that it is also possible tocompare the similarities between document and termconcept vectors directly; this is referred to as latent semanticmapping (LSM). Thus, it is possible to generally use LSI/LSM for computing co-embeddings. Given the relationmatrix R between samples from two groups, LSI computesthe two co-embedding matrices as

Zx ¼ Uk; ð6Þ

Zy ¼ Vk; ð7Þ

where Uk and Vk are the top k left and right singular vectormatrices of R corresponding to the largest k singular values.

2.2.2 Co-Clustering Techniques

Co-clustering algorithms have recently gained popularityfor processing heterogeneous data. Some of these algo-rithms can be used or extended to solve the co-embeddingproblem. In the following, we provide two such examples.

Bipartite graph partitioning (BGP). Bipartite graph parti-tioning [7], [16], [17], [20], [21] has been proposed tosimultaneously cluster two groups of heterogenous samplesinto different clusters, based only on the n�m relationmatrix R between the samples. The key idea is to imply anundirected bipartite graph for samples from both groupsand construct its ðnþmÞ � ðnþmÞ adjacency matrix W ¼½wij� with the structure

W ¼ 0 RRT 0

� �: ð8Þ

Then, BGP computes an ðnþmÞ-dimensional partition

vector from W, using a relaxation of the normalized cut [7],

[16]. Letting u2 and v2 denote the second left and right

singular vectors of D�1

2x RD

�12

y , the ðnþmÞ � 1 partition

vector is ½uT2 D

�12

x ;vT2 D

�12

y �T. By treating this vector as the

new features, a k-means clustering algorithm is then used to

group all the nþm samples into different clusters [7]. This

is equivalent to applying spectral clustering analysis [27] to

a bipartite graph using one-dimensional embedding. How-

ever, we can also view this vector as the one-dimensional

co-embedding for the two different groups. That is, by

letting zx and zy denote the two co-embedding vectors for

samples from group X and Y, respectively, we can assume

zx ¼ D�1

2x u2; ð9Þ

zy ¼ D�1

2y v2: ð10Þ

Of course, it is possible to extend the above tok-dimensional co-embeddings by using

Zx ¼ D�1

2x Uk; ð11Þ

Zy ¼ D�1

2y Vk; ð12Þ

where Uk and Vk are the left and right singular vectormatrices of D

�12

x RD�1

2y , corresponding to the largest 2 to kþ 1

singular values.Co-clustering by collective factorization on related matrices

(CFRM). Another co-clustering technique, referred to ascollective factorization on related matrices [18], [19], canalso be used for computing co-embeddings. It defines twoconfidence matrices to enable the optimal reconstruction ofthe relation matrix R in terms of an error based on theFrobenius norm:

minU2Rn�k; UTU¼Ik�k;V2Rm�k; VTV¼Ik�k;

�ij¼0; for i 6¼j

kR�U��VTk2F: ð13Þ

Elements of the two orthogonal matrices U and V indicatethe confidence degrees of samples from groups X and Ybelonging to k different clusters, respectively, and thediagonal matrix �� represents a compact version of R [18],[19]. The optimization of (13) can be obtained by computingthe SVD of R. The confidence values stored in the twoorthogonal matrices can be viewed as the new featurescharacterizing each sample, which in a way embed samplesfrom the two groups into a common space of dimension-ality k. The embedded feature matrices generated by CFRMare the same as the ones obtained by (6) and (7).

3 PROPOSED METHODS

3.1 Generic Co-Embedding Model Formulations

In this section, we formulate a generic model for obtaining theco-embeddings between two groups X ¼ fxigni¼1 andY ¼ fyjgmj¼1, given their n�m similarity matrix R. Fornotational convenience, we let Z ¼ ½zij� denote the ðnþmÞ �k embedding matrix containing all samples from bothgroups, with ZT ¼ ½ZT

x ;ZTy �. zij denotes the jth dimension of

the ith k-length co-embedding vector zi ¼ ½zi1; zi2; . . . ; zik�T .


The primary motivation underlying all models in thiswork is to treat all ðnþmÞ samples from both groupsequally, construct an ðnþmÞ � ðnþmÞ composite similaritymatrix W from the partial similarity information in R, andthen obtain the co-embeddings Z from W using standardembedding methods for homogeneous data. Since R con-tains only between-group similarities, it is possible toassume implicit within-group information by consideringthe relation values rij between the sample from one groupand all the samples from the other group as its features. Thatis, we can use R as the feature matrix of the n samples fromgroup X, and simultaneously, RT as the feature matrix of them samples from Y. These features can then be passedthrough some similarity function to obtain approximationsof the n� n and m�m within-group similarity matrices forX and Y, respectively. By combining the provided between-group and the estimated within-group similarities, we cancompose the following ðnþmÞ � ðnþmÞ similarity matrixbetween all ðnþmÞ samples:

W ¼ FðRÞ RRT G RT

� �� : ð14Þ

FðRÞ is the n� n within-group similarity matrix for then samples in X based on a similarity function F and theinput features R, while GðRT Þ is the m�m within-groupsimilarity matrix for the m samples in Y based on a givensimilarity function G (this can differ from F ) and thefeatures RT .

The incorporation of the within-group similarities FðRÞand GðRT Þ is necessary to form the composite similaritymatrix W, which will enable the use of the classic(homogeneous) embedding generation methods discussedlater. The idea of using standard embedding methodsconforms with the core idea of obtaining co-embeddings,that is, to circumscribe both X and Y type objects withinone common space where similarities are preserved. Thisassumes that objects are treated equally, that is, they arecommonly described by a single information source suchas W. The need to define the underlying similarities F andG is not a restrictive burden, but gives more flexibility andexpressive ability to the model because different F and Gwill lead to very different distributions of the resulting co-embeddings. It has to be noted that assuming in thecurrent model that FðRÞ ¼ 0n�n and GðRT Þ ¼ 0m�m doesnot simplify the model by avoiding to define within-groupsimilarities. Zero values affect (detrimentally in all experi-ments we performed) the generated co-embeddings. This isbecause such values cannot balance the relative positioningbetween within-group and between-group objects well andmay conflict with the information implied in R. Themixing of reasonably well approximated within-groupinformation and the available heterogeneous relationvalues conforms much better with the underlying embed-ding formulation we adopt. With the proposed model, thecomposite W can be directly subjected to the computationof co-embeddings Z, using the two simple ways wedescribe below.

The first method is based on standard homogeneousembedding techniques [27], [28] and attempts to preservethe proximity structure represented by W in the embeddingspace. The classical way to proceed is to minimize the sum

of penalized pairwise distances between the embeddedpoints, with the penalizing weights being the entries of W.Here, we equivalently maximize similarities and solve thefollowing optimization problem:

maxfzi2Rkgnþmi¼1

Xnþmi;j¼1

wijzTi zj , max

Z2RðnþmÞ�ktr½ZTWZ�; ð15Þ

where the k different columns of Z can be obtained byenforcing an orthogonality constraint such as ZTZ ¼ Ik�k.The optimal solution of (15) is those k eigenvectors of W

corresponding to the largest k eigenvalues.A second method is to use a template similar to the

one adopted by multidimensional scaling (MDS) [29].This minimizes a reconstruction error based on theFrobenius norm:

minZ2RðnþmÞ�krankðZZT Þ¼k

kW� ZZTk2F ; ð16Þ

which drives the dot products between the embeddingscloser to the entries of W. Based on the Eckart-Youngtheorem for low-rank matrix approximation [30], theoptimal solution of (16) is Z ¼ Pk�

12

k, where �k and Pk

are eigenvalue and eigenvector matrices of W correspond-ing to the largest k eigenvalues.

The similarity measures F and G used to estimate thetwo within-group matrices FðRÞ and GðRT Þ for (14) aregeneric and can be set according to likely data propertiesand problem assumptions where available (e.g., if R is aterm-document matrix, a co-occurrence-based similarity,such as dot product or cosine, would be suitable). Generalpossible options for F and G include the cosine similarity,the Gaussian or polynomial kernels, Pearson’s or Kendall’scorrelation coefficient, Minkowski distance, various diver-gence measures, and so on, as well as functions that canmodel aggregate pairwise proximity information based onlocal neighborhood graphs [28]. Regardless of their choice,however, both suggested methods for computing Z from Wgiven by (15) and (16) involve the eigendecomposition ofthe ðnþmÞ � ðnþmÞ matrix W. This makes this modelcomputationally impractical for real-world situations withlarge object numbers n and m. We deal with this issue in thenext section.

3.2 Simplified Models with Accelerated Training

For reasons of computational efficiency and algebraicconvenience, we can specialize the previous model to theuse of dot-product in the estimation of the compositesimilarity matrix W, according to

FðRÞ ¼ RRT ; ð17Þ

GðRT Þ ¼ RTR: ð18Þ

The use of dot product can reduce the eigendecompositionof W to the much more efficient SVD of the n�m matrix R.

First, we notice that W in (14) can be written as the sumof two matrices W and fW according to

W ¼Wþ fW ¼ 0 RRT 0

� �þ RRT 0

0 RTR

� �: ð19Þ


Then, by letting Pk and �k denote the eigenvector and

eigenvalue matrices of W, we split the eigenvector matrix

into one n� k matrix Px and one m� k matrix Py, as

Pk ¼ ½PT

x ;PT

y �T . Then, the eigenequation of W becomes

W Pk ¼0 R

RT 0

� �Px

Py

� �¼ Px

Py

� ��k; ð20Þ

which leads to the equations

RPy ¼ Px�k; ð21Þ

RTPx ¼ Py�k: ð22Þ

These are precisely the equations that define the k-rank SVD

approximation

R ¼ Uk��kVTk ; ð23Þ

where Uk, Vk, and ��k contain the left and right singular

vectors, and singular values of R. In terms of (23), we have

Px ¼ Uk, Py ¼ Vk, and �k ¼ ��k.Subsequently, by letting ePk ¼ ½ePT

x ;ePTy �T and e�k denote

the eigenvector and eigenvalue matrices of fW, the

eigenproblem

fW ePk ¼ RRT 00 RTR

� � ePxePy

� �¼

ePxePy

� �e�k ð24Þ

gives rise to the following two equations:

RRT ePx ¼ ePxe�k; ð25Þ

RTRePy ¼ ePye�k: ð26Þ

However, these can be seen to be the eigendecompositions

of RRT and RTR, which are known to be simultaneously

computable from the SVD of R, with the solutions ePx andePy being, respectively, the left and right singular vectors of

the SVD. Specifically, in terms of (23), we have ePx ¼ Uk,ePy ¼ Vk, and e�k ¼ ��2k.

Therefore, we can now add (20) and (24) together to

obtain

RRT RRT RTR

� �Uk

Vk

� �¼ Uk

Vk

� ��k þ��2

k

� �; ð27Þ

which is exactly the eigenequation of W in (19) using the

eigenvector and eigenvalue matrices ½UTk ;V

Tk �T and

��k þ��2k, respectively.

Until this point, we have shown that the eigenvectors and

eigenvalues of W can be simply computed from the SVD of

R through (23) and (27). To eventually generate the co-

embeddings for the two methods described in Section 3.1, we

do the following: For the homogeneous embedding method,

we use directly Zx ¼ Uk and Zy ¼ Vk. This is numerically

identical to (6) and (7) of the LSI/LSM. However, in this case,

we provide a detailed justification of why these equations

are valid for the computation of co-embeddings, while LSI

does not. For the MDS-based reconstruction, the embed-

dings require an extra scaling using the eigenvalues of W

according to

Zx ¼ Ukð��k þ��2kÞ

12; ð28Þ

Zy ¼ Vkð��k þ��2kÞ

12: ð29Þ

Henceforth, we will refer to the first of these basic methodsas co-embeddings based on the proximity preservingtemplate (CPPT), and the second one as co-embeddingsbased on the optimal reconstruction template (CORT).

It has to be noted that both CPPT and CORT are relying onrestricting the basic generic model of (14) to use dot productsfor both within-group similarity functions F and G as in (17)and (18) to provide a drastically accelerated model trainingthrough SVD operations on smaller matrices. However, dotproduct cannot be the best within-group similarity measurefor all datasets, and this may result in cases of suboptimal co-embeddings. Therefore, to compensate for this inflexibility,but also to create a model that is capable of robustlyrecovering a diverse range of co-embeddings in a parametricmanner, we propose several enhancements and the finalworking model in the next section.

3.3 The ACAS Model

This section introduces the proposed ACAS algorithm. It isbased on directly extending the previous model, first byenhancing it with two scaling mechanisms to control thedistribution of the resulting co-embeddings, and second byintroducing a method for the robust training of the model.

3.3.1 Model Construction and Scaling Mechanisms

The model is extended by incorporating the following twoR-scaling and UV-scaling enhancements:

. R-scaling: Combining (14), (17), and (18) leads to thesimple dot product-based composite matrix Wcreated directly from the input relation matrix R as

W ¼ RRT RRT RTR

� �: ð30Þ

Instead, we propose the use of the scaled relationmatrix

~R ¼ S�12

x RS�12

y ð31Þ

to obtain the composite similarity matrix

W ¼~R ~RT ~R~RT ~RT ~R

� �: ð32Þ

The scaling matrices Sx and Sy are n� n and m�m diagonal ones containing the aggregated relationinformation. The ith diagonal element s

ðxÞi of Sx and

the jth diagonal element sðyÞj of Sy are controlled by

the model variable p according to

sðxÞi ¼

1; if p ¼ 0;Pmj¼1 r

pij

1p

; if p � 1;

maxðri1; ri2; . . . ; rimÞ; if p ¼ 1;

8><>: ð33Þ


and

sðyÞj ¼

1; if p ¼ 0;Pni¼1 r

pij

1p

; if p � 1;

maxðr1j; r2j; . . . ; rnjÞ; if p ¼ 1:

8><>: ð34Þ

. UV-scaling: This scaling mechanism incorporatesthe effect of Sx and Sy and the singular values ��k

from the SVD of (23) into the final computation ofthe co-embeddings Zx and Zy. This is donethrough the scaling of the singular vectors Uk

and Vk according to

Zx ¼ S��x Uk��k ; ð35Þ

Zy ¼ S��y Vk��k : ð36Þ

The real-valued model variables � > 0 and �

parameterize the induced scaling by exponentiatingeach element of the diagonal matrices Sx, Sy, and ��k.This scheme can also be slightly modified to usemultiple variables to model other methods. Forexample, instead of using a single �, we can usek multiple ones �� ¼ ½�1; �2; . . . ; �k� and define

Zx ¼ S��x Ukdiag��1

1 ; ��2

2 . . . ; ��kk��; ð37Þ

Zy ¼ S��y Vkdiag��1

1 ; ��2

2 . . . ; ��kk��: ð38Þ

In this case, CORT can be explicitly modeled,because it requires a different scaling variable � foreach diagonal element. This can be seen by settingð�i þ �2

i Þ12 ¼ ��ii , which corresponds to �i ¼ logð�iþ�2

i Þ2 logð�iÞ .

The above scaling mechanisms are inspired by thedifferent commonalities of existing methods and enableACAS with the parametric ability to generate a diverserange of co-embeddings. For the case of R-scaling, (33) and(34) are partly based on the p-norm because of its flexibilityto measure the aggregated relations in different ways. Forinstance, we can obtain the Manhattan norm for p ¼ 1, theeuclidean for p ¼ 2, or the maximum norm of the row orcolumn vectors of R for p ¼ 1. Different special cases for(31) include no scaling, i.e., ~R ¼ R when p ¼ 0, or one typeof CA for p ¼ 1. Scaling R is needed because the user-provided between-group relations are often subjected tonoise, experimental inaccuracies, or subjectivity. Theproximity between two samples may also be dependenton the marginal totals of the rows (or columns) of R,which can be viewed as the aggregated relations betweenxi and fyjgmj¼1 (or yj and fxigni¼1). This effect has beenconsidered by CA [6], with the use of a scaled relationmatrix ~R ¼ D

�12

x RD�1

2y . The design justification for UV-

scaling is based on the need for more flexibility in theshapes of the distributions of the embedded points withthe least number of variables. Specifically, the multi-plication with Sx or Sy scales the lengths of the embed-dings to balance them in accordance with the aggregaterelations, while the multiplication with ��k scales each ofthe k axes of the common embedding space with the

singular values of the input relations. The exponents � and

� intensify or attenuate the values of the diagonal matrix

elements. Also, both scalings are needed because of the

fact that W combines heterogeneous, incommensurate,

pieces of information, such as the provided between-group

similarities and also the implied and latent within-group

information. The training procedure allows the adjustment

of the variables that control the model and combine the

constituent pieces of W better. The scaling mechanisms

can also be viewed from the distance metric learning point

of view. Such approaches, for example, [31], use transfor-

mations of the feature space to achieve measurements

more suitable to the problem at hand.In summary, the proposed ACAS is based on the model

variables p, �, and � (or ��) to control the shape of co-

embeddings computed from the input relation matrix R.

The incorporation of R-scaling and UV-scaling increases the

model versatility and robustness, and compensates for the

inflexibility caused by fixing W to the use of dot product for

estimating the within-group similarities in (17) and (18). We

recommend the user start with the simpler model of three

variables (p, �, and �), and expand it to kþ 2 (p, �, and ��) or

more, if necessary. However, the three variable formulation

was adequate for all of our experiments. The two scaling

mechanisms provide some intuitive links, in terms of their

underlying decompositions, between existing methods.

This is because some existing co-embedding methods can

be represented by the ACAS model by altering its model

variables. For instance, p ¼ 0, � ¼ 0, and � ¼ 0 correspond

to LSI and CFRM, p ¼ 1, � ¼ 0:5, and � ¼ 0 to BGP, p ¼ 1,

� ¼ 0:5, and � ¼ 1 for CA, while the case where p ¼ 0,

� ¼ 0, and �� as above represent CORT. We present a

summary of different co-embedding algorithms, in terms of

the matrix on which SVD is performed and the co-

embeddings computations, in Table 1.

3.3.2 Quantized Model Training

Given the previous parametric model formulation of the co-

embeddings of ACAS, the model variables have to be

learned from the only input information, which is the n�mrelation matrix R. The most direct procedure for training

ACAS is to optimize its model variables by minimizing the

dissimilarity (or maximize the similarity) between the

ground-truth input R and an approximate relation matrix

Rz ¼ ½rðzÞij � reconstructed from the calculated co-embeddings

Zx and Zy. One simple way for reapproximating the relation

values using the co-embeddings is the Gaussian kernel:


TABLE 1A Summary of Different Co-Embeddings Methods, Showing

the SVD Matrix and the Calculation of Co-Embeddings

The methods proposed in this work are marked by *.

rðzÞij ¼ exp

��zðxÞi � z

ðyÞj

��2

2

1nm

Pni¼1

Pmj¼1

��zðxÞi � zðyÞj

��2

2

!; ð39Þ

which is based on the euclidean distances of the between-group co-embeddings normalized by their total mean.

To evaluate the (dis)similarity between R and Rz, thereare many existing measures, including the Frobenius norm

DðR;RzÞ ¼ kR�RzkF ; ð40Þ

the alignment measure between two matrices [32]

AðR;RzÞ ¼tr�RRT

z

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffitrðRRT Þtr

�RzR

Tz

�q ; ð41Þ

and the log likelihood [2], [3], of which one version is

LðR;RzÞ ¼Xij

rijPij rij

logrðzÞijPij rðzÞij

!" #: ð42Þ

Nevertheless, all these measures consider how well theexact values of each pair of corresponding elements from R

and Rz match each other. Optimizing the model variablesbased on such measures often does not lead to a satisfactorymatch of the global structures between R and Rz. Thereason is that they are biased toward considering thedifferences of specific matrix elements, and this mayhave individual errors from mismatched elements contri-buting too much to the overall matrix dissimilarity. This isimportant because Rz is only an approximate version of R

through the generated co-embeddings Zx and Zy, whichmay not be of good quality when training starts. A robustway of reliably matching the two matrices to drive stablythe training procedure is to smooth out these individuallocal errors and measure their dissimilarity on a moreglobal scale, which does not take into account the directelement values but rather their relative magnitudes. Toachieve this, we propose a new criterion based on a robustquantization procedure, which is capable of focusing at theglobal level of information by ignoring magnitude details,while it remains adequately sensitive to matrix discrepan-cies. The procedure could be envisaged as a matchingbetween noisy images, where scale-space blurring or theuse of rank statistics could increase reliability at differentstages of the matching process.

The quantization employed relies on the estimation ofthe q-quantiles of all the values within R, denoted bypp ¼ ½p1; p2; . . . ; pq�1�T . These use a set of uniformly spacedcumulative probability values to divide the data into q setsof equal size. Subsequently, the q � 1 quantile values thatmark the boundaries between the q adjacent sets are used toquantize each ijth element of R according to

QðrijÞ ¼1; if rij � p1;t; if pt�1 < rij � pt; t ¼ 2; . . . ; q � 1;q; if pq�1 < rij;

8<: ð43Þ

or

QðrijÞ ¼

minijfrijg þ p1

2; if rij � p1;

pt�1 þ pt2

; if pt�1 < rij � pt;t ¼ 2; . . . ; q � 1;

maxijfrijg þ pq�12 ; if pq�1 < rij:

8>>>>><>>>>>:ð44Þ

With (43), the original matrix values are ignored andreplaced with the index of the quantile they belong to.Equation (44) is a soft version of (43) where the quantizedvalues are closer to the original values. Both quantizationfunctions are designed to produce and preserve a roughand global representation of the between-group relations,without relying on detailed error contributions that can bedetrimental to the model fitting procedure.

Finally, to train ACAS, we use the quantized versions ofR and Rz, denoted by QðRÞ and QðRzÞ, respectively. This isdone by minimizing the difference between these twoquantities via

minp;�;�ðor ��Þ

DQðR;RzÞ ¼ kQðRÞ �QðRzÞkF : ð45Þ

Training is based on finding the optimal variables p, �, and� (or ��) that result from solving the optimization in (45).The optimal co-embeddings Zx and Zy are computeddirectly from these variables and input R, using (31), and(35), (36) (or (37), (38)). Of course, (45) can be used todirectly recover the ðnþmÞ � k elements of the co-embed-ding matrix ZT ¼ ½ZT

x ;ZTy �. In this case, however, the large

number of model variables to be optimized would lead to amore cumbersome optimization task.

Because of the simplicity of the ACAS model in terms ofthe small number of its variables, a very simple optimiza-tion procedure, such as grid search, genetic algorithm, orsimulated annealing, can be used. Therefore, despite thediscrete character of the quantization-based objectivefunction in (45), model training (as will be demonstratedin Section 5) is efficient, effective, and stable. It has to benoted that when the dimensionality of the co-embeddings kcannot be decided by the user in advance, one can also viewk as another model variable to be optimized together with p,�, and � (or ��) using the same dissimilarity measure DQ,leading to four model variables in total. Also, although themost time consuming operation of ACAS is the SVD, only pmodifies R and therefore needs a new SVD (see R-scalingand (31)), while � and � (or ��) are only applied to theoutput of SVD (see UV-scaling and (35), (36)). This makestraining very economical since when a new model is testedby the training procedure, the SVD is only needed if p haschanged from previously tested models. Compared toCODE and its variations, ACAS is very efficient as it needsthe optimization of only a very small set of (� kþ 2) shapecontrolling variables, instead of the ðmþ nÞ � k variablesthat include all the elements of the co-embedding matrix.This and the fact that SVDs can be applied economicallyduring training make ACAS more suitable for real-worldsituations where the total number (nþm) of heterogeneousobjects and the number k of required embedding compo-nents may be very high. In addition, having only a fewvariables makes the training process easier to handle thanCODE and obtain a more optimal solution. The main stagesof the proposed ACAS are outlined in Table 2.


4 PROPOSED CO-EMBEDDING EVALUATION

SCHEMES

The quality assessment of the co-embeddings generatedfrom a given algorithm is still an open question due to thenature of the problem. Although such an assessment can betailored to the specific application at hand, as in [2], [3], theperformance of a co-embedding algorithm based ondomain-independent evaluation is also important as itallows a more generic and comparative evaluation. Tosupport this, we propose below various co-embeddingquality assessment schemes, based on a set of benchmarkdatasets of labeled samples for multiclass single-labelclassification. Since there are many such publicly availabledatasets [33], and because it is not currently easy to obtain alot of domain-specific relational data, the proposed schemescan be used by the practitioners in the field to facilitate thedesign and evaluation of algorithms.

Given a dataset of labeled d-dimensional samples, wedivide them into two disjoint groups, X ¼ fxxxxigni¼1 andY ¼ fyyyyigmi¼1. We use X ¼ ½xij� to denote the n� d featurematrix of the n samples from group X, and Y ¼ ½yij� the m�d feature matrix of them samples from Y. For convenience ofdescription, we differentiate the classes the samples in X andY belong to. Some classes have samples entirely being in onegroup; we term these unique classes. For example, a sample ssssof a unique class must satisfy lðssssÞ 62 flðyyyyÞ : y 2 Yg whenssss 2 X, and lðssssÞ 62 flðxxxxÞ : xxxx 2 Xgwhen ssss 2 Y (by lðssssÞ here, wedenote the label of sample ssss). In addition to the uniqueclasses, there may be co-classes, with samples in both X and Ygroups. For example, if xxxx 2 X, yyyy 2 Y, and lðxxxxÞ ¼ lðyyyyÞ, thenlðxxxxÞ is a co-class. Let c denote the total number of co-classes, athe number of unique classes in X, and b the number ofunique classes in Y. Then, the samples in group X and Ybelong to a total number of aþ c and bþ c different classes,respectively. To finally test the algorithm, we generate an

n�m input similarity matrix R using the rows of X and Yand a similarity score based on the between-group distances.In the following sections, we propose different qualitativeand quantitative schemes to analyze and evaluate the outputof the co-embedding algorithms.

It has to be noted that the two groups X and Y areselected from the same classification dataset and thereforeare of the same dimensionality and likely similar character-istics. Although this is not ideal, in the absence of multiplerelational datasets with ground truth, the employed sets canproduce R matrices with diverse structures reflecting thevarious characteristics real relational sets may have. Thequantitative scores proposed in Section 4.2 are designed totake advantage of the available multiclass samples andlabels, examine different aspects of the data potentiallyexisting in relational sets from different domains, and assesshow the local, global, within-, and between-group proper-ties and characteristics of the original data are preserved inthe generated co-embedding space.

4.1 Qualitative Model Assessment

A way of visually exploring the final co-embeddings Zx andZy is to observe the matching between the relations in theoriginal space (using the given R) and the relations betweenthe computed Zx and Zy in the common embedding space(using the reapproximated Rz, as described in Section 3.3.2).A direct visual comparison of these matrices in their rawforms is, however, very cumbersome because of theirregular visual appearance of relation matrices in general.Instead, if both R and Rz are permuted to group togethersimilar relation values, they will form patterns containingsimilarly textured structures (such as blocks, curved, orother gradually varying formations, etc.) in areas withstronger between-group similarities and make visualinspection more obvious. Such matrix permutation pro-blems are based on combinatorial optimization and arereferred to as data seriation or sequencing [34], [35]. Theadvantage of using the permuted versions of R and Rz forcomparison is linked to the fact that seriation is anexploratory analysis tool that emphasizes different proper-ties of the relational data through various regularities andstructures of its reordered relation matrix. In our case, whenthese properties are emphasized, it becomes straightfor-ward to visually inspect not only the global overallmatching between R and Rz, but also the local discrepan-cies of the between-group matrix values.

In our case, the sought reordering can be done using thecoVAT algorithm [36], which first expands an n�m matrixto a square matrix of side nþm by approximating thedistances between within-group samples. Subsequently, itapplies a minimal spanning tree (MST) algorithm (as insingle linkage clustering) and records the order in whichrow and column indices are obtained. In this work, wederive an efficient modification of this algorithm, presentedin Table 3, which works directly on an n�m distancematrix T (if T is a similarity matrix, its values can be simplyreplaced by maxijðtijÞ � tij). The algorithm treats the matrixvalues as arc weights of a bipartite graph whose nodesrepresent the two groups of samples fxigni¼1 and fyjgmj¼1.Then, an MST-style search is initiated, but with the twogroups treated separately. In each step, the located


TABLE 2Description of the Proposed ACAS, Using an

Iterative Search Procedure for Computing Co-Embeddings

minimum length arc is one that links visited nodes of onegroup to nonvisited nodes at the other side. Then, the singlenonvisited node from one of the two groups is inserted inthe corresponding set of flagged nodes Fx or Fy. The outputof this algorithm is two permutation vectors �x and �y forthe row- and column-permutation, respectively.

To compare the two relation matrices R and Rz, we firstapply the above algorithm on the quantized version of Rz.Then, we use the resulting permutation vectors �x and �y toreorder both quantized R and Rz. The better the patternsand structures between the two matrices match, the betterthe resulting co-embeddings are. For example, the pair inFig. 1t shows a much better match than the pair in Fig. 1dbecause of the more accordant co-cluster blocks in theformer pair.

4.2 Quantitative Model Assessment

In this section, we make use of the known feature matricesX and Y of the partitioned labeled datasets as well as theirclass information, to quantitatively analyze the calculatedco-embeddings Zx and Zy. Obviously, the original featuresX and Y cannot be identical to Zx and Zy because theoriginal feature information is not recoverable from therelation matrix R. Instead, by taking advantage of theknown class label information, we can confirm that the co-embeddings preserve the main distribution characteristics,such as the class separabilities and relative positionsbetween samples, also possessed by the original features.This observation enables us to propose a compositestructural matching score (or sms) for testing a given co-embedding algorithm. This score combines three indivi-dual scores (described below), which examine differentaspects of the structure preservation by the generated co-embeddings, and it is defined as

sms ¼ 1

3

SðZx;ZyÞSðX;YÞ þ

WðZx;ZyÞWðX;YÞ þ

BðZx;ZyÞBðX;YÞ

� : ð46Þ

4.2.1 Overall Separability Score

The most intuitive way for assessing the distributionsimilarity between the co-embeddings and the originalfeatures is to compare the k-fold cross-validation classifica-tion performance between the nþm samples in ½ZT

x ; ZTy �T

and the nþm samples in ½XT ; YT �T , defined as SðZx;ZyÞand SðX;YÞ, respectively. We use a simple 1-nearestneighbor to implement the classification. Higher values oftheir ratio used in (46) and defined as the overall separability

score (or S-score) indicate that the co-embeddings are able topreserve better the overall separability between the aþ bþc classes. However, this score does not capture detailedinformation on the within-group and between-group dis-tribution of the data, which we will explore below.

4.2.2 Within-Group Distribution Score

With this score, we wish to examine whether the classdistributions and corresponding sample proximities in theoriginal space are preserved in the embedding space,separately for the aþ c classes within group X and the bþc classes within Y. One way of achieving this is to assesshow much the within-group fractions of friendly (i.e., fromthe same class) and proximal samples change.

We use �NNðxxxx;AÞ to denote the subset of the sample setA with the � nearest neighbors to xxxx (excluding self, whenxxxx 2 A), and also F ðxxxx;AÞ ¼ fssss : ssss 2 A ^ lðssssÞ ¼ lðxxxxÞg to de-note the subset of A with all friendly instances. Then, wecan calculate the fraction

’�ðxxxx;AÞ ¼j �NNðxxxx;AÞ

TF ðxxxx;AÞ j

�; ð47Þ

of friendly to xxxx instances within the �NNðxxxx;AÞ. We alsoaverage the effect of these fractions, using multipleneighborhood sizes varying from � ¼ 1 up to jF ðxxxx;AÞjaccording to

’ðxxxx;AÞ ¼ 1

jF ðxxxx;AÞjXjF ðxxxx;AÞj�¼1

’�ðxxxx;AÞ: ð48Þ

Finally, we use ’ to measure the combined classdistributions within each group X and Y as

WðX;YÞ ¼P

ssss2X ’ ssss;Xð Þ þP

ssss2Y ’ ssss;Yð Þnþm : ð49Þ

The higher the value WðX;YÞ is, the better the classseparability for both groups in the original feature space. Ifthe different classes are perfectly separated within bothgroups, WðX;YÞ reaches its upper bound of 1. It has to benoted that high value of S usually indicates high value of W,but the reverse is not always true; also, low S does notnecessarily indicate low value for W. Equation (49) can besimilarly applied to the co-embeddings to calculateWðZx;ZyÞ. The ratio of these two quantities is incorporatedin (46) and corresponds to the within-group distribution score

(or W-score).

4.2.3 Co-Class Distribution Score

With this score, we take into account another importantcharacteristic for the comparison of co-embeddings with theoriginal features. This is how well the co-class distributionsand the between-group separability of patterns are pre-served. We measure this by letting Xi and Yi denote the set ofsamples from groups X and Y belonging to the ith co-class (fori ¼ 1; . . . ; c), respectively. We then use the ’ measure of thelocal friendly sample fractions, to calculate


TABLE 3Pseudocode for the Matrix Permutation Algorithm

Used to Aid Visualization

B1ðX;YÞ ¼Pc

i¼1

Pssss2Xi

’ ssss;Yð Þ þP

ssss2Yi’ ssss;Xð Þ

h iPc

i¼1 jXi

SYij

: ð50Þ

As with W, the closer B1 is to 1, the stronger co-classseparability is indicated.

Another important characteristic of the co-classes is thatinside each co-class the samples from groups X and Y are

mixed with each other, and the underlying mixing structureshould be preserved. To measure this, we calculate

B2ðX;YÞ ¼Pc

i¼1

Pssss2Xi

SYi’ðssss;Xi

SYiÞPc

i¼1 jXi

SYij

: ð51Þ

Only for (51), in the calculation of ’, do we temporarilychange the class label for the samples of Xi

SYi from the


Fig. 1. Comparison of the structure similarity between the reordered versions of the quantized Rz (recovered) and R (original), on the left and rightside of each subfigure, respectively, for all five algorithms and the synthetic datasets SD1, SD2, SD3, and SD4. The matrix differences DQðR;RzÞ,evaluated with (45), are also shown parenthesized with lower values indicating better matching.

same group as ssss, so that F does not count them as friends,to measure its mixing profile with the friends from the othergroup only. The more similarly the two groups of samplesin each co-class is mixing, the higher value B2 obtains. B2

measures a completely different property of the datastructure from B1, W, and S.

Finally, we combine both measures of co-class distribu-tion via BðX;YÞ ¼ 1

2 ðB1ðX;YÞ þ B2ðX;YÞÞ. When thereare no co-classes (c ¼ 0), we simply set B ¼ 1. For the co-embeddings, we calculate the analogous measure BðZx;ZyÞin the same way as BðX;YÞ, and their ratio referred to asthe co-class distribution score (or B-score) contributes to theoverall sms score of (46).

To summarize, the S-score assesses the overall structure

compliance between the original data and the co-embed-

dings. It uses the known data labels to measure how the

separability in both groups is disrupted by the co-embed-

ding generation procedure. This is done by simply applying

a 1-nearest neighbor classifier separately on both the

original and co-embedding data, and contrasting the two

resulting cross-validated classification accuracies. The

S-score only measures the preservation of the global spatial

structure. The W-score focuses on the preservation of the

local proximities between original and co-embedding data.

Specifically, it assesses how well the local neighborhood

profiles, in terms of the nearest friends, persist from the

original to the co-embedding space. This is measured for all

patterns, but for each group separately. The B-score is

similar, but it examines how the preservation of local

nearest friend profiles is managed across groups. It achieves

that by measuring how for each pattern, its local neighbor-

hood of co-class patterns from the other group is preserved

in the co-embedding space. These three scores evaluate

different aspects of the co-embedding generation process,

and therefore, the sms provides an aggregate view of the

quantitative assessment of an algorithm under testing.

5 EXPERIMENTAL RESULTS AND ANALYSIS

In this section, we compare ACAS1 with the four existingco-embedding algorithms, LSI, BGP, CA, and CODE, usingrelational information extracted from nine synthetic andreal-world classification datasets and the qualitative andquantitative evaluation schemes of Section 4. We also testthese five algorithms with a real-world application problemthat is a document-topic visualization task. The online codedeveloped by the author of [3], [23] is used to run CODE.2

For ACAS, a grid search is used to optimize its three modelvariables, for which the searching range is set byp 2 f0; 1; 2; 3;1g, � 2 ½0; 5�, and � 2 ½0; 2�, with steps of 0.1for � and �. This relatively sparse grid was adequate for allexperiments. Although it contains 5,355 models to beevaluated during training, only 5 of them require thetime-consuming SVD operation as only p changes the SVDinput. To achieve quantization for ACAS, (43) is used forthe nine datasets, while (44) for topic visualization. This isbecause the latter problem requires a more fine-grained

reconstruction of their between-group (document-topic)relation values, while for the nine relational datasetsrougher approximations are adequate as class only struc-ture preservation is sought. The value of the quantileparameter q ¼ 10 is used to obtain the q-quantiles for allexperiments. For the document-topic task and the four two-dimensional synthetic datasets (SDs), we fix the number ofco-embeddings k to 2 to visualize the output. For theremaining five datasets, ACAS includes k 2 f2; 3; . . . ;rankð ~RÞg to the model optimization procedure togetherwith p, �, and �. For LSI, BGP, and CA, k is automaticallyadjusted to the value retaining the 99 percent of singularvalues. For CODE, the co-embedding dimensionality isselected from a set of integers k 2 f2; 3; . . . ; rankðRÞg basedon the output likelihood.

5.1 Evaluation with Classification Data

5.1.1 Dataset Setup

For each of the nine classification datasets, we extract arelation matrix (based on the procedure described inSection 4) to be used as the input to the co-embeddingalgorithms. For each SD, the X and Y groups of samples aregenerated from different bivariate normal distributionswith different means and covariances to create distinctunique classes and co-classes. These datasets SD1, SD2,SD3, and SD4 are shown in Figs. 2a, 2b, 2c, and 2d,respectively. We also use the binary classification datasetsThyroid, Banana, Heart, and Diabetes from [37], and themulticlass dataset Wine from [33]. For each binaryclassification set, we divide its samples into X and Y, withboth groups containing samples from both classes (i.e., c ¼2 and a ¼ b ¼ 0). For the Wine dataset, X contains samplesfrom all three classes of the set, while Y contains samplesfrom two of the classes (i.e., a ¼ 1, b ¼ 0, and c ¼ 2). Table 4summarizes the relational information, as well as the valuesof BðX;YÞ, WðX;YÞ, and SðX;YÞ for each dataset.

5.1.2 Qualitative Comparison Using Synthetic Data

Fig. 1 examines the effectiveness of ACAS and compares itqualitatively with the other co-embedding algorithms usingthe four relation matrices computed from SD1, SD2, SD3,and SD4. For each dataset and each algorithm, thereordered quantized original matrix R and the reapproxi-mated matrix Rz are shown (see Section 4.1). It can be seenthat for all four datasets, the matrix pairs of ACAS possessthe most similar structures and the most clearly defined co-cluster blocks. The co-embeddings generated from ACAS(in Figs. 1q, 1r, 1s, and 1t) indicate better matching betweenthe relations in the original space and the relations in thecommon embedding space. The figure also includes thenumerical comparison between all matrix elements, whichis shown to be the smallest for ACAS. It can also beobserved from Fig. 1 that CA and CODE achieve bettermatching than LSI and BGP.

Fig. 2 shows the distributions of the original two-dimensional features for the synthetic datasets, as well asthe resulting co-embeddings for all algorithms. Becausethe distributions in Figs. 2u and 2w are very dense and itis difficult to observe the proximities in detail, we presentenlarged portions of them in Fig. 3. From Figs. 2 and 3, it


1. http://pcwww.liv.ac.uk/~goulerma/software/acas.zip.2. http://ai.stanford.edu/~gal/code.html.

can be noticed that the co-embeddings obtained by ACASin all four datasets possess the most similar distributionto the original features. In the embedding space, most co-embedding algorithms seem capable of preserving theclass separabilities within group X and within group Y,

individually. However, for most of the datasets, ACAS canfurther preserve the between-group distributions. Thismeans that points within co-class structures (co-clusters ofmixed points from X and Y) in the original space staytogether in the embedding space. Moreover, co-clusters


Fig. 2. Comparison of the five co-embedding algorithms using four two-dimensional synthetic datasets SD1, SD2, SD3, and SD4 (plotted in (a), (b),

(c), and (d)) generated by different Gaussian distributions. The samples correspond to two groups X (circles) and Y (triangles) and different classes

(shown in different colors). The optimal values of the model variables (p, �, and �) are also shown parenthesized for ACAS. Each co-embedding axis

was scaled within ½0; 1�.

in the embedding space do not mix with other clusters or

co-clusters that were far apart in the original space. For

example, the two co-clusters in Fig. 2d break down in Fig.

2t by CODE, but both co-clusters are shown preserved in

Fig. 2x. Also, the single co-cluster of Fig. 2a is preserved

by CA in Fig. 2m, but it spreads too much and mixes with

a unique class cluster. In the output of ACAS shown in

Figs. 2u and 3a, this mixing does not occur. CA seems to

better preserve the between-group distributions compared

to the previous three algorithms.Fig. 4 includes results from ACAS but with replacing

its quantization-based model training objective DQðR;RzÞin (45) with the nonquantized alternatives (40), (41), and

(42), and also the KL divergence used in [2]. We use the

SD3 dataset, but very similar behavior was observed withother datasets. The resulting co-embeddings from thesefour experiments as well as their corresponding optimalmodel variables found from the training are displayed inFig. 4. Equation (42) and KL divergence result in identicalmodel selections. By comparing these co-embeddingswith Figs. 2w and 3b, it can be seen that the former donot preserve the unique class and co-class proximitiesand separations as clearly as the co-embeddings com-puted with the proposed criterion. As discussed inSection 3.3.2, the quantization-based matching is moresuitable as it focuses on the global structure properties ofthe relation matrices.


Fig. 3. Enlarged portions of Figs. 2u and 2w showing the axis ranges.

TABLE 4Characteristics of the Extracted Relational Dataset Information

a, b, and c are the unique classes in X, unique classes in Y, and co-classes, respectively. B, W, and S denote the values of BðX;YÞ,WðX;YÞ, and SðX;YÞ, respectively.

Fig. 4. Co-embeddings generated by ACAS using four alternate model fitting criteria without quantization instead of DQðR;RzÞ. The syntheticdataset SD3 is used in all cases. The optimal values of the model variables p, �, and � are shown together with their sms scores.

TABLE 5Quantitative Evaluation of the Five Co-Embedding Algorithms and the Nine Datasets

The values of BðZx;ZyÞ, WðZx;ZyÞ, and SðZx;ZyÞ are shown in square brackets preceding the final sms score. Boldfaced values denote the highestscore.

5.1.3 Quantitative Evaluation with All Datasets

We also conduct comparisons using all nine synthetic andreal-world datasets, with the quantitative assessmentschemes of Section 4.2. Table 5 presents the sms scorescalculated with (46) and also includes the individualestimations of BðZx;ZyÞ, WðZx;ZyÞ, and SðZx;ZyÞ (the threedenominators in (46) can be found in Table 4). It can be seenthat ACAS possesses the highest sms scores for nearly alldatasets. CA and CODE have higher sms scores than LSIand BGP for most datasets.

Fig. 5 contains the standard deviation bars of theperformance difference between ACAS and each of thecompeting LSI, BGP, CA, and CODE, using the sms scoresof the nine benchmark datasets. In addition to the mean andstandard deviation ranges marked by the horizontalsegments, the actual sms score differences in percentagesare also marked with circles for each method pair. ACASperformance differences with all methods and for themajority of the datasets are significantly above the 0 percentline, apart from CA that has an overall mean sms differenceof around 3 percent, and performance comparable to ACASfor two of the datasets.

Because the sms scores in Table 5 are the averages of thethree individual scores proposed in Section 4.2, we alsopresent plots of the individual score values in Fig. 6.Specifically, the figure plots all pairs between the co-classdistribution B-score

BðZx;ZyÞBðX;YÞ , the within-group distribution

W-scoreWðZx;ZyÞWðX;YÞ , and the overall separability S-score

SðZx;ZyÞSðX;YÞ .

Higher values for each score indicate co-embeddingscomplying better with the set quality criteria. The scoresare plotted for each run of the five algorithms and the ninedatasets. It can be seen that ACAS appears mostly on thetop-right corner of the graphs. The co-embeddings obtainedby CA and CODE do not possess high values of all threescores at the same time and for most datasets, while LSI andBGP appear further down toward the bottom-left side.

We also show in Fig. 7 how the performance of ACASimproves with each iteration throughout its training for theSD3 dataset. Fig. 7a displays the changes of the ACASobjective function of (45), while Fig. 7b displays the changesof the three B, W, and S-score ratios and the composite sms,for each iteration during the optimization of the modelvariables. In Fig. 7a, the objective value decreases duringthe training procedure as it minimizes the differencebetween the real input relation matrix and the approxi-mated one in the co-embedding space. Fig. 7b shows that inthe early iterations, the model generates co-embeddings ofquite high S-score and W-score values, but with very lowB-score. Later on, all three scores show increase, leading tothe highest sms value. The overall increase of sms indicatesthe effectiveness and suitability of the ACAS model. Asexpected, the values of the individual scores cannot increasemonotonically as they are not directly linked to theoptimizing objective function as the plot of Fig. 7a.

To demonstrate how the shape controlling variables p, �,and � can determine the quality of the co-embeddings, weuse Fig. 8 to display variations of the sms score versusdifferent choices of p, �, and � for the SD4 dataset. For easeof visualization, we obtain each plot by fixing one variableat its optimal value and varying the other two within theircorresponding searching ranges, as discussed in thebeginning of Section 5. The optimal model for this datasetas computed by ACAS corresponds to p ¼ 3, � ¼ 1:1, and� ¼ 2. It can be observed from Fig. 8a that the resulting co-embeddings are not too sensitive to variations in � when� > 0, but � ¼ 0 leads steeply to a drop in the sms score.This drop is directly reflected in the co-embeddingscomputed with p ¼ 3, � ¼ 1:1, and � ¼ 0 displayed inFig. 9a, which are far worse than the optimal ones in Fig. 2x.When the value of � in Fig. 8a varies along the line � ¼ 2,the sms score ranges between 0.96 to 0.86. In Figs. 9b and 9c,we illustrate the co-embeddings obtained with these highestand lowest sms scores, respectively. Similar observations


Fig. 6. Pairwise plots for the co-class distribution scoreBðZx;ZyÞBðX;YÞ , the within-group distribution score

WðZx;ZyÞWðX;YÞ , and the overall separability score

SðZx;ZyÞSðX;YÞ ,

denoted in the axes by score B, W, and S, respectively. Each of the 45 marks in each plot corresponds to a specific algorithm executed for a specificdataset. Score values on the top-right corner of each graph indicate better score compliance of the computed co-embeddings Zx and Zy.

Fig. 5. Standard deviation bars of the sms score differences betweenACAS and the competing methods for the nine benchmark datasets ofTable 5.

regarding how � affects the output of ACAS can also be

obtained from Fig. 8b. Different values of p in Fig. 8b lead toco-embeddings with varying sms scores. Fig. 9d illustrates

the ones with the lowest sms score along � ¼ 2. Fig. 8c alsoshows reasonable sms variations with slightly higher

sensitivity, but without abrupt fluctuations. Experimentingwith different datasets, we observed various sensitivity

patterns that are particular to each individual dataset.

Overall, Figs. 8 and 9 demonstrate that different choices ofp, �, and � can lead to varying sms scores and a diverse

range of co-embedding arrangements. This is needed toequip the three shape controlling variables with adequate

expressive power to generate a wide range of co-embedding

possibilities. At the same time, there is no strong sensitivity

in these parameters to cause model instability and the need

for a very dense grid search. In all cases, the quantization-

based objective function of (45) is designed to sustain a

stable and robust training and recovery of the optimal

model variables.Finally, we compare the computational time require-

ments of ACAS, CODE, LSI, BGP, and CA with datasets

containing different sample sizes n and m and for

computing different numbers of co-embeddings k. Fig. 10

displays the execution times from different experimental

trials, along with the corresponding values of n, m, and k in

each trial. Since LSI, BGP, and CA possess very similar

computational needs, one averaged time point is marked.

The plots show that these methods are faster than ACAS;

this is because they only involve a single SVD operation of

the input relation matrix, while ACAS needs to perform

multiple such decompositions during its training procedure

to search for the optimal co-embedding model. However, as

was mentioned in Section 3.3.2 and the beginning of

Section 5, the number of SVDs only depends on the number


Fig. 7. Performance measures for the different iterations of the ACAStraining procedure using the SD3 dataset.

Fig. 9. Comparison of the co-embeddings obtained by severalnonoptimal ACAS models selected from Fig. 8.

Fig. 8. Variations of the sms score for different values of p, �, and � evaluated using the SD4 dataset. The vectors corresponding to the optimalmodel variables computed by ACAS are marked in the three graphs by dots.

of grid points sampling p, and not �, � or ��. Regarding thecomparison between ACAS and CODE, in the first six trialswith comparatively small n and m, and/or k, CODE andACAS possess comparable computational needs. However,as the values of n, m, and k increase, CODE becomesdramatically slower than ACAS; this is due to the largenumber ðnþmÞ � k of model variables it has to optimize.

5.2 Document-Topic Visualization Task

In this section, we evaluate the co-embedding methods with atext mining application of document-topic visualization. Theinput is the document-topic probability matrix, whose ijthentry represents the probability the ith document is drawnfrom the jth topic. The objective is to display both documentsand topics in a common embedding space to study thestructure of the document database. This is a similarapplication to the classifier visualization task demonstratedin [2] and the document-word visualization task of [3].

We make use of n ¼ 593 sample documents from“Reuters-21578 Text Categorization Test Collection” be-longing to m ¼ 6 different topics, and the bag-of-wordfeature vector is extracted for each document. To derivethe probabilities, we perform six sets of binary classifica-tion using linear support vector machine with an one-against-all scheme. In this way, we obtain six separatingfunctions whose output values are passed to a Parzenwindows classifier to estimate the probabilities. The finaloutput is a 593� 6 probability matrix with each elementrepresenting the estimated posterior probability a docu-ment belonging to a topic; this is used as the input to LSI,BGP, CODE, CA, and ACAS. The precision measure in [2]([2, (5.4)] for quantitative assessment of classifier visualiza-tion) is used here to evaluate the resulting co-embeddings.This measures the degree of matching between the inputprobabilities and the estimated proximity values betweenthe co-embeddings.

Model optimization using the quantization proceduredictated by (44) produces the optimal model variables ofp ¼ 1, � ¼ 0:5, and � > 0 (performance is insensitive tononzero values of � for this dataset). The performance ofthe five algorithms is shown in Fig. 11a for increasing numberof samples used to compute the precision measure. It can beseen that ACAS and CA achieve the same good performancewith an average precision of 81.8 percent over different

sample sizes. This is better than CODE (75.1 percent) andmuch better than BGP (60.3 percent) and LSI (28.7 percent).

Fig. 11b plots the documents together with their sixcorresponding topics in the common embedding space asobtained by ACAS. It can be seen that the six classes ofdocuments are nearly separated and the six topics areroughly located in the centers of each class. Also, thedistribution of the topics can indicate some of the between-topic relationships. For example, topics 4 and 5, whichare close to each other, correspond to “money-foreign-exchange” and “money-supply,” respectively. On the otherhand, topics 1, 2, and 6, which are distributed far from eachother, correspond to “earnings forecast,” “merchandisetrade,” and “interest rates,” respectively; this indicates thatthey are less related to each other than topics 4 and 5.

6 CONCLUSION

This work has studied the co-embedding (also joint orheterogeneous embedding) problem. We have conducted ageneric analysis to discover the commonalities betweenexisting co-embedding algorithms and indirectly relatedapproaches, and investigated possible factors controllingthe distributions of the co-embeddings. A novel co-embed-ding method, the ACAS algorithm, is proposed which usesa flexible and efficient formulation of the co-embeddingproblem structure. Its underlying model is equipped withflexible parametric forms which allow it to assume differentspecial cases, such as the embedding proximity preserva-tion and the optimal reconstruction templates. The co-embedding generation stage of ACAS is based on anefficient decomposition of the n�m relation matrix andavoids the eigendecomposition of the expanded ðnþmÞ �ðnþmÞ one. The model employs an economical set ofmodel variables which renders a complex optimizationprocedure unnecessary, and also its model fitting objectivefunction is based on a robust and simple quantizationpreprocessing scheme.

In addition, we have proposed a procedure to qualita-tively analyze the generated co-embeddings, and three newscores (the overall separability, the within-group and theco-class distribution score) that can be used by fieldpractitioners to quantitatively evaluate a given co-embed-ding algorithm with existing labeled datasets. We comparedACAS with four other co-embedding algorithms (LSI, BGP,CA, and CODE) using relational information extractedfrom synthetic and real-world datasets as well as adocument-topic visualization task. Experimental resultsshow that it is very competitive compared to existing co-embedding algorithms.


Fig. 11. Results from experimenting with document-topic visualization.

Fig. 10. Comparison of the computational times of different methods with

datasets containing different sample sizes n and m, and co-embedding

dimensionality k.

Future work will examine more elaborate reconstructionand quantization procedures that could take into accountdifferent aspects of the reconstructed data. Also, moresophisticated similarity measures for approximating thewithin-group information from the user-defined relationvalues could be promising. Finally, the algorithm couldbenefit from the incorporation of domain-specific assump-tions into the model, with possible extensions to objects ofmore than two types.

REFERENCES

[1] H. Zhong, J. Shi, and M. Visontai, “Detecting Unusual Activity inVideo,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,vol. 2, pp. 819-826, 2004.

[2] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T.L. Griffiths, and J.B.Tenenbaum, “Parametric Embedding for Class Visualization,”Neural Computation, vol. 19, no. 9, pp. 2536-2556, 2007.

[3] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “EuclideanEmbedding of Co-Occurrence Data,” J. Machine Learning Research,vol. 8, pp. 2265-2295, 2007.

[4] V. Sindhwani and P. Melville, “Document-Word Co-Regulariza-tion for Semi-Supervised Sentiment Analysis,” Proc. Eighth IEEEInt’l Conf. Data Mining, pp. 1025-1030, 2008.

[5] L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik,“Dimensionality Reduction: A Comparative Review,” TechnicalReport TiCC-TR 2009-005, Tilburg Univ., 2009.

[6] T.F. Cox and M.A.A. Cox, Multidimensional Scaling. Chapman andHall, 2000.

[7] I.S. Dhillon, “Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning,” Proc. Seventh ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining,pp. 269-274, 2001.

[8] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “EmbeddingHeterogeneous Data Using Statistical Models,” Proc. 21st Nat’lConf. Artificial Intelligence, 2006.

[9] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T.L. Griffiths, andJ.B. Tenenbaum, “Parametric Embedding for Class Visualiza-tion,” Proc. Advances in Neural Information Processing Systems,2005.

[10] P. Sarkar, S.M. Siddiqi, and G.J. Gordon, “Approximate KalmanFilters for Embedding Author-Word Co-Occurrence Data overTime,” Proc. 11th Int’l Conf. Statistical Network Analysis, pp. 126-139, 2006.

[11] P. Sarkar, S.M. Siddiqi, and G.J. Gordon, “A Latent SpaceApproach to Dynamic Embedding of Co-Occurrence Data,” Proc.11th Int’l Conf. Artificial Intelligence and Statistics, 2007.

[12] J.P. Benzecri, “L’Analyse des Donnees,” L’Analyse des Correspon-dances, vol. 2, 1973.

[13] M. Greenacre, Theory and Applications of Correspondence Analysis.Academic Press, 1983.

[14] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Proces-sing Magazine, vol. 22, no. 5, pp. 70-80, Sept. 2005.

[15] M. Richardson and G.F. Kuder, “Making a Rating Scale ThatMeasures,” Personnel J., vol. 12, pp. 36-40, 1933.

[16] H. Zha, X. He, C.H.Q. Ding, M. Gu, and H.D. Simon, “BipartiteGraph Partitioning and Data Clustering,” Proc. 10th Int’l Conf.Information and Knowledge Management, pp. 25-32, 2001.

[17] B. Gao, T. Liu, X. Zheng, Q. Cheng, and W. Ma, “ConsistentBipartite Graph Co-Partitioning for Star-Structured High-OrderHeterogeneous Data Co-Clustering,” Proc. 11th ACM SIGKDDInt’l Conf. Knowledge Discovery and Data Mining, pp. 41-50,2005.

[18] B. Long, Z. Zhang, and P.S. Yu, “Co-Clustering by Block ValueDecomposition,” Proc. 11th ACM SIGKDD Int’l Conf. KnowledgeDiscovery in Data Mining, pp. 635-640, 2005.

[19] B. Long, Z. Zhang, X. Wu, and P.S. Yu, “Spectral Clustering forMulti-Type Relational Data,” Proc. 23rd Int’l Conf. MachineLearning, pp. 585-592, 2006.

[20] M. Rege, M. Dong, and F. Fotouhi, “Bipartite Isoperimetric GraphPartitioning for Data Co-Clustering,” Data Mining and KnowledgeDiscovery, vol. 16, no. 3, pp. 276-312, 2008.

[21] C.E. Bichot, “Co-Clustering Documents and Words by Minimizingthe Normalized Cut Objective Function,” J. Math. Modelling andAlgorithms, vol. 9, no. 2, pp. 131-147, 2010.

[22] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. forInformation Science, vol. 41, pp. 391-407, 1990.

[23] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “EuclideanEmbedding of Co-Occurrence Data,” Proc. Advances in NeuralInformation Processing Systems, 2004.

[24] F.W. Young, ViSta: The Visual Statistics System. Wiley, 1996.[25] P.M. Yelland, “An Introduction to Correspondence Analysis,”

Math. J., vol. 12, 2010.[26] M. Greenacre, “Power Transformations in Correspondence

Analysis,” Computational Statistics and Data Analysis, vol. 53,no. 8, pp. 3108-3116, 2009.

[27] U. Luxburg, “A Tutorial on Spectral Clustering,” Statistics andComputing, vol. 17, no. 4, pp. 395-416, 2007.

[28] T. Mu, J.Y. Goulermas, J. Tsujii, and S. Ananiadou, “Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 34, no. 11, pp. 2216-2232, Nov. 2012.

[29] W.S. Torgerson, “Multidimensional Scaling: I. Theory andMethod,” J. Psychometrika, vol. 17, no. 4, pp. 401-419, 1952.

[30] C. Eckart and G. Young, “The Approximation of One Matrix byAnother of Lower Rank,” Psychometrika, vol. 1, no. 3, pp. 211-218,1936.

[31] B. Xie, M. Wang, and D. Tao, “Toward the Optimization ofNormalized Graph Laplacian,” IEEE Trans. Neural Networks,vol. 22, no. 4, pp. 660-666, Apr. 2011.

[32] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, “OnOptimizing Kernel Alignment,” Technical Report NC-TR-01-087,Royal Holloway Univ. of London, 2001.

[33] UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/MLRepository.html, 1992.

[34] M. Hahsler, K. Hornik, and C. Buchta, “Getting Things in Order:An Introduction to the R Package Seriation,” J. Statistical Software,vol. 25, no. 3, pp. 1-34, 2008.

[35] H. Wu, Y. Tien, and C. Chen, “GAP: A Graphical Environment forMatrix Visualization and Cluster Analysis,” Computational Statis-tics and Data Analysis, vol. 54, no. 3, pp. 767-778, 2010.

[36] J.C. Bezdek, R.J. Hathaway, and J.M. Huband, “Visual Assessmentof Clustering Tendency for Rectangular Dissimilarity Matrices,”IEEE Trans. Fuzzy Systems, vol. 15, no. 5, pp. 890-903, Oct. 2007.

[37] UCI, DELVE, and STATLOG Benchmark Repository, http://ida.first.fhg.de/projects/bench/benchmarks.htm, 2013.

Tingting Mu received the BEng degree inelectronic engineering and information sciencefrom the University of Science and Technologyof China, Hefei, China, in 2004, and the PhDdegree in electrical engineering and electronicsfrom the University of Liverpool, United King-dom, in 2008. She is currently a lecturer in theSchool of Electrical Engineering, Electronics andComputer Science at the University of Liverpool.Her current research interests include machine

learning, data analysis, and mathematical modeling, with applications toinformation retrieval, text mining, and bioinformatics. She is a member ofthe IEEE.

John Yannis Goulermas received the BScdegree (first class) in computation from theUniversity of Manchester Institute of Scienceand Technology (UMIST), United Kingdom, in1994, and the MSc and PhD degrees from theControl Systems Center, UMIST, in 1996 and2000, respectively. He is currently a reader in theSchool of Electrical Engineering, Electronics andComputer Science at the University of Liverpool,United Kingdom. His current research interests

include machine learning, combinatorial data analysis, data visualization,and mathematical modeling. He is a senior member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.