14
Spectral shape classification: A deep learning approach q Majid Masoumi, A. Ben Hamza Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada article info Article history: Received 26 April 2016 Revised 9 August 2016 Accepted 1 January 2017 Available online 5 January 2017 Keywords: Deep learning Spectral graph wavelet Bag-of-features Classification abstract In this paper, we propose a deep learning approach to 3D shape classification using spectral graph wave- lets and the bag-of-features paradigm. In order to capture both the local and global geometry of a 3D shape, we present a three-step feature description strategy. Local descriptors are first extracted via the spectral graph wavelet transform having the Mexican hat wavelet as a generating kernel. Then, mid- level features are obtained by embedding local descriptors into the visual vocabulary space using the soft-assignment coding step of the bag-of-features model. A global descriptor is subsequently con- structed by aggregating mid-level features weighted by a geodesic exponential kernel, resulting in a matrix representation that describes the frequency of appearance of nearby codewords in the vocabulary. Experimental results on two standard 3D shape benchmarks demonstrate the much better performance of the proposed approach in comparison with state-of-the-art methods. Ó 2017 Elsevier Inc. All rights reserved. 1. Introduction With improved computational power and an overwhelming availability of 3D shape data, the burgeoning area of deep learning has dramatically transformed how shape classification methods are studied due in large part to the recent theoretical develop- ments in deep learning representations that are essential not only to improving the classification accuracy, but also to producing state-of-the-art results. The recent surge of interest in the spectral analysis of the Laplace-Beltrami operator (LBO) has resulted in a considerable number of spectral shape signatures that have been successfully applied to a broad range of areas, including manifold learning [1], object recognition and deformable shape analysis [2–7], med- ical imaging [8], multimedia protection [9], and shape classifica- tion [10]. The diversified nature of these applications is a powerful testimony of the practical usage of spectral shapes signa- tures, which are usually defined as feature vectors representing local and/or global characteristics of a shape and may be broadly classified into two main categories: local and global descriptors. Local descriptors (also called point signatures) are defined on each point of the shape and often represent the local structure of the shape around that point, while global descriptors are usually defined on the entire shape. Most point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed a local feature descriptor referred to as the global point sig- nature (GPS), which is a vector whose components are scaled eigenfunctions of the LBO evaluated at each surface point. The GPS signature is invariant under isometric deformations of the shape, but it suffers from the problem of eigenfunctions’ switching whenever the associated eigenvalues are close to each other. This problem was lately well handled by the heat kernel signature (HKS) [11], which is a temporal descriptor defined as an exponentially-weighted combination of the LBO eigenfunctions. HKS is a local shape descriptor that has a number of desirable prop- erties, including robustness to small perturbations of the shape, efficiency and invariance to isometric transformations. The idea of HKS was also independently proposed by Ge ßbal et al. [12] for 3D shape skeletonization and segmentation under the name of auto diffusion function. From the graph Fourier perspective, it can be seen that HKS is highly dominated by information from low frequencies, which correspond to macroscopic properties of a shape. To give rise to substantially more accurate matching than HKS, the wave kernel signature (WKS) [13] was proposed as an alternative in an effort to allow access to high-frequency informa- tion. Using the Fourier transform’s magnitude, Kokkinos et al. [14] introduced the scale invariant heat kernel signature (SIHKS), which is constructed based on a logarithmically sampled scale-space. One of the simplest spectral shape signatures is Shape-DNA [3], which is an isometry-invariant global descriptor defined as a trun- cated sequence of the LBO eigenvalues arranged in increasing order http://dx.doi.org/10.1016/j.jvcir.2017.01.001 1047-3203/Ó 2017 Elsevier Inc. All rights reserved. q This paper has been recommended for acceptance by Zicheng Liu. Corresponding author. E-mail address: [email protected] (A. Ben Hamza). J. Vis. Commun. Image R. 43 (2017) 198–211 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

J. Vis. Commun. Image R. 43 (2017) 198–211

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier .com/ locate / jvc i

Spectral shape classification: A deep learning approachq

http://dx.doi.org/10.1016/j.jvcir.2017.01.0011047-3203/� 2017 Elsevier Inc. All rights reserved.

q This paper has been recommended for acceptance by Zicheng Liu.⇑ Corresponding author.

E-mail address: [email protected] (A. Ben Hamza).

Majid Masoumi, A. Ben Hamza ⇑Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada

a r t i c l e i n f o a b s t r a c t

Article history:Received 26 April 2016Revised 9 August 2016Accepted 1 January 2017Available online 5 January 2017

Keywords:Deep learningSpectral graph waveletBag-of-featuresClassification

In this paper, we propose a deep learning approach to 3D shape classification using spectral graph wave-lets and the bag-of-features paradigm. In order to capture both the local and global geometry of a 3Dshape, we present a three-step feature description strategy. Local descriptors are first extracted via thespectral graph wavelet transform having the Mexican hat wavelet as a generating kernel. Then, mid-level features are obtained by embedding local descriptors into the visual vocabulary space using thesoft-assignment coding step of the bag-of-features model. A global descriptor is subsequently con-structed by aggregating mid-level features weighted by a geodesic exponential kernel, resulting in amatrix representation that describes the frequency of appearance of nearby codewords in the vocabulary.Experimental results on two standard 3D shape benchmarks demonstrate the much better performanceof the proposed approach in comparison with state-of-the-art methods.

� 2017 Elsevier Inc. All rights reserved.

1. Introduction

With improved computational power and an overwhelmingavailability of 3D shape data, the burgeoning area of deep learninghas dramatically transformed how shape classification methodsare studied due in large part to the recent theoretical develop-ments in deep learning representations that are essential not onlyto improving the classification accuracy, but also to producingstate-of-the-art results.

The recent surge of interest in the spectral analysis of theLaplace-Beltrami operator (LBO) has resulted in a considerablenumber of spectral shape signatures that have been successfullyapplied to a broad range of areas, including manifold learning[1], object recognition and deformable shape analysis [2–7], med-ical imaging [8], multimedia protection [9], and shape classifica-tion [10]. The diversified nature of these applications is apowerful testimony of the practical usage of spectral shapes signa-tures, which are usually defined as feature vectors representinglocal and/or global characteristics of a shape and may be broadlyclassified into two main categories: local and global descriptors.Local descriptors (also called point signatures) are defined on eachpoint of the shape and often represent the local structure of theshape around that point, while global descriptors are usuallydefined on the entire shape.

Most point signatures may easily be aggregated to form globaldescriptors by integrating over the entire shape. Rustamov [4] pro-posed a local feature descriptor referred to as the global point sig-nature (GPS), which is a vector whose components are scaledeigenfunctions of the LBO evaluated at each surface point. TheGPS signature is invariant under isometric deformations of theshape, but it suffers from the problem of eigenfunctions’ switchingwhenever the associated eigenvalues are close to each other. Thisproblem was lately well handled by the heat kernel signature(HKS) [11], which is a temporal descriptor defined as anexponentially-weighted combination of the LBO eigenfunctions.HKS is a local shape descriptor that has a number of desirable prop-erties, including robustness to small perturbations of the shape,efficiency and invariance to isometric transformations. The ideaof HKS was also independently proposed by Ge�bal et al. [12] for3D shape skeletonization and segmentation under the name ofauto diffusion function. From the graph Fourier perspective, itcan be seen that HKS is highly dominated by information fromlow frequencies, which correspond to macroscopic properties of ashape. To give rise to substantially more accurate matching thanHKS, the wave kernel signature (WKS) [13] was proposed as analternative in an effort to allow access to high-frequency informa-tion. Using the Fourier transform’s magnitude, Kokkinos et al. [14]introduced the scale invariant heat kernel signature (SIHKS), whichis constructed based on a logarithmically sampled scale-space.

One of the simplest spectral shape signatures is Shape-DNA [3],which is an isometry-invariant global descriptor defined as a trun-cated sequence of the LBO eigenvalues arranged in increasing order

Page 2: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 199

of magnitude. Gao et al. [10] developed a variant of Shape-DNA,referred to as compact Shape-DNA (cShape-DNA), which is anisometry-invariant signature resulting from applying the discreteFourier transform to the area-normalized eigenvalues of the LBO.Chaudhari et al. [8] presented a slightly modified version of theGPS signature by setting the LBO eigenfunctions to unity. This sig-nature, called GPS embedding, is defined as a truncated sequenceof inverse square roots of the area-normalized eigenvalues of theLBO. A comprehensive list of spectral descriptors can be found in[15,16].

On the other hand, wavelet analysis has some major advantagesover Fourier transform, which makes it an interesting alternativefor many applications. In particular, unlike the Fourier transform,wavelet analysis is able to perform local analysis and also makesit possible to perform a multiresolution analysis. Classical waveletsare constructed by translating and scaling a mother wavelet, whichis used to generate a set of functions through the scaling and trans-lation operations. The wavelet transform coefficients are thenobtained by taking the inner product of the input function withthe translated and scaled waveforms. The application of waveletsto graphs (or triangle meshes in geometry processing) is, however,problematic and not straightforward due in part to the fact that itis unclear how to apply the scaling operation on a signal (or func-tion) defined on the mesh vertices. To tackle this problem, Coifmanet al. [17] introduced the diffusion wavelets, which generalize theclassical wavelets by allowing for multiscale analysis on graphs.The construction of diffusion wavelets interacts with the underly-ing graph through repeated applications of a diffusion operator,which induces a scaling process. Hammond et al. [18] showed thatthe wavelet transform can be performed in the graph Fourierdomain, and proposed a spectral graph wavelet transform that isdefined in terms of the eigensystem of the graph Laplacian matrix.More recently, a spectral graph wavelet signature (SGWS) wasintroduced in [6,19], and it has shown superior performance overHKS and WKS in 3D shape retrieval. SGWS is a multiresolutionlocal descriptor that is not only isometric invariant, but also com-pact, easy to compute and combines the advantages of both band-pass and low-pass filters.

A popular approach for transforming local descriptors into glo-bal representations that can be used for 3D shape recognition andclassification is the bag-of-features (BoF) model [5]. The task in theshape classification problem is to assign a shape to a class chosenfrom a predefined set of classes. The BoF model represents eachshape in the training dataset as a collection of unordered featuredescriptors extracted from local areas of the shape, just as wordsare local features of a document. A baseline BoF approachquantizes each local descriptor to its nearest cluster center usingK-means clustering and then encodes each shape as a histogramover cluster centers by counting the number of assignments percluster. These cluster centers form a visual vocabulary or codebookwhose elements are often referred to as visual words orcodewords. Although the BoF paradigm has been shown to providesignificant levels of performance, it does not, however, take intoconsideration the spatial relations between features, which mayhave an adverse effect not only on its descriptive ability but alsoon its discriminative power. To account for the spatial relationsbetween features, Bronstein et al. introduced a generalization ofa bag of features, called spatially sensitive bags of features(SS-BoF) [5]. The SS-BoF is a global descriptor defined in terms ofmid-level features and the heat kernel, and can be representedby a square matrix whose elements represent the frequency ofappearance of nearby codewords in the vocabulary. In the samespirit, Bu et al. [20] recently proposed the geodesic-aware bags offeatures (GA-BoF) for 3D shape classification by replacing the heatkernel in SS-BoF with a geodesic exponential kernel.

In this paper, we introduce a deep learning framework, whichwe call deep spectral graph wavelet (DeepSGW), for 3D shape clas-sification. Deep learning is a machine learning paradigm that mim-ics the way the human brain works to varying degrees. Thepopularity of deep learning is largely attributed not only to its hugesuccess in a wide range of tasks such as handwritten characterrecognition, image and video recognition, text analysis and speechrecognition, but also to tech industry giants such as Google, Apple,IBM, Microsoft, Facebook, Twitter, PayPal and Baidu that haveacquiredmost of the dominant players in this field to improve theirproduct offerings and services. Inspired by the actual structure ofthe brain, deep learning refers to a powerful class of machinelearning techniques that learn multi-level representations of datain deep hierarchical architectures composed of multiple layers,where each higher layer corresponds to a more abstract (i.e. higherlevel) representation of information [21]. The process of deeplearning is hierarchical in the sense that it takes low-level featuresat the bottom layer and then constructs higher and higher levelfeatures through the composition of layers.

The proposed DeepSGW approach performs 3D shape classifi-cation on spectral graph wavelet codes that are obtained fromspectral graph wavelet signatures (i.e. local descriptors) via thesoft-assignment coding step of the BoF model in conjunction witha geodesic exponential kernel for capturing the spatial relationsbetween features. While we adopt a similar strategy as [20] inthe sense that we employ a geodesic kernel, our approach, how-ever, differs in the way our shape descriptor is computed. Moreprecisely, we use multiresolution local descriptors that combinethe advantages of both band-pass and low-pass filters, resultingin improved descriptive ability and discriminative power. Broadlyspeaking, shape classification is the process of organizing a data-set of shapes into a known number of classes, and the task is toassign new shapes to one of these classes. It is common practicein classification to randomly split the available data into trainingand test sets. Classification aims to learn a classifier (also calledpredictor or classification model) from labeled training data. Thetraining data consist of a set of training examples or instancesthat are labeled with predefined classes. The resulting, trainedmodel is subsequently applied to the test data to classify future(unseen) data instances into these classes. The test data, whichconsists of data instances with unknown class labels, is used toevaluate the performance of the classification model anddetermine its accuracy in terms of the number of test instancescorrectly or incorrectly predicted by the model. A good classifiershould result in high accuracy, or equivalently, in few misclassifi-cations. While this work focuses primarily on classification, ourproposed framework is fairly general and can be used to addressa variety of shape analysis problems, including retrieval andclustering.

In addition to taking into consideration the spatial relationsbetween features via a geodesic exponential kernel, the proposedapproach performs classification on spectral graph wavelet codes,thereby seamlessly capturing the similarity between these mid-level features. We not only show that our formulation allows us totake into account the spatial layout of features, but we also demon-strate that the proposed framework yields better classificationaccuracy results compared to state-of-the-art methods, whileremaining computationally attractive. Unlike existing shapedescriptors, DeepSGW is not only compact and efficient, but alsocaptures valuable information about both macroscopic and micro-scopic structures of 3D objects. While the vast majority of shaperetrieval methods use low-level or mid-level features for describing3D shapes, DeepSGWemploys high-level features (learned features)as shape descriptors. Our main contributions are summarized asfollows:

Page 3: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

Hidden units

Visible units

Fig. 1. An RBM with visible units v ¼ ðv iÞ and hidden units h ¼ ðhjÞ.

200 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

1. We introduce low-level shape descriptors using area-weightedspectral graph wavelets.

2. We design mid-level features using the bag-of-features modelin which we employ soft-assignment coding as a feature codingscheme, and we measure the spatial relationship between thespectral graph wavelet codes using the geodesic distance.

3. We use deep belief networks to learn high-level features for 3Dshape classification.

The rest of this paper is organized as follows. In Section 2, webriefly overview the Laplace-Beltrami operator and deep learning.In Section 3, we introduce a deep learning approach for 3D shapeclassification, and we discuss in detail its main algorithmic steps.Experimental results are presented in Section 4. Finally, we con-clude in Section 5 and point out some future work directions.

2. Background

A 3D shape is usually modeled as a triangle mesh M whose ver-tices are sampled from a Riemannian manifold. A triangle mesh M

may be defined as a graph G ¼ ðV; EÞ or G ¼ ðV; T Þ, whereV ¼ fv1; . . . ;vmg is the set of vertices, E ¼ feijg is the set of edges,and T ¼ ft1; . . . ; tgg is the set of triangles. Each edge eij ¼ ½vi;vj�connects a pair of vertices fvi;vjg. Two distinct vertices vi;vj 2 Vare adjacent (denoted by vi � vj or simply i � j) if they are con-nected by an edge, i.e. eij 2 E.

2.1. Laplace-Beltrami operator

Given a compact Riemannian manifoldM, the space L2ðMÞ of allsmooth, square-integrable functions on M is a Hilbert spaceendowed with inner product hf 1; f 2i ¼

RMf 1ðxÞf 2ðxÞdaðxÞ, for all

f 1; f 2 2 L2ðMÞ, where daðxÞ (or simply dx) denotes the measure fromthe area element of a Riemannian metric on M. Given a twice-differentiable, real-valued function f : M ! R, the Laplace-Beltrami operator (LBO) is defined as DMf ¼ �divðrMf Þ, whererMf is the intrinsic gradient vector field and div is the divergenceoperator [22,23]. The LBO is a linear, positive semi-definite operatoracting on the space of real-valued functions defined onM, and it is ageneralization of the Laplace operator to non-Euclidean spaces.

Discretization A real-valued function f : V ! R defined on themesh vertex set may be represented as an m-dimensional vectorf ¼ ðf ðiÞÞ 2 Rm, where the ith component f ðiÞ denotes the functionvalue at the ith vertex in V. Using a mixed finite element/finite vol-ume method on triangle meshes [24], the value of DMf at a vertexvi (or simply i) can be approximated using the cotangent weightscheme as follows:

DMf ðiÞ � 1ai

Xj�i

cotaij þ cotbij

2f ðiÞ � f ðjÞð Þ; ð1Þ

where aij and bij are the angles \ðvivk1vjÞ and \ðvivk2vjÞ of two

faces ta ¼ fvi;vj;vk1g and tb ¼ fvi;vj;vk2g that are adjacent to theedge ½i; j�, and ai is the area of the Voronoi cell at vertex i. It shouldbe noted that the cotangent weight scheme is numerically consis-tent and preserves several important properties of the continuousLBO, including symmetry and positive semi-definiteness [25].

Spectral analysis The m�m matrix associated to the discrete

approximation of the LBO is given by L ¼ A�1W, where A ¼ diagðaiÞis a positive definite diagonal matrix (mass matrix), andW ¼ diagð

Pk–icikÞ � ðcijÞ is a sparse symmetric matrix (stiffness

matrix). Each diagonal element ai is the area of the Voronoi cellat vertex i, and the weights cij are given by

cij ¼cotaijþcot bij

2 if i � j

0 o:w:

(ð2Þ

where aij and bij are the opposite angles of two triangles that areadjacent to the edge ½i; j�.

The eigenvalues and eigenvectors of L can be found by solvingthe generalized eigenvalue problem Wu‘ ¼ k‘Au‘ using forinstance the Arnoldi method of ARPACK, where k‘ are the eigenval-ues and u‘ are the unknown associated eigenfunctions (i.e. eigen-vectors which can be thought of as functions on the mesh vertices).We may sort the eigenvalues in ascending order as0 ¼ k1 < k2 6 � � � 6 km with associated orthonormal eigenfunctionsu1;u2; . . . ;um, where the orthogonality of the eigenfunctions isdefined in terms of the A-inner product, i.e.

huk;u‘iA ¼Xmi¼1

aiukðiÞu‘ðiÞ ¼ dk‘; for all k; ‘ ¼ 1; . . . ;m: ð3Þ

Wemay rewrite the generalized eigenvalue problem in matrix formas WU ¼ AUK, where K ¼ diagðk1; . . . ; kmÞ is an m�m diagonalmatrix with the k‘ on the diagonal, and U is an m�m orthogonalmatrix whose ‘th column is the unit-norm eigenvector u‘.

2.2. Deep learning

Deep learning has its roots on neural networks, and focuses onlearning deep feature hierarchies with each layer learning new fea-tures from the output of its preceding layer [26–29,21]. One of themost widely used deep architectures is the so-called Deep BeliefNetwork (DBN), which is a generative graphical model composedof a layer of visible units and multiple layers of hidden units, withunsupervised Restricted Boltzmann Machines (RBMs) as theirbuilding blocks [27]. Each layer of a DBN encodes correlations inthe units in the previous layer, and the network parametersobtained from the unsupervised learning phase are subsequentlyfine-tuned using backpropagation or other discriminative algo-rithms. The visible units correspond to the attributes of the inputdata vector (training example), and the hidden layers act as featuredetectors.

2.2.1. Restricted Boltzmann Machines (RBMs)An RBM is a two-layer, undirected graphical model that consists

of a visible (input) layer of stochastic binary visible units v ¼ ðv iÞof dimension I and a hidden layer of stochastic binary hidden unitsh ¼ ðhjÞ of dimension J, where v i is the state of visible unit i and hj

is the state of hidden unit j. Each visible unit is connected to eachhidden unit, but there are no intra-visible or intra-hidden connec-tions, as shown in Fig. 1. The symmetric connections between thetwo layers of an RBM are represented by an I � J weight matrixW ¼ ðwijÞ, where wij is a real-valued weight characterizing the rel-ative strength of the undirected edge between visible unit i andhidden unit j.

In a standard RBM, the visible and hidden units are assumed tobe binary, meaning that they can only be in one of two states f0;1g,

Page 4: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 201

where 1 indicates the unit is ‘‘on” and 0 indicates the unit is ‘‘off”(i.e. activated or deactivated, respectively).

The energy of the joint configuration of the visible and hiddenunits ðv;hÞ is given by

Eðv;hÞ ¼ �XI

i¼1

XJ

j¼1

v iwijhj �XJ

j¼1

bjhj �XI

i¼1

civ i

¼ �v|Wh� b|h� c|v; ð4Þ

where and bj is the real-valued bias of hidden unit j and ci is thereal-valued bias of visible unit i. This energy defines a joint proba-bility distribution for configuration ðv;hÞ as follows

pðv;hÞ ¼ 1Zexpð�Eðv;hÞÞ; ð5Þ

where Z ¼P

v;h expð�Eðv;hÞÞ is a normalization constant, obtainedby summing up the energies of all possible ðv;hÞ configurations[27,29]. Therefore, configurations with high energy are assignedlow probability, while configurations with low energy are assignedhigh probability.

Because there are no intra-visible or intra-hidden connectionsin an RBM, the visible units are conditionally independent of oneanother given the hidden layer, and vice versa. For a simple RBMwith Bernoulli distribution for both the visible and hidden layers(i.e. Bernoulli-Bernoulli RBM), the probability that hj is activated,given visible unit vector v is

pðhj ¼ 1jvÞ ¼ r bj þXI

i¼1

wijv i

!; ð6Þ

and the probability that v i is activated, given hidden unit vector h is

pðv i ¼ 1jhÞ ¼ r ci þXJ

j¼1

wijhj

!; ð7Þ

where rðxÞ ¼ 1=ð1þ e�xÞ is the logistic sigmoidal activation func-tion, whose output values lie in the interval ð0;1Þ. In other words,the probability that a hidden unit is activated is independent ofthe states of the other hidden units, given the states of the visibleunits. Similarly, the probability that a visible unit is activated isindependent of the states of the other visible units, given the statesof the hidden units. This nice property of RBMs makes Gibbs sam-pling from (6) and (7) highly efficient, as one can sample all the hid-den units simultaneously and then all the visible unitssimultaneously.

Training RBMs Given a training dataset V of visible vectors,RBMs are trained to maximize the average log probability (orequivalently minimize the energy) of V over the RBM’s parametersh ¼ fW;b; cg, i.e.

argmaxh

Xv2V

logpðvÞ; ð8Þ

where pðvÞ is the marginal probability (over the visible vector v)given by

pðvÞ ¼Xh

pðv;hÞ

¼ 1Z

Xh

expð�Eðv;hÞÞ

¼ 1Zexpðc|vÞ

YJj¼1

1þ exp bj þXI

i¼1

wijv i

! !:

ð9Þ

Taking the derivative of the log probability with respect to wij yieldsthe following learning rule that performs stochastic gradient ascentin the log probability of the training data

Dwij ¼ eðhv ihjidata � hv ihjimodelÞ; ð10Þ

where e is a learning rate, and h�idata and h�imodel are the expectationsunder the distributions defined by the data and the model, respec-tively. Since h�imodel is prohibitively expensive to compute, thesingle-step version (CD1) of the contrastive divergence (CD) algo-rithm [27] is often used to optimize the model parameters (i.e.weights and biases) and it works well in practice. The new updaterule becomes

Dwij ¼ eðhv ihjidata � hv ihjireconÞ; ð11Þ

where h�irecon is the expectation with respect to the distribution ofsamples from running the Gibbs sampler initialized at the data forone full step. The intuition behind the weight update rule is thatthe reconstructed data should be as close as possible to the inputdata. Similar updates rules are applied to the biases (i.e. bias vectorsb and c).

The CD algorithm starts by setting the states of the visible unitsto a training vector. Given a randomly selected training example v,a binary vector of hidden units is obtained from sampling the con-ditional probability distribution (6) and then backpropagated using(7), resulting in a reconstruction of the original input data. AfterRBM training, hidden units can be considered to act as featuredetectors, as they form a high-level representation of the inputdata.

Gaussian-Bernoulli RBMs If the visible units are real-valued,then exponential family distributions such as the Gaussian distri-bution are more suitable for modeling real-valued and count data(e.g., grayscale images and speech signals). Hence, for aGaussian-Bernoulli RBM with Gaussian distribution for the visiblelayer and Bernoulli distribution for the hidden layer (i.e. v 2 RI and

h 2 f0;1gJ), the energy of the joint configuration ðv;hÞ is defined as

Eðv;hÞ ¼XI

i¼1

ðv i � ciÞ2

2r2i

�XI

i¼1

XJ

j¼1

wijhjv i

ri�XJ

j¼1

bjhj; ð12Þ

where ri is the standard deviation associated with the Gaussian vis-ible unit v i, and the conditional probabilities are given by

pðhj ¼ 1jvÞ ¼ r bj þXI

i¼1

wijv i

ri

!; ð13Þ

and

pðv i ¼ xjhÞ ¼ N ci þ ri

XJ

j¼1

wijhj;r2i

!; ð14Þ

where Nðl;r2Þ denotes a Gaussian distribution with mean l andvariance r2. In other words, each visible unit is modeled with aGaussian distribution given the hidden layer. In practice, it is a goodidea, prior to fitting a DBN to input data, to standardize each inputvariable to have zero mean and unit standard deviation. Therefore,the energy of the joint configuration ðv;hÞ becomes

Eðv;hÞ ¼ 12kv � ck22 � v|Wh� b|h: ð15Þ

3. Proposed framework

In this section, we provide a detailed description of ourDeepSGW classification method that utilizes spectral graph wave-lets in conjunction with the BoF paradigm. Each 3D shape in thedataset is first represented by local descriptors, which are arrangedinto a spectral graph wavelet signature matrix. Then, we performsoft-assignment coding by embedding local descriptors into thevisual vocabulary space, resulting in mid-level features which werefer to as spectral graph wavelet codes (SGWC). It is important

Page 5: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

202 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

to point out that the vocabulary is computed offline by concatenat-ing all the spectral graph wavelet signature matrices into a datamatrix, followed by applying the K-means algorithm to find thedata cluster centers.

In an effort to capture the spatial relations between features, wecompute a global descriptor of each shape in terms of a geodesicexponential kernel and mid-level features, resulting in a SGWC-BoF matrix which is then transformed into a SGWC-BoF vectorby stacking its columns one underneath the other. The last stageof the proposed approach is to perform classification on theSGWC-BoF vectors using a deep belief networks (DBNs). The flow-chart of the proposed framework is depicted in Fig. 2. DBNs arehighly effective supervised learning methods for classification.Broadly speaking, supervised learning algorithms consist of twomain steps: training step and test step. In the training step, a clas-sification model (classifier) is learned from the training data by aDBN learning algorithm. In the test step, the learned model is eval-uated using a set of test data to predict the class labels for the DBNclassifier and hence assess the classification accuracy.

3.1. Spectral shape signatures

In recent years, several local descriptors based on the eigensys-tem of the LBO have been proposed in the 3D shape analysis liter-ature, including the heat kernel signature (HKS) and wave kernelsignature (WKS) [11,13]. Both HKS and WKS have an elegant phys-ical interpretation: the HKS describes the amount of heat remain-ing at a mesh vertex j 2 V after a certain time, whereas the WKS isthe probability of measuring a quantum particle with the initialenergy distribution at j. The HKS at a vertex j is defined as:

stk ðjÞ ¼Xm‘¼1

e�k‘tku2‘ ðjÞ; ð16Þ

where k‘ and u‘ are the eigenvalues and eigenfunctions of the LBO.The HKS contains information mainly from low frequencies,

which correspond to macroscopic features of the shape; and thusexhibits a major discrimination ability in shape retrieval tasks.With multiple scaling factors tk, a collection of low-pass filtersare established. The larger is tk, the more high frequencies are sup-pressed. However, different frequencies are always mixed in theHKS, and high-precision localization task may fail due in part tothe suppression of the high frequency information, which corre-sponds to microscopic features. To circumvent these disadvan-

Dataset SGWS

skrowte

Nfeil e

Bpee

D

ser ut aeFl evel- di

M

GA-BoF

Low

-leve

l

Fea

ture

s ser ut aeFl evel- hgi

H

Fig. 2. Flowchart of the propose

tages, Aubry et al. [13] introduced the WKS, which is defined at avertex j as follows:

stk ðjÞ ¼Xm‘¼1

Ctk exp �ðlog tk � log k‘Þ2

r2

!u2

‘ ðjÞ; ð17Þ

where Ctk is a normalization constant. The WKS explicitly separatesthe influences of different frequencies, treating all frequenciesequally. Thus, different spatial scales are naturally separated, mak-ing the high-precision feature localization possible.

Given a range of discrete scales tk, a bank of filters is con-structed for each signature, and thus a vertex j on the mesh surfacecan be described by a p-dimensional point signature vector givenby

sj ¼ fstk ðjÞjk ¼ 1; . . . ;pg; for j ¼ 1; . . . ;m: ð18Þ

3.2. Spectral graph wavelet transform

For any graph signal f : V ! M, the forward and inverse graphFourier transforms (also called manifold harmonic and inversemanifold harmonic transforms) are defined as

f ð‘Þ ¼ hf;u‘i ¼Xmi¼1

aif ðiÞu‘ðiÞ; ‘ ¼ 1; . . . ;m ð19Þ

and

f ðiÞ ¼Xm‘¼1

f ð‘Þu‘ðiÞ ¼Xm‘¼1

hf;u‘iu‘ðiÞ; i 2 V; ð20Þ

respectively, where f ð‘Þ is the value of f at eigenvalue k‘ (i.e.

f ð‘Þ ¼ f ðk‘Þ). In particular, the graph Fourier transform of a deltafunction dj centered at vertex j is given by

djð‘Þ ¼Xmi¼1

aidjðiÞu‘ðiÞ ¼Xmi¼1

aidiju‘ðiÞ ¼ aju‘ðjÞ: ð21Þ

The forward and inverse graph Fourier transforms may beexpressed in matrix-vector multiplication as follows:

f ¼ U|Af and f ¼ Uf; ð22Þ

where f ¼ ðf ðiÞÞ and f ¼ ðf ð‘ÞÞ arem-dimensional vectors whose ele-ments are given by (19) and (20), respectively.

Classification ResultDeepSGW

d deep learning approach.

Page 6: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 203

Wavelet function The spectral graph wavelet transform isdetermined by the choice of a spectral graph wavelet generatingkernel g : Rþ ! Rþ, which is analogous to the Fourier domainwavelet. To act as a band-pass filter, the kernel g should satisfygð0Þ ¼ 0 and limx!1gðxÞ ¼ 0.

Let g be a given kernel function and denote by Ttg the wavelet

operator at scale t. Similar to the Fourier domain, the graph Fouriertransform of Tt

g is given by

cTtgf ð‘Þ ¼ gðtk‘Þf ð‘Þ; ð23Þ

where g acts as a scaled band-pass filter. Thus, the inverse graphFourier transform is given by

ðTtgf ÞðiÞ ¼

Xm‘¼1

cTtgf ð‘Þu‘ðiÞ ¼

Xm‘¼1

gðtk‘Þf ð‘Þu‘ðiÞ: ð24Þ

Applying the wavelet operator Ttg to a delta function centered at

vertex j (i.e. f ðiÞ ¼ djðiÞ ¼ dij), the spectral graph wavelet wt;j local-ized at vertex j and scale t is then given by

wt;jðiÞ ¼ ðTtgdjÞðiÞ ¼

Xm‘¼1

gðtk‘Þdjð‘Þu‘ðiÞ

¼Xm‘¼1

ajgðtk‘Þu‘ðjÞu‘ðiÞ: ð25Þ

This indicates that shifting the wavelet to vertex j corresponds to amultiplication by u‘ðjÞ. It should be noted that gðtk‘Þ is able to mod-ulate the spectral wavelets wt;j only for k‘ within the domain of thespectrum of the LBO. Thus, an upper bound on the largest eigen-value kmax is required to provide knowledge on the spectrum inpractical applications.

Hence, the spectral graph wavelet coefficients of a given func-tion f can be generated from its inner product with the spectralgraph wavelets:

Wf ðt; jÞ ¼ hf;wt;ji ¼Xm‘¼1

ajgðtk‘Þf ð‘Þu‘ðjÞ: ð26Þ

Scaling function Similar to the low-pass scaling functions inthe classical wavelet analysis, a second class of waveformsh : Rþ ! R are used as low-pass filters to better encode the low-frequency content of a function f defined on the mesh vertices.To act as a low-pass filter, the function h should satisfy hð0Þ > 0and hðxÞ ! 0 as x ! 1. Similar to the wavelet kernels, the scalingfunctions are given by

/jðiÞ ¼ ðThdjÞðiÞ ¼Xm‘¼1

hðk‘Þdjð‘Þu‘ðiÞ ¼Xm‘¼1

ajhðk‘Þu‘ðjÞu‘ðiÞ: ð27Þ

and their spectral coefficients are

Sf ðjÞ ¼ hf;/ji ¼Xm‘¼1

ajhðk‘Þf ð‘Þu‘ðjÞ: ð28Þ

A major advantage of using the scaling function is to ensurethat the original signal f can be stably recovered when samplingscale parameter t with a discrete number of values tk. As

demonstrated in [18], given a set of scales ftkgKk¼1, the set

F ¼ f/jgmj¼1 [ fwtk ;j

gK mk¼1 j¼1

forms a spectral graph wavelet frame with

bounds

A ¼ mink2½0;kmax �

GðkÞ and B ¼ maxk2½0;kmax �

GðkÞ; ð29Þ

where

GðkÞ ¼ hðkÞ2 þXk

gðtkkÞ2: ð30Þ

The stable recovery of f is ensured when A and B are away from zero.Additionally, the crux of the scaling function is to smoothly repre-sent the low-frequency content of the signal on the mesh. Thus,the design of the scaling function h is uncoupled from the choiceof wavelet generating kernel g.

3.3. Local descriptors

Wavelets are useful in describing functions at different levels ofresolution. To characterize the localized context around a meshvertex j 2 V, we assume that the signal on the mesh is a unitimpulse function, that is f ðiÞ ¼ djðiÞ at each mesh vertex i 2 V. Thus,it follows from (24) that the spectral graph wavelet coefficients are

Wdj ðt; jÞ ¼ hdj;wt;ji ¼Xm‘¼1

a2j gðtk‘Þu2‘ ðjÞ; ð31Þ

and that the coefficients of the scaling function are

Sdj ðjÞ ¼Xm‘¼1

a2j hðk‘Þu2‘ ðjÞ: ð32Þ

Following the multiresolution analysis, the spectral graph waveletand scaling function coefficients are collected to form the spectralgraph wavelet signature at vertex j as follows:

sj ¼ fsLðjÞjL ¼ 1; . . . ;Rg; ð33Þ

where R is the resolution parameter, and sLðjÞ is the shape signatureat resolution level L given by

sLðjÞ ¼ fWdj ðtk; jÞjk ¼ 1; . . . ; Lg [ fSdj ðjÞg: ð34Þ

The wavelet scales tk (tk > tkþ1) are selected to be logarithmicallyequispaced between maximum and minimum scales t1 and tL,respectively. Thus, the resolution level L determines the resolutionof scales to modulate the spectrum. At resolution R ¼ 1, the spectralgraph wavelet signature sj is a 2-dimensional vector consisting oftwo elements: one element, Wdj ðt1; jÞ, of spectral graph waveletfunction coefficients and another element, Sdj ðjÞ, of scaling functioncoefficients. And at resolution R ¼ 2, the spectral graph wavelet sig-nature sj is a 5-dimensional vector consisting of five elements (fourelements of spectral graph wavelet function coefficients and oneelement of scaling function coefficients). In general, the dimensionof a spectral graph wavelet signature sj at vertex j can be expressedin terms of the resolution R as follows:

p ¼ ðRþ 1ÞðRþ 2Þ2

� 1: ð35Þ

Hence, for a p-dimensional signature sj, we define a p�m spectralgraph wavelet signature matrix as S ¼ ðs1; . . . ; smÞ, where sj is thesignature at vertex j and m is the number of mesh vertices. In ourimplementation, we used the Mexican hat wavelet as a kernel gen-erating function g. In addition, we used the scaling function h givenby

hðxÞ ¼ c exp � x0:6kmin

� �4 !

; ð36Þ

where kmin ¼ kmax=20 and c is set such that hð0Þ has the same valueas the maximum value of g. The maximum and minimum scales areset to t1 ¼ 2=kmin and tL ¼ 2=kmax.

The geometry captured at each resolution R of the spectralgraph wavelet signature can be viewed as the area under the curveG shown in Fig. 3. For a given resolution R, we can understand theinformation from a specific range of the spectrum as its associatedareas under G. As the resolution R increases, the partition of spec-trum becomes tighter, and thus a larger portion of the spectrum ishighly weighted.

Page 7: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

204 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

3.4. Mid-level features

Broadly speaking, the BoF model aggregates local descriptors ofa shape in an effort to provide a simple representation that may beused to facilitate comparison between shapes. We model each 3Dshape as a triangle mesh M with m vertices. As illustrated inFig. 4, the BoF model consists of four main steps: feature extractionand description, codebook design, feature coding and featurepooling.

Feature extraction and description In the BoF paradigm, a3D shape M is represented as a collection of m local descriptorsof the same dimension p, where the order of different feature vec-tors is of no importance. Local descriptors may be classified intotwo main categories: dense and sparse. Dense descriptors are com-

0

1

2

3

4

5

6 g 1g 2

g 3g 4

g 5G

AB

0

0

0

0

1

(a) Heat kernel

0

0.1

0.2

0.3

0.4

0.5hg 1

GA

B

0

0

0

0

0

(c) Mexican hat kernel for R=1

0

0.1

0.2

0.3

0.4

0.51

23

GA

B

0

0

0

0

0

(e) Mexican hat kernel for R=3

0

0.1

0.2

0.3

0.4

0.51

23

45

GA

B

0 500 1000 1500

0 500 1000 1500

0 500 1000 1500

0 500 1000 1500

0

0

0

0

0

(g) Mexican hat kernel for R=5

hg

gg

hg

gg

gg

Fig. 3. Spectrum modulation using different kernel functions at various resolutions. Theare upper and lower bounds (B and A) of G, respectively.

puted at each point (vertex) of the shape, while sparse descriptorsare computed by identifying a set of salient points using a featuredetection algorithm. In our proposed framework, we represent Mby a p�m matrix S ¼ ðs1; . . . ; smÞ of spectral graph wavelet signa-tures, where each p-dimensional feature vector si is a dense, localdescriptor that encodes the local structure around the ith vertexof M.

Codebook design A codebook (or visual vocabulary) is con-structed via clustering by quantizing the m local descriptors (i.e.spectral graph wavelet signatures) into a certain number of code-words. These codewords are usually defined as the centersv1; . . . ;vk of k clusters obtained by performing an unsupervisedlearning algorithm (e.g., vector quantization via K-means cluster-ing) on the signature matrix S. The codebook is the set

0

.2

.4

.6

.8

1

.212

34

5G

AB

(b) Wave kernel

0

.1

.2

.3

.4

.51

2G

AB

(d) Mexican hat kernel for R=2

0

.1

.2

.3

.4

.51

23

4

GAB

(f) Mexican hat kernel for R=4

0 500 1000 1500

0 500 1000 1500

0 500 1000 1500

0 500 1000 15000

.1

.2

.3

.4

.51

23

45

6G

AB

(h) Mexican hat kernel for R=6

g g gg g

hg

g

h gg g

g

h g gg

ggg

dark line is the squared sum function G, while the dash-dotted and the dotted lines

Page 8: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

Feature extraction Feature description Vector quantization Bag of features

Fig. 4. Flow of the bag-of-features model.

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 205

V ¼ fv1; . . . ;vkg of size k, which may be represented by a p� kvocabulary matrix V ¼ ðv1; . . . ;vkÞ.

Feature coding The goal of feature coding is to embed localdescriptors in the vocabulary space. Each spectral graph waveletsignature si is mapped to a codeword vr in the codebook via thek�m cluster soft-assignment matrix U ¼ ðuriÞ ¼ ðu1; . . . ;umÞwhose elements are given by

uri ¼expð�aksi � vrk22ÞXk

‘¼1expð�aksi � v‘k22Þ

; ð37Þ

where a is a smoothing parameter that controls the softness of theassignment. Unlike hard-assignment coding in which a localdescriptor is assigned to the nearest cluster, soft-assignment codingassigns descriptors to every cluster center with different probabili-ties in an effort to improve quantization properties of the codingstep. We refer to the coefficient vector ui as the spectral graph wave-let code (SGWC) of the descriptor si, with uri being the coefficientwith respect to the codeword vr .

Histogram representation (feature pooling) Each spectralgraph wavelet signature is mapped to a certain codeword throughthe clustering process and the shape is then represented by the his-togram h of the codewords, which is a k-dimensional vector givenby

h ¼ U1m ¼ ðhrÞr¼1;...;k; ð38Þ

where hr ¼Pm

i¼1uri. That is, the histogram consists of the column-sums of the cluster assignment matrix U. Other feature poolingmethods include average- and max-pooling. In general, a featurevector is given by h ¼ PðUÞ, where P is a predefined pooling func-tion that aggregates the information of different codewords into asingle feature vector.

3.5. Global descriptors

A major drawback of the BoF model is that it only considers thedistribution of the codewords and disregards all information aboutthe spatial relations between features, and hence the descriptiveability and discriminative power of the BoF paradigm may be neg-atively impacted. To circumvent this limitation, various solutionshave been recently proposed including the spatially sensitive bagsof features (SS-BoF) [5] and geodesic-aware bags of features (GA-BoF) [20]. The SS-BoF, which is defined in terms of mid-level fea-tures and the heat kernel, can be represented by a square matrixwhose elements represent the frequency of appearance of nearbycodewords in the vocabulary. Similarly, the GA-BoF matrix isobtained by replacing the heat kernel in the SS-BoF with a geodesicexponential kernel. Unlike the heat kernel which is time-dependent, the geodesic exponential kernel avoids the possibleeffect of time scale and shape size [20]. In the same vein, we definea global descriptor of a shape as a k� k SGWC-BoF matrix F definedin terms of spectral graph wavelet codes and a geodesic exponen-tial kernel as follows:

F ¼ UKU|; ð39Þ

where U is a k�m matrix of spectral graph wavelet codes (i.e. mid-level featues), and K ¼ ðjijÞ is anm�m geodesic exponential kernelmatrix whose elements are given by

jij ¼ exp �d2ij

!; ð40Þ

with dij denoting the geodesic distance between any pair of meshvertices vi and vj, and � is a positive, carefully chosen parameterthat determines the width of the kernel. Intuitively, the parameter� controls the linearity of the kernel function, i.e. the larger thewidth, the more linear the function. It is worth pointing out thatthe proposed DeepSGW is similar in spirit to GA-BoF, albeit in ourwork we use multiresolution local descriptors that may be regardedas generalized signatures for those in [5,20]. In addition, our spec-tral graph wavelet signature combines the advantages of bothband-pass and low-pass filters.

3.6. Deep belief networks

A DBN is a probabilistic, generative model consisting of multiplelayers of RBMs stacked on top of each other, starting with the vis-ible (input) layer and first hidden layer that form the first RBM. It ismade up of a visible layer v and S hidden layers hs; s ¼ 1; . . . ; S,with the number of RBMs also equals S, which can be determinedempirically to obtain the best model performance. Each RBM istrained in a greedy layer-wise manner, with the hidden layer ofthe sth RBM acting as a visible layer for the ðsþ 1Þth RBM, asshown in Fig. 5.

A DBN consists of two main learning phases: pre-training andfine-tuning. Pre-training is an unsupervised phase that learns theweights (and biases) between layers from the bottom-up, i.e. fromone layer at a time using an RBM on each layer. Pre-training treatsthe current layer as the hidden units of an RBM and the previouslayer as the visible units of the same RBM. The pre-training startsby training the first RBM to obtain features in the first hidden layerfrom the training (input) data. In subsequent layers, the hiddenactivations of the previous layer are used as input data, i.e. thelearned feature activations of one RBM are used as the input datafor training the next RBM in the stack. Features at different layerscontain different information about data with higher-level featuresconstructed from lower-level features. This greedy, layer-wisetraining is iteratively performed until reaching the top hiddenlayer. To speed up the pretraining, it is common practice to subdi-vide the input data into mini-batches and the weights are updatedafter each mini-batch. The fine-tuning, on the other hand, is asupervised, discriminative phase that fine-tunes the model param-eters (weights and biases) at the top layer by backpropagationerror derivatives.

For classification tasks, an output layer y ¼ ðykÞ of K classes(units) is added on top of the stacked RBMs learned in the firstphase to construct a discriminative model, where each output nodeof the softmax layer corresponds to a single unique class. The out-

Page 9: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

RBM

RBM

RBM

Fig. 5. DBN architecture with three RBMs stacked on top of each other.

206 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

put (softmax) layer acts as a classifier, and is trained using labeleddata. Each output node is represented by the output probability ofeach class label, and the probabilities will all sum up to 1. The nodewith the largest probability is usually used to predict the class ofan instance (example) in the test set, and hence to compute theclassification error/accuracy. More precisely, each output node ykis represented by a probability pk given by the softmax activationfunction

pk ¼eakXK

k¼1eak

; with ak ¼Xj

wjkhj; ð41Þ

where h ¼ ðhjÞ is the top hidden layer and W ¼ ðwjkÞ is a weightmatrix of symmetric connections between the top hidden layer

and the softmax layer. The predicted class k is then given by

k ¼ argmaxk

pk ¼ argmaxk

ak: ð42Þ

It should be noted that the softmax activation function is a general-ization of the logistic function (it reduces to the logistic functionwhen there are only two classes). The purpose of the softmax func-tion is to provide an estimate of the posterior probability of eachclass, i.e. the probability that an instance belongs in a particularclass, given the data.

3.7. Proposed algorithm

Shape classification is a supervised learning method thatassigns shapes in a dataset to target classes. The objective of 3Dshape classification is to accurately predict the target class for each3D shape in the dataset. Our proposed 3D shape classification algo-rithm consists of four main steps. The first step is to represent each3D shape in the dataset by a spectral graph wavelet signaturematrix, which is a feature matrix consisting of local descriptors.More specifically, let D be a dataset of n shapes modeled by trian-gle meshes M1; . . . ;Mn. We represent each 3D shape in the datasetD by a p�m spectral graph wavelet signature matrixS ¼ ðs1; . . . ; smÞ 2 Rp�m, where si is the p-dimensional local descrip-tor at vertex i and m is the number of mesh vertices.

In the second step, the spectral graph wavelet signatures si aremapped to high-dimensional mid-level feature vectors using thesoft-assignment coding step of the BoF model, resulting in ak�m matrix U ¼ ðu1; . . . ;umÞ whose columns are thek-dimensional mid-level feature codes (i.e. SGWC). In the thirdstep, the k� k SGWC-BoF matrix F is computed using themid-level feature codes matrix and a geodesic exponential kernel,

followed by reshaping F into a k2-dimensional global descriptor xi.

In the fourth step, the SGWC-BoF vectors xi of all n shapes in the

dataset are arranged into a k2 � n data matrix X ¼ ðx1; . . . ;xnÞ.Finally, a DBN classifier is performed on the data matrix X to findthe best hyperplane that separates all data points of one class fromthose of the other classes.

The task in multiclass classification is to assign a class label toeach input example. More precisely, given a training data of the

form X train ¼ fðxi; yiÞg, where xi 2 Rk2 is the ith example (i.e.SGWC-BoF vector) and yi 2 f1; . . . ;Kg is its ith class label, we aimat finding a learning model that contains the optimized parametersfrom the DBN classification algorithm. Then, the trained DBNmodel is applied to a test data X test, resulting in predicted labelsyi of new data. These predicted labels are subsequently comparedto the labels of the test data to evaluate the classification accuracyof the model.

To assess the performance of the proposed DeepSGW frame-work, we employed two commonly-used evaluation criteria, theconfusion matrix and accuracy, which will be discussed in moredetail in the next section. The main algorithmic steps of ourapproach are summarized in Algorithm 1.

Algorithm 1. DeepSGW Algorithm

Input: Dataset D ¼ fM1; . . . ;Mng of n 3D shapesOutput: n-dimensional vector y containing predicted class

labels for each 3D shape1: for i ¼ 1 to n do2: Compute the p�m spectral graph wavelet signaturematrix Si of each shape Mi

3: Apply soft-assignment coding to find the k�m mid-level feature matrix Ui, where k > p4: Compute the k� k SGWC-BoF matrix Fi, and reshape it

into a k2-dimensional vector xi5: end for

6: Arrange all the n SGWC-BoF vectors into a k2 � n datamatrix X ¼ ðx1; . . . ;xnÞ7: Perform DBN classification on X to find the n-dimensionalvector y of predicted class labels.

Remark. It is important to point out that in our implementationthe vocabulary is computed offline by applying the K-means algo-rithm to the p�mn matrix obtained by concatenating all SGWSmatrices of all n meshes in the dataset. As a result, the vocabularyis a matrix V of size p� k, where k > p.

Page 10: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

Fig. 6. Sample shapes from SHREC 2010 (left) and SHREC 2011 (right).

9

9

1

1

9

1 10

1 10

9

9

9

10

12

ant

crab

hand

hum

an

octo

pus

plie

r

snak

e

spec

tacl

e

spid

er

tedd

y

ant

crab

hand

human

octopus

plier

snake

spectacle

spider

teddy

Fig. 7. Confusion matrix for SHREC 2010 using the proposed DeepSGW approach.

Fig. 8. Spectral graph wavelet codes of two shapes (gorilla and hand) from twodifferent classes of the SHREC-2010 dataset.

Table 1Classification accuracy results on the SHREC-2010 dataset. Boldface number indicates the bestclassification performance.

Method Average accuracy (%)

Shape-DNA 90.96cShape-DNA 92.90GPS-embedding 88.87F1-features 86.49F2-features 84.11F3-features 87.72GA-BoF 86.02DeepSGW 96.00

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 207

4. Experimental results

In this section, we organize extensive experiments for 3D shapeclassification problem to evaluate the proposed DeepSGWapproach. The effectiveness of our method is validated by perform-ing a comprehensive comparison with several state-of-the-artmethods.

Datasets The performance of the proposed DeepSGW frame-work is evaluated on two standard and publicly available 3D shapebenchmarks: SHREC 2010 and SHREC 2011. Sample shapes fromthese two benchmarks are shown in Fig. 6.

Performance evaluation measures In practice, the availabledata (which has classes) X for classification is usually split intotwo disjoint subsets: the training set X train for learning, and the testset X test for testing. The training and test sets are usually selectedby randomly sampling a set of training instances from X for learn-ing and using the rest of instances for testing. The performance of aclassifier is then assessed by applying it to test data with knowntarget values and comparing the predicted values with the knownvalues. One important way of evaluating the performance of a clas-sifier is to compute its confusion matrix (also called contingencytable), which is a K � K matrix that displays the number of correct

and incorrect predictions made by the classifier compared with theactual classifications in the test set, where K is the number ofclasses.

Another intuitively appealing measure is the classification accu-racy, which is a summary statistic that can be easily computedfrom the confusion matrix as the total number of correctly classi-fied instances (i.e. diagonal elements of confusion matrix) dividedby the total number of test instances. Alternatively, the accuracy ofa classification model on a test set may be defined as follows

Accuracy ¼ Number of correct classificationsTotal number of test cases

¼ j x : x 2 X test ^ yðxÞ ¼ yðxÞ jj x : x 2 X test j

; ð43Þ

where yðxÞ is the actual (true) label of x, and yðxÞ is the label pre-dicted by the classification algorithm. A correct classification means

Page 11: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

1210

1192 12

88

138

1110

1013

1411

1

58

1010

86

1412

119

98

610

11

alie

nan

tar

mad

illo

bird

1bi

rd2

cam

elca

tce

ntau

rdi

no_s

kel

dino

saur

dog1

dog2

flam

ingo

glas

ses

gori

llaha

ndho

rse

lam

pm

anoc

topu

spa

per

plie

rsra

bbit

sant

asc

isso

rsh

ark

snak

esp

ider

twob

alls

wom

an

alienant

armadillobird1bird2

camelcat

centaurdino_skeldinosaur

dog1dog2

flamingoglassesgorilla

handhorselampman

octopuspaperpliersrabbitsanta

scissorsharksnakespider

twoballswoman

Fig. 9. Confusion matrix for SHREC 2011 using the proposed DeepSGW approach.

208 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

that the learned model predicts the same class as the original classof the test case. The error rate is equal to one minus accuracy.

Baseline methods For each of the 3D shape benchmarks usedfor experimentation, we will report the comparison results of ourmethod against various state-of-the-art methods, includingShape-DNA [3], compact Shape-DNA [10], GPS embedding [8],GA-BoF [20], and F1-, F2-, and F3-features [30]. The latter features,which are defined in terms of the Laplacian matrix eigenvalues,were shown to have good inter-class discrimination capabilitiesin 2D shape recognition [10], but they can easily be extended to3D shape analysis using the eigenvalues of the LBO.

Implementation details The experiments were conducted ona desktop computer with an Intel Core i5 processor running at3.10 GHz and 8 GB RAM; and all the algorithms were implementedin MATLAB. The appropriate dimension (i.e. length or number of

Table 2Classification accuracy results on SHREC-2011 dataset. Boldface numbersindicate the best classification performance.

Method Average accuracy (%)

Shape-DNA 92.89cShape-DNA 94.41GPS-embedding 88.40F1-features 91.90F2-features 89.47F3-features 92.48GA-BoF 93.20DeepSGW 99.70

features) of a shape signature is problem-dependent and usuallydetermined experimentally. For fair comparison, we used the sameparameters that have been employed in the baseline methods, andin particular the dimensions of shape signatures. In our setup, atotal of 201 eigenvalues and associated eigenfunctions of the LBOwere computed. For the proposed approach, we set the resolutionparameter to R ¼ 2 (i.e. the spectral graph wavelet signaturematrix is of size 5�m, where m is the number of mesh vertices)and the kernel width of the geodesic exponential kernel to� ¼ 0:1. We used a DBN architecture consisting of two hidden lay-ers. The first hidden layer has 400 units, while the second hiddenlayer contains 800 units. Each layer of hidden units learns to rep-resent features that capture higher order correlations in the origi-nal input data. Moreover, the parameter of the soft-assignmentcoding is computed as a ¼ 1=ð8l2Þ, where l is the median size ofthe clusters in the vocabulary [5]. For shape-DNA, GPS embedding,and F1-, F2-, and F3-features, the selected number of retainedeigenvalues equals 10. As suggested in [10], the dimension of thecompact Shape-DNA signature was set to 33.

4.1. SHREC-2010 dataset

SHREC 2010 is a dataset of 3D shapes consisting of 200 water-tight mesh models from 10 classes [31]. These models are selectedfrom the McGill Articulated Shape Benchmark dataset. Each classcontains 20 objects with distinct postures. Moreover, each shapein the dataset has approximately m ¼ 1002 vertices.

Performance evaluation We randomly selected 50% shapesin the SHREC-2010 dataset to hold out for the test set, and the

Page 12: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

Fig. 10. Training on the SHREC-2011 dataset. Learned weights on DBN first layer (left). Learned weights on DBN second layer (right).

Fig. 11. First 64 training examples computed by DBN on the SHREC-2011 dataset.

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 209

remaining shapes for training. That is, the test data consists of 100shapes. A DBN with two hidden layers is first trained on the train-ing data to learn the model (i.e. classifier), which is subsequentlyused on the test data with known target values in order to predictthe class labels. Fig. 7 displays the confusion matrix for SHREC2010 on the test data. This 10� 10 confusion matrix shows howthe predictions are made by the model. Its rows correspond tothe actual (true) class of the data (i.e. the labels in the data), whileits columns correspond to the predicted class (i.e. predictions madeby the model). The value of each element in the confusion matrix isthe number of predictions made with the class corresponding tothe column for instances with the correct value as representedby the row. Thus, the diagonal elements show the number of cor-rect classifications made for each class, and the off-diagonal ele-ments show the errors made. As shown in Fig. 7, the proposedDeepSGW framework was able to classify all shapes of the test datawith high accuracy, except the hand and human models whichwere misclassified only once as crab and hand, respectively. Inaddition, the octopus model was misclassified once as a humanand also once as a crab. Such a good performance strongly suggeststhat our method captures well the discriminative features of theshapes.

Results In our DeepSGW approach, each 3D shape in theSHREC-2010 dataset is represented by a 5� 1002 matrix of spec-

tral graph wavelet signatures. Setting the number of codewordsto k ¼ 48, we computed offline a 5� 48 vocabulary matrix V viaK-means clustering. The soft-assignment coding of the BoF modelyields a 48� 1002 matrix U of spectral graph wavelet codes,resulting in a SGWC-BoF data matrix X of size 482 � 200. Fig. 8shows the spectral graph wavelet code matrices of two shapes(gorilla and hand) from two different classes of SHREC 2010. Ascan be seen, the global descriptors are quite different and hencethey may be used efficiently to discriminate between shapes inclassification tasks.

We compared the DeepSGW method to Shape-DNA, compactShape-DNA, GPS embedding, GA-BoF, and F1-, F2-, and F3-features. In order to compute the accuracy, we repeated the exper-imental process 10 times with different randomly selected trainingand test data in an effort to obtain reliable results, and the accuracyfor each run was recorded, then we selected the best result of eachmethod. The classification accuracy results are summarized inTable 1, which shows the results of the baseline methods and theproposed framework. As can be seen, our method achieves betterperformance than Shape-DNA, compact Shape-DNA, GPS embed-ding, GA-BoF, and F1-, F2-, and F3-features. The DeepSGWapproach yields the highest classification accuracy of 96.00%, withperformance improvements of 3.10% and 5.04% over the best base-line methods cShape-DNA and Shape-DNA, respectively. To speed-

Page 13: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

210 M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211

up experiments, all shape signatures were computed offline, albeittheir computation is quite inexpensive due in large part to the factthat only a relatively small number of eigenvalues of the LBO needto be calculated.

4.2. SHREC-2011 dataset

SHREC 2011 is a dataset of 3D shapes consisting of 600 water-tight mesh models, which are obtained from transforming 30 orig-inal models [32]. Each shape in the dataset has approximatelym ¼ 1502 vertices.

Performance evaluation We randomly selected 50% shapesin the SHREC-2011 dataset to hold out for the test set, and theremaining shapes for training. That is, the test data consists of300 shapes. First, we train a DBN with two hidden layers on thetraining data to learn the classification model. Then, we use theresulting, trained model on the test data to predict the class labels.With the exception of the cat model, which was misclassified onceas a hand and also the bird2 model which was misclassified twiceas bird1, the proposed DeepSGW approach was able to accuratelyclassify all shapes in the test data, as shown in Fig. 9.

Results Similar to the previous experiment, each 3D shape inthe SHREC-2011 dataset is represented by a 5� 1502 spectralgraph wavelet signature matrix. We pre-computed offline a vocab-ulary of size k ¼ 48. The soft-assignment coding yields a 48� 1502matrix U of mid-level features. Hence, the SGWC-BoF data matrix Xfor SHREC 2011 is of size 482 � 600. We repeated the experimentalprocess 10 times with different randomly selected training and test

Cla

ssifi

catio

n A

ccur

acy

(%)

97

97.5

98

98.5

99

99.5

100

2 4 8 10 12 16

1 2 3 4 510

20

30

40

50

60

70

80

90

100

Cla

ssifi

catio

n A

ccur

acy

(%)

Fig. 12. Effects of the parameters on the c

data in a bid to obtain reliable results, and the accuracy for eachrun was recorded. The average accuracy results are reported inTable 2. As can be seen, the DeepSGW approach outperforms allthe seven baseline methods used for comparison. The highest clas-sification accuracy of 99.79% corresponds to our method, with per-formance improvements of 6.50% and 5.29% compared to the bestperforming baseline methods GA-BoF and cShape-DNA, respec-tively. It is important to point out that our proposed algorithm per-forms at par with the 3D-DL method [20], while it performs 6.5%better than GA-BoF on the SHREC-2011 dataset. A major advantageof our approach over 3D-DL is its ability to capture both micro-scopic and macroscopic features via spectral graph wavelets,resulting in higher classification accuracy.

Fig. 10 shows the learned weights on DBN first and second lay-ers for the SHREC-2011 dataset. The DBN first layer consists of 400units, while the second layer contains 800 units. Each square dis-plays the incoming weights from all the visible units into one hid-den unit. White encodes a positive weight and black encodes anegative weight. Fig. 11 shows the first 64 training examples com-puted by DBN on the SHREC-2011 dataset.

Parameter sensitivity The proposed approach depends ontwo key parameters that affect its overall performance. The firstparameter is the kernel width � of the geodesic exponential kernel.The second parameter k is the size of the vocabulary, which playsan important role in the SGWC-BoF matrix F. As shown in Fig. 12,the best classification accuracy on SHREC 2011 is achieved using� ¼ 0:1 and k ¼ 48. In addition, the classification performance ofproposed method is satisfactory for a wide range of parameter val-

16 32 48 64 128 256

Cla

ssifi

catio

n A

ccur

acy

(%)

0

20

40

60

80

100

lassification accuracy for SHREC 2011.

Page 14: J. Vis. Commun. Image R. - Encshamza/paperM.pdfMost point signatures may easily be aggregated to form global descriptors by integrating over the entire shape. Rustamov [4] pro- posed

M. Masoumi, A. Ben Hamza / J. Vis. Commun. Image R. 43 (2017) 198–211 211

ues, indicating the robustness of the proposed DeepSGW frame-work to the choice of these parameters. We also tested the effectof the resolution parameter on the classification accuracy of theproposed approach. As can seen in Fig. 12, the best classificationaccuracy on SHREC 2011 is achieved when R ¼ 2.

5. Conclusions

In this paper, we presented a deep learning approach to 3Dshape classification using spectral graph wavelets and the bag-of-features paradigm. This approach not only captures the similaritybetween feature descriptors via a geodesic exponential kernel,but also substantially outperforms state-of-the-art methods bothin classification accuracy and in scalability. For future work, weplan to apply the proposed DeepSGW approach to other 3D shapeanalysis problems, and in particular segmentation and clustering.

References

[1] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometricframework for learning from labeled and unlabeled examples, J. MachineLearn. Res. 7 (2006) 2399–2434.

[2] B. Lévy, Laplace-Beltrami eigenfunctions: towards an algorithm that‘‘understands” geometry, in: Proc. IEEE Int. Conf. Shape Modeling andApplications, 2006, p. 13.

[3] M. Reuter, F. Wolter, N. Peinecke, Laplace-Beltrami spectra as ’Shape-DNA’ ofsurfaces and solids, Comput.-Aid. Des. 38 (4) (2006) 342–366.

[4] R. Rustamov, Laplace-Beltrami eigenfunctions for deformation invariant shaperepresentation, in: Proc. Symp. Geometry Processing, 2007, pp. 225–233.

[5] A. Bronstein, M. Bronstein, L. Guibas, M. Ovsjanikov, Shape Google: geometricwords and expressions for invariant shape retrieval, ACM Trans. Graphics 30(1) (2011).

[6] C. Li, A. Ben Hamza, A multiresolution descriptor for deformable 3D shaperetrieval, Visual Comput. 29 (2013) 513–524.

[7] D. Pickup, X. Sun, P. Rosin, R. Martin, Z. Cheng, Z. Lian, M. Aono, A. Ben Hamza,A. Bronstein, M. Bronstein, S. Bu, U. Castellani, S. Cheng, V. Garro, A. Giachetti,A. Godil, J. Han, H. Johan, L. Lai, B. Li, C. Li, H. Li, R. Litman, X. Liu, Z. Liu, Y. Lu, A.Tatsuma, J. Ye, SHREC’14 track: shape retrieval of non-rigid 3D human models,in: Proc. Eurographics Workshop on 3D Object Retrieval, 2014, pp. 1–10.

[8] A. Chaudhari, R. Leahy, B. Wise, N. Lane, R. Badawi, A. Joshi, Global pointsignature for shape analysis of carpal bones, Phys. Med. Biol. 59 (2014) 961–973.

[9] K. Tarmissi, A. Ben Hamza, Information-theoretic hashing of 3D objects usingspectral graph theory, Expert Syst. Appl. 36 (5) (2009) 9409–9414.

[10] Z. Gao, Z. Yu, X. Pang, A compact shape descriptor for triangular surfacemeshes, Comput.-Aid. Des. 53 (2014) 62–69.

[11] J. Sun, M. Ovsjanikov, L. Guibas, A concise and provably informative multi-scale signature based on heat diffusion, Comput. Graphics Forum 28 (5) (2009)1383–1392.

[12] K. Ge�bal, J.A. Bærentzen, H. Aanæs, R. Larsen, Shape analysis using the autodiffusion function, Comput. Graphics Forum 28 (5) (2009) 1405–1513.

[13] M. Aubry, U. Schlickewei, D. Cremers, The wave kernel signature: a quantummechanical approach to shape analysis, in: Proc. Computational Methods forthe Innovative Design of Electrical Devices, 2011, pp. 1626–1633.

[14] M. Bronstein, I. Kokkinos, Scale-invariant heat kernel signatures for non-rigidshape recognition, in: Proc. Computer Vision and Pattern Recognition, 2010,pp. 1704–1711.

[15] Z. Lian, A. Godil, B. Bustos, M. Daoudi, J. Hermans, S. Kawamura, Y. Kurita, G.Lavoué, H.V. Nguyen, R. Ohbuchi, Y. Ohkita, Y. Ohishi, F. Porikli, M. Reuter, I.Sipiran, D. Smeets, P. Suetens, H. Tabia, D. Vandermeulen, A comparison ofmethods for non-rigid 3D shape retrieval, Pattern Recogn. 46 (1) (2013) 449–461.

[16] C. Li, A. Ben Hamza, Spatially aggregating spectral descriptors for nonrigid 3Dshape retrieval: a comparative survey, Multimedia Syst. 20 (3) (2014) 253–281.

[17] R. Coifman, S. Lafon, Diffusion maps, Appl. Comput. Harmonic Anal. 21 (1)(2006) 5–30.

[18] D. Hammond, P. Vandergheynst, R. Gribonval, Wavelets on graphs via spectralgraph theory, Appl. Comput. Harmonic Anal. 30 (2) (2011) 129–150.

[19] C. Li, A. Ben Hamza, Intrinsic spatial pyramid matching for deformable 3dshape retrieval, Int. J. Multimedia Inform. Retrieval 2 (4) (2013) 261–271.

[20] S. Bu, Z. Liu, J. Han, J. Wu, R. Ji, Learning high-level feature by deep beliefnetworks for 3-D model retrieval and recognition, IEEE Trans. Multimedia 24(16) (2014) 2154–2167.

[21] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw.61 (2015) 85–117.

[22] S. Rosenberg, The Laplacian on a Riemannian Manifold, Cambridge UniversityPress, 1997.

[23] H. Krim, A. Ben Hamza, Geometric Methods in Signal and Image Analysis,Cambridge University Press, 2015.

[24] M. Meyer, M. Desbrun, P. Schröder, A. Barr, Discrete differential-geometryoperators for triangulated 2-manifolds, Visual. Math. III 3 (7) (2003) 35–57.

[25] M. Wardetzky, S. Mathur, F. Kälberer, E. Grinspun, Discrete Laplace operators:no free lunch, in: Proc. Eurographics Symp. Geometry Processing, 2007, pp.33–37.

[26] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel,Backpropagation applied to handwritten zip code recognition, Neural Comput.1 (4) (1989) 541–551.

[27] G. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets,Neural Comput. 18 (7) (2006) 1527–1554.

[28] Y. Bengio, Learning deep architectures for AI, Found. Trends Machine Learn. 2(1) (2009) 1–127.

[29] H. Lee, R. Grosse, R. Ranganath, A. Ng, Convolutional deep belief networks forscalable unsupervised learning of hierarchical representations, in: Proc. Int.Conf. Machine Learning, 2009, pp. 609–616.

[30] M. Khabou, L. Hermi, M. Rhouma, Shape recognition using eigenvalues of thedirichlet laplacian, Pattern Recogn. 40 (2007) 141–153.

[31] Z. Lian, A. Godil, T. Fabry, T. Furuya, J. Hermans, R. Ohbuchi, C. Shu, D. Smeets,P. Suetens, D. Vandermeulen, S. Wuhrer, SHREC’10 track: non-rigid 3D shaperetrieval, in: Proc. Eurographics/ACM SIGGRAPH Sympo. 3D Object Retrieval,2010, pp. 101–108.

[32] Z. Lian, A. Godil, B. Bustos, M. Daoudi, J. Hermans, S. Kawamura, Y. Kurita, G.Lavoue, H. Nguyen, R. Ohbuchi, Y. Ohkita, Y. Ohishi, F. Porikli, M. Reuter, I.Sipiran, D. Smeets, P. Suetens, H. Tabia, D. Vandermeulen, SHREC’11 track:shape retrieval on non-rigid 3D watertight meshes, in: Proc. Eurographics/ACM SIGGRAPH Symp. 3D Object Retrieval, 2011, pp. 79–88.