A Data-Dependent Label Distribution Learning for Age

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XX 2015 1

Data-Dependent Label Distribution Learning forAge Estimation

Zhouzhou He, Xi Li*,Zhongfei Zhang, Fei Wu, Xin Geng, Yaqing Zhang, Ming-Hsuan Yang, andYueting Zhuang

Abstract—As an important and challenging problem in com-puter vision, face age estimation is typically cast as a classificationor regression problem over a set of face samples with respectto several ordinal age labels, which have intrinsically cross-agecorrelations across adjacent age dimensions. As a result, suchcorrelations usually lead to the age label ambiguities of the facesamples. Namely, each face sample is associated with a latent labeldistribution that encodes the cross-age correlation informationon label ambiguities. Motivated by this observation, we proposea totally data-driven label distribution learning approach toadaptively learn the latent label distributions. The proposedapproach is capable of effectively discovering the intrinsic agedistribution patterns for cross-age correlation analysis on thebasis of the local context structures of face samples. Withoutany prior assumptions on the forms of label distribution learning,our approach is able to flexibly model the sample-specific contextaware label distribution properties by solving a multi-task prob-lem, which jointly optimizes the tasks of age-label distributionlearning and age prediction for individuals. Experimental resultsdemonstrate the effectiveness of our approach.

Index Terms—Age estimation, Subspace learning, Label distri-bution learning.

I. INTRODUCTION

As an important and challenging problem, face age estima-tion has recently attracted considerable attentions [1]–[7] as ithas a wide range of applications such as face identification [8]and human-computer interaction [9]. Typical approaches toage estimation focus on the following three issues: I) facefeature representation; II) face context structure construction;III) age prediction modeling. For I), the face appearanceis usually represented by various visual features, such asface texture features (LBP, Garbor) [10] [11], biologicallyinspired features (BIF) [12], and deep learning features [8][13] [14]. For II), the face context structure is often modeledby constructing a face affinity graph for subspace analysis,

Z. He is with the College of Information Science & Electron-ic Engineering, Zhejiang University, Hangzhou 310027, China, (e-mail:[email protected]).

X. Li* (corresponding author), F. Wu, and Y. Zhuang are with the Collegeof Computer Science and Technology, Zhejiang University, Hangzhou 310027,China (email: [email protected], wufei,[email protected]).

Z. Zhang is with the Department of Information Science and ElectronicEngineering, Zhejiang University, Hangzhou 310027, China, and also withthe Computer Science Department, Watson School, The State University ofNew York Binghamton University, Binghamton, NY 13902 USA (e-mail:[email protected]).

Xin Geng is with the School of Computer Science and Engineering,Southeast University, Nanjing 211189, China. E-mail: [email protected]

Y. Zhang was with the College of Information Science & ElectronicEngineering, Zhejiang University, Hangzhou 310027, China.

M.-H. Yang is with the School of Engineering, University of California,Merced, CA 95344 USA (e-mail: [email protected]).

(a)

Legend

(b)

40

37

37

35

35

28

41

32

31

Her neighboring samples(They are discovered by subspace structure learning.)

Other samples

3530 40

3530 40

3530 403530 40

3530 40

3530 40

38

Training sample

Age labels

Data-dependent label distribution

Legend

Face sample

Age label

“Description degree”38

Weighted linear combination

3530 40

Face sample

Age label38Context relationship

Face context structure

3738

40 37

35

35

28

41

32

31

Her neighboring samplesNeighboring samples

Example

Fig. 1. Illustration of our D2LDL algorithm. (a) Face context structure isrepresented by the face affinity graph. In particular, we use subspace structurelearning on the face image features to build the context relationship among theface samples. (b) An example of constructing a label distribution for a femalesample. Her label distribution is constructed through learning the cross-agecorrelation among her context-neighboring face samples.

which aims to capture the intrinsic interactions among facesamples in the face-related image feature or attribute space(e.g., gender and race [15] [16]). For III), the key problemof age prediction modeling is how to effectively learn themapping function (e.g., non-linear and hierarchical function)from low-level image features to high-level age labels.

More recent efforts pay attention to modeling the cross-age correlations by introducing the concept of “label distribu-tion” [17]–[19], which essentially characterizes the ambiguity

2 JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XX 2015

properties of age labels. Due to taking a fixed-form label dis-tribution, such a label distribution model is usually inflexiblein adapting to complex face data domains with diverse cross-age correlations. Motivated by these observations, we focuson designing a data-dependent label distribution model, whichis capable of adaptively learning the cross-age correlationsfrom context-structure-preserving face data. Fig. 1 illustratesan example of constructing a label distribution for a femalesample. We first discover its context affinity structure throughsubspace learning, and then incorporate the context structureinto the process of constructing a data-dependent label dis-tribution model for modeling the cross-age correlations. FromFig. 1, it is obvious that our label distribution model is sample-specific, that is, different face samples have different labeldistributions determined by the face samples themselves aswell as their associated context structures.

Specifically, we propose a novel age estimation approach,called “Age Estimation with Data-Dependent Label Distri-bution Learning” (D2LDL). Compared with the existing ef-forts, our data-dependent label distributions are automaticallylearned from the real face data samples, and thus preservethe underlying manifold structure information with respectto different but correlated face samples. The main idea ofour approach is to design a data-driven strategy for labeldistribution learning, which well discovers the underlying agedistribution patterns for cross-age correlation analysis on thebasis of the local context structures of face samples. Sincelearning the local structure is helpful in understanding therelationship among the face samples, our label distributionof face sample considers both the aging degree of itselfand the external influence of its neighboring face samples.Typically, several existing approaches to label distributionlearning enforce particular prior assumptions on the distribu-tion forms (e.g., Gaussian), which often restrict the flexibilityof model learning for learning the age ambiguity. In contrast,our learning scheme is capable of adaptively capturing thesample-specific context-aware label distribution properties byjointly optimizing the tasks of age-label distribution learningand age prediction for individuals.

In summary, the main contributions of this work are de-scribed as follows.• We propose a multi-task learning model which performs

the task of human age prediction in conjunction withthe task of age-label distribution learning, which takesa totally data-driven strategy to adaptively capture thecross-age correlation patterns in data-dependent flexibleforms. As a result, face age estimation is enabled by ourlearning model to have more robustness to label noises.

• We introduce graph-based manifold structure analysis tothe process of label distribution learning by constructing ajoint optimization problem, which makes it more effectivefor face age estimation to well encode the underlyingcontext information among face samples.

II. RELATED WORK

A. Age EstimationIn general, face age estimation [20] is considered as a

label prediction problem. In what follows, we briefly review

the literature of face age estimation in the following threeaspects: 1) face feature representation; 2) face context structureconstruction; and 3) age prediction model.

Face feature representation Since the face region has theregular texture information, the earlier approaches build thetexture features (e.g., LBP [10], Garbor [11], and AAM [21])to represent the face appearance. Compared with these effortsdirectly using the face region image to extract the feature,recent studies consider the correlations among the face organs.For example, Guo et al. [12] propose the biologically inspiredfeatures. They first segment the face image into many localregions and then extract the face features by the strategy of“spatial pyramid model”. In the related field, many effortsof face identification design various deep neural networks toextract face features. The main advantage of these efforts [8][13] [14] is that the extracted deep learning features capturemuch discriminative visual information.

Face context structure construction These studies con-sider that the human age difference is influenced by the facecontext structure. In order to discover this structure, Guo etal. [22] use OLPP [23] to embed the face samples into a low-dimensional manifold structure, which preserves the originalneighborhood among the face samples. Xiao et al. [24] learna distance metric to preserve the contextual correlation amongthe neighboring face samples. Chao et al. [25] propose lsRCAand lsLPP to extract the face features and the extracted facefeatures both preserve the feature similarity and the labelsimilarity between the neighboring face samples. Ni et al. [26]learn a mapping function and consider all the samples beingrelated and propagating their labels in this mapping space. Theother studies [15] [27] consider that the face-related attributes(e.g., gender and race) also play an important role in describingthe face context relationships. They predict the human agethrough reclassifying the face samples with the face-relatedattributes and their experiments show the difference of agingpattern between male and female. Furthermore, Guo et al. [16]propose a “cross-population” learning strategy, which embedsdifferent aging patterns into a common space and enforces theface samples with the semantically close face-related attributesto be correlated.

Age prediction modeling The existing efforts [27]–[30]focus on designing the various age label predictors throughclassification or regression learning. Motivated by these stud-ies, Guo et.al. [22] propose a mixture approach combining theadvantages from both classification and regression approaches.Recently, Geng et al. [17] observe that the human age canbe represented by a set of adjacent age labels. Therefore,they propose “label-distribution” to replace the original agelabel, which improves the typical objective function of theage estimation problem. Specifically, they explicitly enforcea fixed-form prior assumption on the label distribution (i.e.,”Gaussian” or ”Triangle”), resulting in the inflexibility ofadapting to complicated face data in practice. Furthermore,Geng et al. [31] propose an adaptive label distribution learningapproach, which considers that the label distribution varieswith the temporal changes.

SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR COMPUTER SOCIETY JOURNALS 3

B. Subspace Clustering

For the purpose of adaptive context structure discovery,subspace learning acts as an important and powerful tool inthe recent literature of graph-based data clustering.

Sparse subspace clustering (SSC) SSC [32] is a typicalapproach to cluster the high-dimensional data (e.g., imagesand videos). Its basic idea is that each sample can be re-constructed through a linear combination of a few othersamples. Therefore, all the original samples are embeddedinto many local manifold subspaces and the SSC approachconsiders that there exist the context relationships among thesamples in the same subspace. Many researchers improve theclustering performance by adding various constraints into thesubspace learning, such as the low-rank constraint [33], thetrace Lasso constraint [34], and the mixed Gaussian noiseconstraint [35]. Recently, Li et al. [36] solve the SSC problemthrough combining the context structure discovery and the dataclustering into a unified framework.

Applications In practice, the SSC techniques are widelyused to solve a variety of image or video processing problems.For example, Tierney et al. [37] assume each video frame canbe reconstructed by the temporal-neighboring video frames.Therefore, they segment the video by implementing SSC forall the video frames with adding the temporal smoothing con-straint for the temporal-neighboring video frames. Recently,Cao et al. [38] extend the SSC problem into the multi-view im-age clustering, where they solve SSC for every view iterativelyas well as use the “Hilbert-Schmidt norms” to constrain thecorrelation among the different views. Although the manifoldlearning methods [24] [25] [39] [40] and the subspace learningof our approach all are capable to capture the context structure,their roles in the age estimation are different: these existingefforts use the manifold learning to extract the image featuresand our approach use the subspace learning to build theprediction objectives (the label distributions).

III. FRAMEWORK

In this section, we first give a detailed description of ourdata-dependent label distribution learning framework, and thendepict its associated optimization procedure. For presentationconvenience, a list of notations and symbols is defined below.

Notations: We use upper boldface letters to represent thematrices and use lowercase boldface letters to represent thevectors. For any vector and matrix, we denote x(i) as the i-th element of x and denote X(i, j) as the element of X atthe i-th row and the j-th column. We also denote X>, X†,‖X‖F , ‖X‖1, and ‖X‖2,1 as the transposed matrix, the inversematrix, Frobenius norm, `1 norm, and `2,1 norm, respectively.

A. Problem Formulation

We are given a set of face samples {(xn, yn), n =1, 2, . . . , N}, where xn ∈ RD×1 and yn denote the imagefeature and the age label of the n-th face sample respectively.We set the age range from 1 to T for a convenient repre-sentation. Following the recent work [17], we use the “labeldistribution”s to replace the original age labels for improvingthe age representation. Specifically, we define pn ∈ RT×1 as

TABLE ITHE DETAILED DESCRIPTION OF THE VARIABLES

N the number of the face samplesT the number of the age labelsD the dimension of the image featuresxn ∈ RD×1 the image feature of the n-th sampleyn the age label of the n-th samplepn ∈ RT×1 the label distribution of the n-th sampleX ∈ RD×N the set of the image featuresY ∈ RT×N the set of the age labelsP ∈ RT×N the set of the label distributionsA ∈ RN×N the face affinity graphC ∈ RN×N the subspace representationW ∈ RD×T the regression matrixWt ∈ RD×ε the segment of the regression matrix

the label distribution of the n-th face sample, where pn(t)corresponds to the “description degree” of this face samplehaving the t-th age label.

We denote X ∈ RD×N as a set of image features with then-th column being xn, Y ∈ RT×N as the multi-label matrixrepresentation of the face samples whose entry is defined as:

Y(t, n) =

{1 for yn = t0 for other

(1)

Based on Y, we let P ∈ RT×N denote the associated multi-label distribution matrix representation with the n-th columnbeing pn. For reading convenience, we summarize a collectionof notations used hereinafter in Table I. In order to achievethe goal of face age estimation, we need to accomplish thefollowing three tasks:

I) Use the label distribution to represent the human age.Typical age estimation approaches use the “one-hot” modelfor age estimation (In the “one-hot” model, the human ageof each face sample is tagged only by a single age label).However, we observe that the face samples with the similarimage features usually share the set of the mutually close agelabels. As a result, the age estimation task of a face sampleis not only determined by its own one-hot label annotation,but also influenced by its associated face context (consistingof visually similar face samples). Such a semantic correlationbetween the face sample and its context often leads to thevisual age ambiguity, which implicitly smooths their one-hot age label annotations. Namely, if the one-hot age labelof the face sample is correlated but different from those ofthe visual face context, a visually semantic ambiguity takesplace, which results in the age estimation inconsistency forthose visually similar face samples but with different agelabels only from the visual perspective. To address the visualage label ambiguity problem, we introduce a data-dependentlabel distribution into the process of face age estimation.The corresponding label distribution of each face sampleis simultaneously derived from both the one-hot age labelannotation and its interactions with its visual face context,which leads to age label smoothing in terms of face sample-specific contextual interactions. Consequently, such a label


distribution is capable of encoding the age label ambiguityof face samples from the visual appearance viewpoint.

Specifically, the following shows the process of buildingthe data-dependent label distribution pn based on the originalone-hot age label yn.

pn(t) = µY(t, n) + (1− µ)∑m 6=n

am,nY(t,m) (2)

n,m = 1, 2, . . . , N and t = 1, 2, . . . , T

where µ ∈ [0, 1] is a trade-off factor. The second itemdescribes the synthetic label distribution of the n-th sample,which is built by the label propagation from the other samplesto the n-th sample. We denote am,n as the propagation factorto represent the context relationship between the m-th andthe n-th samples (The definition of am,n is described in thefollowing). Clearly, Eq. (2) shows that the label distribution ofeach face sample is constructed through weighted linear com-bination of the age labels of itself and its context-neighboringsamples into one discrete distribution. Compared with theexisting efforts [17] [31] enforce

∑Tt=1 pn(t) = 1, we only

attach great importance to the label distribution improving theage representation and thus relax this constraint. Furthermore,we give the matrix representation of Eq. (2) as follows,

P = µY + (1− µ)YA (3)

where we denote A ∈ RN×N as the face affinity graph withA(m,n) being am,n.

II) Discover the face context structure to build thelabel distribution. Typically, the problem of discovering thecontext-neighboring relationships among the face samples isconverted to that of subspace structure learning [32]–[36].

Specifically, we first construct a graph matrix C ∈ RN×N tocapture the face context structure. To compute this graph ma-trix, we use the property of “sample self-expressiveness” [32]to build the optimization objective, which is described as:

minC‖X−XC‖2F + β1‖C‖1 + β2

∥∥C>1− 1∥∥2F

(4)

s.t. C(n,m) = 0 if one of the following conditions is met:

1) n = m; 2) |yn − ym| > ∆y; n,m = 1, 2, . . . , N

where β1 is a factor to control the sparsity of C and β2 isa factor to regularize each column of C. We denote ∆y (Wepredefine ∆y = 5 in our approach) as a threshold factor todetermine the sparsity degree of C. Clearly, Eq. (4) showsthat each face sample is linearly reconstructed by its context-neighboring face samples.

Next, given the graph matrix C, we use the similar con-struction strategy in the SSC [32] to build the face affinitygraph A:

A ,1

2

(|C|+ |C>|

)(5)

The above definition ensures that the context relationships aresymmetric and nonnegative.

III) Build the age prediction model. After obtaining thelabel distributions, we solve a regression optimization problemto learn the mapping function between the image features xn

......

Regression matrix: W

w1Columns:

Ŵt = [wt,wt+1,...,wt+ϵ-1 ]

w2 wt wt+ϵ-1 wT-1 wT

Fig. 2. The regression matrix W and the submatrix Wt. Clearly, Wt iscomposed of certain adjacent columns (regression vectors) in W and its widthof temporal window is controlled by ε. Besides, we consider that there existsthe temporal-similarity among the columns of Wt.

and the label distributions pn, n = 1, 2, . . . , N :

minf

N∑n=1

‖f(xn)− pn‖2F (6)

where f(·) denotes the mapping function. Observing that theface aging is a smoothing process [41] [42], we consider thatthe variation of the age predictor is small when predictingthe mutually close age labels. Therefore, we add the temporalsmoothing constraint into our regression model.

Specifically, the face age prediction model is formulated asa linear regression:

minW

1

2

N∑n=1

∥∥W>xn − pn∥∥2F

+ γ

T ′∑t=1

‖Wt‖2,1 (7)

=1

2

∥∥W>X−P∥∥2F

+ γ

T ′∑t=1

‖Wt‖2,1

where γ is a trade-off factor. We denote W ∈ RD×T as theregression matrix to represent the mapping function f(·). Thet-th column of W is wt for predicting the t-th age label. Weshow that the columns (regression vectors) [w1,w2, . . . ,wT ]in W are sorted by the chronological order. The second item isa mixed `2,1 norm [43] for smoothing the regression vectors.We denote Wt as a submatrix of W, which is composed ofcertain adjacent columns in W:

Wt = [wt,wt+1, . . . ,wt+ε−1] (8)t = 1, 2, . . . T ′ and T ′ = T − ε+ 1

where ε is a factor to control the width of temporal win-dow of Wt. Compared with Cai et al. [43] predefiningthe correlation between two regression vectors that have thesemantically-related labels (e.g., sky and plane), we investigatethe temporal-similarity among ε adjacent regression vectors(e.g., [wt,wt+1, . . . ,wt+ε−1]) to design our temporal smooth-ing constraint. Fig. 2 shows a detailed description for the


regression matrix W and the submatrix Wt.As a result, we have the complete form of the joint opti-

mization problem formulated as follows:

min(C,W)

J =1

2

∥∥W>X−P∥∥2F

+ γ1

T ′∑t=1

∥∥∥Wt

∥∥∥2,1

(9)

+λ

2‖X−XC‖2F +

ρ

2

∥∥C>1− 1∥∥2F

+ γ2 ‖C‖1

s.t. P = µY +1− µ

2Y(|C|+ |C>|

);

C(n,m) = 0 if one of the following conditions is met:

1) n = m; 2) |yn − ym| > ∆y;

where λ, ρ, γ1, and γ2 are the trade-off factors.

B. Optimization

The initialization of C and W are implemented by SSC andthe linear regression, respectively. Next, the specific optimiza-tion of D2LDL is implemented by alternating the followingtwo steps iteratively.

1) Updating C, given W

Inspired by the optimization strategy of the non-negativematrix factorization (NMF) [44], we develop an iterativemultiplicative updating rule to obtain the optimal C. Wefirst use C = C1 − C2 to simplify the objective func-tion, where C1 and C2 are two non-negative matrices. Inparticular, we define C1(n,m) = (|C(n,m)|+ C(n,m)) /2and C2(n,m) = (|C(n,m)| −C(n,m)) /2. Now given W,Eq. (9) can be rewritten as:

min(C1,C2)

1

2

∥∥∥∥ (1− µ)

2YC∗ + µY −W>X

∥∥∥∥2F

(10)

+λ

2‖XC1 −XC2 −X‖2F

+ρ

2

∥∥C>1 1−C>2 1− 1∥∥2F

+ γ2 1>(C1 + C2)1

s.t. C1 ≥ 0,C2 ≥ 0

where we denote C∗ as(C1 + C2 + C>1 + C>2

). Next, we

introduce the Lagrange multipliers Φ1 and Φ2 for the con-straints C1 ≥ 0 and C2 ≥ 0 respectively. Then we obtain theLagrange function L as follows:

L =1

2

∥∥∥∥ (1− µ)

2YC∗ + µY −W>X

∥∥∥∥2F

(11)

+λ

2‖XC1 −XC2 −X‖2F

+ρ

2

∥∥C>1 1−C>2 1− 1∥∥2F

+ γ2 1>(C1 + C2)1

+tr(Ψ>1 C1

)+ tr

(Ψ>2 C2

)Taking the derivative of L with respect to C1 and C2, weobtain:

∂J

∂C1= U+

1 −U−1 + Ψ1 (12)

∂J

∂C2= U+

2 −U−2 + Ψ2

where U+1 , U−1 , U+

2 , and U−21. are defined as follows,

U+1 =

(1− µ)2

4C∗YY> +

(1− µ)2

4YY>C∗ (13)

+ λX>XC1 + ρ1>1C1 + γ21>1

U−1 =(1− µ)µ

2

(X>WY + Y>W>X

)+ λX>X(C2 + I) + ρ1>1(C2 + I)

U+2 =

(1− µ)2

4C∗YY> +

(1− µ)2

4YY>C∗

+ λX>X(C2 + I) + ρ1>1(C2 + I) + γ21>1

U−2 =(1− µ)µ

2

(X>WY + Y>W>X

)+ λX>XC1 + ρ1>1C1

By employing the KKT condition Ψ1(n,m)C1(n,m) = 0and Ψ2(n,m)C2(n,m) = 0, we get the following equationsfor C1 and C2,

U−1 (n,m)C1(n,m)−U+1 (n,m)C1(n,m) = 0 (14)

U−2 (n,m)C2(n,m)−U+2 (n,m)C2(n,m) = 0

Then, we obtain the updating rule of C1 and C2,

C1(n,m)← C1(n,m)

√U−1 (n,m)

U+1 (n,m)

(15)

C2(n,m)← C2(n,m)

√U−2 (n,m)

U+2 (n,m)

After getting the convergent values of C1 and C2, we obtainthe solution of C by the following step,

C = C1 −C2 (16)

2) Updating W, given C Given the graph matrix C, wefirst construct the label distributions P by Eq (3) and thentransform Eq. (9) to the following objective function,

minW

1

2

∥∥W>X−P∥∥2F

+ γ1

T ′∑t=1

tr(W>

t UtWt

)=

1

2

T∑t=1

∥∥w>t X−P(t, :)∥∥22

+ γ1

T ′∑t=1

t+ε−1∑l=t

w>l Utwl (17)

where Ut is a diagonal matrix with its k-th diagonal elementbeing 1

2‖Wt(k,:)‖2

. Based on the iterative alternating optimiza-

tion strategy of solving the `2,1 norm regression problem [45][46], we take the derivative of Eq. (17) with respect to wt andset it to zero,

(XX> + γ1Mt)wt −XP(t, :)> = 0 (18)

1In our experiment, the face image features X are non-negative and wemodify the original age labels as Y ← Y + constant, which ensures thatX>WY is non-negative and the updating step of C is reasonable


0 5 10 15 20 25 30 35 40 45 500

1000

2000

3000

4000

5000

6000

7000

8000

Iteration

Obj

ectiv

e va

lue

Convergence curve

D2LDL

Fig. 3. Convergence analysis on the D2LDL algorithm. It is clear that D2LDLis able to reach the convergence point.

where the auxiliary matrix Mt is defined as follows,

Mt =∑l

Ul, max(t− ε+ 1, 1) ≤ l ≤ min(T ′, t) (19)

As a result, we have the following relation:

wt =(XX> + γ1Mt

)†XP(t, :)> (20)

Therefore, the optimal regression matrix W is obtained bycomputing Eq. (19) and Eq. (20) iteratively.

The stopping condition in the optimization is |Jk−Jk−1||Jk| ≤

10−4 where the subscript k denotes the k-th iteration. Toanalyze the quantitative result of the algorithm convergence,we report the variation of D2LDL objective value. Fig. 3 showsthat our objective value decreases in the learning process.

C. Age Prediction in the Testing Dataset

Based on the learned regression matrix W, we can predictthe label distribution with respect to a specific face sample inthe testing dataset. Next, given the learned label distribution,we compute the corresponding age expectation for age pre-diction. Specifically, the predicted label distribution and thepredicted age are defined as follows:

pq = W>xq q ∈ testing samples (21)

yq =

∑Tt=1 t× pq(t)∑Tt=1 pq(t)

Algorithm 1 summarizes the specific procedure of our ageestimation framework.

IV. EXPERIMENTS

A. Dataset

We implement the age estimation in two common bench-mark face datasets. The first is FG-NET [47] that has 1002face images Its human face images are collected from 82people with the age ranges from 0 to 69 years. The secondis MORPH [20] that has about 55000 images. Its human faceimages are collected from about 13000 people with differentraces. For instance, the dataset consists of about 77% Africanfaces, 19% European faces, and the remaining 4% people

Algorithm 1: Age estimation with data-dependent labeldistribution

Input: face image features X and age labels Y;D2LDL Model Learning1: Initialization

- Initialize the subspace representation C by SSC.- Initialize the regression matrix W by the linear

regression.2: Optimizationwhile the convergence conditions are not met do

- Update the subspace representation C via NMFaccording to Eq. (15) and (16);

- Compute the affinity graph A according to Eq. (5):

A =1

2

(|C|+ |C>|

)- Update the label distribution P according to Eq. (3):

P = µY + (1− µ)YA

- Update the regression matrix W:# solve the linear regression with mixed `2,1 norm.while the value of W are not convergent do

for each wt (the t-th column of W) do

wt =(XX> + γ1Mt

)−1XP(t, :)>

- Update the auxiliary matrix Mt, t = 1, . . . , T ′

according to Eq. (19);

Age Prediction in the Testing Dataset1: Age estimation- Obtain the predicted age yq , q ∈ testing samples,

according to Eq. (21).2: Evaluation- Evaluate the MAE and the CS values based on the

absolute errors of the testing face samples |yq − yq|,q ∈ testing samples.

belong to Hispanic, Asian, Indian, and other races. In average,every individual has about 6-15 face images and these imagesvary slightly by the factors of illumination, background, andresolution. Each face image is labeled with its age value of thecorresponding people. The range of human age is from 16 to77 and the human distribution based on age is imbalanced. Inparticular, the young people (age<30), the middle-aged people(30<=age<50) and the aged (age>=50) people account for41%, 53%, and 6% of the dataset respectively. Followingthe recent work [16], we use the same strategy to collect20000 face images from MORPH. In our experiment, thefeature extraction of the face samples is performed by theAlexNet [48] in the Caffe toolbox [49] and we use the outputof the AlexNet’s ’fc7’ layer as the face image feature. Indetail, we use the “random initialization” to preset the networkparameters and then apply the standard back propagationmethod to optimize these parameters.


−5 0 50.05

0.1

0.15

Triangle

−5 0 5

0.05

0.1

0.15

−5 0 50.05

0.1

0.15

−5 0 50.06

0.1

0.14

−5 0 50.05

0.1

0.15

−5 0 5

0.06

0.1

0.14Trapezium

−5 0 5

0.05

0.1

0.15

−5 0 50.06

0.1

0.14

Bimodal

−5 0 50.05

0.1

0.15

Gaussian

−5 0 50.06

0.1

0.14

−5 0 50.06

0.1

0.14

−5 0 50.05

0.1

0.15

−5 0 50

0.05

0.1

0.15

0.2

−5 0 5

0.05

0.1

0.15

−5 0 50

0.05

0.1

0.15

0.2

Triangle

−5 0 50.05

0.1

0.15

0.2Triangle

−5 0 50.05

0.1

0.15

Gaussian

D2LDL

Model

IIS−LLD

Label distribution

(6) (7) (8) (9) (10)

(15)(14)(13)(12)(11)

(1) (2) (3) (4) (5)

Fig. 4. Label distributions of IIS-LLD and D2LDL. The x-axis label is theage after the centralization process and the y-axis label is the “descriptiondegree”. The numbers in D2LDL indicate the cluster ids. Compared with IIS-LLD redefining the distribution form as “Gaussian” or “Triangle”, D2LDLuses the data-driven strategy to obtain the various distribution forms.

TABLE IITHE TOP-5 CLUSTERS OF THE YOUNG PEOPLE, THE MIDDLE-AGED

PEOPLE, AND THE AGED PEOPLE.

People The top-5 clusters’ idsThe young people 9, 13, 14, 5, 12The middle-aged people 9, 14, 3, 10, 5The aged people 9, 2, 3, 10, 1

B. Parameter setting

We set λ = 0.5 and γ1 = 0.1. The parameters ρ and γ2are set properly as those in the SSC [32]. Without the specificinstruction, we set that ε = 4 and µ = 0.5. We have triedseveral different parameter configurations, and have found thatour approach achieves a stable performance within a relativelywide range of parameter values. Note that we take the abovefixed parameter configurations throughout all the experiments.

C. Competing models

Three competing algorithms are involved as a compari-son, namely AAS [28], LARR [22], IIS-LLD [17], and IIS-ALDL [31].• AAS (Appearance and Age Specific): It combines the

global classifier and the multi-local classifiers to imple-ment the age estimation.

• LARR (Locally Adjusted Robust Regressor): It combinesthe OLPP and multiclass SVM classifier for the ageestimation.

• IIS-LLD: It uses the prior label distributions as theprediction objectives and the age estimation is achievedthrough the maximum entropy model.

• IIS-ALDL: It uses the adaptive label distributions as theprediction objectives and the age estimation is achievedthrough the maximum entropy model.

We evaluate our D2LDL and these competing approachesbased on the same face image features.

African European0

10

African European0

10

Bimodal

Unimodal(Gaussian−like)

35 36 36

37 38 3937

34

37

40

35

35 36 37 37 37

32 33 35 35 35

Neighboring face samplesFace sample

37

5 5

7

3

Racial composition

This sample is collected fromthe 9−th cluster.

This sample is collected fromthe 8−th cluster.

Fig. 5. Case study of two female face samples. The first sample is an Africanwoman belonging to the 8-th cluster. The second sample is a European womanbelonging to the 9-th cluster . Their top-10 context-neighboring face samplesare sorted by the chronological order of their age labels. The blue numberdenotes the age label of face sample. The color of image frame denotes therace attribute. Clearly, the red arrows indicate that the “racial composition”is an important factor to influence the variation of distribution form.

D. Evaluation measures

We utilize two commonly used evaluation metrics for quan-titative performance evaluations.• MAE (Mean Absolute Error) [28]: It is defined as the

average of the absolute errors between the predicted agesand the ground truth ages.

• CS (Cumulative Score) [7]: It is defined as CS(l) = Ml

M ×100%. We denote Ml as the number of the testing imageson which the predicted ages make an absolute error nohigher than l years (error level) and denote M as the totalnumber of the testing images.

E. Data-dependent label distribution

I) Visualization of the learned data-dependent label dis-tributions. For a better visualization of the label distributionpatterns, all the sample-specific label distributions are firstcentralized and then are grouped by K-means into severalclusters, each of which is represented by its cluster mean (Weset the cluster number to 15 and obtain the similar clusteringresults when the cluster number varies from 20 to 50). Asshown in Fig. 4, we observe that the learned label distributionpatterns have a variety of forms, such as “Triangle-like”,“Trapezium-like”, and “Gaussian-like”, which are tagged withgreen, pink, and red respectively. Table II shows the top-5 clusters (These clusters have the higher sample numbersthan those of the other clusters) of the young people, themiddle-aged people, and the aged people respectively. Clearly,the major distribution forms of the face samples are the“Gaussian-like” or the “Triangle-like”, which indicates thatour label distributions capture the basic assumption of thetypical approach [17]. Besides, without the “Gaussian-like”distribution, we observe that the relationship between the labelexpectation and the age label varies with the temporal changes.For example, the majority of the “Triangle-like” distributions(the 12-th, 13-th, and 14-th clusters’ distributions) in the young


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Cluster id

raci

al c

om

po

siti

on

ind

ex

"racial composition index"s in the 15 clusters

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

−5 0 5

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(8)

"bimodal"

Label distributionsof the 15 clustes

Fig. 6. The RCIs in the 15 different face clusters. We discover that the 8-thcluster has the highest RCI in all the clusters. Clearly, the red arrow indicatesthat the face sample with a high RCI (the balance “racial composition”) ismore likely to have the “bimodal” distribution form.

people have the relatively smaller expectations than those (the1-st, 2-nd, and 3-rd clusters’ distributions) of the aged people.The above result also supports the motivation of the recentwork [31]: the age label distribution varies with temporalchanges. Therefore, our label distribution learning approachis capable of adaptively reflecting the intrinsically diversecross-age correlations for different face samples with differentattributes.

Furthermore, we observe that the distribution form of the8-th cluster is “bimodal” but those of the other clusters are“unimodal”. To explain this situation, we first give a case studyto investigate this difference. Fig. 5 shows two female facesamples and their context-neighboring face samples, whichare collected from the 8-th and the 9-th clusters respectively.Clearly, the “racial composition”s of these two face samples’neighboring face samples are different. Specifically, we dis-cover that none of the African or the European female samplesplays a decisive role in constructing the label distribution ofthe first sample and we observe the opposite situation in thesecond sample. It indicates that the label distribution of thefirst female sample actually is composed of two “unimodal”distributions, which are constructed by its context-neighboringAfrican and European female samples respectively. Therefore,we infer that the “racial composition” is an important factorto influence the variation of distribution form.

Following, we provide a detailed investigation of the in-fluence of “racial composition” on the distribution form. Toquantitatively analyze this influence, we first define a measure“racial composition index” (RCI) to describe the “racial com-position”. The specific definition of RCI in the k-th cluster isdescribed as follows,

RCIk =1

Nk

Nk∑n=1

(2×min

(#Africann

#Alln,

#Europeann

#Alln

))(22)

where #Africann, #Europeann, and #Alln denote the num-bers of African, European, and all people in the n-th facesample’s context-neighboring samples respectively. Here Nkdenotes the sample number in the k-th cluster. Eq. (22) showsthat if the “racial composition” is balanced (the numbers of

Sample and its neighboring onesunder the prior

assumption

Face sample A: Face sample B:

Typical label distribution

(IIS-LLD)

Data-dependent label distribution

: Context-neighboring sample

age

age age

age

5049 51 52 534847 5049 51 52 534847

: Age label

(1)

5049 51 52 5348475049 51 52 534847

(2)

Age labelWe collect two face samples that have

the same age labels (50 years old).

(3)

(a)

(b)

(d)

age5049 51 52 534847 age5049 51 52 534847

age5049 51 52 534847 age5049 51 52 534847

Sample and its neighboring ones

in reality

(c)

Fig. 7. Two face samples’ label distributions constructed by the typicalapproach (IIS-LLD) and our approach respectively. We have the followingthree observations: (1) In the typical approach, all the samples have thesame distribution form (e.g., Gaussian). (2) Since the majority of sample A’scontext-neighboring samples have an older age than sample A, the distributionform of sample A deviates from the typical “Gaussian” form. (3) Since thecontext information of samples A and B is different, their data-dependentlabel distributions are also different.

the African and European face samples are close) in the k-th cluster, this cluster will have a high RCI. Fig. 6 reportsthe RCIs of all the clusters. Clearly, the 8-th cluster has thehighest RCI and the RCI difference between the 8-th clusterand the other clusters is large.

The above results demonstrate that the majority of theface samples in the 8-th cluster have both many African andEuropean context-neighboring face samples. For example, forthe first face sample in Fig. 5 collected from the 8-th cluster,we observe that the half of its context-neighboring samplesare European female samples but herself is an African femalesample. It indicates that in the 8-th cluster, two face sampleswith the similar image features can belong to the differentraces. Therefore, for one face sample, we infer that the“bimodal” distribution form should both meet the followingtwo conditions: (I) In its context-neighboring face samples,the numbers of African and European face samples are close;(II) The average ages of the above African and European facesamples have a certain degree of difference.

II) The contributions of the data-dependent label distri-bution The contributions of the data-dependent label distribu-


TABLE IIIMEAN ABSOLUTE ERROR OF AGE ESTIMATION COMPARISON STUDY IN

THE FG-NET DATASET. CLEARLY, D2LDL ACHIEVES THE BESTPREDICTION PERFORMANCE (THE LOWEST MAE).

TrainingSet RatioApproach 10% 20% 30% 40% 50% 60%

AAS 5.7250 5.5609 5.4775 5.4079 5.1630 5.0627LARR 5.5108 5.4829 5.4277 5.2180 5.0832 5.0124

IIS-LLD 5.2433 5.2052 5.1881 4.9970 4.8565 4.7745IIS-ALDL 5.1763 5.1330 5.1194 4.9323 4.8061 4.7330

D2LDL 4.7298 4.6822 4.6642 4.6412 4.6332 4.5775

TABLE IVMEAN ABSOLUTE ERROR OF AGE ESTIMATION COMPARISON STUDY IN

THE MORPH DATASET. CLEARLY, D2LDL ACHIEVES THE BESTPREDICTION PERFORMANCE (THE LOWEST MAE).

TrainingSet RatioApproach 10% 20% 30% 40% 50% 60%

AAS 4.9081 4.7616 4.6507 4.5553 4.4690 4.4061LARR 4.7501 4.6112 4.5131 4.4273 4.3500 4.2949

IIS-LLD 4.3466 4.2850 4.2520 4.2083 4.1904 4.1683IIS-ALDL 4.1791 4.1683 4.1228 4.1107 4.1024 4.0902

D2LDL 4.1080 3.9857 3.9204 3.8712 3.8560 3.8385

tion are described by the following two points:First, the data-dependent label distribution captures the age

ambiguity through applying the visual information of the facesamples. Fig. 7 shows the label distributions of a typicalexisting approach (IIS-LLD) [17] and our approach. Here,we collect two face samples that have the same age labels(50 years old). Clearly, Fig. 7(b) shows that the samples Aand B have the same distribution form (“Gaussian” form) inthe typical approach. Besides, Fig. 7(d) shows that the labeldistributions of these two samples deviate from the typical“Gaussian” form in our approach. The above deviations arecaused by the fact that our label distributions are constructedby the samples’ context information. For instance, the majorityof sample A’s context-neighboring samples have an olderage than sample A, which infers that an relatively older age(e.g., 51 − 52) has a higher “description degree” in the labeldistribution of sample A. Since the age labels of the context-neighboring samples usually do not follow a fixed-form distri-bution (see Fig. 7(a) and Fig. (c)), our label distribution builtby the data-driven strategy can be considered as a flexible agerepresentation (prediction objectives). This indicates that ourdata-dependent label distribution is capable of modeling theage smoothing for the various samples through learning theface context structure.

Second, compared with the fact that the typical multi-labelclassification approaches usually learn the label correlationby capturing the semantical-relationships among the labels(e.g., [17] [43] predefining the labels’ correlation-weights and[50] computing the label covariance matrix), our approachcaptures the the cross-age correlation through learning thedata-dependent label distributions of all the face samples,where each label distribution represents its local age smooth-ing pattern. For example, in Fig 7(d), the label distributionsof samples A and B show the different cross-age correlationsof the age labels.

0.1 0.2 0.3 0.4 0.5 0.63.8

4

4.2

4.4

4.6

4.8

5Female & European

TrainingSet Ratio

MA

E

0.1 0.2 0.3 0.4 0.5 0.63.8

4

4.2

4.4

4.6

4.8

5Female & African

TrainingSet Ratio

MA

E

0.1 0.2 0.3 0.4 0.5 0.63.8

4

4.2

4.4

4.6

4.8

5Male & European

TrainingSet Ratio

MA

E

0.1 0.2 0.3 0.4 0.5 0.63.8

4

4.2

4.4

4.6

4.8

5Male & African

TrainingSet Ratio

MA

E

AASLARRIIS−LLDD2LDLIIS−ALDL

Fig. 8. Mean Absolute Error of age estimation comparison study in the foursmall datasets of MORPH. Compared with the competing approaches, D2LDLhas the best prediction performance (lowest MAE).

F. Age estimation study

a) The original dataset First, we implement the face ageestimation using the FG-NET and the MORPH datasets toobtain the overall experimental evaluations. In this experiment,we evaluate D2LDL and four competing approaches in sixdifferent training set ratios (the training set ratio is set from10% to 60%). Table IV and Table III report the comparisonstudy in the FG-NET and the MORPH datasets respectively.In detail, Table III shows that D2LDL outperforms all thecompeting approaches in the FG-NET dataset, where the MAEof D2LDL is about 6.51%, 7.63%, 11.90%, and 13.68%lower than those of IIS-ALDL, IIS-LLD, LARR, and AAS,respectively. Table IV shows that D2LDL obtains the lowestMAE in the MORPH dataset, where the MAE of D2LDLis about 5.10%, 7.35%, 12.49%, and 15.03% lower thanthose of IIS-ALDL, IIS-LLD, LARR, and AAS, respectively.Furthermore, we observe that the MAEs of D2LDL, IIS-ALDL, and IIS-LLD are both lower than those of LARRand AAS. This indicates that the label distribution is a betterage prediction objective that the single label. In addition, thedifference of the MAE among D2LDL, IIS-ALDL, and IIS-LLD indicates that our label distribution have a better ability tolearn the age ambiguity than the other two. Besides, we reportthat the MAE difference between the lowest training set ratioand the highest training set ratio in D2LDL is smaller thanthat of the other competing approaches. This indicates thatour approach is capable of implementing the age estimationwhen the number of training samples is small (about 2000face images). The above observations demonstrate that ourlabel distribution is a feasible age representation for predictingthe face age. However, when the training sample set is very

10 JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XX 20152 4 6 8 10 12

0

0.2

0.4

0.6

0.8

1Set 1

Error Level (years)

CS

1 2 3 4 5 6

0.4

0.5

0.6

0.7

0.8

Set 2

Error Level (years)

CS

2 4 6 8 10 120

0.2

0.4

0.6

0.8

1Female & European

Error Level (years)

CS

2 4 6 8 10 120

0.2

0.4

0.6

0.8

1Female & African

Error Level (years)

CS

2 4 6 8 10 120

0.2

0.4

0.6

0.8

1Male & European

Error Level (years)

CS

2 4 6 8 10 120

0.2

0.4

0.6

0.8

1Male & African

Error Level (years)

CS

AASLARRIIS−LLDTS−SALIIS−ALDL

Fig. 9. Cumulative Score of age estimation comparison study in the four smalldatasets of MORPH. Clearly, D2LDL has a better CS than the competingapproaches at all error levels. This indicates that the majority of predictionsin D2LDL do not deviate from the original age labels too much.

inadequate, the age prediction performance of our approachwill also be deteriorated. For example, at training set ratio5%, the MAE of D2LDL is 5.30 in the FG-NET dataset andis 4.81 in the Morph dataset. The reasons for this result are:in a very small training set ratio, 1) it is difficult to build aneffective mapping from the image features to the age labels.2) it is difficult to discover the neighboring samples havingboth the close age labels and the similar image features.

b) The datasets based on both the gender and raceattributes To obtain the detailed model performance, wedivide the MORPH dataset into four small datasets basedon the gender and race attributes and then implement theage estimation in these datasets respectively. Fig. 8 showsthe comparison study among the five approaches in theMAE for these four small datasets (the datasets with theFemale&European, Female&African, Male&European, andMale&African attributes). Clearly, D2LDL obtains the bestprediction performance (the lowest MAE) in all the trainingset ratios. We discover that the decrease rate of the MAE inthe female samples is slower than that of the male samples inD2LDL, IIS-ALDL, and IIS-LLD when the training set ratiovaries from 10% to 20%. This indicates that building the ageprediction model of the female samples needs more trainingimages than that of the male samples.

Fig. 9 shows the comparison study among the five ap-proaches in the CS for the above four small datasets. Clearly,D2LDL has a better CS than the competing approaches atall error levels. Besides, the CS differences among theseapproaches vary according to the error level. For example, theCS difference is the lowest in 1 and 12 year error level as wellas it is the largest in 5 year error level. The CS curve of ourapproach shows that the majority of our predictions are closeto the original age labels. Considering that we use the visualimage features as the predicted features, the results in Fig. 8

20 30 40 50 600

5

10

15

20Set 1

Age Indices (years)

MA

E

20 30 40 50 600

5

10

15

20Set 2

Age Indices (years)

MA

E

20 30 40 50 600

5

10

15

20Female & European

Age Indices (years)

MA

E

20 30 40 50 600

5

10

15

20Female & African

Age Indices (years)

MA

E

20 30 40 50 600

5

10

15

20Male & European

Age Indices (years)

MA

E

20 30 40 50 600

5

10

15

20Male & African

Age Indices (years)

MA

E

AASLARRIIS−LLDD2LDLIIS−ALDL

Fig. 10. Mean Absolute Error of age estimation comparison study in the foursmall datasets of MORPH at each age. Clearly, D2LDL has a better predictionperformance (lower MAE) than the competing approaches for the old maleface samples.

1 2 3 4 5 6 7 83.85

3.9

3.95

4

4.05

4.1

4.15

4.2

4.25

4.3

ε

MA

E

0.2 0.4 0.6 0.83.85

3.9

3.95

4

4.05

4.1

4.15

4.2

4.25

4.3

µM

AE

(a) (b)

Fig. 11. Mean Absolute Error of D2LDL with the variation of factors ε andµ. (a) It indicates that adding the flexible temporal smoothing constraint isable to capture the temporal-similarity among the adjacent regression vectors.(b) It indicates that applying the data-dependent label distribution is helpfulin improving the prediction performance.

and Fig. 9 indicate that our label distributions built by learningthe samples’ visual context correlations can be considered asthe flexible prediction objectives.

Fig. 10 shows the MAEs of these five approaches at eachage in the above four small datasets. Clearly, the MAE ofthe competing approaches are slightly lower than those ofD2LDL at the age of 30-38 but D2LDL obtains the bestprediction performance (lowest MAE) at the other age labels.As we know, in the above-mentioned “the other age” (e.g.,young or old age), the individual visual difference betweenthe face samples at the same age level is often large. Unlikethe competing approaches giving them the same predictionobjective, our approach builds their label distributions basedon not only their age labels but also the age labels of theircontext-neighboring samples. It indicates that our approach iscapable of distinguishing they based on the difference of theircontext information.


c) Experiments on the variation of factors Fig. 11(a)shows the MAE of D2LDL with the variation of factor ε.Clearly, the MAE decreases with the increase of ε from 1 to4. This indicates that adding the flexible temporal smoothingconstraint is able to capture the temporal-similarity amongthe adjacent regression vectors. Besides, the MAE increaseswith the decrease of ε from 4 to 8. It is obvious that ourconstraint investigates the local temporal-similarity of theregression vectors. Fig. 11(b) shows the MAE of D2LDLwith the variation of factor µ. In principle, the MAE curvewith factor µ from 0.9 to 0.5 describes the variation of theprediction performance from the “one-hot” model to D2LDL.Clearly, the MAE decreases with the decrease of µ from 0.9 to0.5. Besides, the variation of MAE is slow with the decreaseof µ from 0.5 to 0.1, which shows that our label distributionis stable at this range of µ. The above observations indicatethat applying the data-dependent label distribution is helpfulin improving the prediction performance.

V. CONCLUSION

In this paper, we have proposed a novel approach D2LDL toaddress the age prediction. The proposed approach explicitlyanalyzes the cross-age correlation by learning the contextrelationship among the face samples, where we construct theface context graph through the subspace structure learning.Our age estimation model is implemented by jointly opti-mizing the tasks of age-label distribution learning and ageprediction learning. As a result, the learned data-dependentlabel distributions are capable of well improving the age repre-sentation while effectively accommodating the age ambiguity.Experimental results show that our label distribution has avariety of forms and then we give a detailed analysis aboutthe influence of the “racial composition” on the distributionform. In addition, the comparison study has demonstrated theeffectiveness of our approach on FG-NET and MORPH.

REFERENCES

[1] C. Zhang and G. Guo, “Exploiting unlabeled ages for aging patternanalysis on a large database,” in Computer Vision and Pattern Recog-nition Workshops (CVPRW), 2013 IEEE Conference on, June 2013, pp.458–464.

[2] X. Geng, Z.-H. Zhou, and K. Smith-Miles, “Automatic age estimationbased on facial aging patterns,” Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, vol. 29, no. 12, pp. 2234–2240, Dec 2007.

[3] S. Kohail, “Using artificial neural network for human age estimationbased on facial images,” in Innovations in Information Technology (IIT),2012 International Conference on, March 2012, pp. 215–219.

[4] Y. Zhang and D.-Y. Yeung, “Multi-task warped gaussian process for per-sonalized age estimation,” in Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on, June 2010, pp. 2622–2629.

[5] B. Ni, Z. Song, and S. Yan, “Web image and video mining towardsuniversal and robust age estimator,” Multimedia, IEEE Transactions on,vol. 13, no. 6, pp. 1217–1229, Dec 2011.

[6] K.-Y. Chang, C.-S. Chen, and Y.-P. Hung, “Ordinal hyperplanes rankerwith cost sensitivities for age estimation,” in Computer Vision andPattern Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp.585–592.

[7] X. Geng, Z.-H. Zhou, Y. Zhang, G. Li, and H. Dai, “Learning fromfacial aging patterns for automatic age estimation,” in Proceedings ofthe 14th ACM International Conference on Multimedia, ser. MM ’06,2006, pp. 307–316.

[8] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation frompredicting 10,000 classes,” in Computer Vision and Pattern Recognition(CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1891–1898.

[9] A. Garg and R. Bajaj, “Facial expression recognition & classificationusing hybridization of ica, ga, and neural network for human-computerinteraction,” Journal of Network Communications and Emerging Tech-nologies (JNCET) www. jncet. org, vol. 2, no. 1, 2015.

[10] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 24, no. 7, pp. 971–987, Jul 2002.

[11] C. Liu and H. Wechsler, “A gabor feature classifier for face recognition,”in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEEInternational Conference on, vol. 2, 2001, pp. 270–275 vol.2.

[12] G. Guo, G. Mu, Y. Fu, and T. Huang, “Human age estimation using bio-inspired features,” in Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on, June 2009, pp. 112–119.

[13] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encodernetworks (cfan) for real-time face alignment,” in Computer Vision ECCV2014, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla,B. Schiele, and T. Tuytelaars, Eds. Springer International Publishing,2014, vol. 8690, pp. 1–16.

[14] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closingthe gap to human-level performance in face verification,” in ComputerVision and Pattern Recognition (CVPR), 2014 IEEE Conference on, June2014, pp. 1701–1708.

[15] G. Guo and G. Mu, “Human age estimation: What is the influenceacross race and gender?” in Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2010 IEEE Computer Society Conference on.IEEE, 2010, pp. 71–78.

[16] G. Guo and C. Zhang, “A study on cross-population age estimation,”in Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on, June 2014, pp. 4257–4263.

[17] X. Geng, C. Yin, and Z.-H. Zhou, “Facial age estimation by learningfrom label distributions,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 35, no. 10, pp. 2401–2412, Oct 2013.

[18] X. Geng and Y. Xia, “Head pose estimation based on multivariate labeldistribution,” in Computer Vision and Pattern Recognition (CVPR), 2014IEEE Conference on, June 2014, pp. 1837–1842.

[19] Y.-K. Li, M.-L. Zhang, and X. Geng, “Leveraging implicit relativelabeling-importance information for effective multi-label learning,” inData Mining (ICDM), 2015 IEEE International Conference on, Nov2015, pp. 251–260.

[20] K. Ricanek and T. Tesafaye, “Morph: a longitudinal image databaseof normal adult age-progression,” in Automatic Face and GestureRecognition, 2006. FGR 2006. 7th International Conference on, April2006, pp. 341–345.

[21] T. F. Cootes, G. J. Edwards, C. J. Taylor et al., “Active appearance mod-els,” IEEE Transactions on pattern analysis and machine intelligence,vol. 23, no. 6, pp. 681–685, 2001.

[22] G. Guo, Y. Fu, C. Dyer, and T. Huang, “Image-based human ageestimation by manifold learning and locally adjusted robust regression,”Image Processing, IEEE Transactions on, vol. 17, no. 7, pp. 1178–1188,July 2008.

[23] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal laplacianfacesfor face recognition,” Image Processing, IEEE Transactions on, vol. 15,no. 11, pp. 3608–3614, Nov 2006.

[24] B. Xiao, X. Yang, Y. Xu, and H. Zha, “Learning distance metric forregression by semidefinite programming with application to human ageestimation,” in Proceedings of the 17th ACM International Conferenceon Multimedia, ser. MM ’09, 2009, pp. 451–460.

[25] W.-L. Chao, J.-Z. Liu, and J.-J. Ding, “Facial age estimation based onlabel-sensitive learning and age-oriented regression,” Pattern Recogni-tion, vol. 46, no. 3, pp. 628 – 641, 2013.

[26] B. Ni, S. Yan, and A. Kassim, “Learning a propagable graph for semi-supervised learning: classification and regression,” IEEE Transactions onKnowledge and Data Engineering, vol. 24, no. 1, pp. 114–126, 2012.

[27] Y. Fu, Y. Xu, and T. Huang, “Estimating human age by manifold analysisof face pictures and regression on aging features,” in Multimedia andExpo, 2007 IEEE International Conference on, July 2007, pp. 1383–1386.

[28] A. Lanitis, C. Draganova, and C. Christodoulou, “Comparing differentclassifiers for automatic age estimation,” Systems, Man, and Cybernetics,Part B: Cybernetics, IEEE Transactions on, vol. 34, no. 1, pp. 621–628,Feb 2004.

[29] K. Ueki, T. Hayashida, and T. Kobayashi, “Subspace-based age-groupclassification using facial images under various lighting conditions,”in Automatic Face and Gesture Recognition, 2006. FGR 2006. 7thInternational Conference on, April 2006, pp. 6 pp.–48.


[30] S. Yan, H. Wang, X. Tang, and T. Huang, “Learning auto-structuredregressor from uncertain nonnegative labels,” in Computer Vision, 2007.ICCV 2007. IEEE 11th International Conference on, Oct 2007, pp. 1–8.

[31] X. Geng, Q. Wang, and Y. Xia, “Facial age estimation by adaptivelabel distribution learning,” in Pattern Recognition (ICPR), 2014 22ndInternational Conference on, Aug 2014, pp. 4465–4470.

[32] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm,theory, and applications,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 35, no. 11, pp. 2765–2781, Nov 2013.

[33] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery ofsubspace structures by low-rank representation,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 35, no. 1, pp. 171–184, Jan 2013.

[34] C. Lu, J. Feng, Z. Lin, and S. Yan, “Correlation adaptive subspacesegmentation by trace lasso,” in Computer Vision (ICCV), 2013 IEEEInternational Conference on, Dec 2013, pp. 1345–1352.

[35] B. Li, Y. Zhang, Z. Lin, and H. Lu, “Subspace clustering by mixture ofgaussian regression,” June 2015.

[36] C.-G. Li and R. Vidal, “Structured sparse subspace clustering: A unifiedoptimization framework,” in Computer Vision and Pattern Recognition(CVPR), 2015 IEEE Conference on, June 2015, pp. 277–286.

[37] S. Tierney, J. Gao, and Y. Guo, “Subspace clustering for sequentialdata,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on, June 2014, pp. 1019–1026.

[38] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversity-inducedmulti-view subspace clustering,” in Computer Vision and Pattern Recog-nition (CVPR), 2015 IEEE Conference on, June 2015, pp. 586–594.

[39] S. Yan, H. Wang, Y. Fu, J. Yan, X. Tang, and T. S. Huang, “Synchronizedsubmanifold embedding for person-independent pose estimation andbeyond,” IEEE Transactions on Image Processing, vol. 18, no. 1, pp.202–210, Jan 2009.

[40] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Computer Visionand Pattern Recognition (CVPR), 2010 IEEE Conference on, June 2010,pp. 3360–3367.

[41] N. Ramanathan and R. Chellappa, “Modeling age progression in youngfaces,” in Computer Vision and Pattern Recognition, 2006 IEEE Com-puter Society Conference on, vol. 1, June 2006, pp. 387–394.

[42] X. Geng, K. Smith-Miles, and Z.-H. Zhou, “Facial age estimation bynonlinear aging pattern subspace,” in Proceedings of the 16th ACMInternational Conference on Multimedia, ser. MM ’08. New York,NY, USA: ACM, 2008, pp. 721–724.

[43] X. Cai, F. Nie, W. Cai, and H. Huang, “New graph structured sparsitymodel for multi-label image annotations,” in The IEEE InternationalConference on Computer Vision (ICCV), December 2013.

[44] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegativematrix factorization,” Nature, vol. 401, pp. 788–791, 1999.

[45] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”in Advances in Neural Information Processing Systems 19, B. Scholkopf,J. Platt, and T. Hoffman, Eds. MIT Press, 2007, pp. 41–48.

[46] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust featureselection via joint l21-norms minimization,” in Advances in NeuralInformation Processing Systems 23, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010,pp. 1813–1821.

[47] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Toward automatic simulationof aging effects on face images,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 24, no. 4, pp. 442–455, Apr 2002.

[48] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in Neural Infor-mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[49] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia, ser. MM ’14. New York, NY, USA: ACM,2014, pp. 675–678.

[50] L. Xu, Z. Wang, Z. Shen, Y. Wang, and E. Chen, “Learning low-ranklabel correlations for multi-label classification with missing labels,” inData Mining (ICDM), 2014 IEEE International Conference on, Dec2014, pp. 1067–1072.

Documents

A Data-Dependent Label Distribution Learning for Age