Automatic Annotation of Satellite Images via Multifeature Joint Sparse Coding With Spatial Relation Constraint

652 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 4, JULY 2013

Automatic Annotation of Satellite Imagesvia Multifeature Joint Sparse Coding

With Spatial Relation ConstraintXinwei Zheng, Xian Sun, Kun Fu, and Hongqi Wang

Abstract—In this letter, we propose a novel framework forlarge-satellite-image annotation using multifeature joint sparsecoding (MFJSC) with spatial relation constraint. The MFJSCmodel imposes an l1,2-mixed-norm regularization on encodedcoefficients of features. The regularization will encourage the co-efficients to share a common sparsity pattern, which will preservethe cross-feature information and eliminate the constraint thatthey must have identical coefficients. Spatial dependences betweenpatches of large images are useful for the annotation task but areusually ignored or insufficiently exploited in other methods. In thisletter, we design a spatial-relation-constrained classifier to utilizethe output of MFJSC and the spatial dependences to annotateimages more precisely. Experiments on a data set of 21 land-useclasses and QuickBird images show the discriminative power ofMFJSC and the effectiveness of our annotation framework.

Index Terms—Large-image annotation, multifeature jointsparse coding (MFJSC), spatial information.

I. INTRODUCTION

HUGE quantities of high-resolution satellite images arebeing acquired every day. Due to their size, automatically

annotating the large satellite images effectively and efficientlyhas become a strong need for the development of intelligentdatabases.

Many works have been done for automatically adding asemantic label to an image according to its content, such assatellite image classification based on support vector machine(SVM) [2]–[4] or two-layer sparse coding (TSC) [6], andsemantic annotation using latent Dirichlet allocation (LDA)[9]. Although the SVM- and LDA-based methods both havereceived quite promising results, they require a sophisticatedlearning model for each category. This will limit their usebecause of the complicated content in satellite images. TheTSC method which is based on the sparse representation-basedclassification (SRC) [15] was proposed to handle classificationwithout a learning phase. However, the SRC method just cate-nates features as a long vector, which means that different fea-tures must have identical coefficients. Due to the nonlinearity

Manuscript received March 13, 2012; revised May 21, 2012, July 19,2012, and August 19, 2012; accepted August 26, 2012. Date of publicationOctober 15, 2012; date of current version November 30, 2012. This work wassupported in part by the National Natural Science Foundation of China underGrant 41001285.

The authors are with the Key Laboratory of Spatial Information Processingand Application System Technology, Chinese Academy of Sciences, Beijing100190, China (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/LGRS.2012.2216499

of feature extraction, different features extracted from the sameimage may have different coefficients, so SRC will weaken theeffectiveness of multiple features.

In addition, spatial information can also be used to helpannotation. It is reasonable to assume that adjacent patches in alarge satellite image will belong to the same category with highprobability. Bruzzone et al. [3] proposed a novel SVM-basedclassifier to exploit the spatial information in the training phaseand obtained higher classification accuracy. However, this workis not applicable to our method in which there is no learningphase. Liénou et al. [9] used overlapping and majority vote forcommon parts to annotate an image more precisely. However,spatial information is still insufficiently exploited since there isnothing constrained between patches.

In order to improve the annotation result, we propose a newframework based on multifeature joint sparse coding (MFJSC)with spatial relation constraint. MFJSC encourages the coef-ficients to share a common sparsity pattern to preserve thecross-feature information and eliminate the limitation of SRCmentioned earlier. The goal of joint sparsity can be achieved byimposing an l1,2-mixed-norm penalty on the coefficients [11].In addition, we design a spatial-relation-constrained classifier(SRCC) that utilizes the output of MFJSC and the spatial depen-dences to annotate images more precisely. To be incorporatedinto the new classifier, spatial dependences are formulated aspenalties on dissimilarity of adjacent patches. Unlike labelingthe test patches one by one in other methods, SRCC labels allthe patches simultaneously by minimizing an objective functioncomposed of all of the reconstruction errors and penalties ondissimilarity of adjacent patches.

The remainder of this letter is organized as follows. In thenext section, we will discuss our new annotation frameworkin detail. The experiment settings and results are presented inSection III. Finally, we conclude this letter in Section IV.

II. METHODOLOGY

Fig. 1 shows a typical scenario for annotating large satelliteimages. The details of each step are discussed as follows.

A. Multiple-Feature Extraction

Since satellite images usually have complex content andchaotic background, one single feature is incompetent for theclassification task. Therefore, K (i.e., K = 4 in our experi-ments) features are used in our annotation framework. Theparameters of the features are part of the experiment settings, sowe leave the detailed discussion about features in Section III.

1545-598X/$31.00 © 2012 IEEE

ZHENG et al.: AUTOMATIC ANNOTATION OF SATELLITE IMAGES 653

Fig. 1. Framework of the annotation system.

B. Dictionary Generation

Let Ii be the ith patch in the training data set, where i ∈{1, 2, . . . , N} and N is the number of training patches. Let ykibe the kth feature of Ii, where k ∈ {1, 2, . . . ,K}; then, Dk =[yk1 yk2 · · · ykN ] is the dictionary for feature k. The label of ykiis preserved and will be used to calculate reconstruction errors.

In our framework, dictionary generation is the only phase thatwe need to prepare in advance. It is quite simple and easy toextend. This advantage will be very useful in such a situation:We want to enlarge the trained model when a confident patchis encountered in the testing phase. Many other methods (e.g.,LDA and SVM) must retrain the model, while ours only has toadd the features into the dictionaries.

C. MFJSC

Denote yk as the kth feature extracted from the test image.Y = [y1 y2 · · · yK ] is the matrix of features. If N is bigenough, yk can be well approximated by linear superpositionof elements of Dk. Let ωk be the reconstruction coefficients ofyk, where ωk ∈ RN . Then, W = [ω1 ω2 · · · ωK ] is the matrixof coefficients. Denote ωi as the ith row of W . Now, we definea regularization term named l1,2 mixed norm of W , i.e., l1 sumof l2 norm of ωi, which will encourage similar sparsity patternsin all features [11]. Then, our MFJSC model can be formulatedas the solution to the following problem of minimization withl1,2-mixed-norm regularization:

minW

1

2

K∑k=1

‖yk −Dkωk‖22 + λ‖W‖1,2 (1)

where λ is a tuning parameter controlling the sparsity of W .To solve problem (1), we use the accelerated proximal

gradient method proposed in [5] and [14]. Denote f(W ) =

(1/2)∑K

k=1 ‖yk −Dkωk‖22 as the fidelity term. Equation (1)can be substituted by a proximal function QL(W,Wt)

QL(W,Wt) = f(Wt) + 〈W −Wt,∇f(Wt)〉

+L

2‖W −Wt‖2F + λ‖W‖1,2 (2)

where 〈A,B〉 = tr(ATB) denotes the inner product of A andB, ∇f(W ) = [∂f/∂ω1 ∂f/∂ω2 · · · ∂f/∂ωK ] is the gradientof f(W ) with respect to W , ‖ • ‖F is the Frobenius norm, andL is the Lipschitz constant of ∇f(W )

L =K

maxk=1

(λmax

((Dk)

TDk

))(3)

where λmax(•) is the largest eigenvalue of •. In the iterativeprocedure, W is updated with

Wt+1 = argminW

QL(W,Vt) (4)

where Vt is a linear combination of previous iterations of W ,i.e., {Wt−1,Wt}. Rewriting(2), we obtain that

Wt+1=argminW

(1

2

∥∥∥∥W−(Vt−

1

L∇f(Vt)

)∥∥∥∥2

F

+λ

L‖W‖1,2

).

(5)

Then, (5) can be decomposed into separate subproblems rowby row

wi,t+1 = argminw

1

2‖w − zi‖22 +

λ

L‖w‖2 (6)

where zi is the ith row of Z = Vt − (1/L)∇f(Vt). Equation(6) can be easily solved using a shrinkage operator

w∗ =

[1− λ

L‖zi‖2

]+

zi (7)

where [·]+ = max(·, 0). The optimization procedure to solveproblem (1) is summarized as Algorithm I.

Algorithm I (MFJSC)Input: The feature and dictionary pairs {yk, Dk}Kk=1, tuning

parameter λ.Initialization: W0 ∈ RN×K , V0 = W0, a0 = 1, compute L

using (3) and set t = 0.While not converged Do

Step 1: Find a W that minimizes QL(W,Vt)

Z = Vt −1

L∇f(Vt)

For each row of Wt+1

Wi,t+1 =

[1− λ

L‖zi‖2

]+

zi

End ForStep 2: Update Vt+1

at+1 =2

t+ 3

Vt+1 = Wt+1 +1− atat

at+1(Wt+1 −Wt)

t = t+ 1

End WhileOutput: Coefficient matrix W

D. SRCC

The reconstruction error of the test patch represented bya linear combination of the elements chosen from the jthcategory is

r(j) =1

K

K∑k=1

∥∥yk −Dkj ω

kj

∥∥2

(8)


where Dkj and ωk

j are part of the dictionary and coefficientsassociated with the jth category, respectively. Unlike SRCand TSC which determine the classification result to achieveminimal reconstruction error, we design a new classifier interms of dependences on adjacent patches which usually occurin satellite images. The classifier is an optimization procedurethat minimizes the sum of reconstruction errors and a functionthat measures the dissimilarity of adjacent patches. DenoteX = [x1, x2, . . . , xM ] as the matrix of reconstruction errors ofthe test patches, where xi = [r(1) r(2) · · · r(J)]T is the vectorof reconstruction errors corresponding to the ith patch, J is thenumber of categories to be annotated, and M is the numberof test patches. Denote Z = {z1, z2, . . . , zM} as the label se-quence of the test patches. We model the spatial relations ofthe patches as an undirected graph G = {V,E}, where V ={v1, v2, . . . , vM} is the set of vertices which denote the patchesand E is the set of edges which denote only adjacent patcheshaving a connection. Inspired by the conditional random fieldmodel [7], we formulate the classification rule as follows:

Z∗=argminZ

∑i∈V

xi(zi)+γ∑i∈V

∑j∈Ni

δ(zi, zj)xi(zi)xj(zj) (9)

where Ni is the neighborhood of the ith patch, γ is a tuningparameter, and

δ(zi, zj) =

{1, if zi = zj0, otherwise.

(10)

The first term in (9) is to minimize the total reconstructionerror, while the second is to decrease the dissimilarity ofadjacent patches. It is designed following such a principle:The less the reconstruction error that the patch has, the morethe influence that it imposes to its neighbors. In addition, γcontrols the tradeoff between the total reconstruction error andthe spatial similarity. If it is set to zero, our classification rule isdegenerated to be the same with that of SRC and TSC.

Algorithm II (Procedure of automatic annotation)Input: Feature dictionaries {Dk}Kk=1, image to be annotated

I , tuning parameters λ, γ.

1. Cut image I into small patches {Ii}Mi=1.2. Calculate the reconstruction errors {xi}Mi=1 with MFJSC.

2.1 Extract multiple features {yk}Kk=1 from Ii.2.2 Apply Algorithm I to obtain coefficient matrix W .2.3 Use (8) to calculate reconstruction error xi.

3. Jointly label all of patches with SRCC.3.1 Initialize Z with

Z = argminZ

∑i∈V

xi(zi)

3.2 Fix the others and update {zi}Mi=1 by

zi = arg minz∈{1,2,...,J}

xi(z) + γ∑j∈Ni

δ(z, zj)xi(z)xj(zj)

3.3 If Z is changed, repeat 3.2.Output: Label sequence Z of image I .

Fig. 2. Samples of the data set of 21 land-use classes.

TABLE ICLASSIFICATION ACCURACIES OF DIFFERENT METHODS

The global solution of (9) is hard to achieve, so we solveit using the coordinate ascent algorithm to acquire a localsolution. Fortunately, a good initial estimate can be alwaysobtained by first using the classification rule of SRC to get arough label sequence which guarantees that the algorithm willachieve a promising result.

The procedure of automatic annotation of satellite imagery issummarized as Algorithm II.

III. EXPERIMENTS

A. Feature Extraction

Four types of features, namely, DAISY [13], geometric blur[1], scale-invariant feature transform [10], and self-similarity[12], are used in our experiments. All of these features are firstcomputed on a regular grid which is set to make the number oflocal descriptors around 100. Then, the descriptors are vectorquantized to form a vocabulary of visual words using theK-means algorithm for feature clustering.

ZHENG et al.: AUTOMATIC ANNOTATION OF SATELLITE IMAGES 655

Fig. 3. Per-class classification accuracies of LDA, SRC, SVM, and MFJSC.

Fig. 4. Annotation of large satellite images into four classes: (Red) RA, (yellow) PA, (blue) WA, and (green) GA. (a) Images to be annotated. (b) Ground truthmade referring to Google Maps. (c)–(f) Annotation results of MFJSC, SVM, MFJSC–SRCC, and SVM–SRCC. In figures (e) and (f), the annotations look muchsmoother and get higher accuracies.


B. Classification Performance

First, we evaluate the classification performance of MFJSCwithout SRCC, compared with three other methods, namely,SVM, LDA, and SRC, using a data set of 21 land-use classes[16]. The 21 classes are agricultural, airplane, baseball dia-mond, beach, buildings, chaparral, dense residential, forest,freeway, golf course, harbor, intersection, medium density resi-dential, mobile home park, overpass, parking lot, river, runway,sparse residential, storage tanks, and tennis courts. Two samplesof each class are shown in Fig. 2.

The images in each class are randomly divided into two parts,50% for training and the remaining 50% for testing. We use thesame settings as those in [8] for LDA and search “optimal” pa-rameters for SRC and SVM with radial-basis-function kernels.We ran the test program ten times and averaged the results.The classification accuracies yielded by different methods arelisted in Table I, and the details for each category are shown inFig. 3. Compared with the nonlearning classifier SRC, MFJSCgets a much better result, while compared with the learning-based classifiers LDA and SVM, MFJSC is better than LDA andquite comparable with SVM. Although achieving slightly loweraccuracy than SVM, MFJSC benefits from some advantagesthat are not shared by SVM: 1) can avoid overfitting of param-eters; 2) require no learning phase; and 3) can easily extend tomore categories. When training sets or categories are dynamic,SVM needs to retrain models for every category, which willtake a long time, while MFJSC only requires rearranging thedictionaries, which is instantaneous.

C. Annotation

The annotations are performed on panchromatic QuickBirdimages of Washington, Sacramento, Houston, Philadelphia, andManhattan, all with 0.6-m resolution [see Fig. 4(a)]. The imagesare of size 6000 ∗ 6000 pixels and will be cut into small patcheswith a size of 100 ∗ 100 pixels. All of the patches will beannotated with one of the four labels: residential area (RA),public area (PA), water area (WA), and green area (GA). Foreach label, we collect 200 patches from other satellite imagesas training set. To evaluate the accuracy of the annotation, wemanually label the images as ground truth, referring to GoogleMaps [see Fig. 4(b)].

In our annotation system, the MFJSC is first implementedto achieve reconstruction errors for each independent patch,and then, SRCC is used to generate the annotation on thewhole image. To further evaluate the performance of SRCC,we also combine SVM with SRCC, using the decision valuesof SVM as input of SRCC. The visual annotations of imagesand the accuracies generated by MFJSC, SVM, MFJSC-SRCC,and SVM-SRCC are shown in Fig. 4. As we can see, both ofthe annotations using MFJSC and SVM are correct in mostareas while leaving many single incorrect labels which looklike salt and pepper noises. Contrarily, using MFJSC or SVMcombined with our new classifier, most of the “noises” dis-appear. Compared with MFJSC/SVM, MFJSC-SRCC/SVM-

SRCC gets smoother annotation and higher accuracies. It is saidthat, when there is ambiguousness, which category the patchshould be assigned to the spatial relationship will help to decideand improve the annotation result.

IV. CONCLUSION

In this letter, we have developed a model based on MFJSCand designed an SRCC for the annotation of large satelliteimages. MFJSC jointly represents each feature as a super-position of a small set of elements of dictionaries, using acommon sparsity pattern, which means that the positions ofnonzero coefficients tend to be the same. The reconstructionerrors achieved by MFJSC are then used to annotate the largesatellite image by our new classifier which utilizes spatialinformation via an optimization procedure. The classificationand annotation results on the real data set demonstrate theeffectiveness of MFJSC and show that using our annotationsystem can obtain quite promising results.

REFERENCES

[1] A. C. Berg and J. Malik, “Geometric blur for template matching,” in Proc.IEEE CVPR, 2001, pp. I-607–I-614.

[2] L. Bruzzone and L. Carlin, “A multilevel context-based system for clas-sification of very high spatial resolution images,” IEEE Trans. Geosci.Remote Sens., vol. 44, no. 9, pp. 2587–2600, Sep. 2006.

[3] L. Bruzzone, M. Marconcini, and C. Persello, “Fusion of spectral andspatial information by a novel SVM classification technique,” in Proc.IEEE IGARSS, Jul. 2007, pp. 4838–4841.

[4] G. Camps-Valls and A. Rodrigo-Gonzalez, “Classification ofsatellite images with regularized adaboosting of RBF neural networks,”in Speech, Audio, Image Biomedical Signal Processing UsingNeural Network, vol. 83. Berlin, Germany: Springer-Verlag, 2008,pp. 307–326.

[5] X. Chen, W. Pan, J. T. Kwok, and J. G. Carbonell, “Accelerated gradientmethod for multi-task sparse learning problem,” in Proc. 9th IEEE Int.Conf. Data Mining, 2009, pp. 746–751.

[6] D. Dai and W. Yang, “Satellite image classification via two-layer sparsecoding with biased image representation,” IEEE Geosci. Remote Sens.Lett., vol. 8, no. 1, pp. 173–176, Jan. 2011.

[7] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” in Proc.ICML, 2001, pp. 282–289.

[8] F. F. Li and P. Perona, “A Bayesian hierarchical model for learning naturalscene categories,” in Proc. IEEE CVPR, 2005, pp. 524–531.

[9] M. Liénou, H. Maître, and M. Datcu, “Semantic annotation of satelliteimages using latent Dirichlet allocation,” IEEE Geosci. Remote Sens.Lett., vol. 7, no. 1, pp. 28–32, Jan. 2010.

[10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

[11] G. Obozinski, B. Taskar, and M. Jordan, “Joint covariate selection andjoint subspace selection for multiple classification problems,” J. Stat.Comput., vol. 20, no. 2, pp. 231–252, Apr. 2010.

[12] E. Shechtman and M. Irani, “Matching local self-similarities across im-ages and videos,” in Proc. IEEE CVPR, 2007, pp. 1–8.

[13] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptor for dense match-ing,” in Proc. IEEE CVPR, 2008, pp. 1–8.

[14] P. Tseng, “On accelerated proximal gradient methods for convex–concaveoptimization,” SIAM J. Optim., 2008.

[15] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and M. Yi, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[16] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions forland-use classification,” in Proc. ACM SPATIAL GIS, 2012, pp. 270–279.

Documents

Automatic Annotation of Satellite Images via Multifeature Joint Sparse Coding With Spatial Relation Constraint