Knowledge Adaptation with Partially Shared Features for ...yiyang/HFSAR_dc.pdfevents, news events, unusual surveillance events or those with repetitive patterns. For example, Xu et

1

Knowledge Adaptation with Partially SharedFeatures for Event Detection Using Few

ExemplarsZhigang Ma, Yi Yang, Nicu Sebe, and Alexander G. Hauptmann

Abstract—Multimedia event detection (MED) is an emerging area of research. Previous work mainly focuses on simple event detectionin sports and news videos, or abnormality detection in surveillance videos. In contrast, we focus on detecting more complicated andgeneric events that gain more users’ interest, and we explore an effective solution for MED. Moreover, our solution only uses few positiveexamples since precisely labeled multimedia content is scarce in the real world. As the information from these few positive examplesis limited, we propose using knowledge adaptation to facilitate event detection. Different from the state of the art, our algorithm isable to adapt knowledge from another source for MED even if the features of the source and the target are partially different, butoverlapping. Avoiding the requirement that the two domains are consistent in feature types is desirable as data collection platformschange or augment their capabilities and we should be able to respond to this with little or no effort. We perform extensive experimentson real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms severalother state-of-the-art detection algorithms.

Index Terms—Multimedia Event Detection (MED), Knowledge Adaptation, Heterogenous Features, Heterogeneous Features basedStructural Adaptive Regression (HF-SAR)

F

1 INTRODUCTION

W Ith ever expanding multimedia collections, multimediacontent analysis is becoming a fundamental research

issue for many applications such as indexing and retrieval,etc. Multimedia content analysis aims to learn the semanticsof multimedia data. To do so, it has to bridge the semanticgap between the low-level features and the high-level semanticcontent description [17][41]. Different approaches have beenproposed to bridge the semantic gap in the literature, either atconcept level or event level.

We first highlight the difference between a concept andan event. A “concept” means an abstract or general ideainferred from specific instances of objects, scenes and actionssuch as fish, outdoor and boxing. Concepts are lower leveldescriptions of multimedia data which usually can be inferredwith a single image or a few video frames. In multimediaresearch, a major thrust for multimedia content analysis isto learn the semantic concepts of the multimedia data andto use these concepts for multimedia indexing and retrieval.Multimedia concept analysis has been widely studied forimages and videos [24][32][31]. However, as shared personal

• Z. Ma and A. Hauptmann are with the School of Computer Science,Carnegie Mellon University.E-mail: kevinma, [email protected]

• Y. Yang is with the School of Information Technology and ElectricalEngineering, The University of Queensland.E-mail: [email protected]

• N. Sebe is with the Department of Information Engineering and ComputerScience, University of Trento.E-mail: [email protected]

video collections, news videos and documentary videos haveexplosively proliferated these years, video event analysis isgradually attracting more research interest. An “event” refersto an observable occurrence that interests users, e.g. celebrat-ing the New Year. Compared with concepts, events are higherlevel descriptions of multimedia data. A meaningful eventbuilds upon many concepts and is unlikely to be inferred witha single image or a few video frames. For example, the eventmaking a cake consists of a combination of several conceptssuch as cake, people, kitchen together with the action makingwithin a longer video sequence.

Annotation and detection are two different topics of bothconcept and event analysis. Multimedia annotation, alsoknown as recognition, aims to associate a datum with one ormultiple semantic labels (tags). Many approaches have beenproposed to improve the annotation accuracy for both imagesand videos [24][33]. A typical annotation approach first pre-trains a series of classifiers, one for each class, and then appliesthe pre-trained classifiers to predicting the class label of eachtesting datum. In contrast to annotation, detection identifies theoccurrence of a class of interest. One main difference betweenannotation and detection is that in annotation each testingdatum is guaranteed to be a positive sample of one of thepredefined classes while the negative examples in detection arefrom a set of infinite classes. In other words, both the trainingand testing data in annotation tasks are from a fixed number ofclasses but the training and testing data in detection tasks canbe from an infinite number of classes. We have no clue aboutall the concepts or events these negative examples include.This provides very limited training information for obtaining arobust detector, thus making detection a challenging problem.

The TREC Video Retrieval Evaluation (TRECVID) com-

2

munity [3] has notably contributed to the research of videoconcept and event detection by providing a common testbedfor evaluating different detection approaches [27]. In thefield of multimedia, many other works have also focused onconcept detection, e.g., [32][40][22]. However, the researchon video event detection is still in its infancy. Before 2011,most existing research on event detection was limited to theevents in sports [29][39][31] and news video archives [38],or those with repetitive patterns like running [37] or unusualevents in surveillance videos [4][5][6]. In 2010, the TRECVIDcommunity launched the task of “Event detection in Internetmultimedia (MED)” which aims to encourage new technolo-gies for detecting more generic and complicated events, e.g.,landing a fish. For this kind of events, there are huge intra-classvariations. Besides, they can only be characterized by longvideo sequences, which necessitates the exploration of all thesequences for analysis. Figure 1 shows some frames from twovideos of the same event landing a fish. At the first glance, wemay consider Video 1 to be skiing as it contains the concept of“outdoor with snow” which is not a typical scene for landing afish. The scene of Video 2 is more typical, in contrast, thoughit can also be a scene for sailing. The comparison of thesetwo videos aims to demonstrate the huge intra-class variationof complex events. On the other hand, the information fromonly a few frames is patchy, as shown in Figure 1. Thus, theentire video is needed for analysis.

Video 1

Video 2

Figure 1. Some sample frames from two videos of theevent landing a fish.

SVM has been used in few systems designed for the MEDtask and proved to be highly effective [10][11][19]. Thesesystems commonly use sufficient positive examples (about100) for reliable performance. Recently, NIST has proposed aproblem of how to attain respectable detection accuracy whenthere are very few positive examples since precisely labeledmultimedia content is scarce in the real world. In this paper, wefocus on developing an effective method for MED with few

exemplars. Though SVM is effective in current systems, itsperformance would likely be less robust when there are only afew positive examples for training. Humans often adapt knowl-edge obtained from previous experiences to improve learningof new tasks. Therefore, in the same manner, it is advantageousto leverage and adapt knowledge from other related domainsor tasks to address the problem of an insufficient numberof labeled examples. In the multimedia community, there aresome available video archives with annotated concept labels,which can be leveraged to facilitate MED with few exemplars.Inspired by [40][18][12], we propose to adapt the knowledgefrom concept level to assist in our task. Specifically, we usethe available video corpora with annotated concepts as ourauxiliary resource and MED is performed on the target videos.The concepts are supposed to be relevant to the event to bedetected.

Currently, most knowledge adaptation algorithms requirethat the features extracted from the raw data in the sourcedomain and the target domain must be of exactly the sametype. In many applications, such a requirement may be toorestrictive, as data collection platforms change or augmenttheir capabilities. In practice, the data in MED and those in theavailable concept-based video archives usually only have par-tially shared data features. For example, many video archivesare key-frame based so they cannot be represented by audiofeatures such as MFCC. These kinds of features are commonlyused for MED and provide additional information for eventdetection. Hence, we propose to study how to effectively adaptknowledge from one domain to another when the availablefeature sets are partially different, but overlapping, for exampleif new or different features have more or better instrumentationfor observations.

This paper is the extension of our previous work [13]. Wesummarize the main merits of this paper as follows:• We perform an exploration of MED with few exemplars

by proposing a novel approach built atop knowledgeadaptation.

• Unlike many knowledge adaptation methods, our ap-proach does not require that auxiliary videos have thesame events as the target videos. We exploit videos withseveral semantic concepts to facilitate the event detectionon the target videos; the event differs from the conceptsand the video collections are different from each other.

• Another merit is that our method is able to adapt knowl-edge from other sources to the target videos when onlyparts of the feature space are shared by the two domains.This is an intrinsic difference from most state-of-the-artknowledge adaptation algorithms.

2 RELATED WORKIn this section, we briefly review the related work on videoevent detection and knowledge adaptation.

2.1 Video Event DetectionEvent detection is a challenging problem that has not been yetsufficiently studied. Based on its difficulty, event detection canbe roughly categorized into simple event detection, predefinedMED and Ad Hoc MED.

3

2.1.1 Simple Event DetectionMuch effort has been dedicated to the detection of sportsevents, news events, unusual surveillance events or those withrepetitive patterns. For example, Xu et al. propose using web-casting text and broadcast video to detect events from livesports game [39]. In [38], a model based on a multi-resolution,multi-source and multi-modal bootstrapping framework hasbeen developed for events detection in news videos. Newsvideos are more constrained as they are recorded and editedby professionals. Hence, they are usually well structured andeasier to analyze compared to the internet MED videos. Adamet al. present an algorithm using multiple local monitors whichcollect low-level statistics to detect certain types of unusualevents in surveillance videos [4]. Wang et al. have proposeda new motion feature by using motion relativity and visualrelatedness for event detection [37]. Their approach primarilyapplies to events that have repetitive motion attributes and areusually describable by a single shot, e.g. walking and dancing.The aforementioned events are usually simple, well-definedand describable by a short video sequence.

2.1.2 Multimedia Event DetectionIn 2010, “Event detection in Internet multimedia (MED)”was initialized in the TRECVID competition by NIST fordetecting more complicated events. Compared to the simpleevents mentioned above, the events in MED usually containmany people and/or objects, various human actions, multiplescenes and have significant intra-class variations. Additionally,these events take place in much longer and more complexvideo clips. For instance, making a cake includes objects suchas water and bowl; can happen either in the kitchen or outdoor;is accompanied by specific motions such as getting the flour,adding water and baking within a longer video sequence.Though MED is an arduous problem, researchers have beenmaking steady effort on it [10][11][19][20][21].

NIST introduced the predefined MED competition as fol-lows: Each team is given the event kits about 5 months beforethe submission of the detection system. Hence, there is enoughtime for the system to be tailored particularly for a specificevent. SVM is widely used and shows good performance forpredefined MED. We may also use some recent state-of-the-artclassifiers for MED. For example, a new family of boostingalgorithms is proposed in [28] and demonstrates prominentperformance on a variety of applications. In predefined MED,we can identify some event-specific rules or templates tofacilitate detection of the particular event.

To address the generalizability of the MED system, NISTintroduced Ad Hoc MED competition1 in 2012. Ad Hoc MEDdiffers from predefined MED in the sense that we should nottailor the system for a specific event. For this purpose, NISTreleases the event kits to each team only about 12 days beforethe submission of the detection system. In this case, we knowthe testing events when we build the system but the short timeperiod does not allow for special tuning for a specific event.

For both predefined MED and Ad Hoc MED, NIST hasintroduced an even more challenging problem, i.e., using few

1. http://www.nist.gov/itl/iad/mig/med12.cfm

labeled positive exemplars to build a detection system to dealwith the scarcity of labeled multimedia content. Our workfocuses on this problem by adapting knowledge from auxiliaryconcept-based data. As we do not select auxiliary concepts fora particular event, our work is different from predefined MED.Moreover, the time needed for building our system satisfies thetime constraint regulated by NIST. Consequently, our workgets as close as possible to Ad Hoc MED in the intendedunderstanding of NIST.

2.2 Knowledge Adaptation for Multimedia Analysis

Knowledge adaptation, also known as transfer learning, aimsto propagate the knowledge from an auxiliary domain to atarget domain [40][18][12]. Many existing algorithms requirethat the features extracted from the raw data in the sourcedomain and the target domain must be using the exactsame raw sensor output. However, MED deals with verycomplicated events that come from an unlimited semanticspace. Furthermore, the requirement of feature consistencymay be too restrictive, as data collection platforms change oraugment their capabilities. Hence, most existing methods arenot capable of adapting knowledge for MED when we haveheterogeneous feature type between the source and the target.For example, Yang et al. have proposed to use Adaptive SVMsfor cross-domain video concept detection [40]. The methodobtained encouraging results but has some shortcomings. Theproposed approach requires that the auxiliary videos and thetarget videos have the same video concepts. However, inMED the events are complicated and collecting many auxiliaryvideos with the same event description as the target videoswithin limited time is impractical. Jiang et al. [18] haveused the image context of Flickr to select concept detectors.These pre-selected detectors are then refined by the semanticcontext learnt from the target domain. In this way, more preciseconcept detectors are obtained for video search. The proposedmethod is interesting but the selected concept detectors cannotbe handily used for event detection without other sophisticatedalgorithms. Besides, as in our problem we only have very fewpositive examples, using these examples to refine the conceptdetectors is not reliable. Another algorithm proposed by Duanet al. [12] realizes event recognition of consumer videos byleveraging web videos. Their method does not require thatthe auxiliary domain and the target domain have the sameevents. However, the computation of multiple kernel is timeand space consuming, especially when the feature dimensionis high. Luo et al. have presented an object classificationmethod by casting prior features learned from auxiliary imagesinto their multiple kernel learning framework and obtainedadvantageous performance [14]. Yet this approach works ina two-step fashion, i.e., training prior features using auxiliarydata and then incorporating them into the following step. Incontrast, our method works in a unified framework which canjointly optimize the knowledge from the auxiliary domain andthe target domain. Besides those limitations mentioned above,existing knowledge adaptation algorithms mostly require thatthe features in the source domain and the target domain be ofexactly the same type. However, in practice, this requirement

4

Target Testing Videos Auxiliary Concept-based Videos Target Training Videos

Event

detector A Concept classifier

Shared

components

mining

Minimizing

irrelevance

and noise

Event

detector A Knowledge Adaptation

Prediction

consistency

Final decision values

Event

detector AB

feature extraction feature extraction

Modality A Modality B Modality A

nonlinear mapping

Representation A

Representation A

Representation AB Auxiliary

concept labels

nonlinear mapping

Training

data labels

feature extraction

Modality B Modality A

Representation AB

nonlinear mapping

Rep

resenta

tion

A

Decision

values AB

Decision

values A

Figure 2. The illustration of our framework. We first perform nonlinear mapping for the homogeneous features of theauxiliary and target videos, i.e., Modality A. The video concept classifier and the video event detector obtained fromthe homogeneous features presumably have common components which contain irrelevance and noise. We proposeto remove such negative information by optimizing the concept classifier and the event detector jointly. Meanwhile,another event detector of MED videos is trained based on both Modality A and Modality B. Then we integrate the twoevent detectors for optimization, after which the decision values from both are fused for the final prediction.

may be too restrictive as MED videos can be represented bydifferent types of features in contrast with the auxiliary videoarchives. Our previous work in [13] has some advantagescompared to the existing knowledge adaptation algorithmssuch as no requirement for the same classes between theauxiliary domain and the target domain, efficiency, etc. Butit still ignores the reality that the auxiliary domain and thetarget domain possibly have heterogenous feature type.

To progress beyond these aforementioned works, we pro-pose a new knowledge adaptation method for MED with fewexemplars from heterogeneous features. During the trainingphase, the partially shared features of the source domain andtarget domain will be exploited to establish a correspondencebetween the two domains. Meanwhile, the instrumentationobtained from the particular MED features is incorporated intoour framework. The two kinds of aforementioned knowledgeare then integrated to refine the detector of the target videos.

3 FRAMEWORK OVERVIEW

Figure 2 illustrates our framework for MED with few exem-plars. The video archive where the MED is to be conducted isour target domain. A nonlinear mapping is applied to the ho-mogeneous features of the auxiliary and target videos, denotedby Modality A. Based on the resulting representations, theshared knowledge between them is to be explored. Specifically,we perform KPCA [30] to complete the mapping. The videoconcept classifier and the video event detector obtained fromthe homogeneous features presumably have common com-ponents which contain irrelevance and noise. We propose toremove such components by optimizing the concept classifierand the event detector jointly, thereby bringing discriminatingknowledge for the event detector. On the other hand, wehave the heterogeneous features Modality B for MED videos

and they are combined with the homogeneous features asindicated in Figure 2. Another event detector of MED videosis subsequently trained based on the resulting representationsfrom the nonlinear mapping of Modality A and Modality B.Then we integrate the two event detectors for optimization,after which the decision values from both are fused for thefinal prediction.

4 CONCEPTS ADAPTATION ASSISTED EVENTDETECTIONNext, we explain how we adapt knowledge for MED with fewexemplars when the two domains have heterogeneous features.Our approach is grounded on two components: one is theknowledge from the available target training examples andthe other one is the knowledge propagated from the auxiliaryconcepts-based videos.

We first demonstrate how to exploit the knowledge fromthe target training examples. Denote the resulting represen-tations of the target training videos using both ModalityA and Modality B after the nonlinear mapping as Zt =[z1t , z

2t , ..., z

ntt ] ∈ Rdz×nt where t stands for the target, dz is

the feature dimension and nt is the number of the trainingdata. yt = [y1t , y

2t , ..., y

ntt ]T ∈ {0, 1}nt×1 are the labels

for the target training videos. yit = 1 if the ith video isa positive example whereas yit = 0 otherwise. To beginwith, we associate the low-level representations and high-levelsemantics of videos by a decision function f which, for aninput video sequence z, predicts an output y. In this paper,we define ft as:

ft(Zt) = ZTt Pt + 1tbt, (1)

where Pt ∈ Rdz×1 is an event detector which correlates Zt

with their labels yt, bt ∈ R1 is a bias term and 1t ∈ Rnt×1

5

denotes a column vector with all ones. ft is decided byminimizing the following objective based on the trainingexamples Zt and their labels yt:

minft

loss(ft(Zt), yt

). (2)

loss(·, ·) is a loss function. Different loss functions such asthe hinge loss and the least square loss can be used. In thispaper, we use the `2,1-norm based loss function because it isrobust to outliers [25]. Thus, Eq. (2) is reformulated as:

minPt,bt

∥∥∥ZTt Pt + 1tbt − yt

∥∥∥2,1. (3)

Now we show how to adapt the knowledge from auxiliaryvideos which are associated with different concepts and arerepresented only by the homogeneous features, i.e., ModalityA to assist in MED with few exemplars. Denote the resultingrepresentations of the auxiliary videos after the nonlinearmapping as Xa = [x1a, x

2a, ..., x

naa ] ∈ Rdh×na where a stands

for the auxiliary domain, dh is the feature dimension and na isthe number of the auxiliary videos. Ya = [y1a, y

2a, ..., y

naa ]T ∈

{0, 1}na×ca is their label matrix where ca indicates that thereare ca different concepts. Y ij

a denotes the jth class of yia andY ija = 1 if xia belongs to the jth concept, while Y ij

a = 0otherwise. The fundamental step is to mine the correlationbetween the low-level representations and high-level semanticsof the auxiliary concepts-based videos. Similarly to Eq. (3),we realize that by the following objective function:

minWa,ba

∥∥∥XTa Wa + 1aba − Ya

∥∥∥2,1

(4)

where a concept classifier Wa ∈ Rdh×ca is used to correlateXa with their labels Ya, ba ∈ R1×ca is a bias term and 1a ∈Rna×1 is a column vector with all ones.

Next, we illustrate how to adapt knowledge from the auxil-iary concepts-based videos for a more discriminating eventdetector. To begin with, we also use Modality A for thetarget videos in accordance with the auxiliary videos. Denotethe resulting representations after the nonlinear mapping asXt = [x1t , x

2t , ..., x

ntt ] ∈ Rdh×nt . We can similarly find an

event detector Wt based on Xt. Wt ∈ Rdh×1 is used tocorrelate Xt with their labels yt.

Considering each domain separately, it is reasonable to as-sume that for classification purposes some noisy and irrelevantfeatures will not be used, which in turn makes the correspond-ing rows of the projection matrix Wa or Wt identically equalto zero. Considering the two domains together, the auxiliaryconcept videos and the event videos can be correlated in thesemantic level, e.g., the concepts fish, water, people are basicelements of the event landing a fish. Previous work on multi-task learning has suggested that this kind of correlation usuallyresults in common components in the feature level sharedacross related tasks [7][8][9]. In our scenario, the semanticallyrelated auxiliary videos and event videos can be treatedas related tasks because the events build upon the relatedconcepts. When we represent videos from both domains withthe same type of feature such as SIFT Bag-of-Words usingthe same centroid, they would have some shared components.For example, assuming that the event video landing a fish has

SIFT Bag-of-Words of fish, we may find similar SIFT Bag-of-Words in an image of fish. Hence, some shared componentsin the features between them need to be uncovered. Notethat the event detector is actually a mapping function fromfeatures to event labels. Intuitively, not all the Bag-of-wordsare related to semantic labels. Given certain Bag-of-Words, ifthey are irrelevant to all the concepts, it is very likely thatthese Bag-of-Words are also irrelevant to the events, becausethe event builds on top of the concepts. Recalling that thecorresponding rows of Wa or Wt are identically equal to zerofor the irrelevant or noisy features, we should be able to findsimilar patterns in the distribution of these rows by learningWa and Wt jointly. Thus, we exploit the concept classifierWa to help remove the noise in Wt for a more discriminativeevent detector.

Denote Wa = [w1a, ..., w

dh

a ]T , Wt = [w1t , ..., w

dh

t ]T .Then we combine them and define a joint analyzer W =[w1, ..., wdh

]T where wi is the vertical concatenation of wia

and wit, i.e., wi = [wi

a;wit]. In this sense, wi reflects the joint

information from the auxiliary videos and the target trainingvideos. Through proper optimization of wi, we can removethe shared irrelevant or noisy components. Previous workhas shown that sparse models are useful for feature selectionby eliminating redundancy and noise [7][26][25]. The sparsemodels are used to make some of the feature coefficientsshrink to zeros to achieve feature selection. The “shrinking tozero” idea can be applied to uncover the common distributionof the “identically equal to zero” rows of Wa and Wt discussedbefore. In this way, we can remove the shared irrelevance andnoise, thus obtaining a more discriminative Wt.

Now we introduce the technical details of our joint spar-sity model. Specifically, we propose to exploit ‖W‖2,p =(

dh∑i=1

(ca+1∑j=1

|Wij |)p2

) 1p

to achieve that goal. ‖·‖2,p denotes

the `2,p-norm (0 < p < 2). By minimizing ‖W‖p2,p, wecan reduce the negative impact of the irrelevant or noisywi’s. Our model has the flexibility of characterizing differentdegree of relevance between concepts and events. p is usedto control the degree of shared structures. The lower p is, themore semantically correlated are the auxiliary concepts andthe target event. By contrast, when the auxiliary concepts andthe target event have less relevance, we can use a larger p.When we increase p to 2, we do not impose sharing on thetwo domains. To step further, it is expected that the predictedlabels of Wt on Xt be consistent with those of Pt on Zt,thus resulting in more accurate Pt and Wt. In this way, Pt

from the heterogeneous features of the target and Wt from theknowledge adaptation would jointly augment the observations

for MED. We achieve this by minimizing∥∥∥XT

t Wt − ZTt Pt

∥∥∥2F

where ‖·‖2F indicates the Frobenius norm of a matrix.To this end, we propose the following objective function for

MED with few exemplars:

minPt,Wt,Wa,bt,ba

∥∥∥ZTt Pt + 1tbt − yt

∥∥∥2,1

+∥∥∥XT

t Wt − ZTt Pt

∥∥∥2

F

+∥∥∥XT

a Wa + 1aba − Ya

∥∥∥2,1

+ α ‖W‖p2,p + β(‖Wt‖2F + ‖Wa‖2F )(5)

6

where (‖Wa‖2F +‖Wt‖2F ) is added to avoid over-fitting. α andβ are regularization parameters.

Once Pt and Wt are obtained, we apply them to the testingvideos for event detection. The decision values of them arenormalized and then their weighted sum based on the featurenumbers are the final decision values of the testing videos.Our method builds upon 1) the knowledge adaptation fromconcepts-based videos to event-based videos by leveragingthe shared structures between them; and 2) the augmentedobservation from the particular features that are only ownedby MED videos. We therefore name our method HeterogenousFeatures based Structural Adaptive Regression (HF-SAR).

5 OPTIMIZING THE EVENT DETECTOR

In this section, we present our solution for obtaining the targetevent detector. Our problem in Eq. (5) involves the `2,1-normand the `2,p-norm which are both non-smooth and cannot besolved in a closed form. We propose to solve it as follows.

Denote ZTt Pt − yt = [u1, ..., unt ]T , XT

a Wa − Ya =[v1, ..., vna ]T . Next, we define three diagonal matrices Dt,Da and D with their diagonal elements Dii

t = 12‖ui‖2

,Dii

a = 12‖vi‖2

, Dii = 12p‖wi‖2−p

2

respectively. The objectivefunction in Eq. (5) is equivalent to:

minPt,Wt,Wa,bt,ba

Tr((ZT

t Pt + 1tbt − yt)TDt(ZTt Pt + 1tbt − yt)

)+∥∥∥XT

t Wt − ZTt Pt

∥∥∥2

F+ Tr

((XT

a Wa + 1aba − Ya)TDa

(XTa Wa + 1aba − Ya)

)+ αTr

(WTDW

)+β(‖Wa‖2F + ‖Wt‖2F )

(6)

where Tr (·) denotes the trace operator. By setting the deriva-tive of Eq. (6) w.r.t. ba to zero, we get:

ba =1

na1Ta Ya −

1

na1Ta X

Ta Wa. (7)

Similarly, we obtain bt as:

bt =1

nt1Tt yt −

1

nt1Tt Z

Tt Pt. (8)

Substituting Eq. (7) and Eq. (8) into Eq. (6), it becomes:

minPt,Wt,Wa

Tr((HtZ

Tt Pt −Htyt)

TDt(HtZTt Pt −Htyt)

)+∥∥∥XT

t Wt − ZTt Pt

∥∥∥2

F

+Tr((HaX

Ta Wa −HaYa)

TDa(HaXTa Wa −HaYa)

)+αTr

(WTDW

)+ β(‖Wa‖2F + ‖Wt‖2F )

(9)

where Ht = It− 1nt1t1

Tt , Ha = Ia− 1

na1a1

Ta and It ∈ Rnt×nt ,

Ia ∈ Rna×na are two identity matrices. Setting the derivativeof Eq. (9) w.r.t. Wa to zero, we get:

Wa = (XaHaDaHaXTa + αD + βId)

−1XaHaDaHaYa (10)

where Id ∈ Rdh×dh is an identity matrix. Note that D istreated as a constant in this step as we adopt an alternating

optimization approach here. In the same manner, we obtainthe event detector Wt as:

Wt = A−1XtZtPt (11)

where A = αD+ βId + XtXTt . To optimize Pt, the problem

equals to:

minPt

Tr(PTt ZtHtDtHtZ

Tt Pt − 2PT

t ZtHtDtHtyt)

+∥∥∥XT

t Wt − ZTt Pt

∥∥∥2

F+ αTr(WT

t DWt) + βTr(WTt Wt)

(12)

Substituting Eq. (11) into Eq. (12) and defining

J = ZtHtDtHtZTt − ZtX

Tt A−1XtZ

Tt + ZtZ

Tt (13)

K = 2ZtHtDtHtyt, (14)

the problem becomes:

minPt

Tr(PTt JPt − PT

t K) (15)

By setting the derivative of the above function w.r.t. Pt to zero,we get:

Pt =1

2J−1K (16)

Next, we propose Algorithm 1 to solve the objective func-tion in Eq. (5). The computational complexity of Algorithm 1is as follows. For training, it is O(d3z) as dz > dh. Notethat dz � nt because there are few training examples in ourproblem. Thus, the training process is not very computation-ally expensive. During testing, computing kernels between thetesting data and the training data is the most expensive process.Suppose there are nte testing videos, we need to computentnte kernels. Each datum is dz dimensional so the complexityis O(dzntnte).

It can be proved that the objective function value of Eq. (5)monotonically decreases in each iteration until converging tolocal optimum using Algorithm 1.

6 EXPERIMENTS

In this section, we present the experiments which evaluate theperformance of our Heterogenous Features based StructuralAdaptive Regression (HF-SAR) for MED with few exemplars.

6.1 DatasetsNIST has provided so far the largest video corpora for MED.Our experiments on MED with few exemplars are conductedon the TRECVID MED 2010 (MED10) and TRECVID MED2011 (MED11) development set. MED102 includes 3 eventsdefined by NIST, which are Making a cake, Batting a run,and Assembling a shelter. MED113 includes 15 events, i.e.,Attempting a board trick, Feeding an animal, Landing afish, Wedding ceremony, Working on a woodworking project,Birthday party, Changing a vehicle tire, Flash mob gathering,Getting a vehicle unstuck, Grooming an animal, Makinga sandwich, Parade, Parkour, Repairing an appliance and

2. http://nist.gov/itl/iad/mig/med10.cfm3. http://www.nist.gov/itl/iad/mig/med11.cfm

7

Algorithm 1: Optimizing the event detector.

Input:The target training data Zt ∈ Rdz×nt , Xt ∈ Rdh×nt ,yt ∈ Rnt×1;

The auxiliary data Xa ∈ Rdh×na , Ya ∈ Rna×ca ;

Parameters α, β and p.

Output:Optimized Pt ∈ Rdz×1, Wt ∈ Rdh×1 and bt ∈ R1.

1: Set t = 0, initialize Pt ∈ Rdz×1, Wt ∈ Rdh×1 andWa ∈ Rdh×ca randomly;

2: repeatCompute ZT

t Pt − yt = [u1, ..., unt ]T ,XT

a Wa − Ya = [v1, ..., vna ]T and W = [w1, ..., wd]T ;

Compute the diagonal matrix Dtt , D

ta and Dt

according to Diit = 1

2‖ui‖2, Dii

a = 12‖vi‖2

, andDii = 1

2p‖wi‖2−p

2

respectively;

Update W t+1a as: W t+1

a =(XaHaD

taX

Ta + αDt + βId)

−1XaHaDaHaYTa ;

Update bt+1a as: bt+1

a = 1na

1Ta Ya − 1na

1Ta XTa W

t+1a ;

Update P t+1t according to Eq. (13), Eq. (14) and Eq.

(16);

Update W t+1t as: W t+1

t = A−1XtZTt P

t+1t ;

Update bt+1t as: bt+1

t = 1nt1Tt yt − 1

nt1Tt Z

Tt W

t+1t ;

t = t+ 1.until Convergence;

3: Return Pt, Wt and bt.

Working on a sewing project. The two datasets are combinedtogether (MED10-11 for short) in our experiments so we havea dataset of 9746 video clips.

We first use the development set from TRECVID 2012semantic indexing task (SIN12) as the auxiliary videos. SIN12covers 346 concepts but some of them have few positiveexamples. Additionally, “events” usually refer to “seman-tically meaningful human activities, taking place within aselected environment and containing a number of necessaryobjects” [15]. Hence, we removed the concepts with fewpositive examples and selected 65 concepts that are related tohuman, environment and objects. We thus use a subset with3244 video frames. On the other hand, multimedia events areusually accompanied by human actions, which suggests thatwe may find similar motion features between event videos andbasic human action videos. Hence, we additionally use UCF50dataset [16] to test whether it is able to facilitate multimediaevent detection.

6.2 SetupWe ran our program on the Carnegie Mellon UniversityParallel Data Lab cluster, which contains 300 cores, to extract

features and perform the Bag-of-Words mapping for all thevideos. When utilizing SIN12 dataset, we extract SIFT [23]and CSIFT [34] features for the videos in MED10-11 andSIN12. Then we use 1x1, 2x2 and 3x1 spatial grids to generatethe spatial BoW representation [35]. For each grid, we usea standard BoW representation with 4,096 dimensions, thusresulting in a 32,768 dimension spatial BoW feature forSIFT/CSIFT to represent each video. When utilizing UCF50dataset, we extract STIP [36] feature for the videos in MED10-11 and UCF50 since STIP has proved to be robust foranalyzing action videos. A similar procedure is followed togenerate the spatial BoW representation. Hence, the featurerepresentation in this work is different from our previouswork [13] in which we just used BoW representation. Apartfrom visual features, some other features, which providedifferent yet complementary information, can also be used torepresent videos. For example, the auditory feature based onMel-frequency Cepstral Coefficients (MFCC) has also beenfrequently used [19]. We additionally use this feature forMED videos and the dimension is 4096. Thus, when usingthe SIN12 dataset, our two domains have SIFT and CSIFT asshared feature type while MFCC works as the heterogeneousfeature for MED videos; when using UCF50 dataset, our twodomains have STIP as shared feature type while MFCC is theheterogeneous feature for MED videos.

According to the MED task definition from NIST, eachevent is detected independently. Therefore, there are 18 in-dividual detection tasks. NIST has defined that the numberof positive training examples is 10 for MED with few ex-emplars [2]. However, for MED10-11 there is no standardtraining and testing set partition provided by NIST. Hence, werandomly split the MED10-11 dataset into two subsets, one asthe training set and the other one as the testing set. We followthe definition given by NIST and randomly select 10 positiveexamples for each event. Other 1000 negative examples areselected and combined with the positive examples as thetraining data. The remaining 8736 videos are our testingdata. The experiments are independently repeated 5 timeswith randomly selected positive and negative examples. Theaverage results are reported.

We use three evaluation metrics. The first one, MinimumNDC (MinNDC), is officially used by NIST in TRECVIDMED 2011 evaluation [1]. Lower MinNDC indicates betterdetection performance. The second one is the Probabilityof Miss-Detection based on the Detection Threshold 12.5.This evaluation metric is used by NIST in TRECVID MED2012 [2] to evaluate MED performance. We denote it asPmd@TER=12.5 for short. Likewise, lower Pmd@TER=12.5indicates better performance. For more details about the abovetwo evaluation metrics, please see the TRECVID 2011 and2012 evaluation plans [1][2]. The third one is Average Preci-sion (AP). Higher AP indicates better performance.

6.3 Comparison AlgorithmsIn this section, we show the MED results using HeterogenousFeatures based Structural Adaptive Regression (HF-SAR) andother state-of-the-art algorithms. A brief introduction of thecomparison algorithms is as follows:

8

• HF-SAR: the proposed new method which is designed forknowledge adaptation based on heterogeneous features.The χ2 kernel is used for its advantageous performanceon video analysis.

• Structural Adaptive Regression (SAR) [13]: our previousalgorithm on knowledge adaptation for MED with fewexemplars. Similarly, the χ2 kernel is used.

• Adaptive Multiple Kernel Learning (A-MKL) [12]: arecent knowledge adaptation algorithm built upon SVM.

• Multiple Kernel Transfer Learning (MKTL) [14]: a recentmulti-class transfer learning algorithm built within a mul-tiple kernel learning framework. The original algorithmin [14] has used RBF kernel. For fair comparison, weimplement it with χ2 kernel.

• SAR&SVM: We use SAR based on SIFT+CSIFT featuresbetween the auxiliary domain and the target domain. Inaddition, we use SVM based on MFCC feature in thetarget domain. Then we fuse the decision values obtainedby both of them. In this way, we can evaluate the perfor-mance of combining homogeneous transfer learning andthe classifier on the heterogeneous feature.

• SVM: the most widely used and robust event detector forMED [19][10][17][37]. Similarly, we use the χ2 kernelfor it.

• TaylorBoost [28]: a state-of-the-art classifier extendedfrom AdaBoost.

For SVM, we use LIBSVM, and for A-MKL, MKTL andTaylorBoost we use the code shared by the authors. During thetraining and predicting, we combine SIFT, CSIFT and MFCCfeatures of the MED10-11 dataset for SVM and TaylorBoost.SAR, A-MKL and MKTL are knowledge adaptation basedalgorithms, which utilize the SIN12 dataset as auxiliary data.However, they require that the target domain and the auxiliarydomain have the homogeneous feature representation so onlySIFT and CSIFT are used for them. HF-SAR leverages SIN12for MED with few exemplars on MED10-11 and it is capableof using SIFT, CSIFT and MFCC together.

All the regularization parameters are tuned from{0.001, 0.1, 10, 1000}, and the parameter p of HF-SARand SAR is tuned from {0.5, 1, 1.5}. We report the bestresults for each algorithm.

6.4 MED ResultsThe detection performance of different algorithms is dis-played in Figure 3 and Table 1 where all the knowledgeadaptation methods have exploited SIN12 dataset. Note thatLOWER MinNDC and Pmd@TER=12.5 indicate BETTERperformance; HIGHER AP indicates BETTER performance.The proposed method HF-SAR is consistently competitive.Zooming into details, we have the following observations:1) when using MinNDC as metric, HF-SAR gains the bestperformance for 17 events; 2) when using Pmd@TER=12.5as metric, HF-SAR gains the best performance for 15 events;3) when using AP as metric, HF-SAR is the best method for 14events; 4) HF-SAR obtains the top performance for the averageaccuracy over all the 18 events; 5) SAR&SVM is generally thesecond competitive algorithm. This indicates that incorporat-ing the additional information contained in the heterogenous

feature into a robust knowledge adaptation algorithm basedon homogeneous features is beneficial. However, it is unclearwhich algorithms should be combined together for the bestperformance as they may work with different mechanisms;6) SAR, A-MKL and SVM have varying degrees of successon some events. However, they are generally worse than HF-SAR and SAR&SVM. It means knowledge adaptation basedon homogeneous features loses useful information from theheterogenous feature, and SVM utilizes all the features but itcannot leverage knowledge from other sources. In contrast,the newly proposed method HF-SAR transfers knowledgebetween homogeneous features while simultaneously exploitsthe heterogeneous feature to get boosted performance.

Next we show the detection results by exploiting UCF50dataset. As HF-SAR has already shown its advantage overother knowledge adaptation algorithms and this experimentaims to show that we can even adapt useful action knowledgefor MED with few exemplars, we only compare HF-SAR tothe best baseline classifier SVM. This time, we combine STIPand MFCC features of the MED10-11 dataset for SVM. Thedetailed results are illustrated in Table 2. As can be seen,HF-SAR beats SVM on 17, 17, 15 events and the averageperformance over all the 18 events in terms of MinNDC,Pmd@TER=12.5, AP respectively. Moreover, for those eventson which HF-SAR is better, we can observe noticeable per-formance improvement. It is also worth mentioning that theperformance of SVM is lower than its performance in theprevious experiment because different and fewer features areused.

6.5 Influence of Knowledge Adaptation

It is interesting to understand how the knowledge adaptationfrom the auxiliary concept-based videos impacts the MEDwith few exemplars. We base our study on two scenarios: First,we set α in Eq. (5) to 0 so there is no knowledge adaptation;Second, since in our objective function the item α ‖W‖p2,pcontrols the effect of the knowledge adaptation, we investigatethe influence by varying the parameter α and p after fixing βat its optimal values.

For the first scenario, we show the performance comparisonbetween using auxiliary data and not using it in Figure 4.MinNDC is used as metric and the results on the first 10events are displayed due to the space limit. It can be seenthat using auxiliary data has clear advantage over not using it,which demonstrates that through proper design, the auxiliaryknowledge contributes notably to the MED with few exem-plars.

For the second scenario, we similarly use MinNDC asmetric to show the performance variation. Due to the spacelimit, we only show the results on the first 6 events in Figure 5.We observe from Figure 5 that the best results are generallyobtained when p = 0.5 or p = 1. For the other parameter αthere is no obvious rule, which is presumably data-dependent.Lower p indicates that the model is more sparse, therebyeliminating more redundancy and noise.

9

HF−SARSARA−MKLMKTLSAR&SVMSVMTaylorBoost

Notations:

MinNDC: Minimum NDCThe LOWER, the BETTER.

Pmd: Pmd@TER=12.5The LOWER, the BETTER.

AP: Average PrecisionThe HIGHER, the BETTER.

(a) Notations

0.89

0.9

0.91

0.92

0.93

0.94

MinNDC0.55

0.6

0.65

Pmd

0.16

0.2

0.24

AP

(b) Attempting a boardtrick

0.993

0.995

0.997

MinNDC

0.8

0.82

0.84

Pmd

0.065

0.07

AP

(c) Feeding an animal

0.8

0.85

0.9

0.95

MinNDC

0.48

0.54

0.6

0.66

Pmd

0.12

0.18

0.24

0.3

AP

(d) Landing a fish

0.72

0.78

0.84

0.9

MinNDC

0.44

0.52

0.6

0.68

0.76

Pmd

0.08

0.17

0.26

0.35

AP

(e) Wedding ceremony

0.89

0.93

0.97

MinNDC

0.6

0.65

0.7

0.75

Pmd0.08

0.14

0.2

AP

(f) Working on a wood-working project

0.92

0.96

1

MinNDC

0.6

0.7

0.8

Pmd

0.05

0.09

0.13

AP

(g) Birthday party

0.94

0.95

0.96

0.97

MinNDC0.67

0.72

0.77

0.82

Pmd

0.025

0.045

0.065

AP

(h) Changing a vehicletire

0.6

0.7

0.8

MinNDC

0.35

0.4

0.45

0.5

Pmd

0.15

0.25

0.35

0.45

AP

(i) Flash mob gather-ing

0.8

0.85

0.9

0.95

MinNDC

0.5

0.55

0.6

Pmd0.09

0.14

0.19

AP

(j) Getting a vehicle un-stuck

0.98

0.99

1

MinNDC

0.75

0.78

0.81

0.84

Pmd

0.03

0.04

0.05

AP

(k) Grooming an ani-mal

0.95

0.96

0.97

0.98

MinNDC

0.66

0.7

0.74

Pmd0.042

0.051

0.06

0.069

AP

(l) Making a sandwich

0.87

0.89

0.91

0.93

MinNDC

0.53

0.58

0.63

Pmd

0.07

0.11

0.15

AP

(m) Parade

0.86

0.88

0.9

0.92

MinNDC0.56

0.62

0.68

0.74

Pmd

0.04

0.06

0.08

0.1

AP

(n) Parkour

0.62

0.73

0.84

MinNDC

0.4

0.5

0.6

Pmd

0.1

0.16

0.22

0.28

0.34

AP

(o) Repairing an appli-ance

0.87

0.9

0.93

MinNDC0.59

0.63

0.67

Pmd0.065

0.095

0.125

AP

(p) Working on asewing project

0.85

0.88

0.91

MinNDC0.62

0.67

0.72

Pmd0.05

0.07

0.09

AP

(q) Making a cake

0.27

0.32

0.37

0.42

MinNDC0.15

0.2

0.25

Pmd0.28

0.4

0.52

0.64

AP

(r) Batting a run

0.95

0.96

0.97

0.98

0.99

1

MinNDC0.65

0.7

0.75

0.8

Pmd

0.025

0.045

0.065

AP

(s) Assembling a shel-ter

0.82

0.86

0.9

MinNDC0.55

0.6

0.65

Pmd0.08

0.14

0.2

AP

(t) Average

Figure 3. Performance comparison on MED with few exemplars.

10

Table 1Average detection accuracy of different methods. Better results are highlighted in bold.

Evaluation Metric SAR A-MKL MKTL SAR&SVM SVM TaylorBoost HF-SARMinNDC

[email protected]

0.8600.6010.162

0.8810.6170.144

0.8730.6100.153

0.8410.5720.183

0.8500.5750.181

0.9020.6770.080

0.8170.5490.201

Table 2Detection results by exploiting UCF50 dataset in

comparison with SVM.

Event Metric SVM HF-SAR RelativeImprovement

E01MinNDC

Pmd@TER=12.5AP

0.8840.5460.247

0.9220.5690.206

N/AN/AN/A

E02MinNDC

Pmd@TER=12.5AP

1.0000.9380.037

0.9990.8770.046

0.1%7.0%

24.3%

E03MinNDC

Pmd@TER=12.5AP

1.0000.9280.035

0.9900.8150.061

1.0%13.9%74.3%

E04MinNDC

Pmd@TER=12.5AP

0.9360.8700.044

0.9120.7700.132

2.6%13.0%200%

E05MinNDC

Pmd@TER=12.5AP

0.9750.9140.061

0.9460.8170.097

3.1%11.9%59.0%

E06MinNDC

Pmd@TER=12.5AP

0.9920.9170.049

0.9730.7970.077

2.0%15.1%57.1%

E07MinNDC

Pmd@TER=12.5AP

1.0000.9440.033

0.9920.8810.032

0.8%7.2%N/A

E08MinNDC

Pmd@TER=12.5AP

0.9450.8620.094

0.8330.6760.173

13.4%27.5%84.0%

E09MinNDC

Pmd@TER=12.5AP

0.9700.8040.072

0.9280.7030.093

4.5%14.4%29.2%

E10MinNDC

Pmd@TER=12.5AP

0.9970.9330.035

0.9910.8620.043

0.6%8.2%

22.9%

E11MinNDC

Pmd@TER=12.5AP

0.9950.9040.037

0.9820.8350.041

1.3%8.3%

10.8%

E12MinNDC

Pmd@TER=12.5AP

0.9750.8890.052

0.9400.7700.077

9.4%4.5%

13.7%

E13MinNDC

Pmd@TER=12.5AP

0.9700.7110.094

0.9570.6890.078

3.7%3.2%N/A

E14MinNDC

Pmd@TER=12.5AP

0.9190.8400.083

0.8190.6870.191

12.2%22.3%130.1%

E15MinNDC

Pmd@TER=12.5AP

0.9640.8800.059

0.9450.7940.066

2.0%10.8%11.9%

E16MinNDC

Pmd@TER=12.5AP

0.9750.8640.045

0.9370.7960.053

4.1%8.5%

17.8%

E17MinNDC

Pmd@TER=12.5AP

0.8930.7660.125

0.7360.5850.253

21.3%30.9%102.4%

E18MinNDC

Pmd@TER=12.5AP

0.9820.9220.036

0.9670.8360.041

1.6%10.3%13.9%

AverageMinNDC

Pmd@TER=12.5AP

0.9650.8570.069

0.9320.7640.098

3.5%12.2%42.0%

6.6 Using Fewer ConceptsIn this experiment, we test the performance variance ofthe proposed algorithm by varying the number of auxiliaryconcepts as 5, 10, 20, 30, 50 and 65. Figure 6 displaysthe corresponding results on the first 10 events in termsof Minimum NDC. We have the following observations: 1)Generally, the performance of using only 5 auxiliary conceptsis noticeably worse than using all the 65 auxiliary concepts;2) From using 5 auxiliary concepts to using 30 auxiliary con-cepts, the performance is gradually improved for most events;3) From using 30 auxiliary concepts to using 65 auxiliaryconcepts, the performance does not vary much, which suggeststhat the performance saturates at the point when 30 auxiliaryconcepts are used. Our observation indicates that when thenumber of auxiliary concepts is very small, which also meansfew auxiliary videos, the performance gain is limited. To getmore performance boost, we may want to incorporate moreauxiliary videos with more concepts. However, how to decidethe optimal number of auxiliary concepts is still an openproblem and is out of the scope of this paper.

6.7 Do Negative Examples Help?We further conduct an experiment to evaluate whether nega-tive examples contribute much to the detection accuracy byreducing the number of negative examples to 500 and 100.Figure 7 shows the performance comparison between using100, 500 and 1000 negative examples on the first 10 events.Similarly, Minimum NDC is chosen as the evaluation metric.

From Figure 7 we have the following observations: 1) Using1000 or 500 negative examples is better than using only 100negative examples. 2) The performance difference betweenusing 1000 and using 500 negative examples is quite small.This experiment indicates that negative examples are helpfulin improving the detection accuracy in some degree. Forexample, when 500 or 1000 negative examples are used, theperformance is generally better than using 100 negative exam-ples only. However, as the number of negative examples usedincreases, the performance gain does not increase accordingly,e.g., from using 500 negative examples to using 1000 negativeexamples. How many negative examples would bring in themost performance gain is still an open problem, which is out ofthe scope of this paper. However, since negative examples arequite easy to obtain in the real world, it is natural to leveragesuch cheap resources for boosted detection accuracy.

6.8 Parameter Sensitivity StudyThere are two regularization parameters, denoted as α andβ in Eq. (5). To learn how they affect the performance, weconduct an experiment on the parameter sensitivity. Due tothe space limit, we still only show the results on the first 6

11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Attempting a

board trick Feeding

an animalLanding

a fish Wedding

ceremonyWorking on

a woodworking

projectBirth

day

party Changing a

vehicle tire Flash mob

gatheringGettin

g a

vehicle unstuck

Min

ND

C

With Auxiliary Data

Without Auxiliary Data

Figure 4. Performance comparison between using auxiliary knowledge and not using auxiliary knowledge.

0.51

1.5

0.001

0.1

10

1000

0.8

0.85

0.9

0.95

1

p

α

Min

ND

C

(a) Attempting a boardtrick

0.51

1.5

0.001

0.1

10

1000

0.98

0.985

0.99

0.995

1

p

α

Min

ND

C

(b) Feeding an animal

0.51

1.5

0.001

0.1

10

1000

0.6

0.65

0.7

0.75

0.8

0.85

p

α

Min

ND

C

(c) Landing a fish

0.51

1.5

0.001

0.1

10

1000

0.6

0.65

0.7

0.75

0.8

p

αM

inN

DC

(d) Wedding ceremony

0.51

1.5

0.001

0.1

10

1000

0.85

0.86

0.87

0.88

0.89

0.9

p

α

Min

ND

C

(e) Working on a wood-working project

0.51

1.5

0.001

0.1

10

1000

0.8

0.85

0.9

0.95

1

p

α

Min

ND

C

(f) Birthday party

Figure 5. The detection performance variance w.r.t. α and p.

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Attempting a

board trick

Feeding

an animalWorking on

a woodworking

project Birthday

party Changing a

vehicle tireGrooming

an animalMaking a

sandwich Parade

Assembling

a shelter

Min

ND

C

5 Auxiliary Concepts






Figure 6. Performance comparison between using 5, 10, 20, 30, 50 and 65 auxiliary concepts.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Attempting a

board trick Feeding

an animalLanding

a fish Wedding

ceremonyWorking on

a woodworking

project Birthday

party Changing a

vehicle tire Flash mob

gathering Getting a

vehicle unstuck

Min

ND

C

100 Neg EPs

500 Neg EPs

1000 Neg EPs

Figure 7. Performance comparison between using 100, 500 and 1000 negative examples.

12

0.0010.1

101000

0.0010.1

101000

0.8

0.85

0.9

0.95

1

βα

Min

ND

C

(a) Attempting a boardtrick

0.0010.1

101000

0.0010.1

101000

0.9

0.95

1

βα

Min

ND

C

(b) Feeding an animal

0.0010.1

101000

0.0010.1

101000

0.7

0.75

0.8

βα

Min

ND

C

(c) Landing a fish

0.0010.1

101000

0.0010.1

101000

0.6

0.65

0.7

0.75

0.8

βα

Min

ND

C

(d) Wedding ceremony

0.0010.1

101000

0.0010.1

101000

0.8

0.85

0.9

0.95

1

βα

Min

ND

C

(e) Working on a wood-working project

0.0010.1

101000

0.001

0.1

10

1000

0.9

0.92

0.94

0.96

0.98

βα

Min

ND

C

(f) Birthday party

Figure 8. The detection performance variance w.r.t. α and β.

events in Figure 8. From Figure 8 we notice that for someevents, e.g., Birthday party, the performance is sensitive tothe two parameters. For some other events like Feeding ananimal the performance does not change much. However, wecan generally obtain good performance for these events whenα and β are comparable. For example, good performance isobtained when α = β = 0.001 for Attempting a board trick,Feeding an animal, Landing a fish and Wedding ceremony, andα = β = 10 for Birthday party. A similar pattern is observedfor other events as well.

6.9 Convergence Study

We solve our objective function using an alternating approach.In practice, how fast our algorithm converges is crucial forthe whole computational efficiency. Hence, we conduct anexperiment to show the convergence curve of our algorithm.As we have similar results on all the 18 events, we only presentthe convergence curve on the first event. All the parametersinvolved are fixed at 1.

1 2 3 4 5 6 7 8 9 103215

3220

3225

3230

3235

3240

Iteration Number

Obj

ectiv

e Fu

nctio

n V

alue

HF−SAR

Figure 9. Convergence curve of the objective functionvalue in Eq. (5) using Algorithm 1 for the event Attemptinga board trick. The figure shows that the objective func-tion value monotonically decreases until convergence byapplying the proposed algorithm.

Figure 9 shows the convergence curve. It can be seen thatthe objective function value converges within 10 iterations.The convergence experiment demonstrates the efficiency ofour alternating algorithm.

7 COMPLEMENTARY EXPERIMENT ON MULTI-CLASS CLASSIFICATIONOur proposed algorithm can be easily extended to otherapplications such as multi-class classification. In this section,we conduct a complementary experiment on image annotationto show its effectiveness for multi-class classification.

We use the Animals with Attributes (AwA) dataset [42]for evaluation. The reason is that the dataset has both ani-mal categories and the associated attributes. Similarly to ourassumption, different animal categories may share commonattributes. Thus, we use the 10 animal categories specifiedin [42] as our target annotation categories and the rest as ourauxiliary data. Note that for the auxiliary data we use theirattribute labels since these attributes are the shared componentswith the target animal categories. The 10 target categories arepersian cat, hippopotamus, leopard, humpback whale, seal,chimpanzee, rat, giant panda, pig and raccoon. For the 10classes to be annotated, we randomly select 10 samples percategory to form the training set and the remaining dataof these categories are our testing data. We use the SIFTfeature as the homogeneous feature and the Locality SimilarityHistogram (LSH) feature as the heterogeneous feature forimage representation. In other words, the images of the 10target categories are represented by SIFT and LSH while thoseof the auxiliary categories are represented by SIFT only.

The annotation comparison between different algorithmsis displayed in Table 3. We can see that HF-SAR is muchbetter than other comparison algorithms. SVM is second bestalgorithm. Especially, other transfer learning algorithms haveweaker performance as only one feature is exploited.

The reported accuracy in [42] is 40.5%. But we point outthat in [42] six features have been used whereas we only usetwo features. We did not use all the features used in [42]because we were concerned with the computational efficiency,e.g., the comparison algorithms A-MKL and TaylorBoost arecomputationally expensive. On the other hand, to be consistentwith our previous experiment on MED with few exemplars,we select 10 samples from each target category to form thetraining set, making our training set also different from thatin [42]. Thus, we cannot directly compare the annotationaccuracy of our method and that of [42].

This complementary experiment has demonstrated that ourmethod also has the potential for other applications.

8 CONCLUSIONIn this paper, we have introduced the research exploration ofMED with few exemplars. This is an important research issue

13

Table 3Performance comparison of different methods on image annotation. The best result is highlighted in bold.

Evaluation Metric SAR A-MKL MKTL SAR&SVM SVM TaylorBoost HF-SARAccuracy 0.257 0.248 0.232 0.265 0.310 0.264 0.373

as it focuses on more generic, complicated and meaningfulevents that reflect our daily activities. In addition, the situationwe are faced in the real world requires that only few positiveexamples are used. To achieve good performance, we haveproposed to borrow strength from available concepts-basedvideos for MED with few exemplars. A notable differencebetween our proposed algorithm and most existing knowledgeadaptation algorithms is that it is built upon heterogeneousfeatures, i.e., the features of the source and the target arepartially different, but overlapping. Specifically, we first minethe shared irrelevance and noise between the auxiliary videosand the target videos based on the homogeneous features. Thena sophisticated method is exerted to alleviate the negativeimpact of the irrelevance and noise to optimize the eventdetector. Meanwhile, another event detector of MED videos istrained based on the heterogeneous feature. Then we integratethe two event detectors for optimization, after which thedecision values from both are fused for the final prediction.Extensive experiments using real-world multimedia archiveswere conducted with promising results. The results validatethat: 1) it is beneficial to leverage auxiliary knowledge forMED when we do not have sufficient positive examples; and 2)the capability of knowledge adaptation based on heterogeneousfeatures is realistic and advantageous.

9 ACKNOWLEDGMENTS

This paper was partially supported by the European Commis-sion under contract FP7-611346 xLiMe, the US Departmentof Defense the U. S. Army Research Office (W911NF-13-1-0277) and the Intelligence Advanced Research Projects Ac-tivity (IARPA) via Department of Interior National BusinessCenter contract number D11PC20068. The U.S. Governmentis authorized to reproduce and distribute reprints for Gov-ernmental purposes notwithstanding any copyright annotationthereon. Disclaimer: The views and conclusions containedherein are those of the authors and should not be interpreted asnecessarily representing the official policies or endorsements,either expressed or implied, of IARPA, DoI/NBC, ARO, orthe U.S. Government.

REFERENCES[1] http://www.nist.gov/itl/iad/mig/upload/med11-evalplan-v03-

20110801a.pdf.[2] http://www.nist.gov/itl/iad/mig/upload/med12-evalplan-v01.pdf.[3] Trec video retrieval evaluation. National Institute of Standards and

Technology. http://www-nlpir.nist.gov/projects/trecvid/.[4] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Robust real-time

unusual event detection using multiple fixed-location monitors. IEEETrans. Pattern Anal. Mach. Intell., 30(3): 555–560, 2008.

[5] G. Zen, and E. Ricci. Earth mover’s prototypes: A convex learningapproach for discovering activity patterns in dynamic scenes. In CVPR,pages 3225–3232, 2011.

[6] E. Ricci, G. Zen, N. Sebe, and S. Messelodi. A prototype learningframework using EMD: Application to complex scenes analysis. IEEETrans. Pattern Anal. Mach. Intell., 35(3): 513–526, 2013.

[7] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task featurelearning. Machine Learning (73): 243–272, 2008.

[8] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection andjoint subspace selection for multiple classification problems. Statisticsand Computing (20): 231–252, 2010.

[9] Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe. Feature Selectionfor Multimedia Analysis by Sharing Information Among Multiple Tasks.IEEE Transactions on Multimedia, 15(3): 661–669, 2013.

[10] L. Bao, L. Zhang, S.-I. Yu, Z. zhong Lan, L. Jiang, A. Overwijk, Q. Jin,S. Takahashi, B. Langner, Y. Li, M. Garbus, S. Burger, F. Metze, andA. G. Hauptmann. Informedia @ TRECVID2011. In NIST TRECVIDWorkshop, 2011.

[11] L. Cao, S.-F. Chang, N. Codella, C. Cotton, D. Ellis, L. Gong, M. Hill,G. Hua, J. Kender, M. Merler, Y. Mu, A. Natsev, and J. R. Smith. IBMResearch and Columbia University TRECVID-2011 Multimedia EventDetection (MED) System. In NIST TRECVID Workshop, 2011.

[12] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognitionin videos by learning from web data. IEEE Trans. Pattern Anal. Mach.Intell., 34(9): 1667–1680, 2012.

[13] Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. KnowledgeAdaptation for Ad Hoc Multimedia Event Detection with Few Exemplars.In ACM MM, pages 469–478, 2012.

[14] J. Luo, T. Tommasi, and B. Caputo. Multiclass transfer learning fromunconstrained priors. In ICCV, pages 1863–1870, 2011.

[15] L. Fei-Fei and L.-J. Li. What, Where and Who? Telling the Storyof an Image by Activity Classification, Scene Recognition and ObjectCategorization. Studies in Computational Intelligence- Computer Vision,285: 157–171, 2010.

[16] K. K. Reddy and M. Shah. Recognizing 50 human action categories ofweb videos. Machine Vision and Applications Journal, 24(5): 971–981,2013.

[17] A. G. Hauptmann, R. Yan, W.-H. Lin, M. G. Christel, and H. D. Wactlar.Can high-level concepts fill the semantic gap in video retrieval? a casestudy with broadcast news. IEEE Transactions on Multimedia, 9(5): 958–966, 2007.

[18] Y.-G. Jiang, C.-W. Ngo, and S.-F. Chang. Semantic context transferacross heterogeneous sources for domain adaptive video search. In ACMMM, pages 155–164, 2009.

[19] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann. MultimediaClassification and Event Detection using Double Fusion. Journal ofMultimedia Tools and Applications, 2013.

[20] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann. ComplexEvent Detection via Multi-source Video Attributes. In CVPR, pages 2627–2633, 2013.

[21] Y. Yang, Z. Ma, Z. Xu, S. Yan, and A. G. Hauptmann. How RelatedExemplars Help Complex Event Detection inWeb Videos. In ICCV, 2013.

[22] K.-H. Liu, M.-F. Weng, C.-Y. Tseng, Y.-Y. Chuang, and M.-S. Chen.Association and temporal rule mining for post-filtering of semanticconcept detection in video. IEEE Transactions on Multimedia, 10(2):240–251, 2008.

[23] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2): 91–110, 2004.

[24] J. Luo, J. Yu, D. Joshi, and W. Hao. Event recognition: viewing theworld with a third eye. In ACM MM, pages 1071–1080, 2008.

[25] Z. Ma, F. Nie, Y. Yang, J. R. R. Uijlings, and N. Sebe. Web imageannotation via subspace-sparsity collaborated feature selection. IEEETransactions on Multimedia, 14(4): 1021–1030, 2012.

[26] Z. Ma, Y. Yang, F. Nie, J. R. R. Uijlings, and N. Sebe. Exploitingthe entire feature space with sparsity for automatic image annotation. InACM MM, pages 283–292, 2011.

[27] M. R. Naphade and J. R. Smith. On the detection of semantic conceptsat trecvid. In ACM MM, pages 660–667, 2004.

[28] M. J. Saberian, H. Masnadi-Shirazi, and N. Vasconcelos. Taylorboost:First and second-order boosting algorithms with explicit margin control.In CVPR, pages 2929–2934, 2011.

[29] D. A. Sadlier and N. E. O’Connor. Event detection in field sports videousing audio-visual features and a support vector machine. IEEE Trans.Circuits Syst. Video Techn., 15(10): 1225–1233, 2005.

14

[30] B. Scholkopf, A. J. Smola, and K.-R. Muller. Nonlinear componentanalysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.

[31] M.-L. Shyu, Z. Xie, M. Chen, and S.-C. Chen. Video semanticevent/concept detection using a subspace-based multimedia data miningframework. IEEE Transactions on Multimedia, 10(2): 252–259, 2008.

[32] C. G. M. Snoek, M. Worring, J. van Gemert, J.-M. Geusebroek, andA. W. M. Smeulders. The challenge problem for automated detectionof 101 semantic concepts in multimedia. In ACM MM, pages 421–430,2006.

[33] V. S. Tseng, J.-H. Su, J.-H. Huang, and C.-J. Chen. Integrated mining ofvisual features, speech features, and frequent patterns for semantic videoannotation. IEEE Transactions on Multimedia, 10(2): 260–267, 2008.

[34] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluatingcolor descriptors for object and scene recognition. IEEE Trans. PatternAnal. Mach. Intell., 32(9): 1582–1596, 2010.

[35] M. Marszalek, C. Schmid, H. Harzallah, and J. van de Weijer. Learningobject representations for visual object class recognition. In Visualrecognition challenge workshop, 2007.

[36] I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, pages432–439, 2003.

[37] F. Wang, Y.-G. Jiang, and C.-W. Ngo. Video event detection usingmotion relativity and visual relatedness. In ACM MM, pages 239–248,2008.

[38] G. Wang, T.-S. Chua, and M. Zhao. Exploring knowledge of sub-domainin a multi-resolution bootstrapping framework for concept detection innews video. In ACM MM, pages 249–258, 2008.

[39] C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan. Live sports event detectionbased on broadcast video and web-casting text. In ACM MM, pages 221–230, 2006.

[40] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video conceptdetection using adaptive SVMs. In ACM MM, pages 188–197, 2007.

[41] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan. A multimediaretrieval framework based on semi-supervised ranking and relevancefeedback. IEEE Trans. Pattern Anal. Mach. Intell., 34(4): 723–742, 2012.

[42] C. H. Lampert, H. Nickisch and S. Harmeling. Learning to detect unseenobject classes by between-class attribute transfer. In CVPR, pages 951–958, 2009.

Zhigang Ma received the Ph.D. in computerscience from University of Trento, Trento, Italy,in 2013.

He is now a Postdoctoral Research Fellowwith the School of Computer Science, CarnegieMellon University, Pittsburgh, PA. His researchinterest is mainly on machine learning and its ap-plications to multimedia analysis and computervision.

Yi Yang received the Ph.D degree in ComputerScience from Zhejiang University, Hangzhou,China, in 2010.

He is now a DECRA fellow with the Universityof Queensland, Brisbane, Australia. Prior to that,he was a Postdoctoral research fellow at theschool of computer science, Carnegie MellonUniversity, Pittsburgh, PA. His research interestsinclude machine learning and its applicationsto multimedia content analysis and computervision, e.g. multimedia indexing and retrieval,

surveillance video analysis, video semantics understanding, etc.

Nicu Sebe (M’01-SM’11) received the Ph.D. incomputer science from Leiden University, Lei-den, The Netherlands, in 2001.

Currently, he is with the Department of In-formation Engineering and Computer Science,University of Trento, Italy, where he is leadingthe research in the areas of multimedia informa-tion retrieval and human-computer interaction incomputer vision applications. He was a GeneralCo-Chair of the IEEE Automatic Face and Ges-ture Recognition Conference, FG 2008 and ACM

Multimedia 2013, and a program chair of ACM International Conferenceon Image and Video Retrieval (CIVR) 2007 and 2010, ACM Multimedia2007 and 2011. He is a program chair of ECCV 2016 and ICCV 2017.He is a senior member of IEEE and of ACM and a fellow of IAPR.

Alexander G. Hauptmann received the B.A.and M.A. degrees in psychology from JohnsHopkins University, Baltimore, MD, the degreein computer science from the Technische Uni-versitat Berlin, Berlin, Germany, in 1984, andthe Ph.D. degree in computer science fromCarnegie Mellon University (CMU), Pittsburgh,PA, in 1991.

He is currently with the faculty of the Depart-ment of Computer Science and the LanguageTechnologies Institute, CMU. His research in-

terests include several different areas: man-machine communication,natural language processing, speech understanding and synthesis,video analysis, and machine learning. From 1984 to 1994, he worked onspeech and machine translation, when he joined the Informedia projectfor digital video analysis and retrieval, and led the development andevaluation of news-on-demand applications.

Documents

Knowledge Adaptation with Partially Shared Features for ...yiyang/HFSAR_dc.pdfevents, news events, unusual surveillance events or those with repetitive patterns. For example, Xu et