23
Hindawi Publishing Corporation Advances in Multimedia Volume 2013, Article ID 175064, 22 pages http://dx.doi.org/10.1155/2013/175064 Research Article Multiple Feature Fusion Based on Co-Training Approach and Time Regularization for Place Classification in Wearable Video Vladislavs Dovgalecs, Rémi Mégret, and Yannick Berthoumieu IMS Laboratory, University of Bordeaux, UMR5218 CNRS, Bˆ atiment A4, 351 cours de la Lib´ eration, 33405 Talence, France Correspondence should be addressed to Vladislavs Dovgalecs; [email protected] Received 27 April 2012; Revised 29 September 2012; Accepted 5 December 2012 Academic Editor: Anastasios Doulamis Copyright © 2013 Vladislavs Dovgalecs et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e analysis of video acquired with a wearable camera is a challenge that multimedia community is facing with the proliferation of such sensors in various applications. In this paper, we focus on the problem of automatic visual place recognition in a weakly constrained environment, targeting the indexing of video streams by topological place recognition. We propose to combine several machine learning approaches in a time regularized framework for image-based place recognition indoors. e framework combines the power of multiple visual cues and integrates the temporal continuity information of video. We extend it with computationally efficient semisupervised method leveraging unlabeled video sequences for an improved indexing performance. e proposed approach was applied on challenging video corpora. Experiments on a public and a real-world video sequence databases show the gain brought by the different stages of the method. 1. Introduction Due to the recent achievements in the miniaturization of cameras and their embedding in smart devices, a number of video sequences captured using such wearable cameras increased substantially. is opens new application fields and renews the problematics posed to the Multimedia research community earlier. For instance, visual lifelogs can record daily activities of a person and constitute a rich source of information for the task of monitoring persons in their daily life [14]. Recordings captured using wearable camera depict a view that is inside-out, close to the subjective view of the camera wearer. It is a unique source of information, with applications such as a memory refresh aid or as an additional source of information for the analysis of various activities and behavior related events in healthcare context. is oſten comes at the price of contents with very high variability, rapid camera displacement, and poorly constrained environments in which the person moves. Search for specific events in such multimedia streams is therefore particularly challenging. As was shown in [5, 6], multiple aspects of the video content and its context can be taken into account to provide a complete view of activity related events: location, presence of objects or persons, hand movements, and external information such as Global Positioning System (GPS), Radio Frequency Iden- tification (RFID), or motion sensor data. Amongst these, location is an important contextual information, that restricts the possible number of ongoing activities. Obtaining this information directly from the video stream is an interesting application in multimedia processing since no additional equipment such as GPS or RFID is needed. In some applica- tions, this may be even be a constraint, since the access to such modalities is limited in practice by the available devices and the installation of any invasive equipment in the environment (such as home) may not be welcome. Considering the high cost of labeling data for training when dealing with lifelogs, and therefore the low amount of such labeling, inferring place recognition information from such content is a particularly great challenge. For instance, in the framework presented in [7], video lifelog recordings are made in an unknown environment and ground truth location information is limited to small parts of the recording. In such setup, the information sources are short manual annotations and large unlabeled recording parts. e use of unlabeled data to improve recognition performance was up to now reserved to more generic problems and was not evaluated in within

Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Hindawi Publishing CorporationAdvances in MultimediaVolume 2013 Article ID 175064 22 pageshttpdxdoiorg1011552013175064

Research ArticleMultiple Feature Fusion Based on Co-Training Approach andTime Regularization for Place Classification in Wearable Video

Vladislavs Dovgalecs Reacutemi Meacutegret and Yannick Berthoumieu

IMS Laboratory University of Bordeaux UMR5218 CNRS Batiment A4 351 cours de la Liberation 33405 Talence France

Correspondence should be addressed to Vladislavs Dovgalecs vladislavsdovgalecsgmailcom

Received 27 April 2012 Revised 29 September 2012 Accepted 5 December 2012

Academic Editor Anastasios Doulamis

Copyright copy 2013 Vladislavs Dovgalecs et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

The analysis of video acquired with a wearable camera is a challenge that multimedia community is facing with the proliferationof such sensors in various applications In this paper we focus on the problem of automatic visual place recognition in a weaklyconstrained environment targeting the indexing of video streams by topological place recognition We propose to combine severalmachine learning approaches in a time regularized framework for image-based place recognition indoorsThe framework combinesthe power of multiple visual cues and integrates the temporal continuity information of video We extend it with computationallyefficient semisupervised method leveraging unlabeled video sequences for an improved indexing performance The proposedapproach was applied on challenging video corpora Experiments on a public and a real-world video sequence databases showthe gain brought by the different stages of the method

1 Introduction

Due to the recent achievements in the miniaturization ofcameras and their embedding in smart devices a numberof video sequences captured using such wearable camerasincreased substantiallyThis opens new application fields andrenews the problematics posed to the Multimedia researchcommunity earlier For instance visual lifelogs can recorddaily activities of a person and constitute a rich source ofinformation for the task of monitoring persons in their dailylife [1ndash4] Recordings captured using wearable camera depicta view that is inside-out close to the subjective view of thecamera wearer It is a unique source of information withapplications such as a memory refresh aid or as an additionalsource of information for the analysis of various activitiesand behavior related events in healthcare context This oftencomes at the price of contents with very high variability rapidcamera displacement and poorly constrained environmentsin which the person moves Search for specific events in suchmultimedia streams is therefore particularly challenging Aswas shown in [5 6] multiple aspects of the video content andits context can be taken into account to provide a completeview of activity related events location presence of objects

or persons hand movements and external information suchas Global Positioning System (GPS) Radio Frequency Iden-tification (RFID) or motion sensor data Amongst theselocation is an important contextual information that restrictsthe possible number of ongoing activities Obtaining thisinformation directly from the video stream is an interestingapplication in multimedia processing since no additionalequipment such as GPS or RFID is needed In some applica-tions thismay be even be a constraint since the access to suchmodalities is limited in practice by the available devices andthe installation of any invasive equipment in the environment(such as home) may not be welcome

Considering the high cost of labeling data for trainingwhen dealing with lifelogs and therefore the low amount ofsuch labeling inferring place recognition information fromsuch content is a particularly great challenge For instance inthe framework presented in [7] video lifelog recordings aremade in an unknown environment and ground truth locationinformation is limited to small parts of the recording In suchsetup the information sources are short manual annotationsand large unlabeled recording partsTheuse of unlabeled datato improve recognition performance was up to now reservedto more generic problems and was not evaluated in within

2 Advances in Multimedia

the context of wearable video indexing Efficient usage of thisinformation for place recognition in wearable video indexingtherefore defines the problem of the present work

In this paper we propose a novel strategy to incorporateand take advantage of the unlabeled data for place recog-nition It takes into account both unlabeled data multiplefeatures and time information We present a complete pro-cessing pipeline from low-level visual data extraction up tothe visual recognitionTheprincipal contribution of thisworkconstitutes a novel system for robust place recognition inweakly annotated videos We propose a combination of theCo-Training algorithmwith classifier fusion to obtain a singleclassification estimate that exploits bothmultiple features andunlabeled data In this context we also study a range ofconfidence computation techniques found in the literatureand introduce our own confidence measure that is designedto reduce the impact of uncertain classification results Theproposed system is designed as such that each componentis evaluated separately and its presence is justified It will beshown that each component yields an increase in classifica-tion performance both separately as well as in a combinedconfiguration as demonstrated onpublic and our challengingin-house datasets

The system we propose in this paper is motivated by theneed to develop a robust image-based place recognition sys-tem as a part of high-level activity analysis system developedwithin the IMMED project [7] As a part of this projecta wearable video recording prototype (see Figure 1) videoannotation software and activity recognition algorithms weredeveloped as well but will be left out of the scope More detailon the latter can be found in [7 8]

The paper is organized as follows In Section 2 wereview related work from the literature with respect to visualrecognition multiple feature fusion and semisupervisedlearning In Section 3 we present the proposed approachand algorithms In Section 4 we report the experimentalevaluations done on two databases in real life conditions andshow the respective gains from the use of (a)multiple features(b) unsupervised data and (c) temporal information withinour combined framework

2 Literature Review

21 Activity Monitoring Context

211 Motivation With this subsection we aim to put ourwork in the context of activity detection and recognition invideo Several setups have been used for that matter ambientand wearable sensors

212 Monitoring Using Ambient Sensors Activity recogni-tion systems have emerged quickly due to recent advancesin large video recording and in the deployment of highcomputation power systems For this application field mostof proposed methods originated from scene classificationwhere static image information is captured and categorized

Authors in [9] use the SVM classifier to classify localevents such as ldquowalkingrdquo and ldquorunningrdquo in a database consist-ing of very clean and unoccluded video sequences Perhaps

Figure 1Wearable camera recording prototype used in the IMMEDproject

in a more challenging setup human behavior recognitionis performed in [10] by proposing specially crafted sparsespatiotemporal features adapted for temporal visual datadescription Conceptually a similar approach is proposedin [11] where each event in a soccer game is modeled as atemporal sequence of Bag of VisualWords (BOVWs) featuresused in a SVM classifier termed strings which are thencompared using the string kernel

Detection and recognition of events in real-world indus-trial workflows is a challenging problem because of greatintraclass variability (complex classifiers required) unknownevent startend moments and requirement to remember thewhole event history which violates Markovian assumptionsof conditional independence (eg HMM-based algorithms)The problem is alleviated in [12] where authors propose anonline worker behavior classification system that integratesparticle filter and HMM

213 Monitoring Using Wearable Sensors Alternatively ac-tivity information can be also obtained from simple on-bodysensors (eg acceleration) and wearable video

Authors in [13] investigated two methods for activitycontext awareness in weakly annotated videos using 3Daccelerometer data The first one is based on multi-instancelearning by grouping sensor data into bags of activities(instead of labeling every frame)The second one uses a graphstructure for feature and time similarity representation andlabel information transfer in those structures Results favorlabel propagation based methods in multiple feature graphs

Visual lifelog indexing by human actions [2 3 14 15]is proposed recently in healthcare with expansion of theAlzheimer disease Early attempts to answer the challengewasdone in [1 4] as a part of the SenseCam and IMMED projectsproposing lightweight devices and event segmentation algo-rithms A motion-based temporal video segmentation algo-rithm with HMM at the core [8] identified strong correlationbetween activities and localization This study reveals thecomplexity of the issuewhich consists in learning a generativemodel from few training data extension to larger scale and indifficulty to recognize short and infrequent activities These

Advances in Multimedia 3

and related issues were addressed in [16] with HierarchicalHMM which simultaneously fusing complementary low-level and midlevel (visual motion location sound andspeech) features and the contribution of an automatic audio-visual stream segmentation algorithm Results validate thechoice of two-level modeling of activities using HierarchicalHMM and reveal improvement in recognition performancewhen working with temporal segments Optimal featurefusion strategies using the Hierarchical HMM are studiedin [17] The contributed intermediate level fusion at theobservation level where all features are treated separatelycompares positively to more classic early and late fusionapproaches

Thiswork is a part of an effort to detect and recognize per-sonrsquos activities from wearable videos in the context of health-care within the IMMED (httpimmedlabrifr) project [7]and continued within the DemCare (httpwwwdemcareeu) project Localization information is one of multiplepossible cues to detect and recognize activities solely from theegocentric point of view of the recording camera Amongstthese location estimation is an important cue which we nowdiscuss in more detail

22 Visual Location Recognition

221 Motivation Classifying the current location usingvisual content only is a challenging problem It relates toseveral problematics that have been already addressed invarious contexts such as image retrieval from large databasessemantic video retrieval image-based place recognition inrobotics and scene categorization A survey [18] on imageand video retrieval methods covers the paradigms such assemantic video retrieval interactive retrieval relevance feed-back strategies and intelligent summary creation Anothercomprehensive and systematic study in [19] that evaluatesmultimodal models using visual and audio information forvideo classification reveals the importance of global andlocal features and the role of various fusion methods suchas ensembles context fusion and joint boosting We willhereafter focus on location recognition and classification

To deal with location recognition from image contentonly we can consider two families of approaches (a) fromretrieval point of view where a place is recognized as similarto an existing labeled reference from which we can inferthe estimated location (b) from a classification point ofview where the place corresponds to a class that can bediscriminated from other classes

222 Image Retrieval for Place Recognition Image retrievalsystems work by a principle that visual content presented ina query image is visually similar or related to a portion ofimages to be retrieved from database

Pairwise image matching is a relatively simple and attrac-tive approach It implies query image comparison to allannotated images in the database Top ranked images areselected as candidates and after some optional validationprocedures the retrieved images are presented as a resultto the query In [20] SIFT feature matching followed byvoting and further improvement with spatial information is

performed to localize indoors with 18 locations with each ofthem presented with 4 views The voting scheme determineslocations whose keypoints were most frequently classifiedas the nearest neighbors Additionally spatial informationis modeled using HMM bringing in neighbor locationrelationships A study [21] for place recognition in lifelogsimages found that the best matching technique is to use bi-directional matching which nevertheless adds computationalcomplexity This problem is resolved by using robust andrapid to extract SURF features which are then hierarchicallyclustered using the 119896-means algorithm in a vocabulary treeThe vocabulary tree allows the rapid descriptor comparisonof query image descriptors to those of the database andwherethe tree leaf note descriptor votes for the database image

The success of matching for place recognition dependsgreatly on the database which should contain a large amountof annotated images In many applications this is a ratherstrong assumption about the environment In the absenceof prior knowledge brought by completely annotated imagedatabase covering the environment topological place recog-nition discretizes otherwise continuous space A typicalapproach following this idea is presented in [22] Authorspropose the gradient orientation histograms of the edge mapas image feature with a property that visually similar scenesare described by a similar histogram The Learning VectorQuantization method is then used to retain only the mostcharacteristic descriptors for each topological location Anunsupervised approach for robot place recognition indoorsis adapted in [23] The method partitions the space intoconvex subspaces representing room concepts by usingapproximated graph-cut algorithm with the possibility foruser to inject can group and cannot group constraints Theadapted similarity measure relies on the 8-point algorithmconstrained to planar camera motion and followed by robustRANSAC to remove falsematches Besides high computationcost the results show good clustering capabilities if graphnodes representing individual locations are well selectedand the graph is properly built Authors recognize that atlarger scale and more similarly looking locations more falsematching images may appear

223 Image Classification for Place Recognition Trainingvisual appearance model and using it to classify unseenimages constitutes another family of approaches Imageinformation is usually encoded using global or local patchfeatures

In [24] an image is modeled as a collection of patcheseach of which is assigned a codeword using a prebuiltcodebook yielding a bag of codewords The generic Bagof Word [25] approach has been quite successful as globalfeatures One of its main advantages is the ability to representpossibly very complex visual contents and address sceneclutter problem It is flexible enough to accommodate bothdiscrete features [25] and dense features [26] while lettingthe possibility to include also weak spatial information byspatial binning as in [27] Authors in [6] argue that indoorscenes recognition require location-specific global featuresand propose a system recognizing locations by objects thatare present in them An interesting result suggests that the

4 Advances in Multimedia

final recognition performance can be boosted even further asmore object information is used in each image A context-based system for place and object recognition is presentedin [5] The main idea is to use context (scene gist) as aprior and then use it as a prior infer what objects can bepresent in a sceneTheHMM-based place recognition systemrequires a considerable amount of training data possibletransition probabilities and so forth but integrates naturallytemporal information and confidence measure to detect thefact of navigating in unknown locations Probabilistic LatentSemantic Analysis (pLSA) was used in [28] to discover higherlevel topics (eg grass forest water) from low-level visualfeatures and building novel low dimensional representationused afterwards in 119896-Nearest Neighbor classifier The studyshows superior classification performance by passing fromlow-level visual features to high-level topics that could beloosely attributed to the context of the scene

224 Place Recognition in Video Place recognition fromrecorded videos brings both novel opportunities and infor-mation but also poses additional challenges and constraintsMuch more image data can be extracted from video while inpractice some small portion of it can be labeled manually Anadditional information that is often leveraged in the literatureis the temporal continuity of the video stream

Matching-based approach has been used in [29] toretrieve objects in video Results show that simple matchingproduces a large number of false positive matches but theusage of stop list to remove most frequent and most specificvisual words followed by spatial consistency check signifi-cantly improves retrieval result quality In [30] belief func-tions in the Bayesian filtering context are used to determinethe confidence of a particular location at any time momentThe modeling involves sensor and motion models whichhave to be trained offline with sufficiently large annotateddatabase Indeed themodel has to learn themodel of allowedtransitions between places which require the annotated datato represent all possible transitions to be found in the testdata

An important group of methods performing simultane-ous place recognition and mapping (SLAM) is widely usedin robotics [31 32] The main idea in these methods is tosimultaneously build and update a map in an unknownenvironment and track in real time the current position ofthe camera In our work the construction of such map isnot necessary and may prove to be very challenging since theenvironment can be very complex and constantly changing

23 Multiple Feature Learning

231 Motivation Different visual features capture differentaspects of a scene and correct choice depends on the taskto solve [33] To this end even humans perform poorlywhen using only one information source of perception[34] Therefore instead of designing a specific and adapteddescriptor for each specific case several visual descriptorscan be combined in a more complex system while yieldingincreased discrimination power in a wider range of applica-tions Following the survey [35] twomain approaches can be

identified for the fusion of multiple features depending onwhether the fusion is done in the feature space (early fusion)or in the decision space (late fusion)

232 Early Fusion Early fusion strategies focus on the com-bination of input features before using them in a classifierIn the case of kernel classifiers the features can be seen asdefining a new kernel that takes into account several featuresat once This can be done by concatenating the featuresinto a new larger feature vector A more general approachMultiple Kernel Learning (MKL) also tries to estimate theoptimal parameters for kernel combination in addition to theclassifier model In our work we evaluated the SimpleMKL[36] algorithm as a representative algorithm of the MKLfamilyThe algorithm is based on gradient descent and learnsa weighted linear combination of kernels This approach hasnotably been applied in the context of object detection andclassification [37ndash39] and image classification [40 41]

233 Late Fusion In the late fusion strategy several baseclassifiers are trained independently and their outputs are fedto a special decision layer This fusion strategy is commonlyreferred to as a stacking method and is discussed in depthin the multiple classifiers systems literature [42ndash46] Thistype of fusion allows to use multiple visual features leavingtheir exploitation to an algorithm which performs automaticfeature selection or weighting respective to the utility of eachfeature

It is clear that nothing prevents using an SVM as baseclassifier Following the work of [47] it was shown that SVMoutputs in the form of decision values can be combined thelinearly using Discriminative Accumulation Scheme (DAS)[48] for confidence-based place recognition indoors Thefollowing work evolved by relaxing the constraint of linearityof combination using a kernel function on the outputs ofindividual single feature outputs giving rise to GeneralizedDAS [49] Results show a clear gain of performance increasewhen using different visual features or completely differentmodalities Other works follow a similar reasoning but usedifferent combination rules (max product etc) as discussedin [50] A comprehensive comparison of different fusionmethods in the context of object classification is given in [51]

24 Learning with Unlabeled Data

241 Motivation Standard supervised learning with singleor multiple features is successful if enough labeled trainingsamples are presented to the learning algorithm In manypractical applications the amount of training data is limitedwhile a wealth of unlabeled data is often available and islargely unused It is well known that the classifiers learnedusing only training data may suffer from overfitting orincapability to generalize on the unlabeled data In contrastunsupervised methods do not use label information Theymay detect a structure of the data however a prior knowledgeand correct assumptions about the data is necessary to be ableto characterize a structure that is relevant for the task

Semisupervised learning addresses this issue by leverag-ing labeled as well as unlabeled data [52 53]

Advances in Multimedia 5

242 Graph-Based Learning Given a labeled set 119871 =

(x119894 119910119894)119897

119894=1and an unlabeled set 119880 = x

119895119897+119906

119895=119897+1 where x isin X

and 119910 isin minus1 +1 the goal is to estimate class labels for thelatter The usual hypothesis is that the two sets are samplediid according to the same joint distribution 119901(x 119910) There isno intention to provide estimations on the data outside thesets 119871 and119880 Deeper discussion on this issue can be found in[54] and in references therein

In graph-based learning a graph composed of labeled orunlabeled nodes (in our case representing the images) andinterconnected by edges encoding the similarities is builtApplication specific knowledge is used to construct suchgraph in such a way that the labels of nodes connected witha high weight link are similar and that no or a few weaklinks are present between nodes of different classes Thisgraph therefore encodes information on the smoothness ofa learned function 119891 on the graph which corresponds to ameasure of compatibility with the graph connectivity Theuse of the graph Laplacian [55 56] can then be used directlyas a connectivity information to propagate information fromlabeled nodes to unlabeled nodes [57] or as a regularizationterm that penalizes nonsmooth labelings within a classifiersuch as the Lap-SVM [58 59]

From a practical point of view the algorithm requires theconstruction of the full affinity matrix 119882 where all imagepairs in the sequences are compared and the computationof the associated Laplacian matrix 119871 which requires largeamounts of memory in 119874(1198992) While theoretically attractivethe direct method scales poorly with the size of the graphnodes which seriously restricts its usage on a wide range ofpractical applications working

243 Co-Training from Multiple Features The Co-Training[60] is a wrapper algorithm that learns two discriminantclassifiers in a jointmannerThemethod trains iteratively twoclassifiers such that in each iteration the highest confidenceestimates on unlabeled data are fed into the training set ofanother classifier Classically two views on the data or twosingle feature splits of a dataset are used The main idea isthat the solution or hypothesis space is significantly reduced ifboth trained classifiers agree on the data and reduce the riskof overfitting since each classifier also fits the initial labeledtraining set More theoretical background and analysis of themethod is given in Section 332

The algorithm of Co-Training was proposed in [60] asa solution to classify Web pages using both link and wordinformation The same method was applied to the problemof Web image annotation in [61 62] and automatic videoannotation in [63] Generalization capacity of Co-Training ondifferent initial labeled training sets was studied in [64]Moreanalysis on theoretical properties of Co-Training method canbe found in [65] such as rough estimates of maximal numberof iterations A reviewondifferent variants of theCo-Trainingalgorithm is given [66] together with their comparativeanalysis

244 Link between Graph and Co-Training Approaches Itis interesting to note the link [67 68] between Co-Training

method and label propagation in a graph since adding themost confident estimations in each Co-Training iterationcan be seen as label propagation from labeled nodes tounlabeled nodes in a graph This view of the method isfurther discussed and practically evaluated in [69] as a labelpropagation method on a combined graph built from twoindividual views

Graph-based methods are limited by the fact that graphedges encode low-level similarities that are computed directlyfrom the input features The Co-Training algorithm uses adiscriminative model that can be adaptive to the data witheach iteration and therefore achieve better generalization onunseen unlabeled data In the next section we will build aframework based on the Co-Training algorithm to proposeour solution for image-based place recognition

In this work we attempt to leverage all available infor-mation from image data that could help to provide cues oncamera place recognition Manual annotation of recordedvideo sequences requires a lot of human laborThe aim of thiswork is to evaluate the utility of unlabeled data within the Co-Training framework for image-based place recognition

3 Proposed Approach

In this section we present the architecture of the proposedmethod which is based on the Co-Training algorithm andthen discuss each component of the systemThe standardCo-Training algorithm (see Figure 2) allows to benefit from theinformation in the unlabeled part of the corpus by using afeedback loop to augment the training set thus producingaugmented performance classifiers In the standard algo-rithm formulation the two classifiers are still separate whichdoes not leverage their complementarity to its maximumThe proposed method addresses this issue by providing asingle output using late classifier fusion and time filtering fortemporal constrain enforcement

We will present the different elements of the system inthe order of increasing abstraction Single feature extractionpreparation and classification using SVM will be presentedin Section 31 Multiple feature late fusion and a proposedextension to take into account the time information will beintroduced in Section 32 The complete algorithm combin-ing those elements with the Co-Training algorithm will bedeveloped in Section 33

31 Single Feature Recognition Module Each image is repre-sented by a global signature vector In the following sectionsthe visual features x(119895)

119894isin X(119895) correspond to numerical

representations of the visual content of the images where thesuperscript (119895) denotes the type of visual features

311 SVM Classifier In our work we rely on Support VectorMachine (SVM) classifiers to carry out decision operations Itaims at finding the best class separation instead of modelingpotentially complex within class probability densities as ingenerative models such as Naive Bayes [70] The maxi-mal margin separating hyperplane is motivated from the

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

2 Advances in Multimedia

the context of wearable video indexing Efficient usage of thisinformation for place recognition in wearable video indexingtherefore defines the problem of the present work

In this paper we propose a novel strategy to incorporateand take advantage of the unlabeled data for place recog-nition It takes into account both unlabeled data multiplefeatures and time information We present a complete pro-cessing pipeline from low-level visual data extraction up tothe visual recognitionTheprincipal contribution of thisworkconstitutes a novel system for robust place recognition inweakly annotated videos We propose a combination of theCo-Training algorithmwith classifier fusion to obtain a singleclassification estimate that exploits bothmultiple features andunlabeled data In this context we also study a range ofconfidence computation techniques found in the literatureand introduce our own confidence measure that is designedto reduce the impact of uncertain classification results Theproposed system is designed as such that each componentis evaluated separately and its presence is justified It will beshown that each component yields an increase in classifica-tion performance both separately as well as in a combinedconfiguration as demonstrated onpublic and our challengingin-house datasets

The system we propose in this paper is motivated by theneed to develop a robust image-based place recognition sys-tem as a part of high-level activity analysis system developedwithin the IMMED project [7] As a part of this projecta wearable video recording prototype (see Figure 1) videoannotation software and activity recognition algorithms weredeveloped as well but will be left out of the scope More detailon the latter can be found in [7 8]

The paper is organized as follows In Section 2 wereview related work from the literature with respect to visualrecognition multiple feature fusion and semisupervisedlearning In Section 3 we present the proposed approachand algorithms In Section 4 we report the experimentalevaluations done on two databases in real life conditions andshow the respective gains from the use of (a)multiple features(b) unsupervised data and (c) temporal information withinour combined framework

2 Literature Review

21 Activity Monitoring Context

211 Motivation With this subsection we aim to put ourwork in the context of activity detection and recognition invideo Several setups have been used for that matter ambientand wearable sensors

212 Monitoring Using Ambient Sensors Activity recogni-tion systems have emerged quickly due to recent advancesin large video recording and in the deployment of highcomputation power systems For this application field mostof proposed methods originated from scene classificationwhere static image information is captured and categorized

Authors in [9] use the SVM classifier to classify localevents such as ldquowalkingrdquo and ldquorunningrdquo in a database consist-ing of very clean and unoccluded video sequences Perhaps

Figure 1Wearable camera recording prototype used in the IMMEDproject

in a more challenging setup human behavior recognitionis performed in [10] by proposing specially crafted sparsespatiotemporal features adapted for temporal visual datadescription Conceptually a similar approach is proposedin [11] where each event in a soccer game is modeled as atemporal sequence of Bag of VisualWords (BOVWs) featuresused in a SVM classifier termed strings which are thencompared using the string kernel

Detection and recognition of events in real-world indus-trial workflows is a challenging problem because of greatintraclass variability (complex classifiers required) unknownevent startend moments and requirement to remember thewhole event history which violates Markovian assumptionsof conditional independence (eg HMM-based algorithms)The problem is alleviated in [12] where authors propose anonline worker behavior classification system that integratesparticle filter and HMM

213 Monitoring Using Wearable Sensors Alternatively ac-tivity information can be also obtained from simple on-bodysensors (eg acceleration) and wearable video

Authors in [13] investigated two methods for activitycontext awareness in weakly annotated videos using 3Daccelerometer data The first one is based on multi-instancelearning by grouping sensor data into bags of activities(instead of labeling every frame)The second one uses a graphstructure for feature and time similarity representation andlabel information transfer in those structures Results favorlabel propagation based methods in multiple feature graphs

Visual lifelog indexing by human actions [2 3 14 15]is proposed recently in healthcare with expansion of theAlzheimer disease Early attempts to answer the challengewasdone in [1 4] as a part of the SenseCam and IMMED projectsproposing lightweight devices and event segmentation algo-rithms A motion-based temporal video segmentation algo-rithm with HMM at the core [8] identified strong correlationbetween activities and localization This study reveals thecomplexity of the issuewhich consists in learning a generativemodel from few training data extension to larger scale and indifficulty to recognize short and infrequent activities These

Advances in Multimedia 3

and related issues were addressed in [16] with HierarchicalHMM which simultaneously fusing complementary low-level and midlevel (visual motion location sound andspeech) features and the contribution of an automatic audio-visual stream segmentation algorithm Results validate thechoice of two-level modeling of activities using HierarchicalHMM and reveal improvement in recognition performancewhen working with temporal segments Optimal featurefusion strategies using the Hierarchical HMM are studiedin [17] The contributed intermediate level fusion at theobservation level where all features are treated separatelycompares positively to more classic early and late fusionapproaches

Thiswork is a part of an effort to detect and recognize per-sonrsquos activities from wearable videos in the context of health-care within the IMMED (httpimmedlabrifr) project [7]and continued within the DemCare (httpwwwdemcareeu) project Localization information is one of multiplepossible cues to detect and recognize activities solely from theegocentric point of view of the recording camera Amongstthese location estimation is an important cue which we nowdiscuss in more detail

22 Visual Location Recognition

221 Motivation Classifying the current location usingvisual content only is a challenging problem It relates toseveral problematics that have been already addressed invarious contexts such as image retrieval from large databasessemantic video retrieval image-based place recognition inrobotics and scene categorization A survey [18] on imageand video retrieval methods covers the paradigms such assemantic video retrieval interactive retrieval relevance feed-back strategies and intelligent summary creation Anothercomprehensive and systematic study in [19] that evaluatesmultimodal models using visual and audio information forvideo classification reveals the importance of global andlocal features and the role of various fusion methods suchas ensembles context fusion and joint boosting We willhereafter focus on location recognition and classification

To deal with location recognition from image contentonly we can consider two families of approaches (a) fromretrieval point of view where a place is recognized as similarto an existing labeled reference from which we can inferthe estimated location (b) from a classification point ofview where the place corresponds to a class that can bediscriminated from other classes

222 Image Retrieval for Place Recognition Image retrievalsystems work by a principle that visual content presented ina query image is visually similar or related to a portion ofimages to be retrieved from database

Pairwise image matching is a relatively simple and attrac-tive approach It implies query image comparison to allannotated images in the database Top ranked images areselected as candidates and after some optional validationprocedures the retrieved images are presented as a resultto the query In [20] SIFT feature matching followed byvoting and further improvement with spatial information is

performed to localize indoors with 18 locations with each ofthem presented with 4 views The voting scheme determineslocations whose keypoints were most frequently classifiedas the nearest neighbors Additionally spatial informationis modeled using HMM bringing in neighbor locationrelationships A study [21] for place recognition in lifelogsimages found that the best matching technique is to use bi-directional matching which nevertheless adds computationalcomplexity This problem is resolved by using robust andrapid to extract SURF features which are then hierarchicallyclustered using the 119896-means algorithm in a vocabulary treeThe vocabulary tree allows the rapid descriptor comparisonof query image descriptors to those of the database andwherethe tree leaf note descriptor votes for the database image

The success of matching for place recognition dependsgreatly on the database which should contain a large amountof annotated images In many applications this is a ratherstrong assumption about the environment In the absenceof prior knowledge brought by completely annotated imagedatabase covering the environment topological place recog-nition discretizes otherwise continuous space A typicalapproach following this idea is presented in [22] Authorspropose the gradient orientation histograms of the edge mapas image feature with a property that visually similar scenesare described by a similar histogram The Learning VectorQuantization method is then used to retain only the mostcharacteristic descriptors for each topological location Anunsupervised approach for robot place recognition indoorsis adapted in [23] The method partitions the space intoconvex subspaces representing room concepts by usingapproximated graph-cut algorithm with the possibility foruser to inject can group and cannot group constraints Theadapted similarity measure relies on the 8-point algorithmconstrained to planar camera motion and followed by robustRANSAC to remove falsematches Besides high computationcost the results show good clustering capabilities if graphnodes representing individual locations are well selectedand the graph is properly built Authors recognize that atlarger scale and more similarly looking locations more falsematching images may appear

223 Image Classification for Place Recognition Trainingvisual appearance model and using it to classify unseenimages constitutes another family of approaches Imageinformation is usually encoded using global or local patchfeatures

In [24] an image is modeled as a collection of patcheseach of which is assigned a codeword using a prebuiltcodebook yielding a bag of codewords The generic Bagof Word [25] approach has been quite successful as globalfeatures One of its main advantages is the ability to representpossibly very complex visual contents and address sceneclutter problem It is flexible enough to accommodate bothdiscrete features [25] and dense features [26] while lettingthe possibility to include also weak spatial information byspatial binning as in [27] Authors in [6] argue that indoorscenes recognition require location-specific global featuresand propose a system recognizing locations by objects thatare present in them An interesting result suggests that the

4 Advances in Multimedia

final recognition performance can be boosted even further asmore object information is used in each image A context-based system for place and object recognition is presentedin [5] The main idea is to use context (scene gist) as aprior and then use it as a prior infer what objects can bepresent in a sceneTheHMM-based place recognition systemrequires a considerable amount of training data possibletransition probabilities and so forth but integrates naturallytemporal information and confidence measure to detect thefact of navigating in unknown locations Probabilistic LatentSemantic Analysis (pLSA) was used in [28] to discover higherlevel topics (eg grass forest water) from low-level visualfeatures and building novel low dimensional representationused afterwards in 119896-Nearest Neighbor classifier The studyshows superior classification performance by passing fromlow-level visual features to high-level topics that could beloosely attributed to the context of the scene

224 Place Recognition in Video Place recognition fromrecorded videos brings both novel opportunities and infor-mation but also poses additional challenges and constraintsMuch more image data can be extracted from video while inpractice some small portion of it can be labeled manually Anadditional information that is often leveraged in the literatureis the temporal continuity of the video stream

Matching-based approach has been used in [29] toretrieve objects in video Results show that simple matchingproduces a large number of false positive matches but theusage of stop list to remove most frequent and most specificvisual words followed by spatial consistency check signifi-cantly improves retrieval result quality In [30] belief func-tions in the Bayesian filtering context are used to determinethe confidence of a particular location at any time momentThe modeling involves sensor and motion models whichhave to be trained offline with sufficiently large annotateddatabase Indeed themodel has to learn themodel of allowedtransitions between places which require the annotated datato represent all possible transitions to be found in the testdata

An important group of methods performing simultane-ous place recognition and mapping (SLAM) is widely usedin robotics [31 32] The main idea in these methods is tosimultaneously build and update a map in an unknownenvironment and track in real time the current position ofthe camera In our work the construction of such map isnot necessary and may prove to be very challenging since theenvironment can be very complex and constantly changing

23 Multiple Feature Learning

231 Motivation Different visual features capture differentaspects of a scene and correct choice depends on the taskto solve [33] To this end even humans perform poorlywhen using only one information source of perception[34] Therefore instead of designing a specific and adapteddescriptor for each specific case several visual descriptorscan be combined in a more complex system while yieldingincreased discrimination power in a wider range of applica-tions Following the survey [35] twomain approaches can be

identified for the fusion of multiple features depending onwhether the fusion is done in the feature space (early fusion)or in the decision space (late fusion)

232 Early Fusion Early fusion strategies focus on the com-bination of input features before using them in a classifierIn the case of kernel classifiers the features can be seen asdefining a new kernel that takes into account several featuresat once This can be done by concatenating the featuresinto a new larger feature vector A more general approachMultiple Kernel Learning (MKL) also tries to estimate theoptimal parameters for kernel combination in addition to theclassifier model In our work we evaluated the SimpleMKL[36] algorithm as a representative algorithm of the MKLfamilyThe algorithm is based on gradient descent and learnsa weighted linear combination of kernels This approach hasnotably been applied in the context of object detection andclassification [37ndash39] and image classification [40 41]

233 Late Fusion In the late fusion strategy several baseclassifiers are trained independently and their outputs are fedto a special decision layer This fusion strategy is commonlyreferred to as a stacking method and is discussed in depthin the multiple classifiers systems literature [42ndash46] Thistype of fusion allows to use multiple visual features leavingtheir exploitation to an algorithm which performs automaticfeature selection or weighting respective to the utility of eachfeature

It is clear that nothing prevents using an SVM as baseclassifier Following the work of [47] it was shown that SVMoutputs in the form of decision values can be combined thelinearly using Discriminative Accumulation Scheme (DAS)[48] for confidence-based place recognition indoors Thefollowing work evolved by relaxing the constraint of linearityof combination using a kernel function on the outputs ofindividual single feature outputs giving rise to GeneralizedDAS [49] Results show a clear gain of performance increasewhen using different visual features or completely differentmodalities Other works follow a similar reasoning but usedifferent combination rules (max product etc) as discussedin [50] A comprehensive comparison of different fusionmethods in the context of object classification is given in [51]

24 Learning with Unlabeled Data

241 Motivation Standard supervised learning with singleor multiple features is successful if enough labeled trainingsamples are presented to the learning algorithm In manypractical applications the amount of training data is limitedwhile a wealth of unlabeled data is often available and islargely unused It is well known that the classifiers learnedusing only training data may suffer from overfitting orincapability to generalize on the unlabeled data In contrastunsupervised methods do not use label information Theymay detect a structure of the data however a prior knowledgeand correct assumptions about the data is necessary to be ableto characterize a structure that is relevant for the task

Semisupervised learning addresses this issue by leverag-ing labeled as well as unlabeled data [52 53]

Advances in Multimedia 5

242 Graph-Based Learning Given a labeled set 119871 =

(x119894 119910119894)119897

119894=1and an unlabeled set 119880 = x

119895119897+119906

119895=119897+1 where x isin X

and 119910 isin minus1 +1 the goal is to estimate class labels for thelatter The usual hypothesis is that the two sets are samplediid according to the same joint distribution 119901(x 119910) There isno intention to provide estimations on the data outside thesets 119871 and119880 Deeper discussion on this issue can be found in[54] and in references therein

In graph-based learning a graph composed of labeled orunlabeled nodes (in our case representing the images) andinterconnected by edges encoding the similarities is builtApplication specific knowledge is used to construct suchgraph in such a way that the labels of nodes connected witha high weight link are similar and that no or a few weaklinks are present between nodes of different classes Thisgraph therefore encodes information on the smoothness ofa learned function 119891 on the graph which corresponds to ameasure of compatibility with the graph connectivity Theuse of the graph Laplacian [55 56] can then be used directlyas a connectivity information to propagate information fromlabeled nodes to unlabeled nodes [57] or as a regularizationterm that penalizes nonsmooth labelings within a classifiersuch as the Lap-SVM [58 59]

From a practical point of view the algorithm requires theconstruction of the full affinity matrix 119882 where all imagepairs in the sequences are compared and the computationof the associated Laplacian matrix 119871 which requires largeamounts of memory in 119874(1198992) While theoretically attractivethe direct method scales poorly with the size of the graphnodes which seriously restricts its usage on a wide range ofpractical applications working

243 Co-Training from Multiple Features The Co-Training[60] is a wrapper algorithm that learns two discriminantclassifiers in a jointmannerThemethod trains iteratively twoclassifiers such that in each iteration the highest confidenceestimates on unlabeled data are fed into the training set ofanother classifier Classically two views on the data or twosingle feature splits of a dataset are used The main idea isthat the solution or hypothesis space is significantly reduced ifboth trained classifiers agree on the data and reduce the riskof overfitting since each classifier also fits the initial labeledtraining set More theoretical background and analysis of themethod is given in Section 332

The algorithm of Co-Training was proposed in [60] asa solution to classify Web pages using both link and wordinformation The same method was applied to the problemof Web image annotation in [61 62] and automatic videoannotation in [63] Generalization capacity of Co-Training ondifferent initial labeled training sets was studied in [64]Moreanalysis on theoretical properties of Co-Training method canbe found in [65] such as rough estimates of maximal numberof iterations A reviewondifferent variants of theCo-Trainingalgorithm is given [66] together with their comparativeanalysis

244 Link between Graph and Co-Training Approaches Itis interesting to note the link [67 68] between Co-Training

method and label propagation in a graph since adding themost confident estimations in each Co-Training iterationcan be seen as label propagation from labeled nodes tounlabeled nodes in a graph This view of the method isfurther discussed and practically evaluated in [69] as a labelpropagation method on a combined graph built from twoindividual views

Graph-based methods are limited by the fact that graphedges encode low-level similarities that are computed directlyfrom the input features The Co-Training algorithm uses adiscriminative model that can be adaptive to the data witheach iteration and therefore achieve better generalization onunseen unlabeled data In the next section we will build aframework based on the Co-Training algorithm to proposeour solution for image-based place recognition

In this work we attempt to leverage all available infor-mation from image data that could help to provide cues oncamera place recognition Manual annotation of recordedvideo sequences requires a lot of human laborThe aim of thiswork is to evaluate the utility of unlabeled data within the Co-Training framework for image-based place recognition

3 Proposed Approach

In this section we present the architecture of the proposedmethod which is based on the Co-Training algorithm andthen discuss each component of the systemThe standardCo-Training algorithm (see Figure 2) allows to benefit from theinformation in the unlabeled part of the corpus by using afeedback loop to augment the training set thus producingaugmented performance classifiers In the standard algo-rithm formulation the two classifiers are still separate whichdoes not leverage their complementarity to its maximumThe proposed method addresses this issue by providing asingle output using late classifier fusion and time filtering fortemporal constrain enforcement

We will present the different elements of the system inthe order of increasing abstraction Single feature extractionpreparation and classification using SVM will be presentedin Section 31 Multiple feature late fusion and a proposedextension to take into account the time information will beintroduced in Section 32 The complete algorithm combin-ing those elements with the Co-Training algorithm will bedeveloped in Section 33

31 Single Feature Recognition Module Each image is repre-sented by a global signature vector In the following sectionsthe visual features x(119895)

119894isin X(119895) correspond to numerical

representations of the visual content of the images where thesuperscript (119895) denotes the type of visual features

311 SVM Classifier In our work we rely on Support VectorMachine (SVM) classifiers to carry out decision operations Itaims at finding the best class separation instead of modelingpotentially complex within class probability densities as ingenerative models such as Naive Bayes [70] The maxi-mal margin separating hyperplane is motivated from the

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 3

and related issues were addressed in [16] with HierarchicalHMM which simultaneously fusing complementary low-level and midlevel (visual motion location sound andspeech) features and the contribution of an automatic audio-visual stream segmentation algorithm Results validate thechoice of two-level modeling of activities using HierarchicalHMM and reveal improvement in recognition performancewhen working with temporal segments Optimal featurefusion strategies using the Hierarchical HMM are studiedin [17] The contributed intermediate level fusion at theobservation level where all features are treated separatelycompares positively to more classic early and late fusionapproaches

Thiswork is a part of an effort to detect and recognize per-sonrsquos activities from wearable videos in the context of health-care within the IMMED (httpimmedlabrifr) project [7]and continued within the DemCare (httpwwwdemcareeu) project Localization information is one of multiplepossible cues to detect and recognize activities solely from theegocentric point of view of the recording camera Amongstthese location estimation is an important cue which we nowdiscuss in more detail

22 Visual Location Recognition

221 Motivation Classifying the current location usingvisual content only is a challenging problem It relates toseveral problematics that have been already addressed invarious contexts such as image retrieval from large databasessemantic video retrieval image-based place recognition inrobotics and scene categorization A survey [18] on imageand video retrieval methods covers the paradigms such assemantic video retrieval interactive retrieval relevance feed-back strategies and intelligent summary creation Anothercomprehensive and systematic study in [19] that evaluatesmultimodal models using visual and audio information forvideo classification reveals the importance of global andlocal features and the role of various fusion methods suchas ensembles context fusion and joint boosting We willhereafter focus on location recognition and classification

To deal with location recognition from image contentonly we can consider two families of approaches (a) fromretrieval point of view where a place is recognized as similarto an existing labeled reference from which we can inferthe estimated location (b) from a classification point ofview where the place corresponds to a class that can bediscriminated from other classes

222 Image Retrieval for Place Recognition Image retrievalsystems work by a principle that visual content presented ina query image is visually similar or related to a portion ofimages to be retrieved from database

Pairwise image matching is a relatively simple and attrac-tive approach It implies query image comparison to allannotated images in the database Top ranked images areselected as candidates and after some optional validationprocedures the retrieved images are presented as a resultto the query In [20] SIFT feature matching followed byvoting and further improvement with spatial information is

performed to localize indoors with 18 locations with each ofthem presented with 4 views The voting scheme determineslocations whose keypoints were most frequently classifiedas the nearest neighbors Additionally spatial informationis modeled using HMM bringing in neighbor locationrelationships A study [21] for place recognition in lifelogsimages found that the best matching technique is to use bi-directional matching which nevertheless adds computationalcomplexity This problem is resolved by using robust andrapid to extract SURF features which are then hierarchicallyclustered using the 119896-means algorithm in a vocabulary treeThe vocabulary tree allows the rapid descriptor comparisonof query image descriptors to those of the database andwherethe tree leaf note descriptor votes for the database image

The success of matching for place recognition dependsgreatly on the database which should contain a large amountof annotated images In many applications this is a ratherstrong assumption about the environment In the absenceof prior knowledge brought by completely annotated imagedatabase covering the environment topological place recog-nition discretizes otherwise continuous space A typicalapproach following this idea is presented in [22] Authorspropose the gradient orientation histograms of the edge mapas image feature with a property that visually similar scenesare described by a similar histogram The Learning VectorQuantization method is then used to retain only the mostcharacteristic descriptors for each topological location Anunsupervised approach for robot place recognition indoorsis adapted in [23] The method partitions the space intoconvex subspaces representing room concepts by usingapproximated graph-cut algorithm with the possibility foruser to inject can group and cannot group constraints Theadapted similarity measure relies on the 8-point algorithmconstrained to planar camera motion and followed by robustRANSAC to remove falsematches Besides high computationcost the results show good clustering capabilities if graphnodes representing individual locations are well selectedand the graph is properly built Authors recognize that atlarger scale and more similarly looking locations more falsematching images may appear

223 Image Classification for Place Recognition Trainingvisual appearance model and using it to classify unseenimages constitutes another family of approaches Imageinformation is usually encoded using global or local patchfeatures

In [24] an image is modeled as a collection of patcheseach of which is assigned a codeword using a prebuiltcodebook yielding a bag of codewords The generic Bagof Word [25] approach has been quite successful as globalfeatures One of its main advantages is the ability to representpossibly very complex visual contents and address sceneclutter problem It is flexible enough to accommodate bothdiscrete features [25] and dense features [26] while lettingthe possibility to include also weak spatial information byspatial binning as in [27] Authors in [6] argue that indoorscenes recognition require location-specific global featuresand propose a system recognizing locations by objects thatare present in them An interesting result suggests that the

4 Advances in Multimedia

final recognition performance can be boosted even further asmore object information is used in each image A context-based system for place and object recognition is presentedin [5] The main idea is to use context (scene gist) as aprior and then use it as a prior infer what objects can bepresent in a sceneTheHMM-based place recognition systemrequires a considerable amount of training data possibletransition probabilities and so forth but integrates naturallytemporal information and confidence measure to detect thefact of navigating in unknown locations Probabilistic LatentSemantic Analysis (pLSA) was used in [28] to discover higherlevel topics (eg grass forest water) from low-level visualfeatures and building novel low dimensional representationused afterwards in 119896-Nearest Neighbor classifier The studyshows superior classification performance by passing fromlow-level visual features to high-level topics that could beloosely attributed to the context of the scene

224 Place Recognition in Video Place recognition fromrecorded videos brings both novel opportunities and infor-mation but also poses additional challenges and constraintsMuch more image data can be extracted from video while inpractice some small portion of it can be labeled manually Anadditional information that is often leveraged in the literatureis the temporal continuity of the video stream

Matching-based approach has been used in [29] toretrieve objects in video Results show that simple matchingproduces a large number of false positive matches but theusage of stop list to remove most frequent and most specificvisual words followed by spatial consistency check signifi-cantly improves retrieval result quality In [30] belief func-tions in the Bayesian filtering context are used to determinethe confidence of a particular location at any time momentThe modeling involves sensor and motion models whichhave to be trained offline with sufficiently large annotateddatabase Indeed themodel has to learn themodel of allowedtransitions between places which require the annotated datato represent all possible transitions to be found in the testdata

An important group of methods performing simultane-ous place recognition and mapping (SLAM) is widely usedin robotics [31 32] The main idea in these methods is tosimultaneously build and update a map in an unknownenvironment and track in real time the current position ofthe camera In our work the construction of such map isnot necessary and may prove to be very challenging since theenvironment can be very complex and constantly changing

23 Multiple Feature Learning

231 Motivation Different visual features capture differentaspects of a scene and correct choice depends on the taskto solve [33] To this end even humans perform poorlywhen using only one information source of perception[34] Therefore instead of designing a specific and adapteddescriptor for each specific case several visual descriptorscan be combined in a more complex system while yieldingincreased discrimination power in a wider range of applica-tions Following the survey [35] twomain approaches can be

identified for the fusion of multiple features depending onwhether the fusion is done in the feature space (early fusion)or in the decision space (late fusion)

232 Early Fusion Early fusion strategies focus on the com-bination of input features before using them in a classifierIn the case of kernel classifiers the features can be seen asdefining a new kernel that takes into account several featuresat once This can be done by concatenating the featuresinto a new larger feature vector A more general approachMultiple Kernel Learning (MKL) also tries to estimate theoptimal parameters for kernel combination in addition to theclassifier model In our work we evaluated the SimpleMKL[36] algorithm as a representative algorithm of the MKLfamilyThe algorithm is based on gradient descent and learnsa weighted linear combination of kernels This approach hasnotably been applied in the context of object detection andclassification [37ndash39] and image classification [40 41]

233 Late Fusion In the late fusion strategy several baseclassifiers are trained independently and their outputs are fedto a special decision layer This fusion strategy is commonlyreferred to as a stacking method and is discussed in depthin the multiple classifiers systems literature [42ndash46] Thistype of fusion allows to use multiple visual features leavingtheir exploitation to an algorithm which performs automaticfeature selection or weighting respective to the utility of eachfeature

It is clear that nothing prevents using an SVM as baseclassifier Following the work of [47] it was shown that SVMoutputs in the form of decision values can be combined thelinearly using Discriminative Accumulation Scheme (DAS)[48] for confidence-based place recognition indoors Thefollowing work evolved by relaxing the constraint of linearityof combination using a kernel function on the outputs ofindividual single feature outputs giving rise to GeneralizedDAS [49] Results show a clear gain of performance increasewhen using different visual features or completely differentmodalities Other works follow a similar reasoning but usedifferent combination rules (max product etc) as discussedin [50] A comprehensive comparison of different fusionmethods in the context of object classification is given in [51]

24 Learning with Unlabeled Data

241 Motivation Standard supervised learning with singleor multiple features is successful if enough labeled trainingsamples are presented to the learning algorithm In manypractical applications the amount of training data is limitedwhile a wealth of unlabeled data is often available and islargely unused It is well known that the classifiers learnedusing only training data may suffer from overfitting orincapability to generalize on the unlabeled data In contrastunsupervised methods do not use label information Theymay detect a structure of the data however a prior knowledgeand correct assumptions about the data is necessary to be ableto characterize a structure that is relevant for the task

Semisupervised learning addresses this issue by leverag-ing labeled as well as unlabeled data [52 53]

Advances in Multimedia 5

242 Graph-Based Learning Given a labeled set 119871 =

(x119894 119910119894)119897

119894=1and an unlabeled set 119880 = x

119895119897+119906

119895=119897+1 where x isin X

and 119910 isin minus1 +1 the goal is to estimate class labels for thelatter The usual hypothesis is that the two sets are samplediid according to the same joint distribution 119901(x 119910) There isno intention to provide estimations on the data outside thesets 119871 and119880 Deeper discussion on this issue can be found in[54] and in references therein

In graph-based learning a graph composed of labeled orunlabeled nodes (in our case representing the images) andinterconnected by edges encoding the similarities is builtApplication specific knowledge is used to construct suchgraph in such a way that the labels of nodes connected witha high weight link are similar and that no or a few weaklinks are present between nodes of different classes Thisgraph therefore encodes information on the smoothness ofa learned function 119891 on the graph which corresponds to ameasure of compatibility with the graph connectivity Theuse of the graph Laplacian [55 56] can then be used directlyas a connectivity information to propagate information fromlabeled nodes to unlabeled nodes [57] or as a regularizationterm that penalizes nonsmooth labelings within a classifiersuch as the Lap-SVM [58 59]

From a practical point of view the algorithm requires theconstruction of the full affinity matrix 119882 where all imagepairs in the sequences are compared and the computationof the associated Laplacian matrix 119871 which requires largeamounts of memory in 119874(1198992) While theoretically attractivethe direct method scales poorly with the size of the graphnodes which seriously restricts its usage on a wide range ofpractical applications working

243 Co-Training from Multiple Features The Co-Training[60] is a wrapper algorithm that learns two discriminantclassifiers in a jointmannerThemethod trains iteratively twoclassifiers such that in each iteration the highest confidenceestimates on unlabeled data are fed into the training set ofanother classifier Classically two views on the data or twosingle feature splits of a dataset are used The main idea isthat the solution or hypothesis space is significantly reduced ifboth trained classifiers agree on the data and reduce the riskof overfitting since each classifier also fits the initial labeledtraining set More theoretical background and analysis of themethod is given in Section 332

The algorithm of Co-Training was proposed in [60] asa solution to classify Web pages using both link and wordinformation The same method was applied to the problemof Web image annotation in [61 62] and automatic videoannotation in [63] Generalization capacity of Co-Training ondifferent initial labeled training sets was studied in [64]Moreanalysis on theoretical properties of Co-Training method canbe found in [65] such as rough estimates of maximal numberof iterations A reviewondifferent variants of theCo-Trainingalgorithm is given [66] together with their comparativeanalysis

244 Link between Graph and Co-Training Approaches Itis interesting to note the link [67 68] between Co-Training

method and label propagation in a graph since adding themost confident estimations in each Co-Training iterationcan be seen as label propagation from labeled nodes tounlabeled nodes in a graph This view of the method isfurther discussed and practically evaluated in [69] as a labelpropagation method on a combined graph built from twoindividual views

Graph-based methods are limited by the fact that graphedges encode low-level similarities that are computed directlyfrom the input features The Co-Training algorithm uses adiscriminative model that can be adaptive to the data witheach iteration and therefore achieve better generalization onunseen unlabeled data In the next section we will build aframework based on the Co-Training algorithm to proposeour solution for image-based place recognition

In this work we attempt to leverage all available infor-mation from image data that could help to provide cues oncamera place recognition Manual annotation of recordedvideo sequences requires a lot of human laborThe aim of thiswork is to evaluate the utility of unlabeled data within the Co-Training framework for image-based place recognition

3 Proposed Approach

In this section we present the architecture of the proposedmethod which is based on the Co-Training algorithm andthen discuss each component of the systemThe standardCo-Training algorithm (see Figure 2) allows to benefit from theinformation in the unlabeled part of the corpus by using afeedback loop to augment the training set thus producingaugmented performance classifiers In the standard algo-rithm formulation the two classifiers are still separate whichdoes not leverage their complementarity to its maximumThe proposed method addresses this issue by providing asingle output using late classifier fusion and time filtering fortemporal constrain enforcement

We will present the different elements of the system inthe order of increasing abstraction Single feature extractionpreparation and classification using SVM will be presentedin Section 31 Multiple feature late fusion and a proposedextension to take into account the time information will beintroduced in Section 32 The complete algorithm combin-ing those elements with the Co-Training algorithm will bedeveloped in Section 33

31 Single Feature Recognition Module Each image is repre-sented by a global signature vector In the following sectionsthe visual features x(119895)

119894isin X(119895) correspond to numerical

representations of the visual content of the images where thesuperscript (119895) denotes the type of visual features

311 SVM Classifier In our work we rely on Support VectorMachine (SVM) classifiers to carry out decision operations Itaims at finding the best class separation instead of modelingpotentially complex within class probability densities as ingenerative models such as Naive Bayes [70] The maxi-mal margin separating hyperplane is motivated from the

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

4 Advances in Multimedia

final recognition performance can be boosted even further asmore object information is used in each image A context-based system for place and object recognition is presentedin [5] The main idea is to use context (scene gist) as aprior and then use it as a prior infer what objects can bepresent in a sceneTheHMM-based place recognition systemrequires a considerable amount of training data possibletransition probabilities and so forth but integrates naturallytemporal information and confidence measure to detect thefact of navigating in unknown locations Probabilistic LatentSemantic Analysis (pLSA) was used in [28] to discover higherlevel topics (eg grass forest water) from low-level visualfeatures and building novel low dimensional representationused afterwards in 119896-Nearest Neighbor classifier The studyshows superior classification performance by passing fromlow-level visual features to high-level topics that could beloosely attributed to the context of the scene

224 Place Recognition in Video Place recognition fromrecorded videos brings both novel opportunities and infor-mation but also poses additional challenges and constraintsMuch more image data can be extracted from video while inpractice some small portion of it can be labeled manually Anadditional information that is often leveraged in the literatureis the temporal continuity of the video stream

Matching-based approach has been used in [29] toretrieve objects in video Results show that simple matchingproduces a large number of false positive matches but theusage of stop list to remove most frequent and most specificvisual words followed by spatial consistency check signifi-cantly improves retrieval result quality In [30] belief func-tions in the Bayesian filtering context are used to determinethe confidence of a particular location at any time momentThe modeling involves sensor and motion models whichhave to be trained offline with sufficiently large annotateddatabase Indeed themodel has to learn themodel of allowedtransitions between places which require the annotated datato represent all possible transitions to be found in the testdata

An important group of methods performing simultane-ous place recognition and mapping (SLAM) is widely usedin robotics [31 32] The main idea in these methods is tosimultaneously build and update a map in an unknownenvironment and track in real time the current position ofthe camera In our work the construction of such map isnot necessary and may prove to be very challenging since theenvironment can be very complex and constantly changing

23 Multiple Feature Learning

231 Motivation Different visual features capture differentaspects of a scene and correct choice depends on the taskto solve [33] To this end even humans perform poorlywhen using only one information source of perception[34] Therefore instead of designing a specific and adapteddescriptor for each specific case several visual descriptorscan be combined in a more complex system while yieldingincreased discrimination power in a wider range of applica-tions Following the survey [35] twomain approaches can be

identified for the fusion of multiple features depending onwhether the fusion is done in the feature space (early fusion)or in the decision space (late fusion)

232 Early Fusion Early fusion strategies focus on the com-bination of input features before using them in a classifierIn the case of kernel classifiers the features can be seen asdefining a new kernel that takes into account several featuresat once This can be done by concatenating the featuresinto a new larger feature vector A more general approachMultiple Kernel Learning (MKL) also tries to estimate theoptimal parameters for kernel combination in addition to theclassifier model In our work we evaluated the SimpleMKL[36] algorithm as a representative algorithm of the MKLfamilyThe algorithm is based on gradient descent and learnsa weighted linear combination of kernels This approach hasnotably been applied in the context of object detection andclassification [37ndash39] and image classification [40 41]

233 Late Fusion In the late fusion strategy several baseclassifiers are trained independently and their outputs are fedto a special decision layer This fusion strategy is commonlyreferred to as a stacking method and is discussed in depthin the multiple classifiers systems literature [42ndash46] Thistype of fusion allows to use multiple visual features leavingtheir exploitation to an algorithm which performs automaticfeature selection or weighting respective to the utility of eachfeature

It is clear that nothing prevents using an SVM as baseclassifier Following the work of [47] it was shown that SVMoutputs in the form of decision values can be combined thelinearly using Discriminative Accumulation Scheme (DAS)[48] for confidence-based place recognition indoors Thefollowing work evolved by relaxing the constraint of linearityof combination using a kernel function on the outputs ofindividual single feature outputs giving rise to GeneralizedDAS [49] Results show a clear gain of performance increasewhen using different visual features or completely differentmodalities Other works follow a similar reasoning but usedifferent combination rules (max product etc) as discussedin [50] A comprehensive comparison of different fusionmethods in the context of object classification is given in [51]

24 Learning with Unlabeled Data

241 Motivation Standard supervised learning with singleor multiple features is successful if enough labeled trainingsamples are presented to the learning algorithm In manypractical applications the amount of training data is limitedwhile a wealth of unlabeled data is often available and islargely unused It is well known that the classifiers learnedusing only training data may suffer from overfitting orincapability to generalize on the unlabeled data In contrastunsupervised methods do not use label information Theymay detect a structure of the data however a prior knowledgeand correct assumptions about the data is necessary to be ableto characterize a structure that is relevant for the task

Semisupervised learning addresses this issue by leverag-ing labeled as well as unlabeled data [52 53]

Advances in Multimedia 5

242 Graph-Based Learning Given a labeled set 119871 =

(x119894 119910119894)119897

119894=1and an unlabeled set 119880 = x

119895119897+119906

119895=119897+1 where x isin X

and 119910 isin minus1 +1 the goal is to estimate class labels for thelatter The usual hypothesis is that the two sets are samplediid according to the same joint distribution 119901(x 119910) There isno intention to provide estimations on the data outside thesets 119871 and119880 Deeper discussion on this issue can be found in[54] and in references therein

In graph-based learning a graph composed of labeled orunlabeled nodes (in our case representing the images) andinterconnected by edges encoding the similarities is builtApplication specific knowledge is used to construct suchgraph in such a way that the labels of nodes connected witha high weight link are similar and that no or a few weaklinks are present between nodes of different classes Thisgraph therefore encodes information on the smoothness ofa learned function 119891 on the graph which corresponds to ameasure of compatibility with the graph connectivity Theuse of the graph Laplacian [55 56] can then be used directlyas a connectivity information to propagate information fromlabeled nodes to unlabeled nodes [57] or as a regularizationterm that penalizes nonsmooth labelings within a classifiersuch as the Lap-SVM [58 59]

From a practical point of view the algorithm requires theconstruction of the full affinity matrix 119882 where all imagepairs in the sequences are compared and the computationof the associated Laplacian matrix 119871 which requires largeamounts of memory in 119874(1198992) While theoretically attractivethe direct method scales poorly with the size of the graphnodes which seriously restricts its usage on a wide range ofpractical applications working

243 Co-Training from Multiple Features The Co-Training[60] is a wrapper algorithm that learns two discriminantclassifiers in a jointmannerThemethod trains iteratively twoclassifiers such that in each iteration the highest confidenceestimates on unlabeled data are fed into the training set ofanother classifier Classically two views on the data or twosingle feature splits of a dataset are used The main idea isthat the solution or hypothesis space is significantly reduced ifboth trained classifiers agree on the data and reduce the riskof overfitting since each classifier also fits the initial labeledtraining set More theoretical background and analysis of themethod is given in Section 332

The algorithm of Co-Training was proposed in [60] asa solution to classify Web pages using both link and wordinformation The same method was applied to the problemof Web image annotation in [61 62] and automatic videoannotation in [63] Generalization capacity of Co-Training ondifferent initial labeled training sets was studied in [64]Moreanalysis on theoretical properties of Co-Training method canbe found in [65] such as rough estimates of maximal numberof iterations A reviewondifferent variants of theCo-Trainingalgorithm is given [66] together with their comparativeanalysis

244 Link between Graph and Co-Training Approaches Itis interesting to note the link [67 68] between Co-Training

method and label propagation in a graph since adding themost confident estimations in each Co-Training iterationcan be seen as label propagation from labeled nodes tounlabeled nodes in a graph This view of the method isfurther discussed and practically evaluated in [69] as a labelpropagation method on a combined graph built from twoindividual views

Graph-based methods are limited by the fact that graphedges encode low-level similarities that are computed directlyfrom the input features The Co-Training algorithm uses adiscriminative model that can be adaptive to the data witheach iteration and therefore achieve better generalization onunseen unlabeled data In the next section we will build aframework based on the Co-Training algorithm to proposeour solution for image-based place recognition

In this work we attempt to leverage all available infor-mation from image data that could help to provide cues oncamera place recognition Manual annotation of recordedvideo sequences requires a lot of human laborThe aim of thiswork is to evaluate the utility of unlabeled data within the Co-Training framework for image-based place recognition

3 Proposed Approach

In this section we present the architecture of the proposedmethod which is based on the Co-Training algorithm andthen discuss each component of the systemThe standardCo-Training algorithm (see Figure 2) allows to benefit from theinformation in the unlabeled part of the corpus by using afeedback loop to augment the training set thus producingaugmented performance classifiers In the standard algo-rithm formulation the two classifiers are still separate whichdoes not leverage their complementarity to its maximumThe proposed method addresses this issue by providing asingle output using late classifier fusion and time filtering fortemporal constrain enforcement

We will present the different elements of the system inthe order of increasing abstraction Single feature extractionpreparation and classification using SVM will be presentedin Section 31 Multiple feature late fusion and a proposedextension to take into account the time information will beintroduced in Section 32 The complete algorithm combin-ing those elements with the Co-Training algorithm will bedeveloped in Section 33

31 Single Feature Recognition Module Each image is repre-sented by a global signature vector In the following sectionsthe visual features x(119895)

119894isin X(119895) correspond to numerical

representations of the visual content of the images where thesuperscript (119895) denotes the type of visual features

311 SVM Classifier In our work we rely on Support VectorMachine (SVM) classifiers to carry out decision operations Itaims at finding the best class separation instead of modelingpotentially complex within class probability densities as ingenerative models such as Naive Bayes [70] The maxi-mal margin separating hyperplane is motivated from the

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 5

242 Graph-Based Learning Given a labeled set 119871 =

(x119894 119910119894)119897

119894=1and an unlabeled set 119880 = x

119895119897+119906

119895=119897+1 where x isin X

and 119910 isin minus1 +1 the goal is to estimate class labels for thelatter The usual hypothesis is that the two sets are samplediid according to the same joint distribution 119901(x 119910) There isno intention to provide estimations on the data outside thesets 119871 and119880 Deeper discussion on this issue can be found in[54] and in references therein

In graph-based learning a graph composed of labeled orunlabeled nodes (in our case representing the images) andinterconnected by edges encoding the similarities is builtApplication specific knowledge is used to construct suchgraph in such a way that the labels of nodes connected witha high weight link are similar and that no or a few weaklinks are present between nodes of different classes Thisgraph therefore encodes information on the smoothness ofa learned function 119891 on the graph which corresponds to ameasure of compatibility with the graph connectivity Theuse of the graph Laplacian [55 56] can then be used directlyas a connectivity information to propagate information fromlabeled nodes to unlabeled nodes [57] or as a regularizationterm that penalizes nonsmooth labelings within a classifiersuch as the Lap-SVM [58 59]

From a practical point of view the algorithm requires theconstruction of the full affinity matrix 119882 where all imagepairs in the sequences are compared and the computationof the associated Laplacian matrix 119871 which requires largeamounts of memory in 119874(1198992) While theoretically attractivethe direct method scales poorly with the size of the graphnodes which seriously restricts its usage on a wide range ofpractical applications working

243 Co-Training from Multiple Features The Co-Training[60] is a wrapper algorithm that learns two discriminantclassifiers in a jointmannerThemethod trains iteratively twoclassifiers such that in each iteration the highest confidenceestimates on unlabeled data are fed into the training set ofanother classifier Classically two views on the data or twosingle feature splits of a dataset are used The main idea isthat the solution or hypothesis space is significantly reduced ifboth trained classifiers agree on the data and reduce the riskof overfitting since each classifier also fits the initial labeledtraining set More theoretical background and analysis of themethod is given in Section 332

The algorithm of Co-Training was proposed in [60] asa solution to classify Web pages using both link and wordinformation The same method was applied to the problemof Web image annotation in [61 62] and automatic videoannotation in [63] Generalization capacity of Co-Training ondifferent initial labeled training sets was studied in [64]Moreanalysis on theoretical properties of Co-Training method canbe found in [65] such as rough estimates of maximal numberof iterations A reviewondifferent variants of theCo-Trainingalgorithm is given [66] together with their comparativeanalysis

244 Link between Graph and Co-Training Approaches Itis interesting to note the link [67 68] between Co-Training

method and label propagation in a graph since adding themost confident estimations in each Co-Training iterationcan be seen as label propagation from labeled nodes tounlabeled nodes in a graph This view of the method isfurther discussed and practically evaluated in [69] as a labelpropagation method on a combined graph built from twoindividual views

Graph-based methods are limited by the fact that graphedges encode low-level similarities that are computed directlyfrom the input features The Co-Training algorithm uses adiscriminative model that can be adaptive to the data witheach iteration and therefore achieve better generalization onunseen unlabeled data In the next section we will build aframework based on the Co-Training algorithm to proposeour solution for image-based place recognition

In this work we attempt to leverage all available infor-mation from image data that could help to provide cues oncamera place recognition Manual annotation of recordedvideo sequences requires a lot of human laborThe aim of thiswork is to evaluate the utility of unlabeled data within the Co-Training framework for image-based place recognition

3 Proposed Approach

In this section we present the architecture of the proposedmethod which is based on the Co-Training algorithm andthen discuss each component of the systemThe standardCo-Training algorithm (see Figure 2) allows to benefit from theinformation in the unlabeled part of the corpus by using afeedback loop to augment the training set thus producingaugmented performance classifiers In the standard algo-rithm formulation the two classifiers are still separate whichdoes not leverage their complementarity to its maximumThe proposed method addresses this issue by providing asingle output using late classifier fusion and time filtering fortemporal constrain enforcement

We will present the different elements of the system inthe order of increasing abstraction Single feature extractionpreparation and classification using SVM will be presentedin Section 31 Multiple feature late fusion and a proposedextension to take into account the time information will beintroduced in Section 32 The complete algorithm combin-ing those elements with the Co-Training algorithm will bedeveloped in Section 33

31 Single Feature Recognition Module Each image is repre-sented by a global signature vector In the following sectionsthe visual features x(119895)

119894isin X(119895) correspond to numerical

representations of the visual content of the images where thesuperscript (119895) denotes the type of visual features

311 SVM Classifier In our work we rely on Support VectorMachine (SVM) classifiers to carry out decision operations Itaims at finding the best class separation instead of modelingpotentially complex within class probability densities as ingenerative models such as Naive Bayes [70] The maxi-mal margin separating hyperplane is motivated from the

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

6 Advances in Multimedia

Top confidencepattern selection

Top confidencepattern selection

Confidencecomputation

Confidencecomputation

Test classestimates

Test classestimates

Confidencescores

Confidencescores

Top confidence test patterns

Top confidence test patterns

Test scores

Test scores

SVM (1)

SVM (2)

x(1)

x(2)

Tra

inin

g an

d t

esti

ng

dat

a in

tw

o v

iew

s

x(1)

x(2)

CO

Figure 2 Workflow of the Co-Training algorithm

statistical learning theory viewpoint by linking the marginwidth to classifierrsquos generalization capability

Given a labeled set 119871 = (x119894 119910119894)119897

119894=1 where x isin R119889 119910 isin

minus1 +1 a linear maximal margin classifier 119891(x) = w119879x + 119887can be found by solving

minw119887120585

119897

sum

119894=1

120585119894+ 120582w2

st 119910119894(w119879x119894+ 119887) ge 1 minus 120585

119894 120585119894ge 0 forall119894 119894 = 1 119897

(1)

for hyperplane w isin R119889 and its offset 119887 isin R In regularizationframework the loss function called Hinge loss is

ℓ (x 119910 119891 (x)) = max (1 minus 119910119894119891 (x119894) 0) forall119894 119894 = 1 119897 (2)

and the regularizer

ΩSVM (119891) = w2 (3)

As it will be seen from discussion the regularizer playsan important role in the design of learning methods In thecase of an SVM classifiers the regularizer in (3) reflects theobjective to bemaximizedmdashmaximummargin separation onthe training data

312 Processing of Nonlinear Kernels The power of the SVMclassifier owes its easy extension to the nonlinear case [71]Highly nonlinear nature of data can be taken into accountseamlessly by using kernel trick such that the hyperplaneis found in a feature space induced by an adapted kernelfunction 119896(x

119894 x119895) = ⟨Φ(x

119894) Φ(x

119895)⟩ in Reproducing Kernel

Hilbert Space (RKHS) The implicit mapping x 997891rarr Φ(x)means that we can no longer find an explicit hyperplane w 119887

since the mapping function is not known and may be of verylarge dimensionality Fortunately the decision function canbe formulated in so-called dual representation [71] and thenthe solution minimizing regularized risk according to theRepresenter theorem is

119891119896 (x) =

119897

sum

119894=1

120572119894119910119894119896 (x119894 x) + 119887 119896 = 1 119888 (4)

where 119897 is the number of labeled samplesBag of Words descriptors have been used intensively

for efficient and discriminant image description The linearkernel does not provide the best results with such repre-sentations which has been more successful with kernelssuch as the Hellinger kernel 1205942-kernel or the intersectionkernel [6 27 33] Unfortunately training with such kernelsusing the standard SVM tools is much less computationallyefficient than using the linear inner product kernel for whichefficient SVM implementations exist [72] In this work wehave therefore chosen to adapt the input features to the linearcontext using two different techniques For the BOVW (Bagof Visual Words) [25] and SPH (Spatial Pyramid Histogram)[27] features a Hellinger kernel was used This kernel admitsan explicit mapping function using a square root transfor-mation 120601([119909

1sdot sdot sdot 119909119889]119879) = [radic1199091 sdot sdot sdot radic119909119889]

119879 In this particularcase a linear embedding x1015840 = 120601(x) can be computed explicityand will have the same dimensionality as input feature Forthe CRFH (Composed Receptive Field Histogram) features[26] the feature vector has very large number of dimensionsbut is also extremely sparse with between 500 and 4000nonzeros coefficients frommany millions of features in totalThese features could be transformed into a linear embeddingusing Kernel Principal Component Analysis [73] in orderto reduce it to a 500-dimension linear embedding vector

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 7

In the following we will therefore consider that features areall processed into a linear embedding x

119894that is suitable for

efficient linear SVM Utility of this processing will be evidentin the context of the Co-Training algorithm which requiresmultiple retraining and prediction operations of two visualfeature classifiers Other forms of efficient embedding pro-posed in [38] could be also used to reduce learning timeThispreprocessing is done only once right after feature extractionfrom image data In order to simplify the explanations wewill slightly abuse notation by denoting directly by x

119894the

linearized descriptors without further indication in the restof this document

313 Multiclass Classification Visual place recognition is atruly multiclass classification problem The extension of thebinary SVM classifier to 119888 gt 2 classes is considered ina one-versus-all setup Therefore 119888 independent classifiersare trained on the labeled data each of which learns theseparation between one class and the other classes We willdenote by 119891

119896the decision function associated to class 119896 isin

[| 1 119888 |] The outcome of the classifier bank for a samplex can be represented as a scores vector s(x) by concatenatingindividual decision scores

s (x) = (1198911 (x) 119891119888 (x)) (5)

In that case the estimated class of a testing sample x119894is

estimated from the largest positive score

119910119894= arg max

119896=1119888

119891119896(x119894) (6)

32 Multiple Feature FusionModule and Its Extension to TimeInformation In this work we follow a late classifier fusionparadigmwith several classifiers being trained independentlyon different visual cues and fusing the outputs for a singlefinal decision We motivate this choice compared to earlyfusion paradigm as it will allow easier integration at thedecision level of augmented classifiers obtained by the Co-Training algorithm as well as providing a natural extensionto inject temporal continuity information of video

321 Objective Statement We denote the training set by 119871 =(x119894 119910119894)119897

119894=1and the unlabeled set of patterns by 119880 = x

119895119897+119906

119895=119897+1

where x isin X and the outcome of classification is a binaryoutput 119910 isin minus1 +1

The visual data may have 119901 multiple cues describing thesame image 119868

119894 Suppose that 119901 cues has been extracted from

an image 119868119894

x119894997888rarr (x(1)

119894 x(2)119894 x(119901)

119894) (7)

where each cue x(119895)119894

belongs to an associated descriptor spaceX(119895)

Denote also by 119901 the decision functions 119891(1) 119891(2) 119891(119901) where 119891(119895) isin F(119895) are trained on the respective visual

cues and are providing estimation 119910 (119895)119896

on the pattern x(119895)119896

Then for a visual cue 119905 and 119888 class classification in one-versus-all setup a score vector can be constructed

s119905 = (1198911199051(x) 119891119905

119888(x)) (8)

In our work we adopt two late fusion techniques Dis-criminant Accumulation Scheme (DAS) [47 48] and SVM-DAS [49 74]

322 Discriminant Accumulation Scheme (DAS) The ideaof DAS is to combine linearly the scores returned by thesame class decision function across multiple visual cues 119905 =1 119901The novel combined decision function for a class 119895 isthen a linear combination

119891DAS119895(x) =

119901

sum

119905=1

120573119905119891119905

119895(x) (9)

where the weight 120573 is attributed to each cue according to itsimportance in the learning phase The novel scores can thenbe used in decision process for example using max scorecriterion

The DAS scheme is an example of parallel classifiercombination architectures [44] and implies a competitionbetween the individual classifiers The weights 120573

119905can be

found using a cross-validation procedure with the normal-ization constraint

119901

sum

119905=1

120573119905= 1 (10)

323 SVM Discriminant Accumulation Scheme The SVM-DAS can be seen as a generalization of the DAS by buildinga stacked architecture of multiple classifiers [44] whereindividual classifier outputs are fed into a final classifier thatprovides a single decision In this approach every classifier istrained on its own visual cue 119905 and produces a score vector asin (8) Then the single feature score vectors s119905

119894corresponding

to one particular pattern x119894are concatenated into a novel

multifeatures scores vector z119894= [s1119894 s119901

119894] A final top-level

classifier can be trained on those novel features

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895 (11)

Notice that the use of kernel function enables a richerclass of classifiers modeling possibly nonlinear relationsbetween base classifier outputs If a linear kernel function isused

119896SVMDAS (z119894 z119895) = ⟨z119894 z119895⟩ =119901

sum

119905=1

⟨s119905119894 s119905119895⟩ (12)

then the decision function in (11) can be rewritten byexchanging sums

119891SVMDAS119895

(z) =119897

sum

119894=1

120572119894119895119910119894119896 (z z

119894) + 119887119895

=

119897

sum

119894=1

⟨s119905119894 s119905119895⟩

119901

sum

119905=1

120572119894119895119910119894+ 119887119895

(13)

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

8 Advances in Multimedia

Denoting w119905119895= sum119897

119894=1120572119894119895119910119894s119905 we can rewrite the decision

function using input patterns and the learned weights

119891SVMDAS119895

(z) =119901

sum

119905=1

119897

sum

119896=1

w119905119895119896119891119905

119895(x) (14)

The novel representation reveals that using a linear kernelin the SVMDAS framework renders a classifier with weightsbeing learned for every possible linear combination of baseclassifiers The DAS can be seen as a special case in thiscontext but with significantly less parameters Usage of akernel such as RBF or polynomial kernels can result in evenricher class of classifiers

The disadvantage of such configuration is that a final stageclassifier needs to be trained as well and its parameters tuned

324 Extension to Temporal Accumulation (TA) Video con-tent has a temporal nature such that the visual content doesnot usually change much in a short period of time In thecase of topological place recognition indoors this constraintmay be useful as place recognition changes are encounteredrelatively rarely with respect to the frame rate of the video

We propose tomodify the classifier output such that rapidclass changes are discouraged in a relatively short periodof time This leads to lower the proliferation of occasionaltemporally localized misclassifications

Let 119904119905119894= 119891(119905)(x119894) be the scores of a binary classifier for

visual cue 119905 and ℎ a temporal window of size 2120591 + 1 Thentemporal accumulation can be written as

119904119905

119894TA =120591

sum

119896=minus120591

ℎ (119896) 119904119905

119894+119896 (15)

and can be easily generalized tomultiple feature classificationby applying it separately to the output of the classifiersassociated to each feature s119905 where 119905 = 1 119901 is the visualfeature type We use an averaging filter of size 120591 defined as

ℎ (119896) =1

2120591 + 1 119896 = minus120591 120591 (16)

Therefore input of the TA are the SVM scores obtainedafter classification and output are again the processed SVMscores with temporal constraint enforced

33 Co-Training with Time Information and Late FusionWe have already presented how to perform multiple featurefusion within the late fusion paradigm and how it canbe extended to take into account the temporal continuityinformation of video In this section we will explain how toadditionally learn from labeled training data and unlabeleddata

331 The Co-Training Algorithm The standard Co-Training[60] is an algorithm that iteratively trains two classifierson two view data x

119894= (x(1)

119894 x(2)119894) by feeding the high-

est confidence score 119911119894estimates from the testing set in

another view classifier In this semisupervised approach

the discriminatory power of each classifier is improved byanother classifierrsquos complementary knowledge The testingset is gradually labeled round by round using only thehighest confidence estimatesThe pseudocode is presented inAlgorithm 1 which could be also extended to multiple viewsas in [53]

The power of the method lies in its capability of learningfrom small training sets and grows eventually its discrim-inative properties on the large unlabeled data set as moreconfident estimations are added into the training set Thefollowing assumptions are made

(1) the two distinct visual cues bring complementaryinformation

(2) the initially labeled set for each individual classifier issufficient to bootstrap the iterative learning process

(3) the confident estimations on unlabeled data are help-ful to predict the labels of the remaining unlabeleddata

Originally the Co-Training algorithm performs untilsome stopping criterion is met unless 119873 iterations areexceeded For instance a stopping criteria could be a rulethat stops the learning process when there are no confidentestimations to add or there have been relatively small differ-ence from iteration 119905 minus 1 to 119905 The parameter-less version ofCo-Training works till the complete exhaustion of the poolof unlabeled samples but requires a threshold on confidencemeasure which is used to separate high and low confidenceestimates In our work we use this variant of the Co-Trainingalgorithm

332 The Co-Training Algorithm in the RegularizationFramework

Motivation Intuitively it is clear that after a sufficient numberof rounds both classifiers will agree on most of the unlabeledpatterns It remains unclear why and what mechanismsmakesuch learning useful It can be justified from the learningtheory point of view There are less possible solutions orclassifiers from the hypothesis space that agree on unlabeleddata in two views Recall that every classifier individuallyshould fit its training data In the context of the Co-Trainingalgorithm each classifier should be somehow restricted byanother classifierThe two trained classifiers that are coupledin this system effectively reduce possible solution space Eachof those two classifier is less likely to be overfitting since eachof them has been initially trained on its training while takinginto account the training process of another classifier that iscarried out in parallel We follow the discussion from [53] togive more insights about this phenomena

Regularized Risk Minimization (RRM) Framework Betterunderstanding of the Co-Training algorithm can be gainedfrom the RRM framework Letrsquos introduce the Hinge loss

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 9

INPUTTraining set 119871 = (x

119894 119910119894)119897

119894=1

Testing set 119880 = x119894119906

119894=1

OUTPUT119910119894mdashclass estimations for the testing set 119880119891(1) 119891(2)mdashtrained classifiers

PROCEDURE(1) Compute visual features x

119894= (x(1)119894 x(2)119894) for every image 119868

119894in the dataset

(2) Initialize 1198711= (x(1)

119894 119910119894)119897

119894=1and 119871

2= (x(2)

119894 119910119894)119897

119894=1

(3) Initialize 1198801= x(1)119894119906

119894=1and 119880

2= x(2)119894119906

119894=1

(4) Create two work sets 1= 1198801and

2= 1198802

(5) Repeat until the sets 1and

2are empty (CO)

(a) Train classifiers 119891(1) 119891(2) using the sets 1198711 1198712respectively

(b) Classify the patterns in the sets 1and

2using the classifiers 119891(1) and 119891(2) respectively

(i) Compute scores 119904(1)test and confidences 119911(1) on the set 1

(ii) Compute scores 119904(2)test and confidences 119911(2) on the set 2

(c) Add the 119896 top confidence estimations 1198711sub 1 1198712sub 2

(i) 1198711= 1198711cup 1198711

(ii) 1198711= 1198711cup 1198711

(d) Remove the 119896 top confidence patterns from the working sets(i) 1= 1 1198711

(ii) 2= 2 1198712

(e) Go to step (5)(6) Optionally perform Temporal Accumulation (TA) according to (15)(7) Perform classifier output fusion (DAS)

(a) Compute fused scores sDAStest = (1 minus 120573) s(1)

test + 120573s(2)

test (b) Output class estimations 119910

119894from the fused scores sDAStest

Algorithm 1 The CO-DAS and CO-TA-DAS algorithms

function ℓ(x 119910 119891(x)) commonly used in classification Letrsquosalso introduce empirical risk of a candidate function 119891 isin F

(119891) =1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891 (x119894)) (17)

which measures how well the classifier fits the trainingdata It is well known that minimizing only training errorthe resulting classifier is very likely to overfit In practiceregularized risk (RRM) minimization is performed instead

119891RRM= arg min

119891isinF (119891) + 120582Ω (119891) (18)

where Ω(119891) is a nonnegative functional or regularizer thatreturns a large value or penalty for very complicated functions(typically the functions that fit perfectly to the data) Theparameter 120582 gt 0 controls the balance between a fit to thetraining data and the complexity of the classifier By selectinga proper regularization parameter overfitting can be avoidedand better generalization capability on the novel data canbe achieved A good example is the SVM classifier Thecorresponding regularizer ΩSVM(119891) = (12)w

2 selects thefunction that maximizes the margin

The Co-Training in the RRM In semisupervised learning wecan select a regularizer such that it is sufficiently smooth

on unlabeled data as well Keeping all previous discussionin mind indeed a function that fits the training data and isrespecting unlabeled data will probably perform better onfuture data In the case of the Co-Training algorithm we arelooking for two functions 119891(1) 119891(2) isin F that minimize theregularized risk and agree on the unlabeled data at the sametime The first restriction on the hypothesis space is that thefirst function should not only reduce its own regularized riskbut also agree with the second function We can then write atwo-view regularized risk minimization problem as

(119891(1) 119891(2))

= arg min119891(1)119891(2)

2

sum

119905=1

(1

119897

119897

sum

119894=1

ℓ (x119894 119910119894 119891(119905)(x119894))

+1205821ΩSVM (119891

(119905)))

+ 1205822

119897+119906

sum

119894=1

ℓ (x119894 119891(1)(x119894) 119891(2)(x119894))

(19)

where 1205822gt 0 controls the balance between an agreed fit on

the training data and agreement on the test data The firstpart of (19) states that each individual classifier should fit the

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

10 Advances in Multimedia

x(1)

x(2)

CO DAS

CO-DAS

(a)

COTA

TADAS

x(1)

x(2)

SCO-TA-DAS

(b)

Figure 3 Co-Training with late fusion (a) Co-Training with temporal accumulation (b)

given training data but should not overfit which is preventedwith the SVM regularizer ΩSVM(119891) The second part is aregularizer ΩCO(119891

(1) 119891(2)) for the Co-Training algorithm

which incurs penalty if the two classifiers do not agree on theunlabeled data This means that each classifier is constrainedboth by its standard regularization and is required to agreewith another classifier It is clear that an algorithm imple-mented in this framework elegantly bootstraps from eachclassifiers training data exploits unlabeled data and workswith two visual cues

It should be noted that the framework could be easilyextended to more than two classifiers In the literature thealgorithms following this spirit are implementing multipleview learning Refer to [53] for the extension of the frame-work to multiple views

333 Proposition CO-DAS and CO-TA-DAS Methods TheCo-Training algorithm has two drawbacks in the context ofour application The first drawback is that it is not known inadvance which of the two classifiers performs the best andif complementarity properties had been leveraged to theirmaximumThe second drawback is that no time informationis used unless the visual features are constructed to capturethis information

In this work we will use the DAS method for late fusionwhile it is possible to use the more general SVMDASmethodas well Experimental evaluation will show that very compet-itive performances can be obtained using the former muchmore simpler methodWe propose the CO-DASmethod (seeFigure 3(a)) which addresses the first drawback by deliv-ering a single output In the same framework we proposethe CO-TA-DAS method (see Figure 3(b)) which addition-ally enforces temporal continuity information Experimentalevaluation will reveal relative performances of each methodwith respect to baseline and with respect to each other

The full algorithm of the CO-DAS (or CO-TA-DAS iftemporal accumulation is enabled) method is presented inAlgorithm 1

Besides the base classifier parameters one needs to set thethreshold 119896 for the top confidence sample selection temporalaccumulation window width 120591 and the late fusion parameter120573 We express the threshold 119896 as a percentage of the testingsamples The impact of this parameter is extensively studiedin Sections 414 and 415 The selection of the temporalaccumulation parameter is discussed in Section 413 Finallydiscussion on the selection of the parameter 120573 is given inSection 412

334 Confidence Measure The Co-Training algorithm relieson confidence measure which is not provided by an SVMclassifier out of the box In the literature several methodsexist for computing confidence measure from the SVM out-puts We review several methods of confidence computationand contribute a novel confidence measure that attempts toresolve an issue which is common to some of the existingmeasures

Logistic Model (Logistic) Following [75] class probabilitiescan be computed using the logistic model that generalizesnaturally to multiclass classification problem Suppose thatin one-versus-all setup with 119888 classes the scores 119891119896(x)119888

119896=1

are given Then probability or classification confidence iscomputed as

119875 (119910 = 119896 | x) =exp (119891119896 (x))sum119888

119894=1exp (119891119894 (x))

(20)

which ensures that probability is larger for larger positivescore values and sum to 1 over all scoresThis property allowsto interpret the classifier output as a probability There are atleast two drawbackswith thismeasureThismeasure does nottake into account the cases when all classifiers in one-versus-all setup reject the pattern (all negative score values) or accept(all positive scores) Finally forced score normalization tosum up to one may not transfer all dynamics (eg very smallor very large score values)

Modeling Posterior Class Probabilities (Ruping) In [76] aparameter-less method was proposed which assigns scorevalue

119911 =

119901+ 119891 (x) gt 1

1 + 119891 (x)2

minus1 le 119891 (x) le 1119901minus 119891 (x) lt 1

(21)

where 119901+and 119901

minusare the fractions of positive and negative

score values respectively Authors argue that interestingdynamics relevant to confidence estimation happen in theregion of margin and the patterns classified outside themargin have a constant impact This measure has soundtheoretical background in a two-class classification problem

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 11

but it does not cover multiclass case as required by ourapplication

Score Difference (Tommasi) A method that does not requireadditional preprocessing for confidence estimation was pro-posed in [77] and thresholded to obtain a decision corre-sponding to ldquono actionrdquo ldquorejectrdquo or ldquodo not knowrdquo situationfor medical image annotation The idea is to use the contrastbetween the two top uncalibrated score valuesThemaximumscore estimation should be more confident if other score val-ues are relatively smaller This leads to a confidence measureusing the contrast between the two maximum scores

119911 = 119891119896lowast

(x) minus max119896=1119888119896 = 119896

lowast119891119896(x) (22)

This measure has a clear interpretation in a two-classclassification problem where larger difference between thetwo maximal scores hints for better class separability As itis seen from equation there is an issue with the measure if allscores are negative

Class Overlap Aware Confidence Measure We noticed thatclass overlap and reject situations are not explicitly takeninto account in neither of confidence measure computationprocedures The one-versus-all setup for multiple class clas-sification may yield ambiguous decisions For instance it ispossible to obtain several positive scores or all positive or allnegative scores

We propose a confidence measure that penalizes classoverlap (ambiguous decisions) at several degrees and alsotreats two degenerate cases By convention confidenceshould be higher if a sample is classifiedwith less class overlap(fewer positive score values) and further from the margin(larger positive value of a score) Cases with all positive ornegative scores may be considered as degenerate 119911

119894larr 0

The computation is divided in two steps First we computethe standard Tommasi confidence measure

1199110

119894= 119891119895lowast

(x119894) minus max119894=1119888119894 = 119895

lowast119891119894(x) (23)

then the measure 1199110119894is modified to account for class overlap

119911119894= 1199110

119894max (0 1 minus

119901119894minus 1

119862) (24)

where 119901119894= Card(119896 = 1 119888 | 119891119896(x

119894) gt 0) represents

the number of classes for which x119894has positive scores (class

overlap) In case offorall119896119891119896(x119894) gt 0or119891119896(x

119894) lt 0 we set 119911

119894larr 0

Compared to the Tommasi measure the proposed mea-sure additionally penalizes class overlap which is more severeif the test pattern receives several positive scores Comparedto logistic measure samples with no positive scores yieldzero confidence which allows to exclude them and not assigndoubtful probability values

Constructing our measure we assume that a confidentestimate is obtained if only one of binary classifiers returna positive score Following the same logic confidence islowered if more than one binary classifiers return a positivescore

4 Experimental Evaluation

In this section we evaluate the performance of the methodspresented in the previous section on two datasets Exper-imental evaluation is organized in two parts (a) on thepublic database IDOL2 in Section 41 and (2) on our in-house database IMMED in Section 42 respectively Theformer database is relatively simple and is expected to beannotated automatically with small error rate whereas thelatter database is recorded in a challenging environment andis a subject of study in the IMMED project

For each database two experiment setups are created(a) randomly sampled training images across all corpus and(b) more realistic video-versus-video setup First experimentallows for the gradual increase of supervision which givesinsights of place recognition performance for algorithmsunder study The second setup is more realistic and is aimedto validate every place recognition algorithm

On the IDOL2 database we extensively assess the placerecognition performance for each independent part of theproposed system For instance we validate the utility ofmultiple features effect of temporal smoothing unlabeleddata and different confidence measures

The IMMED database is used for validation purposeson which we evaluate all methods and summarize theirperformances

DatasetsThe IDOL2 database is a publicly available corpus ofvideo sequences designed to assess place recognition systemsof mobile robots in indoor environment

The IMMED database represents a collection of videosequences recorded using a camera positioned on the shoul-der of volunteers and capturing their activities during obser-vation sessions in their home environment These sequencesrepresent visual lifelogs for which indexing by activities isrequired This database presents a real challenge for image-based place recognition algorithms due to the high variabilityof the visual content and the unconstrained environment

The results and discussion related to these two datasetsare presented in Sections 41 and 42 respectively

Visual Features In this experimental section wewill use threetypes of visual features that have been used successfully inimage recognition tasks Bag of VisualWords (BOVWs) [25]Composed Receptive Field Histograms (CRFHs) [26] andSpatial Pyramid Histograms (SPHs) [27]

In this work we used 1111 dimensional BOVWhistogramswhich was shown to be sufficient for our application andfeasible from the computation point of view The visualvocabulary was built in a hierarchical manner [25] with 3levels and 10 sibling nodes to speed up the search of thetreeThis allows to introduce visual words ranging frommoregeneral (higher level nodes) tomore specific (leaf nodes)Theeffect of overly frequent visual words is addressed with theuse of common normalization procedure tf-idf [25] from textclassification

The SPH [27 78] descriptor harnesses the power ofthe BOVW descriptor but addresses its weakness when itcomes to spatial structure of the image This is done by

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

12 Advances in Multimedia

(a) (b) (c)

(d) (e)

Figure 4 IDOL2 dataset sample images (a) Printer Area (b) Corridor (c) Two-Person Office (d) One-Person Office and (e) Kitchen

constructing a pyramidwhere each level defines coarse to finesampling grid for histogram extraction Each grid histogramis obtained by constructing standard BOVW histogram withlocal features SIFT sampled in a dense manner The finalglobal descriptor is composed of concatenated individualregion and level histograms We empirically set the numberof pyramid levels to 3 with the dictionary size of 200visual words which yielded in 4200 dimensional vectors perimage Again the number of dimensions was fixed such thatmaximum of visual information is captured while reducingcomputational burden

The CRFH [26] descriptor describes a scene globally bymeasuring responses returned after some filtering operationon the image Every dimension of this descriptor effec-tively counts the number of pixels sharing similar responsesreturned from each specific filter Due to multidimensionalnature and the size of an image such descriptor often resultsin a very high dimensionality vector In our experimentalevaluations we used second order derivatives filter in threedirections at two scales with 28 bins per histogram Thetotal size of global descriptor resulted in very sparse up to400 million dimension vectors It was reduced to a 500-dimensional linear descriptor vector using KPCA with an 1205942kernel [73]

41 Results on IDOL2 Thepublic database KTH-IDOL2 [79]consists of video sequences captured by two different robotplatforms The database is suitable to evaluate the robustnessof image-based place recognition algorithms in controlledreal-world conditions

411 Description of the Experimental Setup The considereddatabase consists of 12 video sequences recorded with the

ldquominnierdquo robot (98 cm above ground) using a Canon VC-C4camera at a frame rate of 5 fps The effective resolution of theextracted images is 309 times 240 pixels

All video sequences were recorded in the same premisesand depict 5 distinct roomsmdashldquoOne-Person Officerdquo ldquoTwo-Person Officerdquo ldquoCorridorrdquo ldquoKitchenrdquo and ldquoPrinter AreardquoSample images depicting these 5 topological locations areshown in Figure 4

The annotation was performed using two annotationsetups random and video versus video In both setup threeimage sets were considered labeled training validation setand an unlabeled set The unlabeled set is used as the testset for performance evaluationThe performance is evaluatedusing the accuracy metric which is defined as the number ofcorrectly classified test images divided by the total number oftest images

Random Sampling Setup In the first setup the databaseis divided into three sets by random sampling trainingvalidation and testing The percentage of training data withrespect to the full corpus defines the supervision level Weconsider 8 supervision levels ranging from 1 to 50 Theremaining images are split randomly in two halves and usedrespectively for validation and testing purposes In order toaccount for the effects of random sampling 10-fold samplingis made at each supervision level and the final result returnedas the average accuracy measure

It is expected that global place recognition performanceraises from mediocre performance at low supervision to itsmaximum at high supervision level

Video-versus-Video Setup In the second setup videossequences are processed in pairsThe first video is completelyannotated while the second is used for evaluation purposes

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 13

0 01 02 03 04 05 06 07 08 09 165

70

75

80

85

90

95

Glo

bal

acc

ura

cy (

)

1

2

3

5

10

20

30

50

Figure 5 Effect of the DAS late fusion approach on the finalperformance for various supervision levels Plot of the accuracy asa function of the parameter 120572 that balances the fusion between SPHfeatures (3 levels) if 120572 = 0 and CRFH if 120572 = 1 (IDOL2 datasetrandom setup)

The annotated video is split randomly into training andvalidation setsWith 12 video sequences under considerationevaluating on all possible pairs amount to 132 = 12 times 11pairs of video sequences We differentiate three sets of pairsldquoEASYrdquo ldquoHARDrdquo and ldquoALLrdquo result cases The ldquoEASYrdquo setcontains only the video sequence pairs where the lightconditions are similar and the recordings were made in avery short span of time The ldquoHARDrdquo set contains pairs ofvideo sequences with different lighting conditions or videosequences recorded with a large time span The ldquoALLrdquo setcontains all the 132 video pairs to provide an overall averagedperformance

Compared to random sampling setup the video-versus-video setup is considered more challenging and thus lowerplace recognition performances are expected

412 Utility of Multiple Features We study the contribu-tion of multiple features for the task of image-based placerecognition on the IDOL2 database We will present a com-plete summary of performances for baseline single featuremethods compared to early and late fusion methods Theseexperiments were carried out using the random labelingsetup only

The DAS Method The DAS method leverages two visualfeature classifier outputs and provides a weighted scoresum in the output on which class decision can be madeIn Figure 5 the performance of DAS using SPH Level 3and CRFH feature embeddings is shown as a function offusion parameter 120572 at different supervision levels Interestingdynamics can be noticed for intermediary fusion values thatsuggest for feature complementarity The fusion parameter 120572can be safely set to an intermediary value such as 05 and the

001 002 003 005 01 02 03 0506

065

07

075

08

085

09

095

1

Supervision levels

Glo

bal

acc

ura

cy

SVMDAS

SimpleMKL

BOVW

CRFH

PH Level 3

Figure 6 Comparison of single (BOVW CRFH and SPH) andmultiple feature (SVMDAS SimpleMKL) approaches for differentsupervision levels Plot of the accuracy as a function of thesupervision level (IDOL2 dataset random setup)

final performance would exceed that of every single featureclassifier alone at all supervision levels

The SVMDAS Method In Figure 6 the effect of the supervi-sion level on the performances of classification is shown forsingle feature andmultiple features approaches It is clear thatall methods perform better if more labeled data is suppliedwhich is an expected behavior We can notice differences inthe performances on the 3 single feature approaches withSPH providing the best performances Both SVMDAS (latefusion approach) and SimpleMKL (early fusion approach)operate fusion over the 3 single features considered Theyoutperform the single feature baseline methods There ispractically no difference between the two fusion methods onthis dataset

Selection of the Late Fusion Method Although not compareddirectly the two late fusion methods DAS and SVMDASdeliver very comparable performances Maximum perfor-mance comparison (at best120572 for each supervision) of theDAS(Figure 5) to those of the SVMDAS (Figure 6) confirms thisclaim on this particular databaseTherefore the choice of theDAS method for the following usage in the final system ismotivated by this result and by simplified fusion parameterselection

413 Effect of Temporal Smoothing

Motivation Temporal information is an implicit attribute ofvideo content which has not been leveraged up to now in thiswork The main idea is that temporally close images shouldcarry the same label

Discussion on the Results To show the importance of thetime information we present the effect of the temporal accu-mulation (TA) module on the performance of single featureSVM classification In Figure 7 the TA window size is variedfrom no temporal accumulation up to 300 framesThe results

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

14 Advances in Multimedia

0 50 100 150 200 250 30050

55

60

65

70

75

80

85

90

95

100

Averaging filter size (h)

Glo

bal

acc

ura

cy (

)

1

5

10

50

Figure 7 Effect of the filter size in temporal accumulation Plot ofthe accuracy as a function of the TA filter size (IDOL2 dataset SPHLevel 3 features)

show that temporal accumulation with a window size up to100 frames (corresponding to 20 seconds of video) increasesthe final classification performance This result shows thatsome minority of temporally close images which are verylikely to carry the same class label obtain an erroneous labeland temporal accumulation could be a possible solutionAssuming that only a minority of temporal neighbors areclassified incorrectly makes the temporal continuity a strongcue for our application and should be integrated in thelearning process as will be shown next

Practical Considerations In practice the best averaging win-dow size cannot be known in advance Knowing the framerate of the camera and relatively slow room change the filtersize can be set empirically to a number of frames that arecaptured in one second for example

414 Utility of Unlabeled Data

Motivation The Co-Training algorithm belongs to a semi-supervised learning algorithm grouped Our goal is to assessits capacity to leverage unlabeled data in practice First wecompare standard single feature SVM to a semisupervisedSVM using graph smoothness assumption Second we studythe proposed CO-DAS method Third we are interested toobserve the evolution of performance if multiple Co-Trainingiterations are performed Finally we present a complete set ofexperiments on the IDOL2 database comparing single featureand multifeature baselines compared to the proposed semi-supervised CO-DAS and CO-TA-DAS methods

Our primary interest is to show how a standard super-vised SVM classifier compares to a state-of-the-art semi-supervised Laplacian SVM classifier Performance of bothclassifiers is shown in Figure 8 The results show that semi-supervised counterpart performs better if a sufficiently largeinitial labeled set of training patterns is givenThe low perfor-mance at low supervision compared to standard supervisedclassifier can be explained by an improper parameter setting

60

70

80

90

100

Supervision level ()

Glo

bal

acc

ura

cy (

)

Standard SVM

Laplacian SVM

100 101

Figure 8 Comparison of standard single feature SVM with semi-supervised Laplacian SVM with RBF kernel on SPH Level 3 visualfeatures (IDOL2 dataset random setup)

Practical application of this method is limited since thetotal kernel matrix should be computed and stored in thememory of a computer which scales as119874(1198992)with number ofpatterns Computational time scales as119874(1198993) which is clearlyprohibitive for medium and large sized datasets

Co-Training with One Iteration The CO-DAS method pro-posed in this work avoids these issues and scales to muchlarger datasets due to the use of a linear kernel SVM InFigure 9 performance of the CO-DAS method is shownwhere we used only one Co-Training iteration Left and rightpanels illustrate the best choice of the amount of selectedhigh confidence patterns for classifier retraining and the DASfusion parameter selection by a cross-validation procedurerespectively The results show that performance increaseusing only one iteration of Co-Training followed by DASfusion is meaningful if the relatively large amount of topconfidence patterns are fed for classifier retraining at lowsupervision rates Notice that the cross-validation procedureselected CRFH visual feature at low supervision rate Thismay hint for overfitting since the SPH descriptor is a richervisual descriptor

Co-Training with More Iterations Interesting additionalinsights on the Co-Training algorithm can be gained if weperform more than one iteration (see Figure 10) The figuresshow the evolution of the performance of a single featureclassifier after it was iteratively retrained from the standardbaseline up to 10 iterations where a constant portion ofhigh confidence estimates were added after each iterationThe plots show an interesting increase of performance withevery iteration for both classifiers with the same trend Firstthis hints that both initial classifiers are possibly enoughbootstrapped with initial training data and the two visualcues are possibly conditionally independent as required forthe Co-Training algorithm to function properly Secondlywe notice a certain saturation after more than 6-7 iterationsin most cases which may hint that both classifiers achievedcomplete agreement levels

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 15

65

70

75

80

85

90

95

100

CRFH

SPH

CO-DAS

100 101

Supervision ()

Acc

ura

cy (

)

(a)

1 2 3 5 10 20 30 500

10

20

30

40

50

60

70

Supervision ()

To

p c

on

fid

ence

pat

tern

s u

sed

(

)

(b)

1 2 3 5 10 20 30 500

Supervision ()

01

02

03

04

05

06

07

08

09

1

DA

S

(c)

Figure 9 Effect of supervision level on the CO-DAS performance and optimal parameters (a) Accuracy for CO-DAS and single featureapproaches (b) Optimal amount of selected samples for the Co-Training feedback loop (c) Selected DAS 120572 parameter for late fusion (IDOL2dataset random setup)

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

Far time minnie cloudy1 minnie cloudy3

Close time minnie cloudy1 minnie night1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

03

05

06

07

08

Close time minnie cloudy1 minnie sunny1

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

04

05

06

07

08

BOF co-training

CRFH co-training

BOF co-training

CRFH co-training

0 2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy

05

06

07

08

09

Far time minnie sunny1 minnie sunny3Far time minnie night1 minnie night3

2 4 6 8 10 12

Number of iterations

Glo

bal

acc

ura

cy04

05

06

07

Close time minnie night1 minnie sunny1

Figure 10 Evolution of the accuracy of individual inner classifiers of the Co-Training module as a function of the number of feedback loopiterations (IDOL2 dataset video-versus-video setup) The plots are shown for six sequence pairs (top) same lighting conditions (bottom)different lighting conditions

Conclusion Experiments carried out this far show that unla-beled data is indeed useful for image-based place recognitionWe demonstrated that a better manifold leveraging unlabeleddata can be learned using semi-supervised Laplacian SVMwith the assumption of low density class separationThis per-formance comes at high computational cost large amountsof required memory and demands careful parameter tuningThis issue is solved by using more efficient Co-Training algo-rithm which will be used in the proposed place recognitionsystem

415 Random Setup Comparison of Global Performance

Motivation Random labeling setup represents the conditionswith training patterns being scattered across the databaseRandomly labeled images may simulate situation when somesmall portions of video are annotated in a frame by framemanner In its extreme few labeled images from every classmay be labeled manually

In this context we are interested in the performance ofthe single feature methods early and late fusion methods

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 16: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

16 Advances in Multimedia

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

Effect of Co-Training Random sampling (training rate = 001)

(a)

BOVW

CRFH

Eq-MKL

DAS

CO-BOVW

CO-CRFH

CO-DAS

Effect of Co-Training Random sampling (training rate = 05)

Acc

ura

cy

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Added test samples ()

(b)

Figure 11 Comparison of the performance of single features (BOF CRFH) early (Eq-MKL) and late fusion (DAS) approaches with detailsof Co-Training performances of the individual inner classifiers (CO-BOF CO-CRFH) and the final fusion (CO-DAS) Plot of the averageaccuracy as a function of the amount of Co-Training feedback The performances are plotted for 1 (a) and 50 (b) of labeled data (IDOL2dataset random labeling) See text for more detailed explanation

and the proposed semi-supervised CO-DAS and CO-TA-DASmethods In order to simulate various supervision levelsthe amount of labeled samples varies from a low (1) torelatively high (50) proportion of the database The resultsdepicting these two setups are presented in Figures 11(a) and11(b) respectively The early fusion is performed using theMKL by attributing equal weights for both visual features

Low Supervision Case The low supervision configuration(Figure 11(a)) is clearly disadvantageous for the single featuremethods achieving approximately 50and60of the correctclassification for BOVW and CRFH based SVM classifiersAn interesting performance increase can be observed for theCo-Training algorithm leveraging 10 of the top confidenceestimates in one re-training iteration achieving respectively10 and 8 increase for the BOVW and CRFH classifiersThis indicates that the top confidence estimates are not onlycorrect but are also useful for each classifier by improvingits discriminatory power on less confident test patternsCuriously the performance of the CRFH classifier degradesif more than 10 of high confidence estimates are providedby the BOVW classifier which may be a sign of increasingthe amount ofmisclassifications being injectedTheCO-DASmethod successfully performs the fusion of both classifiersand addresses the performance drop in the BOVW classifierwhich is achieved by a weighting in favor of the morepowerful CRFH classifier

High Supervision Case At higher supervision levels(Figure 11(b)) the performance of single feature supervisedclassifiers is already relatively high reaching around 80 ofaccuracy for both classifiers which indicates that a significantamount of visual variability present in the scenes has been

captured This comes as no surprise since at 50 of videoannotation in random setup Nevertheless the Co-Trainingalgorithm improves the classification by additional 8-9An interesting observation for the CO-DAS method showsclearly the complementarity of the visual features even whenno Co-Training learning iterations are performed The highsupervision setup permits as much as 50 of the remainingtest data annotation for the next re-training rounds beforereaching saturation at approximately 94 of accuracy

Conclusion These experiments show an interest of using theCo-Training algorithm in low supervision conditions Theinitial supervised single feature classifiers need to be providedwith sufficient number of training data to bootstrap theiterative re-training procedure Curiously the initial diversityof initial classifiers determines what performance gain canbe obtained using the Co-Training algorithm This explainswhy at higher supervision levels the performance increaseof a re-trained classifier pair may not be significant Finallyboth early and late fusion methods succeed to leverage thevisual feature complementarity but failed to go beyond theCo-Training basedmethods which confirms the utility of theunlabeled data in this context

416 Video versus Video Comparison of Global Performance

Motivation Global performance of the methods may beoverly optimistic if annotation is performed only in a randomlabeling setup In practical applications a small bootstrapvideo or a short portion of a video can be annotated insteadWe study in a more realistic setup the case with one videobeing used as training and the place recognition methodevaluated on a different video

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 17: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 17A

ccu

racy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

BOVW-SVM

CRFH-SVM

DAS

TA-DAS

BOVW-CO

CRFH-CO

CO-DAS

CO-TA-DAS

Effect of Co-Training feedback average train rate = 1

Added test samples ()

Figure 12 Comparison of the global performances for single feature(BOVW-SVM CRFH-SVM) multiple feature late fusion (DAS)and the proposed extensions using temporal accumulation (TA-DAS) and Co-Training (CO-DAS CO-TA-DAS) The evolution ofthe performances of the individual inner classifiers of the Co-Training module (BOVW-CO CRFH-CO) is also shown Plotof the average accuracy as a function of the amount of Co-Training feedback The approaches without Co-Training appear asthe limiting case with 0 of feedback (IDOL2 dataset video-versus-video setup ALL pairs)

Discussion on the Results The comparison of the methods invideo-versus-video setup is showed in Figure 12 The perfor-mances are compared showing the influence of the amountof samples used for the Co-Training algorithm feedback loopThe baseline single feature methods perform around equallyby delivering approximately 50 of correct classificationThestandard DAS fusion boosts the performance by additional10This confirms the complementarity of the selected visualfeatures in this test setup

The individual classifiers trained in one Co-Trainingiteration exceed the baseline and are comparable to per-formance delivered by standard DAS fusion method Theimprovement is due to the feedback of unlabeled patternsin the iterative learning procedure The CO-DAS methodsuccessfully leverages both improvements while the CO-TA-DAS additionally takes advantage of the temporal continuityof the video (a temporal window of size 120591 = 50 was used)

Confidence Measure On this dataset a good illustrationconcerning the amount of high confidence is showed inFigure 12 It is clear that only a portion of the test set datacan be useful for classifier re-training This is governed bytwo major factorsmdashquality of the data and robustness of theconfidence measure For this dataset the best portion of highconfidence estimates is around 20ndash50 depending on themethod The best performing TA-CO-TA-DAS method canafford to annotate up to 50 of testing data for the nextlearning iteration

Conclusion The results show as well that all single featurebaselines are outperformed by standard fusion and simple

Acc

ura

cy

0

01

02

03

04

05

06

07

08

09

1

0 01 02 03 04 05 06 07 08 09 1

Video-vs-video all

CO-DAS (proposed)

CO-TA-DAS (proposed)

CO-DAS (logistic)

CO-TA-DAS (logistic)

CO-DAS (tommasi)

CO-TA-DAS (tommasi)

Added test samples ()

Figure 13 Comparison of the performances of the types ofconfidence measures for the Co-Training feedback loop Plot ofthe average accuracy as a function of the amount of Co-Trainingfeedback (video-versus-video setup ALL pairs)

Co-Training methods The proposed methods CO-DAS andCO-TA-DAS perform the best by successfully leveraging twovisual features temporal continuity of the video andworkingin semi-supervised framework

417 Effect of the Type of Confidence Measures Figure 13represents the effect of the type of confidence measure usedin Co-Training on the performances for different amountsof feedback in the Co-Training phase The performances forthe Ruping approach is not reported as it was much lowerthan the other approaches A video-versus-video setup wasused with the results averaged over all sequence pairs Thethree approaches produce a similar behavior with respect tothe amount of feedback first an increase of the performanceswhen mostly correct estimates are added to the trainingset then a decrease when more incorrect estimates are alsoconsidered When coupled with temporal accumulation theproposed confidence measure has a slightly better accuracyfor moderate feedback It was therefore used for the rest ofexperiments

42 Results on IMMED Compared to the IDOL2 databasethe IMMED database poses novel challenges The difficultiesarise from increased visual variability changing from locationto location class imbalance due to room visit irregularitiespoor lighting conditionsmissing or low quality training dataand the large amount of data to be processed

421 Description of Dataset The IMMED database consistsof 27 video sequences recorded in 14 different locationsin real-world conditions The total amount of recordings

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 18: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

18 Advances in Multimedia

(a) (b) (c)

(d) (e) (f)

Figure 14 IMMED sample images (a) bathroom (b) bedroom (c) kitchen (d) living room (e) outside and (f) other

exceeds 10 hours All recordings were performed using aportable GoPro video camera at a frame rate of 30 frames persecond with the frame of the resolution of 1280 times 960 pixelsFor practical reasons we downsampled the frame rate to 5frames per second Sample images depicting the 6 topologicallocations are depicted in Figure 14

Most locations are represented with one short bootstrapsequence depicting briefly the available topological locationsfor which manual annotation is provided One or twolonger videos for the same location depict displacements andactivities of a person in its ecological and unconstrainedenvironment

Across the whole corpus the bootstrap video is typi-cally 35 minutes long (6400 images) while the unlabeledevaluation videos are 20 minutes long (36000 images) inaverage A few locations are not given a labeled bootstrapvideo therefore a small randomly annotated portion ofthe evaluation videos covering every topological location isprovided instead

The topological location names in all the video havebeen equalized such that every frame could carry one of thefollowing labels ldquobathroomrdquo ldquobedroomrdquo ldquokitchenrdquo ldquolivingroomrdquo ldquooutsiderdquo and ldquootherrdquo

422 Comparison of Global Performances

Setup We performed automatic image-based place recogni-tion in realistic video-versus-video setup for each of the 14locations To learn optimal parameter values for employedmethods we used the standard cross-validation procedure inall experiments

Due to a large number of locations we report here theglobal performances averaged for all locationsThe summary

Table 1 IMMED dataset average accuracy of the single featureapproaches

Featureapproach SVM SVM-TABOVW 049 052CRFH 048 053SPH 047 049

of the results for single and multiple feature methods isprovided in Tables 1 and 2 respectively

Baseline-Single Feature Classifier Performance As show inTable 1 single feature methods provide relatively low placerecognition performance Surprisingly potentially the morediscriminant descriptor SPH is less performant than itsmore simple BOVW variant A possible explanation to thisphenomenamay be that due to the low amount of supervisionand a classifier trained on high dimensional SPH featuressimply overfits

Temporal Constraints An interesting gain in performance isobtained if temporal information is enforced On the wholecorpus this performance increase ranges from 2 to 4 inglobal classification accuracy for all single feature methodsWe observe the same order of improvement in multiplefeature methods as MKL-TA for early feature fusion andDAS-TA for late classifier fusion This performance increaseover the single feature baselines is constant for the wholecorpus and all methods

Multiple Feature Exploitation Comparing MKL and DASmethods for multiple feature fusion shows interest in favor

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 19: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 19

Table 2 IMMED dataset average accuracy of the multiple feature approaches

Featureapproach MKL MKL-TA DAS DAS-TA CO-DAS CO-TA-DASBOVW-SPH 048 050 051 056 050 053BOVW-CRFH 050 054 051 056 054 058SPH-CRFH 048 051 050 054 054 057BOVW-SPH-CRFH 048 051 051 056 mdash mdash

of the late fusion method when compared to single fea-ture methods We observe little performance improvementwhen using MKL which can be explained by increaseddimensionality space and thus more risk of overfitting Latefusion strategy is more advantageous compared to respectivesingle feature methods in this low supervision setup bybringing up to 4 with no temporal accumulation and up to5 with temporal accumulation Therefore multiple featureinformation is best leveraged in this context by selecting lateclassifier fusion

Leveraging the Unlabeled Data Exploitation of unlabeled datain the learning process is important when it comes to lowamounts of supervision and great visual variability encoun-tered in challenging video sequences The first proposedmethod termed CO-DAS aims to leverage two visual featureswhile operating in semi-supervised setup It clearly outper-forms all single feature methods and improves on all butBOVW-SPH feature pair compared to DAS by up to 4 Weexplain this performance increase by successfully leveragedvisual feature complementarity and improved single featureclassifiers via Co-Training procedure The second methodCO-TA-DAS incorporates temporal continuity a priori andboosts performances by another 3-4 in global accuracyThis method effectively combines all benefits brought byindividual features temporal video continuity and takingadvantage of unlabeled data

5 Conclusion

In this work we have addressed the challenging problemof indoor place recognition from wearable video record-ings Our proposition was designed by combining severalapproaches in order to deal with issues such as low super-vision and large visual variability encountered in videos fromamobile cameraTheir usefulness and complementarity wereverified initially on a public video sequence database IDOL2then applied to the more complex and larger scale corpus ofvideos collected for the IMMED project which contains real-world video lifelogs depicting actual activities of patients athome

The study revealed several elements that were usefulfor successful recognition in such video corpuses First theusage of multiple visual features was shown to improve thediscrimination power in this context Second the temporalcontinuity of a video is a strong additional cue whichimproved the overall quality of indexing process in mostcasesThird real-world video recordings are rarely annotatedmanually to an extent where most visual variability presentwithin a location is captured Usage of semi-supervised

learning algorithms exploiting labeled as well as unlabeleddata helped to address this problem The proposed systemintegrates all acquired knowledge in a framework which iscomputationally tractable yet takes into account the varioussources of information

We have addressed the fusion of multiple heterogeneoussources of information for place recognition from com-plex videos and demonstrated its utility on the challengingIMMED dataset recorded in real-world conditions Themain focus of this work was to leverage the unlabeleddata thanks to a semi-supervised strategy Additional workcould be done in selecting more discriminant visual featuresfor specific applications and more tight integration of thetemporal information in the learning process Neverthelessthe obtained results confirm the applicability of the proposedplace classification system on challenging visual data fromwearable videos

Acknowledgments

This research has received funding from Agence Nationalede la Recherche under Reference ANR-09-BLAN-0165-02(IMMED project) and the European Communityrsquos Sev-enth Framework Programme (FP72007ndash2013) under GrantAgreement 288199 (DemCare project)

References

[1] A Doherty and A F Smeaton ldquoAutomatically segmentinglifelog data into eventsrdquo in Proceedings of the 9th InternationalWorkshop on Image Analysis for Multimedia Interactive Services(WIAMIS rsquo08) pp 20ndash23 May 2008

[2] E Berry N Kapur L Williams et al ldquoThe use of a wearablecamera SenseCam as a pictorial diary to improve autobi-ographical memory in a patient with limbic encephalitis apreliminary reportrdquo Neuropsychological Rehabilitation vol 17no 4-5 pp 582ndash601 2007

[3] S Hodges L Williams E Berry et al ldquoSenseCam a retro-spective memory aidrdquo in Proceedings of the 8th InternationalConference on Ubiquitous Computing (Ubicomp rsquo06) pp 177ndash193 2006

[4] R Megret D Szolgay J Benois-Pineau et al ldquoIndexing ofwearable video IMMED and SenseCAMprojectsrdquo inWorkshopon Semantic Multimodal Analysis of Digital Media November2008

[5] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision vol 1 pp 273ndash280 October 2003

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 20: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

20 Advances in Multimedia

[6] A Quattoni and A Torralba ldquoRecognizing indoor scenesrdquo inProceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo09) pp 413ndash420 June 2009

[7] R Megret V Dovgalecs H Wannous et al ldquoThe IMMEDproject wearable video monitoring of people with age demen-tiardquo in Proceedings of the International Conference on Multi-media (MM rsquo10) pp 1299ndash1302 ACM Request PermissionssOctober 2010

[8] S Karaman J Benois-Pineau R Megret V Dovgalecs J-FDartigues and Y Gaestel ldquoHuman daily activities indexingin videos from wearable cameras for monitoring of patientswith dementia diseasesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 4113ndash4116August 2010

[9] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

[10] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillancepp 65ndash72 October 2005

[11] L Ballan M Bertini A del Bimbo and G Serra ldquoVideoevent classification using bag of words and string kernelsrdquoin Proceedings of the 15th International Conference on ImageAnalysis and Processing (ICIAP rsquo09) pp 170ndash178 2009

[12] D I Kosmopoulos N D Doulamis and A S VoulodimosldquoBayesian filter based behavior recognition in workflows allow-ing for user feedbackrdquo Computer Vision and Image Understand-ing vol 116 no 3 pp 422ndash434 2012

[13] M Stikic D Larlus S Ebert and B Schiele ldquoWeakly supervisedrecognition of daily life activities with wearable sensorsrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 12 pp 2521ndash2537 2011

[14] D H Nguyen G Marcu G R Hayes et al ldquoEncounteringSenseCam personal recording technologies in everyday liferdquo inProceedings of the 11th International Conference on UbiquitousComputing (Ubicomp rsquo09) pp 165ndash174 ACM Request Permis-sions September 2009

[15] M A Perez-QuiNones S Yang B Congleton G Luc and E AFox ldquoDemonstrating the use of a SenseCam in two domainsrdquo inProceedings of the 6th ACMIEEE-CS Joint Conference on DigitalLibraries (JCDL rsquo06) p 376 June 2006

[16] S Karaman J Benois-Pineau V Dovgalecs et al HierarchicalHidden Markov Model in Detecting Activities of Daily Livingin Wearable Videos for Studies of Dementia 2011

[17] J Pinquier S Karaman L Letoupin et al ldquoStrategies formultiple feature fusion with Hierarchical HMM applicationto activity recognition from wearable audiovisual sensorsrdquoin Proceedings of the 21 International Conference on PatternRecognition pp 1ndash4 July 2012

[18] N SebeM S Lew X Zhou T S Huang and EM Bakker ldquoThestate of the art in image and video retrievalrdquo in Proceedings of the2nd International Conference on Image and Video Retrieval pp1ndash7 May 2003

[19] S-F Chang D Ellis W Jiang et al ldquoLarge-scale multimodalsemantic concept detection for consumer videordquo in Proceed-ings of the International Workshop on Multimedia InformationRetrieva (MIR rsquo07) pp 255ndash264 ACM Request PermissionsSeptember 2007

[20] J Kosecka F Li and X Yang ldquoGlobal localization and relativepositioning based on scale-invariant keypointsrdquo Robotics andAutonomous Systems vol 52 no 1 pp 27ndash38 2005

[21] C O Conaire M Blighe and N OrsquoConnor ldquoSensecam imagelocalisation using hierarchical surf treesrdquo in Proceedings of the15th InternationalMultimediaModeling Conference (MMM rsquo09)p 15 Sophia-Antipolis France January 2009

[22] J Kosecka L Zhou P Barber and Z Duric ldquoQualitative imagebased localization in indoors environmentsrdquo in Proceedings ofIEEE Computer Society Conference on Computer Vision andPattern Recognition vol 2 pp II-3ndashII-8 June 2003

[23] Z Zovkovic O Booij and B Krose ldquoFrom images to roomsrdquoRobotics and Autonomous Systems vol 55 no 5 pp 411ndash4182007

[24] L Fei-Fei and P Perona ldquoA bayesian hierarchical model forlearning natural scene categoriesrdquo in Proceedings of IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo05) pp 524ndash531 June 2005

[25] D Nister and H Stewenius ldquoScalable recognition with a vocab-ulary treerdquo in Proceedings of IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo06) vol 2pp 2161ndash2168 2006

[26] O Linde andT Lindeberg ldquoObject recognition using composedreceptive field histograms of higher dimensionalityrdquo in Proceed-ings of the 17th International Conference on Pattern Recognition(ICPR rsquo04) vol 2 pp 1ndash6 August 2004

[27] S Lazebnik C Schmid and J Ponce ldquoBeyond bags of featuresspatial pyramid matching for recognizing natural scene cate-goriesrdquo in Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) vol 2 pp2169ndash2178 2006

[28] A Bosch and A Zisserman ldquoScene classification via pLSArdquo inProceedings of the 9th European Conference on Computer Vision(ECCV rsquo06) May 2006

[29] J Sivic and A Zisserman ldquoVideo google a text retrievalapproach to object matching in videosrdquo in Proceedings of the 9thIEEE International Conference On Computer Vision pp 1470ndash1477 October 2003

[30] J Knopp Image Based Localization [PhD thesis] Chech Tech-nical University in Prague Faculty of Electrical EngineeringPrague Czech Republic 2009

[31] M W M G Dissanayake P Newman S Clark H F Durrant-Whyte and M Csorba ldquoA solution to the simultaneous local-ization and map building (SLAM) problemrdquo IEEE Transactionson Robotics and Automation vol 17 no 3 pp 229ndash241 2001

[32] L M Paz P Jensfelt J D Tardos and J Neira ldquoEKF SLAMupdates inO(n) with divide and conquer SLAMrdquo in Proceedingsof IEEE International Conference on Robotics and Automation(ICRA rsquo07) pp 1657ndash1663 April 2007

[33] J Wu and J M Rehg ldquoCENTRIST a visual descriptor forscene categorizationrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 8 pp 1489ndash1501 2011

[34] H Bulthoff and A Yuille ldquoBayesian models for seeing shapesand depthrdquo Tech Rep 90-11 Harvard Robotics Laboratory1990

[35] P K Atrey M Anwar Hossain A El Saddik and M SKankanhalli ldquoMultimodal fusion for multimedia analysis asurveyrdquoMultimedia Systems vol 16 no 6 pp 345ndash379 2010

[36] A Rakotomamonjy F R Bach S Canu and Y GrandvaletldquoSimpleMKLrdquo The Journal of Machine Learning Research vol9 pp 2491ndash2521 2008

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 21: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

Advances in Multimedia 21

[37] S Nakajima A Binder C Muller et al ldquoMultiple kernellearning for object classificationrdquo inWorkshop on Information-based Induction Sciences 2009

[38] A Vedaldi V Gulshan M Varma and A Zisserman ldquoMultiplekernels for object detectionrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision (ICCV rsquo09) pp 606ndash613October 2009

[39] J Yang Y Li Y Tian L Duan and W Gao ldquoGroup-sensitivemultiple kernel learning for object categorizationrdquo in Proceed-ings of the 12th International Conference on Computer Vision(ICCV rsquo09) pp 436ndash443 October 2009

[40] M Guillaumin J Verbeek and C Schmid ldquoMultimodal semi-supervised learning for image classificationrdquo in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo10) pp 902ndash909 Laboratoire Jean Kuntzmann LEARINRIA Grenoble June 2010

[41] J Yang Y Li Y Tian L Duan and W Gao ldquoMultiple kernelactive learning for image classificationrdquo in Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME rsquo09)pp 550ndash553 July 2009

[42] A Abdullah R C Veltkamp and M A Wiering ldquoSpatialpyramids and two-layer stacking SVM classifiers for imagecategorization a comparative studyrdquo in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN rsquo09)pp 5ndash12 June 2009

[43] J Kittler M Hatef R P W Duin and J Matas ldquoOn combiningclassifiersrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 20 no 3 pp 226ndash239 1998

[44] L Ilieva Kuncheva Combining Pattern Classifiers Methods andAlgorithms Wiley-Interscience 2004

[45] A Uhl and PWild ldquoParallel versus serial classifier combinationfor multibiometric hand-based identificationrdquo in Proceedings ofthe 3rd International Conference on Advances in Biometrics (ICBrsquo09) vol 5558 pp 950ndash959 2009

[46] W Nayer Feature based architecture for decision fusion [PhDthesis] 2003

[47] M-E Nilsback and B Caputo ldquoCue integration through dis-criminative accumulationrdquo in Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo04) vol 2 pp II578ndashII585 July 2004

[48] A Pronobis and B Caputo ldquoConfidence-based cue integrationfor visual place recognitionrdquo inProceedings of IEEERSJ Interna-tional Conference on Intelligent Robots and Systems (IROS rsquo07)pp 2394ndash2401 October-November 2007

[49] A Pronobis O Martinez Mozos and B Caputo ldquoSVM-baseddiscriminative accumulation scheme for place recognitionrdquo inProceedings of IEEE International Conference on Robotics andAutomation (ICRA rsquo08) pp 522ndash529 May 2008

[50] F Lu X Yang W Lin R Zhang R Zhang and S YuldquoImage classification with multiple feature channelsrdquo OpticalEngineering vol 50 no 5 Article ID 057210 2011

[51] P Gehler and S Nowozin ldquoOn feature combination for mul-ticlass object classificationrdquo in Proceedings of the 12th Interna-tional Conference on Computer Vision pp 221ndash228 October2009

[52] X Zhu ldquoSemi-supervised learning literature surveyrdquo TechRep 1530 Department of Computer Sciences University ofWinsconsin Madison Wis USA 2008

[53] X Zhu and A B Goldberg Introduction to Semi-SupervisedLearning Morgan and Claypool Publishers 2009

[54] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[55] M Belkin P Niyogi and V Sindhwani ldquoManifold regular-ization a geometric framework for learning from labeled andunlabeled examplesrdquoThe Journal of Machine Learning Researchvol 7 pp 2399ndash2434 2006

[56] UVonLuxburg ldquoA tutorial on spectral clusteringrdquo Statistics andComputing vol 17 no 4 pp 395ndash416 2007

[57] D Zhou O Bousquet T Navin Lal JWeston and B ScholkopfldquoLearning with local and global consistencyrdquo Advances inNeural Information Processing Systems vol 16 pp 321ndash3282004

[58] S Melacci and M Belkin ldquoLaplacian support vector machinestrained in the primalrdquo The Journal of Machine LearningResearch vol 12 pp 1149ndash1184 2011

[59] B Nadler and N Srebro ldquoSemi-supervised learning with thegraph laplacian the limit of infinite unlabelled datardquo in Pro-ceedings of the 23rd Annual Conference on Neural InformationProcessing Systems (NIPS rsquo09) 2009

[60] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Confer-ence on Computational LearningTheory (COLTrsquo 98) pp 92ndash100October 1998

[61] D Zhang and W Sun Lee ldquoValidating co-training modelsfor web image classificationrdquo in Proceedings of SMA AnnualSymposium National University of Singapore 2005

[62] W Tong T Yang andR Jin ldquoCo-training For Large Scale ImageClassification AnOnline ApproachrdquoAnalysis and Evaluation ofLarge-Scale Multimedia Collections pp 1ndash4 2010

[63] M Wang X-S Hua L-R Dai and Y Song ldquoEnhancedsemi-supervised learning for automatic video annotationrdquo inProceedings of IEEE International Conference onMultimedia andExpo (ICME rsquo06) pp 1485ndash1488 July 2006

[64] V E van Beusekom I G Sprinkuizen-Kuyper and L GVuurpul ldquoEmpirically evaluating co-trainingrdquo Student Report2009

[65] W Wang and Z-H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Proceedings of the 18th European Conference onMachine Learning (ECML rsquo07) pp 454ndash465 2007

[66] C Dong Y Yin X Guo G Yang and G Zhou ldquoOn co-training style algorithmsrdquo in Proceedings of the 4th InternationalConference on Natural Computation (ICNC rsquo08) vol 7 pp 196ndash201 October 2008

[67] S Abney Semisupervised Learning for Computational Linguis-tics Computer Science and Data Analysis Series Chapman ampHall University of Michigan Ann Arbor Mich USA 2008

[68] D Yarowsky ldquoUnsupervised word sense disambiguation rival-ing supervised methodsrdquo in Proceedings of the 33rd AnnualMeeting on Association for Computational Linguistics (ACL rsquo95)pp 189ndash196 University of Pennsylvania 1995

[69] W Wang and Z-H Zhou ldquoA new analysis of co-trainingrdquo inProceedings of the 27th International Conference on MachineLearning pp 1135ndash1142 May 2010

[70] C M Bishop Pattern Recognition and Machine LearningInformation Science and Statistics Springer Secaucus NJ USA2006

[71] B Scholkopf and A J Smola Learning with Kernels MIT PressCambridge Mass USA 2002

[72] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin ldquoLIBLINEAR a library for large linear classificationrdquo TheJournal of Machine Learning Research vol 9 pp 1871ndash18742008

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 22: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

22 Advances in Multimedia

[73] A J Smola B Scholkopf and K-R Muller ldquoNonlinearcomponent analysis as a kernel eigenvalue problemrdquo NeuralComputation vol 10 no 5 pp 1299ndash1319 1998

[74] A Pronobis O Martınez Mozos B Caputo and P JensfeltldquoMulti-modal semantic place classificationrdquo The InternationalJournal of Robotics Research vol 29 no 2-3 pp 298ndash320 2010

[75] T Hastie R Tibshirani J Friedman and J Franklin ldquoTheelements of statistical learning data mining inference andpredictionvolumerdquo The Mathematical Intelligencer vol 27 no2 pp 83ndash85 2005

[76] S Ruping A Simple Method For Estimating Conditional Prob-abilities For SVMs American Society of Agricultural Engineers2004

[77] T Tommasi F Orabona and B Caputo ldquoAn SVM confidence-based approach to medical image annotationrdquo in Proceedingsof the 9th Cross-Language Evaluation Forum Conference onEvaluating Systems forMultilingual andMultimodal InformationAccess (CLEF rsquo08) pp 696ndash703 2009

[78] K Grauman and T Darrell ldquoThe pyramid match kerneldiscriminative classification with sets of image featuresrdquo in Pro-ceedings of the 10th IEEE International Conference on ComputerVision (ICCV rsquo05) vol 2 pp 1458ndash1465 October 2005

[79] J Luo A Pronobis B Caputo and P Jensfelt ldquoThe KTH-IDOL2 databaserdquo Tech Rep Kungliga Tekniska HoegskolanCVAPCAS 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 23: Research Article Multiple Feature Fusion Based on Co ...downloads.hindawi.com/journals/am/2013/175064.pdf · and wearable sensors.... Monitoring Using Ambient Sensors. Activity recogni-tion

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of