Active transfer learning for activity recognition...Active transfer learning for activity recognition Tom Diethe and Niall Twomey and Peter Flach Intelligent Systems Laboratory, University

Active transfer learning for activity recognition

Tom Diethe and Niall Twomey and Peter Flach

Intelligent Systems Laboratory, University of Bristol, UK

Abstract. We examine activity recognition from accelerometers, whichprovides at least two major challenges for machine learning. Firstly, thedeployment context is likely to differ from the learning context. Secondly,accurate labelling of training data is time-consuming and error-prone. Thiscalls for a combination of active and transfer learning. We derive a hierar-chical Bayesian model that is a natural fit to such problems, and provideempirical validation on synthetic and publicly available datasets. The re-sults show that by combining active and transfer learning, we can achievefaster learning with fewer labels on a target domain than by either alone.

1 Introduction

In this paper we are concerned with activity recognition, which is usually per-formed for the purposes of understanding the Activities of Daily Living (ADL)of a given individual. It is natural to consider the use of accelerometers for activ-ity recognition, since it is clear that certain activities will have clear movementpatterns for different parts of the body, whilst the sensors are relatively low-cost,low-power, and have wide user acceptance [1].

There are at least two major challenges for machine learning in this set-ting. Firstly, the deployment context will necessarily be very different to thethe context in which learning occurs, due to individual differences patterns ofmotion for a given activity. Secondly, accurate labelling of training data is anextremely time-consuming process, and the resulting labels are potentially noisyand error-prone.

Our contributions are: 1) We provide a hierarchical Bayesian model for thetransfer learning problem; 2) We combine this with myopic active learning; 3)Using this approach we show empirically that we can adapt to new domainsusing only a few labelled examples.

1.1 Related Work

Early work on active learning [2] demonstrated that it is possible to computethe statistically ‘optimal’ way to select training data, with the observation thatthe optimality criterion sharply decreases the number of training examples thelearner needs in order to achieve good performance. This differs from the manyheuristic methods for choosing training data, including choosing places wherewe don’t have data, where we perform poorly, where we have low confidence etc.

Within a Bayesian framework, active learning can be naturally conceivedsince uncertainty is directly modelled, and there has been much interest in thisarea, particularly with respect to nonparametric methods. A major assumptionin the majority of machine learning methods is that the training and deployment

429

ESANN 2016 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 2016, i6doc.com publ., ISBN 978-287587027-8. Available from http://www.i6doc.com/en/.

data are drawn from the same underlying distribution. For our application thisassumption clearly does not hold, and so, knowledge transfer, if done success-fully, would greatly improve the performance of learning by avoiding the costlyacquirement of labels [3]. It is well known that the hierarchical Bayesian frame-work can be adapted to sequential decision problems [4], and it has been shownmore recently that it provides a natural formalisation of transfer (reinforcement)learning [5].

Recently, [6] investigated active transfer learning for cross-system recommen-dation, since a newly launched system has a cold-start problem, where existingrating information is available. The authors construct entity correspondenceswith limited budget by using active learning to facilitate knowledge transferacross systems.

2 Hierarchical Bayesian Active Transfer Learning

We present here a multi-class extension of the Bayes Point Machine (BPM) [7],which is a Bayesian model for classification that makes the following assump-tions: 1. The feature values x are always fully observed. 2. The order of instancesdoes not matter. 3. The predictive distribution is a linear discriminant of theform p(yi|xi,w) = p(yi|si = w′xi) where w are the weights and si is the scorefor instance i. 4. The scores are subject to additive Gaussian noise. 5. Eachindividual has a separate set of weights, drawn from a communal prior. Forthe purposes of activity recognition, assumption 2 may be problematic, sincethe data is sequential in nature. The strength of the temporal dependence inthe sequence will determine how costly this approximation is, and this will inturn depend on how the data is preprocessed The factor graph for this modelis illustrated in fig. 1, where N denotes a Gaussian density for a given mean µand precision τ , and Γ denotes a Gamma density for given shape k and scaleθ. The factor indicated by is the ‘arg-max’ factor, which is like a probabilis-tic multi-class switch. The additive Gaussian noise from assumption 4 resultsin the variable s This is a hierarchical multi-class extension of the Bayes pointmachine [7], where we have a plate around the individuals that are present inthe training set (R), who form the “community”. Online learning is performedusing the standard assumed-density filtering method of [4].

To apply our learnt community weight posteriors to a new individual wecan use the same model configured for a single individual (i.e. R = 1) with thepriors over weight mean µw and weight precision τw replaced by the Gaussianand Gamma posteriors learnt from the individuals in the training set. This modelis able to make predictions even when we have not seen any data for the newindividual, and it is also possible to do online training as we receive labelled datafor the individual. By doing so, we can smoothly evolve from making genericpredictions that may apply to any individual to making personalised predictionsspecific to the new individual.

Given a set of potentially noisy training examples S = {(xi, yi)mi=1}, where

xi ∈ X and yi ∈ Y, we wish to learn a general mapping X → Y, and we can

430


iteratively select a new input x and request a label y.

y

s

s1

N

w x×

µw τw

µµ σµ kτ θτ

N

N Γ

N

D

A

R

Fig. 1: Hierarchical com-munity multi-class BPM.

We use two base methods in order to do per-sonalised active learning. The first is a simple un-certainty sampling method, where we select pointsusing the marginal predictive distributions of pointsin the pool closest to chance levels (e.g. 0.5 for a bi-nary classifier). As a sanity check, we also do “cer-tainty” sampling, where we choose the points thatthe classifier is most confident about. Secondly, weextend the method outlined by [8]. We make a my-opic assumption, where we only seek to label onedata point at a time from a pool of potential ex-amples and define a cost matrix C ∈ R|Y|×|Y|, Ci,j

denotes the risk associated with classifying a pointi as j, and Ci,i = 0, ∀i ∈ Y. The total cost on thecommunity training set is

JS =∑a∈Y

∑i:yi=a

Cai(1− pi) +∑

i:yi 6=a

Caipi

The cost for an unlabelled point is

Jxi =∑a∈Y

(Cai(1− pi)p∗i + Ciapi(1− p∗i )) ≈∑a∈Y

((Cai + Cia)(1− pi)pi) ,

where p∗i is the true conditional density of the class label given the data point,which is approximated by pi. The approximate misclassification cost is then

1m+1 (JS + Jxi

). In the method of [9], the cost L(xi) of acquiring a label for xi

is given a value in the same currency as the costs in C. The expected value-of-information (VOI) criterion is then defined as

V OI(xi) = JS + Jxi+ L(xi)− L(xi).

Given a set of unlabelled points U , our strategy is to select cases for labelling andlabelling method that have the highest VOI. Note that whenever V OI(xi) < 0,we have a condition where knowing a single label does not reduce the total cost,which can be employed as a stopping criterion. We also evaluate the effect ofconsidering the unlabelled example that yields the empirical greatest risk (VOI-)rather than the greatest relative risk (VOI+).

3 Experiments

Here we present experimental results that attempt to show that active learningcan be especially useful in transfer learning settings, and whether the costs of theVOI method are justified. We first present some results on a synthetic dataset,and subsequently show the methods working the transfer between two publiclyavailable activity recognition datasets from accelerometer data. Source code forall experiments is at https://github.com/IRC-SPHERE/ActiveTransfer.

431


https://github.com/IRC-SPHERE/ActiveTransfer

3.1 Synthetic Experiment

We sample from a community of 10 subjects using ancestral sampling as follows:1. Create a community prior N (4, 1), and sample a weight for each feature

and subject from the community prior2. For each subject and instance: a) Sample feature values uniformly in [0, 1],

plus a constant bias (-1); b) Compute scores S as the inner product be-tween features and weights; c) Compute the noisy score by sampling fromN (S, 1); and d) Threshold the noisy score at zero to compute the label.

The first 5 subjects are then used to train the hierarchical BPM shown in fig. 1.The remaining 5 are used in the online personalisation phase. For the weightpriors we usedN (0, 1) means and Γ(1, 1) precisions. Inference is performed usingExpectation Propagation (EP). To test the effectiveness of transfer learningusing the hierarchical model, we compare using the posterior community weightsas the priors for the personalisation phase with using the original N (0, 1) priors,which will call community and online respectively.

0 50 100 150 200# instances

0.0

0.2

0.4

0.6

0.8

1.0

Acc

urac

yCumulative Accuracy

OnlineCommunity

(a) Transfer (cumulative)

0 10 20 30 40 50# instances

0.0

0.2

0.4

0.6

0.8

1.0

Acc

urac

y

Hold-out Accuracy

RandomUSCSVOI+VOI-

(b) Active (hold-out)

Fig. 2: Classification accuracy.

In this synthetic dataset, every data-point is equally useful for learning, sincefeature values are uniformly distributed. Inreal-world settings this is often not the case- there may either be high levels of re-dundancy between neighbouring examples(i.e. they are not truly independently andidentically distributed), or some examplesmay simply be corrupted. In order to sim-ulate this, we randomly set the feature vec-tors of 90% of the examples to zero for thedataset provided to the active learner.

Firstly we analyse the performance ofthe transfer learning method.In fig. 2a, wecan see the cumulative online accuracy (andstandard deviation (SD)) of a standard on-line method, versus using the communityposteriors as the priors for the personalisa-tion phase. In this setting we cycle throughthe test set, first predicting the label of anexample, then observing its label an incor-porating it into the model using an AssumedDensity Filtering approximation [4].

Secondly, we analyse the performance ofthe active learning methods in a myopic setting (see fig. 2b). The metric we use isthe accuracy over the hold-out test set (200 examples), averaged over 5 subjects.Note that the VOI+ method learns fastest in this setting, whilst unsurprisinglythe Certainty Sampling (CS) method outperforms the Uncertainty Sampling(US) method, since the US method will always choose the corrupted examples,whereas the CS method does the opposite.

432


Table 1: Publicly available data-sets for activity recognition based on body-wornaccelerometers used in this study.

# Ref. Av. duration Subjects Type Sampling rate Labels

1 [10] 7 mins 30 Smartphone 50 Hz Video2 [11] 6 hours 14 MotionNode 100 Hz Observer

3.2 Activity Recognition from Accelerometers

0 5 10 15 20# instances

0.0

0.2

0.4

0.6

0.8

1.0

Acc

urac

y

Hold-out Accuracy

RandomUSCSVOI+VOI-

(a) Online estimation

0 5 10 15 20# instances

0.0

0.2

0.4

0.6

0.8

1.0

Acc

urac

y

Hold-out Accuracy

RandomUSCSVOI+VOI-

(b) Active transfer

Fig. 3: Cumulative accuracy.

In transfer learning problems, it is commonto refer to “source” and “target” data. Thetwo datasets that will be used as source andtarget data are described below. Note thatthese were collected by different researchers,using different sensor equipment, and po-tentially labelled in different ways. As such,this is a good representation the challengeby researchers in this area when trying toapply models to new scenarios.

The source dataset [10], involved 30 par-ticipants aged 19-48 years and six activi-ties were recorded. Each participant wore asmart-phone on the waist, with tri-axial lin-ear acceleration and tri-axial angular veloc-ity capture using its embedded accelerom-eter and gyroscope at a constant rate of50 Hz. Annotation was done using video-recordings. The target dataset [11], in-volved 14 participants aged 21-49 years and9 activities were recorded.

For both datasets, we considered the ac-celeration signals only. We extracted a totalof 48 features based on the ‘body’ acceleration signal (see [10]. Posterior param-eter distributions were first computed on the source dataset. These are thentransferred to the target domain. It would not be rational to insist that theposteriors on the source and target domains are equally uncertain, and so thetarget precision distributions were set to the initial prior values (Γ(1, 1)) whilekeeping the source’s posteriors over the weight means.

Fig. 3a shows the classification performance using online estimation withvarious instance sampling techniques (random, (un)certainty, VOI sampling).We can see that active parameter selection achieves greater classification per-formance and that it reaches its optimum at ≈ 5 samples. Furthermore, wecan observe that the classification performance begins at 0.5. Fig. 3b presentsthe results obtained using posterior distributions from the source dataset. Wefirst observe that the ‘cold start’ problem has been reduced as, without see-

433


https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

http://sipi.usc.edu/HAD/

ing examples from the new domain, active Bayesian transfer learning achievesapproximately 70% accuracy (with equal class distributions). Also of interestis the variance of predictions, and we see that active selection methods consis-tently yield lower variance than random sampling. Over all experiments, activeinstance selection out-performs baseline methods. However, we note it is difficultto give a clear guide as to which active selection method should be chosen.

4 Conclusions

As we have seen, the activity recognition in the smart-home setting provideschallenges in terms of the deployment context and accurate labelling of trainingdata, which leads to a combination of active learning and transfer learning.We have argued that hierarchical Bayesian methods are particularly well suitedto problems of this nature, and given a possible formulation of such a model.We have provided some experimental results on toy data do demonstrate theefficacy of the two components of this model. On real world data gatheredusing accelerometers, the more expensive VOI method failed to out-perform thesimpler uncertainty sampling method, although for our purposes computationalburden at the active learning stage is not a pressing issue. Our next stepswill be to deploy the various active labelling methods in the prototype smarthome, which will allow us to test the active learning framework, as well as theresident-to-resident transfer method. The house-to-house fusion and transferon multi-modal sensor network can only be tested when multiple homes areavailable which will be the focus of future work.

References

[1] J.H.M. Bergmann and A.H. McGregor. Body-worn sensor design: what do patients andclinicians want? Ann. of biomedical engineering, 39(9):2299–2312, 2011.

[2] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. JAIR,4:129–145, 1996.

[3] S Pan. Transfer learning. In Data Classification: Algorithms and Applications, pages537–570. 2014.

[4] M Opper. A Bayesian approach to on-line learning. pages 363–378. Cambridge UniversityPress, New York, NY, USA, 1998.

[5] A Wilson, A Fern, and P Tadepalli. Transfer learning in sequential decision problems: Ahierarchical Bayesian approach. In ICML, pages 217–227, 2012.

[6] Lili Zhao, Sinno Jialin Pan, Evan Wei Xiang, ErHeng Zhong, Zhongqi Lu, and QiangYang. Active transfer learning for cross-system recommendation. In Proceedings of the27th AAAI Conference on Artificial Intelligence.

[7] R Herbrich, T Graepel, and C Campbell. Bayes point machines. JMLR, 1:245–279, 2001.[8] A Kapoor, E Horvitz, and S Basu. Selective supervision: Guiding supervised learning

with decision-theoretic active learning. In IJCAI, pages 877–882, 2007.[9] P Rashidi and DJ Cook. Activity knowledge transfer in smart environments. Pervasive

Mob. Comput., 7(3):331–343, June 2011.[10] D Anguita, A Ghio, L Oneto, X Parra, and JL Reyes-Ortiz. A public domain dataset for

human activity recognition using smartphones. In ESANN, 2013.[11] M. Zhang and A.A. Sawchuk. USC-HAD: A daily activity dataset for ubiquitous activity

recognition using wearable sensors. In SAGAware, 2012.

434


Documents

Active transfer learning for activity recognition...Active transfer learning for activity recognition Tom Diethe and Niall Twomey and Peter Flach Intelligent Systems Laboratory, University