10
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013 1595 Adaptive Universal Linear Filtering Dan Garber and Elad Hazan Abstract—We consider the problem of online estimation of an arbitrary real-valued signal corrupted by zero-mean noise using linear estimators. The estimator is required to iteratively predict the underlying signal based on the current and several last noisy observations, and its performance is measured by the mean-square-error. We design and analyze an algorithm for this task whose total square-error on any interval of the signal is equal to that of the best xed lter in hindsight with respect to the interval plus an additional term whose dependence on the total signal length is only logarithmic. This bound is asymptotically tight, and resolves the question of Moon and Wiessman [“Uni- versal FIR MMSE ltering,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1068–1083, 2009]. Furthermore, the algorithm runs in linear time in terms of the number of lter coefcients. Previous constructions required at least quadratic time. Index Terms—Filtering, FIR MMSE, logarithmic regret, online learning, regret minimization, universal ltering, unsupervised adaptive ltering. I. INTRODUCTION W E consider the problem of ltering: designing algo- rithms for the causal estimation of a real valued signal from noisy observations. The ltering algorithm observes at each iteration a noisy signal component, and is required to estimate the corresponding underlying signal component based on the current and past noisy observations alone. We consider nite xed-length linear lters that combine the current and several last noisy observations for prediction of the current underlying signal component. Performance is measured by the mean square error over the entire signal. Following the setting in [1], we assume that the underlying signal is an ar- bitrary bounded signal, possibly even adversarial, and that it is corrupted by an additive zero-mean, time-independent, bounded noise with known constant variance. The approach taken in this paper is to construct a universal lter—i.e. an adaptive lter whose performance we compare to an optimal ofine lter with full knowledge of the signal and noise. The metric of performance is thus regret—or the differ- ence between the total square error incurred by our adaptive lter, and the total square error of an ofine xed lter which Manuscript received February 23, 2012; revised June 25, 2012 and October 01, 2012; accepted November 25, 2012. Date of publication December 20, 2012; date of current version March 08, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Konstantinos Slavakis. This work was carried out and supported by the Technion-Microsoft Electronic Commerce Research Center. The authors are with the Department of Industrial Engineering and Manage- ment, Technion—Israel Institute of Technology, Haifa 32000, Israel (e-mail: [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TSP.2012.2234742 is chosen with full information of the entire clean and noisy signals. The question of competing with a xed ofine lter was suc- cessfully tackled in [1]. In this paper we consider a more chal- lenging task: competing with the best ofine changing lter, where restrictions are placed on how often this optimal ofine lter is allowed to change. A more stringent metric of perfor- mance that fully captures this notion of competing with an adap- tive ofine benchmark is called adaptive regret: it is the max- imum regret incurred by the algorithm on any subinterval of the signal. [1] asks whether there exists an algorithm that attains an adaptive regret bound whose dependence on the total length of the signal is only logarithmic. [2] gave a partial answer and de- scribed a ltering algorithm that attains an adaptive regret bound which scales quadratically in the logarithm of the signal length. However, the information-theoretic lower bound of logarithmic adaptive regret is of yet unattained. A. Our Results We present and analyze simple and efcient algorithms that attain logarithmic adaptive regret. This bound is tight as shown in [3], and resolves a question posed by Moon and Weissman in [1]. Along the way, we introduce a simple universal algorithm for ltering, improving the previously known best running time from quadratic in the number of lter coefcients to linear. We also prove that in the setting under consideration in this work, knowing the variance of the noise is crucial, by showing a lower bound on the regret in case the variance is unknown. B. Related Work There has been much work on the problem of estimating a real-valued signal from noisy observations with respect to the MMSE loss over the years. Classical results assume a model in which the underlying signal is stochastic with some known pa- rameters, i.e. the rst and second moments, or require the signal to be stationary, such as the classical work of [4]. The special case of linear MMSE lters has received special attention due to its simplicity [5]. For more recent results on MMSE estimation see [6], [7], [8], [9]. In this work we follow the non-stochastic setting of [1]: no generating model is assumed for the underlying signal and sto- chastic assumptions are made on the added noise (that it is zero- mean, time-independent with known xed variance). In this set- ting, while considering nite linear lters, [1] presented an on- line algorithm that achieves logarithmic expected regret with re- spect to the entire signal. The computational complexity of their algorithm is proportional to a quadratic in the linear lter size. [2] extended the work in [1] and gave an adaptive algorithm which has a regret bound that is quadratic in the logarithm of the signal length, on any interval of the signal. Yet the question 1053-587X/$31.00 © 2012 IEEE

Adaptive Universal Linear Filtering

  • Upload
    elad

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013 1595

Adaptive Universal Linear FilteringDan Garber and Elad Hazan

Abstract—We consider the problem of online estimation ofan arbitrary real-valued signal corrupted by zero-mean noiseusing linear estimators. The estimator is required to iterativelypredict the underlying signal based on the current and severallast noisy observations, and its performance is measured by themean-square-error. We design and analyze an algorithm for thistask whose total square-error on any interval of the signal is equalto that of the best fixed filter in hindsight with respect to theinterval plus an additional term whose dependence on the totalsignal length is only logarithmic. This bound is asymptoticallytight, and resolves the question of Moon and Wiessman [“Uni-versal FIR MMSE filtering,” IEEE Trans. Signal Process., vol. 57,no. 3, pp. 1068–1083, 2009]. Furthermore, the algorithm runs inlinear time in terms of the number of filter coefficients. Previousconstructions required at least quadratic time.

Index Terms—Filtering, FIR MMSE, logarithmic regret, onlinelearning, regret minimization, universal filtering, unsupervisedadaptive filtering.

I. INTRODUCTION

W E consider the problem of filtering: designing algo-rithms for the causal estimation of a real valued signal

from noisy observations. The filtering algorithm observes ateach iteration a noisy signal component, and is required toestimate the corresponding underlying signal component basedon the current and past noisy observations alone.We consider finite fixed-length linear filters that combine the

current and several last noisy observations for prediction of thecurrent underlying signal component. Performance is measuredby the mean square error over the entire signal. Following thesetting in [1], we assume that the underlying signal is an ar-bitrary bounded signal, possibly even adversarial, and that it iscorrupted by an additive zero-mean, time-independent, boundednoise with known constant variance.The approach taken in this paper is to construct a universal

filter—i.e. an adaptive filter whose performance we compare toan optimal offline filter with full knowledge of the signal andnoise. The metric of performance is thus regret—or the differ-ence between the total square error incurred by our adaptivefilter, and the total square error of an offline fixed filter which

Manuscript received February 23, 2012; revised June 25, 2012 and October01, 2012; accepted November 25, 2012. Date of publication December 20, 2012;date of current version March 08, 2013. The associate editor coordinating thereview of this manuscript and approving it for publication was Dr. KonstantinosSlavakis. This work was carried out and supported by the Technion-MicrosoftElectronic Commerce Research Center.The authors are with the Department of Industrial Engineering and Manage-

ment, Technion—Israel Institute of Technology, Haifa 32000, Israel (e-mail:[email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2012.2234742

is chosen with full information of the entire clean and noisysignals.The question of competing with a fixed offline filter was suc-

cessfully tackled in [1]. In this paper we consider a more chal-lenging task: competing with the best offline changing filter,where restrictions are placed on how often this optimal offlinefilter is allowed to change. A more stringent metric of perfor-mance that fully captures this notion of competing with an adap-tive offline benchmark is called adaptive regret: it is the max-imum regret incurred by the algorithm on any subinterval of thesignal. [1] asks whether there exists an algorithm that attains anadaptive regret bound whose dependence on the total length ofthe signal is only logarithmic. [2] gave a partial answer and de-scribed a filtering algorithm that attains an adaptive regret boundwhich scales quadratically in the logarithm of the signal length.However, the information-theoretic lower bound of logarithmicadaptive regret is of yet unattained.

A. Our Results

We present and analyze simple and efficient algorithms thatattain logarithmic adaptive regret. This bound is tight as shownin [3], and resolves a question posed by Moon and Weissman in[1]. Along the way, we introduce a simple universal algorithmfor filtering, improving the previously known best running timefrom quadratic in the number of filter coefficients to linear. Wealso prove that in the setting under consideration in this work,knowing the variance of the noise is crucial, by showing a lowerbound on the regret in case the variance is unknown.

B. Related Work

There has been much work on the problem of estimating areal-valued signal from noisy observations with respect to theMMSE loss over the years. Classical results assume a model inwhich the underlying signal is stochastic with some known pa-rameters, i.e. the first and second moments, or require the signalto be stationary, such as the classical work of [4]. The specialcase of linear MMSE filters has received special attention due toits simplicity [5]. For more recent results on MMSE estimationsee [6], [7], [8], [9].In this work we follow the non-stochastic setting of [1]: no

generating model is assumed for the underlying signal and sto-chastic assumptions are made on the added noise (that it is zero-mean, time-independent with known fixed variance). In this set-ting, while considering finite linear filters, [1] presented an on-line algorithm that achieves logarithmic expected regret with re-spect to the entire signal. The computational complexity of theiralgorithm is proportional to a quadratic in the linear filter size.[2] extended the work in [1] and gave an adaptive algorithmwhich has a regret bound that is quadratic in the logarithm ofthe signal length, on any interval of the signal. Yet the question

1053-587X/$31.00 © 2012 IEEE

1596 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

of whether logarithmic regret on each interval is attainable re-mained open.[1] askedwhether adaptive regret guarantees are possible, and

suggested attacking this problem by taking “blocks” in time andtreating each block as a separate single loss function. In con-trast, we construct a different loss function: one which dependson the previous iterations of the algorithm itself, which is crucialfor obtaining the optimal regret bound. Since our loss functionschange according to the algorithms’ prediction, we use the on-line convex optimization framework for analysis, which allowsfor fully adversarial loss functions.Henceforth we build on recent results from the emerging

online learning framework called online convex optimization[10], [11]. For our adaptive regret algorithm, we use tools fromthe framework presented in [12] to derive an algorithm thatachieves logarithmic expected regret on any interval of thesignal.The rest of the paper is organized as follows. Section II is ded-

icated to formulating the problem setting and preliminaries. InSection III we present and analyze a simple filtering algorithmthat achieves logarithmic expected regret with respect to the en-tire signal. In Section IV we present and analyze our main re-sult—an adaptive algorithm that achieves logarithmic expectedregret on any interval of the signal. In Section V we provide ex-perimental evidence for our algorithms. In Section VI we provean impossibility result for the filtering problem in case the vari-ance of the noise is unknown and in Section VII we give ourconclusions.

II. PRELIMINARIES

A. Online Convex Optimization

In the setting of online convex optimization (OCO) with fullinformation an online algorithm is iteratively required tomake a prediction by choosing a point in some convex set .The algorithm then incurs a loss , whereis a convex function. The emphasis in this model is that oniteration , has only knowledge of the loss functions inprevious iterations and thus may bechosen arbitrarily and even adversely according to the point .The standard goal in this setting is to minimize the differencebetween the overall loss of and that of the best fixed point

in hindsight. This difference is called regret and it isformally given by,

A stronger measure of performance requires the algorithm tohave small regret on any interval with respectto the best fixed point in hindsight in this interval. Thismeasure is called adaptive regret and it is given by,

B. Problem Setting and Notations

Let be a real-valued, possibly adversarial, signal boundedin the range . The signal is corrupted by anadditive zero-mean time independent noise bounded in therange with known time-invariant variance .An estimator observes on time the noisy signal, and is required to predict by taking a linear combination

of the last observations where is theorder of the filter. That is, the estimator chooses on time afilter and predicts according to where ,

, . The performance of the estimatorafter iterations is measured by the mean-square-error and isgiven by .Our focus will be on designing online prediction algo-

rithms that achieve regret bounds (both standard and adaptiveregret) whose dependence on the signal length is only loga-rithmic, against any fixed filter with respect to the total squareerror— .Throughout the paper we denote by the -dimensional

vector for . Similarly we denoteby the vector , . We denote bythe vector of length such that . That is is the

history of the observed signal up to time .

C. The Importance of Knowing the Variance of the Noise

As described, throughout this work we assume that the vari-ance of the noise is fixed on all iterations and known to the fil-tering algorithm. One may wonder if this somewhat restrictiveassumption is indeed necessary or it may be elevated somehow.We answer this by proving that if the underlying signal may bearbitrary, even if an upper bound on the variance of the noiseis known, any filtering algorithm whose choice of filter foriteration depends only on past iterations can not guarantee sub-linear expected regret.Formally we have the following lemma.Lemma 1: There exists a stochastic mechanism for the gen-

eration of the clean and noisy signals such that any filtering al-gorithm whose choice of filter for iteration depends onlyon incurs linear expected regret.The proof of the lemma is differed to Section VI.Although it may seem from lemma 1 that the setting consid-

ered in this work may not be practical since in practice the vari-ance of the noise is not known in advance, we note that in appli-cations it is possible to estimate the variance of the noise withhigh accuracy by sending a training sequence, known to the fil-tering algorithm, at the beginning of transmission. If the cleansignal component of the training sequence at time is known,then clearly is an unbiased estimator of since,

Thus by setting the length of the training sequence appropriatelyand averaging the values over the entire sequence it ispossible to estimate the variance of the noise up to an arbitrarilysmall error using Hoeffding’s concentration inequality.

GARBER AND HAZAN: ADAPTIVE UNIVERSAL LINEAR FILTERING 1597

D. Learning With Unknown Losses

Online algorithms usually require to know the loss functionused on each iteration or at least its first order derivatives. Thusin case is observable to the online algorithm, minimizing theregret and the adaptive regret is fairly easy using the frameworkof OCO with the loss functions . How-ever in our case, the algorithm only observes the noisy signaland does not have access to the exact loss or its derivatives.

The standard trick in this case is to use unbiased estimations ofthe loss function or its derivatives which results in minimizingthe expected regret instead of the actual regret. In this work weuse the same estimator as in [1] for the loss . Let ,

. We define the loss function

Lemma 2: Let , be two sequences offilters such that for all , depend only on the observations

. It holds that

Where the expectation is taken over the noise.The proof is given in the Appendix.Thus by using the estimated loss functions , a simple

gradient descent algorithm such as [10] immediately gives abound on the expected regret as well as on the adaptive

expected regret with respect to the true losses , as long aswe limit the choice of the filter to a set of bounded size.

E. Strongly-Convex and Exp-Concave Losses

Given a function we denote by thegradient vector of at point and by the matrix ofsecond derivatives, also known as the Hessian, of at point .

is called -strongly-convex, for some , if for allit holds that , where is the identity matrix

of proper dimension. That is all the eigenvalues of arelower bounded by for all .

is called -exp-concave, for some , if the func-tion is a concave function of . It iseasy to show that given a function such that and

it holds that is -exp-concave.In case all loss functions are -strongly-convex or -exp-

concave for some constants , there exists algorithms thatare known to achieve logarithmic regret and logarithmic adap-tive regret [11], [12].In our case, the Hessian of the loss function is given

by the random matrix which is positivesemidefinite and it holds that

(1)

Nevertheless, in worst case, need not be strongly-convexor exp-concave and thus algorithms such as [11], [12] could not

be directly applied in order to get logarithmic expected regretand logarithmic adaptive expected regret.

III. ONLINE GRADIENT DESCENT-BASED FILTER

In this section we describe how the problem of the loss func-tions not necessarily being strongly-convex or exp-con-cave could be overcome and introduce a simple online gradientdescent (OGD) algorithm based on [11] that achievesexpected regret with respect to the entire signal. Later on we usethis algorithm as a building block in our adaptive algorithm.Our technique builds on the simple idea of updating the filter

coefficients every several iterations (we refer to this as a block)instead of after each single iteration, as is usually done in on-line algorithms. We show that by summing several consecutivesingle-iteration losses and adding an appropriate regulation termwe get new loss functions that are always strongly convex andthat optimizing the regret with respect to these new functionsalso optimizes the expected regret with respect to the true lossfunctions .Given block length and a filter such that

depends only on , we define the following loss functionfor block and for any filter ,

(2)

where

The vector will be chosen to be the filter that was used bythe online algorithm itself for prediction on the entire block .This choice may seem strange at first glance since it actuallycancels the affect of the regularization term

on the decisions of the algorithm when we set .Nevertheless we show that this regularization term is useful forderiving generic regret bounds with little effort and in fact it willbe crucial in the analysis of our adaptive algorithm.Our OGD-based filtering algorithm is given below. We have

for it the following guarantee.

Algorithm 1: OGD-Filter

1: Input: , ,

2: Let

3: for do

4: for do

5: predict:

6: end for

7: (gradient update step)

8: (projection step)

9: end for

Theorem 1: Let be the filter used by algorithm 1 for pre-diction in time . Let , and let

1598 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

. Algorithm 1 runs in time per iteration and achievesthe following regret bound,

Note that the regret bound in theorem 1 depends on the in-verse of the noise variance . This is not surprising since aswe show, the strong convexity parameter of the loss functions

scales with and the regret of the online gradientdescent algorithm which we apply is inversely dependant onthis strong convexity parameter. Thus in the case of large vari-ance it is easier for our algorithm to compete with the best fixedfilter in hindsight. In case the variance goes to zero, so doesthe strong convexity parameter of the losses and theregret bound tends to explode. This is also an issue with the al-gorithms in [1], [2]. However since we assume that the varianceis known, in such a case one can apply the online gradient de-scent algorithm in [10] instead of that in [11] and still guaranteea regret bound that scales with instead of .Compared to the universal algorithm in [1], the dependence

of the regret bound in theorem 1 on constants is worse by amultiplicative factor of , but as our experiments suggest,the performance of algorithm 1 in practice may be much betterthan suggested by theorem 1.We now turn to the analysis of the above algorithm. We

begin by showing that the loss functions are alwaysstrongly-convex.Claim 1: For block length the function is

always -strongly-convex.Proof:

(3)

The following Lemma plays a key part in our analysis. Itshows that for an algorithm that predicts on block according tothe filter , achieving low expected regret with respect to thelosses implies achieving low expected regret with re-spect to true losses .Lemma 3: Assume that for some . Let

, be two sequences of vectors in suchthat depends only on for all . Denote

and for all . It holdsthat,

Proof: First we note that for wherethe regret on the additional iterations is a constant independentof , since the loss on any single iteration is bounded by a con-stant, and we ignore it in our regret bounds.It holds that,

(4)

We now have that,

Since both and depend only on the random variableswe have,

Using (1) we have,

GARBER AND HAZAN: ADAPTIVE UNIVERSAL LINEAR FILTERING 1599

where for two matrices we define the product toequal .Overall by taking expectation over (4) we get

The lemma now follows from lemma 2.In order to derive regret bounds with precise dependency on

the problem parameters , we need a bound on. The gradient vector is given

by,

It can be verified by simple algebra that since we always use, it holds that

Where is a bound on the magnitude of the filter. That is weconsider only filters such that . needs tobe bounded since the regret scales with .

The minimizer of is given by,

Since and we have that

.We can now easily prove theorem 1.Proof: As stated before, we assume that for some.

Applying the OGD analysis from [11] with respect to the lossfunctions and the values for stated in thetheorem we have for all such that , that

Since algorithm 1 updates its filter every iterations andis fixed we can apply lemma 3 and get

where for .The theorem now follows from plugging the values of

stated in the theorem and the observation that .

Wemention in passing that high probability guarantees on theactual regret, similar to that in [1], theorem 1(b) could also bederived for algorithm 1 but we omit the details.

IV. ADAPTIVE ALGORITHM

In this section we present an algorithm that is based on algo-rithm 1 and the framework from [12] and achieves logarithmicexpected regret on any interval . Our algorithmis given below.

Algorithm 2 AdaptiveFilter

1: Input: , .

2: Let .

3: Let be online filtering algorithms.

4: Let , , , .

5: for do

6: , (update the filter ofthe j’th algorithm).

7: .

8: for do

9: predict: .

10: end for

11: and for ,

12: (adding expert ).

13: for .

14: end for

As in the previous section, the algorithm considers the iter-ations in disjoint blocks of length and updates its filter aftereach block. The algorithm considers a set of experts, de-noted by where each expert is a filtering algorithmthat updates its filter every iterations. Expert starts makingpredictions from block number and onwards and is assumed toachieve small expected regret on each interval starting at blockwith respect to the true square loss. The algorithm maintainsa distribution over these experts which is used to combine theirindividual filters into a single filter via a weighted sum (line 7).The algorithm updates this distribution after each block by anexponential update rule with respect to the losses andalso introduces a new expert that starts making predictions fromthat point on (lines 11–13). After each block, each active expertalso updates its own filter according to the history (line 6).We denote by the filter chosen by expert for predic-

tion in block and by the weight assigned to expert inblock . denotes the weighted average of the filters of allexperts which is the filter that is used for prediction in line 9.The input parameter denotes the exp-concavity parameter ofthe loss functions .

1600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

Algorithm 2 has the following guarantee.Theorem 2: Let and let

. Assume that each expertis a filtering algorithm that predicts according to algorithm

1 from block onwards. Let be the filter used by algorithm2 for prediction in time . Then algorithm 2 runs in timeper iteration and for any interval it achievesthe following regret bound,

The regret analysis is based on showing that for each interval, denoting , algorithm 2 achieves small ex-

pected regret with respect to expert on this interval and thusby assuming that the algorithm played by expert achievessmall expected regret on this interval, we have that algorithm 2itself achieves small expected regret on this interval.For the proof of theorem 2 we need the following claim from

[12] (claim 3.1).Claim 2: Let be an interval of consecutive

blocks of length . Then it holds that

That is algorithm 2 incurs small regret on the block intervalwith respect to expert and the losses .

We can now prove theorem 2.Proof: First we assume that and . If this is

not the case and for instance where , thensince is chosen to be a constant independent of and the losson each iteration is bounded, then we have that the additionalloss on the extra iterations is a constant independent of andthus does not hurt our regret bound.Note that from previous discussions, for the value of stated

in the theorem the losses are indeed all -exp-con-cave. Furthermore by simple algebra it can be verified that,

According to claim 2 we have that

Thus by lemma 3 we have that

where for all .

Fig. 1. MSEs for nonlinear stochastic signal (5) averaged over 50 experiments.(a) is the average MSE with and . (b) is the average MSEwith and . (c) is the average MSE with and

. (a) , ; (b) , ; (c) ,.

Since each of the experts plays according to al-gorithm 1, by theorem 1 we have that

Combining the last two equalities yields the theorem.As in the previous section we mention that certain high prob-

ability guarantees could also be derived for the adaptive regretbut we omit the details.

V. EXPERIMENTAL RESULTS

Non-Adaptive Setting: We compare the performance of ournon-adaptive online gradient descent algorithm—algorithm 1(ALG1) to that of the universal algorithm from [1] (WM) andto the best fixed filter in hindsight of length (BF). We firstconsider the scenario in which the clean signal is the followingnonlinear stochastic signal (also considered in [1]),

(5)

where and . We generated aclean signal of length 10000 and then generated the noisy signal

where (note that althoughmay be unbounded, every realization of them is and thus wemay assume the signals are bounded. For a more in-depth dis-cussion see [1] Section VII). We experimented with differentvalues of the filter length parameter— and the variance of theadded noise— , where for each choice of parameters we av-eraged the MSE of the three algorithms over 50 experiments,regenerating both the clean signal and the noisy signal on eachexperiment. The results appear in Fig. 1.

GARBER AND HAZAN: ADAPTIVE UNIVERSAL LINEAR FILTERING 1601

Fig. 2. MSEs for Henon map signal (6) averaged over 50 experiments. (a) isthe average MSE with and . (b) is the average MSEwith and . (c) is the average MSE with and

. (a) , ; (b) , ; (c) ,.

It is observable that in all three experiments, the performanceof algorithm 1 approaches that of the best fixed filter in hind-sight with convergence speed that is notably faster comparedto the universal algorithm from [1]. Moreover, the experimentssuggest that the actual dependence of algorithm 1 on the param-eters may be better than that of [1] for certain underlyingsignals (these results also hold for intermediate values of ).For instance, ALG1 exhibits better dependency on than WM,despite having a regret bound with worse dependency on thanthe theoretical bound in [1].In the second scenario, the clean signal we consider is a de-

terministic non-linear signal known as Henon map (also consid-ered in [1]),

(6)

where . We repeated the same experiments as inthe first scenario. The results appear in Fig. 2 and are similar tothose of the previous scenario.Overall the above experiments provide evidence that our non-

adaptive scheme successfully competes with the best fixed filterin hindsight and may improve over the current non-adaptivestate-of-the-art scheme from [1], exhibiting faster convergencerates (although having the same theoretical dependency on )and better dependency on the parameters for certain under-lying signals. The experiments also suggest that the actual de-pendence of algorithm 1 on the setting constants may be betterthan stated in theorem 1. One explanation to these phenomenonsis that the regret bound of the OGD algorithm is too pessimisticin terms of dependency on constants, considering the worst pos-sible case, when for many signals the bound could be muchbetter. Another possible explanation is that lemma 3 (used inthe proof of theorem 1) suggests that the expected regret of al-gorithm 1 could be strictly better (but not worse) than statedin theorem 1. As shown, this gap between the expected regret

and the bound in theorem 1 depends on the specific underlyingsignal at hand.Adaptive Setting: We turn to evaluate the performance of our

adaptive algorithm—algorithm 2 (ALG2). We compare its per-formance to that of its non-adaptive counterpart (ALG1), theuniversal filter from [1] (WM), the switching universal filterfrom [2] (AWM) (which similarly to ALG2, combines the filtersof several algorithms, each of which is based on the universalalgorithmWM), the best fixed filter in hindsight (BF) and to thebest switching filter in hindsight (BS) that is fully optimized forthe clean and noisy signals and the switching times.Regarding the implementation of the adaptive algorithms,

since the adaptive algorithm considered in this work and alsothat in [2] require to maintain a growing number of experts (experts at time ) which makes long simulations infeasible, weinstead only maintain 100 experts for these algorithms at eachtime, keeping on each iteration only the experts with the largestweight.We also note that although both adaptive algorithms view the

input signal in blocks, when setting the block-length parameterof the algorithm in [2] exactly as is required for its theoreticalguarantee on the regret to hold, the algorithm gives very poorresults and we omitted it from our experiments. Instead we setits block-length to equal which has no theoreticalguarantee but was shown in [2] to produce good results in sim-ulations. For our adaptive algorithm we used the block-lengthstated in theorem 2 which has a theoretical guarantee.In all of the considered scenarios we fixed the length of the

filter to and passed all clean signals through the additivechannel where .We generated signalsof length 10000 and averaged the MSE of all algorithms over30 independently-generated experiments. We note that with theabove chosen value of , is only a small constant (incomparison to the theoretical dependency of the algorithms onthe other parameters) and thus the better dependency of the re-gret bound of ALG2 on will not necessarily be observable inthe simulations. Larger values of for which is substan-tial are infeasible.In the first scenario we consider the clean signal is the fol-

lowing linear model also considered in [2] that switches every2500 iterations between

and

where . The results appear in Fig. 3.It is observable that both adaptive schemes (ALG2 and

AWM) perform at least as well as the best fixed filter inhindsight (BF), and both improve over their non-adaptive coun-terparts, having smaller MSE with respect to the entire signal.AWM achieves a slightly better MSE than ALG2 (an 3.4%improvement), but its convergence is slow, outperforming itsnon-adaptive counterpart only during the last third of the signal.This slow convergence could be explained by the additionalregret imposed by the experts scheme that is used to combinethe different stationary filters into a single filter (this additional

1602 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

Fig. 3. MSEs for switching linear signal averaged over 30 experiments.

Fig. 4. MSEs for switching non-linear signal averaged over 30 experiments.

regret need not affect both ALG2 and AWM in the same waysince it depends on the experts). The non-adaptive algorithmWM keeps a lower MSE than that of ALG1 during the entiretime (although the final MSE is about the same) which explainsthe overall better performance of AWM in comparison toALG2.In the second scenario we consider, the clean signal is gener-

ated by the following non-linear model that switches every 2500iterations between

and

where again . The results appear in Fig. 4.It is observable that both adaptive schemes (ALG2 and

AWM) outperform the best fixed filter in hindsight. ALG2 no-tably outperforms all algorithms except for the best switchingfilter in hindsight, including AWM. It is also notable that thenon-adaptive algorithm, ALG1 performs not worse than theadaptive algorithm AWM (keeping a lower MSE during theentire signal) and outperforms the best fixed filter (BF). The

Fig. 5. MSEs for switching mixed signal averaged over 30 experiments.

difference between the performance of ALG2 in this scenarioand the previous one in comparison to AWM could be dueto ALG1 indeed being more suitable for filtering non-linearsignals than WM, which is also supported by our non-adaptiveexperiments.In the third scenario we consider, the clean signal is generated

from amixedmodel that switches every 2500 iterations betweenthe following linear signal

and the following non-linear signal

where . The results appear in Fig. 5 and aresimilar to those of the nonlinear case.Overall the above experiments provide evidence that our

adaptive scheme improves over its non-adaptive counterpart,and can approach the performance of an optimal switchingbenchmark while outperforming any fixed filter. Also, it man-ages to outperform the current state-of-the-art adaptive schemewhich is a heuristic based on the algorithm in [2], in certainscenarios, even though our scheme’s better dependency onwas not substantial due to the relatively short signals consid-ered. We also note that our non-adaptive algorithm (ALG1)exhibits good performance with respect to switching signalsas well–beating the best fixed filter (BF) in two out of threeexperiments.

VI. IMPOSSIBILITY RESULT

In this section we prove lemma 1.Our argument is similar to the one given in [13] Theorem 9,

but since the setting considered in [13] is different than the oneconsidered in this work, we reformulate this argument and givea simplified proof that is suitable for our setting.Our argument considers two possible scenarios. In the first

one there is no noise at all and the underlying signal is stochastic.In the second scenario the underlying signal is fixed on all it-erations and the noise is stochastic. By ensuring that on bothscenarios the observed noisy signal is distributed the same,

GARBER AND HAZAN: ADAPTIVE UNIVERSAL LINEAR FILTERING 1603

we make it impossible for any algorithm to distinguish betweenthese two scenarios. We then show that no algorithm can guar-antee sublinear expected regret with respect to both scenariossimultaneously. Thus a random policy that chooses with prob-ability 1/2 the first scenario and with probability 1/2 the secondscenario ensures that any algorithm will incur linear expectedregret.We now give the formal proof.Proof: Throughout the proof we fix the filter length to

and assume that for all , for some constantssatisfying .We consider the following two scenarios. In the first, there is

no noise at all ( for all ) and the signal component isa discrete random variable sampled on each iteration uniformlyfrom (thus it is zero-mean with variance=1). In thesecond scenario the signal is fixed on all iterations andthe noise component on each iteration is a discrete randomvariable sampled uniformly from . Notice that in bothscenarios the observed signal is distributed identi-cally, and thus the two scenarios are indistinguishable based onthe noisy observations alone.We begin by examining the expected loss of an optimal fixed

filter in both scenarios.For the first scenario it holds for any fixed that

Thus the fixed filter has zero expected loss with respectto the first scenario.Similarly, in the second scenario, assuming is fixed, it holds

that

Thus the fixed filter has zero expected loss with respectto the second scenario.We have shown that in both scenarios the expected regret

of the algorithm is just its own expected loss and thus it suf-fices to show that no algorithm can ensure that its expected losswill grow sublinearly with the number of iterations, on bothscenarios.Turning to the expected loss of the filtering algorithm, in the

first scenario it holds that

where in the last equality we used the fact that is independentof .Since we have that

(7)

Where the last inequality follows from the convexity of thefunction and denoting .In the second scenario we have by similar arguments that

(8)

where again we use the notation .Thus, for a certain filtering algorithm, if ,

then by (7) the algorithm has regret of at least in the firstscenario, and otherwise by (8) it has regret at least in thesecond scenario.Thus by choosing each of the two scenarios with equal prob-

ability we have that the expected regret of any algorithm is atleast .

VII. CONCLUSIONS

We have described an adaptive universal filtering algorithmwhich attains a tight adaptive regret bound, answering the ques-tions of [1], [2]. Furthermore, the new algorithm is more effi-cient by a leading order term. Along the way we have devel-oped a new algorithm that achieves a logarithmic bound for thestandard regret. Our theoretical findings are supported by exper-imental results that show that our algorithms achieve state-of-the-art performance.Our upper bounds are completed by an impossibility result,

showing that for an arbitrary signal corrupted by a zero-meannoise, the second moment of the noise is not only sufficient, butalso necessary, in order to obtain any non-trivial regret bound.

APPENDIX APROOF OF LEMMA 2

Proof:

(9)

1604 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

Since is independent of andwe have that

(10)

Plugging (10) into (6) we have

Thus we have that

The lemma now follows from applying the last equality withrespect to both and on all iterations .

REFERENCES

[1] T. Moon and T. Weissman, “Universal FIR MMSE filtering,” IEEETrans. Signal Process., vol. 57, no. 3, pp. 1068–1083, 2009.

[2] T. Moon, “Universal switching FIR filtering,” IEEE Trans. SignalProcess., vol. 60, no. 3, pp. 1460–1464, 2012.

[3] E. Hazan and S. Kale, “Beyond the regret minimization barrier: an op-timal algorithm for stochastic strongly-convex optimization,” J. Mach.Learn. Res.—Proc. Track, vol. 19, pp. 421–436, 2011.

[4] N. Wiener, Extrapolation, Interpolation, and Smoothing of StationaryTime Series, With Engineering Applications. New York, NY, USA:Wiley, 1949.

[5] T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. UpperSaddle River, NJ, USA: Prentice-Hall, 2000.

[6] H. V. Poor, “On robust wiener filtering,” IEEE Trans. Autom. Control,vol. AC-25, no. 3, pp. 521–526, 1980.

[7] Y. C. Eldar and N. Merhav, “A competitive minimax approach to ro-bust estimation of random parameters,” IEEE Trans. Signal Process.,vol. 52, no. 7, pp. 1931–1946, 2004.

[8] Y. C. Eldar, A. Ben-Tal, and A. Nemirovski, “Linear minimax regretestimation of deterministic parameters with bounded data uncertain-ties,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2177–2188, 2004.

[9] S. Haykin, Unsupervised Adaptive Filtering: Volume I, II. New York,NY, USA: Wiley, 2000.

[10] M. Zinkevich, “Online convex programming and generalized infini-tesimal gradient ascent,” in Proc. Int. Conf. Machine Learn. (ICML),2003, pp. 928–936.

[11] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithmsfor online convex optimization,” Mach. Learn., vol. 69, no. 2–3, pp.169–192, 2007.

[12] E. Hazan and C. Seshadhri, “Efficient learning algorithms for changingenvironments,” in Proc. Int. Conf. Machine Learn. (ICML), 2009, p.50.

[13] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir, “Online learningof noisy data,” IEEE Trans. Inf. Theory, vol. 57, no. 12, pp. 7907–7931,2011.

Dan Garber received the B.Sc. degree in computerengineering and the M.Sc. degree in computerscience, both from the Technion—Israel Institute ofTechnology, Haifa, in 2010 and 2012, respectively.He is currently working toward the Ph.D. degree at

the Industrial Engineering and Management faculty,Technion, under the supervision of Dr. E. Hazan. Hisresearch focuses on developing efficient algorithmswith provable performance guarantees for problemsin the fields of optimization, machine learning, andonline decision-making.

Elad Hazan received the Ph.D. degree fromPrinceton University, Princeton, NJ, in 2006, underthe supervision of S. Arora.From 2006 to 2010, he was a research staff

member of the Theory Group at the IBM AlmadenResearch Center. Since 2010, he has been on the fac-ulty at the Technion—Israel Institute of Technology,Haifa. His current research focuses on the design ofefficient and practical algorithms for fundamentalproblems in machine learning and optimization.