9
STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] R and RStudio. Download to your computer or use in University labs Course site on d2l. Consult for course materials. No Textbook is Assigned: Accessible material is referenced to downloads from web. Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following n Xi -∑ n E(Xi all Xj, j < i) (v(n)) 0.5 0 as n⟶∞, for any fixed δ > 0, with probability one, on the event v(n) = n Var(Xi | all Xj, j < i) . The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative

STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

Page 2: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

2 Syllabus 461 Sp 19.nb

Page 3: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

Syllabus 461 Sp 19.nb 3

Page 4: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

4 Syllabus 461 Sp 19.nb

Page 5: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

Syllabus 461 Sp 19.nb 5

Page 6: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

6 Syllabus 461 Sp 19.nb

Page 7: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

Syllabus 461 Sp 19.nb 7

Page 8: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

8 Syllabus 461 Sp 19.nb

Page 9: STT 461 Statistical Methods II Course site on d2l. No · STT 461 Statistical Methods II Tentative Syllabus 12-27-19 Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30

STT 461 Statistical Methods IITentative Syllabus 12-27-19

Professor Raoul LePage, C428 Wells Hall, office hours Tu, Th 11:30 to 1:30 and by appt. email [email protected] and RStudio. Download to your computer or use in University labsCourse site on d2l. Consult for course materials.

No Textbook is Assigned: Accessible material is referenced to downloads from web.Lecture slides/links/notes will be posted to d2l before each week. General idea of topics. This is just to give you a general idea of the kinds of things we study. You likely have not yet seen what you will soon be fully acquainted with. Topics have been chosen that have immediate application to a broad range of Multiplicative Reinforcement Learning Algorithms (MRLA) for Diversifying investments (e.g. stock portfolios, orders for goods and services). Controlling time average of losses (e.g. daily errors of predicted storm path). Classification (e.g. distinguishing those who will buy from those who will not). Adding randomization to MRLA in order to monitor results using probability. An important result asserts that consecutive sums of possibly dependent random variables having finite variances satisfy the following

∑nXi -∑nE(Xi all Xj, j < i)

(v(n))0.5+δ ⟶ 0 as n⟶∞, for any fixed δ > 0,

with probability one, on the event v(n) = ∑n Var(Xi | all Xj, j < i) ⟶ ∞. The idea is that the running average of random variables Xi is becoming close to the running average of their one-step ahead conditional expectations relative to little more than the square root of the average of one-step-ahead conditional variances. We will apply that to create a randomized multiplicative weight algorithm which exhibits the behavior ∑n Xi n within ∑n E (Xi all Xj, j < i) n ± (v(n) .055/ n) red green Just so you can see how it looks, observe how the the red line moves inside the green boundaries as n increases. Adding randomization to MRLA allows us to see a red trajectory ∑n Xi n of the randomized MLRA performing the way it should, moving into the green band. Looking at the randomization we have bounds and feedback to support conjectured performance of the randomized MRLA.

Importance Sampling approximates E h(X), where random X has distribution Q , by exploiting ∫h(x)q(x) ⅆ x = ∫h(x) q(x)p(x)

p(x) ⅆ x, applicable if P(q(x) >0 and p(x) = 0) = 0.

The idea is to draw p-samples Xi and apply a law of averages to

{ h(Xi) q(Xi)p(Xi) , i = 1, 2, .. }.

It is useful in many situations when Q samples are difficult to obtain, P samples are more easily obtained, and q/p is not too variable under P.

Metropolis-Hastings Algorithm, is an enhancement of Importance Sampling. It is an idea that propelled Bayesian statistical method into prominence, bypassing elaborate calculations needed to obtain a-posteriori density p(θ | data). Given a suitable parametric probability model P(data | θ), θ ϵ Θ, and prior probability density p(θ) on θ ϵ Θ, M-H devises simulations { θi1 } for which (number of i ≤ n with θi1 ϵ A) / n ⟶ P(θ ϵ A | data) with probability one, for every set A ⊂ Θ with P(θ ϵ A) > 0.

A consequence is that, when appropriate implementations have delivered the data file { θi1 }, it can directly produce by tabulation approximations of a-posteriori (Bayesian) quantities such as density, mean, mode, variance, quantiles. This circumvents elaborate or impossible deterministic calculations.

Bootstrap. The sampling error distribution of and estimator θ(data0) is the probability distribution of θ(data0)-θ(population). That is in turn the empirical distribution of θ(data1)-θ(population), θ(data2)-θ(population), ... etc. in which data1, data2, .. , ad infinitum, are independent replications of the initial sampling data0.

If the empirical distribution of data0 resembles the population distribution it may seem reasonable to instead sample from the empirical distribution of data0, extracting many independent replications θ(data*i)-θ(data0), i.e. θ(data*1)-θ(data0), θ(data*2)-θ(data0), ... etc. where data*1, data*2, etc. are obtained by drawing from the surrogate population data0 independently, many times over. In many situations it is proven that the file { θ(data*i)-θ(data0) } has approximately the correct quantiles, mean, variance, as the sampling distribution of θ(data0)-θ(population). This is a short list of the goodly number of methods we will study and deploy in R.

no final exam

In-class exams. Each exam grade counts equally. F 2-01-19 W 2-27-19 F 4-26-19HW due dates Each HW grade counts equally. W 1-23-19 W 2-20-19 W 4-17-19

Course GRADE = 0.6 (average of exam grades) + 0.4 (average of HW grades) rounded up, e.g. 2.8 to 3.0, except 1.8 earns 1.5 course grade.

Each exam and each HW is given its own grade scale.

Reading for week of 1-7-19 to 1-11-19. The topic is a Multiplicative Reinforcement Learning Algorithm designed to retard cumulative losses you sustain over a series of times t = 1, 2, .., T. You will do this by harnessing the performance of each of n experts at times previous to t, then choosing one expert i(t) whose advice you accept for time t. You thereby sustain loss M(i(t), t) at time t and accumulate losses ∑ s = 1

s = t M(i(s),s), t ≤ T.How to do this in such a way as to minimize the growth of losses?

Consult reference http://rob.schapire.net/papers/FreundSc96b.pdfparticularly the proof of Theorem 1, given on pg. 8, whose purpose is to control accu-mulating losses over repeated plays of a fixed game against an opponent.

We will abandon the game context of the reference, replacing game oriented M(i, Qt) by any M(i, t) ϵ [0,1], then finding that the pg. 8 proof of Theorem 1 continues to apply. Elementary math is all we need. With this change resulting Theorem 1 gives ∑ t = 1

t = T M(i(t),t) ≤ aβ min i ∑ t = 1t = T M(j,t) + cβ ln(n)

with aβ and cβ as defined in terms of parameter η ϵ [0, 1) as given Theorem 1.

Above says total loss using i(t) is nearly as low as best with hindsight of the experts.

How does the algorithm choose i(t) for each t? Just select i(t) by random draw of an i using weights w(i, t) which are known immediately after period t-1:

P( i(t) = i | all w(j, s), j ≤ n, s < t) ∝ w(i, t), for i ≤ n, t ≤ T. ( ∝ means proportional to)

Notes. This algorithm is very flexible. For example, 1. In time series analysis you could use as experts many differently parameterized time series models as an alternative to trying to estimate the best fitting model. The MW algorithm can use i(t), produced for losses = error of one step ahead prediction or some other loss, with the objective of doing nearly as well over time as the best with hindsight of any of those time series models. This can accommodate changes in the time series, including departures from stationarity.2. The original listing of experts can be replaced on the fly with entry 1 for the next round t being the loss suffered by the expert who is best overall going into time t. Likewise entry 2 could be for second best so far, etc. MW method has no trouble with that dynamic real time rearrangement of the losses table so long as the rearrange-ment does not look into the future in order to rearrange. Keep in mind, you only need to see the experts’ losses after close of each cycle to make the change to ‘ranked’.3. MW can be distracted if, for example, all of the experts enter periods of poor perfor-mance (e.g. losses of stock traders, loss being perhaps your one period loss relative to what you had invested going into the period). You could identify groups of experts, running MW for each group, then run MW on the MW algorithms run for each group.4. Maybe some of your experts are being lost or replaced. You are free to manage this in ways that seem practical to you. MW R code implementation MW=function(X){ X=X[order(X[ ,1]), ] # X is the loss matrix we call M P=X*0 # weights w(i, t) will be kept in matrix P n=length(X[ ,1]) # X[i, t] denotes M[i, t] choices=1:n # experts are 1:n T=length(X[1, ]) # ind=rep(1,T) # holds i(1), .., i(T) m=rep(0,T) M=rep(0,T) D=rep(0,T) loss.total=rep(0,T) beta=1/(1+sqrt(log(n^2)/T)) # this beta is from Theorem 1 proof P[,1]=rep(1/n,n) # weights are uniform going into round t = 1 loss.total[1]=X[ind[1],1] # chose particular start i(1) = 1 to get going m[1]=P[,1]%*%X[,1] # conditional expected X(i(1), 1) to get going M[1]=m[1] D[1]=P[,1]%*%(X[,1]-P[,1]%*%X[,1])^2 # D is cumulative 1-step ahead cond’l Var’s for(t in 2:T){ P[,t]=P[,t-1]*beta^X[,t-1]/sum(P[,t-1]*beta^X[,t-1]) # random selection of i(t) according to probability weights P(i, t) ind[t]=sample(choices,1,replace=FALSE,P[,t]) # update cumulative losses for all i loss.total[t]=loss.total[t-1]+X[ind[t],t] # calculate conditional expected loss m[t]=P[,t]%*%X[,t] # update cumulative on step ahead conditional expected losses M[t]=M[t-1]+m[t] # update cumulative one step ahead conditional variances D[t]=D[t-1]+P[,t]%*%(X[,t]-P[,t]%*%X[,t])^2 } # plots of averages present better than rather than cumulative sums # blue for experts i, red for losses sustained by i(t)’s, green for bounds, plot(1:T,ind, type=”p”) plot(1:T,loss.total/(1:T),type=”l”,col=”red”) # plot of one-step ahead conditional expectations in black # overlays with red plot of actual average losses lines(1:T,M/(1:T)) lines(1:T,(M+D^0.55)/(1:T),type=”l”,col=”green”) lines(1:T,(M-D^0.55)/(1:T),type=”l”,col=”green”) for(k in 1:n){ lines(1:T,cumsum(X[k,])/(1:T),type=”l”,col=”blue”) } lines(1:T,loss.total/(1:T),type=”l”,col=”red”) print(loss.total[T]/T) print(X%*%rep(1,T)/T) print(sqrt(log(n^2)/T)+log(n)/T) print(sqrt(D[T])/T) } # TEST myX=matrix(c(exp(runif(1000)-1),abs(sin(runif(1000))),sqrt(exp(runif(1000)-1)), exp(runif(1000)-1)^2,abs(sin(runif(1000))^2)),nrow=5,ncol=1000,byrow)

MW(myX)

[1] 0.3001047 # not quite as small as best with hindsight [,1][1,] 0.2684112 # best with hindsight of the experts[2,] 0.4254857[3,] 0.6225501[4,] 0.7905005[5,] 0.4619181[1] 0.05834458[1] 0.003683015 # half width of green brackets at T = 1000

# next plot shows the choices i(t) made by the randomized algorithm for t in 1:1000

# next plot tracks all of the important features# note that, where (blue) experts’ average loss curves diverge, red tracks lower

Syllabus 461 Sp 19.nb 9