Coupled Bisection for Root-Ordering - Cornell UniversityCoupled Bisection for Root-Ordering Stephen N. Pallone One-DimensionalRoot-Finding 0 1 f x Problem Statement Structural Results

Coupled Bisection for Root-Ordering

Stephen N. Pallone Peter I. Frazier Shane G. HendersonCornell University ORIE Cornell University ORIE Cornell University ORIE

[email protected] [email protected] [email protected]

July 17, 2015

Coupled Bisection for Root-Ordering Stephen N. Pallone

One-Dimensional Root-Finding

0 1

Problem Statement Structural Results Bisection Policy 2/28



0 1

f

x∗




0 1

f

x∗

x̄



Coupled Root-Ordering

0 1




0 1

f(s1, ·)f(s2, ·)f(s3, ·)

x∗(s1) x∗(s2) x∗(s3)




0 1

f(s1, ·)f(s2, ·)f(s3, ·)

x∗(s1) x∗(s2) x∗(s3)

x̄




• Let S denote a set of elements (large but finite).• Let f : S × [0, 1)→ R, monotonic in its second argument.• For every element s ∈ S, denote x∗(s) as the unique root off (s, ·).

• Suppose when we evaluate f for a given x̄ ∈ [0, 1), we findf (s, x̄) for every s ∈ S.

• Goal: find the complete ordering of elements s by theirroots x∗(s) with the least number of function evaluations aspossible.



Applications

Multi-armed Bandit Problems• Sorting states in state space according to their Gittens orWhittle indices.

• Ordering of states is all that is required to implementoptimal policy.

Remote Sensing• E.g., finding the boundary of a forest from an airbornelaser scanner.



The Naive Approach

Problems with Discretizing the Interval

• Non-adaptive in nature; unclear how to determinesubinterval length.

• May require more computational work than necessary.



The Naive Approach

Problems with Discretizing the Interval• Non-adaptive in nature; unclear how to determinesubinterval length.




The Naive Approach

Problems with Discretizing the Interval• Non-adaptive in nature; unclear how to determinesubinterval length.




The Clever Approach

How can we adaptively choose where to query f ?



The Clever Approach

4




The Clever Approach

2 2




The Clever Approach

1 12




The Clever Approach

1 12




Formal Definition

State Variables2 2

At time epoch j:

• Let Xj ={

[x(0), x(1)), . . . , [x(j), x(j+1))}denote the current

partition of the interval.• Let Nj = (n(0),n(1), . . . ,n(j)) denote the number of roots ineach subinterval.

• The pair (Xj ,Nj) denotes the computational state.



Formal Definition

State Variables

2 2

n(1)n(0)

x(0) x(1) x(2)

At time epoch j:

• Let Xj ={

[x(0), x(1)), . . . , [x(j), x(j+1))}denote the current





Formal Definition

State Variables

1 12

x(0) x(1) x(2) x(3)

n(0) n(1) n(2)

At time epoch j:• Let Xj =

{[x(0), x(1)), . . . , [x(j), x(j+1))

}denote the current





Formal Definition

State Variables

1 12

x(0) x(1) x(2) x(3)

n(0) n(1) n(2)

At time epoch j:• Let Xj =

{[x(0), x(1)), . . . , [x(j), x(j+1))

}denote the current





Formal DefinitionTransition• We assume i.i.d. priors on x∗(s) for every element s ∈ S.• Without loss of generality, x∗(s) ∼ Uniform[0, 1).

Nj+1 =(n(0), . . . ,n(`−1), N̄ (`),n(`) − N̄ (`),n(`+1), . . . ,n(j)

),

whereN̄ (`) ∼ Binomial

(n(`),

x̄ − x(`)

x(`+1) − x(`)

).




Nj+1 =(n(0), . . . ,n(`−1), N̄ (`),n(`) − N̄ (`),n(`+1), . . . ,n(j)

),


(n(`),

x̄ − x(`)

x(`+1) − x(`)

).




Nj+1 =(n(0), . . . ,n(`−1), N̄ (`),n(`) − N̄ (`),n(`+1), . . . ,n(j)

),


(n(`),

x̄ − x(`)

x(`+1) − x(`)

).



Formal Definition

Transition• We assume i.i.d. priors on x∗(s) for every element s ∈ S.• Without loss of generality, x∗(s) ∼ Uniform[0, 1).

2 2

Nj+1 =(n(0), . . . ,n(`−1), N̄ (`),n(`) − N̄ (`),n(`+1), . . . ,n(j)

),


(n(`),

x̄ − x(`)

x(`+1) − x(`)

).



Formal Definition

Transition• We assume i.i.d. priors on x∗(s) for every element s ∈ S.• Without loss of generality, x∗(s) ∼ Uniform[0, 1).

2 2

Nj+1 =(n(0), . . . ,n(`−1), N̄ (`),n(`) − N̄ (`),n(`+1), . . . ,n(j)

),


(n(`),

x̄ − x(`)

x(`+1) − x(`)

).



Formal Definition

Objective Function1 12

Let τ denote the number of function evaluations to sort allelements, i.e.,

τ = inf{j ∈ N : n(i)j ≤ 1 ∀i}.



Formal Definition

Objective Function

1 120


τ = inf{j ∈ N : n(i)j ≤ 1 ∀i}.



Formal Definition

Objective Function

1 10 11


τ = inf{j ∈ N : n(i)j ≤ 1 ∀i}.



Formal Definition

Objective Function

1 10 11


τ = inf{j ∈ N : n(i)j ≤ 1 ∀i}.



Formal Definition

Defining the Optimal Policy

• A policy π is a mapping from computational states (X ,N )to a refined partition X̄ that adds one additionalsubinterval.

• Let W π(X ,N ) denote the expected number of functionevaluations required to sort elements, where we start atstate (X ,N ), i.e.,

W π(X ,N ) = Eπ [τ | (X0,N0) = (X ,N )] .

• Let W (X ,N ) = infπW π(X ,N ) denote the expectednumber of evaluations required under the optimal policy.



Formal Definition




W π(X ,N ) = Eπ [τ | (X0,N0) = (X ,N )] .




Formal Definition




W π(X ,N ) = Eπ [τ | (X0,N0) = (X ,N )] .




Formal Definition




W π(X ,N ) = Eπ [τ | (X0,N0) = (X ,N )] .




Dynamic Programming

By the principle of optimality, we can iteratively take the bestpartition X̄ , giving us a recursion.

Theorem (Basic Recursion)For a given computational state (X ,N ), we have

W (X ,N ) = 1 + infX̄

E[W (X̄ , N̄ ) | (X0,N0) = (X ,N )

],

with stopping conditions W (X ,N ) = 0 if n(i) ≤ 1 for all i.



Dynamic Programming

By the principle of optimality, we can iteratively take the bestpartition X̄ , giving us a recursion.

Theorem (Basic Recursion)For a given computational state (X ,N ), we have

W (X ,N ) = 1 + infX̄

E[W (X̄ , N̄ ) | (X0,N0) = (X ,N )

],

with stopping conditions W (X ,N ) = 0 if n(i) ≤ 1 for all i.



Divide and Conqueor

Theorem (Decomposition and Interval Invariance)Under the optimal policy, the value function W has thefollowing two properties.

• Decomposition:

W (X ,N ) =∑|N |−1

i=0 W({

[x(i), x(i+1))},n(i)

)• Interval Invariance:W({

[x(i), x(i+1))},n(i)

)= W

({[0, 1)} ,n(i)

).



Decomposition

Ex. Expected remaining number of iterations required tofinish sorting 7 elements:

W({

[x(0), x(1)), [x(1), x(2)), [x(2), x(3))}, (2, 2, 3)

)

22 3



Decomposition


W({

[x(0), x(1)), [x(1), x(2)), [x(2), x(3))}, (2, 2, 3)

)

22 3

︸︷︷︸sort 3 elements





Decomposition


W({

[x(0), x(1)), [x(1), x(2)), [x(2), x(3))}, (2, 2, 3)

)

22 3




= W({[x(0), x(1))}, 2

)+ W

({[x(1), x(2))}, 2

)+ W

({[x(2), x(3))}, 3

)



Interval Invariance

a b

3



Interval Invariance

a b

a b

3

1 2



Interval Invariance

a b

a b

a b

3

1 2

1 0 2



Interval Invariance

a b

a b

a b

a b

3

1 2

1 0 2

1 0 1 1



Interval Invariance

a b

a b

a b

a b

3

1 2

1 0 2

1 0 1 1

Re-scaling⇐⇒



Interval Invariance

a b

a b

a b

a b

3

1 2

1 0 2

1 0 1 1

0 1

0 1

0 1

0 1

3

1 2

1 0 2

1 0 1 1

Re-scaling⇐⇒



A Simpler Recursion

0 1

n

Corollary (Simple Recursion)

W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,

where Nx ∼ Binomial (n, x), and W (0) = W (1) = 0.



A Simpler Recursion

Nx n−Nx

0 x x 1

0 1

n

⇓Nx ∼ Binomial(n, x)


W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




A Simpler Recursion

Nx n−Nx

0 x x 1W ({[0, x)}, Nx) W ({[x, 1)}, n−Nx)+

0 1

n

W ({[0, 1)}, n)



W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




A Simpler Recursion

Nx n−Nx

0 1 0 1W ({[0, 1)}, Nx) W ({[0, 1)}, n−Nx)+

0 1

n

W ({[0, 1)}, n)



W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




A Simpler Recursion

Nx n−Nx

0 1 0 1W (Nx) W (n−Nx)+

0 1

n

W (n)



W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




A Simpler Recursion

For notation purposes, let

W (n) := W ({[0, 1)},n) .


W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




A Simpler Recursion

For notation purposes, let

W (n) := W ({[0, 1)},n) .


W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)] ,




Computing the Optimal Policy

W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)]

• Using the stopping conditions W (0) = W (1) = 0, we caniteratively compute W (·) for arbitrary n.

• Problem: Nx ∈ {0,n} with positive probability.• This is an implicit definition of W .




W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)]






W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)]


• Problem: Nx ∈ {0,n} with positive probability.

• This is an implicit definition of W .




W (n) = 1 + minx∈(0,1)

E [W (Nx) + W (n −Nx)]






Theorem (Computational Recursion)For all n ≥ 2, we have

W (n) = minx∈(0,1)

{ 11− xn − (1− x)n

+ E[W (Nx) + W (n −Nx)

∣∣∣∣ 1 ≤ Nx ≤ n − 1]}

,




Bisection Policy

• Suppose we queried the midpoint of the interval, regardlessof the number of elements to sort, i.e., choose x = 1/2 forall n.

• We call this the Bisection Policy.• Denote W β(n) the expected number of functionevaluations to sort n elements by their roots, under thebisection policy.



Bisection Policy

Theorem (Bisection Policy Recursion)For all n ≥ 2, we have

W β(n) = 1 + 2 E[W β(N1/2)

],

where N1/2 ∼ Binomial (n, 1/2), and W β(0) = W β(1) = 0.



Bisection vs. Optimal Policy

0 250 500 750 1000 1250 1500n

0

500

1000

1500

2000

2500

3000Ex

pect

ed F

unct

ion

Eval

uatio

ns

Wβ(n) (Bisection Policy)W(n) (Optimal Policy)



Is Bisection Optimal?

No. Consider the case n = 6.









The case n = 6

0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58x

7.65650

7.65655

7.65660

7.65665

7.65670

7.65675

7.65680

7.65685

7.65690

Expe

cted

Fun

ctio

n Ev

alua

tions





Linear Growth Rate

0 250 500 750 1000 1250 1500n

1.4424

1.4425

1.4426

1.4427

1.4428

1.4429

1.4430

Line

ar R

ate

∆Wβ(n) (Bisection Policy)∆W(n) (Optimal Policy)



Theorem (Lower Bound)For all n ≥ 0, we have

W (n) ≥ n − 1.

Proof : If we want to sort n elements, we must perform at leastn − 1 function evaluations of f to seperate them.

Can we find upper bounds on W β?




W (n) ≥ n − 1.






W (n) ≥ n − 1.





What’s the difference?

For notation purposes, define ∆h(n) = h(n + 1)− h(n).

Theorem (Recursion for ∆W β)For all n ≥ 2, we have

∆W β(n) = E[∆W β(N1/2)

],

where N1/2 ∼ Binomial (n, 1/2), and stopping conditions∆W β(0) = 0 and ∆W β(1) = 2.



Bounding the Bisection Policy

Goal: show for some m, γ > 0 that ∆W β(n) ≤ γ for all n ≥ m.





• Each subsequent term of ∆W β is a weighted average of theprevious terms in the sequence.

• Idea: Induction!





E[∆W β(N1/2)

]

E[∆W β(N1/2)

∣∣ N1/2 ≤ m− 1]

E[∆W β(N1/2)

∣∣ N1/2 ≥ m]

Bound above using

Induction Hypthosis

Bound above using

?



Computational Upper Bounds

0 1 2 3 4 50.0

0.5

1.0

1.5

2.0

Growth Rate of Expected Evaluations Required to Sort

∆W β(n) (Linear Growth Rate)

0 1 2 3 4 50.0

0.1

0.2

0.3

0.4

0.5PMF of Truncated Binonmial Distribution




0 1 2 3 4 50.0

0.5

1.0

1.5

2.0



0 1 2 3 4 50.0

0.1

0.2

0.3

0.4





0 1 2 3 4 50.0

0.5

1.0

1.5

2.0



0 1 2 3 4 50.0

0.1

0.2

0.3

0.4





0 1 2 3 4 50.0

0.5

1.0

1.5

2.0



0 1 2 3 4 50.0

0.1

0.2

0.3

0.4




A Useful Property of Binomial RVs













0 1 2 3 4 50.0

0.5

1.0

1.5

2.0



0 1 2 3 4 50.0

0.1

0.2

0.3

0.4





0 1 2 3 4 50.0

0.5

1.0

1.5

2.0


∆W β(n) (Linear Growth Rate)Monotone Decreasing Upper Bound

0 1 2 3 4 50.0

0.1

0.2

0.3

0.4





Theorem (Upper Bound on Bisection Policy)For some non-negative integer m, we definegm(`) = maxk∈[`,m−1] ∆W β(k), and define h as

hm(n) = E[gm(N1/2

) ∣∣∣ N1/2 ≤ m − 1],

where N1/2 ∼ Binomial (n, 1/2). Suppose we have γm > 0 sothat the following condition holds:

max{

∆W β(m), hm(m)}≤ γm .

Then for all n ≥ m, it must be that ∆W β(n) ≤ γm.



Computing Upper Bounds

• Use the computational recursion to compute ∆W β(k) fork = {1, 2, . . . ,m} for some positive integer m.

• Compute γm = max{∆W β(m), hm(m)}.

Corollary (Bounds for Bisection Policy)For n ≥ 15, we have that

n − 1 ≤W (n) ≤W β(n) ≤ γ15 · (n − 15) + W β(15),

where γ15 = 1.444 and W β(15) = 22.081.



Computing Upper Bounds

• Use the computational recursion to compute ∆W β(k) fork = {1, 2, . . . ,m} for some positive integer m.

• Compute γm = max{∆W β(m), hm(m)}.

Corollary (Bounds for Bisection Policy)For n ≥ 15, we have that

n − 1 ≤W (n) ≤W β(n) ≤ γ15 · (n − 15) + W β(15),

where γ15 = 1.444 and W β(15) = 22.081.



Further Questions

• If we were allowed to evaluate the function at multiplepoints in parallel, where would we query the function?

• Which subintervals do we refine at each time epoch?• How much computing power do we allocate to each

subinterval?

• What if we were only interested in sorting the rankings ofthe top k zeros?

• What would a bisection policy look like?• Can we effectively compute the optimal policy if the i.i.d.

assumption does not hold?



Further Questions



subinterval?






Further Questions



subinterval?





Documents

Coupled Bisection for Root-Ordering - Cornell UniversityCoupled Bisection for Root-Ordering Stephen N. Pallone One-DimensionalRoot-Finding 0 1 f x Problem Statement Structural Results