Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT Autonomic Joint Session Admission Control using Reinforcement Learning

Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT

Autonomic Joint Session Admission Control using Reinforcement Learning

Beijing University of Posts and Telecommunications

China


Motivation, problem area

• The interworking between radio sub-networks, especially tight cooperation between them, is of great interest in operating RATs for higher system performance, spectrum efficiency as well as better user experience.

• The emergency of end-to-end reconfiguability further facilitates the joint radio resource management (JRRM).

• Joint session access control (JOSAC) is one of the JRRM functions which can allow or deny a session to a certain RAT to achieve the optimal allocation of resources.

• Considering an operator running several access networks with different RATs and numerous base stations (BS) or access points (AP) in an urban area, it’s most desirable that the joint control among those RATs is self-managed to adapt to the varying traffic demand without much human-intervened planning and maintaining cost.


Research Objectives

• In order to realize the self-management, we will appeal to the autonomic learning mechanism which plays an important role in the cognitive radio. With respective to JOSAC, such intelligence requires the agent to be able to learn the optimal policy from its online operations, which falls rightly within the field of reinforcement learning (RL).

• In this paper, we formulate the JOSAC problem in the multi-radio environment as a distributed RL process. Our contribution is to realize the autonomy of JOSAC by applying the RL methodology directly in the decision of joint admission control. Our objective is to achieve lower blocking probability and handover dropping probability via the autonomic “trial-and-error” learning process of the JOSAC agent. Proper service allocation and high network revenue can also be obtained from this process.


Research approach, Methodology

• Distributed RL mode– Standard RL model

• State set: S={s1,s2,…,sn} • Action set: A={a1,a2,…,am} • Policy: p: S→A • Reward: r(s,a)• Target:

– Iterative Process• Observation• Calculation• Decision• Evaluation• Update

0

)tt

t

V E r

Learning Agent (π: S→A)

Radio Environment

s r(s,a) a



• Distributed RL mode– Q-learning, as a popular RL algorithm, can learn the optimal policy

through simple Q value iterations without knowing or modeling R(s,a) and Ps,s'(a).

– Iteration Rule:– Optimal Policy:

1( , ) (1 ) ( , ) ( max ( , ))t t t ta A

Q s a Q s a r Q s a

* *( ) arg max ( , )a A

s Q s a

0

1

2

A

B2

1

5

3

4

A

-1000

1

A10

1B

1

Q(0, A) = 12 Q(0, B) = -988

Q(3, A) = 1

Q(4, A) = 10

Q(1, A) = 2Q(1, B) = 11

Q(2, A) = -990

A

A

Q a1 a2 … am

S1

S2

…

sn



• Distributed RL mode– Distributed RL Architecture

• In order to collect information in the real time and improve the user experience more precisely, we assign each terminal an JOSAC agent to handle the JOSAC problem by itself.

• We place Q-value depositories at the network side to store the learning results of all the agents in the form of Q-value for experience sharing and memory space saving.

• The number of agents can change dynamically and it has a negligible influence on the convergence results.



• Problem Formulation– Learning agent: distributed JOSAC agents

– Environment: network status and arriving sessions

– State: redirected or not, coverage, new session or handover, service type, load distribution

– Action: session is rejected or accepted by one RAT

– Reward:

• Δt is the session duration

• η(v,k) is the service revenue coefficient embodies the suitability of RAT k to the traffic type v

• β(h) is the reward gain coefficient of handover which gives a higher reward to the handover session to reduce the handover dropping proportion

• δ is the sharing coefficient between the orginal RAT and target RAT

( , , , , )s y c h v l{0,1,2, , }A K

( , ) ( , ) ( ) tr s a v k h

src

dst

( , ) ( , )

( , ) ( , )(1 )

r s a r s a

r s a r s a



• Algorithm Implementation – (1) Initialization– (2) Q value acquisition and update– (3) Action selection and execution– (4) Reward calculation – (5) Parameter update – (6) Return to (2)

– Under the state s, it will choose an action a according to a proportion which is defined as:

• T is the “temperature” coefficient which will gradually reduce to 0 along with the iteration.

Find Qt(s,a∈A), Update Qt+1(sold,aold)

Initialization:Q(s,a)=0, s,a=0, Ts=T0,

Session Arrival?

Y

N

Construct new state s=(y,c,h,v,l)

Compute p(a|s) by Boltzmann distribution

Select action a,Record (s,a), Count ns,a and ns

JOSAC

Reject

Accept Redirect

r(s,a)=0

r(s,a)=(v,k)(h)t Accepted?

rt=r(s,a)

N

Y

Update s,a and Ts

Redirected?

rt=r(s,a)(1 ) rt=r(s,a)

NY

( , ) /

( , ) /( | )

Q s a T

Q s b T

b A

ep a s

e


Major Outcomes/Results

• Performance Evaluation

Simulation scenario

GSM/GPRS UMTS WLAN

Cell capacity (kbps)

200 kbps 800 kbps 2,000 kbps

η(v,k)Voice 3 5 1

Data 3 1 5

Area Distribution A: 10% B: 10% C: 80%

New session Handover

β(h) 1 10

δ = 0.2 γ = 0.5 T0 = 10 α0 = 0.5

N = 2 K = 3 Iteration times: 20,000

Simulation Configuration


Major Outcomes/Results

• Performance Evaluation

The blocking probability and handover dropping probability performance

The load difference between voice and data service in each RAT (λ/μ=20)

Average revenue per hour performance

The convergence proceed of iteration


Conclusion and outlook

• A novel JOSAC algorithm based on reinforcement learning has been presented and evaluated in a multi-radio environment.

• It solves the joint admission control problem in an autonomic fashion using Q-learning method. Comparing to the Non-JOSAC algorithm and the LB-JOSAC algorithm, the proposed distributed RL-JOSAC can provide the optimized admission control policies that reduce the overall blocking probability while achieve lower handover dropping probability as well as higher revenue.

• Its learning behavior provides the opportunity to improve the online performance by exploiting the past experience. This can be a great advantage to handle the dynamic and complex situations in the B3G environment intelligently without much human efforts.

Documents

Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT Autonomic Joint Session Admission Control using Reinforcement Learning