Upload
mitchell-evans
View
216
Download
4
Embed Size (px)
Citation preview
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Autonomic Joint Session Admission Control using Reinforcement Learning
Beijing University of Posts and Telecommunications
China
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Motivation, problem area
• The interworking between radio sub-networks, especially tight cooperation between them, is of great interest in operating RATs for higher system performance, spectrum efficiency as well as better user experience.
• The emergency of end-to-end reconfiguability further facilitates the joint radio resource management (JRRM).
• Joint session access control (JOSAC) is one of the JRRM functions which can allow or deny a session to a certain RAT to achieve the optimal allocation of resources.
• Considering an operator running several access networks with different RATs and numerous base stations (BS) or access points (AP) in an urban area, it’s most desirable that the joint control among those RATs is self-managed to adapt to the varying traffic demand without much human-intervened planning and maintaining cost.
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research Objectives
• In order to realize the self-management, we will appeal to the autonomic learning mechanism which plays an important role in the cognitive radio. With respective to JOSAC, such intelligence requires the agent to be able to learn the optimal policy from its online operations, which falls rightly within the field of reinforcement learning (RL).
• In this paper, we formulate the JOSAC problem in the multi-radio environment as a distributed RL process. Our contribution is to realize the autonomy of JOSAC by applying the RL methodology directly in the decision of joint admission control. Our objective is to achieve lower blocking probability and handover dropping probability via the autonomic “trial-and-error” learning process of the JOSAC agent. Proper service allocation and high network revenue can also be obtained from this process.
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research approach, Methodology
• Distributed RL mode– Standard RL model
• State set: S={s1,s2,…,sn} • Action set: A={a1,a2,…,am} • Policy: p: S→A • Reward: r(s,a)• Target:
– Iterative Process• Observation• Calculation• Decision• Evaluation• Update
0
)tt
t
V E r
Learning Agent (π: S→A)
Radio Environment
s r(s,a) a
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research approach, Methodology
• Distributed RL mode– Q-learning, as a popular RL algorithm, can learn the optimal policy
through simple Q value iterations without knowing or modeling R(s,a) and Ps,s'(a).
– Iteration Rule:– Optimal Policy:
1( , ) (1 ) ( , ) ( max ( , ))t t t ta A
Q s a Q s a r Q s a
* *( ) arg max ( , )a A
s Q s a
0
1
2
A
B2
1
5
3
4
A
-1000
1
A10
1B
1
Q(0, A) = 12 Q(0, B) = -988
Q(3, A) = 1
Q(4, A) = 10
Q(1, A) = 2Q(1, B) = 11
Q(2, A) = -990
A
A
Q a1 a2 … am
S1
S2
…
sn
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research approach, Methodology
• Distributed RL mode– Distributed RL Architecture
• In order to collect information in the real time and improve the user experience more precisely, we assign each terminal an JOSAC agent to handle the JOSAC problem by itself.
• We place Q-value depositories at the network side to store the learning results of all the agents in the form of Q-value for experience sharing and memory space saving.
• The number of agents can change dynamically and it has a negligible influence on the convergence results.
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research approach, Methodology
• Problem Formulation– Learning agent: distributed JOSAC agents
– Environment: network status and arriving sessions
– State: redirected or not, coverage, new session or handover, service type, load distribution
– Action: session is rejected or accepted by one RAT
– Reward:
• Δt is the session duration
• η(v,k) is the service revenue coefficient embodies the suitability of RAT k to the traffic type v
• β(h) is the reward gain coefficient of handover which gives a higher reward to the handover session to reduce the handover dropping proportion
• δ is the sharing coefficient between the orginal RAT and target RAT
( , , , , )s y c h v l{0,1,2, , }A K
( , ) ( , ) ( ) tr s a v k h
src
dst
( , ) ( , )
( , ) ( , )(1 )
r s a r s a
r s a r s a
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Research approach, Methodology
• Algorithm Implementation – (1) Initialization– (2) Q value acquisition and update– (3) Action selection and execution– (4) Reward calculation – (5) Parameter update – (6) Return to (2)
– Under the state s, it will choose an action a according to a proportion which is defined as:
• T is the “temperature” coefficient which will gradually reduce to 0 along with the iteration.
Find Qt(s,a∈A), Update Qt+1(sold,aold)
Initialization:Q(s,a)=0, s,a=0, Ts=T0,
Session Arrival?
Y
N
Construct new state s=(y,c,h,v,l)
Compute p(a|s) by Boltzmann distribution
Select action a,Record (s,a), Count ns,a and ns
JOSAC
Reject
Accept Redirect
r(s,a)=0
r(s,a)=(v,k)(h)t Accepted?
rt=r(s,a)
N
Y
Update s,a and Ts
Redirected?
rt=r(s,a)(1 ) rt=r(s,a)
NY
( , ) /
( , ) /( | )
Q s a T
Q s b T
b A
ep a s
e
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Major Outcomes/Results
• Performance Evaluation
Simulation scenario
GSM/GPRS UMTS WLAN
Cell capacity (kbps)
200 kbps 800 kbps 2,000 kbps
η(v,k)Voice 3 5 1
Data 3 1 5
Area Distribution A: 10% B: 10% C: 80%
New session Handover
β(h) 1 10
δ = 0.2 γ = 0.5 T0 = 10 α0 = 0.5
N = 2 K = 3 Iteration times: 20,000
Simulation Configuration
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Major Outcomes/Results
• Performance Evaluation
The blocking probability and handover dropping probability performance
The load difference between voice and data service in each RAT (λ/μ=20)
Average revenue per hour performance
The convergence proceed of iteration
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright 2008 - E3 project, BUPT
Conclusion and outlook
• A novel JOSAC algorithm based on reinforcement learning has been presented and evaluated in a multi-radio environment.
• It solves the joint admission control problem in an autonomic fashion using Q-learning method. Comparing to the Non-JOSAC algorithm and the LB-JOSAC algorithm, the proposed distributed RL-JOSAC can provide the optimized admission control policies that reduce the overall blocking probability while achieve lower handover dropping probability as well as higher revenue.
• Its learning behavior provides the opportunity to improve the online performance by exploiting the past experience. This can be a great advantage to handle the dynamic and complex situations in the B3G environment intelligently without much human efforts.