Data Driven Networks Sachin Katti - Platform Labplatformlab.stanford.edu/pdf/Katti.pdf · • Understand how known network input variables affect network KPI (e.g throughput)

Data Driven Networks

Sachin Katti

Is it possible for to “Learn”

the control planes ofnetworks and applications?

Operators specify what they want,

and the system learns how to deliver

CAN WE LEARN THE CONTROL PLANE OF THE

NETWORK?

Control & Orchestration Plane

automatically synthesized by learning

from telemetry data

Northbound Application Interfaces

Southbound Infrastructure Interfaces

Real-time

Network

Telemetry

Operator

Policies, Intent

& SLA

Edge Cloud

RAN

Infrastructure Control Points

VIRTUAL REALITY4K VIDEODRONE ATC

Operator

Policies, Intent

& SLA

Operator

Policies &

Intents

Applications with Access to Network Control

Result:

Dynamic per user/flow control

implemented to meet the

policy and deliver on the SLA

You specify what you want, and the network figures

out how to deliver it!

DATA DRIVEN CONTROL LOOP

Service KPI

Prediction

Will

Current

Controls

Meet

KPI?

Recommend

Current

Control

Settings

Y

N

What-IfAdjust Control

Knobs

Will

Adjusted

Controls

Meet

KPI?

N

YRecommend

Adjusted

Control

Settings

Apply

Controls

Meets

Operator

Policies?

N

Y

Operator Policies

Intent & SLAs

Real-timeData

Network Telemetry

TWO LEARNING APPROACHES:

Classical Machine Learning Deep Reinforcement Learning

Neural NetworkReinforcement Learning

(Prediction & What-if)

RewardFunction

Requires extensive domain knowledge Largely blind, truly approximates intent

based system design

Classical Machine Learning

Build Rules-based

Algorithm / Model

Evaluate

Results

Tune / Update

Algorithm / Model

Deep Reinforcement Learning

LearningAgent

Environment

State ActionsReward

Millions of Training Cycles per hour

Application and Network

state variables

Controllableactions

Reward values to achieve desired

outcomes

Deploy

DEEP RL IS A GREAT FIT FOR NETWORK CONTROL

• LEARN DIRECTLY FROM EXPERIENCE

• NO MODELS, BRITTLE ASSUMPTIONS

• POLICIES ADAPT TO ACTUAL ENVIRONMENT & WORKLOAD

• OPTIMIZE A VARIETY OF OBJECTIVES END-TO-END

• PROVIDE “RAW” OBSERVATIONS OF NETWORK CONDITIONS

• EXPRESS GOALS THROUGH REWARD

• SYSTEM DOES THE REST

• NETWORK CONTROL DECISIONS ARE OFTEN HIGHLY REPETITIVE

• LOTS OF TRAINING DATA → SAMPLE COMPLEXITY LESS OF A CONCERN

LEARNING & CONTROL PIPELINE

Inference Prediction Control

Feature Extraction

Machine Learning Deep Learning

Neural Networks

Inference

INFERENCE

• Understand how known network input variables affect network KPI (e.g throughput)

• Leverage traditional supervised machine learning and statistical analysis (random

forest, etc.) to develop function f(Xt+1) to be used as input for deep learning agent

X = Average Throughput per user, per cell

• Current and past collisions per cell (a)

• Current and past number of users per cell (b)

• Current and past signal strength per user, per cell (c)

𝑋𝑡 = 𝑓(𝑎, 𝑏, 𝑐)

Prediction

PREDICTION

𝑎, 𝑏, 𝑐Network State Variables

Extracting features from

Training Data

(offline)

Prediction

Neural Network

(real-time)

𝑋𝑡+1Throughput

KPI

Prediction

Input Features from realtime data feed

Control

CONTROL

Prediction

Neural Network

(real-time)Deep Learning

(unsupervised,

reinforced learning)Network State

Variables

Operator Policies

Client Conditions

App specific inputs

Control

Decision

Next

Chunk

ABR

STREAM

CDN

USE CASE: VIDEO STREAMING

Network Assisted

Mobile Video Streaming

Result:

50-75% Higher Video Quality

75%-100% Lower Stall Time

Optimization Objective:

Maximize video quality and Minimize Stall Time

in challenging network scenarios

Policy:

Maintain minimum average throughput of 500Kbps per

user for non-video traffic

Data:

Real-time Network State Data Feed

Real-time Mobile Application State Data

Control:

Next Chunk Adaptive Bit Rate (ABR)

SYSTEM ARCHITECTURE

13

LTE

Video ABR recommendation for

next chunk

Video Stream

ABR guidance request

API

Real-time Network Conditions

AP

INetwork

Assisted ABR

CDN

Video Encoded forABR Streaming

ABR Request

CELL DATA INGESTION IN TWO DEPLOYMENTS

Silicon Valley Corridor~250 eNodeBs, ~1300 cells

Melbourne Business District

~90 eNodeBs, ~270 cells

INFER

User performance can be modeled fairly accurately using features

extracted from real time network data feeds

Throughputbits per time

Cellbandwidth

Cell contention

User&totaldemand

?Σ Active

users

Σ Pktrate, size

CQI

Rank

MCS

X

Spectral efficiencybits per resource element

Resource allocation rateresource elements per time

0 20 40 60 80 100

-6-4

-20

24

6

USER SESSION #

58% error0 20 40 60 80 100

-6-4

-20

24

6

USER SESSION #

40% error0 20 40 60 80 100

-6-4

-20

24

6

USER SESSION #

12% error

Video QoE depends on user throughput.

What does user throughput depend on?

PREDICT

Test for seasonality and trend, learning ARIMA using a

neural network

-20

20

60

data

-20

20

60

se

ason

al

-20

20

60

tre

nd

-20

20

60

0 20 40 60 80 100

rem

ain

de

r

time

00:00 01:00 02:00 03:00 04:00 05:00

15

20

25

30

35

40

00:00 01:00 02:00 03:00 04:00 05:00

15

20

25

30

35

40

ActualForeca

st

4% error

Predicting the network state variables and then throughput

Number of users on a cell

Predict + Control

B

R

C

A

video

buffer

states

conv1drelu

relu

ABR Policy

softmax

past b_app ts

A3C ABRLSTM thpt forecaster

past b_seg ts

RSRP/RSRQ ts

CELT ts

relu

relu

linear

thpt

prediction

inputs Hidden layers

0

500

1000

1500

2000

13:52 13:53 13:54 13:55 13:56

LOCAL TIME

BIT

RA

TE

(kb

ps)

VIDEO ON DEMAND

0

20

40

60

13:52 13:53 13:54 13:55 13:56

LOCAL TIME

BU

FF

ER

(seconds)

0

500

1000

1500

2000

11:14 11:15 11:16 11:17 11:18

LOCAL TIME

BIT

RA

TE

(kb

ps)

LIVE VIDEO

0

20

40

60

11:14 11:15 11:16 11:17 11:18

LOCAL TIME

BU

FF

ER

(seconds)

Network predictions become more valuable as buffer limit grows

smaller

ON DEMAND VIDEO (buffer limit up to 4 minutes)

Build a safe buffer, then

safely play the highest

quality without stalling

LIVE VIDEO (buffer limit less than 30s)

Track throughput predictions

to play as high quality as

possible without stalling

WHAT BEHAVIOR DOES THE RL AGENT FOR CONTROLLING VIDEO ABR LEARN?

-2%

33%

58%

4%0%

83%

65%

60%

22%

27%

42%

59%

PERFECT CONDITIONS CELL EDGE (LOW SIGNAL QUALITY) HIGH CELL CONGESTION HIGH USER MOBILITY

DEPLOYMENT RESULTS FROM A US MOBILE NETWORK

19

COMPARED TO CURRENT ABR SOLUTION

BIT RATE STALLS STARTUP BIT RATE STALLS STARTUP BIT RATE STALLS STARTUP

BIT RATE

STALLS STARTUP

Operator Capped Max Bit Rates (1.8 Mbps)

0.0

0.5

1.0

1.5

17:39 17:40 17:41 17:42 17:43 17:44 17:45

(Mbp

s) colour

BITRATE

THP

0

10

20

30

17:39 17:40 17:41 17:42 17:43 17:44 17:45

CU

MU

LA

TIV

E S

TA

LL

(se

co

nds)

colour

CUM STALL

-50

0

50

100

150

17:39 17:40 17:41 17:42 17:43 17:44 17:45

OP

EN

DR

BS TYPE

ERAB

HANDOVER

NORMAL

1No context awareness

at the start. Huge

startup time! 2Cell congestion can

spike transiently

(~10s of seconds).

Huge stalls!

WHY FINE-GRAINED NETWORK PREDICTION MATTERS FOR QOE

0.0

0.5

1.0

1.5

2.0

13:11 13:12 13:13 13:14 13:15 13:16

(Mbp

s) colour

BITRATE

THP

0

10

20

13:11 13:12 13:13 13:14 13:15 13:16

CU

MU

LA

TIV

E S

TA

LL

(se

co

nds)

colour

CUM STALL

-20

-18

-16

-14

-12

13:11 13:12 13:13 13:14 13:15 13:16

LIN

K Q

UA

LIT

Y (

dB

)

colour

NEIGHBOR

SERVING

3Downlink

interference can

spike transiently

(~10s of seconds).

Huge stalls!



Peaks correspond directly

with red lights on Univ Ave,

as users get handed off

Content Pre-Positioning

When and how much

data to download

CONTENT

CDN

USE CASE: CONTENT PREPOSITIONING

Result:

More than double data downloaded on

existing infrastructure with deterministic

impact on other subscribers


Maximize content downloads, while maintaining

operator defined minimum throughput requirements

Policy:

Maintain minimum average throughput of

500Kbps per user

Data:


Control:

When and how much data to download

RAN Load Balancing

USE CASE: MOBILE NETWORK LOAD BALANCING


In cases where signal strength is sufficient on

two or more cells, manage connectivity to

balance load across available cells.SIGNAL LOAD SIGNAL

1

2

A B

RAN Load

Prediction

RAN Controller

RAN Handoff Control

Policy:

Maintain minimum average throughput of

500Kbps per user

Data:


Control:

When and which user to handoff to

neighboring cell?

LOAD

PLATFORM ARCHITECTURE – DEEP LEARNING ENGINE

25

CTR/CTUM Feed Real-timePer-User

Fine-GrainedMonitoring

Deep LearningEngine A

PI

Fine Grained

Mobile Network

Visibility

Dynamic NetworkConditions

OperatorApp Specific

Intents(Desired Outcomes)

Varied based on:User, Subscription Level, Device Type, Content Type, etc.

Uhana Managed Service

(Cloud Infrastructure)

ApplicationControl Points

AP

I

App Control

AP

I

App Control

OperatorPolicies

(Constraints)

App Specific

Neural Networks

Over-the-Air

How will network

metrics look in

the future?

PREDICT

What should be

the app control

for the best app

QoE?

CONTROL

What network

metrics affect

app QoE

metrics?

INFEROFFLINE

REAL TIME

e.g., link quality,

cell congestion etc.

e.g., link throughput in the next 5s

e.g., ABR for the next video

chunk

e.g.,buffer, app throughput

e.g.,

avg bitrate,

stalls,

switches

e.g.,

link

throughput

THE INFER-PREDICT-CONTROL PRIMITIVES

INFER

PREDICT CONTROL

INGEST

Real-time

consumptionof network data

PROCESS

Real-time

computationof network metrics

FORECAST

Real-time

predictionof network metrics

Lower the latency of processing,

lower the requirement on prediction horizon (easier the prediction)

PLATFORM FEEDBACK: PROCESSING LATENCY DICTATES

PREDICTION REQUIREMENTS

CONCLUSION & LESSONS LEARNED

• Network control is a “Big Control” problem

• Learning techniques can lead to automated control and intent based system design

• Infer + Predict + Control is a general framework

• Applies to systems beyond mobile networks

• Streaming analytics pipeline latencies are hard to predict right now Direct implication on

hardness of prediction

• Analytics pipeline system performance requires careful manual tuning

• Hard to iterate and keep performance predictable

• Online learning is still an unsolved problem, both in learning techniques as well as system design

Documents

Data Driven Networks Sachin Katti - Platform Labplatformlab.stanford.edu/pdf/Katti.pdf · • Understand how known network input variables affect network KPI (e.g throughput)