Automated Cellular Root Cause Analysis

Preview:

DESCRIPTION

Automated Cellular Root Cause Analysis. Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik & Rijin John . Cellular Base Station Monitoring. Every 15 minutes. Cell sites. Monitoring Centre. Cell site. Cellular Base Station Monitoring. Performance counters. - PowerPoint PPT Presentation

Citation preview

Automated Cellular Root Cause Analysis

Sayandeep Sen Bell Labs India

Joint work with Sourjya Bhaumik & Rijin John

Cellular Base Station Monitoring

Monitoring Centre

Cell site

Cell sites

Every 15 minutes

Performance countersExample: connected users, average signal strength, cell radius etc.

Cell site

Cell sites

Performance counters

Cellular Base Station Monitoring

Monitoring Centre

Every 15 minutes

Cellular Base Station Monitoring

KPI: Key Performance IndicatorExample: Call drop rate, Successful connection setup rate, Throughput

Cell site

Cell sites

KPI

Every 15 minutes

Monitoring Centre

Root cause analysis

Monitoring Centre

Cell site

Cell sites

KPIKPI

Perf

orm

ance

coun

ters

Why KPI went below threshold ?

Manually

Root Cause Analysis – Issues

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Manual debugging is inefficient

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

??? Sporadic parameter dips

Root Cause Analysis – Issues

Manual debugging is inefficient

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

Multiple parameter interaction

Root Cause Analysis – Issues

Sporadic parameter dips

Manual debugging is inefficient

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Carry out automated (fast) root cause analysis which accounts for sporadic dips and multiple parameter interactions while ensuring human readable output.

Problem Statement

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

KPI-parameter relationship is dependent on other parameter values

Key Intuition

Conn. Req.

Call SuccessHan

doff rate

Key Intuition

Conn. Req.

Threshold

Handoff ra

te

Call Success

y

Conn. Req. > X & H/o =y

X

Key Intuition

Conn. Req.

Handoff ra

te

Call Success

Conn. Req. > X’ & H/o =y’

y’

Key Intuition

KPI-parameter relationship is dependent on other parameter values

X’Determine the rules for various parameter combination values using Regression trees

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Form clusters of points

To minimize the sum of distance metric for sub-clusters

Δ

Δ’Δ”

Regression treesCall Success

Distance metric: sum of Euclidean distance of points in a sub-cluster

Δ

Δ’Δ”

Regression treesCall Success

Form clusters of points

To minimize the sum of distance metric for sub-clusters

Provide human readable rule for each cluster

Conn. Req.

2) Calculate Δ

Regression trees

1) Pick an axis

Call Success

1) Pick an axis 2) Calculate Δ

Conn. Req.

X

Regression treesCall Success

3)Pick pivot to divide points in two clusters,

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate ΔX

Regression treesCall Success

Δ”

Δ’

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate ΔX X X X Repeat for

all pivots

Regression treesCall Success

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for

all pivots

Regression treesRepeat for all axis

Call Success

Conn. Req.

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

X

Conn.Req<X Conn.Req>=X

Regression treesRepeat for all axis

Call Success

Conn. Req.

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

X

Repeat for sub-clusters

Conn.Req<X Conn.Req>=X

Regression treesCall Success

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression treesCall Success

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

Repeat for sub-clusters

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

Repeat for sub-clusters

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Select rules corresponding to low KPI values

Regression treesCall Success

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression treesCall Success

Human readable

Capture multiple variable interaction

Capture sporadic events due to time agnostic clustering

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

• Distance metric oblivious of significance of KPI values• Curse of dimensionality

Regression trees – Issues

Conn. Req.

Handoff rate

Metric oblivious KPI value significance

Call Success

Need big separation between good and bad values

Conn. Req.

Handoff rate

Call Success

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Call Success

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Distinction between good and bad is small

Stratify KPI values

Call Success

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Distinction between good and bad is small

Call Success

Multiply KPI value with custom step function

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Stratification of dataCall Success

Multiply KPI value with custom step function

Call Success

Distinction between good and bad is small

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

Bad

Stratification of dataCall Success

Call Success

Distinction between good and bad is small

Conn. Req.

Handoff rate

Stratification of data

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Call Success

Distinction between good and bad is small

• Distance metric oblivious of significance of KPI values• Stratify KPI values

• Curse of dimensionality reduction

Regression trees – Issues

Interference

Traffic Load

Curse of DimensionalityCall Success

Traffic Load > X & Interference > Y

Handoff rate < X & Conn. Req. < Y

Cell Radius > X & Allotted Power < Y

Interference

Traffic Load

Traffic Load > X & Interference > Y

Handoff rate < X & Conn. Req. < Y

Cell Radius > X & Allotted Power < Y

Call SuccessCurse of Dimensionality

~300 variables lead to 2^300 combinationsregression tree can be misled

• Preprocessing – Remove correlated, barely changing parameters etc.

• Domain knowledge based filtering– Remove unrelated parameters, apply weights

● Heuristics– Spike, Correlation, 3 more …

Dimensionality reduction

Spike heuristic

Time

Time

Call SuccessValues spike around same time

Correlation heuristic

Conn. Req. Conn. Req.

Call

Succ

ess

Call

Succ

ess

Call Success > 98.5 % Call Success <= 98.5 %

Correlation changes significantly

Regression tree

Apply filters

Stratify KPI data

Select rules

Rule generation

Data store

Rule store

Rule application

Rule storeMatching rules

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Training & Verification Data

• Analyzed 28 days of data from 217 cell sites • 2 countries, 2 OEMs

• 317 parameters @ 15 minute interval • 80% data to train and 20% to validate

Find rules for all KPI dips

Country #1 (18 cell sites)

Country #2(60 cell sites)

Cell sites with at least 4 KPIs with more than 100 bad instances selected

1 2 3 40

100

200

300

400

500

600

700

800

Found Rule Bad KPI

1 2 3 40

50

100

150

200

250

300

350

400

450 Found Rule Bad KPI

KPI KPIIn

stan

ces

Inst

ance

s

Rule Verification

• Picked rules for randomly selected 50 KPI dips• Show rules to 15 RF engineers (Ongoing)

• 80% rules were actionable• For all the KPI dips at least one actionable

rule in the rule set

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

KPI dip: Call success rate < 98.5%

3) Download Traffic < 500 Kbytes AND Total active users < 200

Example rule set

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

1) Total users in 5 to 10 KM from base station > 63%

Users concentrated at cell edge

Example rule set

KPI dip: Call success rate < 98.5%

3) Download Traffic < 500 Kbytes AND Total active users < 200

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

1) Total users in 5 to 10 KM from base station > 63%

21% users with bad RSSI and high traffic load

Example rule set

KPI dip: Call success rate < 98.5%

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

Do not point to meaningful cause ?

Example rule set

KPI dip: Call success rate < 98.5%

Example rule set

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

Coarse timescale leading to multiple other failures

Don’t have access to relevant parameters

Specific problem rare event in current sector

KPI dip: Call success rate < 98.5%

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Recommending solution for a problem

Cell site

Cell sitesMonitoring Centre

Parameter list

Parameter list: Remotely configurable parameters,Example: Antenna tilt, Min. signal strength to associate, allowable idle time etc.

Ongoing Work

Recommending solution for a problem

Cell site

Cell sitesMonitoring Centre

Parameter list

When a KPI dips:• Generate rules• Find sectors where the rules do not lead to

KPI dip• Return the parameter list for those sectors

Ongoing Work

Ongoing WorkRecommending solution for problem

More customizations necessary …

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

All bits of a video application are not created equal

< 5 msec

< 105 msec

Nearer the deadline more valuable the packet

Value

I P B

MPEG4/ H.264 encoded video

Value aware networking

ApplicationTransportNetwork

MACPHY

000101 011101 010101I P B

0001011010101100101010101100001001

Value aware application layer

I P B

API

ApplicationTransportNetwork

MACPHY

000101 011101 010101I P B

0001011010101100101010101100001001

Value aware networking

• Order of sending data• Times to retransmit• MAC data rate

Can protocol decisions be taken in a value aware manner ?

I P B

Yes Almost no data overhead

API

Questions?

Backup

Future work

• Online regression tree formation• Fast emulation systems for what-if analysis

Research overview

Scout

[ Submitted]

[DySPAN 2012]

Range-Write[OSDI 2008]

Apex[Sigcomm 2010]

Medusa[NSDI 2010]

MOM[Submitted]

RDP-TS

DGP[MobiCom 2006]

MCB-Mesh[IMC 2008]

Fractel[INFOCOM 2008]

WiScape[IMC 2011]

[WWW 2008]

Topo-cons

WhiteCell

PhD Dissertation

Systems & ProtocolsCross-Layer design Measurement

& Analysis

Root-causeMultIfaceT [ HotMobile’10]

Rx

Higher bandwidth• Home repeater• Vehicular

whitespace

Reliability• Whitespace femto

Benefits

Rx

Tx

Multi-Interface systems

Tx

API with higher layersStriping decisionChannel selectionFeedback gathering

Multi-Interface systems

Challenges

Rx

Tx

Tx Rx

Recommended