1/30/2008 International SIP 2008 (Paris) Peer-to-Peer-based Automatic Fault Diagnosis in VoIP Henning Schulzrinne (Columbia U.) Kai X. Miao (Intel)

1/30/2008 International SIP 2008 (Paris)

Peer-to-Peer-based Automatic Fault Diagnosis in VoIP

Henning Schulzrinne (Columbia U.)Kai X. Miao (Intel)


Overview

• The transition in IT cost metrics• End-to-end application-visible reliability still poor (~ 99.5%)

– even though network elements have gotten much more reliable– particular impact on interactive applications (e.g., VoIP)– transient problems

• Lots of voodoo network management• Existing network management doesn’t work for VoIP and other modern

applications• Need user-centric rather than operator-centric management• Proposal: peer-to-peer management

– “Do You See What I See?”• Using VoIP as running example -- most complex consumer application

– but also applies to IPTV and other services• Also use for reliability estimation and statistical fault characterization


Circle of blame

OS VSP

appvendor

ISP

must be a Windows registryproblem re-installWindows

probably packetloss in yourInternet connection reboot your DSL modem

must beyour software upgrade

probably a gateway fault choose us as provider


Diagnostic undecidability

• symptom: “cannot reach server”• more precise: send packet, but no response• causes:

– NAT problem (return packet dropped)?– firewall problem?– path to server broken?– outdated server information (moved)?– server dead?

• 5 causes very different remedies– no good way for non-technical user to tell

• Whom do you call?


Traditional network management model

SNMP

X

“management from the center”


Old assumptions, now wrong

• Single provider (enterprise, carrier)– has access to most path elements– professionally managed

• Problems are hard failures & elements operate correctly– element failures (“link dead”)– substantial packet loss

• Mostly L2 and L3 elements– switches, routers– rarely 802.11 APs

• Problems are specific to a protocol– “IP is not working”

• Indirect detection– MIB variable vs. actual protocol performance

• End systems don’t need management– DMI & SNMP never succeeded– each application does its own updates


Managing the protocol stack

RTP

UDP/TCP

IP

SIP

no routepacket loss

TCP neg. failureNAT time-outfirewall policy

protocol problem

playout errors

media echogain problems

VAD action

protocol problem

authorizationasymmetric conn (NAT)


Types of failures

• Hard failures– connection attempt fails– no media connection– NAT time-out

• Soft failures (degradation)– packet loss (bursts)

• access network? backbone? remote access?– delay (bursts)

• OS? access networks?– acoustic problems (microphone gain, echo)– a software bug (poor voice quality)

• protocol stack? Codec? Software framework?


Examples of additional problems

• ping and traceroute no longer works reliably– WinXP SP 2 turns off ICMP– some networks filter all ICMP messages

• Early NAT binding time-out– initial packet exchange succeeds, but then TCP binding is

removed (“web-only Internet”)

• policy intent vs. failure– “broken by design”– “we don’t allow port 25” vs. “SMTP server temporarily

unreachable”


Fault localization

• Fault classification – local vs. global – Does it affect only me or does it affect others also?

• Global failures– Server failure

• e.g., SIP proxy, DNS failure, database failures– Network failures

• Local failures– Specific source failure

• node A cannot make call to anyone– Specific destination or participant failure

• no one can make call to node B– Locally observed, but global failures

• DNS service failed, but only B observed it


Proposal: “Do You See What I See?”

• Each node has a set of active and passive measurement tools• Use intercept (NDIS, pcap)

– to detect problems automatically• e.g., no response to SIP, HTTP or DNS request• deviation from normal protocol exchange behavior

– gather performance statistics (packet jitter)– capture RTCP and similar measurement packets

• Nodes can ask others for their view– possibly also dedicated “weather stations”

• Iterative process, leading to:– user indication of cause of failure– in some cases, work-around (application-layer routing) TURN

server, use remote DNS servers• Nodes collect statistical information on failures and their likely

causes

DYSWIS


Architecture

Probe

SIP Proxy DNS Server SMTP Server Firewall Other

Sensor Probe Sensor

Diagnosis Diagnosis

Three types of nodes – sensor, probe, and diagnosis


Diagnosis node

Architecture

“not working”

(notification)

inspect protocol requests(DNS, HTTP, RTCP, …)

“DNS failure for 15m”

orchestrate testscontact others

ping 127.0.0.1can buddy reach our resolver?

notify admin(email, IM, SIP events, …)

request diagnostics

Sensor node


Solution architecture

DNS Server

P2PP2P

P2PP2P

P2PP2P

P2PP2P

Service Provider 1 Service Provider 2

P1

P2

P3

Domain A

P5

P4

P6

P7

P8

DNS Test

PESQ Test

SIP Server

SIP Test

Call Failed at P1

Nodes in different domains cooperating to determine cause of failure


Failure detection tools

• STUN server– what is your IP address?

• ping and traceroute• Transport-level liveness and QoS

– open TCP connection to port– send UDP ping to port– measure packet loss & jitter

• Need scriptable tools with dependency graph– using DROOLS for now

• TBD: remote diagnostic– fixed set (“do DNS lookup”) or– applets (only remote access)

media

RTP

UDP/TCP

IP


Distributed p2p architecture with an iterative process involving all these functions:

- Data gathering from multiple perspectives - Knowledge in existence or built over time (learning) - Tools (with intelligence built in) for active probing or observations - Inference, analysis, and decision making

Peer nodes: detection nodes, diagnosis nodes, and probe nodes

P2P protocol for fault diagnosis

Operation rules used to generate tests – built or learned in real time

Inference based in rules (inference modeling)

Components and Operations


Dependency Graphs

Passive Tests/Active Tests

Analysis/Inference/Diagnosis

Fault diagnosis architecture, components, and domain agents

Dependency relationships/Decision trees

Normal Network Behavior

Monitoring deviant behavior

Active probes Adaptive probes

Diagnostic tests

Diagnostic analysisStatistical inference

Learning & modelingFault profiles

Fault types: hard vs. soft

Components and Operations


Dependency classification

• Functional dependency – At generic service level

• e.g., SIP proxy depends on DB service, DNS service

• Structural dependency– Configuration time

• e.g., Columbia CS SIP proxy is configured to use mysql database on host metro-north

• Operational dependency– Runtime dependencies or run time bindings

• e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab


Dependency Graph


Dependency graph encoded as decision tree A

C

B

D

A Failed,Use Decision Tree

Yes

Invokes Decision Tree for C

No

No

No

Yes

YesInvokes Decision

Tree for B

Invokes Decision Tree for D

Cause Not KnownReport, Add new

Dependency

A

B C D

A = SIP CallC = SIP ProxyB = DNS ServerD = Connectivity


Current work

• Building decision tree system• Using JBoss Rules (Drools 3.0)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


Future work

• Learning the dependency graph from failure events and diagnostic tests

• Learning using random or periodic testing to identify failures and determine relationships

• Self healing• Predicting failures• Protocols for labeling event failures --> enable

automatically incorporating new devices/applications to the dependency system

• Decision tree (dependency graph) based event correlation


Conclusion

• Hypothesis: network reliability as single largest open technical issue prevents (some) new applications

• Existing management tools of limited use to most enterprises and end users

• Transition to “self-service” networks– support non-technical users, not just NOCs running HP

OpenView or Tivoli

• Need better view of network reliability

Documents

1/30/2008 International SIP 2008 (Paris) Peer-to-Peer-based Automatic Fault Diagnosis in VoIP Henning Schulzrinne (Columbia U.) Kai X. Miao (Intel)