TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications...

TOTAL 23 SLIDES BELOW

The network is ReliableAn informal survey of real-world communications failures

BY PETER BAILIS AND KYLE KINGSBURY

CONTENTS

• Abstract

• Various survey reports of network reliability under different circumstance

• Conclusion

ABSTRACT• “The network is reliable.” is a fallacy of distributed

computing.

• The degree of network reliability is critical for systems to function robustly.

• It is hard to determine the degree of network reliability .

VARIOUS SURVEY REPORTS OF

NETWORK RELIABILITY UNDER

DIFFERENT CIRCUMSTANCE

LARGE DEPLOYMENTS & ISSUES

• What are large deployments?Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers.

• What is serious considered issue in large deployments?

Partitions : A network partition refers to the failure of a network device that causes a network to be split

LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES

BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS

Average failure rate• 5.2 devices/day • 40.8 links/day.• which causes Avg loss of 59000 packets

per failure.• Avg time to repair is of approximately five

minutes• Redundancy improves Avg traffic by 43%.

Devices Links0

Per Day Failures

NETWORK FAILURES IN HP’S MANAGED NETWORKS

Analysis of Support ticket data• Connectivity-related tickets

accounted for 11.4%• 14% of which were of the highest

priority level• 2 hours and 45 minutes for the

highest priority tickets and a median duration of 4 hours 18 minutes for all tickets

Conectivity Related

High Priority048

12Trouble Tickets

FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES

Five racks were faulty

(40–80 machines

seeing 50% packet loss)

Eight network maintenances (four might

cause 30-minute random

connectivity losses)

Three router failures (have

to immediately pull traffic for

an hour)

LARGE DEPLOYMENTS & ISSUES(CONTD.)

How these companies try to repair network

partitions?

Google(by Dean): “easy-to use” abstractions

PNUTS: Weeker consistency alternatives

DATACENTER NETWORK FAILURES

A Datacenter of Google

Main factors of Failures :

1)Power failure2)Misconfiguration3)Firmware bugs4)Topology changes5)Cable damage 6)Malicious traffic

CLOUD NETWORKSWhat is Cloud Networks?

Key issues:• 1)Transient latency• 2)Dropped packets• 3)Full network partitions

CLOUD NETWORKS(CONTD.)

When two nodes connected to the

internet but unable to see each other?

What experience can we learn from

this case?

HOST PRVIDERSCould host providers offer reliable networks?

E.g. Freistil IT : a specific data center has50%-100%packet loss that leads

GlusterFS disturbuted file system to entire split-brain undetected

What is the main issue?

WIDE AREA NETWORKS(WAN)

• Why WAN failures are particularly interesting?

• Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours

Conclusion: Graceful degradationUnder partition or increased Latency is especially important for WAN.

GLOBAL ROUTING FAILURES

•Can a high level redundancy internet system be safe?

1) Firewall configuration error: e.g CloudFlare

2)Firmware bug: e.g Juniper Networks

3) BGP misconfiguration: e.g Pakistan Telecom

NICS AND DRIVERSFirmware bug: NICs problem

e.g. BCM5709 (chip model)

Misconfiguration : Drivers problem

e.g. bnx2

APPLICATION-LEVEL FAILURES

What are the issues causing messages drop ping and delay?

1).Crashes

2). Program errors

3).Scheduler latency

4).Overloaded processes

CONCLUSIONWhere are the communication failures occur?

• Processes• Servers• NICs, switches• local and wide area networks• Etc.

CONCLUSION(CONTD.)• Whether there exist a reliable network?

• Depends on

1).Cautious engineering 2)Aggressive network advance 3).Lots of investments

CONCLUSION(CONTD.)

•What can we do ? Consider the risk before a partition occurs.

QUESTIONS TIME ! LOL!

REFERENCES• "Physical Network Interface". Microsoft. January 7, 2009.• Stonebraker, Michael (April 5, 2010). "Errors in Database

Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM

• CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/

post-mortem/.• Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.

THANK YOU FOR YOUR PATIENCE

TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications...

Documents

Kingsbury Bearings

Kingsbury Elementary In-service

Karen Kingsbury Synosis

Kingsbury - EJIL

Kingsbury Planner LR (3)

KINGSBURY CORPORATION

Khachaturian Lakmé A Benefit!starsintheclassics.org/wp-content/uploads/... · Beneﬁt Commiee: Joanne Bailis, Michael Bailis, Leslie Buck, Faye D’Amore, Julie Frazier, David Freno,

MapBroch SS to KBS 19 - tahoerimtrail.org · Spooner Summit to Kingsbury South Connector 17.9 Miles Spooner Summit to Kingsbury North - 12.2 miles Spooner Summit to Kingsbury South

Regulatory Committee 07 July 2020 Kingsbury Quarry ... … · of brick making material to supply the adjacent Kingsbury Brickworks into the future. Kingsbury is the only brickworks

Longing by Karen Kingsbury

Kingsbury Brochure

Year 12 - Kingsbury High

Dredge Pumps and Dredge Propulsion - Kingsbury · PDF filePrinted in U.S. A. KINGSBURY Thrust and Journal BEARINGS for DREDGE PUMPS and DREDGE PROPULSION BULLETIN D KINGSBURY MACHINE

Jepsen V - Kyle Kingsbury - Key Note distributed matters Berlin 2015

Someday by Karen Kingsbury

Kingsbury Et Al, 2005

Kyle Kingsbury and Darren Uyenoyama v. Zuffa LLC UFC

Hatcham Road Summary - Kingsbury

Susain Bailis Assisted Living Event

Kingsbury - Grassy Point Habitat Restoration Project EAWfiles.dnr.state.mn.us/input/environmentalreview/kingsbury/eaw.pdf · The Kingsbury Bay sector of the project is a major restoration