View
223
Download
0
Category
Preview:
DESCRIPTION
The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY
Citation preview
TOTAL 23 SLIDES BELOW
The network is ReliableAn informal survey of real-world communications failures
BY PETER BAILIS AND KYLE KINGSBURY
CONTENTS
• Abstract
• Various survey reports of network reliability under different circumstance
• Conclusion
ABSTRACT• “The network is reliable.” is a fallacy of distributed
computing.
• The degree of network reliability is critical for systems to function robustly.
• It is hard to determine the degree of network reliability .
VARIOUS SURVEY REPORTS OF
NETWORK RELIABILITY UNDER
DIFFERENT CIRCUMSTANCE
LARGE DEPLOYMENTS & ISSUES
• What are large deployments?Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers.
• What is serious considered issue in large deployments?
Partitions : A network partition refers to the failure of a network device that causes a network to be split
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS
Average failure rate• 5.2 devices/day • 40.8 links/day.• which causes Avg loss of 59000 packets
per failure.• Avg time to repair is of approximately five
minutes• Redundancy improves Avg traffic by 43%.
Devices Links0
20
40
Per Day Failures
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
NETWORK FAILURES IN HP’S MANAGED NETWORKS
Analysis of Support ticket data• Connectivity-related tickets
accounted for 11.4%• 14% of which were of the highest
priority level• 2 hours and 45 minutes for the
highest priority tickets and a median duration of 4 hours 18 minutes for all tickets
Conectivity Related
High Priority048
12Trouble Tickets
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES
Five racks were faulty
(40–80 machines
seeing 50% packet loss)
Eight network maintenances (four might
cause 30-minute random
connectivity losses)
Three router failures (have
to immediately pull traffic for
an hour)
LARGE DEPLOYMENTS & ISSUES(CONTD.)
How these companies try to repair network
partitions?
Google(by Dean): “easy-to use” abstractions
PNUTS: Weeker consistency alternatives
DATACENTER NETWORK FAILURES
A Datacenter of Google
Main factors of Failures :
1)Power failure2)Misconfiguration3)Firmware bugs4)Topology changes5)Cable damage 6)Malicious traffic
CLOUD NETWORKSWhat is Cloud Networks?
Key issues:• 1)Transient latency• 2)Dropped packets• 3)Full network partitions
CLOUD NETWORKS(CONTD.)
When two nodes connected to the
internet but unable to see each other?
What experience can we learn from
this case?
HOST PRVIDERSCould host providers offer reliable networks?
E.g. Freistil IT : a specific data center has50%-100%packet loss that leads
GlusterFS disturbuted file system to entire split-brain undetected
Why?
What is the main issue?
WIDE AREA NETWORKS(WAN)
• Why WAN failures are particularly interesting?
• Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours
Conclusion: Graceful degradationUnder partition or increased Latency is especially important for WAN.
GLOBAL ROUTING FAILURES
•Can a high level redundancy internet system be safe?
1) Firewall configuration error: e.g CloudFlare
2)Firmware bug: e.g Juniper Networks
3) BGP misconfiguration: e.g Pakistan Telecom
NICS AND DRIVERSFirmware bug: NICs problem
e.g. BCM5709 (chip model)
Misconfiguration : Drivers problem
e.g. bnx2
APPLICATION-LEVEL FAILURES
What are the issues causing messages drop ping and delay?
1).Crashes
2). Program errors
3).Scheduler latency
4).Overloaded processes
CONCLUSIONWhere are the communication failures occur?
• Processes• Servers• NICs, switches• local and wide area networks• Etc.
CONCLUSION(CONTD.)• Whether there exist a reliable network?
• Depends on
1).Cautious engineering 2)Aggressive network advance 3).Lots of investments
CONCLUSION(CONTD.)
•What can we do ? Consider the risk before a partition occurs.
QUESTIONS TIME ! LOL!
REFERENCES• "Physical Network Interface". Microsoft. January 7, 2009.• Stonebraker, Michael (April 5, 2010). "Errors in Database
Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM
• CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/
post-mortem/.• Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.
THANK YOU FOR YOUR PATIENCE
Recommended