Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
2 2
Introduction What is a Data Center?
• Data Center (DC): contains resources (computational, storage, network)
• Enterprise DC and Internet DC
• Core infrastructure for cloud based services: – On-demand Media – Cloud storage – Cloud computing – social networking services – e-commerce
3 3
Introduction What is a Data Center?
• Microsoft has over a million servers in data centers, google even more [Ballmer 2013]
• Very large DC: > 100´000 servers
q Agile and reconfigurable q High availability levels q Low cost q Energy efficiency
4 4
Introduction What is a Data Center Network (DCN)?
• Connect computational and storage resources – with each other – To the outside
q Scalability q High cross-sectional
bandwidth q Fault tolerance
[Bilal2012]
5 5
Dependability Dependability: • ability to avoid service failures
that are more frequent and more severe than is acceptable
Availability: • readiness for correct service • E.g. A= 99% or A= 99.999 % Reliability: • continuity of correct service • E.g. R(60 min) = 99%
Defini/ons:[Avizienis2004] DataCenter
6 6
Dependability Availability Levels Data Center
ANSI/TIA-942-A • Network architecture • Electrical design • System redundancy • Database management • Protection against physical
hazards (fire, flood, windstorm) • Power management • …
• Practical design considerations (cabling etc.) à Well-designed DC is easier to repair and maintain (availability, maintainability)
Uptime Institute • Tier 1 (99.671%) • Tier 2 (99.741%) • Tier 3 (99.982%) • Tier 4 (99.995%)
Image:h>p://www.coloca/onamerica.com/
8 8
Dependability In Data Center Networks Threats to Dependability • Link and node failure Counter measures • More reliable equipment • Fault tolerance
system continues to operate properly even when some of its components have failed.
DataCenter
9 9
Dependability In Data Center Networks Threats to Dependability • Link and node failure Counter measures • More reliable equipment • Fault tolerance
system continues to operate properly even when some of its components have failed. – Topology – Routing
DataCenterDataCenter
10 10
Three-tier DCN Structure
• Topology with three layers (Core / Aggregation / Access)
• access routers (AccR) aggregate traffic from up to several thousand servers
• 1:1 redundancy in each layer
(except for ToRs)
Figures:[Gill,2011]
11 11
Three-tier DCN Dependability Analysis
• Data center networks are reliable – A = 99.99% for 80% of links and
60% of devices
• Low-cost, commodity switches (ToR) are highly reliable.
(# devices with failures) / (# devices)
Figures:[Gill,2011]
12 12
• Hardware problems take longer to mitigate
• Load balancers experience a high number of software faults.
• Link failures: dominated by HW and connection errors
Three-tier DCN Root causes of failures
Figures:[Gill,2011]
13 13
Fat-Tree Topology
• K pods (k/2)2
k(k/2)k(k/2)k(k/2)2
• Based on Clos-network
ü Use commodity network switches (all identical with k ports)
ü Fault tolerance: Higher redundancy
ü High cross-sectional bandwidth
Limitations • Scalability issues • # pods ≤ # ports in each switch
Image:[Al-Fares2008]
14 14
Dcell Topology
• Server centric (!) • Server with network
connections • Recursive building algorithm
– DCell0: n servers, 1 switch – DCell1: n+1 DCell0 cells – DCell2: n(n+1)+1 DCell1 cells – …
• Decentralized routing based on structure; fault-tolerant routing without using global states
Image:[Guo2008]
16 16
• How to assess dependability in a network?
• Network properties
Table:[Manzano2013]
Robustness Dependability in networks
17 17
• How to assess dependability in a network?
• Network properties • Robustness
– A2TR(p): fraction of node pairs that are connected to each other after p failures [Neumayer 2010]
– Fully connected: A2TR=1
Robustness Dependability in networks
18 18
Robustness Connectivity
• Random networks • 1000 simulation runs • Remove stepwise from 0 to
(n-2) nodes Note: • FatTree: better robustness
metrics (<k>, …) than Dcell But FatTree is worse in connectivity analysis
[Manzano2013]
19 19
Effectiveness of Redundancy
• Considers only structure, rerouting not considered à gives best case
• Real systems: Coverage not perfect
• Study: Three-Tier network [Gill 2011]
• Failures in fail-over mechanism • Configuration problems in back
up (reroute traffic to failed component)
• Protocol issues and timeouts
[Gill2011]
20 20
Conclusion • Network dependability needs
different metrics
• “Optimal” is relative • Many additional factors to
consider – Scalability – Cross-section bandwidth – Cost effectiveness – Energy efficiency
Table:[Bilal2013B]
21 21
References 1
• [Avizienis 2004] Avizienis et al., “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Trans. dependable and secure computing, 2004
• [Helvik 2007] B. E. Helvik, K. Sallhammar, and S. J. Knapskog. Information Assurance; Dependability and Security in Networked Systems, chapter “Chapter 8: Integrated Dependability and Security Evaluation Using Game Theory and Markov Models”, Elsevier 2007.
• [Gill, 2011] P. Gill, Navendu Jain and Nachiappan Nagappan, “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, SIGCOMM, 2011
• [Al-Fares 2008] Al-Fares M, Loukissas A, Vahdat A. “A scalable, commodity data center network architecture”, SIGCOMM 2008
• [Guo 2008] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, “DCell: a scalable and fault tolerant network structure for data centers”, SIGCOMM, 2008
• [Guo 2009] C.Guo, et al., “BCube: a high performance,server-centric network architecture for modular data centers”, SIGCOMM 2009
22 22
References 2 • [Bilal 2012] Bilal et al., “Quantitative comparisons of the state-of-the-art data
center architectures”, Concurrency Computat.: Pract. Exper. 2012 • [Manzano 2013] M. Manzano, K. Bilal, E. Calle, and S. U. Khan, "On the
Connectivity of Data Center Networks," IEEE Communications Letters, 2013. • [Bilal 2013A] K. Bilal, M. Manzano, S. U. Khan, E. Calle, K. Li, and A. Y.
Zomaya, "On the Characterization of the Structural Robustness of Data Center Networks," IEEE Trans. Cloud Computing, 2013.
• [Bilal 2013B] Bilal et al., "A Taxonomy and Survey on Green Data Center Networks," Future Generation Computer Systems, 2013.
• [Neumayer 2010] Sebastian Neumayer and Eytan Modiano, “Network Reliability With Geographically Correlated Failures”, Proc. 2010 Conference on Information Communications
• [Ballmer 2013] http://news.microsoft.com/2013/07/08/steve-ballmer-worldwide-partner-conference-2013-keynote/
• [Liu 2013] Y. Liu et al., “Data Center Networks; Topologies, Architectures and Fault-Tolerance Characteristics”, Springer, 2013
• [Gyarmatia 2013] Gyarmatia et al., “Free-Scaling Your Data Center”, Computer Networks, 2013