76
EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.1

11/12/2004

EE898.02Architecture of Digital Systems

Lecture 4 Interconnection Networks and

Clusters

Prof. Seok-Bum Ko

Page 2: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.2

11/12/2004

Networks

• Goal: Communication between computers• Eventual Goal: treat collection of

computers as if one big computer, distributed resource sharing

• Theme: Different computers must agree on many things

– Overriding importance of standards and protocols– Error tolerance critical as well

• Warning: Terminology-rich environment

Page 3: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.3

11/12/2004

Networks

• Facets people talk a lot about:– direct (point-to-point) vs. indirect (multi-hop)– topology (e.g., bus, ring, DAG)– routing algorithms– switching (aka multiplexing)– wiring (e.g., choice of media, copper, coax, fiber)

• What really matters:– latency– bandwidth– cost– reliability

Page 4: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.4

11/12/2004

Interconnections (Networks)• Examples (Figure 8.1, page 788):

– Wide Area Network (ATM): 100-1000s nodes; ~ 5,000 kilometers– Local Area Networks (Ethernet): 10-1000 nodes; ~ 1-2 kilometers– System/Storage Area Networks (FC-AL): 10-100s nodes;

~ 0.025 to 0.1 kilometers per link

a.k.a.network,communication subnet

a.k.a.end systems,hosts

Interconnection Network

Page 5: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.5

11/12/2004

SAN: Storage vs. System

• Storage Area Network (SAN): A block I/O oriented network between application servers and storage

– Fibre Channel is an example

• Usually high bandwidth requirements, and less concerned about latency

– in 2001: 1 Gbit bandwidth and millisecond latency OK

• Commonly a dedicated network (that is, not connected to another network)

• May need to work gracefully when saturated

• Given larger block size, may have higher bit error rate (BER) requirement than LAN

Page 6: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.6

11/12/2004

SAN: Storage vs. System

• System Area Network (SAN): A network aimed at connecting computers

– Myrinet is an example

• Aimed at High Bandwidth AND Low Latency.

– in 2001: > 1 Gbit bandwidth and ~ 10 microsecond

• May offer in order delivery of packets• Given larger block size, may have

higher bit error rate (BER) requirement than LAN

Page 7: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.7

11/12/2004

More Network Background

• Connection of 2 or more networks: Internetworking

• 3 cultures for 3 classes of networks– WAN: telecommunications, Internet– LAN: PC, workstations, servers cost– SAN: Clusters, RAID boxes: latency (System

A.N.) or bandwidth (Storage A.N.)

• Try for single terminology• Motivate the interconnection

complexity incrementally

Page 8: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.8

11/12/2004

ABCs of Networks

• Starting Point: Send bits between 2 computers

• Queue (FIFO) on each end• Information sent called a “message”• Can send both ways (“Full Duplex”)• Rules for communication? “protocol”

– Inside a computer: » Loads/Stores: Request (Address) & Response

(Data)» Need Request & Response signaling

Page 9: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.9

11/12/2004

A Simple Example

• What is the format of message?– Fixed? Number bytes?

Request/Response

Address/Data

1 bit 32 bits0: Please send data from Address1: Packet contains data corresponding to request

• Header/Trailer: information to deliver a message

• Payload: data in message (1 word above)

Page 10: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.10

11/12/2004

Questions About Simple Example

• What if more than 2 computers want to communicate?

– Need computer “address field” (destination) in packet

• What if packet is garbled in transit?– Add “error detection field” in packet (e.g., Cyclic Redundancy Chk)

• What if packet is lost?– More “elaborate protocols” to detect loss

(e.g., NAK, ARQ, time outs)

• What if multiple processes/machine?– Queue per process to provide protection

• Simple questions such as these lead to more complex protocols and packet formats => complexity

Page 11: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.11

11/12/2004

A Simple Example Revised

• What is the format of packet?– Fixed? Number bytes?

Request/Response

Address/Data

2 bits 32 bits

00: Request—Please send data from Address01: Reply—Packet contains data corresponding to request10: Acknowledge request11: Acknowledge reply

4 bits

CRC

Page 12: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.12

11/12/2004

Software to Send and Receive

• SW Send steps1: Application copies data to OS buffer

2: OS calculates checksum, starts timer

3: OS sends data to network interface HW and says start

• SW Receive steps3: OS copies data from network interface HW to OS buffer

2: OS calculates checksum, if matches send ACK; if not, deletes message (sender resends when timer expires)

1: If OK, OS copies data to user address space and signals application to continue

• Sequence of steps for SW: protocol– Example similar to UDP/IP protocol in UNIX

Page 13: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.13

11/12/2004

Network Performance Measures

• Overhead: latency of interface vs. Latency: network

Page 14: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.14

11/12/2004

Universal Performance Metrics

Sender

Receiver

SenderOverhead

Transmission time(size ÷ bandwidth)

Transmission time(size ÷ bandwidth)

Time ofFlight

ReceiverOverhead

Transport Latency

Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead

Total Latency

(processorbusy)

(processorbusy)

Includes header/trailer in BW calculation?

Page 15: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.15

11/12/2004

Total Latency Example

• 1000 Mbit/sec., sending overhead of 80 µsec & receiving overhead of 100 µsec.

• a 10000 byte message (including the header), allows 10000 bytes in a single message

• 2 situations: distance 100 m vs. 1000 km• Speed of light ~ 300,000 km/sec

• Latency0.01km = 80 + 0.01km / (50% x 300,000) + 10000 x 8 / 1000 + 100 = 260 µsec

• Latency0.5km = 80 + 0.5km / (50% x 300,000) + 10000 x 8 / 1000 + 100 = 263 µsec

• Latency1000km = 80 + 1000 km / (50% x 300,000) + 10000 x 8 / 1000 + 100 = 6931

• Long time of flight => complex WAN protocol

Page 16: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.16

11/12/2004

Universal Metrics

• Apply recursively to all levels of system

• inside a chip, between chips on a board, between computers in a cluster, …

• Look at WAN v. LAN v. SAN

Page 17: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.17

11/12/2004

Simplified Latency Model

• Total Latency = Overhead + Message Size / BW

• Overhead = Sender Overhead + Time of Flight +

Receiver Overhead

• Effective BW = Message Size / Total Latency

Page 18: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.18

11/12/2004

Overhead, BW, Size

0

0

1

10

100

1,000

16

64

25

6

10

24

40

96

16

38

4

65

53

6

26

21

44

10

48

57

6

41

94

30

4

Message Size (bytes)

Eff

ect

ive

Ba

nd

wid

th (

Mb

it/s

ec)

o1,bw1000

o1,bw100

o1,bw10 o500,

bw10

o500,bw100

o500,bw1000

o25,bw100

o25,bw1000

o25,bw10

Msg Size

Delivered BW

•How big are real messages?

Page 19: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.19

11/12/2004

Measurement: Sizes of Message for NFS

• 95% Msgs, 30% bytes for packets ~ 200 bytes• > 50% data transferred in packets = 8KB

Packet size

Cu

mm

ula

tive %

0%10%20%30%40%50%60%70%80%90%

100%

0 1024 2048 3072 4096 5120 6144 7168 8192

Msgs

Bytes

Page 20: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.20

11/12/2004

Interconnect Issues

• Performance Measures• Network Media

Page 21: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.21

11/12/2004

Network Media

Copper, 1mm think, twisted to avoidattenna effect (telephone)"Cat 5" is 4 twisted pairs in bundle

Used by cable companies: high BW, good noise immunity

Light: 3 parts are cable, light source, light detector.Note fiber is unidirectional; need 2 for full duplex

Twisted Pair:

Coaxial Cable:

Copper coreInsulator

Braided outer conductor

Plastic Covering

Fiber Optics

Transmitter– L.E.D– Laser Diode

Receiver– Photodiode

lightsource Silica core

Total internalreflection

Cladding

Cladding

Buffer

Buffer

Page 22: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.22

11/12/2004

Fiber• Multimode fiber: ~ 62.5 micron diameter vs. the 1.3

micron wavelength of infrared light. Since wider it has more dispersion problems, limiting its length at 1000 Mbits/s for 0.1 km, and 1-3 km at 100 Mbits/s. Uses LED as light

• Single-mode fiber: "single wavelength" fiber (8-9 microns) uses laser diodes, 1-5 Gbits/s for 100s kms– Less reliable and more expensive, and restrictions on bending

– Cost, bandwidth, and distance of single-mode fiber affected by power of the light source, the sensitivity of the light detector, and the attenuation rate (loss of optical signal strength as light passes through the fiber) per kilometer of the fiber cable.

– Typically glass fiber, since has better characteristics than the less expensive plastic fiber

Page 23: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.23

11/12/2004

Wave Division Multiplexing Fiber

• Send N independent streams on single fiber!

• Just use different wavelengths to send and demultiplex at receiver

• WDM in 2000: 40 Gbit/s using 8 wavelengths

• Plan to go to 80 wavelengths => 400 Gbit/s!

Page 24: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.24

11/12/2004

Compare MediaAssume 40 2.5" disks, each 25 GB, Move 1 km

Compare Cat 5 (100 Mbit/s), Multimode fiber (1000 Mbit/s), single mode (2500 Mbit/s), and car

• Cat 5: (1000 x 1024 x 8 Mb) / 100 Mb/s = 23 hrs• MM: (1000 x 1024 x 8 Mb) / 1000 Mb/s = 2.3 hrs• SM: (1000 x 1024 x 8 Mb) / 2500 Mb/s = 0.9 hrs• Car: 5 min + 1 km / 50 kph + 10 min = 0.3

hrs• Car of disks = high BW media

Page 25: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.25

11/12/2004

Interconnect Issues

• Performance Measures• Network Media• Connecting Multiple Computers

Page 26: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.26

11/12/2004

Connecting Multiple Computers

• Shared Media vs. Switched: pairs communicate at same time: “point-to-point” connections

• Aggregate BW in switched network is many times shared

– point-to-point faster since no arbitration, simpler interface

• Arbitration in Shared network?– Central arbiter for LAN?– Listen to check if being used (“Carrier

Sensing”)– Listen to check if collision

(“Collision Detection”)– Random resend to avoid repeated collisions;

not fair arbitration; – OK if low utilization (A. K. A. data switching

interchanges, multistageinterconnection networks,interface message processors)

Page 27: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.27

11/12/2004

Connection-Based vs. Connectionless

• Telephone: operator sets up connection between the caller and the receiver

– Once the connection is established, conversation can continue for hours

• Share transmission lines over long distances by using switches to multiplex several conversations on the same lines

– “Time division multiplexing” divide B/W transmission line into a fixed number of slots, with each slot assigned to a conversation

• Problem: lines busy based on number of conversations, not amount of information sent

• Advantage: reserved bandwidth

Page 28: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.28

11/12/2004

Connection-Based vs. Connectionless

• Connectionless: every package of information must have an address => packets

– Each package is routed to its destination by looking at its address

– Analogy, the postal system (sending a letter)– also called “Statistical multiplexing”– Note: “Split phase buses” are sending packets

Page 29: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.29

11/12/2004

Routing Messages

• Shared Media– Broadcast to everyone

• Switched Media needs real routing. Options:

– Source-based routing: message specifies path to the destination (changes of direction)

– Virtual Circuit: circuit established from source to destination, message picks the circuit to follow

– Destination-based routing: message specifies destination, switch must pick the path

» deterministic: always follow same path» adaptive: pick different paths to avoid

congestion, failures» Randomized routing: pick between several

good paths to balance network load

Page 30: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.30

11/12/2004

• mesh: dimension-order routing

– (x1, y1) -> (x2, y2)

– first x = x2 - x1,

– then y = y2 - y1,

• hypercube: edge-cube routing

– X = xox1x2 . . .xn -> Y = yoy1y2 . . .yn

– R = X xor Y– Traverse dimensions of differing

address in order

• tree: common ancestor

Deterministic Routing Examples

001

000

101

100

010 110

111011

Page 31: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.31

11/12/2004

Store and Forward vs. Cut-Through

• Store-and-forward policy: each switch waits for the full packet to arrive in switch before sending to the next switch (good for WAN)

• Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately

– In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches).

– Cut through routing lets the tail continue when head is blocked, compressing the strung-out message into a single switch. (Requires a buffer large enough to hold the largest packet).

Page 32: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.32

11/12/2004

Cut-Through vs. Store and Forward

• Advantage– Latency reduces from a function of:

# of intermediate switches ×

by the size of the packet to the time for 1st part of the packet to negotiate the switches +

transmission time (=the packet size ÷ interconnect BW)

Page 33: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.33

11/12/2004

Congestion Control• Packet switched networks do not reserve bandwidth;

this leads to contention (connection based limits input)

• Solution: prevent packets from entering until contention is reduced (e.g., freeway on-ramp metering lights)

• Options:– Packet discarding: If packet arrives at switch and no room in buffer,

packet is discarded (e.g., UDP)– Flow control: between pairs of receivers and senders;

use feedback to tell sender when allowed to send next packet» Back-pressure: separate wires to tell to stop» Window: give original sender right to send N packets before

getting permission to send more; overlaps latency of interconnection with overhead to send & receive packet (e.g., TCP), adjustable window

– Choke packets: aka “rate-based”; Each packet received by busy switch in warning state sent back to the source via choke packet. Source reduces traffic to that destination by a fixed % (e.g., ATM)

Page 34: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.34

11/12/2004

Protocols: HW/SW Interface

• Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently;

– Enabling technologies: SW standards that allow reliable communications without reliable networks

– Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites

• Transmission Control Protocol/Internet Protocol (TCP/IP)

– This protocol family is the basis of the Internet– IP makes best effort to deliver; TCP guarantees delivery– TCP/IP used even when communicating locally: NFS uses

IP even though communicating across homogeneous LAN

Page 35: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.35

11/12/2004

Protocol Family Concept

Message TH

Message

TH

Message TH

Message

Message TH Message TH TH

Actual

Actual Actual

Actual

Physical

Logical

Logical

Page 36: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.36

11/12/2004

Protocol Family Concept

• Key to protocol families is that communication occurs logically at the same level of the protocol, called peer-to-peer,

• but is implemented via services at the next lower level• Encapsulation: carry higher level information within lower level “envelope”• Fragmentation: break packet into multiple smaller packets and reassemble• Danger is each level increases latency if implemented as hierarchy (e.g.,

multiple check sums)

Page 37: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.37

11/12/2004

Message

TCP/IP packet, Ethernet packet, protocols

• Application sends message

TCP data

TCP Header

IP Header

IP DataEH

Ethernet Hdr

Ethernet Hdr

• TCP breaks into 64KB segments, adds 20B header

• IP adds 20B header, sends to network

• If Ethernet, broken into 1500B packets with headers, trailers (24B)

• All Headers, trailers have length field, destination, ...

Page 38: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.38

11/12/2004

Example Networks

• Ethernet: shared media 10 Mbit/s proposed in 1978, carrier sensing with exponential backoff on collision detection

• 15 years with no improvement; higher BW?

• Multiple Ethernets with devices to allow Ehternets to operate in parallel!

• 10 Mbit Ethernet successors?– FDDI: shared media (too late)– ATM (too late?)– Switched Ethernet– 100 Mbit Ethernet (Fast Ethernet)– Gigabit Ethernet

Page 39: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.39

11/12/2004

Connecting Networks

• Bridges: connect LANs together, passing traffic from one side to another depending on the addresses in the packet.

– operate at the Ethernet protocol level– usually simpler and cheaper than routers

• Routers or Gateways: these devices connect LANs to WANs or WANs to WANs and resolve incompatible addressing.

– Generally slower than bridges, they operate at the internetworking protocol (IP) level

– Routers divide the interconnect into separate smaller subnets, which simplifies manageability and improves security

• Cisco is major supplier; basically special purpose computers

Page 40: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.40

11/12/2004

Comparing Networks

SAN LAN WANFC-AL Infini-

band10 MbEthernet

100 MbEthernet

1000 MbEthernet

ATM

Length(meters)

30/1000 17/100 500/2500 200 100

Datalines

2 1, 4, 12 1 1 4/1 1

Clock(MHz)

1000 2500 10 100 1000 155/622

Switch? Opt. Yes Optional Opt. Yes Yes

Nodes <=127 ~1000 <=254 <=254 <=254 ~10000

Material Copper/ fiber

Copper/fiber

Copper Copper Copper/fiber

Copper/fiber

Page 41: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.41

11/12/2004

Comparing Networks

SAN LAN WANFC-AL Infini-

band10 MbEthernet

100 MbEthernet

1000 MbEthernet

ATM

Switch? Opt. Yes Optional Opt. Yes Yes

BisectionBW(Mbits/sec)

800sharedor 800 xswitchports

(2000 -24000)xswitchports

10sharedor 10 xswitchports

100sharedor 100 xswitchports

1000 xswitchports

155 xswitchports

Peak linkBW(Mbits/sec)

800 2000,8000,24000

10 100 1000 155/622

Topology Ring orStar

Star Line orStar

Line orStar

Star Star

Page 42: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.42

11/12/2004

Comparing Networks

SAN LAN WANFC-AL Infiniband 10 Mb

Ethernet100 MbEthernet

1000 MbEthernet

ATM

Connec-tionless?

Yes Yes Yes Yes Yes No

Store &forward?

No No No No No Yes

Conges-tioncontrol

Credit-based

Back-pressure

Carriersense

Carriersense

Carriersense

Creditbased

Standard ANSITaskGroupX3T11

InfinibandTradeAssocia-tion

IEEE802.3

IEEE802.3

IEEE802.3ab-1999

ATMForum

Page 43: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.43

11/12/2004

Packet Formats

• See Fig 8.20 on page 826

Page 44: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.44

11/12/2004

Wireless Networks• Media can be air as well as glass or copper• Radio wave is electromagnetic wave propagated by an

antenna• Radio waves are modulated: sound signal superimposed

on stronger radio wave which carries sound signal, called carrier signal

• Radio waves have a wavelength or frequency: measure either length of wave or number of waves per second (MHz): long waves => low frequencies, short waves => high frequencies

• Tuning to different frequencies => radio receiver pick up a signal.

– FM radio stations transmit on band of 88 MHz to 108 MHz using frequency modulations (FM) to record the sound signal

Page 45: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.45

11/12/2004

Issues in Wireless

• Wireless often => mobile => network must rearrange itself dynamically

• Subject to jamming and eavesdropping– No physical tape– Cannot detect interception

• Power – devices tend to be battery powered

– antennas radiate power to communicate and little of it reaches the receiver

• As a result, raw bit error rates are typically a thousand to a million times higher than copper wire

Page 46: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.46

11/12/2004

Reliability of Wires Transmission

• bit error rate (BER) of wireless link determined by received signal power, noise due to interference caused by the receiver hardware, interference from other sources, and characteristics of the channel

– Path loss: power to overcome interference– Shadow fading: blocked by objects (walls,

buildings)– Multipath fading: interference between multiple

version of signals arriving different times– Interference: reuse of frequency or from

adjacent channels

Page 47: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.47

11/12/2004

2 Wireless Architectures

• Base-station architectures– Connected by land lines for longer distance

communication, and the mobile units communicate only with a single local base station

– More reliable since 1-hop from land lines– Example: cell phones

• Peer-to-peer architectures– Allow mobile units to communicate with each

other, and messages hop from one unit to the next until delivered to the desired unit

– More reconfigurable

Page 48: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.48

11/12/2004

Cellular Telephony• Exploit exponential path loss to reuse same frequency

at spatially separated locations, thereby greatly increasing customers served

• Divide region into nonoverlaping hexagonal cells (2-10 mi. diameter) which use different frequencies if nearby, reusing a frequency when cells far apart so that mutual interference OK

• Intersection of three hexagonal cells is a base station with transmitters and antennas

• Handset selects a cell based on signal strength and then picks an unused radio channel

• To properly bill for cellular calls, each cellular phone handset has an electronic serial number

Page 49: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.49

11/12/2004

Cellular Telephony II• Original analog design frequencies set for

each direction: pair called a channel– 869.04 to 893.97 MHz, called the forward path – 824.04 MHz to 848.97 MHz, called the reverse path– Cells might have had between 4 and 80 channels

• Several digital successors:– Code division multiple access (CDMA) uses a wider

radio frequency band– time division multiple access (TDMA) – global system for mobile communication (GSM)– International Mobile Telephony 2000 (IMT-2000) which

is based primarily on two competing versions of CDMA and one TDMA, called Third Generation (3G)

Page 50: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.50

11/12/2004

Practical Issues for Interconnection Networks

• Connectivity: max number of machines affects complexity of network and protocols since protocols must target largest size

• Connection Network Interface to computer

– Where in bus hierarchy? Memory bus? Fast I/O bus? Slow I/O bus? (Ethernet to Fast I/O bus, Infiniband to Memory bus since it is the Fast I/O bus)

– SW Interface: does software need to flush caches for consistency of sends or receives?

– Programmed I/O vs. DMA? Is NIC in uncacheable address space?

Page 51: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.51

11/12/2004

Practical Issues for Interconnection Networks

• Standardization advantages:– low cost (components used repeatedly)– stability (many suppliers to chose from)

• Standardization disadvantages:– Time for committees to agree– When to standardize?

» Before anything built? => Committee does design?

» Too early suppresses innovation

• Reliability (vs. availability) of interconnect

Page 52: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.52

11/12/2004

Practical Issues

Interconnection SAN LAN WAN

Example Infiniband Ethernet ATM

Standard Yes Yes Yes

Fault Tolerance? Yes Yes Yes

Hot Insert? Yes Yes Yes

• Standards: required for WAN, LAN, and likely SAN!

• Fault Tolerance: Can nodes fail and still deliver messages to other nodes?

• Hot Insert: If the interconnection can survive a failure, can it also continue operation while a new node is added to the interconnection?

Page 53: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.53

11/12/2004

Cross-Cutting Issues for Networking

• Efficient Interface to Memory Hierarchy vs. to Network– SPEC ratings => fast to memory hierarchy– Writes go via write buffer, reads via L1

and L2 caches• Example: 40 MHz SPARCStation(SS)-2 vs 50

MHz SS-20, no L2$ vs 50 MHz SS-20 with L2$ I/O bus latency; different generations

• SS-2: combined memory, I/O bus => 200 ns• SS-20, no L2$: 2 busses +300ns => 500ns• SS-20, w L2$: cache miss+500ns => 1000ns

Page 54: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.54

11/12/2004

Crosscutting: Smart Switch vs. Smart Network Interface Card

Less Intelligent More Intelligent

Switch

Small Ethernet

Myrinet

Infiniband

Large Ethernet

NIC

Ethernet

Infiniband Target Channel

Adapter

Myrinet

Infiniband Host Channel Adapter

•Inexpensive NIC => Ethernet standard in all computers•Inexpensive switch => Ethernet used in home networks

Page 55: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.55

11/12/2004

Cluster

• LAN switches => high network bandwidth and scaling was available from off the shelf components

• 2001 Cluster = collection of independent computers using switched network to provide a common service

• Many mainframe applications run more "loosely coupled" machines than shared memory machines (next lecture)

– databases, file servers, Web servers, simulations, and multiprogramming/batch processing

– Often need to be highly available, requiring error tolerance and repairability

– Often need to scale

Page 56: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.56

11/12/2004

Cluster Drawbacks• Cost of administering a cluster of N machines

~ administering N independent machines vs. cost of administering a shared address space N processors multiprocessor ~ administering 1 big machine

• Clusters usually connected using I/O bus, whereas multiprocessors usually connected on memory bus

• Cluster of N machines has N independent memories and N copies of OS, but a shared address multi-processor allows 1 program to use almost all memory

– DRAM prices has made memory costs so low that this multiprocessor advantage is much less important in 2001

Page 57: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.57

11/12/2004

Cluster Advantages• Error isolation: separate address space limits

contamination of error• Repair: Easier to replace a machine without bringing

down the system than in an shared memory multiprocessor

• Scale: easier to expand the system without bringing down the application that runs on top of the cluster

• Cost: Large scale machine has low volume => fewer machines to spread development costs vs. leverage high volume off-the-shelf switches and computers

• Amazon, AOL, Google, Hotmail, Inktomi, WebTV, and Yahoo rely on clusters of PCs to provide services used by millions of people every day

Page 58: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.58

11/12/2004

Addressing Cluster Weaknesses

• Network performance: SAN, especially Infiniband, may tie cluster closer to memory

• Maintenance: separate of long term storage and computation

• Computation maintenance:– Clones of identical PCs– 3 steps: reboot, reinstall OS, recycle– At $1000/PC, cheaper to discard than to figure out

what is wrong and repair it?

• Storage maintenance:– If separate storage servers or file servers, cluster is

no worse?

Page 59: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.59

11/12/2004

Putting it all together: Google

• Google: search engine that scales at growth Internet growth rates

• Search engines: 24x7 availability• Google 12/2000: 70M queries per day, or

AVERAGE of 800 queries/sec all day• Response time goal: < 1/2 sec for search• Google crawls WWW and puts up new index

every 4 weeks• Stores local copy of text of pages of WWW

(snippet as well as cached copy of page)• 3 collocation sites (2 CA + 1 Virginia)• 6000 Processors and 12000 disks

Page 60: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.60

11/12/2004

Hardware Infrastructure• VME rack 19 in. wide, 6 feet

tall, 30 inches deep• Per side: 40 1 Rack Unit (RU)

PCs +1 HP Ethernet switch (4 RU): Each blade can contain 8 100-Mbit/s EN or a single 1-Gbit Ethernet interface

• Front+back => 80 PCs + 2 EN switches/rack

• Each rack connects to 2 128 1-Gbit/s EN switches

• Dec 2000: 40 racks at most recent site

Page 61: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.61

11/12/2004

Google PCs

• 2 IDE drives, 256 MB of SDRAM, modest Intel microprocessor, a PC mother-board, 1 power supply and a few fans.

• Each PC runs the Linux operating system• Buy over time, so upgrade components:

populated between March and November 2000 – microprocessors: 533 MHz Celeron to an 800 MHz Pentium III, – disks: capacity between 40 and 80 GB, speed 5400 to 7200

RPM– bus speed is either 100 or 133 MH– Cost: ~ $1300 to $1700 per PC

• PC operates at about 55 Watts• Rack => 4500 Watts , 60 amps

Page 62: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.62

11/12/2004

Reliability• For 6000 PCs, 12000s, 200 EN switches• ~ 20 PCs will need to be rebooted/day

• ~ 2 PCs/day hardware failure, or 2%-3% / year– 5% due to problems with motherboard, power supply, and connectors

– 30% DRAM: bits change + errors in transmission (100 MHz)

– 30% Disks fail

– 30% Disks go very slow (10%-3% expected BW)

• 200 EN switches, 2-3 fail in 2 years

• 6 Foundry switches: none failed, but 2-3 of 96 blades of switches have failed (16 blades/switch)

• Collocation site reliability:– 1 power failure,1 network outage per year per site

– Bathtub for occupancy

Page 63: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.63

11/12/2004

Google Performance: Serving

• How big is a page returned by Google? ~16KB

• Average bandwidth to serve searches

70,000,000/day x 16,750 B x 8 bits/B

24 x 60 x 60

=9,378,880 Mbits/86,400 secs

= 108 Mbit/s

Page 64: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.64

11/12/2004

Google Performance: Crawling

• How big is a text of a WWW page? ~4000B

• 1 Billion pages searched • Assume 7 days to crawl• Average bandwidth to crawl

1,000,000,000/pages x 4000 B x 8 bits/B

24 x 60 x 60 x 7

=32,000,000 Mbits/604,800 secs

= 59 Mbit/s

Page 65: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.65

11/12/2004

Google Performance: Replicating Index

• How big is Google index? ~5 TB • Assume 7 days to replicate to 2 sites,

implies BW to send + BW to receive• Average bandwidth to replicate new

index

2 x 2 x 5,000,000 MB x 8 bits/B

24 x 60 x 60 x 7

=160,000,000 Mbits/604,800 secs

= 260 Mbit/s

Page 66: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.66

11/12/2004

Colocation Sites• Allow scalable space, power, cooling and network

bandwidth plus provide physical security• charge about $500 to $750 per Mbit/sec/month

– if your continuous use measures 1- 2 Gbits/second

to $1500 to $2000 per Mbit/sec/month – if your continuous use measures 1-10 Mbits/second

• Rack space: costs $800 -$1200/month, and drops by 20% if > 75 to 100 racks (1 20 amp circuit)

– Each additional 20 amp circuit per rack costs another $200 to $400 per month

• PG&E: 12 megawatts of power, 100,000 sq. ft./building, 10 sq. ft./rack => 1000 watts/rack

Page 67: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.67

11/12/2004

Google Performance: Total

• Serving pages: 108 Mbit/sec/month• Crawling: 59 Mbit/sec/week, 15 Mbit/s/month• Replicating: 260 Mbit/sec/week, 65

Mb/s/month• Total: roughly 200 Mbit/sec/month• Google’s Collocation sites have OC48

(2488 Mbit/sec) link to Internet• Bandwidth cost per month?

~$150,000 to $200,000• 1/2 BW grows at 20%/month

Page 68: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.68

11/12/2004

Google Costs• Collocation costs: 40 racks @ $1000 per

month + $500 per month for extra circuits= ~$60,000 per site, * 3 sites~$180,000 for space• Machine costs:• Rack = $2k + 80 * $1500/pc + 2 * $1500/EN

= ~$125k• 40 racks + 2 Foundry switches @$100,000

= ~$5M• 3 sites = $15M• Cost today is $10,000 to $15,000 per TB

Page 69: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.69

11/12/2004

Comparing Storage Costs: 1/2001

• Google site, including 3200 processors and 0.8 TB of DRAM, 500 TB (40 racks)

$10k - $15k/ TB • Compaq Cluster with 192 processors,

0.2 TB of DRAM, 45 TB of SCSI Disks (17+ racks) $115k/TB (TPC-C)

• HP 9000 Superdome: 48 processors, 0.25 TB DRAM, 19 TB of SCSI disk =(23+ racks) $360k/TB (TPC-C)

Page 70: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.70

11/12/2004

Putting It All Together: Cell Phones

• 1999 280M handsets sold; 2001 500M

• Radio steps/components: Receive/transmit

– Antenna– Amplifier– Mixer– Filter– Demodulator– Decoder

Page 71: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.71

11/12/2004

Putting It All Together: Cell Phones

• about 10 chips in 2000, which should shrink, but likely separate MPU and DSP

• Emphasis on energy efficiency

Page 72: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.72

11/12/2004

Cell phone steps (protocol)

1. Find a cell• Scans full BW to find stronger signal every 7 secs

2. Local switching office registers call• records phone number, cell phone serial number,

assigns channel• sends special tone to phone, which cell acks if

correct• Cell times out after 5 sec if doesn't get supervisory

tone

3. Communicate at 9600 b/s digitally (modem)• Old style: message repeated 5 times• AMPS had 2 power levels depending on distance

(0.6W and 3W)

Page 73: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.73

11/12/2004

Frequency Division Multiple Access (FDMA)

• FDMA separates the spectrum into distinct voice channels by splitting it into uniform chunks of bandwidth

• 1st generation analog

Page 74: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.74

11/12/2004

Time Division Multiple Access (TDMA)

• a narrow band that is 30 kHz wide and 6.7 ms long is split time-wise into 3 time slots.

• Each conversation gets the radio for 1/3 of time.

• Possible because voice data converted to digital information is compressed so

• Therefore, TDMA has 3 times capacity of analog

• GSM implements TDMA in a somewhat different and incompatible way from US (IS-136); also encrypts the call

Page 75: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.75

11/12/2004

Code Division Multiple Access (CDMA)

•CDMA, after digitizing data, spreads it out over the entire bandwidth it has available.

•Multiple calls are overlaid over each other on the channel, with each assigned a unique sequence code.

•CDMA is a form of spread spectrum; All the users transmit in the same wide-band chunk of spectrum.

•Each user's signal is spread over the entire bandwidth by a unique spreading code. same unique code is used to recover the signal.

•GPS for time stamp. Between 8 and 10 separate calls

space as 1 analog call

Page 76: EE898 Lec 4.1 11/12/2004 EE898.02 Architecture of Digital Systems Lecture 4 Interconnection Networks and Clusters Prof. Seok-Bum Ko

EE898Lec 4.76

11/12/2004

Cell Phone Towers