53
Parallel Rendering Workshop – IEEE Vis 2004 Designing Graphics Clusters Mike Houston, Stanford University

Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

  • Upload
    lamanh

  • View
    226

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

Parallel Rendering Workshop – IEEE Vis 2004

Designing Graphics Clusters

Mike Houston, Stanford University

Page 2: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

2Parallel Rendering Workshop – IEEE Vis 2004

Why clusters?

Commodity parts– Complete graphics pipeline on

a single chip– Extremely fast product cycle

Flexibility– Configurable building blocks

Cost– Driven by consumer demand– Economies of scale

Availability– You can build (small) systems

yourself– A trip to a local computer shop

can bring a node back up

Upgradeability– Graphics– Network– Processor

Scalability– CPUs– Graphics– Memory– Disk

Page 3: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

3Parallel Rendering Workshop – IEEE Vis 2004

Why not clusters?

App rewrites?DebuggingShared memory requirementsMassive I/O requirementsSoftware solutionsSupportMaintenanceWho’s going to build it?

Page 4: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

4Parallel Rendering Workshop – IEEE Vis 2004

Design constraints

PowerCoolingDensityPerformanceCost

These all conflict!!!

Page 5: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

5Parallel Rendering Workshop – IEEE Vis 2004

Power/Cooling/Density

Power– Graphics + processor = huge power draw

• Intel Nocona = 103W (load)• Nvidia 6800 Ultra = 110W

Cooling– Graphics + processor = lots of heat– Fans

• Noise• Reliability

– Liquid• Immense courage

Density– How tall?– How deep?– Cooling?– Power?

Page 6: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

6Parallel Rendering Workshop – IEEE Vis 2004

Performance/Cost

Primary cluster use– Graphics

• Spend on graphics– Graphics + compute

• Balance choices, but graphics easier to upgrade– Compute

• Spend on processors and memory

Bigger/Better/Faster => expensiveBuy at the “knee”

Page 7: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

7Parallel Rendering Workshop – IEEE Vis 2004

Component choices

Processor– Intel– AMD– PowerPC

Interconnect– GigE– Quadrics– Infiniband– Myrinet

Chipset– Consumer– Workstation– Server

Graphics– Vendor

• 3DLabs• ATI• Nvidia

– Market segment• Consumer• Workstation

Chassis– Desktop case– Rackmount

• Height (1/2/3/4/5U)• Depth (Half/Full)

Page 8: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

8Parallel Rendering Workshop – IEEE Vis 2004

What NOT to do!!!

Assume a Graphics Cluster is like a Compute Cluster– Different cooling, power, performance constraints– Different usage scenarios– Different bus loads

Use “riser” boards– Signal quality– Cooling– I’ve seen more issues with this than anything else!!!

Page 9: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

9Parallel Rendering Workshop – IEEE Vis 2004

What NOT to do!!! continued

Purchase untested/new chipsets– Performance oddities– Stability problems– Many people bitten by Intel i840/i860 chipsets’ AGP and PCI

performance problems

Buy cheap components– Failures– Stability– Saving $5 off a fan might cost you thousands in hardware

failures, down time from instability, and man hours trying to track down the problem

Page 10: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

10Parallel Rendering Workshop – IEEE Vis 2004

What TO do

Get help– Talk to others who have built these in academia/industry– Work with a company that has built these– Buy parts/whole thing from a company that has built these

Testing, Testing, Testing– Pound on a few options before you choose, or copy known

working solution– Processor performance/heat/cooling/stability– Graphics performance/heat/cooling/stability– Bus performance/stability– Network performance/stability– Temperature monitoring

Page 11: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

11Parallel Rendering Workshop – IEEE Vis 2004

What TO do continued

Maintenance– Clean filters– Clean/check fans– Run memory/processor tests

• Memtest86 (http://www.memtest.org)• CPU Burn-in (http://users.bigpond.net.au/cpuburn)

– Check disks• fsck often

Monitor your cluster– Temperature fluctuations– Node stability– Checkout Ganglia (http://ganglia.sourceforge.net)

Page 12: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

12Parallel Rendering Workshop – IEEE Vis 2004

Sources of Bottlenecks

Sort-First– Pack/unpack speed (processor)– Primitive distribution (network and bus)– Rendering (processor and graphics chip)

Sort-Last– Rendering (graphics chip)– Composite (network, bus, and read/draw pixels)

Page 13: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

13Parallel Rendering Workshop – IEEE Vis 2004

Rendering and the network

Sort-last– Usage patterns

• All-to-all• Pair-swapping, all nodes

– Switch requirements• Provide backplane to support all nodes sending• Non-blocking

Sort-first– Usage patterns

• 1 to N• M to N

– Switch requirements• Multicast/broadcast support

Page 14: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

14Parallel Rendering Workshop – IEEE Vis 2004

Network interconnects

GigE– Bandwidth: ~90MB/s (large MTU)– Latency: 50-100 usec– Cost per port: <$100– Get chips with a TOE

Bonded GigE– Works great for up to ~4 ports– Gets expensive fast, especially for fully connected networks

Page 15: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

15Parallel Rendering Workshop – IEEE Vis 2004

High speed interconnects

Quadrics– Bandwidth: 876MB/s (elan4)– Latency: 3usec– Cost: $1,866 per port

Myrinet– Bandwidth: 500MB/s– Latency: 3.5usec– Cost: $1,600 per port

Infiniband 4X– Bandwidth: 1450MB/s (PCIe) / 850MB/s (PCI-X)– Latency: 4usec / 8usec– Cost: <$1,000 per port

Page 16: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

16Parallel Rendering Workshop – IEEE Vis 2004

Graphics readback (AGP)

Readback Performance

0

200

400

600

800

1000

0 262144 524288 786432 1048576

Image Size

Ban

dwid

th (M

B/s

)

R360 RGBA R360 BGR R360 Depth NV40 RGBA NV40 BGR NV40 Depth

Page 17: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

17Parallel Rendering Workshop – IEEE Vis 2004

PCIe

The promise– Graphics readback performance

• Easier implementation• ~2GB/s

– Network performance– Unified standard for graphics, network, I/O

Problems– Limited number of slots

• 1 x16 + 1 x8 or several x4– Stability/Performance

• Early implementations have “problems”

Page 18: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

18Parallel Rendering Workshop – IEEE Vis 2004

Case Study - Stanford’s SPIRE

16 node cluster32 2.4GHz P4 XeonsE7505 chipset (Supermicro)16GB DDR1.2TB IDE storageMellanox Cougar HCAMellanox 16 port InfiniScale switchDlink 24-port GigE switchATI 9800 Pro 256MB (AGP)Linux – Fedora Core 2

http://spire.stanford.edu

Page 19: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

19Parallel Rendering Workshop – IEEE Vis 2004

Inside a node

Page 20: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

20Parallel Rendering Workshop – IEEE Vis 2004

Bottleneck Evaluation – SPIRE

0 200 400 600 800 1000 1200

Read/Draw

Network

Bus

Graphics

Processor

Throughput

Page 21: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

21Parallel Rendering Workshop – IEEE Vis 2004

SPIRE vs. Chromium

0 200 400 600 800 1000 1200

Read/Draw

Network

Bus

Graphics

Processor

Throughput

Page 22: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

22Parallel Rendering Workshop – IEEE Vis 2004

Compositing Performance (GigE)

RGBA (1024x1024 image across 16 nodes)– Software:

• 9.5fps • 152 MPixel/sec

– Hardware: • 17fps • 269 MPixel/sec

Depth (RGB + Z) (1024x1024 image across 16 nodes)– Software:

• 3.8fps• 60 Mpixels/sec

– Hardware:• 7.2fps• 116 Mpixels/sec

Page 23: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

23Parallel Rendering Workshop – IEEE Vis 2004

Compositing Performance (Infiniband)

RGBA (1024x1024 image across 16 nodes)– Software:

• 14fps • 224 MPixel/sec

– Hardware: • 45fps • 720 MPixel/sec

Depth (RGB + Z) (1024x1024 image across 16 nodes)– Software:

• 6fps• 96 Mpixels/sec

– Hardware:• 11fps• 176 Mpixels/sec

Page 24: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

24Parallel Rendering Workshop – IEEE Vis 2004

SPIRE

Performance– 8.2 GVox/s (10243 @ 8Hz)– Quake3 @ 5120x4096: 90fps– 790MB/s node to node– 10.2GB/s cross sectional bandwidth

Hardware failures– 3 Western Digital drives– 4 fans (1 rack, 2 chassis, 1 GPU)– 1 CPU

Page 25: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

25Parallel Rendering Workshop – IEEE Vis 2004

Hardware Suggestions

Page 26: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

26Parallel Rendering Workshop – IEEE Vis 2004

Low end systems

Intel Based– Intel P4– Intel 925X

• PCIe• GigE onboard

– 1GB DDR– Nvidia 6800GT

AMD based– Athlon 64– Nforce3 250Gb

• AGP • GigE onboard

– 1GB DDR– Nvidia 6800GT

Characteristics: Single processor, GigE

Cost: <$2,000

Page 27: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

27Parallel Rendering Workshop – IEEE Vis 2004

Mid range systems

Intel Based– Intel Xeon– Intel E7525

• PCIe• Dual GigE onboard

– 4GB DDR– Nvidia 6800 Ultra

AMD based– Dual Opteron– AMD 85XX

• AGP • Dual GigE onboard

– 4GB DDR– Nvidia 6800 Ultra

Characteristics: Dual processor, GigE

Cost: <$5,000

Page 28: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

28Parallel Rendering Workshop – IEEE Vis 2004

High end systems

Intel Based– Intel Xeon– Intel E7525

• PCIe• Dual GigE onboard

– 16GB DDR– Nvidia 6800 Ultra / Quadro

4400/G / SLI– Infiniband 4X / Quadrics

• Single or multirail

AMD based– Dual Opteron– AMD 85XX

• AGP • Dual GigE onboard

– 16GB DDR– Nvidia 6800 Ultra / Quadro

4400/G / SLI– Infiniband 4X / Quadrics

• Single or multirail

Characteristics: Dual processor, high speed interconnect

Cost: ~$10k => Arm/leg/first born

Page 29: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

29Parallel Rendering Workshop – IEEE Vis 2004

What would I build next?

Intel– Dual Nocona– E7575 (Tumwater)– Infiniband 4X (PCIe)– Nvidia NV4X

Good things– Known solution– Currently available

Bad things– Heat– PCIe issues

AMD– Dual/Quad Opteron– PCIe chipset?

• AMD• Nvidia Nforce4

– Nvidia NV4X

Good things– Heat– Dual core chips for upgrade– Multiple PCIe x16 slots

Bad things– Unknown

performance/stability– Not available just yet

Page 30: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

30Parallel Rendering Workshop – IEEE Vis 2004

Example E7525 prototype

Courtesy GraphStream

Page 31: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

31Parallel Rendering Workshop – IEEE Vis 2004

List of graphics cluster companies

ABBAGraphStreamHPIBMOrad

Page 32: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

32Parallel Rendering Workshop – IEEE Vis 2004

Questions?

Page 33: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

33Parallel Rendering Workshop – IEEE Vis 2004

Supplemental

Page 34: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

34Parallel Rendering Workshop – IEEE Vis 2004

Chromium

The Chromium Cluster

Page 35: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

35Parallel Rendering Workshop – IEEE Vis 2004

Chromium Cluster Configuration

Cluster: 32 graphics nodes + 1 server nodeComputer: Compaq SP750– 2 processors (800 MHz PIII Xeon, 133MHz FSB)– i840 core logic

• Simultaneous “fast” graphics and networking• Graphics: AGP-4x

– 256 MB memory– 18GB SCSI 160 disk (+ 3*36GB on servers)

Graphics – 16 NVIDIA GeForce3 w/ DVI (64 MB)– 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB)

Network– Myrinet 64-bit, 66 MHz (LANai 7)

Page 36: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

36Parallel Rendering Workshop – IEEE Vis 2004

Sort-First Performance

Configuration– Application runs application on client– Primitives distributed to servers

Tiled Display– 4x3 @ 1024x768– Total resolution: 4096x2304,

9 Megapixel

Quake 3– 50 fps

Page 37: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

37Parallel Rendering Workshop – IEEE Vis 2004

Sort-Last Performance

Configuration– Parallel rendering on multiple nodes– Composite to final display node

Volume Rendering on 16 nodes– 1.57 GVox/s [Humphreys 02]– 1.82 GVox/s (tuned) 9/02– 256x256x1024 volume1

rendered twice

1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an

NIH/NCRR National Resource)

Page 38: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

38Parallel Rendering Workshop – IEEE Vis 2004

Bottleneck Evaluation – Chromium

0 200 400 600 800 1000 1200

Read/Draw

Network

Bus

Graphics

Processor

Throughput

Page 39: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

39Parallel Rendering Workshop – IEEE Vis 2004

Stanford’s SPIRE

Page 40: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

40Parallel Rendering Workshop – IEEE Vis 2004

Cluster Configuration

16 node cluster, 3U half depth 2.4GHz P4 Xeon (Dual)Intel E7505 chipset1GB DDR (up to 4GB)ATI Radeon 9800 Pro 256MBInfiniband + GigE– Mellanox Cougar HCA– Mellanox 6 chip 16-port switch

80 GB IDEBuilt by Graphstream– We already built one and knew better...– Someone else to support hardware failures

Page 41: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

41Parallel Rendering Workshop – IEEE Vis 2004

Inside a node

Page 42: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

42Parallel Rendering Workshop – IEEE Vis 2004

Bleeding edge is painful

Infiniband– 5 months to get IB working and MPI running

• Driver and firmware issues– 3 months to get Chromium VAPI layer running

• No documentation or support– 1 month to get SDP layer working

• Driver issues

Linux 2.6 kernel– Much better I/O performance– Graphics hardware driver issues

• 4K stack change• Register argument passing (REGPARM)

– Preemptable kernels break many drivers

Page 43: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

43Parallel Rendering Workshop – IEEE Vis 2004

Bottleneck Evaluation – SPIRE

0 200 400 600 800 1000 1200

Read/Draw

Network

Bus

Graphics

Processor

Throughput

Page 44: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

44Parallel Rendering Workshop – IEEE Vis 2004

SPIRE vs. Chromium

0 200 400 600 800 1000 1200

Read/Draw

Network

Bus

Graphics

Processor

Throughput

Page 45: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

45Parallel Rendering Workshop – IEEE Vis 2004

A deeper look into Infiniband

Point to Point1 to NAll to All

Linux kernel 2.8.6MVAPICH 0.9.2OpenIB gen1 revision 279Mellanox Cougar HCAs (PCI-X)Mellanox 16 port InifiniScale switch (6-chip)

Page 46: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

46Parallel Rendering Workshop – IEEE Vis 2004

Point to Point

Bandwidth

0100200300400500600700800

0 131072 262144 393216 524288 655360 786432 917504 1E+06

Message Size (bytes)

Ban

dwid

th (M

B/s

)

Page 47: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

47Parallel Rendering Workshop – IEEE Vis 2004

Point to Point

Latency

0200400600800

10001200140016001800

1000 10000 100000 1000000

Message Size (bytes)

Late

ncy

(use

c)

Page 48: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

48Parallel Rendering Workshop – IEEE Vis 2004

One to Many

One to Many Bandwidth

740

745

750

755

760

765

2 4 6 8 10 12 14 16

# of nodes

Ban

dwid

th (M

B/s)

Page 49: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

49Parallel Rendering Workshop – IEEE Vis 2004

All to All

All to All

640650660670680690700710720

2 4 6 8 10 12 14 16

# of Nodes

Ban

dwid

th

Page 50: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

50Parallel Rendering Workshop – IEEE Vis 2004

Network Summary

Peak Bandwidth (point to point)– 755MB/s– 256KB message size

Cross sectional bandwidth– 10.8 GB/s

Latency– 12usec for small messages (we’ve seen as good as 6usec)– Linear scaling for larger messages

What’s causing the falloff?– Driver issues– Kernel issues– Firmware issues

Page 51: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

51Parallel Rendering Workshop – IEEE Vis 2004

OpenIB – The Good

Open sourceLots of usersGen1 performance on par with Mellanox stackGen2 looking goodMPI works great with gen1! (OSU MVAPICH)SDP– Fast porting of TCP based code

Vendor “independent”Aiming for Linux kernel inclusion

Page 52: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

52Parallel Rendering Workshop – IEEE Vis 2004

OpenIB – The Bad

It’s yet to be generally stable/usableOpenSM– Too hard to use– Make sure to get a switch with an SM

Connection managementCPU overheadAiming for Linux kernel inclusionVAPI– No documentation– EVAPI vs. VAPI– Multiple connection issues

Page 53: Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

53Parallel Rendering Workshop – IEEE Vis 2004

OpenIB – And The Ugly…

Vendor agendas– Topspin vs. Voltaire vs. Infinicon– Lots of bickering

Roadmap– Too many tangents– Doesn’t match up with distributions

• RHEL?• FC3?

Why can’t we just get the basics working before we move on?!?!– Gen2 will hopefully do this– SDP missing…