35
06/06/22 1 Copyrighted IBM 2003 ghted, International Business Machines Corporation Server I/O Networks Server I/O Networks Past, Present, and Past, Present, and Future Future Renato Recio Renato Recio Distinguished Engineer Distinguished Engineer Chief Architect, IBM eServer I/O Chief Architect, IBM eServer I/O

(PowerPoint format)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: (PowerPoint format)

04/08/231 Copyrighted IBM 2003Copyrighted, International Business Machines Corporation, 2003

Server I/O NetworksServer I/O NetworksPast, Present, and FuturePast, Present, and Future

Renato RecioRenato RecioDistinguished EngineerDistinguished EngineerChief Architect, IBM eServer I/OChief Architect, IBM eServer I/O

Page 2: (PowerPoint format)

04/08/232 Copyrighted IBM 2003

Legal Notices

All statements regarding future direction and intent for IBM, InfiniBandTM Trade Association, RDMA Consortium, or any other standard organization mentioned are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your IBM local Branch Office or IBM Authorized Reseller for the full text of a specific Statement of General Direction.

IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.

The information contained in this presentation has not been submitted to any formal IBM test and is distributed as is. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, PowerPC, RS/6000, SP, S/390, AS/400, zSeries, iSeries, pSeries, xSeries, and Remote I/O.

UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Limited. Ethernet is a registered trademark of Xerox Corporation. TPC-C, TPC-D, and TPC-H are trademarks of the Transaction Processing Performance Council. InfinibandTM is a trademark of the InfinibandTM Trade Association. Other product or company names mentioned herein may be trademarks or registered trademarks of their respective companies or organizations.

Page 3: (PowerPoint format)

04/08/233 Copyrighted IBM 2003

In other words…Regarding Industry Trends and Directions

IBM respects the copyright and trademark of other companies…andThese slides represent my views: Does not imply IBM views or directions. Does not imply the views or directions of InfiniBandSM Trade Association,

RDMA Consortium, PCI-SIG, or any other standard group.

These slides simply represent my view.

Page 4: (PowerPoint format)

04/08/234 Copyrighted IBM 2003

AgendaServer I/O

Network types Requirements, Contenders.

Server I/O I/O Attachment and I/O Networks

PCI family and InfiniBand Network stack offload

Hardware, OS, and Application considerations Local Area Networks Cluster Area Networks

InfiniBand and Ethernet Storage Area Networks

FC and EthernetSummary

Page 5: (PowerPoint format)

04/08/235 Copyrighted IBM 2003

Purpose of Server I/O Networks

uP, $uP, $

uP, $uP, $

I/O ExpansionNetwork

VirtualAdapter

Bridge

. . . .VirtualAdapter

VirtualAdapter

uP, $uP, $

uP, $uP, $

MemoryMemoryController

Memory MemoryController

I/O ExpansionNetwork

VirtualAdapterVirtual

AdapterVirtualAdapter

Switch

Storage Area Network

Cluster Network

I/O I/O

I/O I/O

Local Area Network

VirtualAdapter

VirtualAdapter

I/OI/O

I/O Attachment

I/O Attachment

Server I/O networks are used to connect devices and other servers.

Page 6: (PowerPoint format)

04/08/236 Copyrighted IBM 2003

Server I/O Network Requirements In the past, servers have placed the following requirements on I/O

networks: Standardization, so many different vendors' products can be connected; Performance (scalable throughput and bandwidth; and low latency/overhead); High availability, so connectivity is maintained despite failures; Continuous operations, so changes can occur without disrupting availability; Connectivity, so many units can be connected; Distance, both to support scaling and to enable disaster recovery; and Low total cost, interacts strongly with standardization through volumes; also

depends on amount of infrastructure build-up required.

More recently, servers have added the following requirements: Virtualization of host, fabric and devices; Service differentiation (including QoS), to manage fabric utilization peaks; and Adequate security, particularly in multi-host (farm or cluster) situations.

Page 7: (PowerPoint format)

04/08/237 Copyrighted IBM 2003

Server I/O Network HistoryHistorically, no single technology satisfied all the above requirements,

So many types of fabrics proliferated:Local Area NetworksCluster Networks (a.k.a. HPCN, CAN)Storage Area NetworksI/O Expansion Networks, etc…

and many link solutions proliferated:Standard:

FC for SAN, Ethernet for LAN

Proprietary: a handful of IEONs (IBM’s RIO, HP’s remote I/O, SGI’s XIO, etc…) a handful of CANs (IBM’s Colony, Myricom’s Myrinet, Quadrics, etc…)

Consolidation solutions are now emerging, but the winner is uncertain. PCI family: IOA and IOEN Ethernet: LAN, SAN, CAN InfiniBand: CAN, IOEN, and possibly higher-end IOA/SAN.

Page 8: (PowerPoint format)

04/08/238 Copyrighted IBM 2003

Proprietary fabrics(e.g. IBM channels, IBM RIO,

IBM-STI, IBM-Colony, SGI-XIO, Tandem/Compaq/HP-ServerNet)

Rattner pitch(2/98)

FIO goes public (2/99)

FIO and NGIO

merge into IB (9/99)

NGIO goes public (11/98)

NGIO Spec

available (7/99)

FIO spec available

(9/99)

InfiniBand Spec Releases Verb1.0

(10/00)1.0.a(6/01)

PCIPCI-X

1.0 spec available

(9/99)

PCI-X 2.0 announced

(2/02)

3GIOdescribed

at IDF (11/01)

PCI-Express 1.0 Spec

(7/02)

PCI-X 2.0 DDR/QDR

Spec(7/02)

1.1(11/2002)

AS1.0 Spec(2003)

Recent Server I/O NetworkEvolution Timeline

?

RDMA over IP begins (6/00)

53rd IETF ROI BOF calls for

IETF ROI WG

(12/01)

RDMAC announced

(5/02)

54th IETF RDDPRDDP WG Chartered

(3/02)

Ext.(12/03)

RDMA, DDP, MPA

1.0 specs(10/02)

Verbs, SDP, iSER,

…1.0 specs

(4/03)

Page 9: (PowerPoint format)

04/08/239 Copyrighted IBM 2003

PCIPCI standard’s strategy is:

Add evolutionary technology enhancements to the standard,that maintain the existing PCI eco-system.

Within the standard, two contenders are vying for IOA market share: PCI-X

1.0 is shipping now,2.0 is next and targets 10 Gig Networking generation.

PCI-ExpressMaintains existing PCI software/firmware programming model,

adds new: protocol layers, physical layer, and associated connectors.Can also be used as an IOEN, but does not satisfy all enterprise class requirements

Enterprise class RAS is optional (e.g. multipathing) Fabric Virtualization is missing, More efficient I/O communication model, …

Will likely be extended to support: Faster speed link, Mandatory enterprise class RAS.

Page 10: (PowerPoint format)

04/08/2310 Copyrighted IBM 2003

I/O Attachment ComparisonPCI-X (1.0, 2.0) PCI-Express (1.0, 2.0)

Performance

Effective link widthsEffective link frequencyBandwidth range

Parallel 32 bit, 64 bit33, 66, 100, 133, 266, 533 MHz132 MB/s to 4.17 GB/s

Serial 1x, 4x, 8x, 16x2.5 GHz -> 5 or 6.25 GHz250 MB/s to 4 GB/s

ConnectivityDistance

Multi-drop bus or point-pointChip-chip, card-card connector

Memory mapped switched fabricChip-chip, card-card connector, cable

Self-management

Unscheduled outage protection

Schedule outage protectionService level agreement

Interface checks, Parity, ECCNo redundant pathsHot-plug and dynamic discoveryN/A

Interface checks, CRCNo redundant pathsHot-plug and dynamic discoveryTraffic classes, virtual channels

Virtualization

Host virtualizationNetwork virtualizationI/O virtualization

Performed by hostNoneNo standard mechanism

Performed by hostNoneNo standard mechanism

Cost

Infrastructure build upFabric consolidation potential

Delta to existing PCI chipsNone

New chip core (macro)IOEN and I/O Attachment

Page 11: (PowerPoint format)

04/08/2311 Copyrighted IBM 2003

InfiniBand IB’s strategy:

Provide a new, very efficient, I/O communication model, that satisfies enterprise server requirements, andcan be used for I/O, cluster, and storage.

IB’s modelEnables middleware to communicate across a low latency, high bandwidth fabric,

through messages queues, that can be accessed directly out of user space.But… required a completely new infrastructure,

(management, software, endpoint hardware, fabric switches, and links).

I/O adapter industry viewed IB’s model as too complex.Sooo… I/O adapter vendors are staying on PCI,IB may be used to attach high-end I/O to enterprise class servers.

Given current I/O attachment reality, enterprise class vendors will likely:Continue extending their proprietary fabric(s), orTunnel PCI traffic through IB, and provide IB-PCI bridges.

Page 12: (PowerPoint format)

04/08/2312 Copyrighted IBM 2003

I/O Expansion Network Comparison

PCI-Express IB

Performance

Link widthsLink frequencyBandwidth rangeLatency

Serial 1x, 4x, 8x, 16x2.5 GHz250 MB/s to 4 GB/sPIO based synchronous operations (network traversal for PIO Reads)

Serial 1x, 4x, 12x2.5 GHz250 MB/s to 3 GB/sNative: Message based asynchronous operations (Send and RDMA)Tunnel: PIO based sync. operations

ConnectivityDistanceTopology

Memory mapped switched fabricChip-chip, card-card connector, cableSingle host, root Tree

Identifier based switched fabricChip-chip, card-card connector, cableMulti-host, general

Self-management

Unscheduled outage protection

Schedule outage protectionService level agreement

Interface checks, CRCNo native memory access controlsNo redundant paths

Hot-plug and dynamic discovery

Traffic classes, virtual channels

Interface checks, CRCMemory access controlsRedundant paths

Hot-plug and dynamic discovery

Service levels, virtual channels

Page 13: (PowerPoint format)

04/08/2313 Copyrighted IBM 2003

I/O Expansion Network Comparison… Continued

PCI-Express IB

Cost

Infrastructure build upFabric consolidation potential

New chip core (macro)IOEN and I/O Attachment

New infrastructureIOEN, CAN, high-end I/O Attachment

Virtualization

Host virtualizationNetwork virtualizationI/O virtualization

Performed by hostNoneNo standard mechanism

Standard mechanisms availableEnd-point partitioningStandard mechanisms available

Next steps

Higher frequency linksAdvanced functions

5 or 6.25 GHz (work in process)Mandatory interface checks, CRC

5 or 6.25 GHz (work in process)Verb enhancements

Page 14: (PowerPoint format)

04/08/2314 Copyrighted IBM 2003

Server Scale-up Topology Options

PCI-X BridgeSwitch

Mem

ory

uP, $uP, $uP, $uP, $

MemoryController

PCI-ExpressBridge

Ad

apte

r

Ad

apte

r

Ad

apte

r

Ad

apte

r Ad

apter

Ad

apter

Ad

apter

Ad

apter

PCI-X Bridge

Mem

ory

uP, $uP, $uP, $uP, $

MemoryController

Mem

ory

uP, $uP, $uP, $uP, $

MemoryController

Switch

Key PCI-Express IOEN value proposition Bandwidth scaling Short-distance remote I/O Proprietary based virtualization QoS (8 traffic classes, virtual channels) Low infrastructure build-up

Evolutionary compatibility with PCI

Key IB IOEN value proposition Bandwidth scaling Long distance remote I/O Native, standard based virtualization Multipathing for performance and HA QoS (16 service levels, virtual lanes) CAN and IOEN convergence

PCI-Express IOEN IB or Proprietary IOEN

PCI-Express: SMP only

PCI tunneling

For large SMPs, a memory fabric must be used to access I/O that is not local to a SMP sub-node.

Proprietaryor IB

SMP sub-node SMP sub-nodeSMP sub-node

Page 15: (PowerPoint format)

04/08/2315 Copyrighted IBM 2003

Server IOA Outlook Server I/O Attachment

Next steps in PCI family roadmap: 2003-05: PCI-X 2.0 DDR and QDR. 2005: PCI-Express

Key drivers for PCI-Express are: AGP replacement on clients (16x) CPU chipset on clients and servers (8 or 16x)

IB as an IOA Complexity and eco-system issues will limit IB

to a small portion of high-end IOA.

Server I/O Expansion Options for scale-up servers:

Migrate to IB and tunnel PCI I/O through it. Continue upgrading proprietary IOEN. Migrate to PCI-Express.

SHV servers will likely pursue PCI-Express: Satisfies low-end requirements,

but not all enterprise class requirements.

I/O Attachment (GB/s)

I/O Expansion Networks (GB/s)

.01

.1

1

10

100

1994 1999 2004 2009

MCAPCI/PCI-XPCI-Exp.

.01

.1

1

10

100

1994 1999 2004 2009

PCI-E (8/16x) IB (12x)ServerNet SGI XIOIBM RIO/STI HP

Page 16: (PowerPoint format)

04/08/2316 Copyrighted IBM 2003

Problems with Sockets over TCP/IP

Network intensive applications consume a large percent of the CPU cycles: Small 1 KB transfers spend 40% of the time in TCP/IP, and 18% in copy/buffer mgt Large 64 KB transfers spend 25% of the time in TCP/IP, and 49% in copy/buffer mgt

Network stack processing consumes a significant amount of the available server memory bandwidth (3x the link rate on receives).

Copy/Data Mgt

TCPIP

NICInterrupt ProcessingSocket Library

Available forApplication

* Note:*1 KB Based on Erich Nahum’s Tuxedo on Linux, 1 KB files, 512 clients run, but adds .5 instructions per byte for copy.*64 KB Based on Erich Nahum’s Tuxedo on Linux, 64 KB files, 512 clients run, but adds .5 instructions per byte for copy.

Receive server memory to link bandwidth ratio% CPU utilization*

0

20

40

60

80

100

No Offload,1 KB

No Offload,64 KB

0

1

2

3

Standard NIC

Page 17: (PowerPoint format)

04/08/2317 Copyrighted IBM 2003

Network Offload – Basic MechanismsSuccessful network stack offload requires five basic mechanisms:

Direct user space access to a send/receive Queue Pair (QP) on the offload adapter.Allows middleware to directly send/receive data through the adapter.

Registration of virtual to physical address translations with the offload adapter.Allows the hardware adapter to directly access user space memory.

Access controls between registered memory resources and work queues.Allows privileged code to associate adapter resources (memory registrations, QPs, and

Shared Receive Queues) to a combination of: OS image, process, and, if desired, thread.

Remote direct data placement (a.k.a. Remote Direct Memory Access - RDMA).Allows adapter to directly place incoming data into a user space buffer.

Efficient implementation of the offloaded network stack.Otherwise offload may not yield desired performance benefits.

Page 18: (PowerPoint format)

04/08/2318 Copyrighted IBM 2003

Network Stack Offload – InfiniBandHost Channel Adapter Overview

Verb consumer – Software that uses HCA to communicate to other nodes.

Communication is thru verbs, that: Manage connection state. Manage memory and queue access. Submit work to HCA. Retrieve work and events from HCA.

Channel Interface (CI) performs work on behalf of the consumer.CI consists of: Driver – Performs privileged functions. Library – Performs user space functions. HCA – hardware adapter.

SQ – Send Queue RQ – Receive Queue SRQ – Shared RQ

QP – Queue Pair QP = SQ + RQ

CQ – Completion Queue

HCA Driver/Library

Verb consumer

Verbs

HCA

Data Engine Layer

QPContext(QPC)

IB Transport, IB Network,…

CI

CQRQSQ AE

Memory Translation

and Protection

Table(TPT)

SRQ

Page 19: (PowerPoint format)

04/08/2319 Copyrighted IBM 2003

Network Stack Offload – iONICs iONIC

An internet Offload Network Interface Controller (iONIC).

Supports one or more internet protocol suite offload services.

RDMA enabled NIC (RNIC) An iONIC that supports the RDMA

Service. IP suite offload services, include,

but are not limited to: TCP/IP Offload Engine (TOE) Service Remote Direct Memory Access (RDMA)

Service iSCSI Service iSCSI Extensions for RDMA (iSER)

Service IPSec Service

SocketsApplication

Transport

Network

Sockets over Ethernet Link

Service

NICMgt

Host

Sockets over TOE Service

Sockets overRDMA Service

TOE Drv

TOE ServiceLibrary

iONIC

TCPIP

Ethernet

RDMA/DDP/MPA

NICDvr

RNIC Drv

RDMA ServiceLibrary

Only the Ethernet Link, TOE, and RDMA Services are shown.

Page 20: (PowerPoint format)

04/08/2320 Copyrighted IBM 2003

RNIC Driver/Library

Verb consumer

Verbs

Network Stack Offload – iONICRDMA Service Overview

iONIC RDMA Service

Data Engine Layer

QPContext(QPC)

RDMA/DDP/MPA/TCP/IP …

RI

CQRQSQ AE

Memory Translation

and Protection

Table(TPT)

SRQ

Verb consumer – Software that uses RDMA Service to communicate to other nodes.

Communication is thru verbs, that: Manage connection state. Manage memory and queue access. Submit work to iONIC. Retrieve work and events from iONIC.

RDMA Service Interface (RI) performs work on behalf of the consumer.RI consists of: Driver – Performs privileged functions. Library – Performs user space functions. RNIC – hardware adapter.

SQ – Send Queue RQ – Receive Queue SRQ – Shared RQ

QP – Queue Pair QP = SQ + RQ

CQ – Completion Queue

Page 21: (PowerPoint format)

04/08/2321 Copyrighted IBM 2003

Network I/O Transaction Efficiency

Graph shows a complete transaction: Send and Receive for TOE Combined Send+RDMA Write and Receive for RDMA

Send and Receive pair

.01

.1

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000Transfer size in bytes

CP

U I

nst

ruct

ion

s/B

yte

ELSTOE 1TOE 2SDPRDMA

Page 22: (PowerPoint format)

04/08/2322 Copyrighted IBM 2003

Network Offload BenefitsMiddleware View

Multi-tier Server Environment

Benefit of network stack offload (IB or iONIC) depends on the ratio of:Application/Middleware (App) instructions :: network stack instructions.

ClientTier

BrowserUser

Web

Ser

ver

Presentation Server

PresentationData

DB

Clie

nt &

Rep

licat

ion

Web Application

Server

ApplicationData

Business Function

Server: OLTP & BI DB; HPC

BusinessData

NC not useful at present due to XML & Java overheads

Sockets-level NC support beneficial- (5 to 6% performance gain for communication between App tier and business function tier)- (0 to 90% performance gain for communication between browser and web server)

Low-level (uDAPL, ICSC) support most beneficial- (4 to 50% performance gain for business function tier)

iSCSI, DAFS support beneficial- (5 to 50% gain for NFS/RDMA compared to NFS performance)

Lege

nd

Note: All tiers are logical; they can potentially run on the same server OS instance(s).Traditionally use Cluster Network.

Page 23: (PowerPoint format)

04/08/2323 Copyrighted IBM 2003

TCP/IP/Ethernet Are King of LANsEthernet is standard and widely deployed as a LAN.

Long distance links (from card-card to 40 Km). High availability through session, adapter, or port level switchover. Dynamic congestion management when combined with IP Transports. Scalable security levels. Sufficient performance for LAN.

Good enough performance for many clusters, and high performance (when combined with TCP Offload)

Strategy is to extend Ethernet’s role in Wide Area Networks, Cluster, and Storage, through a combination of: Faster link speed (10 Gb/s) at competitive costs

No additional cost for copper (XAUI).$150 to 2000 transceiver cost for fiber.

internet Offload Network Interface Controllers (iONICs)Multiple services

Lower latency switchesSub-microsecond latencies for data-center (cluster and storage) networks.

Page 24: (PowerPoint format)

04/08/2324 Copyrighted IBM 2003

Market Value of Ethernet250 Million Ethernet ports installed to date

Cu

mu

lati

ve S

hip

men

ts

Ser

ver

Eth

ern

et N

IC P

rice

s$

10 Mb/s

100 Mb/s

1 Gb/s

IB 4x

10 Gb/s Cu

10 Gb/s Fiber

10 Gb/s Fiber iONIC

10 Gb/s Cu iONIC

1 Gb/siONIC

10

100

1,000

10,000

100,000

1,000,000

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Po

rts

(000

s)

Sw 10Sw 100Sw 1000Sw 10000

1

10

100

1000

10000

Jan-96 Jan-99 Jan-02 Jan-05

Page 25: (PowerPoint format)

04/08/2325 Copyrighted IBM 2003

LAN Switch TrendsTraditional LAN switch IHV

business model has been to pursue higher level protocols and functions; with less attention to latency.

iONICs and 10 GigE are expected to increase the role Ethernet will play in Cluster and Storage Networks. iONICs and 10 GigE provides an

additional business model for switch vendors, that is focused on satisfying the performance needs of Cluster and Storage Networks.

Some Ethernet switch vendors (e.g. Nishan) are pursuing this new model.

General purpose switch

Data Center (eg iSCSI) focused switch

1997 20-100 us range

2002 10-20 us range <2 us range

2006 3-5 us range <1 us range

Switch latencies

Nishan switch:- IBM 0.18um ASIC- 25 million transistors- 15mm x 15mm die size- 928 signal pins- Less than 2us latency

Page 26: (PowerPoint format)

04/08/2326 Copyrighted IBM 2003

LAN Outlook

Ethernet vendors are gearing up for storage and higher performance cluster networks: 10 GigE to provide higher bandwidths; iONIC to solve CPU and memory

overhead problems; and lower latency switches to satisfy end-

end process latency requirements.

Given above comes to pass,How well will Ethernet play in Cluster market? Storage market?

2x every 16 months(over the past 10 years)

Local Area Networks

.001

.01

.1

1

10

100

1974 1984 1994 2004

GB

/s

EthernetToken RingATMFDDI

Page 27: (PowerPoint format)

04/08/2327 Copyrighted IBM 2003

Cluster Network ContendersProprietary networks

Strategy:Provide advantage over standard networks.

Lower latency/overhead, higher link performance, and advanced functions Eco-system completely supplied by one vendor.

Two approaches:Multi-server – can be used on server from more than one vendor.Single-server – only available on server from one vendor.

Standard networks IB

Strategy is to provide almost all the advantages available on proprietary networks;thereby, eventually, displacing proprietary fabrics.

10 Gb/s Ethernet internet Offload NIC (iONIC)Strategy is to increase Ethernet’s share of the Cluster network pie,

by providing lower CPU/memory overhead and advanced functions,though at a lower bandwidth than IB and proprietary networks.

Note: PCI-Express doesn’t play, because it is missing many functions.

Page 28: (PowerPoint format)

04/08/2328 Copyrighted IBM 2003

0%

20%

40%

60%

80%

100%

Standards BasedMulti PlatformSingle Platform

June 2002 November 2002June 2000

Cluster Interconnect TechnologyTop 500 Supercomputers

Top100

Next100

Last100

Top100

Next100

Last100

Top100

Next100

Last100

Cluster Network Usage in HPC

Use of proprietary cluster networks for high-end clusters will continue to decline. Multi-platform cluster networks have already begun to gain significant share. Standards-based cluster networks will become the dominant form.

* Source: Top 500 study by Tom Heller.

*

Page 29: (PowerPoint format)

04/08/2329 Copyrighted IBM 2003

Reduction in Process-Process Latencies256 B and 8 KB Block

0.1

1

10

100

1000

GbEnet 10GbE 10GbE opt IB-12x

Link Time, 256BSwitch, 5 HopsNetwork Stack+ Adapter Driver/Library

LAN Process-Process Latencies

0

2

4

6

8

GbEnet 10GbE 10GbE opt IB-12x

Normalized LAN Process-Process

Latencies

1 GigE 100 MFLOP=19us10 GigE 100 MFLOP= 6usIB 100 MFLOP= 6us

1

10

100

1000

GbEnet 10GbE 10GbE opt IB-12x

Link Time, 8 KB

0

2

4

6

8

GbEnet 10GbE 10GbE opt IB-12x

1.2x lower

4.6x lower

8.4x lower

2.5x lower

3.0x lower

3.9x lower

Page 30: (PowerPoint format)

04/08/2330 Copyrighted IBM 2003

HPC Cluster Network Outlook

Proprietary fabrics will be displaced by Ethernet and IB.

Server’s with the most stringent performance requirements will use IB.

Cluster Networks will continue to be predominately Ethernet. iONIC and low-latency switches will

increase Ethernet’s participation in the cluster network market.

High PerformanceStandard Links (IB/Ethernet)

.0001

.001

.01

.1

1

10

100

1985 1989 1994 1999 2004 2009

Ban

dwid

th (

GB

/s)

Ethernet Token Ring ATMFDDI FCS IBHiPPI ServerNet Mem. ChannelSGI-GIGAchnl. IBM SP/RIO/STI Synfinity

Page 31: (PowerPoint format)

04/08/2331 Copyrighted IBM 2003

SAN

Current SAN and NAS Positioning Overview

Current SAN differentiators Block level I/O access High performance I/O

Low latency, high bandwidth Vendor unique fabric mgt protocols

Learning curve for IT folks Homogeneous access to I/O.

Current NAS differentiators File level I/O access LAN level performance

High latency, lower bandwidth Standard fabric mgt protocols

Low/zero learning curve for IT folks Heterogeneous platform access to files

Commonalities Robust remote recovery and storage management requires special tools for both. Each can optimize disk access, though SAN does require virtualization to do it.

Contenders: SAN: FC and Ethernet.

LAN

NFS,HTTP,etc…

LUN LBA

NAS: Ethernet.

Page 32: (PowerPoint format)

04/08/2332 Copyrighted IBM 2003

Storage Models for IP

Parallel SCSI and FC have very efficient path through O/S Existing driver to hardware interface has been tuned for many years. An efficient driver-HW interface model has been a key iSCSI adoption issue.

Next steps in iSCSI development: Offload TCP/IP processing to the host bus adapter, Provide switches that satisfy SAN latencies requirements, Improve read and write processing overhead at the initiator and target.

CPU

SCSI or FC

FS APIApplication

Stg. Adapter

FS/LVMStg Driver

Parallel SCSI or FC

CPUApplication

iSCSI Service in iONIC

Adapter Drv

iSCSI HBA

iSCSITCP/IPEthernet

CPU

FS APIApplication

iSCSI Service in host

NIC

FS/LVM

P. OffloadEthernet

iSCSITCP/IP

NIC Driver

FS APIFS/LVM

Stg Driver

.01

.1

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000Transfer size in bytes

CP

U I

nst

ruct

ion

s/B

yte

Parallel SCSI

iSCSI Service in host

iSCSI Service in iONIC

Page 33: (PowerPoint format)

04/08/2333 Copyrighted IBM 2003

Storage Models for IP

RDMA will significantly improve NAS server performance. Host network stack processing will be offloaded to an iONIC.

Removes tcp/ip processing from host path. Allows zero copy.

NAS (NFS with RDMA Extensions) protocols will exploit RDMA. RDMA allows a file level access device to approach

block level access device performance levels. Creating a performance discontinuity for storage.

NFS over ELS NIC

NFS Extensions for RDMA over

RDMA Service in iONIC

CPU

NFS APIApplication

NIC

NFS

P. OffloadEthernet

TCP/IPNIC Driver

CPUApplication

RNIC

RDMA/DDPMPA/TCPIP/Ethernet

NIC Driver

NFS APINFS

.01

.1

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000Transfer size in bytes

CP

U I

nstr

ucti

ons/

Byt

e

NFS over ELS NICNFS over RNIC

Parallel SCSI

Page 34: (PowerPoint format)

04/08/2334 Copyrighted IBM 2003

Storage I/O Network Outlook

Storage network outlook: Link bandwidth trends will continue.

Paced by optic technology enhancements. Adapter throughput trend will continue

Paced by higher frequency circuits, higher performance microprocessors, and larger fast-write and read cache memory.

SANs will gradually transition from FC to IP/Ethernet networks.

Motivated by TCO/complexity reduction.Paced by availability of:

iSCSI with efficient TOE (possibly RNIC) Lower latency switches

NAS will be more competitive against SAN.Paced by RNIC availability.

* Sources: Product literature from 14 companies. Typically use a workload that is 100% read of 512 byte data; not a good measure of overall sustained performance, but it is a good measure of adapter/controller front-end throughput capability.

Single Adapter/Controller Throughput

SAN

.001

.01

.1

1

10

1990 1995 2000 2005

GB

/s

SCSIFCDisk HeadiSCSI/E

.1

1

10

100

1000

1994 1998 2003 2008

K I

OP

S

Page 35: (PowerPoint format)

04/08/2335 Copyrighted IBM 2003

Summary I/O server adapters will likely attach through PCI family,

because of PCI’s low cost and simplicity of implementation. I/O expansion networks will likely use

Proprietary or IB (with PCI tunneling) links that satisfy enterprise requirements, and

PCI-Express on Standard High Volume, Low-End servers.Clusters networks will likely use

Ethernet networks for the high-volume portion of market, and InfiniBand when performance (latency, bandwidth, throughput) is required.

Storage area networks will likely Continue using Fibre Channel, but gradually migrate to iSCSI over Ethernet.

LANs Ethernet is King.Several network stack offload design approaches will be attempted

From all firmware on slow microprocessors, to heavy state machine usage, to all points in between.

After weed design approaches are rooted out of the market,iONICs will eventually become a prevalent feature on low to high-end servers.