14
The Case for Using 10 Gigabit Ethernet for Low Latency Network Applications April 28, 2012

The Case for Using 10 Gigabit Ethernet for Low Latency Network

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Case for Using 10 Gigabit Ethernet for Low Latency Network

The Case for Using 10 Gigabit

Ethernet for

Low Latency Network

Applications

April 28, 2012

Page 2: The Case for Using 10 Gigabit Ethernet for Low Latency Network

2

Introduction

For most traditional enterprise client/server applications, the primary performance metric has been

user response time, with something on the order of 100 milliseconds (ms) generally being

considered acceptable. Over the last few years a new generation of server-to-server distributed

applications has emerged where performance is largely determined by the end-to-end latency

(a.k.a.; the application latency) between servers. These applications include the migration of

virtual machines between physical servers; High Performance Computing (HPC) with clusters of

compute nodes; high frequency, automated security trading; clustered data bases; storage

networking, and Hadoop/MapReduce clusters for performing analytics on unstructured big data.

Applications such as these require far more bandwidth than client/server applications and perform

best with end-to-end latencies that can be as low as a few microseconds.

These requirements have given rise to specialty switching technologies, such as InfiniBand and

Fibre Channel that offer better bandwidth/latency characteristics than Gigabit Ethernet. However,

recent developments in 10 Gigabit Ethernet NIC hardware and low latency 10 GbE switching are

positioning 10 Gigabit Ethernet to offer bandwidth and latency performance that is on a par with,

or surpasses, that of the more specialized interconnects. These developments will allow network

managers to minimize the complexity and reduce the cost of the data center by using Ethernet as

the converged switching technology that can meet the highest performance requirements of each

type of data center traffic.

One goal of this white paper is to provide a review of performance and latency test results that

address the suitability of 10 GbE as the network interconnect for a range of low latency server-to-

server applications. Another goal of this white paper is to provide a brief discussion of some of

the factors that should be included in a TCO comparison of a solution that is based on low

latency10 GbE with another solution that is based on a specialty interconnect.

Network Latency and Switch Latency

Figure 1 illustrates the differences in network latency between store-and-forward and cut-through

switches. A store and forward switch has to wait for the full packet serialization by the sending

NIC before it can begin packet processing. The switch latency for a store and forward switch is

the delay between when the last bit of the packet arrives in the switch and when the first bit of the

packet is sent out of the switch (LIFO). After packet processing is complete, the switch has to

re-serialize the packet to deliver it to its destination. Therefore, neglecting the small propagation

delay over short data center cabling (~3-5.5 ns/meter depending on the media), the network

latency for a one hop store-and-forward (SAF) switched network is:

SAF Network Latency = 2 x (Serialization Delay) + LIFO Switch Latency

Equation 1: Network Latency for a Store-and–Forward Switch

Page 3: The Case for Using 10 Gigabit Ethernet for Low Latency Network

3

Figure 1: Network Latency for a Store-and–Forward vs. a Cut-Through Switch

In the case of the cut-through switch that is depicted at the bottom of Figure 1, the switch can

begin forwarding the packet to the destination system as soon as the destination address, plus

enough header fields to support VLANs, QoS, and security features, is mapped to the appropriate

output port. This means that the cut-through switch can overlap the serialization of the outgoing

packet from the switch to the destination end system with the serialization of the incoming packet.

The switch latency is measured as the delay between the first bit in and the first bit out (FIFO) of

the switch. Therefore, the corresponding network latency through a one hop cut-through (CT)

switched network is:

CT Network Latency = Serialization Delay + FIFO Switch Latency

Equation 2: Network Latency for a Cut-Through Switch

Typical LIFO switch latency for a 10 GbE store-and-forward (SAF) switch is in the range of 2-35

microseconds, while the FIFO switch latency for a 10 GbE cut-through (CT) switch is typically

only 300-1,000 nanoseconds.

As the diameter of the interconnect network increases, the advantage of CT switching becomes

more significant. For example, for an n-hop network, the network latencies for the two types of

switches are:

SAF Network Latency = (n+1) x (Serialization Delay) + n x (LIFO Switch Latency)

CT Network Latency = Serialization Delay + n x (FIFO Switch Latency)

Page 4: The Case for Using 10 Gigabit Ethernet for Low Latency Network

4

Equation 3: Network Latency for a Multi-hop Networks

Switch latency (aka port-to-port latency) is measured using a test device that clocks the departure

and subsequent return of test packets to measure LIFO or FIFO, as shown in Figure 2.

Figure 2: Switch Latency Testing Setup

Switch latency measurements should state whether the switch was in CT or SAF mode, indicate

the benchmark used for the tests, and the amount of load applied during testing. Complete test

results would include latency for a range of packet sizes and statistical metrics, such as mean

latency, min/max latency, and standard deviation. Additional tests might also be run with a

mixture of packet sizes to simulate different types of application traffic flowing through the

switch.

Application Latency or End-to-end Latency

In order to determine the effect of delay on an application it is necessary to consider all the

components of end-to-end latency as depicted in Figure 3. The best way to measure end-to-end

latency is to start the clock when an application sends a request to another application on the

network and to stop the clock when a response is fully received. The end-to-end latency is then

defined as one half of the delay between send and receive. In most cases, end-to-end latency is

measured where the request and the response are contained in single packets.

The end–to-end round trip latency includes:

Request packet protocol processing at the originating node, possibly involving the OS as well

as the drivers and NICs

Network latency including serialization and switch latency and propagation delay

Request Packet incoming processing at the target node

Remote Node Processing of the request-this can vary significantly depending on benchmark or

application workload

Response Packet outgoing processing at the target node

Network latency including serialization and switch latency and propagation delay

Response packet processing at the originating node

Page 5: The Case for Using 10 Gigabit Ethernet for Low Latency Network

5

End-to-end latency is typically dominated by application processing at the target host, as well as

delays in protocol processing encountered in the OS stack, network drivers, and network adapter

I/O rather than by network latency.

Figure 2: The Components of End-to-End Latency

Source: Intel Developers Forum 2010

Much of the early work to reduce end-to-end latency has focused on eliminating the delays and

high CPU utilization that can occur when the OS kernel becomes involved in processing TCP/IP

protocols. The approach used with Fibre Channel was to avoid TCP/IP by developing a new

Layer 2 protocol with a reliable delivery mechanism in conjunction with offloading some of the

protocol processing to an intelligent adapter. InfiniBand used a similar approach in conjunction

with OS bypass by using techniques such as Remote Direct Memory Access (RDMA) whereby

the adapter can bypass the OS by placing data directly into user memory space. For Ethernet, a

number of 10 GbE adapters have been developed that offload TCP/IP, FCoE, and/or iSCSI

protocol processing to an intelligent adapter. In some cases, intelligent adapters that support

TCP/IP can also bypass the OS with RDMA based on the IETF iWARP standard.

With the advent of multi-core processors, processing power has increased to the point where the

host can easily saturate two 10 GbE ports on a network adapter at low overall CPU utilization. In

fact, the host now has plenty of processing power to accommodate a large number of VMs with

enough processing power left over to exceed the processing power of most intelligent NICs. For

example, an Intel Xeon 5500 (Nehalem) processor issues four instructions per clock cycle, and

operates at a clock speed of 3 GHz. As a result, each Nehalem processor has an execution rate

many times that of a generic processor engine of today‘s offload adapters. This difference in

processing power can potentially result in overloading the adapter‘s offload processor, making it a

Page 6: The Case for Using 10 Gigabit Ethernet for Low Latency Network

6

‗bottleneck‘ to application performance. An alternative to an intelligent adapter is an adapter

designed to take advantage of multi-core processing power of the host. One example of this is

Solarflare‘s OpenOnload functionality. With OpenOnload the adapter is doesn‘t implement TOE

or RDMA but relies on TCP/IP processing in user space rather than involving the OS kernel. The

OS is effectively bypassed allowing the OpenOnload adapter to deliver host/NIC latency that is

highly competitive with the best intelligent Ethernet adapters and InfiniBand adapters.

Latency Sensitive Applications

HPC The vast majority of high performance computing (HPC) systems are based on parallel processing

on a cluster of compute nodes based on microprocessors. Most parallel programs written for

execution on clusters are based the Messaging Passing Interface (MPI), a de facto standard for

communication among parallel processes on distributed systems and clusters. MPI provides

synchronization and data exchange between cluster nodes via messages sent over a cluster

interconnect network. In general, application performance is strongly influenced by the bandwidth

and latency characteristics of the cluster interconnect. However, most applications based on MPI

are dominated by relatively small messages, typically in the order of 128 bytes or less.

Latency benchmarks (such as the OSU and Intel benchmarks) for MPI are based on measuring

end-to-end latency for a message between two cluster nodes. The ping-pong or send/receive

round trip delay is measured and reported as this round trip time divided by 2. The most

commonly cited MPI latency is for a message size of 0 or 1 byte, which minimizes the host

processing required on the target node. The MPI benchmarks are sometimes referred to as micro

benchmarks due to the lack of processing delay by the responding node. MPI latency is also

typically measured for a range of message sizes. For larger messages the latency increases by

serialization time, as well as time required to move the message in and out of memory on the

target machine. For GbE, the lowest reported MPI send/receive latencies (1 Byte message) are on

the order of 20 μsecs compared to <5 μsecs for 10 GbE1 with a Solarflare OpenOnload adapter,

and <4 µs for 10 GbE with a Chelsio T420 intelligent RDMA adapter2, vs. <2 μsecs for an

InfiniBand DDR or QDR3 with a Mellanox ConnectX adapter. DDR and QDR stand for double

and quad data rates (16 Gbps and 32 Gbps respectively—note that IB marketing literature tends

to focus on the signaling rates 20 Gbps and 40 Gbps rather than the actual data rates. The 10 GbE

test results were based on low latency CT switches.

The Top500 is a semi-annually updated list of the world‘s highest performing supercomputers

based on the Linpack performance benchmark. Virtually all the systems on the list are designed

using some form of cluster interconnect. While some proprietary interconnects are represented on

the list, the majority of systems are based on Gigabit Ethernet, 10 Gigabit Ethernet, or InfiniBand.

On the most recent list, a total of 224 systems are based on GbE or 10 GbE. These

supercomputers are ranked from #42 down to #500, with most of these still being based on GbE.

1 http://www.highfrequencytraders.com/news-wire/1013/solarflare-establishes-breakthrough-mpi-latency-

performance-platform-mpi-81 2 http://www.chelsio.com/assetlibrary/whitepapers/CHL_11Q2_IWARP_GA_TB_02_0405%20%283%29.pdf 3 http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2010/masvdc10-hdfs-ib.pdf

Page 7: The Case for Using 10 Gigabit Ethernet for Low Latency Network

7

There are 209 systems based on InfiniBand interconnect, ranging from #4 down to #483. The

highest performing 10 GbE system (#42 --an Amazon EC2 cluster) has 17,024 cores and

performance of 0.24 PetaFlops. The highest performing InfiniBand system (#4 --National

Supercomputing Centre in Shenzhen China) has 120,640 cores and performance of 1.27

PetaFlops. Obviously, performance is based on a number of factors besides bandwidth and

latency, including the number of cores, processing power per core, and whether or not general

purpose general CPUs are complemented by Graphic Processing Units (GPUs).

As indicated by the Top500 list, if the goal is to build a Top10 performing supercomputer in the

world, the most popular choices are currently a proprietary interconnect or InfiniBand (in the

current top 10 positions on the list, #1, #2, #3, #6, and #8 use proprietary interconnect and the

rest use InfiniBand). 10 GbE‘s greatest impact on the Top500 and HPC in general is expected to

come over the next 3-4 years as 10 GbE cost ride the learning curve of high volume. One can only

speculate what the performance of a 10 GbE cluster would be with 120,000 cores. For HPC at

both research lab and enterprise performance levels, 10 GbE (and GbE) are certainly very viable

alternatives, especially with the lowest latency NICs and switches that are available.

Note that the Linpack benchmark is only one performance indicator for HPC since sensitivity to

latency and bandwidth can vary significantly across a range of HPC applications. According to

some tests performed by IBM4 with a much broader range of HPC benchmarks, 10GbE actually

outperforms InfiniBand for a wide range of workloads.

Security Trading

Automated and algorithmic trading (aka high frequency trading) has become widely adopted in

recent years. An automated system can take market data feeds and distribute them via messaging

to a number of trading stations and analytical programs. Analytical programs can then trigger

trades that are executed on various exchanges. In some markets, financial firms can derive a profit

from being less than one millisecond faster to act than competing firms, which drives them to

search for sub-millisecond optimizations in their trading solutions. Algorithmic trading

applications are sensitive to the predictability and consistency of latency as well as to how low the

mean latency is. One comment made at an AMD user group meeting was that ―10 milliseconds of

(end-to-end) delay could cost a security trading firm 10% of its profits.‖

Securities Technology Analysis Center (STAC) has developed a number of benchmark tests

designed to measure the effectiveness of solutions for disseminating market data, analyzing the

data, and executing orders based on that analysis. The STAC-M2 test measures the ability of a

solution to handle real-time market data in a variety of configurations. Leading trading firms on

the STAC Benchmark™ Council approved the STAC- M2™ Benchmark specifications as the

meaningful, transparent way for vendors to demonstrate the performance of a high-speed

messaging stack.

4 Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet

Page 8: The Case for Using 10 Gigabit Ethernet for Low Latency Network

8

For 10 GbE, the STAC web site5 has audited results for the STAC-M2 Benchmarks executed

with IBM‘s WebSphere MQ Low Latency Messaging (LLM) running on IBM Xeon servers with

Solarflare‘s 10GbE Ethernet adapters and a cut through 10 GbE switch with latency less than 1

μsec. These results compare very favorably with STAC-M2 Benchmarks results for QDR

Infiniband also based on IBM WebSphere MQ Low Latency Messaging (LLM) running on IBM

Xeon servers with Mellanox ConnectX IB adapters, as shown in Table 1. These results show that

10 GbE delivered very low latency that was demonstrated superior degree of predictability, as

indicated by a standard deviation of virtually zero and lower maximum latency.

STAC-M2 metric 10 GbE InfiniBand QDR Maximum Message Rate Tested 1,500,000 1,000,000

Mean Latency μsec 9 8

99 Percentile Latency μsec 12 11

Max Latency μsec 23 47

Standard Deviation μsec 0 1

Table 1: Comparison of 10 GbE and QDR InfiniBand for STAC-M2

Clustered Databases

Databases that are partitioned across a cluster of commodity servers have become an attractive

alternative to databases running on mainframes or high-end symmetric multiprocessing (SMP)

servers. Clustered databases offer significant economic advantages as well as providing a high

degree of reliability, availability and scalability.

Having a low latency cluster interconnect that is dedicated to inter-nodal communications allows

each node in the cluster to access data from all of the in-memory caches in the database cluster

before having to resort to reading data from SAN-based disk arrays. A dedicated cluster

interconnect network is required in order to avoid contention for bandwidth with other types of

traffic and hence ensure consistent latency and response times.

Oracle Real Application Clusters (RAC) employ a device called Cache Fusion to unify the

database caches resident in each of the nodes in the system. The Global Cache Service6 (GCS) is a

process that manages the status and the transfer of data blocks across the buffer caches of the

cluster nodes to satisfy application requests. With IBM DB2 pureScale clustered databases7 a

cluster caching facility (CF) that is based on dedicated cluster members is used for synchronizing

the locking and caching information across the cluster. CF uses the global lock manager (GLM)

to keep pages consistent across all members and coordinate data access to a group buffer pool

(GBP) that emulates cluster-wide shared memory.

Interconnect end-to-end latency directly affects the time it takes to access blocks in the remote

caches of clustered databases, and thus interconnect latency directly affects application scalability

5 stacresearch.com 6 http://docs.oracle.com/cd/B10501_01/rac.920/a96597/pslkgdtl.htm 7 7 http://www-01.ibm.com/software/data/db2/linux-unix-windows/editions-features-purescale.html

Page 9: The Case for Using 10 Gigabit Ethernet for Low Latency Network

9

and performance. Generally, there are three network alternatives for implementing a cluster

interconnect:

GbE with UDP

InfiniBand

10 GbE, possibly also with iWARP or RoCE (RDMA over Converged Ethernet)

The Oracle 11g Reference Architecture document8 provides some relative performance data for

these interconnects using the iGEN-OLTP benchmark. This document shows that the end-to-end

application latency of 1-2 milliseconds for InfiniBand is lower than that of GbE/UDP by 50-66%.

It also shows that the end-to-end latency for 10 GbE is about 50% lower than the end-to-end

latency is for GbE/UDP. The round trip latency includes the following components (recall that

end–to-end latency is defined as ½ the round trip time):

Request packet protocol processing at the originating node

Network latency (200 B request)

Request Packet incoming processing at the target node

Remote Node Processing of the request-this can vary significantly depending on benchmark or

application workload

Response Packet outgoing processing at the target node

Network latency of response (typically an 8K block)

Response packet processing at the originating node

A good first level approximation is that data base application throughput is inversely proportional

to end-to-end application latency. That means that both 10 GbE and InfiniBand can be expected

to have more than twice the throughput of GbE. This assertion is supported by an Oracle RAC

presentation9.

With InfiniBand, the combination of switch latency, packet serialization latencies and host/NIC

packet processing latencies can be kept under 20-30 microseconds, which means that almost all of

the end-to-end application latency in the benchmark example cited in the architecture paper is due

to application processing on the server or servers that provide the remote cache hit.

With low latency switching and intelligent NICs, very similar results can be achieved using 10

GbE as the cluster interconnect. The relatively long application layer delay contribution to end-to-

end latency means that a small advantage in network latency for InfiniBand cannot in itself result

in significantly better performance for IBM DB2 pureScale or Oracle RAC. IBM DB2 pureScale

supports 10 GbE and InfiniBand interconnects on a more or less equal basis. Oracle promotes an

InfiniBand cluster interconnect solution for RAC 11g, although it also supports GbE and 10 GbE.

While this latter factor is worth considering, it should be kept in mind that Oracle owns a 10%

stake in Mellanox, one of the two remaining InfiniBand vendors.

8 Sun Reference Architecture for Oracle 11g Grid April 2010 9 Oracle presentation: Oracle’s Next-Generation Interconnect Protocol: Reliable Datagram Sockets (RDS) and InfiniBand

Page 10: The Case for Using 10 Gigabit Ethernet for Low Latency Network

10

Storage Networking

Many of the benefits of server virtualization stem from hypervisor-based storage virtualization

that allows VMs to access their virtual disks via a logical name rather than a physical location.

This allows the vDisks to be repositioned in a networked storage system without requiring

reconfiguration of the VM or disruption of its operations. The popularity of server virtualization

has thus resulted in IT departments having a higher level of interest in both storage virtualization

and storage networking.

With storage networking the key performance metrics have generally been I/O operations per

second (IOPS) and I/O latency, which is another example of an end-to-end latency. Sometimes,

Throughput in MBps is quoted as a separate metric, but MBps is simply IOPS times the block

size in megabytes.

VMware and NetApp recently issued a technical report10

summarizing the results of testing

conducted to compare the performance of Fibre Channel, software-initiated iSCSI, and NFS

networked storage in an ESX 3.5 and vSphere environment using NetApp storage. The results

compare the performance of the three storage protocols with a goal of aiding customer decisions

among shared storage alternatives. The results also demonstrate the performance enhancements

made in going from ESX 3.5 to vSphere. The performance tests sought to simulate a ―real-world‖

environment. The test and validation environment is composed of components and architectures

commonly found in a typical VMware implementation that include using the FC, iSCSI, and NFS

protocols in a multiple virtual machine (VM), multiple ESX 3.5 and/or vSphere host environment

accessing multiple data stores. The performance tests used realistic I/O patterns, I/O block sizes,

read/write mixes, and I/O loads common to various operating systems and business applications.

The Fibre Channel results were based on 4 Gbps FC, a FC switch with cut through latency of <

1μsec. The intelligent FC adapter communicates directly with the operating system kernel via a

device driver, allowing for direct memory accesses (DMA) from the SCSI layers to the host

computer‘s memory subsystem. The 10 GbE results were based on a Chelsio T320 intelligent

adapter and a 10 GbE switch with cut through latency of < 1μsec.

Figure 3, 4, and 5 summarize the comparative test results for I/O latency, IOPS, and CPU

utilization. In these results, VMware vSphere demonstrated the ability to support 350,000 I/O

operations per second with just three virtual machines running on a single host with I/O latency

under 2 ms with both FC and 10 GbE. In Figure 3, FC shows a slightly lower latency compared to

10 GbE iSCSI and a slightly higher IOPS in Figure 4 as a result. The greatest difference between

these two technologies comes in lower CPU utilization for FC, as shown in Figure 5. As noted

earlier, CPU utilization is becoming less of a concern as the number of cores per processor chip

increases and it becomes possible to devote an entire core to protocol processing.

10 VMware vSphere and ESX 3.5 Multiprotocol Performance Comparison Using FC, iSCSI, and NFS

Page 11: The Case for Using 10 Gigabit Ethernet for Low Latency Network

11

Figure 3: Comparative I/O Latencies

Figure 4: Comparative I/Os per second

Page 12: The Case for Using 10 Gigabit Ethernet for Low Latency Network

12

Figure 5: Comparative CPU Utilization

Big Data with MapReduce/Hadoop

The open-source Hadoop framework has given rise to the broad application of the MapReduce

paradigm for searching and analyzing massive amounts of unstructured data. The traditional

relational technologies have simply not been able to keep up with the explosive growth of new

types of data.

Hadoop was designed based on a new approach to storing and processing complex data. Instead

of storing data on a SAN for shared accessibility and reliability, each node in a Hadoop cluster

both processes data and stores data in DAS. Hadoop distributes data across a cluster of balanced

machines and uses replication to ensure data reliability and fault tolerance. Because data is

distributed on machines with compute power, processing can be done on the nodes storing the

data. Throughput for I/O bound workloads can be improved via the use of distributed caching of

data in memory and by using fast disks, in particular SSD vs. HDD.

The DFSIO, Sort, and Random Write performance benchmarks are part of the Hadoop

Distribution. DSFIO measures performance of the cluster for reads and writes. OSU11

used the

test to measure sequential access throughput of Hadoop using Map tasks that were writing files

with sizes between 1 and 10 GB. The tests were run for cluster interconnects GbE, 10 GbE, and

InfiniBand DDR with the cluster nodes first using Hard Disk Drives (HDD) and then using Solid

State Drives (SSD). The results showed that with HDD throughput was improved by about 30%

11 Can High-Performance Interconnects Benefit Hadoop Distributed File System?

Page 13: The Case for Using 10 Gigabit Ethernet for Low Latency Network

13

by using low latency cluster interconnect rather than GbE. Switching a GbE cluster to SSD did

not result in a significant improvement in throughput. However, with low latency 10 GbE or

InifiniBand in conjunction with SSD, the performance improved by approximately 6X compared

to GbE/HDD. In all the benchmark tests that were run, low latency 10 GbE interconnect in

conjunction with Chelsio T-320 adapters produced performance results that were better than or

equivalent to those for InfiniBand DDR.

As big data technology evolves, there is growing interest in being able to assimilate, analyze and

respond to data events in near real-time, which requires millisecond-scale access and processing

speeds. Examples of real time applications of big data include analysis of high-volume web session

and user data, reacting to high-speed financial market feeds, aggregating distributed sensor grid

events, processing social network messages and connections, or providing real-time intelligence

and entity classification.

Cassandra is an open source distributed database management system that can be integrated with

Hadoop together with other open source big data utilities, such as Hive, Hbase, and Solr.

Cassandra provides improved management of real time data that complements the batch analytic

capabilities of Hadoop. Cassandra enhances the linear scalability of the cluster, provides for data

replication within the cluster or over the wide area, and provides distributed in-memory caching of

data that improves read and write performance. As nodes are added to the Hadoop/Cassandra

cluster, the size of the distributed cache increases allowing as much of the data as required to

reside in memory rather than on disk.

The distributed in-memory caching of Cassandra should result in a further improvement in

performance when combined with SSD storage and low latency interconnect. However, no test

results have been published yet for this combination of technologies.

Factors to Consider in TCO Comparisons of Low Latency Network Solutions

When competing network solutions offer equivalent performance, the decision of which to choose

often hinges on a TCO analysis. A comprehensive TCO analysis goes well beyond price per

switch port and power consumption per port, especially when two rather divergent technologies

are being considered. In a comprehensive TCO analysis comparing a low latency 10GbE solution

to one based on InfiniBand or Fibre Channel some of the things to be considered include:

Network Convergence With 10 GbE, it is possible to avoid the complexity of multiple adapter

technologies on a single server and additional switching technologies for cluster interconnect and

networked storage. Network convergence also has OPEX benefits by simplifying network

management and allowing less duplication and fragmentation of administrative staff along

technology boundaries.

Seamless Upgrades 10GbE is plug & play compatible with 1GbE and with 40 GbE and 100 GbE.

10GbE fully preserves software and hardware investments in sockets based applications and

application written for TCP/UDP/IP. 10GbE supports structured Cat5E cabling already installed

in the data center.

Page 14: The Case for Using 10 Gigabit Ethernet for Low Latency Network

14

Technology Evolution 10GbE leverages Ethernet‘s declining cost curve driven by high volume

Production. The next 3 to 4 years will the major growth period for 10 GbE supporting the

expectation that 10 GbE adapter and switch prices will follow a similar cost reduction curve as

GbE. More specialized fabric technologies will never be able to achieve the same cost/volume

advantages.

Technology Continuity Ethernet is a true industry standard with a long history of innovations

coming from a broad range of vendors. This has resulted in a rich set of solutions supported by an

extensive ecosystem of vendors. InfiniBand, on the other hand, can be considered a pseudo

industry standard in view of the fact the ―industry‖ has shrunk to two vendors, one of which being

the sole source of IB switch chips and holding 90% market share.

Summary

The performance of latency sensitive network applications is largely determined by end-to-end

latency, rather than switch latency or network latency. Maximizing performance for these

applications means minimizing every component of end-to-end latency, with special attention paid

to the network adapter and host protocol stack contributions to latency.

For all the latency sensitive applications examined in this document low latency 10 GbE offers

end-to-end latencies and levels of performance that are highly competitive with those of the

specialty interconnects.

Low latency 10 GbE therefore makes it highly feasible for IT departments to optimize the

simplicity and homogeneity of their data center LANs by pursuing network technology

convergence, and even fabric convergence, as they increasingly leverage applications that depend

on low, consistent latency together with high application throughput. By taking maximum

advantage of existing expertise with Ethernet technology and management tools, IT departments

can future-proof their IT investments while minimizing a number of the components of OPEX that

contribute significantly to TCO.