Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The Case for Using 10 Gigabit
Ethernet for
Low Latency Network
Applications
April 28, 2012
2
Introduction
For most traditional enterprise client/server applications, the primary performance metric has been
user response time, with something on the order of 100 milliseconds (ms) generally being
considered acceptable. Over the last few years a new generation of server-to-server distributed
applications has emerged where performance is largely determined by the end-to-end latency
(a.k.a.; the application latency) between servers. These applications include the migration of
virtual machines between physical servers; High Performance Computing (HPC) with clusters of
compute nodes; high frequency, automated security trading; clustered data bases; storage
networking, and Hadoop/MapReduce clusters for performing analytics on unstructured big data.
Applications such as these require far more bandwidth than client/server applications and perform
best with end-to-end latencies that can be as low as a few microseconds.
These requirements have given rise to specialty switching technologies, such as InfiniBand and
Fibre Channel that offer better bandwidth/latency characteristics than Gigabit Ethernet. However,
recent developments in 10 Gigabit Ethernet NIC hardware and low latency 10 GbE switching are
positioning 10 Gigabit Ethernet to offer bandwidth and latency performance that is on a par with,
or surpasses, that of the more specialized interconnects. These developments will allow network
managers to minimize the complexity and reduce the cost of the data center by using Ethernet as
the converged switching technology that can meet the highest performance requirements of each
type of data center traffic.
One goal of this white paper is to provide a review of performance and latency test results that
address the suitability of 10 GbE as the network interconnect for a range of low latency server-to-
server applications. Another goal of this white paper is to provide a brief discussion of some of
the factors that should be included in a TCO comparison of a solution that is based on low
latency10 GbE with another solution that is based on a specialty interconnect.
Network Latency and Switch Latency
Figure 1 illustrates the differences in network latency between store-and-forward and cut-through
switches. A store and forward switch has to wait for the full packet serialization by the sending
NIC before it can begin packet processing. The switch latency for a store and forward switch is
the delay between when the last bit of the packet arrives in the switch and when the first bit of the
packet is sent out of the switch (LIFO). After packet processing is complete, the switch has to
re-serialize the packet to deliver it to its destination. Therefore, neglecting the small propagation
delay over short data center cabling (~3-5.5 ns/meter depending on the media), the network
latency for a one hop store-and-forward (SAF) switched network is:
SAF Network Latency = 2 x (Serialization Delay) + LIFO Switch Latency
Equation 1: Network Latency for a Store-and–Forward Switch
3
Figure 1: Network Latency for a Store-and–Forward vs. a Cut-Through Switch
In the case of the cut-through switch that is depicted at the bottom of Figure 1, the switch can
begin forwarding the packet to the destination system as soon as the destination address, plus
enough header fields to support VLANs, QoS, and security features, is mapped to the appropriate
output port. This means that the cut-through switch can overlap the serialization of the outgoing
packet from the switch to the destination end system with the serialization of the incoming packet.
The switch latency is measured as the delay between the first bit in and the first bit out (FIFO) of
the switch. Therefore, the corresponding network latency through a one hop cut-through (CT)
switched network is:
CT Network Latency = Serialization Delay + FIFO Switch Latency
Equation 2: Network Latency for a Cut-Through Switch
Typical LIFO switch latency for a 10 GbE store-and-forward (SAF) switch is in the range of 2-35
microseconds, while the FIFO switch latency for a 10 GbE cut-through (CT) switch is typically
only 300-1,000 nanoseconds.
As the diameter of the interconnect network increases, the advantage of CT switching becomes
more significant. For example, for an n-hop network, the network latencies for the two types of
switches are:
SAF Network Latency = (n+1) x (Serialization Delay) + n x (LIFO Switch Latency)
CT Network Latency = Serialization Delay + n x (FIFO Switch Latency)
4
Equation 3: Network Latency for a Multi-hop Networks
Switch latency (aka port-to-port latency) is measured using a test device that clocks the departure
and subsequent return of test packets to measure LIFO or FIFO, as shown in Figure 2.
Figure 2: Switch Latency Testing Setup
Switch latency measurements should state whether the switch was in CT or SAF mode, indicate
the benchmark used for the tests, and the amount of load applied during testing. Complete test
results would include latency for a range of packet sizes and statistical metrics, such as mean
latency, min/max latency, and standard deviation. Additional tests might also be run with a
mixture of packet sizes to simulate different types of application traffic flowing through the
switch.
Application Latency or End-to-end Latency
In order to determine the effect of delay on an application it is necessary to consider all the
components of end-to-end latency as depicted in Figure 3. The best way to measure end-to-end
latency is to start the clock when an application sends a request to another application on the
network and to stop the clock when a response is fully received. The end-to-end latency is then
defined as one half of the delay between send and receive. In most cases, end-to-end latency is
measured where the request and the response are contained in single packets.
The end–to-end round trip latency includes:
Request packet protocol processing at the originating node, possibly involving the OS as well
as the drivers and NICs
Network latency including serialization and switch latency and propagation delay
Request Packet incoming processing at the target node
Remote Node Processing of the request-this can vary significantly depending on benchmark or
application workload
Response Packet outgoing processing at the target node
Network latency including serialization and switch latency and propagation delay
Response packet processing at the originating node
5
End-to-end latency is typically dominated by application processing at the target host, as well as
delays in protocol processing encountered in the OS stack, network drivers, and network adapter
I/O rather than by network latency.
Figure 2: The Components of End-to-End Latency
Source: Intel Developers Forum 2010
Much of the early work to reduce end-to-end latency has focused on eliminating the delays and
high CPU utilization that can occur when the OS kernel becomes involved in processing TCP/IP
protocols. The approach used with Fibre Channel was to avoid TCP/IP by developing a new
Layer 2 protocol with a reliable delivery mechanism in conjunction with offloading some of the
protocol processing to an intelligent adapter. InfiniBand used a similar approach in conjunction
with OS bypass by using techniques such as Remote Direct Memory Access (RDMA) whereby
the adapter can bypass the OS by placing data directly into user memory space. For Ethernet, a
number of 10 GbE adapters have been developed that offload TCP/IP, FCoE, and/or iSCSI
protocol processing to an intelligent adapter. In some cases, intelligent adapters that support
TCP/IP can also bypass the OS with RDMA based on the IETF iWARP standard.
With the advent of multi-core processors, processing power has increased to the point where the
host can easily saturate two 10 GbE ports on a network adapter at low overall CPU utilization. In
fact, the host now has plenty of processing power to accommodate a large number of VMs with
enough processing power left over to exceed the processing power of most intelligent NICs. For
example, an Intel Xeon 5500 (Nehalem) processor issues four instructions per clock cycle, and
operates at a clock speed of 3 GHz. As a result, each Nehalem processor has an execution rate
many times that of a generic processor engine of today‘s offload adapters. This difference in
processing power can potentially result in overloading the adapter‘s offload processor, making it a
6
‗bottleneck‘ to application performance. An alternative to an intelligent adapter is an adapter
designed to take advantage of multi-core processing power of the host. One example of this is
Solarflare‘s OpenOnload functionality. With OpenOnload the adapter is doesn‘t implement TOE
or RDMA but relies on TCP/IP processing in user space rather than involving the OS kernel. The
OS is effectively bypassed allowing the OpenOnload adapter to deliver host/NIC latency that is
highly competitive with the best intelligent Ethernet adapters and InfiniBand adapters.
Latency Sensitive Applications
HPC The vast majority of high performance computing (HPC) systems are based on parallel processing
on a cluster of compute nodes based on microprocessors. Most parallel programs written for
execution on clusters are based the Messaging Passing Interface (MPI), a de facto standard for
communication among parallel processes on distributed systems and clusters. MPI provides
synchronization and data exchange between cluster nodes via messages sent over a cluster
interconnect network. In general, application performance is strongly influenced by the bandwidth
and latency characteristics of the cluster interconnect. However, most applications based on MPI
are dominated by relatively small messages, typically in the order of 128 bytes or less.
Latency benchmarks (such as the OSU and Intel benchmarks) for MPI are based on measuring
end-to-end latency for a message between two cluster nodes. The ping-pong or send/receive
round trip delay is measured and reported as this round trip time divided by 2. The most
commonly cited MPI latency is for a message size of 0 or 1 byte, which minimizes the host
processing required on the target node. The MPI benchmarks are sometimes referred to as micro
benchmarks due to the lack of processing delay by the responding node. MPI latency is also
typically measured for a range of message sizes. For larger messages the latency increases by
serialization time, as well as time required to move the message in and out of memory on the
target machine. For GbE, the lowest reported MPI send/receive latencies (1 Byte message) are on
the order of 20 μsecs compared to <5 μsecs for 10 GbE1 with a Solarflare OpenOnload adapter,
and <4 µs for 10 GbE with a Chelsio T420 intelligent RDMA adapter2, vs. <2 μsecs for an
InfiniBand DDR or QDR3 with a Mellanox ConnectX adapter. DDR and QDR stand for double
and quad data rates (16 Gbps and 32 Gbps respectively—note that IB marketing literature tends
to focus on the signaling rates 20 Gbps and 40 Gbps rather than the actual data rates. The 10 GbE
test results were based on low latency CT switches.
The Top500 is a semi-annually updated list of the world‘s highest performing supercomputers
based on the Linpack performance benchmark. Virtually all the systems on the list are designed
using some form of cluster interconnect. While some proprietary interconnects are represented on
the list, the majority of systems are based on Gigabit Ethernet, 10 Gigabit Ethernet, or InfiniBand.
On the most recent list, a total of 224 systems are based on GbE or 10 GbE. These
supercomputers are ranked from #42 down to #500, with most of these still being based on GbE.
1 http://www.highfrequencytraders.com/news-wire/1013/solarflare-establishes-breakthrough-mpi-latency-
performance-platform-mpi-81 2 http://www.chelsio.com/assetlibrary/whitepapers/CHL_11Q2_IWARP_GA_TB_02_0405%20%283%29.pdf 3 http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2010/masvdc10-hdfs-ib.pdf
7
There are 209 systems based on InfiniBand interconnect, ranging from #4 down to #483. The
highest performing 10 GbE system (#42 --an Amazon EC2 cluster) has 17,024 cores and
performance of 0.24 PetaFlops. The highest performing InfiniBand system (#4 --National
Supercomputing Centre in Shenzhen China) has 120,640 cores and performance of 1.27
PetaFlops. Obviously, performance is based on a number of factors besides bandwidth and
latency, including the number of cores, processing power per core, and whether or not general
purpose general CPUs are complemented by Graphic Processing Units (GPUs).
As indicated by the Top500 list, if the goal is to build a Top10 performing supercomputer in the
world, the most popular choices are currently a proprietary interconnect or InfiniBand (in the
current top 10 positions on the list, #1, #2, #3, #6, and #8 use proprietary interconnect and the
rest use InfiniBand). 10 GbE‘s greatest impact on the Top500 and HPC in general is expected to
come over the next 3-4 years as 10 GbE cost ride the learning curve of high volume. One can only
speculate what the performance of a 10 GbE cluster would be with 120,000 cores. For HPC at
both research lab and enterprise performance levels, 10 GbE (and GbE) are certainly very viable
alternatives, especially with the lowest latency NICs and switches that are available.
Note that the Linpack benchmark is only one performance indicator for HPC since sensitivity to
latency and bandwidth can vary significantly across a range of HPC applications. According to
some tests performed by IBM4 with a much broader range of HPC benchmarks, 10GbE actually
outperforms InfiniBand for a wide range of workloads.
Security Trading
Automated and algorithmic trading (aka high frequency trading) has become widely adopted in
recent years. An automated system can take market data feeds and distribute them via messaging
to a number of trading stations and analytical programs. Analytical programs can then trigger
trades that are executed on various exchanges. In some markets, financial firms can derive a profit
from being less than one millisecond faster to act than competing firms, which drives them to
search for sub-millisecond optimizations in their trading solutions. Algorithmic trading
applications are sensitive to the predictability and consistency of latency as well as to how low the
mean latency is. One comment made at an AMD user group meeting was that ―10 milliseconds of
(end-to-end) delay could cost a security trading firm 10% of its profits.‖
Securities Technology Analysis Center (STAC) has developed a number of benchmark tests
designed to measure the effectiveness of solutions for disseminating market data, analyzing the
data, and executing orders based on that analysis. The STAC-M2 test measures the ability of a
solution to handle real-time market data in a variety of configurations. Leading trading firms on
the STAC Benchmark™ Council approved the STAC- M2™ Benchmark specifications as the
meaningful, transparent way for vendors to demonstrate the performance of a high-speed
messaging stack.
4 Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet
8
For 10 GbE, the STAC web site5 has audited results for the STAC-M2 Benchmarks executed
with IBM‘s WebSphere MQ Low Latency Messaging (LLM) running on IBM Xeon servers with
Solarflare‘s 10GbE Ethernet adapters and a cut through 10 GbE switch with latency less than 1
μsec. These results compare very favorably with STAC-M2 Benchmarks results for QDR
Infiniband also based on IBM WebSphere MQ Low Latency Messaging (LLM) running on IBM
Xeon servers with Mellanox ConnectX IB adapters, as shown in Table 1. These results show that
10 GbE delivered very low latency that was demonstrated superior degree of predictability, as
indicated by a standard deviation of virtually zero and lower maximum latency.
STAC-M2 metric 10 GbE InfiniBand QDR Maximum Message Rate Tested 1,500,000 1,000,000
Mean Latency μsec 9 8
99 Percentile Latency μsec 12 11
Max Latency μsec 23 47
Standard Deviation μsec 0 1
Table 1: Comparison of 10 GbE and QDR InfiniBand for STAC-M2
Clustered Databases
Databases that are partitioned across a cluster of commodity servers have become an attractive
alternative to databases running on mainframes or high-end symmetric multiprocessing (SMP)
servers. Clustered databases offer significant economic advantages as well as providing a high
degree of reliability, availability and scalability.
Having a low latency cluster interconnect that is dedicated to inter-nodal communications allows
each node in the cluster to access data from all of the in-memory caches in the database cluster
before having to resort to reading data from SAN-based disk arrays. A dedicated cluster
interconnect network is required in order to avoid contention for bandwidth with other types of
traffic and hence ensure consistent latency and response times.
Oracle Real Application Clusters (RAC) employ a device called Cache Fusion to unify the
database caches resident in each of the nodes in the system. The Global Cache Service6 (GCS) is a
process that manages the status and the transfer of data blocks across the buffer caches of the
cluster nodes to satisfy application requests. With IBM DB2 pureScale clustered databases7 a
cluster caching facility (CF) that is based on dedicated cluster members is used for synchronizing
the locking and caching information across the cluster. CF uses the global lock manager (GLM)
to keep pages consistent across all members and coordinate data access to a group buffer pool
(GBP) that emulates cluster-wide shared memory.
Interconnect end-to-end latency directly affects the time it takes to access blocks in the remote
caches of clustered databases, and thus interconnect latency directly affects application scalability
5 stacresearch.com 6 http://docs.oracle.com/cd/B10501_01/rac.920/a96597/pslkgdtl.htm 7 7 http://www-01.ibm.com/software/data/db2/linux-unix-windows/editions-features-purescale.html
9
and performance. Generally, there are three network alternatives for implementing a cluster
interconnect:
GbE with UDP
InfiniBand
10 GbE, possibly also with iWARP or RoCE (RDMA over Converged Ethernet)
The Oracle 11g Reference Architecture document8 provides some relative performance data for
these interconnects using the iGEN-OLTP benchmark. This document shows that the end-to-end
application latency of 1-2 milliseconds for InfiniBand is lower than that of GbE/UDP by 50-66%.
It also shows that the end-to-end latency for 10 GbE is about 50% lower than the end-to-end
latency is for GbE/UDP. The round trip latency includes the following components (recall that
end–to-end latency is defined as ½ the round trip time):
Request packet protocol processing at the originating node
Network latency (200 B request)
Request Packet incoming processing at the target node
Remote Node Processing of the request-this can vary significantly depending on benchmark or
application workload
Response Packet outgoing processing at the target node
Network latency of response (typically an 8K block)
Response packet processing at the originating node
A good first level approximation is that data base application throughput is inversely proportional
to end-to-end application latency. That means that both 10 GbE and InfiniBand can be expected
to have more than twice the throughput of GbE. This assertion is supported by an Oracle RAC
presentation9.
With InfiniBand, the combination of switch latency, packet serialization latencies and host/NIC
packet processing latencies can be kept under 20-30 microseconds, which means that almost all of
the end-to-end application latency in the benchmark example cited in the architecture paper is due
to application processing on the server or servers that provide the remote cache hit.
With low latency switching and intelligent NICs, very similar results can be achieved using 10
GbE as the cluster interconnect. The relatively long application layer delay contribution to end-to-
end latency means that a small advantage in network latency for InfiniBand cannot in itself result
in significantly better performance for IBM DB2 pureScale or Oracle RAC. IBM DB2 pureScale
supports 10 GbE and InfiniBand interconnects on a more or less equal basis. Oracle promotes an
InfiniBand cluster interconnect solution for RAC 11g, although it also supports GbE and 10 GbE.
While this latter factor is worth considering, it should be kept in mind that Oracle owns a 10%
stake in Mellanox, one of the two remaining InfiniBand vendors.
8 Sun Reference Architecture for Oracle 11g Grid April 2010 9 Oracle presentation: Oracle’s Next-Generation Interconnect Protocol: Reliable Datagram Sockets (RDS) and InfiniBand
10
Storage Networking
Many of the benefits of server virtualization stem from hypervisor-based storage virtualization
that allows VMs to access their virtual disks via a logical name rather than a physical location.
This allows the vDisks to be repositioned in a networked storage system without requiring
reconfiguration of the VM or disruption of its operations. The popularity of server virtualization
has thus resulted in IT departments having a higher level of interest in both storage virtualization
and storage networking.
With storage networking the key performance metrics have generally been I/O operations per
second (IOPS) and I/O latency, which is another example of an end-to-end latency. Sometimes,
Throughput in MBps is quoted as a separate metric, but MBps is simply IOPS times the block
size in megabytes.
VMware and NetApp recently issued a technical report10
summarizing the results of testing
conducted to compare the performance of Fibre Channel, software-initiated iSCSI, and NFS
networked storage in an ESX 3.5 and vSphere environment using NetApp storage. The results
compare the performance of the three storage protocols with a goal of aiding customer decisions
among shared storage alternatives. The results also demonstrate the performance enhancements
made in going from ESX 3.5 to vSphere. The performance tests sought to simulate a ―real-world‖
environment. The test and validation environment is composed of components and architectures
commonly found in a typical VMware implementation that include using the FC, iSCSI, and NFS
protocols in a multiple virtual machine (VM), multiple ESX 3.5 and/or vSphere host environment
accessing multiple data stores. The performance tests used realistic I/O patterns, I/O block sizes,
read/write mixes, and I/O loads common to various operating systems and business applications.
The Fibre Channel results were based on 4 Gbps FC, a FC switch with cut through latency of <
1μsec. The intelligent FC adapter communicates directly with the operating system kernel via a
device driver, allowing for direct memory accesses (DMA) from the SCSI layers to the host
computer‘s memory subsystem. The 10 GbE results were based on a Chelsio T320 intelligent
adapter and a 10 GbE switch with cut through latency of < 1μsec.
Figure 3, 4, and 5 summarize the comparative test results for I/O latency, IOPS, and CPU
utilization. In these results, VMware vSphere demonstrated the ability to support 350,000 I/O
operations per second with just three virtual machines running on a single host with I/O latency
under 2 ms with both FC and 10 GbE. In Figure 3, FC shows a slightly lower latency compared to
10 GbE iSCSI and a slightly higher IOPS in Figure 4 as a result. The greatest difference between
these two technologies comes in lower CPU utilization for FC, as shown in Figure 5. As noted
earlier, CPU utilization is becoming less of a concern as the number of cores per processor chip
increases and it becomes possible to devote an entire core to protocol processing.
10 VMware vSphere and ESX 3.5 Multiprotocol Performance Comparison Using FC, iSCSI, and NFS
11
Figure 3: Comparative I/O Latencies
Figure 4: Comparative I/Os per second
12
Figure 5: Comparative CPU Utilization
Big Data with MapReduce/Hadoop
The open-source Hadoop framework has given rise to the broad application of the MapReduce
paradigm for searching and analyzing massive amounts of unstructured data. The traditional
relational technologies have simply not been able to keep up with the explosive growth of new
types of data.
Hadoop was designed based on a new approach to storing and processing complex data. Instead
of storing data on a SAN for shared accessibility and reliability, each node in a Hadoop cluster
both processes data and stores data in DAS. Hadoop distributes data across a cluster of balanced
machines and uses replication to ensure data reliability and fault tolerance. Because data is
distributed on machines with compute power, processing can be done on the nodes storing the
data. Throughput for I/O bound workloads can be improved via the use of distributed caching of
data in memory and by using fast disks, in particular SSD vs. HDD.
The DFSIO, Sort, and Random Write performance benchmarks are part of the Hadoop
Distribution. DSFIO measures performance of the cluster for reads and writes. OSU11
used the
test to measure sequential access throughput of Hadoop using Map tasks that were writing files
with sizes between 1 and 10 GB. The tests were run for cluster interconnects GbE, 10 GbE, and
InfiniBand DDR with the cluster nodes first using Hard Disk Drives (HDD) and then using Solid
State Drives (SSD). The results showed that with HDD throughput was improved by about 30%
11 Can High-Performance Interconnects Benefit Hadoop Distributed File System?
13
by using low latency cluster interconnect rather than GbE. Switching a GbE cluster to SSD did
not result in a significant improvement in throughput. However, with low latency 10 GbE or
InifiniBand in conjunction with SSD, the performance improved by approximately 6X compared
to GbE/HDD. In all the benchmark tests that were run, low latency 10 GbE interconnect in
conjunction with Chelsio T-320 adapters produced performance results that were better than or
equivalent to those for InfiniBand DDR.
As big data technology evolves, there is growing interest in being able to assimilate, analyze and
respond to data events in near real-time, which requires millisecond-scale access and processing
speeds. Examples of real time applications of big data include analysis of high-volume web session
and user data, reacting to high-speed financial market feeds, aggregating distributed sensor grid
events, processing social network messages and connections, or providing real-time intelligence
and entity classification.
Cassandra is an open source distributed database management system that can be integrated with
Hadoop together with other open source big data utilities, such as Hive, Hbase, and Solr.
Cassandra provides improved management of real time data that complements the batch analytic
capabilities of Hadoop. Cassandra enhances the linear scalability of the cluster, provides for data
replication within the cluster or over the wide area, and provides distributed in-memory caching of
data that improves read and write performance. As nodes are added to the Hadoop/Cassandra
cluster, the size of the distributed cache increases allowing as much of the data as required to
reside in memory rather than on disk.
The distributed in-memory caching of Cassandra should result in a further improvement in
performance when combined with SSD storage and low latency interconnect. However, no test
results have been published yet for this combination of technologies.
Factors to Consider in TCO Comparisons of Low Latency Network Solutions
When competing network solutions offer equivalent performance, the decision of which to choose
often hinges on a TCO analysis. A comprehensive TCO analysis goes well beyond price per
switch port and power consumption per port, especially when two rather divergent technologies
are being considered. In a comprehensive TCO analysis comparing a low latency 10GbE solution
to one based on InfiniBand or Fibre Channel some of the things to be considered include:
Network Convergence With 10 GbE, it is possible to avoid the complexity of multiple adapter
technologies on a single server and additional switching technologies for cluster interconnect and
networked storage. Network convergence also has OPEX benefits by simplifying network
management and allowing less duplication and fragmentation of administrative staff along
technology boundaries.
Seamless Upgrades 10GbE is plug & play compatible with 1GbE and with 40 GbE and 100 GbE.
10GbE fully preserves software and hardware investments in sockets based applications and
application written for TCP/UDP/IP. 10GbE supports structured Cat5E cabling already installed
in the data center.
14
Technology Evolution 10GbE leverages Ethernet‘s declining cost curve driven by high volume
Production. The next 3 to 4 years will the major growth period for 10 GbE supporting the
expectation that 10 GbE adapter and switch prices will follow a similar cost reduction curve as
GbE. More specialized fabric technologies will never be able to achieve the same cost/volume
advantages.
Technology Continuity Ethernet is a true industry standard with a long history of innovations
coming from a broad range of vendors. This has resulted in a rich set of solutions supported by an
extensive ecosystem of vendors. InfiniBand, on the other hand, can be considered a pseudo
industry standard in view of the fact the ―industry‖ has shrunk to two vendors, one of which being
the sole source of IB switch chips and holding 90% market share.
Summary
The performance of latency sensitive network applications is largely determined by end-to-end
latency, rather than switch latency or network latency. Maximizing performance for these
applications means minimizing every component of end-to-end latency, with special attention paid
to the network adapter and host protocol stack contributions to latency.
For all the latency sensitive applications examined in this document low latency 10 GbE offers
end-to-end latencies and levels of performance that are highly competitive with those of the
specialty interconnects.
Low latency 10 GbE therefore makes it highly feasible for IT departments to optimize the
simplicity and homogeneity of their data center LANs by pursuing network technology
convergence, and even fabric convergence, as they increasingly leverage applications that depend
on low, consistent latency together with high application throughput. By taking maximum
advantage of existing expertise with Ethernet technology and management tools, IT departments
can future-proof their IT investments while minimizing a number of the components of OPEX that
contribute significantly to TCO.