22
10GbE WAN Data Transfers 10GbE WAN Data Transfers for Science for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting Yang Xia, HEP, Caltech [email protected] September 28, 2004 8:00 AM – 10:00 AM

Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

  • Upload
    nelia

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

10GbE WAN Data Transfers for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting. Yang Xia, HEP, Caltech [email protected] September 28, 2004 8:00 AM – 10:00 AM. Agenda. Introduction 10GE NIC comparisons & contrasts Overview of LHCnet - PowerPoint PPT Presentation

Citation preview

Page 1: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

10GbE WAN Data Transfers 10GbE WAN Data Transfers for Sciencefor Science

High Energy/Nuclear Physics (HENP) SIGFall 2004 Internet2 Member Meeting

Yang Xia, HEP, Caltech

[email protected]

September 28, 2004

8:00 AM – 10:00 AM

Page 2: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

AgendaAgenda

IntroductionIntroduction 10GE NIC comparisons & contrasts10GE NIC comparisons & contrasts Overview of LHCnetOverview of LHCnet High TCP performance over wide area networksHigh TCP performance over wide area networks

Problem statementProblem statement BenchmarksBenchmarks Network architecture and tuningNetwork architecture and tuning Networking enhancements in Linux 2.6 kernelsNetworking enhancements in Linux 2.6 kernels

Light paths : UltraLightLight paths : UltraLight FAST TCP protocol developmentFAST TCP protocol development

Page 3: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

IntroductionIntroduction

High Engery Physics LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec throughout the year.

Many Petabytes per year of stored and processed binary data will be accessed and processed repeatedly by the worldwide collaborations.

Network backbone capacities advancing rapidly to 10 Gbps range and seamless integration into SONETs.

Proliferating GbE adapters on commodity desktops generates bottleneck on GbE Switch I/O ports.

More commercial 10GbE adapter products entering the market, e.g. Intel, S2io, IBM, Chelsio etc.

Page 4: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

IEEE 802.3ae Port Types

Port TypeWavelength and

Fiber TypeWAN/LAN

Maximum Reach

10GBase-SR 850nm/MMF LAN 300m

10GBase-LR1310nm/SMF

(LAN-PHY) LAN 10km

10GBase-ER 1550nm/SMF LAN 40km

10GBase-SW 850nm/MMF WAN 300m

10GBase-LW 1310nm/SMF (WAN-PHY) WAN 10km

10GBase-EW 1550nm/SMF WAN 40km

10GBase-CX4

InfiniBand and 4xTwinax cables LAN 15m

10GBase-T Twisted-pair LAN 100m

Page 5: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

10GbE NICs Comparison (Intel vs S2io)10GbE NICs Comparison (Intel vs S2io)

Standard Support:802.3ae Standard, full duplex only64bit/133MHz PCI-X bus1310nm SMF/850nm MMFJumbo Frame Support

S2io Adapter

Intel Adapter

PCI-X Bus DMA Split Transaction

Capacity32 2

Rx Frame Buffer Capacity 64MB 256KB

MTU 9600Byte 16114Byte

IPv4 TCP Large Send Offload

Max offload size

80k

Partial; Max offload

size 32k

Major Difference in Performance Features:

Page 6: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

LHCnet Network SetupLHCnet Network Setup

10 Gbps transatlantic link extended to Caltech via Abilene and CENIC. NLR wave local loop is in working progress.High-performance end stations (Intel Xeon & Itanium, AMD High-performance end stations (Intel Xeon & Itanium, AMD Opteron) running both Linux and WindowsOpteron) running both Linux and WindowsWe have added a 64x64 Non-SONET all optical switch from We have added a 64x64 Non-SONET all optical switch from Calient to provision a dynamic path via MonALISA, in the context Calient to provision a dynamic path via MonALISA, in the context of UltraLight.of UltraLight.

Page 7: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

LHCnet Topology: August 2004LHCnet Topology: August 2004

JuniperT320

JuniperM10

Cisco7609

Procket8801

Alcatel7770

Linux Farm20 P4 CPU6 TBytes

Caltech/DoE PoP - Chicago

JuniperT320

JuniperM10

Cisco7609

Procket8801

Alcatel7770

Linux Farm20 P4 CPU6 TBytes

CERN - Geneva

Services:Services: IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links

aggregation ; monitoring (Monalisa)aggregation ; monitoring (Monalisa) Clean separation of production and R&D traffic based on CCC. Clean separation of production and R&D traffic based on CCC. Unique Multi-platform / Multi-technology optical transatlantic test-bedUnique Multi-platform / Multi-technology optical transatlantic test-bed Powerful Linux farms equipped with 10 GE adapters (Intel; S2io)Powerful Linux farms equipped with 10 GE adapters (Intel; S2io) Equipment loan and donation; exceptional discountEquipment loan and donation; exceptional discount NEW:NEW: Photonic switch (Glimmerglass T300) evaluation Photonic switch (Glimmerglass T300) evaluation

Circuit (“pure” light path) provisioningCircuit (“pure” light path) provisioning

OC192 (Production and R&D)

10GE

10GE

10GE

10GE

10GE

InternalNetwork

EuropeanPartners

American Partners

LHCnet tesbed LHCnet tesbed

StarLight CERN

Glimmerglass

Page 8: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Optical Switch Matrix

LHCnet Topology: August 2004 (cont’d)LHCnet Topology: August 2004 (cont’d)

Calient Photonic Cross Connect

Switch

GMPLS controlled PXCs and IP/MPLS routers can provide dynamic shortest path set-up and path setup based on priority of links.

Page 9: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Problem StatementProblem Statement

1. To get the most bangs for the buck on 10GbE WAN, packet loss is the #1 enemy. This is because of slow TCP responsive from AIMD algorithm:

No Loss:

cwnd := cwnd + 1/cwnd Loss: cwnd := cwnd/2

2. Fairness: TCP Reno MTU & RTT bias

Different MTUs and delays lead to a very poor sharing of the bandwidth.

Page 10: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

IPv6 record: 4.0 Gbps between Geneva and Phoenix (SC2003) IPv4 Multi-stream record with Windows: 7.09 Gbps between Caltech and CERN (11k km) Single Stream 6.6 Gbps X 16.5 k km with Linux We have exceeded 100 Petabit-m/sec with both Linux & Windows Testing on different WAN distances doesn’t seem to change TCP rate:

7k km (Geneva - Chicago) 11k km (Normal Abilene Path) 12.5k km (Petit Abilene's Tour) 16.5k km (Grande Abilene's Tour)

Internet 2 Land Speed Record (LSR)Internet 2 Land Speed Record (LSR)

Monitoring of the Abilene Traffic in LA:

Page 11: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Internet 2 Land Speed Record (cont’d)Internet 2 Land Speed Record (cont’d)

Single Stream IPv4 Category

Page 12: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Primary Workstation SummaryPrimary Workstation Summary

Sending Station:Sending Station: Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB

PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit PCI-PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit PCI-X slots. No FSB bottleneck. HyperTransport X slots. No FSB bottleneck. HyperTransport connects CPUs (up to 19.2GB/s peak BW per connects CPUs (up to 19.2GB/s peak BW per processor), 24 SATA disks RAID system @ 1.2GB/s processor), 24 SATA disks RAID system @ 1.2GB/s read/writeread/write

Opteron white box with Tyan S2882 motherboard, Opteron white box with Tyan S2882 motherboard, 2x Opteron 2.4 GHz , 2 GB DDR.2x Opteron 2.4 GHz , 2 GB DDR.

AMD8131 chipset PCI-X bus speed: ~940MB/sAMD8131 chipset PCI-X bus speed: ~940MB/s

Receiving Station:Receiving Station: HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB

memory.memory. SATA disk RAID systemSATA disk RAID system

Page 13: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Linux Tuning ParametersLinux Tuning Parameters

1. PCI-X Bus Parameters: (via setpci command) Maximum Memory Read Byte Count (MMRBC)

controls PCI-X transmit burst lengths on the bus: Available values are 512Byte (default), 1024KB, 2048KB and 4096KB

“max_split_trans” controls outstanding splits. Available values are: 1, 2, 3, 4

latency_timer to 248

2. Interrupt Coalescence: It allows a user to change the CPU-affinity of the

interrupts in a system.

3. Large window size = BW*Delay (BDP) Too large window size will negatively impact

throughput.

4. 9000byte MTU and 64KB TSO

Page 14: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Linux Tuning Parameters (cont’d)Linux Tuning Parameters (cont’d)

5. Use sysctl command to modify /proc parameters to increase TCP memory values.

Page 15: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

10GbE Network Testing Tools10GbE Network Testing Tools In Linux:

Iperf: Version 1.7.0 doesn’t work by default on the

Itanium2 machine. Workarounds: 1) Compile using RedHat’s gcc 2.96 or 2) make it single threaded

UDP send rate limits to 2Gbps because of 32-bit date type

Nttcp: Measures the time required to send preset chunk of data.

Netperf (v2.1): Sends as much data as it can in an interval and collects result at the end of test. Great for end-to-end latency test.

Tcpdump: Challenging task for 10GbE link In Windows:

NTttcp: Using Windows APIs Microsoft Network Monitoring Tool Ethereal

Page 16: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Networking Enhancements in Linux 2.6Networking Enhancements in Linux 2.6

2.6.x Linux kernel has made many improvements in general to improve system performance, scalability and hardware drivers.

Improved Posix Threading Support (NGPT and NPTL) Supporting AMD 64-bit (x86-64) and improved NUMA

support. TCP Segmentation Offload (TSO) Network Interrupt Mitigation: Improved handling of

high network loads Zero-Copy Networking and NFS: One system call

with: sendfile(sd, fd, &offset, nbytes) NFS Version 4

Page 17: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

TCP Segmentation OffloadTCP Segmentation Offload Must have hardware support in NIC. It’s a sender only option. It allows TCP layer to send

a larger than normal segment of data, e,g, 64KB, to the driver and then the NIC. The NIC then fragments the large packet into smaller (<=mtu) packets.

TSO is disabled in multiple places in the TCP functions. It is disabled when sacks are received, in tcp_sacktag_write_queue, and when a packet is retransmitted, in tcp_retransmit_skb. However, TSO is never re-enabled in the current 2.6.8 kernel when TCP state changes back to normal (TCP_CA_Open). Need to patch the kernel to re-enable TSO.

Benefits: TSO can reduce CPU overhead by 10%~15%. Increase TCP responsiveness.

p=(C*RTT*RTT)/(2*MSS)p: Time to recover to full rate

C: Capacity of the linkRTT: Round Trip Time

MSS: Maximum Segment Size

Page 18: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Responsiveness with and w/o TSOResponsiveness with and w/o TSO

Path BW RTT (s) MTU Responsiveness (min)

With Delayed

ACK (min)

Geneva-LA (Normal Path) 10Gbps 0.18 9000 38 75

Geneva-LA (Long Path)

10Gbps 0.252 9000 74 148

Geneva-LA (Long Path w/

64KB TSO)10Gbps 0.252 9000 10 20

LAN 10Gbps 0.001 1500 428ms 856ms

Geneva-Chicago 10Gbps 0.12 1500 103 205

Geneva-LA (Normal Path) 1Gbps 0.18 1500 23 46

Geneva-LA (Long Path)

1Gbps 0.252 1500 45 91

Geneva-LA (Long Path w/

64KB TSO)1Gbps 0.252 1500 1 2

Page 19: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

The Transfer over 10GbE WANThe Transfer over 10GbE WAN

With 9000byte MTU and stock Linux 2.6.7 kernel:

LAN: 7.5Gb/sWAN: 7.4Gb/s (Receiver is CPU bound)

We’ve reached the PCI-X bus limit with single NIC. Using bonding (802.3ad) of multiple interfaces we could bypass the PCI X bus limitation in mulple streams case only

LAN: 11.1Gb/sWAN: ??? (a.k.a. doom’s day for Abilene)

Page 20: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

UltraLight:UltraLight: Developing Advanced Network Developing Advanced Network Services for Data Intensive HEP ApplicationsServices for Data Intensive HEP Applications

UltraLightUltraLight (funded by NSF ITR): a next-generation hybrid funded by NSF ITR): a next-generation hybrid packet- and circuit-switched network infrastructure.packet- and circuit-switched network infrastructure. Packet switchedPacket switched: cost effective solution; requires : cost effective solution; requires

ultrascale protocols to share 10G ultrascale protocols to share 10G efficiently and fairly efficiently and fairly Circuit-switchedCircuit-switched: Scheduled or sudden “overflow” : Scheduled or sudden “overflow”

demands handled by provisioning additional wavelengths; demands handled by provisioning additional wavelengths; Use path diversity, e.g. across the US, Atlantic, Canada,…Use path diversity, e.g. across the US, Atlantic, Canada,…

Extend and augment existing grid computing infrastructures Extend and augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as (currently focused on CPU/storage) to include the network as an integral componentan integral component Using MonALISA to monitor and manage global systemsUsing MonALISA to monitor and manage global systems

Partners:Partners: Caltech, UF, FIU, UMich, SLAC, FNAL, Caltech, UF, FIU, UMich, SLAC, FNAL, MIT/Haystack; CERN, Internet2, NLR, CENIC; MIT/Haystack; CERN, Internet2, NLR, CENIC; Translight, UKLight, Netherlight; UvA, UCL, KEK, Translight, UKLight, Netherlight; UvA, UCL, KEK, TaiwanTaiwan

Strong support from Cisco and Level(3)Strong support from Cisco and Level(3)

Page 21: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

““Ultrascale” protocol development: Ultrascale” protocol development: FASTFAST TCPTCP

FAST TCP Based on TCP Vegas Uses end-to-end delay and loss to dynamically adjust the

congestion window Achieves any desired fairness, expressed by utility function Very high utilization (99% in theory)

Compare to Other TCP Variants: e.g. BIC, Westwood+

Linux TCP Linux Westwood+ Linux BIC TCP FAST

BW use 30% BW use 50% BW use 79%

Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow

BW use 40%

Page 22: Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

Summary and Future ApproachesSummary and Future Approaches

Full TCP offload engine will be available for 10GbE in the near future. There is a trade-off between maximizing CPU utilization and ensuring data integrity.

Develop and provide cost-effective transatlantic network infrastructure and services required to meet the HEP community's needs a highly reliable and performance production network,

with rapidly increasing capacity and a diverse workload.

an advanced research backbone for network and Grid developments: including operations and management assisted by agent-based software (MonALISA)

Concentrate on reliable Terabyte-scale file transfers, to drive development of an effective Grid-based Computing Model for LHC data analysis.