Upload
lamdung
View
219
Download
1
Embed Size (px)
Citation preview
PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions
02, March, 2017
Deepak Pathania, NEC Corporation
2 © NEC Corporation 2017
Evolution of Data for Processing, Storage and Analytics
Traditional Data Big Data
“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days”, Google, 2010
3 © NEC Corporation 2017
ActionableInformation
Real-timeFeedback
Challenge faced in Real-Time Data Analytics
Big Data of varying characteristics, such as Live feeds, graphics, video, text, etc. comes into cloud computers
This data is to be processed and analyzed in real-time
However, instead of building servers with such accelerators, Cloud vendors still prefer building homogeneous servers due to TCO and efficiency considerations
Real-Time Analytics,
Deep Learning, etc… To accelerate such processing, a large number of accelerators such as GPUs and FPGAs, along with high speed storage are required
Xeon Phi GPU FPGA
4 © NEC Corporation 2017
What is ExpEther?
IO Expansion Unitwith
PCIe Cards
PCI Express
IO Device
ExpEtherEngine
L2 Switch
StandardEthernet
Server
CPU
Memory
PCI Express
ExpEtherNIC
ExpEtherEngine
A technology that can extend PCI Express bus beyond the confines of a computer chassis via Ethernet without any modification of existing hardware and software
5 © NEC Corporation 2017
Broad-Scale Single Computer
PCIeSwitch
IODevice
IODevice
CPU CPU
IODeviceIO
DeviceIODevice
IODeviceIO
DeviceIODevice
In the same rack In the next rack
IODeviceIO
Device
In another floor
IODeviceIO
Device
In another building
A PCI express switch is equivalent to Ethernet fabric.
ExpEtherEngines
ExpEtherEngines
ExpEtherEngine
ExpEtherEngines
ExpEtherEngines
EthernetSwitch
EthernetSwitch
EthernetSwitch
EthernetSwitch
ExpEther can build new type of computing environment without physical constraints
6 © NEC Corporation 2017
Just Like a standard PCIe
IODevice
IODevice
ExpEther Engine is seen as PCIe Switch from CPU Ethernet region is invisible from the CPU
Upstream Port(PCI Bridge)
Downstream Port(PCI Bridge)
Downstream Port(PCI Bridge)
Internal PCI bus
CPU
IODevice
IODevice
PCIe Switch
CPU
EthernetSwitch
ExpEther Engine(PCI Bridge)
ExpEther Engine(PCI Bridge)
ExpEther Engine(PCI Bridge)
Ethernet Fabric(Invisible)
PCI Express
PCI Express
PCI Express
PCI Express
ExpEther is one example of implementation of PCIe Switch
7 © NEC Corporation 2017
ExpEther Architecture
▌Achieve the “System on Network”
Merge the PCI Express technology into Ethernet technology
▌Connect logically in MAC layer
No impact for upper or lower layer of the PCIe and Ethernet standard for future expansion
Application
OS
PCI Driver
EFI/PCI BIOS
ExpEther Logic
MAC
PHY40G 10G 1G
Application
OS
NDIS Driver
Ethernet Logic
MAC
PHY10M 100M 1G 10G 40G
Ethernet
ExpEther
Software
Hardware
Upper Compatible
No modification for
future expansion of
ExpEther or Ethernet
8 © NEC Corporation 2017
Ether
Frame
Features of ExpEther
▐ ExpEther Engine is compliant with PCIe and Ethernet Standard
PCI-SIG PCI Express Certified
Can use off-the-shelf L2 Ethernet Switch
CPU
PCI Express
ExpEther
Engine
PCI Express
Ethernet
SwitchExpEther
Engine
ExpEther
Engine
ExpEther
Engine
I/O
Device
I/O
Device
I/O
Device
PC
I Ex
pre
ss
Equivalent to direct connection(Ethernet is invisible from CPU/IO)
1
Ethernet
Fabric
Low Latency(L2 Ether w/o SW stack)
2
I/O Dynamic Reconfiguration(Hot-Plug Scheme)
4
EE PCI Express TLP
No packet loss(Adding reliability to Ethernet)
3
9 © NEC Corporation 2017
PCIe Timeout Spec and ExpEther
PCIeSwitch
Upstream Port(PCI Bridge)
Downstream Port
(PCI Bridge)
Downstream Port
(PCI Bridge)
I/O
Device
CPU
(RC)
PCI Express
Ethernet
Switch
I/O
Device
I/O
Device
CPU
(RC)
I/O
Device
ExpEtherHost Chip
(PCI Bridge)
ExpEtherIO Chip
(PCI Bridge)
ExpEtherIO Chip
(PCI Bridge)
Internal PCI bus
▐ It is difficult to extend the PCIe to long reach by cable because of DLLP timeout rule DLLP timeout is less than 200 usec (depending on chipset)
TLP timeout is 50 msec, but can be extended to 64 seconds by configuration
▐ ExpEther is unaffected by DLLP timeout It is possible to extend the IO devices to long reach
Specified the DLLP Timeout
in the order of usec
Specified the TLP
Timeout in the
order of msec
No Timeout rule
10 © NEC Corporation 2017
Dual Path for Throughput and Reliability
▌Two Ethernet connections are established between the Host Chip and I/O Chip
Load balancing for performance
Path redundancy for failure recovery
Dual Port
CPUExpEther
Host Chip
I/O
Device
ExpEther
IO Chip
I/O
Device
ExpEther
IO Chip
Failure Recovery
Quickly detects path
failures and switches paths
Load-balancing
Round-robin data packet
transmission between the
two redundant connections
Ethernet Fabric-I
Ethernet Fabric-II
10G ExpEther NIC
11 © NEC Corporation 2017
Frame Rate Control
TCP/IP : Rate control is triggered by packet loss (TCP Reno)
NetworkBandwidth
Slow Start AvoidCongestion
TimeAvoid
CongestionAvoid
Congestion
Packet loss causes significant performance degradation because of retransmission.
ExpEther : Rate control is always done by measuring network latency
Probing Avoid Congestion
NetworkBandwidth
Time
Packet loss does not occur basically in ExpEther.
ExpEther engine always measures the frame arrival time of receive side and minutely controls the frame rate to avoid packet loss.
12 © NEC Corporation 2017
Loss-less ExpEther Frame
▌Ethernet may lose packets, but PCIe does not allow losing any TLP.
▌ExpEther ensures that the packets certainly arrive at end by Ack/Nack scheme in Ethernet.
Seq 6
Sender Receiver
Seq 1Seq 2
Timer reset
Seq 3Seq 4Seq 5
ACK Timer set
ACK Timer
expireBuffer Release
Seq1~Seq 5
Seq 7Seq 8
Timer reset Timer reset Timer reset Timer reset
Timer reset Timer reset Timer reset
ACK Timer set
ACK Timer
expireBuffer Release
Seq6~Seq 8
①
②
③④
①
②
③④
Timer expire
But there is no frame in buffer..
Re-transmission is not started.
Bridge
Ether network
2ExpEther packet
5 4 3 2 1
ACK packet
Bridge
Retransmission
13 © NEC Corporation 2017
SAS JBOD
ExpEther Reliability ~ Multi-Path
▌Multi-Path IO (MPIO)
MPIO is one of the technic for achieving high-reliability. If the target IO device supports MPIO, it can support MPIO even under ExpEther.
▌Multi-Path Ethernet
ExpEther supports the high-speed network path failover.
Host
SASHBA#0
SASHBA#1
HostEE
NIC#0
SAS JBOD
SASHBA#0
SASHBA#1
Equivalent
Act Act
MPIO
EtherSwitch
EtherSwitch
EE EE
MPIO
High-SpeedNetwork Failover
14 © NEC Corporation 2017
Sequence of Network Path Failover (1/2)
▌Both network paths are used as ACT-ACT
EE NIC (Tx side)
RetransferBuffer
Arb
iter
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
123
4
5
6
7
8
10
12
14
1618
20
9
▐ If a path is failed, ExpEther resends lost packets. This failover time is about 10 RTT (several microseconds).
11
13
15
1719
EE NIC (Tx side)
RetransferBuffer
Arb
iter
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
Lost Packet
13
14
1516
17
Sequence Number Check
Re-receive packets after several microseconds
EtherSwitch
Resending
15 © NEC Corporation 2017
Sequence of Network Path Failover (2/2)
▌Network path is recovered by some Ethernet recovering scheme like P-Flow linked with EE manager.
EE NIC (Tx side)
RetransferBuffer
Arb
iter
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
123
456
7
8
9
1011
12
DEVINFO
▐ When ExpEther device receives a management packet indicating the path recovered, it starts reusing both network paths.
EE NIC (Tx side)
RetransferBuffer
Arb
iter
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
EtherSwitch
789
101112
14
16
18
2022
24
13
15
Lost Packet
17
19
2123
EtherSwitch
EtherSwitch
New path is enabled
16 © NEC Corporation 2017
System Configuration by Grouping
Host
B D G I
Host
A J
Host
C E H
Host
F
Group#1 Group#2 Group#3 Group#4
Logical View
Host Host Host1 2 4
A B C D E F G H I J1 1 1 12 23 3 34
ExpEtherManager
PCIeSwitch
PCIeSwitch
PCIeSwitch
PCIeSwitch
Host
Ethernet Fabric
3
Each ExpEther device has a Grouping ID to connect a Host and IO devices logically
The ID is assigned by rotary switch or Manager software The ID can be set from 1 to 4,095 and it is used as VLAN tag
17 © NEC Corporation 2017
ExpEther Management Scheme Overview
▌Group ID (GID : 1~4,095)
GID range from 1 to 15 is set by physical DIP switch residing on card.
Setting GID to 0 allows Management Software to program a soft GID.
Host Host HostHost
ManagementServer
EE1
EE2
EE3
EE4
EE EE EE EE EE EE EE EE EE EE EE EE EE EE EE EE1 1 1 1 12 2 2 23 3 34 4 4 4
IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO
Group ID Configuration
Group ID Configuration
Collecting Various
Information
- ExpEther Manager - Configuration• Group ID Configuration
Monitoring• ExpEther network status• PCIe device status• New ExpEther detection• Failure detection
Management Frame
- Mng. Frame - Special Ether Frame• ExpEther hard wired logic
directly receives and sends the frames for configuration and management
18 © NEC Corporation 2017
ExpEther Manager Library and SDK
▌Library Structure
REST API
• CLI
• Web Interface
• OpenStack Combination
Java API
• Java GUI
C/C++ API
• Customer Original Application
ExpEther ManagerC/C++ Library
ExpEther ManagerJava Module
Java Sarvlet
> EEM listIO#0 IntelIO#1 BroadcomIO#2 Mellanox
EEM Library / SDK
ExpEther管理ソフト
名前*
E2SV情報
E2IO一覧
説明*
Group ID 0016
名前 説明
E2IO 1
MAC
11:22:33:44:55:10
MAC 11:22:33:44:55:66
左から2番目
生徒B
生徒C
11:22:33:44:55:11
11:22:33:44:55:12
電源
生徒A
11:22:33:44:55:13生徒D
11:22:33:44:55:14生徒E
生徒F 11:22:33:44:55:15
E2SV 1
接続
切断 電源ON 電源OFF リセット
電源連動ON 電源連動OFF
UID-LED ON UID-LED OFF
ファイル(F) ツール(T) ヘルプ(H)
OFF
ON
ON
ON
UID-SW OFF
IO種別
IO-BOX
E2Z
NDAS
NDAS
IO-BOX
IO-BOX
E2IO 2
E2IO 3
E2IO 4
E2IO 5
E2IO 6
エラ
致命
冗長
致命
* : 編集可能です。入力後Enterを押してください。
UID-LED OFF
ACアダプタ 有り 無し
削除
削除
ON
ON
更新 展開
接続済Host ListHost 1
Host 2
E2SV 1
E2SV 2
未接続Host List
Host 3
-
-
+
+
E2SV 3
-
E2IO 1E2IO 2E2IO 3E2IO 4E2IO 5E2IO 6E2IO 7
未接続IO List
+
+
-
Host 4E2SV 4
-
-
E2IO 8
-
-
CLI
Java GUI App.
Original Application
Web Browser
19 © NEC Corporation 2017
KEY Register/Delete
▌TLP Encryption (TWINE)
40G ExpEther supports TLP encryption
The encryption key is configured by Management Software
▌TWINE is developed by NEC
High-speed and quite small hardware implementation
IO Side
Server
ManagementSoftware
ExpEtherChip
Key Reg.
PCIeDevicePCIe
Host
ExpEtherChip
Key Reg.
CPUPCIe
Non-Volatile Memory
20 © NEC Corporation 2017
ExpEther Technology Architectural Possibilities
▐ Std-EE : Standard PCIe-over-Ethernet
Foundation of ExpEther
▐ MR-EE : I/O sharing
Multi-hosts are able to share an IO device by using SR-IOV compliant device
▐ P2P-EE: I/O direct connection
Support for the Peer-to-Peer data transfer between I/O devices.
▐ NTB-EE : Remote direct memory access by NTB
Hi-speed data transfer between hosts
Host
Std-EE
I/O I/O
P2P-EE P2P-EE
Ethernet
Switch
Peer-to-Peer
Current Path
Host
NTB-EE
Ethernet
Switch
Host
NTB-EE
Host
NTB-EE
NTB
Ethernet
I/O
Std-EE
I/O
Std-EE
Host
Std-EE
I/O I/O
Std-EE Std-EE
Ethernet
Switch
PCIe-over-Ether
Host
Std-EE
PartitioningPartitioning
Host
Std-EE
SR-IOV
Ethernet
Switch
Host
Std-EE
Host
Std-EE
SR-IOV
MR-EE MR-EE
Resource Sharing
Ethernet
Here Today Future
21 © NEC Corporation 2017
ExpEther Advantages
Dynamic Resource
Reconfiguration & Sharing
Since ExpEther supports concept of GID/VLAN, IO
resources can be dynamic allocated to different
hosts based on need to application/workload.
No Change in OS or Driver
NVMoE specification asks for changing drivers and
OS, whereas no such change in required in
ExpEther. NVMe are accessible with simple plug-
and-play using ExpEther
No space/length constraint
The length of the Ethernet fabric can be few
meters to several kilometers with ExpEther. So
servers can be somewhere else while IOs
anywhere, which is especially useful for IoT
Reduced Costs
When expanding the systems by adding tens of
hundreds of IO devices, no need to purchase
expensive PCIe switches, ExpEther works on
standard off-the-shelf Ethernet switches
ExpEther
Advantages
22 © NEC Corporation 2017
Service Acceleration Platform with ExpEther
EE Client
USB/VGA
KVM
CPU/Chipset
CPU/Chipset
Remote IO
GPGPUGPGPU
GPGPUGPGPU
GPGPUGPGPU
GPGPUAcceleratorFPGA
NVMeSSDNVMeSSDNVMeSSDNVMeSSD
ExpEtherEngines
NVMeSSDNVMeSSDNVMeSSDNVMeSSD
ExpEtherEngines
NVMeSSDNVMeSSDNVMeSSDNVMeSSD
ExpEtherEngines
NVMeSSDNVMeSSDNVMeSSDNVMeSSD
ExpEtherEngines
ExpEtherHBA
ExpEtherHBA
ExpEtherEngine
Ethernet
EtherSwitch
ExpEtherEngine
USBCtrl
ExpEtherEngines
ExpEtherEngines
Sensors
EtherSwitch
Accelerator Resource Pool
IO devices can be dynamically allocated to appropriate host according to workload
EtherSwitch
23 © NEC Corporation 2017
Case : Resource Pool System for HPC (Osaka University)
ServerServerServerServerServerServerServerServerServerServer
SAS JBODSAS JBODSAS JBODSAS Ctrl
GPUsGPUs
TOR SW
ServerServerServerServerServerServerServerServerServerServer
SAS JBODSAS JBODSAS Ctrl
GPUsGPUs
TOR SW
ServerServerServerServerServerServerServerServerServerServer
SAS JBODSAS JBODSAS Ctrl
GPUsGPUs
TOR SW
ServerServerServerServerServerServerServerServerServerServer
PCoIPK2 GRID
GPUsGPUs
TOR SW
ServerServerServerServerServerServerServerServerServerServer
SAS JBODSAS JBODSAS Ctrl
GPUsGPUs
TOR SW
ServerServerServerServerServer
ServerServerServerServer
NICPCIe Flash
GPUsGPUs
TOR SW
Server ServerServer Server
CPU
GPUGPU
GPUGPU
HDDHDD
FlashFlash
Softw
are
Pro
vis
ion
ing
Server System is configured according to user requirement
▌64 servers and 70 IO devices for research in Osaka University
There are GPUs, Flash storages and VDI accelerators as IO device
The IO devices are dynamically connected to the servers through 10G ExpEther in accordance with server’s workload
24 © NEC Corporation 2017
Case : Easy Extension of Measurement Equipment (PXI)
PCIe Cable
E.g. Different Room
Optical Cable (more than 1 Mile...)Ethernet
Switch
ExpEther Manager Software assigns ID to each ExpEther module
Current PXI products are typically extended by PCIe cable. So the measurement system is fixed and the installation location is very limited.
If ExpEther engine is implemented into PXI chassis, the system can have a large number of PXI modules and dynamically configure the system.
PXI ModulePXI (PCI eXtensions for Instrumentation) is one of several modular electronic instrumentation platforms based on PCIe.
25 © NEC Corporation 2017
Wide-AreaNetwork
LocalNetwork
Edge Computing
Device Computing
Cloud ComputingL5
L3
L1
IoT Layers
Living at the Edge for going Real-Time with ExpEther
L5 Cloud ~ Analytics
L3 Edge ~ Abstraction/Real-Time Proc.
L1 Device/Sensor ~ Smart Device
Real-TimeFeedback
Rack-Scale or Resource pooling with dynamic reconfiguration allows low-cost, low-power and high performance computing data centers at the cloud level.
ActionableInformation
ExpEther can connect devices directly to the edge and servers using simple everything in hardware approach or no complex software protocol stack for communication which is high-speed and low power. Making devices smarter.
ExpEther helps in bringing analytics to the edge.In combination with low-power and high-performance hardware like FPGA’s one can achieve an idealistic abstraction required for Real-time processing.
DataCollection
Analytics
Abstraction
26 © NEC Corporation 2017
ExpEther as a back-plane interconnect for Ubiquitous Computing and IoT solutions for Real-Time Analytics
27 © NEC Corporation 2017
http://www.expether.org/
You can see more detailed technical and product information in ExpEther web site.