17
1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*, Yangdong Deng , Yubei Chen Presenters: Abraham Addisie, Vaibhav Gogte *Electrical and Computer Engineering University of Texas at Austin Institute of Microelectronics Tsinghua University

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

Embed Size (px)

Citation preview

Page 1: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

1/45

1

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Yuhao Zhu*, Yangdong Deng‡, Yubei Chen‡

Presenters: Abraham Addisie, Vaibhav Gogte

*Electrical and Computer EngineeringUniversity of Texas at Austin

‡Institute of MicroelectronicsTsinghua University

Page 2: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

2

2

2

• Introduction• Motivation• Related work• GPU Overview• Hermes Architecture• Adaptive warp scheduling• Hardware Implementation• Experimental Analysis• Conclusion

Outline

Page 3: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

3

3

3

Processing of an IP packet at a router

1. Checking IP Header 2. Packet Classification 3. Routing Table Lookup 4. Decrementing Time to Live (TTL) value5. IP Fragmentation (if > Max Transmission Unit)

Introduction

Receive an IP packet

New processing requirements are being added to the list• Deep packet inspection

IP Packet Processing

Mac Header:Source Mac :mxDest Mac :my-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data

Mac Header:Source Mac :newDest Mac :new-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data

Page 4: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

4

4

4

MotivationInternet traffic is increasing exponentially

• Multimedia application, social network, internet of things

Network protocols are being added and modified

• Transition from IPv4(32 bit) to IPv6(128 bit)

High Throughput Router

High Programmable Router

New high processing demanding task is being added• Deep packet inspection

Page 5: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

5

5

5

ASIC based routerNetwork processor based routerGPP (software) based router

Related Work

ASIC based router:• Long design turnaround• High non-recurring engineering cost

NP based router:• No effective programming model• Intel discontinue its NP router

business

GPP (Software) based router: • Low performance

GPU based router:• High performance + High

programmability

Page 6: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

6

6

6

GPP (Software) based router

Related Work – CPU vs GPU Throughput

GPU based software router

Low throughput processor High throughput processor

Packetshader: Han and et. al[2010]

Page 7: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

7

7

7

Processing of a Packet is independent with the others • Data level parallelism = Packet level parallelism

Exploiting High Throughput GPU for IP routing

GPU based router is shown to outperform software based router by 30x (in terms of throughput)Packetshader: Han and et. al[2010]

Packet Queue

Batching

Parallel Processing by GPU

Page 8: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

8

8

8

Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps• GPU throughput = 30x CPU’s , without memory mapping• Reduced to 5x CPU’s , with memory mapping overhead

Cannot guarantee minimum latency for an individual packet

Limitation of existing GPU based router

Solution: Hermes

Architecture of NVIDIA GTX480

Page 9: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

99

Shared Memory Hierarchy

Hermes, integrated CPU/GPU IP routingLower packet transferring overhead• Shared memory

Lower per packet latency• Adaptive warp scheduling

Page 10: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

10

10

10

Adaptive Warp Issue

Arrival pattern of packets

Available resources in GPU

Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling

no. of packets to be processed

SMP

SMP

SMP

SMP

SMP

SMP

SMP

SMP

SMP

Minimum 1 warp fetch granularity

Shared MemoryData transfer

Task FIFO- - - - -

- - - - -

- - - - -

Monitor the packets

CPU

Page 11: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

11

11

11

In Order Commit

UDP protocol users expect packets to arrive in order

DCQ entry id Warp idLookup Table (LUT)

Warp Allocator

Warp Scheduler

Write Back Stage

.

.

.

Shader Core

DCQ

Warp id. . .. . .

DCQ entry id

Warp id

Maps DCQ entry to wrap ID

Records warp ids in flight

Warps committed in order

Page 12: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

12

12

12

Task FIFO• 32 bit - 1028 entries• Area = 0.053 mm2

Delay Commit Queue• Size depends on maximally allowed concurrent warps (MCWs) and

shader cores• 8 bit – 1028 entries• Area = 0.013 mm2

DCQ-Warp LUT• Size depends on number of MCWs• 16 bit – 32 entries• Area = 0.006 mm2

Hardware and Area Overhead

Hardware Overhead Negligible!

Page 13: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

13

13

13

Cycle Accurate GPGPU-Sim to evaluate performance

Experimental Setup

Benchmarks• Checking IP header Packet classification Routing table

lookup Decrementing TTL IP fragmentation and Deep packet inspection

• Both burst and sparse patterns

QoS parameters – throughput, delay, delay variance

Page 14: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

14

14

14

Throughput evaluation

Burst traffic without DCQ

Sparse traffic without DCQ

• No packet queueing• CPU/GPU still unable to deliver at input rate

• Outperforms CPU/GPU by a factor of 5

• Better resource utilization with increasing MCW

Computing rates of benchmark applications

Page 15: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

15

15

15

Delay analysis

Simple processing in GPU, overlap of CPU side waiting

with GPU processing

Packet Delay reduction by 81.2%!

Burst traffic without DCQ

Divergent branches takes higher processing time

starving the packets

Delay - with DCQ vs without DCQ

Page 16: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

16

16

16

• Lack of QoS and CPU-GPU communication overhead major bottleneck

• Hermes – closely coupled CPU-GPU solution

• Meet stringent delay requirements

• Enable QoS through optimized configuration

• Minimal hardware extension

• Novel high quality packet processing engine for future software routers

Conclusion

Page 17: 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

17

17

17

• Are GPUs really easy to program for processing packets?• How does the performance and area overhead compare with ASIC

based routers?• Is router programmability really a crucial concern?

Discussion points