High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 [email protected] Tsinghua University

High Performance Embedded Computing High Performance Embedded Computing with Massively Parallel Processorswith Massively Parallel Processors

Yangdong Steve Deng Yangdong Steve Deng 邓仰东邓仰东[email protected]@tsinghua.edu.cn

Tsinghua UniversityTsinghua University

22

Outline

Motivation and backgroundMorphing GPU into a

network processorHigh performance radar

DSP processor Conclusion

33

High Performance Embedded Computing Future IT infrastructure demands even higher computing power

Core Internet router throughput: up to 90Tbps 4G wireless base station: 1Gbit/s data rate per customer and up to 200

subscribers in service area CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second)…

44

~$1M

Fast Increasing IC Costs

Fabrication CostMoore’s Second Law: The cost of doubling circuit

density increases in line with Moore's First Law.

Design CostNow $20-50M per productWill reach $75-120M at

32nm node

The 4-year development The 4-year development of Cell processor by of Cell processor by Sony, IBM, and Toshiba Sony, IBM, and Toshiba costs over costs over $400M$400M..

55

Implications of the Prohibitive Cost

ASICs would be unaffordable for many applications!Scott MacGregor, CEO of Broadcom:

• “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.”

David Turek, VP of IBM:• “IBM will be pulling out of Cell

development, with PowerXCell

8i to be the company’s last

entrance in the technology.”

66

Multicore Machines Are Really Powerful!Manufacturer

Processor Type

Model Model Number # Cores GFLOPs FP64 GFLOPs FP32

AMD GPGPU FireStream 9270 160/800 240 1200

AMD GPU Radeon HD 5870 320/1600 544 2720

AMD GPU Radeon HD 5970 640/3200 928 4640

AMD CPU Magny-Cours 12 362.11 362.11

Fujitsu CPU SPARC64 VII 4 128 128

Intel CPU Core 2 Extreme QX9775 4 51.2 51.2

nVidia GPU Fermi 480 512 780 1560

nVidia GPGPU Tesla C1060 240 77.76 933.12

nVidia GPGPU Tesla C2050 448 515.2 1288

Tilera CPU TilePro 64 166 166

AMD 12-Core CPU Tilera Tile Gx100 CPU NVidia Fermi GPU

GPU: Graphics Processing Unit GPGPU: General Purpose GPU

77

Implications

An increasing number of applications would be implemented with multi-core devicesHuawei: multi-core base stations Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor…

Also meets the strong demands for customizability and extendibility

88

Outline




99

Background and motivation GPU based routing processing

Routing table lookupPacket classificationDeep packet inspection

GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling

Software Routing with GPU

1010

Ever-Increasing Internet Traffic

1111

Fast Changing Network Protocols/Services

New services are rapidly appearingData-center, Ethernet forwarding, virtual LAN, …

Personal customization is often essential for QoS However, today’s Internet heavily depend on 2 protocols

Ethernet and IPv4, with both developed in 1970s!

1212

Internet Router

…

1313

Cisco GSR 12416

6ft

19”

2ft

Capacity: 160Gb/sPower: 4.2kW

Internet Router

Backbone network devicePacket forwarding and path findingConnect multiple subnetsKey requirements

• High throughput: 40G-90Tbps• High flexibility

Packets

Router Packets

1414

Current Router Solutions

Hardware routersFastLong design timeExpensiveAnd hard to maintain

Network processor based routerNetwork processor: data parallel packet processorNo good programming models

Software routersExtremely flexibleLow costBut slow

1515

Outline




1616

Critical Path of Routing Processing

IP AddressLookup

UpdateHeader

Header Processing

RoutingTable

RoutingTable

IP Addr Next Hop

BufferMemory

BufferMemory

Packet Classification

Data Hdr

Data Hdr

QueuePacket

RuleSet

RuleSet

Hdr Fields FlowSwitch Fabric

Deep Packet Inspection

1717

GPU Based Software Router

CPU0 CPU1

CPU2 CPU3

Front Side Bus (FSB)

North Bridge (Memory

controller)NIC

NIC

PCIe 16-lane

PCIe 4-lane

PCIe 4-lane

Main Memory

Memory Bus

GPUGPU

Memory

Graphics Card

Internet

Data level parallelism = packet level parallelism

1818

Routing Table Lookup Routing table contains network topology information

Find the output port according to destination IP addressPotentially large routing table (~1M entries)

• Can be updated dynamically

Destination Address Prefix Next-Hop Output Port

24.30.32/20 192.41.177.148 2

24.30.32.160/28 192.41.177.3 6

208.12.32/20 192.41.177.196 1

208.12.32.111/32 192.41.177.195 5

An exemplar routing table

1919

Routing Table Lookup

Longest prefix matchMemory boundUsually based on a trie data

structure• Trie: a prefix tree

with strings as keys• A node’s position

directly reflects its key

• Pointer operations• Widely divergent branches!

Destination Address Prefix

Next-Hop Output Port

24.30.32/20 192.41.177.148 2

24.30.32.160/28 192.41.177.3 6

208.12.32/20 192.41.177.196 1

208.12.32.111/32 192.41.177.195 5

24.30.32/20

24.30.32.160/28

208.12.32/20

Search Trie

208.12.32.111/32

1

0

2

3 4

2020

GPU Based Routing Table Lookup

Organize the search trie into an arrayPointer converted to offset with regard to array head

6X speedup even with frequent routing table updates

2121

Packet Classification Match header fields with predefined rules

Size of rule-sets can be huge (i.e., over 5000 rules)

Rule Example

Priority Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority

Packet filtering Deny all traffic from ISP3 destined to 166.111.66.77

Traffic rate limit Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2

Accounting & billing Treat video traffic to 166.111.X.X as highest priority and perform accounting

2222

Packet Classification

Hardware solution Usually with Ternary CAM

(TCAM)• Expensive and power hungry

Software solutions Linear search Hash based Tuple space search

• Convert the rules into a set of exact match

2323

GPU Based Packet Classification

A linear search approachScale to rule sets with 20,000 rules

Meta-programmingCompile rules into CUDA code with PyCUDA

Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority

if (DA >= 166.111.66.70) && (DA <= 166.111.66.77)

priority = 0;

2424

GPU Based Packet Classification

~60X speedup

2525

Deep Packet Inspection (DPI) Core component for network intrusion detection

Against viruses, spam, software vulnerabilities, …

Packet Decoder

Preprocessor(Plug-ins)

Detection Engine

(Plug-ins)Output Stage

(Plug-ins)

Sniffing

Snort

Data

Flow

Alerts/Logs

Packet stream

Fixed String MatchingRegular

Expression Matching

Example rule:alert tcp $EXTERNAL_NET 27374 -> $HOME_NET any (msg:"BACKDOOR subseven 22"; flags: A+; content: "|0d0a5b52504c5d303032

0d0a|";

2626

GPU Based Deep Packet Inspection (DPI)

Fixed string matchEach rule is just a string that is disallowedBloom-filter based searchOne warp for a packet and one thread for a stringThroughput: 19.2Gbps (30X speed-up over SNORT)

0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 0 1 0 1 0 0 1

r1 r2 …

0 1 0 0 1 0 1 0 1 0 0 1

s1 s2 …

Hash 1

Hash 2

Hash 3

Initial Bloom Filter

After pre-processing rules

Checking packet content Bloom Vector

2727

GPU Based Deep Packet Inspection (DPI)

Regular expression matching Each rule is a regular expression

• e.g., a|b* = {ε, a, b, bb, bbb, ...} Aho-Corasick Algorithm

• Converts patterns into a finite state machine• Matching is done by state traversal

Memory bound• Virtually no computation

Compress the state table• Merging don’t-cared entries

Throughput: 9.3Gbps 15X speed-up over SNORT

Example: P={he, she, his, hers}

2828

Outline




2929

CPU0 CPU1

CPU2 CPU3

Front Side Bus (FSB)

North Bridge (Memory

controller)NIC

NIC

PCIe 16 -lane

PCIe 4 -lane

PCIe 4 -lane

Main Memory

Memory Bus

GPUGPU

Memory

Graphics Card

Internet

Limitation of GPU-Based Packet Processing

Packet queue

CPU-GPU communication overhead

No QoS guarantee

3030

Microarchitectural Enhancements CPU-GPU integration with a shared memory

Maintain current CUDA interfaceImplemented on GPGPU-Sim*

*A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS, 2009.

NIC

CPUInternet

NPGPU

CPU/GPU Shared Memory

Task FIFO

Delayed Commit Queue

GPU

3131

Microarchitectural Enhancements

Uniformly one thread for one packetNo thread block necessaryDirectly schedule and issue warps

GPU fetches packet IDs from task queue whenEither a sufficient number of packets

are already collectedOr a given interval passes after last

fetch

CPU-maintained task queue

Delayed Commit Queue

GPU Core

GPU Core

GPU Core

GPU Core

GPU Core

GPU Core

3232

Results: Throughput

0

50

100

150

200

250

300

350

Deep PacketInspection

PacketClassification

Routing TableLookup

Decrease TTL

Line-card Rate

CPU/GPU

New Architecture

3333

Results: Packet Latency

0

50

100

150

200

250

Deep PacketInspection

Packet Classification Routing Table Lookup Decrease TTL

CPU/GPU

New Architecture

3434

Outline




3535

High Performance Radar DSP Processor

Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor

3636

Research Objectives

High performance DSP processor For high-performance applications

• Radar, sonar, cellular baseband, …

Performance requirementsThroughput ≥ 800GFLOPsPower Efficiency ≥ 100GFLOPS/WMemory bandwidth ≥ 400Gbit/sScale to multi-chip solutions

3737

Current DSP Platforms

*GDDR5: Peak Bandwidth 28.2GB/s

ProcessorFrequen

cy# cores

Throughput

Memory Bandwid

th

Power

Power Efficiency

(GFLOPS/W)

TI TMS320C647

2-700500MHz 6

33.6GMac/s

NA 3.8W 17.7

FreeScale MSC8156

1GHz 6 48GMac/s 1GB/s 10W 9.6

ADI TigerSHARC

ADSP-TS201S 600MHz 1 4.8GMac/s

38.4GB/s (on-chip)

2.18W

4.4

PicoChip PC205

260MHz1GPP+248DSP

s31GMac/s NA <5W 12.4

Intel Core i7 980XE

3.3GHz 6107.

5GFLOPS31.8GB/s

130W

0.8

Tilera Tile64 866MHz 64 CPUs221GFLOP

S6.25GB/s 22W 10.0

NVidia Fermi GPU

1GHz512

scalar cores

1536GFLOPS

230GB/s *

200W

7.7

3838



3939

HPEC Challenge - Radar BenchmarksBenchmark Description

TDFIR Time-domain finite impulse response filtering

FDFIR Frequency-domain finite impulse response filtering

CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT

QR QR factorization: prevalent in target recognition algorithms

SVD Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference

CFAR Constant false-alarm rate detection: find target in an environment with varying background noise

GA Graph optimization via genetic algorithm: removing uncorrelated data relations

PM Pattern Matching: identify stored tracks that match a target

DB Database operations to store and query target tracks

4040

GPU ImplementationBenchmark Description

TDFIR Loops of multiplication and accumulation (MAC)

FDFIR FFT followed by MAC loops

CT GPU based matrix transpose, extremely efficient

QR Pipeline of CPU + GPU, Fast Givens algorithm

SVD Based on QR factorization and fast matrix multiplication

CFAR Accumulation of neighboring vector elements

GA Parallel random number generator and inter-thread communication

PM Vector level parallelism

DB Binary tree operation, hard for GPU implementation

4141

Performance ResultsKernels Data Set CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup

TDFIRSet 1Set 2

3.3823.326

97.50623.130

28.86.9

FDFIRSet 1Set 2

0.5410.542

61.68111.955

114.122.1

CTSet 1Set 2

1.1940.501

17.17735.545

14.370.9

PMSet 1Set 2

0.8710.281

7.76121.241

8.975.6

CFAR

Set 1Set 2Set 3Set 4

1.1541.3141.3131.261

2.23417.31913.9628.301

1.913.110.66.6

GA

Set 1Set 2Set 3Set 4

0.5620.6830.4410.373

1.1778.5710.5892.249

2.112.51.46.0

QRSet 1Set 2Set 3

1.7040.9010.904

54.3095.6796.686

31.86.37.4

SVDSet 1Set 2

0.7470.791

4.1752.684

5.63.4

DBSet 1Set 2

112.35.794

126.88.459

1.131.46

*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.

4242

Performance Comparison GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101

4343

Instruction Profiling

4444

Thread Profiling

Warp occupancy: number of active threads in an issued warp32 threads per warp

4545

Off-Chip Memory Profiling

DRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.

4646

Limitation GPU suffers from a low power-efficiency (MFLOPS/W)

4747



4848

Key Idea - Hardware Architecture Borrow the GPU microarchitecture

Using a DSP core as the basic execution unitMultiprocessors organized in programmable pipelinesNeighboring multiprocessors can be merged as wider datapaths

4949

Key Idea – Parallel Code Generation Meta-programming based parallel code generation Foundation technologies

GPU meta-programming frameworks• Copperhead (UC Berkeley) and PyCUDA (NY University)

DSP code generation framework• Spiral (Carnegie Mellon University)

runtime

DSP code generation

Source optimization

Compile

5050

Key Idea – Internal Representation as KPN

Kahn Process Network (KPN)A generic model for concurrent

computationSolid theoretic foundation

• Process algebra

5151

Scheduling and Optimization on KPN Automatic task and thread scheduling and

mappingExtract data parallelism through process

splittingLatency and throughput aware schedulingPerformance estimation based on analytical

models

Ttotal

T1

T2

Ti

5252

Key Idea - Low Power Techniques GPU-like processors are power hungry! Potential low power techniques

Aggressive memory coalescingEnable task-pipeline to avoid synchronization via

global memoryOperation chaining to avoid extra memory accesses???

DRAM line

DRAM chip

Used

Unused

…

Current coalescingOur coalescing solution

5353

Outline




5454

Conclusion

A new market of high performance embedded computing is emergingMulti-core engines would be the work-horses

Need both HW and SW researchCase study 1: GPU based Internet routingCase study 2: Massively parallel DSP

processor Significant performance improvementsMore works ahead

• Low power, scheduling, parallel programming model, legacy code, …

Documents

High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 [email protected] Tsinghua University