54
High Performance Embedded High Performance Embedded Computing with Massively Computing with Massively Parallel Processors Parallel Processors Yangdong Steve Deng Yangdong Steve Deng 邓邓 邓邓 [email protected] [email protected] n n

High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 [email protected] Tsinghua University

Embed Size (px)

Citation preview

Page 1: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

High Performance Embedded Computing High Performance Embedded Computing with Massively Parallel Processorswith Massively Parallel Processors

Yangdong Steve Deng Yangdong Steve Deng 邓仰东邓仰东[email protected]@tsinghua.edu.cn

Tsinghua UniversityTsinghua University

Page 2: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

22

Outline

Motivation and backgroundMorphing GPU into a

network processorHigh performance radar

DSP processor Conclusion

Page 3: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

33

High Performance Embedded Computing Future IT infrastructure demands even higher computing power

Core Internet router throughput: up to 90Tbps 4G wireless base station: 1Gbit/s data rate per customer and up to 200

subscribers in service area CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second)…

Page 4: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

44

~$1M

Fast Increasing IC Costs

Fabrication CostMoore’s Second Law: The cost of doubling circuit

density increases in line with Moore's First Law.

Design CostNow $20-50M per productWill reach $75-120M at

32nm node

The 4-year development The 4-year development of Cell processor by of Cell processor by Sony, IBM, and Toshiba Sony, IBM, and Toshiba costs over costs over $400M$400M..

Page 5: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

55

Implications of the Prohibitive Cost

ASICs would be unaffordable for many applications!Scott MacGregor, CEO of Broadcom:

• “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.”

David Turek, VP of IBM:• “IBM will be pulling out of Cell

development, with PowerXCell

8i to be the company’s last

entrance in the technology.”

Page 6: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

66

Multicore Machines Are Really Powerful!Manufacturer

Processor Type

Model Model Number # Cores GFLOPs FP64 GFLOPs FP32

AMD GPGPU FireStream 9270 160/800 240 1200

AMD GPU Radeon HD 5870 320/1600 544 2720

AMD GPU Radeon HD 5970 640/3200 928 4640

AMD CPU Magny-Cours 12 362.11 362.11

Fujitsu CPU SPARC64 VII 4 128 128

Intel CPU Core 2 Extreme QX9775 4 51.2 51.2

nVidia GPU Fermi 480 512 780 1560

nVidia GPGPU Tesla C1060 240 77.76 933.12

nVidia GPGPU Tesla C2050 448 515.2 1288

Tilera CPU TilePro 64 166 166

AMD 12-Core CPU Tilera Tile Gx100 CPU NVidia Fermi GPU

GPU: Graphics Processing Unit GPGPU: General Purpose GPU

Page 7: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

77

Implications

An increasing number of applications would be implemented with multi-core devicesHuawei: multi-core base stations Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor…

Also meets the strong demands for customizability and extendibility

Page 8: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

88

Outline

Motivation and backgroundMorphing GPU into a

network processorHigh performance radar

DSP processor Conclusion

Page 9: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

99

Background and motivation GPU based routing processing

Routing table lookupPacket classificationDeep packet inspection

GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling

Software Routing with GPU

Page 10: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1010

Ever-Increasing Internet Traffic

Page 11: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1111

Fast Changing Network Protocols/Services

New services are rapidly appearingData-center, Ethernet forwarding, virtual LAN, …

Personal customization is often essential for QoS However, today’s Internet heavily depend on 2 protocols

Ethernet and IPv4, with both developed in 1970s!

Page 12: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1212

Internet Router

Page 13: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1313

Cisco GSR 12416

6ft

19”

2ft

Capacity: 160Gb/sPower: 4.2kW

Internet Router

Backbone network devicePacket forwarding and path findingConnect multiple subnetsKey requirements

• High throughput: 40G-90Tbps• High flexibility

Packets

Router Packets

Page 14: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1414

Current Router Solutions

Hardware routersFastLong design timeExpensiveAnd hard to maintain

Network processor based routerNetwork processor: data parallel packet processorNo good programming models

Software routersExtremely flexibleLow costBut slow

Page 15: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1515

Outline

Background and motivation GPU based routing processing

Routing table lookupPacket classificationDeep packet inspection

GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling

Page 16: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1616

Critical Path of Routing Processing

IP AddressLookup

UpdateHeader

Header Processing

RoutingTable

RoutingTable

IP Addr Next Hop

BufferMemory

BufferMemory

Packet Classification

Data Hdr

Data Hdr

QueuePacket

RuleSet

RuleSet

Hdr Fields FlowSwitch Fabric

Deep Packet Inspection

Page 17: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1717

GPU Based Software Router

CPU0 CPU1

CPU2 CPU3

Front Side Bus (FSB)

North Bridge (Memory

controller)NIC

NIC

PCIe 16-lane

PCIe 4-lane

PCIe 4-lane

Main Memory

Memory Bus

GPUGPU

Memory

Graphics Card

Internet

Data level parallelism = packet level parallelism

Page 18: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1818

Routing Table Lookup Routing table contains network topology information

Find the output port according to destination IP addressPotentially large routing table (~1M entries)

• Can be updated dynamically

Destination Address Prefix Next-Hop Output Port

24.30.32/20 192.41.177.148 2

24.30.32.160/28 192.41.177.3 6

208.12.32/20 192.41.177.196 1

208.12.32.111/32 192.41.177.195 5

An exemplar routing table

Page 19: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

1919

Routing Table Lookup

Longest prefix matchMemory boundUsually based on a trie data

structure• Trie: a prefix tree

with strings as keys• A node’s position

directly reflects its key

• Pointer operations• Widely divergent branches!

Destination Address Prefix

Next-Hop Output Port

24.30.32/20 192.41.177.148 2

24.30.32.160/28 192.41.177.3 6

208.12.32/20 192.41.177.196 1

208.12.32.111/32 192.41.177.195 5

24.30.32/20

24.30.32.160/28

208.12.32/20

Search Trie

208.12.32.111/32

1

0

2

3 4

Page 20: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2020

GPU Based Routing Table Lookup

Organize the search trie into an arrayPointer converted to offset with regard to array head

6X speedup even with frequent routing table updates

Page 21: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2121

Packet Classification Match header fields with predefined rules

Size of rule-sets can be huge (i.e., over 5000 rules)

Rule Example

Priority Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority

Packet filtering Deny all traffic from ISP3 destined to 166.111.66.77

Traffic rate limit Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2

Accounting & billing Treat video traffic to 166.111.X.X as highest priority and perform accounting

Page 22: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2222

Packet Classification

Hardware solution Usually with Ternary CAM

(TCAM)• Expensive and power hungry

Software solutions Linear search Hash based Tuple space search

• Convert the rules into a set of exact match

Page 23: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2323

GPU Based Packet Classification

A linear search approachScale to rule sets with 20,000 rules

Meta-programmingCompile rules into CUDA code with PyCUDA

Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority

if (DA >= 166.111.66.70) && (DA <= 166.111.66.77)

priority = 0;

Page 24: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2424

GPU Based Packet Classification

~60X speedup

Page 25: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2525

Deep Packet Inspection (DPI) Core component for network intrusion detection

Against viruses, spam, software vulnerabilities, …

Packet Decoder

Preprocessor(Plug-ins)

Detection Engine

(Plug-ins)Output Stage

(Plug-ins)

Sniffing

Snort

Data

Flow

Alerts/Logs

Packet stream

Fixed String MatchingRegular

Expression Matching

Example rule:alert tcp $EXTERNAL_NET 27374 -> $HOME_NET any (msg:"BACKDOOR subseven 22"; flags: A+; content: "|0d0a5b52504c5d303032

0d0a|";

Page 26: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2626

GPU Based Deep Packet Inspection (DPI)

Fixed string matchEach rule is just a string that is disallowedBloom-filter based searchOne warp for a packet and one thread for a stringThroughput: 19.2Gbps (30X speed-up over SNORT)

0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 0 1 0 1 0 0 1

r1 r2 …

0 1 0 0 1 0 1 0 1 0 0 1

s1 s2 …

Hash 1

Hash 2

Hash 3

Initial Bloom Filter

After pre-processing rules

Checking packet content Bloom Vector

Page 27: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2727

GPU Based Deep Packet Inspection (DPI)

Regular expression matching Each rule is a regular expression

• e.g., a|b* = {ε, a, b, bb, bbb, ...} Aho-Corasick Algorithm

• Converts patterns into a finite state machine• Matching is done by state traversal

Memory bound• Virtually no computation

Compress the state table• Merging don’t-cared entries

Throughput: 9.3Gbps 15X speed-up over SNORT

Example: P={he, she, his, hers}

Page 28: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2828

Outline

Background and motivation GPU based routing processing

Routing table lookupPacket classificationDeep packet inspection

GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling

Page 29: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

2929

CPU0 CPU1

CPU2 CPU3

Front Side Bus (FSB)

North Bridge (Memory

controller)NIC

NIC

PCIe 16 -lane

PCIe 4 -lane

PCIe 4 -lane

Main Memory

Memory Bus

GPUGPU

Memory

Graphics Card

Internet

Limitation of GPU-Based Packet Processing

Packet queue

CPU-GPU communication overhead

No QoS guarantee

Page 30: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3030

Microarchitectural Enhancements CPU-GPU integration with a shared memory

Maintain current CUDA interfaceImplemented on GPGPU-Sim*

*A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS, 2009.

NIC

CPUInternet

NPGPU

CPU/GPU Shared Memory

Task FIFO

Delayed Commit Queue

GPU

Page 31: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3131

Microarchitectural Enhancements

Uniformly one thread for one packetNo thread block necessaryDirectly schedule and issue warps

GPU fetches packet IDs from task queue whenEither a sufficient number of packets

are already collectedOr a given interval passes after last

fetch

CPU-maintained task queue

Delayed Commit Queue

GPU Core

GPU Core

GPU Core

GPU Core

GPU Core

GPU Core

Page 32: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3232

Results: Throughput

0

50

100

150

200

250

300

350

Deep PacketInspection

PacketClassification

Routing TableLookup

Decrease TTL

Line-card Rate

CPU/GPU

New Architecture

Page 33: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3333

Results: Packet Latency

0

50

100

150

200

250

Deep PacketInspection

Packet Classification Routing Table Lookup Decrease TTL

CPU/GPU

New Architecture

Page 34: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3434

Outline

Motivation and backgroundMorphing GPU into a

network processorHigh performance radar

DSP processor Conclusion

Page 35: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3535

High Performance Radar DSP Processor

Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor

Page 36: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3636

Research Objectives

High performance DSP processor For high-performance applications

• Radar, sonar, cellular baseband, …

Performance requirementsThroughput ≥ 800GFLOPsPower Efficiency ≥ 100GFLOPS/WMemory bandwidth ≥ 400Gbit/sScale to multi-chip solutions

Page 37: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3737

Current DSP Platforms

*GDDR5: Peak Bandwidth 28.2GB/s

ProcessorFrequen

cy# cores

Throughput

Memory Bandwid

th

Power

Power Efficiency

(GFLOPS/W)

TI TMS320C647

2-700500MHz 6

33.6GMac/s

NA 3.8W 17.7

FreeScale MSC8156

1GHz 6 48GMac/s 1GB/s 10W 9.6

ADI TigerSHARC

ADSP-TS201S 600MHz 1 4.8GMac/s

38.4GB/s (on-chip)

2.18W

4.4

PicoChip PC205

260MHz1GPP+248DSP

s31GMac/s NA <5W 12.4

Intel Core i7 980XE

3.3GHz 6107.

5GFLOPS31.8GB/s

130W

0.8

Tilera Tile64 866MHz 64 CPUs221GFLOP

S6.25GB/s 22W 10.0

NVidia Fermi GPU

1GHz512

scalar cores

1536GFLOPS

230GB/s *

200W

7.7

Page 38: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3838

High Performance Radar DSP Processor

Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor

Page 39: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

3939

HPEC Challenge - Radar BenchmarksBenchmark Description

TDFIR Time-domain finite impulse response filtering

FDFIR Frequency-domain finite impulse response filtering

CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT

QR QR factorization: prevalent in target recognition algorithms

SVD Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference

CFAR Constant false-alarm rate detection: find target in an environment with varying background noise

GA Graph optimization via genetic algorithm: removing uncorrelated data relations

PM Pattern Matching: identify stored tracks that match a target

DB Database operations to store and query target tracks

Page 40: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4040

GPU ImplementationBenchmark Description

TDFIR Loops of multiplication and accumulation (MAC)

FDFIR FFT followed by MAC loops

CT GPU based matrix transpose, extremely efficient

QR Pipeline of CPU + GPU, Fast Givens algorithm

SVD Based on QR factorization and fast matrix multiplication

CFAR Accumulation of neighboring vector elements

GA Parallel random number generator and inter-thread communication

PM Vector level parallelism

DB Binary tree operation, hard for GPU implementation

Page 41: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4141

Performance ResultsKernels Data Set CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup

TDFIRSet 1Set 2

3.3823.326

97.50623.130

28.86.9

FDFIRSet 1Set 2

0.5410.542

61.68111.955

114.122.1

CTSet 1Set 2

1.1940.501

17.17735.545

14.370.9

PMSet 1Set 2

0.8710.281

7.76121.241

8.975.6

CFAR

Set 1Set 2Set 3Set 4

1.1541.3141.3131.261

2.23417.31913.9628.301

1.913.110.66.6

GA

Set 1Set 2Set 3Set 4

0.5620.6830.4410.373

1.1778.5710.5892.249

2.112.51.46.0

QRSet 1Set 2Set 3

1.7040.9010.904

54.3095.6796.686

31.86.37.4

SVDSet 1Set 2

0.7470.791

4.1752.684

5.63.4

DBSet 1Set 2

112.35.794

126.88.459

1.131.46

*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.

Page 42: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4242

Performance Comparison GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101

Page 43: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4343

Instruction Profiling

Page 44: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4444

Thread Profiling

Warp occupancy: number of active threads in an issued warp32 threads per warp

Page 45: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4545

Off-Chip Memory Profiling

DRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.

Page 46: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4646

Limitation GPU suffers from a low power-efficiency (MFLOPS/W)

Page 47: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4747

High Performance Radar DSP Processor

Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor

Page 48: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4848

Key Idea - Hardware Architecture Borrow the GPU microarchitecture

Using a DSP core as the basic execution unitMultiprocessors organized in programmable pipelinesNeighboring multiprocessors can be merged as wider datapaths

Page 49: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

4949

Key Idea – Parallel Code Generation Meta-programming based parallel code generation Foundation technologies

GPU meta-programming frameworks• Copperhead (UC Berkeley) and PyCUDA (NY University)

DSP code generation framework• Spiral (Carnegie Mellon University)

runtime

DSP code generation

Source optimization

Compile

Page 50: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

5050

Key Idea – Internal Representation as KPN

Kahn Process Network (KPN)A generic model for concurrent

computationSolid theoretic foundation

• Process algebra

Page 51: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

5151

Scheduling and Optimization on KPN Automatic task and thread scheduling and

mappingExtract data parallelism through process

splittingLatency and throughput aware schedulingPerformance estimation based on analytical

models

Ttotal

T1

T2

Ti

Page 52: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

5252

Key Idea - Low Power Techniques GPU-like processors are power hungry! Potential low power techniques

Aggressive memory coalescingEnable task-pipeline to avoid synchronization via

global memoryOperation chaining to avoid extra memory accesses???

DRAM line

DRAM chip

Used

Unused

Current coalescingOur coalescing solution

Page 53: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

5353

Outline

Motivation and backgroundMorphing GPU into a

network processorHigh performance radar

DSP processor Conclusion

Page 54: High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua University

5454

Conclusion

A new market of high performance embedded computing is emergingMulti-core engines would be the work-horses

Need both HW and SW researchCase study 1: GPU based Internet routingCase study 2: Massively parallel DSP

processor Significant performance improvementsMore works ahead

• Low power, scheduling, parallel programming model, legacy code, …