77
©Vincent J. Mooney III, 2007 FPGAworld FPGAworld 13 September 2007 13 September 2007 Research Trends in Hardware/ Software Codesign of Embedded Operating Systems for FPGAs Vincent J. Mooney III Vincent J. Mooney III http:// http:// codesign.ece.gatech.edu codesign.ece.gatech.edu http:// http:// www.crest.gatech.edu www.crest.gatech.edu Associate Professor, School of Electrical and Computer Engineeri Associate Professor, School of Electrical and Computer Engineeri ng ng Adjunct Associate Professor, College of Computing Adjunct Associate Professor, College of Computing Co Co - - Director, Center for Research on Embedded Systems and Technology Director, Center for Research on Embedded Systems and Technology Georgia Institute of Technology Georgia Institute of Technology Atlanta, Georgia, USA Atlanta, Georgia, USA

Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

©Vincent J. Mooney III, 2007FPGAworldFPGAworld 13 September 200713 September 2007

Research Trends in Hardware/ Software Codesign of Embedded

Operating Systems for FPGAs

Vincent J. Mooney IIIVincent J. Mooney IIIhttp://http://codesign.ece.gatech.educodesign.ece.gatech.edu

http://http://www.crest.gatech.eduwww.crest.gatech.edu

Associate Professor, School of Electrical and Computer EngineeriAssociate Professor, School of Electrical and Computer EngineeringngAdjunct Associate Professor, College of ComputingAdjunct Associate Professor, College of Computing

CoCo--Director, Center for Research on Embedded Systems and TechnologyDirector, Center for Research on Embedded Systems and TechnologyGeorgia Institute of TechnologyGeorgia Institute of Technology

Atlanta, Georgia, USAAtlanta, Georgia, USA

Page 2: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

2FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

OutlineOutline

• Trends• Custom RTOS Hardware IP

• System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU)

• The he δδ Hardware/Software RTOS Generation FrameworkHardware/Software RTOS Generation Framework

•• Compilation for Reverse Execution & Compilation for Reverse Execution & DebuggingDebugging

•• ConclusionConclusion

Page 3: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

3FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Moore’s Prediction (“Law”)Si atom ~ 2Å.09u = 90nm = 900Å July 2004

⇒ ~450 atoms, 902 = 8100

.065u = 65nm = 650Å Jan 2006⇒ ~325 atoms, 652 = 4225

.045u = 45nm = 450Å July 2007⇒ ~225 atoms, 452 = 2025

.032u = 32nm = 320Å Jan 2009⇒ ~160 atoms, 322 = 1024

.022u = 22nm = 220Å July 2010⇒ ~110 atoms, 222 = 488

.016u = 16nm = 160Å Jan 2012⇒ ~80 atoms, 162 = 256

.011u = 11nm = 110Å July 2013⇒ ~55 atoms, 112 = 121

.008u = 8nm = 80Å Jan 2015⇒ ~40 atoms, 82 = 64

.006u = 6nm = 60Å July 2016⇒ ~30 atoms, 62 = 36

.004u = 4nm = 40Å Jan 2018⇒ ~20 atoms, 42 = 16

.003u = 3nm = 30Å July 2019⇒ ~15 atoms, 32 = 9

.002u = 2nm = 20Å Jan 2020⇒ ~10 atoms, 22 = 4

n+ n+

DrainSource

Gate

P-substrate

NFET:

Page 4: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

4FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Power Consumption

Power consumption of VLSI is a fundamental problem of mobile devices as well high-performance computers

Limited operation (battery life)HeatOperation cost

Power = dynamic + static Dynamic power more than 90% of total power (0.18u tech. and above)Static nearly equal to dynamic power for latest processes (.65u)

Dynamic power reduction: Technology scalingFrequency scalingVoltage scaling

IBM PowerPC 970*

*N. Rohrer et al., “PowerPC 970 in 130nm and 90nm Technologies," IEEE International Solid-State Circuits Conference, vol 1, pp. 68-69, February 2004.

Page 5: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

5FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Cost of latest Silicon Processes

Mask sets for 90nm: over $1 million (U.S. $)NRE costs $10 to $100 million U.S.

Verification costs can exceed design costsEmbedded software costs may exceed hardware costs

Page 6: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

6FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Increasing Cost of Customization*

Example: Design with80 M transistors in 100 nm technology

Estimated Cost -$85 M -$90 M

C P produ

ction

verifi

catio

n

desig

npro

totyp

e

verifi

catio

n

imple

mentat

ion

verifi

catio

n

12 – 18 months

Cost and Risk rising to unacceptable levels Top cost drivers

Verification (40%)Architecture Design (26%)Embedded Software Design

1400 man months (SW)1150 man months (HW)

HW/SW integration

*Handel H. Jones, ”How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com

Page 7: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

7FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Yearly Chip Design StartsRoughly 8,000 in 1996Peaked at approx. 12,000 in 2000Recent years approx. 3,000 per yearHowever, comparing, say, 2005 to 2000, number of EDA tool licenses purchased roughly the same

Larger team per chip design start

System level chip design 200533.5% U.S.26.1% Japan9.9% Taiwan, 5.8% Germany, 5.4% China, 5.3% Korea

Page 8: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

8FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Digital Silicon CMOS VLSI TrendsYesterday

(1980s)Today Tomorrow

memory

gate arrays

ASICs

processors

memory

gate arrays

ASICs

processors

reconfigurable

SoC

memory

ASICs

processors

reconfigurable

Platform SoC

Custom SoC ≤ 9products

≥ 10products

gate arrays /struct. ASIC

Page 9: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

9FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Embedded Systems Market Drivers

1950s-1980s: Business Applications1990s-Today: Consumer Electronics

Some observationsprofit margins tight in consumervolumes and risk high in consumersupply chain as or more important than raw technology

Page 10: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

10FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Chip Software Business Models

Electronics Design AutomationSynopsys, Cadence, Mentor, SynplicityYearly/quarterly subscriptions

Embedded Operating System CompaniesWindriver, Montavista, Timesys, Integrity, etc.Upgrades, debuggers, Integrated Development Environment (IDE)Per-seat costs 1/10 of EDA

$10K per seat (as compared to $100K for EDA)

Cooley’s observationFPGA software sells for far less than ASIC softwareMany FGPA companies give away software (lock-in…)

Page 11: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

11FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Failed Reconfigurable Attempts

STPhillipsTexas InstrumentsPilkingtonAtmelSamsungMotorolaIBMVantisLattice

QuicksilverConcurrent LogicToshibaChameleonPlesseyAdaptive SiliconDynachipCrosspointLucentNational Semiconductor

Page 12: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

12FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

A Few Successes

TriscendSynplicity

Note1:Xilinx, Altera: each probably have, as a lower bound, approx. 12 million lines of code for FPGA programming tools

Note2:Key reconfigurable patents from 1970s have expired

Page 13: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

13FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Comments (probably incorrect) on Trends!

Reconfigurable like to increase superlinearly(increasing % of market)

Tools critical to adoptionChips for consumer electronics to drive technology

E.g., Cell processorEmbedded software and operating systems more important and less supported than everBreakthrough platform SoC design flows to determine chip success

E.g., will not be Cell because too difficult to program…

Page 14: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

14FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

OutlineOutline

• Trends• Custom Operating System Hardware IP

• System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU)

• The he δδ Hardware/Software RTOS Generation FrameworkHardware/Software RTOS Generation Framework

•• Compilation for Reverse Execution & Compilation for Reverse Execution & DebuggingDebugging

•• ConclusionConclusion

Page 15: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

15FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

System-on-a-Chip (SoC)

This architecture is suitable for embedded multimedia applications, which require great processing power and large volume data management

RISC 2

DSP 2

Analog Interface Network Interface

DSP 1L1 Cache

L1 Cache

RISC 1

L1 Cache

Global Memory(DRAM /SRAM)

Custom Logic

SoCDMMU

ReconfigurableLogic

Page 16: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

16FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

SoCThe existence of global on-chip memory, arises the need for an efficient way to dynamically allocate it among the PEs

Page 17: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

17FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Problem

How to deal with the allocation of the large global on-chip memory between the PEs in a dynamic yet deterministic way?

Page 18: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

18FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Solution 1

Custom Memory Configuration (Static)Hardware/Software co-synthesis with memory hierarchies [Wayne Wolf]Matisse [IMEC]Memory synthesis for telecom applications [WUYTACK et Al.], [YKMAN et al.]

Page 19: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

19FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Custom Memory ConfigurationPros:

EasyDeterministic

Cons:Inefficient memory utilizationSystem modification after implementation is very difficult if not impossible

Page 20: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

20FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Solution 2

Shared memory multiprocessor (Dynamic)

Using conventional software memory Allocation/Deallocation techniques (e.g., Sequential Fits, Buddy Systems, etc.)Sharing one heap (using locks)Multiple heaps (one per processor)

Page 21: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

21FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Shared memory multiprocessor

ProsFlexibleEfficient memory utilization

ConsWorst case execution time is very high and usually not deterministic

Page 22: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

22FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Our Solution

We introduce a new memory management hierarchy, Two-Level Memory Management, for a multiprocessor SoCTwo-Level Memory Management combines the best of dynamic memory management techniques (flexibility and efficiency) with the best of static memory allocation techniques (determinism).

Page 23: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

23FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Our Solution (2)

In Two-Level Memory Management, large on-chip memory is managed between the on-chip processors (Level Two)Memory assigned to any processor is managed by the operating system running on that particular processor (Level One)To manage Level Two, we present the System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU)

Page 24: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

24FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Dynamic Memory Management

AutomaticAutomatically recycles memory that a program will not use againEither as a part of the language or as an extension

ManualThe programmer has direct control over when memory is allocated and when memory may be de-allocated (e.g., by using malloc() & free())

Page 25: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

25FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory AllocationSoftware Techniques

Sequential Fits

First Fit,Next Fit,Best Fit or Worst Fit

Page 26: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

26FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory AllocationSoftware Techniques

Segregated Free Lists

Simple Segregated Storage Segregated Fit

Page 27: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

27FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory AllocationSoftware Techniques

Buddy System

Bitmapped Fits

Page 28: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

28FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory AllocationHardware Techniques

Knowlton*

Binary buddy allocator that can allocate memory blocks whose sizes are a power of 2

Puttkamer *

Hardware buddy allocator (using Shift Register)Chang and Gehringer *

Modified hardware-based binary buddy system that suffers from the blind spot problem

Cam et al. *Hardware buddy allocator that eliminates the blind spotproblem in Chang’s allocator

* References are available in the thesis by M. Shalan

Page 29: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

29FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory AllocationHardware Techniques

Request size is 3

It searches for 4

[3 rounded to the nearest power of 2]

Page 30: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

30FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory Model AssumptionsThe global memory is divided into a fixed number of equally sized blocks ( e.g., 16KB)The global memory allocation done by the SoCDMMU will be referred to as G_allocationThe global memory de-allocation done by the SoCDMMU will be referred to as G_deallocationThe PE can G_allocate one or more than one block.Different PEs can issue the G_allocation/ G_de-allocation commands simultaneously

Page 31: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

31FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory Model Assumptions (continued)

Each memory block has one physical address and one or more virtual addresses. The block virtual address may differ from PE to anotherThe block virtual address will be referred to as PE-address

Page 32: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

32FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Two-Level Memory Management

The SoCDMMU manages the memory between the PEsThe OS (or custom software) on each PE manages the memory between the processes that run on that PEThe process requests the memory allocation from the OS or custom software. If there in not enough memory, the OS requests memory allocation from the SoCDMMU

Page 33: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

33FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Types of Memory Allocation

Exclusive• Only the the owner can access it. No other PE can

access it

Read/Write• The owner can read/write to it. Other PE’s can

read from it if it G_allocated it as read only

Read Only• The PE G_allocates the memory for read only.

Other PE G_allocated it as Read/Write

Page 34: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

34FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

PE-SoCDMMU Interface

Page 35: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

35FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

SoCDMMU Commands

Page 36: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

36FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU Hardware

Address Converter

Page 37: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

37FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU HardwareThe Basic SoCDMMU

Basic SoCDMMU

Page 38: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

38FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

SW ID

1

012345

0010000. . . . .0

255 . . . . .

PEMODE

0

1

2

3

255

.

.

.

.

25PE 1Ex

11PE 2Ex

Allocation Vector

Allocation Table

SW ID

1

012345

1110000. . . . .0

255 . . . . .

PEMODE

0

1

2

3

255

.

.

.

.

25PE 1Ex

11PE 2Ex

Allocation Vector

Allocation Table

The SoCDMMU HardwareThe Basic SoCDMMU

Basic SoCDMMU

Page 39: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

39FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Basic SoCDMMU

The SoCDMMU HardwareThe Basic SoCDMMU

Page 40: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

40FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Basic SoCDMMU

The SoCDMMU HardwareThe Basic SoCDMMU

Page 41: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

41FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Basic SoCDMMU

The SoCDMMU HardwareThe Basic SoCDMMU

Page 42: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

42FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU HardwareThe Allocation Unit

1 allocate(size,in[0:n-1]) {2 for (i:=0 to n-1) {3 if (in[i]==0 and size>0) {4 out[i]:=1;5 size:=size-1;6 } else out[i]:=0;7 }8 if (size>0) return NOT_ENOUGH_MEMORY;9 else return out;10 }

Page 43: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

43FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU HardwareThe Allocation Unit

0 0 0 0

1 1 1 1

0 0 1 1

21

1 0 0

0

Page 44: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

44FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU HardwareThe Allocation Unit

Page 45: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

45FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU HardwareThe Allocation Unit

8.5X3.3XComparison

17.5 MHz56.3 ns17930Un-optimized Alocator

150 MHz6.6 ns5364Optimized Allocator

Max. Clock Speed (MHz)

Worst Delay (ns)

Area (NAND gates)

256 G_blocks. Synthesized using Synopsys Design CompilerTM and a TSMC 0.25u library from LEDA Systems.

Page 46: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

46FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The SoCDMMU Hardware Execution Times/Synthesis

Synthesized using the TSMC 0.25u .Clock Speed: 300MHzSize:

~7500 gates (not including the Allocation Table and Address Converter)Allocation Table: The size of 0.66KB 6T-SRAMAddress Converter: The size of 1.22 KB 6T-SRAM

Page 47: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

47FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Microcontroller Implementation

Stores the allocation StatusExecutes the allocation commandsExecutes the de-allocation commands

Microcontroller Roles:

Custom HW: 16 Cycles WCET

uC: 231 Cycles BCET

Page 48: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

48FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Automatic Generation: Motivation

Page 49: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

49FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

SoCDMMU IP GenerationTo overcome the productivity gap, Intellectual Property (IP) cores should be used in SoC designsAlso, tools should be used to automatically customize/configure the IPs

Processor Generators: Tensilica, ARC Core, etc.Memory Compilers: Artisan, Virage, LEDA, etc.

The SoCDMMU as an IP core should be customized before being used in a system different than the one for which it was designed

Page 50: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

50FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

DX-Gt Overview

DX-Gt

H/W DB VPP

Page 51: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

51FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

RTOS SupportIntroduction

Conventional memory allocation algorithms (e.g., Buddy-heap) are not suitable for Real-Time systems because they are not deterministic and/or the WCET is highThis is mainly because of memory fragmentation and compaction. Also, most allocation algorithms usually use linked lists that have constant search time.An RTOS uses a different approach to make the allocation deterministic

Page 52: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

52FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

RTOS SupportIntroduction

An RTOS (e.g., uCOS-II, eCOS, VRTXsa, etc., ) usually divides the memory into pools each of which is divided into fixed-sized allocation units and any task can allocate only one unit at a time

Pool 1 Pool 2 Pool 3

Page 53: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

53FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Atalanta Memory ManagementOverview

Atalanta is an open source RTOS developed at GaTechAtalanta allows tasks to obtain fixed-sized memory blocks from partitions made of a contiguous memory areaAllocation and de-allocation of these memory blocks are done in a constant timeNo partition can be created at the run-time

Partition

BlockSizePartition

Size

StartAddress

.

.

.

Page 54: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

54FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Atalanta Memory ManagementAPI Functions

asc_partition_gainGet free memory block from a partition (non-blocking)

asc_partition_seekGet free memory block from a partition (blocking)

asc_partition_freeFree a memory block

asc_partition_referenceGet partition information

Page 55: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

55FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Atalanta Support for the SocDMMUObjectives

Add Dynamic Memory Management to AtalantaUse the same Memory Management API FunctionsKeep the Memory Management Deterministic

Page 56: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

56FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Atalanta Support for the SocDMMUFacts

The SoCDMMU needs to know where the allocated physical memory will be placed in the PE address spaceThe PE address space is much larger than the physical address space (64 MB* vs. 4GB)The PE-Address Space Fragmentation can be overcome by:

Using the SoCDMUU G_move Command (pointers problems)Replicate the physical address space

* A typical global on-chip memory size for billion transistor multiprocessor SoC

Page 57: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

57FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Atalanta Support for the SocDMMUNew API New Functions

Find a place in the PE address space to which to map the allocated memory.

asc_memory_find

Delete a partition and de-allocate memory block if required.

asc_partition_delete

Create a partition by requesting memory allocation from the SoCDMMU if necessary.

asc_partition_createDescriptionFunction Name

Page 58: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

58FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Comparison to a Fully Shared-Memory Multiprocessor System

BusArbiter

SoCDMMU

ARM9

L1 $

ARM9

L1 $

ARM9

L1 $

ARM9

L1 $

Global Memory

Simulation Setup

Simulation was carried out using Mentor Graphics Co-Verification Environment (CVE) , the cycle-accurate XRAY sotwaresimulator/debugger and Synopsys VCS Verilog simulatorARM SDT was used for software development

Page 59: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

59FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 1

Global memory of 16MB; Data L1 $ is 64 KB, Instruction L1 $ is 64 KBThe ARM runs at 150 MHz.Accessing the Global Memory costs 5 cycles for the first access.A handheld device that utilizes this SoC can be used for OFDM communication as well as other applications (MPEG2 video player).Initially the device runs an MPEG2 video player. When the device detects an incoming signal it switches to the OFDM receiver. The switching time (which includes the time for memory management) should be short or the device might lose the incoming message.

Page 60: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

60FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 1

32 Kbytes8 Kbytes0.5 Kbytes32 Kbytes1.5 Kbytes1.5 Kbytes1500 Kbytes1 Kbytes5 Kbytes32 Kbytes500 Kbytes34 Kbytes2 Kbytes

OFDM ReceiverMPEG-2 Player

• Sequence of Memory Allocations Required

Page 61: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

61FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 1Speedup of a single malloc()

8.21X7.92XSpeed up over uClibc malloc()

2.8X3.78XSpeed up over SDT malloc()

199 cycles28 cyclesSoCDMMU allocation

1646 cycles222 cyclesuClib malloc()

559 cycles106 cyclesSDT2.5 embedded malloc()

Execution Time (Worst Case)

Execution Time (Average Case)

Page 62: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

62FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 1Speedup of a single free()

28.42X14.8XSpeed up over uClibc free()

6.64X5.9XSpeed up over SDT free()

28 cycles14 cyclesSocDMMU deallocation

796 cycles208 cyclesuClib free()

186 cycles83 cyclesSDT2.5 embedded free()

Execution Time (Worst Case)

Execution Time (Average Case)

Page 63: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

63FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 1Speedup in transition time

3.9X4851 cycles1244 cyclesWorst Case

4.4X1240 cycles280 cycles Average Case

SpeedupUsing SDT malloc() and free()Using the SOCDMMU

12.46X15502 cycles1244 cyclesWorst Case

9.26X2593 cycles280 cycles Average Case

SpeedupUsing uClibc malloc() and free()Using the SOCDMMU

Page 64: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

64FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 2Speedup in Execution Time

Same setup used for Experiment 1GCC and Glibc were used for development3 kernels from the SPLASH-2 application suite are used

Complex 1D FFT (FFT)Integer RADIX sort (RADIX)Blocked LU decomposition (LU)

They were modified to replace all the static memory allocations by dynamic ones

Page 65: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

65FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experiment 2Speedup in Execution Time

20.38%141491694333RADIX

27.13%101998375988FFT

9.90%31512318307LU

% of E. T. used to Memory Management

Memory Management E. T. (Cycles)

E.T. (Cycles)Benchmark

19.59%96.10%0.99%5505558347RADIX

26.34%97.10%1.07%2951276941FFT

9.44%95.31%0.51%1476288271LU

% Reduction in Benchmark E. T.

% Reduction in Time used to Manage Memory

% of E. T. used to Memory Management

Memory Management E. T. (Cycles)

E.T. (Cycles)

Benchmark

Glibc malloc() & free()

Using the SoCDMMU

Page 66: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

66FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Area Estimation of The SoC

* Using dual-port 6T SRAM Cells..

0.186%SoCDMMU to SoC (%)

0.0186%SoCDMMU w/o memory elements to SoC

160.965M TransistorsSoC (total)

300K TransistorsSoCDMMU (total)

4 x 60K = 240K TransistorsSoCDMMU Address Converters (4)

30K TransistorsSoCDMMU Allocation Table

30K TransistorsSoCDMMU (w/o memory elements)

134.217M TransistorsGlobal On-Chip Memory (16MB)

4 x 6.5M = 26M Transistors*4 L1 Caches (64KB+64KB)

4 x 112K = 448K Transistors4 ARM9TDMI Cores

Number of TransistorsElement

Page 67: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

67FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Area Estimation of The SoC

For this 161 Million transistor chip, the SoCDMMU consumes 300K transistors (0.186% of 161M) and yields a 4-10X speedup in memory allocation/de-allocation

Page 68: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

68FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

SoCDMMU ConclusionWe introduced The Two-Level memory management hierarchy for multiprocessor SoCWe showed how Level 1 in the hierarchy can be implemented using the SoCDMMUWe gave a sample hardware implementation of the SoCDMMUWe introduced DX-Gt to automatically configure/customize the SoCDMU hardwareWe showed how to add the SoCDMMU support to a real-time OSOur Experiments show that using the SoCDMMUspeeds up the application transition time as well as the application execution time

Page 69: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

69FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

OutlineOutline

• Trends• Custom Operating System Hardware IP

• System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU)

• The he δδ Hardware/Software RTOS Generation FrameworkHardware/Software RTOS Generation Framework

•• Compilation for Reverse Execution & Compilation for Reverse Execution & DebuggingDebugging

•• ConclusionConclusion

Page 70: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

70FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

δ Hardware/Software RTOS Generation Frameworkand current simulation platform

HardwareRTOS

Library

Makefile

User.hSW RTOS

Top.vHW RTOS

Base Architecture

Library

GUI Tool

SoftwareRTOS

Library

SWCompile

HWCompile

UserInput

Resultand

Feedback

Application

Compiled Hardware

Description

ExecutableHW

Simulation in

Seamless CVE

ExecutableSW

XRAY

Modelsimor VCS

Page 71: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

71FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

The The δδ Hardware/Software RTOS Hardware/Software RTOS Generation FrameworkGeneration Framework

RTOS1

Hardware RTOS library

Software RTOS library

GUI tool

SW RTOS w/ dead-

lock avoid.

SW RTOS +

SoCDDU

SW RTOS + SoCLC

+ SoCDDU

Compile Stage for each systemApplication

Executable HW file for each

Executable SW file for each

Simulation in Seamless CVE

Base Architecture

library

VCS XRAY

RTOS2 RTOS3 RTOS4 RTOS5

RTU

User Input

SW RTOS w/ sem

SW RTOS +

SoCLC

RTOS6

Page 72: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

72FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

OutlineOutline

• Trends• Custom Operating System Hardware IP

• System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU)

• The he δδ Hardware/Software RTOS Generation FrameworkHardware/Software RTOS Generation Framework

•• Compilation for Reverse Execution & Compilation for Reverse Execution & DebuggingDebugging

•• ConclusionConclusion

Page 73: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

73FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Generation of a Reverse Program

* See thesis by T. Akgul available from http://codesign.ece.gatech.edu

Page 74: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

74FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Experimentation Platform

Background Debug Mode (BDM) Interface

PCWindows 2000

MBX860MPC860 processor

4MB DRAM, 2MB Flash RTC, four 16-bit timers, watchdog

Page 75: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

75FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Memory Usage for State Saving

354X589X1331471140784336LZW (16KB input data)

112X185X3513916364970LZW (4KB input data)

35X57X98.434255630LZW (1KB input data)

2X2.5X246447676175ADPCM (128KB input data)

2X2.5X123223843088ADPCM (64KB input data)

2X2.5X61611921544ADPCM (32KB input data)

1404X2206X125017550062756883Matrix multiply (400x400)

143X224X12.618012820Matrix multiply (40x40)

14X21X0.172.353.6Matrix multiply (4x4)

55X82X7237397913593389Selection sort (10000 inputs)

27X40X15140656032Selection sort (1000 inputs)

6.3X9X7.546.968.2Selection sort (100 inputs)

ISSDI /

RCG

ISS /RCG

RCG (KB)

ISSDI (KB)

ISS (KB)

Page 76: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

76FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Program Re-execute Approach vs. RCG

0

100

200

300

400

500

0 100 200 300 400

Outermost loop iteration count

Tim

e (s

econ

ds)

Forward execution Reverse execution via RCG

400x400 matrix multiplystarting point for reverse execution

starting point for forward execution

Page 77: Research Trends in Hardware/ Software Codesign of Embedded ...mooney.gatech.edu/codesign/publications/mooney/presentation/fpgaw07.pdf · Yearly Chip Design Starts Roughly 8,000 in

77FPGAworldFPGAworld 13 September 200713 September 2007 ©Vincent J. Mooney III, 2007

Conclusion

TrendsSoCDMMUReverse Program GenerationHave fun at FPGAworld 2007!