109
Architectures and Compilers for Embedded Systems (ACES) Laboratory Center for Embedded Computer Systems University of California, Irvine [email protected] http://www.cecs.uci.edu/~dutt Architectural Exploration for Programmable Embedded Systems With Contributions from the EXPRESSION team: Peter Grun, Ashok Halambi, Nick Savoiu, Radu Cornea, Prabhat Mishra, Aviral Shrivastava, Partha Biswas, Srikanth Srinivasan, Ilya Issenin, Marcio Buss, Dr. Hiroyuki Tomiyama, and Prof. Alex Nicolau Work Partially Supported by NSF, ONR, and DARPA Nikil D. Dutt

Final version is available

Embed Size (px)

Citation preview

Page 1: Final version is available

Architectures and Compilers for Embedded Systems (ACES) Laboratory Center for Embedded Computer Systems

University of California, [email protected]

http://www.cecs.uci.edu/~dutt

Architectural Exploration for Programmable Embedded Systems

With Contributions from the EXPRESSION team: Peter Grun, Ashok Halambi, Nick Savoiu, Radu Cornea, Prabhat Mishra, Aviral Shrivastava, Partha

Biswas, Srikanth Srinivasan, Ilya Issenin, Marcio Buss, Dr. Hiroyuki Tomiyama, and Prof. Alex Nicolau

Work Partially Supported by NSF, ONR, and DARPA

Nikil D. Dutt

Page 2: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 2

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Page 3: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 3

Traditional Processor-Centric Designs

Performance driven designs Limitations:

from application: only limited by available parallelism from architecture: widening processor-memory gap => memory bottleneck

Solution: expose maximally the available parallelism in application (compiler) devise memory hierarchy to exploit effectively this parallelism

Can increase performance by explicit exploitation of available parallelism implicit exploitation of parallelism to mask operations and memory latencies

Match processor architecture w/ memory configuration for application suite(s)

Page 4: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 4

Embedded System-on-Chip (SOC) Designs

One or few dedicated applications Opportunity to customize design

Diverse requirements (Real-time) performance, power, data/code density, testability,….

Approach: aggressively exploit application behavior: Use coarse-grain and fine-grain compiler techniques Evaluate different architectures and memory organizations

Need for exploration capability without loss of efficiency Rapid software toolkit generation (compiler, simulator, debugger,...)

Page 5: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 5

Embedded S-O-C Design Issues

Technology Trends 1G transistor chips by ~2010 (SIA Roadmap) Faster processors => Migration of functionality from HW to SW Reconfigurable logic => SW DRAM merged with logic (plus analog, RF, etc.)

Market Trends Shrinking time-to-market Design Reuse

Componentization, decreasing time between design starts Product “versioning”

New standards, but unique implementations (e.g., Bluetooth, G3)

Result: Intense pressure to rapidly innovate, explore, and differentiate, while meeting

complex design contraints

Page 6: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 6

What to do with all these transistors?

New processor architectures E.g., Ultra Large Instruction Word Machines (i.e., VLIW-like) Aggressive use of compiler technology (speculation, sophisticated disambiguation)

Multiprocessors on a chip Heterogeneous processors tuned for specific tasks/functions Enhanced compiler technology for better communication/synchronization Integration of OS/Multithreading

Novel memory organizations and hierarchies Different types of on-chip memories: multiple cache hierarchies, frame buffers,

stream buffers, etc. Need “memory-aware”compiler, and processor-memory coexploration

RESULT: Software issues WILL dominate, requiring rapid generation of software toolkits to support design

Page 7: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 7

Programmable Embedded Systems: Boards to SOCs

Past Board-level IC’s

Present System-on-a-chip (SOC) and IP “cores” Core types

Hard: layout Firm: structural HDL Soft: RT-synthesizable HDL

Processor Memory Peripheral

Board

Peripheral Mem

Processor

IP cores

Core libraryPeripheralA

PeripheralB

ProcessorX

SOC[Source: F. Vahid]

Page 8: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 8

Networked Embedded System

Power Supply

Bat

tery

DC-DCConverter

Communication

RadioModem

RFTransceiver

Processing

ProgrammablePs & DSPs

(apps, protocols etc.) Memory

ASICs

Peripherals

Disk Display

Signaling protocols, choice of modulation, TX/RX architecture, RF/IF circuits

Baseband DSP

[Courtesy: R. Gupta]

Page 9: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 9

Programmable SOC Platforms

Domain-specific Parameterized Cores Sample Parameters:

Voltage scale Size, line, associativity Bus width, encoding (gray,

invert) UART tx/rx buffer size DCT resol.

Configurations impact power/performance

[Source: T. Givargis]

UART

MIPSI-Cache

D-Cache

Bridge

Peripheral Bus

DCT CODEC

Memory

DMA

System-on-a-Chip (SOC)

Page 10: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 10

Why Explore Architectures?

JPEG

0

200

400

600

800

1000

1200

1400

0 200 400 600 800 1000 1200 1400 1600 1800

Execution time (usec)

Po

wer

(u

W)

5.10x exe.7.51x power2.73x energy

[Source: T. Givargis]

Example: JPEG implemented on prog. SOC platform

Tremendous Variation in Power/Performance!

Variations:

Page 11: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 11

Philips Velocity SoC Platform

Page 12: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 12

Configurable Processor Platform : Tensilica Xtensa

MMU

ALU

Pipe Cache

I/O

Timer

Register File

Controller

Page 13: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 13

Fixed Programmable SOC Template

Page 14: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 14

Programmable Architectural Trends

Recent advances in System-On-Chip Technology customizable processor cores, coprocessors, multiple processors on SOC novel on-chip/off-chip memory hierarchies, heterogeneous memory organizations mixed memory/logic fabrication (on-chip DRAM)

Customization of SOC architectures for specific embedded applications/tasks.

Software content of SOCs increasing rapidly

Tune SOC for diverse goals: power, code size, area, ...

Shrinking time-to-market + short product lifetimes

Need: rapidly evaluate SOC architectures Design Space Exploration (DSE)

Page 15: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 15

Architecture-Compiler Coupling

Parameters:no, size of unitsno, size, ports of reg filescachesmemory hierarchy

Instruction Set Definition:basic instructionssub-word parallelismapplication-specific instructionscache control instructions…. ….

Page 16: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 16

Compiler-Architecture-CAD Coupling

Parameters:no, size of unitsno, size, ports of reg filescachesmemory hierarchy

Instruction Set Definition:basic instructionssub-word parallelismapplication-specific instructionscache control instructions…. ….

Tasks:estimate global memoryidentify bottlenecksreduce memory traffic

….partition and organize memories

Hardware/Software Partitioning

Memory-related Optimizations

Page 17: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 17

Programmable Arch’s: Traditional Design Flow

DesignSpecification

Hw/Sw Partitioning

Off-ChipMemory

ProcessorCore

On-ChipMemory

SynthesizedHW Interface

HWVHDL, Verilog

SWC

Synthesis Compiler

Cosimulation

Estimators

- Application-to-architecture mapping

- Early HW/SW partitioning

- Ensuing tasks of synthesis, SW compilation

Page 18: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 18

Programmable Arch’s: Traditional Design Flow

DesignSpecification

Hw/Sw Partitioning

Off-ChipMemory

ProcessorCore

On-ChipMemory

SynthesizedHW Interface

HWVHDL, Verilog

SWC

Synthesis Compiler

Cosimulation

Estimators Issues:

-- Multiple specificationsFunctional, IS, RT (synthesis)

-- Software after Hardware

-- Limited Exploration Spaceneed compiler/simulator in-the-loop

-- Consistency and Validation

-- Verification and Testing

Page 19: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 19

Traditional Design Flow

DesignSpecification

Hw/Sw Partitioning

Off-ChipMemory

ProcessorCore

On-ChipMemory

SynthesizedHW Interface

HWVHDL, Verilog

SWC

Synthesis Compiler

Cosimulation

Estimators

Predefined Architectural Model

Page 20: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 20

IP-Centric Design Flow

Increasing use of IP blocks COTS => IP, Soft/Hard IP blocks Processor Core Families

RISC, DSP, VLIW, ASIPs: many attributes parametrizable

Custom Memory Configurations Special-purpose HW blocks

(video/audio compression/decompression engines, encryption engines, etc.)

Design Reuse Leveraged through predesigned, preverified blocks Customization, adaptation

Reduce time-to-market Key Bottleneck: lack of software tools to support use of IP Again, urgent need to rapidly generate optimized software toolkits

Page 21: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 21

Main Bottleneck

SOC Customization with IP Blocks COTS: SW tools available (already developed) IP Blocks: no support tools, huge time lag until SW tools are

generated/modified

Need rapid generation of SW toolkit for Embedded SOC (compilers, simulators, debuggers, etc.)

Language-Based Design Methodology for Embedded SOC Application=> Specification Language Architecture=> ADL (drives SW tools generation)

Page 22: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 22

ADL-Driven Design Flow

DesignSpecification

Hw/Sw Partitioning

Off-ChipMemory

ProcessorCore

On-ChipMemory

SynthesizedHW Interface

HWVHDL, Verilog

SWC

Synthesis Compiler

Cosimulation

Estimators ADLSpecification

P1

M1P2

IP Library

Verification

Rapid design space exploration

Quality tool-kit generation

Design reuse

Page 23: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 23

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Page 24: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 24

Specify architecture templates of SOCs

Blocks/components which reside on the SOC How they are connected or interact Functionality of each component

Support Automated SW toolkit generation

ILP compilers Simulators (instruction-set-, cycle-, phase-accurate) Debuggers Real-time OSs

Verification / Validation

Architecture Description Languages

Page 25: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 25

ADL-Based SOC Codesign Flow

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

SpecifySynthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Page 26: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 26

Survey of ADLs

Classification Based on Type of Information Captured

Behavior-centric ADLs Structure-centric ADLs Mixed-level ADLs

Classification Based on Their Main Objective

Synthesis-Oriented ADLsCompiler-Oriented ADLs Simulation-Oriented ADLs Validation-Oriented ADLs

Page 27: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 27

Behavior Centric ADLs

Primarily capture Instruction Set (IS) Provide programmer’s view Organized in a hierarchical manner for

conciseness

Advantages: Capture easily available information Good for regular architectures

Disadvantages: Tedious for irregular architectures Hard to specify pipelining Contain an implicit architecture model

Instruction-Set

Arithmetic Operations:

Addition

…………………..

Memory Operations:

…………………..

…………………..

Constraints:

……………………

Examples:

nML, ISDL, ValenC, CSDL

Page 28: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 28

Structure Centric ADLs

Provide net-list view of the architecture

Advantages: Common specification for both

software toolkit generation and hardware synthesis

Can capture detailed pipelining information

Disadvantages: Hard to extract IS view

Instruction-Set

Arithmetic Operations:

Addition

…………………..

Memory Operations:

…………………..

…………………..

Constraints:

……………………

Examples:

MIMOLA, COACH

Page 29: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 29

Mixed-Level ADLs

Capture Instruction Set viewCapture high-level architecture view

Combine benefits of both

Advantages: Common specification for both

software toolkit generation and hardware synthesis

Can validate/verify structure versus behavior (and vice-versa)

Disadvantages: May require specification of redundant information

Instruction-Set

Arithmetic Operations:

Addition

…………………..

Memory Operations:

…………………..

…………………..

Constraints:

……………………

Examples:

MDes, LISA/RADL, EXPRESSION

Page 30: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 30

Survey of ADLs

Classification Based on Type of Information Captured

Behavior-centric ADLs Structure-centric ADLs Mixed-level ADLs

Classification Based on Their Main Objective

Synthesis-Oriented ADLsCompiler-Oriented ADLs Simulation-Oriented ADLs Validation-Oriented ADLs

Page 31: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 31

Synthesis-Oriented ADLs

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Enable early synthesis of architectures

Page 32: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 32

Synthesis-Oriented ADLs

MIMOLA (Univ. of Dortmund, Germany)Synthesizable HDL Mainly targeted to DSPs with tightly constrained datapaths Used in the MSSQ and RECORD compiler systems Capture the structure (RT-level netlist) of the target processor Behavior (instruction set) is automatically extracted ILP constraints are automatically detected

COACH (Kyushu Univ., Japan) CAD system for ASIPs Mainly targeted to simple RISC processors without ILP Use the UDL/I HDL for processor description Capture the structure Behavior is automatically extracted Generate compilers and instruction-set simulators

Page 33: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 33

Synthesis-Oriented ADLs

Summary Synthesis and simulation tools available Capture only the structural aspect (RT-level netlist) of the processors Low abstraction level => not suited to early and rapid DSE of SOCs Behavior extraction and compiler generation are successful for a limited class of

processor architectures

Page 34: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 34

Compiler-Oriented ADLs

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Support automatic generation of compilers

Page 35: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 35

Compiler-Oriented ADLs

nML (TU Berlin, Germany) Mainly targeted to DSPs and ASIPs Generate compilers, instruction-set simulators, and assemblers at TU Berlin,

IMEC, Cadence, etc. Capture the behavior (instruction set) of the processors as an attribute grammar ILP constraints are described in a form of a set of legal combinations of

operations

ISDL (MIT, USA) Mainly targeted to VLIW processors Generate compilers, assemblers, and cycle-accurate simulators Capture the behavior ILP constraints are described in a form of a set of Boolean rules all of which

must be satisfied Can be translated to synthesizable Verilog code

Page 36: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 36

Compiler-Oriented ADLs

MDES (HPLabs & UIUC, USA) Used for design space exploration of high-performance processors in the

Trimaran system Generate compilers and cycle-accurate simulators Retargetability of cycle-accurate simulators are limited to the HPL-PD

processor family Mainly captures the behavior (instruction set) ILP constraints are described in a form of reservation tables

EXPRESSION (UC Irvine, USA) Targeted to a wide range of architectures (e.g., RISC, VLIW, SS, DSP) Generate compilers and cycle-accurate simulators Capture both the behavior and the structure (high-level netlist) Models complex memory organizations/hierarchies ILP constraints are automatically detected through reservation tables Graphical front-end for specification and analysis

Page 37: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 37

Compiler-Oriented ADLs

Other Compiler-Oriented ADLs The FlexWare CAD system supporting compiler and simulator generation for

DSPs and ASIPs (TIMA, France) The Valen-C compiler system supporting bit-width optimization of RISC-like

ASIPs (Kyushu Univ., Japan) The Zephyr compiler system supporting development of custom compilers

(Univ. of Virginia, USA)

Summary In most compiler-oriented ADLs, the behavior of the target processor is mainly

captured. In addition, manual description of ILP constraints is need for ILP scheduling.

EXPRESSION captures both the behavior and the structure, enabling automatic detection of ILP constraints

Page 38: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 38

Simulator-Oriented ADLs

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Support automatic generation of simulators

Page 39: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 39

Simulator-Oriented ADLs

LISA (RWTH Aachen, Germany) Mainly targeted to DSPs Generate bit-true cycle-accurate compiled simulators Explicit support for modeling pipeline behaviors such as interlocking,

bypassing, stalls, flushes, etc. No support for compiler generation

RADL (Rockwell Semiconductor, USA) Extension of the LISA approach Mainly targeted to DSPs Generate phase-accurate simulators Explicit support for modeling delay slots, interrupts, zero-overhead loops,

hazards and multi-pipelines in addition to features of LISA No support for compiler generation

Page 40: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 40

Simulator-Oriented ADLs

Summary Capture both the structural and architectural aspect of the processors Explicit support for modeling pipeline behaviors such as stalls and flushes No explicit support for ILP compiler generation

Page 41: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 41

Validation-Oriented ADLs

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Enable early verification/validation of architectures

Page 42: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 42

Validation-Oriented ADLs

AIDL (Univ. of Tsukuba, Japan) Targeted to high-performance superscalar processors Describe timing behavior of pipelines (e.g., data-forwarding, out-of-order

completion, etc.) using temporal logic The timing behavior is validated/verified through simulation No support for SW toolkit generation Can be translated to synthesizable VHDL code

Summary Limited previous work Few properties can be validated No support for SW toolkit generation

Page 43: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 43

Future Directions for ADLs

Formal Verification Detection of pipeline conflicts (resource, data, and control conflicts) Consistency checking between the behavior and the structure

SOC Architecture Synthesis from ADL SpecificationsAutomatic Generation of Real-Time OSs

Optimization of task scheduling, interrupt handling, memory management, etc. IP Libraries

Standard mechanisms to specify SOC architectures Standard mechanisms to encapsulate design attributes such as performance,

power consumption, feature size, etc.)Support for Future SOC Architectures

Heterogeneous multi-processors with multi-threaded architectures On-chip memory hierarchies with various memory types (e.g., DRAM, flash

memories, etc.) On-chip reconfigurable devices

Page 44: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 44

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Page 45: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 45

Software Toolkits for Processor Cores

SOC designers using processor cores. Major bottleneck: lack of supporting software tools (compiler, simulator, …) Traditionally: toolkit built at later stages of system design Design Space Exploration meaningless w/o toolkit support

Solution: Generate Toolkit from a Target machine specification Architecture Description Language (ADL) used to define architectural template ADL is used to drive generation of compiler, simulators, validation/verification, and

synthesis Approach allows compiler-in-the-loop architectural exploration

Page 46: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 46

Objectives: Support automated SW toolkit generation

exploration (through parametrization & generality) production quality SW tools (cycle-accurate simulator, memory-aware compiler..)

Specify from a variety of architecture classes (VLIWs, DSP, RISC, ASIPs…)

Specify novel memory organizations

Specify pipelining and resource constraints

Architecture Description Languages (ADLs)

ArchitectureArchitecture

DescriptionDescription

FileFile

Compiler

Simulator

Synthesis

Architecture ModelADL

Compiler

Formal Verification

Page 47: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 47

Software Tools

Estimators Code Size, Memory Requirements, Performance, Power etc.

Compilers Coarse-grain (task-level) and ILP (microarchitecture-level)

Assembler, Linker, Loader

Profiler, Debugger, Code Development Environment

Simulators Bus-functional, instruction-, cycle-, and phase- accurate, structural

Real Time Operating Systems (RTOS)

Validation/Verification

Page 48: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 48

Software Tools

Estimators Code Size, Memory Requirements, Performance, Power etc.

Compilers Coarse-grain (task-level) and ILP (microarchitecture-level)

Assembler, Linker, Loader

Profiler, Debugger, Code Development Environment

Simulators Bus-functional, instruction-, cycle-, and phase- accurate, structural

Real Time Operating Systems (RTOS)

Validation/Verification

Page 49: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 49

Compiler Issues for Embedded SOC

Traditional ES Software Handcoded in assembly

Poor code quality from compilers Idiosyncratic architectural features (specialized IS, register banks, etc.)

Embedded SOC Widely heterogeneous, customized processors Multiple levels of parallelism Complex, non-traditional memory organization/hierarchy Complex constraints (hard RT, code size, power, cost,…)

Embedded SOC Software Cannot do handcoding Need powerful retargetable compiler technology Must fully exploit unique/non-traditional IS or architecture features Compiler is CRITICAL for Embedded SOC

Compiler Issues for Embedded SOC

Language-driven Software Toolkit Generation

Architectural Exploration of Embedded SOC

Page 50: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 50

Compiler as an Exploration Tool

Analysis Phase of Compiler: Estimation Memory size parallelism resources

“Fast” Compiler Algorithms to Evaluate Tradeoffs on-chip parallelism vs. memory effect on speed, power, code size

“Fast” Simulator to evaluate architectural modifications/enhancements Customized instructions customized units data path size (bitwidth) customized memory organization/hierarchy

Compiler Critical for Embedded SOC exploration

Page 51: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 51

Retargetable Compilers

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Automatic generation of compilers from ADLs

Page 52: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 52

Retargetable Compilers

Issues: Produce efficient code for a wide variety of processor architectures

DSP, VLIW, RISC, Superscalar Multi-processor/Multi-threaded architectures

Need efficient code optimization techniques ILP, Predicated Execution Techniques for novel instruction-sets, architectures

Multimedia instructions, cache control instructions Specialized addressing modes, specialized functional units

Need dynamic phase ordering capability

Produce code that satisfies varied constraints Instruction Memory size, Data Memory size Power, Performance

Page 53: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 53

Lexical Analysis

Semantic Analysis

Compiler Flow (Front-End)

Analysis:Data dependenceArray, PointerLoop

Memory/Power:Estimation

Loop/Arrayoptimizations

Parallelization

Task-level

Loop-level

Program

High-level IR

High-level IR

Multi-processor/Multi-threading Info.

Memory Subsystem/Power Info.

ADL

Page 54: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 54

Lowering:

Complex Expressions, Array Subscripts

Compiler Flow (Back-End)

Pre-scheduling optimizationsDead code removal,Induction Variable EliminationPartial Redundancy Elimination, …..

Memory/Power:Initial memory assignmentData-Cache OptimizationsLoop blocking, skewing, etc.

Transformations:

Software Pipelining

Instruction Selection Register Allocation

Scheduling (ILP)

High-Level IR

Medium-level IR

Low-level IR

Memory Subsystem/Power Info.ADL

Optimizations:

Tree Height Reduction

Strength Reduction

Spill code optimization

Memory Subsystem/Resource Info.

Operation Behavior

Register File Info.

Pipeline Conflicts/Constraints Info.

Page 55: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 55

Post-scheduling optimizations:

Peephole Optimizations

Machine Specific Optimizations

Compiler Flow (Back-End)

Memory/Power:Block reorderingInstruction-Cache OptimizationsFinal memory assignment

Low-Level IR

Low-level IR

Memory Subsystem/Power Info.ADL

Code Generation

Object Code

Operation Format/Image Info.

InterProcedural:Register AllocationCall convention implementationGlobal references aggregation

Call Convention/Register Info.

Page 56: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 56

Retargetable Compilers Survey (1)

CHESS (using nML ADL) Mainly targeted to fixed-point DSPs and ASIPs Performs instruction selection, register allocation, and scheduling. Fixed phase ordering ILP constraints described as a set of legal combinations of operations

AVIV (using ISDL ADL) Mainly targeted to VLIW processors Optimizes for minimal code size Branch-and-bound techniques for concurrent scheduling, resource allocation ILP constraints described as a set of Boolean rules which must be satisfied

Page 57: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 57

Retargetable Compilers Survey (2)

ELCOR (using MDES ADL) Mainly targeted to VLIW architectures with speculative execution Used for design space exploration of high-performance processors in the

Trimaran system ILP constraints are explicitly described as reservation tables

EXPRESS (using EXPRESSION ADL) Targeted to a wide range of processor architectures such as

RISC, VLIW, Superscalar, and DSP Mutation-Scheduling based dynamic phase ordering capability ILP constraints are automatically detected using reservation tables

Page 58: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 58

Retargetable Compilers Survey (3)

Other Retargetable Compilers The FlexWare CAD system

Supports compiler generation for DSPs and ASIPs (TIMA, France)

The Valen-C compiler system Supports bit-width optimization of RISC-like ASIPs (Kyushu Univ., Japan)

The Zephyr compiler system Supports development of custom compilers (Univ. of Virginia, USA)

SUIF Compiler Infrastructure Open compiler insfrastructure (Stanford Univ., USA)

Other Efforts discussed at this workshop Dortmund, EPFL, IITB, IITD, ...

Page 59: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 59

Software Tools

Estimators Code Size, Memory Requirements, Performance, Power etc.

Compilers Coarse-grain (task-level) and ILP (microarchitecture-level)

Assembler, Linker, Loader

Profiler, Debugger, Code Development Environment

Simulators Bus-functional, instruction-, cycle-, and phase- accurate, structural

Real Time Operating Systems (RTOS)

Validation/Verification

Page 60: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 60

Simulators/Simulator Generators

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Support automatic generation of simulators

Page 61: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 61

Simulators/Simulator Generators

Issues: Level of abstraction

Functional (no timing information)

Cycle-accurate (cycle level timing information)

Bit-, Phase-accurate (detailed timing information)

Simulation model Interpretation based(easy to generate, flexible but slower)

Compilation based (fast but not very flexible) Static compiled simulation Dynamic compiled simulation

Interoperability (the ability to integrate with other tools)

Ability to simulate a wide variety of architectures

Faster, less detail

Slower, more detail

Page 62: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 62

Simulators/Simulator Generators Survey (1)

GENSIM/XSIM (using ISDL ADL) Mainly targeted to VLIW architectures Generate cycle-accurate, bit-true Instruction Level Simulator Interpretation based, but perform disassembly off-line to improve speed Used for architecture evaluation

SIMPRESS (using EXPRESSION ADL) Targeted to wide range of processor architectures such as

RISC, VLIW, Superscalar, and DSP Generate cycle-accurate, structural simulator Interpretation based. Used for design space exploration and architecture evaluation

Page 63: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 63

Simulators/Simulator Generators Survey (2)

LISA/S (using LISA ADL) Mainly targeted to DSPs Generate bit-true, cycle-accurate, static compiled simulators Explicit support for modeling pipeline behaviors such as

interlocking, bypassing, stalls, flushes, etc.

RADL (Rockwell Semiconductor, USA) Extension of the LISA approach Mainly targeted to DSPs Generate phase-accurate simulators Explicit support for modeling delay slots, interrupts, zero-overhead loops,

hazards and multi-pipelines in addition to features of LISA

Page 64: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 64

Simulators/Simulator Generators Survey (3)

Other Retargetable Simulators/Simulator Generators: HPL-PD simulator (using the MDES ADL)

Limited retargetability in the form of parameters such as number of FUs, etc.

MIMOLA ADL Convert the processor description into a simulatable HDL model

Insulin Uses a VHDL model of a generic parameterizable machine

Several Commercial Offerings Axys, Lisa, Vast,….

Page 65: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 65

Software Tools

Estimators Code Size, Memory Requirements, Performance, Power etc.

Compilers Coarse-grain (task-level) and ILP (microarchitecture-level)

Assembler, Linker, Loader

Profiler, Debugger, Code Development Environment

Simulators Bus-functional, instruction-, cycle-, and phase- accurate, structural

Real Time Operating Systems (RTOS)

Validation/Verification

Page 66: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 66

ADL-driven Validation/Verification

ProcessorProcessorASICs

MemoriesIFs

ASICs

MemoriesIFs

Cosimulation

HW SW

HW/SW Partitioning

Synthesis Compiler

Application

ProcessorsASICs

MemoriesIFs

Interconnection

System on Chip

Synthesize

IPLibrary

Verify/Validate

GenerateADL Specification

Estimator

Reuse

EstimateModify

Support validation/verification of architecture spec and implementation

Page 67: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 67

Bottom-up Validation Approach

RTLRTL

ReverseReverse

EngineeringEngineering

High LevelHigh Level

DescriptionDescription

ManualManual

VerificationVerification

PropertyProperty

CheckingChecking

PropertyProperty

CheckingChecking

SpecificationSpecification

(English Document)(English Document)

Page 68: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 68

ADL-driven Validation

RTLRTL

ReverseReverse

EngineeringEngineering

High LevelHigh Level

DescriptionDescription

ManualManual

VerificationVerification

PropertyProperty

CheckingChecking

PropertyProperty

CheckingChecking

PropertyProperty

CheckingChecking

SpecificationSpecification

(English Document)(English Document)

ADL Description inADL Description in

EXPRESSIONEXPRESSION

High LevelHigh Level

DescriptionDescription

EquivalenceEquivalence

CheckingChecking

PropertyProperty

CheckingChecking

RTLRTLEquivalenceEquivalence

CheckingChecking

Ref: papers from EXPRESSION group at HLDVT99-01, VLSI02, DATE02 (Mishra et al.)

Page 69: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 69

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and ConclusionsEXPRESSION ADL Toolkit/Framework

Page 70: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 70

Memory Libraries

Cache SRAM

PrefetchBuffer

Frame Buffer

EDO

On-chipRD RAM

SD RAM

VLIW

DSP

ASIPToolkit

Generator

Toolkit Generator

SIMPRESS

EXPRESS

SIMPRESS

EXPRESS

Profiler

Profiler

ApplicationExploration Phase

Generation Phase

Processor Libraries

Verification Feedback

EXPRESSION: Our ADL Approach

EXPRESSION

ADL

Feedback

EXPRESSION, EXPRESS, and SIMPRESS comprise the toolkit to aid the System Designer.Compiler-in-the-loop architectural exploration

Page 71: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 71

System -Level Exploration

Alg. spec

C implementation

Proced. code

Cost estimation (mem,...)Perf. estimation

H/S Partitioning

Coarse-grain & algtransformations

HLSEXPRESSCompiler

Target CodeRTOS Kernel

ROM

Proc

ASIC

On-chipMemory

MainMemoryController Datapath

HW SW

Page 72: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 72

MEMOREX: Memory Exploration Environment

System spec in C

Parser, FG Generator w/ Semantics Retention

Memory Disambiguation, Multi-dim DF analysis

Hw/Sw Partitioning (SpecSyn)

SW Synthesis (EXPRESS) HW Synthesis (ISE, Synopsys)

HW/SW Codesign

Memory Estimation

Transformations

Memory Optimizations

Virtual Memory Mapping

UserInterface

Control/DFGraph

CDFG withreal memories

MemoryLibrary

Physical Memory Mapping

MEMOREX

Page 73: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 73

Software Toolkit for the System Designer

EXPRESS - An Extensible, Retargetable, Instruction-Level Parallelizing (ILP) Compiler State-of-art ILP techniques:

Resource Directed Loop Pipelining (RDLP), Trailblazing Percolation Scheduling (TiPS)

Mutation Scheduling : Framework for dynamically exploring tradeoffs between transformations.

Detailed architecture model (for enhanced retargetability and optimizing capability)

Automatic generation of operation conflict information (as Reservation Tables) from EXPRESSION

Very general speculation/predication

SIMPRESS - A Retargetable, Cycle-accurate simulator Runs on EXPRESS IR. (Compiler designers can use the simulator as a debugging tool)

Structural Simulation. (Provides System Designer with detailed statistics)

Highly retargetable. (Can be used to simulate VLIWs, DSPs etc)

V-SAT - A Visual Architecture Specification and Analysis Tool Visual Tool for easy specification of Structural and Instruction-Set Information.

Interfaces with SIMPRESS to collect detailed statistical information about the architecture

Visual display of the statistics in an intuitive manner to aid architecture evaluation .

Page 74: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 74

GCC + Semantics Retention

AnalysisMutating

Transformations

Simulation,Visualization,

Interaction

(SIMPRESS&

VSAT)

RetargetableBack End

EXPRESS: Compiler Environment for Embedded Processors

Memory Hierarchy Transformations

Proc 1 Proc 2 Proc n.......

Control

EXPRESSION(ADL)

Page 75: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 75

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Experiments:- Pipelining- Memory-aware compilation- Memory arch exploration

Page 76: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 76

The DLX Example Architcture

Page 77: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 77

Design Space Exploration

Designer targets various goals (power, area, perf) Often conflicting

DSE allows trade-offs between these goals. Explore changes to:

processor/memory system architecture changing the pipeline structure changing the data path structure increasing parallelism changing the memory components

instruction set adding new operations (e.g., MAC)

DLX simulation Pipeline stalled 53% of time, due to RAW data hazards INT and FP Adder units are the most utilized

Explored several forwarding path placements

Page 78: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 78

1. Forwarding path from All_mem_Latch to A1

2. Forwarding path Mem_WB_Latch to INT

3. Both (1) and (2)

4. Forwarding path All_mem_latch to INT and (1)

5. Forwarding path Mem_WB_Latch to A1 and (1)

Example Design Space Exploration: Pipelining

Exploits (mpy,fp_add) sequences

Page 79: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 79

1. Forwarding path All_mem_Latch to A1

2. Forwarding path Mem_WB_Latch to INT

3. Both (1) and (2)

4. Forwarding path All_mem_latch to INT and (1)

5. Forwarding path Mem_WB_Latch to A1 and (1)

Exploits (ld,int_add) sequences

Example Design Space Exploration: Pipelining

Page 80: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 80

1. Forwarding path All_mem_Latch to A1

2. Forwarding path Mem_WB_Latch to INT

3. Both (1) and (2)

4. Forwarding path All_mem_latch to INT and (1)

5. Forwarding path Mem_WB_Latch to A1 and (1)

Exploits (mpy,fp_add) and (ld,int_add) sequences

Example Design Space Exploration: Pipelining

Page 81: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 81

1. Forwarding path All_mem_Latch to A1

2. Forwarding path Mem_WB_Latch to INT

3. Both (1) and (2)

4. Forwarding path All_mem_latch to INT and (1)

5. Forwarding path Mem_WB_Latch to A1 and (1)

Exploits (mpy,fp_add) and (mpy,int_add) sequences

Example Design Space Exploration: Pipelining

Page 82: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 82

1. Forwarding path All_mem_Latch to A1

2. Forwarding path Mem_WB_Latch to INT

3. Both (1) and (2)

4. Forwarding path All_mem_latch to INT and (1)

5. Forwarding path Mem_WB_Latch to A1 and (1)

Exploits (mpy,fp_add) and (ld,fp_add) sequences

Example Design Space Exploration: Pipelining

Page 83: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 83

DLX Pipeline DSE Results

Innerp Linear_eq State_eq Integrate 1D_particle GLR

Page 84: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 84

DLX Pipelining Experiments Summary

Forwarding paths added: average performance improvement: 15%

Reduced the number of pipeline stages Multiply from 7 to 5 stages FP Adder from 4 to 3 stages average performance improvement: 6%

Forwarding paths + reduced number of pipeline stages: average performance improvement: 25.9%

Multi-issue version of DLX: 4 instructions issued every cycle average performance improvement: 11.7%

Page 85: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 85

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Experiments:- Pipelining- Memory-aware compilation- Memory arch exploration

Page 86: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 86

Memory-Aware Compilation

Traditionally, memory system transparent to compiler: Scheduled all loads/stores assuming a uniform behavior

However, memory operations intrinsically non-uniform: Modern DRAMs: Page-mode, burst-mode accesses, banking, pipelining Caches: cache hits and misses have very different timing

Our Approach: TIMGEN Provide accurate memory timing information to compiler Allow compiler to globally hide latencies of lengthy memory operations. Generate significant performance improvements Two instances:

DRAM Efficient Access Modes (page, burst-mode accesses) In the presence of caches: Cache Miss Traffic Management

Page 87: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 87

Exploiting DRAM Access Modes in Memory-Aware Compiler

Allow Compiler to exploit page-mode, burst-mode accesses DRAM access:

Row-decode, Column-decode, Precharge

Page-mode access: Consecutive accesses to the same row Row-decode and precharge can be omitted.

Burst-mode access: Starting from an initial address, a number of words are clocked out on consecutive cycles

Normal DRAM access:5 cycles

8 cyclesPage-mode DRAM access:

Page 88: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 88

Example Exploiting DRAM Access Modes in Memory-Aware Compiler

No efficient accessmodes (180 cyc):

...

for(i=0;i<9;i++){ a = a + x[i] + y[i]; b = b + z[i] + u[i];}

Access mode optimization (84 cyc):

114 % gain

Memory-aware compiler (60 cyc):

40 % further gain

Time

Page 89: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 89

Experiments exploiting DRAM access modes

Dynamic cycle counts exploiting page-mode and burst-modeaccesses in the compiler.

Presented at Design Automation Conference (DAC) 2000.

Page 90: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 90

MIST: Cache miss traffic management

Cache misses: most time consuming operationsTraditionally, compiler assumed all memory accesses as cache hits, relying on

the memory controller to account for the cache misses. However, hiding latency of cache misses is crucialOur approach: MIST.

Allow compiler to perform global optimizations, and hide the latency of the cache misses.

Cache miss (20 cyc)Cache hit (2 cyc)Add (1 cyc)

Page 91: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 91

Cache miss traffic management Example

Cache line size: 4

Isolate cache misses

...for(i=0;i<12;i+=4){ s=s+temp s=s+a[i+1]; <== HIT s=s+a[i+2]; <== HIT s=s+a[i+3]; <== HIT temp=a[i+4]; <== MISS}...

Shift cache missto previous iteration

for(i=0;i<16;i++){ s=s+a[i];}

...

120 cyc

......

87 cyc (37% gain)

for(i=0;i<16;i+=4){ s=s+a[i]; <== MISS s=s+a[i+1]; <== HIT s=s+a[i+2]; <== HIT s=s+a[i+3]; <== HIT}

Cache Dependences

...

108 cyc (11% gain)

Page 92: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 92

Miss Traffic Management Experiments

Dynamic cycle counts for MIST: Memory Miss Traffic Management Algorithm.

Proc. International Conference on Computer Aided Design (ICCAD) 2000

Page 93: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 93

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Experiments:- Pipelining- Memory-aware compilation- Memory arch exploration

Page 94: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 94

Embedded memories: the programmer’s viewpoint

Register files Explicit usage in instruction set

Caches, TLBs Fully implicit

RAM buffers Explicitly controlled through special LD,ST instructions

Reconfigurable memories Explicitly controlled through control instructions

For embedded systems Expose memory architecture to the compiler

Page 95: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 95

Memory Organizations and Architectures

Traditional memory hierarchies Caching: spatial and temporal locality

Embedded memories Architectural and circuit techniques

Custom memory architecturesOther storage optimization examples

Spatial locality (multiple banks) Parsimony (compression) Scratch-pad memories, register files,...

Page 96: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 96

Custom Memory Architectures

Disk File systems: Parsons et al., Patterson et al.: use file access patterns to improve file system.

High-level Synthesis Catthoor et al.: memory allocation, packing data structures into memories Panda et al.: Scratch-pad on-chip SRAM together with cache Bakshi et al.: memory exploration combining different port configurations

Computer Architectures Jouppi: Kessler et al.: hardware stream buffers to enhance memory perf. Graphics processors: frame buffers, FIFOs, etc.

Page 97: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 97

APEX: Access Pattern based Memory Exploration

Motivation: Majority of memory accesses generated by a few instructions

e.g., Vocoder, 15k LOC: only 15 instructions, 62% Customize memory architecture for these accesses

APEX Approach (Grun, et al. ISSS-2001) Extract, analyze and cluster the most active Access Patterns in the application Use heuristic to prune the design space

many possible mappings with different power/perf/costs Avoid simulation of the entire design space

[Grun ISSS2001]

Page 98: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 98

Customizing Memory Architectures

Opportunity for wide range of power, cost, performance Analyze application behavior (compile-time) Map memory accesses to structures supporting access patterns

CPU

CacheDRAM

StreambufferLinked-list buff

SRAM

CPU Cache DRAM

Page 99: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 99

Motivating Example

Illustrative example: 2 cases

1. Traditional Cache-only Memory Architecture All data structures handled by the cache

2. APEX: Access Pattern-based Memory Customization Access Patterns go to Stream buffers, SRAMs,

Linked-list, and self-indirect Memory Modules.

for(i=0;i<1000;i++){ … = a[i] + …;}…

for(i=0;i<1000;i++){ code = codetab[code];}…

while(…){ … p = p->next;}…for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[j] + …; }}

Page 100: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 100

1. Traditional Cache-only Memory Arch.

for(i=0;i<1000;i++){ … = a[i] + …;}…

for(i=0;i<1000;i++){ code = codetab[code];}…

while(…){ … p = p->next;}…for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[j] + …; }}

All data structures handled by the cache

CPU Cache DRAMa[]codetab[]Heapcoeff[]

Page 101: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 101

2. APEX: Access Pattern-based Memory Customization

for(i=0;i<1000;i++){ … = a[i] + …;}…

for(i=0;i<1000;i++){ code = codetab[code];}…

while(…){ … p = p->next;}…for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[r] + …; }}

Mapping data structures to memories supporting their access modes: stream buffer, linked-list buffer, SRAM, and

cache

CPU

CacheDRAM

a[]codetab[]Heap

Streambuffer

Linked-list buff

SRAM coeff[]

[Grun ISSS2001]

Page 102: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 102

Cost/Perf Exploration: Compress

Page 103: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 103

Memory Exploration: Compress (Perf. Paretos)

Page 104: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 104

Perf/Power Exploration: Compress

Page 105: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 105

Memory Exploration: Compress (Power Paretos)

Page 106: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 106

Memory Organizations and Architectures

Traditional memory hierarchies Caching: spatial and temporal locality

Embedded memories Architectural and circuit techniques

Custom memory architecturesOther storage optimization examples

Spatial locality (multiple banks) Parsimony (compression) Scratch-pad memories, registers,..

Page 107: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 107

Outline

Methodology for Architectural Exploration

Survey of Architectural Description Languages (ADLs)

Software Toolkit Generation

Architectural Exploration

Summary and Conclusions

Page 108: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 108

Summary

Today we reviewed ADL-driven architectural exploration of programmable embedded systems

methodology, ADL survey, toolkit generation, sample experiments

Tremendous opportunity for architectural exploration Application-specific customization

Performance, power, size variations Processor, coprocessor, memory co-exploration

Key technologies required ADL as an executable specification of the architecture

tookit generation, validation/verification,... Highly tunable/retargettable compiler technology

Compiler-in-the-loop architectural evaluation Application-Architecture co-evolution

Page 109: Final version is available

Copyright © 2002 Nikil Dutt ACES Laboratory www.cecs.uci.edu/~aces IITD ASIP Wkshp 109

Outlook

Current Focus: Language-driven SW toolkit generation (ADL=>compiler, simulator,…) Memory issues for embedded systems-on-chip: organization, exploration

performance, power, size

Flexible, powerful compilation environment for processor-core based designs compiler as an exploration tool, and as a software synthesis tool

Data and Instruction cache sizing for embedded applications Estimators, tight bounds on WCET for real-time applications using caches

Future Directions Memory/S-O-C architectures for Embedded DRAM/embedded logic Simulation/compilation environment for multiprocessors and novel memory

hierarchies on chip Customized OS support Tight coupling between arch, compiler, CAD, PP and OS