20
Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS1016792) and by the Semiconductor Research Corporation (GRC 2143.001) Andrew Becker Andrew Becker 1 , Scott Sirowy , Scott Sirowy 2 2 , , Frank Vahid Frank Vahid Department of Computer Science and Engineering University of California, Riverside {abecker | ssirowy | vahid}@cs.ucr.edu 1. Now at EPFL 2. Now at ESRI

Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS1016792) and by the Semiconductor

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Just-in-Time Compilation for FPGA Processor Cores

This work was supported in part by the National ScienceFoundation (CNS1016792) and by the Semiconductor Research

Corporation (GRC 2143.001)

Andrew BeckerAndrew Becker11, Scott Sirowy, Scott Sirowy22, Frank Vahid, Frank Vahid

Department of Computer Science and EngineeringUniversity of California, Riverside{abecker | ssirowy | vahid}@cs.ucr.edu

1. Now at EPFL 2. Now at ESRI

Andrew Becker 2 of 20

MotivationSystemC useful capture language

Concurrency, structure, timingSimulation typical, but in-system I/O often useful

Design/synthesis to FPGA may take hours/days and require advanced tools

Switches/LEDs Cameras/displays

In-system I/OSimulation

Andrew Becker 3 of 20

BackgroundWant rapid design iteration with in-system I/O

Compile design description; avoid design/synthesisPreviously: Hybrid approach—SystemC bytecode

class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …

class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …

SystemC Code

Compiler

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…

BytecodeSimulator (no in-system I/O)

Design/synthesis (time-consuming)

Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]

Andrew Becker 4 of 20

BackgroundEmulate bytecode in engine on FPGA

Fast compilationBytecode also portable (FPGA-device independent)

Compiler

FPGA

Emulation Engine

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…

Bytecode

Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]

In-system I/O

class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …

class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …

Andrew Becker 5 of 20

Emulation EngineDiscrete event simulator

C code on a processor (Currently Microblaze soft-core; could be hard-core)

Support-circuits for architectural features, peripheral I/O

Processor CoreUART

LEDs

ButtonsInstruction Mem.

Read SignalMemory

Write SignalMemory

Peripheral Bus

Event Kernel

Frame Buffer

Andrew Becker 6 of 20

Caveat EmptorEmulation is slow

On soft-core, is even slower than PC simulation

Won't meet many real-time constraints

Andrew Becker 7 of 20

This work – Speed up emulatorFirst analyzed emulator performance

69%

8%

23%Process Emulation

Waiting for I/O

Signal QueueMaintenance

Andrew Becker 8 of 20

Low-Hanging Fruit69% of time spent emulating bytecodeTwo strategies to reduce

Reduce each instruction’s emulation time Reduce instruction memory latency

69%

8%

23% Process Emulation

Waiting for I/O

Signal QueueMaintenance

Andrew Becker 9 of 20

First StepReduce instruction emulation time

• Optimize event kernel?

Processor CoreUART

LEDs

ButtonsInstruction Mem.

Read SignalMemory

Write SignalMemory

Peripheral Bus

Event Kernel

Frame Buffer

Andrew Becker 10 of 20

First StepReduce instruction emulation time

• Optimize event kernel?• Just-in-time (JIT) compile bytecode to native

processor code, done transparently by event kernel

Processor CoreUART

LEDs

ButtonsInstruction Mem.

Read SignalMemory

Write SignalMemory

Peripheral Bus

Event Kernel

Frame Buffer

Andrew Becker 11 of 20

Just-in-Time Compilation of Bytecode Implemented SystemC-bytecode to Microblaze JIT compiler

3x speedup; still portable

Tunable delay/jitter

Still want more speed

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart:ADDI $2 $2 1ADDI $3 $0 7…

process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart:ADDI $2 $2 1ADDI $3 $0 7…

Emulation EngineMachine Code

Event Kernel

Machine CodeMachine CodeBytecode

IMM 0xDEADLWI $11 $0 0xBEEFBGTI $11 StartBRAI DoneStart:…

IMM 0xDEADLWI $11 $0 0xBEEFBGTI $11 StartBRAI DoneStart:…

Machine CodeEmulation Engine

JIT

Andrew Becker 12 of 20

Further ImprovementReduce instruction memory latency

Add dedicated small, fast memory for JIT code on a fast, local bus

Unique JIT possibility due to FPGA configurability

Andrew Becker 13 of 20

Architecture Changes

Processor CoreUART

LEDs

ButtonsInstr. Mem.

Read SignalMemory

Write SignalMemory

Peripheral Bus

Emulation Engine

Local Memory Bus

JIT Mem.

Frame Buffer

Andrew Becker 14 of 20

Even Further Improvement23% of time spent maintaining signal queueWhat can be done?

• Optimize signal queue maintenance code?

69%

8%

23%

Process Emulation

Waiting for I/O

Signal QueueMaintenance

Andrew Becker 15 of 20

Common DenominatorFPGA offers configurability

Engine designer can make tradeoffsTrade hardware resources for speed

FPGA

Emulation Engine

FPGA

Emulation Engine

Extra Resources

Andrew Becker 16 of 20

Common DenominatorFPGA offers configurability

Engine designer can make tradeoffsTrade hardware resources for speed

Add another soft-core?

FPGA

Emulation Engine

FPGA

Emulation Engine

Extra Resources

Andrew Becker 17 of 20

Even Further Improvement23% of time spent maintaining signal queueWhat can be done?

Optimize signal queue maintenance code?• Offload job to coprocessor

• Again, unique JIT option due to FPGA configurability

69%

8%

23%

Process Emulation

Waiting for I/O

Signal QueueMaintenance

Andrew Becker 18 of 20

Architecture Changes

Processor CoreUART

LEDs

ButtonsInstr. Mem.

Read SignalMemory

Write SignalMemory

Peripheral Bus

Emulation Engine

Local Memory Bus

JIT Mem.Signal Queue

EmulationMemory

Controller

Frame Buffer

Andrew Becker 19 of 20

3.1

2.2 2.3

1.3

12.1

3.03.6

2.3 2.5

1.3

5.1

13.6

7.6

9.4

5.8

15.7

8.8

0

2

4

6

8

10

12

14

16

18

Edge Detection MatrixMultiplication

A5/1 Cipher Sequencer Digital Timer Geometric Mean

Benchmark

Sp

ee

du

p

Base Emulation

Regular JIT

JIT, JIT Mem.

JIT, JIT Mem., SQ,EMCNative C, JIT Mem.

Experimental Results

Andrew Becker 20 of 20

Conclusions• Approach rapid design iteration with in-system I/O

• Uses

• Education (typically loose timing constraints)

• System prototypes that can tolerate real-time slowdown (e.g., slow frame rate)

• Portable and flexible• Engine design sets speed, not compiler or CAD flow

• This work: 15x speedup via normal JIT (3x) + FPGA-specific JIT (5x)

• But, still orders of magnitude slower than design/synthesis

• Future work: Bytecode accelerators, JIT synthesis