Just-in-Time Compilation for FPGA Processor Cores
This work was supported in part by the National ScienceFoundation (CNS1016792) and by the Semiconductor Research
Corporation (GRC 2143.001)
Andrew BeckerAndrew Becker11, Scott Sirowy, Scott Sirowy22, Frank Vahid, Frank Vahid
Department of Computer Science and EngineeringUniversity of California, Riverside{abecker | ssirowy | vahid}@cs.ucr.edu
1. Now at EPFL 2. Now at ESRI
Andrew Becker 2 of 20
MotivationSystemC useful capture language
Concurrency, structure, timingSimulation typical, but in-system I/O often useful
Design/synthesis to FPGA may take hours/days and require advanced tools
Switches/LEDs Cameras/displays
In-system I/OSimulation
Andrew Becker 3 of 20
BackgroundWant rapid design iteration with in-system I/O
Compile design description; avoid design/synthesisPreviously: Hybrid approach—SystemC bytecode
class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …
class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …
SystemC Code
Compiler
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…
BytecodeSimulator (no in-system I/O)
Design/synthesis (time-consuming)
…
Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]
Andrew Becker 4 of 20
BackgroundEmulate bytecode in engine on FPGA
Fast compilationBytecode also portable (FPGA-device independent)
Compiler
FPGA
Emulation Engine
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart: ADDI $2 $2 1ADDI $3 $0 7…
Bytecode
Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]
In-system I/O
class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …
class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ …
Andrew Becker 5 of 20
Emulation EngineDiscrete event simulator
C code on a processor (Currently Microblaze soft-core; could be hard-core)
Support-circuits for architectural features, peripheral I/O
Processor CoreUART
LEDs
ButtonsInstruction Mem.
Read SignalMemory
Write SignalMemory
Peripheral Bus
Event Kernel
Frame Buffer
Andrew Becker 6 of 20
Caveat EmptorEmulation is slow
On soft-core, is even slower than PC simulation
Won't meet many real-time constraints
Andrew Becker 7 of 20
This work – Speed up emulatorFirst analyzed emulator performance
69%
8%
23%Process Emulation
Waiting for I/O
Signal QueueMaintenance
Andrew Becker 8 of 20
Low-Hanging Fruit69% of time spent emulating bytecodeTwo strategies to reduce
Reduce each instruction’s emulation time Reduce instruction memory latency
69%
8%
23% Process Emulation
Waiting for I/O
Signal QueueMaintenance
Andrew Becker 9 of 20
First StepReduce instruction emulation time
• Optimize event kernel?
Processor CoreUART
LEDs
ButtonsInstruction Mem.
Read SignalMemory
Write SignalMemory
Peripheral Bus
Event Kernel
Frame Buffer
Andrew Becker 10 of 20
First StepReduce instruction emulation time
• Optimize event kernel?• Just-in-time (JIT) compile bytecode to native
processor code, done transparently by event kernel
Processor CoreUART
LEDs
ButtonsInstruction Mem.
Read SignalMemory
Write SignalMemory
Peripheral Bus
Event Kernel
Frame Buffer
Andrew Becker 11 of 20
Just-in-Time Compilation of Bytecode Implemented SystemC-bytecode to Microblaze JIT compiler
3x speedup; still portable
Tunable delay/jitter
Still want more speed
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart:ADDI $2 $2 1ADDI $3 $0 7…
process(clock)READ $1 dataRdyBGT $1 $0 StartJ DoneStart:ADDI $2 $2 1ADDI $3 $0 7…
Emulation EngineMachine Code
Event Kernel
Machine CodeMachine CodeBytecode
IMM 0xDEADLWI $11 $0 0xBEEFBGTI $11 StartBRAI DoneStart:…
IMM 0xDEADLWI $11 $0 0xBEEFBGTI $11 StartBRAI DoneStart:…
Machine CodeEmulation Engine
JIT
Andrew Becker 12 of 20
Further ImprovementReduce instruction memory latency
Add dedicated small, fast memory for JIT code on a fast, local bus
Unique JIT possibility due to FPGA configurability
Andrew Becker 13 of 20
Architecture Changes
Processor CoreUART
LEDs
ButtonsInstr. Mem.
Read SignalMemory
Write SignalMemory
Peripheral Bus
Emulation Engine
Local Memory Bus
JIT Mem.
Frame Buffer
Andrew Becker 14 of 20
Even Further Improvement23% of time spent maintaining signal queueWhat can be done?
• Optimize signal queue maintenance code?
69%
8%
23%
Process Emulation
Waiting for I/O
Signal QueueMaintenance
Andrew Becker 15 of 20
Common DenominatorFPGA offers configurability
Engine designer can make tradeoffsTrade hardware resources for speed
FPGA
Emulation Engine
FPGA
Emulation Engine
Extra Resources
Andrew Becker 16 of 20
Common DenominatorFPGA offers configurability
Engine designer can make tradeoffsTrade hardware resources for speed
Add another soft-core?
FPGA
Emulation Engine
FPGA
Emulation Engine
Extra Resources
Andrew Becker 17 of 20
Even Further Improvement23% of time spent maintaining signal queueWhat can be done?
Optimize signal queue maintenance code?• Offload job to coprocessor
• Again, unique JIT option due to FPGA configurability
69%
8%
23%
Process Emulation
Waiting for I/O
Signal QueueMaintenance
Andrew Becker 18 of 20
Architecture Changes
Processor CoreUART
LEDs
ButtonsInstr. Mem.
Read SignalMemory
Write SignalMemory
Peripheral Bus
Emulation Engine
Local Memory Bus
JIT Mem.Signal Queue
EmulationMemory
Controller
Frame Buffer
Andrew Becker 19 of 20
3.1
2.2 2.3
1.3
12.1
3.03.6
2.3 2.5
1.3
5.1
13.6
7.6
9.4
5.8
15.7
8.8
0
2
4
6
8
10
12
14
16
18
Edge Detection MatrixMultiplication
A5/1 Cipher Sequencer Digital Timer Geometric Mean
Benchmark
Sp
ee
du
p
Base Emulation
Regular JIT
JIT, JIT Mem.
JIT, JIT Mem., SQ,EMCNative C, JIT Mem.
Experimental Results
Andrew Becker 20 of 20
Conclusions• Approach rapid design iteration with in-system I/O
• Uses
• Education (typically loose timing constraints)
• System prototypes that can tolerate real-time slowdown (e.g., slow frame rate)
• Portable and flexible• Engine design sets speed, not compiler or CAD flow
• This work: 15x speedup via normal JIT (3x) + FPGA-specific JIT (5x)
• But, still orders of magnitude slower than design/synthesis
• Future work: Bytecode accelerators, JIT synthesis