Trace-based Compilationckrintz/classes/s20/cs... · Code Cache •Translated code blocks (traces) are stored in a code cache (in native code) for reuse nOld, unused blocks are replaced

Trace-based Compilation

Compilation• Dynamic compilation

n Compilation of methods at runtime at any time & repeatedly4Table-base compilation precludes the need for code patching

u Adds a level of indirection (memory access) for table lookup4Can be performed in the background

• Enables multi-mode (mixed mode) compilationn Interpret first, compile when hot (= executed repeatedly)n Fast-compile (no opts) first (Just-In-Time compilation)

4Optimize (perhaps repeatedly) over time• Enables adaptive compilation

n (Re-)compile hot methods and optimize based on their execution profile, behavior

• Profiling/sampling required to identify when/how to do so

Compilation• Trace-based

n Same thing, only for a series of instructions (e.g. a path through a method) -- not an entire method4Store native code traces in a code cache

n Link traces together when possible to avoid going back and forth between executing code and the runtime

n Used in 4Binary translators/instrumentors: Pin, virtualization systems4Binary optimizers: Dynamo, Sun Studio, SOLAR, ADORE4VMs for high level languages

u Javascript: Mozilla’s SpiderMonkey, SPURu Java: Yeti, HotpathVM, Hotspot TraceJIT, Testerossa, Dalvik u Python: PyPyu Lua: LuaJIT

Translated Code Blocks• Emulator translates one block at a time into a CFG

n What are static BBs? --> require >1 pass over code to construct control flow graph (CFG)

add . . .load . . .store . . .

loop: load . . .add . . .store . . .brcond skipload . . .sub . . .

skip: add . . .store . . .brcond loopadd . . .store . . .jump indirect. . .

Find the leaders:1) first instruction in method2) Instruction after branch/jump3) Target of a branch/jump... Then draw the blocks and edges

Translated Code Blocks• Emulator translates one block at a time into a CFG

n What are static BBs? --> require >1 pass over code to construct control flow graph (CFG)4Note that each instruction is in ONLY ONE basic block




BB1

BB2

BB3

BB4

BB5

Find the leaders:1) first instruction in method2) Instruction after branch/jump3) Target of a branch/jump... Then draw the blocks and edges

Tracing JIT: Translated Code Blocks

• Emulator translates one block at a time: n Dynamic basic blocks not static basic blocks

• Dynamic BBs - determined by actual flow of executing programn Execute the programn Form basic blocks and connect them as you discover the

edgesn You CANNOT go back (one pass only!)

4The cost/trade-off is that some instructions are in multiple basic blocks...

Static Basic Blocks versus Dynamic Basic Blocks




BB1

BB2

BB3

BB4

BB5





BB1

BB2

BB3

BB4

BB5

not taken, then taken

taken

jumps are always taken





BB1

BB2

BB3

BB4

BB5



[skip:]add . . .store . . .brcond loop

loop: load . . .add . . .store . . .brcond skip

skip: add . . .store . . .brcond loop. . .

BB1

BB2

BB3

BB4

not taken, then taken

taken

jumps are always taken

Translated Code Blocks

• Emulator translates a block at a time: dynamic basic blocks• Dynamic BBs - determined by actual flow of executing

programn Traces are series of dynamic basic blocks

4Single entry, multiple exitn Note that traces can go across function call boundaries

4Essentially inlining on-demandA

CE

BOops!

Oops!

A

C

E

B

D

Code Cache

• Translated code blocks (traces) are stored in a code cache (in native code) for reusen Old, unused blocks are replaced when the cache fills

A

CE

B

Code Cache

• Translated code blocks (traces) are stored in a code cache (in native code) for reusen Old, unused blocks are replaced when the cache fills

• As the translator finds code blocks that are related in terms of control flown A block is the fall through of one in the cachen A block is the jump target of one in the cachen A block jumps or falls through to a block already in cache

n Links them together in some cases, as discovered4When it can avoid jumping back into the interpreter

u For frequently taken branches, or sequential bb’s4Called “chaining”, or “stitching” in our paper

Incremental Predecoding and Tracing

• Translate binary while the program is executing (dynamically)n Predecode or translate new instructions when reached by program

4Incrementally4SPC: source (virtual) program counter4TPC: target program counter

SPC to TPCmap table

EmulationManager

Interpreter

Translatormiss

hit

Control flow

Data access

Byte or source code

Binary traces

SPC

TPC

Code Cache Management

• Differences from memory/hw cachen Traces do not have a fixed sizen Presence and locations of the blocks are dependent on one

another b/c of chainingn There is no copy of the cached blocks in a `backing store‘



another b/c of chainingn There is no copy of the cached blocks in a `backing store’

• Two key code cache operationsn SPC to TPC mapping (find block in code cache given an SPC)n Given a TPC in code cache, find the corresponding SPC

4This is needed to find the source when an exception occurs



another b/c of chainingn There is no copy of the cached blocks in a `backing store’

• Two key code cache operationsn SPC to TPC mapping (find block in code cache given an SPC)n Given a TPC in code cache, find the corresponding SPC

4This is needed to find the source when an exception occurs

• Practical implementations limit the size of the code cachen We need a replacement policy for when/if cache fills

Code Cache Replacement Policies• Least Recently Used (LRU)

n Good idea, but hard to implement4Requires extra memory access per block to keep track4Unlinking gets complicated

u All pointers into removed block must be updated to point to interpreteru Can use back pointers

4Can lead to fragmentation (holes where nothing fits)




4Can lead to fragmentation (holes where nothing fits)• Flush when full• Preemptive flush

n Flush when program behavior changes




4Can lead to fragmentation (holes where nothing fits)• Flush when full• Preemptive flush

n Flush when program behavior changes• Fine-grained FIFO

n Manage code cache as circular buffern Must still maintain backpointers to keep track of chaining

Code Cache Replacement Policies

• Least Recently Used (LRU)• Flush when full• Preemptive flush• Fine-grained FIFO• Course-grained FIFO

n Partition cache into large blocks, each managed via FIFOn Simplifies backpointer problem b/c only cross-block pointers

must be maintained4Backpointers point to blocks, no backpointers within blocks

Javascript Implementation

• Interpreter/runtime in browser (Node.js outside of browser)• Lookups done by name at runtime b/c of dynamic typing --

variables have no types!n Variable access -- Lookup at runtime -- high overheadn Method calls -- Dynamic dispatch -- high overhead


• Interpreter/runtime in browser (Node.js outside of browser)• Lookups done by name at runtime b/c of dynamic typing --

variables have no types!n Variable access -- Lookup at runtime -- high overheadn Method calls -- Dynamic dispatch -- high overhead

• Complicationsn Inheritance – lookup name in table, then walk hierarchy

doing the same4Local hash tables and global hashtables

n Fields/methods can be added dynamically via use or call4Must always do the lookup to make sure you don’t miss something


• Prototypes vs classes; prototype inheritance vs classicaln See related readings for more on this & automatic type inference

for JS

• Unboxing: decoding an object to determine its type / layoutn High overhead

• Fat/high-level bytecodes (much work done in runtime for each)n Like superinstructions – only at the bytecode leveln Unlike Java/GForth, similar to Python/Ruby – low dispatch raten So traditional interpreter opts that target the dispatch loop have

little impact

Today’s Paper• Trace-based compilation (why? How?)

n Mixed mode (interp + compile traces)n Two assumptions

4Most time spent in hot loops4Hot loops are mostly type-stable

n Traces are extended bb’s – they have multiple exits4Single entry (at the top)

u No joins within/into (single entry)- Entry only through the top- Makes optimizing much simpler

4Functions inlined (a bit complex, b/c we have to still stack frames)u Trace across function boundaries

A

CE

BOops!

Oops!


n Mixed mode• Contributions

n A real implementationn Focus on tracing loops only

4Nested trace trees4Why? How does an outer get to an inner trace?


n Mixed mode• Contributions

n A real implementationn Focus on tracing loops only

4Nested trace trees4Why? How does an outer get to an inner trace?4Option1: trace inner (hot first), continue tracing through outer back

to inner (code duplication for outer for every side exit and type)u Will overflow the code cache (fixed-size memory area for traces)

4Option2: give up and only trace inner’s

4New option: keep track and separate inner and outer; have outer “call” inner

TraceMonkey State Machine

Sieve example in Javascript, TraceMonkey LIR for Line 5 – tests/side exits are guards

Type ID in object headerInternal rep/type

Assumptions: primes is anArray type, store to primes[k]completes without error

X86 for Line 5 in the Sieve example -- Guards are the *’d instructions

Compiler Optimizations Performed on Traces• Optimizations

n Floating point to integer emulation on non-fp architecturesn CSEn Simplification, constant folding, strength reductionn Source-language optimizations (Double -> int)n Dead stack stores, call stack stores, code elimination

• Blacklisting (Why? What? How?)

Compiler Optimizations and Blacklisting• Optimizations

n Floating point to integer emulation on non-fp architecturesn CSEn Simplification, constant folding, strength reductionn Source-language optimizations (Double -> int)n Dead stack stores, call stack stores, code elimination

• Blacklisting n Some traces won’t finish - won’t reach header (exceptions)

4So recording them is a waste4Set a counter so that you don’t record them for awhile4Try again, then if untraceable, mark in bytecode (bad loop header)4Trace lookup in blacklist is time consuming

38

Static Single Assignment Form

• LIR of a function (intermed. repr.)n Simplifies program analysis and optimizationn Static Single Assignment (SSA) form of a program makes

information about variable definitions and uses explicit• A program is in SSA form if it satisfies:

n each definition (def) has a distinct name; andn each use refers to a single definition.

• To make this work, the compiler inserts special operations, called f-functions, at points where control flow paths join (block E above)

C

ED

39

SSA Form: f - Functions and joins/join-points

• A f-function behaves as follows:x1 = … x2 = …

x3 = f (x1, x2)

This assigns to x3 the value of x1, if control comes from the left, and that of x2 if control comes from the right.n Its simply a marker for the compiler (not a real function)

4Lets the compiler optimize the block without having to worry about different versions – the Phi function will select it

• Significantly complicates compilation• On entry to a basic block, all the f-functions in the block

execute (conceptually) in parallel.

40

Static Single Assignment Form

• LIR of a function (intermed. repr.)n Simplifies program analysis and optimizationn Static Single Assignment (SSA) form of a program makes

information about variable definitions and uses explicit• A program is in SSA form if it satisfies:

n each definition (def) has a distinct name; andn each use refers to a single definition.

• To make this work, the compiler inserts special operations, called f-functions, at points where control flow paths joinn Although this is a key part of SSA, traces contain NO joins

4In TraceMonkey

A

CE

BOops!

Oops!

Register Allocation

• Linear scan over the instructions in a tracen From last to first, greedily

v3 = add v1, v2• Spilling

n Heuristic: consider all values in registers at this point (for the instructions after this one)Find oldest register-carried value across them:4Consider values in registers for instruction after this one4Look at instructions before this instruction

u Find last instruction that refers to (uses) each value u Spill the value that has the minimum (smallest index) instruction

4Copy a value to/from memory (runtime stack location) when neededu Insert restore code just after this instructionu Insert spill code just after last use (a previous instruction)

Other Issues and Results

• Calling external functionsn C-function: Foreign Function Interface (FFI)

• Access to global variables, exceptions, to the call stackn Aborts trace, but rarely occurs (note this when you code!)

• Findingsn 9 of 26 benchmarks better than G8 and SFX (all are better

than SpiderMonkey interpreter-only baseline)4Both G8 (hot method compilation) and SFX (opt’d interpretation)4Footprint not measured

Results

Documents

Trace-based Compilationckrintz/classes/s20/cs... · Code Cache •Translated code blocks (traces) are stored in a code cache (in native code) for reuse nOld, unused blocks are replaced