FlexiTaint: A Programmable Accelerator for Dynamic Taint Propagation

G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08

Reading Group Presentation 02/14/2008

Tainting Schemes extremely useful for security and debugging purposes◦ Eg TaintCheck, PointerCheck

Implemented in Software◦ Usually some kind of DBI◦ Extremely Versatile◦ Really Slow◦ Problems with Multithreaded Apps, JIT

compilation, and self-modifying Code

So, make hardware for it◦ Multiple examples: Raksha, Minos, etc◦ Fast◦ Can deal with strange codes that trouble S/W◦ Extensive modifications in the OoO core, caches,

buses, memories required◦ Limit the state which can be manipulated,

usually to a few bits, easily managed by H/W◦ So, who is going to implement it?

Solution: FlexiTaint◦ Use H/W to accelerate what the S/W is doing

Common Case Propagation, and metadata manipulation

RISC ISA

Taint State 1..16 bits per word 1-Level table in the application address space

◦ Protected from the application◦ No need to widen buses, caches etc◦ L1-T cache for taint bits: 4 kB for 2-bit states

No changing L1-D, no port contention◦ Taint state shares L2

2 Registers for that◦ MTBR: Memory Taint Base Register: start of the

table◦ FTCR: FlexiTaint configuration Register: bits/word◦ Both must be saved on a context switch by the O/S

All loads/stores prefetch taint state to L1-T State 0..0 is assumed to be a safe one State can manipulated directly by special

instructions◦ Must be added somehow after special events

Read a file, malloc, input purging etc

Takes place after the OoO core◦ Can be turned off and completely bypassed if

unnecessary The normal Commit becomes Pre-CoMmiT A software handler receives 4 arguments:

◦ OpCode, Reg1 State, Reg2 State, Mem State And returns the output state and whether

an exception should be raised Handler address stored in TPCHR

◦ Restricted access register

The answer of the S/W handler for the same inputs will be the same◦ Cache it

128 entry direct mapped response cache Indexed by opcode, Reg1 state, Reg2 state,

Mem State (folded in 7 bits) Stores the Output State and Exception bit Cleared every time the TPCHR (software

handler address register) is changed◦ Usually on context switch

After the OoO core has ended. Size of the Architectural Register File, NOT the physical one

State of Reg0 hardwired to 0

Reserved for instructions that touch memory

Example: For instructions that do not touch memory◦ Remember RISC ISA

ALARM!

128-entry Direct MappedCleared when TPCHR

changes

Suppresses silent stores

Example: Stores

Still, TPCache lookups take 1 cycle If dependent instructions were retired in the

same cycle, the In Order taint propagation will stall◦ Pressure to the physical register file and ROB

Well, usually 0..00 is good, and when zeroes are combined, the result is 0..00

Also, if only one Non-zero, then usually you have unary propagation

Create a table to store that

Stores for each opcode (256) 2-bit value◦ 512 bits total, must be stored on context switch

Really fast lookups, allows for same-cycle propagation

4 stage in order pipeline◦ Receives non-speculative instructions

First 2 stages: Look up◦ Filter TPT◦ L1-T

3rd stage Taint Propagation◦ TPC Lookup◦ Or trivial propagation through Filter TPT

4th stage commit

Summary of what the O/S needs to store on context switches◦ TPCHR (handler address)◦ FTCR (state size)◦ MTBR (shadow state address)◦ Filter TPT content (64 bytes)

The TPCache can simply be discarded All state in the address space of the

application◦ So swapping, virtualization, etc normally

Data and Metadata accessed in 2 different cycles◦ Potential consistency issues

Solution for Loads:◦ Prefetch State when data address is resolved◦ If state does not hit in the L1-T a few cycles later,

replay the load Solution for Stores:

◦ Prefetch State (same with load)◦ Write only when data/metadata both hit in the L1

Usually L1-T is always a hit due to prefetch

1st: TaintCheck 1 bit state per word◦ Allows for maximum optimization 10 in the Filter

TPT (unary propagation and zero optimization)◦ TPCache and S/W will consider XOR R1,R1,R1 cases

2nd: 1-bit PointerCheck◦ Stores which words are valid heap pointers◦ Good for leak detection◦ And something that Raksha cannot handle◦ Filter TPT: 01 (non-pointers produce non-pointers)

3rd: A Combination with 2-bit states◦ Filter TPT: 01 (untainted non-pointers produce

untainted non-pointers)

TaintCheck Rules 1-bit Heap PointerCheck

SESC simulator 8-core system 4-issue OoO superscalar cores @ 2.93GHz L1-D 32-Kbytes, 8-way set associative, dual

ported, 64 byte blocks L2 4MBytes 16-way set associate, single-

ported, 64-byte blocks◦ Small for 8 core system

L1-T: 4 KB, 4-ways set associative, dual ported, 64-byte blocks

Bus 64-bits wide @ 1333 MHz

~1% for SPEC 2K and 4% for Splash2Splash 2 is worse due to false sharing of metadata

Smaller Cache line → Less false sharing of Metadata

For 4 KB ~1% overhead for SPEC 2k 8 KB minimal gains 2 KB 2.8% overhead

Conclusion: 4 KB is fine for 1 and 2 bit states

Use FlexiTaint to simulate previously proposed hardware

And implement the lifeguard that they couldn’t handle (1-bit Heap PointerCheck)

Obviously FlexiTaint proves better

Versatile scheme to handle most lifeguards with low overhead

Nice idea to cache the answer of the software handler

In general, a good idea◦ With its limitation though (LockSet)

Questions?

Documents

FlexiTaint: A Programmable Accelerator for Dynamic Taint Propagation