University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

1 University of MichiganElectrical Engineering and Computer Science

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

Nathan Clark, Amir Hormati, Scott Mahlke,

Sami Yehia*, Krisztián Flautner*

University of Michigan *ARM Ltd.


Computational Efficiency

• Low power envelope

• More useful work/transistors

• Hardware accelerators

• Niagara II encryption engine

Source: AMD Analyst Day 12/14/06


How Are Accelerators Used?

Control statically placed in binary

CPU

Accel.Program


Problem With Static Control

Not forward/backward compatible

CPU

Accel.

ProgramCPU

CPU

Accel.


Solution: Virtualization

• Statically identify accelerated computation• Abstract accelerator features• Dynamically retarget binary

Proc.

Accel.

Program

Proc.

Proc.

Accel.

Trans.

Trans.

Trans.

Engineer/Compiler


Liquid SIMD

• Virtualize SIMD accelerators

• Why virtualize SIMD?– Intel MMX to SSE2– ARM v6 to Neon– Wide vectors useful [Lin 06]


SIMD Accelerator Assumptions

• Same instruction stream• Separate pipeline – memory interface

Fetch Decode

ScalarExec

SIMDExec

Retire


• Use scalar ISA to represent SIMD operations– Compatibility, low overhead

• Key: easy to translate

How to Virtualize

Program

Branch


Virtualization Architecture

Fetch

Decode Execute

Retire

Accel.uCodeCache

Trans.


1. Data Parallel Operations

for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4;}

+

&

A B

+

&

A B

+

&

A B

C


1a. What If There’s No Scalar Equivalent?

for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ...}

SADD

A B

Idioms can always be constructed


2. Scalarizing Permutations

&

+for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1}

for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const …}

offset = {4, 4, 4, 4, -4, -4, -4, -4}

&

+

&

+

offset = {4, 4, 4, 4, -4, -4, -4, -4}offset = {4, 4, 4, 4, -4, -4, -4, -4}


3. Scalarizing Reductions

+

for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; …}


Applied to ARM Neon

• All instructions supported except…

• VTBL – indirect indexingv1 = vtbl v2, v3

• Interleaved memory accesses

• Not needed in evaluated benchmarks

v3

1 0 1 3v2

v1

v1

Mem


Translation to SIMD

• Update induction variable• Use inverse of defined translation rules

for(i = 0; i < 8; i++){ r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3;}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant

}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4

}

i += 4for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i];

}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3;}


Translator Design

Translator: efficiency, speed, flexibility

Proc.

Accel.

Program

Proc.

Proc.

Accel.

Trans.

Trans.

Trans.

Engineer/Compiler


Evaluation

• Trimaran ARM

• Hand SIMDized loops

• SimpleScalar model ARM926 w/ Neon SIMD

• VHDL translator, 130nm std. cell


Liquid SIMD Issues

• Code bloat– <1% overhead beyond baseline

• Register pressure– Not a problem

• Translator cost– 0.2 mm2 + 2KB cache

• Translation overhead


Translation Overhead

SPECfp MediaBench Kernels


Summary

• Accelerators are more common and evolving– Costly binary migration

• SIMD virtualization using scalar ISA– One binary: forward/backward compatibility– Negligible overhead


Questions

?

?

??

?

? ?

? ?

?

??

Documents

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,