21
1 University of Michigan Electrical Engineering and Computer Science Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia * , Krisztián Flautner * University of Michigan *ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

1 University of MichiganElectrical Engineering and Computer Science

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

Nathan Clark, Amir Hormati, Scott Mahlke,

Sami Yehia*, Krisztián Flautner*

University of Michigan *ARM Ltd.

2 University of MichiganElectrical Engineering and Computer Science

Computational Efficiency

• Low power envelope

• More useful work/transistors

• Hardware accelerators

• Niagara II encryption engine

Source: AMD Analyst Day 12/14/06

3 University of MichiganElectrical Engineering and Computer Science

How Are Accelerators Used?

Control statically placed in binary

CPU

Accel.Program

4 University of MichiganElectrical Engineering and Computer Science

Problem With Static Control

Not forward/backward compatible

CPU

Accel.

ProgramCPU

CPU

Accel.

5 University of MichiganElectrical Engineering and Computer Science

Solution: Virtualization

• Statically identify accelerated computation• Abstract accelerator features• Dynamically retarget binary

Proc.

Accel.

Program

Proc.

Proc.

Accel.

Trans.

Trans.

Trans.

Engineer/Compiler

6 University of MichiganElectrical Engineering and Computer Science

Liquid SIMD

• Virtualize SIMD accelerators

• Why virtualize SIMD?– Intel MMX to SSE2– ARM v6 to Neon– Wide vectors useful [Lin 06]

7 University of MichiganElectrical Engineering and Computer Science

SIMD Accelerator Assumptions

• Same instruction stream• Separate pipeline – memory interface

Fetch Decode

ScalarExec

SIMDExec

Retire

8 University of MichiganElectrical Engineering and Computer Science

• Use scalar ISA to represent SIMD operations– Compatibility, low overhead

• Key: easy to translate

How to Virtualize

Program

Branch

9 University of MichiganElectrical Engineering and Computer Science

Virtualization Architecture

Fetch

Decode Execute

Retire

Accel.uCodeCache

Trans.

10 University of MichiganElectrical Engineering and Computer Science

1. Data Parallel Operations

for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4;}

+

&

A B

+

&

A B

+

&

A B

C

11 University of MichiganElectrical Engineering and Computer Science

1a. What If There’s No Scalar Equivalent?

for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ...}

SADD

A B

Idioms can always be constructed

12 University of MichiganElectrical Engineering and Computer Science

2. Scalarizing Permutations

&

+for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1}

for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const …}

offset = {4, 4, 4, 4, -4, -4, -4, -4}

&

+

&

+

offset = {4, 4, 4, 4, -4, -4, -4, -4}offset = {4, 4, 4, 4, -4, -4, -4, -4}

13 University of MichiganElectrical Engineering and Computer Science

3. Scalarizing Reductions

+

for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; …}

14 University of MichiganElectrical Engineering and Computer Science

Applied to ARM Neon

• All instructions supported except…

• VTBL – indirect indexingv1 = vtbl v2, v3

• Interleaved memory accesses

• Not needed in evaluated benchmarks

v3

1 0 1 3v2

v1

v1

Mem

15 University of MichiganElectrical Engineering and Computer Science

Translation to SIMD

• Update induction variable• Use inverse of defined translation rules

for(i = 0; i < 8; i++){ r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3;}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant

}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4

}

i += 4for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i];

}

for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3;}

16 University of MichiganElectrical Engineering and Computer Science

Translator Design

Translator: efficiency, speed, flexibility

Proc.

Accel.

Program

Proc.

Proc.

Accel.

Trans.

Trans.

Trans.

Engineer/Compiler

17 University of MichiganElectrical Engineering and Computer Science

Evaluation

• Trimaran ARM

• Hand SIMDized loops

• SimpleScalar model ARM926 w/ Neon SIMD

• VHDL translator, 130nm std. cell

18 University of MichiganElectrical Engineering and Computer Science

Liquid SIMD Issues

• Code bloat– <1% overhead beyond baseline

• Register pressure– Not a problem

• Translator cost– 0.2 mm2 + 2KB cache

• Translation overhead

19 University of MichiganElectrical Engineering and Computer Science

Translation Overhead

SPECfp MediaBench Kernels

20 University of MichiganElectrical Engineering and Computer Science

Summary

• Accelerators are more common and evolving– Costly binary migration

• SIMD virtualization using scalar ISA– One binary: forward/backward compatibility– Negligible overhead

21 University of MichiganElectrical Engineering and Computer Science

Questions

?

?

??

?

? ?

? ?

?

??