38
WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Embed Size (px)

Citation preview

Page 1: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS

ODES-9

Page 2: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

64-bit datapath64-bit addressing and high precision computing

64-bit adder

64bit

64bit

Page 3: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

64-bit datapath64-bit addressing and high precision computing

16-bit adder

16-bit adder

16-bit adder

16-bit adder

64bit

64bit

Page 4: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

16-bit integer datapath64-bit addressing and high precision computing

40% of computations need only a 16-bit datapath

Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT)

16-bit adder

16-bit adder

16-bit adder

16-bit adder

Page 5: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

What does non-productive mean?

0 x 0000 0000 0000 0001

0 x 0000 0000 0000 0025

0 x 0000 0000 0000 0026+

Page 6: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

What does non-productive mean?

0 x 0000 0000 0000 0001

0 x 0000 0000 0000 0025

0 x 0000 0000 0000 0026+

Page 7: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

What does non-productive mean?

0 x 0000 0000 0000 0001

0 x 0000 0000 0000 0025

0 x 0000 0000 0000 0026+

Page 8: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

Contributions and conclusions

1. Narrow ISA offers more opportunities to remove non-productive memory operations

2. 50 % of dynamic narrow operations are non-productive

3. Memory Productiveness Pruning: profile-guided, dynamic optimization

Page 9: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

ENERGY EFFICIENT CODE GENERATIONFOR PROCESSORS WITH EXPOSED DATAPATH

DONGRUI SHE, YIFAN HE, BART MESMAN, HENK CORPORAAL (TUE)

Exposed datapath: software controls every movement in the data pathExample: transport-triggered architecture (Henk Corporaal)

Register file access reduction

Page 10: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

REGISTER REUSE SCHEDULING

GERGÖ BARANY

ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation

MotivationSpill code generated by the compiler has crucial effect on program performance

MethodImplicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG)

Results8.9% less spilling, 3.4% smaller static spill costs

Page 11: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Register Allocation and spilling

REGISTER REUSE SCHEDULING

Virtual registersPhysical registers

Memory

Page 12: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Register Allocation with reuse candidates

REGISTER REUSE SCHEDULING

basic block

interference graph

definitely overlap

definitely NO overlappossible overlap

data dependence graph

Page 13: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Register Allocation with reuse candidates

REGISTER REUSE SCHEDULING

Page 14: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

MOUNIRA BACHIR, SID-AHMED-ALI TOUATI, ALBERT COHEN

ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II)

MotivationCode size related with memory requirements and I-cache performance

MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones

Results“Good” if enough functional units to perform the additional move operations and acceptable execution time

Page 15: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Periodic Register Allocation

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

• Rotating Register File

R

Page 16: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Periodic Register Allocation

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

• Rotating Register File• Move operations

d-1 MOVs/iteration

d : iteration span of variables

Page 17: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Periodic Register Allocation

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

• Rotating Register File• Move operations• Loop unrolling

3 * code size

Page 18: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Periodic Register Allocation

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion

a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]

using 9 registers instead of 8

MAXLIVE = 8

Page 19: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Periodic Register Allocation

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion• Meeting Graph

lifetime in cycles

lifetime interval of c ends when interval of b begins

Page 20: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Meeting Graph

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]a[i+3]b[i+3]c[i+3]a[i+4]b[i+4]c[i+4]a[i+5]b[i+5]c[i+5]a[i+6]b[i+6]c[i+6]a[i+7]b[i+7]c[i+7]

Page 21: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Circuit Decomposition

DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

Page 22: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

2011 INTERNATIONAL SYMPOSIUM ONCODE GENERATION AND OPTIMIZATION

Main Conference

Page 23: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER

ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)

Micro-architectural: not always documentedProprietary compilers at advantage!

SPEC2000 int

Loop

SPEC2000 int

Loop

NOP+ 1 NOP instruction

- 7% execution time

Page 24: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER

ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)

Micro-architectural: not always documentedExample: instruction decoding in Core 2 in chunks of 16 bytes

SPEC2000 int

Loop

SPEC2000 int

Loop

NOP16-byte alignment boundary

16-byte alignment boundary

Page 25: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER

ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)

Contributions and conclusions

1. Extensible assembly to assembly optimizer

2. Does not fit in GCC flow, because after RTL level not enough information preserved

3. Discover micro-architectural details semi-automatically through generation of micro-benchmarks

Page 26: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

DYNAMIC REGISTER PROMOTION OF STACK VARIABLES

JIANJUN LI, CHENGGANG WU, WEI-CHUNG HSU

Use DBT to let x86 binaries use the extra registers on x86-64recompiling is not always an option (legacy binaries)compute-intensive applications gain speed when using 64-bit

Challenge: implicit stack accessesSolved using page protection and stack switching (with shadow stack)

Page 27: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

LANGUAGE AND COMPILER SUPPORT FORAUTO-TUNING VARIABLE-ACCURACY ALGORITHMS

JASON ANSEL, YEE LOK WONG, CY CHAN, MAREK OLSZEWSKI, ALAN EDELMAN, SAMAN AMARASINGHE (MIT)

PetaBricks: language extensions to expose trade-offsbetween time and accuracy to the compiler

1. New programming language, toolchain and run-time environment2. Technique for mapping variable accuracy code to enable auto-

efficient tuning

Page 28: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

PRACTICAL MEMORY CHECKING WITH DR. MEMORY

DEREK BRUENING (GOOGLE), QIN ZHAO (MIT)

x86

Existing memory checking tools (e.g. Valgrind)slowmany false positives

Page 29: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER

HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)

Extend the compilation scope from methods to tracesTraces span multiple method invocationsMore powerful than method inlining

Page 30: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER

HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)

Claim: current trace-JITs are immatureKeep the advanced optimization infrastructure by retrofitting

Page 31: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

PHASE-BASED TUNING FOR BETTER UTILIZATION OFPERFORMANCE-ASYMMETRIC MULTICORE PROCESSORS

TYLER SONDAG AND HRIDESH RAJAN

ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores

MotivationTrend towards performance asymmetry among cores of a single chip

MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster

Results36% average process speedup with negligible overheads

Page 32: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Phase-based tuning

PHASE-BASED TUNING FOR BETTER UTILIZATION OF PERFORMANCE-ASYMMETRICMULTICORE PROCESSORS

Page 33: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE

DORIT NUZMAN, SERGEI DYSHEL, ERVEN ROHOU, IRA ROZEN, ALBERT COHEN, AYAL ZAKS

ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one

MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse

MethodMix-and-match existing open compilation tools, namely GCC and MONO

ResultsComparable to specialized monolithic offline compilers

Page 34: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Vectorizing for different platforms

VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE

Page 35: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Split vectorization scheme

VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE

Page 36: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

Interoparable compilation flows

VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE

Page 37: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9
Page 38: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

This is not a bullet slide.