Upload
rollo
View
72
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Xtensa C and C++ Compiler Ding-Kai Chen Tensilica, Inc [email protected]. Presentation Outline. XCC history XCC target -- Xtensa configurable processor XCC details with examples User defined C types Operator overloading VLIW scheduling Auto-SIMD vectorization Operation fusion - PowerPoint PPT Presentation
Citation preview
2Copyright © 2010, Tensilica, Inc.
Presentation Outline
XCC historyXCC target -- Xtensa configurable processorXCC details with examples– User defined C types– Operator overloading– VLIW scheduling– Auto-SIMD vectorization– Operation fusion– SWP Changes
3Copyright © 2010, Tensilica, Inc.
XCC History
Got the first version of SGI Pro64 in May 2000First customer release, August 2001Release with IPA, August 2002Release with SWP, Feedback, VLIW, September 2004Release with GCC 4.2 Front End, October 2009Supports C and C++ applications– Other languages are not as important for embedded
applications
4Copyright © 2010, Tensilica, Inc.
Xtensa Processor
32-bit RISC processor targeting embedded dataplane applications16 32-bit general registers (AR)24-bit base instructions Configurable at design-time (not at run-time)
Xtensa Core Architecture
5Copyright © 2010, Tensilica, Inc.
Xtensa Configuration Options
Many pre-defined options to choose from– Endianness– Windowed vs non-windowed
register file– Narrow (16-bit) instructions– Multipliers– Coprocessors (HiFi, Vectra,
BBE, FP)– Specialized (e.g., MAX)
instructions, etc
Xtensa Core Architecture
Configuration Options
6Copyright © 2010, Tensilica, Inc.
Targeting XCC to Base Xtensa and Tensilica Configurations
As part of retargeting to Xtensa, we used/added– Code-generator generator tool Olive for WHIRL to CGIR
translation• Handles a lot of configuration specific code
– Support for Xtensa zero-overhead loop instructions– CG Code-size optimization that commonizes instructions
from control-flow predecessors– Feedback-directed speed vs code-size tradeoff– Support for flexible VLIW formats
• Formats of different bit width and different number of issue slots
7Copyright © 2010, Tensilica, Inc.
Tensilica Instruction Extension (TIE)
TIE is a language to describe new custom:– Register files up to 512 bits
wide– Instructions up to 128 bits– VLIW formats up to 15 slots– C types mapped to custom
register files– Vectorization rules– Fusion patterns– Operator overloadingXtensa Architecture
Custom TIE
Configuration Options
8Copyright © 2010, Tensilica, Inc.
XCC Challenges
Custom extensions in TIE are written at customer site and cannot be configured at XCC build timeDesign goals:– Separation of config-independent code and config-
dependent libraries– Re-targeting in minutes after TIE is designed or modified by
processor architect at customer site– programming new HW extensions as native C
types/operations
9Copyright © 2010, Tensilica, Inc.
Xtensa - Full Development Automation
Xtensa Processor
Generator*
* US Patent: 6,477,697
Use standard ASIC/COT design techniques and
libraries for any IC fabrication process
Complete Hardware DesignSource pre-verified RTL, EDA scripts, test suite
Customized Software ToolsC/C++ compiler Debuggers, Simulators, RTOSes
1. Select from menu2. Explicit instruction
description (TIE)
Processor Configuration
ProcessorExtensions
10Copyright © 2010, Tensilica, Inc.
TIE register file and operation
// new register file for int32x4// vectorizationRegfile v 128 16
// a new C type based on <v> regfile// and has 128-bit size and// 128-bit alignmentctype int32x4 128 128 v
operation add_v { out v vout, in v va, in v vb } {} { assign vout = {
va[127:96] + vb[127:96],va[95:64] + vb[95:64],va[63:32] + vb[63:32],va[31:0] + vb[31:0] };
}
in C:void vsum() {
int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c;
for (i=0; i<VSIZE; i++) { // C intrinsic call vc[i] = add_v(va[i] , vb[i]); }}
add_v is an intrinsic call in C
In WHIRL, it is an intrinsic_op optimizer friendly
11Copyright © 2010, Tensilica, Inc.
TIE C type support
Each TIE C type maps to a new WHIRL mtypeEach TIE regfile maps to a ISA_REGCLASSGCC FE declares new C types and new intrinsics (added new TIE_TYPE tree code)WGEN translates TIE C type references to WHIRL loads/storesOlive tool adds dynamic rules to handle new types and WHIRL opcodesAdded TN_mtype() for register spills/reloadsMade BE optimizations (CSE, ebo, etc) work
12Copyright © 2010, Tensilica, Inc.
TIE example – generated code
#<loop> Loop body line 28, nesting depth: 1, iterations: 8#<loop> unrolled 4 times load_v v0,a2,0 # [0*II+0] id:20 b+0x0 load_v v1,a3,0 # [0*II+1] id:19 a+0x0 load_v v2,a2,16 # [0*II+2] id:20 b+0x0 load_v v3,a3,16 # [0*II+3] id:19 a+0x0 load_v v4,a2,32 # [0*II+4] id:20 b+0x0 load_v v5,a3,32 # [0*II+5] id:19 a+0x0 load_v v6,a2,48 # [0*II+6] id:20 b+0x0 load_v v7,a3,48 # [0*II+7] id:19 a+0x0 addi a2,a2,64 # [0*II+8] addi a3,a3,64 # [0*II+9] addi a4,a4,64 # [0*II+10] add_v v0,v1,v0 # [0*II+11] add_v v1,v3,v2 # [0*II+12] add_v v2,v5,v4 # [0*II+13] add_v v3,v7,v6 # [0*II+14] store_v v0,a4,-64 # [0*II+15] id:21 c+0x0 store_v v1,a4,-48 # [0*II+16] id:21 c+0x0 store_v v2,a4,-32 # [0*II+17] id:21 c+0x0 store_v v3,a4,-16 # [0*II+18] id:21 c+0x0Total 19/4 = 4.75 cycles per iteration
13Copyright © 2010, Tensilica, Inc.
TIE updating ld/st
// pre-increment load/storeoperation load_vu { out v vout, inout AR base, in simm8 offset } { out VAddr, in MemDataIn128 } { assign VAddr = base + offset; assign vout = MemDataIn128; assign base = base + offset;}operation store_vu { in v vin, inout AR base, in simm8 offset } { out VAddr, out MemDataOut128 } { assign VAddr = base + offset; assign MemDataOut128 = vin; assign base = base + offset;}proto int32x4_loadiu { out int32x4 vout, inout int32x4* base, in immediate offset } {} { load_vu vout, base, offset;}proto int32x4_storeiu { in int32x4 vin, inout int32x4* base, in immediate offset } {} { store_vu vin, base, offset;}
14Copyright © 2010, Tensilica, Inc.
TIE updating ld/st
#<loop> Loop body line 28, nesting depth: 1, iterations: 32load_vu v0,a2,16 # [0*II+0] id:20 b+0x0load_vu v1,a3,16 # [0*II+1] id:19 a+0x0store_vu v2,a4,16 # [1*II+2] id:21 c+0x0add_v v2,v1,v0 # [0*II+3]
total 4 cycles per iteration
XCC Identifies updating ld/st operationsPre-bias ld/st bases to work with pre-incrementCombine ld/st with addi in CG
15Copyright © 2010, Tensilica, Inc.
TIE operator overloading
// map “+” operator to add_v for// type int32x4operator "+" add_v
in C:void vsum_op() {
int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c;
for (i=0; i<VSIZE; i++) { // more natural using C “+” syntax vc[i] = va[i] + vb[i]; }}
Check for TIE type operands and operator overloading in build_binary_op in c-typeck.c of GCCBuild proper call to mapped TIE intrinsic
16Copyright © 2010, Tensilica, Inc.
TIE VLIW scheduling
format flix0 64 {slot0,slot1} // add 2-slots 64-bit VLIW formatslot_opcodes slot0 { load_v, store_v, load_vu, store_vu, add_v }slot_opcodes slot1 { load_v, store_v, load_vu, store_vu, add_v }---------------------------------- .s output --------------------------------------------------#<loop> unrolled 2 times { # format flix0 load_vu v3,a2,32 # [0*II+0] id:20 b+0x0 add_v v5,v4,v3 # [1*II+0] } { # format flix0 load_v v0,a2,-16 # [0*II+1] id:20 b+0x0 add_v v2,v1,v0 # [1*II+1] } { # format flix0 load_v v1,a3,16 # [0*II+2] id:19 a+0x0 load_vu v4,a3,32 # [0*II+2] id:19 a+0x0 } { # format flix0 store_v v2,a4,16 # [1*II+3] id:21 c+0x0 store_vu v5,a4,32 # [1*II+3] id:21 c+0x0 }total 4/2=2 cycles per iteration
17Copyright © 2010, Tensilica, Inc.
TIE VLIW scheduling
XCC initialization includes analysis on TIE VLIW formatsCreate resources that model bundling constraints– Consider a simpler case: 1 slot is allowed for each opcode– Each VLIW slot in a format is viewed as a resource
• Different formats are treated separately
– Each opcode consumes the resource of the slot it is allowed– For a group of operations, if the total resource usage is within
the limit can be scheduled in the same cycle– Get complicated when multiple slots are allowed for opcodes
Resource reservation modeling allows de-coupling of scheduling and slot assignment in CGExtended resource reservation word type SI_RRW to arbitrary length bit-vectorsTI_RES_RES_Resources_Available() also checks for compatible formats
18Copyright © 2010, Tensilica, Inc.
TIE auto-SIMD vectorization
property vector_ctype {int32x4, int32, 4}property vector_proto {add_v, xt_add, 4}
in C: for (i=0; i<SIZE; i++) { c[i] = a[i] + b[i]; }
with -O3 -LNO:simd -clist, in .w2c:
int32x4 V_00;int32x4 V_;int32x4 V_0;int32x4 V_4;_INT32 i;
for(i = 0; i <= 127; i = i + 4){ V_00 = *(int32x4 *)(&a[i]); V_ = *(int32x4 *)(&b[i]); V_0 = add_v(V_00, V_); V_4 = V_0; * (int32x4 *)(&c[i]) = V_4;}
19Copyright © 2010, Tensilica, Inc.
TIE auto-SIMD vectorization
Developed independently (before) Open64 VectorizerIntegrate into Phase2 of LNOScan all loops in a nestCheck for presence of vectorized versions of each op in the loopCheck for stride-1 or invariant memory referencesSupport for loads and stores with addresses not aligned as vector type– Pre-load once before the vector loop– Subsequent loads in the vector loop combine with the prior loads
Support for spatial reuse within a vector using select instruction– E.g. a[i] + a[i+1] in the scalar loop
• Pre-load once before the vector loop• Only a single load is needed now for each iteration• Select instructions shuffle data from loads of consecutive iterations
20Copyright © 2010, Tensilica, Inc.
TIE operation fusion
Combine multiple operations to oneE.g., combines an add followed by a shift to one add_shift operationPerformed in CGBuild dataflow graphs from input patternsRepeatedly search for matches in BBsPeephole optimization with custom patterns
imap add_shift_v { out v vout, in v va, in v vb, in immediate amount }{ {} { // the output pattern add_shift_v vout, va, vb, amount; }}{ { v v_temp } { // the input pattern add_v v_temp, va, vb; shift_v vout, v_temp, amount; }}
21Copyright © 2010, Tensilica, Inc.
TIE operation fusion
Example C code:– for (i=0; i<VSIZE; i++) {
vc[i] = (va[i] + vb[i]) << 2;}
Original schedule is 5 cycles / 2 iter = 2.5 cycles per iteration
New schedule with operation fusion is 4 cycles / 2 iter = 2 cycles per iteration
22Copyright © 2010, Tensilica, Inc.
XCC SWP schedulerXtensa has no rotating registers – added 2 register allocators, simple and coloring. Use simple first to get tighter bound then try coloring.Performance is critical: added back-tracking for the following– Unrolling (hard to guess best unrolling)– Different priority heuristics for choosing candidates– Different initial op orderings– Register allocation failures
Runs slightly longer but complements the original IA-64 based SWP algorithm well
23Copyright © 2010, Tensilica, Inc.
Conclusion
Open64 is versatile in providing optimized performance for embedded applications.XCC experience shows that many of the optimizations can be adapted to retarget for ISA extensions quickly.Sample Performance Data:– EEMBC Consumer benchmark gained 6x speedup with
automatic vectorization + vliw scheduling + operation fusionXCC solution is not final. It is still evolving with new HW features offered from Tensilica.Want to explore new ways in TIE to describe HW that supports optimizations.
24Copyright © 2010, Tensilica, Inc.
Tensilica is looking for new talent to join the compiler team.
http://[email protected]