Multi-platform Automatic Parallelization and Power ... · (SMP servers) Server Code Generation OpenMP Compiler Shred memory servers Heterogeneous Multicores from Vendor B Hitachi,

Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan

IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore STC Chair

URL: http://www.kasahara.cs.waseda.ac.jp/

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI： Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Model Base Designed Engine Control on V850 Multicore with Denso

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine

control by multicore

Parallelizing Handwritten Engine Control Programs on Multi‐core processors

• Current automotive crankshaft program– Developed by TOYOTA Motor Corp– About 300,000 Lines– Difficulty of parallel processing

• Too fine granularity• Many conditional branches and small basic blocks,

but no parallelizable loops– Minimizing run‐time overhead and improvement of parallelism are necessary Current product compilers can not parallelize Current accelerators are not applicable

Automatic parallelization of a crankshaft program using multi‐grain parallelization in OSCAR Compiler Performance improvement and efficient multi‐threaded

programming development 2013/04/19 Cool Chips XVI 5

Analysis of Coarse Grain Parallelism by OSCAR Compiler

2013/04/19 Cool Chips XVI

6

1

2 3

4

5

6

7

8

910 11

12

13

14Macro-Flow GraphMacro-Task Graph

Earliest Executable Condition

1

2 3

4

5

6

7

8

9 10

11

12

13

14

Decomposes a program into coarse grain tasks, or macro tasks(MTs)

1. BB (Basic Block)2. RB (Repetition Block, or loop)3. SB (Subroutine Block, or function)

Generate MFG(Macro Flow Graph) Control flow and data dependencies

Generates MTG(Macro Task Graph) Coarse grain parallelism

: Data Dependency: Control Flow: Conditional Branch

Data Dependency

Control Flow

Conditional Branch

Coarse Grain Task Parallelizationof Hand-written Engine Control Program

2013/04/19 Cool Chips XVI 7

MTG of crankshaft programs

Loop parallelizationNo parallelizable loopsin engine control codes

Fine grain parallelizationEach BBs are very low cost

-less than 100 clock cyclesBranches prevent compilers

Coarse grain parallelizationUtilize parallelism between SBs and BBs

Static Task Scheduling

2013/04/19 Cool Chips XVI 8MFG of sample program

Dynamic task scheduling Prevent from traceability Add run-time overhead

Static task scheduling Guarantee Real-time constraints Ensure traceability Minimize run-time overhead

Cannot assign BBs having braches statically Static task scheduling can be applied if the MTG has only

data dependency The compiler cannot see if the branch

is taken or not at compile time. Fuse tasks by hiding conditional branches in MFG

to avoid dynamic task scheduling Macro Task Fusion

Analysis of A Crankshaft Program Using Macro Task Fusion


There is not enough parallelism

Can not schedule MTs at compile time

MTG of crankshaft program before macro task fusion MTG of crankshaft program after macro task fusion

sb4 and block5 account for over 90% of whole execution time.

MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements


Successfully increased coarse grain parallelism

Critical Path(CP)

CP accounts for over 99% of whole

execution time.

MTG of crankshaft program before restructuring

Critical Path(CP)

CP accounts for about 60% of whole

execution time.

MTG of crankshaft program after restructuring

Succeed to reduce CP 99% -> 60%

Evaluation Environment :Embedded Multi-core Processor RPX

• SH-4A 648MHz * 8– As a first step, we use just two SH-4A cores because target dual-core

processors are currently under design for next-generation automobiles


.t-

Evaluation of Crankshaft Program with Multi-core Processors

• Attain 1.54 times speedup on RPX– There are no loops, but only many conditional branches and small basic blocks and

difficult to parallelize this program

• This result shows possibility of multi-core processor for engine control programs


1.00

1.54 0.57

0.37

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.000.200.400.600.801.001.201.401.601.80

1 core 2 core

executiontim

e(us)speedu

pratio

Performance of OSCAR Compileron Intel Core i7 Notebook PC

• OSCAR Compiler accelerate Intel Compiler about 2.0 times on average

1.00 1.18

2.70

1.00 1.00

1.70

2.24

4.12

2.91

2.53

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

SPEC95 su2cor SPEC95hydro2d

SPEC95 mgrid SPEC95 turb3d AAC Encoder

Speesup Ra

tio

Intel Compiler Ver.14.0OSCAR

CPU: Intel Core i7 3720QM (Quad‐core)MEM: 32GB DDR3‐SODIMM PC3‐12800OS: Ubuntu 12.04 LTS

Parallel Processing of JPEG XR Encoder on TILEPro64

Multimedia Applications:Sequential C Source Code

Parallelized C Program with OSCAR API

OSCAR Compiler

Parallelized Executable Binary for TILEPro64

API Analyzer +Sequential Compiler

Cache Allocation Setting

1x

28x

55x

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 64Speedu

p

# Cores

Speedup (JPEG XR Encoder)

Default Cache AllocationOur Cache Allocation

(1)OSCAR Parallelization

(2)Cache Allocation Setting

Local cache optimization:Parallel Data Structure (tile) on heapallocating to local cache

55x speedup on 64 cores AAC EncoderJPEG XR Encoder Optical Flow Calc.

0,01,02,03,04,05,06,07,0

0,11,12,13,14,15,16,17,1

0,21,22,23,24,25,26,27,2

0,31,32,33,34,35,36,37,3

0,41,42,43,44,45,46,47,4

0,51,52,53,54,55,56,57,5

0,61,62,63,64,65,66,67,6

0,71,72,73,74,75,76,77,7

I/O

I/OI/O

Memory Controller 0

Memory Controller 1

Memory Controller 2

Memory Controller 3

Ds

t0X4)

rt 1

al)

nal)

X4)

{

Parallel Processing of Face Detection on Manycore, Highendand PC Server

15

• OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver.

1.00 1.72

3.01

5.74

9.30

1.00 1.93

3.57

6.46

11.55

1.00 1.93

3.67

6.46

10.92

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

率

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

Automatic Parallelization of Face Detection

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000（Power7 Based 128 Core Linux SMP）

0

10

20

30

40

50

60

70

80

90

100

1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Speedup

agai

nst s

eque

ntia

l pro

cess

ing

oscar

Profile-Based Automatic Parallelization and Sequential Program Tuning for Android 2D Rendering on Nexus7

WASEDA UNIVERSITY

OSCAR Compiler

Profile-Based Automatic Parallelization

OSCAR Parallelization Compiler

AnalyzeResult

Profile ResultSequentialBinary File

Hotspot Analyzer

Original Source files

ARMCompil

er(GCC)

ParalleizedBinary File

Parallelized Source Fileswith OSCAR API

ARMCompile

r(GCC)

ＡＢ

Rewite toParallelizable C

OSCAR API

Analyzer Parallelized Source Files

Hotspot Source File

Green Computing Systems Research and Development Center Waseda University

Path Generation

RasterizationShading

BitBlit

paths

mask

destination Image

source image

modifieddestination Image

Transform figure to some paths

Transform path to Bitmap(Mask)

Calculate and transfer display image from source image, mask and destination image.

Make a color data from a Command

Standard library which performs 2D rendering on Android

Skia 2D Rendering Pipeline

Android 2D Rendering “Skia” Execution flow is different on each benchmarHigh load on the BitBlit process.

SkRGB16_Blitter

::blitH

81.84%

memset_128_loo

p

2.70%

sk_fill_path

2.21%

memset32_loop1

28

1.42%

SkString::~SkStri

ng()

0.61%

Others

11.22%

DrawArc

others

2.57%

SkRGB16_Blitter::blitRect97.43%

DrawRect

ＡＢ

void SkRGB16_Blitter_blitRect_oscar(int width, int height, uint16_t* device, unsigned deviceRB, SkPMColor src32) {

int i;uint16_t* deviceTMP;

for (i = height; i > 0; i--){deviceTMP = (uint16_t*)((char*)device + (deviceRB * (height - i)));blend32_16_row(src32, deviceTMP, width);

}}

void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) {SkASSERT(x + width

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawImage : FPS

Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawRect :FPS

22.82

43.57

27.16

×1.91 ×1.95

for DrawRect 1.91 speedupfor DrawImage 1.95 speedup

On Nexus7, 3 core parallelization gave us

52.88

18

1 Core 3 cores 1 Core 3 cores

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

Low-Power Optimization with OSCAR API

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control（Voltage：1.4V)

With Power Control （Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

processor

3.4[fps]

111[fps]

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Automatic Power Reduction for MPEG2 Decode on Android Multicore

ODROID X2 ARM Cortex-A9４cores

23

• On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.

• 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.

0.97

1.88

2.79

0.63 0.46 0.37

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3

Power Con

sumption[W

]

Number of Processor Cores

電力制御なし電力制御ありWithout Power Reduction

With Power Reduction

2/3(‐35.0%)

1/4（‐75.5%）

1/7(‐86.7%)

1/3(‐61.9%)

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

Automatic Power Reduction on 4

core Intel Haswell

• Haswell Processor– OS Ubuntu 13.10– Intel CPU Core i7 4770K

• 4 cores• L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle• L2 Cache 64Bytes/cycle• L3 Cache 8 MB• Frequency 3.5GHz~0.8MHz

– Memory 16GB (8GB×2)

24

Power Reduction on Intel Haswellfor Real-time Optical Flow

25

Power was reduced to 1/4 by the compiler power optimization on the same 3 cores.The power with 3 core was reduced to 1/3 against 1 core.

29.29

36.5941.58

28.40

13.22 10.49

0.00

10.00

20.00

30.00

40.00

50.00

1 2 3

Average Po

wer [W

]

No. of Processor Cores

電力制御なし電力制御あり

Power was reduced to 1/4 by compiler on 3 cores

Power was reduced to 1/3 compared with one core ordinal execution

For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])Without

Power Control

With Power Control

Power Waves for 1 Core to 3 Cores without the Compiler Power Control on Intel Haswell

for Real-time Optical Flow

2014/6/17 DEMO 26

電圧(V)電流(A)電力(W)

29.29W36.59W

41.58W

2 Cores1 Core 3 Cores

PowerPower Power

2014/6/17 DEMO 27

電圧(V)電流(A)電力(W)

28.40W13.22W

10.49W

PowerPowerPower

3 Cores2 Cores1 Core

Power Waves for 1 Core to 3 Cores with the Compiler Power Control on Intel Haswell

for Real-time Optical Flow

Power for 1 & 3Cores without Controlvs. for 3 Cores with Control on HaswellWithout Power Control With Power Control

2014/6/17 DEMO 282014/6/17

29.29W

1 Core

Power

28

41.58W

3 Cores

Power

Without Power Control

28

10.49W

Power

3 Cores

Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

Solar powered with more than 100 times power efficient : FLOPS/W• Regional Disaster Simulators

saving lives from tornadoes, localized heavy rain, fires with earth quakes

‐From everyday recharging to less than once a week‐ Solar powered operation inemergency condition‐ Keep health

Smart phones

Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,

clean usable inside OP room

Advanced medical systems Personal / Regional Supercomputers

Documents

Multi-platform Automatic Parallelization and Power ... · (SMP servers) Server Code Generation OpenMP Compiler Shred memory servers Heterogeneous Multicores from Vendor B Hitachi,