OSCAR Automatic Parallelization and Power Reduction ......OSCAR Automatic Parallelization and Power Reduction Compiler for Homogeneous and Heterogeneous Multicores Hironori Kasahara

OSCAR Automatic Parallelization and Power Reduction Compiler for Homogeneous

and Heterogeneous Multicores

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan

IEEE Computer Society Multicore STC ChairIEEE Computer Society 2015 Election

President Candidate URL: http://www.kasahara.cs.waseda.ac.jp/

＜R & D Target＞Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, OSCAR Tech.＜Ripple Effect＞Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

2

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

Solar PoweredSmart phones

Cameras

Robots

Cool desktop servers

Industry-government-academia collaboration in R&Dand target practical applications

On-board vehicle technology (navigation systems, integrated controllers, infrastructure coordination)

Consumer electronicInternet TV/DVD

Non-fan, cool, quiet servers designed for server

Green cloud servers

Stock trading

Operation/rechargingby solar cells

OSCAR many-core chip

Green supercomputers

OSCAR

OSCAR

OSCARMany-core

Chip

3

Waseda University :R&DMany-core system technologies with

ultra-low power consumptionSuper real-time disaster simulation (tectonic shifts, tsunami), tornado,flood, fire spreading)

National Institute of Radiological Sciences

Protect lives

Protect environmentFor smart life

Camcorders

Intelligent home appliances

Supercomputers and servers

Industry

Capsule inner cameras

Compiler, API

Medical servers

Heavy particle radiation planning, cerebral infarction)

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)

BPA

BPA

1 BPA

3 BPA2 BPA

4 BPA

5 BPA

6 RB

7 RB15 BPA

8 BPA

9 BPA 10 RB

11 BPA

12 BPA

13 RB

14 RB

END

RB

RB

BPA

RB

Data Dependency

Control flow

Conditional branch

Repetition Block

RB

BPA

Block of PsuedoAssignment Statements

7

11

14

1

2 3

4

5 6

15 7

8

9 10

12

13

Data dependency

Extended control dependencyConditional branch

6

ORAND

Original control flowA Macro Flow GraphA Macro Task Graph

5

MTG of Su2cor-LOOPS-DO400

DOALL Sequential LOOP BBSB

Coarse grain parallelism PARA_ALD = 4.3

6

Data-Localization: Loop Aligned Decomposition• Decompose multiple loop (Doall and Seq) into CARs and LRs

considering inter-loop data dependence.– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region

DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33

DO I=1,33

DO I=2,34

DO I=68,100

DO I=67,67

DO I=35,66

DO I=34,34

DO I=68,100DO I=35,67

LR CAR CARLR LR

C RB2(Doseq)DO I=1,100

B(I)=B(I-1)+A(I)+A(I+1)

ENDDO

C RB1(Doall)DO I=1,101A(I)=2*I

ENDDO

RB3(Doall)DO I=2,100

C(I)=B(I)+B(I-1)ENDDO

C

7

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI： Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Kasahara-Kimura Lab., Waseda University 99

#pragma omp parallel sections{

#pragma omp sectionmain_vpc0();




}

Thread Execution Model

main_vpc0()

main_vpc1()

main_vpc2()

main_vpc3()

Parallel Execution(A thread and a core are bound

one-by-one)

VPC: Virtual Processor Core


Memory Mapping• Placing variables on an onchip centralized

shared memory (onchipCSM)• #pragma oscar onchipshared (C)• !$oscar onchipshared (Fortran)

• Placing variables on a local data memory (LDM)• #pragma omp threadprivate (C)• !$omp threadprivate (Fortran)• This directive is an extension to OpenMP

• Placing variables on a distributed shared memory (DSM)• #pragma oscar distributedshared (C)• !$oscar distributedshared (Fortran)


Data Transfer• Specifying data transfer lists

– #pragma oscar dma_transfer (C)– !$oscar dma_transfer (Fortran)– Containing following parameter directives

• Specifying a contiguous data transfer– #pragma oscar dma_contiguous_parameter (C)– !$oscar dma_contiguous_parameter (Fortran)

• Specifying a stride data transfer– #pragma oscar dma_stride_parameter– !$oscar dma_stride_parameter– This can be used for scatter/gather data transfer

• Data transfer synchronization– #pragma oscsar dma_flag_check– !$oscar dma_flag_check

Hierarchical Barrier Synchronization

• Specifying a hierarchical group barrier– #pragma oscar group_barrier (C) – !$oscar group_barrier (Fortran)


8cores

PG0(4cores) PG1(4cores)

PG1-0(2cores) PG1-1(2cores)

2nd layer →

PG0-0 PG0-1 PG0-2 PG0-3

1st layer →

3rd layer →

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

110 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000（Power7 Based 128 Core Linux SMP）

14

9.6 Times Speedup on 12 cores Power 8 against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation

• Evaluation using medium size input data(12GB)

1.0

4.9

9.6

0

2

4

6

8

10

12

1pe 6pe 12pe

Speedu

ps against 1 pe

S812L(POWER8)

• IBM S812L• CPU:POWER8

• 12 cores• Clock Frequency

3.026GHz• Memory:60GB• OS:Redhat Linux 7.1• Backend Fortran compiler:

IBM xlf 15.1.1

No. of Processors

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average

Performance of OSCAR Compiler for Fortran Programs on a IBM p550q 8core Deskside Server 2.7 times speedup against loop

parallelizing compiler on 8 cores

0

1

2

3

4

5

6

7

8

9

tomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsispec95 spec2000

spee

dup

ratio

Loop parallelizationMultigrain parallelizition

Performance for C programson IBM p5 550Q

18

2.412.03 1.94 1.97 1.90 1.91

3.724.10

3.63 3.673.09

3.55

4.96

7.05

5.246.06

5.306.16

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

Spee

d-up

ratio

1core2core4core8core

5.8 times speedup against one processor on average

19

IBM XL Fortran for AIX Version 8.1

Performance of OSCAR Compiler on IBM pSeries690(Power 4) RegattaH 16 Processors High-end Server

0.0

2.0

4.0

6.0

8.0

10.0

12.0

tomcatvswim

su2corhydro 2d

mgridappluturb3d aps i

fppppwave5

wupwiseswim2kmgrid

2kapplu2ks ixt r

ackaps i2k

Spee

dup

ratio

XL Fortran(max)

APC(max)

23.1s

19.2s

3.8s

16.7s

38.3s

3.4s

21.5s

7.1s

21.0s

3.0s

16.4s

3.1s

23.2s

13.0s

37.4s

3.5s

39.5s27.8s35.1s30.3s21.5s 126.5s 307.6s 291.2s 279.1s

126.5s

115.2s

38.5s

107.4s

28.8s

184.8s

105.0s

22.5s 85.8s 18.8s

22.5s 85.8s 18.8s

282.4s 321.4s

282.4s 321.4s

28.9s

3.5 times speedup in average

Parallel Processing of Face Detection on Manycore, Highend and PC Server

20

OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.

1.00 1.72

3.01

5.74

9.30

1.00 1.93

3.57

6.46

11.55

1.00 1.93

3.67

6.46

10.92

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

率

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

Automatic Parallelization of Face Detection

Speedup with 2cores for Engine Crankshaft Handwritten Program on Renesas RPX Multi‐core Processor

21

Macrotask graph with a lot of conditional branches

Macrotask graph after task fusion 1

1.60

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

速度向上率

1core

2core

1core 2core

1.6 times Speed up by 2 cores against 1core

Branches are fused to macrotasks for static scheduling

Grain is too fine (us) for dynamic scheduling.

OSCAR Compile Flow for Simulink Applications

22

Simulink model C code

Generate C codeusing Embedded Coder

OSCAR Compiler

(1) Generate MTG→ Parallelism

(2) Generate gantt chart→ Scheduling in a multicore

(3) Generate parallelized C code using the OSCAR API→ Multiplatform execution(Intel, ARM and SH etc)

23

Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examplesBuoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulinkColor Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐to‐grayscale‐/Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/

Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)

OSCAR Heterogeneous Multicore

24

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

25

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control

TIM

E

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

processor

3.4[fps]

111[fps]

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Low-Power Optimization with OSCAR API

28

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction29

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control（Voltage：1.4V)

With Power Control （Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

Low Power High Performance Multicore Computer with Solar Panel Clean Energy Autonomous Servers operational in deserts

1.07 0.79

0.95 0.72

1.69

0.57

1.50

0.36

2.45

0.51

2.23

0.30

0.00

0.50

1.00

1.50

2.00

2.50

3.00

without power control with power control without power control with power control

H.264 Optical flowAverage Po

wer Con

sumption[W]

1 core 2 cores 3 cores

1 1132 12 2 233 3

31

- 86.5%(1/7)

- 68.4%(1/3)

-79.2%（１／５）

-52.3%(1/2)

Power on 4 cores ARM CortexA9 with Androidhttp://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

H.264 decoder & Optical Flow (Using 3 cores)ODROID X2

Samsung Exynos4412 Prime, ARM Cortex‐A9 Quad core1.7GHz〜0.2GHz, used by Samsung's Galaxy S3

On the same 3 cores, the power control reduced the power to 1/5～1/7 against no power control.

The power control reduced the power to 1/2～1/3 compared with the ordinary sequential execution on 1 core without power control.

H.264 decoder & Optical Flow

32

29.67

17.37

29.29

24.17

37.11

16.15

36.59

12.21

41.81

12.50

41.58

9.60

0.00

10.00

20.00

30.00

40.00

50.00

without power control with power control without power control with power control

H.264 Optical flow

Average Po

wer Con

sumption[W]

1 core 2 cores 3 cores

1 321 321 321 32

-70.1%(1/3)

-57.9%(2/5)

-76.9%(1/4)

-67.2%(1/3)

Power Reduction on Intel Haswell 3cores

The power consumption was reduced to 1/3～1/4 by OSCAR power control against no power control on the same 3 processor cores.

The power reduced to 2/5～1/3 by the compiler power control against 1core no power control.

H81M‐A, Intel Core i7 4770kQuad core, 3.5GHz〜0.8GHz

33コア数

コア数

Automatic Power Reduction by OSCAR Compiler onIntel Haswell 4 Core Multicore

- Power Consumption for real-time face detection was reduced to 2/5 -

OSCAR CompilerIntel HaswellPower Reduction

Parallel processing of face detection program on Intel Haswell 4cores

without power controlwith power control

Average power consumptionwhen automatic power reduction is applied

Speedups against sequential executionat the fastest execution mode

reduced to 3/5（-43.68%）

reduced to 2/5（-60.37%）

2.44 times speedup

Parallelization flow of OpenCV face detection program

Input Camera

DisplayDrawing

CPU : Inel Core i7 4770KNumber of core : 4Clock frequency : 3.5GHz〜0.8GHzMother board : ASUS H81M-A

Power measurementon Intel Haswell board

Inserting power measurement

circuit between PMIC and CPU

Ave

rage

Pow

er C

onsu

mpt

ion

[W]

Numbuer of core

Spe

edup

rat

io

Numbuer of core

Loop for searching changing sizes

Searching loop along x and y directions

Face detection processing

Automatic parallelization by OSCAR compiler

next frame

Target:

Solar Powered with

compiler power reduction.

Fully automatic

parallelization and

vectorization including

local memory management

and data transfer.

OSCAR Vector Multicore with Power Processor Core and Compiler for Embedded to Severs with

OSCAR Technology

Centralized Shared Memory

Compiler Co-designed Interconnection Network

Compiler co-designed Connection Network

On-chip Shared Memory

Multicore Chip

VectorData

TransferUnit

CPU(Power)

Local MemoryDistributed Shared Memory

Power Control Unit

Core

×４chips

Summary OSCAR Automatic Parallelizing and Power Reducing Compiler has

succeeded speedup and/or power reduction of scientific applications including medical applications and natural disaster simulation, and real‐time applications like Automobile Engine Control and MATLAB/Simulink, and media applications codec and face detection on various homogeneous and heterogeneous multicores.

In automatic parallelization on Power Architectures:OSCAR Compiler has been developed on Power architectures like Power 4, 5, 5+, 6, 7 and 8. for “Earthquake Wave Propagation Simulation”, 110 times

speedup on 128 cores of IBM Power 7 and 9.6 times speedup on 12 cores IBM Power8 against 1 core,

for “Cancer Treatment Using Carbon Ion”, 55 times speedup on 64 cores of IBM Power 7 against 1 core

In automatic power reduction, consumed powers were reduced to 1/2 or 1/3 using 3 cores on ARM Cortex A9 and Intel Haswell.

We are planning to realize automatic power reduction using “On-Chip Regulator” on Power Processors with accelerators.

35

Documents

OSCAR Automatic Parallelization and Power Reduction ......OSCAR Automatic Parallelization and Power Reduction Compiler for Homogeneous and Heterogeneous Multicores Hironori Kasahara