Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
OSCAR Automatic Parallelization and Power Reduction Compiler for Homogeneous
and Heterogeneous Multicores
Hironori KasaharaProfessor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan
IEEE Computer Society Multicore STC ChairIEEE Computer Society 2015 Election
President Candidate URL: http://www.kasahara.cs.waseda.ac.jp/
<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)
Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, OSCAR Tech.<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products
Consumer Electronics, Automobiles, Servers
Green Computing Systems R&D CenterWaseda University
Supported by METI (Mar. 2011 Completion)
Beside Subway Waseda Station,Near Waseda Univ. Main Campus
2
Hitachi SR16000:Power7 128coreSMP
Fujitsu M9000SPARC VII 256 core SMP
Solar PoweredSmart phones
Cameras
Robots
Cool desktop servers
Industry-government-academia collaboration in R&Dand target practical applications
On-board vehicle technology (navigation systems, integrated controllers, infrastructure coordination)
Consumer electronicInternet TV/DVD
Non-fan, cool, quiet servers designed for server
Green cloud servers
Stock trading
Operation/rechargingby solar cells
OSCAR many-core chip
Green supercomputers
OSCAR
OSCAR
OSCARMany-core
Chip
3
Waseda University :R&DMany-core system technologies with
ultra-low power consumptionSuper real-time disaster simulation (tectonic shifts, tsunami), tornado,flood, fire spreading)
National Institute of Radiological Sciences
Protect lives
Protect environmentFor smart life
Camcorders
Intelligent home appliances
Supercomputers and servers
Industry
Capsule inner cameras
Compiler, API
Medical servers
Heavy particle radiation planning, cerebral infarction)
To improve effective performance, cost-performance and software productivity and reduce power
OSCAR Parallelizing Compiler
Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory
Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)
Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.
1
23 45
6 7 8910 1112
1314 15 16
1718 19 2021 22
2324 25 26
2728 29 3031 32
33Data Localization Group
dlg0dlg3dlg1 dlg2
Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)
BPA
BPA
1 BPA
3 BPA2 BPA
4 BPA
5 BPA
6 RB
7 RB15 BPA
8 BPA
9 BPA 10 RB
11 BPA
12 BPA
13 RB
14 RB
END
RB
RB
BPA
RB
Data Dependency
Control flow
Conditional branch
Repetition Block
RB
BPA
Block of PsuedoAssignment Statements
7
11
14
1
2 3
4
5 6
15 7
8
9 10
12
13
Data dependency
Extended control dependencyConditional branch
6
ORAND
Original control flowA Macro Flow GraphA Macro Task Graph
5
MTG of Su2cor-LOOPS-DO400
DOALL Sequential LOOP BBSB
Coarse grain parallelism PARA_ALD = 4.3
6
Data-Localization: Loop Aligned Decomposition• Decompose multiple loop (Doall and Seq) into CARs and LRs
considering inter-loop data dependence.– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region
DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33
DO I=1,33
DO I=2,34
DO I=68,100
DO I=67,67
DO I=35,66
DO I=34,34
DO I=68,100DO I=35,67
LR CAR CARLR LR
C RB2(Doseq)DO I=1,100
B(I)=B(I-1)+A(I)+A(I+1)
ENDDO
C RB1(Doall)DO I=1,101A(I)=2*I
ENDDO
RB3(Doall)DO I=2,100
C(I)=B(I)+B(I-1)ENDDO
C
7
Low Power Heterogeneous Multicore Code
GenerationAPI
Analyzer(Available
from Waseda)
Existing sequential compiler
Multicore Program Development Using OSCAR API V2.0Sequential Application
Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)
Low Power Homogeneous Multicore Code
GenerationAPI
AnalyzerExisting
sequential compiler
Proc0
Thread 0
Code with directives
Waseda OSCARParallelizing Compiler
Coarse grain task parallelization
Data Localization DMAC data transfer Power reduction using
DVFS, Clock/ Power gating
Proc1
Thread 1
Code with directives
Parallelized API F or C program
OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,
data transfer using DMA, power managements
Generation of parallel machine
codes using sequential compilers
Exe
cuta
ble
on v
ario
us m
ultic
ores
OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface
HomegeneousMulticore s
from Vendor A(SMP servers)
Server Code GenerationOpenMP Compiler
Shred memory servers
HeterogeneousMulticores
from Vendor B
Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.
Accelerator 1Code
Accelerator 2Code
Hom
ogen
eous
Accelerator Compiler/ User Add “hint” directives
before a loop or a function to specify it is executable by
the accelerator with how many clocks
Het
ero
Manual parallelization / power reduction
Kasahara-Kimura Lab., Waseda University 99
#pragma omp parallel sections{
#pragma omp sectionmain_vpc0();
#pragma omp sectionmain_vpc1();
#pragma omp sectionmain_vpc2();
#pragma omp sectionmain_vpc3();
}
Thread Execution Model
main_vpc0()
main_vpc1()
main_vpc2()
main_vpc3()
Parallel Execution(A thread and a core are bound
one-by-one)
VPC: Virtual Processor Core
Kasahara-Kimura Lab., Waseda University 10
Memory Mapping• Placing variables on an onchip centralized
shared memory (onchipCSM)• #pragma oscar onchipshared (C)• !$oscar onchipshared (Fortran)
• Placing variables on a local data memory (LDM)• #pragma omp threadprivate (C)• !$omp threadprivate (Fortran)• This directive is an extension to OpenMP
• Placing variables on a distributed shared memory (DSM)• #pragma oscar distributedshared (C)• !$oscar distributedshared (Fortran)
Kasahara-Kimura Lab., Waseda University 11
Data Transfer• Specifying data transfer lists
– #pragma oscar dma_transfer (C)– !$oscar dma_transfer (Fortran)– Containing following parameter directives
• Specifying a contiguous data transfer– #pragma oscar dma_contiguous_parameter (C)– !$oscar dma_contiguous_parameter (Fortran)
• Specifying a stride data transfer– #pragma oscar dma_stride_parameter– !$oscar dma_stride_parameter– This can be used for scatter/gather data transfer
• Data transfer synchronization– #pragma oscsar dma_flag_check– !$oscar dma_flag_check
Hierarchical Barrier Synchronization
• Specifying a hierarchical group barrier– #pragma oscar group_barrier (C) – !$oscar group_barrier (Fortran)
Kasahara-Kimura Lab., Waseda University 12
8cores
PG0(4cores) PG1(4cores)
PG1-0(2cores) PG1-1(2cores)
2nd layer →
PG0-0 PG0-1 PG0-2 PG0-3
1st layer →
3rd layer →
8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)
55 times speedup by 64 processorsIBM Power 7 64 core SMP
(Hitachi SR16000)
National Institute of Radiological Sciences (NIRS)
Cancer Treatment Carbon Ion Radiotherapy
(Previous best was 2.5 times speedup on 16 processors with hand optimization)
110 Times Speedup against the Sequential Processing for GMS Earthquake Wave
Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)
14
9.6 Times Speedup on 12 cores Power 8 against the Sequential Processing for GMS Earthquake Wave
Propagation Simulation
• Evaluation using medium size input data(12GB)
1.0
4.9
9.6
0
2
4
6
8
10
12
1pe 6pe 12pe
Speedu
ps against 1 pe
S812L(POWER8)
• IBM S812L• CPU:POWER8
• 12 cores• Clock Frequency
3.026GHz• Memory:60GB• OS:Redhat Linux 7.1• Backend Fortran compiler:
IBM xlf 15.1.1
No. of Processors
Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server
Compile Option:(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto
OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average
Performance of OSCAR Compiler for Fortran Programs on a IBM p550q 8core Deskside Server 2.7 times speedup against loop
parallelizing compiler on 8 cores
0
1
2
3
4
5
6
7
8
9
tomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsispec95 spec2000
spee
dup
ratio
Loop parallelizationMultigrain parallelizition
Performance for C programson IBM p5 550Q
18
2.412.03 1.94 1.97 1.90 1.91
3.724.10
3.63 3.673.09
3.55
4.96
7.05
5.246.06
5.306.16
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Spee
d-up
ratio
1core2core4core8core
5.8 times speedup against one processor on average
19
IBM XL Fortran for AIX Version 8.1
Performance of OSCAR Compiler on IBM pSeries690(Power 4) RegattaH 16 Processors High-end Server
0.0
2.0
4.0
6.0
8.0
10.0
12.0
tomcatvswim
su2corhydro 2d
mgridappluturb3d aps i
fppppwave5
wupwiseswim2kmgrid
2kapplu2ks ixt r
ackaps i2k
Spee
dup
ratio
XL Fortran(max)
APC(max)
23.1s
19.2s
3.8s
16.7s
38.3s
3.4s
21.5s
7.1s
21.0s
3.0s
16.4s
3.1s
23.2s
13.0s
37.4s
3.5s
39.5s27.8s35.1s30.3s21.5s 126.5s 307.6s 291.2s 279.1s
126.5s
115.2s
38.5s
107.4s
28.8s
184.8s
105.0s
22.5s 85.8s 18.8s
22.5s 85.8s 18.8s
282.4s 321.4s
282.4s 321.4s
28.9s
3.5 times speedup in average
Parallel Processing of Face Detection on Manycore, Highend and PC Server
20
OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.
1.00 1.72
3.01
5.74
9.30
1.00 1.93
3.57
6.46
11.55
1.00 1.93
3.67
6.46
10.92
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 2 4 8 16
速度
向上
率
コア数
速度向上率tilepro64 gcc
SR16k(Power7 8core*4cpu*4node) xlc
rs440(Intel Xeon 8core*4cpu) icc
Automatic Parallelization of Face Detection
Speedup with 2cores for Engine Crankshaft Handwritten Program on Renesas RPX Multi‐core Processor
21
Macrotask graph with a lot of conditional branches
Macrotask graph after task fusion 1
1.60
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
速度向上率
1core
2core
1core 2core
1.6 times Speed up by 2 cores against 1core
Branches are fused to macrotasks for static scheduling
Grain is too fine (us) for dynamic scheduling.
OSCAR Compile Flow for Simulink Applications
22
Simulink model C code
Generate C codeusing Embedded Coder
OSCAR Compiler
(1) Generate MTG→ Parallelism
(2) Generate gantt chart→ Scheduling in a multicore
(3) Generate parallelized C code using the OSCAR API→ Multiplatform execution(Intel, ARM and SH etc)
23
Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examplesBuoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulinkColor Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐to‐grayscale‐/Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/
Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores
(Intel Xeon, ARM Cortex A15 and Renesas SH4A)
OSCAR Heterogeneous Multicore
24
• DTU– Data Transfer
Unit• LPM
– Local Program Memory
• LDM– Local Data
Memory• DSM
– Distributed Shared Memory
• CSM– Centralized
Shared Memory• FVR
– Frequency/Voltage Control Register
25
An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control
TIM
E
33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
12.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
Speedu
ps against a single SH
processor
3.4[fps]
111[fps]
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
Without Power Reduction With Power Reductionby OSCAR Compiler
Average:1.76[W] Average:0.54[W]
1cycle : 33[ms]→30[fps]
70% of power reduction
Low-Power Optimization with OSCAR API
28
MT1
VC0
MT2
MT4MT3
Sleep
VC1
Scheduled Resultby OSCAR Compiler void
main_VC0() {
MT1
voidmain_VC1() {
MT2
#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))
#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))
Sleep
MT4MT3
} }
Generate Code Image by OSCAR Compiler
Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2
by OSCAR Parallelizing Compiler
Avg. Power5.73 [W]
Avg. Power1.52 [W]
73.5% Power Reduction29
MPEG2 Decoding with 8 CPU cores
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Without Power Control(Voltage:1.4V)
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)
Low Power High Performance Multicore Computer with Solar Panel Clean Energy Autonomous Servers operational in deserts
1.07 0.79
0.95 0.72
1.69
0.57
1.50
0.36
2.45
0.51
2.23
0.30
0.00
0.50
1.00
1.50
2.00
2.50
3.00
without power control with power control without power control with power control
H.264 Optical flowAverage Po
wer Con
sumption[W]
1 core 2 cores 3 cores
1 1132 12 2 233 3
31
- 86.5%(1/7)
- 68.4%(1/3)
-79.2%(1/5)
-52.3%(1/2)
Power on 4 cores ARM CortexA9 with Androidhttp://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
H.264 decoder & Optical Flow (Using 3 cores)ODROID X2
Samsung Exynos4412 Prime, ARM Cortex‐A9 Quad core1.7GHz〜0.2GHz, used by Samsung's Galaxy S3
On the same 3 cores, the power control reduced the power to 1/5~1/7 against no power control.
The power control reduced the power to 1/2~1/3 compared with the ordinary sequential execution on 1 core without power control.
H.264 decoder & Optical Flow
32
29.67
17.37
29.29
24.17
37.11
16.15
36.59
12.21
41.81
12.50
41.58
9.60
0.00
10.00
20.00
30.00
40.00
50.00
without power control with power control without power control with power control
H.264 Optical flow
Average Po
wer Con
sumption[W]
1 core 2 cores 3 cores
1 321 321 321 32
-70.1%(1/3)
-57.9%(2/5)
-76.9%(1/4)
-67.2%(1/3)
Power Reduction on Intel Haswell 3cores
The power consumption was reduced to 1/3~1/4 by OSCAR power control against no power control on the same 3 processor cores.
The power reduced to 2/5~1/3 by the compiler power control against 1core no power control.
H81M‐A, Intel Core i7 4770kQuad core, 3.5GHz〜0.8GHz
33コア数
コア数
Automatic Power Reduction by OSCAR Compiler onIntel Haswell 4 Core Multicore
- Power Consumption for real-time face detection was reduced to 2/5 -
OSCAR CompilerIntel HaswellPower Reduction
Parallel processing of face detection program on Intel Haswell 4cores
without power controlwith power control
Average power consumptionwhen automatic power reduction is applied
Speedups against sequential executionat the fastest execution mode
reduced to 3/5(-43.68%)
reduced to 2/5(-60.37%)
2.44 times speedup
Parallelization flow of OpenCV face detection program
Input Camera
DisplayDrawing
CPU : Inel Core i7 4770KNumber of core : 4Clock frequency : 3.5GHz〜0.8GHzMother board : ASUS H81M-A
Power measurementon Intel Haswell board
Inserting power measurement
circuit between PMIC and CPU
Ave
rage
Pow
er C
onsu
mpt
ion
[W]
Numbuer of core
Spe
edup
rat
io
Numbuer of core
Loop for searching changing sizes
Searching loop along x and y directions
Face detection processing
Automatic parallelization by OSCAR compiler
next frame
Target:
Solar Powered with
compiler power reduction.
Fully automatic
parallelization and
vectorization including
local memory management
and data transfer.
OSCAR Vector Multicore with Power Processor Core and Compiler for Embedded to Severs with
OSCAR Technology
Centralized Shared Memory
Compiler Co-designed Interconnection Network
Compiler co-designed Connection Network
On-chip Shared Memory
Multicore Chip
VectorData
TransferUnit
CPU(Power)
Local MemoryDistributed Shared Memory
Power Control Unit
Core
×4chips
Summary OSCAR Automatic Parallelizing and Power Reducing Compiler has
succeeded speedup and/or power reduction of scientific applications including medical applications and natural disaster simulation, and real‐time applications like Automobile Engine Control and MATLAB/Simulink, and media applications codec and face detection on various homogeneous and heterogeneous multicores.
In automatic parallelization on Power Architectures:OSCAR Compiler has been developed on Power architectures like Power 4, 5, 5+, 6, 7 and 8. for “Earthquake Wave Propagation Simulation”, 110 times
speedup on 128 cores of IBM Power 7 and 9.6 times speedup on 12 cores IBM Power8 against 1 core,
for “Cancer Treatment Using Carbon Ion”, 55 times speedup on 64 cores of IBM Power 7 against 1 core
In automatic power reduction, consumed powers were reduced to 1/2 or 1/3 using 3 cores on ARM Cortex A9 and Intel Haswell.
We are planning to realize automatic power reduction using “On-Chip Regulator” on Power Processors with accelerators.
35