Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler
Hironori KasaharaProfessor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, Japan
IEEE Computer Society Board of GovernorsIEEE Computer Society Multicore STC Chair
URL: http://www.kasahara.cs.waseda.ac.jp/
To improve effective performance, cost-performance and software productivity and reduce power
OSCAR Parallelizing Compiler
Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory
Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)
Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.
1
23 45
6 7 8910 1112
1314 15 16
1718 19 2021 22
2324 25 26
2728 29 3031 32
33Data Localization Group
dlg0dlg3dlg1 dlg2
Low Power Heterogeneous Multicore Code
GenerationAPI
Analyzer(Available
from Waseda)
Existing sequential compiler
Multicore Program Development Using OSCAR API V2.0Sequential Application
Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)
Low Power Homogeneous Multicore Code
GenerationAPI
AnalyzerExisting
sequential compiler
Proc0
Thread 0
Code with directives
Waseda OSCARParallelizing Compiler
Coarse grain task parallelization
Data Localization DMAC data transfer Power reduction using
DVFS, Clock/ Power gating
Proc1
Thread 1
Code with directives
Parallelized API F or C program
OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,
data transfer using DMA, power managements
Generation of parallel machine
codes using sequential compilers
Exe
cuta
ble
on v
ario
us m
ultic
ores
OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface
HomegeneousMulticore s
from Vendor A(SMP servers)
Server Code GenerationOpenMP Compiler
Shred memory servers
HeterogeneousMulticores
from Vendor B
Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.
Accelerator 1Code
Accelerator 2Code
Hom
ogen
eous
Accelerator Compiler/ User Add “hint” directives
before a loop or a function to specify it is executable by
the accelerator with how many clocks
Het
ero
Manual parallelization / power reduction
Model Base Designed Engine Control on V850 Multicore with Denso
Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.
1 core 2 cores
Hard real-time automobile engine
control by multicore
Parallelizing Handwritten Engine Control Programs on Multi‐core processors
• Current automotive crankshaft program– Developed by TOYOTA Motor Corp– About 300,000 Lines– Difficulty of parallel processing
• Too fine granularity• Many conditional branches and small basic blocks,
but no parallelizable loops– Minimizing run‐time overhead and improvement of parallelism are necessary Current product compilers can not parallelize Current accelerators are not applicable
Automatic parallelization of a crankshaft program using multi‐grain parallelization in OSCAR Compiler Performance improvement and efficient multi‐threaded
programming development 2013/04/19 Cool Chips XVI 5
Analysis of Coarse Grain Parallelism by OSCAR Compiler
2013/04/19 Cool Chips XVI
6
1
2 3
4
5
6
7
8
910 11
12
13
14Macro-Flow GraphMacro-Task Graph
Earliest Executable Condition
1
2 3
4
5
6
7
8
9 10
11
12
13
14
Decomposes a program into coarse grain tasks, or macro tasks(MTs)
1. BB (Basic Block)2. RB (Repetition Block, or loop)3. SB (Subroutine Block, or function)
Generate MFG(Macro Flow Graph) Control flow and data dependencies
Generates MTG(Macro Task Graph) Coarse grain parallelism
: Data Dependency: Control Flow: Conditional Branch
Data Dependency
Control Flow
Conditional Branch
Coarse Grain Task Parallelizationof Hand-written Engine Control Program
2013/04/19 Cool Chips XVI 7
MTG of crankshaft programs
Loop parallelizationNo parallelizable loopsin engine control codes
Fine grain parallelizationEach BBs are very low cost
-less than 100 clock cyclesBranches prevent compilers
Coarse grain parallelizationUtilize parallelism between SBs and BBs
Static Task Scheduling
2013/04/19 Cool Chips XVI 8MFG of sample program
Dynamic task scheduling Prevent from traceability Add run-time overhead
Static task scheduling Guarantee Real-time constraints Ensure traceability Minimize run-time overhead
Cannot assign BBs having braches statically Static task scheduling can be applied if the MTG has only
data dependency The compiler cannot see if the branch
is taken or not at compile time. Fuse tasks by hiding conditional branches in MFG
to avoid dynamic task scheduling Macro Task Fusion
Analysis of A Crankshaft Program Using Macro Task Fusion
2013/04/19 Cool Chips XVI 9
There is not enough parallelism
Can not schedule MTs at compile time
MTG of crankshaft program before macro task fusion MTG of crankshaft program after macro task fusion
sb4 and block5 account for over 90% of whole execution time.
MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements
2013/04/19 Cool Chips XVI 10
Successfully increased coarse grain parallelism
Critical Path(CP)
CP accounts for over 99% of whole
execution time.
MTG of crankshaft program before restructuring
Critical Path(CP)
CP accounts for about 60% of whole
execution time.
MTG of crankshaft program after restructuring
Succeed to reduce CP 99% -> 60%
Evaluation Environment :Embedded Multi-core Processor RPX
• SH-4A 648MHz * 8– As a first step, we use just two SH-4A cores because target dual-core
processors are currently under design for next-generation automobiles
2013/04/19 Cool Chips XVI 11
.t-
Evaluation of Crankshaft Program with Multi-core Processors
• Attain 1.54 times speedup on RPX– There are no loops, but only many conditional branches and small basic blocks and
difficult to parallelize this program
• This result shows possibility of multi-core processor for engine control programs
2013/04/19 Cool Chips XVI 12
1.00
1.54 0.57
0.37
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.000.200.400.600.801.001.201.401.601.80
1 core 2 core
executiontim
e(us)speedu
pratio
Performance of OSCAR Compileron Intel Core i7 Notebook PC
• OSCAR Compiler accelerate Intel Compiler about 2.0 times on average
1.00 1.18
2.70
1.00 1.00
1.70
2.24
4.12
2.91
2.53
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
SPEC95 su2cor SPEC95hydro2d
SPEC95 mgrid SPEC95 turb3d AAC Encoder
Speesup Ra
tio
Intel Compiler Ver.14.0OSCAR
CPU: Intel Core i7 3720QM (Quad‐core)MEM: 32GB DDR3‐SODIMM PC3‐12800OS: Ubuntu 12.04 LTS
Parallel Processing of JPEG XR Encoder on TILEPro64
Multimedia Applications:Sequential C Source Code
Parallelized C Program with OSCAR API
OSCAR Compiler
Parallelized Executable Binary for TILEPro64
API Analyzer +Sequential Compiler
Cache Allocation Setting
1x
28x
55x
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 64Speedu
p
# Cores
Speedup (JPEG XR Encoder)
Default Cache AllocationOur Cache Allocation
(1)OSCAR Parallelization
(2)Cache Allocation Setting
Local cache optimization:Parallel Data Structure (tile) on heapallocating to local cache
55x speedup on 64 cores AAC EncoderJPEG XR Encoder Optical Flow Calc.
0,01,02,03,04,05,06,07,0
0,11,12,13,14,15,16,17,1
0,21,22,23,24,25,26,27,2
0,31,32,33,34,35,36,37,3
0,41,42,43,44,45,46,47,4
0,51,52,53,54,55,56,57,5
0,61,62,63,64,65,66,67,6
0,71,72,73,74,75,76,77,7
I/O
I/OI/O
Memory Controller 0
Memory Controller 1
Memory Controller 2
Memory Controller 3
Ds
t0X4)
rt 1
al)
nal)
X4)
{
Parallel Processing of Face Detection on Manycore, Highendand PC Server
15
• OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highendserver.
1.00 1.72
3.01
5.74
9.30
1.00 1.93
3.57
6.46
11.55
1.00 1.93
3.67
6.46
10.92
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 2 4 8 16
速度
向上
率
コア数
速度向上率tilepro64 gcc
SR16k(Power7 8core*4cpu*4node) xlc
rs440(Intel Xeon 8core*4cpu) icc
Automatic Parallelization of Face Detection
92 Times Speedup against the Sequential Processing for GMS Earthquake Wave
Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)
0
10
20
30
40
50
60
70
80
90
100
1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe
Speedup
agai
nst s
eque
ntia
l pro
cess
ing
oscar
Profile-Based Automatic Parallelization and Sequential Program Tuning for Android 2D Rendering on Nexus7
WASEDA UNIVERSITY
OSCAR Compiler
Profile-Based Automatic Parallelization
OSCAR Parallelization Compiler
AnalyzeResult
Profile ResultSequentialBinary File
Hotspot Analyzer
Original Source files
ARMCompil
er(GCC)
ParalleizedBinary File
Parallelized Source Fileswith OSCAR API
ARMCompile
r(GCC)
A B
Rewite toParallelizable C
OSCAR API
Analyzer Parallelized Source Files
Hotspot Source File
Green Computing Systems Research and Development Center Waseda University
Path Generation
RasterizationShading
BitBlit
paths
mask
destination Image
source image
modifieddestination Image
Transform figure to some paths
Transform path to Bitmap(Mask)
Calculate and transfer display image from source image, mask and destination image.
Make a color data from a Command
Standard library which performs 2D rendering on Android
Skia 2D Rendering Pipeline
Android 2D Rendering “Skia” Execution flow is different on each benchmarHigh load on the BitBlit process.
SkRGB16_Blitter
::blitH
81.84%
memset_128_loo
p
2.70%
sk_fill_path
2.21%
memset32_loop1
28
1.42%
SkString::~SkStri
ng()
0.61%
Others
11.22%
DrawArc
others
2.57%
SkRGB16_Blitter::blitRect97.43%
DrawRect
A B
void SkRGB16_Blitter_blitRect_oscar(int width, int height, uint16_t* device, unsigned deviceRB, SkPMColor src32) {
int i;uint16_t* deviceTMP;
for (i = height; i > 0; i--){deviceTMP = (uint16_t*)((char*)device + (deviceRB * (height - i)));blend32_16_row(src32, deviceTMP, width);
}}
void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) {SkASSERT(x + width
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawImage : FPS
Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawRect :FPS
22.82
43.57
27.16
×1.91 ×1.95
for DrawRect 1.91 speedupfor DrawImage 1.95 speedup
On Nexus7, 3 core parallelization gave us
52.88
18
1 Core 3 cores 1 Core 3 cores
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
Low-Power Optimization with OSCAR API
MT1
VC0
MT2
MT4MT3
Sleep
VC1
Scheduled Resultby OSCAR Compiler void
main_VC0() {
MT1
voidmain_VC1() {
MT2
#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))
#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))
Sleep
MT4MT3
} }
Generate Code Image by OSCAR Compiler
Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2
by OSCAR Parallelizing Compiler
Avg. Power5.73 [W]
Avg. Power1.52 [W]
73.5% Power Reduction
MPEG2 Decoding with 8 CPU cores
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Without Power Control(Voltage:1.4V)
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)
33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
12.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
Speedu
ps against a single SH
processor
3.4[fps]
111[fps]
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
Without Power Reduction With Power Reductionby OSCAR Compiler
Average:1.76[W] Average:0.54[W]
1cycle : 33[ms]→30[fps]
70% of power reduction
Automatic Power Reduction for MPEG2 Decode on Android Multicore
ODROID X2 ARM Cortex-A94cores
23
• On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.
• 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.
0.97
1.88
2.79
0.63 0.46 0.37
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 3
Power Con
sumption[W
]
Number of Processor Cores
電力制御なし 電力制御ありWithout Power Reduction
With Power Reduction
2/3(‐35.0%)
1/4(‐75.5%)
1/7(‐86.7%)
1/3(‐61.9%)
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
Automatic Power Reduction on 4
core Intel Haswell
• Haswell Processor– OS Ubuntu 13.10– Intel CPU Core i7 4770K
• 4 cores• L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle• L2 Cache 64Bytes/cycle• L3 Cache 8 MB• Frequency 3.5GHz~0.8MHz
– Memory 16GB (8GB×2)
24
Power Reduction on Intel Haswellfor Real-time Optical Flow
25
Power was reduced to 1/4 by the compiler power optimization on the same 3 cores.The power with 3 core was reduced to 1/3 against 1 core.
29.29
36.5941.58
28.40
13.22 10.49
0.00
10.00
20.00
30.00
40.00
50.00
1 2 3
Average Po
wer [W
]
No. of Processor Cores
電力制御なし 電力制御あり
Power was reduced to 1/4 by compiler on 3 cores
Power was reduced to 1/3 compared with one core ordinal execution
For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])Without
Power Control
With Power Control
Power Waves for 1 Core to 3 Cores without the Compiler Power Control on Intel Haswell
for Real-time Optical Flow
2014/6/17 DEMO 26
電圧(V)電流(A)電力(W)
29.29W36.59W
41.58W
2 Cores1 Core 3 Cores
PowerPower Power
2014/6/17 DEMO 27
電圧(V)電流(A)電力(W)
28.40W13.22W
10.49W
PowerPowerPower
3 Cores2 Cores1 Core
Power Waves for 1 Core to 3 Cores with the Compiler Power Control on Intel Haswell
for Real-time Optical Flow
Power for 1 & 3Cores without Controlvs. for 3 Cores with Control on HaswellWithout Power Control With Power Control
2014/6/17 DEMO 282014/6/17
29.29W
1 Core
Power
28
41.58W
3 Cores
Power
Without Power Control
28
10.49W
Power
3 Cores
Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control
Solar powered with more than 100 times power efficient : FLOPS/W• Regional Disaster Simulators
saving lives from tornadoes, localized heavy rain, fires with earth quakes
‐From everyday recharging to less than once a week‐ Solar powered operation inemergency condition‐ Keep health
Smart phones
Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,
clean usable inside OP room
Advanced medical systems Personal / Regional Supercomputers