Upload
magoroku-yamamoto
View
4.963
Download
0
Embed Size (px)
DESCRIPTION
ARM quad power reduction OSCAR android LCPC mobile
Citation preview
OSCAR Compiler Controlled Mul3core Power Reduc3on
on Android Pla8orm
Hideo Yamamoto¹, Tomohiro Hirano¹, Kohei Muto¹, Hiroki Mikami¹, Takashi Goto¹, Dominic Hillenbrand¹,
Moriyuki Takamura², Keiji Kimura¹ and Hironori Kasahara¹
¹Green Compu3ng Systems Research and Department Center Waseda University ²FUJITSU LABORATORIES LTD.
LCPC2013 1
Presenta3on Outline • Background
– Power consump3on in mul3core – Power control mechanism of the OSCAR Compiler – Power control on the Android™ pla8orm
• Experimental – Evalua3on target , power rail and measurement device – Precise power measurement method Using GPIO – Bind mode – Clock ga3ng method using WFI instruc3on
• Highlight event in data – Power consump3on of MPEG2 decoder
• Conclusion
LCPC2013 2
BACKGROUND
LCPC2013 3
A Plethora of Smart Devices
LCPC2013 4
Linux
ARM11/ CortexA8
Linux -‐2 core SMP
Cortex-‐ A9
Cortex-‐ A9
Linux -‐ 4 core SMP
Cortex-‐ A9
Cortex-‐ A9
Cortex-‐ A9
Cortex-‐ A9
Linux – 8 core HMP
Cortex-‐ A15
Cortex-‐ A15
Cortex-‐ A15
Cortex-‐ A15 Cortex-‐
A7 Cortex-‐ A7
Cortex-‐ A7
Cortex-‐ A7
Linux -‐ 8 core big.LITTLE
Cortex-‐ A15
Cortex-‐ A15
Cortex-‐ A15
Cortex-‐ A15
Cortex-‐ A7
Cortex-‐ A7
Cortex-‐ A7
Cortex-‐ A7
Linux -‐ 5 core 4+1 vSMP
Cortex-‐ A9
Cortex-‐ A9
Cortex-‐ A9
Cortex-‐ A9
Cortex-‐ A9
2013 2007 2011 ・・・・・・・ 2014
High performance device
Cumula3ve smart device shipment iOS 700,000,000 Android 1000,000,000
In quad core case, you can reduce ‘f’ to ¼ keeping the same performance. If ‘v’ is 0.6(v) for ¼ ‘f’, power consump3on will be reduced to 0.36
Power Consump3on in mul3 core
• Uni Core P = f*c*v^2 ・・・・・ Eq.1
• Mul3 Core P = n*f*c*v^2 ・・・・・ Eq.2
LCPC2013 5
OSCAR Compiler
LCPC2013 6
Wased
a University
Mul3grain Parallel Processing • Hierarchical and Global Paralleliza3on • Coarse grain task parallel • Loop itera3on parallel • Statement level parallel
Data Locality Op3miza3on • Task (or loop) decomposi3on considering cache size or local memory size • Task scheduling considering data affinity
Low power op3miza3on • Power scheduling with DVFS, clock ga3ng and power ga3ng by somware
Doall loop
Seq. loop
Task level or statement level parallelization
Power Control Mechanism of the OSCAR Compiler
• Es3mate execu3on 3me of each MT and find cri3cal path • Determine execu3on 3me to sa3sfy the given deadline • Decide op3mal frequency and voltage of each MT.
LCPC2013 7
MT1
MT2
MT5
MT3
MT6
MT8
MT4
MT7
MT9
Core0 Core1 Core2 Core0 Core1 Core2
MT1
MT2
MT5 (Low freq.)
MT3 (Low freq.)
MT6 MT8
MT4
MT7 MT9
Given Dead Line
3me
Margin Clock ga>ng
Power ga3ng
Power ga3ng
Power ga3ng
Sta3c scheduled MTG Power scheduling with DVFS, clock ga3ng and power ga3ng by somware
Time management
3me
Power Control on Android
• CPUFreq – Frequency and voltage scaling of a target CPU
• CPUIdle – Manages the level of idle on each core of the CPU
• HotPlug > 10ms – Extended func3on of CPUFreq and CPUIdle – Adds another core to distribute the load in high u3liza3on
– Shuts down excess core with low u3liza3on – Decide core on/off line in a heuris3c adap3on
LCPC2013 8
Problems of Linux power control and parallel processing
• Hotplug can’t online core and thread binding swimly – In worst case it needs several hundred milliseconds
• Non real-‐3me – Linux can’t control fine resolu3on 3me under 5-‐10ms
LCPC2013
440.6ms
9
Startup 3me 440.6ms
Background • Mo3va3on – Paralleliza3on is effec3ve for low power execu3on with DVFS, power-‐ga3ng and clock-‐ga3ng
– OSCAR compiler has the capability to generate power control API automa3cally
• Obstacle – Linux needs long startup 3me for distribu3ng load to mul3cores
– Lack of fine resolu3on 3me control • Challenge – Low power execu3on Android pla8orm by paralleliza3on
LCPC2013 10
EXPERIMENTAL
LCPC2013 11
Evalua3on board -‐ ODROID-‐X2
• Samsung Exynos4412 Prime – ARM Cortex-‐A9 Quad core – Maximum clock frequency 1.7GHz – Used by Samsung's Galaxy S3
• DVFS can’t be applied to each core independently
• Android Open Source version is in place • Circuit Schema3c is available on request
LCPC2013 12
SoC Exynos4412
Power Rail for Exynos4412 • Exynos4412 is powered by 4 PMIC (Power Management IC) voltage
– VDD_ARM CORE – VDD_INT Interrupt controller and L2 – VDD_G3D GPU – VDD_MIF DDR Memory
• Power consump3on of VDD_ARM (CORE) has been measured
LCPC2013
Cortex-‐A9 32KB I/D NEON
Cortex-‐A9 32KB I/D NEON
Cortex-‐A9 32KB I/D NEON
Cortex-‐A9 32KB I/D NEON
Interrupt controller + L2
GPU
DDR
VDD_ARM
VDD_INT
VDD_G3D
VDD_MIF
PMIC
13
Modified Circuit Diagram of ODROID-‐X2
LCPC2013 14
Current
Voltage
Voltage (V) Current (A) x = Power (W)
How to measure CORE power on ODROID-‐X2
• Adding a 40 mΩ shunt resistor to VDD_ARM
LCPC2013
SoC
PMIC
Shunt Instrumenta3on amp
Voltage drop
15
synchroniza3on between program and waveforms using GPIO
LCPC2013 16
“bind” mode • Core assignment logic of Android Linux hotplug is heuris3c • New core assignment mode called “bind” mode is developed
for efficient parallel execu3on • "bind" mode is integrated in Android Linux as OSCAR run3me
and API • Specifica3on of OSCAR API for “bind” mode
– Core 0 is reserved for Android system and non OSCAR parallel program
– Applica3on can disable hotplug and control for Core ON/OFF line – Applica3on can Bind Core 1,2 and 3 to OSCAR parallel program
LCPC2013 17 Startup 3me 7.2ms
clock ga3ng
• WFI instruc3on – WFI instruc3on suspends the execu3on of the processor core and stops the clock un3l 3mer event
• Clock ga3ng driver using WFI instruc3on – The WFI instruc3on is privileged instruc3on – The API allows user program to execute WFI instruc3on within Linux driver
LCPC2013 18
while(1) { gpio_value(1); call_wfi_api(1); gpio_value(0); }
250mA
500mA
Fine 3ming control by WFI driver
LCPC2013 19
250mA
500mA
2000us (4 slot)
Wake up
Time Slot is 500 us
GPIO
while(1) { gpio_value(1); call_wfi_api(4); gpio_value(0); }
GPIO
Clock ga3ng
0us < T < 500us 1500us < T < 2000us
15000us (3 slot)
(N -‐1) x 500us < T < N x 500us
Current waveform of busy wait without clock ga3ng
1000mA
1500mA
2000mA
500mA
1core 2cores 3cores 4cores
Busy wait in ordinary execute
20
Current waveform of busy wait with clock ga3ng
LCPC2013
1000mA
1500mA
2000mA
500mA
1core 2cores 3cores 4cores
Busy wait with clock ga>ng
21
Wake up all cores
Clock ga3ng all cores
Compare with
current waveforms
1000mA
1500mA
2000mA
500mA
1core 2cores 3cores 4cores
Busy wait in ordinary execute
LCPC2013
1000mA
1500mA
2000mA
500mA
1core 2cores 3cores 4cores
Busy wait with clock ga>ng
22
Wake up all cores
Clock ga3ng all cores
MPEG2 DECODER
Highlight data
LCPC2013 23
Power Consump3on of MPEG2 Decoder on ODROID-‐X2
LCPC2013
1/7(13.3%)
1/3(38.1%)
NUMBER OF CORES
24
With Power Reduc3on Control Without Power Reduc3on Control
demo
LCPC2013 25
LCPC2013
MPEG2 Decode execu3on In high clock and voltage
Busy Wait execu3on Clock ga3ng
by WFI
Reduced by WFI
Consumed Reduced
26
(a) Without Power Reduc3on Control (b) With Power Reduc3on Control
Power Waveform of MPEG2 Decoder for 1 Core
1.7GHz, 1.4V
1.7GHz, 1.4V
LCPC2013
Busy Wait execu3on
Clock ga3ng by WFI
MPEG2 Decode execu3on In low clock and voltage
Power Waveform of MPEG2 Decoder for 3 Core
DVFS P = n*f*c*V^2
Reduced by WFI
MPEG2 Decode execu3on In high clock and voltage
Consumed Reduced
27
(a) Without Power Reduc3on Control (b) With Power Reduc3on Control
1.7GHz, 1.4V
400MHz, 1.05V
200MHz, 0.92V
Power Consump3on of MPEG2 Decoder on ODROID-‐X2
LCPC2013
NUMBER OF CORES
2.79
0.97
0.63 0.37
WFI DVFS
WFI
1/3(38.1%)
Consumed Reduced
28
Conclusions • The ODROID-‐X2 Circuit is modified such that
1. Precise Power waveforms at the output of PMIC is observed, and
2. The power waveforms and parallel program event are inter-‐related in 3ming for OSCAR compiler op3miza3on.
• The efficient parallel program execu3on pla8orm on Android is established by 1. “bind” mode, and 2. The WFI instruc3on by the OSCAR compiler.
• The newly developed OSCAR compiler power control mechanism has decreased the power to one third, from 0.97 Wa~ in 1-‐core to 0.37 Wa~ in 3-‐core, in running MPEG2 decoder on Android pla8orm. LCPC2013 29
BACKUP SLIDE
LCPC2013 30
OPTICAL FLOW
Highlight data
LCPC2013 31
Power Consump3on of Op3cal Flow on ODROID-‐X2
LCPC2013
13.4% 31.5%
32
Power Waveform of Op3cal Flow for 1core
LCPC2013
Op3cal Flow execu3on
Busy Wait execu3on Clock ga3ng by WFI
Reduce power of waste CPU
cycles
33
Power Waveform of Op3cal Flow for 3core
LCPC2013
Op3cal Flow execu3on In high clock and voltage
Busy Wait execu3on Clock ga3ng
by WFI
P = n*f*c*V^2
Op3cal Flow execu3on In low clock and voltage
34
#pragma oscar get_current_>me(current, >mer_no)
Low-‐power code with OSCAR API
LCPC2013
Proc0 Scheduled Tasks
T1 off
Proc1 Scheduled Tasks
T2 T4
Proc2 Scheduled Tasks
T3 T6(slow)
OSCAR Compiler
• Multigrain Parallelization • Memory Optimization • Data Transfer Optimization • DVFS, Clock gating
Sequential Programs C/Fortran
Low-‐power parallel C/Fortran Programs including OSCAR API
Backend Compiler
API Decoder
Na3ve Compiler
#pragma oscar fvcontrol(pe, (id, state)) #pragma oscar get_fvstatus(pe, id, state)
Translate OSCAR API into Library call
Exec. Object
35
ODROID Original
L
C
GND
L
C
GND
VDD_ARM
Schema3c Layout
36
PMIC
ODROID Amer rework
PMIC
GND GND
VDD_ARM
R
C C
L
GND
Single 5 Pin
Drop Voltage
L
R
Voltage
37
How to work hotplug �
L L L L0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L L
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2
up2g0_delay
up2gn_delay down_delay up2gn_delay down_delay
1 1
up
up
up Down
Down
Down
down_delay
Idle
Idle
Idle
Idle
up down idle disable
Auto hotplug governor�
tegra_cpu_set_speed_cap 578 int tegra_cpu_set_speed_cap(unsigned int *speed_cap) 579 { 581 unsigned int new_speed = tegra_cpu_highest_speed(); 586 new_speed = tegra_throttle_governor_speed(new_speed); 587 new_speed = edp_governor_speed(new_speed); 588 new_speed = user_cap_speed(new_speed); 592 ret = tegra_update_cpu_speed(new_speed); 594 tegra_auto_hotplug_governor(new_speed, false); 596 }
tegra_auto_hotplug_governor
parameters LP-mode GP-MODE
up_delay up2g0_delay up2dn_delay
down_delay down_deley down_delay
top_freq idle_top_freq idle_bottom_freq
botttom_freq 0 idle_bottom_freq
Current State
Compare with requested freq
New State
Delay to effecte
IDLE > top_freq UP Up_delay
IDLE <=bottom_freq DOWN Down_delay
DOWN >top_freq UP Up_delay
DOWN >bottom_freq IDLE NA
UP <bottom_freq DOWN Down_delay
UP <=top_freq IDLE ND
Throttle_table throttle_index
Update form user thermal_cooling_device
Edp_Thermal Auto Hot plug Suspend CpuFreq