39
OSCAR Compiler Controlled Mul3core Power Reduc3on on Android Pla8orm Hideo Yamamoto¹, Tomohiro Hirano¹, Kohei Muto¹, Hiroki Mikami¹, Takashi Goto¹, Dominic Hillenbrand¹, Moriyuki Takamura², Keiji Kimura¹ and Hironori Kasahara¹ ¹Green Compu3ng Systems Research and Department Center Waseda University ²FUJITSU LABORATORIES LTD. LCPC2013 1

Oscar compiler for power reduction

Embed Size (px)

DESCRIPTION

ARM quad power reduction OSCAR android LCPC mobile

Citation preview

Page 1: Oscar compiler for power reduction

OSCAR  Compiler  Controlled    Mul3core  Power  Reduc3on    

on  Android  Pla8orm

Hideo  Yamamoto¹,   Tomohiro  Hirano¹,  Kohei  Muto¹,    Hiroki  Mikami¹,  Takashi  Goto¹,  Dominic  Hillenbrand¹,    

Moriyuki  Takamura²,  Keiji  Kimura¹  and  Hironori  Kasahara¹    

¹Green  Compu3ng  Systems  Research  and  Department  Center  Waseda  University  ²FUJITSU  LABORATORIES  LTD.  

LCPC2013 1

Page 2: Oscar compiler for power reduction

Presenta3on  Outline •  Background  

–  Power  consump3on  in  mul3core    –  Power  control  mechanism  of  the  OSCAR  Compiler  –  Power  control  on  the  Android™ pla8orm  

•  Experimental  –  Evalua3on  target  ,  power  rail  and  measurement  device  –  Precise  power  measurement  method  Using  GPIO  –  Bind  mode  –  Clock  ga3ng  method  using  WFI  instruc3on  

•  Highlight  event  in  data  –  Power  consump3on  of  MPEG2  decoder    

•  Conclusion

LCPC2013 2

Page 3: Oscar compiler for power reduction

BACKGROUND    

LCPC2013 3

Page 4: Oscar compiler for power reduction

A    Plethora    of    Smart Devices

LCPC2013 4

Linux

ARM11/  CortexA8

Linux  -­‐2  core  SMP

Cortex-­‐  A9

Cortex-­‐  A9

Linux  -­‐  4  core  SMP

Cortex-­‐  A9

Cortex-­‐  A9

Cortex-­‐  A9

Cortex-­‐  A9

Linux  –  8  core  HMP

Cortex-­‐  A15

Cortex-­‐  A15

Cortex-­‐  A15

Cortex-­‐  A15 Cortex-­‐  

A7 Cortex-­‐  A7

Cortex-­‐  A7

Cortex-­‐  A7

Linux  -­‐  8  core  big.LITTLE

Cortex-­‐  A15

Cortex-­‐  A15

Cortex-­‐  A15

Cortex-­‐  A15

Cortex-­‐  A7

Cortex-­‐  A7

Cortex-­‐  A7

Cortex-­‐  A7

Linux  -­‐  5  core    4+1  vSMP  

Cortex-­‐  A9

Cortex-­‐  A9

Cortex-­‐  A9

Cortex-­‐  A9

Cortex-­‐  A9

2013 2007 2011 ・・・・・・・ 2014

High  performance  device

Cumula3ve  smart  device  shipment                    iOS                          700,000,000                    Android      1000,000,000

Page 5: Oscar compiler for power reduction

 In  quad  core  case,  you  can  reduce  ‘f’  to  ¼  keeping  the  same  performance.     If  ‘v’  is    0.6(v)  for  ¼  ‘f’,  power  consump3on  will  be  reduced  to  0.36  

Power  Consump3on  in  mul3  core

•  Uni  Core  P  =  f*c*v^2          ・・・・・ Eq.1  

•   Mul3  Core  P  =  n*f*c*v^2      ・・・・・ Eq.2  

LCPC2013 5

Page 6: Oscar compiler for power reduction

OSCAR  Compiler

LCPC2013 6

Wased

a  University

 

Mul3grain  Parallel  Processing  • Hierarchical  and  Global  Paralleliza3on • Coarse  grain  task  parallel  • Loop  itera3on  parallel  • Statement  level  parallel  

Data  Locality  Op3miza3on  • Task  (or  loop)  decomposi3on  considering  cache  size  or  local  memory  size  • Task  scheduling  considering  data  affinity  

Low  power  op3miza3on  • Power  scheduling  with  DVFS,  clock  ga3ng  and  power  ga3ng  by  somware  

Doall loop

Seq. loop

Task level or statement level parallelization

Page 7: Oscar compiler for power reduction

Power  Control  Mechanism  of    the  OSCAR  Compiler

•  Es3mate  execu3on  3me  of  each  MT  and  find  cri3cal  path  •  Determine  execu3on  3me  to  sa3sfy  the  given  deadline  •  Decide  op3mal  frequency  and  voltage  of  each  MT.    

LCPC2013 7

MT1

MT2

MT5

MT3

MT6

MT8

MT4

MT7

MT9

Core0 Core1 Core2 Core0 Core1 Core2

MT1

MT2

MT5  (Low  freq.)  

MT3  (Low  freq.)

MT6 MT8

MT4

MT7 MT9

Given  Dead  Line

3me

Margin Clock  ga>ng

Power  ga3ng

Power  ga3ng

Power  ga3ng

Sta3c  scheduled  MTG Power  scheduling  with  DVFS,  clock  ga3ng  and  power  ga3ng  by  somware  

Time  management

3me

Page 8: Oscar compiler for power reduction

Power  Control  on  Android

•  CPUFreq        – Frequency  and  voltage  scaling  of  a  target  CPU  

•  CPUIdle  – Manages  the  level  of  idle  on  each  core  of  the  CPU  

•  HotPlug    >  10ms  – Extended  func3on  of  CPUFreq  and  CPUIdle  – Adds  another  core  to  distribute  the    load  in  high  u3liza3on  

– Shuts  down  excess  core  with  low  u3liza3on    – Decide  core  on/off  line  in  a  heuris3c  adap3on    

LCPC2013 8

Page 9: Oscar compiler for power reduction

Problems  of  Linux    power  control  and  parallel  processing  

•  Hotplug  can’t  online  core  and  thread  binding  swimly  –  In  worst  case  it  needs  several  hundred  milliseconds  

   

•  Non  real-­‐3me  –  Linux  can’t  control  fine  resolu3on  3me  under  5-­‐10ms  

LCPC2013

440.6ms

9

Startup  3me  440.6ms

Page 10: Oscar compiler for power reduction

Background  •  Mo3va3on  –  Paralleliza3on  is  effec3ve  for  low  power  execu3on  with  DVFS,  power-­‐ga3ng  and  clock-­‐ga3ng  

–  OSCAR  compiler  has  the  capability  to  generate  power  control  API  automa3cally      

•  Obstacle  –  Linux  needs  long  startup  3me  for  distribu3ng  load    to  mul3cores    

–  Lack  of  fine  resolu3on  3me  control  •  Challenge  –  Low  power  execu3on  Android  pla8orm  by  paralleliza3on    

LCPC2013 10

Page 11: Oscar compiler for power reduction

EXPERIMENTAL  

LCPC2013 11

Page 12: Oscar compiler for power reduction

Evalua3on  board  -­‐  ODROID-­‐X2

•  Samsung  Exynos4412  Prime  – ARM  Cortex-­‐A9  Quad  core  – Maximum  clock  frequency  1.7GHz  – Used  by  Samsung's  Galaxy  S3  

•  DVFS  can’t  be  applied  to  each  core  independently  

•  Android  Open  Source  version  is  in  place  •  Circuit  Schema3c  is  available  on  request  

LCPC2013 12

Page 13: Oscar compiler for power reduction

SoC Exynos4412

Power  Rail  for  Exynos4412 •  Exynos4412  is  powered  by  4  PMIC  (Power  Management  IC)  voltage  

–  VDD_ARM    CORE  –  VDD_INT    Interrupt  controller  and  L2 –  VDD_G3D    GPU –  VDD_MIF    DDR  Memory

•  Power  consump3on  of  VDD_ARM  (CORE)  has  been  measured    

LCPC2013

Cortex-­‐A9  32KB  I/D  NEON

Cortex-­‐A9  32KB  I/D  NEON

Cortex-­‐A9  32KB  I/D  NEON

Cortex-­‐A9  32KB  I/D  NEON

Interrupt  controller    +    L2  

GPU

DDR

VDD_ARM

VDD_INT

VDD_G3D

VDD_MIF

PMIC

13

Page 14: Oscar compiler for power reduction

Modified  Circuit  Diagram  of    ODROID-­‐X2

LCPC2013 14

Current

Voltage

Voltage  (V) Current  (A) x = Power  (W)

Page 15: Oscar compiler for power reduction

How  to  measure  CORE  power    on  ODROID-­‐X2

•  Adding  a  40  mΩ  shunt  resistor  to  VDD_ARM

LCPC2013

SoC

PMIC

Shunt Instrumenta3on  amp

Voltage  drop

15

Page 16: Oscar compiler for power reduction

synchroniza3on  between  program  and  waveforms  using  GPIO

LCPC2013 16

Page 17: Oscar compiler for power reduction

“bind”  mode •  Core  assignment  logic  of  Android  Linux  hotplug    is  heuris3c  •  New  core  assignment  mode  called  “bind”  mode  is  developed  

for  efficient  parallel  execu3on  •  "bind"  mode  is  integrated  in  Android  Linux  as  OSCAR  run3me  

and  API  •  Specifica3on  of  OSCAR  API  for  “bind”  mode    

–  Core  0  is    reserved    for  Android  system  and  non  OSCAR    parallel  program    

–  Applica3on  can  disable  hotplug  and  control  for  Core  ON/OFF  line   –  Applica3on  can  Bind  Core  1,2  and  3  to  OSCAR  parallel  program    

LCPC2013 17 Startup  3me  7.2ms

Page 18: Oscar compiler for power reduction

clock  ga3ng

•  WFI  instruc3on  – WFI  instruc3on    suspends  the  execu3on  of  the  processor  core  and  stops  the  clock  un3l  3mer  event  

•  Clock  ga3ng  driver  using  WFI  instruc3on  – The  WFI  instruc3on  is  privileged  instruc3on  – The  API  allows  user  program  to  execute  WFI  instruc3on  within  Linux  driver  

LCPC2013 18

Page 19: Oscar compiler for power reduction

while(1)  {      gpio_value(1);      call_wfi_api(1);      gpio_value(0);   }

250mA

500mA

Fine  3ming  control  by  WFI  driver

LCPC2013 19

250mA

500mA

2000us  (4  slot)

Wake  up

Time  Slot  is  500  us

GPIO

while(1)  {      gpio_value(1);      call_wfi_api(4);      gpio_value(0);   }

GPIO

Clock  ga3ng

0us  <    T  <  500us 1500us  <    T      <  2000us

15000us  (3  slot)

(N  -­‐1)  x  500us      <    T    <    N  x  500us

Page 20: Oscar compiler for power reduction

Current  waveform  of  busy  wait    without  clock  ga3ng  

1000mA

1500mA

2000mA

   500mA

1core 2cores 3cores 4cores

Busy  wait  in  ordinary  execute

20

Page 21: Oscar compiler for power reduction

Current  waveform  of  busy  wait      with  clock  ga3ng

LCPC2013

1000mA

1500mA

2000mA

   500mA

1core 2cores 3cores 4cores

Busy  wait  with  clock  ga>ng

21

Wake  up  all  cores

Clock  ga3ng  all  cores

Page 22: Oscar compiler for power reduction

 Compare  with    

current  waveforms    

1000mA

1500mA

2000mA

   500mA

1core 2cores 3cores 4cores

Busy  wait  in  ordinary  execute

LCPC2013

1000mA

1500mA

2000mA

   500mA

1core 2cores 3cores 4cores

Busy  wait  with  clock  ga>ng

22

Wake  up  all  cores

Clock  ga3ng  all  cores

Page 23: Oscar compiler for power reduction

MPEG2  DECODER

 Highlight  data

LCPC2013 23

Page 24: Oscar compiler for power reduction

Power  Consump3on  of    MPEG2  Decoder  on  ODROID-­‐X2

LCPC2013

1/7(13.3%)

1/3(38.1%)

NUMBER  OF  CORES

24

With  Power  Reduc3on  Control Without  Power  Reduc3on  Control  

Page 25: Oscar compiler for power reduction

 demo

LCPC2013 25

Page 26: Oscar compiler for power reduction

LCPC2013

  MPEG2  Decode  execu3on  In  high  clock  and  voltage  

Busy  Wait  execu3on    Clock  ga3ng    

by  WFI  

Reduced  by  WFI

Consumed Reduced  

26

(a)  Without  Power  Reduc3on  Control (b)  With  Power  Reduc3on  Control

Power  Waveform  of    MPEG2  Decoder  for  1  Core

1.7GHz,  1.4V

1.7GHz,  1.4V

Page 27: Oscar compiler for power reduction

LCPC2013

Busy  Wait  execu3on  

 Clock  ga3ng    by  WFI  

MPEG2  Decode  execu3on  In  low  clock  and  voltage  

Power  Waveform  of    MPEG2  Decoder  for  3  Core

DVFS  P  =  n*f*c*V^2  

Reduced  by  WFI

MPEG2  Decode  execu3on  In  high  clock  and  voltage  

Consumed Reduced

27

(a)  Without  Power  Reduc3on  Control (b)  With  Power  Reduc3on  Control

1.7GHz,  1.4V

400MHz,  1.05V

200MHz,  0.92V

Page 28: Oscar compiler for power reduction

Power  Consump3on  of    MPEG2  Decoder  on  ODROID-­‐X2

LCPC2013

NUMBER  OF  CORES

2.79

0.97

0.63 0.37

WFI DVFS

WFI

1/3(38.1%)

Consumed Reduced

28

Page 29: Oscar compiler for power reduction

Conclusions  •  The  ODROID-­‐X2  Circuit  is  modified  such  that  

1.  Precise  Power  waveforms  at  the  output  of  PMIC  is  observed,  and  

2.  The  power  waveforms  and  parallel  program  event  are  inter-­‐related  in  3ming  for  OSCAR  compiler  op3miza3on.  

•  The  efficient  parallel  program  execu3on  pla8orm  on  Android  is  established  by  1.  “bind”  mode,  and    2.  The  WFI  instruc3on    by  the  OSCAR  compiler.  

•  The  newly  developed  OSCAR  compiler  power  control  mechanism  has  decreased  the  power  to  one  third,  from  0.97  Wa~  in  1-­‐core  to  0.37  Wa~  in  3-­‐core,  in  running  MPEG2  decoder  on  Android  pla8orm.  LCPC2013 29

Page 30: Oscar compiler for power reduction

BACKUP  SLIDE  

LCPC2013 30

Page 31: Oscar compiler for power reduction

OPTICAL  FLOW

Highlight  data

LCPC2013 31

Page 32: Oscar compiler for power reduction

Power  Consump3on  of    Op3cal  Flow  on  ODROID-­‐X2

LCPC2013

13.4% 31.5%

32

Page 33: Oscar compiler for power reduction

Power  Waveform  of    Op3cal  Flow  for  1core

LCPC2013

Op3cal  Flow  execu3on  

Busy  Wait  execu3on   Clock  ga3ng  by  WFI  

Reduce  power  of  waste  CPU  

cycles

33

Page 34: Oscar compiler for power reduction

Power  Waveform  of    Op3cal  Flow  for  3core

LCPC2013

Op3cal  Flow  execu3on  In  high  clock  and  voltage  

Busy  Wait  execu3on  Clock  ga3ng    

by  WFI  

P  =  n*f*c*V^2  

Op3cal  Flow  execu3on  In  low  clock  and  voltage  

34

Page 35: Oscar compiler for power reduction

#pragma  oscar  get_current_>me(current,  >mer_no)

Low-­‐power  code  with  OSCAR  API

LCPC2013

Proc0 Scheduled Tasks

T1 off

Proc1 Scheduled Tasks

T2 T4

Proc2 Scheduled Tasks

T3 T6(slow)

OSCAR Compiler

• Multigrain Parallelization • Memory Optimization • Data Transfer Optimization • DVFS, Clock gating

Sequential Programs C/Fortran

Low-­‐power  parallel  C/Fortran  Programs  including  OSCAR  API

Backend Compiler

API  Decoder

Na3ve  Compiler

#pragma  oscar  fvcontrol(pe,  (id,  state))  #pragma  oscar  get_fvstatus(pe,  id,  state)  

Translate  OSCAR  API  into  Library  call  

Exec. Object

35

Page 36: Oscar compiler for power reduction

ODROID Original

 

L

C

GND

L

C

GND

VDD_ARM

Schema3c Layout

36

PMIC

Page 37: Oscar compiler for power reduction

ODROID  Amer  rework

PMIC

GND GND

VDD_ARM

R

C C

L

GND

Single  5  Pin

Drop  Voltage

L

R

Voltage

37

Page 38: Oscar compiler for power reduction

How to work hotplug �

L L L L0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

L L

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2

up2g0_delay

up2gn_delay down_delay up2gn_delay down_delay

1 1

up

up

up Down

Down

Down

down_delay

Idle

Idle

Idle

Idle

up down idle disable

Page 39: Oscar compiler for power reduction

Auto hotplug governor�

tegra_cpu_set_speed_cap 578 int tegra_cpu_set_speed_cap(unsigned int *speed_cap) 579 { 581 unsigned int new_speed = tegra_cpu_highest_speed(); 586 new_speed = tegra_throttle_governor_speed(new_speed); 587 new_speed = edp_governor_speed(new_speed); 588 new_speed = user_cap_speed(new_speed); 592 ret = tegra_update_cpu_speed(new_speed); 594 tegra_auto_hotplug_governor(new_speed, false); 596 }

tegra_auto_hotplug_governor

parameters LP-mode GP-MODE

up_delay up2g0_delay up2dn_delay

down_delay down_deley down_delay

top_freq idle_top_freq idle_bottom_freq

botttom_freq 0 idle_bottom_freq

Current State

Compare with requested freq

New State

Delay to effecte

IDLE > top_freq UP Up_delay

IDLE <=bottom_freq DOWN Down_delay

DOWN >top_freq UP Up_delay

DOWN >bottom_freq IDLE NA

UP <bottom_freq DOWN Down_delay

UP <=top_freq IDLE ND

Throttle_table throttle_index

Update form user thermal_cooling_device

Edp_Thermal Auto Hot plug Suspend CpuFreq