45
Performance of Go on Mul/core Systems Huang Yipeng 19 th November 2012 FYP Presenta/on 1

Performance of Go on Multicore Systems

Embed Size (px)

DESCRIPTION

NUS Presentation 2012

Citation preview

Page 1: Performance of Go on Multicore Systems

Performance  of  Go  on  Mul/core  Systems  

Huang  Yipeng  

19th  November  2012  FYP  Presenta/on  

1  

Page 2: Performance of Go on Multicore Systems

Mo/va/on  

•  Mul-core  systems  have  become  common  

•  But  “dual,  quad-­‐cores  are  not  useful  all  the  /me,  they  waste  baEeries...”  -­‐  Stephen  Elop,  Nokia  CEO  

2  

Page 3: Performance of Go on Multicore Systems

Mo/va/on  

•  Mul-core  systems  have  become  common  

•  But  “dual,  quad-­‐cores  are  not  useful  all  the  /me,  they  waste  baEeries...”  -­‐  Stephen  Elop,  Nokia  CEO  

•  Because  most  programs  are  explicitly  parallel  –  #Threads  –  #Cores    

3  

Page 4: Performance of Go on Multicore Systems

Mo/va/on:  Why  Go?    

4  

Page 5: Performance of Go on Multicore Systems

Objec/ve  

•  To  study  the  parallelism  performance  of  Go,  compared  with  C,  using  measurements  and  analy-cal  models  (to  quan/fy  actual  and  predicted  performances  respec/vely)    

5  

Page 6: Performance of Go on Multicore Systems

Related  Work  

•  Understanding  the  Off-­‐chip  Memory  Conten/on  of  Parallel  Programs  in  Mul/core  Systems  (B.M.  Tudor,  Y.M.  Teo,  2011)  

•  A  Prac/cal  Approach  for  Performance  Analysis  of  Shared  Memory  Programs  (B.M.  Tudor,  Y.M.  Teo,  2011)  

6  

Parallelism  of  Shared-­‐memory  Program  

Memory  Conten/on  

Useful  Work  Data  

Dependency  

Page 7: Performance of Go on Multicore Systems

Related  Work:  Differences  

7  

Shared  Memory  Programs  

Shared  Memory  Programs  Implicit  Parallelism    e.g.  Go  

Explicit  Parallelism  

e.g.  C  &  OpenMP  

Processor  Architecture    

Shared  Memory  Programs  Emerging  pladorms  e.g.  ARM  

Mul/core  pladorms  

e.g.  Intel,  AMD  

Parallelism  Performance    Analy/cal    Models  

Low  Memory  Conten/on  

High  Memory  Conten/on  

Page 8: Performance of Go on Multicore Systems

Contribu/ons  

1.  Insights  about  the  parallelism  performance  of  Go  

2.  Extend  our  analy/cal  parallelism  model  for  programs  with  lower  memory  conten/on  

3.  Automate  performance  predic/on  and  model  valida/on  with  scripts    

8  

Page 9: Performance of Go on Multicore Systems

Outline    

•  Mo/va/on  •  Related  Work    

•  Methodology  –  Approach  –  Valida/on  

•  Evalua/on  •  Conclusion  

9  

Page 10: Performance of Go on Multicore Systems

Process  Methodology  

10  

Analy/cal  Models  

Baseline  Execu/ons  

Parallelism  Traces   Parallelism  Traces  

1.  Hardware  Counters  (Perf  Stat  3.0)  

2.  Run  Queue  (Proc  Reader)  

Parallelism  Predic/on  

Go  Program  

Page 11: Performance of Go on Multicore Systems

Analy/cal  Parallelism  Model    

Parallelism  of  Shared-­‐memory  Program:    m  threads,  n  cores  

Number of Threads: m

Exploited Parallelism: π’ Contention: M(n)

Memory  Conten/on  

Useful  Work  Data  

Dependency  

11  

Page 12: Performance of Go on Multicore Systems

Experimental  Setup:  Workloads  

12  

Page 13: Performance of Go on Multicore Systems

Non-­‐Uniform  Memory  Access  (24  cores):  Dual  six-­‐core  Intel  Xeon  X5650  2.67  GHz,  2  hardware  threads  per  core,  12MB  L3  cache,  16  GB  RAM,  running  Linux  Kernel  3.0    

Experimental  Setup:  Machine  

13  

Page 14: Performance of Go on Multicore Systems

Outline    

•  Mo/va/on  •  Related  Work    

•  Methodology  –  Approach  –  Valida-on  

•  Evalua/on  •  Conclusion  

14  

Page 15: Performance of Go on Multicore Systems

The  Memory  Conten/on  Model  

SP  (Class  C)  

15  

9.7  

Page 16: Performance of Go on Multicore Systems

Defini-on:  Low  conten6on  problems  have  a  conten/on  ≤  1.2    

Observa-on:  Low  conten/on  problems  exhibt  a  W-­‐like  paEern  not  captured  by  the  model.    

Why  does  this  occur?  

Valida/on  of  Memory  Cont.  Model  

Mandelbrot  

Fannkuck-­‐Redux  

Spectral  Norm  

EP  (Class  C)  

16  

Page 17: Performance of Go on Multicore Systems

Original  Model:  Matrix  Mul  

17  

Modifica/on  of  Memory  Cont.  Model  

Model  revalidated...    1.  For  Matrix  Mul/plica/on  (down  from  50%  error  to  7%)  2.  For  other  low  conten/on  programs    3.  In  Go  and  C  4.  On  Intel  and  ARM  mul/cores    

Revised  Model:  Matrix  Mul  

Page 18: Performance of Go on Multicore Systems

Outline    

•  Mo/va/on  •  Related  Work    

•  Methodology  –  Approach  –  Valida/on  

•  Evalua-on  •  Conclusion  

18  

Page 19: Performance of Go on Multicore Systems

Performance  analysis:  Go  vs  C  

1.  How  much  poorer  is  Go  compared  to  C?  Why?  –  Run/me,  speedup  vs  #Cores  

2.  Could  Go  outperform  C?    –  Run/me  vs  Problem  size    –  Run/me  vs  #Threads  

3.  Predictability  of  actual  performance  –  Modeled  vs  Measured  –  Conten/on  vs  #Cores  –  Prob.  size  vs  Exp.  Parallelism  /  Data  Dep.  /  Conten/on  

19  

Page 20: Performance of Go on Multicore Systems

Points  of  Comparison  

20  

Unop/mized   Op/mized  

Compiler  Op/miza/on   Programmer  Op/miza/on  

Experiment  1  Matrix  Mul/plica/on  (4992*4992)    No  op/miza/on  flags  (-­‐N  for  Go)  #threads  =  24  

Go  is  comparable  with  C  

Page 21: Performance of Go on Multicore Systems

Points  of  Comparison  

21  

Unop/mized   Op/mized  

Compiler  Op/miza/on   Programmer  Op/miza/on  

Experiment  1  Matrix  Mul/plica/on  (4992*4992)    No  op/miza/on  flags  (-­‐N  for  Go)  #threads  =  24  

Go  is  comparable  with  C  

Experiment  2  Matrix  Mul/plica/on  (4992*4992)    -­‐O3  op/miza/on  for  C,  No  flag  for  Go  #threads  =  24  

Go  is  marginally  worse  than  C  

Page 22: Performance of Go on Multicore Systems

Points  of  Comparison  

22  

Unop/mized   Op/mized  

Compiler  Op/miza/on   Programmer  Op/miza/on  

Experiment  1  Matrix  Mul/plica/on  (4992*4992)    No  op/miza/on  flags  (-­‐N  for  Go)  #threads  =  24  

Experiment  2  Matrix  Mul/plica/on  (4992*4992)    -­‐O3  op/miza/on  for  C,  No  flag  for  Go  #threads  =  24  

Go  is  marginally  slowerthan  C  

Experiment  3  Transposed  Matrix  Mul/plica/on  (4992*4992)    -­‐O3  op/miza/on  for  C,  No  flag  for  Go  #threads  =  24  

Go  is  much  worse  than  C    

Page 23: Performance of Go on Multicore Systems

Observa-ons:  •  Sequen-al:  Go  is  16%  slower    •  Parallel:  Go  is  up  to  5%  faster  

No  Op/miza/on:  Run/me  vs  #Cores  

23  

MatrixMul(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  run/me  

MatrixMul(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  X  ra/o  

Page 24: Performance of Go on Multicore Systems

Reasons  

Observa-ons  (in  Go)  

1.   Instruc-ons  executed:      12%  less  

2.   #Cycles:      sequen/al  (16%  higher),      parallel  (5%  less)    

3.   Cache  Misses:        sequen/al  (27x  worse),      parallel  (similar)    

24  

Conclusions  •  Go’s  poor  sequen/al  performance  caused  

by  heavy  cache  miss  rate.  Likely  result  of  parallel  overhead.    

Page 25: Performance of Go on Multicore Systems

Observa-ons:  

•  Go  makes  up  for  poor  sequen/al  performance  with  a  higher  speedup.  

•  Normalized  Go  speedup  is  marginally  beEer  (up  to  1.05x),  except  on  1/24  cores  (0.86x/0.97x)    

No  Op/miza/on:  Parallelism  (Speedup)  vs  #Cores  

25  

MatrixMul(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  speedup  

MatrixMul(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  norm.  speedup    (against  best  seq.  execu/on  /me)  

Page 26: Performance of Go on Multicore Systems

Observa-ons:  

•  Sequen-al:  Go  is  400%  slower    

•  Parallel:  Go  is  180-­‐340%  slower  

Both  Op/miza/ons:  Run/me  vs  #Cores  

26  

MatrixMul   –O3(#threads  =   24,   P   size   =   5K)  Effect  of  #cores  on  run/me  

MatrixMul   –O3(#threads  =   24,   P   size   =   5K)  Effect  of  #cores  on  X  difference  

Page 27: Performance of Go on Multicore Systems

Reasons  

27  

Observa-ons  (in  Go)  1.   Instruc-ons  executed:    

 5.2x  as  many  2.   #Cycles:    

 sequen/al  (400%  higher),      parallel  (180%  higher)    

3.   Cache  Misses:        sequen/al  (64%  less),      parallel  (56%  less)    

Conclusions  •  Go’s  op-miza-on  not  as  mature  as  C’s  

Sequen/al  instruc/ons  reduced  1.3x  vs  8x,  cycles  reduced  4x  vs  18x    

•  Go  has  beVer  cache  management    

Page 28: Performance of Go on Multicore Systems

Observa-ons:  •  Go  speedup  is  higher  than  C’s  on  its  own  base,  but  significantly  worse  when  normalized.    •  Secondary  Objec-ve:  Given  that  Go  has  a  higher  own-­‐base  speedup,  could  it  beat  C  if  we  

increase  the  problem  size?      

Both  Op/miza/ons:  Parallelism  vs  #Cores  

28  

MatrixMul   –O3(#threads  =   24,   P   size   =   5K)  Effect  of  #cores  on  speedup  

MatrixMul   –O3(#threads  =   24,   P   size   =   5K)  Effect  of  #cores  on  norm.  speedup  

Page 29: Performance of Go on Multicore Systems

Observa-on:  •  Variance  in  the  /mes  ra/o  reduces  from  1.0-­‐1.3  to  1.0-­‐1.1  

Conclusion:    •  In  general,  Go  is  increasingly  compe//ve  as  the  problem  size  increases.    

Compiler  Op/miza/on:  Varying  Problem  Size  

29  

MatrixMul  –O3(#threads  =  24,  P  size  =  10K)  Effect  of  #cores  on  X  difference  

MatrixMul  –O3(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  X  difference  

Page 30: Performance of Go on Multicore Systems

Both  Op/miza/ons:  Varying  Problem  Size  

30  

MatrixMul  –O3(#threads  =  24)  Effect  of   problem   size,   #cores   on   /mes  difference  

Observa-on:  •  The  /mes  ra/o  decreases  as  the  

problem  size  increases  on  1-­‐20  cores.    

Conclusion:    

•  There  is  a  valley  of  performance  on  intermediate  core  numbers.    

Page 31: Performance of Go on Multicore Systems

Both  Op/miza/ons:    Run/me  vs  #threads  

31  

Observa-on:  

•  Go’s  rela/ve  performance  as  the  #threads  increases.    

Conclusions:  •  The  cost  of  gorou/nes  in  Go  is  

extremely  low.  

•  Go’s  performance  may  improve  on  problems  with  high  data  dependency.  

MatrixMul  (#cores=  24,  Problem  size  =  5K)    Effect  of  #threads  on  run/me  

Page 32: Performance of Go on Multicore Systems

Predictability  of  Actual  Performance  

•  Objec-ve:  To  determine  how  Go  compares  to  C  with  regard  to  mul/core  predictability  as  we  change  the  #cores,  #threads,  problem  size  

•  Observa-ons  (in  Go):    –  Model  exhibits  beEer  accuracy    –  Memory  Conten/on  does  not  fluctuate  as  #cores  changes    –  Measurements  consistent  with  assump/ons  as  problem  size  changes    

•  Result:  Go  exhibts  proper/es  useful  for  predic/on  that  C  does  not.    

32  

Page 33: Performance of Go on Multicore Systems

Observa-ons  

•  Conten/on  Error  –  C        (Avg:  15%,  Max:  55%  )  

–  Go  (Avg:  3%,  Max:  14%)  

•  Parallelism  Error  –  C        (Avg:  18%,  Max:  44%)  

–  Go  (Avg:  6%,  Max:  15%)  

•  Run/me  Error  

–   C  (Avg:  16%,  Max:  47%)  

–  Go  (Avg:  5%,  Max:  13%)  

Conclusion  

•  Go  has  a  beEer  predictability  than  C  

Predictability  of  Performance  Modeled  vs  Measured  

33  

MatrixMul  –O3(#threads  =  24,  P=17K)  Effect  of  #cores  on  conten/on  factor  

Page 34: Performance of Go on Multicore Systems

Observa-ons  

•  In  C  ,  conten/on  flucuates  (0-­‐5.6)    

•  Not  so  much  in  Go  (0-­‐1)  

Conclusion    

•  Garbage  Collec/on,  Channel  U/l    

•  A  conten/on  factor  can  be  easily  bounded  in  Go  to  guarantee  performance  of  some  other  program.    

Predictability  of  Performance  Conten/on  vs  #Cores  

34  

MatrixMul  –O3(#threads  =  24,  P=17K)  Effect  of  #cores  on  conten/on  factor  

Page 35: Performance of Go on Multicore Systems

Predictability  of  Performance  Modeling  across  problem  sizes  

•  Objec-ve:  Can  we  perform  measurements  on  smaller  problem  sizes  to  reduce  run/me  of  parallelism  predic/on?    

35  

Page 36: Performance of Go on Multicore Systems

Predictability  of  Performance  Problem  size  vs  Exploit.  Parallelism  

36  

Go  MatrixMul  (#threads  =  24,  P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

C  MatrixMul   (#threads   =   24,   P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

Observa-ons  (in  Go)  •  Exploited  Parallelism  only  decreases  slightly  as  problem  size  increases  

Page 37: Performance of Go on Multicore Systems

Predictability  of  Performance  Problem  size  vs  Data  Dependency  

37  

Go  MatrixMul  (#threads  =  24,  P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

C  MatrixMul   (#threads   =   24,   P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

Observa-ons  (in  Go)  •  Data  Dependency  decreases  as  expected  as  problem  size  increases  

Page 38: Performance of Go on Multicore Systems

Predictability  of  Performance  Problem  size  vs  Conten/on  

38  

Go  MatrixMul  (#threads  =  24,  P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

C  MatrixMul   (#threads   =   24,   P=17K)  Effect   of   problem   size   on   exploited  parallelsim    

Observa-ons  (in  Go)  •  Memory  conten/on  only  increases  slightly  as  problem  size  increases  

Conclusion:  •  Measurements  inputs  on  small  problems  are  more  accurate  in  Go  than  in  C  

Page 39: Performance of Go on Multicore Systems

Conclusion  

1.   How  does  Go  compare  to  C  in  a  mul-core  environment?    

Go’s  Actual  Performance  –  Comparable  performance  before,  Inferior  performance  aver  programmer  

op/miza/on  –  Consequence  of  different  levels  of  op/miza/on    –  Performance  margin  decreases  as  the  problem  size  increases  on  intermediate  

core  numbers  –  Cost  of  gorou/nes  much  lower  than  threads  

Go’s  Predicted  Performance    –  Model  exhibits  beEer  accuracy    –  Memory  Conten/on  does  not  fluctuate  as  #cores  changes    –  Measurements  consistent  with  assump/ons  as  problem  size  changes    

39  

Page 40: Performance of Go on Multicore Systems

Conclusion  

2.   Is  the  model  extensible  beyond  C,  tradi-onal  mul-cores,  and  high  conten-on?  –  Modified  /  Validated  for  low  conten/on  problems  –  Validated  for  the  Go  language  –  Validated  for  ARM  devices    

3.   Can  we  make  the  model  easier  to  use?  –  Formally  defined  valida/on  criteria  –  Wrote  script  to  perform  model  valida/on  –  Wrote  script  to  perform  performance  predic/on    –  *Future  Work*  Front  end  for  predic/on  

40  

Page 41: Performance of Go on Multicore Systems

Observa-ons:  

•  Sequen-al:  Go  is  31%  slower    

•  Parallel:  Go  is  up  to  0-­‐28%  slower  

•  On  UMA,  /mes  ra/o  decreases  as  #cores  increases    

Compiler  Op/miza/on:  Run/me  vs  #Cores  

41  

MatrixMul  –O3  (#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  run/me  

MatrixMul  –O3  (#threads  =  24,  P  size  =  5K)  Effect  of  #cores  on  X  difference  

Page 42: Performance of Go on Multicore Systems

Reasons  

42  

Observa-ons  (in  Go)  

1.   Instruc-ons  executed:      4.5x  as  many  

2.   #Cycles:      sequen/al  (30%  higher),      parallel  (similar)    

3.   Cache  Misses:        sequen/al  (10%  higher),      parallel  (46%  less)    

Page 43: Performance of Go on Multicore Systems

Observa-ons:  

•  Go  speedup  is  higher  than  C’s  on  its  own  base,  but  lower  when  normalized.    

•  Secondary  Objec-ve:  Given  that  Go  has  a  higher  own-­‐base  speedup,  could  it  beat  C  if  we  increase  the  problem  size?      

Compiler  Op/miza/on:  Parallelism  vs  #Cores  

43  

MatrixMul  –O3(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  Exp.  Parallelism  

MatrixMul  –O3(#threads  =  24,  P  size  =  5K)    Effect  of  #cores  on  norm.  speedup  

Page 44: Performance of Go on Multicore Systems

Sequen/al  Op/miza/on  

44  

No  op/miza/on    

Compiler  op/miza/on    

Compiler  +  Programmer  op/miza/on    

Page 45: Performance of Go on Multicore Systems

Predictability  of  Performance  Modeling  across  problem  sizes  

•  Objec-ve:  Can  we  perform  measurements  on  smaller  problem  sizes  to  reduce  run/me  of  parallelism  predic/on?    

•  Observa-on:  The  performance  profiles  in  Go  are  consistent  with  expecta/ons  as  problem  size  changes    

•  Result:    Measurements  inputs  on  small  problems  are  more  accurate  in  Go  than  in  C  

45