19
MEASURING AND OPTIMIZING PERFORMANCE OF CLUSTER AND PRIVATE CLOUD APPLICATIONS BY USING PPA

PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

Embed Size (px)

DESCRIPTION

Presentation PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang at the AMD Developer Summit (APU13) November 11-13, 2013

Citation preview

Page 1: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

MEASURING  AND  OPTIMIZING  PERFORMANCE  OF  CLUSTER  AND  PRIVATE  CLOUD  APPLICATIONS  

BY  USING  PPA    

Page 2: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

MULTICOREWARE  INC  LIHUA.ZHANG    HUI.HUANG  

ANDY.ZHENG    

Page 3: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

3   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

IntroducEon  to  MCW  PPA™  For  Cluster  

A  tracing  tool  targets  the  distributed  systems.  

! Distributely  collect  instrumented  data  and  hardware  measurements  within  a  tracing  infrastructure.  

!  Provide  visualizaEons  with    intuiEve  graphs/GanX  charts  and  generate  staEsEc                                                                    reports  intended  for  idenEfying  criEcal  paths.  

! Do  offline  analysis  that  aids  in  understanding  target  system’s  behavior  and  reasoning  about  performance  issues.  

!  PPA  Product  series    

PPA For Cluster PPA Workstation Edition PPA For Android

Page 4: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

4   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Main  Features  

!  Low  overhead  ‒   Have  negligible  performance  impact  on  the  running  applicaEons  by  relying  on  the  PPA  runEme  library.  This  is  very  useful  for  highly  opEmized  cases  which  are  performance  sensiEve.  

!  InstrumentaBon  on  applicaBon  level  ‒ The  PPA  runEme  library  provides  APIs  to  measure  codes.  The  hardware  measurement  part  is  very  transparent  to  the  developers.  And  these  PPA  codes  can  be  easily  cleanup  by  turning  on  a  disable  opEon.  

‒ Auto-­‐instrumentaEon  of  binaries  available  soon.  

!  Scalability  ‒ The  tool  can  be  extended  to  profile  clusters  with  various  scales  (now  up  to  4000  nodes)  and  services  (e.g.  Hadoop).  This  benefits  from  PPA’s  distributed  data  repositories,  big-­‐data  process  and  buffered  views  of  visualizaEons  etc.  

‒ PPA  Profiler  can  be  extended  to  support  HW  vendor  specific  features  

Page 5: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

5   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

 The  Highlights  

!  Profiler  and  performance  analyzer  ‒  Low  overhead  (almost  no  cost  if  no  profiling  capture  is  enabled)  ‒  CPU  &  GPU  acEvity  traces  ‒  Hardware  uElizaEons  measurement  ‒  HW  Vendor  specific  support  ‒  Features  Eme-­‐based  views  and  staEsEcal  analysis  /  reports  ‒  MulE-­‐core  profiling  at  process/thread  at  source  code  ‒  Good  data  organizaEon  in  intuiEve  colour  schemes  

!  Big  data  support  ‒  Storage  ‒  Smooth  visualizaEon  

!  System-­‐wide  criEcal  paths  idenEficaEon  ‒  Correlate  hardware  uElizaEons  and  CPU  events  in  the  same  Emeline  ‒  Cluster  wide  global  clock  synchronizaEon  ‒  MulE-­‐views  for  sessions  from  different  nodes  in  the  same  Emeline  ‒  RunEme  monitors  

!  Customizable  for  specific  applicaEons,  e.g.  Hadoop    

Page 6: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

6   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

 Developer  Library  Overview  

!  C/C++  SDK  ‒ Already  used  in  numerous  OpenCL™  applicaEons  

!  Java  Support  ‒  Java  bindings  for  OpenCL™  applicaEons  

!  Thread-­‐safe  !   Low  overhead  if  no  capture  !   Transparent  for  OpenCL  instrumentaEons  ‒  Timing  OpenCL  APIs  ‒   Timing  kernels  &  data  transfers:  start/submit/queue/complete  ‒  Visualize  construcEon  of  dependence  graph  between  kernels  &  data  transfer  ‒  Exclusive  sub-­‐kernel  support  for  AMD  GFX  cards  

 Provide  a  friendly  Interface  

(JPPA.jar)  for  the  JAVA  developer.  

JAVA Provide  a  friendly  Interface  

(ppaAPI.h)  for  the  C/C++  developer.  

C/C++

Page 7: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

7   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

System  Overview    

!  Distributed  repositories  for  trace  data  !  Distributed  post-­‐processing  to  minimize  overhead  

!  Powerful  visualizaEon  engine  !  Scalability  to  any  scale  of  cluster  system  

Communication

Framework

Data Transfer

Fault-tolerant

Synchronization and heartbeat etc.

Data collecting by PPA Profiler

Presentation layer

Network layer

Data layer

Profiler Logic layer

UI Logic layer

Raw Data Repository

Raw Data Post Processing Processed Data Repository Data serialize for Presentation

Profiler Control (Start/Stop etc.) Other profiler logic

Graphics  Rendering �

Page 8: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

8   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Gepng  Started  

!  Install  PPA  Clients  and  PPA  Server  on  the  target  plaqorms    ‒  Deploy  PPA  Clients  by  scripts  ‒  Support  CLI  for  capture  ‒  Generally  PPA  Server  is  running  on  master  node  

!  Set  up  capture  opEons  ‒  Node  IP,  communicaEon  Port…  ‒  OpEonally  select  nodes  to  profile  ‒  OpEonally  enable  CPU  Event  filters  ‒  OpEonally  enable  CPU  Event  merge    ‒  Hardware  measurement  is  by  default  

!  Collect  data  and  analysis  reports  ! Operate  views  

Page 9: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

9   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Summary  View  

! Available  to  help  find  the  problemaEc  nodes  or  un-­‐balanced  loads.  

!  Tell  difference  between  different  runs    

Multistage Table

Bar Charts

Page 10: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

10   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

The  Sharp  UElity:  Timeline  View  

!  Correlate  CPU  Events  to  HW  performance  in  analysis  

Monitoring application’s behaviour

Monitoring hardware behavior

Zoom in/out from hour to ns resolutions

Session and its node list

Page 11: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

11   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Profiling    Data  !  CPU  Events  Level  

‒ Thread  ‒ Name  ‒ Core  miEgaEon  ‒ Timing  

!  OpenCL  traces  !  Hardware  counters  

‒ %  CPU  Usage  ‒ Memory  Usage  ‒ Bytes  read/write  of  Disk  ‒ Bytes  in/write  of  the  Net  ‒ Cache  hit/miss  

!  StaEsEcs  ‒ Process/Thread  involved  ‒ #  of  total  CPU  Events  ‒ #  of  the  same  CPU  Events  ‒ Min/Max/Average  for  each  

Page 12: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

12   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Timeline  View  for  CPU  Events  

Expand process

Expand thread

!  Process-­‐thread-­‐event  data  ‒  IdenEfy  the  problemaEc  process/thread/event  ‒ Tell  the  dependency  ‒ Tell  parent  &  child  ‒  Frames  analyzer  for  frame-­‐based  program  

Page 13: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

13   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Timeline  View  for  HW  measurement  

When is the critical throughput on disk?

Abnormal load of the Network?

When the CPU usage is very low or high?

!  Aggregate  performance  data  

!  Per-­‐core  data  

Page 14: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

14   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Where  mulE-­‐views  Help  OpEmizaEon  

!  IdenEfy  node’s  abnormal  behavior  

!  Difference/relaEons  between  nodes  !  Job  scheduler  maXers  

Page 15: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

15   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Hadoop  with  PPA  on  AWS  as  Demo  

!  Overview  of  the  tracing  infrastructure  

Page 16: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

16   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Setup  AWS  EC2  instance  

!  16  Hadoop  nodes  (dual  core  node  with  7.5GB  memory)  

!  4GB  Hadoop  Terasort  Workload  

!  >  1.2  GB  PPA  trace  per  node  

Page 17: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

17   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Run  Hadoop  jobs  

!  Start  the  capture  !  Jobs  are  done  by  map  &  reduce  

Page 18: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

18   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  

Remote  control  by  VNC  viewer  

!  Intended  for  mulEple  users  on  AWS  

!  Experience  and  operate  PPA  from  different  connect  points  

Page 19: PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

CONTACT  US:    [email protected]    [email protected]    [email protected]