Upload
amd-developer-central
View
104
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang at the AMD Developer Summit (APU13) November 11-13, 2013
Citation preview
MEASURING AND OPTIMIZING PERFORMANCE OF CLUSTER AND PRIVATE CLOUD APPLICATIONS
BY USING PPA
MULTICOREWARE INC LIHUA.ZHANG HUI.HUANG
ANDY.ZHENG
3 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
IntroducEon to MCW PPA™ For Cluster
A tracing tool targets the distributed systems.
! Distributely collect instrumented data and hardware measurements within a tracing infrastructure.
! Provide visualizaEons with intuiEve graphs/GanX charts and generate staEsEc reports intended for idenEfying criEcal paths.
! Do offline analysis that aids in understanding target system’s behavior and reasoning about performance issues.
! PPA Product series
PPA For Cluster PPA Workstation Edition PPA For Android
4 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Main Features
! Low overhead ‒ Have negligible performance impact on the running applicaEons by relying on the PPA runEme library. This is very useful for highly opEmized cases which are performance sensiEve.
! InstrumentaBon on applicaBon level ‒ The PPA runEme library provides APIs to measure codes. The hardware measurement part is very transparent to the developers. And these PPA codes can be easily cleanup by turning on a disable opEon.
‒ Auto-‐instrumentaEon of binaries available soon.
! Scalability ‒ The tool can be extended to profile clusters with various scales (now up to 4000 nodes) and services (e.g. Hadoop). This benefits from PPA’s distributed data repositories, big-‐data process and buffered views of visualizaEons etc.
‒ PPA Profiler can be extended to support HW vendor specific features
5 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
The Highlights
! Profiler and performance analyzer ‒ Low overhead (almost no cost if no profiling capture is enabled) ‒ CPU & GPU acEvity traces ‒ Hardware uElizaEons measurement ‒ HW Vendor specific support ‒ Features Eme-‐based views and staEsEcal analysis / reports ‒ MulE-‐core profiling at process/thread at source code ‒ Good data organizaEon in intuiEve colour schemes
! Big data support ‒ Storage ‒ Smooth visualizaEon
! System-‐wide criEcal paths idenEficaEon ‒ Correlate hardware uElizaEons and CPU events in the same Emeline ‒ Cluster wide global clock synchronizaEon ‒ MulE-‐views for sessions from different nodes in the same Emeline ‒ RunEme monitors
! Customizable for specific applicaEons, e.g. Hadoop
6 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Developer Library Overview
! C/C++ SDK ‒ Already used in numerous OpenCL™ applicaEons
! Java Support ‒ Java bindings for OpenCL™ applicaEons
! Thread-‐safe ! Low overhead if no capture ! Transparent for OpenCL instrumentaEons ‒ Timing OpenCL APIs ‒ Timing kernels & data transfers: start/submit/queue/complete ‒ Visualize construcEon of dependence graph between kernels & data transfer ‒ Exclusive sub-‐kernel support for AMD GFX cards
Provide a friendly Interface
(JPPA.jar) for the JAVA developer.
JAVA Provide a friendly Interface
(ppaAPI.h) for the C/C++ developer.
C/C++
7 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
System Overview
! Distributed repositories for trace data ! Distributed post-‐processing to minimize overhead
! Powerful visualizaEon engine ! Scalability to any scale of cluster system
Communication
Framework
Data Transfer
Fault-tolerant
Synchronization and heartbeat etc.
Data collecting by PPA Profiler
Presentation layer
Network layer
Data layer
Profiler Logic layer
UI Logic layer
Raw Data Repository
Raw Data Post Processing Processed Data Repository Data serialize for Presentation
Profiler Control (Start/Stop etc.) Other profiler logic
Graphics Rendering �
8 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Gepng Started
! Install PPA Clients and PPA Server on the target plaqorms ‒ Deploy PPA Clients by scripts ‒ Support CLI for capture ‒ Generally PPA Server is running on master node
! Set up capture opEons ‒ Node IP, communicaEon Port… ‒ OpEonally select nodes to profile ‒ OpEonally enable CPU Event filters ‒ OpEonally enable CPU Event merge ‒ Hardware measurement is by default
! Collect data and analysis reports ! Operate views
9 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Summary View
! Available to help find the problemaEc nodes or un-‐balanced loads.
! Tell difference between different runs
Multistage Table
Bar Charts
10 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
The Sharp UElity: Timeline View
! Correlate CPU Events to HW performance in analysis
Monitoring application’s behaviour
Monitoring hardware behavior
Zoom in/out from hour to ns resolutions
Session and its node list
11 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Profiling Data ! CPU Events Level
‒ Thread ‒ Name ‒ Core miEgaEon ‒ Timing
! OpenCL traces ! Hardware counters
‒ % CPU Usage ‒ Memory Usage ‒ Bytes read/write of Disk ‒ Bytes in/write of the Net ‒ Cache hit/miss
! StaEsEcs ‒ Process/Thread involved ‒ # of total CPU Events ‒ # of the same CPU Events ‒ Min/Max/Average for each
12 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Timeline View for CPU Events
Expand process
Expand thread
! Process-‐thread-‐event data ‒ IdenEfy the problemaEc process/thread/event ‒ Tell the dependency ‒ Tell parent & child ‒ Frames analyzer for frame-‐based program
13 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Timeline View for HW measurement
When is the critical throughput on disk?
Abnormal load of the Network?
When the CPU usage is very low or high?
! Aggregate performance data
! Per-‐core data
14 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Where mulE-‐views Help OpEmizaEon
! IdenEfy node’s abnormal behavior
! Difference/relaEons between nodes ! Job scheduler maXers
15 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Hadoop with PPA on AWS as Demo
! Overview of the tracing infrastructure
16 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Setup AWS EC2 instance
! 16 Hadoop nodes (dual core node with 7.5GB memory)
! 4GB Hadoop Terasort Workload
! > 1.2 GB PPA trace per node
17 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Run Hadoop jobs
! Start the capture ! Jobs are done by map & reduce
18 | PRESENTATION TITLE | DECEMBER 4, 2013 | CONFIDENTIAL
Remote control by VNC viewer
! Intended for mulEple users on AWS
! Experience and operate PPA from different connect points
CONTACT US: [email protected] [email protected] [email protected]