Upload
lammien
View
215
Download
0
Embed Size (px)
Citation preview
www.bsc.es
Extreme-Scale Performance Tools - SC12 workshop
Judit Gimenez ([email protected])
BSC ToolsDetecting and analyzing application structure
with clustering
Clustering
Identification of computation structure– CPU burst = region between consecutive runtime calls
• Described with performance hardware counters• Associated with call stack data
Using DBSCAN density-cluster algorithm– Data not necessarily Gaussian– Two parameters: Epsilon (search radius), MinPoints
Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009)
Noise point
Cluster points
Clustering
Identification of computation structure– CPU burst = region between consecutive runtime calls
• Described with performance hardware counters• Associated with call stack data
Using DBSCAN density-cluster algorithm– Data not necessarily Gaussian– Two parameters: Epsilon (search radius), MinPoints
Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009)
Outputs
Scatter Plot of Clustering Metrics Clusters Distribution Along Time
Cluster Statistics Code Linking
All counters in a single run
Using clusters to understand apps behavior (GROMACS)Instructions imbalance
IPC Imbalance 64 procs
256 procs
Identifying main code regions
WRF
HydroCCG-POP
instr. vs. cluster
0.0
0.2
0.4
0.6
0.8
1.0
1.21
2
3
4
5
67
8
9
10
11
Cluster 1_L Cluster 1 Cluster 1_R
PM_CYC
PM_GRP_CMPL
PM_GCT_EMPTY_CYC
PM_GCT_EMPTY_IC_MISS
PM_GCT_EMPTY_BR_MPRED
PM_CMPLU_STALL_LSU
PM_CMPLU_STALL_REJECT
PM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_DCACHE_MISS
PM_CMPLU_STALL_FXU
PM_CMPLU_STALL_DIV
PM_CMPLU_STALL_FPU
PM_CMPLU_STALL_FDIV
PM_CMPLU_STALL_OTHER
PM_INST_CMPL
PM_INST_DISP
PM_LD_REF_L1
PM_ST_REF_L1
PM_LD_MISS_L1
PM_ST_MISS_L1
PM_LD_MISS_L1_LSU0
PM_LD_MISS_L1_LSU1
PM_L1_WRITE_CYC
PM_DATA_FROM_MEM
PM_DATA_FROM_L2
PM_DTLB_MISS
PM_ITLB_MISS
PM_LSU0_BUSY
PM_LSU1_BUSY
PM_LSU_FLUSH
PM_LSU_LDF
PM_FPU_FIN
PM_FPU_FSQRT
PM_FPU_FDIV
PM_FPU_FMA
PM_FPU0_FIN
PM_FPU1_FIN
PM_FPU_STF
PM_LSU_LMQ_SRQ_EMPTY_CYC
PM_HV_CYC
PM_1PLUS_PPC_CMPL
PM_TB_BIT_TRANS
PM_FLUSH_BR
PM_BR_MPRED_TA
PM_GCT_EMPTY_SRQ_FULL
PM_FXU_FIN
PM_FXU_BUSY
PM_IOPS_CMPL
PM_GCT_FULL_CYC
1E+ 001E+ 021E+ 041E+ 061E+ 081E+ 10
Cluster 1Cluster 2
Cou
nter
Val
ue CPI stack - Detailed View
0.
0.2
0.4
0.6
0.8
1.
1.2
Cluster1_L
Cluster1
Cluster1_R
Cluster2_L
Cluster2
Cluster2_R
Other StallsStall by FPU Basic LatencyStall by FDIV/FSQRTStall by FXU Basic LatencyStall by Div/MTSPR/MFSPRStall by LSU Basic LatencyStall by D-cache MissOther RejectStall by XlateOther GCT Stalls (flush, etc.)Branch Mispredict PenaltyI-cache Miss PenaltyTotal Group Complete Cycles
Hardware counters projection
Performance Data Extrapolation in Parallel Codes. (ICPADS 2010)
Quality driven clustering algorithms
Hierarchy of cluster granularities
Automatic SPMD quality criteria
Automatic Refinement of Parallel Application Structure Detection (LSPP 2012)
Conclusions
Performance analytics: Data analytics applied to raw performance data– From data to insight, huge room for research
Clustering enables focusing the analysis and opens many different uses– Correlation of sampled data to generate instantaneous metric
evolution– Separate speed factors per cluster on predictive simulations with
Dimemas– Track application evolution using the clustering scatter plots
www.bsc.es/paraver