14
CounterMiner: Mining Big Performance Data from Hardware Counters Yirong Lv HICAS Shenzhen Institute of Advanced Technology, CAS Shenzhen, China [email protected] Bin Sun Information Engineering College Capital Normal University Beijing, China brad [email protected] Qinyi Luo Department of Computer Science University of Southern California Los Angles, USA [email protected] Jing Wang Information Engineering College Capital Normal University Beijing, China [email protected] Zhibin Yu HICAS Shenzhen Institute of Advanced Technology, CAS Shenzhen, China [email protected] Xuehai Qian Department of Computer Science University of Southern California) Los Angles, USA [email protected] Abstract—Modern processors typically provide a small number of hardware performance counters to capture a large number of microarchitecture events 1 . These counters can easily generate a huge amount (e.g., GB or TB per day) of data, which we call big performance data in cloud computing platforms with more than thousands of servers and millions of complex workloads running in a ”24/7/365” manner. The big performance data provides a precious foundation for root cause analysis of performance bottlenecks, architecture and compiler optimization, and many more. However, it is challenging to extract value from the big performance data due to: 1) the many unperceivable errors (e.g., outliers and missing values); and 2) the difficulty of obtaining insights, e.g., relating events to performance. In this paper, we propose CounterMiner, a rigorous method- ology that enables the measurement and understanding of big performance data by using data mining and machine learning techniques. It includes three novel components: 1) using data cleaning to improve data quality by replacing outliers and filling in missing values; 2) iteratively quantifying, ranking, and pruning events based on their importance with respect to performance; 3) quantifying interaction intensity between two events by residual variance. We use sixteen benchmarks (eight from CloudSuite and eight from the Spark 2 version of HiBench) to evaluate CounterMiner. The experimental results show that CounterMiner reduces the average error from 28.3% to 7.7% when multiplexing 10 events on 4 hardware counters. We also conduct a real-world case study, showing that identifying important configuration parameters of Spark programs by event importance is much faster than directly ranking the importance of these parameters. Index Terms—performance, big data, compute architecture, performance counters, data mining I. I NTRODUCTION Modern processors typically provide 4-8 hardware counters to measure hundreds of crucial events such as cache and TLB misses [1]–[4]. These events can generally reveal root causes and key insights about the performance of computer systems. 1 We use events to represent microarchitecture events throughput the paper. 2 We use Spark to represent Apache Spark throughput the paper. Therefore, performance counter based analysis is applied in a wide range of applications, including task scheduling [5], [6], workload characterization [7]–[14], performance optimization of applications [15]–[18], compiler optimization [19]–[21], architecture optimization [22], and many more. A number of programable performance measurement tools have there- fore been developed, including PAPI [23], VTune [24], Perf- mon [25], Oprofile [26], and many others [27], [28]. However, there is a fundamental tension between accu- racy and efficiency when using a small number of hardware counters to measure a large number of events. On the one side, the one counter one event (OCOE) approach can achieve high accuracy because a number of events are measured by the same number of hardware counters at a time. Obviously, OCOE becomes inefficient with more than one hundred and up to fourteen hundreds of measurable events [29]. This motivates the multiplexing (MLPX) approach, which improves the measurement efficiency by scheduling events from a fraction of execution to be counted on hardware counters and extrapolating the full behavior of each event from its samples. However, MLPX incurs large measurement errors due to time-sharing and sampling [4], [30]–[32]. In cloud computing era, this problem is exaggerated for two reasons. First, resolving the measurement errors of MLPX is a requirement. A modern cloud computing platform usually consists of more than thousand of servers and millions of com- plex workloads running services with diverse characteristics in a ”24/7/365” manner. The major companies have good incentives to understand the performance behavior because a small performance improvement (e.g., 1%) can result in millions of dollars of savings [7]. In this context, randomly selecting and measuring a small number of events with OCOE is not sufficient because the cloud services’ performance may be affected by the unmeasured events. To measure a large number of events, using MLPX and handling its measurement errors are mandatory. Prior work shows that the errors cannot

CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

  • Upload
    others

  • View
    5

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

CounterMiner: Mining Big Performance Datafrom Hardware Counters

Yirong LvHICAS

Shenzhen Institute of Advanced Technology, CASShenzhen, China

[email protected]

Bin SunInformation Engineering College

Capital Normal UniversityBeijing, China

brad [email protected]

Qinyi LuoDepartment of Computer ScienceUniversity of Southern California

Los Angles, [email protected]

Jing WangInformation Engineering College

Capital Normal UniversityBeijing, China

[email protected]

Zhibin YuHICAS

Shenzhen Institute of Advanced Technology, CASShenzhen, [email protected]

Xuehai QianDepartment of Computer ScienceUniversity of Southern California)

Los Angles, [email protected]

Abstract—Modern processors typically provide a small numberof hardware performance counters to capture a large number ofmicroarchitecture events 1. These counters can easily generate ahuge amount (e.g., GB or TB per day) of data, which we call bigperformance data in cloud computing platforms with more thanthousands of servers and millions of complex workloads runningin a ”24/7/365” manner. The big performance data providesa precious foundation for root cause analysis of performancebottlenecks, architecture and compiler optimization, and manymore. However, it is challenging to extract value from the bigperformance data due to: 1) the many unperceivable errors (e.g.,outliers and missing values); and 2) the difficulty of obtaininginsights, e.g., relating events to performance.

In this paper, we propose CounterMiner, a rigorous method-ology that enables the measurement and understanding of bigperformance data by using data mining and machine learningtechniques. It includes three novel components: 1) using datacleaning to improve data quality by replacing outliers and fillingin missing values; 2) iteratively quantifying, ranking, and pruningevents based on their importance with respect to performance; 3)quantifying interaction intensity between two events by residualvariance. We use sixteen benchmarks (eight from CloudSuiteand eight from the Spark 2 version of HiBench) to evaluateCounterMiner. The experimental results show that CounterMinerreduces the average error from 28.3% to 7.7% when multiplexing10 events on 4 hardware counters. We also conduct a real-worldcase study, showing that identifying important configurationparameters of Spark programs by event importance is muchfaster than directly ranking the importance of these parameters.

Index Terms—performance, big data, compute architecture,performance counters, data mining

I. INTRODUCTION

Modern processors typically provide 4-8 hardware countersto measure hundreds of crucial events such as cache and TLBmisses [1]–[4]. These events can generally reveal root causesand key insights about the performance of computer systems.

1We use events to represent microarchitecture events throughput the paper.2We use Spark to represent Apache Spark throughput the paper.

Therefore, performance counter based analysis is applied in awide range of applications, including task scheduling [5], [6],workload characterization [7]–[14], performance optimizationof applications [15]–[18], compiler optimization [19]–[21],architecture optimization [22], and many more. A numberof programable performance measurement tools have there-fore been developed, including PAPI [23], VTune [24], Perf-mon [25], Oprofile [26], and many others [27], [28].

However, there is a fundamental tension between accu-racy and efficiency when using a small number of hardwarecounters to measure a large number of events. On the oneside, the one counter one event (OCOE) approach can achievehigh accuracy because a number of events are measured bythe same number of hardware counters at a time. Obviously,OCOE becomes inefficient with more than one hundred andup to fourteen hundreds of measurable events [29].

This motivates the multiplexing (MLPX) approach, whichimproves the measurement efficiency by scheduling eventsfrom a fraction of execution to be counted on hardwarecounters and extrapolating the full behavior of each event fromits samples. However, MLPX incurs large measurement errorsdue to time-sharing and sampling [4], [30]–[32].

In cloud computing era, this problem is exaggerated for tworeasons. First, resolving the measurement errors of MLPX isa requirement. A modern cloud computing platform usuallyconsists of more than thousand of servers and millions of com-plex workloads running services with diverse characteristicsin a ”24/7/365” manner. The major companies have goodincentives to understand the performance behavior becausea small performance improvement (e.g., 1%) can result inmillions of dollars of savings [7]. In this context, randomlyselecting and measuring a small number of events with OCOEis not sufficient because the cloud services’ performance maybe affected by the unmeasured events. To measure a largenumber of events, using MLPX and handling its measurementerrors are mandatory. Prior work shows that the errors cannot

Page 2: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

be effectively avoided during the sampling [33], [34].Second, it becomes more difficult to extract insights of

performance behavior. The events of server processors in acloud computing platform can generate a huge amount of data,leading to big performance data. For example, GWP (Google-wide-Profiler) [7] could generate several GBs of performancedata per day. If we treat the data equally, the high eventdimensionality (typically > 100) incurs extremely high cost.

In this paper, we propose CounterMiner, a rigorous method-ology that enables the measurement and understanding of thebig performance data with data mining and machine learningtechniques. It includes three components: 1) data cleaner,which improves the counter data quality by replacing outliersand filling in missing values after the sampling of MLPX,which is complementary to [33], [34]; 2) importance ranker,which iteratively quantifies, ranks, and prunes events based ontheir importance with respect to performance; 3) interactionranker, which quantifies interaction intensity between twoevents by residual variance.

We use sixteen benchmarks that eight from CloudSuite [35]and eight from the Spark version of HiBench [36] to evaluateCounterMiner. The experimental results show that Counter-Miner reduces the average error from 28.3% to 7.7% whenmultiplexing 10 events on 4 hardware counters. We alsoconduct a real-world case study, showing that identifyingimportant configuration parameters of Spark programs byevent importance is much faster than directly ranking theimportance of these parameters.

Moreover, CounterMiner reveals a number of interestingfindings: 1) the event of stall cycles due to instruction queuefull is the most important event for most cloud programs; 2)the branch related events interact with other events the moststrongly; 3) there is a ’one-three significantly more important’law (’one-three SMI’ for short) which concludes that gen-erally one to three events of a benchmark are significantlymore important than others with respect to performance; 4)a number of noisy events of a modern processor can bedefinitely removed; 5) there are common important eventsrelated to branches, TLBs (instruction, data, and second-levelTLBs), and remote memory and remote cache operations; 6)the eight Spark benchmarks from HiBench surprisingly showmore diversity than those from CloudSuite when we look atthe top ten important events. These findings are valuable toguide cross-layer performance optimizations of architectureand applications of cloud systems.

The rest of this paper is organized as follows. Section IIdiscusses the background and motivation. Section III presentsour CounterMiner framework. Section IV describes the ex-perimental methodology. Section V provides the results andanalysis. Section VI discusses the related work, and Section VIconcludes the paper.

II. BACKGROUND AND MOTIVATION

A. Hardware Counters

Every modern processor has a logical unit called Perfor-mance Monitoring Unit (PMU) which consists of a set of

hardware counters. A counter counts how many times a certainevent occurs during a time interval of a program’s execution.The number of counters may vary across microarchitecturesor even processor modes within the same microarchitecture.For example, ARM Cortex-A53 MPCore processor has sixcounter registers [2] while recent Intel processors have threefixed counters (which can only measure specific events such asclock cycles and retired instructions) and eight programmablecounters per core (four per SMT thread if it is enabled) [1].

The events that can be measured by hardware countersare predefined by processor vendors. The number of eventsmay also be significantly different for different generationsof processors within the same microarchitecture, let alonedifferent microarchitectures. For instance, the basic Ivy-Bridgemodel defines only 338 events whereas the high-end IvyTownchips support 1423 events [29]. In summary, the number ofevents greatly outnumbers that of hardware counters (morethan one hundred vs. less than ten).

There are two ways to program hardware counters to captureevents. In the one counter one event (OCOE) approach, onehardware counter is only programmed for counting one eventduring the whole profiling period of a program. OCOE isaccurate because one counter is dedicated to one event at atime. However, OCOE is ineffective because one usually needsto measure a large number of events to identify the root causesof performance bottlenecks for an unknown system [37]. Thenumber of events that can be simultaneously measured inOCOE is equal to or less than that of the available hardwarecounters of a processor.

Multiplexing (MLPX) is developed to improve the measure-ment efficiency by letting multiple events timeshare a singlehardware counter [4], [38]. In MLPX, events are scheduledto be sampled during a fraction of execution. Based on thesamples, the full behavior of each event is extrapolated [30].

However, large measurement errors occur with MLPX be-cause information may be lost when the event does not happenduring a sampled interval [30], [33], [34] but happens duringa un-sampled interval. Mathur et al. [38] reported that higherthan 50% of errors were observed when the SPEC CPU 2000benchmarks were profiled with MLPX.

B. Motivation

1) Measurement Errors: while some prior works such as[14], [15] employ OCOE to measure performance, MLPX ismandatory when a large number of events need to be sampled,e.g., for emerging workloads in cloud computing. Moreover,quantifying measurement errors is a fundamental challengedue to the sampling nature of MLPX.

We conduct the following experiments to observe the errorscaused by MLPX. We firstly run a set of cloud computingprograms and measure their events by OCOE. Since differentruns of the same program in cloud computing platforms maytake different times, which is caused by the non-deterministicnature of modern operating systems, the different time seriesfor the same event may have different lengths. The traditionalapproaches such as calculating the Euclidean or Manhattan

Page 3: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

Aggregation Bayes Join Kmeans Pagerank Scanori 0.0612 0.0423 0.0489 0.0553 0.0409 0.0772clean 0.0489 0.0269 0.0349 0.0373 0.0275 0.0671Data Quality Promted 20.10% 36.41% 28.63% 32.55% 32.76% 13.08%ref1-ref2之间的dtw距离 0.044 0.024 0.032 0.036 0.024 0.064

处理前 0.390909091 0.7625 0.528125 0.536111 0.704167 0.20625处理后 0.111363636 0.120833 0.090625 0.036111 0.145833 0.048438

Aggregation Bayes Join Kmeans Pagerank Scan处理前 39.1% 76.3% 52.8% 53.6% 70.4% 20.6%处理后 11.1% 12.1% 9.1% 3.6% 14.6% 4.8%处理前的平均值 52.7% 52.7% 52.7% 52.7% 52.7% 52.7%处理后的平均值 8.70% 8.70% 8.70% 8.70% 8.70% 8.70%

Aggregation Bayes Join Kmeans Pagerank Scan处理前 0.281045752 0.432624 0.345603 0.349005 0.413203 0.170984处理后 0.100204499 0.107807 0.083095 0.034853 0.127273 0.0462

RAW CLEANEDAGG 28% 10%BAY 43.26% 10.80%JON 34.56% 8.30%KME 34.90% 3.50%PGR 41.30% 12.70%SCN 17.10% 4.60%SOT 33.62% 9.40%WDC 36.86% 5.30%AVG 33.71% 8.10%

按公式1-(Dist_ref/Dist_mea)

0% 10% 20% 30% 40% 50%

AGG BAY JON KME PGR SCN SOT WDC AVG

error

0%

10%

20%

30%

40%

50%

AGG BAY JON KME PGR SCN SOT WDC AVG

error

RAW CLEANED

Fig. 1. The measurement errors caused by MLPX. WDC-wordcount, PGR-pagerank, AGG-aggregation, JON-join, SCN-scan, SOT-sort, BAY-Bayes,KME-kmeans, DAA-DataAnalytics, DAC-DataCaching, DAS-DataServing,GPA-GraphAnalytics, IMA - In-memoryAnalytics, MES-MediaStreaming,WSH-WebSearch, WSG-WebServing. AVG-average error.

distance between two vectors cannot calculate the performancebehavior differences [39] because they require the two vectorshave the same length. In order to compute the distance(difference) between two time series, we must “wrap” thetime axis of one (or both) sequences to achieve a betteralignment. Dynamic time wrapping (DTW) [40] is a techniquefor efficiently achieving this wrapping. It employs a dynamicprogramming approach to align one time series to one anotherso that the distance measurement is minimized. We thereforeemploy DTW to calculate the distance between two event timeseries as follows:

dist = DTW (S1, S2) (1)

where S1 and S2 are the first and second time series for thesame event, respectively. Note that the length of S1 is notnecessarily equal to that of S2.

First, we calculate the DTW between two time seriescollected by OCOE, denoted by distref .

distref = DTW (Socoe1, Socoe2) (2)

where Socoe1 and Socoe2 are the two time series for a certainevent of the same program collected by OCOE. Due tothe accuracy of OCOE and non-deterministic nature of OS,distref is a nonzero value but is theoretically close to zero.

Second, we compute the DTW between one time series col-lected by MLPX and one by OCOE, represented by distmea.

distmea = DTW (Smlpx, Socoe) (3)

where Smlpx and Socoe are the time series collected by MLPXand OCOE, respectively. Smlpx and Socoe are for the sameprogram with the same event. Theoretically, distmea is largerthan distref due to MLPX.

Finally, we define the error caused by MLPX as follows:

error = |1− distrefdistmea

| × 100% (4)

This error definition roughly quantifies how close the timeseries generated by MLPX is to the one obtained by OCOE.Since Socoe1 and Socoe2 are both collected by OCOE for thesame program with the same event, distref is a very smallvalue that is close to zero. In contrast, distmea is larger thandistref . Thus, error reflects the error caused by MPLX.

Figure 1 shows that the errors caused by MLPX for the eventICACHE.MISSES for 16 programs (the program descriptionis presented in Section IV-B). We can see that the minimumand maximum errors are 8.8% and 43.3%, respectively. The

0

0.5

1

1.5

2

2.5

3

3.5

1 39 77 115153191229267305343381419# of u ops(x10

3)

measurement samples

OCOE

MLPX

0

10

20

30

40

50

60

1 38 75 112149186223260297334371408

# of misses

measurement samples

OCOE

MLPX

outliers missing values

(a) (b)

0

0.5

1

1.5

2

2.5

3

3.5

1

39

77

11

5

15

3

19

1

22

9

26

7

30

5

34

3

38

1

41

9

# o

f u

op

s(x

10

3)

measurement samples

OCOE

MLPX

MLPX-CLN

0

10

20

30

40

50

60

1

35

69

10

3

13

7

17

1

20

5

23

9

27

3

30

7

34

1

37

5

40

9

# o

f m

isse

s

measurement samples

OCOE

MLPX

MLPX-CLN

Outliers are replaced Made up values

(a) (b)

00.5

11.5

22.5

33.5

1

39

77

11

5

15

3

19

1

22

9

26

7

30

5

34

3

38

1

41

9# o

f u

op

s(x

10

3)

measurement samples

OCOEMLPX

0102030405060

13

36

59

71

29

16

11

93

22

52

57

28

93

21

35

33

85

41

7

# o

f m

isse

s

measurement samples

OCOEMLPX

outliers missing values

(a) (b)

Fig. 2. The outlier and missing value examples in benchmark WordCount.(a) The outliers in the time series of event IDQ.DSB UOPS (the # of uopsdelivered to Instruction Decode Queue from the Decode Stream Buffer path).(b) The missing values in the time series of event ICACHE.MISSES (the #of instruction cache misses per 1K instructions).

average error achieves 28.3%. For other events, we observesimilar or even higher errors.

By carefully analyzing the errors, we find two root causes:outliers and missing values. Figure 2 (a) shows an example ofoutliers. In the end of the time series of event IDQ.DSB UOPScollected by MLPX, the number of UOPS is 4.2× of thenormal numbers collected by OCOE. Such outliers will sig-nificantly “improve” the overall results if they are taken intoaccount. On the other hand, Figure 2 (b) illustrates an exampleof missing values. By using OCOE, we observe a large numberof instruction cache misses at the beginning of the time seriesof event ICACHE.MISSES. This is reasonable because at thebeginning of a program execution, the instruction cache isempty (a.k.a ”cold cache”) and a large number of missesmust happen. However, these instruction cache misses are notobserved by MLPX. These examples indicate that MLPX mayindeed cause large errors. Moreover, simultaneously measuringmore events in MLPX generally exacerbates the problem, asshown in Figure 3. These errors, however, are difficult tobe removed by event scheduling [33], [34] and estimationalgorithms [38]during the sampling procedure. This motivatesus to reduce them by data cleaning techniques after thesampling procedure of MLPX, which is very important formining the CPU big performance data in an off-line manner.

2) Are all events equally important?: Due to the large num-ber of measurable events compared to the available hardwarecounters (e.g., 207× for HaswellX [29]), MLPX is thereforemandatory, but it comes at a high cost. Figure 3 showsthat the MLPX errors increase with more events measuredsimultaneously with the limited hardware counters (red lineindicates the trend). To lower the errors, the number of eventsmeasured by the same counter during a program’s executionneeds to be limited. However, this makes the same programneed to run many times to measure all the events of theprocessor being used.

On the other side, measuring all events may not be neces-sary. First, even if we measure all the events accurately in anideal scenario at the same time, most values are zeros, whichis inconceivable. Second, the optimization overhead would beextremely high when considering all the events of a processor.In fact, to improve performance, an OS can only leveragea small number of events to provide feedback to a runtimesystem for optimization [29].

Therefore, it is crucial to identify a subset of events that are

Page 4: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

10evetns 16events 20events 24events 28events

ori 0.0369 0.0356 0.0393 0.0513 0.0467clean 0.0246 0.0281 0.025 0.0305 0.0328未处理数据 0.583690987 0.527896996 0.686695 1.201717 1.004292经预处理的数据 0.055793991 0.206008584 0.072961 0.309013 0.4077254events 0.0233 0.0233 0.0233 0.0233 0.0233

10evetns 16events 20events 24events 28events未处理数据 58.37% 52.79% 68.67% 120.17% 100.43%经预处理的数据 5.58% 20.60% 7.30% 30.90% 40.77%

10 16 20 24 2858.37% 52.79% 68.67% 120.17% 100.43%

0.631436314 0.654494382 0.592875 0.454191 0.4989290.368563686 0.345505618 0.407125 0.545809 0.501071

cleaned 0.055793991 0.206008584 0.072961 0.309013 0.4077250.947154472 0.829181495 0.932 0.763934 0.710366

cleaned 0.052845528 0.170818505 0.068 0.236066 0.289634

10 16 20 24 2837% 35% 41% 55% 50%

RAW CLN10 37% 5.30%16 35% 17.08%20 41% 6.80%24 55% 23.61%28 50% 28.96%32 44% 13.38%36 54% 29.39%

37% 35%

41%

55% 50% 44%

54%

30% 35% 40% 45% 50% 55%

10 16 20 24 28 32 36

err

or

The number of events collected simultaneously

Fig. 3. The error variation with the number of events (represented by thenumbers along with X axis) measured simultaneously.

strongly relevant for a given architecture or workload [29]. Itcan reduce both measurement and optimization overhead. Thekey to achieve this goal is to quantify the importance of events.

III. COUNTERMINER METHODOLOGY

CounterMiner is a methodology designed to mine the bigperformance data collected from hardware counters. It reducesthe measurement errors of MLPX by leveraging data cleaningtechniques rather than traditional event scheduling [33], [34]and estimation algorithms [38]. CounterMiner is comple-mentary to [34] [33] [38] because CounterMiner does thatafter (not during) the sampling. Moreover, it quantifies theimportance of events and the interactions between two eventswith respect to performance. Figure 4 shows the workflow ofCounterMiner. It consists of four components: data collector,data cleaner, importance ranker, and interaction ranker.

A. Data CollectorThe data collector is in charge of sampling the values of

events when a program is running. It supports two modes:OCOE and MLPX. The data collector can be any availablecounter profiling tools such as Perf [41] and Permon2 [25]. Inthis paper, we use the Linux Perf [41] because it is availablein all Linux distributions.

We take the sampled values of an event as a time seriessince it is important to observe the time varying behaviors ofthe event [30]. We formally describe the time series as follows:

TSei = {Vi1, Vi2, ..., Vij , ..., Vin} (5)

where TSei is the time series for the ith event of a program,Vij is the jth value of the ith event, and n is the totalnumber of sampled values for TSei. The important featureas well as challenge of these time series is that their lengthsmay be significantly different even for the same event of thesame program because of the non-deterministic behavior of amodern OS. This property makes storing and analyzing thesetime series challenging.

We store the collected time series in a database managementsystem (DBMS) which can be any popular DBMS such asMySQL and SQL Server. In CounterMiner, we employ SQLitebecause it seamlessly integrates with Python which is theprogram language we used for data analysis. In the database,we design a two-level table organization. The first level tablesstore information including the name of a program, the namesof the measured events, the execution times of the program,and the names of the second-level tables. The second-leveltables store the time series for the measured events for aprogram at each run. Note that these two level tables mayneed to be re-initialized when CounterMiner is applied on adifferent microarchitecture.

Event selectionEvent selection

Event selection

Perf Tools

Workloads

Raw time series

Data integrator

Data filter

Data interpolator

Importance

quantification

Event selection

<e1,e2> pair

Removing outliers

Making up

missed values

Data collector Data cleaner

Importance Ranker Interaction Ranker

Machine learning algorithm libraryPerformance

counter database

Fig. 4. Block Diagram of CounterMiner.

B. Data Cleaner

The outliers and missing values of events are the mainerror sources of the collected data with MLPX. Data cleanerreplaces the outliers by normal values and fills in the missingvalues. To evaluate the accuracy of the data cleaner, we takethe values of events collected by OCOE as golden references.

1) Replacing Outliers: To determine outliers, we first per-form a rigorous statistic testing (see Section IV-C for thetesting tool) on the value distributions of all 229 events. Wefind that, only for 100 events, their values follow Gaussiandistribution. We further investigate the value distributionsof the other 129 events and find they all show long taildistribution but with different levels. In order to find out thelong tail distributions, we perform the statistic testing on themusing several different distribution functions such as logisticand Gumbel (general extreme value - GEV) distributions. Wefind the GEV distribution best fits the long tail distributions,which has been confirmed by [15].

After knowing the value distributions of the events, wedesign the criterion to replace the outliers. We employ ageneral technique from data mining to determine outliers bythe following equation:

threshold = mean+ n× std (6)

where threshold is the criterion for replacing outliers, meanand std are the mean value and the standard deviation of aseries of values of an event, respectively; n is a control variablewhich needs to be determined according to the distribution oruser requirements such as the percentage of data within thethreshold. According to [42], if a data series obeys Gaussiandistribution, n equals 3.

Since the values of a large number of events do not obeythe Gaussian distribution but instead long tail distribution. Weneed to determine n by controlling the percentage of datawithin the threshold. In this study, we specify that 99% of thecollected data for events of a program are within the thresholdbecause there are not many outliers based on our confirmedobservation. Table I shows the percentage of data within thethreshold with different n values. We see that when n is 5,the percentages of data within the threshold for all programsexceed 99%, we therefore set n to 5.

When an outlier is found, we use the median value of theinterval where the outlier locates to replace the outlier. Theinterval length is calculated as follows.

L =Max(TSei)−Min(TSei)

Roundup(Sqrt(Count(TSei)), 0)(7)

Page 5: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

Benchmarks n=3 n=5 Benchmarks n=3 n=5wordcount 98.8% 99.3% DataAnalytics 99.3% 99.8%pagerank 98.6% 99.4% DataCaching 99.1% 99.8%aggregation 99.3% 99.8% DataServing 99.4% 99.8%join 98.9% 99.8% GraphAnalytics 99.2% 99.8%scan 99.3% 99.8% In-mAnalytics 99.2% 99.8%sort 98.9% 99.5% MediaStreaming 99.3% 99.8%bayes 98.6% 99.8% WebSearch 98.8% 99.7%kmeans 99.1% 99.8% WebServing 99.1% 99.5%

TABLE ITHE COVERAGE RATE WITH DIFFERENT VALUES OF n.

with Max(TSei) and Min(TSei) the maximum and mini-mum values in TSei, respectively, Count(TSei) the numberof values in TSei, Sqrt the square root, and Roundup theround up value.

2) Filling in Missing Values: To fill in missing values, wefirst classify the event values into two categories: zero valuesand none-zero values. Based on our observation, zero valuesare highly likely to be missing values. However, it is stillpossible that the values of a certain event at some pointsare indeed zeros. It is therefore difficult to distinguish themfrom the missing values. To address this issue, we check themaximum and minimum values of the event in the past. If theminimum value is zero and the maximum value is less than0.01, we consider that the zero value for the event is not amissing value. The rationale is that the maximum value of anevent is only 0.01 which is very close to zero, and even theactual value is not zero, the error would not be high.

For the none-zero value category, we employ KNN (K-Nearest Neighbor) algorithm [43] to fill in missing values. Forexample, a series of data for ICACHE.MISSES is as follows:

{X1, X2, X3, X4, X5, 0, X7, ..., Xi, ..., Xm} (8)

where Xi is the ith value of ICACHE.MISSES, m is thetotal number of values, and 0 is the missing value. Weuse KNN regression to calculate the missing value and it isrepresented by the average value of the k nearest neighbors.The determination of k is important because it affects theaccuracy. In this study, we tried several values from 3 to 8based on [42] and find that k = 5 is accurate enough torepresent the missing value.

C. Importance Ranker

In order to quantify the importance of events with respect toIPC (Instructions Per Cycle), an accurate performance modelneeds to be constructed. The inputs of the model are eventvalues and the output is IPC, which can be represented by:

IPC = perf(e1, e2, ..., ei, ..., en) (9)

where ei is the value of the ith event, and n is the total numberof events. As discussed before, n is typically larger than 100and up to 1423 for modern processors. For the processorsused in this paper, n is 229. It is extremely challengingto build an accurate performance model with such a largenumber of input parameters. Analytical modeling techniquesare not suitable for this case because they allow only a fewparameters (e.g., less than 5). Moreover, analytical models are

not accurate when dealing with complex cases. Statistical mod-eling techniques are either not accurate with high dimensionalinputs because they make an unrealistic assumption that therelationship between the input parameters are linear. However,complex nonlinear relationships typically exist between eventsof modern processors.

To solve this problem, machine learning techniques areapplied to construct accurate models with high dimensionalinputs. To mitigate over-fitting, we use an ensemble learningalgorithm, Stochastic Gradient Boosted Regression Tree (SG-BRT) [44], to construct the model. The key insight is thatSGBRT combines a number of tree models in a stagewisemanner, where each one reflects a part of the performance.The final model is called ensemble model. With performancemodel constructed, we can leverage the model to quantify theimportance of events and their interactions. Clearly, a moreaccurate performance model results in more precise eventimportance quantification.

For a single tree T in an ensemble model, one can useI2j (T ) as a measure of importance for each event ej , whichis based on the number of times ej is selected for splitting atree weighted by the squared improvement to the model as aresult of each of those splits [45]. This measure of importanceis calculated as follows:

I2j (T ) = nt ·nt∑i=1

P 2(k), (10)

where nt is the number of times ej is used to split tree T ,and P 2(k) is the squared performance improvement to thetree model by the kth split. In particular, P (k) is defined asthe relative IPC error which is (IPCk − IPCk−1)/IPCk−1

after the kth split. If ej is used as a splitter in R trees in theensemble model, the importance of ej to the model equals:

I2j =1

R

R∑m=1

I2j (Tm). (11)

To make the results intuitive, the importance of an event isnormalized so that the sum across all events adds up to 100%.A higher percentage indicates stronger influence of the eventon performance.

After the importance of each event with respect to per-formance is obtained, we rank them in a descending order.Then, we remove the 10 least important events and take theremainders as input parameters to construct a performancemodel by using the SGBRT algorithm again. We conduct thesame procedure on the new model to quantify the importanceof the remaining events and rank them. This procedure mayiteratively repeat several times until we obtain the Most Accu-rate Performance Model (MAPM) for a program and we callthis procedure Event Importance Refinement (EIR). The eventimportance obtained by MAPM is the most accurate [45].

D. Interaction Ranker

After we obtain a set of important events, we construct alinear regression model per pair of the events and consider

Page 6: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

the residual variance of the model as an indication for inter-action intensity. The intuition is that if two microarchitectureevents are orthogonal (i.e., they do not interact), the residualvariance will be small because the linear model will be ableto accurately predict the combined effect of both events.If on the other hand, the microarchitecture events interactsubstantially, this will be reflected in the residual variancebeing significantly larger than zero, because the linear modelis unable to accurately capture the combined effect of theevent pair. The linear regression model is trained for eachpair of important events by setting the values of all otherevents to their respective means. This process is repeated foreach possible event pair. The residual variance or interactionintensity for a particular event pair is computed as:

v =

n∑i=1

(pi − p)2, (12)

where pi is the performance (IPC) predicted by the linearregression model, p is the observed performance, and n isthe number of predictions. Zero indicates that there is nointeraction between two microarchitecture events, and a highervalue indicates a stronger interaction.

To indicate the importance of interactions among all possi-ble pairs, we normalize interaction intensity against the otherpairs as follows:

Ii =

vin∑

j=1

vj

× 100%, (13)

where Ii is the importance of the ith event-pair interactionand vi is the ith event-pair interaction intensity. After thenormalization, we can tell how much more/less important anevent pair is compared to another event pair, — reflecting therelative interaction intensity of event pairs.

IV. EXPERIMENTAL METHODOLOGY

A. Experimental Cluster

Our experimental cluster consists of four Dell servers, oneserves as the master node and the other three serve as slavenodes. Each server is equipped with 12 Intel(R) Xeon(R) CPUE5-2630 v3 @ 2.40GHz eight-core processor with Haswell-E microarchitecture and 64GB PC3 memory. The OS ofeach node is SUSE Linux Enterprise Server 12. The OS ofthe whole cluster is Mesos1.0. Although our experimentalcluster is small compared to a cloud platform, we believe thatevaluating CounterMiner in this environment is still sufficient,because it is essentially a performance analysis methodology.In a real cloud platform, CounterMiner can easily work withthe Google Wide profiler (GWP) to provide more meaningfulresults. In addition, it can be integrated with cluster manage-ment tools such as Quasar [46] and other cloud computingresearches such as [47]–[49].

Benchmarks Framework Input datawordcount Spark 2.0 generated by RandomTextWriter.pagerank Spark 2.0 hyperlinks with Zipfian.aggregation Spark 2.0 hyperlinks with Zipfian.join Spark 2.0 hyperlinks with Zipfian.scan Spark 2.0 hyperlinks with Zipfian.sort Spark 2.0 generated by RandomTextWriter.bayes Spark 2.0 words with Zipfian.kmeans Spark 2.0 numbers with Gaussian.DataAnalytics Hadoop a Wikimedia dataset.DataCaching Memcached a Twitter dataset.DataServing Cassandra a YCSB dataset.GraphAnalytics GraphX a Twitter dataset.In-mAnalytics SparkMLlib a move-rating dataset.MediaStreaming Nginx a synthetic video dataset.WebSearch Apache Solr a set of crawled websites.WebServing Elgg generated by Faban.

TABLE IITHE EXPERIMENTED BENCHMARKS.

0

0.5

1

1.5

2

2.5

3

3.5

1

39

77

11

5

15

3

19

1

22

9

26

7

30

5

34

3

38

1

41

9

# o

f u

op

s(x

10

3)

measurement samples

OCOE

MLPX

MLPX-CLN

0

10

20

30

40

50

60

1

35

69

10

3

13

7

17

1

20

5

23

9

27

3

30

7

34

1

37

5

40

9

# o

f m

isse

s

measurement samples

OCOE

MLPX

MLPX-CLN

Outliers are replaced Made up values

(a) (b)

00.5

11.5

22.5

33.5

1

39

77

11

5

15

3

19

1

22

9

26

7

30

5

34

3

38

1

41

9# o

f u

op

s(x

10

3)

measurement samples

OCOEMLPX

0102030405060

13

36

59

71

29

16

11

93

22

52

57

28

93

21

35

33

85

41

7

# o

f m

isse

s

measurement samples

OCOEMLPX

outliers missing values

(a) (b)

00.5

11.5

22.5

33.5

1

39

77

11

5

15

3

19

1

22

9

26

7

30

5

34

3

38

1

41

9# o

f u

op

s(x

10

3)

measurement samples

OCOEMLPXMLPX-CLN

0102030405060

1

35

69

10

3

13

7

17

1

20

5

23

9

27

3

30

7

34

1

37

5

40

9

# o

f m

isse

s

measurement samples

OCOEMLPXMLPX-CLN

Outliers are replaced Missing values are filled in

Fig. 5. The event cleaning examples. (a) The replaced outliers in the timeseries of IDQ.DSB UOPS. (b) The filled in values in the time series ofICACHE.MISSES. MLPX-CLN represents the cleaned time series.

Aggregation Bayes Join Kmeans Pagerank Scanori 0.0612 0.0423 0.0489 0.0553 0.0409 0.0772clean 0.0489 0.0269 0.0349 0.0373 0.0275 0.0671Data Quality Promted 20.10% 36.41% 28.63% 32.55% 32.76% 13.08%ref1-ref2之间的dtw距离 0.044 0.024 0.032 0.036 0.024 0.064

处理前 0.390909091 0.7625 0.528125 0.536111 0.704167 0.20625处理后 0.111363636 0.120833 0.090625 0.036111 0.145833 0.048438

Aggregation Bayes Join Kmeans Pagerank Scan处理前 39.1% 76.3% 52.8% 53.6% 70.4% 20.6%处理后 11.1% 12.1% 9.1% 3.6% 14.6% 4.8%处理前的平均值 52.7% 52.7% 52.7% 52.7% 52.7% 52.7%处理后的平均值 8.70% 8.70% 8.70% 8.70% 8.70% 8.70%

Aggregation Bayes Join Kmeans Pagerank Scan处理前 0.281045752 0.432624 0.345603 0.349005 0.413203 0.170984处理后 0.100204499 0.107807 0.083095 0.034853 0.127273 0.0462

RAW CLEANEDAGG 28% 10%BAY 43.26% 10.80%JON 34.56% 8.30%KME 34.90% 3.50%PGR 41.30% 12.70%SCN 17.10% 4.60%SOT 33.62% 9.40%WDC 36.86% 5.30%AVG 33.71% 8.10%

按公式1-(Dist_ref/Dist_mea)

0% 10% 20% 30% 40% 50%

WDC PGR AGG JON SCN SOT BAY KME AVG

error

0% 10% 20% 30% 40% 50%

WDC PGR AGG JON SCN SOT BAY KME AVG

error RAW CLN

Fig. 6. The measurement error comparison between before and after our datacleaning approach is employed.

B. Benchmarks

We employ CloudSuite 3.0 [35] and select eight programsfrom HiBench with version of Spark 2.0 [36] (a.k.a ”Spark-Bench”) to evaluate CounterMiner in this study. CloudSuite3.0 is a benchmark suite for cloud services and it consistsof eight applications based on their popularity in today’sdatacenters. HiBench with Spark 2.0 consists of a broad setof MapReduce-like programs implemented by Spark 2.0. Thebenchmarks are listed in Table II. The benchmarks in Cloud-Suite use different frameworks while the ones in HiBenchemploy the same framework but represent four categories ofapplications including websearch, SQL, machine learning, andmicrobenchmarks.

C. Modeling Tools

We use Python, a freely available high-level programminglanguage, to perform our SGBRT modeling and KNN model-ing. The Python version is 2.7 and we use scikit-learn 0.19.0which is a machine learning algorithm library implementedin Python to construct our performance models. To perform arigorous statistic testing for the value distribution of events,we employ SciPy [50] which is an open-source softwarefor mathematics, science, and engineering. In which, we use

Page 7: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

Abbr. Event DescriptionISF ILD STALL.IQ FULL stall cycles due to IQ is full.ISL ILD STALL LCP counts cycles where the decoder is stalled on an inst with a length changing prefix (LCP).BRE BR INST EXEC.ALL BRANCHES counts all near executed branches (not necessarily retired).IPD INST RETIRED.PREC DIST Precise inst retired event with HW to reduce effect of PEBS shadow in IP distribution.BRB BR INST RETIRED.ALL BRANCHES counts the number of retired branch instructions.BMP BR MISP RETIRED.ALL BRANCHES mispredicted macro branch instructions retired.PI3 PAGE WALKER LOADS.ITLB L3 counts the number of extended page table walks from ITLB hit in the L3.ITM ITLB MISSES.WALK DURATION counts the number of cycles while PMH (page miss handler) is busy with the page walk.DSP DTLB STORE MIS.PDE CACHE MIS DTLB store misses with low part of linear-to-physical address translation missed.DSH DTLB STORE MIS.STLB HIT store ops that miss the first TLB level but hit the second and do not cause page walks.MMR MEM LD UOPS L3 MIS.R M Retired load uops whose data source was remote DRAM.MUL MEM UOPS RETIRED.ALL LOADS Load uops retired to architected path with filter on bits 0 and 1 applied.URA UOPS RETIRED.ALL Counts the number of micro-ops retired..URS UOPS RETIRED.RETIRE SLOTS counts the number of retirement slots used in each cycle.MCO MACHINE CLEARS.MEMORY ORDERING counts the number of machine clears due to memory order conflicts.MSL MEM UOPS RETIRED.STLB MISLD load uops with true STLB (second level TLB) miss retired to architected path.MLL MEM UOPS RETIRED LOCK LOAD load uops with locked access retired to architected path.PDM PAGE WALKER LOADS.DTLB MEMORY number of DTLB page walker hits in memory.TFA TLB FLUSH.STLB ANY count number of STLB (second TLB) flush attempts.LAA LD BLKS PARTIAL.ADDR ALIAS false dependencies in MOB (memory order buffer) due to partial compare on address.LSF LD BLKS.STORE FORWARD loads blocked by overlapping with store buffer that cannot be forwarded.BRC BR INST RETIRED.CONDITIONAL counts the number of conditional branch instructions retired.BNT BR MISP RETIRED.NEAR TAKEN number of near branch instructions retired that were mispredicted and taken.LMH MEM LD UOPS L3 MIRE.R H retired load uops whose data sources were a remote HitM responses.UEP UOPS EXECUTED PORT.PORT 0 cycles during which uops are dispatched from the Reservation Station (RS) to port 0 (on the per-thread basis).MLH MEM LOAD UOPS RETIRED.L1 HIT retired load uops with L1 cache hits as data sources.MST MEM UOPS RETIRED.STLB M ST store uops with true STLB miss retired to architected path.IM4 ITLB MISSES.WALK COMPLD 4K code miss in all TLB levels causes a page walk that completes (4K).IMC ITLB MISSES.WALK COMPLD misses in all ITLB levels that cause completed page walks.LHN LD UOPS L3 H R.X N retired load uops which data sources were hits in L3 without snoops required.CAC CYC ACT.CYC LDM PEND cycles with pending memory loads.C2P CYC ACT.STAL L2 PEND number of missed L2.LUO LSD.UOPS Number of uops delivered by the LSD.PLM PAGE WALKER LD.DTLB MEM number of DTLB page walker loads from memory.OTS OTHER ASSISTS.AVX TO SSE number of transitions from AVX-256 to legacy SSE when penalty applicable.I4U IDQ.ALL MITE CYCLES 4 UOPS counts cycles MITE is delivered four uops. Set Cmask = 4.

ORA OFFCORE REQUESTS.ALL DATA RD data read requests sent to uncore (demand and prefetch).BAA BACLEARS.ANY number of front end re-steers due to BPU (Branch Prediction Unit) misprediction.LRC L2 RQSTS.CODE RD MISS number of instruction fetches that missed the L2 cache.MIE MOVE ELIMINATION.INT ELIMINATED number of integer move elimination candidate uops that were eliminated.ORO OFCORE REQ OUSTAND.AL D RD Offcore outstanding cacheable data read transactions in SQ (Super Queue) to uncore.IDU IDQ.DSB UOPS the number of of uops delivered to IDQ (instruction dispatch queue) from DSB (decode stream buffer) path.LRA L2 RQSTS.ALL RFO counts all L2 store RFO (read for ownership) requests.

TABLE IIIEVENT NAME AND DESCRIPTION FOR THOSE APPEARED IN THE 10 MOST IMPORTANT EVENT LIST OF EACH BENCHMARK.

37% 35% 41%

55% 50% 44%

54%

5.30% 17.08%

6.80%

23.61% 28.96% 13.38%

29.39%

0% 10% 20% 30% 40% 50% 60%

10 16 20 24 28 32 36

error

The number of events collected simultanesously

RAW

Fig. 7. The measurement error comparison between before and after our datacleaning approach is employed when different number of events are measuredsimultaneously by MLPX.

scipy.stats.anderson to perform Anderson-Darling test for datacoming from a particular distribution.

V. RESULTS AND ANALYSIS

A. Error Reduction

Figure 5 shows the cleaning results for the outlier and miss-ing value examples shown in Figure 2. MLPX-CLN representsthe cleaned times series for the corresponding events. Thebenchmark is wordcount in these two examples. Figure 5 (a)illustrates that the outliers are correctly replaced and Figure 5(b) shows that most missing values are filled in.

Figure 6 compares the measurement errors before (bluebars) and after (red bars) applying our data cleaning techniques

on the times series of ICACHE.MISSES for the sixteenbenchmarks. We see that our data cleaner significantly reducesthe errors caused by MLPX. In particular, the average error isreduced from 28.3% to 7.7% thanks to data cleaning.

Figure 7 illustrates the behavior of the data cleaner when weincrease the number of events measured by MLPX at a time.We made several interesting observations. 1) The data cleanersignificantly reduces the errors caused by MLPX in each case.With 10 events, the error is reduced to only 5.3%. 2) Thedata cleaner accurately follows the error trend when numberof simultaneously measured events increases. 3) Although noerror is larger than 30% in all cases, some errors are high suchas 23.6% in the case of 24 events. This indicates that we cannot measure too many (e.g., 24) events at the same time evenwith data cleaner. As a general recommendation, the numbershould not be larger than 20.

B. Important Events

As mentioned in Section III-C, the event importance ob-tained by MAPM (the most accurate performance model) isthe most accurate [45]. We therefore perform the EIR (eventimportance refinement) procedure aiming to get MAPM before

Page 8: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

229 219 209 199

1 wordcount 0.083493299 0.083493299 0.077941238 0.073965138

2 kmeans 0.122430607 0.122430607 0.115324218 0.105489632

3 pagerank 0.126060172 0.103968176 0.082315943 0.061945976

4 aggregation 0.191380649 0.165288404 0.100929759 0.078879719

5 join 0.114289346 0.0962202 0.084299563 0.073291606

6 scan 0.173490358 0.14105686 0.107805686 0.072860425

7 bayes 0.173167946 0.147325825 0.125791823 0.111475772

8 sort 0.12645407 0.1041986 0.0941986 0.0821986

avg 0.138845806 0.120497746 0.098575854 0.082513358

# of features

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

229 219 209 199 189 179 169 159 149 139 129 119

err

wordcount kmeans

0% 5%

10% 15% 20% 25% 30% 35% 40% 45%

229

219

209

199

189

179

169

159

149

139

129

119

109

99

89

79

69

59

49

39

29

19

9

err

wordcount kmeans pagerank aggregation join scan bayes sort avg

Fig. 8. The error variation of the models when we reduce the number ofevents used as their inputs for HiBench benchmarks. The X axis representsthe numbers of events.

we show the results for important microarchitecture events.During EIR, we employ a number of training examples (m) totrain the model and use one-quarter of m unseen test examplesto evaluate the model accuracy. The error of the models isdefined as follows.

err =|IPCmeas − IPCpred|

IPCmeas× 100% (14)

where IPCmeas is the measured IPC of a program andIPCpred is the IPC predicted by the performance model.

Figure 8 show the error variation when we perform theEIR for HiBench benchmarks. We see that considering moreevents may not necessarily result in higher performance modelaccuracy. For the experimented processors and benchmarks,the average error is 14% when we take all 229 events as themodel inputs, while the lowest average error is only 6.3%when around 150 events are taken as the model input parame-ters. This indicates that the exhausted list of events of modernprocessors may contain a large number of noisy events.Processor vendors could leverage the proposed approach tosystematically select proper events for their processors.

However, the accuracy of performance models decreaseswhen we further reduce the number of input events of themodels after they achieve the highest accuracy (e.g., 150events in this study), as shown in Figure 8. When the numberof events decreases to 99, the average performance model errorincreases to 9.6%, but is still very low. When we decreasethe number to 59, the average error further increases to 14%which is the same as that of models with all 229 events. Thisimplies that using only 59 events can achieve the same resultsof performance analysis by using 229 events. The benchmarksfrom CloudSuite show the similar results. We therefore do notshow the figure due to limited space.

Figure 9 and 10 show the importance ranking of eventsobtained by MAPM for the benchmarks from HiBench andfrom CloudSuite, respectively. The Y axis represents eventimportance and the X axis denotes the abbreviations of eventswhich are shown in Table III. Due to space limit, we showthe 10 most important events for each benchmark.We highlightfour key findings as follows.

First, the importance of one to three events of a bench-mark is significantly higher than that of other events of thesame benchmark, and this is true for all benchmarks, —both HiBench and CloudSuite. For example, the three mostimportant events of the benchmark wordcount are ISF, BRE,

and ORA. Their importance exceeds 5% while those of theother events are less than 2.2%. We call this phenomenonone-three significantly more important law (one-three SMIlaw). Note that this is different from Pareto principle becausethe accumulated importance of 20% of events is not around80%. The one-three SMI law indicates that there is alwaysan opportunity to optimize cloud programs significantly moreefficiently by first tuning the parameters related to the top oneto three events than by tuning other ones. We will demonstratethis in a case study in Section V-D.

Second, the importance rank of events may vary acrossbenchmarks. For instance, the most important event for word-count is ISF while that for pagerank is BRE. This indicatesthat different benchmarks have different characteristics at themicroarchitecture level.

Third, the common most important events for all the exper-imented cloud benchmarks are related to instruction queue,branch, TLBs (instruction TLB, data TLB, and second levelTLB), memory load, and remote memory or cache access.The second insight indicates that application-specific tuning isneeded at application level whereas the third one indicates thatcommon optimization approaches for different applications arealso effective at lower-level, e.g., microarchitecture or com-piler. For example, enlarging the length of instruction queueof processors used in cloud or enhancing the performance ofthe memory sub-system of cloud servers may improve theperformance of most cloud services significantly because ourresults show that ISF (stall cycles due to instruction queueis full) is the most important event for most experimentedbenchmarks.

Fourth, we surprisingly find that the eight benchmarks fromHiBench show more diversity than those from CloudSuitebased on the 10 most important events for each benchmark.Only four important events (MUL, MLL, DSP, and DSH)of CloudSuite benchmarks are not included in those of theeight HiBench benchmarks while thirteen important events(ORA, URA, URS, BRC, BAA, LRC, IMC, IM4, CAC, ORO,IDU, LRA, and OTS) of the HiBench programs are not inthose of the eight CloudSuite benchmarks. The common beliefis that, the benchmarks from CloudSuite should be morediverse than the ones from HiBench because the CloudSuitebenchmarks use different frameworks such as Hadoop, Spark,and MemCached while the HiBench benchmarks only useApache Spark. Our counter-intuitive results indicate that, toachieve application diversity, different frameworks may not bemore important than the algorithms and codes of benchmarks.Moreover, our results indicate that more diverse benchmarksneed to be included in CloudSite3.0.

C. Important Event Interactions

Figure 11 and Figure 12 show the interaction intensity ranksfor the eight Spark benchmarks from HiBench and all thebenchmarks from CloudSuite, respectively. Again, we onlyshow the 10 most important interactions of event pairs. The Yaxis represents the importance of interaction intensity of eventpairs and the X axis denotes the event pairs.

Page 9: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

0% 1% 2% 3% 4% 5% 6%

ISF

B

RE

O

RA

IP

D

BR

B

BM

P

MS

L

UR

A

UR

S

ITM

B

RE

IS

F

BR

B

LM

H

BM

P

ITM

P

I3

MC

O

BR

C

TF

A

ISF

B

RE

B

RB

M

SL

B

AA

M

MR

P

I3

BM

P

IPD

M

CO

B

RE

L

RC

IS

F

BR

B

LM

H

IPD

B

MP

IM

C

IM4

ITM

B

RE

IS

F

LM

H

BR

B

MS

L

PI3

M

MR

B

MP

M

IE

CA

C

OR

O

IDU

IS

F

LR

A

BR

E

BR

B

BM

P

LM

H

MS

L

MS

T

BR

E

ISF

P

I3

MS

L

BR

B

IPD

M

ST

T

FA

M

MR

L

MH

IS

F

BR

E

IPD

B

RB

IM

T

MS

L

PI3

O

TS

B

MP

M

CO

wordcount pagerank aggregation join scan sort bayes kmeans 6.1% 6.7% 6.6% 7.6%

Fig. 9. The importance rank of the eight Spark benchmarks from Hibench when we employ the events which can construct the most accurate performancemodels. Y axis represents the importance of events and the X axis denotes the abbreviations of the events.

IPD 2.00%MMR 1.95%MSL 1.52%LMH 1.52%ITM 1.45%BMP 1.39%ISF 3.75%BRB 2.55%IPD 2.55%BRE 2.38%MSL 1.83%BMP 1.70%MMR 1.46%LMH 1.29%MST 1.20%MLL 1.16%ISF 4.77%BRB 3.26%BRE 2.80%IPD 2.68%MMR 2.36%MSL 2.17%LMH 1.79%MUL 1.62%MST 1.43%MLL 1.41%

Data Caching

Data Analytics

Data Serving

0% 2% 4% 6%

ISF

B

RB

B

RE

IP

D

MM

R

MS

L

LM

H

MU

L

MS

T

ML

L

ISF

B

RB

IP

D

BR

E

MS

L

BM

P

MM

R

LM

H

MS

T

ML

L

ISF

P

I3

BR

E

BR

B

IPD

M

MR

M

SL

L

MH

IT

M

BM

P

ISF

B

RE

B

RB

M

SL

D

SP

T

FA

M

MR

D

SH

M

ST

B

MP

B

RE

IS

F

BR

B

MS

L

IPD

M

MR

B

MP

P

I3

LM

H

ML

L

BR

E

ISF

B

RB

M

MR

IP

D

MS

L

LM

H

BM

P

MC

O

PI3

IS

F

MS

L

IPD

B

RE

M

MR

B

MP

B

RB

M

ST

L

HN

M

LL

M

SL

IS

F

BM

P

MM

R

LH

N

IPD

IS

L

BR

E

ML

L

LM

H

3% 4% 5% 6% 7% 8%

wordcount pagerank aggregation join scan sort bayes kmeans

DataAnalytics DataCaching DataServing GraphAnalytics In-mAnalytics MediaStreaming WebSearch WebServing

6.4%

Fig. 10. The importance rank of the eight benchmarks from CloudSuite when we employ the events which can construct the most accurate performancemodels. Y axis represents the importance of events and the X axis denotes the abbreviations of the events.

IDU-MST 0.047505938BMP-LMH 0.047505938LRA-BRE 0.042755344BMP-MST 0.040380048ORO-LRA 0.035629454BRE-MSL 0.035629454ISF-BRB 0.2 0.28409734BRE-BRB 0.2 0.282965478BRE-ISF 0.129598189PI3-BRB 0.117713639ISF-PI3 0.063950198BRE-PI3 0.05319751MSL-MST 0.010752688MMR-LMH 0.010752688BRB-LMH 0.006791171BRE-LMH 0.004527448BRB-BMP 0.2 0.231152831ISF-BMP 0.102962746ISF-BRB 0.090055735ITM-BMP 0.06101496BRB-ITM 0.056614843BRE-BRB 0.048401291BRE-BMP 0.046934585PI3-BMP 0.038721033MSL-BMP 0.038427691BRB-PI3 0.037254327

sort(有更新)

bayes

kmeans

0% 2% 4% 6% 8%

10% 12% 14% 16%

BR

B-B

MP

O

RA

-BR

B

UR

A-U

RS

B

RB

-IT

M

OR

A-B

MP

IS

F-B

RB

B

RB

-UR

A

BR

E-B

RB

O

RA

-IT

M

ISF

-BR

E

BR

B-B

MP

B

RE

-IS

F

BR

E-B

RB

B

RE

-BM

P

ISF

-BR

B

ISF

-BM

P

BR

B-B

RC

B

RE

-PI3

B

RE

-IT

M

ISF

-IT

M

BR

E-M

SL

IS

F-M

SL

M

SL

-BM

P

MS

L-B

AA

M

MR

-BM

P

ISF

-BR

E

MS

L-P

I3

BR

B-B

MP

B

RB

-MS

L

BR

E-B

RB

B

RB

-BM

P

BR

E-B

RB

IS

F-B

MP

IS

F-B

RB

B

RE

-IS

F

BR

E-B

MP

L

RC

-BR

B

LR

C-B

MP

B

RE

-IP

D

BM

P-I

MC

IS

F-B

MP

IS

F-L

MH

B

RE

-BM

P

LM

H-M

MR

L

MH

-BM

P

BR

E-L

MH

B

RE

-IS

F

MM

R-B

MP

IS

F-M

MR

B

RE

-MM

R

ISF

-MS

T

LR

A-M

ST

O

RO

-MS

T

BR

E-M

ST

ID

U-M

ST

B

MP

-LM

H

LR

A-B

RE

B

MP

-MS

T

OR

O-L

RA

B

RE

-MS

L

ISF

-BR

B

BR

E-B

RB

B

RE

-IS

F

PI3

-BR

B

ISF

-PI3

B

RE

-PI3

M

SL

-MS

T

MM

R-L

MH

B

RB

-LM

H

BR

E-L

MH

B

RB

-BM

P

ISF

-BM

P

ISF

-BR

B

ITM

-BM

P

BR

B-I

TM

B

RE

-BR

B

BR

E-B

MP

P

I3-B

MP

M

SL

-BM

P

BR

B-P

I3

kmeans bayes scan sort join aggregation pagerank wordcount

32% 20% 26% 28% 28% 23% 28% 17% 16%

Fig. 11. The interaction rank of the important event pairs for the eight Sark benchmarks from HiBench. The Y axis represents the importance of interactionintensity of event pairs. The X axis represents abbreviations of event pairs. XXX-YYY denotes an event pair.

IDU-MST 0.047505938BMP-LMH 0.047505938LRA-BRE 0.042755344BMP-MST 0.040380048ORO-LRA 0.035629454BRE-MSL 0.035629454ISF-BRB 0.2 0.28409734BRE-BRB 0.2 0.282965478BRE-ISF 0.129598189PI3-BRB 0.117713639ISF-PI3 0.063950198BRE-PI3 0.05319751MSL-MST 0.010752688MMR-LMH 0.010752688BRB-LMH 0.006791171BRE-LMH 0.004527448BRB-BMP 0.2 0.231152831ISF-BMP 0.102962746ISF-BRB 0.090055735ITM-BMP 0.06101496BRB-ITM 0.056614843BRE-BRB 0.048401291BRE-BMP 0.046934585PI3-BMP 0.038721033MSL-BMP 0.038427691BRB-PI3 0.037254327

sort(有更新)

bayes

kmeans

0% 3% 6% 9%

12% 15%

BR

B-B

MP

O

RA

-BR

B

UR

A-U

RS

B

RB

-IT

M

OR

A-B

MP

IS

F-B

RB

B

RB

-UR

A

BR

E-B

RB

O

RA

-IT

M

ISF

-BR

E

BR

B-B

MP

B

RE

-IS

F

BR

E-B

RB

B

RE

-BM

P

ISF

-BR

B

ISF

-BM

P

BR

B-B

RC

B

RE

-PI3

B

RE

-IT

M

ISF

-IT

M

BR

E-M

SL

IS

F-M

SL

M

SL

-BM

P

MS

L-B

AA

M

MR

-BM

P

ISF

-BR

E

MS

L-P

I3

BR

B-B

MP

B

RB

-MS

L

BR

E-B

RB

B

RB

-BM

P

BR

E-B

RB

IS

F-B

MP

IS

F-B

RB

B

RE

-IS

F

BR

E-B

MP

L

RC

-BR

B

LR

C-B

MP

B

RE

-IP

D

BM

P-I

MC

IS

F-B

MP

IS

F-L

MH

B

RE

-BM

P

LM

H-M

MR

L

MH

-BM

P

BR

E-L

MH

B

RE

-IS

F

MM

R-B

MP

IS

F-M

MR

B

RE

-MM

R

ISF

-MS

T

LR

A-M

ST

O

RO

-MS

T

BR

E-M

ST

ID

U-M

ST

B

MP

-LM

H

LR

A-B

RE

B

MP

-MS

T

OR

O-L

RA

B

RE

-MS

L

ISF

-BR

B

BR

E-B

RB

B

RE

-IS

F

PI3

-BR

B

ISF

-PI3

B

RE

-PI3

M

SL

-MS

T

MM

R-L

MH

B

RB

-LM

H

BR

E-L

MH

B

RB

-BM

P

ISF

-BM

P

ISF

-BR

B

ITM

-BM

P

BR

B-I

TM

B

RE

-BR

B

BR

E-B

MP

P

I3-B

MP

M

SL

-BM

P

BR

B-P

I3

kmeans bayes scan sort join aggregation pagerank wordcount

26% 28% 28% 23% 28% 17% 16%

0%

10%

20%

30%

ISF

-BR

B

ISF

-ML

L

BR

B-M

LL

M

SL

-ML

L

BR

B-M

SL

M

MR

-…

BR

B-B

RE

M

UL

-ML

L

ISF

-BR

E

BR

B-L

MH

B

RB

-BM

P

BR

B-B

RE

IS

F-B

RB

B

RE

-BM

P

ISF

-BM

P

BR

B-M

SL

IS

F-B

RE

M

SL

-BM

P

BR

E-M

SL

B

RE

-ML

L

BR

B-B

MP

B

RE

-BR

B

ISF

-BM

P

ISF

_B

RB

B

RE

-BM

P

PI3

-BR

B

PI3

-BM

P

ITM

-BM

P

ISF

-BR

E

BR

B-I

TM

B

RB

-BM

P

BR

E-B

RB

B

RB

-PI3

IS

F-B

RB

B

MP

-PI3

B

RE

-BM

P

ISF

-BM

P

BR

E-P

I3

ISF

-PI3

B

RE

-IS

F

DS

P-B

MP

B

RE

-BM

P

ISF

-BM

P

BR

B-B

MP

B

RE

-BR

B

DS

P-D

SH

M

SL

-BM

P

ISF

-BR

B

BR

B-D

SP

B

RE

-DS

P

BR

B-B

MP

B

RE

-BR

B

BR

E-B

MP

B

RB

-PI3

IS

F-B

RB

B

MP

-PI3

B

RE

-PI3

B

RE

-IS

F

ISF

-PI3

IS

F-M

MR

B

MP

-BR

B

BR

B-M

LL

B

RE

-BR

B

BM

P-M

LL

B

RE

-BM

P

BR

E-M

LL

M

SL

-BR

B

MS

L-B

MP

IS

F-B

RB

M

SL

-BR

E

BM

P-B

RB

IS

F-B

RB

IS

F-B

MP

IS

L-B

RB

B

MP

-ML

L

BR

B-M

LL

IS

F-I

SL

B

MP

-IS

L

BM

P-L

HN

M

SL

-BM

P

DataAnalytics

36%

DataCaching DataServing In-mAnalytics GraphAnalytics M-Streaming WebSearch WebServing

35% 33% 64% 31%

Fig. 12. The interaction rank of the important event pairs for the benchmarks from CouldSuite. The Y axis represents the importance of interaction intensityof event pairs. The X axis represents abbreviations of event pairs. XXX-YYY denotes an event pair.

ISF-nwt 0.034952 3.495225ISF-mmf 0.029876 2.987571ISF-kbm 0.025733 2.573279MSL-nwt 0.021143 2.114251LMH-bbs 0.020481 2.04812MST-kbf 0.019995 1.999494PI3-ssb 0.15 26.2535 0.262535MST-nwt 0.057074 5.707359ISF-ssb 0.051313 5.13131MSL-ssb 0.046186 4.618556BRE-dpl 0.044653 4.465253MSL-dpl 0.034066 3.406646IPD-nwt 0.027296 2.729556ISF-rdm 0.026332 2.633161MSL-exc 0.024145 2.414529PI3-bbs 0.023248 2.324812BRE-exm 0.146858 14.68575PI3-exc 0.097054 9.705445MSL-exm 0.088079 8.807885MSL-kbf 0.072248 7.224784BMP-kbf 0.049162 4.916156BMP-dpl 0.031814 3.181448MSL-exc 0.028391 2.839109BMP-dmm 0.024162 2.416221MCO-dpl 0.019649 1.964851PI3-dpl 0.019375 1.937522

sort

bayes

kmeans

0% 2% 4% 6% 8%

10% 12% 14%

ISF

-dm

m

ISF

-mm

f M

SL

-mm

f M

SL

-ics

IT

M-d

mm

IT

M-m

mf

BR

B-k

bm

B

RE

-kbm

IS

F-b

bs

IPD

-mm

f T

FA

-exc

PI3

-exc

BM

P-e

xc

BR

C-d

pl

MC

O-b

bs

ISF

-exm

IS

F-d

pl

LM

H-e

xm

B

MP

-rdm

IT

M-m

mf

BR

E-m

mf

BA

A-m

mf

PI3

-mm

f B

RB

-mm

f M

MR

-mm

f M

SL

-mm

f IP

D-m

mf

MM

R-k

bf

BA

A-n

wt

PI3

-ssb

IT

M-i

cs

ITM

-sfb

B

RE

-rdm

IS

F-k

bm

IP

D-d

pl

BR

E-k

bm

IM

4-i

cs

IM4-k

bm

B

RE

-dm

m

IPD

-mm

f B

RE

-dm

m

BR

E-m

mf

MM

R-m

mf

MM

R-i

cs

CA

C-d

mm

C

AC

-mm

f M

SL

-kbm

IS

F-k

bm

B

RE

-bbs

BR

B-m

mf

OR

O-b

bs

IDU

-nw

t M

SL

-rdm

O

RO

-exm

IS

F-n

wt

ISF

-mm

f IS

F-k

bm

M

SL

-nw

t L

MH

-bbs

MS

T-k

bf

PI3

-ssb

M

ST

-nw

t IS

F-s

sb

MS

L-s

sb

BR

E-d

pl

MS

L-d

pl

IPD

-nw

t IS

F-r

dm

M

SL

-exc

PI3

-bbs

BR

E-e

xm

P

I3-e

xc

MS

L-e

xm

M

SL

-kbf

BM

P-k

bf

BM

P-d

pl

MS

L-e

xc

BM

P-d

mm

M

CO

-dpl

PI3

-dpl

wordcount pagerank aggregation join scan sort bayes kmeans

34% 24% 26% 15%

Fig. 13. The interaction rank of spark configuration parameter and event pairs. The Y axis represents the importance of the interaction of configurationparameter and event pairs and the X axis denotes the abbreviations of the parameter and event pairs. In XXX-YYY, XXX represents an event, and YYYrepresents a configuration parameter.

First, we see that all benchmarks have one or two dominantpairs of events which interact with each other more stronglythan other event pairs. This indicates that we can focus onanalyzing the dominant interaction pairs with limited timebudget. Second, the branch related events interact stronglywith other events. In the 160 most important interactionpairs for the 16 benchmarks (10 most important interaction

pairs for each), one branch related event is involved in 98interaction pairs. Two branches related events are involvedin 36 interaction pairs. It means that 83.4% of the 160most important interaction pairs contain branch related events.These results imply that branch related events are criticalfor cloud computing environment. Moreover, the pair BRB-BMP (see Table III) appears in 12 of the 16 experimented

Page 10: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

benchmarks and it is ranked as the most important inter-action pair in 10 benchmarks (wordcount, pagerank, join,kmeans, DataCaching, DataServing, In-memoryAnalytics, Me-diaStreaming, WebSearch, and WebServing). This indicates thenumber of successfully retired branch instructions (BRB) andthat of mispredicted but finally successfully retired branchinstructions (BMP) interact strongly in most benchmarks. Thisis because a small BRB surely results in a small BMP and alarge BMP is definitely caused by a large BRB.

Another interesting phenomenon is that the the events indominant interaction pairs of benchmarks from CloudSuiteinteract much more strongly with each other than those in thedominant pairs of benchmarks from HiBench, see Figure 11and Figure 12. This indicates that a benchmark containingmore software tiers results in stronger interactions betweenevents than a benchmark only implementing algorithms. Forexample, WebServing has four tiers: the web server, thedatabase server, the memcached server, and the clients. Theinteraction intensity of its dominant interaction pair achieves64%. In contrast, GraphAnalytics only implements the pager-ank algorithm on a Spark Library GraphX. The interactionintensity of its dominant interaction pair is only 19%.

Knowing the importance of interaction pairs is importantfor performance analysis. First, it can explain why one eventvalue changes significantly when the other event value ischanged in the same interaction pair. Second, it can explainwhy performance variation is larger with the change of twoevent values at the same time than that with the change of oneof two event values.

D. Case Study

This case study shows an usage example of CounterMiner.After we know the important events, we first leverage our in-teraction intensity quantification approach to determine whichconfiguration parameters of the Spark framework stronglyinteract with the important events. We then tune two con-figuration parameters, e.g., A and B, which tightly correlatewith a more important and a less important event, respectively.Finally, we observe the performance variation when we tuningthe two parameters. Note that the default values of Sparkconfiguration parameters can be found at [51].

Figure 13 shows the importance of interactions betweena Spark configuration parameter and an event. The Sparkconfiguration parameter names and their abbreviations areshown in Table IV. For each benchmark, we see that thereexists one or two pairs of a Spark configuration parameter andan event whose interaction intensities are much stronger thanother pairs. This implies that we should tune the configurationparameter in the strongest interaction pair first, which morelikely leads to more performance gain. Second, the mostimportant pair of interaction between a Spark configurationparameter and an event varies across benchmarks. It is becauseof different characteristics between benchmarks and indicatesthat different configuration parameters should be tuned firstfor different programs for efficiently optimizing performance.

Abbr. Configuration Parameterbbs spark.broadcast.blockSize.dpl spark.default.parallelism.

dmm spark.driver.memory.exc spark.executor.cores.exm spark.executor.memory.ics spark.io.compression.snappy.blockSize.kbf spark.kryoserializer.buffer.kbm spark.kryoserializer.buffer.max.mmf spark.memory.fraction.nwt spark.network.timeout.rdm spark.reducer.maxSizeInFlight.sfb spark.shuffle.file.buffer.ssb spark.shuffle.sort.bypassMergeThreshold.

TABLE IVTHE NAMES AND ABBREVIATIONS OF SPARK CONFIGURATION

PARAMETERS THAT ARE INTERACT WITH THE IMPORTANT EVENTSSTRONGLY.

sortImportance Rank events

top1OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD

强相关 bbs

bbs 2M 4M 8Mtime(second) 278 163 142

top12IDQ.ALL_MITE_CYCLES_4_UOPS 

强相关 nwt

nwt 50 100 150time(second) 196 112 146

278 84 2.309524163 84 0.940476142 84 0.690476127 84 0.511905

1.113095

0 50

100 150 200 250 300

2M 4M 8M 16M 32M

exec

tim

e (s

)

bbs

0 40 80

120 160 200

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

45

0

50

0

exec

tim

e (s

)

nwt

Fig. 14. Execution time (in second) optimization for sort by tuning bbs andnwt. bbs tightly correlates with the most important event (ORO) of sort. nwttightly correlates with the less important event I4U. See Table III for bbs andnwt, Table IV for ORO and I4U.

... ...c1 c2 ci cn

t... ...c1 c2 ci cnt

... ...c1 c2 ci cn

... ...c1 c2 ci cn...

v1v2

vm

A B

Fig. 15. An illustration of profiling times.ci denotes the value of the ith

configuration parameter; t represents the execution time of a benchmark; vmdenotes the mth value of a certain event.

We study an example to demonstrate how to optimizethe performance of Spark programs by using our event im-portance quantification. As shown in Figure 13, the mostimportant interaction pair of benchmark sort is ORO-bbsand Figure 9 shows that ORO is the most important eventof sort. Looking at Table IV, we know bbs correspondsto spark.broadcast.blockSize. We then choose an event notin the 10 most important event list and find the Sparkconfiguration parameter that is tightly correlated with it.In this example, we choose I4U and the correspondingconfiguration parameter is nwt (spark.network.timeout). Wetune spark.broadcast.blockSize and spark.network.timeout sep-arately and compare the performance variation. Figure 14shows the results. We see that, the execution time reduction issignificantly larger when tuning bbs than tuning nwt. Specifi-cally, average execution time variation is 111.3% when tuningbbs while that is only 29.4% by tuning nwt. These resultsconfirm that CounterMiner can indeed provide the “handle”to users to optimize performance more quickly and efficiently.

Nevertheless, one may think that it is unnecessary to identifythe important parameters of Spark programs by using ourevent importance quantification (method A). Instead, one canquantify the importance of parameters directly by using ourimportance ranker (method B). We argue that it is untrue,because method B takes much longer time than method A.We use Figure 15 to illustrate the two methods. To quantifythe importance of either configuration parameters or events, weneed to collect a number of training examples. For method B,

Page 11: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

ISF 3.58%PI3 2.13%BRB 1.45%CRX 1.44%IPD 1.36%MLL 1.12%MUL 1.10%MSL 1.05%BNT 1.04%BRE 1.00%BRE 10.12%L2H 5.17%L2R 4.87%DSP 4.33%L2C 3.66%TFA 3.64%L2A 3.57%L2M 3.40%L2S 3.18%MSL 3.14%

DataCaching+DataCaching

DataCaching+GraphAnalytics

0%

2%

4%

6%

ISF

PI3

BR

B

CR

X

IPD

ML

L

MU

L

MS

L

BN

T

BR

E

BR

E

L2

H

L2

R

DS

P

L2

C

TF

A

L2

A

L2

M

L2

S

MS

L

DataCaching+DataCaching DataCaching+GraphAnalytics

10.1%

Fig. 16. The importance rank of events for co-located workloads: ’Data-Caching + DataCaching’ and ’DataCaching + GraphAnalytics’.

we have to run a benchmark k times with k different configu-rations to collect k training examples because we can collectthe execution time of a program only after it completes itsexecution. In contrast, for method A, we can collect m trainingexamples for an event during one execution of a benchmarkbecause we sample a number of values for the event duringone run of the benchmark with a certain configuration. MethodA therefore needs a much smaller number of benchmark runsthan method B.

Taking pagerank as an example, we have to run it 6000times to collect 6000 training examples to build a performancemodel with around 90% of accuracy by method B. Then weidentify the important configuration parameters. For method A,we only run the benchmark 60 times to build a performancemodel as a function of events with 90% of accuracy. To findthe tightly coupled configuration parameter and event pairs, weneed to additionally run the benchmark 1520 times. Therefore,we need to run pagerank 1580 times in total to identify itsimportant configuration parameters by method A, which isnearly only 1/4 as the time needed for method B.

E. Co-located Workloads

We now demonstrate how to use CounterMiner with co-located benchmarks running on a shared cluster, which is atypical scenario in cloud computing environment. We considertwo cases. 1) An application is submitted to the cluster torun when the same application is running. 2) An applicationis submitted to the cluster to run when another applicationis running. These are two typical cases people use cloudplatforms. In this study, we use ’DataCaching + DataCaching’and ’DataCaching + GraphAnalytics’ to demonstrate the firstand second case, respectively.

Figure 16 shows the importance ranking of events forthe two cases. Note that CounterMiner can not show theimportance ranking for individual benchmarks in this contextbecause hardware counters and events are shared resourcesamong co-located benchmarks. The left part shows the im-portance ranking of events for ’DataCaching + DataCaching’.Compared with Figure 9, we see that the most important eventis still ISF with the similar importance of 3.7%. However,the importance order and events in the other top 9 importantevent list of ’DataCaching + DataCaching’ are only slightlydifferent from those of ’DataCaching’. This indicates thattwo ’DataCaching’ programs do not interfere with each otherseverely.

In contrast, we see that the importance order of events andevents of ’DataCaching + GraphAnalytics’ are significantlydifferent from both of those of ’DataCaching’ and ’Graph-Analytics’. This indicates that ’GraphAnalytics’ churns theexecution of ’DataCaching’ severely. More interestingly, 6 L2

cache related events are ranked in the top 10 important eventlist for ’DataCaching + GraphAnalytics’. No L2 cache relatedevents have been in the 10 most events for both ’DataCaching’and ’GraphAnalytics’. This indicates that the mixed runningof these two benchmarks cause a lot of L1 cache misses forboth instruction and data caches, which should be avoided.

As observed above, we see that CounterMiner can capturenot only significant churns but also small ones in the contextof co-located workloads running in cloud platforms.

VI. RELATED WORK

A. Counter Data Management and Analysis

Google recently developed a profiling infrastructure, namedGoogle-Wide Profiling (GWP), to provide performance in-sights for cloud applications [7]. GWP employs a two-levelsampling technique (sample machines and sample time in-tervals within a machine for profiling) to collect counterdata in Google data centers. Huck et al. first developeda framework to manage performance data [28] and thenproposed to leverage statistics techniques such as clusteringto analyze the performance data of parallel machines [52].Dong et al. proposed to use statistical techniques such as PCA(Principle Component Analysis) to extract important featuresfrom performance counters [53]. CounterMiner differs fromthese studies with twofold: 1) CounterMiner proposes datacleaning techniques to clean the counter data collected byMLPX; 2) CounterMiner not only extracts important eventsbut also directly quantifies the importance of an event withrespect to performance. The related studies that use PCAor random linear projection can implicitly tell the importantevents as a form of principle components or projected metricsbut can not explicitly quantify how important an event is.This hinders one to directly leverage the important events tooptimize application performance.

B. Error Reduction

The measurement errors caused by MLPX have been ob-served for nearly two decades [29]–[31], [33], [34], [38],[54]–[56] and several approaches were proposed to reducethem. In MLPX, the values of unsampled time intervals ofan event are usually estimated by linear interpolating a valuebetween the predecessor and successor intervals. Mathur etal. tried to develop a fine-grained estimation algorithm whichdivides the time interval into several sub-intervals. They foundthat performing a linear interpolation estimation for eachsub-interval results in better accuracy [38]. Weaver et al.found that experimental setup significantly affects the accuracyof measurements by hardware counters and they thereforeprovided corresponding suggestions to reduce the errors [31].

Recently, Lim et al. propose a scheduling algorithm thatschedules n events on m counters (n > m) (vs. the tra-ditional round-robin algorithm) to improve the measurementaccuracy [34]. The key idea is to monitor the most recentthree values of an event for determining whether another eventshould be scheduled to monitor on a counter. If the valuesof an event are not significantly different, another event will

Page 12: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

be scheduled and vice-versa. Dimakoupoulou et al. foundthat the measurement error of MLPX increases when Intelhyper-thread is enabled. They then proposed a dynamic eventscheduling algorithm based on graph matching to reduce theerrors [33]. These studies try to reduce errors before or dur-ing the performance measurement. In contrast, our approachdecreases the errors after the performance measurement hasbeen completed.

C. Counter Applications

1) Workload Characterization: Recently, Kanev et al.leveraged hardware counters to profile a warehouse-scalecomputer [14] and they found a number of interesting ob-servations. For example, the instruction locality of emergingcloud computing workloads is getting weaker and thereforethe instruction cache needs to be redesigned. Ferdman etal. employed hardware counters to characterize a group ofscale-out workloads and released the CloudSuite [9]. Lateron, Yasin et al. performed a deep characterization by usinghardware counters for the CloudSuite [11], [12]. Jia et al.use performance counters to characterize data analysis work-loads in datacenters [8]. Wang et al. characterized big dataworkloads for internet services [10]. Xiong et al. employedperformance counters to characterize the big data analysis incity transportation industry and they proposed a transportationbig data benchmark suite [13].

2) Architecture and Compiler Optimization: Kozyrakis etal. employ hardware counters and other tools to analyze howlarge-scale online services use resources in data centers andthen they provide several insights for server architecture designin data centers [22]. Chen et al. leveraged hardware-event sam-pling to generate edge profiles to perform feedback-directedoptimization for application runtime performance [19]. Mose-ley et al. used hardware counters to optimize compilers andshow speedups between 32.5% and 893% on selected regionsof SPEC CPU 2006 benchmarks [20].

3) Application Optimization: Chen et al. used hardwarecounters to observe the cache behavior and then proposeda task-stealing algorithm for multisocket multicore architec-tures [5]. Blagodurov et al. developed a user level schedulingalgorithm for NUMA multicore systems under Linux by an-alyzing information from hardware performance counters [6].By observing the CPI (Cycle Per Instruction) collected fromhardware counters, Zhang et al. proposed a CPU performanceisolation strategy for shared compute clusters [15]. Tam et al.firstly carefully analyzed the L2 cache miss rate behavior ofcommodity systems and subsequently proposed an algorithmto approximate the L2 miss rate curves which can be usedfor online optimizations [16]. He et al. proposed to leveragefractals to approximate the L2 miss rate curves with muchlower overhead [17]. Based on hardware counters, Blagodurovet al. proposed an algorithm to manage the contention ofNUMA multicore systems [18].

Although CounterMiner does not focus on workload char-acterization and optimization for architectures, compilers, andapplications, it provides an important step stone toward these

goals. After CounterMiner cleans the performance counterdata, these approaches can achieve better results. After Coun-terMiner quantifies the importance of microarchitecture events,these approaches can be more efficient.

VII. CONCLUSIONS

This paper proposes CounterMiner, a methodology thatenables the measurement and understanding of the big perfor-mance data with three novel techniques: 1) using data cleaningto improve data quality by replacing outliers and filling inmissing values; 2) iteratively quantifying, ranking and pruningevents based on the importance with respect to performance; 3)quantifying interaction intensity between two events by resid-ual variance. For various applications, experimental resultsshow that CounterMiner reduces the average error from 28.3%to 7.7% when multiplexing 10 events on 4 hardware counters.The real-world case study shows that identifying importantparameters of Spark programs by event importance is muchfaster than directly ranking the importance of parameters.

ACKNOWLEDGMENT

The authors would also like to thank the anonymous re-viewers for their valuable comments. This work is supportedby the national key research and development program underGrant No. 2016YFB1000204, Shenzhen Technology ResearchProject (Grant No. JSGG20160510154636747), outstandingtechnical talent program of CAS, and NSFC under (GrantNo. 61672511 and 61702495). The research is also partiallysupported by the National Science Foundation grants NSF-CCF-1657333, NSF-CCF-1717754, NSF-CNS-1717984, andNSF-CCF-1750656. Zhibin Yu is the corresponding author.Contact: [email protected].

REFERENCES

[1] I. Coorporation, “Intel 64 and ia-32 architecturesdeveloper’s manual,” 2017. [Online]. Available:https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-manual-325462.html

[2] A. Coorporation, “Arm cortex-a53 mpcore processor technical referencemanual,” 2014. [Online]. Available: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DDI0500D cortex a53 r0p2 trm.pdf

[3] I. Advanced Micro Devices, “Bios and kernel developer’s guidefor amd family 10h processors,” 2013. [Online]. Available: http://support.amd.com/TechDocs/31116.pdf

[4] J. M. May, “Mpx: Software for multiplexing hardware performancecounters in multithreaded programs,” in Proceedings of IEEE Interna-tional Symposium on Parallel and Distributed Processing, 2001.

[5] Q. Chen, M. Guo, and Z. Huang, “Cats: Cache aware task-stealingbased on online profiling in multi-socket mukti-core architectures,” inProceedings of the International Conference on Supercomputing, 2012.

[6] S. Blagodurov and A. Fedorova, “User-level scheduling on numa mul-ticore systems under linux,” in Proceedings of Linux Symposium, 2011.

[7] G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt, “Google-wide profiling: A continuous profiling infrastructure for data centers,”IEEE Micro, vol. 30, no. 4, pp. 65–78, 2010.

[8] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, “Characterizingdata analysis workloads in data centers,” in Proceedings of the IEEEInternational Symposium on Workload Characterization (IISWC), 2013.

[9] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing theclouds: A study of emerging scale-out workloads on modern hardware,”in Proceedings of the International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS), 2012.

Page 13: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

[10] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi,S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench:a big data benchmark suite from internet services,” in Proceedings of theInternational Symposium on High Performance Computer Architecture(HPCA), 2014.

[11] A. Yasin, Y. Ben-Asher, and A. Mendelson, “Deep-dive analysis ofthe data analytics workload in cloudsuite,” in Proceedings of the IEEEInternational Symposium on Workload Characterization (IISWC), 2014.

[12] A. Yasin, “A top-down method for performance analysis and countersarchitecture,” in Proceedings of the IEEE International Symposium onPerformance Analysis of Sytems and Software (ISPASS), 2014.

[13] W. Xiong, Z. Yu, L. Eeckhout, Z. Bei, F. Zhang, and C. Xu, “Shenzhentransportation system (szts): A novel big data benchmark suite,” Journalof Supercomputing, vol. 72, pp. 4337–4364, 2016.

[14] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley,G.-Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” inProceedings of the International Symposium on Computer Architecture(ISCA), 2015.

[15] X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes,“Cpi2:cpu performance isolation for shared compute clusters,” in Pro-ceedings of European Conference on Computer Systems (EuroSys), 2013.

[16] D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, “Rapidmrc:Approximating l2 miss rate curves on commodity systems for onlineoptimizations,” in Proceedings of the International Conference on Ar-chitectural Support for Programming Languages and Operating Systems(ASPLOS), 2009.

[17] L. He, Z. Yu, and H. Jin, “Fractalmrc: Online cache miss rate curveprediction on commodity systems,” in Proceedings of the IEEE Interna-tional Parallel and Distributed Processing Symposium (IPDPS), 2012.

[18] S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova, “A casefor numa-aware contention management on multicore systems,” inProceedings of USENIX Annual Technical Conference (ATC), 2011.

[19] D. Chen, N. Vachharajani, R. Hundt, S. wei Liao, V. Ramasamy, P. Yuan,W. Chen, and W. Zheng, “Taming hardware event samples for fdocompilation,” in Proceedings of the International Symposium on CodeGeneration and Optmization (CGO), 2010.

[20] T. Moseley, D. Grunwald, and R. Peri, “Optiscope: Performance ac-countability for optimizing compilers,” in Proceedings of the Interna-tional Symposium on Code Generation and Optmization (CGO), 2009.

[21] Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores: Amachine learning based approach,” in Proceedings of the 14th ACM SIG-PLAN symposium on Principles and Practice of Parallel Programming(PPoPP), 2009.

[22] C. Kozyrakis, A. Kansal, S. Sankar, and K. Vaid, “Server engineeringinsights for large-scale online services,” IEEE Micro, vol. 30, no. 1, pp.8–19, 2010.

[23] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci, “Aportable programming interface for performance evaluation on modernprocessors,” The International Journal of High Performance ComputingApplications, vol. 14, no. 0, pp. 189–204, 2000.

[24] Intel, “Intel vtune amplifier,” 2017. [Online]. Available: https://software.intel.com/en-us/intel-vtune-amplifier-xe/

[25] P. Team, “Perfmon2: the hardware-based performance monitoringinterface for linux,” 2017. [Online]. Available: http://perfmon2.sourceforge.net/docs v4.html

[26] O. Team, “Oprofile,” 2017. [Online]. Available: http://oprofile.sourceforge.net/news/

[27] L. Adhianto, S. Banerjee, M. Fagan, M. Krental, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance anal-ysis of optimized parallel programs,” Concurrency and Computation:Practice and Experience, pp. 1–7, 2008.

[28] K. A. Huck, A. D. Malony, R. Bell, and A. Morris, “Design andimplementation of a parallel performance data management framework,”in Proceedings of the International Conference on Parallel Processing(ICPP), 2005.

[29] G. Zellweger, D. Lin, and T. Roscoe, “So many performance events,so little time,” in Proceedings of the ACM Asia-Pacific Workshop onSystems (APSys), 2016.

[30] T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan, “Timeinterpolation: So many metrics, so few registers,” in Proceedings ofIEEE International Symposium on Microarchitecture, 2007.

[31] V. M. Weaver and S. A. McKee, “Can hardware performance counters betrusted,” in Proceedings of IEEE International Symposium on WorkloadCharacterization, 2008.

[32] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, “Producingwrong data without doing anything obviously wrong,” in Proceedingsof the 14th ACM Symposium on Architectural Support for ProgrammingLanguages and Operating Systems, 2009.

[33] M. Dimakopoulou, S. Eranian, N. Koziris, and N. Bambos, “Reliableand efficient performance monitoring in linux,” in Proceedings of theInternational Conference for High Performance Computing, Networking,Storage, and Analysis (SC), 2016.

[34] R. V. Lim, D. Carrillo-Cisneros, W. Y. Alkowaileet, and I. D. Scherson,“Computationally efficient multiplexing of events on hardware counters,”in Proceedings of the Ottawa Linux Symposium, 2014.

[35] D. Ustiugov, Z. Tian, M. Sutherland, A. Pourhabibi, H. Kassir,S. Gupta, M. Drumond, A. Daglis, M. Ferdman, and B. Falsafi,“Cloudsuite 3.0: A benchmark suite for cloud services,” 2017. [Online].Available: http://cloudsuite.ch//pages/download/

[36] Intel, “Sparkbench: The big data micro benchmark suite for spark 2.0,”2016. [Online]. Available: https://github.com/intel-hadoop/HiBench/blob/master/docs/run-sparkbench.md

[37] T. Moseley, N. Vachharajani, and W. Jalby, “Hardware performancemonitoring for the rest of us: A position and survey,” in Proceedingsof IFIP International Conference on Network and Parallel Computing(NPC), 2011.

[38] W. Mathus and J. Cook, “Toward accurate performance evaluationusing hardware counters,” in Proceedings of the ITEA Modeling andSimulation Workshop, 2003.

[39] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automaticallycharacterizing large scale program behavior,” in Proceedings of theInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 2002.

[40] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patternin time series,” in Proceedings of the AAAI-94 Workshop on KnowledgeDiscovery in Databases (KDD), 1994.

[41] linux community, “perf: Linux profiling with performance counters,”2015. [Online]. Available: https://perf.wiki.kernel.org/index.php/MainPage#perf: Linux profiling with performance counters

[42] J. Han, M. Kamber, and J. Pei, Data Mining — Concetps and Techniques,3rd ed. New York: Morgan Kaufmann, 2012.

[43] N. Altman, “An Introduction to Kernel and Nearest-Neighbor Nonpara-metric Regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, February 2012.

[44] J. H. Friedman, “Stochastic Gradient Boosting,” Computational Statisticsand Data Analysis, vol. 38, no. 4, pp. 367–378, October 2002.

[45] J. H. Friedman and J. J. Meulman, “Multiple Additive Regression Treeswith Application in Epidemiology,” Statistics in Medicine, vol. 22, no. 9,pp. 1365–1381, May 2003.

[46] C. Delimitrou and C. Kozyrakis, “Quasar: Resource-efficient and qos-aware cluster management,” in Proceedings of International Conferenceon Architecture Support for Programming Languages and OperatingSystems (ASPLOS), 2014, pp. 1–17.

[47] ——, “Paragon: Qos-aware scheduling for heterogeneous datacenters,”in Proceedings of International Conference on Architecture Support forProgramming Languages and Operating Systems (ASPLOS), 2013, pp.1–12.

[48] ——, “Hcloud: Resource-efficient provisioning in shared cloud sys-tems,” in Proceedings of Twenty First International Conference on Ar-chitecture Support for Programming Languages and Operating Systems(ASPLOS), 2016, pp. 1–15.

[49] ——, “Bolt: I know what you did last summer... in the cloud,” inProceedings of Twenty Second International Conference on ArchitectureSupport for Programming Languages and Operating Systems (ASPLOS),2017, pp. 1–15.

[50] S. Team, “Scipy: An open-source software for mathematics, science,and engineering,” 2018. [Online]. Available: https://docs.scipy.org/doc/scipy/reference/index.html

[51] A. S. Team, “Hive performance benchmarks,” 2018. [Online]. Available:https://spark.apache.org/docs/latest/configuration.html

[52] K. A. Huck and A. D. Malony, “Perfexplorer: A performance datamining framework for large-scale parallel computing,” in Proceedingsof the International Conference for High Performance Computing,Networking, Storage, and Analysis (SC), 2005.

[53] D. H. Ahn and J. S. Vetter, “Scalable analysis techniques for micropro-cessor performance counter metrics,” in Proceedings of the InternationalConference on Supercomputing (ICS), 2002.

Page 14: CounterMiner: Mining Big Performance Data from Hardware ...alchem.usc.edu/portal/static/download/counterminer.pdfA. Hardware Counters Every modern processor has a logical unit called

[54] J. Dongarra, K. London, S. Moore, P. Mucci, D. Terpstra, H. You, andM. Zhou, “Experiences and lessons learned with a portable interface tohardware performance counters,” in Proceedings of IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), 2003.

[55] G. Ammons, T. Ball, and J. Larus, “Exploiting hardware performancecounters with flow and context sensitive profiling,” ACM SIGPLANNotices, vol. 32, 01 1999.

[56] V. M. Weaver, D. Terpatra, and S. Moore, “Non-determinism and over-count on modern hardware performance counter implementations,” inProceedings of IEEE International Symposium on Performance Analysisof Systems and Software, 2013.