59
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Embed Size (px)

Citation preview

Page 1: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Performance of Multithreaded Chip Multiprocessors and Implications for

Operating System Design

Hikmet Aras2006720612

Page 2: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Main Papers

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design, 2004

Throughput-Oriented Scheduling on Chip Multithreaded Systems, 2004

– Authors Alexandra Fedorova, Harvard University, Sun Microsystems Margo Seltzer, Harvard University Christopher Small, Sun Microsystems Daniel Nussbaum, Sun Microsystems

Page 3: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Agenda

Introduction Performance Impacts of Shared Resources

on CMTs Implementation Related Work Conclusion

Page 4: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Part I - Introduction

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Page 5: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

What industry needs...

Modern applications (Application servers, OLTP systems) can not utilize pipeline.

– Multiple threads of control, executing short integer operations.

– Frequent dynamic branches.– Poor cache locality and branch prediction accuracy.

SPEC CPU reported the utilization could be as low as 19% for some configurations.

CMPs and MTs are designed to solve this problem.

Page 6: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

CMP and MT

CMP : Chip Multiprocessing– Multiple processor cores on a single chip,

allowing more than one thread to be active at a time, improving utilization of the chip.

MT : Hardware Multithreading– Has multiple sets of registers, interleaves the

execution of threads, either by switching between them in each cycle, or by executing multiple threads simultaneously(using different functional units)

Page 7: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Examples

CMPs– IBM’s Power4– Sun’s Ultra Sparc IV

MTs– Intel’s hyper-threaded Pentium IV– IBM’s RS64 IV

Page 8: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

What is CMT?

CMT (Multithreaded Chip Multiprocessor) is a new generation of processors, that exploit thread level parallelism to mask the memory latency in modern workloads.

CMT = CMP + MT Studies have demonstrated the performance benefits

of CMTs, and vendors are planning to ship their CMTs in 2005.

So it is important to understand how to best take advantage of CMTs.

Page 9: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

CMTs share resources...

A CMT may be equipped with many simultaneously active thread contexts. So, competition for shared resources is intense.

It is important to understand the conditions leading to performance bottlenecks on shared resources, and avoid performance degradation.

Page 10: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

CMT Simulation

CMT systems not exist yet, we will work on CMT system simulator kit (similar to Simics)

The simulated CPU core has a simple RISC pipeline, with one set of functional units.

Each core has a TLB, L1 data and instruction caches.

L2 cache is shared by all CPU cores in the chip.

Page 11: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

CMT Simulation

A schematic view of the simulated CMT Processor

Page 12: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

CMT Simulation

Accurately simulate: Pipeline contention L1 and L2 caches Bandwidth limits on crossbar connections between L1-

L2 caches Bandwidth limits on the path between L2 cache and

memory.

1 to 4 cores, each including 4 hardware contexts, 8KB-16KB L1 data and instruction caches, and L2 cache.

Page 13: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Part II - Performance Impacts of Shared Resources on CMTs

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Page 14: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Shared Resources on CMTs

We will analyze the potential performance bottlenecks on:– Processor Pipeline– L1 Data Cache– L2 Cache

Page 15: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

Threads differ in how they use the pipeline– Compute-intensive vs. Memory-intensive threads.

Page 16: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

Co-scheduling compute-intensive threads.

While each thread is able to issue an instruction on every cycle, it can not do so, because the processor has to switch among all the threads.

Page 17: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

Co-scheduling memory-intensive threads

The pipeline is under-utilized.

Page 18: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

Co-scheduling compute intensive and memory-intensive threads

The scheduler can identify the thread by measuring the single threaded CPI:

– Low CPI -> high pipeline utilization -> compute-intensive– High CPI -> low pipeline utilization -> memory-intensive

Page 19: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

A Simple example on performance gain of such a scheduling.

– A machine with 4 processor cores and 4 thread contexts.– Try to schedule 16 threads, with CPIs 1,6,11,16, 4 in each

group.

Page 20: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

• Throughput of 4 different scheduling ways

• Throughput can be improved with a smart scheduling algorithm.

Page 21: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

CPI-based scheduling works if the threads have varying single-threaded CPIs, which is not the case in real environments.

Page 22: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Processor Pipeline

Performance gains of scheduling based on pipeline usage is negligible in this system.

May be more advantageous in SMTs with multiple functional units, having threads with variable CPIs.

Page 23: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L1 Data Cache

L1 data cache is small. What about hitrates on MT config?

L1 data cache hitrates for baseline single-threaded and 4-way multithreaded configs. Data Cache size is 8KB.

What happens in different cache sizes?

Page 24: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L1 Data Cache

Pipeline utilization and L1 cache miss rates as a function of Cache Size.

Page 25: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L1 Data Cache

Pipeline utilization as a function of Cache Size for different applications.

Page 26: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L1 Data Cache

Increasing size of L1 data cache does not improve performance on MTs. Even if it decreases cache misses, no need for such a cost.

L1 instruction cache is also small, but hitrates are always high (above 97%), so no need to consider.

Page 27: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L2 Cache

L2 cache is more likely to be a potential bottleneck when hitrate decreases.– Latency (on hyperthreaded Pentium IV)

A trip from L1 to L2 takes 18 cycles A trip from L2 to memory takes 360 cycles.

– Bandwidth The bandwidth between L1-L2 is typically greater than

the bandwidth between L2 and main memory.

Page 28: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L2 Cache

A simple experiment.– Single core MT machine with 4 contexts, 8KB L1

cache and varying L2 cache size. – A workload of 8 applications having different

cache behaviours.– On simulated Solaris operating system.

Page 29: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

L2 Cache

Performance is highly related with L2 cache miss ratio, so our scheduling algorithm should be targeted to decreasecache misses on L2.

Page 30: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Part III - Implementation

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Page 31: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

A new Scheduling Algorithm

Balance set scheduling (Denning, 1968) Basic idea :

– To avoid thrashing, schedule a group of threads whose working set fits in the cache.

– Working set is the amount of data that a thread touches in its lifetime.

Page 32: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Problem with Working-Set Model

Denning’s assumption: – Working set is accessed uniformly.– Small working sets have good locality, large ones

have poor locality.

Does it apply to modern workloads?

Page 33: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Problem with Working-Set Model

A simple experiment to prove it. – Footprint for gzip : 200K– Footprint for crafty : 40K– According to Denning’s assumption, crafty should

have better cache hit rates than gzip.– We proved this assumption was wrong for

modern workloads, so we can not use working-set model.

Page 34: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Problem with Working-Set Model

L1 data cache hit rates for crafty and gzip.

Page 35: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

A better metric of Locality

Hypothesis : Even though gzip has a larger working set, it has better cache hit rates than crafty. It should be accessing the data in smaller chunks.

Reuse distance : amount of time that passes between the references to a memory location.

Page 36: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

A better metric of Locality

crafty gzip

Reuse distance distributions for crafty and gzip Degree of locality can be better represented with reuse distance model,

than working set model.

Page 37: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Reuse Distance Model

A cache model based on reuse distance distributions. The smaller the reuse distance, the greater the probability of cache hit.

Model needs Reuse Distance histogram as an input. (Cost ?)

2 methods to adapt the model to multithread environments.

Page 38: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Method #1 : COMB

Combines the reuse distance histograms for several threads– Sum the number of references in each bucket.– Multiply the value of each reuse distance by N.

Accurate but may be too expensive. (assume on a machine with 32 contexts and 100 threads)

Page 39: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Method #2 : AVG

Predict the miss rate of individual threads, if they were run with a fraction of total cache – Cache = TotalCache / #Threads

Take average of miss rates Preferrable method : Less expensive, same

accuracy.

Page 40: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Reuse Distance Model

Actual vs. Predicted miss rates

Page 41: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Balance-Set Scheduling

Adapted balance set principle to work with reuse distance model instead of working set.

When a scheduling decision is to be done :– Predict miss rates of all possible groups of

threads.– Schedule the groups whose predicted miss rates

are below a threshold.

Page 42: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Balance-Set Scheduling

What is the threshold to be used?

Page 43: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Balance-Set Scheduling

Two policies for selecting the group to schedule :– PERF : Select the groups with lowest miss rate

(making sure no workload will be starved)– FAIR : Each workload receives equal share of

processor. Keep track of how many times each workload is selected In each selection, favor groups that has least frequently

selected workloads.

Page 44: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Balance-Set Scheduling

IPC achieved with default, PERF and FAIR schedulers

- Lowest performance gain : 16% , FAIR when L2=384KB- Highest performance gain : 32%, PERF when L2=48KB

Page 45: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Balance-Set Scheduling

• L2 cache miss rates are reduced by 20-40%

• Minimum gain is 12% with FAIR scheduler in L2 = 384KB, which could be achieved by using a 4 times larger L2 cache.

Page 46: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Fairness

Trying to optimize cache hit rate, we favor the well-behaving workloads, not fair?

All workloads should get an equal share of CPU in a fair environment.

A metric for fairness : – Standart Deviation from average CPU share.

Page 47: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Fairness

Processor share of 8 workloads in PERF and FAIR• FAIR is better, but stdDev is still high.• Each thread is represented in the diagram, so no one is starved.

Page 48: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Implementation Cost

We talked about the potential benefits, but a useful approach must be practical to implement.

Cost of predicting the miss rates based on reuse distance histograms, was previously discussed.

– To adapt the model to MT environment, we should combine the histogram informations.

– Little cost with AVG method, more expensive in COMB method.

Page 49: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Implementation Cost

Cost of collecting the data required for building reuse distance histograms. – Monitor memory locations and record their reuse

distances. – A user-level watching tool was implemented, with

20% overhead. – Overhead is reduced multiple watch points (could

be done in UltraSparc) and kernel level instead of user-level.

Page 50: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Implementation Cost

Data to be stored for each thread is small. The size of reuse distance histogram can be

fixed. Reuse distance histograms can be

compressed:– Aggregated reuse distances in buckets.– Results stayed accurate even for a few buckets.

Page 51: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Part IV – Related Work

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Page 52: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

SMT Scheduling [3][4][5]

Samples the space of possible schedules, identifies best grouping of threads, attempts to co-schedule them.

Easy to implement, with trivial hardware support and no overhead.

Improves response time by %17. Not preferred when sample space becomes

very large.

Page 53: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Cohort Scheduling [6]

A scheduling infra for server applications. Similar operations from different server

requests are executed together, improves data locality.

Reduces L2 misses by 50%, improves IPC by 30%.

Limited applicability, requires application code change.

Page 54: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Capriccio Thread Package [7]

Resource-aware scheduling by monitoring thread’s behaviour.

Better scheduling decisions. Optimizes resource utilization, but requires very detailed monitoring of the program state.

Page 55: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Part V – Conclusion

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Page 56: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Conclusion

We investigated the performance effects of shared resources in a CMT system, and found L2 cache has the greatest effect on performance.

Using Balance-Set Scheduling, we reduced L2 cache miss rates by 20-40% and improved performance by 16-32%. (same improvement when L2 cache size multiplied by 4)

Page 57: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

Conclusion

We adapted Reuse-Distance cache miss rate model to work with MT and CMP processors.

Next thing to do is, to repeat experiments on larger machine configurations under commercial workloads, and to evaluate implementation costs on real systems.

Page 58: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

References

[1] A.Fedorova, M.Seltzer, C.Small, D.Nussbaum, “Performance of Multithreaded Chip Multiprocessors and Implications on Operating System Design”, 2004.

[2] A.Fedorova, M.Seltzer, C.Small, D.Nussbaum, “Throughput Oriented Scheduling on Chip Multithreading Systems”, 2004.

[3] A.Snavely, D.Tullsten, “Symbiotic Job Scheduling for a Simultaneous Multithreading Machine”, 2000

[4] A.Snavely,D.Tullsten,G.Voelker, “Symbiotic JobScheduling with priorities for a Simultaneous Multithreading Processor”, 2002

[5] S.Parekh,S.Eggers,H.Levy,J.Lo, “Thread-sensitive scheduling for SMT processors”.

[6] J.Larus,M.Parkes, “Using Cohort Scheduling to enhance server performance”, 2002.

[7] R.Behren,J.Condit,F.Zhou,G.Necula, E.Brewer, “Capriccio:Scalable threads for internet services”, 2003.

Page 59: Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612

THANKS...

QUESTIONS ?