22
 Analysis of Multithreading Capabilities of Current High-Performance Processors Kamil K ędzierski, Francisco J. Cazorla, Mateo Valero [email protected], [email protected], [email protected]

2006 Kedzierski XVII JdC

Embed Size (px)

Citation preview

Page 1: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 1/22

 Analysis of Multithreading Capabilities

of Current High-Performance

Processors

Kamil K ędzierski, Francisco J. Cazorla, Mateo [email protected], [email protected], [email protected]

Page 2: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 2/22

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

Introduction

Implementation

Performance

Conclusions

Page 3: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 3/22

Do we need multithreading?

• Problem: 

resources not used efficiently

in a super-scalar processor 

• Goal: 

increase efficiency – do morecomputation with similar 

resource amount

• One possible solution: 

SMT – SimultaneousMultithreading

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

Introduction

Implementation

Performance

Conclusions

Motivation

History

Page 4: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 4/22

How did we get to SMT?

• Multithreading first appeared in

the 1950s, and Simultaneous

Multithreading was investigated

by IBM in 1968

• Super-scalar architectures: only

one thread is running at a given

time

• Multiprocessor: each thread usesa different set of resources

[ http://www.cs.clemson.edu/~mark/multithreading.html ]

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

Introduction

Implementation

Performance

Conclusions

Motivation

History

Page 5: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 5/22

How did we get to SMT?

• Coarse-grain MT: thread switch on a long-latency operations – a.k.a. "concurrent multithreading (CMT)“, "switch-on-event multithreading (SOEMT)“,

"dynamic blocked multithreading (BMT)" 

• Fine-grain MT: thread switch

not only on a long-latency ops – a.k.a. "vertical multithreading“, 

"interleaved multithreading (IMT)“,

"time-sharing multithreading“,

"hardware multiprogramming" (1958)

• Simultaneous MT:

ops issue from different threads

at the same clock cycle

[ http://www.cs.clemson.edu/~mark/multithreading.html ]

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

Introduction

Implementation

Performance

Conclusions

Motivation

History

Page 6: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 6/22

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

Introduction

Implementations

Performance

Conclusions

Page 7: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 7/22

Studied implementations

• IBM’s POWER5 

• Intel’s Xeon (Pentium 4) 

POWER5 die photo

[B. Sinharoy et al,“POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 8: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 8/22

SMT implementation in POWER5

• SMT added to super-scalar architecture

 – Second PC

 – Return stack in branch prediction logic duplicated

 – GPR/FPR rename mapper expanded for second set of registers

 – Completion logic duplicated

 – Thread bit added to most address/tag busses

[Ron Kalla et al,”Simultaneous Multi-threading Implementation in POWER5”, HotChips, 2003 ]

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 9: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 9/22

POWER5 architecture

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 10: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 10/22

SMT implementation in Xeon

• SMT added to super-scalar architecture

 – Second PC

 – ITLB duplicated

 – Microcode queue duplicated

 – Return stack buffer duplicated

 – Rename logic duplicated

 – Thread bit added to most address/tag busses

[Deborah T. Marr et al “Hyper -Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 11: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 11/22

Xeon architecture

 Additional stages in case of 

cache miss – used to fill the

Trace Cache

[Deborah T. Marr et al “Hyper -Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

[David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6, Issue 1, 2002 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 12: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 12/22

Simplified pipeline’s views 

• IBM’s POWER5 

• Intel’s Xeon (Pentium 4) 

[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”, ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems 2006 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

I t d ti IBM’ POWER5

Page 13: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 13/22

SMT implementations comparison

[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”, ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems 2006 ]

Resource  POWER5  Intel Xeon PC  dupl icated   dupl icated  

Instruction Cache  shared   shared  

ITLB  shared   dupl icated  

DTLB  shared   shared  

BHT  shared   shared  

Return Stack Buffer   dupl icated   dupl icated  Decode  shared   shared  

Instruction Buffer/

uCode Queue dupl icated   dupl icated  

Group Formation  shared   -  

Mapping/Rename  shared   dupl icated  

IQ  shared   part i t ioned  

Scheduler   -   shared  Register Read  shared   shared  

FUs  shared   shared  

Store Queues  dupl icated   part i t ioned  

Register Write  shared   shared  

GCT/Commit  dupl icated   part i t ioned  

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

I t d ti

Page 14: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 14/22

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

Introduction

Implementations

Performance

Conclusions

I t d ti IBM’ POWER5

Page 15: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 15/22

POWER5 performance

0,97 + 0,97 = 1,94

[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ] 

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Page 16: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 16/22

Introduction IBM’s POWER5

Page 17: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 17/22

POWER5 performance

0,85 + 1 = 1,85

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ] 

Shows only similar 

trends!Pictures for two different

architectures

Introduction

Implementations

Performance

Conclusions

IBM’s POWER5 

Intel’s Xeon 

Comparison

Introduction IBM’s POWER5

Page 18: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 18/22

Xeon performance

Varying processors in a system Varying server workloads

[Deborah T. Marr et al “Hyper -Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

Introduction

Implementations

Performance

Conclusions

IBM s POWER5 

Intel’s Xeon 

Comparison

Introduction IBM’s POWER5

Page 19: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 19/22

 Average performance comparison

• POWER5’s performance is reported to increase up to 41% 

• Xeon’s performance is reported to increase up to 28% 

• BUT POWER5 is dual-core architecture  

[Deborah T. Marr et al “Hyper -Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

Introduction

Implementations

Performance

Conclusions

IBM s POWER5 

Intel’s Xeon 

Comparison

Introduction

Page 20: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 20/22

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

Introduction

Implementations

Performance

Conclusions

Introduction Conclusions

Page 21: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 21/22

Conclusions

• Two architectures have been studied (POWER5 and Xeon)

•  Adding second thread usually increases performance

 – Heavy depends on a given workload

• Most of the resources are shared

• Only the crucial ones are duplicated/partitioned:

 – Return Stack Buffer, Instruction Buffer/uCode Queue, Store Queues

and GCT/Commit logic

• In addition Xeon more often duplicates/partitiones than shares:

 – ITLB, Rename and IQ

• The thread level parallelism is a good answer on how to increase

performance without drastically increasing the resources

Introduction

Implementations

Performance

Conclusions

Conclusions

References

Introduction Conclusions

Page 22: 2006 Kedzierski XVII JdC

7/27/2019 2006 Kedzierski XVII JdC

http://slidepdf.com/reader/full/2006-kedzierski-xvii-jdc 22/22

References

1. R. Kalla, B. Sinharoy, J. M. Tendler, ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEEMICRO 2004, pages 40-47

2. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner “POWER5 systemmicroarchitecture”, IBM Journal of Research and Development 2005, pages 505-521

3. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel “Characterization of simultaneous multithreading (SMT) efficiency in POWER5”, IBM Journal of Research and Development 2005,pages 555-564

4. Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, Michael Upton“Hyper -Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Volume 6, Issue1, 2002

5. David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6,Issue 1, 2002

6. Ron Kalla, BalaramSinharoy, Joel Tendler, ”Simultaneous Multi-threading Implementation in POWER5”, ASymposium on High Performance Chips (HotChips), 19th August 2003,

7. G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. “The microarchitecture of thePentium 4 processor”, Intel Technology Journal, Volume 5, Issue 1, 2001 

8. Joel Emer, “EV8:The Post-Ultimate Alpha”, International Conference on Parallel Architectures and CompilationTechniques, page 155, 2001

9. Shubu Mukherjee, “The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead BeyondY2K”, 1999, http://www.cs.wisc.edu/~shubu/talks/alpha.ppt 

10. Rick Merritt, “Designers cut fresh paths to parallelism”, EETimes http://www.eetimes.com/story/OEG19991008S0014

11. Francisco J. Cazorla Almeida, “Quality of Service for Simultaneous Multithreading Processors (QoS for SMTProcessors)”, PhD Thesis, DAC, UPC, 2005 

12. Kamil Kędzierski, Francisco J. Cazorla and Mateo Valero, “Analysis of Simultaneous MultithreadingImplementations in Current High-Performance Processors”, ACACES Second International Summer School on

 Advanced Computer Architecture and Compilation for Embedded Systems, Poster Session, July 26th, 2006pages 113-116

13. http://www.cs.clemson.edu/~mark/multithreading.html

Introduction

Implementations

Performance

Conclusions

Conclusions

References