The Case for a Single-Chip Multiprocessor
ByKunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang
Presented by Dheeraj Kumar Kaveti
Trend: wide instruction issue super scalar processors
Limitations: More logic circuitry
Comparing performance: 6-issue dynamically scheduled superscalar processor with a 4 x two-issue multiprocessor.
Introduction
OutlineThe Limits of the Superscalar Approach
The Case for a Single-Chip Multiprocessor
Floor plans for a six-issue superscalarmicro architecture and a 4 x2 way super scalar multiprocessor
comparison of results of both the processors
out of program order execution uses dynamic scheduling.
Hard ware to track register dependencies between instructions.
The three phases in a superscalar processors are Fetch ,issue and execute
The Limits of the Superscalar Approach
Factors constrain instruction fetch: mispredicted branches, instruction misalignment and cache misses.
Even with good branch prediction and alignment a significant cache miss rate will limit performance.
Fortunately, it is possible to hide some of the instruction cache miss latency.
The Limits of the Superscalar Approachin Fetch stage
There are two ways to implement renaming.
1. Explicit table for mapping architectural registers to physical
2. use a combination reorder buffer/instruction.
The advantage of the mapping table is that no comparisons are required for register renaming.
The disadvantage of the mapping table is that the number of access ports required.
The Limits of the Superscalar Approach in issue stage
For example, a machine with 8 wide issue, 3 operand instructions, a 64-entry instruction queue, and 6-bit comparisons requires 9,216 1-bit comparators.
So it takes large area to implement.
This accounts for the long delays.
So queue will limit the performance .
The Limits of the Superscalar Approach in issue stage
Wider instructions requires more register renaming.
The no. of ports required to satisfy the full instruction issue bandwidth also grows with issue width.
The better way to add ports to the data cache is by building a banked cache.
Added banked cache increases the access time of the cache.
The Limits of the Superscalar Approach in is execute stage
To increase the throughput .
Increasing wide spread of multimedia and use of visualization.
To execute the multiple threads in parallel that come from a single execution.
To accelerate execution of sequential applications with out manual intervention.
The Case for a Single-Chip Multiprocessor
Two micro architectures
6way super scalar Architecture
Now the number of ports in instruction buffer now increased by 50% thus area of each buffer increased by 30-40%.
To handle out of order the instruction issue should occupy 30% of die but it has only 18%.
Also size of branch target buffer and call-return stack are increased to 2048 and 32 respectively,which increases the branch prediction accuracy.
4x2-way superscalar multiprocessor architecture
It has 4 processors arranged in a grid.
Size of each processor is less than one 4th of 6-way SS processor.
Here the I cache and D cache and L2 are shared by four processors.
The Cache hit time is 5 cycles but for 6 way SS is 4 cycles.
Applications
Performance comparision
IPC break down
Performance of 4x2 issue processor
Comparison of Both processors
High delays are encountered with the Super scalar architecture.
Can exploit this parallelism so that the superscalar micro architecture is at most 10% better, even at the same clock rate.
large grained thread-level parallelism and multiprogramming workloads the multiprocessor performs 50--100% better than the wide superscalar micro architecture.
Conclusion
Questions
Thank you
[1] S.P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W.Tseng, "An overview of the SUIF compiler for scalable parallel machines," Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Compiler, San Francisco, 1995.
[2] S. Amarasinghe et.al., "Hot compilers for future hot chips,“ presented at Hot Chips VII, Stanford, CA, 1995.
[3] D.W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The IBM System/360 model 91: Machine philosophy and instruction-handling," IBM Journal of Research and Development, vol. 11, pp. 8-24, 1967.
[4] W. Bowhill et. al., "A 300MHz 64b quad-issue CMOS microprocessor," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 182-183, San Francisco, CA, 1995.
[5] E, Bugnion, J. Anderson, T. Mowry, M. Rosenblum, and M. Lam. "Compiler-Directed Page Coloring for Multiprocessors," Proceedings Seventh International Syrup. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), October 1996.
[6] "Chart watch: RISC processors," Microprocessor Report, vol. 10, no. 1, p. 22, January, 1996.
References