2
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 15, 303-304 (1992) Special Issue on Memory System Architectures for Scalable Multiprocessors Guest Editors’ Introduction MICHEL DUBOIS University of Southern California, Los Angeles, California 90089-2.562 AND SHREEKANTTHAKKAR Sequent Computer Systems, Beaverton, Oregon 97006 One of the most active topics of research in computer architecture today is the design of scalable shared-mem- ory multiprocessor systems. An architecture scales well if processor efficiency remains good as the number and/ or the speed of processors is increased. One impediment to better scalability is the widening speed gap between processor technology and memory/interconnect technol- ogy. Whereas the speed of processors doubles every 2 to 3 years, the speed of DRAMS (Dynamic Random Access Memories) improves at a much slower pace. Latencies are drastically higher in multiprocessors than in uniprocessors because processors must access memories which are remote and shared. These accesses involve arbitration protocols for shared communication links and transfers on wires. As the number of processor increases, the amount of communication traffic and the latency of each remote access also increase. Therefore, as more processors are added to a multiprocessor sys- tem, the efficiency of each processor drops; at some point, the cost-effectiveness of adding more processors becomes questionable. Techniques must be invented at the system level in order to hide the impact of latencies on performance and scalability. One common and useful technique consists in keeping multiple copies of data in each processor node in a cache or in a local memory. However, multiple copies of the same block must be kept consistent. Consistency is maintained in hardware or by explicit software control. Typically hardware solutions are reserved for cache- based systems. They hide most of the complexity of co- herence from the software (user and operating system) and rely on complex protocols called cache coherence protocols. When a processor writes into a cache block with copies in other caches, it must either invalidate these copies (write-invalidate protocols) or broadcast the new value to update the other copies (write-broadcast protocols). Invalidation signals reduce the hit rate of each cache because, when a block has been invalidated, it must be reloaded at the next access to it. Write broad- casts are a form of write-through on shared blocks and cause high write traffic. More importantly invalidations and broadcasts block processors sending them. Whether a processor may continue processing while one or several of its invalidations or broadcasts are pending depends on the restrictions on the global ordering of memory ac- cesses imposed by the architecture specification. The or- dering of memory accesses in turn determines the mem- ory consistency model, which is the logical model offered by the memory system to the programmer. Several mem- ory consistency models have been introduced and de- fined, such as sequential consistency, weak ordering, and release consistency. Usually, the more relaxed the restrictions on access ordering, the more efficient and scalable the multiproces- sor. Therefore it is critical to understand how relaxed these restrictions can be. In this special issue, Philip Bi- tar overviews prior research work on memory ordering schemes, proposes a new framework to reason about the correctness of memory access orders, and introduces “The Weakest Memory-Access Order.” A drawback of memory access ordering schemes is that they require a significant effort from the programmer because they are defined in terms of memory accesses and synchronization instructions. The contribution of the paper by Kourosh Gharachorloo et al., “Programming for Different Memory Consistency Models,” is to intro- 303 0743-7315192 $5.00 Copyright 0 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

Special issue on memory system architectures for scalable multiprocessors: Guest editors' introduction

Embed Size (px)

Citation preview

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 15, 303-304 (1992)

Special Issue on Memory System Architectures for Scalable Multiprocessors

Guest Editors’ Introduction

MICHEL DUBOIS

University of Southern California, Los Angeles, California 90089-2.562

AND

SHREEKANTTHAKKAR

Sequent Computer Systems, Beaverton, Oregon 97006

One of the most active topics of research in computer architecture today is the design of scalable shared-mem- ory multiprocessor systems. An architecture scales well if processor efficiency remains good as the number and/ or the speed of processors is increased. One impediment to better scalability is the widening speed gap between processor technology and memory/interconnect technol- ogy. Whereas the speed of processors doubles every 2 to 3 years, the speed of DRAMS (Dynamic Random Access Memories) improves at a much slower pace.

Latencies are drastically higher in multiprocessors than in uniprocessors because processors must access memories which are remote and shared. These accesses involve arbitration protocols for shared communication links and transfers on wires. As the number of processor increases, the amount of communication traffic and the latency of each remote access also increase. Therefore, as more processors are added to a multiprocessor sys- tem, the efficiency of each processor drops; at some point, the cost-effectiveness of adding more processors becomes questionable.

Techniques must be invented at the system level in order to hide the impact of latencies on performance and scalability. One common and useful technique consists in keeping multiple copies of data in each processor node in a cache or in a local memory. However, multiple copies of the same block must be kept consistent. Consistency is maintained in hardware or by explicit software control.

Typically hardware solutions are reserved for cache- based systems. They hide most of the complexity of co- herence from the software (user and operating system) and rely on complex protocols called cache coherence protocols. When a processor writes into a cache block

with copies in other caches, it must either invalidate these copies (write-invalidate protocols) or broadcast the new value to update the other copies (write-broadcast protocols). Invalidation signals reduce the hit rate of each cache because, when a block has been invalidated, it must be reloaded at the next access to it. Write broad- casts are a form of write-through on shared blocks and cause high write traffic. More importantly invalidations and broadcasts block processors sending them. Whether a processor may continue processing while one or several of its invalidations or broadcasts are pending depends on the restrictions on the global ordering of memory ac- cesses imposed by the architecture specification. The or- dering of memory accesses in turn determines the mem- ory consistency model, which is the logical model offered by the memory system to the programmer. Several mem- ory consistency models have been introduced and de- fined, such as sequential consistency, weak ordering, and release consistency.

Usually, the more relaxed the restrictions on access ordering, the more efficient and scalable the multiproces- sor. Therefore it is critical to understand how relaxed these restrictions can be. In this special issue, Philip Bi- tar overviews prior research work on memory ordering schemes, proposes a new framework to reason about the correctness of memory access orders, and introduces “The Weakest Memory-Access Order.”

A drawback of memory access ordering schemes is that they require a significant effort from the programmer because they are defined in terms of memory accesses and synchronization instructions. The contribution of the paper by Kourosh Gharachorloo et al., “Programming for Different Memory Consistency Models,” is to intro-

303 0743-7315192 $5.00

Copyright 0 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

304 INTRODUCTION

duce a new model which lends itself to the hardware optimizations of other models, but at the same time eases the task of the programmer.

Interconnection networks for scalable shared-memory multiprocessors must efficiently support interprocessor synchronization as well as cache coherence. Tradition- ally, processor-to-memory interconnects are efficient ei- ther at point-to-point communication (e.g., crossbars) or at broadcast (e.g., buses). By contrast, the network de- signs proposed in “Notification and Multicast Networks for Synchronization and Coherence” by John B. An- drews, Carl J. Beckmann, and David K. Poulsen support ~lnrestricted multicast. Multicast is the sending of mes- sages to arbitrary sets of recipient processors and is par- ticularly useful for synchronization and cache coherence in large-scale multiprocessors with Multistage Intercon- nection Networks (MIN).

In their paper “Compile-Time Optimization of Near- Neighbor Communication for Scalable Shared-Memory Multiprocessors,” David E. Hudak and Santosh G. Abraham take the software approach to maintaining consistency among caches in a multiprocessor. Their compile-time techniques are applicable to parallelized FORTRAN DO-loops with near-neighbor communica- tions.

The last three papers in this issue are devoted to Shared Virtual Memory Systems, built on top of multi- processors with distributed memory. These systems al- low the presence of a given page in different memories and employ coherence algorithms similar to those of multicache systems. The major differences are that con- sistency is managed by the virtual memory management software and that the unit of sharing is the page rather than the cache block. There are three advantages of such systems over multicache systems. The hardware is sim- pler because coherence is managed in software. More- over, more sophisticated consistency algorithms, taking into account high-level knowledge about access patterns to specific data structures, can be designed to optimize performance for each different data structure. Finally, the transfer of a page in its entirety is faster than its transfer block by block, which favors pages over blocks for nonshared data. One problem is that software control is usually slower than hardware control. Another prob- lem is the large size of the unit of sharing (the page). Often a whole page is brought in on demand, just to ac- cess a word in the page; and false sharing (the read-write sharing of pages without actual sharing of data) cannot be neglected. The paper by Yuval Tamir and G. Janakira- man, “Hierarchical Coherency Management for Shared

Virtual Memory Multicomputers,” proposes to solve this problem through an adaptive scheme involving two gran- ularities of sharing: the page and the block.

Andrew W. Wilson and Richard P. LaRowe present performance evaluations of the Galactica Net, a distrib- uted shared-memory system under development at the Center for High Performance Computing. The paper is titled “Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture” and provides a rare comparison between three popular memory consistency models on a real architecture.

Evaluating and comparing the memory systems of vari- ous multiprocessors is a complex task. It is often difficult to understand the effectiveness of particular architecture choices in isolation from the software policies guiding the hardware. The paper “Evaluation of Multiprocessor Memory Systems Using Off-Line Optimal Behavior” by William J. Bolosky and Michael L. Scott presents a tech- nique for separating the two.

We thank all the authors who submitted manuscripts to this special issue. Many of the submitted papers were outstanding and selecting the seven contributions in- cluded in the issue was a very difficult task indeed. These decisions were shaped by all the referees who offered their expertise and their precious time. They deserve most of the credit for the quality of the issue. Finally we are grateful to Professor Kai Hwang for providing us with the opportunity to host a special issue on our favorite topic of research, and for his guidance throughout the reviewing process.

MICHEL DUBOIS is an Associate Professor in the Department of Electrical Engineering of the University of Southern California. Before joining U.S.C. in 1984, he was a research engineer at the Central Re- search Laboratory of Thomson-CSF in Orsay, France. His main inter- ests are computer architecture and parallel processing, with a focus on multiprocessor architecture, performance, and algorithms. He has edited two books, one on multiprocessor caches and one on scalable shared memory multiprocessors. Dubois holds a Ph.D. from Purdue University, an M.S. from the University of Minnesota, and an engineer- ing degree from the Fact&C Polytechnique de Mons in Belgium, all in electrical engineering. He is a member of the ACM and a senior member of the Computer Society of the IEEE.

SHREEKANT THAKKAR is a staff systems engineer at Sequent Computer Systems. His research interests include highly parallel sys- tems, computer architecture and performance, parallel programming and algorithms, and VLSI design. Thakkar received a B.S. with honors in statistics and computer science from the University of London and an M.S. and Ph.D. in computer science from the University of Man- chester. He is a chartered engineer and a member of the IEEE, ACM, BCS, and IEE.