16
Characteristics of an On-Chip Cache on NEC SX Vector Architecture Akihiro MUSA 1;3 , Yoshiei SATO 1 y , Ryusuke EGAWA 1;2z , Hiroyuki TAKIZAWA 1;2 x , Koki OKABE 2 } , and Hiroaki KOBAYASHI 1;2 k 1 Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan 2 Cyberscience Center, Tohoku University, Sendai 980-8578, Japan 3 NEC Corporation, Tokyo 108-8001, Japan Received May 6, 2008; final version accepted December 10, 2008 Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation efficiency for computation-intensive scientific applications. However, they have been encountering the memory wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register files under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and five leading scientific applications. The results indicate that the vector cache boosts the computational efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/ FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientific applications. The selective caching is effective for efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling. KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system, Scientific application 1. Introduction Vector supercomputers have high computation efficiency for computation-intensive scientific applications. In our previous work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers can achieve high sustained performance in several leading applications. This is due to their high bytes per flop rates compared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processor speeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder for vector supercomputers to keep a high memory bandwidth balanced with the improvement of their flop/s performance. On the NEC SX systems, the bytes per flop rate has decreased from 8 B/FLOP to 2 B/FLOP during the last twenty years. E-mail: [email protected] y E-mail: [email protected] z E-mail: [email protected] x E-mail: [email protected] } E-mail: [email protected] k E-mail: [email protected] Interdisciplinary Information Sciences Vol. 15, No. 1 (2009) 51–66 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2009.51

Characteristics of an on chip cache on nec sx

Embed Size (px)

DESCRIPTION

Arquitetura de Computadores

Citation preview

Page 1: Characteristics of an on chip cache on nec sx

Characteristics of an On-Chip Cacheon NEC SX Vector Architecture

Akihiro MUSA1;3�, Yoshiei SATO1y, Ryusuke EGAWA1;2z, Hiroyuki TAKIZAWA1;2x,Koki OKABE2}, and Hiroaki KOBAYASHI1;2k

1Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan2Cyberscience Center, Tohoku University, Sendai 980-8578, Japan

3NEC Corporation, Tokyo 108-8001, Japan

Received May 6, 2008; final version accepted December 10, 2008

Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computationefficiency for computation-intensive scientific applications. However, they have been encountering the memorywall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes perflop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is gettingworse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limitedand does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain theeffective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanismbetween the main memory and register files under software controls. We evaluate the performance of the vectorcache on the NEC SX vector processor architecture with bytes per flop rates of 2B/FLOP and 1B/FLOP, toclarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extendedwith the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loopsand five leading scientific applications. The results indicate that the vector cache boosts the computationalefficiencies of the 2 B/FLOP and 1B/FLOP systems up to the level of the 4B/FLOP system. Especially, in thecase where cache hit rates exceed 50%, the 2B/FLOP system can achieve a performance comparable to the 4B/FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory andthe cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact ofcache associativity on the cache hit rate, and the relationship between cache latency and the performance. Theresults also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latencydepend on the vector loop length of applications. The cache shorter latency contributes to the performanceimprovement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case oflonger loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive tothe cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loopunrolling on the vector cache performance for the scientific applications. The selective caching is effective forefficient use of the limited cache capacity. The loop unrolling is also effective for the improvement ofperformance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loopunrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over thecache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling.

KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system,Scientific application

1. Introduction

Vector supercomputers have high computation efficiency for computation-intensive scientific applications. In ourprevious work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers canachieve high sustained performance in several leading applications. This is due to their high bytes per flop ratescompared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processorspeeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder forvector supercomputers to keep a high memory bandwidth balanced with the improvement of their flop/s performance.On the NEC SX systems, the bytes per flop rate has decreased from 8B/FLOP to 2B/FLOP during the last twentyyears.

� E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected]} E-mail: [email protected] E-mail: [email protected]

Interdisciplinary Information Sciences Vol. 15, No. 1 (2009) 51–66#Graduate School of Information Sciences, Tohoku UniversityISSN 1340-9050 print/1347-6157 onlineDOI 10.4036/iis.2009.51

Page 2: Characteristics of an on chip cache on nec sx

We have proposed an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of futurevector supercomputers (Kobayashi et al. (2007)) and provided the early evaluation of the vector cache using the NECSX simulator (Musa et al. (2007)). The vector cache employs a bypass mechanism between a vector register and a mainmemory under software controls, and load/store data are selectively cached. The bypass mechanism and the cachework complementarily together to provide data to the register file, resulting in the high effective memory bandwidthrate. However, we did not investigate the fundamental configuration of the vector cache, i.e. cache associativity andcache latency. Moreover, our previous work did not discuss the relationship between loop unrolling and caching. Themain contribution of this paper is to clarify the characteristics of the on-chip vector cache on the vectorsupercomputers. We investigate the relationship among cache associativity, cache latency and performance using fivepractical applications selected from the areas of advanced computational sciences. In addition, we discuss in detailselective caching and loop unrolling under the vector cache to improve the performance.

The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes a vectorarchitecture with an on-chip vector cache discussed in this paper. Section 4 provides our experimental methodologyand benchmark programs for performance evaluation. Section 5 presents experimental results when executing thekernel loops and five real applications on our architecture. Through the experimental results, we clarify thecharacteristics of the vector cache. In Section 6, we examine the effect of selective caching on the performance, inwhich only the data with high localities of reference are cached. In addition, we discuss the relationship between loopunrolling and the vector cache. Finally, Section 7 summarizes the paper with some future work.

2. Related Work

Vector caches have been previously studied by a number of researchers using trace driven simulators of convectionalvector architectures in the 1990’s. Gee et al. have provided an evaluation of the cache performance in vectorarchitectures: Cray X-MP and Ardent Titan (Gee and Smith (1992), Gee and Smith (1994)). Their caches employ a full-associative cache and an n-way set associative cache. The line sizes range from 16 to 128 bytes, and the maximumcache size is 4MB. Their benchmark programs are selected from Livermore Loops, NAS kernels, Los Alamosbenchmark set and real applications in chemistry, fluid dynamics and linear analysis. They showed that vectorreferences contained somewhat less temporal locality, but large amounts of spatial locality compared to instruction andscalar references, and the cache improved the computational performance of the vector processors. Kontothanassiset al. have evaluated the cache performance of Cray C90 architecture using NAS parallel benchmarks (Kontothanassiset al. (1994)). Their cache is a direct-mapped cache with a 128 bytes line size, and the maximum cache size is 8MB.They showed that the cache reduced memory traffic from off-chip memory, and a DRAM memory system with thecache was competitive to a SRAM memory system.

Modern vector supercomputer Cray X1 has a 2MB vector cache, called Ecache, organized as a 2-way set associativewrite-back cache with 32 bytes line size and the least recently used (LRU) replacement policy (Abts et al. (2003)). Theperformance of Cray X1 has been evaluated using real scientific applications by several researchers (Dunigan Jr. et al.(2005), Oliker et al. (2004), Shan and Strohmaier (2004)). These studies have compared the performance of Cray X1with that of other platforms; IBM Power, SGI Altix and NEC SX. However, they have not quantitatively discussed theeffect of the cache on its performance.

The aim of our research is to quantitatively investigate the effect of the vector cache using modern vectorsupercomputer NEC SX architecture. We clarify the basic characteristics of the vector cache and discuss its effectiveusages.

3. On-Chip Cache Memory for Vector Architecture

We design an on-chip vector cache for a vector processor architecture as shown in Figure 1. Our vector processor hasa superscalar unit, a vector unit and an address control unit. The vector unit contains five types of vector arithmeticpipes (Mask, Logical, Multiply, Add/Shift and Divide), vector registers and a vector cache. Four parallel vector pipes

1B/FLOP

~ 4B/FLOP

Vector Registers

Main Memory

VectorCache

VectorCache

VectorCache

VectorCache

VectorCache ×32

4B/FLOP

VectorCache

VectorRegisters

Mask Reg. Mask

Logical

Multiply

Add/Shift

Divide

Vector Unit

Scalar Unit

×4

×4

×4

×4

×4

Mai

n M

emor

y U

nit

Address ControlUnit

. .

Figure 1. Vector architecture with vector cache and memory system block diagram.

52 MUSA et al.

Page 3: Characteristics of an on chip cache on nec sx

are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issuesvector operation instructions to the vector unit. The vector operation instructions contain vector memory instructionsand vector computational instructions. Each instruction concurrently deals with 256 double-precision floating-pointdata.

The main memory unit employs an interleaved memory system. The main memory is divided into 32 parts, each ofwhich operates independently and supplies data to the vector processor. Thus, the main memory and the vectorprocessor are interconnected through 32 memory ports. Moreover, we implement the vector cache in each memorypart; the vector cache consists of 32 sub-caches, and the sub-cache is a non-blocking cache with a bypass mechanismbetween a vector register file and the main memory. The bypass mechanism is software-controlled, and vector load/store instructions have a flag for cache control, i.e. cache on/off. When the flag indicates ‘‘cache on,’’ the supplied databy the vector load/store instruction are cached. Meanwhile, when the flag is ‘‘cache off,’’ the vector load/storeinstruction provides data via the bypass mechanism: the data are not cached. The sub-cache employs a set-associativewrite-through cache with the LRU replacement policy. The line size is 8 bytes; it is the unit for memory accesses.

4. Experimental Environment

4.1 Methodology

We use a trace-driven simulator that can simulate the behavior of the proposed vector architecture at the registertransfer level. It is based on the NEC SX simulator, which accurately models a single processor of the SX architecture;the vector unit, the scalar unit and the memory system. The simulator takes a system parameter file and a trace file asinput, and the output of the simulator contains instruction cycle counts of a benchmark program and cache hitinformation. The system parameter file has configuration parameters of a vector architecture and setting parameters,i.e., the cache size, the associativity, the cache latency and the memory bandwidth.

The trace file contains an instruction sequence of a benchmark program and directives of cache control: cache on/off. These directives are used in selective caching and set the cache control flags in the vector load/store instructions. Abenchmark program is compiled by the NEC FORTRAN compiler: FORTRAN90/SX. It supports ANSI/ISOFortran95 in addition to functions of automatic vectorization and automatic parallelization. The executable programruns on the SX trace generator to produce the trace file. In this work, we use two kernel loops and the original sourcecodes of five applications, and compile them with the highest optimizations option (-C hopt) and inlining subroutines.

To evaluate on-chip vector caching for future vector processors with a higher flop/s rate but a relatively lower off-chip memory bandwidth, we evaluate the effect of the cache on the vector processor by limiting its memory bandwidthper flop/s rate from 4B/FLOP down to 1B/FLOP. Here, we adopt the NEC SX-7 architecture (Kitagawa et al. (2003))to our vector processor. The setting parameters are shown in Table 11.

4.2 Benchmark Programs

To clarify the basic characteristics and validity of the vector cache, we select basic kernel loops and five leadingapplications. In this section, we describe our benchmark programs in detail.

The following kernel loops are used to evaluate the performance of the vector processor with the cache.

AðiÞ ¼ XðiÞ þ YðiÞ: ð1ÞAðiÞ ¼ XðiÞ � YðiÞ þ ZðiÞ: ð2Þ

Table 1. Summary of setting parameters.

Base System Architecture NEC SX-7

Main Memnory DDR-SDRAM

Vector Cache SRAM

Total Size (Sub-cache Size) 256KB–8MB (8KB–256KB)

Cache Policy LRU, Write-through

Associativity 2WAY, 4WAY, 8WAY

Cache Latency 15%, 50%, 85%, 100%

(Rate of main memory latency)

Cache Bank Cycle 5% of memory cycle

Line Size 8B

Memory—Cache bandwidth per flop/s 1B/FLOP, 2B/FLOP, 4B/FLOP

Cache—Register bandwidth per flop/s 4B/FLOP

1 The absolute values of memory related parameters such as latency and bank cycle cannot be disclosed in this paper because of their confidentiality.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 53

Page 4: Characteristics of an on chip cache on nec sx

As leading applications, we have selected five applications in three scientific computing areas. The applications havebeen developed by researchers of Tohoku University for the SX-7 system and are representative of each research area.Table 2 shows the summary of the evaluated applications, which methods are standard in individual areas (Kobayashi(2006)).

The GPR simulation code evaluates a performance of the SAR-GPR (Synthetic Aperture Radar–Ground PenetratingRadar) in detection of buried anti personnel mines under conditions of inhomogeneous subsurface mediums (Kobayashiet al. (2003)). The simulation method is based on the three dimensional FDTD (Finite Difference Time Domain)method (Kunz and Luebbers (1993)). The vector operation ratio is 99.7% and the vector loop length is over 500.

The APFA simulation is applied to design of a high gain antenna; an Anti-Podal Fermi Antenna (APFA) (Takagiet al. (2004)). The simulation calculates directional characteristics of the APFA using the FDTD method and theFourier transform. The vector operation ratio is 99.9% and the vector loop length is 255.

The PRF simulation provides numerical simulations of two dimensional Premixed Reactive Flow (PRF) incombustion for design of engines of planes (Tsuboi and Masuya (2003)). The simulation uses the sixth-order compactfinite difference scheme and the third-order Runge-Kutta method for time advancement. The vector operation ratio is99.3% and the vector loop length is 513.

The SFHT simulation realizes direct numerical simulations of three-dimensional laminar Separated Flow and HeatTransfer (SFHT) on surfaces of a plane (Nakajima et al. (2005)). The finite-difference forms are the fifth-order upwinddifference scheme for space derivatives and the Crank–Nicholson method for a time derivative. The resulting finite-difference equations are solved using the SMAC method. The vector operation ratio is 99.4% and the vector loop lengthis 349.

The PBM simulation uses the three-dimensional numerical Plate Boundary Models (PBM) to explain an observedvariation in propagation speed of postseismic slip (Ariyoshi et al. (2007)). This is a quasi-static simulation in an elastichalf-space including a rate- and state-dependent friction. The vector operation ratio is 99.5% and the vector loop lengthis 32400.

5. Performance Evaluation of Vector Cache

We evaluate the performance of the proposed vector processor architecture by using the benchmark programs andclarify the basic characteristics of the vector cache.

5.1 Impact of Memory Bandwidth on Effective Performance

To evaluate the impact of the memory bandwidth on performance of the vector processor without the vector cache,we varied memory bandwidth per flop/s rates from 4B/FLOP down to 1B/FLOP. Figure 2 shows the computationefficiency, the ratio of the sustained performance to the peak performance, in the execution of the benchmark programs.Here, the 4B/FLOP case is equal to the memory bandwidth per flop/s rate of the SX-8 or earlier system, the 2 B/FLOPis the same as Cary X1 and BlackWidow (Abts et al. (2007)), and the 1B/FLOP is the same level as the commodity-based high performance scalar systems such as NEC TX7 series (Senta et al. (2003)) and SGI Altix series (SGI (2007)).Figure 2 indicates that the computational performance is seriously affected by the B/FLOP rate. In particular, theperformances of GPR and PBM, which are memory-intensive programs, are degraded by half and quarter when thememory bandwidth per flop/s rate is reduced to 2B/FLOP and 1B/FLOP from 4B/FLOP. Even if APFA iscomputation-intensive, the performance on the 1B/FLOP system is decreased to 69% of the performance on the4B/FLOP system. Therefore, the memory bandwidth per flop/s rate needs the 4 B/FLOP rate on the vectorsupercomputer.

Table 2. Summary of benchmark programs.

Area Name and DescriptionMethod

Subdivision

Memory

Size

Electromagnetic AnalysisGPR simulation: Simulation of

Array Antenna Ground Penetrating Radar

FDTD

50� 750� 7508.9GB

APFA simulation: Simulation of

Anti-Podal Fermi Antenna

FDTD

612� 105� 50512GB

CFD/Heat AnalysisPRF simulation: Simulation of

Premixed Reactive Flow in Combustion

DNS

513� 5131.4GB

SFHT simulation: Simulation of

Separated Flow and Heat Transfer

SMAC

711� 91� 2216.6GB

SeismologyPBM simulation: Simulation of

Plate Boundary Model on Seismic Slow Slip

Friction Law

32400� 324008GB

54 MUSA et al.

Page 5: Characteristics of an on chip cache on nec sx

5.2 Relationship between Efficiency and Cache Hit Rate on Kernel Loops

We simulate the execution of the kernel loop (1) on the vector processor with the vector cache. Since this loop has notemporal locality, we store a part of XðiÞ and YðiÞ data on the vector cache in advance of the loop execution. In thefollowing two ways, we select the data for caching to change the range of cache hit rates, and examine their effects onperformance. One is to store both XðiÞ and YðiÞ with the same index i on the cache, and load instructions of both XðiÞand YðiÞ are set to ‘‘cache on,’’ named Case 1. The other is to cache only XðiÞ, and load instructions of XðiÞ and YðiÞ areset to ‘‘cache on’’ and ‘‘cache off,’’ respectively. This is Case 2. Here, the range of index i is 1 � i � 25600.

Figure 3 shows the relationship between the cache hit rate and the relative memory bandwidth that is obtained bynormalizing the effective memory bandwidth of the systems with the vector cache by that of the 4 B/FLOP systemwithout the cache. Figure 4 shows the relationship between the cache hit rate and the computation efficiency. Figure 3indicates that the relative memory bandwidth of each case increases as the cache hit rate increases. Therefore, thevector cache is one of the promising solutions to cover a lack of the memory bandwidth. Similar to Figure 3, thecomputation efficiency of each case in Figure 4 improves as the cache hit rate increases.

Figures 3 and 4 show the correlation between the relative memory bandwidth and the computation efficiency. Onboth the 2B/FLOP and 1B/FLOP systems, the performance of Case 2 is greater than that of Case 1. In particular,Case 2 with the vector cache of the 50% hit rate on the 2B/FLOP system achieves the same performance of the 4B/FLOP system. On the 4B/FLOP system, each of XðiÞ and YðiÞ is provided at a 2B/FLOP bandwidth rate on average.When the cache hit rate is 50% in Case 2, all data of XðiÞ are supplied directly from the vector cache and all data of YðiÞare supplied from the memory at the 2B/FLOP bandwidth rate through the bypass mechanism. Consequently, each ofXðiÞ and YðiÞ is provided at the 2B/FLOP bandwidth rate on the vector registers, and the total amount of data providedto vector registers in time is equal to that of the 4B/FLOP system. However, the 1B/FLOP system with vector cachingwith the 50% hit rate does not achieve the performance of the 4B/FLOP system. Because YðiÞ is provided at a 1B/FLOP bandwidth rate, the total bandwidth rate at the vector registers does not reach the 4B/FLOP. On the other hand,in Case 1, the data of XðiÞ and YðiÞ are supplied from either of the vector cache or the memory at the same index i.While supplying the data from the memory, XðiÞ and YðiÞ have to share the memory bandwidth rate. Therefore, the 1B/

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80 90 100

Rel

ativ

e M

emor

y B

andw

idth

Cache Hit Rate (%)

4B/F 2B/F Case12B/F Case2 1B/F Case11B/F Case2

Figure 3. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (1).

0

10

20

30

40

50

60

X + Y X*Y+Z GPR APFA PRF SFHT PBM

Eff

icie

ncy(

%)

4B/FLOP 2B/FLOP 1B/FLOP

Figure 2. Effect of memory bandwidth per flop/s rate.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 55

Page 6: Characteristics of an on chip cache on nec sx

FLOP and 2B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4B/FLOP system.

Similarly, we examine the performance of Kernel (2) on the 2B/FLOP system in the following three cases. InCase 1, all of XðiÞ, YðiÞ, and ZðiÞ are cached with the same index i in advance. In Case 2, both XðiÞ and YðiÞ areprovided from the vector cache, and therefore the maximum cache hit rate is 66%. In Case 3, only XðiÞ is in the cache,and hence the maximum cache hit rate is 33%. Figure 5 shows the relationship between the cache hit rate and thechange in the relative memory bandwidth regarding Kernel (2). On the 4 B/FLOP system, each of XðiÞ, YðiÞ and ZðiÞ isprovided at a 4/3B/FLOP bandwidth rate on average. The performance of Case 2 is comparable to that of the 4 B/FLOP system when the cache hit rate is 66%, because the bandwidth per flop/s rate of each data is the 4/3B/FLOPrate on average. However, in Case 3, both YðiÞ and ZðiÞ are provided from the memory, and YðiÞ and ZðiÞ have to sharethe 2 B/FLOP bandwidth rate at the vector register. As a result, the relative memory bandwidth never reaches that ofthe 4B/FLOP system.

These results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memorybandwidth. In additions, all data do not need to be cached from the discussions on Figures 3 and 5. The key datadetermining the performance should be cached to make good use of limited on-chip capacity.

5.3 Relationship between Efficiency and Cache Hit Rate on Five Scientific Applications

We simulate the execution of five scientific applications with the vector cache varying memory bandwidth per flop/srate; 1 B/FLOP and 2B/FLOP. Here, all vector load/store instructions are set to ‘‘cache on.’’ Figure 6 shows therelationship between the cache hit rate of the 8MB vector cache and the relative memory bandwidth. Figure 6 indicatesthat the vector cache can improve the effective memory bandwidth, depending on the cache hit rate of the applications.Cache hit rates vary from 13% in SFHT to 96% in APFA, because these depend on the locality of reference inindividual applications.

In APFA, the relative memory bandwidth is 0.96 at the 2B/FLOP system with almost 100% hit of the 8MB cache,resulting in an effective memory bandwidth almost equal to the 4 B/FLOP system. In this case, most data are provided

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80 90 100

Eff

icie

ncy

(%)

Cache Hit Rate (%)

4B/F 2B/F Case12B/F Case2 1B/F Case11B/F Case2

Figure 4. Relationship between cache hit rate and computation efficiency on Kernel loop (1).

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Rel

ativ

e M

emor

y B

andw

idth

Cache Hit Rate (%)

4B/F

2B/F Case1

2B/F Case2

2B/F Case3

Figure 5. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (2).

56 MUSA et al.

Page 7: Characteristics of an on chip cache on nec sx

from the cache. PBM of the 2B/FLOP system reaches the effective memory bandwidth of the 4B/FLOP system at thecache hit rate 50%. Figure 7 shows the main routine of PBM. The inner loop, index j, is vectorized and arrays gd dipand wary are stored in the vector cache by vector load instructions. However, gd dip is spilled from the cache, becausegd dip needs 7.6GB. On the other hand, wary can be held in the cache, because its needs only 250KB and the array isdefined in the preceding loop. Therefore, the cache hit rate of this loop is 50%. Then all data of wary are supplieddirectly from the vector cache at the 2B/FLOP rate and all data of gd dip are supplied from the memory at the 2B/FLOP rate through the cache, and the total amount of data provided to vector registers in time is equal to that of the4B/FLOP system.

The relative memory bandwidths in others applications, SFHT, GPR and PRF are over 0.6 on the 2 B/FLOP systemand 0.3 on the 1 B/FLOP system. Here, the relative memory bandwidth without the cache is 0.5 on the 2 B/FLOPsystem and 0.25 on the 1B/FLOP system. Thus, the improvement of the relative memory bandwidth is over 20% usingthe 8MB cache.

Figure 8 shows the computation efficiency of five scientific applications on the 4B/FLOP system without the cache,the 2 B/FLOP and 1B/FLOP systems with/without the 8MB cache. Figure 8 indicates that the efficiency of the 2B/FLOP and 1B/FLOP systems is increased by the cache. In particular, the 2B/FLOP system with the cache achieves thesame efficiency as the 4 B/FLOP system in APFA and PBM. Figure 9 shows the relationship between the cache hit rateof the 8MB cache and the recovery rate of the performance by the cache from the 2B/FLOP and 1B/FLOP systems tothe 4B/FLOP system. The recovery rate goes up as the cache hit rate increases. The vector cache can increase therecovery rate by 18 to 99% on the 2B/FLOP system and 9 to 96% on the 1 B/FLOP system, depending on the dataaccess locality of each application. The results indicate that the vector cache has a potential for the future SX systems tocover the shortage of memory bandwidth on real scientific applications.

5.4 Relationship between Associativity and Cache Hit Rate

In this section, we examine the effect of the associativity ranging from 2 to 8 ways on the performance. Figure 10shows that the cache hit rate on each set associative cache, covering cache sizes from 256KB to 8MB. Fourapplications, APFA, PRF, SFHT and PBM, approximately have constant cache hit rates across three associativity cases.In GPR, the cache hit rate slightly varies with the associativity.

The cache hit rate is dependent on among placements of the arrays on memory addresses, memory access patterns ofthe programs and the associativity. In GPR, its basic computational structure consists of triple nested loops. The loopsaccess many arrays at intervals of 584 bytes stride addresses, and the cache data are frequently replaced. Therefore, the

0

0.25

0.5

0.75

1

0 20 40 60 80 100

Rel

ativ

e M

emor

y B

andw

idth

Cache Hit Rate (%)

2B/F 8MB

1B/F 8MB

PB

M

SFH

T

GP

R

PR

F

AP

FA

Figure 6. Relationship between cache hit rate and relative memory bandwidth on five applications.

01 do i=1,ncells

02 do j=1,ncells

03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j)

04 end do

05 end do

Figure 7. Main routine of PBM.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 57

Page 8: Characteristics of an on chip cache on nec sx

cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytesstride accesses. On the 2MB case, the cached data which will be reused on the 2-way set associative cache are evictedby some other data. Thus this is due to a placement of memory addresses of the arrays. Other programs have constantcache hit rates across three associativity cases. These memory access patterns are 8 bytes or 16 bytes stride accesses,and the placement of the arrays on memory do not cause the conflict of reused data.

These results provide that the associativity hardly have effects on the cache hit rates, and the cache capacity mainlyinfluences the cache hit rate across the five applications.

5.5 Effects of Vector Cache Latency on Performance

In general, the latency of the on-chip cache is considerably shorter than that of the off-chip memory. We investigatethe effects of the cache access latency on the performance of the five applications. Figure 11 shows the relationshipbetween the computation efficiency of the applications and four cache latencies; 15, 50, 85 and 100% of the memoryaccess latency on the 2B/FLOP system with the 2MB cache. The five applications have almost the same efficiencywhen changing the cache latency. Because the vector architecture can hide the memory access times by pipelinedvector operations when a vector operation ratio and a vector loop length of applications are enough large. In addition,the vector cache consists of 32 sub-caches between a vector register file and the main memory via each memory port.The sub-caches connected to 32 memory ports provide multiple words at a time. The cache latency of the secondmemory access or later is hidden in the sub-cache system.

For comparison, we examine the performance of shorter loop length cases using Kernel loops (1) and (2) on the 2B/FLOP system when changing the cache access latency. Figure 12 shows that the relationship between the computationefficiency of Kernel loops and four cache latency cases. Here, we examine three loop length cases in the 100% cache hitrate; 1 � i � 64, 1 � i � 128, 1 � i � 256. On the shorter loop length cases, Figure 12 indicates that the vector cachelatency is an importance factor in the performance. Especially, on the loop length 64 case, the 15% latency case has twotimes higher efficiency in Kernel (1) than the 100% case, and the 15% latency case has 12% higher efficiency in Kernel

0

20

40

60

80

100

0 20 40 60 80 100

Per

form

ance

Rec

over

y R

ate

(%)

Cache Hit Rate (%)

2B/F 8MB

1B/F 8MB

PB

M

SFH

T

GP

R PR

F

AP

FA

Figure 9. Recovery rate of performance on 8MB cache.

0

10

20

30

40

50

GPR APFA PRF SFHT PBM

Eff

icie

ncy

(%)

4B/F 2B/F non-cache 2B/F 8MB 1B/F non-cache 1B/F 8MB

Figure 8. Computation efficiency of five applications with/without 8MB vector cache.

58 MUSA et al.

Page 9: Characteristics of an on chip cache on nec sx

(2). However, the longer the loop length, the lower the effect of the cache latency. Figure 13 shows that thecomputation efficiency of Kernel loops in the 4B/FLOP and 2B/FLOP systems with/without the vector cache. Onboth the loop lengths 64 and 128 of Kernel (1), the efficiency of the 4B/FLOP system with the cache is higher than thatof the 4 B/FLOP system without the cache, and the loop length 128 case of the 4B/FLOP system with the cacheachieves the highest efficiency. Meanwhile, in Kernel (2), the loop length 64 case of the 4B/FLOP system with thecache has the highest efficiency. This is because the ratio of memory access time to the processing time becomes the

GPR APFA

PRF SFHT

PBM

0

10

20

30

40

50

2WAY 4WAY 8WAY

Cac

he H

it R

ate

(%)

0.25MB 0.5MB 1MB 2MB 4MB 8MB

0

20

40

60

80

100

2WAY 4WAY 8WAY

Cac

he H

it R

ate

(%)

0.25MB 0.5MB 1MB 2MB 4MB 8MB

0

10

20

30

40

50

2WAY 4WAY 8WAY

Cac

he H

it R

ate

(%)

0.25MB 0.5MB 1MB 2MB 4MB 8MB

0

10

20

30

40

50

2WAY 4WAY 8WAY

Cac

he H

it R

ate

(%)

0.25MB 0.5MB 1MB 2MB 4MB 8MB

0

10

20

30

40

50

2WAY 4WAY 8WAY

Cac

he H

it R

ate

(%)

0.25MB 0.5MB 1MB 2MB 4MB 8MB

Figure 10. Cache hit rate vs. associativity on five applications.

0

10

20

30

40

50

60

15% 50% 85% 100%

Eff

icie

ncy

(%)

Cache Latency Case

GPR APFA PRF SFHT PBM

Figure 11. Computation efficiency of five applications on four cache latency cases.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 59

Page 10: Characteristics of an on chip cache on nec sx

smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of theshorter loop length case cannot be entirely hidden by pipelined vector operations only. Therefore, the performance ofshorter loop length cases is sensitive to the vector cache latency, and the vector cache can boost the performance of theapplications with short vector lengths on the 4B/FLOP system.

Moreover, the cache latency is not hidden when the bank conflicts occur. In this work, the bank conflicts hardly occurin the five applications. As our future work, quantitative discussions on the influence of the bank conflicts will beaddressed.

6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling

In this section, we discuss some optimization techniques for the vector cache to reduce cache miss rates on scientificapplications. In particular, we show that the relationship between loop-unrolling and caching. Many compilersautomatically optimize programs using the loop-unrolling, which is the most basic loop optimization.

6.1 Effects of Selective Caching

The vector cache has a bypass mechanism, and data are selectively cached: selective caching. The simulationmethods of the four scientific applications, GPR, APFA, PRF and SFHT, are based on the difference schemes; FiniteDifference Time Domain (FDTD) method and Direct Numerical Simulation of flow and heart (DNS), and a part ofarrays has a high locality in many cases. As the on-chip cache size is limited, selective caching might be an effectiveapproach to efficient use of the on-chip cache. We evaluate the selective caching for two applications, GPR and PRF,which are typical examples of FDTD and DNS, respectively.

Figure 14 shows one of kernel loops of GPR; the loop is a main routine of the FDTD method, and many routines ofGPR are similar to this kernel loop. The innermost loop is indexed with j. Usually, FORTRAN compilers interchangeloops j and i, however, FORTRAN90/SX compiler does not interchange the loops, because the compiler decides thatthe computational performance cannot be improved due to the short length of loop i. Loop i has a length of 50, and j is750.

As the size of each array used in GPR is 330MB, the cache cannot store all the arrays. However, the differencescheme as shown in Figure 14 generally has temporal locality in accessing arrays; H x, H y and H z. We select thesearrays for selective caching. Figure 15 shows the effect of selective caching on the 2B/FLOP system with the vectorcache in GPR. Figure 16 shows the recovery rate of the performance from the 2B/FLOP system without the cache tothe 4 B/FLOP system by selective caching in GPR. We discuss the cache sizes from 256KB up to 8MB. Here, ‘‘ALL’’in the figures indicates that all the arrays in the loop are cached. ‘‘Selective’’ shows that some of the arrays areselectively cached. ‘‘Cache Usage Rate’’ indicates the ratio of number of cache hit references to the total memoryreferences.

02468

1012141618

15% 50% 85% 100%

Eff

icie

ncy

(%)

Cache Latency Case

Kernel Loop (1)

64 128 256

0

5

10

15

20

25

30

15% 50% 85% 100%

Eff

icie

ncy

(%)

Cache Latency Case

Kernel Loop (2)

64 128 256

Figure 12. Computation efficiency of two kernel loops on four cache latency cases.

02468

101214161820

64 128 256

Eff

icie

ncy

(%)

Loop Length

Kernel Loop (1)

4B/F 4B/F 2MB 2B/F 2B/F 2MB

0

5

10

15

20

25

30

35

64 128 256

Eff

icie

ncy

(%)

Loop Length

Kernel Loop (2)

4B/F 4B/F 2MB 2B/F 2B/F 2MB

Figure 13. Computation efficiency of two kernel loops on 4B/FLOP and 2B/FLOP.

60 MUSA et al.

Page 11: Characteristics of an on chip cache on nec sx

In the 256KB cache case, the efficiencies of ‘‘ALL’’ and ‘‘Selective’’ are 33.3 and 34.1%, and the cache usage rateare 9.6 and 16.6%, respectively. The arrays H yði; j; kÞ, H yði� 1; j; kÞ (line 08 in Figure 14), H xði; j; kÞ (line 11) andH zði� 1; j; kÞ (line 12) can load data from the cache on ‘‘Selective.’’ In ‘‘ALL,’’ H yði� 1; j; kÞ (line 08) and

01 DO 10 k=0,Nz

02 DO 10 i=0,Nx03 DO 10 j=0,Ny

04 E_x(i,j,k) = C_x_a(i,j,k)*E_x(i,j,k)

05 & + C_x_b(i,j,k) * (( H_z(i,j,k) - H_z(i,j-1,k))/dy06 & - (H_y(i,j,k) - H_y(i,j,k-1))/dz - E_x_Current(i,j,k))

07 E_z(i,j,k) = C_z_a(i,j,k)*E_z(i,j,k)

08 & + C_z_b(i,j,k) * ((H_y(i,j,k)-H_y(i-1,j,k) )/dx09 & - (H_x(i,j,k)-H_x(i,j-1,k) )/dy - E_z_Current(i,j,k))

10 E_y(i,j,k) = C_y_a(i,j,k)*E_y(i,j,k)

11 & + C_y_b(i,j,k) * ((H_x(i,j,k) - H_x(i,j,k-1) )/dz12 & - (H_z(i,j,k) - H_z(i-1,j,k))/dx - E_y_Current(i,j,k))

13 10 CONTINU

Figure 14. A kernel loop of GPR.

0

10

20

30

30

32

34

36

38

40

256K 512K 1M 2M 4M 8M

Cac

he U

sage

Rat

e (%

)

Eff

icie

ncy

(%)

Cache Size (B)

ALL (Performance)Selective (Performance)ALL (Cache usage rate)Selective (Cache usage rate)

Figure 15. Efficiency of selective caching and cache hit rate on 2B/FLOP system (GPR).

0

10

20

30

0

10

20

30

40

256K 512K 1M 2M 4M 8M

Cac

he U

sage

Rat

e (%

)

Rec

over

y R

ate

of P

erfo

rman

c (%

)

Cache Size (B)

ALL (Recover Rate)

Selective (Recover Rate)

ALL (Cache usage rate)

Selective (Cache usage rate)

Figure 16. Recovery rate of performance and cache hit rate on selective caching (GPR).

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 61

Page 12: Characteristics of an on chip cache on nec sx

H zði� 1; j; kÞ (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data.Figures 15 and 16 indicate that the performance and the cache usage rate increase as the cache size increase. On the4MB cache, the cache hit rate of ‘‘Selective’’ is 24% and its efficiency is 3% higher than that of ‘‘ALL.’’ The recoveryrate of performance is mere 34%. However, GPR cannot achieve high cache usage rates and recovery rates ofperformance by selective caching. This is because this loop has many non-locality arrays; C x a, E x, E x Current etc.

Figure 17 shows one of kernel loops of PRF; the loop is a high cost routine. The size of each array in PRF is 18MB,and many loops are doubly nested loop. Thus, the arrays are treated as two dimensions array, and the size of cacheddata per array is 2MB only. wDY1YC1, wDY1YC2 and wDY1YC3 in Figure 17 are defined in the preceding loop. Inaddition, wDY1YC1 and wDY1YC2 have the spatial locality. We select these arrays for selective caching. Here,wDY1YC3ðI;KK � 1;LÞ (line 04 in Figure 17), wDY1YC2ðI;KK;LÞ (line 05), wDY1YC1ðI;KK;LÞ (line 06) andwDY1YC2ðI;KK; LÞ (line 06) can reuse data on the register files. Thus, these arrays do not access the cache.

Figures 18 and 19 show the efficiency and the recovery rate of the performance from the 2B/FLOP system to the4B/FLOP system by selective caching in PRF. Figure 18 indicates that the cache hit rates and efficiency of ‘‘Selective’’are the same as that of ‘‘ALL’’ until 2MB cache size. wDY1YC2ðI;KK � 1;LÞ (line 03) and wDY1YC1ðI;KK � 1;LÞ(line 08) are cache hit arrays in each case, and the cache hit rate is 33%. Over 4MB cache size, ‘‘Selective’’ achieves a66% cache hit rate, resulting in a 30+% improvement in efficiency. Moreover, Figure 19 indicates that the recoveryrate is 78%.

The results of GPR and PRF show that the selective caching improves the performance of the applications.Compared with GPR, PRF has higher values regarding both the cache usage rate and improved efficiency, because theratio of arrays with locality per loop on PRF is higher than that of GPR. In GPR the reused data are only defined in theloop, and the cache miss always occurs in this loop. For example, arrays H yði; j; kÞ (line 08 in Figure 14) hits the dataof H yði; j; kÞ (line 06), but H yði; j; kÞ (line 06) cannot hit the data due to their first accesses (cold miss). On the otherhand, the cached data of PRF are provided in the immediately preceding loop, thus cache misses do not occur like thefirst access in PRF.

01 DO KK = 2,NJ

02 DO I = 1,NI03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L)

04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC1(I,KK,L) * wDY1YC3(I,KK-1,L)

05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L)06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L)

07 END DO

08 END DO

Figure 17. High cost kernel loop of PRF.

0

10

20

30

40

50

60

70

20

25

30

35

256K 512K 1M 2M 4M 8M

Cac

he U

sage

Rat

e (%

)

Eff

icie

ncy

(%)

Cache Size (B)

ALL (Performance)

Selective (Performance)

ALL (Cache usage rate)

Selective (Cache usage rate)

Figure 18. Efficiency and cache hit rate of selective caching on performance (PRF).

62 MUSA et al.

Page 13: Characteristics of an on chip cache on nec sx

6.2 Effects of Loop-unrolling and Caching

Loop unrolling is effective for higher utilization of vector pipelines, however, loop unrolling also needs more cachecapacity to capture all the arrays of unrolled loops. Therefore, we discuss the relationship between loop unrolling andvector caching on the performance.

Figure 20 shows a code outer unrolled 4 times, whose original code is shown in Figure 7. In the outer unrolling case,array wary of the outer loop index i (line 03 in Figure 20) reuses the data of the index i� 1 on the cache and arrayswary (line 04, 05 and 06) reuse the data on register files of wary (line 03), hence, memory references are reduced.However, the cache capacity for gd dip increases due to an increase in the number of arrays reference in the innermostloop. Figure 21 shows the efficiency and the cache hit rate of the outer unrolling case on PBM with the 2 B/FLOPsystem. ‘‘2B/F’’ indicates the without-cache case, and the efficiency constantly increases as the degree of unrollingincreases. The cache hit rate of the 0.5MB cache case is 49% on the non-unrolling case and becomes poor as the loopunrolling proceeds, because the cached data of wary are evicted from the cache by gd dip. Thus, the efficiency of the0.5MB cache case achieves 49% on the non-unrolling case, and it is similar to that of the 2B/FLOP system withunrolling. In this case, a conflicting effect between caching and unrolling appears. Moreover, the cache hit rates of the8MB cache case decrease as the degree of unrolling increases. The cached data remain on the cache in this case, but thecache hit rate decreases due to a decrease in the number of wary (line 03) references. However, the efficiency isapproximately constant; 48 or 49% because unrolling covers the losing effect of caching. This case shows that theeffects of caching are comparable to that of unrolling.

Figure 22 shows a code obtained by unrolling code two times shown in Figure 17. Outer unrolling reduces thememory references of the arrays wDY1YC2ðI;KK; LÞ (line 07 in Figure 22) and wDY1YC1ðI;KK;LÞ (line 10).Figure 23 shows the efficiency and the cache hit rate of outer unrolling on PRF with the 2B/FLOP system. The cachehit rates of both the 0.5 and 8MB caches gradually reduce as the degree of unrolling increase, because of an increase inthe cache miss rate. In this case, the highest efficiency is 33% on the 8MB cache with the unrolling degree of 4 sincethe gain by loop unrolling covers the loss due to the increase in the miss rate. It is higher than the selective cachingcase; it is 30.7% (Ref. Figure 18). Thus, this case indicates the synergistic effect of both caching and unrolling.

0

10

20

30

40

50

60

70

20

40

60

80

256K 512K 1M 2M 4M 8M

Cac

he U

sage

Rat

e (%

)

Rec

over

y R

ate

of p

erfo

rman

ce (

%)

Cache Size (B)

ALL (Performance)Selective (Performance)ALL (Cache usage rate)Selective (Cache usage rate)

Figure 19. Recovery rate of performance and cache hit rate on selective caching (PRF).

01 do i=1,ncells,4

02 do j=1,ncells

03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j)

04 wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j)

05 wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j)

06 wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j)

07 end do

08 end do

Figure 20. Outer unrolling loop of PBM.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 63

Page 14: Characteristics of an on chip cache on nec sx

These results indicate that loop unrolling has both conflicting and synergistic effects of caching. Loop unrolling is amost basic optimization and has beneficial effects in various applications. Therefore, caching has to be used carefully asa complementary tuning option for loop unrolling.

7. Conclusions

This paper has presented the characteristics of an on-chip vector cache with the bypass mechanism on the vectorsupercomputer NEC SX architecture. We have demonstrated the relationship between the cache hit rate and the

0

10

20

30

40

50

60

0

10

20

30

40

50

60

Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16

Cac

he H

it R

ate

(%)

Eff

icie

ncy

(%)

2B/F 0.5MB 8MB Hit rate 0.5MB Hit rate 8MB

Figure 21. Efficiency and cache hit rate of outer unrolling on performance (PBM).

01 DO KK = 2,NJ

02 DO I = 1,NI03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L)

04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC(I,KK,L) * wDY1YC3(I,KK-1,L)

05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L)06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L)

07 wDY1YC3(I,KK,L) = wDY1YC3(I,KK,L) * wDY1YC2(I,KK,L)

08 wDY1YC2(I,KK+1,L) = wDY1YC2(I,KK+1,L) – wDY1YC(I,KK+1,L) * wDY1YC3(I,KK,L)09 wDY1YC2(I,KK+1,L) = 1.D0 / wDY1YC2(I,KK+1,L)

10 wDY1YC(I,KK+1,L) = (wPHIY12(I,KK+1) – wDY1YC1(I,KK+1,L) * wDY1YC1(I,KK,L)) * wDY1YC2(I,KK+1,L)

11 END DO 12 END DO

Figure 22. Outer unrolling loop of PRF.

0

10

20

30

40

50

60

0

5

10

15

20

25

30

35

Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16

Cac

he H

it R

ate

(%)

Eff

icie

ncy

(%)

2B/F 0.5MB 8MB hit rate 0.5MB hit rate 8MB

Figure 23. Efficiency and cache hit rate of outer unrolling on performance (PRF).

64 MUSA et al.

Page 15: Characteristics of an on chip cache on nec sx

performance using two DAXPY-like loops. Moreover, we have clarified the same tendency of the relationship on thefive scientific applications. The vector cache can recover the lack of the memory bandwidth, and boosts thecomputation efficiency of the 2 B/FLOP and 1B/FLOP systems. Especially, as cache hit rates are 50% or more, the2B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with a bypassmechanism can provide the data from both the memory and the cache in time, and the sustained memory bandwidth forthe registers are increased. Therefore, the vector cache has a potential for the future SX systems to cover the shortage oftheir memory bandwidth.

We have also discussed the relationship between performance and cache design parameters such as cacheassociativity and cache latency. We have examined the effect of the cache associativity setting from 2-way to 8-way.When the associativity is two or more, the cache hit rate is not sensitive to the associativity in the five applications. Inaddition, we have examined the effect of the cache latency on the performance when changing it to 15, 50, 85 and100% of the memory latency. We have demonstrated the computational efficiencies of five applications are constantacross these latency changes, when the vector loop lengths of the applications are 256 or more. In these cases, thelatency is hidden by pipelined vector operations. However, in the case of shorter vector loop lengths, the cache latencyaffects the performance, and the 15% latency case of Kernel (1) has two times higher efficiency than the 100% case inthe 2B/FLOP system. In addition, the 4 B/FLOP system is also boosted due to the effect of the short latency of thecache. The above results indicate that the performance of cache does not largely depend on the cache associativity,however, the cache latency affects the performance of applications with short vector loop lengths.

Finally we have discussed selective caching and the relationship between loop-unrolling and caching. We haveshown that selective caching, which is controlled by means of the bypass mechanism, is effective for efficient use oflimited on-chip caches. We have examined two cases; the ratios of arrays with locality per loop are higher and lowercases. In each case, the higher performance is obtained by selective caching, compared with all the data caching. Inaddition, the loop unrolling is useful in the improvement of performance, and caching is complementary to the effect ofloop unrolling.

In the future work, we will evaluate other scientific applications using the vector cache, and quantitativelydiscussions the relationship between the bank conflicts and the cache latency. In addition, we will also investigate themechanism of vector cache on multi core vector processors.

Acknowledgments

The authors would like to gratefully thank: Dr. M. Sato, Dr. A. Hasegawa, Professor G. Masuya, Professor T. Ota,Professor K. Sawaya of Tohoku University, for their advice about applications. We also would like to thank MatsuoAoyama and Hirofumi Akiba of NEC Software Tohoku for their assistance in experiments.

REFERENCES

[1] Abts, D., et al., ‘‘The Cray BlackWidow: A Highly Scalable Vector Multiprocessor,’’ In Proceedings of the ACM/IEEESC2007 conference (2007).

[2] Abts, D., Scott, S., and Lilja, D. J., ‘‘So Many States, So Little Time: Verifying Memory Coherence in the Cray X1,’’ InInternational Parallel and Distributed Processing Symposium (2003, April).

[3] Ariyoshi, K., Matsuzawa, T., and Hasegawa, A., ‘‘The key frictional parameters controlling spatial variation in the speed ofpostseismic slip propagation on a subduction plate boundary,’’ Earth Planet. Sci. Lett., 256: 136–146 (2007).

[4] Dunigan Jr., T. H., Vetter, J. S., White III, J. B., and Worley, P. H., ‘‘Performance Evaluation of The Cray X1 DistributedShared-Memory Architecture,’’ In IEEE MICRO, 25: 30–40 (2005).

[5] Gee, J. D., and Smith, A. J., ‘‘Vector Processor Caches,’’ Technical Report UCB/CSD-92-707, University of California (1992,October).

[6] Gee, J. D., and Smith, A. J., ‘‘The Effectiveness of Caches for Vector Processors,’’ In The 8th international Conference onSupercomputing 1994, pp. 333–343 (1994, July).

[7] Kitagawa, K., Tagaya, S., Hagihara, Y., and Kanoh, Y., ‘‘A Hardware Overview of SX-6 and SX-7 Supercomputer,’’ NECResearch & Development, 44: 2–7 (2003).

[8] Kobayashi, H., ‘‘Implication of Memory Performance in Vector-Parallel and Scalar-Parallel HEC,’’ In M. Resch et al. (Eds.),High Performance Computing on Vector Systems 2006, pp. 21–50, Springer-Verlag (2006).

[9] Kobayashi, H., Musa, A., Sato, Y., Takizawa, H., and Okabe, K., ‘‘The potential of On-chip Memory Systems for FutureVector Architectures,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2007, pp. 247–264,Springer-Verlag (2007).

[10] Kobayashi, T., Feng, X., and Sato, M., ‘‘FDTD simulation on array antenna SAR-GPR for land mine detection,’’ InProceedings of SSR2003: 1st International Symposium on Systems and Human Science, Osaka, Japan, pp. 279–283 (2003,November).

[11] Kontothanassis, L. I., Sugumar, R. A., Faanes, G. J., Smith, J. E., and Scott, M. L., ‘‘Cache Performance in VectorSupercomputers,’’ In Proceedings of the 1994 conference on Supercomputing, Washington D.C., pp. 255–264 (1994).

[12] Kunz, K. S., and Luebbers, R. J., ‘‘The Finite Difference Time Domain Method for Electromagnetics,’’ Boca Raton, FL: CRCPress (1993).

[13] Musa, A., Sato, Y., Egawa, R., Takizawa, H., Okabe, K., and Kobayashi, H., ‘‘An On-Chip cache Design for Vector

Characteristics of an On-Chip Cache on NEC SX Vector Architecture 65

Page 16: Characteristics of an on chip cache on nec sx

Processors,’’ In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems andarchitecture, Romania, pp. 17–23 (2007).

[14] Musa, A., Takizawa, H., Okabe, K., Soga, T., and Kobayashi, H., ‘‘Implications of Memory Performance for Highly EfficientSupercomputing of Scientific Applications,’’ In Proceedings of the 4th International Symposium on Parallel and DistributedProcessing and Application 2006, Italy, pp. 845–858 (2006).

[15] Nakajima, M., Yanaoka, H., Yoshikawa, H., and Ota, T., ‘‘Numerical Simulation of Three-Dimensional Separated Flow andHeat Transfer around Staggerd Surface-Mounted Rectangular Blocks in a Channel,’’ Numerical Heat Transfer (PartA), 47:691–708 (2005).

[16] Oliker, L., et al., ‘‘A Performance Evaluation of the Cray X1 for Scientific Applications,’’ In VECPAR’04: 6th InternationalMeeting on High Performance Computing for Computational Science, Valencia, Spain (2004, June).

[17] Senta, T., Takahashi, A., Kadoi, T., Itoh, T., Kawaguchi, S., and Kishida, Y., ‘‘Itanium2 32-way Server System Architecture,’’NEC Research & Development, 44: 8–12 (2003).

[18] SGI, ‘‘SGI Altix 4700 Servers and Supercomputers,’’ Technical report, SGI Inc., http://www.sgi.com/pdfs/3867.pdf (2007).[19] Shan, H., and Strohmaier, E., ‘‘Performance Characteristics of the Cray X1 and Their Implications for Application

Performance Tuning,’’ In 18th Annual International Conference on Supercomputing, Malo, France, pp. 175–183 (2004).[20] Takagi, Y., Sato, H., Wagatsuma, Y., Mizuno, K., and Sawaya, K., ‘‘Study of High Gain and Broadband Antipodal Fermi

Anetnna with Corrugation,’’ In 2004 International Symposium on Antennas and Propagation, Vol. 1, pp. 69–72 (2004).[21] Tsuboi, K., and Masuya, G., ‘‘Direct Numerical Simulations for Instabilities of Remixed Planar Flames,’’ In The Fourth Asia-

Pacific Conference on Combustion, Nanjing, China, pp. 136–139 (2003, November).

66 MUSA et al.