Handling Congesion Over Noc Using Mem Controller

7/30/2019 Handling Congesion Over Noc Using Mem Controller

1/8

A Network Congestion-Aware Memory Controller

Dongki Kim, Sungjoo Yoo, Sunggu Lee

Embedded System Architecture Laboratory

Department of Electronic and Electrical EngineeringPohang University of Science and Technology (POSTECH)

{dongki.kim, sungjoo.yoo, slee}@postech.ac.kr

ABSTRACT

Network-on-chip and memory controller become correlated

with each other in case of high network congestion since the

network port of memory controller can be blocked due to the

(back-propagated) network congestion. We call such a

problem network congestion-induced memory blocking. In

order to resolve the problem, we present a novel idea of

network congestion-aware memory controller. Based on the

global information of network congestion, the memory

controller performs (1) congestion-aware memory access

scheduling and (2) congestion-aware network entry control

of read data. The experimental results obtained from a 5x5

tile architecture show that the proposed memory controller

presents up to 18.9% improvement in memory utilization.

1. IntroductionIn many-cores, the performance of memory and network can

often determine the system performance. In order to increase

memory performance, multiple memory modules (in short,

multiple memories) start to be used [1][2][3][4][5]. In

addition, 3D die stacking is expected to make their usage

popular [6][7][8][9]. For the performance improvement of

individual memory, there have been presented several studies

on memory access scheduling [10][11][12][13][14].

Network-on-chip (NoC) becomes more and more popular in

many-cores [2][3][8]. As the NoC size increases, network

congestion becomes one of the most important problems

limiting NoC performance thereby system performance.

Several methods, e.g., adaptive routing [15][16][17] have

been presented in order to address the network congestion

problem.

In most of previous work, memory and network have been

assumed to be independent of each other. However, in case

of heavy network congestion, both memory and networkbecome correlated with each other, in a negative way, to

yield system performance degradation. In this paper, we

tackle a problem due to such a correlation which is called

network congestion-induced memory blocking. In the

problem, the network congestion propagates from hot spots

up to the memory controller. It prevents the memory

controller from servicing productive requests 1 and, finally,

degrades memory performance thereby system performance.

In order to resolve the problem, in this paper, we propose a

novel idea called network congestion-aware memory

controller. In the proposed method, first, network congestion

information is propagated to the memory controller. Then,

the memory controller takes two measures of network

congestion-aware memory access scheduling and network

entry control in order to prioritize productive requests

(requests from the uncongested area) over unproductive ones

(requests from the congested area which will only consume

resource in the memory controller and/or NoC without

contributing to system performance improvement). We

evaluate the proposed idea with a tile-based many-core

architecture.

This paper is organized as follows. Section 2 reviews

related work. Section 3 gives preliminaries. Section 4

explains our motivation. Section 5 presents the proposed

method of network congestion-aware memory controller.

Section 6 reports experimental results. Section 7 concludes

the paper.

2. Related WorkRixner, et al. present memory access scheduling policies

including first ready-first come first serve (FR-FCFS) and

open/close page schemes [10]. Heithecker and Ernst utilize a

traffic shaping in assigning memory resource to high priority

requests thereby reducing performance degradation in other

low priority requests [11]. Ahn, et al. present a memory

controller with per-thread request buffers in order to avoid

inter-thread interference while exploiting the intra-thread

locality, e.g., high row buffer hit rate [18]. Akesson, et al.

present a memory controller called Predator where memory

accesses are grouped to enable parallel bank accesses therebyimproving memory utilization and easing QoS support [12].

Mutlu, et al. present a fair memory access scheduling which

balances the slow-down (=the ratio of memory access latency

between shared and alone cases) in memory access latency

among cores when a shared DRAM is accessed by multiple

1 Productive requests are the ones that will improve system

performance if they are served by the memory controller.

2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip

978-0-7695-4053-5/10 $26.00 2010 IEEE

DOI 10.1109/NOCS.2010.36

257


2/8

cores [13]. They also present another fair memory access

scheduling called PAR-BS, which tries to avoid the

starvation of low priority requests by applying a batch

scheduling [14].

There have been presented several solutions to the NoC

congestion problem. Most of existing solutions utilize, as the

congestion metric, the resource utilization of self andneighbor routers, e.g., buffer availability [15] and queue

length in output port [16][17]. Singh, et al. present a method

called GOAL where the routing decision is based on the

queue length of each output port [16]. They also presents a

method called GAL which utilizes the information of output

queue length for both quadrant selection and routing in the

tori architecture [17]. Gratz, et al. present a notion of

regional congestion awareness (RCA) where a router

determines, on each incoming packet, the output port based

on the global congestion information on each quadrant of

mesh architecture [19]. Our method presented in this paper is

based on the global congestion information used in the RCA

method. In our case, we exploit the global congestion

information for memory access scheduling and network entry

control. Our method can be combined with the RCA method,

which is beyond the scope of this paper.

There has been little work on the integrated and cooperative

solution which includes both memory controller and network

(congestion avoidance). Recently, Jang and Pan present a

memory-aware NoC router [20]. In this work, the router

performs switch allocation, i.e., output port arbitration

considering the memory access latency of packets

contending for the same output port. The router performs

memory access scheduling which was previously executed

by the memory controller. However, network congestion

information is not utilized in this work.

In our work, the contribution is to identify the problem of

network congestion-induced memory blocking and to

integrate network congestion awareness into the memory

controller in order to resolve the problem.

3. Preliminaries: DRAM, Memory AccessScheduling and Memory ControllerFigure 1 illustrates an internal structure of DRAM. As shown

in the figure, the memory array is divided into banks

(typically 4 or 8 banks per DRAM). Each bank, which can be

accessed independently of the other banks, is a two

dimensional array of memory cells consisting of rows and

columns. Figure 1 illustrates a four-bank DRAM. TheDRAM access typically takes three steps. In step 1, the bank

and row are selected by a bank/row address and a set of

control signals including RAS (row address strobe) as the

arrow denoted with (1) shows in Figure 1. The selected row

(typically, 2KB data) is fetched to the row buffer (consisting

of sense amplifiers and latches), which incurs a latency of

tRCD. This operation is often called row activation and it is

said that the row is open when it is fetched to the row buffer.

In step 2, the desired data (a dark rectangle in Figure 1) is

accessed (read or written) by a column address and a set of

control signals including CAS (column address strobe) as the

arrow denoted with (2) shows in the figure. The accessed

data flows over external IO signals, DQs after the latency of

column access called CAS latency (CL) in case of read

operation. In step 3, after finishing accesses to the open row

in the row buffer, the row is closed by the memory controllerissuing a memory command called precharge (PRE) to

DRAM as the arrow denoted with (3) shows in the figure.

The precharge command incurs a latency of tRP. Another row

from the same bank can be activated only after the previously

open row is precharged.

Figure 1 DRAM structure and access flow

Memory access scheduling tries to improve memory

performance considering the fact that multiple banks can be

accessed independently (as far as there is no conflict in the

address/control/data buses) and an access to the open row

requires a minimum access latency. Figure 2 illustrates a

memory access scheduling policy called FR-FCFS (first

ready-first come first serve) [10] where the oldest ready

request is served by the memory controller. We assume a

DDR SDRAM whose CL, tRP, and tRCD are all 3 cycles and

BL (burst length) is 8.

Figure 2 FR-FCFS policy in memory access scheduling

RowDecoder

Row Buffer

Bank0MemoryArray

ColumnDecoder

Row addressRAS

Column addressCAS

DQ

tRCD

CL

tRP

(1)

(2)

(3)

Accesseddata

B0R0C0 B2R2C8

Bank 0

Bank 2

Cycle

5 10 15 200

tRCD CL

Data OutB0R0C0

tRP

CL

Data OutB2R2C0

Data OutB0R1C1

B0R1C8

2 4

tRCD

tRCD CL

tRP

tRP

258


3/8

Assume that three read requests arrive at the memory

controller at time 0, 2, and 4 as shown by the three dashed

arrows at the bottom of Figure 2. Each request is represented

by a tuple, bank ID, row address, column address. The first

request accesses bank 0 (1st bank), row 0 (1st row in bank 0)

and column 0 (1st data in row 0), the second one bank 0, row

1, and column 8, and the third one bank 2, row 2, and column

8. At time 0, the row activation command for the first request,

B0R0C0 is sent to DRAM, which takes 3 cycles (tRCD) to

open the row.2 At time 3, the column access command is sent

to access the open row. After the delay of CL, i.e., 3 cycles,

at time 6, the data of the first request come out of the DRAM.

At time 7, a precharge command is sent to close the open

row in order to prepare the access to another row, row 1 in

the same bank. Note that the precharge command can overlap

with the read data operation.3

When serving the requests in a FCFS manner, the third

request can start to be served only after the service of the

second request is finished, which will take a large delay in

our example. FR-FCFS can give a reduced latency as Figure

2 shows. Even though the third request arrives at the memory

controller later than the second request, it is served earlier

than the second one since it is ready, i.e., the memory

command of row activation for bank 2 can be issued at the

time when the third request arrives. As Figure 2 shows, the

row activation of the third request is issued at time 4 to open

the row 2 at bank 2. The data of the third request start to

come out of the DRAM at time 10. Thus, the service of all

the three requests finishes at time 20 in the FR-FCFS policy.

If the FCFS is applied, the total cycle will be 30 since all the

requests are served in a sequential manner.

Memory access scheduling is usually performed by the

memory controller. Figure 3 shows a simplified block

diagram of memory controller. It is connected with network

interface (NI) and DRAM as shown in the figure. It takes

memory access requests from NI, e.g., via read/write address

channels in the case of AXI protocol [21]. The requests are

stored in the request buffer (RB). The memory access

scheduler checks the status of each bank with bank FSMs

and the requests in the RB. The scheduler applies a

scheduling algorithm, e.g., FR-FCFS [10], PAR-BS [14], etc.,

in order to select a request from the RB. Then, it generates

corresponding DRAM commands for row activation, column

access for read or write, precharge, etc. The generated

DRAM commands are usually stored in the command buffer

and issued to DRAM at their scheduled cycles.

2 In this example, for simplicity, we assume that the memory

controller does not consume internal delay cycles. In our

experiments, the memory controller takes internal latency cycles

from the reception of request from the network to the issue of

DRAM commands.3

The overlap period between read and precharge operations

depends on BL, tRTP (read-to-precharge latency) and AL (additive

latency). For more details, refer to the DRAM specification, e.g.,

[24].

In case of write request, the data buffer (DB) receives write

data from NI, e.g., via write data channel in AXI protocol,

and sends them to DRAM when the corresponding column

access commands are issued to DRAM. In case of read

request, the DB receives data from DRAM and then sends

them to NI, e.g., via read data channel in AXI protocol.

Figure 3 A simplified block diagram of memory

controller

4. MotivationFigure 4 illustrates the problem of network congestion-

induced memory blocking. Figure 4 (a) shows a 3x3 tile

architecture where the memory controller (MC) is located at

the center tile. We assume that there is a network congestion.The shaded routers (small shaded rectangles denoted with

R) and thick links between them represent the congested

area. In Figure 4 (a), the congestion is mostly on the third

column of the mesh topology and affects up to the MC

through the router connected to the MC.

Figure 4 Network congestion-induced memory blocking

Figure 4 (b) shows a detailed view of MC and connected

router corresponding to the dashed area in Figure 4 (a). In

B0R0C0

B0R1C1

B2R2C3

ACT PRE

Close

Open

Bank FSM

ACT B0R0MemoryAccess

Scheduler

Address

RAS, CAS

DQ

Request Buffer

Data Buffer

Addresschannels

Data

channels

Command Buffer

DRAMNetwork

Interface

PE0

R

PE1

R

PE2

R

PE3

R

MC

R

PE5

R

PE

6

R

PE

7

R

PE

8

R

R

NI

D_PE8D_PE2

R_PE0R_PE3

RB DB

MC

(a) 3x3 tile architecture (b) Blocking in MC and NI

259


4/8

Figure 4 (b), the MC has two read requests from PE0 and

PE3 (R_PE0 and R_PE3 in RB). The MC has also two data

bound for PE8 and PE2 (D_PE8 and D_PE2 in DB). In this

case, since the paths from the MC to PE8 and PE2 are

congested, and the output port of MC is blocked by thecongestion, the two data cannot enter the NoC. Thus, they

just wait, until the congestion is removed, while occupying

the precious resource, DB. Since the DB is full and there isno available space to store data coming from the DRAM, the

two read requests in RB cannot issue their memory

commands for read operations. Thus, no memory command

is issued to the DRAM and the DRAM remains idle until the

network congestion is removed in this case. We call such a

case network congestion-induced memory blocking. The

problem is resolved only when the congestion is removed (or

mitigated). As will be reported in Section 6, such a problem

can cause a significant degradation in memory utilization

thereby system performance degradation.

From the viewpoint of memory controller, the above problem

is caused by the fact that conventional memory access

scheduling does not take network status into account. In thecase of Figure 4, the memory access scheduling produces

data D_PE8 and D_PE2 in order to maximize only the

memory performance, i.e., memory utilization without

considering system performance. The lack of knowledge on

network status in memory access scheduling causes the DB

to be occupied only by the blocked data bound for the

congested area.

The above problem can be resolved by a network congestion-

aware memory controller. To be specific, the MC takes two

measures in case of network congestion: (1) prioritization of

requests from the uncongested area and (2) congestion-aware

network-entry control of read data. The rationale behind the

two measures is as follows. At high congestion, as shown in

Figure 4 (b), the MC resource (especially, DB and the output

port) tends to be occupied by the requests from the congested

area. Thus, in order to prevent those requests from occupying

the MC resource, requests from the uncongested area need to

be prioritized. Regarding the network entry control, read data

bound for the congested area, when they enter the NoC, will

aggravate the NoC congestion and can cause the blocking

problem. Thus, in this case, read data bound for the

uncongested area are prioritized in entering the NoC. By

doing that, the memory controller can reclaim data buffer

space more quickly and support more requests with the

reclaimed resource.

5. Network Congestion Awareness inMemory ControllerIn order to realize the idea of network congestion-aware

memory controller, the congestion information needs to be

transferred from the NoC to the MC (Section 5.1). Then,

both memory access scheduling and network entry control

need to exploit the congestion information (Section 5.2).

Figure 5 Regions (quadrants) to manage congestion information

5.1Congestion Information ManagementCongestion information is collected within the NoC andtransferred to the MC. As the congestion information, we

utilize the global congestion information called regional

congestion in [19]. The information of regional congestion

represents how much congested each quadrant of a router is.

Figure 5 illustrates the regional congestion information. The

router (shaded rectangle) at (3,3) has the congestion

information for each of the four quadrants (north-west/east,

and south-west/east) seen from its location.

Figure 6 Propagation of regional congestion information

Figure 6 illustrates how each router manages the regional

congestion information. In order to quantify the congestion

level, we use the number of occupied VCs (vc) and crossbar

demand (xb, the number of contending packets bound for the

same output port) per router.4 The higher the number of vcand xb is, the higher congestion the network suffers from in

the corresponding quadrant. As dashed arrows represent in

Figure 6, each router collects the congestion information

4In [19], three possibilities of congestion information, i.e., the

number of occupied VCs (vc) / buffers (bf) and crossbar demand

(xb) are presented. The combination ofvc and xb proves to give

the best results.

RR R

RR R

RR R

RR R

RR R

RR R

RR R

RR R

RR R

RR R

RR R

RR R

NE (Quadrant 1)

SE (Quadrant 4)

NW (Quadrant 2)

SW (Quadrant 3)

RR R

RR R

RR R

4

4

4

41212

6

7.5

6

7

7

15

18.25

15

17

17

19.5

24.25

260


5/8

from neighbor routers via a set of sideband signals. 5 It

calculates its congestion level by a weighted sum of its local

congestion information (vc andxb) and those of its neighbors

collected via the sideband signals, and propagates the new

congestion information to its neighbor routers. Note that the

congestion information propagates from a router to all its

neighbors, i.e., in four directions in the mesh architecture.

For simplicity, Figure 6 shows only the south-west-bound

flow of congestion information.6

The congestion level of each quadrant around the MC is

propagated to the MC as the dashed line between the router

and the MC shows in Figure 7.

Figure 7 Propagating congestion information to memory

controller

5.2Network Congestion-Aware Memory AccessScheduling and Network Entry ControlWe integrate congestion awareness into two functions of

memory controller: memory access scheduling and network

entry control of read data.Figure 8 shows the pseudo code of network congestion-

aware memory access scheduling. We use FR-FCFS as the

baseline policy of memory access scheduling. Note that there

is no limitation to memory access scheduling policies in

integrating congestion awareness.

5 The congestion information is carried on 9 bits of sideband signals

in our work. However, the overhead of sideband signals can

become negligible since they can be reduced down to a serial link

without losing the benefit of adaptive routing as reported in [19].A detailed analysis on the effect of reducing the number of

sideband signals will be our future work.

6 In adaptive routing which utilizes the regional congestion

information, each router selects a link with the lowest congestion

level. For instance, in the case of the (shaded) router at (1,1) in

Figure 6, when a north-east bound packet arrives at the router, it

selects the east-bound output port as the target output port since

the congestion level of north-bound output port, 24.25 is higher

than that of east-bound output port, 18.25. For more details of

regional congestion awareness, refer to [19].

Figure 8 Network congestion-aware memory access scheduling

If any of four quadrants around (the router connected to) the

MC has a high congestion level (line 3), then the requests

from the congested quadrant(s) (CQ) are excluded from the

set of candidate requests in the request buffer (line 4). The

FR-FCFS policy is applied to the new candidate set (R_UCQ

in Figure 8). If there is a selected request, it is served by the

MC (line 5). If there is no selected one, then a candidate is

selected, if any, from the requests from the congested

quadrant (R_CQ) (lines 6-7). If there is no quadrant with

high congestion, then the original FR-FCFS is applied to all

requests in the request buffer.

The network entry control is applied to read data stored in

the data buffer of memory controller. The procedure to select

a data (to be injected to the NoC) from the data buffer is

similar to that in Figure 8. First, if there is any quadrant with

high congestion, then the read data bound for the quadrant

are excluded from the set of ready read data in the data buffer.

If there is any data bound for the uncongested area, then the

oldest of such data is selected and sent to its destination. If

there is no data bound for the uncongested area, then the

oldest ready data bound for the congested area is selected.

Note that the granularity of selection is a burst of data. In our

experiments, a burst of data corresponds to 8 words of 64b

data for L1 cache miss.

6. Experiments6.1Experimental SetupFigure 9 shows a 5x5 tile architecture used in our

experiments. The 5x5 architecture has one MC at the center

tile. The MC runs at 400MHz and is connected to a

conventional DRAM memory channel of DDR2-800 (CL-

tRP-tRCD=6-6-6). The MC has a request buffer and a data

buffer whose sizes are varied in our experiments. The MC

supports AXI protocol [21] at its bus interface. Thus, we use

a network interface between the router and the MC. We use a

Tensilica LX2 core (64b, 16K/16K, cache line size of 64B,

400MHz) as the PE.

The NoC router supports four-stage pipeline (input buffering,

virtual channel allocation & lookahead routing computation,

switch allocation and switch traversal), 4 virtual channels

(VCs) per input port, wormhole switching, speculative

allocation [22], XY routing and 64b flits. There are two types

of packets: request and response packets. The read request

R

NI

RB DB

MC

Congestion info

R R

1 // CQ and UCQ: congested and uncongested quadrants2 // R_CQ and R_UCQ: requests from CQ and UCQ3 if (high congestion in any quadrant)4 R_UCQ = All_Requests R_CQ5 selected = FR_FCFS(R_UCQ)6 if selected == null7 selected = FR_FCFS(R_CQ)

8 else // no heavy congestion9 selected = FR_FCFS(All_Requests)

261


6/8


7/8

medium levels of network congestion, the performance gain

is modest (4.4% ~14.1%).

Figure 10 Performance comparison

Figure 11 Performance gains

Figure 12 gives another explanation of high performance

gain of the proposed method. The figure shows the fraction

of time that the data buffer is full. The figure shows that as

the congestion level becomes higher, the data buffer becomes

full more frequently in the baseline method. It is because, at

high congestion, the read data bound for the congested area

occupy the output port of memory controller as well as the

majority of data buffer resource. That is, network congestion-

induced memory blocking occurs as exemplified in Section 4.

Thus, they prevent the other data bound for the uncongested

area from being served. The proposed method reduces the

period of full data buffer by up to 12.1% (high congestion,

DB size = 64), which translates into the performance gain in

Figure 11.

Figure 12 Fraction of time when data buffer is full

(single program case)

Network congestion-aware memory access scheduling and

network entry control can be applied independently of each

other. Figure 13 shows the effect of each method obtained

when each of them is applied independently to the case of

high congestion level. The figure shows that the network

congestion-aware memory access scheduling (Schedule in

the figure) alone offers up to 16.0% improvement in memory

utilization while the network congestion-aware entry control

(Entry) alone offers a performance improvement by up to

8.1%.

Figure 13 Decomposition of performance impact

Comparing the results in Figures 11 and 13, the effects of

both congestion-aware memory access scheduling and

network entry control are not always additive. In case of

small DB sizes (32 and 64), the effect of network congestion-

aware memory access scheduling dominates. The effect of

network congestion-aware network entry control becomes

more significant with large DB sizes. It is because large DB

(a) Baseline

(b) Proposed method

# requests/100 cycles

# requests/100 cycles

High

High

Medium

Medium

Low

Low

DB size

DB size

4.98

5.23 5.32

4.064.18 4.27

3.223.35

3.48

3.00

3.50

4.00

4.50

5.00

5.50

6.00

32 64 96

5.43

5.86 5.92

4.49

4.774.86

3.68

3.93 3.92

3.00

3.50

4.00

4.50

5.00

5.50

6.00

32 64 96

Gain (%)

DB size

LowMedium High

0.00

5.00

10.00

15.00

20.00

32 64 96

Fraction of time (%)

DB size

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

32 64 96

Low-base

Low-prop

Med-base

Med-prop

High-base

High-prop

Low-base High-prop

DB size

Performance gain

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

32 64 96

Schedule

Entry

263


8/8

sizes allow for more space to keep data bound for the

congested area in case of high congestion. Thus, there are

more opportunities that data bound for the uncongested area

bypass data bound for the congested area thereby reducing

memory blocking period and finally improving memory

utilization.

Figure 14 Performance gains in the multi-program case

Figure 14 shows the performance gain of the proposed

method for the multi-program mapping on the 5x5

architecture as shown in Figure 9. The proposed method

gives a significant performance improvement by up to 18.9%

(high congestion, DB size = 64). Figure 15 confirms that thereduction in memory blocking period (enabled by the

proposed memory controller) offers such a performance

improvement shown in Figure 14.

Figure 15 Fraction of time when data buffer is full

(multi-program case)

7. ConclusionIn this paper, we explained a performance problem called

network congestion-induced memory blocking and presented,

as a solution to the problem, a memory controller which

performs memory access scheduling and network entry

control of read data in a network congestion-aware manner.

The proposed memory controller gives significant

performance improvements (by up to 18.9% in memory

utilization), especially when the network suffers from severe

congestion. As future work, we will work on an adaptive

solution to cope with dynamically changing network

congestion. We will also work on fairness issues between

requests from congested and uncongested areas.

8. References[1] E. Lindholm, et al., NVIDIA Tesla: A Unified Graphics andComputing Architecture, IEEE Micro, 28(2), March 2008.

[2] S. Bell, et al., TILE64 - Processor: A 64-Core SoC with Mesh

Interconnect, Proc. ISSCC 2008.

[3] Intel, Co., Single-chip Cloud Computer, available at

http://techresearch.intel.com/articles/Tera-Scale/1826.htm.

[4] W. Kwon, et al., A Practical Approach of Memory Access

Parallelization to Exploit Multiple Off-chip DDR Memories,

Proc. DAC, 2008.

[5] E. Aho, et al., A Case for Multi-channel Memories in Video

Recording, Proc. DATE, 2009.

[6] T. Ezaki, et al., A 160Gb/s Interface Design Configuration for

Multichip LSI, Proc. ISSCC, 2004.

[7] S. Borkar, Thousand-Core Chips - A Technology Perspective,

Proc. DAC, 2007.

[8] J. Bautista, Tera-scale Computing and Interconnect Challenges

3D Stacking Considerations, Tutorial, ISCA, 2008.

[9] G. H. Loh, 3D-Stacked Memory Architectures for Multi-Core

Processors, Proc. ISCA, 2008.

[10] S. Rixner, et al., Memory Access Scheduling, Proc. ISCA,

2000.

[11] S. Heithecker and R. Ernst, Traffic Shaping for an FPGA

based SDRAM Controller with Complex QoS Requirements,

Proc. DAC, 2005.

[12] B. Akesson, K. Goossens, and M. Ringhofer, Predator: a

Predictable SDRAM Memory Controller, Proc. CODES+ISSS,

2007.

[13] O. Mutlu and T. Moscibroda, Stall-time Fair Memory Access

Scheduling for Chip Multiprocessor, Proc. MICRO, 2007.[14] O. Mutlu and T. Moscibroda, Parallelism-Aware Memory

Access Scheduling, Proc. ISCA, 2008.

[15] J. Kim, et al., A Low Latency Router Supporting Adaptivity

for On-Chip Interconnects, Proc. DAC, 2005.

[16] A. Singh, et al., GOAL: A Load-Balanced Adaptive Routing

Algorithm for Torus Networks, Proc. ISCA, 2003.

[17] A. Singh, et al., Globally Adaptive Load-Balanced Routing on

Tori, IEEE Computer Architecture Letters, 3(2):2, 2004.

[18] J. Ahn, M. Erez, and W. Dally, The Design Space of Data-

Parallel Memory Systems, Proc. Supercomputing, 2006.

[19] P. Gratz, et al., Regional Congestion Awareness for Load

Balance in Networks-on-Chip, Proc. HPCA, 2008.

[20] W. Jang and D. Z. Pan, An SDRAM-aware Router for

Networks-on-Chip, Proc. DAC, 2009.

[21] ARM, Ltd., AMBA3 (AXI) Protocol, available at

http://www.arm.com/products/solutions/AMBAHomePage.html

[22] L. Peh and W. J. Dally, A Delay Model and Speculative

Architecture for Pipelined Routers, Proc. HPCA, 2001.

[23] T. Sherwood, et al., Discovering and Exploiting Program

Phases, IEEE Micro, Nov/Dec, 2003.

[24] Samsung Electronics, Co., DDR I, II, and III Device

Operations & Timing Diagram, available at

http://www.samsung.com/global/business/semiconductor.

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

32 64 96

Gain (%)

DB size

Low

Medium High

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

32 64 96

Low-base

Low-prop

Med-base

Med-prop

High-base

High-prop

Fraction of time (%)

DB size

264

Documents

Handling Congesion Over Noc Using Mem Controller