Upload
umesh-patel
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 Handling Congesion Over Noc Using Mem Controller
1/8
A Network Congestion-Aware Memory Controller
Dongki Kim, Sungjoo Yoo, Sunggu Lee
Embedded System Architecture Laboratory
Department of Electronic and Electrical EngineeringPohang University of Science and Technology (POSTECH)
{dongki.kim, sungjoo.yoo, slee}@postech.ac.kr
ABSTRACT
Network-on-chip and memory controller become correlated
with each other in case of high network congestion since the
network port of memory controller can be blocked due to the
(back-propagated) network congestion. We call such a
problem network congestion-induced memory blocking. In
order to resolve the problem, we present a novel idea of
network congestion-aware memory controller. Based on the
global information of network congestion, the memory
controller performs (1) congestion-aware memory access
scheduling and (2) congestion-aware network entry control
of read data. The experimental results obtained from a 5x5
tile architecture show that the proposed memory controller
presents up to 18.9% improvement in memory utilization.
1. IntroductionIn many-cores, the performance of memory and network can
often determine the system performance. In order to increase
memory performance, multiple memory modules (in short,
multiple memories) start to be used [1][2][3][4][5]. In
addition, 3D die stacking is expected to make their usage
popular [6][7][8][9]. For the performance improvement of
individual memory, there have been presented several studies
on memory access scheduling [10][11][12][13][14].
Network-on-chip (NoC) becomes more and more popular in
many-cores [2][3][8]. As the NoC size increases, network
congestion becomes one of the most important problems
limiting NoC performance thereby system performance.
Several methods, e.g., adaptive routing [15][16][17] have
been presented in order to address the network congestion
problem.
In most of previous work, memory and network have been
assumed to be independent of each other. However, in case
of heavy network congestion, both memory and networkbecome correlated with each other, in a negative way, to
yield system performance degradation. In this paper, we
tackle a problem due to such a correlation which is called
network congestion-induced memory blocking. In the
problem, the network congestion propagates from hot spots
up to the memory controller. It prevents the memory
controller from servicing productive requests 1 and, finally,
degrades memory performance thereby system performance.
In order to resolve the problem, in this paper, we propose a
novel idea called network congestion-aware memory
controller. In the proposed method, first, network congestion
information is propagated to the memory controller. Then,
the memory controller takes two measures of network
congestion-aware memory access scheduling and network
entry control in order to prioritize productive requests
(requests from the uncongested area) over unproductive ones
(requests from the congested area which will only consume
resource in the memory controller and/or NoC without
contributing to system performance improvement). We
evaluate the proposed idea with a tile-based many-core
architecture.
This paper is organized as follows. Section 2 reviews
related work. Section 3 gives preliminaries. Section 4
explains our motivation. Section 5 presents the proposed
method of network congestion-aware memory controller.
Section 6 reports experimental results. Section 7 concludes
the paper.
2. Related WorkRixner, et al. present memory access scheduling policies
including first ready-first come first serve (FR-FCFS) and
open/close page schemes [10]. Heithecker and Ernst utilize a
traffic shaping in assigning memory resource to high priority
requests thereby reducing performance degradation in other
low priority requests [11]. Ahn, et al. present a memory
controller with per-thread request buffers in order to avoid
inter-thread interference while exploiting the intra-thread
locality, e.g., high row buffer hit rate [18]. Akesson, et al.
present a memory controller called Predator where memory
accesses are grouped to enable parallel bank accesses therebyimproving memory utilization and easing QoS support [12].
Mutlu, et al. present a fair memory access scheduling which
balances the slow-down (=the ratio of memory access latency
between shared and alone cases) in memory access latency
among cores when a shared DRAM is accessed by multiple
1 Productive requests are the ones that will improve system
performance if they are served by the memory controller.
2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
978-0-7695-4053-5/10 $26.00 2010 IEEE
DOI 10.1109/NOCS.2010.36
257
7/30/2019 Handling Congesion Over Noc Using Mem Controller
2/8
cores [13]. They also present another fair memory access
scheduling called PAR-BS, which tries to avoid the
starvation of low priority requests by applying a batch
scheduling [14].
There have been presented several solutions to the NoC
congestion problem. Most of existing solutions utilize, as the
congestion metric, the resource utilization of self andneighbor routers, e.g., buffer availability [15] and queue
length in output port [16][17]. Singh, et al. present a method
called GOAL where the routing decision is based on the
queue length of each output port [16]. They also presents a
method called GAL which utilizes the information of output
queue length for both quadrant selection and routing in the
tori architecture [17]. Gratz, et al. present a notion of
regional congestion awareness (RCA) where a router
determines, on each incoming packet, the output port based
on the global congestion information on each quadrant of
mesh architecture [19]. Our method presented in this paper is
based on the global congestion information used in the RCA
method. In our case, we exploit the global congestion
information for memory access scheduling and network entry
control. Our method can be combined with the RCA method,
which is beyond the scope of this paper.
There has been little work on the integrated and cooperative
solution which includes both memory controller and network
(congestion avoidance). Recently, Jang and Pan present a
memory-aware NoC router [20]. In this work, the router
performs switch allocation, i.e., output port arbitration
considering the memory access latency of packets
contending for the same output port. The router performs
memory access scheduling which was previously executed
by the memory controller. However, network congestion
information is not utilized in this work.
In our work, the contribution is to identify the problem of
network congestion-induced memory blocking and to
integrate network congestion awareness into the memory
controller in order to resolve the problem.
3. Preliminaries: DRAM, Memory AccessScheduling and Memory ControllerFigure 1 illustrates an internal structure of DRAM. As shown
in the figure, the memory array is divided into banks
(typically 4 or 8 banks per DRAM). Each bank, which can be
accessed independently of the other banks, is a two
dimensional array of memory cells consisting of rows and
columns. Figure 1 illustrates a four-bank DRAM. TheDRAM access typically takes three steps. In step 1, the bank
and row are selected by a bank/row address and a set of
control signals including RAS (row address strobe) as the
arrow denoted with (1) shows in Figure 1. The selected row
(typically, 2KB data) is fetched to the row buffer (consisting
of sense amplifiers and latches), which incurs a latency of
tRCD. This operation is often called row activation and it is
said that the row is open when it is fetched to the row buffer.
In step 2, the desired data (a dark rectangle in Figure 1) is
accessed (read or written) by a column address and a set of
control signals including CAS (column address strobe) as the
arrow denoted with (2) shows in the figure. The accessed
data flows over external IO signals, DQs after the latency of
column access called CAS latency (CL) in case of read
operation. In step 3, after finishing accesses to the open row
in the row buffer, the row is closed by the memory controllerissuing a memory command called precharge (PRE) to
DRAM as the arrow denoted with (3) shows in the figure.
The precharge command incurs a latency of tRP. Another row
from the same bank can be activated only after the previously
open row is precharged.
Figure 1 DRAM structure and access flow
Memory access scheduling tries to improve memory
performance considering the fact that multiple banks can be
accessed independently (as far as there is no conflict in the
address/control/data buses) and an access to the open row
requires a minimum access latency. Figure 2 illustrates a
memory access scheduling policy called FR-FCFS (first
ready-first come first serve) [10] where the oldest ready
request is served by the memory controller. We assume a
DDR SDRAM whose CL, tRP, and tRCD are all 3 cycles and
BL (burst length) is 8.
Figure 2 FR-FCFS policy in memory access scheduling
RowDecoder
Row Buffer
Bank0MemoryArray
ColumnDecoder
Row addressRAS
Column addressCAS
DQ
tRCD
CL
tRP
(1)
(2)
(3)
Accesseddata
B0R0C0 B2R2C8
Bank 0
Bank 2
Cycle
5 10 15 200
tRCD CL
Data OutB0R0C0
tRP
CL
Data OutB2R2C0
Data OutB0R1C1
B0R1C8
2 4
tRCD
tRCD CL
tRP
tRP
258
7/30/2019 Handling Congesion Over Noc Using Mem Controller
3/8
Assume that three read requests arrive at the memory
controller at time 0, 2, and 4 as shown by the three dashed
arrows at the bottom of Figure 2. Each request is represented
by a tuple, bank ID, row address, column address. The first
request accesses bank 0 (1st bank), row 0 (1st row in bank 0)
and column 0 (1st data in row 0), the second one bank 0, row
1, and column 8, and the third one bank 2, row 2, and column
8. At time 0, the row activation command for the first request,
B0R0C0 is sent to DRAM, which takes 3 cycles (tRCD) to
open the row.2 At time 3, the column access command is sent
to access the open row. After the delay of CL, i.e., 3 cycles,
at time 6, the data of the first request come out of the DRAM.
At time 7, a precharge command is sent to close the open
row in order to prepare the access to another row, row 1 in
the same bank. Note that the precharge command can overlap
with the read data operation.3
When serving the requests in a FCFS manner, the third
request can start to be served only after the service of the
second request is finished, which will take a large delay in
our example. FR-FCFS can give a reduced latency as Figure
2 shows. Even though the third request arrives at the memory
controller later than the second request, it is served earlier
than the second one since it is ready, i.e., the memory
command of row activation for bank 2 can be issued at the
time when the third request arrives. As Figure 2 shows, the
row activation of the third request is issued at time 4 to open
the row 2 at bank 2. The data of the third request start to
come out of the DRAM at time 10. Thus, the service of all
the three requests finishes at time 20 in the FR-FCFS policy.
If the FCFS is applied, the total cycle will be 30 since all the
requests are served in a sequential manner.
Memory access scheduling is usually performed by the
memory controller. Figure 3 shows a simplified block
diagram of memory controller. It is connected with network
interface (NI) and DRAM as shown in the figure. It takes
memory access requests from NI, e.g., via read/write address
channels in the case of AXI protocol [21]. The requests are
stored in the request buffer (RB). The memory access
scheduler checks the status of each bank with bank FSMs
and the requests in the RB. The scheduler applies a
scheduling algorithm, e.g., FR-FCFS [10], PAR-BS [14], etc.,
in order to select a request from the RB. Then, it generates
corresponding DRAM commands for row activation, column
access for read or write, precharge, etc. The generated
DRAM commands are usually stored in the command buffer
and issued to DRAM at their scheduled cycles.
2 In this example, for simplicity, we assume that the memory
controller does not consume internal delay cycles. In our
experiments, the memory controller takes internal latency cycles
from the reception of request from the network to the issue of
DRAM commands.3
The overlap period between read and precharge operations
depends on BL, tRTP (read-to-precharge latency) and AL (additive
latency). For more details, refer to the DRAM specification, e.g.,
[24].
In case of write request, the data buffer (DB) receives write
data from NI, e.g., via write data channel in AXI protocol,
and sends them to DRAM when the corresponding column
access commands are issued to DRAM. In case of read
request, the DB receives data from DRAM and then sends
them to NI, e.g., via read data channel in AXI protocol.
Figure 3 A simplified block diagram of memory
controller
4. MotivationFigure 4 illustrates the problem of network congestion-
induced memory blocking. Figure 4 (a) shows a 3x3 tile
architecture where the memory controller (MC) is located at
the center tile. We assume that there is a network congestion.The shaded routers (small shaded rectangles denoted with
R) and thick links between them represent the congested
area. In Figure 4 (a), the congestion is mostly on the third
column of the mesh topology and affects up to the MC
through the router connected to the MC.
Figure 4 Network congestion-induced memory blocking
Figure 4 (b) shows a detailed view of MC and connected
router corresponding to the dashed area in Figure 4 (a). In
B0R0C0
B0R1C1
B2R2C3
ACT PRE
Close
Open
Bank FSM
ACT B0R0MemoryAccess
Scheduler
Address
RAS, CAS
DQ
Request Buffer
Data Buffer
Addresschannels
Data
channels
Command Buffer
DRAMNetwork
Interface
PE0
R
PE1
R
PE2
R
PE3
R
MC
R
PE5
R
PE
6
R
PE
7
R
PE
8
R
R
NI
D_PE8D_PE2
R_PE0R_PE3
RB DB
MC
(a) 3x3 tile architecture (b) Blocking in MC and NI
259
7/30/2019 Handling Congesion Over Noc Using Mem Controller
4/8
Figure 4 (b), the MC has two read requests from PE0 and
PE3 (R_PE0 and R_PE3 in RB). The MC has also two data
bound for PE8 and PE2 (D_PE8 and D_PE2 in DB). In this
case, since the paths from the MC to PE8 and PE2 are
congested, and the output port of MC is blocked by thecongestion, the two data cannot enter the NoC. Thus, they
just wait, until the congestion is removed, while occupying
the precious resource, DB. Since the DB is full and there isno available space to store data coming from the DRAM, the
two read requests in RB cannot issue their memory
commands for read operations. Thus, no memory command
is issued to the DRAM and the DRAM remains idle until the
network congestion is removed in this case. We call such a
case network congestion-induced memory blocking. The
problem is resolved only when the congestion is removed (or
mitigated). As will be reported in Section 6, such a problem
can cause a significant degradation in memory utilization
thereby system performance degradation.
From the viewpoint of memory controller, the above problem
is caused by the fact that conventional memory access
scheduling does not take network status into account. In thecase of Figure 4, the memory access scheduling produces
data D_PE8 and D_PE2 in order to maximize only the
memory performance, i.e., memory utilization without
considering system performance. The lack of knowledge on
network status in memory access scheduling causes the DB
to be occupied only by the blocked data bound for the
congested area.
The above problem can be resolved by a network congestion-
aware memory controller. To be specific, the MC takes two
measures in case of network congestion: (1) prioritization of
requests from the uncongested area and (2) congestion-aware
network-entry control of read data. The rationale behind the
two measures is as follows. At high congestion, as shown in
Figure 4 (b), the MC resource (especially, DB and the output
port) tends to be occupied by the requests from the congested
area. Thus, in order to prevent those requests from occupying
the MC resource, requests from the uncongested area need to
be prioritized. Regarding the network entry control, read data
bound for the congested area, when they enter the NoC, will
aggravate the NoC congestion and can cause the blocking
problem. Thus, in this case, read data bound for the
uncongested area are prioritized in entering the NoC. By
doing that, the memory controller can reclaim data buffer
space more quickly and support more requests with the
reclaimed resource.
5. Network Congestion Awareness inMemory ControllerIn order to realize the idea of network congestion-aware
memory controller, the congestion information needs to be
transferred from the NoC to the MC (Section 5.1). Then,
both memory access scheduling and network entry control
need to exploit the congestion information (Section 5.2).
Figure 5 Regions (quadrants) to manage congestion information
5.1Congestion Information ManagementCongestion information is collected within the NoC andtransferred to the MC. As the congestion information, we
utilize the global congestion information called regional
congestion in [19]. The information of regional congestion
represents how much congested each quadrant of a router is.
Figure 5 illustrates the regional congestion information. The
router (shaded rectangle) at (3,3) has the congestion
information for each of the four quadrants (north-west/east,
and south-west/east) seen from its location.
Figure 6 Propagation of regional congestion information
Figure 6 illustrates how each router manages the regional
congestion information. In order to quantify the congestion
level, we use the number of occupied VCs (vc) and crossbar
demand (xb, the number of contending packets bound for the
same output port) per router.4 The higher the number of vcand xb is, the higher congestion the network suffers from in
the corresponding quadrant. As dashed arrows represent in
Figure 6, each router collects the congestion information
4In [19], three possibilities of congestion information, i.e., the
number of occupied VCs (vc) / buffers (bf) and crossbar demand
(xb) are presented. The combination ofvc and xb proves to give
the best results.
RR R
RR R
RR R
RR R
RR R
RR R
RR R
RR R
RR R
RR R
RR R
RR R
NE (Quadrant 1)
SE (Quadrant 4)
NW (Quadrant 2)
SW (Quadrant 3)
RR R
RR R
RR R
4
4
4
41212
6
7.5
6
7
7
15
18.25
15
17
17
19.5
24.25
260
7/30/2019 Handling Congesion Over Noc Using Mem Controller
5/8
from neighbor routers via a set of sideband signals. 5 It
calculates its congestion level by a weighted sum of its local
congestion information (vc andxb) and those of its neighbors
collected via the sideband signals, and propagates the new
congestion information to its neighbor routers. Note that the
congestion information propagates from a router to all its
neighbors, i.e., in four directions in the mesh architecture.
For simplicity, Figure 6 shows only the south-west-bound
flow of congestion information.6
The congestion level of each quadrant around the MC is
propagated to the MC as the dashed line between the router
and the MC shows in Figure 7.
Figure 7 Propagating congestion information to memory
controller
5.2Network Congestion-Aware Memory AccessScheduling and Network Entry ControlWe integrate congestion awareness into two functions of
memory controller: memory access scheduling and network
entry control of read data.Figure 8 shows the pseudo code of network congestion-
aware memory access scheduling. We use FR-FCFS as the
baseline policy of memory access scheduling. Note that there
is no limitation to memory access scheduling policies in
integrating congestion awareness.
5 The congestion information is carried on 9 bits of sideband signals
in our work. However, the overhead of sideband signals can
become negligible since they can be reduced down to a serial link
without losing the benefit of adaptive routing as reported in [19].A detailed analysis on the effect of reducing the number of
sideband signals will be our future work.
6 In adaptive routing which utilizes the regional congestion
information, each router selects a link with the lowest congestion
level. For instance, in the case of the (shaded) router at (1,1) in
Figure 6, when a north-east bound packet arrives at the router, it
selects the east-bound output port as the target output port since
the congestion level of north-bound output port, 24.25 is higher
than that of east-bound output port, 18.25. For more details of
regional congestion awareness, refer to [19].
Figure 8 Network congestion-aware memory access scheduling
If any of four quadrants around (the router connected to) the
MC has a high congestion level (line 3), then the requests
from the congested quadrant(s) (CQ) are excluded from the
set of candidate requests in the request buffer (line 4). The
FR-FCFS policy is applied to the new candidate set (R_UCQ
in Figure 8). If there is a selected request, it is served by the
MC (line 5). If there is no selected one, then a candidate is
selected, if any, from the requests from the congested
quadrant (R_CQ) (lines 6-7). If there is no quadrant with
high congestion, then the original FR-FCFS is applied to all
requests in the request buffer.
The network entry control is applied to read data stored in
the data buffer of memory controller. The procedure to select
a data (to be injected to the NoC) from the data buffer is
similar to that in Figure 8. First, if there is any quadrant with
high congestion, then the read data bound for the quadrant
are excluded from the set of ready read data in the data buffer.
If there is any data bound for the uncongested area, then the
oldest of such data is selected and sent to its destination. If
there is no data bound for the uncongested area, then the
oldest ready data bound for the congested area is selected.
Note that the granularity of selection is a burst of data. In our
experiments, a burst of data corresponds to 8 words of 64b
data for L1 cache miss.
6. Experiments6.1Experimental SetupFigure 9 shows a 5x5 tile architecture used in our
experiments. The 5x5 architecture has one MC at the center
tile. The MC runs at 400MHz and is connected to a
conventional DRAM memory channel of DDR2-800 (CL-
tRP-tRCD=6-6-6). The MC has a request buffer and a data
buffer whose sizes are varied in our experiments. The MC
supports AXI protocol [21] at its bus interface. Thus, we use
a network interface between the router and the MC. We use a
Tensilica LX2 core (64b, 16K/16K, cache line size of 64B,
400MHz) as the PE.
The NoC router supports four-stage pipeline (input buffering,
virtual channel allocation & lookahead routing computation,
switch allocation and switch traversal), 4 virtual channels
(VCs) per input port, wormhole switching, speculative
allocation [22], XY routing and 64b flits. There are two types
of packets: request and response packets. The read request
R
NI
RB DB
MC
Congestion info
R R
1 // CQ and UCQ: congested and uncongested quadrants2 // R_CQ and R_UCQ: requests from CQ and UCQ3 if (high congestion in any quadrant)4 R_UCQ = All_Requests R_CQ5 selected = FR_FCFS(R_UCQ)6 if selected == null7 selected = FR_FCFS(R_CQ)
8 else // no heavy congestion9 selected = FR_FCFS(All_Requests)
261
7/30/2019 Handling Congesion Over Noc Using Mem Controller
6/8
7/30/2019 Handling Congesion Over Noc Using Mem Controller
7/8
medium levels of network congestion, the performance gain
is modest (4.4% ~14.1%).
Figure 10 Performance comparison
Figure 11 Performance gains
Figure 12 gives another explanation of high performance
gain of the proposed method. The figure shows the fraction
of time that the data buffer is full. The figure shows that as
the congestion level becomes higher, the data buffer becomes
full more frequently in the baseline method. It is because, at
high congestion, the read data bound for the congested area
occupy the output port of memory controller as well as the
majority of data buffer resource. That is, network congestion-
induced memory blocking occurs as exemplified in Section 4.
Thus, they prevent the other data bound for the uncongested
area from being served. The proposed method reduces the
period of full data buffer by up to 12.1% (high congestion,
DB size = 64), which translates into the performance gain in
Figure 11.
Figure 12 Fraction of time when data buffer is full
(single program case)
Network congestion-aware memory access scheduling and
network entry control can be applied independently of each
other. Figure 13 shows the effect of each method obtained
when each of them is applied independently to the case of
high congestion level. The figure shows that the network
congestion-aware memory access scheduling (Schedule in
the figure) alone offers up to 16.0% improvement in memory
utilization while the network congestion-aware entry control
(Entry) alone offers a performance improvement by up to
8.1%.
Figure 13 Decomposition of performance impact
Comparing the results in Figures 11 and 13, the effects of
both congestion-aware memory access scheduling and
network entry control are not always additive. In case of
small DB sizes (32 and 64), the effect of network congestion-
aware memory access scheduling dominates. The effect of
network congestion-aware network entry control becomes
more significant with large DB sizes. It is because large DB
(a) Baseline
(b) Proposed method
# requests/100 cycles
# requests/100 cycles
High
High
Medium
Medium
Low
Low
DB size
DB size
4.98
5.23 5.32
4.064.18 4.27
3.223.35
3.48
3.00
3.50
4.00
4.50
5.00
5.50
6.00
32 64 96
5.43
5.86 5.92
4.49
4.774.86
3.68
3.93 3.92
3.00
3.50
4.00
4.50
5.00
5.50
6.00
32 64 96
Gain (%)
DB size
LowMedium High
0.00
5.00
10.00
15.00
20.00
32 64 96
Fraction of time (%)
DB size
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
32 64 96
Low-base
Low-prop
Med-base
Med-prop
High-base
High-prop
Low-base High-prop
DB size
Performance gain
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
32 64 96
Schedule
Entry
263
7/30/2019 Handling Congesion Over Noc Using Mem Controller
8/8
sizes allow for more space to keep data bound for the
congested area in case of high congestion. Thus, there are
more opportunities that data bound for the uncongested area
bypass data bound for the congested area thereby reducing
memory blocking period and finally improving memory
utilization.
Figure 14 Performance gains in the multi-program case
Figure 14 shows the performance gain of the proposed
method for the multi-program mapping on the 5x5
architecture as shown in Figure 9. The proposed method
gives a significant performance improvement by up to 18.9%
(high congestion, DB size = 64). Figure 15 confirms that thereduction in memory blocking period (enabled by the
proposed memory controller) offers such a performance
improvement shown in Figure 14.
Figure 15 Fraction of time when data buffer is full
(multi-program case)
7. ConclusionIn this paper, we explained a performance problem called
network congestion-induced memory blocking and presented,
as a solution to the problem, a memory controller which
performs memory access scheduling and network entry
control of read data in a network congestion-aware manner.
The proposed memory controller gives significant
performance improvements (by up to 18.9% in memory
utilization), especially when the network suffers from severe
congestion. As future work, we will work on an adaptive
solution to cope with dynamically changing network
congestion. We will also work on fairness issues between
requests from congested and uncongested areas.
8. References[1] E. Lindholm, et al., NVIDIA Tesla: A Unified Graphics andComputing Architecture, IEEE Micro, 28(2), March 2008.
[2] S. Bell, et al., TILE64 - Processor: A 64-Core SoC with Mesh
Interconnect, Proc. ISSCC 2008.
[3] Intel, Co., Single-chip Cloud Computer, available at
http://techresearch.intel.com/articles/Tera-Scale/1826.htm.
[4] W. Kwon, et al., A Practical Approach of Memory Access
Parallelization to Exploit Multiple Off-chip DDR Memories,
Proc. DAC, 2008.
[5] E. Aho, et al., A Case for Multi-channel Memories in Video
Recording, Proc. DATE, 2009.
[6] T. Ezaki, et al., A 160Gb/s Interface Design Configuration for
Multichip LSI, Proc. ISSCC, 2004.
[7] S. Borkar, Thousand-Core Chips - A Technology Perspective,
Proc. DAC, 2007.
[8] J. Bautista, Tera-scale Computing and Interconnect Challenges
3D Stacking Considerations, Tutorial, ISCA, 2008.
[9] G. H. Loh, 3D-Stacked Memory Architectures for Multi-Core
Processors, Proc. ISCA, 2008.
[10] S. Rixner, et al., Memory Access Scheduling, Proc. ISCA,
2000.
[11] S. Heithecker and R. Ernst, Traffic Shaping for an FPGA
based SDRAM Controller with Complex QoS Requirements,
Proc. DAC, 2005.
[12] B. Akesson, K. Goossens, and M. Ringhofer, Predator: a
Predictable SDRAM Memory Controller, Proc. CODES+ISSS,
2007.
[13] O. Mutlu and T. Moscibroda, Stall-time Fair Memory Access
Scheduling for Chip Multiprocessor, Proc. MICRO, 2007.[14] O. Mutlu and T. Moscibroda, Parallelism-Aware Memory
Access Scheduling, Proc. ISCA, 2008.
[15] J. Kim, et al., A Low Latency Router Supporting Adaptivity
for On-Chip Interconnects, Proc. DAC, 2005.
[16] A. Singh, et al., GOAL: A Load-Balanced Adaptive Routing
Algorithm for Torus Networks, Proc. ISCA, 2003.
[17] A. Singh, et al., Globally Adaptive Load-Balanced Routing on
Tori, IEEE Computer Architecture Letters, 3(2):2, 2004.
[18] J. Ahn, M. Erez, and W. Dally, The Design Space of Data-
Parallel Memory Systems, Proc. Supercomputing, 2006.
[19] P. Gratz, et al., Regional Congestion Awareness for Load
Balance in Networks-on-Chip, Proc. HPCA, 2008.
[20] W. Jang and D. Z. Pan, An SDRAM-aware Router for
Networks-on-Chip, Proc. DAC, 2009.
[21] ARM, Ltd., AMBA3 (AXI) Protocol, available at
http://www.arm.com/products/solutions/AMBAHomePage.html
[22] L. Peh and W. J. Dally, A Delay Model and Speculative
Architecture for Pipelined Routers, Proc. HPCA, 2001.
[23] T. Sherwood, et al., Discovering and Exploiting Program
Phases, IEEE Micro, Nov/Dec, 2003.
[24] Samsung Electronics, Co., DDR I, II, and III Device
Operations & Timing Diagram, available at
http://www.samsung.com/global/business/semiconductor.
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
32 64 96
Gain (%)
DB size
Low
Medium High
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
32 64 96
Low-base
Low-prop
Med-base
Med-prop
High-base
High-prop
Fraction of time (%)
DB size
264