Handling Congesion Over Noc Using Mem Controller

Embed Size (px)

Citation preview

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    1/8

    A Network Congestion-Aware Memory Controller

    Dongki Kim, Sungjoo Yoo, Sunggu Lee

    Embedded System Architecture Laboratory

    Department of Electronic and Electrical EngineeringPohang University of Science and Technology (POSTECH)

    {dongki.kim, sungjoo.yoo, slee}@postech.ac.kr

    ABSTRACT

    Network-on-chip and memory controller become correlated

    with each other in case of high network congestion since the

    network port of memory controller can be blocked due to the

    (back-propagated) network congestion. We call such a

    problem network congestion-induced memory blocking. In

    order to resolve the problem, we present a novel idea of

    network congestion-aware memory controller. Based on the

    global information of network congestion, the memory

    controller performs (1) congestion-aware memory access

    scheduling and (2) congestion-aware network entry control

    of read data. The experimental results obtained from a 5x5

    tile architecture show that the proposed memory controller

    presents up to 18.9% improvement in memory utilization.

    1. IntroductionIn many-cores, the performance of memory and network can

    often determine the system performance. In order to increase

    memory performance, multiple memory modules (in short,

    multiple memories) start to be used [1][2][3][4][5]. In

    addition, 3D die stacking is expected to make their usage

    popular [6][7][8][9]. For the performance improvement of

    individual memory, there have been presented several studies

    on memory access scheduling [10][11][12][13][14].

    Network-on-chip (NoC) becomes more and more popular in

    many-cores [2][3][8]. As the NoC size increases, network

    congestion becomes one of the most important problems

    limiting NoC performance thereby system performance.

    Several methods, e.g., adaptive routing [15][16][17] have

    been presented in order to address the network congestion

    problem.

    In most of previous work, memory and network have been

    assumed to be independent of each other. However, in case

    of heavy network congestion, both memory and networkbecome correlated with each other, in a negative way, to

    yield system performance degradation. In this paper, we

    tackle a problem due to such a correlation which is called

    network congestion-induced memory blocking. In the

    problem, the network congestion propagates from hot spots

    up to the memory controller. It prevents the memory

    controller from servicing productive requests 1 and, finally,

    degrades memory performance thereby system performance.

    In order to resolve the problem, in this paper, we propose a

    novel idea called network congestion-aware memory

    controller. In the proposed method, first, network congestion

    information is propagated to the memory controller. Then,

    the memory controller takes two measures of network

    congestion-aware memory access scheduling and network

    entry control in order to prioritize productive requests

    (requests from the uncongested area) over unproductive ones

    (requests from the congested area which will only consume

    resource in the memory controller and/or NoC without

    contributing to system performance improvement). We

    evaluate the proposed idea with a tile-based many-core

    architecture.

    This paper is organized as follows. Section 2 reviews

    related work. Section 3 gives preliminaries. Section 4

    explains our motivation. Section 5 presents the proposed

    method of network congestion-aware memory controller.

    Section 6 reports experimental results. Section 7 concludes

    the paper.

    2. Related WorkRixner, et al. present memory access scheduling policies

    including first ready-first come first serve (FR-FCFS) and

    open/close page schemes [10]. Heithecker and Ernst utilize a

    traffic shaping in assigning memory resource to high priority

    requests thereby reducing performance degradation in other

    low priority requests [11]. Ahn, et al. present a memory

    controller with per-thread request buffers in order to avoid

    inter-thread interference while exploiting the intra-thread

    locality, e.g., high row buffer hit rate [18]. Akesson, et al.

    present a memory controller called Predator where memory

    accesses are grouped to enable parallel bank accesses therebyimproving memory utilization and easing QoS support [12].

    Mutlu, et al. present a fair memory access scheduling which

    balances the slow-down (=the ratio of memory access latency

    between shared and alone cases) in memory access latency

    among cores when a shared DRAM is accessed by multiple

    1 Productive requests are the ones that will improve system

    performance if they are served by the memory controller.

    2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip

    978-0-7695-4053-5/10 $26.00 2010 IEEE

    DOI 10.1109/NOCS.2010.36

    257

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    2/8

    cores [13]. They also present another fair memory access

    scheduling called PAR-BS, which tries to avoid the

    starvation of low priority requests by applying a batch

    scheduling [14].

    There have been presented several solutions to the NoC

    congestion problem. Most of existing solutions utilize, as the

    congestion metric, the resource utilization of self andneighbor routers, e.g., buffer availability [15] and queue

    length in output port [16][17]. Singh, et al. present a method

    called GOAL where the routing decision is based on the

    queue length of each output port [16]. They also presents a

    method called GAL which utilizes the information of output

    queue length for both quadrant selection and routing in the

    tori architecture [17]. Gratz, et al. present a notion of

    regional congestion awareness (RCA) where a router

    determines, on each incoming packet, the output port based

    on the global congestion information on each quadrant of

    mesh architecture [19]. Our method presented in this paper is

    based on the global congestion information used in the RCA

    method. In our case, we exploit the global congestion

    information for memory access scheduling and network entry

    control. Our method can be combined with the RCA method,

    which is beyond the scope of this paper.

    There has been little work on the integrated and cooperative

    solution which includes both memory controller and network

    (congestion avoidance). Recently, Jang and Pan present a

    memory-aware NoC router [20]. In this work, the router

    performs switch allocation, i.e., output port arbitration

    considering the memory access latency of packets

    contending for the same output port. The router performs

    memory access scheduling which was previously executed

    by the memory controller. However, network congestion

    information is not utilized in this work.

    In our work, the contribution is to identify the problem of

    network congestion-induced memory blocking and to

    integrate network congestion awareness into the memory

    controller in order to resolve the problem.

    3. Preliminaries: DRAM, Memory AccessScheduling and Memory ControllerFigure 1 illustrates an internal structure of DRAM. As shown

    in the figure, the memory array is divided into banks

    (typically 4 or 8 banks per DRAM). Each bank, which can be

    accessed independently of the other banks, is a two

    dimensional array of memory cells consisting of rows and

    columns. Figure 1 illustrates a four-bank DRAM. TheDRAM access typically takes three steps. In step 1, the bank

    and row are selected by a bank/row address and a set of

    control signals including RAS (row address strobe) as the

    arrow denoted with (1) shows in Figure 1. The selected row

    (typically, 2KB data) is fetched to the row buffer (consisting

    of sense amplifiers and latches), which incurs a latency of

    tRCD. This operation is often called row activation and it is

    said that the row is open when it is fetched to the row buffer.

    In step 2, the desired data (a dark rectangle in Figure 1) is

    accessed (read or written) by a column address and a set of

    control signals including CAS (column address strobe) as the

    arrow denoted with (2) shows in the figure. The accessed

    data flows over external IO signals, DQs after the latency of

    column access called CAS latency (CL) in case of read

    operation. In step 3, after finishing accesses to the open row

    in the row buffer, the row is closed by the memory controllerissuing a memory command called precharge (PRE) to

    DRAM as the arrow denoted with (3) shows in the figure.

    The precharge command incurs a latency of tRP. Another row

    from the same bank can be activated only after the previously

    open row is precharged.

    Figure 1 DRAM structure and access flow

    Memory access scheduling tries to improve memory

    performance considering the fact that multiple banks can be

    accessed independently (as far as there is no conflict in the

    address/control/data buses) and an access to the open row

    requires a minimum access latency. Figure 2 illustrates a

    memory access scheduling policy called FR-FCFS (first

    ready-first come first serve) [10] where the oldest ready

    request is served by the memory controller. We assume a

    DDR SDRAM whose CL, tRP, and tRCD are all 3 cycles and

    BL (burst length) is 8.

    Figure 2 FR-FCFS policy in memory access scheduling

    RowDecoder

    Row Buffer

    Bank0MemoryArray

    ColumnDecoder

    Row addressRAS

    Column addressCAS

    DQ

    tRCD

    CL

    tRP

    (1)

    (2)

    (3)

    Accesseddata

    B0R0C0 B2R2C8

    Bank 0

    Bank 2

    Cycle

    5 10 15 200

    tRCD CL

    Data OutB0R0C0

    tRP

    CL

    Data OutB2R2C0

    Data OutB0R1C1

    B0R1C8

    2 4

    tRCD

    tRCD CL

    tRP

    tRP

    258

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    3/8

    Assume that three read requests arrive at the memory

    controller at time 0, 2, and 4 as shown by the three dashed

    arrows at the bottom of Figure 2. Each request is represented

    by a tuple, bank ID, row address, column address. The first

    request accesses bank 0 (1st bank), row 0 (1st row in bank 0)

    and column 0 (1st data in row 0), the second one bank 0, row

    1, and column 8, and the third one bank 2, row 2, and column

    8. At time 0, the row activation command for the first request,

    B0R0C0 is sent to DRAM, which takes 3 cycles (tRCD) to

    open the row.2 At time 3, the column access command is sent

    to access the open row. After the delay of CL, i.e., 3 cycles,

    at time 6, the data of the first request come out of the DRAM.

    At time 7, a precharge command is sent to close the open

    row in order to prepare the access to another row, row 1 in

    the same bank. Note that the precharge command can overlap

    with the read data operation.3

    When serving the requests in a FCFS manner, the third

    request can start to be served only after the service of the

    second request is finished, which will take a large delay in

    our example. FR-FCFS can give a reduced latency as Figure

    2 shows. Even though the third request arrives at the memory

    controller later than the second request, it is served earlier

    than the second one since it is ready, i.e., the memory

    command of row activation for bank 2 can be issued at the

    time when the third request arrives. As Figure 2 shows, the

    row activation of the third request is issued at time 4 to open

    the row 2 at bank 2. The data of the third request start to

    come out of the DRAM at time 10. Thus, the service of all

    the three requests finishes at time 20 in the FR-FCFS policy.

    If the FCFS is applied, the total cycle will be 30 since all the

    requests are served in a sequential manner.

    Memory access scheduling is usually performed by the

    memory controller. Figure 3 shows a simplified block

    diagram of memory controller. It is connected with network

    interface (NI) and DRAM as shown in the figure. It takes

    memory access requests from NI, e.g., via read/write address

    channels in the case of AXI protocol [21]. The requests are

    stored in the request buffer (RB). The memory access

    scheduler checks the status of each bank with bank FSMs

    and the requests in the RB. The scheduler applies a

    scheduling algorithm, e.g., FR-FCFS [10], PAR-BS [14], etc.,

    in order to select a request from the RB. Then, it generates

    corresponding DRAM commands for row activation, column

    access for read or write, precharge, etc. The generated

    DRAM commands are usually stored in the command buffer

    and issued to DRAM at their scheduled cycles.

    2 In this example, for simplicity, we assume that the memory

    controller does not consume internal delay cycles. In our

    experiments, the memory controller takes internal latency cycles

    from the reception of request from the network to the issue of

    DRAM commands.3

    The overlap period between read and precharge operations

    depends on BL, tRTP (read-to-precharge latency) and AL (additive

    latency). For more details, refer to the DRAM specification, e.g.,

    [24].

    In case of write request, the data buffer (DB) receives write

    data from NI, e.g., via write data channel in AXI protocol,

    and sends them to DRAM when the corresponding column

    access commands are issued to DRAM. In case of read

    request, the DB receives data from DRAM and then sends

    them to NI, e.g., via read data channel in AXI protocol.

    Figure 3 A simplified block diagram of memory

    controller

    4. MotivationFigure 4 illustrates the problem of network congestion-

    induced memory blocking. Figure 4 (a) shows a 3x3 tile

    architecture where the memory controller (MC) is located at

    the center tile. We assume that there is a network congestion.The shaded routers (small shaded rectangles denoted with

    R) and thick links between them represent the congested

    area. In Figure 4 (a), the congestion is mostly on the third

    column of the mesh topology and affects up to the MC

    through the router connected to the MC.

    Figure 4 Network congestion-induced memory blocking

    Figure 4 (b) shows a detailed view of MC and connected

    router corresponding to the dashed area in Figure 4 (a). In

    B0R0C0

    B0R1C1

    B2R2C3

    ACT PRE

    Close

    Open

    Bank FSM

    ACT B0R0MemoryAccess

    Scheduler

    Address

    RAS, CAS

    DQ

    Request Buffer

    Data Buffer

    Addresschannels

    Data

    channels

    Command Buffer

    DRAMNetwork

    Interface

    PE0

    R

    PE1

    R

    PE2

    R

    PE3

    R

    MC

    R

    PE5

    R

    PE

    6

    R

    PE

    7

    R

    PE

    8

    R

    R

    NI

    D_PE8D_PE2

    R_PE0R_PE3

    RB DB

    MC

    (a) 3x3 tile architecture (b) Blocking in MC and NI

    259

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    4/8

    Figure 4 (b), the MC has two read requests from PE0 and

    PE3 (R_PE0 and R_PE3 in RB). The MC has also two data

    bound for PE8 and PE2 (D_PE8 and D_PE2 in DB). In this

    case, since the paths from the MC to PE8 and PE2 are

    congested, and the output port of MC is blocked by thecongestion, the two data cannot enter the NoC. Thus, they

    just wait, until the congestion is removed, while occupying

    the precious resource, DB. Since the DB is full and there isno available space to store data coming from the DRAM, the

    two read requests in RB cannot issue their memory

    commands for read operations. Thus, no memory command

    is issued to the DRAM and the DRAM remains idle until the

    network congestion is removed in this case. We call such a

    case network congestion-induced memory blocking. The

    problem is resolved only when the congestion is removed (or

    mitigated). As will be reported in Section 6, such a problem

    can cause a significant degradation in memory utilization

    thereby system performance degradation.

    From the viewpoint of memory controller, the above problem

    is caused by the fact that conventional memory access

    scheduling does not take network status into account. In thecase of Figure 4, the memory access scheduling produces

    data D_PE8 and D_PE2 in order to maximize only the

    memory performance, i.e., memory utilization without

    considering system performance. The lack of knowledge on

    network status in memory access scheduling causes the DB

    to be occupied only by the blocked data bound for the

    congested area.

    The above problem can be resolved by a network congestion-

    aware memory controller. To be specific, the MC takes two

    measures in case of network congestion: (1) prioritization of

    requests from the uncongested area and (2) congestion-aware

    network-entry control of read data. The rationale behind the

    two measures is as follows. At high congestion, as shown in

    Figure 4 (b), the MC resource (especially, DB and the output

    port) tends to be occupied by the requests from the congested

    area. Thus, in order to prevent those requests from occupying

    the MC resource, requests from the uncongested area need to

    be prioritized. Regarding the network entry control, read data

    bound for the congested area, when they enter the NoC, will

    aggravate the NoC congestion and can cause the blocking

    problem. Thus, in this case, read data bound for the

    uncongested area are prioritized in entering the NoC. By

    doing that, the memory controller can reclaim data buffer

    space more quickly and support more requests with the

    reclaimed resource.

    5. Network Congestion Awareness inMemory ControllerIn order to realize the idea of network congestion-aware

    memory controller, the congestion information needs to be

    transferred from the NoC to the MC (Section 5.1). Then,

    both memory access scheduling and network entry control

    need to exploit the congestion information (Section 5.2).

    Figure 5 Regions (quadrants) to manage congestion information

    5.1Congestion Information ManagementCongestion information is collected within the NoC andtransferred to the MC. As the congestion information, we

    utilize the global congestion information called regional

    congestion in [19]. The information of regional congestion

    represents how much congested each quadrant of a router is.

    Figure 5 illustrates the regional congestion information. The

    router (shaded rectangle) at (3,3) has the congestion

    information for each of the four quadrants (north-west/east,

    and south-west/east) seen from its location.

    Figure 6 Propagation of regional congestion information

    Figure 6 illustrates how each router manages the regional

    congestion information. In order to quantify the congestion

    level, we use the number of occupied VCs (vc) and crossbar

    demand (xb, the number of contending packets bound for the

    same output port) per router.4 The higher the number of vcand xb is, the higher congestion the network suffers from in

    the corresponding quadrant. As dashed arrows represent in

    Figure 6, each router collects the congestion information

    4In [19], three possibilities of congestion information, i.e., the

    number of occupied VCs (vc) / buffers (bf) and crossbar demand

    (xb) are presented. The combination ofvc and xb proves to give

    the best results.

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    RR R

    NE (Quadrant 1)

    SE (Quadrant 4)

    NW (Quadrant 2)

    SW (Quadrant 3)

    RR R

    RR R

    RR R

    4

    4

    4

    41212

    6

    7.5

    6

    7

    7

    15

    18.25

    15

    17

    17

    19.5

    24.25

    260

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    5/8

    from neighbor routers via a set of sideband signals. 5 It

    calculates its congestion level by a weighted sum of its local

    congestion information (vc andxb) and those of its neighbors

    collected via the sideband signals, and propagates the new

    congestion information to its neighbor routers. Note that the

    congestion information propagates from a router to all its

    neighbors, i.e., in four directions in the mesh architecture.

    For simplicity, Figure 6 shows only the south-west-bound

    flow of congestion information.6

    The congestion level of each quadrant around the MC is

    propagated to the MC as the dashed line between the router

    and the MC shows in Figure 7.

    Figure 7 Propagating congestion information to memory

    controller

    5.2Network Congestion-Aware Memory AccessScheduling and Network Entry ControlWe integrate congestion awareness into two functions of

    memory controller: memory access scheduling and network

    entry control of read data.Figure 8 shows the pseudo code of network congestion-

    aware memory access scheduling. We use FR-FCFS as the

    baseline policy of memory access scheduling. Note that there

    is no limitation to memory access scheduling policies in

    integrating congestion awareness.

    5 The congestion information is carried on 9 bits of sideband signals

    in our work. However, the overhead of sideband signals can

    become negligible since they can be reduced down to a serial link

    without losing the benefit of adaptive routing as reported in [19].A detailed analysis on the effect of reducing the number of

    sideband signals will be our future work.

    6 In adaptive routing which utilizes the regional congestion

    information, each router selects a link with the lowest congestion

    level. For instance, in the case of the (shaded) router at (1,1) in

    Figure 6, when a north-east bound packet arrives at the router, it

    selects the east-bound output port as the target output port since

    the congestion level of north-bound output port, 24.25 is higher

    than that of east-bound output port, 18.25. For more details of

    regional congestion awareness, refer to [19].

    Figure 8 Network congestion-aware memory access scheduling

    If any of four quadrants around (the router connected to) the

    MC has a high congestion level (line 3), then the requests

    from the congested quadrant(s) (CQ) are excluded from the

    set of candidate requests in the request buffer (line 4). The

    FR-FCFS policy is applied to the new candidate set (R_UCQ

    in Figure 8). If there is a selected request, it is served by the

    MC (line 5). If there is no selected one, then a candidate is

    selected, if any, from the requests from the congested

    quadrant (R_CQ) (lines 6-7). If there is no quadrant with

    high congestion, then the original FR-FCFS is applied to all

    requests in the request buffer.

    The network entry control is applied to read data stored in

    the data buffer of memory controller. The procedure to select

    a data (to be injected to the NoC) from the data buffer is

    similar to that in Figure 8. First, if there is any quadrant with

    high congestion, then the read data bound for the quadrant

    are excluded from the set of ready read data in the data buffer.

    If there is any data bound for the uncongested area, then the

    oldest of such data is selected and sent to its destination. If

    there is no data bound for the uncongested area, then the

    oldest ready data bound for the congested area is selected.

    Note that the granularity of selection is a burst of data. In our

    experiments, a burst of data corresponds to 8 words of 64b

    data for L1 cache miss.

    6. Experiments6.1Experimental SetupFigure 9 shows a 5x5 tile architecture used in our

    experiments. The 5x5 architecture has one MC at the center

    tile. The MC runs at 400MHz and is connected to a

    conventional DRAM memory channel of DDR2-800 (CL-

    tRP-tRCD=6-6-6). The MC has a request buffer and a data

    buffer whose sizes are varied in our experiments. The MC

    supports AXI protocol [21] at its bus interface. Thus, we use

    a network interface between the router and the MC. We use a

    Tensilica LX2 core (64b, 16K/16K, cache line size of 64B,

    400MHz) as the PE.

    The NoC router supports four-stage pipeline (input buffering,

    virtual channel allocation & lookahead routing computation,

    switch allocation and switch traversal), 4 virtual channels

    (VCs) per input port, wormhole switching, speculative

    allocation [22], XY routing and 64b flits. There are two types

    of packets: request and response packets. The read request

    R

    NI

    RB DB

    MC

    Congestion info

    R R

    1 // CQ and UCQ: congested and uncongested quadrants2 // R_CQ and R_UCQ: requests from CQ and UCQ3 if (high congestion in any quadrant)4 R_UCQ = All_Requests R_CQ5 selected = FR_FCFS(R_UCQ)6 if selected == null7 selected = FR_FCFS(R_CQ)

    8 else // no heavy congestion9 selected = FR_FCFS(All_Requests)

    261

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    6/8

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    7/8

    medium levels of network congestion, the performance gain

    is modest (4.4% ~14.1%).

    Figure 10 Performance comparison

    Figure 11 Performance gains

    Figure 12 gives another explanation of high performance

    gain of the proposed method. The figure shows the fraction

    of time that the data buffer is full. The figure shows that as

    the congestion level becomes higher, the data buffer becomes

    full more frequently in the baseline method. It is because, at

    high congestion, the read data bound for the congested area

    occupy the output port of memory controller as well as the

    majority of data buffer resource. That is, network congestion-

    induced memory blocking occurs as exemplified in Section 4.

    Thus, they prevent the other data bound for the uncongested

    area from being served. The proposed method reduces the

    period of full data buffer by up to 12.1% (high congestion,

    DB size = 64), which translates into the performance gain in

    Figure 11.

    Figure 12 Fraction of time when data buffer is full

    (single program case)

    Network congestion-aware memory access scheduling and

    network entry control can be applied independently of each

    other. Figure 13 shows the effect of each method obtained

    when each of them is applied independently to the case of

    high congestion level. The figure shows that the network

    congestion-aware memory access scheduling (Schedule in

    the figure) alone offers up to 16.0% improvement in memory

    utilization while the network congestion-aware entry control

    (Entry) alone offers a performance improvement by up to

    8.1%.

    Figure 13 Decomposition of performance impact

    Comparing the results in Figures 11 and 13, the effects of

    both congestion-aware memory access scheduling and

    network entry control are not always additive. In case of

    small DB sizes (32 and 64), the effect of network congestion-

    aware memory access scheduling dominates. The effect of

    network congestion-aware network entry control becomes

    more significant with large DB sizes. It is because large DB

    (a) Baseline

    (b) Proposed method

    # requests/100 cycles

    # requests/100 cycles

    High

    High

    Medium

    Medium

    Low

    Low

    DB size

    DB size

    4.98

    5.23 5.32

    4.064.18 4.27

    3.223.35

    3.48

    3.00

    3.50

    4.00

    4.50

    5.00

    5.50

    6.00

    32 64 96

    5.43

    5.86 5.92

    4.49

    4.774.86

    3.68

    3.93 3.92

    3.00

    3.50

    4.00

    4.50

    5.00

    5.50

    6.00

    32 64 96

    Gain (%)

    DB size

    LowMedium High

    0.00

    5.00

    10.00

    15.00

    20.00

    32 64 96

    Fraction of time (%)

    DB size

    0.00

    10.00

    20.00

    30.00

    40.00

    50.00

    60.00

    70.00

    32 64 96

    Low-base

    Low-prop

    Med-base

    Med-prop

    High-base

    High-prop

    Low-base High-prop

    DB size

    Performance gain

    0.00%

    2.00%

    4.00%

    6.00%

    8.00%

    10.00%

    12.00%

    14.00%

    16.00%

    18.00%

    32 64 96

    Schedule

    Entry

    263

  • 7/30/2019 Handling Congesion Over Noc Using Mem Controller

    8/8

    sizes allow for more space to keep data bound for the

    congested area in case of high congestion. Thus, there are

    more opportunities that data bound for the uncongested area

    bypass data bound for the congested area thereby reducing

    memory blocking period and finally improving memory

    utilization.

    Figure 14 Performance gains in the multi-program case

    Figure 14 shows the performance gain of the proposed

    method for the multi-program mapping on the 5x5

    architecture as shown in Figure 9. The proposed method

    gives a significant performance improvement by up to 18.9%

    (high congestion, DB size = 64). Figure 15 confirms that thereduction in memory blocking period (enabled by the

    proposed memory controller) offers such a performance

    improvement shown in Figure 14.

    Figure 15 Fraction of time when data buffer is full

    (multi-program case)

    7. ConclusionIn this paper, we explained a performance problem called

    network congestion-induced memory blocking and presented,

    as a solution to the problem, a memory controller which

    performs memory access scheduling and network entry

    control of read data in a network congestion-aware manner.

    The proposed memory controller gives significant

    performance improvements (by up to 18.9% in memory

    utilization), especially when the network suffers from severe

    congestion. As future work, we will work on an adaptive

    solution to cope with dynamically changing network

    congestion. We will also work on fairness issues between

    requests from congested and uncongested areas.

    8. References[1] E. Lindholm, et al., NVIDIA Tesla: A Unified Graphics andComputing Architecture, IEEE Micro, 28(2), March 2008.

    [2] S. Bell, et al., TILE64 - Processor: A 64-Core SoC with Mesh

    Interconnect, Proc. ISSCC 2008.

    [3] Intel, Co., Single-chip Cloud Computer, available at

    http://techresearch.intel.com/articles/Tera-Scale/1826.htm.

    [4] W. Kwon, et al., A Practical Approach of Memory Access

    Parallelization to Exploit Multiple Off-chip DDR Memories,

    Proc. DAC, 2008.

    [5] E. Aho, et al., A Case for Multi-channel Memories in Video

    Recording, Proc. DATE, 2009.

    [6] T. Ezaki, et al., A 160Gb/s Interface Design Configuration for

    Multichip LSI, Proc. ISSCC, 2004.

    [7] S. Borkar, Thousand-Core Chips - A Technology Perspective,

    Proc. DAC, 2007.

    [8] J. Bautista, Tera-scale Computing and Interconnect Challenges

    3D Stacking Considerations, Tutorial, ISCA, 2008.

    [9] G. H. Loh, 3D-Stacked Memory Architectures for Multi-Core

    Processors, Proc. ISCA, 2008.

    [10] S. Rixner, et al., Memory Access Scheduling, Proc. ISCA,

    2000.

    [11] S. Heithecker and R. Ernst, Traffic Shaping for an FPGA

    based SDRAM Controller with Complex QoS Requirements,

    Proc. DAC, 2005.

    [12] B. Akesson, K. Goossens, and M. Ringhofer, Predator: a

    Predictable SDRAM Memory Controller, Proc. CODES+ISSS,

    2007.

    [13] O. Mutlu and T. Moscibroda, Stall-time Fair Memory Access

    Scheduling for Chip Multiprocessor, Proc. MICRO, 2007.[14] O. Mutlu and T. Moscibroda, Parallelism-Aware Memory

    Access Scheduling, Proc. ISCA, 2008.

    [15] J. Kim, et al., A Low Latency Router Supporting Adaptivity

    for On-Chip Interconnects, Proc. DAC, 2005.

    [16] A. Singh, et al., GOAL: A Load-Balanced Adaptive Routing

    Algorithm for Torus Networks, Proc. ISCA, 2003.

    [17] A. Singh, et al., Globally Adaptive Load-Balanced Routing on

    Tori, IEEE Computer Architecture Letters, 3(2):2, 2004.

    [18] J. Ahn, M. Erez, and W. Dally, The Design Space of Data-

    Parallel Memory Systems, Proc. Supercomputing, 2006.

    [19] P. Gratz, et al., Regional Congestion Awareness for Load

    Balance in Networks-on-Chip, Proc. HPCA, 2008.

    [20] W. Jang and D. Z. Pan, An SDRAM-aware Router for

    Networks-on-Chip, Proc. DAC, 2009.

    [21] ARM, Ltd., AMBA3 (AXI) Protocol, available at

    http://www.arm.com/products/solutions/AMBAHomePage.html

    [22] L. Peh and W. J. Dally, A Delay Model and Speculative

    Architecture for Pipelined Routers, Proc. HPCA, 2001.

    [23] T. Sherwood, et al., Discovering and Exploiting Program

    Phases, IEEE Micro, Nov/Dec, 2003.

    [24] Samsung Electronics, Co., DDR I, II, and III Device

    Operations & Timing Diagram, available at

    http://www.samsung.com/global/business/semiconductor.

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    32 64 96

    Gain (%)

    DB size

    Low

    Medium High

    0.00

    10.00

    20.00

    30.00

    40.00

    50.00

    60.00

    70.00

    32 64 96

    Low-base

    Low-prop

    Med-base

    Med-prop

    High-base

    High-prop

    Fraction of time (%)

    DB size

    264