DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

DRAM Memory Controller and

Optimizations

CSC458: Semester Project

By

Yanwei Song, Raj Parihar

Motivation

“Memory Wall” Problem

On-chip memory systems

Bandwidth and Latency improved due to larger and

faster cache

Off-chip memory systems

DRAM based system are used to minimize the power

Slow and often proved bottlenecks in high performance

Overall performance is function of DRAM

System performance

3/7/2012 CSC458: Parallel & Distributed Systems 2

DRAM Design Space


DRAM

System

Nature of Application

(Locality, Rand Access)

Organization Parameters

(Channel, Rank, Bank etc)

Address Mapping Polices

(Favors locality or Random)

Txn Scheduling Polices

(FCFS, Greedy etc)

Hardware Resources

(Txn Queue, Bank Queue)

Timing Parameters

(CAS, RAS, Pre-Charge Lat)

High Sustainable

Bandwidth

INPUT System OUTPUT

Low Average

Latency

Low Power

Consumption

Fairness

(Multiple Agents)

Outline

DRAM Basics

Memory Controller: System Architecture

State-of-Art Techniques/ Design

Simulation: DRAMsim

Results and Analysis

Future Work

Conclusion


DRAM Basics


Organization

Channel >> Rank >> Bank

>> Row >> Column

Memory Access

Command

Row Activate <> Column

Access <> Pre-Charge

DRAM Latency depends

Row Hit < Row Closed <

Row Conflict

Average Access Latency

Function of type of request and current state of the memory system Access to open row takes less time as compare to closed

or conflicting row


CPU Memory

Controller

DRAM

A: Delay in Processor Q

A

B

B: Txn sent to MemCtrl

C

C: Txn -> CMD sequence

D

D: Cmd sent to DRAM

E1

E1: Requires only CAS

E2/ E3 E2: Requires RAS +CAS

E3: Needs Pre + RAS + CAS

DRAM Latency = A + B + C + D + E {E1, E2, E3} + F

F

F: Txn sent back to CPU

Memory Controller System Architecture


Row-Buffer-Management Policy

Open Page Policy

Next transaction to same row will incurs only CAS

Favors applications with high Locality (Temporal, Spatial)

High power consumption: Sense amplifiers always open

Close Page Policy

Most of the time, the new access is to the new row

Good for application which exhibit Random Accesses

Low power consumption – 3x lower than open page policy

Hybrid (Adaptive) Page Policy

Switches back between Open and Close page policies

Set a threshold to decide whether to have open or close


Address Mapping


0x00

0x01

0x02

0x03

Cache

…

…

0xnn

Rank 0 Rank 1

DRAM 1

DRAM 2

4 bank/ Dev Channel 0

Rank 0 Rank 1

DRAM 1

DRAM 2

4 bank/ Dev Channel 1

Ch ID Rank ID Bank ID Row Col

Physical Address

State-of-Art Optimizations

In a shared DRAM system, requests from a thread can not only delay

requests from other threads by causing bank/bus/row-buffer conflicts but

they can also destroy other threads’ DRAM-bank-level parallelism.

The first scheduling algorithm which is aware of the

bank-level parallelism within thread.

Sort of two-level scheduling

Batch Scheduling

To achieve the fairness

Within Batch Scheduling

To minimize the average stall time and maximize the throughput

Row-buffer locality

Intra-thread bank parallelism


Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems, ISCA2008

Different Scheduling Algorithm Comparison

Stall time

Thread 1 4

Thread 2 4

Thread 3 5

Thread 4 7

AVG 5


Stall time

Thread 1 5.5

Thread 2 3

Thread 3 4.5

Thread 4 4.5

AVG 4.375

Stall time

Thread 1 1

Thread 2 2

Thread 3 4

Thread 4 5.5

AVG 3.125

Survey: DRAM system in Intel5000

Fully Buffered DIMMs (FB-DIMMs)

Simultaneous read/write data

transfer to different FB-DIMMs on a

channel,

No turnarounds between back-to-

back data transfers to different FB-

DIMMs on a channel,

Minimal turnarounds and bubbles

between back-to-back data transfers

to the same FB-DIMM, and

…


Optimizations in Controller

Speculative memory read

Read cancels are issued to the memory

controller if the local-bus snoop or cross-

bus snoop results in „„dirty‟‟ data or a retry.

Memory interleaving

Memory controller can be programmed to

scatter sequential addresses with fine-

grained interleaving across memory

branches, ranks, and DRAM banks.

Partial writes

Coalescing the partial writes and reducing

the possibility of conflict serialization in

multi-bus systems,

Survey: DRAM system in Opteron

AMD Northbridge Architecture

More Concurrency, additional open DRAM banks to

reduce page conflicts.

Longer burst length to improve efficiency.

DRAM paging support uses history-based pattern

prediction to increase the frequency of page hits and

decrease page conflicts.

DRAM pre-fetcher tracks positive, negative, and

non-unit strides and has a dedicated buffer for pre-

fetched data.

Write bursting minimizes read and write turnaround.


Simulation: DRAMsim

Part of the SYsim – A system level simulator Also can be

used as Standalone Simulator: Developed by UMD

Detailed timing models

SDRAM, DDR, DDR2, DRDRAM, FB-DIMM etc.


Results/ Analysis

DRAMsim and SimpleScalar Traces

Number of Channels VS Read Latency

gcc: Read Latency (ns) VS Number of Channel in DDR2

0

50

100

150

200

250

300

350

400

450

185 190 195 200 205 210 215 220 225 230

Read Latency (ns)

Nu

mb

er

of

Req

uests

Ch_1

Ch_2

Ch_4


As the number of channels increase the Avg latency comes down drastically. This

is due to the fact that Channels provide the highest level of parallelism in system.

Row-Management Policy VS Read Latency

swim: Row-management Policy VS Read Latency (ns)

0

50

100

150

200

250

300

350

400

450

500

185 190 195 200 205 210 215 220 225

Read Latency (ns)

Nu

mb

er

of

Req

uests

Row_Close

Row_Open


Open Row Policy favors Locality. This means if “Swim” exhibits the better

performance for open row then it has lot of locality. An application which has lot

of random access would favor close row policy as oppose to open row.

Address Mapping Policy VS Read Latency

vortex: Address Mapping Policy VS Read Latency (ns)

0

100

200

300

400

500

600

185 190 195 200 205 210 215 220 225

Read Latency (ns)

Nu

mb

er

of

Ac

ce

se

s

SDRAM_Close intel845g


Intel845g implement some sort of address mapping which favors

locality. Row-management policy and Address mapping policy are

tightly related and should favor each other.

Txn Queue Scheduling VS Read Latency

applu: Transaction Queue Scheduling Policy VS Read Latency

(ns)

0

10

20

30

40

50

60

70

185 190 195 200 205 210 215 220 225 230

Read Latency (ns)

Nu

mb

er

of

Ac

ce

ss

es

FCFS Most_Pending Greedy


Greedy: Most Pending:

Future Work

Performance Comparison of various DRAM Memory Controllers

Current results belong to DDR2 type memory controllers

Performance Comparison: DDR2, DDR3, FB-DIMM etc

Support for Multiple Threads DRAMsim supports single thread execution at the

moment

Integration of DRAMsim into CMP_network CMP_network is ACAL‟s multiprocessor simulator

Currently implements fixed memory latency of 200 cycles


Conclusion

Memory controller optimization is highly application dependent

Mapping of consecutive cacheline to different banks, ranks and channels scatters requests to different rows and favors parallelism in lieu of spatial locality

The conventional design of memory controller for single-core system is no longer optimal choice for the shared memory controller


References

“Memory Systems: Cache, DRAM, Disk”, Bruce Jacob, Spencer W. Ng, and David T

“Memory Access Scheduling”, S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.Owens

“DRAMsim: A Memory-System Simulator”, David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, Aamer Jaleel, and Bruce Jacob

“Parallelism-aware Batchscheduling: Enhancing both performance and fairness of shared dram systems”, O. Mutlu and T. Moscibroda

“The AMD Opteron Northbridge Architecture”, P. Conway and B. Hughes

“The blackford northbridge chipset for the intel 5000,” S. Radhakrishnan, S. Chinthamani, and K. Cheng,


Q & A

Backup


Bank Parallelism of a Thread


Thread A: Bank 0, Row 1


Bank access latencies of the two requests overlapped

Thread stalls for ~ONE bank access latency

Thread A :

Bank 0 Bank 1

Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

Single Thread:

Compute

Compute

2 DRAM Requests

Bank Parallelism Interference in

DRAM


Bank 0 Bank 1


Thread B: Bank 1, Row 99



A : Compute

2 DRAM Requests

Bank 0

Stall

Bank 1

Baseline Scheduler:

B: Compute

Bank 0

Stall Bank 1

Stall

Stall

Bank access latencies of each thread serialized

Each thread stalls for ~TWO bank access latencies

2 DRAM Requests

Parallelism-Aware Scheduler


Bank 0 Bank 1





A :

2 DRAM Requests

Parallelism-aware Scheduler:

B: Compute Bank 0

Stall Compute

Bank 1

Stall

2 DRAM Requests

A : Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

B: Compute

Bank 0

Stall Compute

Bank 1

Stall

Stall

Baseline Scheduler:

Compute

Bank 0

Stall Compute

Bank 1

Saved Cycles Average stall-time:

~1.5 bank access

latencies

Documents

DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to