27
DRAM Memory Controller and Optimizations CSC458: Semester Project By Yanwei Song, Raj Parihar

DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

DRAM Memory Controller and

Optimizations

CSC458: Semester Project

By

Yanwei Song, Raj Parihar

Page 2: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Motivation

“Memory Wall” Problem

On-chip memory systems

Bandwidth and Latency improved due to larger and

faster cache

Off-chip memory systems

DRAM based system are used to minimize the power

Slow and often proved bottlenecks in high performance

Overall performance is function of DRAM

System performance

3/7/2012 CSC458: Parallel & Distributed Systems 2

Page 3: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

DRAM Design Space

3/7/2012 CSC458: Parallel & Distributed Systems 3

DRAM

System

Nature of Application

(Locality, Rand Access)

Organization Parameters

(Channel, Rank, Bank etc)

Address Mapping Polices

(Favors locality or Random)

Txn Scheduling Polices

(FCFS, Greedy etc)

Hardware Resources

(Txn Queue, Bank Queue)

Timing Parameters

(CAS, RAS, Pre-Charge Lat)

High Sustainable

Bandwidth

INPUT System OUTPUT

Low Average

Latency

Low Power

Consumption

Fairness

(Multiple Agents)

Page 4: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Outline

DRAM Basics

Memory Controller: System Architecture

State-of-Art Techniques/ Design

Simulation: DRAMsim

Results and Analysis

Future Work

Conclusion

3/7/2012 CSC458: Parallel & Distributed Systems 4

Page 5: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

DRAM Basics

3/7/2012 CSC458: Parallel & Distributed Systems 5

Organization

Channel >> Rank >> Bank

>> Row >> Column

Memory Access

Command

Row Activate <> Column

Access <> Pre-Charge

DRAM Latency depends

Row Hit < Row Closed <

Row Conflict

Page 6: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Average Access Latency

Function of type of request and current state of the memory system Access to open row takes less time as compare to closed

or conflicting row

3/7/2012 CSC458: Parallel & Distributed Systems 6

CPU Memory

Controller

DRAM

A: Delay in Processor Q

A

B

B: Txn sent to MemCtrl

C

C: Txn -> CMD sequence

D

D: Cmd sent to DRAM

E1

E1: Requires only CAS

E2/ E3 E2: Requires RAS +CAS

E3: Needs Pre + RAS + CAS

DRAM Latency = A + B + C + D + E {E1, E2, E3} + F

F

F: Txn sent back to CPU

Page 7: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Memory Controller System Architecture

3/7/2012 CSC458: Parallel & Distributed Systems 7

Page 8: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Row-Buffer-Management Policy

Open Page Policy

Next transaction to same row will incurs only CAS

Favors applications with high Locality (Temporal, Spatial)

High power consumption: Sense amplifiers always open

Close Page Policy

Most of the time, the new access is to the new row

Good for application which exhibit Random Accesses

Low power consumption – 3x lower than open page policy

Hybrid (Adaptive) Page Policy

Switches back between Open and Close page policies

Set a threshold to decide whether to have open or close

3/7/2012 CSC458: Parallel & Distributed Systems 8

Page 9: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Address Mapping

3/7/2012 CSC458: Parallel & Distributed Systems 9

0x00

0x01

0x02

0x03

Cache

0xnn

Rank 0 Rank 1

DRAM 1

DRAM 2

4 bank/ Dev Channel 0

Rank 0 Rank 1

DRAM 1

DRAM 2

4 bank/ Dev Channel 1

Ch ID Rank ID Bank ID Row Col

Physical Address

Page 10: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

State-of-Art Optimizations

In a shared DRAM system, requests from a thread can not only delay

requests from other threads by causing bank/bus/row-buffer conflicts but

they can also destroy other threads’ DRAM-bank-level parallelism.

The first scheduling algorithm which is aware of the

bank-level parallelism within thread.

Sort of two-level scheduling

Batch Scheduling

To achieve the fairness

Within Batch Scheduling

To minimize the average stall time and maximize the throughput

Row-buffer locality

Intra-thread bank parallelism

3/7/2012 CSC458: Parallel & Distributed Systems 10

Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems, ISCA2008

Page 11: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Different Scheduling Algorithm Comparison

Stall time

Thread 1 4

Thread 2 4

Thread 3 5

Thread 4 7

AVG 5

3/7/2012 CSC458: Parallel & Distributed Systems 11

Stall time

Thread 1 5.5

Thread 2 3

Thread 3 4.5

Thread 4 4.5

AVG 4.375

Stall time

Thread 1 1

Thread 2 2

Thread 3 4

Thread 4 5.5

AVG 3.125

Page 12: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Survey: DRAM system in Intel5000

Fully Buffered DIMMs (FB-DIMMs)

Simultaneous read/write data

transfer to different FB-DIMMs on a

channel,

No turnarounds between back-to-

back data transfers to different FB-

DIMMs on a channel,

Minimal turnarounds and bubbles

between back-to-back data transfers

to the same FB-DIMM, and

3/7/2012 CSC458: Parallel & Distributed Systems 12

Optimizations in Controller

Speculative memory read

Read cancels are issued to the memory

controller if the local-bus snoop or cross-

bus snoop results in „„dirty‟‟ data or a retry.

Memory interleaving

Memory controller can be programmed to

scatter sequential addresses with fine-

grained interleaving across memory

branches, ranks, and DRAM banks.

Partial writes

Coalescing the partial writes and reducing

the possibility of conflict serialization in

multi-bus systems,

Page 13: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Survey: DRAM system in Opteron

AMD Northbridge Architecture

More Concurrency, additional open DRAM banks to

reduce page conflicts.

Longer burst length to improve efficiency.

DRAM paging support uses history-based pattern

prediction to increase the frequency of page hits and

decrease page conflicts.

DRAM pre-fetcher tracks positive, negative, and

non-unit strides and has a dedicated buffer for pre-

fetched data.

Write bursting minimizes read and write turnaround.

3/7/2012 CSC458: Parallel & Distributed Systems 13

Page 14: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Simulation: DRAMsim

Part of the SYsim – A system level simulator Also can be

used as Standalone Simulator: Developed by UMD

Detailed timing models

SDRAM, DDR, DDR2, DRDRAM, FB-DIMM etc.

3/7/2012 CSC458: Parallel & Distributed Systems 14

Page 15: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Results/ Analysis

DRAMsim and SimpleScalar Traces

Page 16: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Number of Channels VS Read Latency

gcc: Read Latency (ns) VS Number of Channel in DDR2

0

50

100

150

200

250

300

350

400

450

185 190 195 200 205 210 215 220 225 230

Read Latency (ns)

Nu

mb

er

of

Req

uests

Ch_1

Ch_2

Ch_4

3/7/2012 CSC458: Parallel & Distributed Systems 16

As the number of channels increase the Avg latency comes down drastically. This

is due to the fact that Channels provide the highest level of parallelism in system.

Page 17: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Row-Management Policy VS Read Latency

swim: Row-management Policy VS Read Latency (ns)

0

50

100

150

200

250

300

350

400

450

500

185 190 195 200 205 210 215 220 225

Read Latency (ns)

Nu

mb

er

of

Req

uests

Row_Close

Row_Open

3/7/2012 CSC458: Parallel & Distributed Systems 17

Open Row Policy favors Locality. This means if “Swim” exhibits the better

performance for open row then it has lot of locality. An application which has lot

of random access would favor close row policy as oppose to open row.

Page 18: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Address Mapping Policy VS Read Latency

vortex: Address Mapping Policy VS Read Latency (ns)

0

100

200

300

400

500

600

185 190 195 200 205 210 215 220 225

Read Latency (ns)

Nu

mb

er

of

Ac

ce

se

s

SDRAM_Close intel845g

3/7/2012 CSC458: Parallel & Distributed Systems 18

Intel845g implement some sort of address mapping which favors

locality. Row-management policy and Address mapping policy are

tightly related and should favor each other.

Page 19: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Txn Queue Scheduling VS Read Latency

applu: Transaction Queue Scheduling Policy VS Read Latency

(ns)

0

10

20

30

40

50

60

70

185 190 195 200 205 210 215 220 225 230

Read Latency (ns)

Nu

mb

er

of

Ac

ce

ss

es

FCFS Most_Pending Greedy

3/7/2012 CSC458: Parallel & Distributed Systems 19

Greedy: Most Pending:

Page 20: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Future Work

Performance Comparison of various DRAM Memory Controllers

Current results belong to DDR2 type memory controllers

Performance Comparison: DDR2, DDR3, FB-DIMM etc

Support for Multiple Threads DRAMsim supports single thread execution at the

moment

Integration of DRAMsim into CMP_network CMP_network is ACAL‟s multiprocessor simulator

Currently implements fixed memory latency of 200 cycles

3/7/2012 CSC458: Parallel & Distributed Systems 20

Page 21: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Conclusion

Memory controller optimization is highly application dependent

Mapping of consecutive cacheline to different banks, ranks and channels scatters requests to different rows and favors parallelism in lieu of spatial locality

The conventional design of memory controller for single-core system is no longer optimal choice for the shared memory controller

3/7/2012 CSC458: Parallel & Distributed Systems 21

Page 22: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

References

“Memory Systems: Cache, DRAM, Disk”, Bruce Jacob, Spencer W. Ng, and David T

“Memory Access Scheduling”, S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.Owens

“DRAMsim: A Memory-System Simulator”, David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, Aamer Jaleel, and Bruce Jacob

“Parallelism-aware Batchscheduling: Enhancing both performance and fairness of shared dram systems”, O. Mutlu and T. Moscibroda

“The AMD Opteron Northbridge Architecture”, P. Conway and B. Hughes

“The blackford northbridge chipset for the intel 5000,” S. Radhakrishnan, S. Chinthamani, and K. Cheng,

3/7/2012 CSC458: Parallel & Distributed Systems 22

Page 23: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Q & A

Page 24: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Backup

3/7/2012 CSC458: Parallel & Distributed Systems 24

Page 25: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Bank Parallelism of a Thread

3/7/2012 CSC458: Parallel & Distributed Systems 25

Thread A: Bank 0, Row 1

Thread A: Bank 1, Row 1

Bank access latencies of the two requests overlapped

Thread stalls for ~ONE bank access latency

Thread A :

Bank 0 Bank 1

Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

Single Thread:

Page 26: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

Compute

Compute

2 DRAM Requests

Bank Parallelism Interference in

DRAM

3/7/2012 CSC458: Parallel & Distributed Systems 26

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99

Thread A: Bank 1, Row 1

A : Compute

2 DRAM Requests

Bank 0

Stall

Bank 1

Baseline Scheduler:

B: Compute

Bank 0

Stall Bank 1

Stall

Stall

Bank access latencies of each thread serialized

Each thread stalls for ~TWO bank access latencies

Page 27: DRAM Memory Controller and Optimizationsparihar/pres/Pres_DRAM-Scheduling.pdf · Motivation “Memory Wall” Problem On-chip memory systems Bandwidth and Latency improved due to

2 DRAM Requests

Parallelism-Aware Scheduler

3/7/2012 CSC458: Parallel & Distributed Systems 27

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99

Thread A: Bank 1, Row 1

A :

2 DRAM Requests

Parallelism-aware Scheduler:

B: Compute Bank 0

Stall Compute

Bank 1

Stall

2 DRAM Requests

A : Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

B: Compute

Bank 0

Stall Compute

Bank 1

Stall

Stall

Baseline Scheduler:

Compute

Bank 0

Stall Compute

Bank 1

Saved Cycles Average stall-time:

~1.5 bank access

latencies