12
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth 1

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

Embed Size (px)

Citation preview

Page 1: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

ECE408/CS483

Applied Parallel Programming

Lecture 7: DRAM Bandwidth

1

Page 2: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

Objective

• To understand DRAM bandwidth– Cause of the DRAM bandwidth problem– Programming techniques that address the problem:

memory coalescing, corner turning,

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

2

Page 3: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

Global Memory (DRAM) Bandwidth

Ideal Reality

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

3

Page 4: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

DRAM Bank Organization

• Each core array has about 1M bits

• Each bit is stored in a tiny capacitor, made of one transistor

Memory CellCore Array

RowDecoder

Sense Amps

Column Latches

Mux

RowAddr

ColumnAddr

Off-chip Data

Wide

Narrow

Pin Interface

4

Page 5: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

A very small (8x2 bit) DRAM Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

0 1 1

Sense amps

Mux5

Page 6: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM core arrays are slow.

• Reading from a cell in the core array is a very slow process– DDR: Core speed = ½ interface speed– DDR2/GDDR3: Core speed = ¼ interface speed– DDR3/GDDR4: Core speed = ⅛ interface speed– … likely to be worse in the future

deco

de

To sense amps

A very small capacitance that stores a data bit

About 1000 cells connected to each vertical line

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

6

Page 7: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM Bursting.

• For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:

– Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed

– DDR2/GDDR3: buffer width = 4X interface width

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

7

Page 8: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM Bursting

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

0 1 0

Sense amps

Mux8

Page 9: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM Bursting

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

0 1 1

Sense amps and buffer

Mux9

Page 10: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM Bursting for the 8x2 Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Non-burst timing

Burst timing

Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations.

10

Page 11: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

Multiple DRAM Banks

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

Sense amps

Muxde

code

Sense amps

Mux

0 1 10

Bank 0 Bank 1

11

Page 12: ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM

DRAM Bursting for the 8x2 Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time 12