Upload
lehuong
View
277
Download
3
Embed Size (px)
Citation preview
DRAM Memory Controller and
Optimizations
CSC458: Semester Project
By
Yanwei Song, Raj Parihar
Motivation
“Memory Wall” Problem
On-chip memory systems
Bandwidth and Latency improved due to larger and
faster cache
Off-chip memory systems
DRAM based system are used to minimize the power
Slow and often proved bottlenecks in high performance
Overall performance is function of DRAM
System performance
3/7/2012 CSC458: Parallel & Distributed Systems 2
DRAM Design Space
3/7/2012 CSC458: Parallel & Distributed Systems 3
DRAM
System
Nature of Application
(Locality, Rand Access)
Organization Parameters
(Channel, Rank, Bank etc)
Address Mapping Polices
(Favors locality or Random)
Txn Scheduling Polices
(FCFS, Greedy etc)
Hardware Resources
(Txn Queue, Bank Queue)
Timing Parameters
(CAS, RAS, Pre-Charge Lat)
High Sustainable
Bandwidth
INPUT System OUTPUT
Low Average
Latency
Low Power
Consumption
Fairness
(Multiple Agents)
Outline
DRAM Basics
Memory Controller: System Architecture
State-of-Art Techniques/ Design
Simulation: DRAMsim
Results and Analysis
Future Work
Conclusion
3/7/2012 CSC458: Parallel & Distributed Systems 4
DRAM Basics
3/7/2012 CSC458: Parallel & Distributed Systems 5
Organization
Channel >> Rank >> Bank
>> Row >> Column
Memory Access
Command
Row Activate <> Column
Access <> Pre-Charge
DRAM Latency depends
Row Hit < Row Closed <
Row Conflict
Average Access Latency
Function of type of request and current state of the memory system Access to open row takes less time as compare to closed
or conflicting row
3/7/2012 CSC458: Parallel & Distributed Systems 6
CPU Memory
Controller
DRAM
A: Delay in Processor Q
A
B
B: Txn sent to MemCtrl
C
C: Txn -> CMD sequence
D
D: Cmd sent to DRAM
E1
E1: Requires only CAS
E2/ E3 E2: Requires RAS +CAS
E3: Needs Pre + RAS + CAS
DRAM Latency = A + B + C + D + E {E1, E2, E3} + F
F
F: Txn sent back to CPU
Memory Controller System Architecture
3/7/2012 CSC458: Parallel & Distributed Systems 7
Row-Buffer-Management Policy
Open Page Policy
Next transaction to same row will incurs only CAS
Favors applications with high Locality (Temporal, Spatial)
High power consumption: Sense amplifiers always open
Close Page Policy
Most of the time, the new access is to the new row
Good for application which exhibit Random Accesses
Low power consumption – 3x lower than open page policy
Hybrid (Adaptive) Page Policy
Switches back between Open and Close page policies
Set a threshold to decide whether to have open or close
3/7/2012 CSC458: Parallel & Distributed Systems 8
Address Mapping
3/7/2012 CSC458: Parallel & Distributed Systems 9
0x00
0x01
0x02
0x03
Cache
…
…
0xnn
Rank 0 Rank 1
DRAM 1
DRAM 2
4 bank/ Dev Channel 0
Rank 0 Rank 1
DRAM 1
DRAM 2
4 bank/ Dev Channel 1
Ch ID Rank ID Bank ID Row Col
Physical Address
State-of-Art Optimizations
In a shared DRAM system, requests from a thread can not only delay
requests from other threads by causing bank/bus/row-buffer conflicts but
they can also destroy other threads’ DRAM-bank-level parallelism.
The first scheduling algorithm which is aware of the
bank-level parallelism within thread.
Sort of two-level scheduling
Batch Scheduling
To achieve the fairness
Within Batch Scheduling
To minimize the average stall time and maximize the throughput
Row-buffer locality
Intra-thread bank parallelism
3/7/2012 CSC458: Parallel & Distributed Systems 10
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems, ISCA2008
Different Scheduling Algorithm Comparison
Stall time
Thread 1 4
Thread 2 4
Thread 3 5
Thread 4 7
AVG 5
3/7/2012 CSC458: Parallel & Distributed Systems 11
Stall time
Thread 1 5.5
Thread 2 3
Thread 3 4.5
Thread 4 4.5
AVG 4.375
Stall time
Thread 1 1
Thread 2 2
Thread 3 4
Thread 4 5.5
AVG 3.125
Survey: DRAM system in Intel5000
Fully Buffered DIMMs (FB-DIMMs)
Simultaneous read/write data
transfer to different FB-DIMMs on a
channel,
No turnarounds between back-to-
back data transfers to different FB-
DIMMs on a channel,
Minimal turnarounds and bubbles
between back-to-back data transfers
to the same FB-DIMM, and
…
3/7/2012 CSC458: Parallel & Distributed Systems 12
Optimizations in Controller
Speculative memory read
Read cancels are issued to the memory
controller if the local-bus snoop or cross-
bus snoop results in „„dirty‟‟ data or a retry.
Memory interleaving
Memory controller can be programmed to
scatter sequential addresses with fine-
grained interleaving across memory
branches, ranks, and DRAM banks.
Partial writes
Coalescing the partial writes and reducing
the possibility of conflict serialization in
multi-bus systems,
Survey: DRAM system in Opteron
AMD Northbridge Architecture
More Concurrency, additional open DRAM banks to
reduce page conflicts.
Longer burst length to improve efficiency.
DRAM paging support uses history-based pattern
prediction to increase the frequency of page hits and
decrease page conflicts.
DRAM pre-fetcher tracks positive, negative, and
non-unit strides and has a dedicated buffer for pre-
fetched data.
Write bursting minimizes read and write turnaround.
3/7/2012 CSC458: Parallel & Distributed Systems 13
Simulation: DRAMsim
Part of the SYsim – A system level simulator Also can be
used as Standalone Simulator: Developed by UMD
Detailed timing models
SDRAM, DDR, DDR2, DRDRAM, FB-DIMM etc.
3/7/2012 CSC458: Parallel & Distributed Systems 14
Results/ Analysis
DRAMsim and SimpleScalar Traces
Number of Channels VS Read Latency
gcc: Read Latency (ns) VS Number of Channel in DDR2
0
50
100
150
200
250
300
350
400
450
185 190 195 200 205 210 215 220 225 230
Read Latency (ns)
Nu
mb
er
of
Req
uests
Ch_1
Ch_2
Ch_4
3/7/2012 CSC458: Parallel & Distributed Systems 16
As the number of channels increase the Avg latency comes down drastically. This
is due to the fact that Channels provide the highest level of parallelism in system.
Row-Management Policy VS Read Latency
swim: Row-management Policy VS Read Latency (ns)
0
50
100
150
200
250
300
350
400
450
500
185 190 195 200 205 210 215 220 225
Read Latency (ns)
Nu
mb
er
of
Req
uests
Row_Close
Row_Open
3/7/2012 CSC458: Parallel & Distributed Systems 17
Open Row Policy favors Locality. This means if “Swim” exhibits the better
performance for open row then it has lot of locality. An application which has lot
of random access would favor close row policy as oppose to open row.
Address Mapping Policy VS Read Latency
vortex: Address Mapping Policy VS Read Latency (ns)
0
100
200
300
400
500
600
185 190 195 200 205 210 215 220 225
Read Latency (ns)
Nu
mb
er
of
Ac
ce
se
s
SDRAM_Close intel845g
3/7/2012 CSC458: Parallel & Distributed Systems 18
Intel845g implement some sort of address mapping which favors
locality. Row-management policy and Address mapping policy are
tightly related and should favor each other.
Txn Queue Scheduling VS Read Latency
applu: Transaction Queue Scheduling Policy VS Read Latency
(ns)
0
10
20
30
40
50
60
70
185 190 195 200 205 210 215 220 225 230
Read Latency (ns)
Nu
mb
er
of
Ac
ce
ss
es
FCFS Most_Pending Greedy
3/7/2012 CSC458: Parallel & Distributed Systems 19
Greedy: Most Pending:
Future Work
Performance Comparison of various DRAM Memory Controllers
Current results belong to DDR2 type memory controllers
Performance Comparison: DDR2, DDR3, FB-DIMM etc
Support for Multiple Threads DRAMsim supports single thread execution at the
moment
Integration of DRAMsim into CMP_network CMP_network is ACAL‟s multiprocessor simulator
Currently implements fixed memory latency of 200 cycles
3/7/2012 CSC458: Parallel & Distributed Systems 20
Conclusion
Memory controller optimization is highly application dependent
Mapping of consecutive cacheline to different banks, ranks and channels scatters requests to different rows and favors parallelism in lieu of spatial locality
The conventional design of memory controller for single-core system is no longer optimal choice for the shared memory controller
3/7/2012 CSC458: Parallel & Distributed Systems 21
References
“Memory Systems: Cache, DRAM, Disk”, Bruce Jacob, Spencer W. Ng, and David T
“Memory Access Scheduling”, S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.Owens
“DRAMsim: A Memory-System Simulator”, David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, Aamer Jaleel, and Bruce Jacob
“Parallelism-aware Batchscheduling: Enhancing both performance and fairness of shared dram systems”, O. Mutlu and T. Moscibroda
“The AMD Opteron Northbridge Architecture”, P. Conway and B. Hughes
“The blackford northbridge chipset for the intel 5000,” S. Radhakrishnan, S. Chinthamani, and K. Cheng,
3/7/2012 CSC458: Parallel & Distributed Systems 22
Q & A
Backup
3/7/2012 CSC458: Parallel & Distributed Systems 24
Bank Parallelism of a Thread
3/7/2012 CSC458: Parallel & Distributed Systems 25
Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1
Bank access latencies of the two requests overlapped
Thread stalls for ~ONE bank access latency
Thread A :
Bank 0 Bank 1
Compute
2 DRAM Requests
Bank 0
Stall Compute
Bank 1
Single Thread:
Compute
Compute
2 DRAM Requests
Bank Parallelism Interference in
DRAM
3/7/2012 CSC458: Parallel & Distributed Systems 26
Bank 0 Bank 1
Thread A: Bank 0, Row 1
Thread B: Bank 1, Row 99
Thread B: Bank 0, Row 99
Thread A: Bank 1, Row 1
A : Compute
2 DRAM Requests
Bank 0
Stall
Bank 1
Baseline Scheduler:
B: Compute
Bank 0
Stall Bank 1
Stall
Stall
Bank access latencies of each thread serialized
Each thread stalls for ~TWO bank access latencies
2 DRAM Requests
Parallelism-Aware Scheduler
3/7/2012 CSC458: Parallel & Distributed Systems 27
Bank 0 Bank 1
Thread A: Bank 0, Row 1
Thread B: Bank 1, Row 99
Thread B: Bank 0, Row 99
Thread A: Bank 1, Row 1
A :
2 DRAM Requests
Parallelism-aware Scheduler:
B: Compute Bank 0
Stall Compute
Bank 1
Stall
2 DRAM Requests
A : Compute
2 DRAM Requests
Bank 0
Stall Compute
Bank 1
B: Compute
Bank 0
Stall Compute
Bank 1
Stall
Stall
Baseline Scheduler:
Compute
Bank 0
Stall Compute
Bank 1
Saved Cycles Average stall-time:
~1.5 bank access
latencies