EE5780 Advanced VLSI CAD

Z. Feng MTU EE5780 Advanced VLSI CAD11.1

EE5780 Advanced VLSI CAD

Lecture 11 SRAM and Yield AnalysisZhuo Feng


Outline■ Memory Arrays■ SRAM Architecture

► SRAM Cell► Decoders► Column Circuitry► Multiple Ports

■ Serial Access Memories


Memory Arrays

Random Access Memory Serial Access Memory Content Addressable Memory(CAM)

Read/Write Memory(RAM)

(Volatile)

Read Only Memory(ROM)

(Nonvolatile)

Static RAM(SRAM)

Dynamic RAM(DRAM)

Shift Registers Queues

First InFirst Out(FIFO)

Last InFirst Out(LIFO)

Serial InParallel Out

(SIPO)

Parallel InSerial Out

(PISO)

Mask ROM ProgrammableROM

(PROM)

ErasableProgrammable

ROM(EPROM)

ElectricallyErasable

ProgrammableROM

(EEPROM)

Flash ROM

Memory Arrays


Array Architecture■ 2n words of 2m bits each■ If n >> m, fold by 2k into fewer rows of more columns

■ Good regularity – easy to design■ Very high density if good cells are used

row decoder

columndecoder

n

n-kk

2m bits

columncircuitry

bitline conditioning

memory cells:2n-k rows x2m+k columns

bitlines

wordlines


12T SRAM Cell■ Basic building block: SRAM Cell

► Holds one bit of information, like a latch► Must be read and written

■ 12-transistor (12T) SRAM cell► Use a simple latch connected to bitline► 46 x 75 unit cell

bit

write

write_b

read

read_b


6T SRAM Cell■ Cell size accounts for most of array size

► Reduce cell size at expense of complexity■ 6T SRAM Cell

► Used in most commercial chips► Data stored in cross-coupled inverters

■ Read:► Precharge bit, bit_b► Raise wordline

■ Write:► Drive data onto bit, bit_b► Raise wordline

bit bit_b

word


SRAM Read■ Precharge both bitlines high■ Then turn on wordline■ One of the two bitlines will be pulled down by the cell■ Ex: A = 0, A_b = 1

► bit discharges, bit_b stays high► But A bumps up slightly

■ Read stability► A must not flipbit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600time (ps)

word bit

A

A_b bit_b


SRAM Read■ Precharge both bitlines high■ Then turn on wordline■ One of the two bitlines will be pulled down by the cell■ Ex: A = 0, A_b = 1

► bit discharges, bit_b stays high► But A bumps up slightly

■ Read stability► A must not flipbit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600time (ps)

word bit

A

A_b bit_b

N1 >> N2N3 >> N4


SRAM Write■ Drive one bitline high, the other low■ Then turn on wordline■ Bitlines overpower cell with new value■ Ex: A = 0, A_b = 1, bit = 1, bit_b = 0

► Force A_b low, then A rises high■ Writability

► Must overpower feedback inverter

time (ps)

word

A

A_b

bit_b

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600 700

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word


SRAM Write■ Drive one bitline high, the other low■ Then turn on wordline■ Bitlines overpower cell with new value■ Ex: A = 0, A_b = 1, bit = 1, bit_b = 0

► Force A_b low, then A rises high■ Writability

► Must overpower feedback inverter

time (ps)

word

A

A_b

bit_b

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600 700

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

N2 >> P1N4 >> P2


SRAM Sizing■ High bitlines must not overpower inverters during reads■ But low bitlines must write new value into cell

bit bit_b

m ed

A

weak

strong

m ed

A_b

word


SRAM Column Example

bit_v1f

bit_b_v1f

2

MoreCells

SRAM Cell

word_q1bit_v1f

bit_b_v1f

data_s1

write_q1

Bitline Conditioning

Read Write


SRAM Layout■ Cell size is critical: 26 x 45 (even smaller in industry)■ Tile cells sharing VDD, GND, bitline contacts

VDD

GND GNDBIT BIT_B

WORD

Cell boundary


Thin Cell■ In nanometer CMOS

► Avoid bends in polysilicon and diffusion► Orient all transistors in one direction

■ Lithographically friendly or thin cell layout fixes this► Also reduces length and capacitance of bitlines


Commercial SRAMs■ Five generations of Intel SRAM cell micrographs

► Transition to thin cell at 65 nm► Steady scaling of cell area


Decoders■ n:2n decoder consists of 2n n-input AND gates

► One needed for each row of memory► Build AND from NAND or NOR gates

Static CMOS Pseudo-nMOS

word0

word1

word2

word3

A0A1

A1word

A0 1 1

1/2

2

4

8

16word

A0

A1

11

11

4

8

word0

word1

word2

word3

A0A1


Decoder Layout■ Decoders must be pitch-matched to SRAM cell

► Requires very skinny gates

GND

VDD

word

buffer inverterNAND gate

A0A0A1A2A3 A2A3 A1


Large Decoders■ For n > 4, NAND gates become slow

► Break large gates into multiple smaller gates

word0

word1

word2

word3

word15

A0A1A2A3


Column Circuitry■ Some circuitry is required for each column

► Bitline conditioning► Sense amplifiers► Column multiplexing


Bitline Conditioning■ Precharge bitlines high before reads

■ Equalize bitlines to minimize voltage difference when using sense amplifiers

bit bit_b

bit bit_b


Sense Amplifiers■ Bitlines have many cells attached

► Ex: 32-kbit SRAM has 256 rows x 128 cols► 128 cells on each bitline

■ tpd (C/I) V► Even with shared diffusion contacts, 64C of diffusion

capacitance (big C)► Discharged slowly through small transistors (small I)

■ Sense amplifiers are triggered on small voltage swing (reduce V)


Differential Pair Amp■ Differential pair requires no clock■ But always dissipates static power

bit bit_bsense_b sense

N1 N2

N3

P1 P2


Clocked Sense Amp■ Clocked sense amp saves power■ Requires sense_clk after enough bitline swing■ Isolation transistors cut off large bitline

capacitance

bit_bbit

sense sense_b

sense_clk isolationtransistors

regenerativefeedback


Twisted Bitlines■ Sense amplifiers also amplify noise

► Coupling noise is severe in modern processes► Try to couple equally onto bit and bit_b► Done by twisting bitlines

b0 b0_b b1 b1_b b2 b2_b b3 b3_b


Column Multiplexing■ Recall that array may be folded for good aspect ratio

■ Ex: 2k word x 16 folded into 256 rows x 128 columns► Must select 16 output bits from the 128 columns► Requires 16 8:1 column multiplexers


Tree Decoder Mux■ Column mux can use pass transistors

► Use nMOS only, precharge outputs

■ One design is to use k series transistors for 2k:1 mux► No external decoder logic needed

B0 B1 B2 B3 B4 B5 B6 B7 B0 B1 B2 B3 B4 B5 B6 B7A0

A0

A1

A1

A2

A2

Y Yto sense amps and write circuits


Single Pass-Gate Mux■ Or eliminate series transistors with separate

decoder

A0A1

B0 B1 B2 B3

Y


Dual-Ported SRAM■ Simple dual-ported SRAM

► Two independent single-ended reads► Or one differential write

■ Do two reads and one write by time multiplexing► Read during ph1, write during ph2

bit bit_b

wordBwordA


Large SRAMs■ Large SRAMs are split into subarrays for speed■ Ex: UltraSparc 512KB cache

► 4 128 KB subarrays► Each have 16 8KB banks► 256 rows x 256 cols / bank► 60% subarray area efficiency► Also space for tags & control

[Shin05]


■ Yield analysis of embedded memory modules► Play a critical role in nanoscale and emerging nonvolatile memory

designs for future microprocessor, 3D-IC, and mixed-signal system designs

► Challenged by a huge number (many millions) of Monte Carlo SPICE simulations for relatively small circuit blocks considering parametric(process, voltage, and temperature) variations

3rd Generation Intel Core Processor: 22nm ProcessSource: http://www.futurelooks.com/intel-core-i7-3770k-ivy-bridge-lga1155-processor-review/


TinySPICE Simulator■ TinySPICE: SPICE-accurate, fast, massively

parallel small nonlinear circuits simulator► Accelerates entire SPICE simulator on GPU► Run independent SPICE simulations concurrently► Designs GPU-friendly data structures► Develops efficient algorithm flow► Optimizes GPU memory accesses

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

Shared Memory

Shared Memory

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

1i 2i

2v1v

3i

kgs

dskm v

ig

kds

dskds v

iG

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a aa a a

a aa a a

a a aa a

a a a

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

Shared Memory

Shared Memory

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

Shared Memory

Shared Memory SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

Shared Memory

Shared Memory

CKT Schematic Model Evaluation Matrix Stamping/Solve

Converged?

Return

Newton Raphson Iterations

GPU Streaming Multiprocessors Massively Parallel SPICE Simulations in GPU Shared Memory

l1

Slide 31

l1 lengfei, 5/13/2013


TinySPICE Flow Chart

TinySPICETinySPICE

Parse NetlistParse Netlist

Setup 3D LUTsNonlinear Device

Setup 3D LUTsNonlinear Device

Excitation & MOS Terminal Map Index

Excitation & MOS Terminal Map Index

Setup Matrix forLinear Circuit

Setup Matrix forLinear Circuit

G Matrix & RHS@Shared Mem.G Matrix & RHS@Shared Mem.

CPU Setup

GPU Setup

G Matrix & RHS@Texture Mem.G Matrix & RHS@Texture Mem.

Map Index@Texture Mem.

Map Index@Texture Mem.

GPU Analysis

Get Latest

Solution

Get Latest

Solution

Reset G Matrix& Update RHSReset G Matrix& Update RHS

Transistor EvaluationTransistor Evaluation

Nonlinear Stamp

Nonlinear Stamp

LU SolveLU Solve

Conv. Chec

k

Conv. Chec

k

ReturnMap Index

@Shared Mem.Map Index

@Shared Mem.GPU LUT

@Texture Mem.GPU LUT

@Texture Mem.

■ TinySPICE algorithm flow have three key steps► CPU-setup phase, GPU-setup phase and GPU-analysis phase


MNA Matrix Construction■ Linear element evaluation on CPU due to

► Corresponding entries are constant value throughout whole SPICE simulation

■ Nonlinear element evaluation on GPU due to► Need to be re-evaluated in every NR step

■ Nonlinear element evaluation by parametric 3D look-up tables (LUTs)► BSIM4 too complicated for GPU► Simplify control flow► Can capture impact of process variations by introduce

parametric LUTs


Parametric 3D LUTs

VthLeffeffth

effLeffthVth

effLthVbase

LUTLV

LLUTVLUT

LLUTVLUTLUTLUTeffth

2

22

2

baseLUT

LeffV LUTandLUTth

22 effLVth LUTandLUT

thV

effL

: Base LUT based on transistor nominal parameters

: First order coefficient LUT for transistor threshold voltage andeffective channel length

: Second order coefficient LUT for transistor threshold voltage and effective channel length

: Variation of transistor threshold voltage

: Variation of transistor effective channel length

VthLeffLUT: coefficient LUT derived from the partial derivatives of Vth and Leff

■ 2nd order Parametric 3D LUTs evaluation function

More input parameters andmore accurate requirement

More costly higher order LUTs


■ Parametric 3D LUTs extraction► High resolution means high

memory and runtime cost► One time cost, multiple usage► 0.6% of whole simulation time

■ Trilinear interpolation

LUTs Trilinear Interpolation(1st Order Demonstration)

zyxczyxczyxczyxc

zyxczyxczyxc

zyxczyxp

111

110

011

101

001

010

100

000

)1()1(

)1()1)(1(

)1()1()1)(1(

)1)(1)(1(,,

effth

effLthVbase

LPVPP

LLUTVLUTLUTLUTeffth

210

0p

Base LUT

1p

Threshold Voltage LUT

2p

Effective Channel Length LUT


GPU Friendly Data Structure Design■ System Matrix

►Adopt dense structure due to▼Complicated memory access required by sparse matrix▼Affordable memory consumption (limited problem size)

►Store in GPU register due to▼Frequent read/write memory access

■ Parametric 3D LUTs►Store in texture memory due to

▼Read only memory access

Gmat… 3D LUTs

MOSFET 1 … n

∆ Gmatvth… ∆ GmatLeff…… … …

MOSFET 1 … n MOSFET 1 … n

TextureMemory


GPU Friendly Data Structure Design (cont.)■ Index-mapping vectors & voltage sources vector

► Record stamping locations of nonlinear device and sources► Record voltage sources value in all time points► Store in Share memory due to

▼ All threads share this information▼ Data size is affordable

V_t1 V_t2 … VS_step

Voltage Source 1 Voltage Source n

... V_tn

PWL Node a … VS_map

Voltage Source 1

Idx Type d g s b … Mos_map

MOSFET 1 … MOSFET n

Voltage Source n

Node b PseudoNode P

SharedMemory


GPU Friendly Data Structure Design (cont.)

■ Solution vector►Record solution of each time point►Need send back to CPU ►Store in Global memory due to

▼Low access frequency

►Satisfy coalesced memory accesses

X1.0 X2.0 … Xn.0 X1.n X2.n … Xn.n…

T1 T2 Tn

X[0] X[n]

…

SolutionGlobalMemory


Newton-Raphson Algorithm on GPU■ Check convergence after several NR iterations

► Reduces the GPU thread divergence► Leads some overhead

Get Latest Solution

Get Latest Solution



Nonlinear StampNonlinear Stamp

LU SolveLU Solve

Return

Conv. CheckConv. Check

Get Latest Solution

Get Latest Solution



Nonlinear StampNonlinear Stamp

LU SolveLU Solve

Conv. CheckConv. Check

Return

Time to CheckTime to Check


Linux computing system: C++ & CUDA– CPU: Core 2 Quad 2.66GHz + 6GB DRAM

– GPU: GPU: NVIDIA GTX 480 + 1.5G Device memory

Tested cases

Experimental Setups

Circuit NL_Num Node_Num Vs_Num Unk_Num

6T-SRAM 6 8 5 12

D-Latch 8 9 5 13

D-Flip-Flop 16 12 5 16

Invertor-Chain

32 20 3 22

4:1 Mux 24 27 9 35NL_Num : Number of nonlinear devices

Node_Num : Number of nodes in the circuit

Vs_Num : Number of independent voltage sources

Unk_Num : number of unknowns of the nonlinear system


Run 1,536,000 simulations with different excitation and circuit design parameters

– Speedup LUTs GPU vs. LUTs CPU is up to 138X

– Speedup LUTs GPU vs. BSIM4 CPU is up to 264X

DC Simulation Result

Ru

ntim

e(lo

g)

264 X138 X

211 X92X

189 X92X

104 X47 X 22 X

45 X


Transient Simulation Result■ Speedup LUTs GPU vs. LUTs CPU is up to 57X■ Speedup LUTs GPU vs. BSIM4 CPU is up to 222X

Ru

ntim

e(lo

g)

187 X46X

222 X57X

202X53X

160 X54 X 12 X

96 X


Memory Consumption vs. Speedup■ Runtime efficiency decrease with the increase of circuit size

► Memory usage increase dramatically due to dense MNA structure

Documents

EE5780 Advanced VLSI CAD