Transient Analysis CK Cheng UC San Diego CK Cheng UC San Diego Jan. 25, 2007

Transient Analysis Transient Analysis

CK Cheng

UC San Diego

CK Cheng

UC San Diego

Jan. 25, 2007Jan. 25, 2007

Outline

• Research Directions• Simulation test case results• Overview of Simulation• Commercial Package• Alternating direction implicit (ADI) Method• General Operator Splitting Method• Distributed Computing• Conclusions and Future Works

Research Directions

• Simulation: SPICE, STA

• Network on Chip: topology and wire styles,

• Power, and Clock Networks

• Data Path Components: adders, shifters, multipliers, division

• Packaging: passive distortion compensation

6x6 Bump Simulation Results• The Circuit:

– 184K Capacitors, 17K Current Sources, 120K Inductors and 246K Resistors.

– 306K Nodes

• Accuracy:– Waveform and measurement results match Fujitsu’s

with less than 0.002% error.

• Runtime / Memory Comparison:

CPU_Time Memory Computer Used

UCSD 678s 600.2M Pentium 4 3.2G, Linux

Fujistu Log File 1845s 771M unknown

6x6 Bump Simulation Results• Measurement results and waveform

Min_pwr_l_est_10000954 Min_18269323 Min_33085875

UCSD 0.9980790 0.9967357 0.9934251

Fujistu Log File 0.9980620 0.9966940 0.9933790

Error 0.002% 0.004% 0.005%

(Red curve is UCSD result)

703KR Simulation Results• The Circuit:

– 514K Capacitors, 76K Current Sources, 370K Inductors and 703K Resistors.

– 1.3M Nodes

• Accuracy:– Measurement results match Fujitsu’s with less than

0.02% error.

• Runtime / Memory Comparison:

CPU_Time Memory Computer Used

UCSD 2575s (0.7h) 1.7G Pentium 4 3.2G, Linux

Fujistu Log File 864561s (240h) 2.28G unknown

703KR Simulation Results • Measurement results and waveform

Min_33096003 Min_33096004 Min_33097557

UCSD 0.9400988 0.9421157 0.9370827

Fujistu Log File 0.9399610 0.9419260 0.9368400

Error 0.015% 0.02% 0.026%

(UCSD results only. Fujitsu waveform is not available for comparison)

Further Speed-ups• Reduce iteration count by 50% for pure linear circuits (like

6x6 bump and 703KR)– 2x speed up

• More effective time step control– DVDT, breakpoint, truncation error. 1.5 - 3x speed up

• Use Multigrid solver– 1.5 - 2x speed up for medium circuits (6x6 bump)

– 2x – 10x speed up for large circuits (703KR)

• Parallel simulation– 4 or more processors on linux cluster

– 32 to hundreds of processors on supercomputer.

• Overall speed-up– 6x - 60x speed up without parallel simulation

– 12x - 1000x speed up with parallel simulation

Performance and capacity prediction

Cases 10x-100x larger than 703KR.

Preferred Solver Cpu Time Memory

Small - Medium

0.3M nodes

LU Decomposition 11 minutes 600M

Medium - Large

1.3M nodes

Multigrid 43 minutes 1.7G

Huge

10–100 M nodes

Multigrid + Parallel

5 – 100 hours 15G - 200G

Overview of Simulation

Our research• Fast speed with SPICE

accuracy• Nonlinear devices• Efficient matrix solvers• Effective integration methods• Time step controls according

to different integration methods

• Distributed computingYes

Load Circuit

Device Evaluation

LU Decomposition

N-R Converge?

Next Time Point

Time Step Control

Integration Approximation

Linearization

No


•Matrix Solver•LU Decomposition•Iterative Approach

•Integration•Time Step Control•ADI

•Nonlinear Devices•Two Stage Newton Raphson

•Distributed Computing•Commercial Implementation


•Integration•Time Step Control•ADI (two-way partitioning)•Operator Splitting (multi-way)

•Distributed Computing•MPI•Partitioning

•Three Ph.D. Students

Commercial Package: Fastrack Design

•Founded in January 2001•Headquartered in San Jose•Privately funded, cash-flow positive•Two Business Units

•Design Services•Technology Products

Analog Designs

DesignDesign # Elements# Elements Sim. Sim. LenLen

HSpiceHSpice mSPICEmSPICE SPEEDUPSPEEDUP

FACTORFACTOR

LVDS 13490 20us 80h 26h 3.1X

Oscillator 222 1 ms 13,706s 2,670s 5.1X

Biasing Circuit

49197 200ns 427s 82s 5.2X

PLL 16050 40us 67d 12d 5.6X

PLL (post-layout)

300K 40us 290d (est) 16d 18.1X

Digital Blocks

DesignDesign

NameNameDevicesDevices RuntimeRuntime Speedup Speedup

FactorFactorMOSMOS RR CC mSPICEmSPICE Traditional Traditional SpiceSpice

ALU 10.1k 12.7k 7.5k 6.9m 7m 1.0X

CONTROL 69k 83.7k 52.5k 1.5h 9.5h 6.3X

YN_BLK 205K 242.8k 203.9k 3.5h > 2d >13.7X

THP 437k 499.3k 313.5k 5.0h COULD NOT RUN ∞

VCON 936k 753k 561k 15.0h COULD NOT RUN ∞

Memory Blocks

DesignDesign # #

TrTr

##

RR

##

CC

# Vectors / # Vectors / Sim. LengthSim. Length

mSPICEmSPICERun TimeRun Time

BRAM (pre) 220K 0 500 2 2.5 hours

SRAM (pre)

8Kx8 SP

410K 0 0 2 7 hours

eRAM (post)

256x16

72K 28K 427K 48ns 8 hours

BRAM (post) 220K 1320K 870K 2 18 hours

• 100% accurate Spice simulation

mSPICE-Parallel

• Industry’s first practical parallel Spice simulation solution

– Increases capacity further

– Dramatically improves throughput

• Uses Matrix Level Partitioning

– No loss of accuracy

– Client-Server configuration

– Minimal memory requirement for client nodes

Client-Server Configuration

• Server distributes sub-matrices to clients• Clients communicate partial solutions• Minimal memory requirements for clients

1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1

1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1

0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1

1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1

1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1

1 0 1 0 0 0 0 0 0 1 0 0

1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0

0 1 0 0 0 1 0 1

Experimental Results

DesignDesign TotalTotal

ElementsElements

Sim. LengthSim. Length RuntimeRuntime

1-proc1-proc 2-proc2-proc 4-proc4-proc

ASIC 1.2M 8ns 12.2h 7.0h

(1.7X)

5.1h

(2.4X)

38IO SSO 1.4M 30ns 3.0h 2.0h

(1.5X)

1.4h

(2.2X)

Signal-power 2.1M 1.2us 13d 7d18h

(1.7X)

5d12h

(2.4X)

4096x8 RAM

(extracted)

2.3M 10ns 32h 18.5h

(1.7X)

13.4h

(2.4X)

120IO SSO 3.5M 30ns 6.2h 4.1h

(1.5X)

3.1h

(2.0X)

ADI: Previous Works

• 1999, Namiki and Ito

– the alternating direction implicit (ADI) is used to simulate a 2D TE wave.

• 2001, Zheng etc.

– extend to 3D problem

• 2001 & 2003, Lee and Chen

– ADI is used to transmission line modeled power grid

The alternation is among different geometric directions, so the simulated geometric structure is constrained.

Alternating Direction Implicit (ADI)

• ADI Integration Method– Two way partition of the circuit– One partition is used for each backward integration

– Unconditional stable

(A-stable: independent of time step size)– Time step size according to local truncation error.

Alternating Direction Implicit (ADI)

• ADI method formulation• Circuit partition algorithm• Local truncation error estimation• Stability discussion• Experimental results

SPICE Formulation

• Equations for RLC circuits

where C: capacitance matrix L: inductance matrix

R: resistance matrix G: conductance matrix

E: incidence matrix

)t(U)t(I

)t(V

RE

EG

)t(I

)t(V

L0

0C T

ADI Formulation

• Transient simulation

– Split the resistors and inductors branchesinto two parts

• G = G1 + G2

• E = E1 + E2

• R = R1 + R2

– Alternate Backward and Forward integrationon each partition

ADI Formulation (Cont.)

• Equations of ADI method

– the size of left-hand-side matrix remains unchanged

– the number of non-zero elements is decreased

– direct solving methods can be efficient

)2

ht(U

)2

ht(I

)2

ht(V

Rh

L2E

EGh

C2

)ht(I

)ht(V

Rh

L2E

EGh

C2

)2

ht(U

)t(I

)t(V

Rh

L2E

EGh

C2

)2

ht(I

)2

ht(V

Rh

L2E

EGh

C2

11

T11

22

T22

22

T22

11

T11

Experiments of non-zero fill-ins

• A small ASIC Design

Spice matrix : Dimension: 10,286 The number of non-zero elements: 46,655 The number of non-zero fill-ins: 90,960

• A large I/O Design

Spice matrix : Dimension: 615,436 The number of non-zero elements: 2,126,246

Sub-matrix1 Sub-matrix2 Total# non-zero

fill-ins# non-zeroelements

# non-zerofill-ins

# non-zeroelements

# non-zerofill-ins

Case 1 38,572 2,618 42,020 10,040 12,658

Case 2 1,176,208 12,421,534 950,038 14,772,068 27,193,602

Local Truncation Error (LTE)

• Time step control using LTE– In circuit transient analysis, the next time step can be

estimated from the local truncation error at the present time point

– LTE is defined as the difference between the calculated solution and the exact solution

– To ensure the consistency, the local truncation error should not exceed the error tolerance, thus the time step can be estimated using

)tΔ(fx̂xεLTE n1n1nn

toln1n1nn E)tΔ(fx̂xεLTE

Local Truncation Error (Cont.)

• LTE of ADI method(1) equations

let , , and

then

)t(U)t(I

)t(V

RE

EG

)t(I

)t(V

L0

0C T

UNXXM

)t(I

)t(VX

L0

0CM

RE

EGN

T

BUAXUMNXMX 11


• LTE of ADI method(2) Estimate exact solution

we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn:

]tΔ

UΔBA)UΔU(B[A]

tΔ

UΔBABU(AX[eX

n

n1nn

1

n

n1n

1n

tΔA1n

n

n3

n32

n2

n X)tΔA6

1tΔA

2

1tΔAI(

n3

n22

n U)tΔBA6

1tΔAB

2

1B(

)tΔ(OUΔ)tΔAB6

1tΔB

2

1( 4

nn2

nn


• LTE of ADI method(3) Estimate ADI solution

2/1n2/1n1n

1n2n

2/1nn2n

2/1n1n

UX)NMtΔ

2(X)NM

tΔ

2(

UX)NMtΔ

2(X)NM

tΔ

2(

n2n1

1n

1n1

2n

1n X)A2

tΔI()A

2

tΔI)(A

2

tΔI()A

2

tΔI(X̂

2/1nn1

2n1

1n

1n1

2n BU

2

tΔ])A

2

tΔI()A

2

tΔI)(A

2

tΔI()A

2

tΔI[(


• LTE of ADI method(3) Estimate ADI solution

n2n1

1n

1n1

2n

1n X)A2

tΔI()A

2

tΔI)(A

2

tΔI()A

2

tΔI(X̂

2/1nn1

2n1

1n

1n1

2n BU

2

tΔ])A

2

tΔI()A

2

tΔI)(A

2

tΔI()A

2

tΔI[(

n3

n213

n32

n2

n X)tΔAAA4

1tΔA

4

1tΔA

2

1tΔAI(

n3

n213

n22

nn U)tΔBAA4

1tΔBA

4

1tΔAB

2

1tΔB(

)tΔ(OUΔ)tΔAB4

1tΔB

2

1( 4

nn2

nn


• LTE of ADI method(4) LTE estimation

1n1nn X̂XεLTE

n3

n213

n3 X)tΔAAA

4

1tΔA

12

1(

)tΔ(OXtΔAA4

1XtΔ

12

1 4nn

3n21n

3n

)tΔ(OUΔtΔAB12

1U)tΔBAA

4

1tΔBA

12

1( 4

nn2

nn3

n213

n2


• LTE of ADI method(5) Time step control

2/1n2/1n1n

1n2n

2/1nn2n

2/1n1n

UX)NMtΔ

2(X)NM

tΔ

2(

UX)NMtΔ

2(X)NM

tΔ

2(

2/1n1n22/1n12/1n1nn

2/1nn22/1n1n2/1nn

UXNXN)XX(MtΔ

2

UXNXN)XX(MtΔ

2


• LTE of ADI method(5) Time step control

)XX(tΔAA4

1)XX(

2

tΔXX n1n

2n21n1n

nn1n

n3

n21n1nn

nn XtΔAA4

1)XX(

2

tΔXtΔ

)XX(2

tΔXtΔAA

4

1n1n

nn

3n21

)XX(2

tΔXtΔAA

4

11nn

1n1n

31n21

3n2

1n

1nnnn

3n21n

3n tΔ)

tΔ2

XX

12

X(XtΔAA

4

1XtΔ

12

1LTE

Stability Discussion

• The stability is concerned with whether the accumulated error grows or decays as time evolves through a series of time steps.

• One-step integration approximations, the error is accumulated by a factor of

• If the final steady state error vector is smaller than the initial, then the integration method is stable.

• In ADI integration method:

– It can be proved to be unconditional stable

]tΔ

UΔBABU(AX[e]

tΔ

UΔBA)UΔU(B[AX

n

n1n

1n

tΔA

n

n1nn

11n

n

ntΔAe

)A2

tΔI()A

2

tΔI)(A

2

tΔI()A

2

tΔI(e 2

n11

n1

n12

ntΔA n


Circuit1 Cuicuit2 Circuit3 1k-cell

#Nodes 10,000 40,000 90,000 10,200

#Transistors 0 0 0 6,500

Period 10ns 10ns 10ns 10ns

SPICE3 CPU time (sec) 77.8 485.3 3,061.1 181.6

#steps 115 115 114 193

ADI CPU time (sec) 28.6 117.8 275.2 523.3

#steps 102 102 102 949

Speedup 2.7x 4.1x 11.1x -

Voltage drop of Circuit3 (power mesh with sinks)

Signal in 1k_cell (ASIC design)

General Operator Splitting

• General operator splitting method– Multiple way partitions

– Each partition is considered separately in each time step simulation

– No geometry constrains

– Local truncation error is used to dynamically control time step size

General Operator Splitting

• Fundamental theory• Operator splitting formulation• Local truncation error estimation• Stability discussion• Experimental results

Fundamental theory

• In circuit transient simulation, the integration approximation is actually the approximation of the exponential operator

• The exponential operators can be approximated in any order using a general scheme of fractal decomposition

• The decomposition of exponential operators corresponds to the circuit multi-way partition

New integration approximation in transient simulation

Fundamental theory

• Approximation of exponential operator– General circuit equation and solution

– If we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn

– Exponential operator approximation

• Forward Euler

• Backward Euler

• Trapezoidal

]tΔ

UΔBA)UΔU(B[A]

tΔ

UΔBABU(AX[eX

n

n1nn

1

n

n1n

1n

tΔA1n

n

)t(Bu)t(Ax)t(x

tΔt

t

)τtΔt(AtΔA τd)τ(Bue)t(xe)tΔt(x

1tΔA )tΔAI(e

tΔAIe tΔA

)tΔA2

1I()tΔA

2

1I(e 1tΔA

Fundamental theory

• Decomposition of exponential operators(Masuo Suzuki, 1991, Physics)– Function

– First order:

– Second order:

– Third order:

– (2m-1)th and (2m)th order:

)BA(xe)x(F xBxA

1 ee)x(f xA

2

1xB

xA2

1

2 eee)x(f

)22/(1s,eeeeeee)x(f 3xA

2

ssxB

xA2

s1xB)s21(

xA2

s1sxB

xA2

s

3

)22/(1k

)xk(f)x)k21((f)xk(f)x(f)x(f1m2

m

m3m2m3m2m3m2m21m2

Fundamental theory

• Decomposition of exponential operators

)()(2

1)(

)()2

1

2

1

2

1

2

1()(

)()4

1

2

1

2

1

8

1

2

1

8

1()

2

1

2

1(

)](8

1

2

1)][(

2

1)][(

8

1

2

1[

)(

)()(2

1)()(

322

3222

322222

322322322

2

1

2

1

2

322)(

xOxBAxBAI

xOxBAABBAxBAI

xOxABAABABAxABAI

xOxAAxIxOxBBxIxOxAAxI

eeexf

xOxBAxBAIexF

xAxBxA

BAx

General Operator Splitting Formulation

• Transient simulation:– Apply the second order approximation

– In each time step, every partition is calculated separately and trapezoidal integration is used for every partition

– The size of left-hand-side matrix may be changed

– The number of non-zero elements is definitely decreased

– Can be easily extended to multi-way partitions

12

121

xA2

1xAxA

2

1)AA(x eeee

121qq1q21q21xA

2

1xA

2

1xA

2

1xAxA

2

1xA

2

1xA

2

1)A...AA(xxA ee...eee...eeee

General Operator Splitting Formulation

• Equations

)2

ht(U

2

1

)t(I

)t(V

2

R

h

L2

2

E2

E

2

G

h

C2

)ht(I

)ht(V

2

R

h

L2

2

E2

E

2

G

h

C2

)2

ht(U

2

1

)t(I

)t(V

2

R

h

L

2

E2

E

2

G

h

C

)t(I

)t(V

2

R

h

L

2

E2

E

2

G

h

C

)2

ht(U

2

1

)t(I

)t(V

2

R

h

L2

2

E2

E

2

G

h

C2

)t(I

)t(V

2

R

h

L2

2

E2

E

2

G

h

C2

1T1

T11

1T1

T11

2T2

T22

2T2

T22

1T1

T11

1T1

T11

12

121

hA2

1hAhA

2

1)AA(h eeee


• LTE of general operator splitting methodEstimate solution

2/1nn1

n1n

1

n

2/1nn2

nn

2

n

2/1nn1

nn

1

n

U2

1X)

2

NM

tΔ

2(X)

2

NM

tΔ

2(

U2

1X)

2

NM

tΔ

1(X)

2

NM

tΔ

1(

U2

1X)

2

NM

tΔ

2(X)

2

NM

tΔ

2(


• LTE of general operator splitting methodEstimate solution

n1n1

1n

2n1

2n

1n1

1n

1n X)A4

tΔI()A

4

tΔI)(A

2

tΔI()A

2

tΔI)(A

4

tΔI()A

4

tΔI(X̂

11

n2

n12

n1

n11

n )A4

tΔI)(A

2

tΔI()A

2

tΔI)(A

4

tΔI()A

4

tΔI[(

2/1n1

1n1

2n

1n1

1n U

2

1])A

4

tΔI()A

2

tΔI)(A

4

tΔI()A

4

tΔI(

n3

n2122122

21

31

3n

32n

2n X)tΔ)AAA

4

1AA

8

1AA

8

1A

16

1(tΔA

4

1tΔA

2

1tΔAI(

n3

n1221

3n

22nn U)tΔB)AA

16

3A

32

3(tΔBA

4

1tΔAB

2

1tΔB(

)tΔ(OUΔ)tΔAB4

1tΔB

2

1( 4

nn2

nn


• LTE of general operator splitting methodLTE estimation

1n1nn X̂XεLTE

n3

n2122122

21

31n

3n XtΔ)AAA

4

1AA

8

1AA

8

1A

16

1(XtΔ

12

1

)tΔ(OUtΔB)AA16

3A

32

3( 4

nn3

n1221



2/1nn1nn1n1n

2/1nnnnn2nn

2/1nnnnn1nn

UtΔB4

1)XX(tΔA

4

1XX

UtΔB2

1)XX(tΔA

2

1XX

UtΔB4

1)XX(tΔA

4

1XX



)XX(2

tΔXX n1n

nn1n

n3

n2122122

21

31 XtΔ)AAA

4

1AA

8

1AA

8

1A

16

1(

n3

n1221 UtΔB)AA

16

3A

32

3(

3n2

1n

1nnn tΔ)tΔ2

XX

12

X(LTE

Stability Discussion

• The trapezoidal integration method is unconditional stable for stable system.

• In our operator splitting method, trapezoidal method is used for all the sub-systems

still unconditional stable

)A4

tΔI()A

4

tΔI)(A

2

tΔI()A

2

tΔI)(A

4

tΔI()A

4

tΔI(e 1

n11

n2

n12

n1

n11

ntΔA n

)A2

tΔI()A

2

tΔI(e n1ntΔA n

12

121

xA2

1xAxA

2

1)AA(x eeee


Circuit1 Cuicuit2 Circuit3

#Nodes 10,000 40,000 90,000

#Transistors 0 0 0

Period 10ns 10ns 10ns

SPICE3 CPU time (sec) 77.8 485.3 3,061.1

#steps 115 115 114

GOS CPU time (sec) 164.7 1011.6 3435.9

#steps 102 102 102

Comparison 2.1x 2x 1.1x

Voltage drop of Circuit3 (power mesh with sinks)

Conclusions

• We investigate alternating direction implicit and general operator splitting integration methods for transistor-level circuit transient simulation.

• In both methods, the circuit will be divided into several sub-circuits, thus the direct matrix solver is still efficient because the matrix is simplified.

• Both methods are second order accurate and unconditional stable.

• Overhead:– Circuit partition– Each time step consists of many sub-steps, each sub-step is a

N-R iteration process• Better for circuits with large linear network

• Distributed Processors – Cluster

– Supercomputer

– Multi-Core Processors (Intel Dual/Quad-Core, IBM Cell etc.)

• Standard– MPI

– Partitioning

– Matrix Solver

• Capabilities– Speed-up (10-100+)

– Memory Capacity (10-100+)

Distributed Computing

Future Works

• ADI method– More experiments

• General operator splitting method– Design and implement multi-way circuit partition

algorithm– Implement multi-way general operator splitting program– Derive LTE for general multi-way situation– More experiments

• Distributed Computing– MPI Standard– Distributed Partitioning, Matrix Solver

Documents

Transient Analysis CK Cheng UC San Diego CK Cheng UC San Diego Jan. 25, 2007