What Does It Take to Accelerate SPICE on the GPU? | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3364-SPICE... · What Does It Take to Accelerate SPICE on the GPU? M

What Does It Take to Accelerate SPICE on the GPU?

M. Naumov, F. Lannutti, S. Chetlur, L.S. Chien and P. Vandermersch

http://www.gputechconf.com/page/home.html

What is SPICE?

Simulation Program with Integrated Circuit Emphasis

— First version was developed by Laurence Nagel in 1973

— http://en.wikipedia.org/wiki/SPICE

There exist many variations (not limited to)

— Academic:

ngspice, spice3 (UC – Berkeley), XSPICE (GeorgiaTech)

— Industrial:

HSPICE (Synopsys), Pspice (Cadence), Eldo (Mentor), EEsof (Agilent)

What does SPICE do?

Vs

1

Ixs

R1 R3

R2

2

R4

3

Circuit (diagram):

What does SPICE do?

nodes i j

R1 1 2 1k

R2 2 0 1k

R3 2 3 0.4k

R4 3 0 0.1k

V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)

Vs

1

Ixs

R1 R3

R2

2

R4

3

Circuit (diagram): Netlist (text file):

What does SPICE do?

Resistor Stamp

Sparse Matrix RHS

Vi Vj Ixs

i 1/R -1/R

j -1/R 1/R

xs

nodes i j

R1 1 2 1k

R2 2 0 1k

R3 2 3 0.4k

R4 3 0 0.1k

V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)

Voltage Source Stamp

Sparse Matrix RHS

Vi Vj Ixs

i 1

j -1

xs 1 -1 Vs

Vs

1

Ixs

R1 R3

R2

2

R4

3


Physics (Kirchhoff + Ohms + ...):

row col

row col

What does SPICE do?

Resistor Stamp

Sparse Matrix RHS

Vi Vj Ixs

i 1/R -1/R

j -1/R 1/R

xs

nodes i j

R1 1 2 1k

R2 2 0 1k

R3 2 3 0.4k

R4 3 0 0.1k

V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)

Voltage Source Stamp

Sparse Matrix RHS

Vi Vj Ixs

i 1

j -1

xs 1 -1 Vs

Vs

1

Ixs

R1 R3

R2

2

R4

3


Physics (Kirchhoff + Ohms + ...): Linear system (sparse):

row col

row col

source

-1

(1/R3+1/R4)

1/R1 -1/R1

-1/R1 (1/R1+1/R2+1/R3) -1/R3

-1/R3

1 Vs Ixs

V1

V2

V3 node 3

node 2

node 1

Input

— Parse netlist and setup internal data structures

DC Analysis

— Device model evaluation

— Linear system solution

Transient Analysis



— Truncation error + Time step correction

SPICE Details

Newton-Raphson

For each time step:

- Newton-Raphson

Device Model Evaluation

— Takes between 30%-60% of the simulation

DC Analysis



Transient Analysis




SPICE Details

Newton-Raphson

For each time step:

- Newton-Raphson

Linear System Solution

— Takes between 30%-60% of the simulation

DC Analysis



Transient Analysis




SPICE Details

Newton-Raphson

For each time step:

- Newton-Raphson

Basic models

— Resistor, Capacitor, Inductor, Voltage and Current Source

Transistor models

— MOSFET transistor (BSIM4v7, PSP, etc.)

— Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)

Other models

— Diodes, etc.


Basic models

— Resistor, Capacitor, Inductor, Voltage and Current Source

Transistor models

— MOSFET transistor (BSIM4v7, PSP, etc.)

— Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)

Other models

— Diodes, etc.


focus of this presentation

Key Idea (Transistor - BSIM4v7)

— Many branches are related to fixed parameters

Temperature

Operation Regime

— Reorganize the code (slightly)

Minimize thread divergence

Maximize memory coalescing


if() if() if() ... else ... else ...

T1 T2 T3 T4 T10K Tn T100K

... ... ...

BSIM

4v7

Inst

ances

...

Basic Device Model Evaluation

0

5

10

15

20

25

30

35

40

8192 16384 32768 65536 131072

Speedup

number of instances of models*

Resistor Netlist

Capacitor Netlist

Inductor Netlist

*NGSPICE

*NVIDIA C2070, ECC on

*Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz

1) Resistor Netlist: all resistors;

2) Capacitor Netlist: half capacitors and half resistors;

3) Inductor Netlist: half resistors, quarter capacitors and quarter inductors;

Performance may vary based on

OS version and motherboard configuration

Transistor (BSIM4v7) Device Model Evaluation

0

10

20

30

40

50

60Tim

e (

ms)

ISCAS85 Benchmark Suite

CPU (1 core)

GPU

6.67x

*NGSPICE


*Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz Performance may vary based on


Solve a set of (sparse) linear systems

Ai xi = fi for i=1,...,k

where the coefficient matrices Ai have the same sparsity pattern

Matrix properties

— Nonsymmetric

— Ill-conditioned

Different methods

— Direct Methods (LU factorization + triangular solve)

— Iterative Methods (GMRES, BiCGStab, etc.)

Solution of Linear Systems

Solve a set of (sparse) linear systems

Ai xi = fi for i=1,...,k

where the coefficient matrices Ai have the same sparsity pattern

Matrix properties

— Nonsymmetric

— Ill-conditioned

Different methods

— Direct Methods (LU factorization + triangular solve)

— Iterative Methods (GMRES, BiCGStab, etc.)

Solution of Linear Systems


Original linear system

A x = f

Reordering (to minimize fill-in)

(A Q) (QT x) = f where QTQ=QQT=I

Pivoting

(PT A Q) (QT x) = PT f where PTP=PPT=I

LU factorization

(PT A Q) = L U

Forward and backward (triangular) solve

L (U y) = b where y = QT x and b = PT x

Sparse Direct Methods

Recall

— Solving a set of linear systems

— Coefficient matrices have the same sparsity pattern

Assume

— reordering (to minimize fill-in) is the same

— pivoting is also constant

LU factorization (i=1)

(PT A Q) = L U

LU re-factorization (i=2,...,k)

— Sparsity (the required memory) of L and U is known ahead of time

Sparse Direct Methods


Key Idea

— LU-factorization

A = LU

— Incomplete-LU factorization

M = L(zeroed)+U(zeroed)+A

GLU: LU re-factorization on the GPU

equivalent

Key Idea


A = LU




equivalent

A1 = L1U1 (i=1)

Mi = L1(zeroed)+U1

(zeroed)+Ai

(i=2,...,k)

Solving a set of systems

Ai xi = fi (i=1,...,k)

Key Idea


A = LU



— Many parallel techniques are applicable


equivalent

A1 = L1U1 (i=1)

Mi = L1(zeroed)+U1

(zeroed)+Ai

(i=2,...,k)

Solving a set of systems

Ai xi = fi (i=1,...,k)

GLU

— Developed in CUDA programming language for GPUs

— Sparsity pattern of L and U known ahead of time

— Memory requirements known ahead of time

vs. KLU, which is

— Designed specifically for circuit simulation

— Gilbert-Peierls (single threaded)

vs. PARDISO, which is

— Supernodal method (multi-threaded)


Review of sparse direct solvers can be found at

http://www.cise.ufl.edu/research/sparse/codes/

Test matrices can be found at

http://www.cise.ufl.edu/research/sparse/matrices/

GLU Speedup (C2070)


*Intel X5680 (Nehalem, 6 CoreTM) @ 3.33GHz, MKL 10.3.6 Performance may vary based on


0

1

2

3

4

rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M

Speedup

GLU vs. KLU (1t)

GLU vs. PARDISO (6t)

14.3 7.5 7.0|25.2

GLU Speedup (K20x)



0

1

2

3

4


Speedup

GLU vs. KLU (1t)


16.1 8.6 7.0|5.4

* NVIDIA K20, ECC on

* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6

GLU Speedup (K20x)



0

1

2

3

4


Speedup

GLU vs. KLU (1t)


16.1 8.6 7.0|5.4

Average Speedup vs. KLU: 2x


* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6

GLU Speedup (K20x)


* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6 Performance may vary based on


0

1

2

3

4


Speedup

GLU vs. KLU (1t)


16.1 8.6 7.0|5.4

Average Speedup vs. PARDISO: 2.5x

SPICE simulation two most time consuming parts


— Solution of linear systems

Device model evaluation

— Speedup* of up to 6x

Solution of linear systems

— Speedup* (average) of 2x

GPU (overall) acceleration

— SPICE (overall expected) speedup of 2-3x

— No slowdown: easy to test an iteration (and revert back if needed)

Conclusion

*: speedup is dependent on input parameters

Questions?

Thank you

Documents

What Does It Take to Accelerate SPICE on the GPU? | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3364-SPICE... · What Does It Take to Accelerate SPICE on the GPU? M