Upload
dinhtuyen
View
227
Download
0
Embed Size (px)
Citation preview
What Does It Take to Accelerate SPICE on the GPU?
M. Naumov, F. Lannutti, S. Chetlur, L.S. Chien and P. Vandermersch
What is SPICE?
Simulation Program with Integrated Circuit Emphasis
— First version was developed by Laurence Nagel in 1973
— http://en.wikipedia.org/wiki/SPICE
There exist many variations (not limited to)
— Academic:
ngspice, spice3 (UC – Berkeley), XSPICE (GeorgiaTech)
— Industrial:
HSPICE (Synopsys), Pspice (Cadence), Eldo (Mentor), EEsof (Agilent)
What does SPICE do?
Vs
1
Ixs
R1 R3
R2
2
R4
3
Circuit (diagram):
What does SPICE do?
nodes i j
R1 1 2 1k
R2 2 0 1k
R3 2 3 0.4k
R4 3 0 0.1k
V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)
Vs
1
Ixs
R1 R3
R2
2
R4
3
Circuit (diagram): Netlist (text file):
What does SPICE do?
Resistor Stamp
Sparse Matrix RHS
Vi Vj Ixs
i 1/R -1/R
j -1/R 1/R
xs
nodes i j
R1 1 2 1k
R2 2 0 1k
R3 2 3 0.4k
R4 3 0 0.1k
V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)
Voltage Source Stamp
Sparse Matrix RHS
Vi Vj Ixs
i 1
j -1
xs 1 -1 Vs
Vs
1
Ixs
R1 R3
R2
2
R4
3
Circuit (diagram): Netlist (text file):
Physics (Kirchhoff + Ohms + ...):
row col
row col
What does SPICE do?
Resistor Stamp
Sparse Matrix RHS
Vi Vj Ixs
i 1/R -1/R
j -1/R 1/R
xs
nodes i j
R1 1 2 1k
R2 2 0 1k
R3 2 3 0.4k
R4 3 0 0.1k
V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)
Voltage Source Stamp
Sparse Matrix RHS
Vi Vj Ixs
i 1
j -1
xs 1 -1 Vs
Vs
1
Ixs
R1 R3
R2
2
R4
3
Circuit (diagram): Netlist (text file):
Physics (Kirchhoff + Ohms + ...): Linear system (sparse):
row col
row col
source
-1
(1/R3+1/R4)
1/R1 -1/R1
-1/R1 (1/R1+1/R2+1/R3) -1/R3
-1/R3
1 Vs Ixs
V1
V2
V3 node 3
node 2
node 1
Input
— Parse netlist and setup internal data structures
DC Analysis
— Device model evaluation
— Linear system solution
Transient Analysis
— Device model evaluation
— Linear system solution
— Truncation error + Time step correction
SPICE Details
Newton-Raphson
For each time step:
- Newton-Raphson
Device Model Evaluation
— Takes between 30%-60% of the simulation
DC Analysis
— Device model evaluation
— Linear system solution
Transient Analysis
— Device model evaluation
— Linear system solution
— Truncation error + Time step correction
SPICE Details
Newton-Raphson
For each time step:
- Newton-Raphson
Linear System Solution
— Takes between 30%-60% of the simulation
DC Analysis
— Device model evaluation
— Linear system solution
Transient Analysis
— Device model evaluation
— Linear system solution
— Truncation error + Time step correction
SPICE Details
Newton-Raphson
For each time step:
- Newton-Raphson
Basic models
— Resistor, Capacitor, Inductor, Voltage and Current Source
Transistor models
— MOSFET transistor (BSIM4v7, PSP, etc.)
— Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)
Other models
— Diodes, etc.
Device Model Evaluation
Basic models
— Resistor, Capacitor, Inductor, Voltage and Current Source
Transistor models
— MOSFET transistor (BSIM4v7, PSP, etc.)
— Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)
Other models
— Diodes, etc.
Device Model Evaluation
focus of this presentation
Key Idea (Transistor - BSIM4v7)
— Many branches are related to fixed parameters
Temperature
Operation Regime
— Reorganize the code (slightly)
Minimize thread divergence
Maximize memory coalescing
Device Model Evaluation
if() if() if() ... else ... else ...
T1 T2 T3 T4 T10K Tn T100K
... ... ...
BSIM
4v7
Inst
ances
...
Basic Device Model Evaluation
0
5
10
15
20
25
30
35
40
8192 16384 32768 65536 131072
Speedup
number of instances of models*
Resistor Netlist
Capacitor Netlist
Inductor Netlist
*NGSPICE
*NVIDIA C2070, ECC on
*Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz
1) Resistor Netlist: all resistors;
2) Capacitor Netlist: half capacitors and half resistors;
3) Inductor Netlist: half resistors, quarter capacitors and quarter inductors;
Performance may vary based on
OS version and motherboard configuration
Transistor (BSIM4v7) Device Model Evaluation
0
10
20
30
40
50
60Tim
e (
ms)
ISCAS85 Benchmark Suite
CPU (1 core)
GPU
6.67x
*NGSPICE
*NVIDIA C2070, ECC on
*Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz Performance may vary based on
OS version and motherboard configuration
Solve a set of (sparse) linear systems
Ai xi = fi for i=1,...,k
where the coefficient matrices Ai have the same sparsity pattern
Matrix properties
— Nonsymmetric
— Ill-conditioned
Different methods
— Direct Methods (LU factorization + triangular solve)
— Iterative Methods (GMRES, BiCGStab, etc.)
Solution of Linear Systems
Solve a set of (sparse) linear systems
Ai xi = fi for i=1,...,k
where the coefficient matrices Ai have the same sparsity pattern
Matrix properties
— Nonsymmetric
— Ill-conditioned
Different methods
— Direct Methods (LU factorization + triangular solve)
— Iterative Methods (GMRES, BiCGStab, etc.)
Solution of Linear Systems
focus of this presentation
Original linear system
A x = f
Reordering (to minimize fill-in)
(A Q) (QT x) = f where QTQ=QQT=I
Pivoting
(PT A Q) (QT x) = PT f where PTP=PPT=I
LU factorization
(PT A Q) = L U
Forward and backward (triangular) solve
L (U y) = b where y = QT x and b = PT x
Sparse Direct Methods
Recall
— Solving a set of linear systems
— Coefficient matrices have the same sparsity pattern
Assume
— reordering (to minimize fill-in) is the same
— pivoting is also constant
LU factorization (i=1)
(PT A Q) = L U
LU re-factorization (i=2,...,k)
— Sparsity (the required memory) of L and U is known ahead of time
Sparse Direct Methods
focus of this presentation
Key Idea
— LU-factorization
A = LU
— Incomplete-LU factorization
M = L(zeroed)+U(zeroed)+A
GLU: LU re-factorization on the GPU
equivalent
Key Idea
— LU-factorization
A = LU
— Incomplete-LU factorization
M = L(zeroed)+U(zeroed)+A
GLU: LU re-factorization on the GPU
equivalent
A1 = L1U1 (i=1)
Mi = L1(zeroed)+U1
(zeroed)+Ai
(i=2,...,k)
Solving a set of systems
Ai xi = fi (i=1,...,k)
Key Idea
— LU-factorization
A = LU
— Incomplete-LU factorization
M = L(zeroed)+U(zeroed)+A
— Many parallel techniques are applicable
GLU: LU re-factorization on the GPU
equivalent
A1 = L1U1 (i=1)
Mi = L1(zeroed)+U1
(zeroed)+Ai
(i=2,...,k)
Solving a set of systems
Ai xi = fi (i=1,...,k)
GLU
— Developed in CUDA programming language for GPUs
— Sparsity pattern of L and U known ahead of time
— Memory requirements known ahead of time
vs. KLU, which is
— Designed specifically for circuit simulation
— Gilbert-Peierls (single threaded)
vs. PARDISO, which is
— Supernodal method (multi-threaded)
GLU: LU re-factorization on the GPU
Review of sparse direct solvers can be found at
http://www.cise.ufl.edu/research/sparse/codes/
Test matrices can be found at
http://www.cise.ufl.edu/research/sparse/matrices/
GLU Speedup (C2070)
*NVIDIA C2070, ECC on
*Intel X5680 (Nehalem, 6 CoreTM) @ 3.33GHz, MKL 10.3.6 Performance may vary based on
OS version and motherboard configuration
0
1
2
3
4
rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M
Speedup
GLU vs. KLU (1t)
GLU vs. PARDISO (6t)
14.3 7.5 7.0|25.2
GLU Speedup (K20x)
Performance may vary based on
OS version and motherboard configuration
0
1
2
3
4
rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M
Speedup
GLU vs. KLU (1t)
GLU vs. PARDISO (8t)
16.1 8.6 7.0|5.4
* NVIDIA K20, ECC on
* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6
GLU Speedup (K20x)
Performance may vary based on
OS version and motherboard configuration
0
1
2
3
4
rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M
Speedup
GLU vs. KLU (1t)
GLU vs. PARDISO (8t)
16.1 8.6 7.0|5.4
Average Speedup vs. KLU: 2x
* NVIDIA K20, ECC on
* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6
GLU Speedup (K20x)
* NVIDIA K20, ECC on
* Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6 Performance may vary based on
OS version and motherboard configuration
0
1
2
3
4
rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M
Speedup
GLU vs. KLU (1t)
GLU vs. PARDISO (8t)
16.1 8.6 7.0|5.4
Average Speedup vs. PARDISO: 2.5x
SPICE simulation two most time consuming parts
— Device model evaluation
— Solution of linear systems
Device model evaluation
— Speedup* of up to 6x
Solution of linear systems
— Speedup* (average) of 2x
GPU (overall) acceleration
— SPICE (overall expected) speedup of 2-3x
— No slowdown: easy to test an iteration (and revert back if needed)
Conclusion
*: speedup is dependent on input parameters
Questions?
Thank you