1
Accelerating Symbolic Computations on NVidia Fermi Pavel Emeliyanenko Max-Planck Institute for Informatics, Saarbrücken, Germany [email protected] Resultants is a fundamental algebraic tool in the elimination theory. They have numerous applications, for instance in topological study of algebraic curves or computer graphics. Resultant of two bivariate polynomials f and g is the determinant of Sylvester matrix S: Computing resultant involves a substantial amount of symbolic operations which rapidly becomes a bottleneck for many exact geometric algorithms Following the ideas of classical “divide-conquer-combinemodular algorithm of Collins [1]: given two bivariate polynomials with large integer coefficients use modular and evaluation homeomorphisms to reduce the problem to a simpler domain: compute univariate resultants over a prime field in parallel on the graphics hardware: interpolate resultant polynomial over a prime field (GPU): lift the polynomial coefficients using Chinese remaindering (partly on the GPU): Motivation High-level structure of the algorithm Problem: the amount of parallelism exhibited by the modular algorithm is far too low to satisfy the needs of massively-threaded architecture like that of GPU Introduction to Displacement structure Solution: reduce the problem to computation with structured matrices because matrix operations typically map very well to the graphics hardware Computation of the resultant reduces to triangular factorization of Sylvester matrix S which is shift-structured [2]: The Generalized Schur Algorithm computes matrix factorization in O(n 2 ) time by operating solely on matrix generators: are generator matrices is a down-shift matrix Reduce the top rows of G and B (vector updates) Collect factors of the resultant Shift down the first columns of G and B Iterate until generators vanish completely Division-free generator recurrences Abstract: we present the first implementation of a modular resultant algorithm on GPUs [4,5]. With recent developments taking advantage of new NVidia Fermi GPU architecture and instruction set we have been able to achieve about 150x speedup over a CPU-based resultant algorithm from Maple 13. Polynomial interpolation over a prime field kernel 1 kernel 2 CPU sequential code reduce polynomial coefficients modulo 31-bit primes Gather results for different evaluation points. For each modulus in parallel Factorization of Sylvester matrix using Schur algorithm Solving Vandermonde system using Schur algorithm, see [5] CPU sequential code Recover resultant coefficients from Mixed-radix representation Gather results for different moduli Polynomial evaluation: Computing univariate resultants: For each modulus and each evaluation point in parallel Eliminate ‟‟bad‟‟ evaluation points Divide by the denominator: kernel 3 Polynomial interpolation: kernel 4 Compute mixed-radix digits using CRA Division using Montgomery modular inverse, see [4] grid size: N × M N: number of moduli M: number of evaluation points Reduce the problem to solving the Vandermonde system using the generalized Schur algorithm in O(n 2 ) time (see [5]): Schematic view of the GPU algorithm Realization of 31-bit modular arithmetic on the GPU NVidia Fermi architectural features: native 32-bit integer multiplication support (instead of 24-bit multiplication on GT200) full-speed double-precision arithmetic (8x faster than that of GT200) modulo operation („%‟) is costly: implement modular reduction in floating-point new set of video instructions: can do several arithmetic operations at a time (PTX assembly [3]) Raw performance: up to 154 GMad/s on the GTX480 graphics processor. GMad/s = 10 9 modular multiply-adds per second. Input polynomials in Divide by the denominator: For each evaluation point in parallel // D2I_TRUNC = (double)3^51 (fast mantissa truncation) // inv_m = (double)1 / m double f = (double)b * (double)c * inv_m + D2I_TRUNC; unsigned s = b * c - __double2loint(f) * m; // equivalent to min(s, s + m) asm volatile(“vadd.u32.u32.u32.min %0,%1,%2,%3;” : “=r”(s) : “r”(s), “r”(m), “r”(s)); s += a; // equivalent to min(s, s - m) asm volatile(“vsub.u32.u32.u32.min %0,%1,%2,%3;” : “=r”(s) : “r”(s), “r”(m), “r”(s)); return s; Modular multiply-add: Vector updates: // inv_m = (double)1 / m double f1 = (double)a * (double)b; double f2 = (double)c * (double)d; double f = (f1 f2) * inv_m; unsigned r = (unsigned)__double2int_rd(f); unsigned s = a * b c * d r * m; // equivalent to min(s, s + m) asm volatile(“vadd.u32.u32.u32.min %0,%1,%2,%3;” : “=r”(s) : “r”(s), “r”(m), “r”(s)); // equivalent to min(s, s - m) asm volatile(“vsub.u32.u32.u32.min %0,%1,%2,%3;” : “=r”(s) : “r”(s), “r”(m), “r”(s)); return s; x 6 x 5 x 4 x 3 x 2 x 1 y 6 y 5 y 4 y 3 y 2 y 1 z 6 z 5 z 4 z 3 z 2 z 1 w 6 w 5 w 4 w 3 w 2 w 1 thread ID 5 4 3 2 1 0 share the first elements between all threads G i B i i th iteration vector updates x 6 x 5 x 4 x 3 x 2 x 1 y 6 y 5 y 4 y 3 y 2 0 z 6 z 5 z 4 z 3 z 2 z 1 w 6 w 5 w 4 w 3 w 2 0 collect factors of the resultant x 5 x 4 x 3 x 2 x 1 y 6 y 5 y 4 y 3 y 2 z 5 z 4 z 3 z 2 z 1 w 6 w 5 w 4 w 3 w 2 thread ID 4 3 2 1 0 share the first elements between all threads G i+1 B i+1 (i+1) th iteration x 5 x 4 x 3 x 2 x 1 y 6 y 5 y 4 y 3 y 2 z 5 z 4 z 3 z 2 z 1 w 6 w 5 w 4 w 3 w 2 shift down the first generator columns G i+1 B i+1 G i B i vector updates Block 1 Efficient stream compaction on Fermi: use ballot voting primitive to obtain zero-one pattern across each warp separately compute element shifts per warp using population count (popc intrinsic) propagate shifts to subsequent warps through addition in shared memory 1 1 1 1 1 n n 2 2 2 2 2 2 n n Block 2 Block n 2 2 2 2 2 2 1 1 1 1 1 n n n n n n n n Block write offset is controlled by a global memory variable changed atomically (the relative order of evaluation points does not matter for interpolation) Computing univariate resultants over a prime field For comparison: 2.5GHz Quad-Core Xeon E5420 can do about 1 GMad/s per core. GTX280 using 24-bit modular arithmetic vs GTX480 using 31-bit modular arithmetic Performance depending on y-degree of polynomials with coefficients bit-length fixed Performance as function of coefficient bit-length with polynomials‟ x/y-degrees fixed grid size: N × M / 128 grid size: N × 1 grid size: M × 1 Performance comparison with the resultant algorithm from 32-bit Maple 13 (deterministic) Target graphics card: GeForce GTX480 Host machine: Dual-Core AMD Opteron 2220SE, Linux platform Instance Maple time GPU time CUDA blocks executed Instance Maple time GPU time CUDA blocks executed deg x f: 40 deg x g: 39 deg y f: 19 deg y g: 17 bits: 32 dense 12.2 s 0.057 s 56 × 1372 32×2 threads deg x f: 42 deg x g: 33 deg y f: 31 deg y g: 20 bits: 32 dense 101.2 s 0.48 s 223 × 1874 64 threads deg x f: 36 deg x g: 42 deg y f: 19 deg y g: 17 bits: 320 dense 114.4 s 0.781 s 488× 1353 32×2 threads deg x f: 10 deg x g: 7 deg y f: 95 deg y g: 93 bits: 16 sparse 157.8 s 1.24 s 206 × 1604 96 threads deg x f: 40 deg x g: 30 deg y f: 31 deg y g: 20 bits: 100 sparse 56.7 s 0.4 s 215 × 1740 32×2 threads deg x f: 10 deg x g: 7 deg y f: 95 deg y g: 93 bits: 120 dense timed out (> 15 min) 6.35 s 951 × 1604 96 threads References: [1] Collins G.E.: “The calculation of multivariate polynomial resultants”, SYMSAC‟71, 1971, 212-2 [2] Kailath T. and Sayed A.: “Displacement structure: theory and applications”, SIAM review, 1995, 297386 [3] PTX: Parallel Thread Execution. ISA Version 2.1. NVIDIA Corp., 2010 [4] Emeliyanenko P.: “Modular Resultant Algorithm for Graphics Processors”, ICA3PP‟10, 2010, 427-440 [5] Emeliyanenko P.: “ A complete modular resultant algorithm targeted for realization on graphics hardware”, PASCO‟10, 2010, 35-43 deg x/y degrees in x/y of polynomials f and g; bits coefficient bit-length; sparse/dense varying density of polynomials; CUDA blocks executed: # of blocks run by 1 st resultant kernel (N × M) and # of threads per block Multiply all collected factors using parallel reduction

Pavel Emeliyanenko - nvidia.com · Accelerating Symbolic Computations on NVidia Fermi Pavel Emeliyanenko Max-Planck Institute for Informatics, Saarbrücken, Germany [email protected]

Embed Size (px)

Citation preview

Page 1: Pavel Emeliyanenko - nvidia.com · Accelerating Symbolic Computations on NVidia Fermi Pavel Emeliyanenko Max-Planck Institute for Informatics, Saarbrücken, Germany asm@mpi-inf.mpg.de

Accelerating Symbolic Computations on NVidia FermiPavel Emeliyanenko

Max-Planck Institute for Informatics, Saarbrücken, Germany

[email protected]

Resultants is a fundamental algebraic tool in the elimination theory. They have numerous

applications, for instance in topological study of algebraic curves or computer graphics.

Resultant of two bivariate polynomials f and g is the determinant of Sylvester matrix S:

Computing resultant involves a substantial amount of symbolic operations

which rapidly becomes a bottleneck for many exact geometric algorithms

Following the ideas of classical “divide-conquer-combine” modular algorithm of Collins [1]:

• given two bivariate polynomials with large integer coefficients use modular and evaluationhomeomorphisms to reduce the problem to a simpler domain:

• compute univariate resultants over a prime field in parallel on the graphics hardware:

• interpolate resultant polynomial over a prime field (GPU):

• lift the polynomial coefficients using Chinese remaindering (partly on the GPU):

Motivation

High-level structure of the algorithm

Problem: the amount of parallelism exhibited by the modular algorithm is far too low to

satisfy the needs of massively-threaded architecture like that of GPU

Introduction to Displacement structure

Solution: reduce the problem to computation with structured matrices because matrix

operations typically map very well to the graphics hardware

Computation of the resultant reduces to triangular factorization of Sylvester matrix S which

is shift-structured [2]:

The Generalized Schur Algorithm computes matrix factorization in O(n2) time by operating solely

on matrix generators:

are generator matricesis a down-shift matrix

Reduce the top rows of

G and B (vector updates)

Collect factors of

the resultant

Shift down the first

columns of G and B

Iterate until generators

vanish completely

Division-free generator

recurrences

Abstract: we present the first implementation of a modular resultant algorithm on GPUs [4,5]. With recent developments taking advantage of new NVidia

Fermi GPU architecture and instruction set we have been able to achieve about 150x speedup over a CPU-based resultant algorithm from Maple 13.

Polynomial interpolation over a prime field

kernel 1

kernel 2

CPU sequential code

reduce polynomial coefficients modulo 31-bit primes

Gather results for different evaluation

points. For each modulus in parallel

Factorization of Sylvester

matrix using Schur algorithm

Solving Vandermonde system

using Schur algorithm, see [5]

CPU sequential code Recover resultant coefficients from Mixed-radix representation

Gather results for different

moduli

Polynomial evaluation:

Computing univariate resultants:

For each modulus and each

evaluation point in parallel

Eliminate ‟‟bad‟‟ evaluation points

Divide by the denominator:

kernel 3

Polynomial interpolation:

kernel 4

Compute mixed-radix digits using CRA

Division using Montgomery

modular inverse, see [4]

grid size:

N × M

N: number of moduli

M: number of evaluation points

Reduce the problem to solving the Vandermonde system

using the generalized Schur algorithm in O(n2) time (see [5]):

Schematic view of the GPU algorithm

Realization of 31-bit modular arithmetic on the GPU

NVidia Fermi architectural features:• native 32-bit integer multiplication support (instead of 24-bit multiplication on GT200)

• full-speed double-precision arithmetic (8x faster than that of GT200)

• modulo operation („%‟) is costly: implement modular reduction in floating-point

• new set of video instructions: can do several arithmetic operations at a time (PTX assembly [3])

Raw performance: up to 154 GMad/s on the GTX480 graphics processor.

GMad/s = 109 modular multiply-adds per second.

Input polynomials in

Divide by the denominator: For each evaluation point in parallel

// D2I_TRUNC = (double)3^51 (fast mantissa truncation)

// inv_m = (double)1 / m

double f = (double)b * (double)c * inv_m + D2I_TRUNC;

unsigned s = b * c - __double2loint(f) * m;

// equivalent to min(s, s + m)

asm volatile(“vadd.u32.u32.u32.min %0,%1,%2,%3;” :

“=r”(s) : “r”(s), “r”(m), “r”(s));

s += a;

// equivalent to min(s, s - m)

asm volatile(“vsub.u32.u32.u32.min %0,%1,%2,%3;” :

“=r”(s) : “r”(s), “r”(m), “r”(s));

return s;

Modular multiply-add: Vector updates:

// inv_m = (double)1 / m

double f1 = (double)a * (double)b;

double f2 = (double)c * (double)d;

double f = (f1 – f2) * inv_m;

unsigned r = (unsigned)__double2int_rd(f);

unsigned s = a * b – c * d – r * m;

// equivalent to min(s, s + m)

asm volatile(“vadd.u32.u32.u32.min %0,%1,%2,%3;” :

“=r”(s) : “r”(s), “r”(m), “r”(s));

// equivalent to min(s, s - m)

asm volatile(“vsub.u32.u32.u32.min %0,%1,%2,%3;” :

“=r”(s) : “r”(s), “r”(m), “r”(s));

return s;

x6

x5

x4

x3

x2

x1

y6

y5

y4

y3

y2

y1

z6

z5

z4

z3

z2

z1

w6

w5

w4

w3

w2

w1

thread

ID

5

4

3

2

1

0

share the first elements

between all threads

Gi Bi

ith iteration

vecto

r u

pd

ate

s

x6x5x4x3x2x1

y6y5y4y3y2

0

z6z5z4z3z2z1

w6w5w4w3w2

0

collect factors of the

resultant

x5

x4

x3

x2

x1

y6

y5

y4

y3

y2

z5

z4

z3

z2

z1

w6

w5

w4

w3

w2

thread

ID

4

3

2

1

0

share the first elements

between all threads

Gi+1 Bi+1

(i+1)th iteration

x5

x4

x3

x2

x1

y6

y5

y4

y3

y2

z5

z4

z3

z2

z1

w6

w5

w4

w3

w2

shift down the first

generator columns

Gi+1 Bi+1

Gi Bi

vecto

r u

pd

ate

s

Block 1

Efficient stream compaction on Fermi:

• use ballot voting primitive to obtain zero-one pattern across each warp separately

• compute element shifts per warp using population count (popc intrinsic)

• propagate shifts to subsequent warps through addition in shared memory

1 1 1 1 1 n n2 2 2 2 22 n n…Block 2 Block n

2 2 2 2 2 2 … 1 1 1 1 1 … n n n n

n nn n

Block write offset is controlled by a global memory variable changed atomically

(the relative order of evaluation points does not matter for interpolation)

Computing univariate resultants over a prime field

For comparison: 2.5GHz Quad-Core Xeon E5420 can do about 1 GMad/s per core.

GTX280 using 24-bit modular arithmetic vs GTX480 using 31-bit modular arithmetic

Performance depending on y-degree of polynomials

with coefficients bit-length fixed

Performance as function of coefficient bit-length with polynomials‟ x/y-degrees fixed

grid size:

N × M / 128

grid size:

N × 1

grid size:

M × 1

Performance comparison with the resultant algorithm from 32-bit Maple 13 (deterministic)Target graphics card: GeForce GTX480 Host machine: Dual-Core AMD Opteron 2220SE, Linux platform

Instance Maple

time

GPU

time

CUDA blocks

executed

Instance Maple

time

GPU time CUDA blocks

executed

degxf: 40 degxg: 39

degyf: 19 degyg: 17

bits: 32 dense

12.2 s 0.057 s 56 × 1372

32×2 threads

degxf: 42 degxg: 33

degyf: 31 degyg: 20

bits: 32 dense

101.2 s 0.48 s 223 × 1874

64 threads

degxf: 36 degxg: 42

degyf: 19 degyg: 17

bits: 320 dense

114.4 s 0.781 s 488× 1353

32×2 threads

degxf: 10 degxg: 7

degyf: 95 degyg: 93

bits: 16 sparse

157.8 s 1.24 s 206 × 1604

96 threads

degxf: 40 degxg: 30

degyf: 31 degyg: 20

bits: 100 sparse

56.7 s 0.4 s 215 × 1740

32×2 threads

degxf: 10 degxg: 7

degyf: 95 degyg: 93

bits: 120 dense

timed out

(> 15

min)

6.35 s 951 × 1604

96 threads

References:[1] Collins G.E.: “The calculation of multivariate polynomial resultants”, SYMSAC‟71, 1971, 212-2

[2] Kailath T. and Sayed A.: “Displacement structure: theory and applications”, SIAM review, 1995, 297–386

[3] PTX: Parallel Thread Execution. ISA Version 2.1. NVIDIA Corp., 2010

[4] Emeliyanenko P.: “Modular Resultant Algorithm for Graphics Processors”, ICA3PP‟10, 2010, 427-440

[5] Emeliyanenko P.: “A complete modular resultant algorithm targeted for realization on graphics hardware”, PASCO‟10, 2010, 35-43

degx/y – degrees in x/y of polynomials f and g; bits – coefficient bit-length; sparse/dense – varying density of polynomials; CUDA blocks executed: # of blocks run by 1st resultant kernel (N × M) and # of threads per block

Multiply all collected factors

using parallel reduction