Accelerating Bayesian Computation With Parallel Reduction Using CUDA

Accelerating Bayesian Computation with

Parallel Reduction using CUDAThanakij Pechprasarn and Noppadon Khiripet

Knowledge Elicitation and Archiving Laboratory (KEA)National Electronics and Computer Technology Center

(NECTEC)

MIWAI 2010

Bayesian Inference

• To calculate the posterior distribution of the hypothesis given observed data

• However, to be able to evaluate the probability, one have to solve for the integral value in the following equations:

• Often, no closed-form solution• Have to apply a numerical method to approximate

the integrals

Monte Carlo Integration

• Monte Carlo (MC) is a method involving a random process

• Monte Carlo integration (MCI) is MC for solving integrals (especially, high dimensions) in 2 steps:

1.generate samples (random numbers) from f pseudo-, quasi-, Markov chain MC (MCMC)

2.compute the integral by:

Problems with MCI

• Quality and quantity of samples are crucial to MCI

> to obtain results with acceptable error and to converge faster

• Typically, the size of sample set is large, so the computation takes time

Subtle Problems with MCI

• Typically, one would do:

• However, naïve implementation of MCI will lead to an incorrect result because of floating-point error called absorption, for example, running it with N=67107840, we get 0.250004 (but this is uniform data, the answer should be around 0.5)

sum = 0for i = 1 .. N sum = sum + f(x[i])return sum/N

Summary of Problems

• How to produce samples with good quality• How to assess the quality of samples• How to estimate the quantity of samples needed• Performance -> speed!• Accuracy -> reduce floating-point error!

We try to approach some parts of the problems: performance and accuracy

SurveyFor accuracy:• Kahan summation can

be used to significantly reduce the absorption error:

sum = 0c = 0for i = 1 .. N y = f(x[i]) - c t = sum + y c = (t - sum) - y sum = treturn sum/N

For speed:• Many parallel

programming schemes can help speed up the computation such as OpenMP, MPI and GPU computing

Our Method

• Use GPU computing with NVidia CUDA to accelerate the computation of MCI because it is data-parallel and price/performance is good

• Kahan summation is slow because of extra instructions

• Instead, we propose a technique called “div2 technique” which guarantees the prohibition of absorption error

CUDA• Much more CUDA cores compared to

CPU cores• Exploit parallelism in a form of blocks and

threads• Automatic scaling -> running more blocks

Parallel Reduction

• Tree-based structure pattern for parallel computing• At each level, divide into several blocks according to block size

• “Kernel” is a function to be executed in the device, f<<<gridDim, blockDim>>>(param), we have 2 kernels:

> reduce for the core reduction> compact for gathering distributed partial results from reduce and form a new input

• Outer loop for logBN times, calling reduce and compact• Inner loop (in reduce) for log2B times!

• Optimization techniques: “half the number of threads” and “first add during load”

Div2 Technique• We observe that absorption will occur when the 2 operands differ more than the

magnitude of about 107

• If we can ensure that an addend at any time will not go over max value of samples, then we can prevent the absorption

• So what if we divide the addend by 2 every time we do the add operation?

• But we just can’t divide every time, instead it maps very well to the parallel reduction pattern

• where C is a correcting factor = higher_power_of_2(N)/N> C = 1 if N is a power of 2> C has a value of [1,2)

• Have to handle the case that N is not a power of 2• More complex when have to think of blocks and threads in CUDA

MCI = parallel_reduction(samples)/N = C*parallel_reduction_with_div2(samples)

ExperimentData sets1. Uniform distribution (pseudo-)2. Gaussian distribution (pseudo-, Irwin Hall algorithm)Machines with• CPU: Intel Core2 Quad Core 2.8GHz (4 cores)• GPU: NVidia Geforce GTX 280 1.4GHz (240 cores,

compute capability (cc) of 1.3, GT200 family)Implementation1. Naïve2. Kahan summation3. Our method

Thoughts• In CUDA with cc 1.3, max B is 512, but B

is also a power of two (for efficiency and simplicity), therefore B can be: 32, 64,128, 256 and 512

• With optimization, we reduce #threads (=B/4), allowing a thread to do more work in a block. Therefore, eligible B becomes: 128, 256, 512, 1024, 2048

• Because we alos know that the number of blocks is 65535, we can compute the max problem size N for each block size B (N=65535xB from “reduce” kernel)

Block Size Problem Size

128 8388480

256 16776960

512 33553920

1024 67107840

2048 134215680

Result: AccuracyData Set Problem Size Naïve Kahan Our Method

U(0,1) 8388480 0.49998 0.499981 0.499981

16776960 0.499953 0.499981 0.499981

33553920 0.499944 0.499981 0.499981

67107840 0.250004 0.499981 0.499981

134215680 0.125002 0.499981 0.499981

N(6,1) 8388480 6.00014 5.99992 5.99992

16776960 6.60712 5.99992 5.99992

33553920 4.14664 5.99992 5.99992

67107840 2.25161 5.99992 5.99992

134215680 1.30409 5.99992 5.99992

Result: Execution Time

Result: Speed-Up

Future Work

• Further optimization• Utilize multidimensional feature of CUDA’s

grid and block• Test on “Fermi” which is CUDA with cc 2.0 to

see how the program scales and performs

End

• We are data mining lab (KEA) at NECTEC• We are looking for interns! (open to

undergrads and masters)- have fun with CUDA projects!- during summer (1.5-2 months)- getting paid• email: [email protected]

The need of single-precision(why not we use double-precision?)

• It is Monte Carlo and we tend to use single-precision

• Single-precision uses less memory footprint (GPU memory is very small compared to CPU DRAM)

• In addition, in CUDA with cc 1.x, double-precision has much worse performance than single-precision

• (in CUDA with cc 2.x has a major performance improvement in double-precision)

The need of div2 technique• Yes, parallel reduction alone already alleviates the error

from absorption• But errors still can be found in certain sets of numbers• In addition, in CUDA with cc 1.x, single-precision floating-

point implementation has some deviation from a standard IEEE 754

• (in CUDA with cc 2.x, single-precision supports full standard IEEE 754 as same as CPU)

• So it is easier to obtain results with floating-point errors • Using the “div2 technique” will guarantee the correctness

of the answer, preventing the occurrence of absorption!

The effect of embedding div2

• In terms of execution time, the effect div2 tends to fluctuate: better and worse from problem size and block size

• Kind of trade-off a fraction of performance for reliable results

To handle extremely large N

• Utilize multidimensional feature of CUDA’s grid and block

• Divide into smaller N chunks that a program can handle -> have to think of compact-like function

• Employ add during load technique

Documents

Accelerating Bayesian Computation With Parallel Reduction Using CUDA