Upload
thanakij-pechprasarn
View
157
Download
4
Embed Size (px)
Citation preview
Accelerating Bayesian Computation with
Parallel Reduction using CUDAThanakij Pechprasarn and Noppadon Khiripet
Knowledge Elicitation and Archiving Laboratory (KEA)National Electronics and Computer Technology Center
(NECTEC)
MIWAI 2010
Bayesian Inference
• To calculate the posterior distribution of the hypothesis given observed data
• However, to be able to evaluate the probability, one have to solve for the integral value in the following equations:
• Often, no closed-form solution• Have to apply a numerical method to approximate
the integrals
Monte Carlo Integration
• Monte Carlo (MC) is a method involving a random process
• Monte Carlo integration (MCI) is MC for solving integrals (especially, high dimensions) in 2 steps:
1.generate samples (random numbers) from f pseudo-, quasi-, Markov chain MC (MCMC)
2.compute the integral by:
Problems with MCI
• Quality and quantity of samples are crucial to MCI
> to obtain results with acceptable error and to converge faster
• Typically, the size of sample set is large, so the computation takes time
Subtle Problems with MCI
• Typically, one would do:
• However, naïve implementation of MCI will lead to an incorrect result because of floating-point error called absorption, for example, running it with N=67107840, we get 0.250004 (but this is uniform data, the answer should be around 0.5)
sum = 0for i = 1 .. N sum = sum + f(x[i])return sum/N
Summary of Problems
• How to produce samples with good quality• How to assess the quality of samples• How to estimate the quantity of samples needed• Performance -> speed!• Accuracy -> reduce floating-point error!
We try to approach some parts of the problems: performance and accuracy
SurveyFor accuracy:• Kahan summation can
be used to significantly reduce the absorption error:
sum = 0c = 0for i = 1 .. N y = f(x[i]) - c t = sum + y c = (t - sum) - y sum = treturn sum/N
For speed:• Many parallel
programming schemes can help speed up the computation such as OpenMP, MPI and GPU computing
Our Method
• Use GPU computing with NVidia CUDA to accelerate the computation of MCI because it is data-parallel and price/performance is good
• Kahan summation is slow because of extra instructions
• Instead, we propose a technique called “div2 technique” which guarantees the prohibition of absorption error
CUDA• Much more CUDA cores compared to
CPU cores• Exploit parallelism in a form of blocks and
threads• Automatic scaling -> running more blocks
Parallel Reduction
• Tree-based structure pattern for parallel computing• At each level, divide into several blocks according to block size
• “Kernel” is a function to be executed in the device, f<<<gridDim, blockDim>>>(param), we have 2 kernels:
> reduce for the core reduction> compact for gathering distributed partial results from reduce and form a new input
• Outer loop for logBN times, calling reduce and compact• Inner loop (in reduce) for log2B times!
• Optimization techniques: “half the number of threads” and “first add during load”
Div2 Technique• We observe that absorption will occur when the 2 operands differ more than the
magnitude of about 107
• If we can ensure that an addend at any time will not go over max value of samples, then we can prevent the absorption
• So what if we divide the addend by 2 every time we do the add operation?
• But we just can’t divide every time, instead it maps very well to the parallel reduction pattern
• where C is a correcting factor = higher_power_of_2(N)/N> C = 1 if N is a power of 2> C has a value of [1,2)
• Have to handle the case that N is not a power of 2• More complex when have to think of blocks and threads in CUDA
MCI = parallel_reduction(samples)/N = C*parallel_reduction_with_div2(samples)
ExperimentData sets1. Uniform distribution (pseudo-)2. Gaussian distribution (pseudo-, Irwin Hall algorithm)Machines with• CPU: Intel Core2 Quad Core 2.8GHz (4 cores)• GPU: NVidia Geforce GTX 280 1.4GHz (240 cores,
compute capability (cc) of 1.3, GT200 family)Implementation1. Naïve2. Kahan summation3. Our method
Thoughts• In CUDA with cc 1.3, max B is 512, but B
is also a power of two (for efficiency and simplicity), therefore B can be: 32, 64,128, 256 and 512
• With optimization, we reduce #threads (=B/4), allowing a thread to do more work in a block. Therefore, eligible B becomes: 128, 256, 512, 1024, 2048
• Because we alos know that the number of blocks is 65535, we can compute the max problem size N for each block size B (N=65535xB from “reduce” kernel)
Block Size Problem Size
128 8388480
256 16776960
512 33553920
1024 67107840
2048 134215680
Result: AccuracyData Set Problem Size Naïve Kahan Our Method
U(0,1) 8388480 0.49998 0.499981 0.499981
16776960 0.499953 0.499981 0.499981
33553920 0.499944 0.499981 0.499981
67107840 0.250004 0.499981 0.499981
134215680 0.125002 0.499981 0.499981
N(6,1) 8388480 6.00014 5.99992 5.99992
16776960 6.60712 5.99992 5.99992
33553920 4.14664 5.99992 5.99992
67107840 2.25161 5.99992 5.99992
134215680 1.30409 5.99992 5.99992
Result: Execution Time
Result: Speed-Up
Future Work
• Further optimization• Utilize multidimensional feature of CUDA’s
grid and block• Test on “Fermi” which is CUDA with cc 2.0 to
see how the program scales and performs
End
• We are data mining lab (KEA) at NECTEC• We are looking for interns! (open to
undergrads and masters)- have fun with CUDA projects!- during summer (1.5-2 months)- getting paid• email: [email protected]
The need of single-precision(why not we use double-precision?)
• It is Monte Carlo and we tend to use single-precision
• Single-precision uses less memory footprint (GPU memory is very small compared to CPU DRAM)
• In addition, in CUDA with cc 1.x, double-precision has much worse performance than single-precision
• (in CUDA with cc 2.x has a major performance improvement in double-precision)
The need of div2 technique• Yes, parallel reduction alone already alleviates the error
from absorption• But errors still can be found in certain sets of numbers• In addition, in CUDA with cc 1.x, single-precision floating-
point implementation has some deviation from a standard IEEE 754
• (in CUDA with cc 2.x, single-precision supports full standard IEEE 754 as same as CPU)
• So it is easier to obtain results with floating-point errors • Using the “div2 technique” will guarantee the correctness
of the answer, preventing the occurrence of absorption!
The effect of embedding div2
• In terms of execution time, the effect div2 tends to fluctuate: better and worse from problem size and block size
• Kind of trade-off a fraction of performance for reliable results
To handle extremely large N
• Utilize multidimensional feature of CUDA’s grid and block
• Divide into smaller N chunks that a program can handle -> have to think of compact-like function
• Employ add during load technique