31
Training Random Forests on the GPU: Genomic Implications on HIV Susceptibility Mark Seligman Rapidics LLC March 27, 2014 Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV Susceptibility March 27, 2014 1 / 31

Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Training Random Forests on the GPU: Genomic Implications on HIVSusceptibility

Mark Seligman

Rapidics LLC

March 27, 2014

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 1 / 31

Page 2: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 2 / 31

Page 3: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Outline

Introduction

Random Forests

Arborist

Data and research.

Implementation

Summary and conclusions

Q & A

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 3 / 31

Page 4: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 4 / 31

Page 5: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Ongoing work with research team from University of Washington

Research team from Departments of Biostatistics, Global Health and Epidemiology.

Performing GWAS on subjects from sub-Saharan Africa, where HIV infection is quiteprevalent. Some subjects may have genetic tendency toward resistance to infection.

Using Random Forests as a predictive analysis tool to identify genetic components likelyto confer protection.

Rapidics supplies accelerated RF implementation for beta-testing.

Ongoing project: reporting here on progress so far.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 5 / 31

Page 6: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 6 / 31

Page 7: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Decision trees

Prediction method involving, intuitively, a series of yes/no questions.

Answer to each question determines which (of two) questions to pose next.

Each series of questions can be represented as a binary tree, whence the name.

A training phase builds the trees from observational data, assigning prediction scores tothe leaves.

A prediction phase walks the trees over separate observational data, deriving an overallprediction from the score obtained at each tree.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 7 / 31

Page 8: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Random Forest algorithm

Trademarked name registered to Leo Breiman (dec.) and Adele Cutler.

Binary decision tree method for either regression or classification.I Regression predicts the mean of numerical-valued response.I Classifcation predicts the most likely category of categorical response.

Observation data (“predictors”) can have either numerical or categorical values.

Trains on randomly-selected subset of the observations (“bagging”).

Tests predictive quality on left-out subset.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 8 / 31

Page 9: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

RF algorithm, cont.

Branching (“splitting”) decision for a subtree made from the information content ofobservations within that subtree.

Predictor yielding highest informational gradient (Gini gain) within some two-levelpartition is selected.

Selected partition is order-based for numerical predictors and subset-based for categorical.

O.k. to mix numercial, categorical predictors.

At a given subtree, candidate predictors are a randomly-selected subcollection of full set.

Subcollection size is specified by user, and varies with application.

“winner” determines which rows (samples) lie to left, right.

As subsequent (subtree) splitting decisions are made, this partition is enforced among allother predictors, w.r.t. sample index.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 9 / 31

Page 10: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Opportunites and obstacles for acceleration

Since splitting is based on ordering or subset membership, it makes sense to presort theobservations.

Each successive subtree remains ordered, but the implemenational trick is to retain thisordering economically for “restaging”. Expensive solution would be to resort on eachsubtree.

Within a subtree, predictors can be evaluated in parallel.

Subtrees can themselves be evaluated in parallel, favoring breadth-first approach.

Trees can be built and walked independently - a common Hadoop-based strategy, forexample.

A key synchronization point, however, lies in passing from subtree to descendants.

At first blush, appears to be purely random access. Actually though, the access patternsresemble scatters, which become progressively more local further down the tree.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 10 / 31

Page 11: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 11 / 31

Page 12: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

The Arborist

Commercial implementation of RF, geared to commodity high-performance hardware aswell as customizable configuration within specialized applications.

Deployed implementations exploit large-memory, multicore and multiprocessor platforms.

Data and feature weighting, NA-screening, quantile regression all available as standardfeatures.

“Loopless” reinvocation features to enable iterative calls while bypassing reinitialization.

Bindings available for R language using Rcpp C++ template extensions, although theimplmentation is essentially front-end agnostic.

Python bindings proposed, to encourage adoption by larger ML community and tofacilitate comparison with other accelerated RF implementations.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 12 / 31

Page 13: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Background

Began with need to improve performance of R-based implementation.

R’s implementation of regression RF employs a sort on each subtree. Reimplimentationusing presort significantly accelerates performance: linear vs. quadratic in number ofsamples.

Categorical RF implemented in tight Fortran code. Slowdowns appear in larger data setswith factor-valued predictors. These are remedied by using compiled code to process theinitial data.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 13 / 31

Page 14: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 14 / 31

Page 15: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

HIV susceptibility data

Research goal is to find SNPs associated with resistance, either singly or interacting.

Peformance goal is to accomodate this in a resampling framework with turnaround inhours, as opposed to days or even weeks.

100 subjects from sub-Saharan Africa, a region suspected to have genotypes conveyingHIV resistance.

Raw data deeply-genotyped (40x), includes indels and repeats.

Observational data reduced to biallelic SNPs, with MAF threshold of 0.03.

SNPs coded as factor levels 1, 2 or 3: no mutation, heterozygous, homozygous.

Response is binary: diseased or disease-free.

Typically considering one chromosome at a time, > 105 SNPs.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 15 / 31

Page 16: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Wide data: advantages and disadvantages

Highly underconstrained data set: 100 samples and up to one million predictors.

Not much potential for subtree-based parallelism: log2(100) < 7, so few levels in abalanced tree.

Potential for predictor-based parallelism is huge, however.

Noise contribution is extreme, so predictions’ significance vetted using resampling.

Goal: derive resampled p−values in “acceptable” time.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 16 / 31

Page 17: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 17 / 31

Page 18: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

RbSNP: multicore reference version

Specialized version of Arborist, exploits compact predictor and response sizes: does notrequire general-purpose containers.

Reference implementation for University of Washington team. Used for timings andparameter selection.

Logistic regression confirms quality of predictor importance estimates.

Runs with randomly-chosen subsets of the data reveal that 104 trees is optimal.

Similar studies show a predictor-selection probability near 0.6 to be optimal, in accordancewith published RF genomic studies.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 18 / 31

Page 19: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

RbSNP performance

Appears to be stable, accurate.

Suitable for apples-to-apples comparison with GPU version.

Linear scaling in tree count, predictor count, row size - as with Arborist.

Faster than Arborist, roughly three-fold for the data under consideration.

Data locality better than Arborist, due to compaction.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 19 / 31

Page 20: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Performance, cont.

Transfer from front end, plus presort: can be hefty for small tree count, but amortizeswell after a few hundred trees.

Tree-walking (prediction) scales well with core count (OpenMP). GPU schemes have beendevised and have been reported at GTC.

Gini computations scale well with core count. Note that there is a load-balancingchallenge, given that the number of cells to traverse varies among the subtrees.

Restaging does not scale with core count. Problem may be attributable in part to highTLB miss rate.

Restaging is also the biggest performance hurdle, roughly 80% of execution time.

Amdahl corollary: at 80%, best speedup will be 5x . Need to look elsewhere forimprovement, even if restaging shrunk to zero.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 20 / 31

Page 21: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

GPU: strategy

Apply a series of “perturbations” to RbSNP, as is common for acceleration projects.

Learn from effects and derive new approaches accordingly.

Begin by exploiting predictor-level parallelism, as on CPU.

Subtrees restaged sequentially: no thread divergence, as offsets do not vary acrosspredictors, only content.

Pass restaged data back to CPU for Gini, argmax, next level setup.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 21 / 31

Page 22: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Results, responses

Initially, slower than CPU: responded by incrementally moving functionality to GPU.

As expected, best speedups came from eliminating per-level transfer of restaged data.

On-device Gini, argmax have very little effect. In particular, load balancing is at best asecondary concern at this point.

Finally, inter-level data structures (order 1KB) maintained on device to drive iteration.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 22 / 31

Page 23: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Flow of control, per tree

Host-controlled loop until no “live” subtrees:

Restage (if not top level) and Gini.

Argmax.

Single-threaded maintenance of inter-level information.

Report live-count to host.

End result: 15% speedup over RbSNP.

Initially surprising, as loads are coalesced and stores, though not coalesced, arenon-blocking and there is no reuse within the kernel.

Key criticism: heavy dependence on shared memory allows only 32-lane threadblocks.

Note: dynamic parallelism, stream/event construction will only be secondary players untilthis problem solved.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 23 / 31

Page 24: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Rethink

Moving compact data packets (8 bytes) either right or left, according to the contents of abit vector.

In the interest of data locality L/R movement is confined to a fixed (shrinking range):useful (for CPU), but not necessarily for GPU.

If, e.g., all left subtrees processed before the first right subtree the problem maps preciselyto an application of stable partitioning.

Key insight: on GPU, can exploit well-tuned algorithms to perform restaging“horizontally”. That is, parallel scan computes target offsets of each row and a scattersends the data to the respective offsets.

Requires a transpose to reconcile column-major and row-major kernel orientations, butCublas has highly-tuned transposes.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 24 / 31

Page 25: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Revised flow of control, per tree

Transfer compact data as column-major. Loop until no live subtrees:

Out-of-place transpose.

Restage as scalable partition on transposed representation.

Gini on row-major representation, as before.

Argmax on row-major, as before.

Maintain inter-level representation.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 25 / 31

Page 26: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Progress

On-device transpose on 105 column synthetic benchmark adds about 2ms per invocation:very little effect on overall timings.

Thrust has a single-vector implementation, but this approach requires multi-vector.

No way to extrapolate from Thrust performance, but expect better behavior thansequential approach: Nvidia.

In any event, then, this scheme is worth pursuing.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 26 / 31

Page 27: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

1 Outline

2 Introduction

3 Random Forests

4 Arborist

5 Data and research

6 Implementation

7 Summary and conclusions

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 27 / 31

Page 28: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Summary

Initialization poses a significant cost for larger data sets, but this tends to be amortizedby tree count.

Prediction, while treatable by GPUs, is not a major contributor to run time.

Floating-point arithmetic and split determination is also not a major contributor.

Subtree repartitioning is the major bottleneck for execution time.

At its core RF is an example of a stable partitioning problem.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 28 / 31

Page 29: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Comments

Numerous opportunities exist to accelerate decision tree construction.

Which opportunities are best exploited depend critically on the characteristics of the dataunder consideration.

The high width and compressed, categorical nature of genomic data offers uniqueopportunities and challenges for the GPU.

Scalability, crucial for larger-scale genomic studies, is achievable for the techniquesdiscussed.

Interpretation of results still underway, but predictions are of sufficient quality to justifyexaminiation by researchers.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 29 / 31

Page 30: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Future work

Finish the new kernel loop.

Investigate load balancing and computational overlap as secondary treatments, usingdynamic parallelism.

Address TLB miss rate for CPU version by means of “huge” pages.

Implementation on multiple GPUs: essential for larger genomic data sets.

Blocking techniques to accomodate higher sample sizes and out-of-coprocessor memoryloads.

Integrate with multi-node schemes to exploit cloud, data-center resources.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 30 / 31

Page 31: Training Random Forests on the GPU: Genomic …on-demand.gputechconf.com/gtc/2014/presentations/S4502...Faster than Arborist, roughly three-fold for the data under consideration. Data

Acknowledgments

UW team: Jai Lingappam, Mary Emond, Romel Mackelprang, Xuan Lin Hou.

Mark Berger, NVIDIA, supplied K20 and K40 boards for advanced testing.

Insilicos lent K20 and C2075 boards for development.

Jace Mogill provided technical guidance.

Julien Demouth, Nvidia, offered technical advice.

Allison Jones and Anthony Santos, Nvidia, lent a laptop after my own failed.

Mark Seligman (Rapidics LLC) Training Random Forests on the GPU: Genomic Implications on HIV SusceptibilityMarch 27, 2014 31 / 31