176
High-Performance Computing Needs Machine Learning... And Vice Versa (was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”) Nicolas Pinto NIPS “Big Learning” | December 16th, 2011 The Rowland Institute HARVARD UNIVERSITY edition

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

  • Upload
    npinto

  • View
    4.693

  • Download
    1

Embed Size (px)

DESCRIPTION

http://biglearn.org

Citation preview

Page 1: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-Performance Computing NeedsMachine Learning... And Vice Versa(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”)

Nicolas PintoNIPS “Big Learning” | December 16th, 2011

The Rowland Institute at HarvardHARVARD UNIVERSITY

edition

Page 2: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 3: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 4: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Motivation...

Page 5: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Visual Object RecognitionThe Problem:

Page 6: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Why?

Page 7: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Why?it seems easy, right?

Page 8: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

44 years ago...

Page 9: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Visual Object RecognitionThe Problem:

Page 10: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Visual Object RecognitionThe Problem:

Page 11: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

fast

Visual Object RecognitionThe Problem:

Page 12: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

fastaccurate

Visual Object RecognitionThe Problem:

Page 13: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

fastaccurateeffortless

Visual Object RecognitionThe Problem:

Page 14: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

fastaccurateeffortlesscritical to survival

Visual Object RecognitionThe Problem:

Page 15: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

fastaccurateeffortlesscritical to survival

tolerant to variations!

Visual Object RecognitionThe Problem:

Page 16: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

hard?

Page 17: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

hard?

// the world is 3D but the retina is 2D

Page 18: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

hard?

// the world is 3D but the retina is 2D

// the curse of dimensionality

Page 19: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

hard?

// the world is 3D but the retina is 2D

// the curse of dimensionality

// considerable image variation

Page 20: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

~50% of is for vision!

Page 21: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

you learned it...

may

hav

e

you learned it...

Page 22: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Background

Page 23: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The ApproachReverse and Forward Engineering the Brain

Page 24: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The ApproachReverse and Forward Engineering the Brain

Build Artificial System

FORWARD REVERSE Study

Natural System

Page 25: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The ApproachReverse and Forward Engineering the Brain

Build Artificial System

FORWARD REVERSE Study

Natural System

Page 26: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Animation by Li N

Images by DiCarlo JJ & Cox DDReverse EngineeringThe Ventral Visual Stream

Page 27: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Animation by Li N

Images by DiCarlo JJ & Cox DDReverse EngineeringThe Ventral Visual Stream

Page 28: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Reverse EngineeringThe Ventral Visual Stream

brain = 20 petaflops ?!

Page 29: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The ApproachReverse and Forward Engineering the Brain

Build Artificial System

FORWARD REVERSE Study

Natural System

Page 30: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Forward EngineeringThe Ventral Visual Stream

all about learning ???

Page 31: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

L1

L2

n. of !lters

kernel size

kernel size

number of !lters

Learning

normalizationneighborhood

normalizationneighborhood

norm strengththresh/sat

norm strengththresh/sat

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

Trace“Temp. Adv.”“Auto-reset”

...

Page 32: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Page 33: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

Page 34: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

1) One grad student

Page 35: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

1) One grad student2) One Model (size limited by runtime)

Page 36: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

1) One grad student2) One Model (size limited by runtime)3) Performance numbers on a few standard test sets

Page 37: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

1) One grad student2) One Model (size limited by runtime)3) Performance numbers on a few standard test sets 4) yay. we. rock.

Page 38: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How are things done normally?

Usual Formula:

1) One grad student2) One Model (size limited by runtime)3) Performance numbers on a few standard test sets

5) One Ph.D.

4) yay. we. rock.

Page 39: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

tew How do you call this ?

“This is graduate student descent”- David McAllester

Page 40: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

tew How do you call this ?

“This is graduate student descent”- David McAllester

Page 41: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

tew What’s better than this?

“Conjugate graduate student descent?”- Nicolas Poilvert

Page 42: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

Page 43: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student

Page 44: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student2) One Hundreds of Thousands of BIG Models

Page 45: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student2) One Hundreds of Thousands of BIG Models3) Performance numbers on a few standard test sets

Page 46: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student2) One Hundreds of Thousands of BIG Models3) Performance numbers on a few standard test sets

Page 47: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student2) One Hundreds of Thousands of BIG Models3) Performance numbers on a few standard test sets

4) yay. we. rock.

Page 48: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Doing things a little bit differently

1) One grad student2) One Hundreds of Thousands of BIG Models3) Performance numbers on a few standard test sets

4) yay. we. rock.

5) Hundreds of Thousands One PhD ?

Page 49: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Linus Pauling(double Nobel Prize Winner)

If you want to have good ideas you must have many ideas.”“

Most of them will be wrong, and what you have to learn is

which ones to throw away.

“”

Page 50: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 51: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 52: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput Screening

Page 53: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

L1

L2

L3

input

Read-out

n. of !lters

kernel size

kernel size

number of !lters

number of !lters

Learning

kernel size

normalizationneighborhood

normalizationneighborhood

normalizationneighborhood

norm strengththresh/sat

norm strengththresh/sat

norm strengththresh/sat

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

large family of brain-inspired models

very inclusive!

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

52 parameters

more than 1025

possible unique combinations!

Page 54: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The curse of speed

Page 55: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

thousands of big models

The curse of speed

Page 56: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

thousands of big models

The curse of speed

large amounts of unsupervised learning experience

Page 57: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The curse of speed...and the blessing of massively parallel computing

No off-the-shelf solution? DIY!

Engineering (Hardware/SysAdmin/Software) Science

Page 58: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The curse of speed...and the blessing of massively parallel computing

No off-the-shelf solution? DIY!

Engineering (Hardware/SysAdmin/Software) Science

Leverage non-scientific high-tech markets and their $billions of R&D...

Gaming: Graphics Cards (GPUs), PlayStation 3

Web 2.0: Cloud Computing (Amazon, Google)

Page 59: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Build your own!

Page 60: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Peak

GFL

OP/

s

Com

puta

tiona

l pow

er GPUs

CPUs

DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007)

The blessing of GPUs

Page 61: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Q9450 (Matlab/C) [2008]

Q9450 (C/SSE) [2008]

7900GTX (OpenGL/Cg) [2006]

PS3/Cell (C/ASM) [2007]

8800GTX (CUDA1.x) [2007]

GTX280 (CUDA2.x) [2008]

GTX480 (CUDA3.x) [2010] 974.3

339.3

192.7

111.4

68.2

9.0

0.3

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

Pinto, Cox GPU Comp. Gems 2011

speed(in billion floating point operations per second)

(Fermi)

Page 62: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Q9450 (Matlab/C) [2008]

Q9450 (C/SSE) [2008]

7900GTX (OpenGL/Cg) [2006]

PS3/Cell (C/ASM) [2007]

8800GTX (CUDA1.x) [2007]

GTX280 (CUDA2.x) [2008]

GTX480 (CUDA3.x) [2010] 974.3

339.3

192.7

111.4

68.2

9.0

0.3

>1000X speedup is game changing...

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

Pinto, Cox GPU Comp. Gems 2011

speed(in billion floating point operations per second)

(Fermi)

Page 63: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

50 60 70 80 90 1000

50

100

150

200

250

N=2500

Cou

nt

Performance (%)

chance

High-throughput ScreeningSkimming off the best models

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

stupidbaseline

Page 64: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

50 60 70 80 90 1000

50

100

150

200

250

N=2500

Cou

nt

Performance (%)

chance

High-throughput ScreeningSkimming off the best models

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

stupidbaseline

Page 65: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

50 60 70 80 90 1000

50

100

150

200

250

N=2500

Cou

nt

Performance (%)

chance

High-throughput ScreeningSkimming off the best models

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

stupidbaseline

Page 66: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput ScreeningValidate on other tasks

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

vs.

“HMAX 2.1”(~80%)

V1-like

state-of-the-art(from literature)

high-throughput models(baseline)

~90%

12345

Page 67: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput ScreeningValidate on other tasks

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

vs.

“HMAX 2.1”(~80%)

V1-like

state-of-the-art(from literature)

high-throughput models(baseline)

~90%

12345

Page 68: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput ScreeningValidate on other tasks

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

vs.

“HMAX 2.1”(~80%)

V1-like

state-of-the-art(from literature)

high-throughput models(baseline)

~90%

12345

Page 69: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput ScreeningValidate on other tasks

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

vs.

“HMAX 2.1”(~80%)

V1-like

state-of-the-art(from literature)

high-throughput models(baseline)

~90%

12345

Page 70: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput ScreeningValidate on faces

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

V1-likestate-of-the-art(from literature)

high-throughput models(baseline)

vs.SI

FT

GB

PHO

G

PHO

W

HM

AX

2.1

5 4 3 2 1 blen

d

Page 71: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Human vs. Machine8-way object categorization

99.1

64

31.3

baseline best model best human

chance (12.5%)

Page 72: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

what have we learned ?What does it all mean?

briefly...

Page 73: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

...

!1

!2

!k

NormalizePoolFilter

Threshold &Saturate

Grayscale InputNormalize

L1 L2 L3Linear SVM

what have we learned ?What does it all mean?

➡ dimensionality: more filters is better

...

!1

!2

!k

Filter

simple classifier

Page 74: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

...

!1

!2

!k

NormalizePoolFilter

Threshold &Saturate

Grayscale InputNormalize

L1 L2 L3Linear SVM

what have we learned ?What does it all mean?

➡ learning is difficult

...

!1

!2

!k

Filter

simple classifier

Page 75: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

...

!1

!2

!k

NormalizePoolFilter

Threshold &Saturate

Grayscale InputNormalize

L1 L2 L3Linear SVM

what have we learned ?What does it all mean?

➡ non-linearities are important

NormalizePoolThreshold &

Saturate

simple classifier

Page 76: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

...

!1

!2

!k

NormalizePoolFilter

Threshold &Saturate

Grayscale InputNormalize

L1 L2 L3Linear SVM

what have we learned ?What does it all mean?

➡ normalization is very important

Normalize

missed in previous modeling effortsnow confirmed by LeCun et al., Poggio et al., Ng et al.

simple classifier

Page 77: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What are these models not good for?

objects

facesbackgro

undslow level

Page 78: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 79: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

iPhD one more thing

Page 80: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Real-world apps?testing the generality and scalability of the approach

Page 81: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

collab. with Zak Stone & Todd Zickler @ Harvard

FacebookReally Real World Problem

3TB+ uploaded every day

billion of photos

dense, collaborative face labels

enormous scale

Page 82: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Relevance to Social Networking

slide courtesy of David Cox

Page 83: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Relevance to Social Networking

Page 84: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 85: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-throughput Screening

Page 86: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

High-Throughput Screening

top 5 modelsLFW view 1 performance

HT L3s (3 layers)

Pinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)

Labeled Faces in the Wild (LFW) View 1> 30,000 large-scale models (1to3 layers) screened in only 3 days

No Unsupervised Learning!

Page 87: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

GeneralizationPerformance on LFW View 2 (hold out)

Face Verification Performance (% correct)

88.186.8

85.3

79.4Kumar et al.ICCV 2009

Ours(HT)V1-like

Wolf et al.ACCV 2009face.com

Pinto, Cox (FG 2011)

Page 88: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

collab. with Zak Stone & Todd Zickler @ Harvard

“Facebook100”typical social network size?

Pinto, Stone, Zickler, Cox (CVPR 2011)

Page 89: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

collab. with Zak Stone & Todd Zickler @ Harvard

> 86% accurate(w/ 90 training examples)

Auto-tagging a network of 100 Facebook friends

Pinto, Stone, Zickler, Cox (CVPR 2011)

Page 90: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 91: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

vs face.comcomparison with a heavily-specialized commercial system

Perfo

rman

ce (

% c

orre

ct)

training example(s) / friend

V1-like(one layer)

L3

Pinto, Stone, Zickler, Cox (CVPR 2011)

(best technology around)face.com

(hardware-acceleratedbrute-force random model)

Page 92: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Conclusion?

Page 93: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

picture courtesy of Koray Kavukcuoglu

Hardware Matters !

Yann LeCun’s Mac

Page 94: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 95: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Two conflicting requirements

Page 96: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Two conflicting requirements

FAST

Page 97: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Two conflicting requirements

FAST

FLEXIBLE

Page 98: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

How to optimize?

Two conflicting requirements

FAST

FLEXIBLE

Page 99: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 100: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What’s the bottleneck?

Page 101: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

3D Filterbank Convolutions!

Page 102: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Our answer?

Page 103: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Meta-programming

!

Page 104: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What?Meta-programming

Page 105: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Meta-programming !

Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW)• Dynamically compile specialized versions

of the same kernel for different conditions • Empirical run-time tuning• For free: smooth syntactic ugliness: unroll

loops, index un-indexable registers, etc.

Page 106: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

“Instrument” your solutions:• Block size • Work size• Loop unrolling• Pre-fetching• Spilling• etc.

Meta-programming !

... and let the computer generatefind the optimal code

Page 107: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How?

Page 108: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Always use the right tool !

Page 109: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Page 110: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

Templating

Page 111: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Compilation?(with Python-based solutions)

Page 112: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

PyCUDA/PyOpenCL (by Andreas Klockner)

Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)

Page 113: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 114: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)extern "C" {

__global__ void convolve_beta_j0(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j1(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j2(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j3(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

}

conv_kernel_template.cuconv_kernel_4x4x4.cu

Page 115: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

conv_kernel_template.cu

conv_kernel_4x4x4.cu

20 kB

conv_kernel_8x8x4.cu

64 kB

Page 116: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Benefits?

Page 117: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Smooth syntactic ugliness

Page 118: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• loop unrolling (possibly fine-controlled)

Page 119: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• fine-controlled loop unrolling / jamming

(...) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j1(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j2(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j3(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

}

Page 120: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

How about #pragma unroll ?(why don’t you trust the compiler?)

Page 121: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Using GPUs for Signal

Correlation

Michael Clarkwith

Paul La Plante and Lincoln Greenhill

The Murchison Widefield Array

Daniel A. Mitchell

Figure 3: On the left is an image of the J2107-2526 field produced by integrating 8-second snapshots over

the entire time interval without blanking. On the right is an image of the field after RFI blanking and peeling,

along with contours of the unpeeled image.

occasion, reflect or refract into the receivers at levels that are orders of magnitude above the noise

floor. During deep integrations the MWA real-time system will simply discard dubious data. This

will require a series of data-quality tests, of which the simple median-based detector shown here

will form an integral part.

References

[1] A.E.E. Rogers, RFI Statistics at Boolardy, EDGES Memo, 058, 2010.

[2] D.A. Mitchell, L.J. Greenhill, R.B. Wayth, R.J. Sault, C.J. Lonsdale, R.J. Cappallo, M.F. Morales, and

S.M. Ord, Real-Time Calibration of the Murchison Widefield Array, IEEE Journal of Selected Topics

in Signal Processing, 2 (5), 707–717, 2008, [astro-ph/0807.191

2].

[3] C.J. Lonsdale, et al., The Murchison Widefield Array: Design Overview, Proceedings of the IEEE, 97

(8), 1497–1506, 2009, [astro-ph/0903.182

8].

[4] S.M. Ord, L.J. Greenhill, R.B. Wayth, D.A. Mitchell, K. Dale, H. Pfister, and R.G. Edgar, Graphics

Processing Units for Data Processing in the Murchison Wide-field Array, ASP Conference Series,

411, 127, 2009.

[5] J.P. Hamaker, J.D. Bregman, and R.J. Sault, Understanding radio polarimetry. I. Mathematical

foundations, Astron. Astrophys. Suppl. Ser., 117, 137–147, 1996.

[6] J.P. Hamaker, Understanding radio polarimetry. IV. The full-coherency analogue of scalar

self-calibration: Self-alignment, dynamic range and polarimetric fidelity, Astron. Astrophys. Suppl.

Ser., 143, 515–543, 2000.

[7] S.M. Ord, et al., Wide-field interferometric imaging via the combination of warped snapshots, in prep.

[8] P.A. Fridman, Statistically Stable Estimates of Variance in Radio-Astronomy Observations as Tools

for Radio-Frequency Interference Mitigation, The Astronomical Journal, 135 (5), 1810–1824, 2008.

[9] A.E. Wright and R. Otrupcek, (Eds), Parkes Catalogue, Australia Telescope National Facility, 1990.

[10] J.E. Noordam, LOFAR Calibration Challenges, in Proc. SPIE: Groundbased Telescopes, 5489,

817–825, 2004.

6

Thursday, 27 January 2011

IICS‘2011

we are not alone....

Don’t trust compilers

• Compare these “identical” code fragments

a += b*c +

d*c + e*f

+ g*h;

a += b*c;

a += d*c;

a += e*f;

a += g*h;

1020 GFLOPS

770 GFLOPS

Thursday, 27 January 2011

Page 122: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• variable-length argument lists

Page 123: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Smooth syntactic ugliness

Manipulations that were not easily accessible in CUDA C code:• index un-indexable resources (e.g. regs)

Page 124: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Explore design decision space more freely

Page 125: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 126: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

caching... too many

optimizations?

bank conflicts

coalescing

partition campingclam

ping

mix

ed p

reci

sion

broadcasting

streamszero-copy

loop unrolling

Page 127: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

can’t decide ?

keep them all !

Page 128: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Exploring design decision space more freely

Meta-programming:

• enables efficient learning of the GPU hardware/software

• allows full exploitation of the GPU architecture

Page 129: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

conv_kernel_beta_template.cu ...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4

...

version A

version B

...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...2x faster... Why ?

Page 130: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Results

Page 131: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Q9450 (Matlab/C) [2008]

Q9450 (C/SSE) [2008]

7900GTX (OpenGL/Cg) [2006]

PS3/Cell (C/ASM) [2007]

8800GTX (CUDA1.x) [2007]

GTX280 (CUDA2.x) [2008]

GTX480 (CUDA3.x) [2010] 974.3

339.3

192.7

111.4

68.2

9.0

0.3

>1000X speedup is game changing...

Pinto, Doukhan, DiCarlo, Cox PLoS 2009

Pinto, Cox GPU Comp. Gems 2011

speed(in billion floating point operations per second)

(Fermi)

Page 132: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Analysis

33.4 Final Evaluation 469

Table 33.1 (Continued)

Default Auto-Tuned

GPU/SDK Input Filter-Bank (gflops) (gflops) Boost

GTX580 256x256x8 64x9x9x8 617.446 ± 1.340 669.465 ± 0.856 8.4%CUDA3.2-i7- 512x512x4 32x13x13x4 979.426 ± 3.037 1074.493 ± 0.660 9.7%3.2-Gentoo 1024x1024x8 16x5x5x8 745.302 ± 0.111 763.134 ± 0.278 2.4%

2048x2048x4 4x8x8x4 479.679 ± 0.071 903.934 ± 0.714 88.4%

GTX580 256x256x8 64x9x9x8 585.959 ± 1.240 585.959 ± 1.240 0.0%CUDA3.2-i7- 512x512x4 32x13x13x4 947.092 ± 2.903 1035.999 ± 0.849 9.4%2.66-Ubuntu-10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571 2.6%

2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017 87.1%

Table 33.2 Performance of Auto-Tuned Implementations on TwoHardware Platforms, Including Performance Tuned on One Platform andRun on the Other

Optimized for:

Run on: 9400M GTX480 Tuning Speedup

9400M 0.32s 2.52s 675%GTX480 0.016s 0.011s 52%

Large performance gains are observed for the auto-tuned meta-kernels as compared to the “default”parameter set, which was hand-picked to allow correct execution of all input ranges on all GPUs,without running up against hardware limitations.

Interestingly, we note that a different peak-performing meta-parameter set was chosen for eachinput size for each hardware platform. Given the many demands on system resources that trade offagainst each other, a different “sweet-spot” implementation exists for different incoming inputs and fordifferent combinations of hardware resources. To illustrate this point, in Table 33.2 we show the perfor-mance with the best auto-tuned parameters for two different hardware platforms (a 9400M laptop-gradeGPU, and a GTX480 desktop GPU), as well as the performance for each if the parameter sets wereswapped (i.e., if we tuned on the 9400M and ran on the GTX480, and vice versa). In all cases, bestparameter sets were chosen using half of the time trials, and the median performances shown in thetable were computed using the remaining trials. We see large differences in performance (in somecases over 100%) when a custom hardware auto-tuned kernel is used, as compared to when an optimalkernel for a different platform is used. Such performance differences are particularly important whendevelopment is done on a different machine (e.g., a laptop) than where the code will be run in produc-tion mode. Similarly, for applications that are widely deployed on a variety of user hardware, optimalperformance can be achieved by either optimizing in situ or shipping with a database of parameter setsfor different platforms.

➡Different hardware ?

Page 133: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Analysis470 CHAPTER 33 GPU Metaprogramming: A Case Study

Table 33.3 Performance of Auto-Tuned Implementations on Two Input

Configurations, Including Performance Tuned for One Configuration

and Run with the Other

Optimized for:

Run on: Config1 Config2 Tuning Speedup

config1 11.1ms 15.7ms 41%config2 fails 10.8ms not comparable

Similarly, in Table 33.3 we show the effect of tuning on one input configuration and running on

another. Again, significant speedups are obtained using kernels tailored to a specific input configura-

tion, as opposed to generic kernels optimized under different conditions. Without metaprogramming,

hand-tuning for each of the many hardware configurations in existence and for many different input

configurations would be a tedious and error-prone process. By contrast, template metaprogramming in

combination with a simple auto-tuning scheme allows optimal implementations to be chosen for any

platform and input size.

33.5 FUTURE DIRECTIONS

Above, we have demonstrated how writing kernel templates, rather than complete kernels per se, can

result in cleaner, more readable code and can provide a coherent framework for exploring the interac-

tions of many implementation decisions. In our auto-tuning example code, we show a straightforward

implementation of a brute-force auto-tuning approach, in which we grid search a large number of com-

binations and permutations of template parameters and auto-benchmark. While this brute-force search

procedure leads to surprisingly good results despite its simplicity, it clearly becomes suboptimal as

the number of template parameters increases. Thus, an important future direction is the application

of more intelligent optimization algorithms — e.g., decision trees, simplex search, simulated anneal-

ing, genetic algorithms, or other derivative-free de-randomized methods (such as Covariance Matrix

Adaptation) — to more efficiently search the space of possible implementations.

Acknowledgments

This work was funded by the Rowland Institute of Harvard, the NVIDIA Graduate Fellowship, and the National

Science Foundation (IIS 0963668). Hardware support was generously provided by the NVIDIA Corporation. The

authors would also like to thank Andreas Kloeckner for providing and supporting the PyCuda package and Cliff

Woolley for helpful comments on earlier versions of this chapter.

➡Different input configurations

Page 134: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Page 135: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

Page 136: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

• can assist exploration and manual optimization

Page 137: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter highly-optimized code

Page 138: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter highly-optimized code

• is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)

Page 139: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter highly-optimized code

• is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)

➡ helps get drastic speed-ups !

Page 140: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter highly-optimized code

• is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)

➡ helps get drastic speed-ups !➡ facilitates “auto-tuning” !

Page 141: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 142: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Intelligent

with Machine Learning

and fast

Auto-Tuning

with James Bergstra and David Cox

Page 143: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Intelligent

with Machine Learning

and fast

Auto-Tuning

Page 144: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

Page 145: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:

Page 146: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:- pros: very generic (dominant in compilers), fast

“inference”

Page 147: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:- pros: very generic (dominant in compilers), fast

“inference”

- cons: hard to build, domain expertise required, auto-tuned code far from peak

Page 148: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:- pros: very generic (dominant in compilers), fast

“inference”

- cons: hard to build, domain expertise required, auto-tuned code far from peak

• Empirical optimization:

Page 149: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:- pros: very generic (dominant in compilers), fast

“inference”

- cons: hard to build, domain expertise required, auto-tuned code far from peak

• Empirical optimization:- pros: auto-tuned code close to peak (dominant in

specialized libraries e.g. ATLAS, FFTW), easier to build

Page 150: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: two approaches

• Analytical model-based optimization:- pros: very generic (dominant in compilers), fast

“inference”

- cons: hard to build, domain expertise required, auto-tuned code far from peak

• Empirical optimization:- pros: auto-tuned code close to peak (dominant in

specialized libraries e.g. ATLAS, FFTW), easier to build

- cons: very slow “inference” (for new inputs, etc.)

Page 151: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Empirical Auto-Tuning

The goal is to empirically optimize execution time given both

• the environment

- hardware (GPU, CPU, Memory, Mobo, etc.)

- software (SDK, Compiler suite, etc.)

• the data (input dimensions, repetitions, etc.)

Page 152: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Empirical Auto-Tuning with Meta-programming

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 153: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Intelligent

with Machine Learning

and fast

Auto-Tuning

Page 154: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: best of both approaches ?

Page 155: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: best of both approaches ?

• Empirically-learned model-based optimization:

Page 156: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: best of both approaches ?

• Empirically-learned model-based optimization:

- pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)

Page 157: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: best of both approaches ?

• Empirically-learned model-based optimization:

- pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)

- cons: unexplored !

Page 158: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Auto-tuning: best of both approaches ?

• Empirically-learned model-based optimization:

- pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)

- cons: unexplored !

* could be dominant in specialized libraries(e.g. machine learning!)

Page 159: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Fast Machine Learning-basedRuntime Auto-Tuning

ML-based

Page 160: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Fast Machine Learning-basedRuntime Auto-Tuning

Machine Learning for Predictive Auto-Tuning with BoostedRegression Trees

First LastAffiliation line 1Affiliation line 2

[email protected]

First LastAffiliation line 1Affiliation line 2

[email protected]

First LastAffiliation line 1Affiliation line 2

[email protected]

ABSTRACT

The rapidly evolving landscape of multicore architectures

makes the construction of efficient libraries a daunting task.

A family of methods known collectively as “auto-tuning”has

emerged to address this challenge. Two major approaches to

auto-tuning are empirical and model-based: empirical auto-

tuning is a generic but slow approach that works by mea-

suring runtimes of candidate implementations, model-based

auto-tuning predicts those runtimes using simplified abstrac-

tions designed by hand. We show that machine learning

methods for non-linear regression can be used to estimate

timing models from data, capturing the best of both ap-

proaches. A statistically-derived model offers the speed of

a model-based approach, with the generality and simplicity

of empirical auto-tuning. We validate our approach using

the filterbank correlation kernel described in Pinto and Cox

[2012], where we find that 0.1 seconds of hill climbing on

the regression model (“predictive auto-tuning”) can achieve

an average of 95% of the speed-up brought by minutes of

empirical auto-tuning. Our approach is not specific to filter-

bank correlation, nor even to GPU kernel auto-tuning, and

can be applied to almost any templated-code optimization

problem, spanning a wide variety of problem types, kernel

types, and platforms.

1. INTRODUCTION

Due to power consumption and heat dissipation concerns,

scientific applications have shifted from computing platforms

where performance had been primarily driven by rises in the

clock frequency of a single “heavy-weight” processor (with

complex out-of-order control and cache structures) to a plat-

form with ever increasing numbers of “light-weight” cores.

Interestingly, this shift is now not only relevant to compu-

tational sciences but to the development of all computer sys-

tems: from ubiquitous consumer-facing devices (e.g. phones)

to high-end computer farms for web-scale applications (e.g.

social networks).

Although the future lies in low-power multi-core hardware

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

designs, the field lacks consensus on exactly how the differ-ent subsystems (memory, communication and computation)

should be efficiently integrated, modeled and programmed.

These systems have exhibited varying degrees of memory

hierarchy and multi-threading complexity and, as a conse-

quence, they have been increasingly relying on flexible but

low-level software-controlled cache management and paral-

lelism [Asanovic et al., 2006] in order to better control and

understand the various trade-offs among performance, reli-

ability, energy efficiency, production costs, etc. This evo-

lution has profoundly altered the landscape of application

development: programmers are now facing a wide diversity

of low-level architectural issues that must be carefully bal-

anced if one is to write code that is both high-performance

and portable.

1.1 Motivation

In this rapidly evolving landscape, the construction of gen-

eral development tools and libraries that fully utilize system

resources remains a daunting task. Even within special-

ized architectures from the same vendor, such as NVIDIA’s

Graphics Processing Units (GPUs) and the Compute Unified

Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,

2011], many developers default to massive amounts of man-

ual labor to optimize CUDA code to specific input domains.

In addition, hand-tuning rarely generalizes well to new hard-

ware generations or different input domains, and it can also

be error-prone or far from optimal. One of the reason is that

kernels can produce staggeringly large optimization spaces

[Datta et al., 2008]. The problem is further compounded

by the fact that these spaces can be highly discontinuous

[Ryoo et al., 2008], difficult to explore, and quasi-optimal

solutions lie at the edge of “performance cliffs” induced by

hard device-specific constraints (e.g. register file size or low-

latency cache size).

1.2 Auto-Tuning

One strategy for addressing these challenges is to use one

of a variety of automatic methods known collectively as

“auto-tuning.” Two major auto-tuning approaches have emer-

ged in the extensive literature covering the subject (see sur-

veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc

et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,

2008, Li et al., 2009, Park et al., 2011]): analytical model-

driven optimization and empirical optimization [Yotov et al.,

2003].

The model-driven optimization approach uses analytical

abstractions to model the hardware architectures, in order

Machine Learning for Predictive Auto-Tuning with BoostedRegression Trees

First LastAffiliation line 1Affiliation line 2

[email protected]

First LastAffiliation line 1Affiliation line 2

[email protected]

First LastAffiliation line 1Affiliation line 2

[email protected]

ABSTRACT

The rapidly evolving landscape of multicore architectures

makes the construction of efficient libraries a daunting task.

A family of methods known collectively as “auto-tuning”has

emerged to address this challenge. Two major approaches to

auto-tuning are empirical and model-based: empirical auto-

tuning is a generic but slow approach that works by mea-

suring runtimes of candidate implementations, model-based

auto-tuning predicts those runtimes using simplified abstrac-

tions designed by hand. We show that machine learning

methods for non-linear regression can be used to estimate

timing models from data, capturing the best of both ap-

proaches. A statistically-derived model offers the speed of

a model-based approach, with the generality and simplicity

of empirical auto-tuning. We validate our approach using

the filterbank correlation kernel described in Pinto and Cox

[2012], where we find that 0.1 seconds of hill climbing on

the regression model (“predictive auto-tuning”) can achieve

an average of 95% of the speed-up brought by minutes of

empirical auto-tuning. Our approach is not specific to filter-

bank correlation, nor even to GPU kernel auto-tuning, and

can be applied to almost any templated-code optimization

problem, spanning a wide variety of problem types, kernel

types, and platforms.

1. INTRODUCTION

Due to power consumption and heat dissipation concerns,

scientific applications have shifted from computing platforms

where performance had been primarily driven by rises in the

clock frequency of a single “heavy-weight” processor (with

complex out-of-order control and cache structures) to a plat-

form with ever increasing numbers of “light-weight” cores.

Interestingly, this shift is now not only relevant to compu-

tational sciences but to the development of all computer sys-

tems: from ubiquitous consumer-facing devices (e.g. phones)

to high-end computer farms for web-scale applications (e.g.

social networks).

Although the future lies in low-power multi-core hardware

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

designs, the field lacks consensus on exactly how the differ-ent subsystems (memory, communication and computation)

should be efficiently integrated, modeled and programmed.

These systems have exhibited varying degrees of memory

hierarchy and multi-threading complexity and, as a conse-

quence, they have been increasingly relying on flexible but

low-level software-controlled cache management and paral-

lelism [Asanovic et al., 2006] in order to better control and

understand the various trade-offs among performance, reli-

ability, energy efficiency, production costs, etc. This evo-

lution has profoundly altered the landscape of application

development: programmers are now facing a wide diversity

of low-level architectural issues that must be carefully bal-

anced if one is to write code that is both high-performance

and portable.

1.1 Motivation

In this rapidly evolving landscape, the construction of gen-

eral development tools and libraries that fully utilize system

resources remains a daunting task. Even within special-

ized architectures from the same vendor, such as NVIDIA’s

Graphics Processing Units (GPUs) and the Compute Unified

Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,

2011], many developers default to massive amounts of man-

ual labor to optimize CUDA code to specific input domains.

In addition, hand-tuning rarely generalizes well to new hard-

ware generations or different input domains, and it can also

be error-prone or far from optimal. One of the reason is that

kernels can produce staggeringly large optimization spaces

[Datta et al., 2008]. The problem is further compounded

by the fact that these spaces can be highly discontinuous

[Ryoo et al., 2008], difficult to explore, and quasi-optimal

solutions lie at the edge of “performance cliffs” induced by

hard device-specific constraints (e.g. register file size or low-

latency cache size).

1.2 Auto-Tuning

One strategy for addressing these challenges is to use one

of a variety of automatic methods known collectively as

“auto-tuning.” Two major auto-tuning approaches have emer-

ged in the extensive literature covering the subject (see sur-

veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc

et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,

2008, Li et al., 2009, Park et al., 2011]): analytical model-

driven optimization and empirical optimization [Yotov et al.,

2003].

The model-driven optimization approach uses analytical

abstractions to model the hardware architectures, in order

James BergstraNicolas PintoDavid Cox[submitted]

Page 161: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

3D Filterbank Convolutions!

Page 162: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

0 200 350

0

200

350

GTX 295

0 400 800

0

400

800

Tesla C2070

0 200 350

0

200

350

Tesla C1060

0 600 1200

0

600

1200

GTX 480

0 200 400 600 800 1000 1200 1400

GFLOP/s of empirical auto-tuning

0

200

400

600

800

1000

1200

GFL

OP/

sof

pred

ictiv

eau

to-t

unin

g

equality

2x slower

2x fasterGTX 580

0 100 200

Number of auto-tuned problemconfigurations used for training

300

400

500

600

Auto-­tuned mean

Reference mean

GFL

OP/

sof

pred

ictiv

eau

to-t

unin

g

(a) (b)

(c)

NVIDIA GTX 580 (Fermi)

Preview

old way: minutes!

ML-based:< 0.1sec

Page 163: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

0 200 350

0

200

350

GTX 295

0 400 800

0

400

800

Tesla C2070

0 200 350

0

200

350

Tesla C1060

0 600 1200

0

600

1200

GTX 480

0 200 400 600 800 1000 1200 1400

GFLOP/s of empirical auto-tuning

0

200

400

600

800

1000

1200

GFL

OP/

sof

pred

ictiv

eau

to-t

unin

g

equality

2x slower

2x fasterGTX 580

0 100 200

Number of auto-tuned problemconfigurations used for training

300

400

500

600

Auto-­tuned mean

Reference mean

GFL

OP/

sof

pred

ictiv

eau

to-t

unin

g

(a) (b)

(c)

NVIDIA GTX 580 (Fermi)

Preview

old way: minutes!

ML-based:< 0.1sec

> 1.1 TERAFLOP/s !

Page 164: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else ?

Page 165: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

Page 166: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

Page 167: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

• Minimize mixed-precision errors

Page 168: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

• Minimize mixed-precision errors

• Help better understand hardware features and their complex interactions

Page 169: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

• Minimize mixed-precision errors

• Help better understand hardware features and their complex interactions

• Help design better architectures ?

Page 170: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

• Minimize mixed-precision errors

• Help better understand hardware features and their complex interactions

• Help design better architectures ?

• $$$

Page 171: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

What else could we do for HPC ?

• Minimize failures (exascale supercomputers)

• Minimize mixed-precision errors

• Help better understand hardware features and their complex interactions

• Help design better architectures ?

• $$$

• etc.

Page 172: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

It would be a win-win-win situation!

(The Office Season 2, Episode 27: Conflict Resolution)

Page 173: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC

Page 174: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Acknowledgements

DiCarlo Lab @ MIT

Jim DiCarlo

David Cox

Page 175: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Acknowledgements

Page 176: High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

COME