169
Parallel Computation in R: What We Want, and How We (Might) Get It Norm Matloff University of California at Davis Parallel Computation in R: What We Want, and How We (Might) Get It Norm Matloff University of California at Davis Keynote Address useR! 2017 Brussels, 6 July, 2017

Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation in R: What We Want,and How We (Might) Get It

Norm MatloffUniversity of California at Davis

Keynote AddressuseR! 2017

Brussels, 6 July, 2017

Page 2: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Shameless Promotion

Out July 28!

(A longheld plan— decades — nowfinally got aroundto it.)

Page 3: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Shameless Promotion

Out July 28!

(A longheld plan— decades — nowfinally got aroundto it.)

Page 4: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns, illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 5: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns, illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 6: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns, illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 7: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns,

illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 8: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns, illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 9: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Disclaimer

• “Everyone has an opinion.”

• I’ll present mine.

• I will essentially propose general design patterns, illustratedwith our own package partools but meant to be general.

• Dissent is encouraged. :-)

Page 10: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 11: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 12: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive

• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 13: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs

• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as afew hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 14: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 15: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 16: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 17: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 18: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Drivers and Their Result

• Parallel hardware for the masses:

• 4 cores standard, 16 not too expensive• GPUs• Intel Xeon Phi, ≈ 60 cores (!), coprocessor, as low as a

few hundred dollars

• Big Data

• Whatever that is.

Result: Users believe,

“I’ve got the hardware and I’ve got the data need —so I should be all set to do parallel computation in Ron the data.”

Page 19: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 20: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 21: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 22: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.

• Bandwidth limits — CPU/memory, CPU/network,CPU/GPU.

• Cache coherency problems (inconsistent caches inmulticore systems).

• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 23: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.

• Cache coherency problems (inconsistent caches inmulticore systems).

• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 24: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).

• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 25: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.

• OS/R limits on number of sockets (network connections).• Serialization.

Page 26: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.• OS/R limits on number of sockets (network connections).

• Serialization.

Page 27: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Not So Simple

• Non-“embarrassingly parallel” algorithms.

• Overhead issues:

• Contention for memory/network.• Bandwidth limits — CPU/memory, CPU/network,

CPU/GPU.• Cache coherency problems (inconsistent caches in

multicore systems).• Contention for I/O ports.• OS/R limits on number of sockets (network connections).• Serialization.

Page 28: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Wish List

• Ability to run on various types of hardware — from R.

• Ease of use for the non-cognoscenti.

• Parameters to tweak for the experts or the daring.

Page 29: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Wish List

• Ability to run on various types of hardware — from R.

• Ease of use for the non-cognoscenti.

• Parameters to tweak for the experts or the daring.

Page 30: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Wish List

• Ability to run on various types of hardware — from R.

• Ease of use for the non-cognoscenti.

• Parameters to tweak for the experts or the daring.

Page 31: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Wish List

• Ability to run on various types of hardware — from R.

• Ease of use for the non-cognoscenti.

• Parameters to tweak for the experts or the daring.

Page 32: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Non-cognoscenti Can Becomethe Daring

Help, I’m in over my head here! – a prominent R developer,entering the parallel comp. world.

Page 33: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Non-cognoscenti Can Becomethe Daring

Help, I’m in over my head here! – a prominent R developer,entering the parallel comp. world.

Page 34: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Non-cognoscenti Can Becomethe Daring

Help, I’m in over my head here! – a prominent R developer,entering the parallel comp. world.

Page 35: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Non-cognoscenti (cont’d.)

• Casual users, even if they are deft programmers, quicklylearn that this is no casual operation.

• After getting burned by disappointing performance, somewill be emboldened to learn the subtleties.

• Painless parallel computation is not possible.

Page 36: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Non-cognoscenti (cont’d.)

• Casual users, even if they are deft programmers, quicklylearn that this is no casual operation.

• After getting burned by disappointing performance, somewill be emboldened to learn the subtleties.

• Painless parallel computation is not possible.

Page 37: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Non-cognoscenti (cont’d.)

• Casual users, even if they are deft programmers, quicklylearn that this is no casual operation.

• After getting burned by disappointing performance, somewill be emboldened to learn the subtleties.

• Painless parallel computation is not possible.

Page 38: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Non-cognoscenti (cont’d.)

• Casual users, even if they are deft programmers, quicklylearn that this is no casual operation.

• After getting burned by disappointing performance, somewill be emboldened to learn the subtleties.

• Painless parallel computation is not possible.

Page 39: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example: Matrix-VectorMultiplication

• D = AX , with A being n × p and X being p × 1

• Naive approach: Parallelize the loop

f o r ( i i n 1 : n )d [ i ] ← a [ i , ] %∗% x

• Naive use of foreach package likely quite slow;scatter-gather overhead a substantial proportion of theoverall time.

Page 40: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example: Matrix-VectorMultiplication

• D = AX , with A being n × p and X being p × 1

• Naive approach: Parallelize the loop

f o r ( i i n 1 : n )d [ i ] ← a [ i , ] %∗% x

• Naive use of foreach package likely quite slow;scatter-gather overhead a substantial proportion of theoverall time.

Page 41: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example: Matrix-VectorMultiplication

• D = AX , with A being n × p and X being p × 1

• Naive approach: Parallelize the loop

f o r ( i i n 1 : n )d [ i ] ← a [ i , ] %∗% x

• Naive use of foreach package likely quite slow;scatter-gather overhead a substantial proportion of theoverall time.

Page 42: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example: Matrix-VectorMultiplication

• D = AX , with A being n × p and X being p × 1

• Naive approach: Parallelize the loop

f o r ( i i n 1 : n )d [ i ] ← a [ i , ] %∗% x

• Naive use of foreach package likely quite slow;scatter-gather overhead a substantial proportion of theoverall time.

Page 43: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example: Matrix-VectorMultiplication

• D = AX , with A being n × p and X being p × 1

• Naive approach: Parallelize the loop

f o r ( i i n 1 : n )d [ i ] ← a [ i , ] %∗% x

• Naive use of foreach package likely quite slow;scatter-gather overhead a substantial proportion of theoverall time.

Page 44: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious: For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this. And automaticparallelization would miss it.

Page 45: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious:

For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this. And automaticparallelization would miss it.

Page 46: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious: For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this. And automaticparallelization would miss it.

Page 47: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious: For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this. And automaticparallelization would miss it.

Page 48: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious: For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this.

And automaticparallelization would miss it.

Page 49: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example (cont’d.)

• Solution is obvious: For r processes, partition rows Ai inton/r chunks and change the above loop from n iterationsto n/r .

f o r ( k i n 1 : r )d [ rowb lockk ] ← a [ rowblockk , ] %∗% x

• But casual users may miss this. And automaticparallelization would miss it.

Page 50: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 51: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 52: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 53: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 54: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 55: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 56: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Use Cases

A few reference examples, somewhat spanning the space:

• Compute-intensive parametric: Quantile regression.

• Compute-intensive nonparametric: Nearest-neighborregression.

• Compute-intensive nonparametric: Graph algorithms.

• Run-of-the-mill aggregation: Group-by-and-find-means op.

• Tougher aggregation: Credit card fraud detection.

Page 57: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 58: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 59: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 60: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 61: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 62: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 63: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 64: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Software Alchemy (SA)

• My term for method developed by a number of authors(Matloff, 2016).

• Break data into chunks. Apply estimator, say lm() toeach chunk, then average the results.

• For parallel comp. with r processes, use r chunks.

• Same statistical accuracy.

• Often produces superlinear speedup, i.e. > r .

• Useful in some apps.

• Available in partools package (NM, C. Fitzgerald),github.com/matloff.

Page 65: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Programming World Views

• Message passing/distributed comp.: Send data to the Rprocesses; each process works on its data; possiblycombine results.

In R, e.g. parallel (the part from snow), rMPI.

In C, e.g.

Page 66: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Programming World Views

• Message passing/distributed comp.: Send data to the Rprocesses; each process works on its data; possiblycombine results.

In R, e.g. parallel (the part from snow), rMPI.

In C, e.g.

Page 67: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Programming World Views

• Message passing/distributed comp.: Send data to the Rprocesses; each process works on its data; possiblycombine results.

In R, e.g. parallel (the part from snow), rMPI.

In C, e.g.

Page 68: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Programming World Views

• Message passing/distributed comp.: Send data to the Rprocesses; each process works on its data; possiblycombine results.

In R, e.g. parallel (the part from snow), rMPI.

In C, e.g.

Page 69: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

World Views (cont’d.)

• Shared-memory: The processes have access to a commonmemory, so no data transfer needed.

Not (yet) common in R, but do have Rdsm (NM), thread(R. Bartnik).

In C, e.g.

Page 70: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

World Views (cont’d.)

• Shared-memory: The processes have access to a commonmemory, so no data transfer needed.

Not (yet) common in R, but do have Rdsm (NM), thread(R. Bartnik).

In C, e.g.

Page 71: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

World Views (cont’d.)

• Shared-memory: The processes have access to a commonmemory, so no data transfer needed.

Not (yet) common in R, but do have Rdsm (NM), thread(R. Bartnik).

In C, e.g.

Page 72: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

World Views (cont’d.)

• Shared-memory: The processes have access to a commonmemory, so no data transfer needed.

Not (yet) common in R, but do have Rdsm (NM), thread(R. Bartnik).

In C, e.g.

Page 73: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 74: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 75: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 76: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization —

no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 77: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed —

is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 78: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected.

Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 79: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 80: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)

What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 81: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 82: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 83: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.

• We need iterative estimators, std. errors, linear algebra,etc.

• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 84: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.

• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 85: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.

• UseRs may have become fairly good programmers, butlack systems knowledge.

Page 86: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises in This Talk

• There is a lot of hype about parallel computation.

• Parallel computation is not for the casual user.

• Efficient automatic parallelization — no userintervention/sophistication needed — is generally notpossible and should not be expected. Please stop askingfor it. :-)

• As in politics, transparency in software tools is vital. :-)What do those APIs really do?

• UseRs are different from aggregation-oriented (e.g. Spark)users.

• Aggregation is only part of what useRs do.• We need iterative estimators, std. errors, linear algebra,

etc.• Newer methodology, e.g. ML, random graphs etc.• UseRs may have become fairly good programmers, but

lack systems knowledge.

Page 87: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 88: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 89: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 90: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well,

and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 91: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 92: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 93: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation,

esp. using “Leave it there”concept.

Page 94: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Premises (cont’d).

• Use of SA as means of parallelization should be fine forthings like linear models, quantile regression, k-nearestneighbor regression etc.

• Some apps, e.g. graph algorithms, are based on sharingstate, so shared-memory world view/hardware may beneeded.

• But in most of the Use Cases, including the SA ones,distributed world view works well, and may be neededanyway at very large scale.

• Bottom line: For most Use Cases, use one of thefollowing

• SA• Distributed computation, esp. using “Leave it there”

concept.

Page 95: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Spark

One well-publicized distributed approach today isSpark/SparkR.

• MapReduce not well-suited to most of the above UseCases.

• Highly elaborate Spark machinery violates thetransparency requirement.

• On the other hand, the distributed file system approach ofHadoop/Spark is good for useRs too.

Page 96: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Spark

One well-publicized distributed approach today isSpark/SparkR.

• MapReduce not well-suited to most of the above UseCases.

• Highly elaborate Spark machinery violates thetransparency requirement.

• On the other hand, the distributed file system approach ofHadoop/Spark is good for useRs too.

Page 97: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Spark

One well-publicized distributed approach today isSpark/SparkR.

• MapReduce not well-suited to most of the above UseCases.

• Highly elaborate Spark machinery violates thetransparency requirement.

• On the other hand, the distributed file system approach ofHadoop/Spark is good for useRs too.

Page 98: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Spark

One well-publicized distributed approach today isSpark/SparkR.

• MapReduce not well-suited to most of the above UseCases.

• Highly elaborate Spark machinery violates thetransparency requirement.

• On the other hand, the distributed file system approach ofHadoop/Spark is good for useRs too.

Page 99: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Spark

One well-publicized distributed approach today isSpark/SparkR.

• MapReduce not well-suited to most of the above UseCases.

• Highly elaborate Spark machinery violates thetransparency requirement.

• On the other hand, the distributed file system approach ofHadoop/Spark is good for useRs too.

Page 100: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: I

• (Gittens et al, 2016). Matrix Factorizations at Scale: aComparison of Scientific Data Analytics in Spark andC+MPI Using Three Case StudiesIn spite of careful optimization, performance of Sparkranged from slightly slower to really, really slower. :-)Just not what Spark was designed for.

My personal side comment: Not clear whether, say, PCA,has much accuracy or usefulness at the truly Big Datascale, including for sparse matrices.

Page 101: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: I

• (Gittens et al, 2016). Matrix Factorizations at Scale: aComparison of Scientific Data Analytics in Spark andC+MPI Using Three Case Studies

In spite of careful optimization, performance of Sparkranged from slightly slower to really, really slower. :-)Just not what Spark was designed for.

My personal side comment: Not clear whether, say, PCA,has much accuracy or usefulness at the truly Big Datascale, including for sparse matrices.

Page 102: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: I

• (Gittens et al, 2016). Matrix Factorizations at Scale: aComparison of Scientific Data Analytics in Spark andC+MPI Using Three Case StudiesIn spite of careful optimization, performance of Sparkranged from slightly slower to really, really slower. :-)

Just not what Spark was designed for.

My personal side comment: Not clear whether, say, PCA,has much accuracy or usefulness at the truly Big Datascale, including for sparse matrices.

Page 103: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: I

• (Gittens et al, 2016). Matrix Factorizations at Scale: aComparison of Scientific Data Analytics in Spark andC+MPI Using Three Case StudiesIn spite of careful optimization, performance of Sparkranged from slightly slower to really, really slower. :-)Just not what Spark was designed for.

My personal side comment: Not clear whether, say, PCA,has much accuracy or usefulness at the truly Big Datascale, including for sparse matrices.

Page 104: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: I

• (Gittens et al, 2016). Matrix Factorizations at Scale: aComparison of Scientific Data Analytics in Spark andC+MPI Using Three Case StudiesIn spite of careful optimization, performance of Sparkranged from slightly slower to really, really slower. :-)Just not what Spark was designed for.

My personal side comment: Not clear whether, say, PCA,has much accuracy or usefulness at the truly Big Datascale, including for sparse matrices.

Page 105: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: II

Reyes-Ortiz et al, Big Data Analytics in the Cloud: Spark onHadoop vs MPI/OpenMP on Beowulf

Abstract:

...MPI/OpenMP outperforms Spark by more than oneorder of magnitude in terms of processing speed andprovides more consistent performance. However,Spark shows better data management infrastructureand the possibility of dealing with other aspects suchas node failure and data replication

I contend that very few useRs, even those who need parallelcomputation, need to guard against node failure.

Page 106: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: II

Reyes-Ortiz et al, Big Data Analytics in the Cloud: Spark onHadoop vs MPI/OpenMP on Beowulf

Abstract:

...MPI/OpenMP outperforms Spark by more than oneorder of magnitude in terms of processing speed andprovides more consistent performance. However,Spark shows better data management infrastructureand the possibility of dealing with other aspects suchas node failure and data replication

I contend that very few useRs, even those who need parallelcomputation, need to guard against node failure.

Page 107: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: II

Reyes-Ortiz et al, Big Data Analytics in the Cloud: Spark onHadoop vs MPI/OpenMP on Beowulf

Abstract:

...MPI/OpenMP outperforms Spark by more than oneorder of magnitude in terms of processing speed andprovides more consistent performance. However,Spark shows better data management infrastructureand the possibility of dealing with other aspects suchas node failure and data replication

I contend that very few useRs, even those who need parallelcomputation, need to guard against node failure.

Page 108: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: II

Reyes-Ortiz et al, Big Data Analytics in the Cloud: Spark onHadoop vs MPI/OpenMP on Beowulf

Abstract:

...MPI/OpenMP outperforms Spark by more than oneorder of magnitude in terms of processing speed andprovides more consistent performance. However,Spark shows better data management infrastructureand the possibility of dealing with other aspects suchas node failure and data replication

I contend that very few useRs, even those who need parallelcomputation, need to guard against node failure.

Page 109: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example Study: II

Reyes-Ortiz et al, Big Data Analytics in the Cloud: Spark onHadoop vs MPI/OpenMP on Beowulf

Abstract:

...MPI/OpenMP outperforms Spark by more than oneorder of magnitude in terms of processing speed andprovides more consistent performance. However,Spark shows better data management infrastructureand the possibility of dealing with other aspects suchas node failure and data replication

I contend that very few useRs, even those who need parallelcomputation, need to guard against node failure.

Page 110: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 111: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 112: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 113: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.

(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 114: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.

(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 115: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 116: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

The Principle of “Leave It There”

Extremely simple idea, but very powerful.

• Common setting (e.g. parallel package): Scatter/gather.

(a) Manager node partitions (scatters) data to worker nodes.(b) Worker nodes work on their chunks.(c) Manager collects (gathers) and combines the results.

• But NO, avoid step (c) as much as possible.

Page 117: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps. Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 118: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps. Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 119: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps. Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 120: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps.

Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 121: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps. Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 122: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Example of “Leave It There”

Say we wish to perform the following on some dataset:

• Convert categorical variables to dummies.

• Replace NA values by means. (Not great, but just anexample.)

• Remove outliers, as def. by |X − µ| > 3σ. (Just anexample.)

• Run linear regression analysis.

The point is to NOT do the gather op after each of the abovesteps. Leave the data there (in distributed form).

Note too: The last step can be done in parallel too, with SA.

Page 123: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Comparing Just a Few Packages

A few packages that facilitate the above approach:

pkg flexibility high-level ops

partools high few

ddR medium medium

multidplyr low more

Page 124: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Comparing Just a Few Packages

A few packages that facilitate the above approach:

pkg flexibility high-level ops

partools high few

ddR medium medium

multidplyr low more

Page 125: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Comparing Just a Few Packages

A few packages that facilitate the above approach:

pkg flexibility high-level ops

partools high few

ddR medium medium

multidplyr low more

Page 126: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Going One Step Further:Distributed Files

• Since will do “Leave it there” over many ops,

• might as well distribute a persistent version of the data,i.e. have distributed files.

• Like Hadoop/Spark, but without the complex machinery.

• Our partools package includes various functions formanaging distributed files

Page 127: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Going One Step Further:Distributed Files

• Since will do “Leave it there” over many ops,

• might as well distribute a persistent version of the data,i.e. have distributed files.

• Like Hadoop/Spark, but without the complex machinery.

• Our partools package includes various functions formanaging distributed files

Page 128: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Going One Step Further:Distributed Files

• Since will do “Leave it there” over many ops,

• might as well distribute a persistent version of the data,i.e. have distributed files.

• Like Hadoop/Spark, but without the complex machinery.

• Our partools package includes various functions formanaging distributed files

Page 129: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Going One Step Further:Distributed Files

• Since will do “Leave it there” over many ops,

• might as well distribute a persistent version of the data,i.e. have distributed files.

• Like Hadoop/Spark, but without the complex machinery.

• Our partools package includes various functions formanaging distributed files

Page 130: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Going One Step Further:Distributed Files

• Since will do “Leave it there” over many ops,

• might as well distribute a persistent version of the data,i.e. have distributed files.

• Like Hadoop/Spark, but without the complex machinery.

• Our partools package includes various functions formanaging distributed files

Page 131: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Distributed Files in partools

• File x spread across x.001, x.002 etc.

• filesplit(): Make distributed file from monolithic one.

• fileread(): If node i does fileread(x,d), then x.i will beread into the variable d.

• filesave(): Saves distributed data to distributed file.

• Etc.

Page 132: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Distributed Files in partools

• File x spread across x.001, x.002 etc.

• filesplit(): Make distributed file from monolithic one.

• fileread(): If node i does fileread(x,d), then x.i will beread into the variable d.

• filesave(): Saves distributed data to distributed file.

• Etc.

Page 133: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Partools Example of “Leave ItThere”

• Say have distributed file xy, physically stored in filesxy.001, xy.002 etc.

• Say we have written functions (not shown) NAtoMeanand deleteOuts, to handle missing values and removeoutliers, as mentioned before. The functions have beengiven to the workers

Page 134: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Partools Example of “Leave ItThere”

• Say have distributed file xy, physically stored in filesxy.001, xy.002 etc.

• Say we have written functions (not shown) NAtoMeanand deleteOuts, to handle missing values and removeoutliers, as mentioned before. The functions have beengiven to the workers

Page 135: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Partools Example of “Leave ItThere”

• Say have distributed file xy, physically stored in filesxy.001, xy.002 etc.

• Say we have written functions (not shown) NAtoMeanand deleteOuts, to handle missing values and removeoutliers, as mentioned before. The functions have beengiven to the workers

Page 136: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

# do NA remova l a t each worker ,

# on the worker ’ s chunk o f xy

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , NAtoMean ) )# do the o u t l i e r r emova l a t each worker ,

# on the worker ’ s chunk o f xy

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , d e l e t eOu t s ) )

# use S o f twa r e Alchemy to pe r f o rm l i n e a r r e g r e s s i o n ,

# r e t u r n i n g j u s t th e c o e f f i c i e n t s i n t h i s c a s e

calm ( c l s , ’ y ∼ . , data=xy ’ )$ t h t

Page 137: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

# do NA remova l a t each worker ,

# on the worker ’ s chunk o f xy

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , NAtoMean ) )# do the o u t l i e r r emova l a t each worker ,

# on the worker ’ s chunk o f xy

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , d e l e t eOu t s ) )

# use S o f twa r e Alchemy to pe r f o rm l i n e a r r e g r e s s i o n ,

# r e t u r n i n g j u s t th e c o e f f i c i e n t s i n t h i s c a s e

calm ( c l s , ’ y ∼ . , data=xy ’ )$ t h t

Page 138: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

What Is Happening

E.g.

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , NAtoMean ) )

We are saying, At each worker node, do

xy ← apply ( xy , 2 , NAtoMean ) )

which means, each node does the apply op on its portion of xy.

Page 139: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

What Is Happening

E.g.

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , NAtoMean ) )

We are saying, At each worker node, do

xy ← apply ( xy , 2 , NAtoMean ) )

which means, each node does the apply op on its portion of xy.

Page 140: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

What Is Happening

E.g.

c l u s t e r E v a lQ ( c l s , xy ← apply ( xy , 2 , NAtoMean ) )

We are saying, At each worker node, do

xy ← apply ( xy , 2 , NAtoMean ) )

which means, each node does the apply op on its portion of xy.

Page 141: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 142: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 143: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 144: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.

• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 145: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 146: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.

• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 147: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.

• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 148: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 149: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 150: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too!

Use partools::filesave.

Page 151: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

“Leave It There” Example(cont’d.)

The key point:

For typical data analysis, hopefully we have:

• Data file stored in distributed fashion.• Lots of “leave it there” ops:

• Parallel.• No network delay.• No serialization overhead.

• Have occasional “collect” ops, hopefully small insize, e.g. from an aggregation such ascolMeans.

• If change data or create new data, save indistributed file form too! Use partools::filesave.

Page 152: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 153: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 154: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()

• k-NN• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 155: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN

• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 156: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests

• PCA• quantile()

• Very easy to make your own SA functions.

Page 157: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests• PCA

• quantile()

• Very easy to make your own SA functions.

Page 158: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 159: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Heavy Use of SA

• Have SA forms of

• lm()/glm()• k-NN• random forests• PCA• quantile()

• Very easy to make your own SA functions.

Page 160: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Various Collection Ops

E.g. addlists().Say have distributed list, 2 compoments. From one, managernode receives

l i s t ( a=3,b=8)

and from the other

l i s t ( a=5,b=1,c=12)

The functions “adds” them, producing (non-distributed)

l i s t ( a=8,b=9,c=12)

Page 161: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Various Collection Ops

E.g. addlists().

Say have distributed list, 2 compoments. From one, managernode receives

l i s t ( a=3,b=8)

and from the other

l i s t ( a=5,b=1,c=12)

The functions “adds” them, producing (non-distributed)

l i s t ( a=8,b=9,c=12)

Page 162: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Various Collection Ops

E.g. addlists().Say have distributed list, 2 compoments. From one, managernode receives

l i s t ( a=3,b=8)

and from the other

l i s t ( a=5,b=1,c=12)

The functions “adds” them, producing (non-distributed)

l i s t ( a=8,b=9,c=12)

Page 163: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 164: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.”

But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 165: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 166: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 167: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 168: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)

Page 169: Parallel Computation in R: What We Want, and How We (Might ...heather.cs.ucdavis.edu/user2017.pdf · in R: What We Want, and How We (Might) Get It Norm Matlo University of California

ParallelComputationin R: What

We Want, andHow We

(Might) Get It

Norm MatloffUniversity ofCalifornia at

Davis

Conclusions

No “silver bullet.” But the following should go a long waytoward your need for parallel computation.

• SA for the computational stuff.

• For aggregation, “leave it there” and distributed files.

• Could do in other packages, not just partools.

Ready for the dissent. :-)