81
Building a cutting-edge data processing environment on a budget Ga¨ el Varoquaux This talk is not about rocket science!

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

  • Upload
    pydata

  • View
    491

  • Download
    1

Embed Size (px)

DESCRIPTION

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Citation preview

Page 1: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Building a cutting-edge data processingenvironment on a budget

Gael Varoquaux

This talk is not aboutrocket science!

Page 2: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Building a cutting-edge data processingenvironment on a budget

Gael Varoquaux

Disclaimer: this talk is as much about peopleand projects as it is about code and algorithms.

Page 3: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

I did a PhD inquantum physics

Page 4: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

I did a PhD inquantum physics

Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)

Best training everfor agile project

management

Page 5: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

I did a PhD inquantum physics

Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)

Computers were only oneof the many moving parts

MatlabInstrument control

Shaped my visionof computing as ameans to an end

Page 6: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

I did a PhD inquantum physics

Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)

Computers were only oneof the many moving parts

MatlabInstrument controlShaped my vision

of computing as ameans to an end

Page 7: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

2011Tenured researcherin computer science

TodayGrowing team withdata sciencerock stars

Page 8: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Growing up as a penniless academic

2011Tenured researcherin computer science

TodayGrowing team withdata sciencerock stars

Page 9: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Using machine learning tounderstand brain function

Link neural activity to thoughts and cognitionG Varoquaux 6

Page 10: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Functional MRI

t

Recordings of brain activity

G Varoquaux 7

Page 11: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Cognitive NeuroImaging

Learn a bilateral link between brain activityand cognitive function

G Varoquaux 8

Page 12: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Encoding models of stimuli

Predicting neural responseñ a window into brain representations of stimuli

“feature engineering” a description of the worldG Varoquaux 9

Page 13: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Decoding brain activity

“brain reading”

G Varoquaux 10

Page 14: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data processing featsVisual image reconstruction from human brain activity

[Miyawaki, et al. (2008)]

“brain reading”

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html

Code, data, ... just worksTM

G Varoquaux 11

Page 15: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data processing featsVisual image reconstruction from human brain activity

[Miyawaki, et al. (2008)]

“if it’s not open and verifiable by others, it’s notscience, or engineering...” Stodden, 2010

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html

Code, data, ... just worksTM

G Varoquaux 11

Page 16: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data processing featsVisual image reconstruction from human brain activity

[Miyawaki, et al. (2008)]

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html

Code, data, ... just worksTM

http://nilearn.github.io

ni

G Varoquaux 11

Page 17: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data processing featsVisual image reconstruction from human brain activity

[Miyawaki, et al. (2008)]

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html

Code, data, ... just worksTM

http://nilearn.github.io

ni

G Varoquaux 11

Page 18: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data processing featsVisual image reconstruction from human brain activity

[Miyawaki, et al. (2008)]

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html

Code, data, ... just worksTM

http://nilearn.github.io

ni

Software development challenge

G Varoquaux 11

Page 19: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data accumulationWhen data processing is routine... “big data”

for rich models ofbrain function

Accumulation of scientific knowledgeand learning formal representations

“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”

Stephen Hawking, A Brief History of Time.

G Varoquaux 12

Page 20: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Data accumulationWhen data processing is routine... “big data”

for rich models ofbrain function

Accumulation of scientific knowledgeand learning formal representations

“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”

Stephen Hawking, A Brief History of Time.

G Varoquaux 12

Page 21: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

A lab is no different from a startup

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

Our mission is to revolutionize brain data processingon a tight budget

G Varoquaux 13

Page 22: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

A lab is no different from a startup

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

Our mission is to revolutionize brain data processingon a tight budget

G Varoquaux 13

Page 23: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

A lab is no different from a startup

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

Our mission is to revolutionize brain data processingon a tight budget

G Varoquaux 13

Page 24: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

2 Patterns in data processing

G Varoquaux 14

Page 25: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

2 The data processing workflow agile

Interaction...Ñ script...Ñ module...

ý interaction again...

Consolidation,progressively

Low tech and shortturn-around times

G Varoquaux 15

Page 26: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

2 From statistics to statistical learning

Paradigm shift as thedimensionality of datagrows# features,

not only # samples

From parameterinference to prediction

Statistical learning isspreading everywhere

x

y

G Varoquaux 16

Page 27: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Let’s just make softwareto solve all these problems.

c©Theodore W. GrayG Varoquaux 17

Page 28: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Design philosophy

1. Don’t solve hard problemsThe original problem can be bent.

2. Easy setup, works out of the boxInstalling software sucks.

Convention over configuration.

3. Fail gracefullyRobust to errors. Easy to debug.

4. Quality, quality, qualityWhat’s not excellent won’t be used.

Not “one software to rule them all”

Break down projects by expertise

G Varoquaux 18

Page 29: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Design philosophy

1. Don’t solve hard problemsThe original problem can be bent.

2. Easy setup, works out of the boxInstalling software sucks.

Convention over configuration.

3. Fail gracefullyRobust to errors. Easy to debug.

4. Quality, quality, qualityWhat’s not excellent won’t be used.

Not “one software to rule them all”

Break down projects by expertise

G Varoquaux 18

Page 30: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

VisionMachine learning without learning the machinery

Black box that can be openedRight trade-off between ”just works” and versatility

(think Apple vs Linux)

G Varoquaux 19

Page 31: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

VisionMachine learning without learning the machinery

Black box that can be openedRight trade-off between ”just works” and versatility

(think Apple vs Linux)

G Varoquaux 19

Page 32: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

VisionMachine learning without learning the machinery

Black box that can be openedRight trade-off between ”just works” and versatility

(think Apple vs Linux)

We’re not going to solve all the problems for youI don’t solve hard problems

Feature-engineering, domain-specific cases...Python is a programming language. Use it.

Cover all the 80% usecases in one packageG Varoquaux 19

Page 33: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Performance in high-level programming

High-level programmingis what keeps usalive and kicking

G Varoquaux 20

Page 34: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Performance in high-level programming

The secret sauceOptimize algorithmes, not for loops

Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack

line-profiler/memory-profilerscipy-lectures.github.io

Cython not C/C++

G Varoquaux 20

Page 35: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Performance in high-level programming

The secret sauceOptimize algorithmes, not for loops

Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack

line-profiler/memory-profilerscipy-lectures.github.io

Cython not C/C++

Hierarchical clustering PR #21991. Take the 2 closest clusters2. Merge them3. Update the distance matrix

...Faster with constraints: sparse distance matrix- Keep a heap queue of distances: cheap minimum- Need sparse growable structure for neighborhoods

skip-list in Cython!Oplog nq insert, remove, access

bind C++ map[int, float] with CythonFast traversal, possibly in Cython, for step 3.

G Varoquaux 20

Page 36: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Performance in high-level programming

The secret sauceOptimize algorithmes, not for loops

Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack

line-profiler/memory-profilerscipy-lectures.github.io

Cython not C/C++

Hierarchical clustering PR #21991. Take the 2 closest clusters2. Merge them3. Update the distance matrix

...Faster with constraints: sparse distance matrix- Keep a heap queue of distances: cheap minimum- Need sparse growable structure for neighborhoods

skip-list in Cython!Oplog nq insert, remove, access

bind C++ map[int, float] with CythonFast traversal, possibly in Cython, for step 3.

G Varoquaux 20

Page 37: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Architecture of a data-manipulation toolkit

Separate data from operations,but keep an imperative-like language

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

bokeh, chaco, hadoop, Mayavi, CPUs

Object API exposes a data-processing languagefit, predict, transform, score, partial fit

Instantiated without data but with all the parameters

Objects pipeline, merging, etc...

configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern

G Varoquaux 21

Page 38: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Architecture of a data-manipulation toolkit

Separate data from operations,but keep an imperative-like language

Object API exposes a data-processing languagefit, predict, transform, score, partial fit

Instantiated without data but with all the parameters

Objects pipeline, merging, etc...

configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern

G Varoquaux 21

Page 39: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

3 Architecture of a data-manipulation toolkit

Separate data from operations,but keep an imperative-like language

Object API exposes a data-processing languagefit, predict, transform, score, partial fit

Instantiated without data but with all the parameters

Objects pipeline, merging, etc...

configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern

G Varoquaux 21

Page 40: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 Big data on small hardware

Biggish

smallish

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 22

Page 41: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 Big data on small hardwareBiggish

smallish

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 22

Page 42: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-line algorithms

Process the data one sample at a time

Compute the mean of a gazillionnumbers

Hard?

G Varoquaux 23

Page 43: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-line algorithms

Process the data one sample at a time

Compute the mean of a gazillionnumbers

Hard?No: just do a running mean

G Varoquaux 23

Page 44: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-line algorithmsConverges to expectations

Mini-batch = bunch observations for vectorization

Example: K-Means clusteringX = np.random.normal(size=(10 000, 200))

scipy.cluster.vq.kmeans(X, 10,

iter=2)11.33 s

sklearn.cluster.MiniBatchKMeans(n clusters=10,

n init=2).fit(X)0.62 s

G Varoquaux 23

Page 45: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-the-fly data reduction

Big data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 24

Page 46: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-the-fly data reductionDropping data

1 loop: take a random fraction of the data

2 run algorithm on that fraction

3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation

Exploits redundancy across observations

Run the loop in parallel

G Varoquaux 24

Page 47: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.WardAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 24

Page 48: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 On-the-fly data reduction

Example: randomized SVD Random projectionsklearn.utils.extmath.randomized svd

X = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)

1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)

1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)

1 loops, best of 3: 303 ms per loop

linalg.norm(lapack[0][:, :10] - arpack[0]) / 20000.0022360679774997738

linalg.norm(lapack[0][:, :10] - randomized[0]) / 20000.0022121161221386925

G Varoquaux 24

Page 49: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

4 Biggish ironOur new box: 15 ke

48 cores384G RAM70T storage

(SSD cache on RAID controller)

Gets our work done faster than our 800 CPU cluster

It’s the access patterns!

“Nobody ever got fired for using Hadoop on a cluster”A. Rowstron et al., HotCDP ’12

G Varoquaux 25

Page 50: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Avoiding the framework

joblib

G Varoquaux 26

Page 51: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing big picture

Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneck

The right grain of parallelismToo fine ñ overheadToo coarse ñ memory shortage

Scale by the relevant cache pool

G Varoquaux 27

Page 52: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing joblib

Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks>>> from joblib import Parallel, delayed>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

G Varoquaux 27

Page 53: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing joblib

IPython, multiprocessing, celery, MPI?joblib is higher-level

No dependencies, works everywhereBetter traceback reportingMemmaping arrays to share memory (O. Grisel)On-the-fly dispatch of jobs – memory-friendlyThreads or processes backend

G Varoquaux 27

Page 54: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing joblib

IPython, multiprocessing, celery, MPI?joblib is higher-level

No dependencies, works everywhereBetter traceback reportingMemmaping arrays to share memory (O. Grisel)On-the-fly dispatch of jobs – memory-friendlyThreads or processes backend

G Varoquaux 27

Page 55: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing Queues

Queues: high-performance, concurrent-friendly

Difficulty: callback on result arrivalñ multiple threads in caller ` risk of deadlocks

Dispatch queue should fill up “slowly”ñ pre dispatch in joblib

ñ Back and forth communicationDoor open to race conditions

G Varoquaux 28

Page 56: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Parallel processing: what happens where

joblib design: Caller, dispatch queue, and collectqueue in same process

Benefit: robustness

Grand-central dispatch design: dispatch queue hasa process of its own

Benefit: resource managment in nested for loops

G Varoquaux 29

Page 57: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 CachingFor reproducibility:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

G Varoquaux 30

Page 58: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store

G Varoquaux 30

Page 59: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store

Challenges in the context of big dataa & b are big

Design goalsa & b arbitrary Python objectsNo dependencies

Drop-in, framework-less code

G Varoquaux 30

Page 60: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store

Lego bricks for out-of-core algorithms coming soonąąąąąąąąą result = g.call and shelve(a)ąąąąąąąąą result

MemorizedResult(cachedir=”...”, func=”g...”, argument hash=”...”)ąąąąąąąąą c = result.get()

G Varoquaux 30

Page 61: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Efficient input argument hashing – joblib.hash

Compute md5‹ of input arguments

Trade-off between features and costBlack boxyRobust and completely generic

G Varoquaux 31

Page 62: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Efficient input argument hashing – joblib.hash

Compute md5‹ of input arguments

Implementation1. Create an md5 hash object2. Subclass the standard-library pickler

= state machine that walks the object graph3. Walk the object graph:

- ndarrays: pass data pointer to md5 algorithm(“update” method)

- the rest: pickle4. Update the md5 with the pickle

‹ md5 is in the Python standard libraryG Varoquaux 31

Page 63: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects

Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest

ñ Multiple files

Store concurrency issuesStrategy: atomic operations ` try/except

Renaming a directory is atomicDirectory layout consistent with remove operations

Good performance, usable on shared disks (cluster)

G Varoquaux 32

Page 64: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Making I/O fastFast compression

CPU may be faster than disk accessin particular in parallel

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer

+ meta-information (strides, class...)

Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage

What matters on large systemsNumbers of bytes stored

brings network/SATA bus downMemory usage

brings compute nodes downNumber of atomic file access

brings shared storage down

G Varoquaux 33

Page 65: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Making I/O fastFast compression

CPU may be faster than disk accessin particular in parallel

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer

+ meta-information (strides, class...)

Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage

What matters on large systemsNumbers of bytes stored

brings network/SATA bus downMemory usage

brings compute nodes downNumber of atomic file access

brings shared storage down

G Varoquaux 33

Page 66: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Making I/O fastFast compression

CPU may be faster than disk accessin particular in parallel

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer

+ meta-information (strides, class...)

Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage

What matters on large systemsNumbers of bytes stored

brings network/SATA bus downMemory usage

brings compute nodes downNumber of atomic file access

brings shared storage down

G Varoquaux 33

Page 67: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Making I/O fastFast compression

CPU may be faster than disk accessin particular in parallel

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer

+ meta-information (strides, class...)

Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage

What matters on large systemsNumbers of bytes stored

brings network/SATA bus downMemory usage

brings compute nodes downNumber of atomic file access

brings shared storage down

G Varoquaux 33

Page 68: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

5 Benchmarking to np.save and pytables

yax

issc

ale:

1is

np.s

ave

NeuroImaging data (MNI atlas)G Varoquaux 34

Page 69: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 The bigger picture: buildingan ecosystem

Helping your future selfG Varoquaux 35

Page 70: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 Community-based development in scikit-learnHuge feature set:

benefits of a large teamProject growth:

More than 200 contributors„ 12 core contributors

1 full-time INRIA programmerfrom the start

Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn

G Varoquaux 36

Page 71: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 The economics of open sourceCode maintenance too expensive to be alone

scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month

“Hey Gael, I take it you’re toobusy. That’s okay, I spent a daytrying to install XXX and I thinkI’ll succeed myself. Next timethough please don’t ignore myemails, I really don’t like it. Youcan say, ‘sorry, I have no time tohelp you.’ Just don’t ignore.”

Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah

Share the common code......to avoid dying under code

Code becomes less precious with timeAnd somebody might contribute features

G Varoquaux 37

Page 72: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 The economics of open sourceCode maintenance too expensive to be alone

scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month

Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah

Share the common code......to avoid dying under code

Code becomes less precious with timeAnd somebody might contribute features

G Varoquaux 37

Page 73: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 Many eyes makes code fast

Bench WiseRF anybody?L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 38

Page 74: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 6 steps to a community-driven project

1 Focus on quality

2 Build great docs and examples

3 Use github

4 Limit the technicality of your codebase

5 Releasing and packaging matter

6 Focus on your contributors,give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire

G Varoquaux 39

Page 75: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 Core project contributors

Normalized number of commits since 2009-06

Num

ber

of c

omm

its

Individual committerCredit: Fernando Perez, Gist 5843625

G Varoquaux 40

Page 76: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

6 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted

ñ Hard to fund, less excitementThey need citation, in papers & on corporate web pages

G Varoquaux 41

Page 77: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

@GaelVaroquaux

Solving problems that matter

The 80/20 rule80% of the usecases can be solved

with 20% of the lines of code

scikit-learn, joblib, nilearn, ... I hope

Page 78: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

@GaelVaroquaux

Cutting-edge ... environment ... on a budget

1 Set the goals rightDon’t solve hard problems

What’s your original problem?

2 Use the simplest technological solutions possible

3 Don’t forget the human factors

A perfectdesign?

Page 79: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

@GaelVaroquaux

Cutting-edge ... environment ... on a budget

1 Set the goals right

2 Use the simplest technological solutions possibleBe very technically sophisticated

Don’t use that sophistication

3 Don’t forget the human factors

A perfectdesign?

Page 80: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

@GaelVaroquaux

Cutting-edge ... environment ... on a budget

1 Set the goals right

2 Use the simplest technological solutions possible

3 Don’t forget the human factorsWith your users (documentation)

With your contributors

A perfectdesign?

Page 81: Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

@GaelVaroquaux

Cutting-edge ... environment ... on a budget

1 Set the goals right

2 Use the simplest technological solutions possible

3 Don’t forget the human factors

A perfectdesign?