Python for brain mining: (neuro)science with state of the art machine learning and data visualization

Python for brain mining:(neuro)science with state of the art machine learningand data visualization

Gael Varoquaux

1. Data-driven science“Brain mining”

2. Data mining in PythonMayavi, scikit-learn, joblib

1 Brain miningLearning models of brain function

Gael Varoquaux 2

1 Imaging neuroscience

Brainimages

Models offunction

Cognitive tasks

Gael Varoquaux 3

1 Imaging neuroscience

Brainimages

Models offunction

Cognitive tasks

Data-driven science

i ~ ∂

∂t Ψ = HΨ

Gael Varoquaux 3

1 Brain functional data

Rich data50 000 voxels perframeComplex underlyingdynamics

Few observations∼ 100

Drawing scientific conclusions?Ill-posed statistical problem

Gael Varoquaux 4

1 Brain functional data

Rich data50 000 voxels perframeComplex underlyingdynamics

Few observations∼ 100

Drawing scientific conclusions?Ill-posed statistical problem

Modern complex system studies:from strong hypothesizes to rich data

Gael Varoquaux 4

1 Statistics: the curse of dimensionalityy function of x1

y function of x1 and x2

More fit parameters?⇒ need exponentially more data

y function of 50 000 voxels?Expert knowledge

(pick the right ones) Machine learning

Gael Varoquaux 5

1 Statistics: the curse of dimensionalityy function of x1 y function of x1 and x2




Gael Varoquaux 5

1 Statistics: the curse of dimensionalityy function of x1 y function of x1 and x2




Gael Varoquaux 5

1 Brain readingPredict from brain images the object viewed

Correlation analysis

Gael Varoquaux 6


Inverse problem

Observations Spatial code


Inject prior: regularize

Sparse regression= compressive sensing

Gael Varoquaux 6


Inverse problem




Extract brain regionsTotal variationregression

[Michel, Trans Med Imag 2011]Gael Varoquaux 6


Inverse problem




Extract brain regionsTotal variationregression

[Michel, Trans Med Imag 2011]

Cast the problem in a prediction task:supervised learning.

Prediction is a model-selection metric

Gael Varoquaux 6

1 On-going/spontaneous activity

95% of theactivity is

unrelated to task

Gael Varoquaux 7

1 Learning regions from spontaneous activity

Multi-subject dictionarylearning

SpatialmapTime series

Sparsity + spatial continuity + spatial variability

⇒ Individual maps + functional regions atlas

[Varoquaux, Inf Proc Med Imag 2011]

Gael Varoquaux 8

1 Graphical models: interactions between regions

Estimate covariance structureMany parameters to learn

Regularize: conditionalindependence= sparsity on inverse

covariance

[Varoquaux NIPS 2010]

Gael Varoquaux 9

1 Graphical models: interactions between regions

Estimate covariance structureMany parameters to learn

Regularize: conditionalindependence= sparsity on inverse

covariance

[Varoquaux NIPS 2010]

Find structure via a density estimation:unsupervised learning.

Model selection: likelihood of new data

Gael Varoquaux 9

2 My data-science software stackMayavi, scikit-learn, joblib

Gael Varoquaux 10

2 Mayavi: 3D data visualizationRequirements

large 3D datainteractive visualizationeasy scripting

SolutionVTK: C++ data visualizationUI (traits)

+ pylab-inspired API

Black-box solutions don’t yield new intuitions

Limitationshard to installclunky & complexC++ leaking through3D visualization doesn’tpay in academia

Tragedy of thecommons or

niche product?

Gael Varoquaux 11

2 scikit-learn: statistical learningVision

Address non-machine-learning expertsSimplify but don’t dumb downPerformance: be state of the artEase of installation

Gael Varoquaux 12

2 scikit-learn: statistical learningTechnical choices

Prefer Python or Cython, focus on readabilityDocumentation and examples are paramountLittle object-oriented design. Opt for simplicityPrefer algorithms to frameworkCode quality: consistency and testing

Gael Varoquaux 13

2 scikit-learn: statistical learningAPI

Inputs are numpy arraysLearn a model from the data:estimator.fit(X train, Y train)

Predict using learned modelestimator.predict(X test)

Test goodness of fitestimator.score(X test, y test)

Apply change of representationestimator.transform(X, y)

Gael Varoquaux 14

2 scikit-learn: statistical learningComputational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithms rather than low-level optimizationconvex optimization + machine learning

Avoid memory copies

Gael Varoquaux 15

2 scikit-learn: statistical learningCommunity

35 contributors since 2008, 103 github forks25 contributors in latest release (3 months span)

Why this success?Trendy topic?Low barrier of entryFriendly and very skilled mailing listCredit to people

Gael Varoquaux 16

2 joblib: Python functions on steroidsWe keep recomputing the same things

Nested loops with overlapping sub-problemsVarying parameters I/O

Standard solution:pipelines

ChallengesDependencies modelingParameter tracking

Gael Varoquaux 17

2 joblib: Python functions on steroids

PhilosophySimple don’t change your codeMinimal no dependenciesPerformant big dataRobust never fail

joblib’s solution = lazy recomputation:Take an MD5 hash of function arguments,Store outputs to disk

Gael Varoquaux 18

2 joblibLazy recomputing

>>> from j o b l i b import Memory>>> mem = Memory( c a c h e d i r =’/tmp/ joblib ’)>>> import numpy as np>>> a = np. v a n d e r (np. a r a n g e (3))>>> s q u a r e = mem. cache (np. s q u a r e )>>> b = s q u a r e (a)

[Memory] C a l l i n g s q u a r e ...s q u a r e ( a r r a y ([[0 , 0, 1],

[1, 1, 1],[4, 2, 1]]))

s q u a r e - 0.0 s>>> c = s q u a r e (a)>>> # No recomputation

Gael Varoquaux 19

Conclusion

Data-driven science will needmachine learning because of thecurse of dimensionality

Scikit-learn and joblib: focus onlarge-data performance and easy ofuse

Cannot develop software and science separatelyGael Varoquaux 20

Technology

Python for brain mining: (neuro)science with state of the art machine learning and data visualization