Upload
gaelvaroquaux
View
123
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Talk given at scipy 2011
Citation preview
Python for brain mining:(neuro)science with state of the art machine learningand data visualization
Gael Varoquaux
1. Data-driven science“Brain mining”
2. Data mining in PythonMayavi, scikit-learn, joblib
1 Brain miningLearning models of brain function
Gael Varoquaux 2
1 Imaging neuroscience
Brainimages
Models offunction
Cognitive tasks
Gael Varoquaux 3
1 Imaging neuroscience
Brainimages
Models offunction
Cognitive tasks
Data-driven science
i ~ ∂
∂t Ψ = HΨ
Gael Varoquaux 3
1 Brain functional data
Rich data50 000 voxels perframeComplex underlyingdynamics
Few observations∼ 100
Drawing scientific conclusions?Ill-posed statistical problem
Gael Varoquaux 4
1 Brain functional data
Rich data50 000 voxels perframeComplex underlyingdynamics
Few observations∼ 100
Drawing scientific conclusions?Ill-posed statistical problem
Modern complex system studies:from strong hypothesizes to rich data
Gael Varoquaux 4
1 Statistics: the curse of dimensionalityy function of x1
y function of x1 and x2
More fit parameters?⇒ need exponentially more data
y function of 50 000 voxels?Expert knowledge
(pick the right ones) Machine learning
Gael Varoquaux 5
1 Statistics: the curse of dimensionalityy function of x1 y function of x1 and x2
More fit parameters?⇒ need exponentially more data
y function of 50 000 voxels?Expert knowledge
(pick the right ones) Machine learning
Gael Varoquaux 5
1 Statistics: the curse of dimensionalityy function of x1 y function of x1 and x2
More fit parameters?⇒ need exponentially more data
y function of 50 000 voxels?Expert knowledge
(pick the right ones) Machine learning
Gael Varoquaux 5
1 Brain readingPredict from brain images the object viewed
Correlation analysis
Gael Varoquaux 6
1 Brain readingPredict from brain images the object viewed
Inverse problem
Observations Spatial code
Correlation analysis
Inject prior: regularize
Sparse regression= compressive sensing
Gael Varoquaux 6
1 Brain readingPredict from brain images the object viewed
Inverse problem
Observations Spatial code
Correlation analysis
Inject prior: regularize
Extract brain regionsTotal variationregression
[Michel, Trans Med Imag 2011]Gael Varoquaux 6
1 Brain readingPredict from brain images the object viewed
Inverse problem
Observations Spatial code
Correlation analysis
Inject prior: regularize
Extract brain regionsTotal variationregression
[Michel, Trans Med Imag 2011]
Cast the problem in a prediction task:supervised learning.
Prediction is a model-selection metric
Gael Varoquaux 6
1 On-going/spontaneous activity
95% of theactivity is
unrelated to task
Gael Varoquaux 7
1 Learning regions from spontaneous activity
Multi-subject dictionarylearning
SpatialmapTime series
Sparsity + spatial continuity + spatial variability
⇒ Individual maps + functional regions atlas
[Varoquaux, Inf Proc Med Imag 2011]
Gael Varoquaux 8
1 Graphical models: interactions between regions
Estimate covariance structureMany parameters to learn
Regularize: conditionalindependence= sparsity on inverse
covariance
[Varoquaux NIPS 2010]
Gael Varoquaux 9
1 Graphical models: interactions between regions
Estimate covariance structureMany parameters to learn
Regularize: conditionalindependence= sparsity on inverse
covariance
[Varoquaux NIPS 2010]
Find structure via a density estimation:unsupervised learning.
Model selection: likelihood of new data
Gael Varoquaux 9
2 My data-science software stackMayavi, scikit-learn, joblib
Gael Varoquaux 10
2 Mayavi: 3D data visualizationRequirements
large 3D datainteractive visualizationeasy scripting
SolutionVTK: C++ data visualizationUI (traits)
+ pylab-inspired API
Black-box solutions don’t yield new intuitions
Limitationshard to installclunky & complexC++ leaking through3D visualization doesn’tpay in academia
Tragedy of thecommons or
niche product?
Gael Varoquaux 11
2 scikit-learn: statistical learningVision
Address non-machine-learning expertsSimplify but don’t dumb downPerformance: be state of the artEase of installation
Gael Varoquaux 12
2 scikit-learn: statistical learningTechnical choices
Prefer Python or Cython, focus on readabilityDocumentation and examples are paramountLittle object-oriented design. Opt for simplicityPrefer algorithms to frameworkCode quality: consistency and testing
Gael Varoquaux 13
2 scikit-learn: statistical learningAPI
Inputs are numpy arraysLearn a model from the data:estimator.fit(X train, Y train)
Predict using learned modelestimator.predict(X test)
Test goodness of fitestimator.score(X test, y test)
Apply change of representationestimator.transform(X, y)
Gael Varoquaux 14
2 scikit-learn: statistical learningComputational performance
scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithms rather than low-level optimizationconvex optimization + machine learning
Avoid memory copies
Gael Varoquaux 15
2 scikit-learn: statistical learningCommunity
35 contributors since 2008, 103 github forks25 contributors in latest release (3 months span)
Why this success?Trendy topic?Low barrier of entryFriendly and very skilled mailing listCredit to people
Gael Varoquaux 16
2 joblib: Python functions on steroidsWe keep recomputing the same things
Nested loops with overlapping sub-problemsVarying parameters I/O
Standard solution:pipelines
ChallengesDependencies modelingParameter tracking
Gael Varoquaux 17
2 joblib: Python functions on steroids
PhilosophySimple don’t change your codeMinimal no dependenciesPerformant big dataRobust never fail
joblib’s solution = lazy recomputation:Take an MD5 hash of function arguments,Store outputs to disk
Gael Varoquaux 18
2 joblibLazy recomputing
>>> from j o b l i b import Memory>>> mem = Memory( c a c h e d i r =’/tmp/ joblib ’)>>> import numpy as np>>> a = np. v a n d e r (np. a r a n g e (3))>>> s q u a r e = mem. cache (np. s q u a r e )>>> b = s q u a r e (a)
[Memory] C a l l i n g s q u a r e ...s q u a r e ( a r r a y ([[0 , 0, 1],
[1, 1, 1],[4, 2, 1]]))
s q u a r e - 0.0 s>>> c = s q u a r e (a)>>> # No recomputation
Gael Varoquaux 19
Conclusion
Data-driven science will needmachine learning because of thecurse of dimensionality
Scikit-learn and joblib: focus onlarge-data performance and easy ofuse
Cannot develop software and science separatelyGael Varoquaux 20