80
Scientist meets web dev: how Python became the language of data Ga¨ el Varoquaux

Scientist meets web dev: how Python became the language of data

Embed Size (px)

Citation preview

Page 1: Scientist meets web dev: how Python became the language of data

Scientist meets web dev:how Python became the language of dataGael Varoquaux

Page 2: Scientist meets web dev: how Python became the language of data

Scientist meets web dev:how Python became the language of dataGael Varoquaux

Very diverse communityThis talk: a reflection on whatwe have in common, Python

I am talking aboutthings you don’t understand (my science)and things I don’t understand (web dev)

Page 3: Scientist meets web dev: how Python became the language of data

I actually did a PhDin quantum physics

Hence I think I qualify asa “scientist”

G Varoquaux 3

Page 4: Scientist meets web dev: how Python became the language of data

I now do computer science for neuroscience

Try to link neural activity to thoughts and cognition

G Varoquaux 4

Page 5: Scientist meets web dev: how Python became the language of data

I now do computer science for neuroscience

Try to link neural activity to thoughts and cognitionWe attack it as a machine learning problem

Python software: nilearnG Varoquaux 5

Page 6: Scientist meets web dev: how Python became the language of data

On the way, we createda machine-learning library:

scikit-learn

G Varoquaux 6

Page 7: Scientist meets web dev: how Python became the language of data

Data science with Python is hot

Huge success.Cool.

Data science is THE thing.

Python is the go-to languageHow did it happen?

We built scikit-learn, others pandas, etc...,but these were built on solid foundations

G Varoquaux 7

Page 8: Scientist meets web dev: how Python became the language of data

Data science with Python is hot

Huge success.Cool.

Data science is THE thing.

Python is the go-to languageHow did it happen?

We built scikit-learn, others pandas, etc...,but these were built on solid foundations

G Varoquaux 7

Page 9: Scientist meets web dev: how Python became the language of data

1 Scientists come from JupiterAnd web devs from Saturn?

And sysadmins from Neptune?

G Varoquaux 8

Page 10: Scientist meets web dev: how Python became the language of data

1 We’re different

numbers (in arrays)

arrays (of numbers)

arrays

arrays

strings

databases

object-orientedprogramming

flow control

A bit of a culture gapG Varoquaux 9

Page 11: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site205 talks:

How OpenStack makes Python better (and vice-versa)Introduction to aiohttp

So you think your Python startup is worth $10 million...SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun WayScaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

G Varoquaux 10

Page 12: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site205 talks:

How OpenStack makes Python better (and vice-versa)Introduction to aiohttp

So you think your Python startup is worth $10 million...SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun WayScaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

Anyone who has used Python to search textfor substring patterns has at least heard ofthe regular expression module. Many of ususe it extensively for parsers and lexers,

The py.test tool presents a rapid and simple

way to write tests for your Python code. This

training gives a quick introduction with exercises

into some distinguishing features.Chat with the core developers about how

to extend django CMS or how to integrate

your own apps seamlessly. Lets talk about

your plugins, apphooks, toolbar extensions

G Varoquaux 10

Page 13: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site205 talks:

How OpenStack makes Python better (and vice-versa)Introduction to aiohttp

So you think your Python startup is worth $10 million...SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun WayScaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

Anyone who has used Python to search textfor substring patterns has at least heard ofthe regular expression module. Many of ususe it extensively for parsers and lexers,

The py.test tool presents a rapid and simple

way to write tests for your Python code. This

training gives a quick introduction with exercises

into some distinguishing features.Chat with the core developers about how

to extend django CMS or how to integrate

your own apps seamlessly. Lets talk about

your plugins, apphooks, toolbar extensions

import urllib2, bs4

import sklearn, wordcloud

G Varoquaux 10

Page 14: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteCrawl

the schedule to get a list of titles and URLstalk pages to retrieve abstract and tags

bs4: beautiful soup, matchings on the DOM tree

Vectorize

G Varoquaux 11

Page 15: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteCrawl

the schedule to get a list of titles and URLstalk pages to retrieve abstract and tags

bs4: beautiful soup, matchings on the DOM tree

VectorizeAnyone who has used Python to search textfor substring patterns has at least heard ofthe regular expression module. Many of ususe it extensively for parsers and lexers,

The py.test tool presents a rapid and simple

way to write tests for your Python code. This

training gives a quick introduction with exercises

into some distinguishing features.Chat with the core developers about how

to extend django CMS or how to integrate

your own apps seamlessly. Lets talk about

your plugins, apphooks, toolbar extensions

acancode

ismoduleprofiling

performancePython

the

20104

143219

18

Term Freq

G Varoquaux 11

Page 16: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteCrawl

the schedule to get a list of titles and URLstalk pages to retrieve abstract and tags

bs4: beautiful soup, matchings on the DOM tree

VectorizeAnyone who has used Python to search textfor substring patterns has at least heard ofthe regular expression module. Many of ususe it extensively for parsers and lexers,

The py.test tool presents a rapid and simple

way to write tests for your Python code. This

training gives a quick introduction with exercises

into some distinguishing features.Chat with the core developers about how

to extend django CMS or how to integrate

your own apps seamlessly. Lets talk about

your plugins, apphooks, toolbar extensions

acancode

ismoduleprofiling

performancePython

the

20104

143219

18

Term Freq

1321540208964123

76

1911450

Alldocs

G Varoquaux 11

Page 17: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteCrawl

the schedule to get a list of titles and URLstalk pages to retrieve abstract and tags

bs4: beautiful soup, matchings on the DOM tree

VectorizeAnyone who has used Python to search textfor substring patterns has at least heard ofthe regular expression module. Many of ususe it extensively for parsers and lexers,

The py.test tool presents a rapid and simple

way to write tests for your Python code. This

training gives a quick introduction with exercises

into some distinguishing features.Chat with the core developers about how

to extend django CMS or how to integrate

your own apps seamlessly. Lets talk about

your plugins, apphooks, toolbar extensions

acancode

ismoduleprofiling

performancePython

the

20104

143219

18

Term Freq

1321540208964123

76

1911450

Alldocs

.015

.018

.019

.014

.023

.286

.167

.047

.012

Ratio

TF-IDF in scikit-learnsklearn.feature extraction.text.TfidfVectorizer

G Varoquaux 11

Page 18: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

documents

the

Python

performance

profiling

module

is

code

can

a

Term-document matrix

3 78 9 7

79 7

79 7527

578

94 71 6

797

97 8

7

1 4 4

9

5 2 5

8

Can be a sparse matrix

G Varoquaux 12

Page 19: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

documents

the

Python

performance

profiling

module

is

code

can

a

Term-document matrix

3 78 9 7

79 7

79 7527

578

94 71 6

797

97 8

7

1 4 4

9

5 2 5

8

Can be a sparse matrix

G Varoquaux 12

Page 20: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython site

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

documents

the

Python

performance

profiling

module

is

code

can

a →03078

090707907

0079075270

0578

9407100600

0797

topics

the

Python

performance

profiling

module

is

code

can

a 030

007

940

009

100

000

documents

topics

+What termsare in a topics

What documentsare in a topics

A matrix factorizationOften with non-negative constraints

sklearn.decompositions.NMF

G Varoquaux 13

Page 21: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteEuroPyton abstracts

Topic 1

G Varoquaux 14

Page 22: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteEuroPyton abstracts

Topic 2

G Varoquaux 14

Page 23: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteEuroPyton abstracts

Topic 3

G Varoquaux 14

Page 24: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteEuroPyton abstracts

G Varoquaux 14

Page 25: Scientist meets web dev: how Python became the language of data

1 Let’s do something together: sort EuroPython siteEuroPyton abstracts

Add one of Python’s great templating engine... get a usable website

https://gaelvaroquaux.github.io/my_topics/ep16

G Varoquaux 14

Page 26: Scientist meets web dev: how Python became the language of data

Want to try it?$ pip install scikit-learn...ImportError: Numerical Python (NumPy) is not installed.scikit-learn requires NumPy >= 1.6.1

C:> pip install numpy...error: Unable to find vcvarsall.bat

G Varoquaux 15

Page 27: Scientist meets web dev: how Python became the language of data

Want to try it?$ pip install scikit-learn...ImportError: Numerical Python (NumPy) is not installed.scikit-learn requires NumPy >= 1.6.1

C:> pip install numpy...error: Unable to find vcvarsall.bat

G Varoquaux 15

Page 28: Scientist meets web dev: how Python became the language of data

Want to try it?$ pip install scikit-learn...ImportError: Numerical Python (NumPy) is not installed.scikit-learn requires NumPy >= 1.6.1

C:> pip install numpy...error: Unable to find vcvarsall.bat

G Varoquaux 15

Page 29: Scientist meets web dev: how Python became the language of data

Want to try it?$ pip install scikit-learn...ImportError: Numerical Python (NumPy) is not installed.scikit-learn requires NumPy >= 1.6.1

C:> pip install numpy...error: Unable to find vcvarsall.bat

G Varoquaux 15

Page 30: Scientist meets web dev: how Python became the language of data

1 We’re different

Well

fast linear algebraATLAS (Fortran) 70x faster

libfortran.so.3 ??

you’re kidding me

G Varoquaux 16

Page 31: Scientist meets web dev: how Python became the language of data

1 We’re different

Well

fast linear algebraATLAS (Fortran) 70x faster

libfortran.so.3 ??

you’re kidding me

Packaging is a major roadblock for scientific PythonA lot of compiled code + shared libraries

⇒ library + ABI compatibility issuesProgress:

- Manylinux wheels: PEP 513, RT. McGibbon, NJ. Smithrely on a conservative core set of libs

- Openblas: pure-C, fast linear algebra

G Varoquaux 16

Page 32: Scientist meets web dev: how Python became the language of data

1 We’re different

But working together gives us awesome things

Text mining ⇒ intelligent interfaces

G Varoquaux 17

Page 33: Scientist meets web dev: how Python became the language of data

2 The scientist’s view of codeNumerics versus control flow

Numerics versus databasesNumerics versus strings

Numerics versus the world

G Varoquaux 18

Page 34: Scientist meets web dev: how Python became the language of data

2 Why we love numpy100 000 term frequency vs inverse doc frequency:

In [*]: %timeit [t * i for t, i in izip(tf, idf)]100 loops, best of 3: 6.2 ms per loop

The numpy style:In [*]: %timeit tf * idf1000 loops, best of 3: 74.2 µs per loop

102 104 106

number of elements

1µs

100ns

10ns

1nstime

per e

lem

ent

listsnumpy

Array computing can be more readabletf * idf

vs[t * i for t, i in izip(tf, idf)]

G Varoquaux 19

Page 35: Scientist meets web dev: how Python became the language of data

2 Why we love numpy100 000 term frequency vs inverse doc frequency:

In [*]: %timeit [t * i for t, i in izip(tf, idf)]100 loops, best of 3: 6.2 ms per loop

The numpy style:In [*]: %timeit tf * idf1000 loops, best of 3: 74.2 µs per loop

102 104 106

number of elements

1µs

100ns

10ns

1nstime

per e

lem

ent

listsnumpy

Array computing can be more readabletf * idf

vs[t * i for t, i in izip(tf, idf)]

G Varoquaux 19

Page 36: Scientist meets web dev: how Python became the language of data

2 arrays are nothing but pointersA numpy array =

memory address data type shape strides

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

stride 2

stride 1

shape 1

shape 2

Represents any regulardata in a structured way:how to access elementsvia pointer arythmetics(computing offsets)

stride 2

stride 1shape 1

shape 20387879479

792701790

752701578

...

Matches the memory model of numerical libraries⇒ Enables copyless interactions

Numpy is really a memory model

G Varoquaux 20

Page 37: Scientist meets web dev: how Python became the language of data

2 arrays are nothing but pointersA numpy array =

memory address data type shape strides

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

stride 2

stride 1

shape 1

shape 2

Represents any regulardata in a structured way:how to access elementsvia pointer arythmetics(computing offsets)

stride 2

stride 1shape 1

shape 20387879479

792701790

752701578

...

Matches the memory model of numerical libraries⇒ Enables copyless interactions

Numpy is really a memory model

G Varoquaux 20

Page 38: Scientist meets web dev: how Python became the language of data

2 Array computing is fast

102 104 106

number of elements

1µs

100ns

10ns

1nstime

per e

lem

ent

listsnumpy

tf idf = tf * idf CPU

03878794797927

01790752701578

*tf_idf=No type checkingDirect sequential memory accessVector operations (SIMD)

G Varoquaux 21

Page 39: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

tf idf = tf * idf CPU

03878794797927

01790752701578

*tf_idf=

2x slowdown passed a certain size

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:

G Varoquaux 22

Page 40: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

105 ∼ size ofthe CPU cache

Memory is much slower than CPU

tf idf = tf * idf

03878794797927

01790752701578

*tf_idf=0387879

01790

cpucache

2x slowdown passed a certain size

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:

G Varoquaux 22

Page 41: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

Memory is much slower than CPU

tf idf = tf * idf - 1

It gets worse for complex expressions

What’s going on:1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:

G Varoquaux 22

Page 42: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

Memory is much slower than CPU

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:

G Varoquaux 22

Page 43: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

In [*]: %timeit tf * idf10000 loops, best of 3: 74.2 µs per loop

In [*]: %timeit tf * idf - 11000 loops, best of 3: 418 µs per loop

A compilation problem:

G Varoquaux 22

Page 44: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

In [*]: %timeit tf * idf10000 loops, best of 3: 74.2 µs per loop

In [*]: %timeit tf * idf - 11000 loops, best of 3: 418 µs per loop

In-place operations: reuse the allocationIn [*]: %timeit tmp = tf * idf; tmp -= 110000 loops, best of 3: 112 µs per loop

A compilation problem:

G Varoquaux 22

Page 45: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

numpynp inplace

tmp = tf * idftmp -= 1

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:

G Varoquaux 22

Page 46: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent

numpynp inplace

tmp = tf * idftmp -= 1

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:tf idf = tf * idf - 1 tf idf = tf * idf

tf idf -= 1

G Varoquaux 22

Page 47: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent numpynp inplacenumexpr

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:• Removing/reusing temporaries• Operating on “chunks” that fit in cache

Addressed by numexpr, with string expressionsnumexpr.evaluate(’tf * idf - 1’, locals())

G Varoquaux 22

Page 48: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

103 104 105 106

number of elements

1ns

2ns

3ns

4ns

5ns

time

per e

lem

ent numpynp inplacenumexpr

tf idf = tf * idf - 1What’s going on:

1. tmp ← tf * idf2. tf idf ← tmp - 1

Big temporary: Moving data in& out of cache

A compilation problem:• Removing/reusing temporaries• Operating on “chunks” that fit in cache

Addressed by numexpr, with string expressionsAddressed by numba, with bytecode inspectionlazyarray

Similar problem to pagination with SQL queries

G Varoquaux 22

Page 49: Scientist meets web dev: how Python became the language of data

2 Array computing is limited by CPU starvation

tf idf = tf * idf

Too small:overhead

Too BIG:Out of cache

BIG Dat

a

$$$G Varoquaux 23

Page 50: Scientist meets web dev: how Python became the language of data

2 Numerics versus control flowWhat if there is an if

tf idf = tf / idftf idf[idf == 0] = 0

Suppose the we are looking at ages in a population:ages[gender == ’male’].mean()

- ages[gender == ’female’].mean()

This is really starting to be looking like databasespandas: something in between arrays and

an in-memory databaseGreat for queries, less great for numerics.

G Varoquaux 24

Page 51: Scientist meets web dev: how Python became the language of data

2 Numerics versus control flowWhat if there is an if

tf idf = tf / idftf idf[idf == 0] = 0

Suppose the we are looking at ages in a population:ages[gender == ’male’].mean()

- ages[gender == ’female’].mean()

This is really starting to be looking like databasespandas: something in between arrays and

an in-memory databaseGreat for queries, less great for numerics.

G Varoquaux 24

Page 52: Scientist meets web dev: how Python became the language of data

Install

ation

PROBLE

MS

BeautifulPythonCOde

Routinesin Fortran

or C++

ScalaBILiTY

DeploymentPROBLEMS

BeautifulPythonCOde

DATABASEin C++, JAVA,ERLANG...

ScalaBILiTY

Numpy is the scientist’sequivalent to an ORM

Gives speedwith non-Python code

G Varoquaux 25

Page 53: Scientist meets web dev: how Python became the language of data

Install

ation

PROBLE

MS

BeautifulPythonCOde

Routinesin Fortran

or C++

ScalaBILiTY

DeploymentPROBLEMS

BeautifulPythonCOde

DATABASEin C++, JAVA,ERLANG...

ScalaBILiTY

Numpy is the scientist’sequivalent to an ORM

Gives speedwith non-Python code

G Varoquaux 25

Page 54: Scientist meets web dev: how Python became the language of data

numerics vs databasesnumerics efficient on regularly spaced data

But numpy creates cache misses for big arrays⇒ Need to remove temporaries and chunk data

selection and grouping efficient with indexes or trees⇒ Need to group queries

G Varoquaux 26

Page 55: Scientist meets web dev: how Python became the language of data

numerics vs databasesnumerics efficient on regularly spaced data

But numpy creates cache misses for big arrays⇒ Need to remove temporaries and chunk data

selection and grouping efficient with indexes or trees⇒ Need to group queries

Compilation is unpythonic

A computation & query language? numexprI hate domain-specific languages (SQL)Numpy is very expressive

G Varoquaux 26

Page 56: Scientist meets web dev: how Python became the language of data

numerics vs databasesnumerics efficient on regularly spaced data

But numpy creates cache misses for big arrays⇒ Need to remove temporaries and chunk data

selection and grouping efficient with indexes or trees⇒ Need to group queries

Compilation is unpythonicA computation & query language? numexpr

I hate domain-specific languages (SQL)Numpy is very expressive

G Varoquaux 26

Page 57: Scientist meets web dev: how Python became the language of data

numerics vs databasesnumerics efficient on regularly spaced data

But numpy creates cache misses for big arrays⇒ Need to remove temporaries and chunk data

selection and grouping efficient with indexes or trees⇒ Need to group queries

Compilation is unpythonicA computation & query language? numexpr

I hate domain-specific languages (SQL)Numpy is very expressive

PonyORM: Compiling Python to optimized SQLDatascience with SQL: Ibis & Blaze

G Varoquaux 26

Page 58: Scientist meets web dev: how Python became the language of data

numerics vs databasesnumerics efficient on regularly spaced data

But numpy creates cache misses for big arrays⇒ Need to remove temporaries and chunk data

selection and grouping efficient with indexes or trees⇒ Need to group queries

Spark: java-world “big data” rising starcombines distributed store

+ computing model

We (scikit-learn) are faster when data fits in RAM

G Varoquaux 26

Page 59: Scientist meets web dev: how Python became the language of data

Operations on chunks, or algorithms on chunksMachine learning, data mining = numerics

03878794797927

03878794797927

Out-of-core opera-tions not efficient:no data locality

On-line aglorithms (streaming)03

8787

9479

7927

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

eg stochastic gradient descentAs in deep learning

G Varoquaux 27

Page 60: Scientist meets web dev: how Python became the language of data

Operations on chunks, or algorithms on chunksMachine learning, data mining = numerics

03878794797927

03878794797927

03878794797927

03878794797927

ETL (extract, transform, & load) Multivariate statistics

Out-of-core opera-tions not efficient:no data locality

On-line aglorithms (streaming)03

8787

9479

7927

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

eg stochastic gradient descentAs in deep learning

G Varoquaux 27

Page 61: Scientist meets web dev: how Python became the language of data

Operations on chunks, or algorithms on chunksMachine learning, data mining = numerics

03878794797927

03878794797927

03878794797927

03878794797927

ETL (extract, transform, & load) Multivariate statistics

Out-of-core opera-tions not efficient:no data locality

On-line aglorithms (streaming)03

8787

9479

7927

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

eg stochastic gradient descentAs in deep learning

G Varoquaux 27

Page 62: Scientist meets web dev: how Python became the language of data

Making the data-science magic happens

from sklearn import

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

documents

the

Python

performance

profiling

module

is

code

can

a

0307809070

7907

0079075270

0578

9407100600

0797

topics

the

Python

performance

profiling

module

is

code

can

a

030

007

940

009

100

000

documents

topics

+What termsare in a topics

What documentsare in a topics

G Varoquaux 28

Page 63: Scientist meets web dev: how Python became the language of data

Making the data-science magic happens

from sklearn import

Turning applied maths papers to robust codeHigh-level, readable, simple syntax reduces cognitive load

Thanks

G Varoquaux 28

Page 64: Scientist meets web dev: how Python became the language of data

3 Beyond numericsMake #PyData great (again)

G Varoquaux 29

Page 65: Scientist meets web dev: how Python became the language of data

3 Data/computation flow is crucial

03878794797927

03878794797927

03878794797927

03878794797927

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Data-flow engines are everywheredask pure-Python dynamic scheduler

static compiler parallel & distributed

theano expression analysis pure-Python

tensorflow C library distributed

G Varoquaux 30

Page 66: Scientist meets web dev: how Python became the language of data

3 Data/computation flow is crucial

03878794797927

03878794797927

03878794797927

03878794797927

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Data-flow engines are everywherePython should shine there:

reflexivity + metaprogramming + async“Python is the best numerical language out there

because it’s not a numerical language.” – Nathaniel Smith

API challenging:For algorithm design: no framework / inversion of control

G Varoquaux 30

Page 67: Scientist meets web dev: how Python became the language of data

3 Ingredients for future data flowsDistributed computation & Run-time analysis

Reflexivity is centralDebugging

Interactive workCode analysis

Persistence 03878794797927

03878794797927

Parallel computing

Pickledistribute code and data without data modelserialize intermediate resultsdeep of hash of any data structure joblib.hash

joblib:Simple parallel syntax:

Parallel(n jobs=2)(delayed(sqrt)(i) for i in range(10))

Fast persistence:joblib.dump(anything, ’filename.pkl.gz’)

Primitive for out of core:pointer = mem.cache(f).call and shelves(big data)

•Non-invasive syntax / paradigm•Fast on big numpy arrays•Soon backend system (job broker and persistence)

Gets job managment into algorithms (eg in scikit-learn)

G Varoquaux 31

Page 68: Scientist meets web dev: how Python became the language of data

3 Ingredients for future data flowsDistributed computation & Run-time analysis

Reflexivity is centralDebugging

Interactive workCode analysis

Persistence 03878794797927

03878794797927

Parallel computingPickle

distribute code and data without data modelserialize intermediate resultsdeep of hash of any data structure joblib.hash

Very limited (eg no lambda #19272)⇒ variants: dill, cloudpickle

joblib:Simple parallel syntax:

Parallel(n jobs=2)(delayed(sqrt)(i) for i in range(10))

Fast persistence:joblib.dump(anything, ’filename.pkl.gz’)

Primitive for out of core:pointer = mem.cache(f).call and shelves(big data)

•Non-invasive syntax / paradigm•Fast on big numpy arrays•Soon backend system (job broker and persistence)

Gets job managment into algorithms (eg in scikit-learn)

G Varoquaux 31

Page 69: Scientist meets web dev: how Python became the language of data

3 Ingredients for future data flowsDistributed computation & Run-time analysis

Reflexivity is centralDebugging

Interactive workCode analysis

Persistence 03878794797927

03878794797927

Parallel computingPickle

distribute code and data without data modelserialize intermediate resultsdeep of hash of any data structure joblib.hash

joblib:Simple parallel syntax:

Parallel(n jobs=2)(delayed(sqrt)(i) for i in range(10))

Fast persistence:joblib.dump(anything, ’filename.pkl.gz’)

Primitive for out of core:pointer = mem.cache(f).call and shelves(big data)

•Non-invasive syntax / paradigm•Fast on big numpy arrays•Soon backend system (job broker and persistence)

Gets job managment into algorithms (eg in scikit-learn)G Varoquaux 31

Page 70: Scientist meets web dev: how Python became the language of data

3 The Python VM is greatThe simplicity of the VM is our strength

Software Transactional Memory... would be niceBut, I want to use foreign memory

Java gained jmalloc for foreign memory

Better garbage collectionYes but, I easily plug into reference counting

A strength of Python is its clear C API⇒ Easy foreign functionality

Cython: the best of C and Python

Add types for speed (numpy arrays as float*)Call C to bind external libraries: surprisingly easy

no pointer arithmetics

An adaptation layer between Python VM and C

G Varoquaux 32

Page 71: Scientist meets web dev: how Python became the language of data

3 The Python VM is greatThe simplicity of the VM is our strength

Software Transactional Memory... would be niceBut, I want to use foreign memory

Java gained jmalloc for foreign memory

Better garbage collectionYes but, I easily plug into reference counting

A strength of Python is its clear C API⇒ Easy foreign functionality

Cython: the best of C and Python

Add types for speed (numpy arrays as float*)Call C to bind external libraries: surprisingly easy

no pointer arithmetics

An adaptation layer between Python VM and C

G Varoquaux 32

Page 72: Scientist meets web dev: how Python became the language of data

4 Working together

G Varoquaux 33

Page 73: Scientist meets web dev: how Python became the language of data

4 Scikit-learn is easy machine learningAs easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

People love the encapsulationclassifier is a semi black box

The power of a simple object-oriented APIDocumentation-driven development

High-level, readable, simple API reduces cognitive loadPyData loves Python in return

G Varoquaux 34

Page 74: Scientist meets web dev: how Python became the language of data

4 Scikit-learn is easy machine learningAs easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

People love the encapsulationclassifier is a semi black box

The power of a simple object-oriented APIDocumentation-driven development

High-level, readable, simple API reduces cognitive loadPyData loves Python in return

G Varoquaux 34

Page 75: Scientist meets web dev: how Python became the language of data

4 Difference is richnessWe all do different thingsWe can all benefit from others

though we don’t know how

Being didactic outside one’s community is crucialAvoiding jargon take that machine learningPrioritizing information“Simple is better than complex”

Students learning numerics don’t care about unicode

Build documentation upon very simple examplesThink stackoverflow

Sphinx + Sphinx-gallery

G Varoquaux 35

Page 76: Scientist meets web dev: how Python became the language of data

4 Difference is richness, but requires outreachWe all do different thingsWe can all benefit from others

though we don’t know how

Being didactic outside one’s community is crucialAvoiding jargon take that machine learningPrioritizing information“Simple is better than complex”

Students learning numerics don’t care about unicode

Build documentation upon very simple examplesThink stackoverflow

Sphinx + Sphinx-gallery

G Varoquaux 35

Page 77: Scientist meets web dev: how Python became the language of data

@GaelVaroquaux

Scientist web dev: Python is the language for data

Python language & VM is perfect to manipulatelow-level constructs

with high-level wordingsConnects to other paradigms, eg C

Page 78: Scientist meets web dev: how Python became the language of data

@GaelVaroquaux

Scientist web dev: Python is the language for data

Python language & VM is perfect to manipulatelow-level constructs

with high-level wordings

Dynamism and reflexivity⇒ meta-programming and debugging

Page 79: Scientist meets web dev: how Python became the language of data

@GaelVaroquaux

Scientist web dev: Python is the language for data

Python language & VM is perfect to manipulatelow-level constructs

with high-level wordings

Dynamism and reflexivity⇒ meta-programming and debugging

Needs for compilation and dynamism:a difficult balance

PEP 509: guards on run-time modificationPEP 510: function specicalization

Page 80: Scientist meets web dev: how Python became the language of data

@GaelVaroquaux

Scientist web dev: Python is the language for data

Python language & VM is perfect to manipulatelow-level constructs

with high-level wordings

Dynamism and reflexivity⇒ meta-programming and debugging

Needs for compilation and dynamism

Pydata will use DB and concurrency from web

PyData can give knowledge engineering + AI