Next.ml Boston: Data Science Dev Ops

Preview:

Citation preview

Eric Chiangeric@yhathq.co

m

Data Science DevOps

yhat

Blog

Hi, my name is Eric… and I’m a software engineer.

{ "pred_class" : 0, "prob": 0.87}

Data Science DevOps

Today’s Agenda1. Data science or

machine learning?2. A story about protein3. What makes DevOps

hard4. Attempting solutions5. Q & A

First, let’s get a couple things out of the way.

Next.ML

Next.ML

Data Science or Machine Learning?!?

Data Science“Applying machine learning to organizational problems.”

Different ProblemsSoftware engineering, not strictly engineering.

Different ProblemsHow do I work with others?

Different ProblemsHow do I version this?

Different ProblemsHow do I not break things?

Data ScienceIn the end still machine learning.

Bro, do you even big data?

Data mining vs. model building

Data mining vs. model building

Okay, story time

Let’s talk about protein

(Crystallography Team)

(Crystallography Team)

●Uses “brute force” approach to crystallize proteins

(Crystallography Team)

●Uses “brute force” approach to crystallize proteins

●Manually scores images one at a time

Crystal

$$$$$

Murky stuff

Lighting difference

s

Important murky stuff

What the hell is this line?

R & D

R & D Production

Porting this was hard

Will it make my job easier?

:(

yhat

R & D Production

Difficult to encapsulate

How to make them production ready?

“Production” - Reliable- Reproducible- Scalable

Some reasons about why this is hard to achieve

Model != Service(bare with me on this one)

Software stacks are complicated

Software stacks are complicated

All technologies can be connected to over a network

Where does machine learning fit into this stack?

What stops us from doing this?

Machine learning is stateful

import alg…model = alg.train(data)…model.predict(newdata)

import alg…model = alg.train(data)…model.predict(newdata)

- Source code doesn’t encapsulate program

- Source code doesn’t encapsulate program- Training is expensive (don’t want to do it every time)

Serialization

Serialization$python>>> import pickle>>> x = 3>>> p = pickle.dumps(x)>>> y = pickle.loads(p)>>> y3

Sterilization has it’s own set of problems

Traceback (most recent call last): File "diabetes.py", line 33, in <module>

R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean() File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1361, in cross_val_score

for train, test in cv) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__

self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch

job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__

self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1459, in _fit_and_score

estimator.fit(X_train, y_train, **fit_params) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 273, in fit

for i, t in enumerate(trees)) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__

self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch

job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__

self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 94, in _parallel_build_trees

tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 227, in fit

raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")ValueError: min_weight_fraction_leaf must in [0, 0.5]

Customer - “My model isn’t working”

What’s the problem?…R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean()…tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)

raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")

ValueError: min_weight_fraction_leaf must in [0, 0.5]

What’s the problem? - Pickle a scikit-learn 0.16.1 model- Unpickle it in 0.15.1

Interpreted languages have a lot of run time dependencies

Reproducing dependencies is critical

Reproducing dependencies is criticalDependency detection can be hard to automate

Reproducing dependencies is criticalPackage managersaren’t perfect

Example: pip

Example: pip- Not standard

Example: pip- Not standard- Can do a poor job of installing dependencies

Example: pip- Not standard- Can do a poor job of installing dependencies- Only recently precompiled

Example: r

Example: r- Can’t install specific version of package

Example: r- Can’t install specific version of package- No, seriously

Solution

Solution- Use a better package manager

Solution- Use a better package manager- Ship your dependencies

Complied languages are easier- Matlab (MCC), Scala

Complied languages are easier- Matlab (MCC), Scala- Linking still an issue

Data transforms can be critical to the model

PMML

PMML ?def tokenize(s): s = s.lower() s = s.split(" ") return s

$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'

$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'

Trying to get models onto a network

Databases are great

1) Compute regression

2) Shove coefficients in database

3) …4) Profit?!?

Simple Web Servers

Simple Web ServersYou’re still stuck with environment management problems

Simple Web ServersSome modeling languages are not languages you want to write a server in…

Simple Web Servers- Division of roles- NPR uses flask for visualization dev but not production website

A solution we (yhat) have decided on

Containers FTW

Containers FTWContainers address a lot of previous concerns

Containers FTWReproducibility, managing environments, etc.

Containers FTWCheap

Containers FTWWord of warning:

Containers FTWWord of warning:If you choose this route you will be manage models and Docker

{ "pred_class" : 0, "prob": 0.87}

Take a model

“Deploy” to our platform

Defer to Docker

$ pip install foo==0.2.4$ pip install bar==1.4.9Attempt to

recreate env

$ pip install foo==0.2.4$ pip install bar==1.4.9Replicate as

necessary

The dev team is always trying to learn of better ways of doing this

Thanks!

And remember to be nice to your DevOps.