121
Eric Chiang eric@yhathq.com

Next.ml Boston: Data Science Dev Ops

Embed Size (px)

Citation preview

Page 1: Next.ml Boston: Data Science Dev Ops

Eric [email protected]

m

Page 2: Next.ml Boston: Data Science Dev Ops

Data Science DevOps

Page 3: Next.ml Boston: Data Science Dev Ops

yhat

Page 4: Next.ml Boston: Data Science Dev Ops

Blog

Page 5: Next.ml Boston: Data Science Dev Ops

Hi, my name is Eric… and I’m a software engineer.

Page 6: Next.ml Boston: Data Science Dev Ops
Page 7: Next.ml Boston: Data Science Dev Ops
Page 8: Next.ml Boston: Data Science Dev Ops

{ "pred_class" : 0, "prob": 0.87}

Page 9: Next.ml Boston: Data Science Dev Ops

Data Science DevOps

Page 10: Next.ml Boston: Data Science Dev Ops

Today’s Agenda1. Data science or

machine learning?2. A story about protein3. What makes DevOps

hard4. Attempting solutions5. Q & A

Page 11: Next.ml Boston: Data Science Dev Ops

First, let’s get a couple things out of the way.

Page 12: Next.ml Boston: Data Science Dev Ops

Next.ML

Page 13: Next.ml Boston: Data Science Dev Ops

Next.ML

Page 14: Next.ml Boston: Data Science Dev Ops

Data Science or Machine Learning?!?

Page 15: Next.ml Boston: Data Science Dev Ops

Data Science“Applying machine learning to organizational problems.”

Page 16: Next.ml Boston: Data Science Dev Ops

Different ProblemsSoftware engineering, not strictly engineering.

Page 17: Next.ml Boston: Data Science Dev Ops

Different ProblemsHow do I work with others?

Page 18: Next.ml Boston: Data Science Dev Ops

Different ProblemsHow do I version this?

Page 19: Next.ml Boston: Data Science Dev Ops

Different ProblemsHow do I not break things?

Page 20: Next.ml Boston: Data Science Dev Ops

Data ScienceIn the end still machine learning.

Page 21: Next.ml Boston: Data Science Dev Ops

Bro, do you even big data?

Page 22: Next.ml Boston: Data Science Dev Ops

Data mining vs. model building

Page 23: Next.ml Boston: Data Science Dev Ops

Data mining vs. model building

Page 24: Next.ml Boston: Data Science Dev Ops

Okay, story time

Page 25: Next.ml Boston: Data Science Dev Ops

Let’s talk about protein

Page 26: Next.ml Boston: Data Science Dev Ops
Page 27: Next.ml Boston: Data Science Dev Ops

(Crystallography Team)

Page 28: Next.ml Boston: Data Science Dev Ops

(Crystallography Team)

●Uses “brute force” approach to crystallize proteins

Page 29: Next.ml Boston: Data Science Dev Ops

(Crystallography Team)

●Uses “brute force” approach to crystallize proteins

●Manually scores images one at a time

Page 30: Next.ml Boston: Data Science Dev Ops
Page 31: Next.ml Boston: Data Science Dev Ops

Crystal

Page 32: Next.ml Boston: Data Science Dev Ops

$$$$$

Page 33: Next.ml Boston: Data Science Dev Ops
Page 34: Next.ml Boston: Data Science Dev Ops

Murky stuff

Page 35: Next.ml Boston: Data Science Dev Ops

Lighting difference

s

Page 36: Next.ml Boston: Data Science Dev Ops

Important murky stuff

Page 37: Next.ml Boston: Data Science Dev Ops

What the hell is this line?

Page 38: Next.ml Boston: Data Science Dev Ops
Page 39: Next.ml Boston: Data Science Dev Ops

R & D

Page 40: Next.ml Boston: Data Science Dev Ops

R & D Production

Page 41: Next.ml Boston: Data Science Dev Ops
Page 42: Next.ml Boston: Data Science Dev Ops
Page 43: Next.ml Boston: Data Science Dev Ops
Page 44: Next.ml Boston: Data Science Dev Ops
Page 45: Next.ml Boston: Data Science Dev Ops
Page 46: Next.ml Boston: Data Science Dev Ops

Porting this was hard

Page 47: Next.ml Boston: Data Science Dev Ops

Will it make my job easier?

Page 48: Next.ml Boston: Data Science Dev Ops

:(

Page 49: Next.ml Boston: Data Science Dev Ops

yhat

Page 50: Next.ml Boston: Data Science Dev Ops

R & D Production

Page 51: Next.ml Boston: Data Science Dev Ops

Difficult to encapsulate

Page 52: Next.ml Boston: Data Science Dev Ops

How to make them production ready?

Page 53: Next.ml Boston: Data Science Dev Ops

“Production” - Reliable- Reproducible- Scalable

Page 54: Next.ml Boston: Data Science Dev Ops

Some reasons about why this is hard to achieve

Page 55: Next.ml Boston: Data Science Dev Ops

Model != Service(bare with me on this one)

Page 56: Next.ml Boston: Data Science Dev Ops

Software stacks are complicated

Page 57: Next.ml Boston: Data Science Dev Ops

Software stacks are complicated

Page 58: Next.ml Boston: Data Science Dev Ops

All technologies can be connected to over a network

Page 59: Next.ml Boston: Data Science Dev Ops
Page 60: Next.ml Boston: Data Science Dev Ops
Page 61: Next.ml Boston: Data Science Dev Ops
Page 62: Next.ml Boston: Data Science Dev Ops
Page 63: Next.ml Boston: Data Science Dev Ops
Page 64: Next.ml Boston: Data Science Dev Ops

Where does machine learning fit into this stack?

Page 65: Next.ml Boston: Data Science Dev Ops

What stops us from doing this?

Page 66: Next.ml Boston: Data Science Dev Ops

Machine learning is stateful

Page 67: Next.ml Boston: Data Science Dev Ops

import alg…model = alg.train(data)…model.predict(newdata)

Page 68: Next.ml Boston: Data Science Dev Ops

import alg…model = alg.train(data)…model.predict(newdata)

Page 69: Next.ml Boston: Data Science Dev Ops

- Source code doesn’t encapsulate program

Page 70: Next.ml Boston: Data Science Dev Ops

- Source code doesn’t encapsulate program- Training is expensive (don’t want to do it every time)

Page 71: Next.ml Boston: Data Science Dev Ops

Serialization

Page 72: Next.ml Boston: Data Science Dev Ops

Serialization$python>>> import pickle>>> x = 3>>> p = pickle.dumps(x)>>> y = pickle.loads(p)>>> y3

Page 73: Next.ml Boston: Data Science Dev Ops

Sterilization has it’s own set of problems

Page 74: Next.ml Boston: Data Science Dev Ops

Traceback (most recent call last): File "diabetes.py", line 33, in <module>

R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean() File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1361, in cross_val_score

for train, test in cv) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__

self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch

job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__

self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1459, in _fit_and_score

estimator.fit(X_train, y_train, **fit_params) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 273, in fit

for i, t in enumerate(trees)) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__

self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch

job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__

self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 94, in _parallel_build_trees

tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 227, in fit

raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")ValueError: min_weight_fraction_leaf must in [0, 0.5]

Customer - “My model isn’t working”

Page 75: Next.ml Boston: Data Science Dev Ops

What’s the problem?…R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean()…tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)

raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")

ValueError: min_weight_fraction_leaf must in [0, 0.5]

Page 76: Next.ml Boston: Data Science Dev Ops

What’s the problem? - Pickle a scikit-learn 0.16.1 model- Unpickle it in 0.15.1

Page 77: Next.ml Boston: Data Science Dev Ops

Interpreted languages have a lot of run time dependencies

Page 78: Next.ml Boston: Data Science Dev Ops
Page 79: Next.ml Boston: Data Science Dev Ops

Reproducing dependencies is critical

Page 80: Next.ml Boston: Data Science Dev Ops

Reproducing dependencies is criticalDependency detection can be hard to automate

Page 81: Next.ml Boston: Data Science Dev Ops

Reproducing dependencies is criticalPackage managersaren’t perfect

Page 82: Next.ml Boston: Data Science Dev Ops

Example: pip

Page 83: Next.ml Boston: Data Science Dev Ops

Example: pip- Not standard

Page 84: Next.ml Boston: Data Science Dev Ops

Example: pip- Not standard- Can do a poor job of installing dependencies

Page 85: Next.ml Boston: Data Science Dev Ops

Example: pip- Not standard- Can do a poor job of installing dependencies- Only recently precompiled

Page 86: Next.ml Boston: Data Science Dev Ops

Example: r

Page 87: Next.ml Boston: Data Science Dev Ops

Example: r- Can’t install specific version of package

Page 88: Next.ml Boston: Data Science Dev Ops

Example: r- Can’t install specific version of package- No, seriously

Page 89: Next.ml Boston: Data Science Dev Ops

Solution

Page 90: Next.ml Boston: Data Science Dev Ops

Solution- Use a better package manager

Page 91: Next.ml Boston: Data Science Dev Ops

Solution- Use a better package manager- Ship your dependencies

Page 92: Next.ml Boston: Data Science Dev Ops

Complied languages are easier- Matlab (MCC), Scala

Page 93: Next.ml Boston: Data Science Dev Ops

Complied languages are easier- Matlab (MCC), Scala- Linking still an issue

Page 94: Next.ml Boston: Data Science Dev Ops

Data transforms can be critical to the model

Page 95: Next.ml Boston: Data Science Dev Ops

PMML

Page 96: Next.ml Boston: Data Science Dev Ops

PMML ?def tokenize(s): s = s.lower() s = s.split(" ") return s

Page 97: Next.ml Boston: Data Science Dev Ops

$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'

Page 98: Next.ml Boston: Data Science Dev Ops

$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'

Page 99: Next.ml Boston: Data Science Dev Ops

Trying to get models onto a network

Page 100: Next.ml Boston: Data Science Dev Ops

Databases are great

Page 101: Next.ml Boston: Data Science Dev Ops

1) Compute regression

2) Shove coefficients in database

3) …4) Profit?!?

Page 102: Next.ml Boston: Data Science Dev Ops

Simple Web Servers

Page 103: Next.ml Boston: Data Science Dev Ops

Simple Web ServersYou’re still stuck with environment management problems

Page 104: Next.ml Boston: Data Science Dev Ops

Simple Web ServersSome modeling languages are not languages you want to write a server in…

Page 105: Next.ml Boston: Data Science Dev Ops

Simple Web Servers- Division of roles- NPR uses flask for visualization dev but not production website

Page 106: Next.ml Boston: Data Science Dev Ops

A solution we (yhat) have decided on

Page 107: Next.ml Boston: Data Science Dev Ops

Containers FTW

Page 108: Next.ml Boston: Data Science Dev Ops

Containers FTWContainers address a lot of previous concerns

Page 109: Next.ml Boston: Data Science Dev Ops

Containers FTWReproducibility, managing environments, etc.

Page 110: Next.ml Boston: Data Science Dev Ops

Containers FTWCheap

Page 111: Next.ml Boston: Data Science Dev Ops
Page 112: Next.ml Boston: Data Science Dev Ops

Containers FTWWord of warning:

Page 113: Next.ml Boston: Data Science Dev Ops

Containers FTWWord of warning:If you choose this route you will be manage models and Docker

Page 114: Next.ml Boston: Data Science Dev Ops

{ "pred_class" : 0, "prob": 0.87}

Page 115: Next.ml Boston: Data Science Dev Ops

Take a model

Page 116: Next.ml Boston: Data Science Dev Ops

“Deploy” to our platform

Page 117: Next.ml Boston: Data Science Dev Ops

Defer to Docker

Page 118: Next.ml Boston: Data Science Dev Ops

$ pip install foo==0.2.4$ pip install bar==1.4.9Attempt to

recreate env

Page 119: Next.ml Boston: Data Science Dev Ops

$ pip install foo==0.2.4$ pip install bar==1.4.9Replicate as

necessary

Page 120: Next.ml Boston: Data Science Dev Ops

The dev team is always trying to learn of better ways of doing this

Page 121: Next.ml Boston: Data Science Dev Ops

Thanks!

And remember to be nice to your DevOps.