View
202
Download
2
Category
Preview:
Citation preview
Eric Chiangeric@yhathq.co
m
Data Science DevOps
yhat
Blog
Hi, my name is Eric… and I’m a software engineer.
{ "pred_class" : 0, "prob": 0.87}
Data Science DevOps
Today’s Agenda1. Data science or
machine learning?2. A story about protein3. What makes DevOps
hard4. Attempting solutions5. Q & A
First, let’s get a couple things out of the way.
Next.ML
Next.ML
Data Science or Machine Learning?!?
Data Science“Applying machine learning to organizational problems.”
Different ProblemsSoftware engineering, not strictly engineering.
Different ProblemsHow do I work with others?
Different ProblemsHow do I version this?
Different ProblemsHow do I not break things?
Data ScienceIn the end still machine learning.
Bro, do you even big data?
Data mining vs. model building
Data mining vs. model building
Okay, story time
Let’s talk about protein
(Crystallography Team)
(Crystallography Team)
●Uses “brute force” approach to crystallize proteins
(Crystallography Team)
●Uses “brute force” approach to crystallize proteins
●Manually scores images one at a time
Crystal
$$$$$
Murky stuff
Lighting difference
s
Important murky stuff
What the hell is this line?
R & D
R & D Production
Porting this was hard
Will it make my job easier?
:(
yhat
R & D Production
Difficult to encapsulate
How to make them production ready?
“Production” - Reliable- Reproducible- Scalable
Some reasons about why this is hard to achieve
Model != Service(bare with me on this one)
Software stacks are complicated
Software stacks are complicated
All technologies can be connected to over a network
Where does machine learning fit into this stack?
What stops us from doing this?
Machine learning is stateful
import alg…model = alg.train(data)…model.predict(newdata)
import alg…model = alg.train(data)…model.predict(newdata)
- Source code doesn’t encapsulate program
- Source code doesn’t encapsulate program- Training is expensive (don’t want to do it every time)
Serialization
Serialization$python>>> import pickle>>> x = 3>>> p = pickle.dumps(x)>>> y = pickle.loads(p)>>> y3
Sterilization has it’s own set of problems
Traceback (most recent call last): File "diabetes.py", line 33, in <module>
R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean() File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1361, in cross_val_score
for train, test in cv) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1459, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 273, in fit
for i, t in enumerate(trees)) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
self.dispatch(function, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
job = ImmediateApply(func, args, kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
self.results = func(*args, **kwargs) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 94, in _parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) File "/home/eric/programming/python/env/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 227, in fit
raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")ValueError: min_weight_fraction_leaf must in [0, 0.5]
Customer - “My model isn’t working”
What’s the problem?…R2 = cross_val_score(clf, X, y=y, cv=KFold(y.size, K), n_jobs=1).mean()…tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
ValueError: min_weight_fraction_leaf must in [0, 0.5]
What’s the problem? - Pickle a scikit-learn 0.16.1 model- Unpickle it in 0.15.1
Interpreted languages have a lot of run time dependencies
Reproducing dependencies is critical
Reproducing dependencies is criticalDependency detection can be hard to automate
Reproducing dependencies is criticalPackage managersaren’t perfect
Example: pip
Example: pip- Not standard
Example: pip- Not standard- Can do a poor job of installing dependencies
Example: pip- Not standard- Can do a poor job of installing dependencies- Only recently precompiled
Example: r
Example: r- Can’t install specific version of package
Example: r- Can’t install specific version of package- No, seriously
Solution
Solution- Use a better package manager
Solution- Use a better package manager- Ship your dependencies
Complied languages are easier- Matlab (MCC), Scala
Complied languages are easier- Matlab (MCC), Scala- Linking still an issue
Data transforms can be critical to the model
PMML
PMML ?def tokenize(s): s = s.lower() s = s.split(" ") return s
$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'
$ python>>> def tokenize(s):... return s.lower().split(" ")...>>> import pickle>>> pickle.dumps(tokenize)'c__main__\nclean_sentence\np0\n.'
Trying to get models onto a network
Databases are great
1) Compute regression
2) Shove coefficients in database
3) …4) Profit?!?
Simple Web Servers
Simple Web ServersYou’re still stuck with environment management problems
Simple Web ServersSome modeling languages are not languages you want to write a server in…
Simple Web Servers- Division of roles- NPR uses flask for visualization dev but not production website
A solution we (yhat) have decided on
Containers FTW
Containers FTWContainers address a lot of previous concerns
Containers FTWReproducibility, managing environments, etc.
Containers FTWCheap
Containers FTWWord of warning:
Containers FTWWord of warning:If you choose this route you will be manage models and Docker
{ "pred_class" : 0, "prob": 0.87}
Take a model
“Deploy” to our platform
Defer to Docker
$ pip install foo==0.2.4$ pip install bar==1.4.9Attempt to
recreate env
$ pip install foo==0.2.4$ pip install bar==1.4.9Replicate as
necessary
The dev team is always trying to learn of better ways of doing this
Thanks!
And remember to be nice to your DevOps.
Recommended