104
Data Science in the Cloud Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk November 2016

in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Data Science in the Cloud

Stefan Krawczyk @stefkrawczyklinkedin.com/in/skrawczykNovember 2016

Page 2: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Who are Data Scientists?

Page 3: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 4: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 5: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 6: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Means: skills vary wildly

Page 7: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

But they’re in demand and expensive

Page 8: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

“The Sexiest Job of the 21st Century”

- HBR

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Page 9: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

How many Data Scientists do you have?

Page 10: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

At Stitch Fix we have ~80

Page 11: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

~85% have not done formal CS

Page 12: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

But what do they do?

Page 13: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

What is Stitch Fix?

Page 14: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 15: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 16: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 17: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 18: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 19: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 20: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible
Page 21: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Two Data Scientist facts:

1. Has AWS console access*.

2. End to end, they’re responsible.

Page 22: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

How do we enable this without

?

Page 23: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Make doing the right thing the easy thing.

Page 24: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Fellow Collaborators

Horizontal team focused on Data Scientist Enablement

Page 25: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

1. Eng. Skills2. Important

3. What they work on

Page 26: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Let’s Start

Page 27: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Will Only Cover

1. Source of truth: S3 & Hive Metastore2. Docker Enabled DS @ Stitch Fix3. Scaling DS doing ML in the Cloud

Page 28: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Source of truth:

S3 & Hive Metastore

Page 29: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Want Everyone to Have Same View

A

B

Page 30: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

This is Usually Nothing to Worry About

● OS handles correct access

● DB has ACID properties

A

B

Page 31: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

This is Usually Nothing to Worry About

● OS handles correct access

● DB has ACID properties

● But it’s easy to outgrow these options with a big data/team.

A

B

Page 32: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Amazon’s Simple Storage Service

● Infinite* storage

● Can write, read, delete, BUT NOT append.

● Looks like a file system*:○ URIs: my.bucket/path/to/files/file.txt

● Scales well

S3

* For all intents and purposes

Page 33: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Hadoop service, that stores:○ Schema○ Partition information, e.g. date○ Data location for a partition

Hive Metastore

Page 34: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Hadoop service, that stores:○ Schema○ Partition information, e.g. date○ Data location for a partition

Hive Metastore:

Hive Metastore

Partition Location

20161001 s3://bucket/sold_items/20161001

...

20161031 s3://bucket/sold_items/20161031

sold_items

Page 35: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Hive Metastore

Page 36: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Replacing data in a partition

But if we’re not careful

Page 37: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Replacing data in a partition

But if we’re not careful

Page 38: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

But if we’re not careful

B

A

Page 39: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

But if we’re not careful

● S3 is eventually consistent

● These bugs are hard to track down

A

B

Page 40: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Use Hive Metastore to control partition source of truth

● Principles:○ Never delete○ Always write to a new place each time a partition changes

● Stitch Fix solution: ○ Use an inner directory → called Batch ID

Hive Metastore to the Rescue

Page 41: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Batch ID Pattern

Page 42: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Batch ID Pattern

Date Location

20161001 s3://bucket/sold_items/20161001/20161002002334/

... ...

20161031 s3://bucket/sold_items/20161031/20161101002256/

sold_items

Page 43: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Overwriting a partition is just a matter of updating the location

Batch ID Pattern

Date Location

20161001 s3://bucket/sold_items/20161001/20161002002334/

... ...

20161031 s3://bucket/sold_items/20161031/20161101002256/s3://bucket/sold_items/20161031/20161102234252

sold_items

Page 44: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Overwriting a partition is just a matter of updating the location● To the user this is a hidden inner directory

Batch ID Pattern

Date Location

20161001 s3://bucket/sold_items/20161001/20161002002334/

... ...

20161031 s3://bucket/sold_items/20161031/20161101002256/s3://bucket/sold_items/20161031/20161102234252

sold_items

Page 45: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Enforce via API

Page 46: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Enforce via API

Page 47: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Python:

store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])df = load_dataframe(src_db, src_table, partitions=[‘2016’])

R:sf_writer(data = result,

namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE)))

sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE)))

API for Data Scientists

Page 48: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Full partition history○ Can rollback

■ Data Scientists are less afraid of mistakes

○ Can create audit trails more easily■ What data changed and when

○ Can anchor downstream consumers to a particular batch ID

Batch ID Pattern Benefits

Page 49: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Docker Enabled DS @ Stitch Fix

Page 50: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Workstation Env. Mgmt. Scalability

Low Low

Medium Medium

High High

Ad hoc Infra: In the Beginning...

Page 51: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Workstation Env. Mgmt. Scalability

Low Low

Medium Medium

High High

Ad hoc Infra: Evolution I

Page 52: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Workstation Env. Mgmt. Scalability

Low Low

Medium Medium

High High

Ad hoc Infra: Evolution II

Page 53: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Workstation Env. Mgmt. Scalability

Low Low

Medium Medium

Low High

Ad hoc Infra: Evolution III

Page 54: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Control of environment○ Data Scientists don’t need to worry about env.

● Isolation○ can host many docker containers on a single machine.

● Better host management○ allowing central control of machine types.

Why Does Docker Lower Overhead?

Page 55: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Flotilla UI

Page 56: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Has:○ Our internal API libraries○ Jupyter Notebook:

■ Pyspark■ IPython

○ Python libs: ■ scikit, numpy, scipy, pandas, etc.

○ RStudio○ R libs:

■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.

● Mounts User NFS

● User has terminal access to file system via Jupyter for git, pip, etc.

Our Docker Image

Page 57: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Docker Deployment

Page 58: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Docker Deployment

Page 59: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Docker Deployment

Page 60: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Docker tightly integrates with the Linux Kernel.○ Hypothesis:

■ Anything that makes uninterruptable calls to the kernel can:● Break the ECS agent because the container doesn’t respond.● Break isolation between containers.

■ E.g. Mounting NFS

● Docker Hub:○ Switched to artifactory

Our Docker Problems So Far

Page 61: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Scaling DS doing ML in the Cloud

Page 62: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

1. Data Latency2. To Batch

or Not To Batch3. What’s in a Model?

Page 63: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Data Latency

How much time do you spend waiting for data?

Page 64: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

*This could be a laptop, a shared system, a batch process, etc.

Page 65: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Use Compression

*This could be a laptop, a shared system, a batch process, etc.

Page 66: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Use Compression - The Components

[ 1.3234543 0.23443434 … ]

[ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ]

[ 1 0 0 1 0 0 … 0 1 0 0 ]

[ 1.3234543 0.23443434 … ]

{ 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }

Page 67: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Use Compression - Python Comparison

Pickle: 60MBZlib+Pickle: 129KBJSON: 15MBZlib+JSON: 55KB

Pickle: 3.1KBZlib+Pickle: 921BJSON: 2.8KBZlib+JSON: 681B

Pickle: 2.6MBZlib+Pickle: 600KBJSON: 769KBZlib+JSON: 139KB

[ 1.3234543 0.23443434 … ]

[ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ]

[ 1 0 0 1 0 0 … 0 1 0 0 ]

[ 1.3234543 0.23443434 … ]

{ 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }

Page 68: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Naïve scheme of JSON + Zlib works well:

Observations

import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))

Page 69: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Naïve scheme of JSON + Zlib works well:

● Double vs Float: do you really need to store that much precision?

Observations

import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))

Page 70: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Naïve scheme of JSON + Zlib works well:

● Double vs Float: do you really need to store that much precision?

● For more inspiration look to columnar DBs and how they compress columns

Observations

import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))

Page 71: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

To Batch or Not To Batch:

When is batch inefficient?

Page 72: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Online:○ Computation occurs synchronously when needed.

● Streamed:○ Computation is triggered by an event(s).

Online & Streamed Computation

Page 73: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Online & Streamed Computation

Very likely you start with a batch system

Page 74: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Online & Streamed Computation

● Do you need to recompute:○ features for all users?○ predicted results for all users?

Very likely you start with a batch system

Page 75: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Online & Streamed Computation

● Do you need to recompute:○ features for all users?○ predicted results for all users?

● Are you heavily dependent on your

ETL running every night?

Very likely you start with a batch system

Page 76: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Online & Streamed Computation

● Do you need to recompute:○ features for all users?○ predicted results for all users?

● Are you heavily dependent on your

ETL running every night?

● Online vs Streamed depends on in house factors:○ Number of models○ How often they change○ Cadence of output required○ In house eng. expertise○ etc.

Very likely you start with a batch system

Page 77: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Online & Streamed Computation

● Do you need to recompute:○ features for all users?○ predicted results for all users?

● Are you heavily dependent on your

ETL running every night?

● Online vs Streamed depends on in house factors:○ Number of models○ How often they change○ Cadence of output required○ In house eng. expertise○ etc.

Very likely you start with a batch system

We use online system for recommendations

Page 78: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Streamed Example

Page 79: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Streamed Example

Page 80: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Streamed Example

Page 81: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Streamed Example

Page 82: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists

Online/Streaming Thoughts

Page 83: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists

● Requires better software engineering practices○ Code portability/reuse○ Designing APIs/Tools Data Scientists will use

Online/Streaming Thoughts

Page 84: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists

● Requires better software engineering practices○ Code portability/reuse○ Designing APIs/Tools Data Scientists will use

● Prototyping on AWS Lambda & Kinesis was surprisingly quick○ Need to compile C libs on an amazon linux instance

Online/Streaming Thoughts

Page 85: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

What’s in a Model?

Scaling model knowledge

Page 86: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?

Page 87: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?○ Or you didn’t remember yourself?

Page 88: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?○ Or you didn’t remember yourself?

● Had performance dip in models and you have trouble figuring out why?

Page 89: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?○ Or you didn’t remember yourself?

● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?

Page 90: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?○ Or you didn’t remember yourself?

● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?

● Wanted to compare model performance over time?

Page 91: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Ever:● Had someone leave and then nobody understands how they trained their

models?○ Or you didn’t remember yourself?

● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?

● Wanted to compare model performance over time?

● Wanted to train a model in R/Python/Spark and then deploy it a webserver?

Page 92: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Produce Model Artifacts

Page 93: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?

Produce Model Artifacts

Page 94: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

Produce Model Artifacts

Page 95: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

Produce Model Artifacts

Page 96: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

Produce Model Artifacts

Page 97: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

How do you deal with organizational drift?

Produce Model Artifacts

Page 98: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

How do you deal with organizational drift?

Produce Model Artifacts

Makes it easy to keep an archive and track changes over time

Page 99: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

How do you deal with organizational drift?

Produce Model Artifacts

Helps a lot with model debugging & diagnosis!

Makes it easy to keep an archive and track changes over time

Page 100: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Isn’t that just saving the coefficients/model values?○ NO!

● Why?

How do you deal with organizational drift?

Produce Model Artifacts

Helps a lot with model debugging & diagnosis!

Makes it easy to keep an archive and track changes over time Can more easily use in

downstream processes

Page 101: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

● Analogous to software libraries

● Packaging:○ Zip/Jar file

Produce Model Artifacts

Page 102: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

But all the above seems complex?

Page 103: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

We’re building APIs.

Page 104: in the Cloud Data Science - QCon San Francisco · 2017-02-02 · Two Data Scientist facts: 1.Has AWS console access*. 2.End to end, they’re responsible

Fin; Questions?

@stefkrawczyk