67
Building a scalable data strategy with IPTOP Hugo Bowne-Anderson @hugobowne

IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Building a scalable data strategy with IPTOP

Hugo Bowne-Anderson@hugobowne

Page 2: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Illustrations you can use, just copy/paste

➔ Hugo Bowne-Anderson, data scientist at DataCamp

◆ Undergrad in sciences/humanities (double math major)

◆ PhD in Pure Mathematics (UNSW, Sydney)

◆ Applied math research in cell biology (Yale University,

Max Planck Institute)

◆ Python curriculum engineer at DataCamp

◆ Host of DataFramed, the DataCamp podcast

◆ Data & AI evangelist, strategy consultant

A bit about Hugo

Page 3: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Ramnath VaidyanathanYou can find him at @ramnath_vaidya

Ramnath leads Product Research at

Joint work with

3

Page 4: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Our Mission Our mission is to democratize data science education by building the

best platform to learn and teach data skills and make data fluency

accessible to millions of people and businesses around the world.

Page 5: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Learn by doing

➔ Short videos from expert instructors

➔ In-browser coding

➔ Real-time feedback

300+ Unmatched data science courses

➔ Languages: Python, R, SQL, Git, Shell, Spreadsheets

➔ Topics: Importing & Cleaning, Data Manipulation, Visualization, Probability & Statistics, Machine Learning, and more!

Industry-leading instructors

➔ Learn from the authors of renowned code packages and the organizations that understand data science innovation

Learn by Doing

Page 6: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 7: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 8: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

What can data science do?

1. Descriptive analytics (Business Intelligence)

2. Predictive analytics (Machine Learning)

3. Prescriptive Analytics (Decision Science)

We can slice data science into 3 components:

Page 9: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Descriptive analytics

Page 10: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Illustrations you can use, just copy/pasteDifferent views for different business strategies

Page 11: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Descriptive analytics

Page 12: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Descriptive analytics

Page 13: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Another way to slice data work

1. Data work to inform decision making

2. Automated actions from data pipelines

3. Human-in-the-loop

Another telling way to slice data science:

Page 14: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

1. 0-25%

2. 26-50%

3. 51-75%

4. 76-100%

POLL: What percentage of your data work is actually used??

Page 15: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Definition(s) of scalability

Scalability refers to the ability to take on increased demand without incurring proportional costs.

Page 16: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Definition(s) of scalability

A scalable data strategy is one that can easily accommodate new projects, employees, techniques, phases of growth, tools, infrastructural layers, among other things.

Page 17: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Illustrations you can use, just copy/pasteScaling your data strategy

How hard it

is to do

How many people can do it

Making the impossible possible

Making the possible widespread

David RobinsonPrincipal Data Scientist, Heap

Page 18: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Illustrations you can use, just copy/pasteScale your data strategy by scaling IPTOP

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP

Page 19: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 20: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Why do we need infrastructure?

20

Page 21: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Scaling infrastructure at DataCamp

Tables

Views

Knowledge Repo

Dashboards

Metabase Visualizations

ViewsData Pipeline

Data Lake InsightsToolsRaw Data

Campus

Sales

Assessment

Page 22: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Scaling infrastructure at Netflix

Data Infrastructure at Netflix

Page 23: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Scaling infrastructure at Airbnb

Data Infrastructure at Airbnb

Page 24: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Enable data discovery

Page 25: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Enable data discovery

Amundsen: Lyft’s data discovery and metadata engine

Page 26: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Recap

➔ Scaling infrastructure is key to scaling data work

➔ Developing a principled, modular tech stack is essential

➔ For data discovery, online experimentation, machine learning,

and more.

Page 27: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 28: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Identify roles

Page 29: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Map out skills by role

Page 30: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Measure competencies

1.

DataCamp Signal: Data Science Assessments

Page 31: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Identify gaps

Page 33: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Support continuous learning

Page 34: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

34

Page 35: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Recap

➔ Identify roles

➔ Map out skills by role

➔ Measure competencies & determine gaps

➔ Personalize learning paths & support continuous learning

Page 36: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 37: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

The data science workflow

Hadley Wickham,Chief Scientist, RStudio

Page 38: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Build tools

Hadley Wickham,Chief Scientist, RStudio

datacamp(r/py)

dcmetrics

dcplot dcdash

dcdocs

dcmodels

Page 39: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Build tools

Page 40: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Build frameworks

I want to track recurring revenue over the last two years, aggregated by quarter, broken

down by segment, and geography.

I want to track course completion rates over the last year, aggregated by week, broken

down by technology, topic, and track.

Page 41: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Tidymetrics: Metrics in R

Page 42: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Airbnb’s framework for online experimentation

Page 43: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Tool building in machine learning

Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of

data collection and processing. Scully et al. (Google, Inc.)

Page 44: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Machine Learning workflow

Page 45: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Zipline: feature engineering at airbnb

Page 46: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Recap

➔ Tools are key to abstract over common data tasks

➔ Tools may be cool, but frameworks are cooler!

➔ Key for all types of data work, including descriptive analytics

and predictive analytics (machine learning)

➔ The point: gains in efficiency for a one off cost

Page 47: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 48: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Data team structure: centralized or decentralized?

Marketing

Finance Product

Engineering

Data Science

Marketing

Finance Product

Engineering

Page 49: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Data team structure: decentralized?

Marketing

Finance Product

Engineering

ProsEach team has a dedicated DS.

Clear alignment due to common roadmap for the team.

Data science has a more natural “seat at the table”.

Fewer dependencies across teams.

ConsHarder to move DS resources between teams to handle load.

Manager of the team may not have domain knowledge.

Harder for DS to collaborate.

Harder for DS to drive longer-term projects, with the risk of turning into a support service.

Page 50: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Data team structure: centralized

ProsAllows DS to function as a center of excellence

Promotes more collaboration and better knowledge sharing.

DS manager has domain knowledge

Easier to move resources to meet load.

Easier to advocate for consistent technology stack and better tooling.

ConsComplicates the coordination between DS and their stakeholders.

Risk of data science work not being aligned with product

DS is an extra function for the company to support.

Data Science

Marketing

Finance Product

Engineering

Page 51: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Data team structure: hybrid

Marketing

Finance Product

Engineering

ProsDS can function as a center of excellence.

DS can drive common tech stack, tooling, frameworks, and standardization.

DS can collaborate and align on organizational goals.

Better alignment between DS and business units

ConsRisk of mismatch of expectation leadership of DS and business unit.

Everyone has at least two teams.

Data Science

Page 52: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Recap

➔ Centralized, decentralized, and hybrid models for data teams

➔ Pros and cons of each

Page 53: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Page 54: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

1. Define project lifecycle

Microsoft Team Data Science Process

Page 55: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

2. Standardize project structure

Project Template

Cookie-Cutter Data Science

Page 56: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

3. Embrace notebooks

JupyterLab is ready for users

Page 57: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

3. Embrace notebooks

Rmarkdown from RStudio

Page 58: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

4. Embrace version control

Page 59: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

5. Adopt style guides

The Tidyverse style guide, Hadley Wickham

Page 60: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

5. Adopt style guides

Page 61: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

5. Adopt style guides

Page 62: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

6. Other processes to consider

➔ Code review

➔ Pair programming

➔ Data testing

➔ “Data parties”

➔ Incorporating data work into the decision function

Page 63: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Recap

➔ Define project lifecycle

➔ Standardize project structure

➔ Embrace notebooks & version control

➔ Many more things!

Page 64: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Scale data strategy by scaling

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP

Page 65: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

What’s next?

Page 66: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

What’s next?

➔ April 23 (the third Thursday of the month)

DataCamp’s online conference

Page 67: IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Thank you!

Hugo Bowne-AndersonData Scientist@hugobowne