IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell...

Preview:

Citation preview

Building a scalable data strategy with IPTOP

Hugo Bowne-Anderson@hugobowne

Illustrations you can use, just copy/paste

➔ Hugo Bowne-Anderson, data scientist at DataCamp

◆ Undergrad in sciences/humanities (double math major)

◆ PhD in Pure Mathematics (UNSW, Sydney)

◆ Applied math research in cell biology (Yale University,

Max Planck Institute)

◆ Python curriculum engineer at DataCamp

◆ Host of DataFramed, the DataCamp podcast

◆ Data & AI evangelist, strategy consultant

A bit about Hugo

Ramnath VaidyanathanYou can find him at @ramnath_vaidya

Ramnath leads Product Research at

Joint work with

3

Our Mission Our mission is to democratize data science education by building the

best platform to learn and teach data skills and make data fluency

accessible to millions of people and businesses around the world.

Learn by doing

➔ Short videos from expert instructors

➔ In-browser coding

➔ Real-time feedback

300+ Unmatched data science courses

➔ Languages: Python, R, SQL, Git, Shell, Spreadsheets

➔ Topics: Importing & Cleaning, Data Manipulation, Visualization, Probability & Statistics, Machine Learning, and more!

Industry-leading instructors

➔ Learn from the authors of renowned code packages and the organizations that understand data science innovation

Learn by Doing

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

What can data science do?

1. Descriptive analytics (Business Intelligence)

2. Predictive analytics (Machine Learning)

3. Prescriptive Analytics (Decision Science)

We can slice data science into 3 components:

Descriptive analytics

Illustrations you can use, just copy/pasteDifferent views for different business strategies

Descriptive analytics

Descriptive analytics

Another way to slice data work

1. Data work to inform decision making

2. Automated actions from data pipelines

3. Human-in-the-loop

Another telling way to slice data science:

1. 0-25%

2. 26-50%

3. 51-75%

4. 76-100%

POLL: What percentage of your data work is actually used??

Definition(s) of scalability

Scalability refers to the ability to take on increased demand without incurring proportional costs.

Definition(s) of scalability

A scalable data strategy is one that can easily accommodate new projects, employees, techniques, phases of growth, tools, infrastructural layers, among other things.

Illustrations you can use, just copy/pasteScaling your data strategy

How hard it

is to do

How many people can do it

Making the impossible possible

Making the possible widespread

David RobinsonPrincipal Data Scientist, Heap

Illustrations you can use, just copy/pasteScale your data strategy by scaling IPTOP

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Why do we need infrastructure?

20

Scaling infrastructure at DataCamp

Tables

Views

Knowledge Repo

Dashboards

Metabase Visualizations

ViewsData Pipeline

Data Lake InsightsToolsRaw Data

Campus

Sales

Assessment

Scaling infrastructure at Netflix

Data Infrastructure at Netflix

Scaling infrastructure at Airbnb

Data Infrastructure at Airbnb

Enable data discovery

Enable data discovery

Amundsen: Lyft’s data discovery and metadata engine

Recap

➔ Scaling infrastructure is key to scaling data work

➔ Developing a principled, modular tech stack is essential

➔ For data discovery, online experimentation, machine learning,

and more.

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Identify roles

Map out skills by role

Measure competencies

1.

DataCamp Signal: Data Science Assessments

Identify gaps

Support continuous learning

34

Recap

➔ Identify roles

➔ Map out skills by role

➔ Measure competencies & determine gaps

➔ Personalize learning paths & support continuous learning

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

The data science workflow

Hadley Wickham,Chief Scientist, RStudio

Build tools

Hadley Wickham,Chief Scientist, RStudio

datacamp(r/py)

dcmetrics

dcplot dcdash

dcdocs

dcmodels

Build tools

Build frameworks

I want to track recurring revenue over the last two years, aggregated by quarter, broken

down by segment, and geography.

I want to track course completion rates over the last year, aggregated by week, broken

down by technology, topic, and track.

Tidymetrics: Metrics in R

Airbnb’s framework for online experimentation

Tool building in machine learning

Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of

data collection and processing. Scully et al. (Google, Inc.)

Machine Learning workflow

Zipline: feature engineering at airbnb

Recap

➔ Tools are key to abstract over common data tasks

➔ Tools may be cool, but frameworks are cooler!

➔ Key for all types of data work, including descriptive analytics

and predictive analytics (machine learning)

➔ The point: gains in efficiency for a one off cost

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

Data team structure: centralized or decentralized?

Marketing

Finance Product

Engineering

Data Science

Marketing

Finance Product

Engineering

Data team structure: decentralized?

Marketing

Finance Product

Engineering

ProsEach team has a dedicated DS.

Clear alignment due to common roadmap for the team.

Data science has a more natural “seat at the table”.

Fewer dependencies across teams.

ConsHarder to move DS resources between teams to handle load.

Manager of the team may not have domain knowledge.

Harder for DS to collaborate.

Harder for DS to drive longer-term projects, with the risk of turning into a support service.

Data team structure: centralized

ProsAllows DS to function as a center of excellence

Promotes more collaboration and better knowledge sharing.

DS manager has domain knowledge

Easier to move resources to meet load.

Easier to advocate for consistent technology stack and better tooling.

ConsComplicates the coordination between DS and their stakeholders.

Risk of data science work not being aligned with product

DS is an extra function for the company to support.

Data Science

Marketing

Finance Product

Engineering

Data team structure: hybrid

Marketing

Finance Product

Engineering

ProsDS can function as a center of excellence.

DS can drive common tech stack, tooling, frameworks, and standardization.

DS can collaborate and align on organizational goals.

Better alignment between DS and business units

ConsRisk of mismatch of expectation leadership of DS and business unit.

Everyone has at least two teams.

Data Science

Recap

➔ Centralized, decentralized, and hybrid models for data teams

➔ Pros and cons of each

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

1. Define project lifecycle

Microsoft Team Data Science Process

2. Standardize project structure

Project Template

Cookie-Cutter Data Science

3. Embrace notebooks

JupyterLab is ready for users

3. Embrace notebooks

Rmarkdown from RStudio

4. Embrace version control

5. Adopt style guides

The Tidyverse style guide, Hadley Wickham

5. Adopt style guides

5. Adopt style guides

6. Other processes to consider

➔ Code review

➔ Pair programming

➔ Data testing

➔ “Data parties”

➔ Incorporating data work into the decision function

Recap

➔ Define project lifecycle

➔ Standardize project structure

➔ Embrace notebooks & version control

➔ Many more things!

Scale data strategy by scaling

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP

What’s next?

What’s next?

➔ April 23 (the third Thursday of the month)

DataCamp’s online conference

Thank you!

Hugo Bowne-AndersonData Scientist@hugobowne

Recommended