IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell...

Building a scalable data strategy with IPTOP

Hugo Bowne-Anderson@hugobowne

Illustrations you can use, just copy/paste

➔ Hugo Bowne-Anderson, data scientist at DataCamp

◆ Undergrad in sciences/humanities (double math major)

◆ PhD in Pure Mathematics (UNSW, Sydney)

◆ Applied math research in cell biology (Yale University,

Max Planck Institute)

◆ Python curriculum engineer at DataCamp

◆ Host of DataFramed, the DataCamp podcast

◆ Data & AI evangelist, strategy consultant

A bit about Hugo

Ramnath VaidyanathanYou can find him at @ramnath_vaidya

Ramnath leads Product Research at

Joint work with

Our Mission Our mission is to democratize data science education by building the

best platform to learn and teach data skills and make data fluency

accessible to millions of people and businesses around the world.

Learn by doing

➔ Short videos from expert instructors

➔ In-browser coding

➔ Real-time feedback

300+ Unmatched data science courses

➔ Languages: Python, R, SQL, Git, Shell, Spreadsheets

➔ Topics: Importing & Cleaning, Data Manipulation, Visualization, Probability & Statistics, Machine Learning, and more!

Industry-leading instructors

➔ Learn from the authors of renowned code packages and the organizations that understand data science innovation

Learn by Doing

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

What can data science do?

1. Descriptive analytics (Business Intelligence)

2. Predictive analytics (Machine Learning)

3. Prescriptive Analytics (Decision Science)

We can slice data science into 3 components:

Descriptive analytics

Illustrations you can use, just copy/pasteDifferent views for different business strategies

Descriptive analytics

Another way to slice data work

1. Data work to inform decision making

2. Automated actions from data pipelines

3. Human-in-the-loop

Another telling way to slice data science:

1. 0-25%

2. 26-50%

3. 51-75%

4. 76-100%

POLL: What percentage of your data work is actually used??

Definition(s) of scalability

Scalability refers to the ability to take on increased demand without incurring proportional costs.

Definition(s) of scalability

A scalable data strategy is one that can easily accommodate new projects, employees, techniques, phases of growth, tools, infrastructural layers, among other things.

Illustrations you can use, just copy/pasteScaling your data strategy

How hard it

is to do

How many people can do it

Making the impossible possible

Making the possible widespread

David RobinsonPrincipal Data Scientist, Heap

Illustrations you can use, just copy/pasteScale your data strategy by scaling IPTOP

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Why do we need infrastructure?

Scaling infrastructure at DataCamp

Tables

Knowledge Repo

Dashboards

Metabase Visualizations

ViewsData Pipeline

Data Lake InsightsToolsRaw Data

Campus

Assessment

Scaling infrastructure at Netflix

Data Infrastructure at Netflix

Scaling infrastructure at Airbnb

Data Infrastructure at Airbnb

Amundsen: Lyft’s data discovery and metadata engine

➔ Scaling infrastructure is key to scaling data work

➔ Developing a principled, modular tech stack is essential

➔ For data discovery, online experimentation, machine learning,

and more.

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Identify roles

Map out skills by role

Measure competencies

DataCamp Signal: Data Science Assessments

Identify gaps

Personalize learning paths

DataCamp: Custom Tracks

Support continuous learning

➔ Identify roles

➔ Map out skills by role

➔ Measure competencies & determine gaps

➔ Personalize learning paths & support continuous learning

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

The data science workflow

Hadley Wickham,Chief Scientist, RStudio

Build tools

Hadley Wickham,Chief Scientist, RStudio

datacamp(r/py)

dcmetrics

dcplot dcdash

dcdocs

dcmodels

Build tools

Build frameworks

I want to track recurring revenue over the last two years, aggregated by quarter, broken

down by segment, and geography.

I want to track course completion rates over the last year, aggregated by week, broken

down by technology, topic, and track.

Tidymetrics: Metrics in R

Airbnb’s framework for online experimentation

Tool building in machine learning

Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of

data collection and processing. Scully et al. (Google, Inc.)

Machine Learning workflow

Zipline: feature engineering at airbnb

➔ Tools are key to abstract over common data tasks

➔ Tools may be cool, but frameworks are cooler!

➔ Key for all types of data work, including descriptive analytics

and predictive analytics (machine learning)

➔ The point: gains in efficiency for a one off cost

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Data team structure: centralized or decentralized?

Marketing

Finance Product

Engineering

Data Science

Marketing

Finance Product

Engineering

Data team structure: decentralized?

Marketing

Finance Product

Engineering

ProsEach team has a dedicated DS.

Clear alignment due to common roadmap for the team.

Data science has a more natural “seat at the table”.

Fewer dependencies across teams.

ConsHarder to move DS resources between teams to handle load.

Manager of the team may not have domain knowledge.

Harder for DS to collaborate.

Harder for DS to drive longer-term projects, with the risk of turning into a support service.

Data team structure: centralized

ProsAllows DS to function as a center of excellence

Promotes more collaboration and better knowledge sharing.

DS manager has domain knowledge

Easier to move resources to meet load.

Easier to advocate for consistent technology stack and better tooling.

ConsComplicates the coordination between DS and their stakeholders.

Risk of data science work not being aligned with product

DS is an extra function for the company to support.

Data Science

Marketing

Finance Product

Engineering

Data team structure: hybrid

Marketing

Finance Product

Engineering

ProsDS can function as a center of excellence.

DS can drive common tech stack, tooling, frameworks, and standardization.

DS can collaborate and align on organizational goals.

Better alignment between DS and business units

ConsRisk of mismatch of expectation leadership of DS and business unit.

Everyone has at least two teams.

Data Science

➔ Centralized, decentralized, and hybrid models for data teams

➔ Pros and cons of each

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

1. Define project lifecycle

Microsoft Team Data Science Process

2. Standardize project structure

Project Template

Cookie-Cutter Data Science

3. Embrace notebooks

JupyterLab is ready for users

3. Embrace notebooks

Rmarkdown from RStudio

4. Embrace version control

5. Adopt style guides

The Tidyverse style guide, Hadley Wickham

5. Adopt style guides

6. Other processes to consider

➔ Code review

➔ Pair programming

➔ Data testing

➔ “Data parties”

➔ Incorporating data work into the decision function

➔ Define project lifecycle

➔ Standardize project structure

➔ Embrace notebooks & version control

➔ Many more things!

Scale data strategy by scaling

InfrastructureSet up a data lake

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

What’s next?

➔ April 23 (the third Thursday of the month)

DataCamp’s online conference

Thank you!

Hugo Bowne-AndersonData Scientist@hugobowne

IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell...

Documents

Cryptography Lecture by Sam Bowne

Site: (BOB) BOWNE OF BOSTON Phone: (617) 542-1926 Operator ... edgar version.pdf · bowne integrated typesetting system crc: 46218 name: lynch corporation site: (bob) bowne of boston

SCOTT .BOWNE, COLECTIVAS

X Remote Production Site Survey Checklist - John Bowne High School

Practice Designing a Controlled Experiment - John Bowne

BOWNE MUNRO ELEMENTARY SCHOOL 120 Main Street, East ... · BOWNE MUNRO ELEMENTARY SCHOOL . 120 Main Street, East Brunswick, NJ 08816 . LOCAL GOVERNMENT ENERGY AUDIT PROGRAM . FOR

PYTHON PROGRAMMING UNIT-1 ppt.pdf · Output: X=5 Ex -3: î datacamp ïáî ïáî ï Output: datacamp Tutorial Python Since the python print() function by default ends with newline

Bowne Venture Capital Guidebook

DataCamp Customer Roadmapassets.datacamp.com/email/other/DataCamp+Customer+Roadmap+… · Beginner / refresher Course ... Quantitative Risk Management, Value-at-risk, **** Technologies:

Ethical Hacking and Network Defense. Contact Information Sam Bowne Sam Bowne Website: Website:

Cell Factory Worksheet - John Bowne High School

BOWNE INTEGRATED TYPESETTING SYSTEM Name: * Lines ... · bowne integrated typesetting system crc: * name: ... submission header for edgar_dir:[sub]h04295.sub:

BOWNE BYTES Elementary SchoolElementary School Bowne …€¦ · Bowne to tell us about her books and to answer some of our questions. You should go read some of her books so you

SPATIAL STATISTICS IN R - Amazon S3 · DataCamp Spatial Statistics in R UFO Data. DataCamp Spatial Statistics in R Spatial K Function. DataCamp Spatial Statistics in R Space-Time

AUTUMN 2015 - John Bowne Housebownehouse.org/newsletter/Bowne-House-Newsletter-November-2015.… · he Bowne House Historical Society marked the occasion of its 70th ... work at the

DEFCON 19 Bowne Three Generations of DoS Attacks

DataCamp Data Types for Data Science - s3.amazonaws.com · DataCamp Data Types for Data Science Counter to find the most common.most_common() method returns the counter values in

· BOWNE INTEGRATED TYPESETTING SYSTEM CRC: * Name: * Lines: * Site: BOWNE OF HOUSTON, INC. Phone: (713) 869-9181 Operator: BOD99999T Date: 16-MAR-2007 15:59:21.11

Ethical Hacking Defeating Wireless Security. 2 Contact Sam Bowne Sam Bowne Computer Networking and Information Technology Computer Networking and Information

John Bowne High School · 6/15/2011 · John Bowne High School