30
Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 06/22/2022 Bill Howe, UW 1

Big Data Curricula at the UW eScience Institute, JSM 2013

Embed Size (px)

DESCRIPTION

A 25 minute talk from a panel on big data curricula at JSM 2013 http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664

Citation preview

Page 1: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

Big Data Curricula at the University of Washington

eScience Institute

Bill Howe, UW

Page 2: Big Data Curricula at the UW eScience Institute, JSM 2013

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Page 3: Big Data Curricula at the UW eScience Institute, JSM 2013

1. Theory (last 2000 yrs)2. Experiment (last 200

yrs)3. Simulation (last 50 yrs)4. Data-Driven Discovery

(last 5 yrs)

Page 4: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 4

The University of Washington eScience Institute

• Rationale– The exponential increase in sensors is transitioning all fields of

science and engineering from data-poor to data-rich– As a result, the techniques and technologies of data science

must be widely practiced and widely adopted

• Mission– Advance the forefront of research both in modern data science

techniques and technologies, and in the fields that depend upon them

• Strategy– Provide an umbrella organization for Big Data activities at UW

and beyond (new curricula, collaborations, funding sources, hiring practices)

– Bootstrap a national network of partners and peer institutes– Attract, develop, and retain “Pi-shaped people”Bill Howe, UW

Page 5: Big Data Curricula at the UW eScience Institute, JSM 2013

π -shaped researchers

Broad in many areas; deep in at least two

Page 6: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 6

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training

Previous courses:Scientific Data Management, Graduate CS, Summer 2006, Portland State UniversityScientific Data Management, Graduate CS, Spring 2010, University of Washington

Page 7: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 7

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 8: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 8

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 9: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 9Bill Howe, UW

Page 10: Big Data Curricula at the UW eScience Institute, JSM 2013

• 8600 completed all programming assignments• 7000 earned a certificate

Page 11: Big Data Curricula at the UW eScience Institute, JSM 2013
Page 12: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 12

Syllabus

• Data Science Landscape (~1 week)

• Data Manipulation at Scale– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)

• Analytics– Statistics Pearls (~1 week)– Machine Learning Pearls (~1 week)

• Visualization (~1 week)

Bill Howe, UW

Page 13: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 13Bill Howe, UW

tools abstr.

desk cloud

structs stats

hackers analysts

This Course

Page 14: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 14Bill Howe, UW

What are the abstractions of data science?

tools abstr.

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

Page 15: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 15Bill Howe, UW

matrices and linear algebra? relations and relational algebra?objects and methods?files and scripts?data frames and functions?

What are the abstractions of data science?

tools abstr.

Page 16: Big Data Curricula at the UW eScience Institute, JSM 2013

16

Data Access Hitting a Wall

Current practice based on data download (FTP/GREP)Will not scale to the datasets of tomorrow

• You can GREP 1 MB in a second• You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days• You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~5,000 disks

• At some point you need indices to limit searchparallel data search and analysis

• This is where databases can help

• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (~1$)• … 2 days and 1K$• … 3 years and 1M$

desk cloud

[slide src: Jim Gray]

Page 17: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 17

US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

Bill Howe, UW

--Mckinsey Global Institute

hackers analysts

Page 18: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 18

Three types of tasks:

Bill Howe, UW

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”-- Aaron Kimball

structs stats

Page 19: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 19

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 20: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 20

New Phd Track: “Big Data U”

• Open to all departments• New courses to “level the playing field”

– “Molecular Biology for Computer Scientists” offered this Fall

• Dual advising in two disciplines• Joint projects leading to multiple theses

– Each methods thesis will include domain impact component

– Each domain thesis will include methods impact component

• Contribution to a shared cyberinfrastructure– Software engineering experience as a side effect

• “Application Assistantships”– Like RAs and TAs; focused on solving a concrete

problem

Bill Howe, UW

Magda Balazinska

Carlos Guestrin

Page 21: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 21

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 22: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 22

Data Science Incubator: Motivation

• We need the right people– We produce “builders,” but 99% of them go to

industry to “make people click on ads”– They aren’t motivated by writing papers– No viable career path in the academy

• We need the right processes– Hands-on, extended, intensive experience is required to

produce π-shaped people – Data-driven discovery requires intensive collaboration

Bill Howe, UW

Page 23: Big Data Curricula at the UW eScience Institute, JSM 2013

Science DomainsStats, Computer Science, Applied Math

• “Where’s the funding?”• “How does this help me write a paper in my field”?• Thin collaborations; nobody to work on the short-

term, high-risk, high-impact “triage” projects• “Does method X work on dataset Y?”

Page 24: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Research Programmers

• Expensive; doesn’t scale• “Code Monkey” – No viable career path• Can’t attract top people• No sharing, no community, no cross-pollination

Page 25: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 25

Data Science Incubator: Structure

• Recruit top-flight data science talent• Give them autonomy to select collaborations and

projects• Promote them according to “altmetrics” and project

impact– “Data Scientist” “Senior Data Scientist” “Technical

Fellow”– “Data Science Fellows”

• Perhaps non-tenure, but 3-5 year commitments• Funded with contributions from Academic units, IT,

Libraries, and soft money

Bill Howe, UW

Page 26: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 26

Data Science Incubator: Seed Grants

• Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects– Reviewed by the Data Scientists themselves

• Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration– Application may or may not include funding for the

student

Bill Howe, UW

Page 27: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Incubator

• Data Scientists have their own identity and prestige• Cross-pollination between disciplines• Awardees leave with skills and knowledge; become “disciples”

Page 28: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Incubator

• Data Scientists have their own identity and prestige• Cross-pollination between disciplines• Awardees leave with skills and knowledge; become “disciples”

Page 29: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 29

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 30: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 30

MOOC “Introduction to Data Science:”https://www.coursera.org/course/datasci

Certificate program:http://www.pce.uw.edu/courses/data-science-intro

Bill Howe, UW

http://escience.washington.edu

[email protected]