56
“Big Data in Texas: Then, Now, and Ahead” Paco Nathan, Evil Mad Scientist @ Concurrent, Inc. 1

Big Data in Texas: Then, Now, and Ahead

Embed Size (px)

Citation preview

“Big Data in Texas: Then, Now, and Ahead”

Paco Nathan, Evil Mad Scientist @ Concurrent, Inc.

1

Then, Now, and Ahead

1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves T

HE

N

2

Lynn asked me to talk about Data here today

A few weeks ago we stepped back for a moment to reflect about what we’d seen happen in Austin over the years

Both of us ran alternative bookstores in Austin, twenty or so years ago, and we participated as the Internet thing exploded in the 1990s

That was a blast –

observations…

3

4

5

6

7

We noticed a trend

Thinking about some of those who kept showing up whenever interesting things were afoot…

observations…

8

9

“curation and metadata”

10

Overall, it’s about systems thinking

We have a wealth of that here, at UT/Austin in particular…

Ilya Prigogine spent years here, which is just incredible

School of Architecture, with leading work in VR, GIS, etc.

Interactive innovations at ACTLab…

Quantitative emphasis at McCombs…

⇒ major intellectual resources here

observations…

11

Then, Now, and Ahead

NO

W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

12

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, availability

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpModeData Science

13

Data Science in Texas…

14

by DJ Patil

Data JujitsuO’Reilly, 2012

amazon.com/dp/B008HMN5BE

Building Data Science TeamsO’Reilly, 2011

amazon.com/dp/B005O4U3ZE

references…

15

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Enterprise Data Workflows

cascading.org

16

Over the past 5+ years, we’ve seen many large-scale Enterprise production deployments based on Cascading, Cascalog, Scalding, PyCascading, Cascading.JRuby, etc.

Enterprise data workflows,Machine learning at scale,Big Data…

Why?

Enterprise Data Workflows

amazon.com/dp/1449358721

17

Then, Now, and Ahead

NO

W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

18

Three broad categories of dataCurt Monash, 2010

dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

19

Three broad categories of dataCurt Monash, 2010

dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

• Adjusted Data – Dr. Don Easterbrook, Senate witness• Adjusted Data – Dr. Don Easterbrook, Senate witness

20

Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this

21

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

22

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

“Throw it over the wall”

23

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

24

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

“Data products”

25

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

26

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

“Optimizing topologies”

27

• Lambda Architecture: blending topologies

• Big Data by Nathan Marz, James Warren

• manning.com/marz

references…

source: Nathan Marz

28

by Leo Breiman

Statistical Modeling: The Two CulturesStatistical Science, 2001

bit.ly/eUTh9L

references…

29

Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtube.com/watch?v=E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtube.com/watch?v=qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

references…

30

Then, Now, and Ahead

NO

W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

31

Geoffrey MooreMohr Davidow Ventures, author of Crossing The ChasmHadoop Summit, 2012:

what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade

data as the major force… mostly through apps – verticals, leveraging domain expertise

Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc.XLDB, 2012:

complex analytics workloads are now displacing SQL as the basis for Enterprise apps

Displacement

32

algorithmic modeling + machine data + curation, metadata + Open Data ⇒ data products, as feedback into automation

⇒ evolution of feedback loops

a big part of the science in data science…

internet of things + complex analytics ⇒ accelerated evolution, additional feedback loops

taking this out into a highly social dimension

Drivers

33

source: National Geographic

“A kind of Cambrian explosion”

34

Internet of Things

35

A Thought Exercise

Consider that when a company like Catepillar moves into data science, they won’t be building the world’s next search engine or social network

They will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment…

Operations Research –crunching amazing amounts of data

36

That’s a $50B company,in a market segment worth $250B

Upcoming: tractors as drones – guided by complex, distributed data apps

A Thought Exercise

37

Alternatively…

climate.com

38

Two Avenues to the App Layer

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff

39

Then, Now, and Ahead

AH

EA

D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

40

Let’s drill-down on that intersection of tractors and crops, as a focus…

Some of the largest use cases for large-scale data workflows which we encounter are in Agriculture

Here’s a sector which integrates some of those themes from the Internet of Things, Catepillar, Climate Corp, etc.

For instance…

41

• single largest employer, livelihood for 40% globally

• 500 million small farms worldwide

• most family farmers rely on rain-fed agriculture

• approx $2T agricultural real estate in US alone

• high annual rate of soil depletion

• cycles of flooding, drought, desertification

• high resolution from private satellite networks,e.g., skyboximaging.com

• SMS networks for “business intelligence” among family farmers in Ethiopia agrepedia.com

• microfinance, e.g., kiva.org, slowmoney.org

Data and Agriculture, Ahead

42

Consider the emerging reality of drone tractors, guided by satellite feeds, with predictive analytics accessing remote cloud-based clusters, crunching data for crops planted per-plot, based on years of history evaluated in time series analysis

It would be difficult to identify a bigger Big Data problem in the world

Data and Agriculture, Ahead

43

You’ve heard about Peak Oil, Peak Phosphorus? How about Peak Snow?

In other words, rising variance of snow pack levels, increasingly earlier peak snow in the mountains… which stresses the watersheds, infrastructure, etc., which in turn stress agriculture, energy, transportation, financial markets, tax basis, etc.

Jeff Dozier, William Gail“The Emerging Science of Environmental Applications”The Fourth Paradigm, 2009

source: J. Dozier, et al., UCSB

Data and Agriculture, Ahead

44

Variance in the timing of the water cycle causes stress on natural resources and infrastructure: reservoirs, aqueducts, river ways, aquifers, levees, farm lands, seawater incursion, etc.

Even in the face of so much IoT data looming, we lack adequate data and modeling of snowpack, snow melt, runoff, evaporation, water basins, etc., to understand the impact of these changes – now needed to forecast where to change infrastructure or strategies

There’s not much machine data up in the mountain peaks, and satellite data only serves so far…

⇒ new opportunities for Big Data

source: J. Dozier, et al., UCSB

Data and Agriculture, Ahead

45

Data and Agriculture, Ahead

46

Data and Agriculture, Ahead

We can resolve these kinds of problems; however, solutions must leverage huge amounts of data

47

Then, Now, and Ahead

AH

EA

D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

48

Agriculture is just one sector, one set of problems to tackle

We have much, much more here in Texas

For example, Houston is a major center for Maritime work…

check out:marinexplore.org

Everything’s Bigger in Texas

49

There’s also the not so small matter of the Energy and Transportation sectors

GE is putting sensors in each and every wind generator, each and every jet engine – again, the Internet of Things.

I’ve heard rumors there are a few of those wind turbines out in West Texas?

Everything’s Bigger in Texas

50

Another of the fastest growing use cases we see for large-scale predictive modeling is in Telecom

Think about the stream of CDRs, billions of us bipeds wandering about the planet with our phones…

Firehose for that makes Twitter look like MySpace!

The value of location services as data products for local businesses, communities is astounding

Everything’s Bigger in Texas

51

Then, Now, and Ahead

AH

EA

D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves

52

Approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.

Unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up

Most valuable skills:

‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making analysis repeatable

source: D3

What is needed?

53

• more emphasis on statistical thinking

• not SQL vs. NoSQL, but instead a focus on apps as the process of structuring data

• multi-disciplinary teams, not cubicles and silos

• evolving more feedback loops, to drive more automation

• oddly enough, we need automation to be able to employ more people in intelligent, productive ways

• otherwise, we’re left with…

source: Schwa Corporation

What else do we need?

54

source: Twentieth Century Fox

55

source: Twentieth Century Fox

Thank you very much!

56