Upload
paco-nathan
View
11.355
Download
1
Tags:
Embed Size (px)
Citation preview
Then, Now, and Ahead
1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves T
HE
N
2
Lynn asked me to talk about Data here today
A few weeks ago we stepped back for a moment to reflect about what we’d seen happen in Austin over the years
Both of us ran alternative bookstores in Austin, twenty or so years ago, and we participated as the Internet thing exploded in the 1990s
That was a blast –
observations…
3
We noticed a trend
Thinking about some of those who kept showing up whenever interesting things were afoot…
observations…
8
Overall, it’s about systems thinking
We have a wealth of that here, at UT/Austin in particular…
Ilya Prigogine spent years here, which is just incredible
School of Architecture, with leading work in VR, GIS, etc.
Interactive innovations at ACTLab…
Quantitative emphasis at McCombs…
⇒ major intellectual resources here
observations…
11
Then, Now, and Ahead
NO
W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
12
business process,stakeholder
data prep, discovery, modeling, etc.
software engineering, automation
systems engineering, availability
datascience
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
Unique Registration
Launched games lobby
NUI:TutorialMode
Birthday Message
Chat PublicRoom voice
Launched heyzap game
ConnectivityTest: test suite started
Create New Pet
Movie View Started: client, community
NUI:MovieMode
Buy an Item: web
Put on Clothing
Address space remaining: 512M
Customer Made Purchase Cart Page Step 2
Feed Pet
Play Pet
Chat Now
Edit Panel
Client Inventory Panel Flip Product Over
Add Friend
Open 3D Window
Change Seat
Type a Bubble
Visit Own Homepage
Take a Snapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Address space remaining: 1G
Leave a Message
NUI:ChatMode
NUI:FriendsModedv
Website Login
Add Buddy
NUI:PublicRoomMode
NUI:MyRoomMode
Client Inventory Panel Remove Product
Client Inventory Panel Apply Product
NUI:DressUpMode
Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpModeData Science
13
by DJ Patil
Data JujitsuO’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science TeamsO’Reilly, 2011
amazon.com/dp/B005O4U3ZE
references…
15
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Enterprise Data Workflows
cascading.org
16
Over the past 5+ years, we’ve seen many large-scale Enterprise production deployments based on Cascading, Cascalog, Scalding, PyCascading, Cascading.JRuby, etc.
Enterprise data workflows,Machine learning at scale,Big Data…
Why?
Enterprise Data Workflows
amazon.com/dp/1449358721
17
Then, Now, and Ahead
NO
W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
18
Three broad categories of dataCurt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits well into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
19
Three broad categories of dataCurt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits well into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
• Adjusted Data – Dr. Don Easterbrook, Senate witness• Adjusted Data – Dr. Don Easterbrook, Senate witness
20
Q3 1997: inflection point
Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack emerged from this
21
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
22
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
“Throw it over the wall”
23
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
24
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“Data products”
25
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
26
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
“Optimizing topologies”
27
• Lambda Architecture: blending topologies
• Big Data by Nathan Marz, James Warren
• manning.com/marz
references…
source: Nathan Marz
28
by Leo Breiman
Statistical Modeling: The Two CulturesStatistical Science, 2001
bit.ly/eUTh9L
references…
29
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtube.com/watch?v=E91oEn1bnXM
Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtube.com/watch?v=qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
references…
30
Then, Now, and Ahead
NO
W1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
31
Geoffrey MooreMohr Davidow Ventures, author of Crossing The ChasmHadoop Summit, 2012:
what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade
data as the major force… mostly through apps – verticals, leveraging domain expertise
Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc.XLDB, 2012:
complex analytics workloads are now displacing SQL as the basis for Enterprise apps
Displacement
32
algorithmic modeling + machine data + curation, metadata + Open Data ⇒ data products, as feedback into automation
⇒ evolution of feedback loops
a big part of the science in data science…
internet of things + complex analytics ⇒ accelerated evolution, additional feedback loops
taking this out into a highly social dimension
Drivers
33
A Thought Exercise
Consider that when a company like Catepillar moves into data science, they won’t be building the world’s next search engine or social network
They will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment…
Operations Research –crunching amazing amounts of data
36
That’s a $50B company,in a market segment worth $250B
Upcoming: tractors as drones – guided by complex, distributed data apps
A Thought Exercise
37
Two Avenues to the App Layer
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff
39
Then, Now, and Ahead
AH
EA
D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
40
Let’s drill-down on that intersection of tractors and crops, as a focus…
Some of the largest use cases for large-scale data workflows which we encounter are in Agriculture
Here’s a sector which integrates some of those themes from the Internet of Things, Catepillar, Climate Corp, etc.
For instance…
41
• single largest employer, livelihood for 40% globally
• 500 million small farms worldwide
• most family farmers rely on rain-fed agriculture
• approx $2T agricultural real estate in US alone
• high annual rate of soil depletion
• cycles of flooding, drought, desertification
• high resolution from private satellite networks,e.g., skyboximaging.com
• SMS networks for “business intelligence” among family farmers in Ethiopia agrepedia.com
• microfinance, e.g., kiva.org, slowmoney.org
Data and Agriculture, Ahead
42
Consider the emerging reality of drone tractors, guided by satellite feeds, with predictive analytics accessing remote cloud-based clusters, crunching data for crops planted per-plot, based on years of history evaluated in time series analysis
It would be difficult to identify a bigger Big Data problem in the world
Data and Agriculture, Ahead
43
You’ve heard about Peak Oil, Peak Phosphorus? How about Peak Snow?
In other words, rising variance of snow pack levels, increasingly earlier peak snow in the mountains… which stresses the watersheds, infrastructure, etc., which in turn stress agriculture, energy, transportation, financial markets, tax basis, etc.
Jeff Dozier, William Gail“The Emerging Science of Environmental Applications”The Fourth Paradigm, 2009
source: J. Dozier, et al., UCSB
Data and Agriculture, Ahead
44
Variance in the timing of the water cycle causes stress on natural resources and infrastructure: reservoirs, aqueducts, river ways, aquifers, levees, farm lands, seawater incursion, etc.
Even in the face of so much IoT data looming, we lack adequate data and modeling of snowpack, snow melt, runoff, evaporation, water basins, etc., to understand the impact of these changes – now needed to forecast where to change infrastructure or strategies
There’s not much machine data up in the mountain peaks, and satellite data only serves so far…
⇒ new opportunities for Big Data
source: J. Dozier, et al., UCSB
Data and Agriculture, Ahead
45
Data and Agriculture, Ahead
We can resolve these kinds of problems; however, solutions must leverage huge amounts of data
47
Then, Now, and Ahead
AH
EA
D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
48
Agriculture is just one sector, one set of problems to tackle
We have much, much more here in Texas
For example, Houston is a major center for Maritime work…
check out:marinexplore.org
Everything’s Bigger in Texas
49
There’s also the not so small matter of the Energy and Transportation sectors
GE is putting sensors in each and every wind generator, each and every jet engine – again, the Internet of Things.
I’ve heard rumors there are a few of those wind turbines out in West Texas?
Everything’s Bigger in Texas
50
Another of the fastest growing use cases we see for large-scale predictive modeling is in Telecom
Think about the stream of CDRs, billions of us bipeds wandering about the planet with our phones…
Firehose for that makes Twitter look like MySpace!
The value of location services as data products for local businesses, communities is astounding
Everything’s Bigger in Texas
51
Then, Now, and Ahead
AH
EA
D1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves
52
Approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.
Unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up
Most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
source: D3
What is needed?
53
• more emphasis on statistical thinking
• not SQL vs. NoSQL, but instead a focus on apps as the process of structuring data
• multi-disciplinary teams, not cubicles and silos
• evolving more feedback loops, to drive more automation
• oddly enough, we need automation to be able to employ more people in intelligent, productive ways
• otherwise, we’re left with…
source: Schwa Corporation
What else do we need?
54