Laws and Limits of Data Science: The Next Decade
Michael L. Brodie
2
Big Data is Opening the door to …
3
Grand Opportunities:Accelerating Scientific Discovery …
4
Grand Challenges:Many – efficacy, efficiency, …
What is Big Data?
• Defining Big Data constrains this emerging phenomena • Since Big Data is not
— About data, but a problem solving ecosystem — A discipline, but a multidisciplinary sub-domain of most disciplines*
• What matters is what we will do with Big Data • Big Data is opening the door to profound change in
— Processing — Thinking
• Let’s use the potential of profound change to understand Big Data
5
* “transforma,ve … changing academia (… emerged .. on the cri,cal path for their sub-‐discipline)” and is changing society” Michael Jordan.
Starting to Understand Big Data
• Listen to Data — Hypothesis generation ! overcome limits of human cognition*
• Multiple, Simultaneous Perspectives — Ensemble models ! Accelerating Scientific Discovery*
• And many more …
6
* Necessary condition: human-guidance
7
Big Data is in its infancyWith at least decade-long challenges
Outline • Big Picture: Why and What • Grand Opportunities • Grand Challenges
— Efficacy, amongst many • Laws and Limits of Data Science
Hypothesis
Phenomenon
Big Picture Scientific Method
Causality
Experiment Model
Big Picture: Why & What
Experiment Model What
(Big Data) Why
(Empiricism)
Correlation: What might occur
Causation: Why it occurs
Phenomenon
Why: Scientific Method and the Search for Causation History of Science and the Scientific Method Mature Disciplines: Empiricism, Clinical Studies, Drug Discovery
The Holy Grail of science is to identify accurate causality.
Empirical, clinical trial, and drug discovery methods take time +100 years
Three Ages of Medicine [The Remedy: Goetz] Free-for-All: 1850s–1940s Rise of Trials: 1940s–2010s Beyond the Lab: Post-2010
What: Models and the Search for Meaningful Correlations
• History of Modelling: mathematics, sciences, computing, …
• Disciplines " Mature (theory-driven): math, physics, statistics, … " Emerging (data-driven): data mining, machine learning, neural networks, support
vector machines, …
The Holy Grail of data-intensive discovery is correlations that are meaningful.
Correlation does not imply causation
• Methodologies " Mature: 100s of years " Emerging: at least a decade
The Holy Grail of data-intensive discovery is correlations that are meaningful. The Holy Grail of data-intensive discovery is correlations that are accurate and reliable.
GRAND OPPORTUNITIES Big Data
Accelerating Scientific Discovery
Experiment Model
Correlations
Hypotheses
Why: Causation
What: Correlation
Data D
riven Theory D
riven
Accelerating Scientific Discovery
Experiment Model
Correlations
Hypotheses
Why: Causation
What: Correlation
Data D
riven Theory D
riven
Watson
Baylor
Scientists
Wonderful Use Case
Grand Challenges • Big Data is in its infancy: 10+ year evolution
" Efficiency: expression/language ! execution (stack) " Open Data: data use/reuse / sharing " Efficacy
“major engineering and mathematical challenge, one that will not be solved by just gluing together a few
existing ideas from statistics, optimization, databases and computer systems.” Michael Jordan
“wrt to Big Data we’re now at the what are the principles? point in time”. Michael Jordan
What is Data Science @ Scale? Data Science @ scale is to data-intensive discovery as The Scientific Method is to scientific discovery
Reframe Empiricism* " Data Science is the data component of the Scientific Method for data " Concepts, tools, and techniques for data-intensive discovery
• Data-intensive discovery = virtual experiment
" Laws and Limits of Data Science
* With Dr. Jennie Duggan, MIT & Northwestern University
First Law of Data Science
Meaning of a correlation requires empirical verification
What is seldom enough Why is not always necessary
Best Practice #1: Efficacy-driven data discovery
(Efficacy before efficiency)
Second Law of Data Science*
Causality can be determined from correlations only by community accepted mechanisms and metrics**, e.g.,
empiricism.
* With Gregory Piatetsky-Shapiro, KDNuggets
** for What and Why
Limits of Data Science
We do not know where our concepts, tools, and techniques break on massive data sets!
Caution: Big Data Winter Potential (Michael Jordan) Best Practice #2: Experiment + Error bars everywhere
" Common Practice: not so much
Best Practice #3: Machine-driven, human guided " Common Practice: not so much
Best Practice Not So Common* • BP1: Efficacy-driven data discovery
" Best eScience, Journalism, Economics, Computational X, … " Big Data not so much (<5%)
• BP2: Experiment + Error bars everywhere " Above + Best Data Scientists (~5%, w/scientific, ML, … training) " Big Data (<5%): Customers don’t ask; data scientists don’t practice
• BP3: Machine-driven, human guided " ~5% strict;95% not so much, e.g., ~60 Data Curation products " 50% partial: supervised / trained
• Example: based on the above Laws and Best Practices
*Personal un-scientific study, limited data, yet so unbiased and oh so true
Laws of Data Science Less So … 1st Correlations ≠ Causation
Common confusion in science*, more in Data Science, even more in business
2nd Causality (meaning) requires verification by community-accepted norms
Cornerstone of Science, hopefully emerging in Data Science**
*Richard Feynman, 1974 ** If #1 is rare, #2 is more so
Conclusions • Big Data is in its infancy and is opening the door to … • Grand Opportunities • Grand Challenges • 10+ year evolution • Data Science ~= Scientific Method For Data • Laws of Data Science
1 Correlations must be verified 2 Verification relative to community-accepted norms
• Data Science Best Practices 1 Efficacy-driven discovery 2 Experiment + Error Bars everywhere 3 Machine-Driven – Human Guided
• Limit of Data Science: we do not know where our tools break
25
26