83
Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com [email protected]

Data Science and Scientific Discovery New Approaches to Nature’s Complexity

  • Upload
    damara

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Science and Scientific Discovery New Approaches to Nature’s Complexity. Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Data Science and Scientific

DiscoveryNew Approaches to Nature’s

Complexity

Dr. John RumblePresident

R&R Data ServicesGaithersburg MD

[email protected]

Page 2: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

To understand scientific and technical data today, we must first understand how the information revolution has changed both Science and Data and their relationship

2DC Data Science May 2012

Page 3: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

My Talk

1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific

data

3DC Data Science May 2012

Page 4: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Why Do We Do Science

Two primary motivations for advancing science

• First is our insatiable thirst to understand the world– probably from when we started thinking

• Second is a direct result of the Industrial Revolution: How does the technology we are inventing actually work?

4DC Data Science May 2012

Page 5: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

21st Century Science

• From the fundamental to the complex– Determining the laws of nature for a

few particles to understanding real systems - cells, the atmosphere, the Earth, ecology

• From reductionism to constructionism– Using our basic knowledge to make

models and predict behavior of real systems – that is all systems we find in nature or that we can construct

5DC Data Science May 2012

Page 6: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

6DC Data Science May 2012

Science Vol. 336, p. 707 (2012)

Page 7: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

7DC Data Science May 2012

J. Schmitt et al, Science vol 336, p 708, 2012

Page 8: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Today’s Science and E-Science

The Data Revolution has enabled E-Science through– Advanced telecommunications and networks– Computation power and storage– New algorithms for data management,

visualization, analysis, and mathematics• Today, E-Science can be done faster and more

powerfully, and scientific communication can occur almost instantly

The real revolution, however, is in the relationship between science and data

8DC Data Science May 2012

Page 9: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

9DC Data Science May 2012

When it hits the New York Times, you know it is for real!

Page 10: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Science and Data

To understand scientific and technical data today, we must first understand how the information revolution has changed Science and Data and their relationship

• Science today is not about reduction to a few basic laws

• Science is about how do we understand and control all aspects of nature

• How is this done?– By careful

measurement, accurate tests, keen observations, and powerful models and simulations that lead to scientific knowledge

• The results are expressed as scientific data!

10DC Data Science May 2012

Page 11: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Knowledge

• What does this really mean?

11DC Data Science May 2012

Page 12: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Knowledge

• What does this really mean?

12DC Data Science May 2012

Recognize a new

phenomenon

Analyze its components Identify the

variables that govern it

Isolate the important variablesDemonstrate

understanding by control

Change the phenomenon

Scientific knowledge means understanding the independent variables governing a phenomenon and how they influence it

Page 13: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Science Today

A major theme of science today is that we are able to make accurate measurements on a complex world that– Advance our understanding of nature,– Improve our ability to harness technology,

And, in spite of many challenges,– Increase the importance of science to

society in the futureScientific data are at the core of modern

science13DC Data Science May 2012

Page 14: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

My Talk

1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific

data

14DC Data Science May 2012

Page 15: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

The Data Revolution in Science

15DC Data Science May 2012

Today, E-Science is real

• Computer at every desk• Connectivity: The Internet/WWW

explosion• Computerized experiments

and observations • Database tools on every

computer• Electronic publications • Model and simulation-

based R&D• Comprehensive databases • Virtual libraries

Page 16: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Four Ways to Generate Scientific Data

• Observations• Experiments• Standardized testing• Modeling and simulation

16DC Data Science May 2012

Page 17: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Observational Science Today

Today we have exciting new capability to observe nature in situ better than ever before– Hubble Space Telescope– High sensitivity seismographs– Bio-macromolecule sequencing instruments– LTER (Long-term ecological research) platforms– Earth-observing satellites– High power computers to analyze data

Generates huge amounts of quality data

17DC Data Science May 2012

Page 18: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Experimental Science Today

18DC Data Science May 2012

Today we have exciting new capability to observe nature in controlled circumstances better than ever before– Atomic force microscopes– Micro-electronics and lasers– High energy accelerators– Femto-second chemical

reactors– High power computers to

analyze data

Generates large amounts of high quality data

Page 19: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Testing Today

Today we have new capability to test and analyze materials using standard methods– Electronic test equipment– Analytical databases fully integrated into equipment– Analyzing unknown substances– Carbon and other techniques dating objects– Genomic sequencing– National and international standard test procedures– Data analysis tools to generate properties– Self-calibrating instruments

Generates medium amounts of high quality data

19DC Data Science May 2012

Page 20: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Computation Today

We now also have the ability to create a Virtual World

Models and simulations of complex systems Techniques to do advanced mathematics Computers to execute immense calculations Visualization tools to examine our virtual

world

Uses and generates large amounts of data

20DC Data Science May 2012

Page 21: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Characteristics of Approaches for Generating Scientific Data

ApproachType

ScienceIndependent

Variables Examples Data VolumesObservational Big As available,

manySatellites, Seismic, census

Large

Small As available, few Biodiversity, social Small

Experiment Big Chosen, few High energy physics Large

Small Chosen, few Chemistry Small

Test Big Specified, few Genomics Medium

Small Specified, many Materials testing, imaging, structure determination

Small to large

Modeling Big Chosen, many Climate change, epidemiology

Medium

Small Chosen, few All disciplines Small

21DC Data Science May 2012

Page 22: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

The Data Revolution in Science is Real

• Observation, experimentation, testing, and calculation all produce, and in some cases use, large amounts of data

• E-Science has provided an incredible array of tools, technologies, and methods to collect, store, manage, analyze, exploit, preserve, and disseminate these data

Science today is more fully based on data and data collections than ever before!

22DC Data Science May 2012

Page 23: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

My Talk

1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific

data

23DC Data Science May 2012

Page 24: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Data and Scientific Databases

Data communicate measurement (experimental and observational) and computational results

“When you can measure what you are speaking about, and express it in numbers, you know something about it; Lord Kelvin

24DC Data Science May 2012

Page 25: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Types of Scientific Data

• Numbers• Simple text • Complex text• Equations• Graphs• Diagrams• Pictures• Software• Rules

25DC Data Science May 2012

• 1, 2, 3…• ABCs• Greek, scripts, symbol• E=mc2

Page 26: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

All Data Are Not the Same

• Measurement or property: There is a difference!

• Measurements are a one-time look at nature

• Properties are the inherent characteristics of nature – They are Nature Itself

26DC Data Science May 2012

Page 27: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Measurements are for Today

27DC Data Science May 2012

• Measurements are what you see now

• Capture one point of view

• Usually limited number of variables changed

One of 1300 measurements of Diego Giacometti

Page 28: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Properties are Forever

• Properties are the real thing

• Need many repeated measurements

• Far too many substances and systems to determine properties

• Will never properties of everything

28DC Data Science May 2012

The real Diego Giacometti

Page 29: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Knowledge

Theories Models

Hypotheses Questions

Data

Measure-ment

The Classical Paradigm for Science and Data

29

Page 30: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Knowledge

Theories Models

Hypotheses Questions

Data

Measure-mentData

Collections

The True data paradigm has always been this

30

Page 31: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Databases in History

• Preserved data collections (large and small) • At first, simply data preservation• Data was stored, but not really exploited

1. Accuracy2. Comprehensiveness3. Systematizing

31DC Data Science May 2012

Page 32: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Accuracy

Newgrange – Ireland• 6000 years old• Aligned to the rising sun in the winter solstice• Depended on careful observational data on the rising sun• One data point!

32DC Data Science May 2012

Page 33: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Volume and Accuracy Improving

Stonehenge• 5000 years old• Over 100 stones• Complicated stone

alignments • Marks position of the

moon and major stars as well as the sun

• Storage of several observations

33DC Data Science May 2012

Page 34: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Comprehensive Data Sets

34DC Data Science May 2012

Galen• Greek physician• Experimental physiologist• Arabic copy from 800 AD• Pictorial, descriptive,

function describing• Representative of

botanical and animal catalogs

Page 35: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Systematizing a Comprehensive Collection

35DC Data Science May 2012

Pliny the Elder• Roman scholar• Natural History (77

AD)• One of earliest known

encyclopedias of the natural world

• Systemization of data

Page 36: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

My Talk

1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific

data

36DC Data Science May 2012

Page 37: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Data and Scientific Discovery

• The advent of the Baconian Revolution –anchoring scientific understanding to physical observation

• Led to databases becoming the foundation of scientific discovery

• True Beginnings of Data Science!

37DC Data Science May 2012

Page 38: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Databases in History

• Preserved data collections (large and small) form the foundation of scientific discovery

• Trends in data preservation and discovery1. Accuracy2. Comprehensiveness3. Systematizing4. Extraction of essence5. Explanation of the complex6. Prediction of new phenomena!7. Physical theory from data!

38DC Data Science May 2012

Page 39: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Extraction of Essence

Tycho Brahae• Late 16th Century• Danish Astronomer• Made precise

measurements that led to Kepler’s theories

• Led to discovery of simple relationships

39DC Data Science May 2012

Page 40: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Explanation of the Complex

40DC Data Science May 2012

Charles Darwin• Combined with others in

geology, zoology and botany

• A wide variety of facts and phenomena recorded

• Theory of Evolution had to explain many diverse observations and measurements from different disciplines

Page 41: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Prediction of New Phenomena

41DC Data Science May 2012

Mendeleev and the Chemical Periodic TablePredicting properties of unknown elements from properties (data) of known elements

Page 42: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Physical Theory from Data

• Notes on the Spectral Lines of Hydrogen: Johann Jacob Balmer Annalen der Physik und Chemie 25 80-5 (1885) – “I gradually arrived at a formula which, at least for these four

lines, expresses a law by which their wavelengths can be represented by striking precision…From the formula, we obtained for a fifth hydrogen line 3936.65x10-7 mm. “

• The development of quantum mechanics

Bohr

42DC Data Science May 2012

Schrödinger

Page 43: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Brief History of Modern S&T Databases

1950s Crystal structures (software generated data-1960s Neutron data (modeling weapons)

1970s Analytical chemistry (identify chemicals)Thermochemistry (properties linked)Environmental and toxicologyLarge physics experimentsSpace science

1980s Astronomy Materials Earth sciences Biology Genomics

43DC Data Science May 2012

Page 44: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Databases Today

Preserving” Data is Easy • Database management tools are inexpensive and

powerful• Many models for good interfaces exist• Collecting data (data deposition) can be routine• Expertise is easily available from many sources

Building databases today is remarkably easy

44DC Data Science May 2012

Page 45: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Comprehensive Data Collections for 21st Century Science

• International Virtual Observatory

• Structural Genomics• Proteomics• Climate change• Historic geologic• Chemistry on demand

• Biodiversity• Brain scans

45DC Data Science May 2012

• All observation for every point in the sky

• For all living things!• 30,000 or 300,000?• Water, earth, atmosphere and

all they contain• Many millennia, the entire

planet• 60 elements, 5 at a time,

many ratios, 109 – 1010

compounds• 5M species? or 10M? or 50M?• Every person, every thought

foreverVery large databases will be found in every scientific discipline

Page 46: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

The Face of 21st Century Science

• Complex• Multi-disciplinary• Real systems• Virtual as well as physical

Access to quality data becomes critical

Attention to the problems and challenges of long term preservation of and access to data becomes more important than ever!

46DC Data Science May 2012

Page 47: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Discovery and Data Collections The Paradigm has Changed

Yesterday• Collections managed by a

small number of people• Collections readable by

one scientist• Collections interpretable

by one person

• Discoveries made by thinking, with analysis by one person

47DC Data Science May 2012

Today Collections managed

by groups Collections not

readable by any individual

Collections interpretable only with aid of software

The Future Discoveries aided or

made by computers, with verification by people?

Page 48: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

The Proposition

Scientific databases in the future will be even more important source for scientific discovery

• Data collections are critical for– New insights– New scientific principles– New knowledge– Understanding complex systems

Let’s look at 3 problems and the challenges they present

48DC Data Science May 2012

Page 49: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

My Talk

1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on

scientific data

49DC Data Science May 2012

Page 50: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Three Problems in the Data Era

1. Too much data2. Complex systems3. Complex science

50DC Data Science May 2012

Scientific Knowledge

Theory Models

Hypotheses Questions

Data

MeasurementData Collections

Science and DataHow the information

revolution has changed their relationship

Page 51: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Data in the Future

Problem 1: Too much data

The Challenges• How do you look at large volumes of data?• What does data quality mean for large data

collections?• How do you determine which data are important?

51DC Data Science May 2012

Page 52: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Too Much Data

• Too much data for any one person to read or understand

Can use• Visualization• Data reduction• Anomalies and outliers

How does anyone read a terabyte of data?

Software must be used to “read” data

Can we allow software to determine what are important data?

52DC Data Science May 2012

Page 53: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Too Much Data

Do we have the technology to handle the overwhelming volume of data from new measurement techniques?

• What to capture when we generate too much data too fast?

• How to store, represent, manipulate and display too voluminous data?

• How to find out which data are important?

53DC Data Science May 2012

Page 54: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Too Much Data and Data Quality

Evaluating data quality

• How can large amounts of data be evaluated? In real time? As new data are published?

• How can large data sets be integrated together correctly?

• What does quality mean in a terabyte of data? For

Each data point? Each set of points? Sub-collections? An entire collection?

54DC Data Science May 2012

Page 55: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Too Much Data and Data Quality

Evaluating data quality

• Bad data quality leads to bad science and bad decisions based on science

• One measurement does not make a property

• Agreement between theory and experiment does not mean both are correct

• In today’s world with terabytes of data, what does quality mean?

55DC Data Science May 2012

Page 56: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Modeling and Data Quality

Data Quality

Making accurate virtual measurements on virtual systems

• What is the quality in a calculation?

• How do you establish uncertainty for a calculation?

• Which computational results should be stored, and how can those data be handled?

• How do you discover something new in a mass of computational results?

56DC Data Science May 2012

Some models have mechanisms for assessing quality

HΨ = EΨ Schrödinger equation

The variational principle applies only to energy

Quality of other properties calculated from the equation is unknown

Page 57: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Modeling and Data Quality

Documenting Quality of a calculation

Making accurate virtual measurements on virtual systems

• How do you establish uncertainty for each step of a calculational result?

• Science Vol. 336 pp. 159-160 (2012): Software created by public funding must be released, just as with data themselves

57DC Data Science May 2012

1. Model assumptions (which ind. Var. used)

2. Translation into algorithms

3. Coding4. Input5. Finite arithmetic6. Post-processing analysis

Page 58: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: With Too Much Data, What is Important?

When you have a lot of data, what can you do with it?

Abstraction of important features

• How can we find what is important when we have too much data?

• Or not enough of the correct data?

Truly great science is having the insight of what is important

Can we teach software how to do that?

58DC Data Science May 2012

• 80 trillion cells in body• Number of human proteins is

estimated to be 30,000 – 70,000

• Which are important and why?• At least 150 proteins repair

DNA damage

Page 59: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Data in the Future

Problem 2: Real systems are very complex

The Challenges• Large number of objects • Large number of independent variables• Changing scientific language• Data Integration

59DC Data Science May 2012

Page 60: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Many Objects

There are too many objects to count, observe, measure, or calculate

• Number of stars• Number of species• Number of chemicals• Number of individuals• Number of rocks• Number of cells• Number of thoughts• Number of ecosystems

• You get the point

60DC Data Science May 2012

Page 61: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Independent Variables

• How do we use metadata (report the relevant independent variables) to describe what we preserve?

• Independent variables are a quantitative mechanism for expressing our knowledge about how and why a phenomenon occurs

• Capturing complete knowledge of independent variables requires a large or (perhaps) even an impossible amount of data

• One goal of research is to understand which variables are important and why

• Our knowledge clearly evolves over time

61DC Data Science May 2012

P=(n/V)RT (Ideal gas law)

Dependent variable P=pressure

Independent variables n/V = number/volume=density T = temperature

Page 62: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Independent Variables

Major challenge in data collections is to capture evolution of knowledge of independent variables

• Must be done in a way as to preserve data set compatibility

• Let’s work through a quick examples of the complexity and how knowledge changes with time

62DC Data Science May 2012

Most data have numerous independent variables they are functions of

Page 63: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Brain Imaging

• Recording techniques evolve and improve over time– X-ray, CT, MRI, PET, next?

• Each technology individually evolves, as do the types of signals collected, their association with brain activity and region

• Monitoring reactions to stimulus: pain, visual, auditory, tactile, etc.

• Details must be defined and recorded

63DC Data Science May 2012

Page 64: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Brain History

• If we imagine the details necessary to describe this, the number of independent variables expand rapidly– Stimuli history– Physiological history– Developmental history– Environmental exposures– Drugs taken– More

• As with the development of unifying theories of the gross physical world – motion, evolution, chemistry, genetics - the details are necessary to find the dominant factors

What are the most important independent variables for recording brain history? Still an open question!

64DC Data Science May 2012

Page 65: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Independent Variables

Modern science requires data from many disciplines

• If we must aggregate different data sets (e.g., over the Web) to do discovery, how do we know data are comparable?

• How do we integrate data sets with varying numbers of independent variables?

• Especially if their names and meaning change over time?

65DC Data Science May 2012

Page 66: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Evolving Language

These are powerful change factors that cannot be ignored in preserving data

How do languages evolve? • Contractions of words• Reordering sentences• Borrowing words• Dropping and adding

startings and endings• Differentiation of concepts• Evolution of concepts

John McWhorter – The Power of Babel

• Ontologies can help

66DC Data Science May 2012

Page 67: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Evolving Language

Time and Scientific Language

• Grammar rules appeared only a few hundred years ago

• Language change factors ignore authority

• Usage wins over regulations every time!– Are terminology standards

actually used?

Are efforts such as that on the right doomed to fail?

67DC Data Science May 2012

Page 68: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Evolving Language

Time and the evolution of Scientific Language

• New knowledge requires new language

• Data preservation efforts must recognize evolution of scientific language

• Not just independent variables and metadata – the scientific language itself

• So if you are going to do “discovery,” you’d better know what you are working with

68DC Data Science May 2012

Page 69: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Developing standards for scientific data and metadata

69DC Data Science May 2012

• What is the business case for such standards?

• How can you standardize scientific language if it continues to evolve?

• How can you determine object equivalency and uniqueness with partial data sets?

• How do you persuade scientists to back off the state-of-the-art to agree on standards?

Page 70: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Data Standards: Making exploitation of large data sets possible

70DC Data Science May 2012

• What standards are needed for making data sets work together?

• How can you trust integrated data sets?

• No science is an island by itself

• Science today is multi-discipline, international, multi-lingual, ever-changing

• Integration can be achieved by standards and clear reporting of measurements

• As knowledge of variables increases, integrating old and new data becomes more difficult

Page 71: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Data Ownership

We must differentiate between discovery and adding value

Observing nature should not lead to data ownership

• Transforming observations through value-added intellectual effort can create IPR

• For scientific data, must be very careful not to restrict use by others

• The same observations led to many different theories of planetary motion – from Aristotle and Ptolemy to Kepler to Newton to today

71DC Data Science May 2012

Page 72: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Maintaining full and open access to the large number of databases required for making new scientific discoveries

72DC Data Science May 2012

• What policies are needed for full and open access?

• Open access aims to provide everyone with the information and data to advance science

• Open is not necessarily free• Long term preservation does cost

money– Data and literature

collections must be supported

• How can discoverers profit from their automated discoveries?

• How do you get the information industry to understand the new paradigm for discovery?

Page 73: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Data Costs

Nothing is ever really free!

• It costs significant money to generate, capture, manage, store, analyze, use, disseminate, and preserve scientific data

• Data costs must be integrated into the cost of generation

Policy and practice will vary from discipline to discipline, but nothing is ever free

73DC Data Science May 2012

Page 74: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Data Repositories

• Started with crystal structure

• Genomics• Other disciplines

following• NSF now requiring

data management plans

• Often required to publish papers

• Curation (everything reported correctly) now automated

Model does not translate easily for evolving fields

74DC Data Science May 2012

Page 75: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Complexity Challenge: Data Integration

Progression of Data Collection

• Individual• Collegial• Institution or discipline

repository• Evaluated data• “Property values”

• Each step requires more metadata to provide adequate documentation

• Very difficult to add metadata after the fact

• For new phenomenon, difficult to know what Ind. Var. are necessary

75DC Data Science May 2012

Page 76: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Scientific Discovery in Preserved Data

Problem 3: Real systems are very complex and complex behavior in systems is difficult to find

The Challenges • How do we recognize real understanding? • What is knowledge discovery in the future?

76DC Data Science May 2012

Page 77: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Real Understanding

Real systems are very complex

How can you identify the existence of a unifying theory or concept?

• Could we have derived quantum mechanics from a complete database of atomic and molecular spectra?

• What features does quantum mechanics have beyond these data?

77DC Data Science May 2012

Page 78: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Real Understanding

Real systems are very complex

Multiple views of the same phenomena exist

The Simple (?) Laws of Interaction

• String theory• Quantum theory• Matrix mechanics• Maxwell’s theory• Quantum electrodynamics• Newton’s laws of motion

Are all views of nature equally discoverable?

By computer-aided discovery?

78DC Data Science May 2012

Page 79: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Real Understanding

How do we develop real understanding?

Real Scientific knowledge?

• Just because we measure a phenomenon do not mean we understand it– Do we know how many

genes there?– Does measuring the

mass of the universe makes us understand dark matter?

How does data lead to understanding?

79DC Data Science May 2012

Page 80: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Challenge: Real Understanding

Knowledge Discovery • Large amounts of data can help find new discoveries

• How to know which data are the most important, the key to discovery

• Hoe to know something is there to be discovered?

• Can too much data make discovery more difficult?

• Will/Can discovery have to be automated?

80DC Data Science May 2012

Page 81: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Data Collections in 21st Century Science

The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them

William Bragg

81DC Data Science May 2012

Page 82: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Some Final Thoughts

Scientific databases in the future will be even more important source for scientific discovery

• Preservation of data needed for– New insights– Scientific principles– New knowledge– Understanding complex systems

The problems and challenges I have just outlined are not insurmountable – just problems and challenges

82DC Data Science May 2012

Page 83: Data Science and Scientific Discovery New Approaches to Nature’s Complexity

Some Final Thoughts

Science has changed and with that change, our expectations for science have changed.

We now expect science to be a force for shaping the future, not just understanding nature

Scientific databases in the future will be even more important source for scientific discovery

The Data Revolution has become an enabling force to meet our expectations for 21st Century Science

83DC Data Science May 2012