38
Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment http://knb.ecoinformatics.org nding: National Science Foundation (DEB99-80154, DBI99-04777)

Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Embed Size (px)

Citation preview

Page 1: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Data Integration, Analysis, and Synthesis

Matthew B. JonesNational Center for Ecological Analysis and Synthesis

University of California Santa Barbara

Scalable Information Networks for the Environment

http://knb.ecoinformatics.org

Funding: National Science Foundation (DEB99-80154, DBI99-04777)

Page 2: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

NCEAS’ Mission

Integrate existing data for broad ecological synthesis

Use synthesis to inform policy and management

Page 3: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Synthesis at NCEAS

Research Management Policy

200+ synthesis projects 1900+ participating scientists

Page 4: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Research projects Hunsaker – Quantification of Uncertainty in

Spatial Data for Ecological Applications Ives & Frost – Intrinsic and Extrinsic Variability

in Community Dynamics Osenberg -- Meta-Analysis, Interaction

Strength and Effect Size; Application of Biological Models to the Synthesis of Experimental Data

Murdoch – Complex Population Dynamics

Page 5: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Management projects Andelman – Designing and Assessing the

Viability of Nature Reserve Systems at Regional Scales: Integration of Optimization, Heuristic and Dynamic Models

Boersma & Kareiva – Prospectus For An Analysis of Recovery Plans and Delisting

Kareiva – Habitat Conservation Planning for Endangered Species

Lubchenco, Palumbi, & Gaines – Developing the Theory of Marine Reserves

Page 6: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Policy projects Costanza & Farber -- The Value of the World's

Ecosystem Services and Natural Capital: Toward a Dynamic, Integrated Approach

http://www.nceas.ucsb.edu/

Page 7: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Synthesis projects

Use existing data...

Distributed sources Varying protocols Varying formats

Obtained via personal collaboration

Page 8: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Functional breakdown Functional breakdown for synthesis

Data discovery Data access Data storage Data interpretation

Quality assessment Data Conversion & Integration Analysis & Modeling Visualization

Page 9: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Presentation Outline Integration, Analysis, and

Synthesis:

Challenges

Page 10: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Population survey Experimental Taxonomic survey Behavioral Meteorological Oceanographic Hydrology …

Data Heterogeneity

Economic Social (urban

ecology) Paleoecological Historical

Land use Demographics

Page 11: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Types of Heterogeneity Intensional vs. Arbitrary Heterogeneity

Syntax (format) CSV, Fixed ASCII, proprietary binary

Schema (organization) Non-normalized models

Semantics (meaning/methods) Protocol semantics (e.g., scale) Parameter semantics (e.g., bodysize (g)) Conceptual framework (e.g., experimental trts) Taxonomy + nomenclature

Page 12: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Data Dispersion Data are distributed among:

Independent researcher holdings Research station collections

LTER Network (24 sites) Org. of Biological Field Stations (168 sites) Univ. Cal Natural Reserve System (36 sites) MARINE (62 sites) PISCO

Agency databases Museum databases

Access via personal networking Not scalable

Page 13: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Lack of Metadata Majority of ecological data

undocumented Lack information on syntax, schema and

semantics of data Impossible to understand data without

contacting the original researchers

Documentation conventions widely vary Requires large time investment to

understand each data set

Page 14: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Scaling Data Integration Because of:

Data heterogeneity Data dispersion Lack of documentation

Integration and synthesis are limited to a manual process Thus, difficult to scale integration

efforts up to large numbers of data sets

Page 15: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3

Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8

Date Site Species Density 10/1/1993 N654 Picea

rubens 13

10/3/1994 N654 Picea rubens

14.5

10/1/1993 N654 Betula papyifera

3

10/31/1993 1 Picea rubens

13.5

10/31/1993 1 Betula papyifera

1.6

11/14/1994 1 Picea rubens

8.4

11/14/1994 1 Betula papyifera

1.8

A

B

C

Page 16: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Presentation Outline Integration, Analysis, and

Synthesis:

Challenges Current work

Knowledge Network for Biocomplexity Partnership for Biodiversity Informatics

Page 17: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Knowledge Network for Biocomplexity (KNB) National network for biocomplexity

data Data discovery Data access Data interpretation

Enable advanced services Data integration Analysis framework Hypothesis modeling Visualization

Page 18: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Central Role of Metadata What metadata?

Ownership, attribution, structure, contents, methods, quality, etc.

Critical for addressing data heterogeneity issues

Critical for developing extensible systems

Critical for long-term data preservation

Allows advanced services to be built

Page 19: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

KNB Components Ecological Metadata Language (EML) Morpho -- data management for ecologists

Cross platform Java application Metacat -- flexible metadata & data

system

Analysis and Modeling engine Data integration engine Semantic Query Processor Hypothesis Modeling Engine

Page 20: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Ecological Metadata Language

XML syntax for representing metadata

Extensible – can add new metadata

Modular – can subset metadata for specific applications

Page 21: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

EML 2.0beta3 modules eml-resource -- Basic resource info eml-dataset -- Data set info eml-literature -- Citation info eml-software -- Software info eml-party -- People and Organizations

eml-entity -- Data entity (table) info eml-attribute -- Attribute (variable) info eml-constraint -- Integrity constraints eml-physical -- Physical format info eml-access -- Access control eml-distribution -- Distribution info

eml-project -- Research project info eml-coverage -- Geographic, temporal and taxonomic coverage eml-protocol -- Methods and QA/QC

Page 22: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
Page 23: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Metacat metadata system

LTERMetacat

NCEASMetacat

Metacat Catalog

Morpho clients

Key

SDSCMetacatSite metadata system

AND

SEV

CAP

OBFS

Web clients

XML wrapper

NRSMetacat

SEVMetacat

Page 24: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Metacat architectureMetacat Server

RDBMS(Oracle)

TransformationSubsystem

LDAP

Java

Ser

vlet

En

gin

e (T

om

cat)

HT

TP

Ser

ver

(Ap

ach

e)JDBCAPI

LDAPAdapter

Met

acat

Ser

vlet

(D

ispa

tche

r)

AuthenticationInterface

StorageSubsystem

QuerySubsystem

ReplicationSubsystem

ValidationSubsystem

Data StorageInterface

FSAdapter

File System

Page 25: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Metacat web interface

Page 26: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

UC

Natural Reserve System

OBFS Network

LTER

Network

Page 27: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Functional breakdown Functional breakdown for synthesis

Data discovery Data access Data storage Data interpretation

Quality assessment Data Conversion & Integration Analysis & Modeling Visualization

Page 28: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Quality Assessment system

Semantic

Metadata++ +

Researcher

DecisionsData

Quality

Assessment

Report

Page 29: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Quality Assessment Integrity constraint checking Data type checking Metadata completeness Data entry errors Outlier detection Check assertions about data

e.g., trees don’t shrink e.g., sea urchins do

Page 30: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Data Integration

Semantic

Metadata++ +

Researcher

DecisionsData

Date Site Species Density10/1/1993 N654 Picea

rubens13

10/3/1994 N654 Picearubens

14.5

10/1/1993 N654 Betulapapyifera

3

10/31/1993 1 Picearubens

13.5

10/31/1993 1 Betulapapyifera

1.6

11/14/1994 1 Picearubens

8.4

11/14/1994 1 Betulapapyifera

1.8

Integrated

Data Set

Page 31: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3

Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8

Date Site Species Density 10/1/1993 N654 Picea

rubens 13

10/3/1994 N654 Picea rubens

14.5

10/1/1993 N654 Betula papyifera

3

10/31/1993 1 Picea rubens

13.5

10/31/1993 1 Betula papyifera

1.6

11/14/1994 1 Picea rubens

8.4

11/14/1994 1 Betula papyifera

1.8

A

B

C0

2

4

6

8

10

12

14

16

Pic

ea r

ubens

Pic

ea r

ubens

Betu

la p

apyif

era

Pic

ea r

ubens

Betu

la p

apyif

era

Pic

ea r

ubens

Betu

la p

apyif

era

Densi

ty (

#/m

2)

Page 32: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Scaling Analysis and Modeling

Data and Metadata Input

(from Morpho/Metacat)

Execution engine (plugins)

SASR

MatlabSimulation models

...

Analysis + Model Metadata

InputsOutputs

Processing

Output

Page 33: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Scaling Analysis and Modeling

Execution Engine

Data and Metadata InputConfiguration for Analysis and Models

DDLSpecification(Inputs andDDL Code)

ProceduralSpecification(Inputs andproc code)

Input MapSpecification(test inputsmapped to

metadata/datafields)

Script withunresolvedvariables

Input MapParser

TestSpecification

Parser

Script withsymbolically

resolvedvariables

Script/Metadata/Data Validation

and ConflictResolution

User orontological

input forconflict

resolution

Data/MetadataInput facilitator

and Parser

DataPackage

(Metadatawith data

file)

Fullyresolved

final scriptScriptExecutor

Output(HTML,

XML, Text,etc.)

Script withsome fullyresolvedvariables

AnalyticalEnginePlugin

OutputStream from

AnalyticalEngine

OuputRenderer

OuputConfig File

Page 34: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
Page 35: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Semantic metadata Describes the relationship between

measurements and ecologically relevant concepts

Drawn from a controlled vocabulary Ontology for ecological

measurements

Page 36: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Ecological Ontologies

BiodiversitySpecies TaxonOrganism

SpeciesEveness (J')

ShannonDiversity (H')

S

ii ppH1

ln'

S

HJ

ln

''Species

Count (S)

Abundance (N)

Abundance ofSpecies i (Ni)

SamplingArea (A)

ProportionalAbundance

Species i (pi)

N

Np

ii

isaisa

has

has

has

has

has

S

iNN1

Page 37: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

What drives synthesis Science questions Hypotheses Analyses + Models Integrated Data Original Data

Page 38: Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara

Conclusions

Barriers to integration can be addressed using structured metadata

Can accomplish a lot with ‘just’ mechanical transformations

Domain ontologies + semantic mediation are paths to scaling integration

Analysis drives all other phases of integration