40
Prof. Yannis Ioannidis “Athena” Research Center & University of Athens

Big data in the research life cycle: technologies, infrastructures, policies

Embed Size (px)

Citation preview

Page 1: Big data in the research life cycle: technologies, infrastructures, policies

Prof. Yannis Ioannidis

“Athena” Research Center & University of Athens

Page 2: Big data in the research life cycle: technologies, infrastructures, policies

BioMed

Oceans

Space & Earth

Culture Environment

OA Policies

Data Proc

OpenMinTeD

Page 3: Big data in the research life cycle: technologies, infrastructures, policies

EXAREME MaDIS GRAPHOS

PAROS

CHESS

Optique

AITION/TopMod

KDD/ML

MDP

OpenAIRE

MaDgIK Systems

DCV ML

ResAnal

HBP Capsella

W-Dance

O-MinTeD

STE

G-kak^3

BB

EarthSrvr

V-Exhibit

EFG1914

Fut-TDM

OpenUP

WDAqua

RDA

StR-ESFRI

Page 4: Big data in the research life cycle: technologies, infrastructures, policies

Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources

Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine

Application Layer: Data (pre) processing and knowledge discovery platform

Imaging , Video

Streaming Data Un/Semi/Structured Biomedical Data

Legacy Data Simulation Models Digital Libraries (PubMed etc)

Ontologies (UMLS, GO..)

Clinician knowledge

Upper level declarative language and extensible UDFs

MADRefine module Data Preprocessing & Transformation

Curation & Validation

AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling

Distributed execution on clouds and ad-hoc clusters

Distributed Query Engine

AITION simulation Graphical Probabilistic modelling for

Statistical simulation

Ontology Based Data Access

Data Processing

• Distribution, Federation, Parallelism

• EXAREME

Data Analytics

• Cleaning & curation

• MADRefine

• Modeling, Mining

• AITION

Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration

Multi-modal, vertical integrated, distributed bio medical data

Biomedical Info

Registries & Metadata

Simulation Models

KDD Results

Data Infrastructures

• ESFRI Infrastructures

• ICOS, EMSO,

• E-Infrastructures

• OpenAIRE

WH

AT

W

HE

RE

H

OW

W

HY

Page 5: Big data in the research life cycle: technologies, infrastructures, policies

OpenAIRE HUB

CERN zenodo

Visualize - Manage Enhanced Publications

Get support (NOADs)

Linked Content Statistics

+++

Search & Browse

Curate & collaborate

Deposit Publications

& data Research impact Citations, usage

statistics

+++

Link Classify

De-duplicate Cite

Text Mine APIs

Publication repositories Institutional & Thematic Open Access Journals

17,500,000 OA publications 700+ validated repositories

accessing >5K repos/OA journals

Data repositories Data Journals

ResearchID (ORCID, ..)

OpenDOAR

CRIS Systems

National funding

EC funding

Usage data

Metadata on publications Metadata

on data

Guidelines for Data Providers & Open Data Pilot

Guidelines for Funding Info

Guidelines for Publications

OpenAIRE

Page 6: Big data in the research life cycle: technologies, infrastructures, policies

ICOS

LIFEWATCH

EMSO

SIOS

EURO-ARGO

IAGOS

EPOS

EISCAT

COPAL ACTRIS

DANUBIUS_RI

Page 7: Big data in the research life cycle: technologies, infrastructures, policies

ICOS: Integrated Carbon Observation System

Harmonized and High Precision Scientific Data on Carbon Cycle And Greenhouse Gas Budget and Perturbations

EMSO: European Multi-disciplinary Seafloor and water-column Observatory

Ocean observation systems for long-term, high-resolution, (near) real-time monitoring of environmental processes including natural hazards, climate change, and marine ecosystems

Page 8: Big data in the research life cycle: technologies, infrastructures, policies

SIOS: Svalbard Integrated Earth Observing System

Arctic environmental and climate-related challenges

EURO-ARGO: European contribution to ARGO

Ocean observation and for oceanography and climate

IAGOS: In-service Aircraft for a Global Observing System

Atmospheric composition, aerosol and cloud particles

Page 9: Big data in the research life cycle: technologies, infrastructures, policies

EISCAT_3D: European Incoherent Scatter

Radar systems for the upper atmosphere, the ionosphere and the Aurora Borealis

EUFAR-COPAL: European Facility for Airborne Research

Airborne research for the environmental and geo sciences in Europe

Page 10: Big data in the research life cycle: technologies, infrastructures, policies

ACTRIS: Aerosols, Clouds and Trace gases RI

Models and forecast systems by offering high quality data for atmospheric gases, clouds, and trace gases

DANUBIUS-RI: Int’l Center for Advanced Studies on River-Sea Systems

Addressing conflicts between society’s demands, environmental change and environmental protection in river–sea systems worldwide.

Page 11: Big data in the research life cycle: technologies, infrastructures, policies

Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources

Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:

Multi-modal, vertical integrated, distributed bio medical data

Biomedical Info

Registries & Metadata

Simulation Models

Imaging , Video

Streaming Data Un/Semi/Structured Biomedical Data

Legacy Data Simulation Models Digital Libraries (PubMed etc)

Ontologies (UMLS, GO..)

Clinician knowledge

KDD Results

Application Layer: Data (pre) processing and knowledge discovery platform

MADRefine module Data Preprocessing & Transformation

Curation & Validation

AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling

AITION simulation Graphical Probabilistic modelling for

Statistical simulation

Data Analytics

• Cleaning & curation

• MADRefine

• Modeling, Mining

• AITION

Data Infrastructures

• ESFRI Infrastructures

• ELIXIR

• E-Infrastructures

• OpenAIRE

Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine Upper level declarative language and extensible UDFs

Distributed execution on clouds and ad-hoc clusters

Distributed Query Engine

Ontology Based Data Access

Data Processing

• Distribution, Federation, Parallelism

• EXAREME

Page 12: Big data in the research life cycle: technologies, infrastructures, policies

Gateway

Master

Worker Worker Worker Worker

Execution Engine

Execution Engine

Optimization Engine

Optimization Engine

Fast Local Net

Data Connector

Data Connector

P2P Net

Page 13: Big data in the research life cycle: technologies, infrastructures, policies

Parallel / distributed execution of complex data flows targeting data analysis and mining

Data remain at source (hospital) – dataflow / query travels

Privacy preserving: transmit only aggregated information from hospital (sufficient statistics)

Advanced data compression, on the data partitioning

Query Language: SQL + UDFs (in Python)

Page 14: Big data in the research life cycle: technologies, infrastructures, policies

Query

Fed

era

tion

Decompose query into

local and global parts

1 N

id m-name m-value id m-name m-value

Local queries Local queries

Partial

aggregated

results

Run local

queries Run local

queries

“count, avg, std”

m-name N avg std

m-name Σx Σx2 N

Σx,Σx2,N Σx,Σx2,N

Partial

aggregated

results m-name Σx Σx2 N

L:“Σx, Σx2, N”

G:“N, avg, std”

Run global

queries N, avg, std

Page 15: Big data in the research life cycle: technologies, infrastructures, policies

• Distributed elastic execution

– Parallel aggregations, unions, and joins

– Resources are reserved dynamically

• Iterative dataflow execution

– Support machine learning algorithms

• Novel query optimization techniques

– SQL with User Defined Functions

– Arbitrary user code with unknown properties

– Privacy-aware query optimization

Page 16: Big data in the research life cycle: technologies, infrastructures, policies

• Time and money

• 2-dimensional optimization

Quantum: 1 hour

• Simple map-reduce flow

– A: 1 hour B: 10 minutes C: 1 hour

Schedule Time

(hours)

Money

(resource hours)

Winner

One host for all ops 18.60 19 5x cheaper

Different host per op 2.16 102 9x faster

Page 17: Big data in the research life cycle: technologies, infrastructures, policies

• Optimal dataflow scheduling

• Skyline of all Pareto optimal plans

Time

Money

Page 18: Big data in the research life cycle: technologies, infrastructures, policies

Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources

EXAREME Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine

Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:

Multi-modal, vertical integrated, distributed bio medical data

Biomedical Info

Registries & Metadata

Simulation Models

Imaging , Video

Streaming Data Un/Semi/Structured Biomedical Data

Legacy Data Simulation Models Digital Libraries (PubMed etc)

Ontologies (UMLS, GO..)

Clinician knowledge

KDD Results

Upper level declarative language and extensible UDFs

Distributed execution on clouds and ad-hoc clusters

Distributed Query Engine

Ontology Based Data Access

Data Processing

• Distribution, Federation, Parallelism

• EXAREME

Data Infrastructures

• ESFRI Infrastructures

• ELIXIR

• E-Infrastructures

• OpenAIRE

Application Layer: Data (pre) processing and knowledge discovery platform

MADRefine module Data Preprocessing & Transformation

Curation & Validation

AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling

AITION simulation Graphical Probabilistic modelling for

Statistical simulation

Data Analytics

• Cleaning & curation

• MADRefine

• Modeling, Mining

• AITION

Page 19: Big data in the research life cycle: technologies, infrastructures, policies

Data Mining

Disease signatures

Patient grouping & similarity

Raw data from biomarker based

personalized acquisition

Personalized Model

Guided Medicine

For a particular patient

Unknown / missing data

Predict value of missing

variable

Variable dependencies & causality

Simulation Models

Create Statistical

Simulation

Models

Individualized diagnosis,

prognosis & treatment plan

Model & Verification Knowledge Discovery Reasoning & decision support

Data

Preprocessing

Curation & Validation

Transformed &

Validated Data

Domain knowledge &

assumptions

Clinical workflows

BOTTOM-UP TOP-DOWN

Big Data Analytics • Capture

• multi source • multi modal • multi system

Management • Data provenance • Sanitization

(Anonymization)

• Process

• aggregate • distributed

Analysis • Privacy preserving

• Algorithms • Mechanisms

Modeling • Personalized • De-identified

Practice • Ethics • Privacy

Page 20: Big data in the research life cycle: technologies, infrastructures, policies

SEX AgeOnSet

ILAR

JntActDis

GlbActDis

DisDur JntLOM GenEval

CHAQ ESR CRP ANA

MEFN IL2RA Poznanski

NSAID STEROID DMARD BIOLOGIC

JADI

JntLOMDiff CHAQDiff

ESRDiff CRPDiff

JntActDisDiff GlbActDisDiff

GenEvalDiff

BOXValidatedOut

Adapted Sharp/ van der Heijde

Score Out JADIOut

Extended BOX

Predictors

Medication

Outcome

demographics imaging genetics

clinical

lab

Synovial volume

OTHER

Page 21: Big data in the research life cycle: technologies, infrastructures, policies

Disease signatures

Patient grouping & similarity Variable dependencies & causality

Simulation Models

Individualized diagnosis,

prognosis & treatment plan

Data Mining Personalized Model

Guided Medicine

For a particular patient

Unknown / missing data

Predict value of missing

variable

Create Statistical

Simulation

Models

Model & Verification Knowledge Discovery Reasoning & decision support

Domain knowledge &

assumptions

Clinical workflows Raw data from biomarker based

personalized acquisition

Data

Preprocessing

Curation & Validation

Transformed &

Validated Data

Page 22: Big data in the research life cycle: technologies, infrastructures, policies

Extensible validation and data transformation engine

Ιnteractive and efficient WEB-Based interface

Data cleaning:

◦ Typographical error detection (numeric & alphanumeric)

◦ Data cleaning rules: (functional dependencies, conditional funct. dependencies, denial constraints)

◦ New/derived columns (discretization, computation of medical scores)

◦ Data visualisation (barcharts, piecharts, scatterplots, linecharts, etc.)

End-to-end data analysis workflow support (rerun experiments, reproduce results)

Page 23: Big data in the research life cycle: technologies, infrastructures, policies
Page 24: Big data in the research life cycle: technologies, infrastructures, policies

Variable dependencies & causality

Simulation Models

Individualized diagnosis,

prognosis & treatment plan Transformed &

Validated Data

Personalized Model

Guided Medicine

For a particular patient

Unknown / missing data

Predict value of missing

variable

Create Statistical

Simulation

Models

Model & Verification Reasoning & decision support

Data

Preprocessing

Curation & Validation

Domain knowledge &

assumptions

Clinical workflows

Data Mining

Raw data from biomarker based

personalized acquisition

Knowledge Discovery

Disease signatures

Patient grouping & similarity

Page 25: Big data in the research life cycle: technologies, infrastructures, policies

Disease signatures: Latent factors (patterns) that characterize

disease

◦ Distribution of most relevant variables for disease (e.g., biomarkers)

◦ Multiple variables per signature, signatures per disease

Patient Cluster: Homogeneous patient group with common

characteristics

Patient Similarity: Patients “like” me or mine (patient or

clinician role)

◦ “like” = according to different criteria

(e.g., allocation on disease signatures)

Page 26: Big data in the research life cycle: technologies, infrastructures, policies
Page 27: Big data in the research life cycle: technologies, infrastructures, policies

Similarity & Graph clustering

Topics & allocations

Modelling

Page 28: Big data in the research life cycle: technologies, infrastructures, policies

Disease signatures

Patient grouping & similarity

Individualized diagnosis,

prognosis & treatment plan Transformed &

Validated Data

Personalized Model

Guided Medicine

For a particular patient

Unknown / missing data

Predict value of missing

variable

Reasoning & decision support

Clinical workflows

Data Mining

Raw data from biomarker based

personalized acquisition

Knowledge Discovery

Data

Preprocessing

Curation & Validation

Create Statistical

Simulation

Models

Model & Verification

Domain knowledge &

assumptions

Variable dependencies & causality

Simulation Models

Page 29: Big data in the research life cycle: technologies, infrastructures, policies

Bayesian Net: Directed Acyclic Graph + Conditional Prob Distributions

◦ Features (Nodes) & Dependencies (Edges)

◦ Compact representation of joint data distribution

Patient X1 X2 X3 X4 X5 X6 X7 X8

1 Y N N Y Y Y N Y

:

1000 N N Y N N Y N N

X1

X4 X5

X7 X8

Smoking

Lung cancer

Chronic bronchitis

X2

Genetic Factor

X6

X3

Allergy +

Find:

Given:

+ Domain Knowledge

Page 30: Big data in the research life cycle: technologies, infrastructures, policies

Final DAG (based on MCMC&DP, threshold=0.5)

Age

ParC

HD

Pro

cedure

s

ExIn

tole

r

Cyanosis

CP

BP

CP

Arr

hy

CP

Concl

CP

Term

Rsn

BS

A

TP

VR

egurg

TriR

egurg

RV

D

RedR

V

PS

Motion

Restr

Patt

AV

Blo

ck

Supra

vA

rrhy

Ventr

icA

rrhy

Age

ParCHD

Procedures

ExIntoler

Cyanosis

CPBP

CPArrhy

CPConcl

CPTermRsn

BSA

TPVRegurg

TriRegurg

RVD

RedRV

PSMotion

RestrPatt

AVBlock

SupravArrhy

VentricArrhy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Modelling Dependency Analysis

Inference

Page 31: Big data in the research life cycle: technologies, infrastructures, policies

Disease signatures

Patient grouping & similarity Variable dependencies & causality

Simulation Models

Transformed &

Validated Data

Data Mining

Raw data from biomarker based

personalized acquisition

Knowledge Discovery

Data

Preprocessing

Curation & Validation

Create Statistical

Simulation

Models

Model & Verification

Domain knowledge &

assumptions

Personalized Model

Guided Medicine

For a particular patient

Unknown / missing data

Predict value of missing

variable

Reasoning & decision support

Clinical workflows

Individualized diagnosis,

prognosis & treatment plan

Page 32: Big data in the research life cycle: technologies, infrastructures, policies
Page 33: Big data in the research life cycle: technologies, infrastructures, policies

Increased RVD is related with worse values in every MR aspect (TVPRegurg, PSMotion, RedRV, AV_Block, TriRegurg)

Page 34: Big data in the research life cycle: technologies, infrastructures, policies

Brussels – 6-7 May 2014

Page 35: Big data in the research life cycle: technologies, infrastructures, policies

MyHealthMyData

Page 36: Big data in the research life cycle: technologies, infrastructures, policies

Raw

Personal

Data

Raw

Anonymised

Summary

Anonymised

Private Controlled Access Public

Bioinformatics

services for All Users Doctors (and

Patients?) Researchers

Page 37: Big data in the research life cycle: technologies, infrastructures, policies

Obtaining consent not straightforward

Anonymisation: necessary, rather complicated, ensuring neither privacy nor data value

“Blending in a crowd” and k-anonymity: privacy is property not output of sanitization

How do we define privacy?

Page 38: Big data in the research life cycle: technologies, infrastructures, policies

data publishing: “Sanitization” (Anonymisation) hiding individual info (k-anonymity) but preserving (sufficient) aggregated statistics

data mining: Specific algorithms (usually operating in two phases) for classification, clustering, association rules, …

mechanisms: Differential Privacy & Crowd-Blending Privacy perturb data or add noise ensuring ε-indistinguishable output distribution

encryption: Fully Homomorphic Encryption (FHE) for computation and query to run over encrypted data

decentralization: Blockchain to Protect Personal Data - decentralized personal data management, users own and control their data

Page 39: Big data in the research life cycle: technologies, infrastructures, policies

Big data is not only about size

Data is distributed, data is heterogeneous

Processing goes to data, not data to processing

ICT (Data management & processing) advances

◦ Data compression

◦ Federated / privacy-preserving processing

◦ Scalable parallel / distributed processing

◦ Data curation (otherwise: garbage in, garbage out)

◦ Text and data analytics