Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DH03 – Educating for the Future:
Data Engineering Techniques
https://education.phuse.eu/eftf/data-engineering/
Contribution:
Andy Richardson, Amy Gillespie, Beate Hientzsch, Berber Snoeijer,Beverly Hayes, JatinPatel, Karnika Dalal, Mark Bynens, Mike Carniello, Parag Shiralkar, Paul Slagle, Ralf Goetzelmann, Rohit Banga, Sagar Jain, Susan Olson, Vijay Pasapula, Vince Marinelli, Xiaohui Wang
Special thanks:Ian Fleming, James McDermott, Wendy Dobson
Presentation:Guy Garrett, Achieve IntelligenceRenu Shukla, Janssen Research & DevelopmentMohit Juneja, LyfeScience
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Data Engineering Landscape
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
StudyStudyStudy
Data Management & Quality Control
Analysis & Reporting
Data Capture & Site Management
Master Data Management
https://education.phuse.eu/eftf/data-engineering/
Master Data Management
Data Management & Quality Control
Analysis & Reporting
Data Capture & Site Management
eCRF
DM Scripts
A&R Scripts
https://education.phuse.eu/eftf/data-engineering/
eCRF
eCRF
Lab
Others
AutomatedData Ingestion
Routines
Master Data Management
* F.A.I.R. = Findability, Accessibility, Interoperability, Reusability
https://education.phuse.eu/eftf/data-engineering/
Centralised
Data
Hub
https://education.phuse.eu/eftf/data-engineering/
Master Data Management
• Principle
Approach Investment Return on Investment
Master Data Management More up-front investment Longer term savings
Siloed Data Processing Less up-front investment Most costly long-term
https://education.phuse.eu/eftf/data-engineering/
Master Data Management
• Principle
Approach Investment Return on Investment
Master Data Management More up-front investment Longer term savings
Siloed Data Processing Less up-front investment Most costly long-term
Data Engineering Project is looking for Use Cases to evidence this principle.
1) Measure current processes2) Adopt MDM3) Measure MDM processes4) Analyse ROI
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”– Definition
– Staging & Loading
– Change Data Capture
– Data Validation
– Scheduling & Batch Processing
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
• Wikipedia definition
“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
• Wikipedia definition
“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”
“…data engineering, is a software engineering approach to designing and developing information systems. “
i.e. It’s about building data pipelines, rather than datasets.
Data PipelineData Pipeline
ETL Overview – “Transforming”
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Data PipelineData Pipeline
Staging
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Data PipelineData Pipeline
Loading
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Data PipelineData Pipeline
Change Data Capture
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Changes: Additions, Updates, Deletions
Methods:•Database Timestamping (Control Tables)•Delta Identification (Comparison)
Data PipelineData Pipeline
Change Data Capture
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Data PipelineData Pipeline
Data Validation (At Source)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
e.g.: EDC Patient DOB must be a valid date.
Data PipelineData Pipeline
Data Validation (Simple)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Simple)
Scheduling Controls
e.g.: Lab Data Value unlikely to be > nnn
Failed records get diverted for cleansing processes
Data PipelineData Pipeline
Data Validation (Complex)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
• Across different sourcesThis lab value is not consistent with that EDC value(e.g. Males can’t be pregnant)
• Summarisation Checks
Data PipelineData Pipeline
Data Validation (Complex)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Data PipelineData Pipeline
Scheduling & Batch Processing
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
• Frequencies• Dependencies• Alerts
Data PipelineData Pipeline
ETL Overview – “Transforming”
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Data Lake – “Trawling”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
• Big-Data• Unstructured/Semi-Structured• Just-in-case - “Disk space is cheap”
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Data Marketplace – “Targeting”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
• Small-Data• Specific Information• “As required” via APIs
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
https://education.phuse.eu/eftf/data-engineering/
Conclusion
Data Engineering Project(Educating for the Future PHUSE Working Group)
No Copyright infringement intended.https://education.phuse.eu/eftf/data-engineering/