Big Data and beyond: What can we expect in the future! · Wolfgang E. Nagel (wolfgang.nagel@tu ... Big Data and beyond: What can we expect in the future! ... Universitätsklinikum

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450

Wolfgang E. Nagel ([email protected])

Big Data and beyond: What can we expect in the future!

IEEE ETFA‘2016, Berlin, September 9th, 2016

2Wolfgang E. Nagel

Overview

Some words on Dresden

Big Data

– What is “Big Data”?

– Big Data and High Performance Computing (HPC): Two worlds?

– User Support: ScaDS Dresden/Leipzig

– User challenges to get benefit from Big Data

Summary

TU Dresden:University of Excellence

4

Facts & Figures

§ the only technical comprehensive university (Volluniversität) in Germany

§ students: approx. 35.961 (01.12.2015) of whom international students: approx. 4,800 from 126 nationsfirst-year students: 8.474

§ study programmes: 124

§ many cooperations with universities worldwide

§ employees: approx. 7,700of whom financed by third-party funds: approx. 3,400

§ overall budget in 2014: 585 million Eurosof which third-party funds: 242 million Euros

5

Center for Information Services and HPC (ZIH)

Central Scientific Unit at TU Dresden

Running computing and communication infrastructure for the university

Development of algorithms and methods: Cooperation with users from all departments

Providing infrastructure and qualified service for scientists all over Saxony

Dresden CUDA Center for Excellence

Dresden Intel® Parallel Computing Center (IPCC)

Competence center for „Parallel Computing and Software Tools“

Competence center for Big Data (ScaDS)

6

Areas of Expertise

Research topics– Scalable software tools to support the optimization

of applications for HPC systems – Data intensive computing and data life cycle– Performance and energy efficiency analysis for

innovative computer architectures – Distributed computing and cloud computing– Data analysis, methods and modeling in life

sciences – Parallel programming, algorithms and methods

Pick up and preparation of new concepts, methods, and techniques

Teaching and Education

7

HPC-Infrastructure (Past)

-

HPC-SANLustre

79 TB Capacity

PetaByte-Bandarchiv

1 PB Capacity

8 GB/s 3 GB/s

1,8 GB/s

SGI Altix 4700 - Mars2048 Montecito Cores6,5 TB main memory

Megware PC-Cluster - Atlas5888 AMD Interlagos Cores

13 TB main memory

Installation 2012

HRSK-IIInstallation 2013

HRSK-IInstallation 2006

SGI UV2000 - Venus512 Intel Sanybridge Cores

8 TB main memory

Throughput component –TaurusIsland 1: 4320 Intel Sandybridge CoresIsland 2: 44 Nodes with 72 Tesla GPUsIsland 3: 2160 Intel Westmere Cores

8 GB/s 3 GB/s

Lustre1 PB

Capacity

20 GB/s

68 TB Capacity

8

Lehmann Data Centre building site on January 24th, 2013

Daniel Hackenberg

9

Building site 24. Januar 2013

Daniel Hackenberg

10

New Data Center – German Data Center Award 2014

Winner in the category of energy and resource efficient data centers 2014

Plenum in the data center: A concept for efficiency and safety

Wolfgang E. Nagel

11

StorageHPC

Throughput

What about I/O ???

Batch-System

Login

Access Patterns

Flexible Storage System

User A

User Z

User A

Server/File Systems

User Z…

Net

SSD

SAS

SATA

Server 1

Server 2

Server N

Switc

h 2

Switc

h 1 Analysis

Steering

Transaktion

Checkpoint.

Seriell

Export

ZIH/TUD Campus

SCR

ATC

H

Wolfgang E. Nagel

12

ZIH HPC and BigData Infrastructure

100 Gbit/s Cluster Uplink Home

HTW, Chemnitz, Leipzig, Freiberg

Erlangen, Potsdam

ArchivTU Dresden

ZIH-Backbone

other clustersMegware Cluster

IBM iDataPlex

Storage

High ThroughputBULL Islands

Westmere, Sandy Bridge,und Haswell processors,

GPU-Nodes700 Nodes

~15.000 Cores

Shared Memory

SGI Ultraviolet 2000512 cores SandyBridge

8 TB RAM NUMA

Parallel MachineBULL HPC

Haswell processors

2 x 612 Nodes~ 30.000 Cores

Wolfgang E. Nagel

13

Part of the ZIH Compute Infrastructure

Wolfgang E. Nagel

14

Inaugration: May 13th, 2015

Wolfgang E. Nagel

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450


What is “Big Data“?

16

Motivation: How large is Big Data?

Wolfgang E. Nagel

Mostly unstructured

data!

Source: IDC’s Digital Universe study, sponsored by EMC, 2014

Big means not a fixed scale!!!!

17

Motivation: How large is Big Data?

Wolfgang E. Nagel

Source: IDC’s Digital Universe study, sponsored by EMC, 2014

Where is data coming from?

Beside science – industry – and consumer data!

Source: The U.S. Mobile App Report, August 2014

18

Big Data Definition(s)

more important: extract new content from database

Wolfgang E. Nagel

19Wolfgang E. Nagel

In Science: Not just “big players” – Long Tail of Science

Requirements from the Users perspective

Data must be managed, annotated and curated to extract their potential

Many research communities do not have the necessary tools to transform ever-growing data into scientific knowledge

How to close the gap?

Large Collaborations (e.g. @Cern)

DNA sequencing

And many more!!!

EngineeringTransportation

20

Motivation: How users will use data in future

How to find relevant data for a given research topic?

Example: Digital Humanities

Classical data view point: document (source) based

Scenario: find relevant information about e.g. J.W. v. Goethe (famous German writer from the classical period)

Perform keyword-based search

Search delivers links to documents only

Wolfgang E. Nagel

1: repository1/part1/Brief23

2: web/text1/Goethe.html

Klick!

21

Motivation: How users will use data in future

How to find relevant data for a given research topic?

Example: Digital Humanities – same use case

Changed data view point: content-based

Search delivers relations (content) and its connectivity

User can navigate in data base (+referencesto initial documents)

Wolfgang E. Nagel

1: JWvGoethe visited Dresden

2: JWvGoethe wasBornIn Frankfurt

Knowledge/Content Base(Ontologie)

22

Outcome

This will change life in future!

The way you look for information!

The way you run your fab!

The way you optimize your production facility!

The way you have to do marketing and logistics!

The way you order products!

Wolfgang E. Nagel

23

Awareness Campaigns

Wolfgang E. Nagel

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450


Big Data and HPC:Two Worlds?

25

Motivation: How to support users with infrastructures

HPC vs. Data Analytics

Bring computing to data, or data to computing (data mover)?

Systems and infrastructure should support users, not forcing them to follow rigid regiments

Let user pick up approach, which is best for individual use case

HPC: traditional rather monolithic usage, e.g simulations

Big Data analytics: more data centric, but not all and every analysis is embarrassingly parallel, iterative models still induce large data movements

There is no unique big data blueprint!

Question: which way to follow – more HPC like approach or dynamic possibilities of big data frameworks?

Wolfgang E. Nagel

26

2. Increase coherence between technology base used for modeling and simulation and that used for data analytic computing

Modeling and Simulation! Multi-scale! Multi-physics! Multi-resolution! Multidisciplinary! Coupled models

Data Science! Data Assimilation! Visualization! Image Analysis! Data Compression! Data Analytics

NSF Role: Support foundational research and research infrastructure within and across all disciplines (across all NSF directorates)

Time

This slide courtesy Irene Qualters, National Science Foundation. Used with permission; may not be reused without permission

27

Convergence between HPC and Big Data hardware

Used with permission from Daniel Reed & Jack Dongarra. CACM 58(7):56-68

28

Extension of HRSK-II for HPC Data Analytics (HPC-DA)

Virtual Research Environments

HPCHPC HTC NVRAM …

Memory Virtualization

Compute Virtualization

classical HPC

LustreMemory Memory …

Flink YARN …

Federation

Abstraction,Services

Compute

Memory

Simulation Analysis Throughput

Streams, Data

Memory

Compute

HRSK-IIHPC-DA: HardwareHPC-DA: Software

29

HRSK-II Hardware Extensions (Phase 1 and 2)

216 Ports FDR

SATA Lustre

216 Ports FDRHPC Island

SATA Lustre

SSD Lustre

HPC Island

HTC Island

BladeBlade

BladeBlade

Data AnalyticsNodes

Data AnalyticsNodes

Data Analytics Memory

Data Analytics Memory

Blade

Blade

Blade

Blade

30

Big Data Analytics and HPC

Formalized workflow

Automatic provision of required environment (Hadoop, Spark, Flink)

Complex analytics based on user requirements

Execution plan of primitives (map/reduce/…) optimized by framework (e.g. Flink, below)

Wolfgang E. Nagel

Big Data cluster start-up

Big Data session

Big Data cluster shut-down

HPC-Job allocation

End of HPC-Job allocation

Automatic start of Big Data session within seconds

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450


Big Data User Support

32

National Big Data Competence Center

Wolfgang E. Nagel

Competence Center forScaDSDresden/Leipzig

lableataervices and Solutions

33

National Big Data Competence Center

One of two national Big Data competence centers in Germany

Project period: 4 years (10/2014 – 09/2018)

After evaluation: option for funding extension by 3 more years

Many involved research groups (47 PI) from 21 organizations

Focal point for new research activities

Close collaboration with other national and international Big Data research activities

Wolfgang E. Nagel

Max Planck Institute of Molecular Cell Biology and Genetics

34

Associated Partners

Avantgarde-Labs GmbH (AL)

Data Virtuality GmbH (DV)

E-Commerce Genossenschaft e. G. (ECEG)

European Centre for Emerging Materials and Processes Dresden (ECEMP)Fraunhofer-Institut für Verkehrs- und Infrastruktursysteme IVI

Fraunhofer-Institut für Werkstoff- und Strahltechnik

GISA GmbHHelmholtz-Zentrum Dresden - Rossendorf (HZDR)

Helmholtz Zentrum für Umweltforschung (UFZ)

Hochschule für Telekommunikation Leipzig (HfTL)Institut für Angewandte Informatik e. V. (InfAI)

Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG)

Netzwerk Logistik Leipzig-Halle e. V. (NLLH)

Sächsische Landesbibliothek – Staats- und Universitätsbibliothek Dresden (SLUB)Scionics Computer Innovation GmbH (SCI)

Technische Universität Chemnitz (TUC)

Universitätsklinikum Carl Gustav Carus (UK)

Wolfgang E. Nagel

35

Structural Approach

Developments of Big Data solutions for a broad range of scientific applications

Starting with five disciplines in the project, later open to others

Methodological focus: data quality and integration, knowledge extraction, visual analysis; cross-cutting topics: Big Data architectures and data life cycle management

Service Center as linking entity

Wolfgang E. Nagel

36

Data Quality and Data Integration

Parallel execution of comprehensive data integration workflows

Learning based configuration of integration workflows

Real-time data integration and dynamic information enrichment

Wolfgang E. Nagel

continuous changes in thousands of data sources

“This tool by far shows the most mature use of MapReduce for data deduplication” www.hadoopsphere.com

37

Knowledge Extraction

Efficient algorithms for structural data

Machine learning in structural models

Text-mining methods for similarity analysis

Exploration of metabolic networks

Wolfgang E. Nagel

Large text corpora: access to full texts on different textual levels and annotations (CTS standard) 3D Scene understanding from images

and videos

38

Visual Analysis

Alternative reduction techniques and real-time visualization

Guided navigation and interactive data validation

Particle, volume and process visualization

Wolfgang E. Nagel

particle simulation

interaction with large data sets

39

How Workflows Need to Change

Efficient (raw) data reduction

Raw data from the instruments must be immediately deconvolved

NGS – partially already done by instrument

– With proprietary software

– Often GPU based

Microscopes – on-the-fly reductions are left to the users

In-Situ reduction

– Streaming of data to analysis resources• Efficient data transfer• Intelligent directed streaming

– Merge data streams after reduction

– In-Memory-Analysis

– Compression

Wolfgang E. Nagel

40


Data Life Cycle Management

Management of huge numbers of objects/files

Reuse of data

Data provenance

Classify data to chose most cost efficient storage

Needs to be supported by tools

Scalable

– Management

– Access capabilities

– Storage

– Retrieval

Combined with workflow management

Wolfgang E. Nagel

dataone.org

41


Metadata

Needed to describe and reuse the research data

Types of metadata

– Technical

– Contextual

– Disciplinary

Tools to automatically extract metadata

Using HPC resources necessary and beneficial

Distributed but connected metadata and data

Wolfgang E. Nagel

www.dqglobal.com

42


Knowledge extraction

Automatically create information and knowledge from the data

Techniques and algorithms are needed

– Read information

– Extract context, connections, correlations, interpretations, or ideas

Already quite common in business data processing

Science disciplines still focusing on their original analysis strategies only

Tools

– Data mining

– Query such information and knowledge

– Fast, scalable, and easy to use

Wolfgang E. Nagel

Know-ledge

Information

Data

43


Workflow management

Often analysis has many single steps

Submit thousands or millions of jobs

Tools for workflow support

– More intelligence

– Scalable

– Manage data and computing tasks altogether

– Organizing and balancing the resources needed for both

– Resilience• Realize exceptions/errors and react (not only restarting or recreating a

workflow ) • Must react on the source of the exception • Find ways to circumvent them automatically

Wolfgang E. Nagel

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450


User Challengesto use Big Data

45Wolfgang E. Nagel

User Centric Scenarios: Challenges are manifold

Requirements from the Users perspective

What to do with my data? Variable and different data sources:

– Large streams of raw data (e.g. microscopes, sensor arrays, …)

– Integrate heterogeneous data sets into common analysis (open data, collaborative aspects etc.)

Keep control of data:

– Cover all aspects of data life cycle

– Ensure validity and quality of data

– Is there more knowledge in the data?

Deal with heterogeneous environments:

– Different data and meta data formats

– Data not self-explanatory (missing documentation or no meta data at all)

46

Execution of large data-driven workflows

Challenge: support execution of data-intensive user workflows in HPC environment

– No prior HPC-knowledge required on user side

– Formulation of workload directly in workflow environment

Solution: combination of well-known and widely used tools

– KNIME for workflow formulation

– Middleware UNICORE used for HPC interaction

Use Case: processing pipeline for cell tracking (bacteria E.coli) over time

Wolfgang E. Nagel

47


First: export of workflow and its input data

Second: automatic generation of compute jobs and execution on HPC system

Wolfgang E. Nagel

Automatic generation of thousands of computing jobs if

required

2.1.

48


Evaluation data set: 1,8 TB in ~7,5 M files

Runtime improvement: previously 17d on 4 cores now 2h on 800 cores

à 200x faster

Next steps: fully automated pipeline connectingmicroscope with HPC environmentand research data repository

Wolfgang E. Nagel

R. Grunzke, F. Jug, B. Schuller, R. Jäkel, G. Myers and W. E. Nagel: Seamless HPC Integration of Data-intensive KNIME Workflows via UNICORE, 4th International Workshop on Parallelism in Bioinformatics (PBio 2016), 2016, accepted.

49

Application area: environmental sciences and urban modelling

Challenges:

– Analysis of maps to trace thedevelopment of settlement areas and their internal structure over time

Settlement structures in topographic maps

50

Application area: environmental sciences and urban modelling

Solution:

– Avoid previously required labor intensive manual work

– Usage of image segmentation algorithms in data processing

Scenario:

– Analysis of historic maps (“Messtischblätter”)

– Good coverage of Germany in 1:25000 scale (1875-1945)

– Thorough evaluation is desired

– Accurate training set required


Example settlement areas

51

Results:

– Automatic and new method for settlement detection in historic maps available

– Scalable data processing of large quantity of input maps possible

Runtime improvement:

– serial processing on ordinary workstation: ~780 minutes (13 hours)

– Parallel execution: <4min à ~200x faster


Input Correct output labels

52

Imaging in Neurosurgery - background

– no prevalent method for imaging neural activity

– perfusion monitoring limited to measurement cycle of employed tracers

– only some tumors are detectable by fluorescence marker method

Potential of medical applications using thermal imaging

– (breast) tumor segmentation

– neuronal activity monitoring

– inflammations / fever – and many more

Thermography depicts a promising approach for solving these issues

Intraoperative Thermal Imaging

Wolfgang E. Nagel

IR camera

InfraTec hr HEAD

53

Application area: low delay operation support using thermal imaging processing

Challenge:

– Perfusion and neural activity monitoringrequire long-term intraoperative measurements (~10 minutes) to increase statistical power and correctness

– Fast preprocessing required to decrease delay for subsequent analysis workflows and result presentations => minimize overall OP delay

– Iterative process: 3000 frames (5.4 GB) have to be processed every minute (50 Hz sampling rate)


Thermal image of acutesubdural hematoma

Wolfgang E. Nagel

54

Results:

– Real-time data processing pipeline using imaging data fromUniversity Hospital Dresden (UKD)

– Parallel implementation using Apache Spark framework

– Relatively small cluster instance sufficient to achieve real-timecapability (8 nodes on HPC cluster TAURUS)

– Available SSD-backend further decreased overall runtime

– Fail-safe storage and operations on imaging data

Runtime improvement:

– Typical workstation @UKD: ~7000s/30.000 images

– 8-node Spark cluster @Taurus: ~32s/30.000 images à ~220x faster


Wolfgang E. Nagel

55Wolfgang E. Nagel

Summary

There is no unique big data usage pattern

– Many different aspects are of interest (not just “volume”)

– But: transparency for users is very important

HPC systems will support an extremely large main memory, which will result in huge input/output data (size and/or number of files)

Other, more distributed approaches still valid, e.g. for Hadoop-like workloads

Still depending on use-case requirements – user needs to adopt current workloads

Ø Big Data Analytics at the push of a button …will take a while

Big Data Analysis

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450


Center for Information Services and High Performance Computing (ZIH)

Thank You

Rene JäkelMichael KlugeAndreas KnüpferRalph Müller-Pfefferkorn

Richard GrunzkeEugene MyersYannis KalaidzidisGerhard Fettweis

Documents

Big Data and beyond: What can we expect in the future! · Wolfgang E. Nagel (wolfgang.nagel@tu ... Big Data and beyond: What can we expect in the future! ... Universitätsklinikum