Upload
truongquynh
View
218
Download
4
Embed Size (px)
Citation preview
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Big Data and beyond: What can we expect in the future!
IEEE ETFA‘2016, Berlin, September 9th, 2016
2Wolfgang E. Nagel
Overview
Some words on Dresden
Big Data
– What is “Big Data”?
– Big Data and High Performance Computing (HPC): Two worlds?
– User Support: ScaDS Dresden/Leipzig
– User challenges to get benefit from Big Data
Summary
TU Dresden:University of Excellence
4
Facts & Figures
§ the only technical comprehensive university (Volluniversität) in Germany
§ students: approx. 35.961 (01.12.2015) of whom international students: approx. 4,800 from 126 nationsfirst-year students: 8.474
§ study programmes: 124
§ many cooperations with universities worldwide
§ employees: approx. 7,700of whom financed by third-party funds: approx. 3,400
§ overall budget in 2014: 585 million Eurosof which third-party funds: 242 million Euros
5
Center for Information Services and HPC (ZIH)
Central Scientific Unit at TU Dresden
Running computing and communication infrastructure for the university
Development of algorithms and methods: Cooperation with users from all departments
Providing infrastructure and qualified service for scientists all over Saxony
Dresden CUDA Center for Excellence
Dresden Intel® Parallel Computing Center (IPCC)
Competence center for „Parallel Computing and Software Tools“
Competence center for Big Data (ScaDS)
6
Areas of Expertise
Research topics– Scalable software tools to support the optimization
of applications for HPC systems – Data intensive computing and data life cycle– Performance and energy efficiency analysis for
innovative computer architectures – Distributed computing and cloud computing– Data analysis, methods and modeling in life
sciences – Parallel programming, algorithms and methods
Pick up and preparation of new concepts, methods, and techniques
Teaching and Education
7
HPC-Infrastructure (Past)
-
HPC-SANLustre
79 TB Capacity
PetaByte-Bandarchiv
1 PB Capacity
8 GB/s 3 GB/s
1,8 GB/s
SGI Altix 4700 - Mars2048 Montecito Cores6,5 TB main memory
Megware PC-Cluster - Atlas5888 AMD Interlagos Cores
13 TB main memory
Installation 2012
HRSK-IIInstallation 2013
HRSK-IInstallation 2006
SGI UV2000 - Venus512 Intel Sanybridge Cores
8 TB main memory
Throughput component –TaurusIsland 1: 4320 Intel Sandybridge CoresIsland 2: 44 Nodes with 72 Tesla GPUsIsland 3: 2160 Intel Westmere Cores
8 GB/s 3 GB/s
Lustre1 PB
Capacity
20 GB/s
68 TB Capacity
8
Lehmann Data Centre building site on January 24th, 2013
Daniel Hackenberg
9
Building site 24. Januar 2013
Daniel Hackenberg
10
New Data Center – German Data Center Award 2014
Winner in the category of energy and resource efficient data centers 2014
Plenum in the data center: A concept for efficiency and safety
Wolfgang E. Nagel
11
StorageHPC
Throughput
What about I/O ???
Batch-System
Login
Access Patterns
Flexible Storage System
User A
User Z
User A
Server/File Systems
User Z…
Net
SSD
SAS
SATA
Server 1
Server 2
Server N
Switc
h 2
Switc
h 1 Analysis
Steering
Transaktion
Checkpoint.
Seriell
Export
ZIH/TUD Campus
SCR
ATC
H
Wolfgang E. Nagel
12
ZIH HPC and BigData Infrastructure
100 Gbit/s Cluster Uplink Home
HTW, Chemnitz, Leipzig, Freiberg
Erlangen, Potsdam
ArchivTU Dresden
ZIH-Backbone
other clustersMegware Cluster
IBM iDataPlex
Storage
High ThroughputBULL Islands
Westmere, Sandy Bridge,und Haswell processors,
GPU-Nodes700 Nodes
~15.000 Cores
Shared Memory
SGI Ultraviolet 2000512 cores SandyBridge
8 TB RAM NUMA
Parallel MachineBULL HPC
Haswell processors
2 x 612 Nodes~ 30.000 Cores
Wolfgang E. Nagel
13
Part of the ZIH Compute Infrastructure
Wolfgang E. Nagel
14
Inaugration: May 13th, 2015
Wolfgang E. Nagel
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
What is “Big Data“?
16
Motivation: How large is Big Data?
Wolfgang E. Nagel
Mostly unstructured
data!
Source: IDC’s Digital Universe study, sponsored by EMC, 2014
Big means not a fixed scale!!!!
17
Motivation: How large is Big Data?
Wolfgang E. Nagel
Source: IDC’s Digital Universe study, sponsored by EMC, 2014
Where is data coming from?
Beside science – industry – and consumer data!
Source: The U.S. Mobile App Report, August 2014
18
Big Data Definition(s)
more important: extract new content from database
Wolfgang E. Nagel
19Wolfgang E. Nagel
In Science: Not just “big players” – Long Tail of Science
Requirements from the Users perspective
Data must be managed, annotated and curated to extract their potential
Many research communities do not have the necessary tools to transform ever-growing data into scientific knowledge
How to close the gap?
Large Collaborations (e.g. @Cern)
DNA sequencing
And many more!!!
EngineeringTransportation
20
Motivation: How users will use data in future
How to find relevant data for a given research topic?
Example: Digital Humanities
Classical data view point: document (source) based
Scenario: find relevant information about e.g. J.W. v. Goethe (famous German writer from the classical period)
Perform keyword-based search
Search delivers links to documents only
Wolfgang E. Nagel
1: repository1/part1/Brief23
2: web/text1/Goethe.html
Klick!
21
Motivation: How users will use data in future
How to find relevant data for a given research topic?
Example: Digital Humanities – same use case
Changed data view point: content-based
Search delivers relations (content) and its connectivity
User can navigate in data base (+referencesto initial documents)
Wolfgang E. Nagel
1: JWvGoethe visited Dresden
2: JWvGoethe wasBornIn Frankfurt
Knowledge/Content Base(Ontologie)
22
Outcome
This will change life in future!
The way you look for information!
The way you run your fab!
The way you optimize your production facility!
The way you have to do marketing and logistics!
The way you order products!
Wolfgang E. Nagel
23
Awareness Campaigns
Wolfgang E. Nagel
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Big Data and HPC:Two Worlds?
25
Motivation: How to support users with infrastructures
HPC vs. Data Analytics
Bring computing to data, or data to computing (data mover)?
Systems and infrastructure should support users, not forcing them to follow rigid regiments
Let user pick up approach, which is best for individual use case
HPC: traditional rather monolithic usage, e.g simulations
Big Data analytics: more data centric, but not all and every analysis is embarrassingly parallel, iterative models still induce large data movements
There is no unique big data blueprint!
Question: which way to follow – more HPC like approach or dynamic possibilities of big data frameworks?
Wolfgang E. Nagel
26
2. Increase coherence between technology base used for modeling and simulation and that used for data analytic computing
Modeling and Simulation! Multi-scale! Multi-physics! Multi-resolution! Multidisciplinary! Coupled models
Data Science! Data Assimilation! Visualization! Image Analysis! Data Compression! Data Analytics
NSF Role: Support foundational research and research infrastructure within and across all disciplines (across all NSF directorates)
Time
This slide courtesy Irene Qualters, National Science Foundation. Used with permission; may not be reused without permission
27
Convergence between HPC and Big Data hardware
Used with permission from Daniel Reed & Jack Dongarra. CACM 58(7):56-68
28
Extension of HRSK-II for HPC Data Analytics (HPC-DA)
Virtual Research Environments
HPCHPC HTC NVRAM …
Memory Virtualization
Compute Virtualization
classical HPC
LustreMemory Memory …
Flink YARN …
Federation
Abstraction,Services
Compute
Memory
Simulation Analysis Throughput
Streams, Data
Memory
Compute
HRSK-IIHPC-DA: HardwareHPC-DA: Software
29
HRSK-II Hardware Extensions (Phase 1 and 2)
216 Ports FDR
SATA Lustre
216 Ports FDRHPC Island
SATA Lustre
SSD Lustre
HPC Island
HTC Island
BladeBlade
BladeBlade
Data AnalyticsNodes
Data AnalyticsNodes
Data Analytics Memory
Data Analytics Memory
Blade
Blade
Blade
Blade
30
Big Data Analytics and HPC
Formalized workflow
Automatic provision of required environment (Hadoop, Spark, Flink)
Complex analytics based on user requirements
Execution plan of primitives (map/reduce/…) optimized by framework (e.g. Flink, below)
Wolfgang E. Nagel
Big Data cluster start-up
Big Data session
Big Data cluster shut-down
HPC-Job allocation
End of HPC-Job allocation
Automatic start of Big Data session within seconds
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Big Data User Support
32
National Big Data Competence Center
Wolfgang E. Nagel
Competence Center forScaDSDresden/Leipzig
lableataervices and Solutions
33
National Big Data Competence Center
One of two national Big Data competence centers in Germany
Project period: 4 years (10/2014 – 09/2018)
After evaluation: option for funding extension by 3 more years
Many involved research groups (47 PI) from 21 organizations
Focal point for new research activities
Close collaboration with other national and international Big Data research activities
Wolfgang E. Nagel
Max Planck Institute of Molecular Cell Biology and Genetics
34
Associated Partners
Avantgarde-Labs GmbH (AL)
Data Virtuality GmbH (DV)
E-Commerce Genossenschaft e. G. (ECEG)
European Centre for Emerging Materials and Processes Dresden (ECEMP)Fraunhofer-Institut für Verkehrs- und Infrastruktursysteme IVI
Fraunhofer-Institut für Werkstoff- und Strahltechnik
GISA GmbHHelmholtz-Zentrum Dresden - Rossendorf (HZDR)
Helmholtz Zentrum für Umweltforschung (UFZ)
Hochschule für Telekommunikation Leipzig (HfTL)Institut für Angewandte Informatik e. V. (InfAI)
Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG)
Netzwerk Logistik Leipzig-Halle e. V. (NLLH)
Sächsische Landesbibliothek – Staats- und Universitätsbibliothek Dresden (SLUB)Scionics Computer Innovation GmbH (SCI)
Technische Universität Chemnitz (TUC)
Universitätsklinikum Carl Gustav Carus (UK)
Wolfgang E. Nagel
35
Structural Approach
Developments of Big Data solutions for a broad range of scientific applications
Starting with five disciplines in the project, later open to others
Methodological focus: data quality and integration, knowledge extraction, visual analysis; cross-cutting topics: Big Data architectures and data life cycle management
Service Center as linking entity
Wolfgang E. Nagel
36
Data Quality and Data Integration
Parallel execution of comprehensive data integration workflows
Learning based configuration of integration workflows
Real-time data integration and dynamic information enrichment
Wolfgang E. Nagel
continuous changes in thousands of data sources
“This tool by far shows the most mature use of MapReduce for data deduplication” www.hadoopsphere.com
37
Knowledge Extraction
Efficient algorithms for structural data
Machine learning in structural models
Text-mining methods for similarity analysis
Exploration of metabolic networks
Wolfgang E. Nagel
Large text corpora: access to full texts on different textual levels and annotations (CTS standard) 3D Scene understanding from images
and videos
38
Visual Analysis
Alternative reduction techniques and real-time visualization
Guided navigation and interactive data validation
Particle, volume and process visualization
Wolfgang E. Nagel
particle simulation
interaction with large data sets
39
How Workflows Need to Change
Efficient (raw) data reduction
Raw data from the instruments must be immediately deconvolved
NGS – partially already done by instrument
– With proprietary software
– Often GPU based
Microscopes – on-the-fly reductions are left to the users
In-Situ reduction
– Streaming of data to analysis resources• Efficient data transfer• Intelligent directed streaming
– Merge data streams after reduction
– In-Memory-Analysis
– Compression
Wolfgang E. Nagel
40
How Workflows Need to Change
Data Life Cycle Management
Management of huge numbers of objects/files
Reuse of data
Data provenance
Classify data to chose most cost efficient storage
Needs to be supported by tools
Scalable
– Management
– Access capabilities
– Storage
– Retrieval
Combined with workflow management
Wolfgang E. Nagel
dataone.org
41
How Workflows Need to Change
Metadata
Needed to describe and reuse the research data
Types of metadata
– Technical
– Contextual
– Disciplinary
Tools to automatically extract metadata
Using HPC resources necessary and beneficial
Distributed but connected metadata and data
Wolfgang E. Nagel
www.dqglobal.com
42
How Workflows Need to Change
Knowledge extraction
Automatically create information and knowledge from the data
Techniques and algorithms are needed
– Read information
– Extract context, connections, correlations, interpretations, or ideas
Already quite common in business data processing
Science disciplines still focusing on their original analysis strategies only
Tools
– Data mining
– Query such information and knowledge
– Fast, scalable, and easy to use
Wolfgang E. Nagel
Know-ledge
Information
Data
43
How Workflows Need to Change
Workflow management
Often analysis has many single steps
Submit thousands or millions of jobs
Tools for workflow support
– More intelligence
– Scalable
– Manage data and computing tasks altogether
– Organizing and balancing the resources needed for both
– Resilience• Realize exceptions/errors and react (not only restarting or recreating a
workflow ) • Must react on the source of the exception • Find ways to circumvent them automatically
Wolfgang E. Nagel
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
User Challengesto use Big Data
45Wolfgang E. Nagel
User Centric Scenarios: Challenges are manifold
Requirements from the Users perspective
What to do with my data? Variable and different data sources:
– Large streams of raw data (e.g. microscopes, sensor arrays, …)
– Integrate heterogeneous data sets into common analysis (open data, collaborative aspects etc.)
Keep control of data:
– Cover all aspects of data life cycle
– Ensure validity and quality of data
– Is there more knowledge in the data?
Deal with heterogeneous environments:
– Different data and meta data formats
– Data not self-explanatory (missing documentation or no meta data at all)
46
Execution of large data-driven workflows
Challenge: support execution of data-intensive user workflows in HPC environment
– No prior HPC-knowledge required on user side
– Formulation of workload directly in workflow environment
Solution: combination of well-known and widely used tools
– KNIME for workflow formulation
– Middleware UNICORE used for HPC interaction
Use Case: processing pipeline for cell tracking (bacteria E.coli) over time
Wolfgang E. Nagel
47
Execution of large data-driven workflows
First: export of workflow and its input data
Second: automatic generation of compute jobs and execution on HPC system
Wolfgang E. Nagel
Automatic generation of thousands of computing jobs if
required
2.1.
48
Execution of large data-driven workflows
Evaluation data set: 1,8 TB in ~7,5 M files
Runtime improvement: previously 17d on 4 cores now 2h on 800 cores
à 200x faster
Next steps: fully automated pipeline connectingmicroscope with HPC environmentand research data repository
Wolfgang E. Nagel
R. Grunzke, F. Jug, B. Schuller, R. Jäkel, G. Myers and W. E. Nagel: Seamless HPC Integration of Data-intensive KNIME Workflows via UNICORE, 4th International Workshop on Parallelism in Bioinformatics (PBio 2016), 2016, accepted.
49
Application area: environmental sciences and urban modelling
Challenges:
– Analysis of maps to trace thedevelopment of settlement areas and their internal structure over time
Settlement structures in topographic maps
50
Application area: environmental sciences and urban modelling
Solution:
– Avoid previously required labor intensive manual work
– Usage of image segmentation algorithms in data processing
Scenario:
– Analysis of historic maps (“Messtischblätter”)
– Good coverage of Germany in 1:25000 scale (1875-1945)
– Thorough evaluation is desired
– Accurate training set required
Settlement structures in topographic maps
Example settlement areas
51
Results:
– Automatic and new method for settlement detection in historic maps available
– Scalable data processing of large quantity of input maps possible
Runtime improvement:
– serial processing on ordinary workstation: ~780 minutes (13 hours)
– Parallel execution: <4min à ~200x faster
Settlement structures in topographic maps
Input Correct output labels
52
Imaging in Neurosurgery - background
– no prevalent method for imaging neural activity
– perfusion monitoring limited to measurement cycle of employed tracers
– only some tumors are detectable by fluorescence marker method
Potential of medical applications using thermal imaging
– (breast) tumor segmentation
– neuronal activity monitoring
– inflammations / fever – and many more
Thermography depicts a promising approach for solving these issues
Intraoperative Thermal Imaging
Wolfgang E. Nagel
IR camera
InfraTec hr HEAD
53
Application area: low delay operation support using thermal imaging processing
Challenge:
– Perfusion and neural activity monitoringrequire long-term intraoperative measurements (~10 minutes) to increase statistical power and correctness
– Fast preprocessing required to decrease delay for subsequent analysis workflows and result presentations => minimize overall OP delay
– Iterative process: 3000 frames (5.4 GB) have to be processed every minute (50 Hz sampling rate)
Intraoperative Thermal Imaging
Thermal image of acutesubdural hematoma
Wolfgang E. Nagel
54
Results:
– Real-time data processing pipeline using imaging data fromUniversity Hospital Dresden (UKD)
– Parallel implementation using Apache Spark framework
– Relatively small cluster instance sufficient to achieve real-timecapability (8 nodes on HPC cluster TAURUS)
– Available SSD-backend further decreased overall runtime
– Fail-safe storage and operations on imaging data
Runtime improvement:
– Typical workstation @UKD: ~7000s/30.000 images
– 8-node Spark cluster @Taurus: ~32s/30.000 images à ~220x faster
Intraoperative Thermal Imaging
Wolfgang E. Nagel
55Wolfgang E. Nagel
Summary
There is no unique big data usage pattern
– Many different aspects are of interest (not just “volume”)
– But: transparency for users is very important
HPC systems will support an extremely large main memory, which will result in huge input/output data (size and/or number of files)
Other, more distributed approaches still valid, e.g. for Hadoop-like workloads
Still depending on use-case requirements – user needs to adopt current workloads
Ø Big Data Analytics at the push of a button …will take a while
Big Data Analysis
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Thank You
Rene JäkelMichael KlugeAndreas KnüpferRalph Müller-Pfefferkorn
Richard GrunzkeEugene MyersYannis KalaidzidisGerhard Fettweis