12
Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation http://hpcs2015.cisedu.info / 22 July, 2015 Marian Bubak AGH University of Science and Technology Krakow, Poland and University of Amsterdam, Amsterdam, The Netherlands http:// dice.cyfronet.pl

Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Embed Size (px)

Citation preview

Page 1: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Panel

Data Intensive Scienceat

HPCS 2015 – The International Conference on High Performance Computing & Simulation

http://hpcs2015.cisedu.info/ 22 July, 2015

Marian Bubak AGH University of Science and Technology Krakow, Poland

andUniversity of Amsterdam, Amsterdam, The Netherlands

http://dice.cyfronet.pl

Page 2: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

DICE Team

Academic Computer Centre CYFRONET AGH (1973)

120 employees

http://www.cyfronet.pl/en/

Department of Computer Science AGH (1980)

800 students, 70 employeeshttp://www.ki.agh.edu.pl/uk/index.htm

Faculty of Computer Science, Electronics and Telecommunication (2012)

2000 students, 200 employees

http://www.iet.agh.edu.pl/

AGH University of Science and Technology (1919)

16 faculties, 36000 students; 4000 employeeshttp://www.agh.edu.pl/en

Other 15 faculties

Distributed Computing Environments (DICE) Team http://dice.cyfronet.pl

• Investigation of methods for building complex scientific collaborative applications• Elaboration of environments and tools for e-Science• Integration of large-scale distributed computing infrastructures• Knowledge-based approach to services, components, and their semantic composition

Page 3: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

From Workshop on Cloud Services for File Synchronisation and Sharing, CERN Nov 17-18, 2014

• Protocols for file sharing and synchronization• Reliability and consistency of file synchronization

services• Efficiency and scalability of file synchronization services• File-sharing semantics• Data analysis workflows• Backend storage technologies• Federated access to cloud storage• Integration of large data repositories• Mobile access to data

Page 4: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

• In service orchestration, all data is passed to the workflow engine

• Data transfers are made through SOAP, which is unfit for large data transfers

Spiros Koulouzis, Reggie Cushing, Kostas Karasavvas, Adam Belloum, and Marian Bubak. Enabling web services to consume and produce large datasets. IEEE Internet Computing, 16(1):52–60, 2012Spiros Koulouzis, Dmitry Vasyunin, Reginald Cushing, Adam Belloum, and MarianBubak. Cloud data federation for scientific applications. In Euro-Par 2013: Parallel Processing Workshops, LNCS 8374, pp 13–22. Springer, 2014

• Storage federation

Scalable data access

Page 5: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Cloud and Big Data Benchmarking and Verification Methodology

• Methodology of Evaluation of systems and applications– Qualitative metrics (architectures, functionality)– Quantitative metrics (performance, stability, cost)– Test scenarios, test cases and parameters– Experiment planning, analysis of results

• Selection of benchmarks– Portfolio of standard benchmarks– Design of application-specific scenarios

• Target platforms– IaaS clouds (public, private)– Hybrid Clouds with cloud bursting – Real-Time BigData processing systems (Hadoop, Spark, ElasticSearch)

• Collaboration with Samsung R&D Polska– Methodology applied to cloud infrastructure at the industrial partner– Consultancy on the analysis of results and development of Testing-as-a-service (TaaS) system

K. Zieliński, M. Malawski, M. Jarząb, S. Zieliński, K. Grzegorczyk, T. Szepieniec, and M. Zyśk: Evaluation Methodology of Converged Cloud Environments. In: K. Wiatr, J. Kitowski, M. Bubak (Eds) Proceedings of the Seventh ACC Cyfronet AGH Users’ Conference, ACC CYFRONET AGH, Kraków, ISBN 978-83-61433-09-5, pp. 77-78 (2014)

5

Page 6: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Data security in clouds• To ensure security of data in transit • Modern applications use secure tranport protocols

(e.g.TLS)• For legacy unencrypted protocols if absolutly needed,

or as additional security measure:– Site-to-Site VPN, e.g. between cloud sites is outside of

the instance, might use – Remote access – for individual users accessing e.g. from

their laptops

• Data should be secure stored and realiable deleted when no longer needed

• Clouds not secure enough, data optimisations preventing ensuring that data were deleted

• A solution:– end-to-end encryption (decryption key stays in

protected/private zone)– data dispersal (portion of data, dispersed between nodes

so it’s non-trivial/impossible to recover whole message)

J. Meizner, M. Bubak, M. Malawski, P. Nowakowski: Secure Storage and Processing of Confidential Data on Public Clouds. In: PPAM 2013, LNCS 8384, pp. 272-282, Springer, 2014

Page 7: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Competences Exploitation of PaaS-based solutions with in-house

installations Handling heterogeneous data in diverse scientific

disciplines Building multi-layer and multi-protocol software

stacks

Objectives Ad-hoc metadata model creation and deployment

of corresponding storage facilities Create a research space for metadata model

exchange and discovery with associated data repositories with access restrictions in place

Different types of storage sites and data transfer protocols

Architecture Web Interface-based metadata model

management PaaS-based repositories over REST Site-specific storage infrastructure for file

persistence

Colaborative metadata management

D. Harężlak, M. Kasztelnik, M. Pawlik, B. Wilk, and M. Bubak: A Lightweight Method of Metadata and Data Management with DataNet. In: M. Bubak, J. Kitowski, K. Wiatr (Eds.): eScience on Distributed Computing Infrastructure, LNCS 8500. Springer, pp. 164-177, 2014

Page 8: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Levee Monitoring Application ISMOP project - http://www.ismop.edu.pl/en

• Levee breach threat due to a passing wave• High water levels lasting for up to 2 weeks• Large areas of levees affected (100+ km)

8

Page 9: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Flood threat assessment platform

Bartosz Balis,Marek Kasztelnik, Maciej Malawski, Piotr Nowakowski, Bartosz Wilk, Maciej Pawlik, Marian Bubak: Execution Management and Efficient Resource Provisioning for Flood Decision Support. ICCS 2015: 2377-2386, Procedia Computer Science51, Elsevier 2015

Page 10: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Goal: Extending the traditional

scientific publishing model with computational access and interactivity mechanisms; enabling readers (including reviewers) to replicate and verify experimentation results and browse large-scale result spaces.

Challenges: Scientific: A common description schema for primary data (experimental data, algorithms, software, workflows, scripts) as part of publications; deployment mechanisms for on-demand reenactment of experiments in e-Science.Technological: An integrated architecture for storing, annotating, publishing, referencing and reusing primary data sources.Organizational: Provisioning of executable paper services to a large community of users representing various branches of computational science; fostering further uptake through involvement of major players in the field of scientific publishing.

P. Nowakowski, E. Ciepiela, D. Harężlak, J. Kocot, M. Kasztelnik, T. Bartyński, J. Meizner, G. Dyk, M. Malawski: The Collage Authoring Environment. In: Proceedings of the International Conference on Computational Science, ICCS 2011 (2011), Winner of the Elseview/ICCS Executable Paper Grand Challenge

E. Ciepiela, D. Harężlak, M. Kasztelnik, J. Meizner, G. Dyk, P. Nowakowski, M. Bubak: The Collage Authoring Environment: From Proof-of-Concept Prototype to Pilot Service in Procedia Computer Science, vol. 18, 2013

Collage - executable e-Science publications

Page 11: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

Simulating a city, citizen science

Sensors SimulatingOpen DataData

AnalyticsDecision

• Understanding a city (mobility, crime, flood, health, evacuation, etc.) through computation

• Set of simulation combined together and reacting for changes

Key challenges:• Open data (https://odkrk.hackpad.com - Tomek Gubała’s initiative) • Distributed environment with auto scaling capability (e.g. Atmosphere, AWS Auto Scaling, etc.) • Simulation repository• Decision Support System

Proof of concept projects, which use Open Data (work in progress), https://plankrk.herokuapp.com

Page 12: Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation

State Graph describing a filtering state machine for tweets which is mapped to 11 VMs

Reginald Cushing, Adam Belloum, Marian Bubak, and Cees de Laat. Automata-based dynamic data processing for clouds. In Euro-Par 2014: Parallel Processing Workshops, LNCS 8805, pp 93–104, 2014Reginald Cushing, Adam Belloum, Marian Bubak, and Cees de Laat. Towards Computing Without Borders: Data Processing Plane, In review: Future Generation of Computer Systems, 2015

Automata-based dynamic data processing

• Data processing schema can be considered as a state transformation graph

• The graph facilitates data processing in many ways

– Data state can be easily tracked– Using the graph as a protocol

header, a virtual data processing network layer is achieved

– Data becomes self routable to processing nodes

– Collaboration can be achieved by joining the virtual network