24
KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association STEINBUCH CENTRE FOR COMPUTING - SCC www.kit.edu Data Intensive Services for the LSDF Jos van Wezel

Data Intensive Services for the LSDF

  • Upload
    tahir

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Intensive Services for the LSDF. Jos van Wezel. Intro. Past and Context The Data Challenge ahead LSDF at KIT Software Services Roadmap. Steinbuch Centre for Computing. Computer Centre of the Karlsruhe Institute of Technology IT Services for KIT High Performance Computing - PowerPoint PPT Presentation

Citation preview

Page 1: Data Intensive Services for the LSDF

KIT – University of the State of Baden-Württemberg andNational Laboratory of the Helmholtz Association

STEINBUCH CENTRE FOR COMPUTING - SCC

www.kit.edu

Data Intensive Services for the LSDFJos van Wezel

Page 2: Data Intensive Services for the LSDF

2 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Intro

Past and ContextThe Data Challenge aheadLSDF at KITSoftware ServicesRoadmap

26 May 2011

Page 3: Data Intensive Services for the LSDF

3 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Steinbuch Centre for Computing

Computer Centre of the Karlsruhe Institute of TechnologyIT Services for KITHigh Performance ComputingScientific Computing und SimulationLarge Scale Data Management & AnalysisGrid ComputingCloud ComputingVirtualisierung

26 May 2011

Page 4: Data Intensive Services for the LSDF

4 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Data services at SCC

GridKa – LHC Tier 1 centre / 2002WLCG Tier 1 centre10 PB storage, 16000 cores, 40 Gb/s networkingDedicated to Physics off-line computing

Biology contacts Institute for Toxicology and Genetics at KIT / 2007

Initial use of ‘spare’ GridKa capacityindicating storage and computing needs

BioQuant / 2008Prof. Dr. Wolfrum and Prof. Dr. Juling: Joint proposalCooperation to procure storage for genomic research

26 May 2011

Page 5: Data Intensive Services for the LSDF

5 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

LSDF development time line at SCC/KIT

First ideas for LSDF / early 2008Installation of an LSDF pilot with 150 TB storage and 4 serversDevelopment of initial concepts , i.e. tiered storage, hadoopResult: KIT proposal

SCC Helmholtz external review / spring 2009 LSDF is an excellent idea, but DO plan beyond KITWorkshop held in 2/2009 to coordinate BioQuant and SCC efforts

Storage, funded by State of Baden-Wuerttemberg / late 2009Tendering and negotiations by R. Eils (BioQuant) and R. Kupsch (SCC)Storage for systems biology in Heidelberg (@BioQuant)Storage for Universities in Baden-Wuerttemberg (@SCC)Long Digital time Archives (@SCC)Storage Support and Services for State Universities (@SCC)

Compute cluster for DIC & cloud research / late 2009Bring computing (low latency) to storageUse hadoop to allow fast distributed data access

26 May 2011

Page 6: Data Intensive Services for the LSDF

6 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

LSDF Hardware todayDedicated LSDF Data Acquisition Network

10Gb/s redundant backbone with 2 Nexus routersSeveral KIT institutes, ITG, IPE, IAI, ANKA, GPISince 1 week: 10Gb/s to BioQuant

File Servers and On-line StorageIBM → 2 PB, 6 servers, SoFSDDN → 750 TB, 8 servers, GPFS

Computing cluster464 cores, 2 TB total memoryDirectly attached to storage (GPFS/DDN)110 TB HDFS, Hadoop native filesystemAvailable from the Cloud environment OpenNebula

users can deploy own dedicated VMsreliable, highly flexible, and very fast to deploy

Archival and off-line storageTape library6 LTO 5 drives

26 May 2011

Executive scientists:Serguei BourovAriel Garcia

Page 7: Data Intensive Services for the LSDF

7 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

LSDF realms

26 May 2011

Access to LSDF (KIT) via standard protocolsInternal (inside Firewall) via NFS/CIFS and DataBrowserExternal (outside Firewall) via ‘grid’ tools or http

Page 8: Data Intensive Services for the LSDF

8 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Users of the LSDF @ SCC/KIT

Biology High Throughput MicroscopyGene Sequencing0.5 PB/a, automated image processing

Synchrotron radiation facility (ANKA)Tomographie-Beamlines240 TB/a – 1 PB/a, data management

Climate research (IMK)Several instruments mounted on satallites300 TB/a (till 2024), 20 years archiving

In developmentBioQuant ArchivesBiophisics (Nanoscopy, Nanoparticles, …)Arts and Humanities (DARIAH)Geophysics (Seismology, Applied seismic research)Many others

26 May 2011 LSDMA-Treffen

Page 9: Data Intensive Services for the LSDF

9 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Measurement Data

26 May 2011

Data is generated at increasing rates

Costs per byte measured is decreasing

Costs per byte of storage is decreasing

2011

2013

2015

2017

2019

2021

2023

2025

10

100

1000

10000

Storage Density

Page 10: Data Intensive Services for the LSDF

10 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Scientific data sources

In the past big data resulted from simulations on supercomputersToday big data results from experiments, observations, measurementsData is valuable because it is either unique or costly to obtain or both

26 May 2011

Page 11: Data Intensive Services for the LSDF

11 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

From data to knowledge

The fourth paradigmExperiment Theory Simulation

Data ExplorationWidely Recognised i.e.

“Riding the wave, How Europe can gain from the rising tide of scientific data.” Final report of the High Level Expert Group on Scientific Data. October 2010

26 May 2011

Tony Hey, Stewart Tansley, Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, ISBN 978-0982544204, http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Jim Gray, eScience Talk at NRC-CSTB meeting Mountain View CA, 11 January 2007, http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt

Page 12: Data Intensive Services for the LSDF

12 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

A collaborative Data Infrastructure

26 May 2011

DARIAHCESSDALifeWatchENESetc.

EUDATD4ScienceELIXERetc.

LSDF

Scientific Experiments

Page 13: Data Intensive Services for the LSDF

13 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Key demands of modern data driven science

Data storage and management beyond PetaBytesLong-Term digital archiving of raw and publicised dataAnalysis with tools for data intensive computingVisualisation and data mining tools for large amounts of measurement dataIntegration of data handling with scientific workflowsSupport and services from IT and data experts

26 May 2011

LSDF

Blu

eprin

t

Page 14: Data Intensive Services for the LSDF

14 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Workflow: applicable for many data sources

data is measured, buffered and validated in storage near the instrument (T0)data is curated, registered and moved in the LSDF (T1)data is processed for analysis. each analysis step produces new, derived data that is also registered, stored and archived (T2)new data is archived: immutable data

26 May 2011

Page 15: Data Intensive Services for the LSDF

15 Steinbuch Centre for Computing

LSDF developments

Software for Scientific dataData managementSecure Access and Global AuthenticationArchival and Bit PreservationPersistent Identifiers

Data intensive Computing Storage and computing optimisationStorage and file system design

Community servicesHelpdesk and supportIntegration of existing applications

Storage for the state of Baden-WürttembergScientific Data (BioQuant)Universities, Archives, LibrariesDesktop-Data

13/4/2011

ScientificExperiments,Applications,Communities

LSDFInfrastructure

Technologies

“for happy users”

Page 16: Data Intensive Services for the LSDF

16 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Data Services

File systems, file protocols, databasesGPFS, NFS, CIFS, GridFTP, Oracle, MySQL

HadoopShared cluster wide file system, Map/Reduce framework

Cloud/Open NebulaFast deployment of virtual machines

iRodsRule-Oriented Data System

Automated Processing of large image stacksKepler workflow engine

Data Ingest Meta DataADALAPIData Browser

26 May 2011

Page 17: Data Intensive Services for the LSDF

17 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Meta Data

26 May 2011

Meta data describes the contents of dataEverybody uses meta data:

File name and extension(e.g. picture.jpg, budget.xls, Readme.doc)Location(e.g. /…/EU-projects/2011/Fishy/budget.xls)Personal know-how

Sufficient for small file systems , desktops

Try to locate a file or info somewhere-in-a-file-system15 years old ?in the file system of a colleague ?in a 100 PetaByte file system ?

Page 18: Data Intensive Services for the LSDF

18 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Access to large scale data

Separate frameworks for data and meta dataGood scalability and AccessComplicates transparent access

26 May 2011

Page 19: Data Intensive Services for the LSDF

19 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Hierarchical Catalog System (Repository)

LFN Physical File Name

Logical File Catalog DB

LFN Physical File Name

Logical File Catalog DB

LFN Physical File Name

Logical File Catalog DB

LFN Physical File Name

Logical File Catalog DB

LFN Physical File Name

Logical File Catalogs

Computing

Storage

LDN LDN, LFN

Logical Directory Catalog

LPN LDN, meta data

Logical Project Catalog DB

DB

DB

Meta data scheme

repository

Zebrafish II

ANKA BL1

Material research

Zebrafish I

Digital objects inArts and

Humanities

APIs and Tools

Catalogs LSDF SystemsSustainable and easily extensible for large amounts of data (size and number)Independent of data formatsPerformance by distributed access Safety by redundancyUse of openstandards

Generic file tree

Page 20: Data Intensive Services for the LSDF

20 Steinbuch Centre for ComputingBioQuant, First Byte Symposium26 May 2011

DataBrowser

API: Data and meta data organizationGUI: File, data and project explorer

Functions:• Data management• Queries in meta data

cataloges• Up-/Download• Control of

data analysis + vis.workflows

• Easy-to-use• Extensible• World-wide access

Page 21: Data Intensive Services for the LSDF

21 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

ADALAPI (Abstract Data Access Layer API)

Java class librarySeamless application access to LSDFIndependent of transfer protocol and locationAuthentification

X.509 certificatesuser/passwd

Protocols and file systemslocal filesgsiftpsftphttp(s) hdfs

LSDF Storage Infrastructure

Applications ToolsDataBrowserScientific exp.

DAQ

Grid

Workstations

Client software

Visualization

Cloud

Page 22: Data Intensive Services for the LSDF

22 Steinbuch Centre for ComputingBioQuant, First Byte Symposium

Conclusions

Important services have been deployedDifferent communities at KIT are successfully using the LSDF (storing as well as on-line computing)Development on new tools in progress

Roadmap LSDF will grow, adding users and hardwareContributing to EUDAT and Helmholtz Association infrastructuresAdding software and community services and support to hardware services

26 May 2011

Page 23: Data Intensive Services for the LSDF

23 Steinbuch Centre for ComputingBioQuant, First Byte Symposium26 May 2011

The Steinbuch Centre for Computing at KIT congratulates BioQuant

with its successful LSDF4LS launch. We are proud to cooperate with them and look forward to mutually enhance science by deploying innovative large

scale data services.

Page 24: Data Intensive Services for the LSDF

STEINBUCH CENTRE FOR COMPUTING - SCC

www.kit.edu

You have the data, we have the technology

Thank you very much for your attention

[email protected]

Many thanks to: Serguei Bourov, Ariel Garcia, Rainer Kupsch, Achim Streit, Rainer Stotzka and all other KIT colleagues making LSDF happen

26 May 2011