15
Neil Chue Hong Project Manager, EPCC [email protected] +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd March 2005

Neil Chue Hong Project Manager, EPCC [email protected]@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

Embed Size (px)

Citation preview

Page 1: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

Neil Chue HongProject Manager, EPCC

[email protected]+44 131 650 5957

Data ServicesWhat, Why, How

e-Research Meeting

NeSC, 2nd March 2005

Page 2: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Overview

• The difficulty with data

• Data Services

• Data Middleware

• Data Repositories

Page 3: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

The Data Deluge

• Entering an age of data– Data Explosion

– CERN: LHC will generate 1GB/s = 10PB/y

– VLBA (NRAO) generates 1GB/s today– Pixar generate 100 TB/Movie

– Storage getting cheaper• Data stored in many different ways

– Data resources– Relational databases– XML databases / files– Result files

• Need ways to facilitate – Data discovery– Data access– Data integration

• Empower e-Business and e-Science– The Grid is a vehicle for achieving this

Page 4: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

What is e-Science?

• Goal: to enable better research

• Method: Invention and exploitation of advanced

computational methods– to generate, curate and analyse research data

– From experiments, observations and simulations– Quality management, preservation and reliable evidence

– to develop and explore models and simulations– Computation and data at extreme scales– Trustworthy, economic, timely and relevant results

– to enable dynamic distributed virtual organisations– Facilitate collaboration with resource sharing– Security, reliability, accountability, and manageability

Multiple, independently managed sources of data – each with own time-varying structure

Creative researchers discover new knowledge by combining data from multiple sources

Page 5: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Composing Observations in Astronomy

No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects

Data and images courtesy Alex Szalay, John Hopkins

Page 6: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Data Services: challenges

• Scale– Many sites, large collections, many uses

• Longevity– Research requirements outlive technical decisions

• Diversity– No “one size fits all” solutions will work

– Primary Data, Data Products, Meta Data, Administrative data, …

• Many Data Resources– Independently owned & managed

– No common goals– No common design– Work hard for agreements on foundation types and ontologies– Autonomous decisions change data, structure, policy, …

– Geographically distributed

• and I haven’t even mentioned security yet!

Page 7: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

The Discovery Process

• Choosing data sources– How do you find them?– How do they describe and advertise them?– Is the equivalent of Google possible?

• Obtaining access to that data– Overcoming administrative barriers– Overcoming technical barriers

• Understanding that data and extracting from multiple sources– The parts you care about for your research

• Combing them using sophisticated models– The picture of reality in your head

• Analysis on scales required by statistics– Coupling data access with computation

• Repeated Processes– Examining variations, covering a set of candidates– Monitoring the emerging details

Page 8: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Small problems

• Not just “Grand Challenges”!– Also the small problems

• For instance:– What happens to data when a researcher leaves a team?– How can a research leader point to “popular” data when a new

researcher joins?– How can you manage your data when you start to run out of local

storage space?– How do I get my data from one format/database to another?– How do I combine my data with your data?

• You need to manage your data: metadata

Page 9: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

What is a data service?

• An interface to a stored collection of data– e.g. Google and Amazon– web services

• But the data could be:– replicated– shared– federated– virtual– incomplete

• Don’t care about the underlying representation– do care about the information it represents

Page 10: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Examples of Data Services

• Many Data Services and applications– Commercial databases

– Web interfaces

– Applications developed individually by groups and projects

• Also many places to get hold of public data– Publications and citation servers

– Results servers

• Highlight a few of these– principally ones trying to bridge the gap between “local” and “distributed”

• But… no such thing as a free lunch– Things are not yet “Plug and Play”

– You will need to expend some effort to use these tools effectively

Page 11: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

OGSA-DAI / DQP

• Data Access and Integration / Distributed Query Processing– http://www.ogsadai.org.uk– Provides a way to access and query hetereogenous, structured data

resources– Relational databases– XML databases– files

– Provides a framework for extending services– more smarts, closer to the data

– “Everything looks like a database”

• National Grid Service starting to host– both through OGSA-DAI and Oracle

Page 12: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

SRB

• Storage Resource Broker– http://www.sdsc.edu/srb/– Provides a way to access data sets and resources based on their

attributes and/or logical names rather than their names or physical locations.

– may be hetererogenous, distributed and/or replicated– Many different ways of connecting– Can connect SRB systems together

– zoneSRB– “Everything looks like a filesystem”

Page 13: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

SRM and more

• Storage Resource Managers– http://forge.gridforum.org/projects/gsm-wg/– a joint effort between a number of institutions

– EU DataGrid/CERN, FermiLab, LBNL, JL– to define a standardised interface to “Storage Resource Managers”

so that different implementations can work together– principally between physics communities, extending further now

• Many other examples of data middleware– Replication management and location: RLS, QCDGrid– Many “datagrids”: SciDAC, Gfarm– GridFTP for efficient transfer– Packaged software: Virtual Data Toolkit

Page 14: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

EDINA and friends

• EDINA– http://edina.ac.uk/

– Offers the UK tertiary education and research community networked access to a library of data, information and research resources, e.g geographical data

• Digital Curation Centre– http://www.dcc.ac.uk

– support UK institutions to store, manage and preserve these data to ensure their enhancement and their continuing long-term use.

• Other national data centres:– MIMAS, UKDA, CCLRC DataPortal…

Page 15: Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd

e-Research within The University of Edinburgh

Summary

• Data is important to research– across all disciplines

• There is already a large amount of data– but it’s sometimes difficult to find and bring together

• Data Services are built to standards– which define particular functionality

• Data Services should be composable– so that it is easier to work with data

• There is already software out there– so it is possible to evaluate against your requirements