34
Big data – no big deal for curation? Graham Pryor, Associate Director, UK Digital Curation Centre Eduserv Symposium 2012: Big Data, Big Deal? . This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License Because good research needs good data

Graham Pryor

  • Upload
    eduserv

  • View
    728

  • Download
    0

Embed Size (px)

DESCRIPTION

If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address. The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective? We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints. That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity. This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.

Citation preview

Page 1: Graham Pryor

Big data

– no big deal for curation?

Graham Pryor, Associate Director, UK Digital Curation Centre

Eduserv Symposium 2012: Big Data, Big Deal?

.

This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License

Because good research needs good data

Page 2: Graham Pryor

Big data – big deal or same deal?

“What need the bridge much broader than the flood?

The fairest grant is the necessity.

Look, what will serve is fit…”Much Ado About Nothing, Act 1 Scene 1

Page 3: Graham Pryor

• Operating Systems & Networking

• Computer and Network Security

• Distributed Systems

• Mobile Computing

• Wireless Networking

• Software Engineering

• High performance compute clusters

• Cloud and grid technologies

• Effective management of large clusters and

cluster file-systems

• Very large database systems (architecture,

management and application optimization)

Eduserv Symposium 2012 –

speakers’ Research Areas

Page 4: Graham Pryor

The Digital Curation Centre• a consortium comprising units from the Universities of Bath

(UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)

• launched 1st March 2004 as a national centre for solving

challenges in digital curation that could not be tackled by

any single institution or discipline

• funded by JISC to build capacity, capability and skills in

research data management across the UK HEI community

• awarded additional HEFCE funding 2011/13 for

• the provision of support to national cloud services

• targeted institutional development

Page 5: Graham Pryor

Three perspectives

Scale and complexity– Volume and pace

– Infrastructure

– Open science

Policy– Funders

– Institutions

– Ethics & IP

Management– Storage

– Incentives

– Costs & Sustainability

http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/

Page 6: Graham Pryor

Challenges of scale and complexity

• Globally, >100,000 neuroscientists study the CNS, generating massive, intricate and highly interrelated datasets

• Analysts require access to these data to develop algorithms, models and schemata that characterise the underlying system

• Resources and actors are rarely collocated and are therefore difficult to combine.

• The virtual laboratory is a federation of server nodes that allows distributed data to be stored local to acquisition

• Analysis codes can be uploaded and executed on the nodes so that derived datasets need not be transported over low bandwidth connections

• Data and analysis codes are described by structured metadata, providing an index for search, annotation and audit over workflows leading to scientific outcomes

• Users access the distributed resources through a web portal emulating a PC desktop

http://www.carmen.org.uk/

But this is only talking terabytes…

Page 7: Graham Pryor

Big data? – The Large Hadron Collider

• Predicted annual generation of around 15 petabytes (15 million gigabytes) of data

• Would need >1,700,000 dual layer DVDs

Searching for the Higgs Boson

Page 8: Graham Pryor

Big data – the GridPP solution

“With GridPP you need never have those data processing blues again…”

http://www.gridpp.ac.uk/about

With the Large Hadron Collider running at CERN the grid is

being used to process the accompanying data deluge. The UK

grid is contributing more than the equivalent of 20,000 PCs to

this worldwide effort.

Crowd sourcing for the LHC Home and office computer users

can sign up to the LHC at home

project (based at Queen Mary,

University of London), which

makes use of idle CPU time. So

far, 40,000 users in more than 100

countries have contributed the

equivalent of 3000 years on a

single computer to the project.

Page 9: Graham Pryor

Yet…..Data Preservation in High

Energy Physics?

Data from high–energy physics (HEP) experiments are collected with significant financial and human effort and are in many cases unique. At the same time, HEP has no coherent strategy for data preservation and re–use, and many important and complex data sets are simply lost.

David M. South, on behalf of the ICFA DPHEP Study GrouparXiv:1101.3186v1 [hep-ex]

Page 10: Graham Pryor

Big data in genomics

These studies are generating

valuable datasets which, due to

their size and complexity, need to

be skilfully managed…

Page 11: Graham Pryor

There’s a bigger deal than big data…

Socio-technical

management perspectives

Information systems

perspectives

Research practice

perspectives

• Identify drivers and champions

• Analyse stakeholders, issues

• Identify capability gaps

• Assess costs, benefits, risks

2. • Inventory data assets• Profile norms, roles,

values• Identify capability gaps• Analyse current

workflows

3.

• Produce feasible, desirable changes

• Evaluate fitness for purpose

Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

Page 12: Graham Pryor

• 18 institutional engagements, 14 roadshows

• advice and assistance in strategy and policy

• use of curation tools for audit and planning

• training and skills transfer

The DCC - building capacity and capability through targeted institutional development

Page 13: Graham Pryor

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

Page 14: Graham Pryor

“Departments don’t have guidelines or norms for personal back-up and researcher procedure, knowledge and diligence varies

tremendously. Many have experienced moderate to catastrophic data loss”

Incremental Project Report, June 2010

http://www.flickr.com/photos/mattimattila/3003324844/

Page 15: Graham Pryor

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

2. There is a lack of clarity in terms of skills availability and acquisition

Page 16: Graham Pryor

…researchers are

reluctant to adopt new tools and

services unless they know

someone who can recommend

or share knowledge about

them. Support needs to be

based on a close understanding

of the researchers’ work, its

patterns and timetables.

Page 17: Graham Pryor

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

2. There is a lack of clarity in terms of skills availability and acquisition

3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders

Page 18: Graham Pryor

EPSRC expects all those institutions it funds

• to have developed a roadmap aligning their policies

and processes with EPSRC’s nine expectations by

1st May 2012

• to be fully compliant with each of those expectations

by 1st May 2015

• to recognise that compliance will be monitored and

non-compliance investigated and that

• failure to share research data could result in the

imposition of sanctions

Page 19: Graham Pryor

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

2. There is a lack of clarity in terms of skills availability and acquisition

3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders

4. …and legislators

Page 20: Graham Pryor

Rules and regulations…

Compliance

• Rights, Exemptions, EnforcementData Protection Act

1998

• Climategate, Tree Rings, Tobacco and…(what’s next?)

Freedom of Information Act 2000

• etc. etc. etc………..Computer Misuse Act

1980

Page 21: Graham Pryor

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

2. There is a lack of clarity in terms of skills availability and acquisition

3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders

4. …and legislators

5. The advantages from planning, openness and sharing are not understood

Page 22: Graham Pryor

Open to all? Case studies of openness in research

Choices are made according to context, with degrees of openness reached according to:

• The kinds of data to be made available

• The stage in the research process

• The groups to whom data will be made available

• On what terms and conditions it will be provided

Default position of most:

• YES to protocols, software, analysis tools, methods and techniques

• NO to making research data content freely available to everyone

After all, where is the incentive?Angus Whyte, RIN/NESTA, 2010

Page 23: Graham Pryor

DCC

Institutional

Engagements

http://www.dcc.ac.uk/community/institutional-engagements

Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

Page 24: Graham Pryor

Main institutional concerns

– Compliance

– Asset management

– Cost benefits

– Incentivisation

– Complexity of the data environment

And big data? There has been no mention yet of any specific challenge from big data but…

Institutions are providing resources to work on big data, both equipment and people, and more importantly…

…the issues central to effective data management are common across the data spectrum, irrespective of size

Page 25: Graham Pryor

Some current institutional engagements

Assessing needs

RDM roadmaps

Piloting tools e.g. DataFlow

Policy development

Policy implementation

Page 26: Graham Pryor

Support offered by the DCC

Assess needs

Make the case

Develop support

and services

RDM policy development

Customised Data Management Plans

DAF & CARDIO assessments Guidance

and training

Workflow assessment

DCC support

team

Advocacy to senior management

Institutional data catalogues

Pilot RDM tools

…and support policy implementation

Page 27: Graham Pryor

Four DCC Tools

Page 28: Graham Pryor

Your Data as Assets: DAF

• What are the characteristics of your

research data assets?

– Number?

– Scale?

– Complexity?

– Dependencies?

– Liabilities?

• Why do researchers act the way they do

with respect to data?

• Which data do they need to undertake

productive research?

Page 29: Graham Pryor

DMP Online is a web-based data management

planning tool that allows you to build and edit plans

according to the requirements of the major UK

funders.

The tool also contains helpful guidance and links for

researchers and other data professionals.

http://www.dcc.ac.uk/dmponline

Page 30: Graham Pryor

An online tool for departments or research groups to

identify their current data management capabilities

and identify coordinated pathways to future

enhancement via a dedicated knowledge base.

CARDIO emphasises a collaborative, consensus-

driven approach, and enables benchmarking with

other groups and institutions.

http://cardio.dcc.ac.uk/

Page 31: Graham Pryor

DRAMBORA is an audit methodology and tool for

identifying and planning for the management of risks

which may threaten the availability and/or usability of

content in a digital repository or archive.

http://www.repositoryaudit.eu

Page 32: Graham Pryor

So, big data – no big deal for curation?

• Yes, it’s big

• It’s also very complex

• There is no single technology solution

• Issues of human infrastructure are possibly a bigger challenge

• But for big data aficionados the technology challenges are big enough

Page 33: Graham Pryor

Data Management – infrastructure

and data storage challenges...

The case for cloud computing in genome informatics. Lincoln D Stein, May 2010

Scaleability

Cost-effectiveness

Security (privacy and IPR)

Robust and resilient

Low entry barrier

Ease-of-use

Data-handling / transfer /

analysis capabilities

Page 34: Graham Pryor

Help desk:0131 651 1239

[email protected]

www.dcc.ac.uk