43
Building a scalable, open source storage and processing solution for biodiversity data Phil Cryer Anthony Goddard Thursday, November 12, 2009

Building A Scalable Open Source Storage Solution

Embed Size (px)

DESCRIPTION

The Biodiversity Heritage Library (BHL), like many other projects within biodiversity informatics, maintains terabytes of data that must be safeguarded against loss. Further, a scalable and resilient infrastructure is required to enable continuous data interoperability, as BHL provides unique services to its community of users. This volume of data and associated availability requirements present significant challenges to a distributed organization like BHL, not only in funding capital equipment purchases, but also in ongoing system administration and maintenance. A new standardized system is required to bring new opportunities to collaborate on distributed services and processing across what will be geographically dispersed nodes. Such services and processing include taxon name finding, indexes or GUID/LSID services, distributed text mining, names reconciliation and other computationally intensive tasks, or tasks with high availability requirements.

Citation preview

Page 1: Building A Scalable Open Source Storage Solution

Building a scalable, open source storage and processing solution for

biodiversity data

Phil CryerAnthony Goddard

Thursday, November 12, 2009

Page 2: Building A Scalable Open Source Storage Solution

> Biodiversity Heritage Library's data 

• all BHL storage is handled by the Internet Archive

• 38,000+ scanned books

• approximately 48 terabytes of data

• unable to self-host

Thursday, November 12, 2009

Page 3: Building A Scalable Open Source Storage Solution

> BHL - Europe

• 3 year, EU funded project

• 28 major natural history museums, botanical gardens and other cooperating institutions

• third file-store of all BHL data

• collecting cultural heritage from all over Europe

Thursday, November 12, 2009

Page 4: Building A Scalable Open Source Storage Solution

> Data explosion

• more data being created

• more data being saved

• more data tomorrow

• storage has not kept up with Moore’s Law

• this presentation will be saved online, more data!

Thursday, November 12, 2009

Page 5: Building A Scalable Open Source Storage Solution

> Data explosion

• more data being created

• more data being saved

• more data tomorrow

• storage has not kept up with Moore’s Law

• this presentation will be saved online, more data!

Thursday, November 12, 2009

Page 6: Building A Scalable Open Source Storage Solution

> Potential #fail’s

Thursday, November 12, 2009

Page 7: Building A Scalable Open Source Storage Solution

> Problem 1 - Data access

• file size we can’t store

• latency of large files

• quality user experience

• processing data-mining

Thursday, November 12, 2009

Page 8: Building A Scalable Open Source Storage Solution

> Problem 1 - Data access

• file size we can’t store

• latency of large files

• quality user experience

• processing data-mining

Access denied...

Thursday, November 12, 2009

Page 9: Building A Scalable Open Source Storage Solution

> Problem 2 - Copyright concerns

• international copyright concerns

• potential related funding issues

• we’d rather not let this be an issue ©

Thursday, November 12, 2009

Page 10: Building A Scalable Open Source Storage Solution

> Problem 3 - Redundancy

Thursday, November 12, 2009

Page 11: Building A Scalable Open Source Storage Solution

> Problem 3 - Redundancy

• computers crash

• hard drives die

• networks fail

• natural disasters occur

Thursday, November 12, 2009

Page 12: Building A Scalable Open Source Storage Solution

> Problem 3 - Redundancy

• computers crash

• hard drives die

• networks fail

• natural disasters occur

but...

This is NOT a problem!

Thursday, November 12, 2009

Page 13: Building A Scalable Open Source Storage Solution

...so plan for it.

Thursday, November 12, 2009

Page 14: Building A Scalable Open Source Storage Solution

Thursday, November 12, 2009

Page 15: Building A Scalable Open Source Storage Solution

Current

Thursday, November 12, 2009

Page 16: Building A Scalable Open Source Storage Solution

Thursday, November 12, 2009

Page 17: Building A Scalable Open Source Storage Solution

Thursday, November 12, 2009

Page 18: Building A Scalable Open Source Storage Solution

> Site 1 - Internet Archive

Thursday, November 12, 2009

Page 19: Building A Scalable Open Source Storage Solution

> Site 2 - MBL, Woods Hole

Thursday, November 12, 2009

Page 20: Building A Scalable Open Source Storage Solution

> Site 3 - NHM, London

...followed by new Data center

Thursday, November 12, 2009

Page 21: Building A Scalable Open Source Storage Solution

Data Centre – “Darwin Repository”

• €600,000 Funding secured from eContentPlus• Suitable location found with very good development

potential in collaboration with Science Museum.• Economy of scale provides additional avenues for co-

development of services.– Disaster Recovery and Business Continuity for all

Museums (help with ongoing and running costs)• DCMS funding sought to help with development.

– e-Infrastructure European initiative• Building Digital Repositories for Scientific

Communities– PESI (Biodiversity)

Thursday, November 12, 2009

Page 22: Building A Scalable Open Source Storage Solution

Proposed Data Centre Location

Swindon

Wroughton Science Museum ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas

Thursday, November 12, 2009

Page 23: Building A Scalable Open Source Storage Solution

Vendor Stakeholders / Partners

• Identified Technology Partners*

• Additional Funding Partners*

*Note: Discussions are ongoing with all Partners and may be at different stages

Thursday, November 12, 2009

Page 24: Building A Scalable Open Source Storage Solution

Long Term Sustainability

• No Dripping Tap

– Business case should provide for significant self funding opportunities.

• Diversity

– Darwin Repository (Data Centre) will provide an economy of scale that will provide significant efficiency gains.

• Green technology to minimise carbon footprint and provide industry leadership.

Thursday, November 12, 2009

Page 25: Building A Scalable Open Source Storage Solution

> Distributed storage

• write once, read anywhere

• replication and fault tolerance

• error correction

• automatic redundancy

• scalable horizontally

Thursday, November 12, 2009

Page 26: Building A Scalable Open Source Storage Solution

> Distributed storage - Options

• fully hosted storage (cloud)

• hosted with own storage (private cloud)

• self hosted with proprietary hardware (Sun Thumper)

• self hosted with commodity hardware

Thursday, November 12, 2009

Page 27: Building A Scalable Open Source Storage Solution

> Distributed storage - GlusterFS

• GlusterFS: a cluster file-system capable of scaling to several peta-bytes

• open source software on commodity hardware

• tunable performance • simple to install and manage

• offers seamless expansion

Thursday, November 12, 2009

Page 28: Building A Scalable Open Source Storage Solution

> Distributed storage - Archival 

• Fedora-commons is an open source repository

• accounts for all changes, so built-in version control

• provides disaster recover

• open standards to mesh with future file formats

• provides open sharing services such as OAI-PMH

Thursday, November 12, 2009

Page 29: Building A Scalable Open Source Storage Solution

> Distributed storage - Mirrored data

• now we have redundancy

• in fact, multiple redundant copies

• provides fault tolerance

• offers load balancing

• gives us future geographical distribution

Thursday, November 12, 2009

Page 30: Building A Scalable Open Source Storage Solution

> Now we have lots of computers...

Thursday, November 12, 2009

Page 31: Building A Scalable Open Source Storage Solution

> Distributed processing

• more abilities available than just storing data

• with distributed storage comes distributed processing

• distributed processing means faster answers

• faster answers mean new questions

• lather, rinse, repeat

Thursday, November 12, 2009

Page 32: Building A Scalable Open Source Storage Solution

> Distributed processing

• make your data more useful

• image and OCR processing

• distributed web services

• identifier resolution pools

• map/reduce frameworks

• generate new visualizations, text mining, NLP

Thursday, November 12, 2009

Page 33: Building A Scalable Open Source Storage Solution

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

Request

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

> Distributed processing

Thursday, November 12, 2009

Page 34: Building A Scalable Open Source Storage Solution

> Some assembly required (optional)

• our example uses new, faster commodity hardware

• but it could run on any hardware that can run Linux

• you could chain old "out dated" computers together

• build your own cluster for next to nothing (host it in your basement)

• solves some infrastructure funding issues

• hardware vendor neutrality

Thursday, November 12, 2009

Page 35: Building A Scalable Open Source Storage Solution

> Our proof of concept

• we ran a six box cluster to demonstrate GlusterFS

• ran stock Debian/GNU Linux

• simulated hardware failures

• synced data with a remote cluster

• ran map/reduce jobs

• defined procedures, configurations and build scripts

Thursday, November 12, 2009

Page 36: Building A Scalable Open Source Storage Solution

Thursday, November 12, 2009

Page 37: Building A Scalable Open Source Storage Solution

Thursday, November 12, 2009

Page 38: Building A Scalable Open Source Storage Solution

Raw disk arrayCommodity SATA controllers

Commodity HostsDedicated Storage Network

discoHadoop

Fedora CommonsMulgara triplestore / rdf

wwwmod_glusterfs

REST

GlusterFSExt4 (exabyte)

‘Network RAID’

rsyncBitTorrent

HTTP

stor

age

met

adat

aprocessing

API

presentationfile system

sync

sup

port

Thursday, November 12, 2009

Page 39: Building A Scalable Open Source Storage Solution

> Distributed storage - Projected costs

Graph from Backblaze (http://www.backblaze.com)

$246,000

Thursday, November 12, 2009

Page 40: Building A Scalable Open Source Storage Solution

> Other avenues - Cloud pilot

• BHL is participating in a pilot with New York Public Library and Duraspace

• Duraspace would provide a link to cloud providers

• pilot to show feasibility of hosting

• testing use of image server, other services in the cloud

• cloud could seed new clusters

Thursday, November 12, 2009

Page 41: Building A Scalable Open Source Storage Solution

> Code (63 6f 64 65)

• all of our code and configurations are open source

• hosted on Google Code

• get involved

• join the mailing-lists

• follow us on Twitter

• ask questions, we'll help!

Thursday, November 12, 2009

Page 42: Building A Scalable Open Source Storage Solution

> It’s your turn...

• similar projects?

• distributed services and processing?

• where can this be best applied?

• resilient services on top of storage

• names processing?

• LSID resolution pools?

• image processing?

• text-mining / NLP?

• #biodiv webservices?

Thursday, November 12, 2009

Page 43: Building A Scalable Open Source Storage Solution

Phil Cryer

Missouri Botanical GardenBiodiversity Heritage Library

[email protected]://philcryer.com@fak3r

Anthony Goddard

MBLWHOI LibraryBiodiversity Heritage Library

[email protected]://anthonygoddard.com@anthonygoddard

Web: http://www.biodiversitylibrary.org/Code, Support: http://code.google.com/p/bhl-bitsTwitter: @BioDivLibrary (tag #bhl)

Thursday, November 12, 2009