The Biodiversity Heritage Library (BHL), like many other projects within biodiversity informatics, maintains terabytes of data that must be safeguarded against loss. Further, a scalable and resilient infrastructure is required to enable continuous data interoperability, as BHL provides unique services to its community of users. This volume of data and associated availability requirements present significant challenges to a distributed organization like BHL, not only in funding capital equipment purchases, but also in ongoing system administration and maintenance. A new standardized system is required to bring new opportunities to collaborate on distributed services and processing across what will be geographically dispersed nodes. Such services and processing include taxon name finding, indexes or GUID/LSID services, distributed text mining, names reconciliation and other computationally intensive tasks, or tasks with high availability requirements.
- 1. Building a scalable, open source storage and processing solution for biodiversity dataPhil CryerAnthony Goddard Thursday, November 12, 2009
2. > Biodiversity Heritage Library's data all BHL storage is handled by theInternet Archive 38,000+ scanned books approximately 48terabytes of data unable to self-host Thursday, November 12, 2009 3. >BHL - Europe 3 year, EU funded project 28 major natural history museums,botanical gardens and othercooperating institutions third file-store of all BHL data collecting cultural heritage from allover Europe Thursday, November 12, 2009 4. > Data explosion more data being created more data being saved more data tomorrow storage has not kept up withMoores Law this presentation will be savedonline, more data! Thursday, November 12, 2009 5. > Data explosion more data being created more data being saved more data tomorrow storage has not kept up withMoores Law this presentation will be savedonline, more data! Thursday, November 12, 2009 6. > Potential #fails Thursday, November 12, 2009 7. > Problem 1 - Data access file size we cant store latency of large files quality user experience processing data-mining Thursday, November 12, 2009 8. > Problem 1 - Data access file size we cant store Access latency of large files quality user experience processing data-mining denied... Thursday, November 12, 2009 9. > Problem 2 - Copyright concerns international copyright concerns potential related funding issues wed rather not let this be anissue Thursday, November 12, 2009 10. >Problem 3 - Redundancy Thursday, November 12, 2009 11. >Problem 3 - Redundancy computers crash hard drives die networks fail natural disasters occur Thursday, November 12, 2009 12. >Problem 3 - Redundancy computers crash hard drives die networks fail natural disasters occur but...This is NOT a problem! Thursday, November 12, 2009 13. ...so plan for it.Thursday, November 12, 2009 14. Thursday, November 12, 2009 15. Current Thursday, November 12, 2009 16. Thursday, November 12, 2009 17. Thursday, November 12, 2009 18. >Site 1 - Internet Archive Thursday, November 12, 2009 19. >Site 2 - MBL, Woods Hole Thursday, November 12, 2009 20. >Site 3 - NHM, London ...followed by new Data center Thursday, November 12, 2009 21. Data Centre Darwin Repository 600,000 Funding secured from eContentPlus Suitable location found with very good development potential in collaboration with Science Museum. Economy of scale provides additional avenues for co- development of services. Disaster Recovery and Business Continuity for allMuseums (help with ongoing and running costs) DCMS funding sought to help with development. e-Infrastructure European initiative Building Digital Repositories for Scientic Communities PESI (Biodiversity)Thursday, November 12, 2009 22. Proposed Data Centre Location Swindon Wroughton Science Museum 2008 Google Imagery 2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data 2008 Tele Atlas Thursday, November 12, 2009 23. Vendor Stakeholders / Partners Identied Technology Partners* Additional Funding Partners* *Note: Discussions are ongoing with all Partners and may be at different stages Thursday, November 12, 2009 24. Long Term Sustainability No Dripping Tap Business case should provide for signicant self funding opportunities. Diversity Darwin Repository (Data Centre) will provide an economy of scale that will provide signicant efficiency gains. Green technology to minimise carbon footprint and provide industry leadership.Thursday, November 12, 2009 25. >Distributed storage write once, read anywhere replication and fault tolerance error correction automatic redundancy scalable horizontally Thursday, November 12, 2009 26. >Distributed storage - Options fully hosted storage (cloud) hosted with own storage (private cloud) self hosted with proprietary hardware (Sun Thumper) self hosted with commodity hardware Thursday, November 12, 2009 27. >Distributed storage - GlusterFS GlusterFS: a cluster file-systemcapable of scaling to several peta-bytes open source software oncommodity hardware tunable performance simple to install and manage offers seamless expansion Thursday, November 12, 2009 28. >Distributed storage - Archival Fedora-commons is an opensource repository accounts for all changes, so built-in version control provides disaster recover open standards to mesh withfuture file formats provides open sharing servicessuch as OAI-PMH Thursday, November 12, 2009 29. >Distributed storage - Mirrored data now we have redundancy in fact, multiple redundantcopies provides fault tolerance offers load balancing gives us future geographicaldistribution Thursday, November 12, 2009 30. >Now we have lots of computers... Thursday, November 12, 2009 31. >Distributed processing more abilities available than juststoring data with distributed storage comesdistributed processing distributed processing meansfaster answers faster answers mean newquestions lather, rinse, repeat Thursday, November 12, 2009 32. >Distributed processing make your data more useful image and OCR processing distributed web services identifier resolution pools map/reduce frameworks generate new visualizations, textmining, NLP Thursday, November 12, 2009 33. >Distributed processing Finder TaxonFinderTaxonFinderTaxonFinderTaxonFinder Taxon WebServiceWebService Load Balancer Cluster NodeClusterSite RequestThursday, November 12, 2009 34. >Some assembly required(optional) our example uses new, fastercommodity hardware but it could run on any hardwarethat can run Linux you could chain old "out dated"computers together build your own cluster for next tonothing (host it in your basement) solves some infrastructure fundingissues hardware vendor neutrality Thursday, November 12, 2009 35. >Our proof of concept we ran a six box cluster todemonstrate GlusterFS ran stock Debian/GNU Linux simulated hardware failures synced data with a remote cluster ran map/reduce jobs defined procedures, configurationsand build scripts Thursday, November 12, 2009 36. Thursday, November 12, 2009 37. Thursday, November 12, 2009 38. wwwpresentation RESTAPI mod_glusterfs disco processingHadoopFedora Commons metadataMulgara triplestore / rdfrsync GlusterFS sync support le system BitTorrent Ext4 (exabyte) HTTPNetwork RAID Raw disk array Commodity SATA controllersstorage Commodity Hosts Dedicated Storage Network Thursday, November 12, 2009 39. >Distributed storage - Projected costs$246,000 Graph from Backblaze (http://www.backblaze.com) Thursday, November 12, 2009 40. >Other avenues - Cloud pilot BHL is participating in a pilot with NewYork Public Library and Duraspace Duraspace would provide a link tocloud providers pilot to show feasibility of hosting testing use of image server, otherservices in the cloud cloud could seed new clusters Thursday, November 12, 2009 41. >Code (63 6f 64 65) all of our code and configurations areopen source hosted on Google Code get involved join the mailing-lists follow us on Twitter ask questions, we'll help! Thursday, November 12, 2009 42. >Its your turn... similar projects? distributed services and processing? where can this be best applied? resilient services on top of storage names processing? LSID resolution pools? image processing? text-mining / NLP? #biodiv webservices?Thursday, November 12, 2009 43. Web: http://www.biodiversitylibrary.org/Code, Support: http://code.google.com/p/bhl-bitsTwitter: @BioDivLibrary (tag #bhl) Phil CryerAnthony GoddardMissouri Botanical Garden MBLWHOI Library Biodiversity Heritage Library Biodiversity Heritage Libraryphil.email@example.com firstname.lastname@example.org http://philcryer.com http://anthonygoddard.com @fak3r @anthonygoddardThursday, November 12, 2009