Digital Library Collection Management using HBase

Preview:

DESCRIPTION

Speaker: Ron Buckley (OCLC) OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.

Citation preview

The world’s libraries. Connected.

Digital Library Collection Management using HBase“AKA: A Success Story”

Case Studies

Ron Buckley

HBaseCon

May 5, 2014

The world’s libraries. Connected.

About OCLC

Worldwide, member-owned library cooperative• Based in Dublin, Ohio• Founded in 1967• Not-for -profit

Worldcat • Union catalog of library items from 72,000 libraries in 170 countries

• Over 2 billion records, 2.5 billions location listings

Hosting • Melvyl, University of California Digital Library (and many others) are hosted directly out of

Worldcat

The world’s libraries. Connected.

Center of our world• 15 month project to rebuild data infrastructure with Hadoop at the center. • Leveraged HBase to build multiple new products.• Replaced and decommissioned multiple Oracle RAC environments.

Old Meets New• Dewey Decimal System – OCLC owns and maintains the Dewey Decimal

System. The Dewey Decimal System is stored in and maintain in HBase.

HBase @ OCLC

The world’s libraries. Connected.

Why• Data set was too big a long time ago – Not long after we built our Oracle

database we removed almost all joins and views.• Too expensive – Making a dataset available for free open-access was going to

cost us almost $1 Million, just for storage• Slow – Couldn’t analyze data set because it took a week just to walk it.

How• Text index and our own secondary indexing for Hbase • Transition period of about 12 months with both - Multiple tools built and run find

and fix discrepancies.

Moving from Relational to HBase

The world’s libraries. Connected.

http://www.worldcat.org/title/HBase-the-definitive-guide/oclc/761693417

HBase Book – from HBase

The world’s libraries. Connected.

HBase - Hub of Linked DataIt is imperative that library data be available in new data formats that are native to the web.

• Databases are walked and analyzed frequently

• Many hundreds of millions, soon billions, of interrelated endpoints are stored back to HBase.

• Endpoints are made available through multiple standard protocols (RDF,JSON,Turtle, N-Triple) for machine use.

- Tim Berners Lee

The world’s libraries. Connected.

HBase - Hub of Linked Datahttp://experiment.worldcat.org/entity/work/data/1151002411.html

The world’s libraries. Connected.

“Libraries aren’t just about books”

• OCLC Contentdm is used by 1000’s of libraries to manage local digital content preservation.

• We’re moving over 40 millions digital objects (many TB’s) into a centrally hosted HBase repository.

HBase as Content Store

The world’s libraries. Connected.

• Key – Internal Key is MD5 hashed into HBase key.

• PDF’s - Compression (snappy) doesn’t reduce the size of PDF documents.

• 10 MB cellsize - Objects over 10 MB are not being stored in HBase. We’re storing them in HDFS. (We do store Metadata Rows for these objects in HBase.)

Digital storage in HBase

The world’s libraries. Connected.

University of the Pacific

http://oc.lc/bDo9l0

The world’s libraries. Connected.

Academy of Motion Picture Arts and Sciences. Margaret Herrick Library.

http://collections.oscars.org/prodart/

The world’s libraries. Connected.

Illinois Digital Archives (via Illinois State Library)

http://oc.lc/lrzLFr

The world’s libraries. Connected.

http://cdm15937.contentdm.oclc.org/cdm/ref/collection/DSDL01/id/46

U.S. Department of State

The world’s libraries. Connected.

Stability - Almost 7 months uptime• CDH 4.3 – April 26, 2014 - 37 Region Servers up for 7 months

The world’s libraries. Connected.

Performance –Fast

The world’s libraries. Connected.

Performance – Cache Hits Help

The world’s libraries. Connected.

• We run hundreds of M/R jobs a day on our user facing cluster.

• Our cluster is oversized for HBase

• M/R jobs run with limited tasks, niced,…

• Still faster than “the old way”

• Looking forward to multi-tenant features in upcoming releases

M/R and HBase?

The world’s libraries. Connected.

- We needed a way to upgrade HBase, without downtime.

- Rolling installs on a 50-Node cluster sounded cumbersome

Upgrading HBase

The world’s libraries. Connected.

• HBase Master-Master replication is used to maintain an always available disaster site.

• We have a middle tier service layer (like the thrift server) that knows about both our main cluster and our DR cluster.

• When we shutdown the main cluster, the middle tier automatically switches to disaster site.

• Each cluster runs a web server that exposes it’s hadoop config.

• Example: http://HBase-config-perf.ent.oclc.org:9007/HBaseconf/HBase-site.xml

Replication for 0 downtime install

The world’s libraries. Connected.

• Instead of relying on HBase-site.xml in the classpath, we load the HBase-site.xml via addResource.

public HBaseManagedConnection(String HBaseSiteUrl, int maxPoolSize) {

tableCounter = new BlockingCounter(maxPoolSize);

Configuration config = HBaseConfiguration.create();

try {

config.addResource(new URL(HBaseSiteUrl));

} catch (MalformedURLException mue) {

LOG.error("**** URL to HBase Site is invalid, Unable to connect to HBase: {} *****", HBaseSiteUrl);

}

Replication for 0 downtime install

The world’s libraries. Connected.

Summary

• HBase is the center of our world. By association, a lot of libraries.

• You can move from relational to HBase.

• We’ve been successful running user facing traffic alongside Map/Reduce.

• EASY to support. We have two converted Oracle DBA’s as our front line admins. Mostly, they’re lent to MySQL support for other internal systems.

The world’s libraries. Connected.

Questions?

The world’s libraries. Connected.

Come to Ohio -Our snowballs roll themselves!

Recommended