76
Linked Library Data in the wild

Linked Library Data in the wild

Embed Size (px)

DESCRIPTION

How we use Linked Open Data to drive our next generation discovery interface, and how we've gone about it.

Citation preview

Page 1: Linked Library Data in the wild

Linked Library Datain the wild

Page 2: Linked Library Data in the wild

Technical Lead for Prism

Phil John

Introductions...

Page 3: Linked Library Data in the wild

So, what’s Prism then?

Introductions...

Page 4: Linked Library Data in the wild
Page 5: Linked Library Data in the wild
Page 6: Linked Library Data in the wild
Page 7: Linked Library Data in the wild

a next generation discovery interface

Prism

Introductions

Page 8: Linked Library Data in the wild

(yes…even configuration settings)

Built entirely on Linked Data

Prism

Page 9: Linked Library Data in the wild

Discovery of library catalogue resources

Prism

but grander plans afoot...

Page 10: Linked Library Data in the wild

...some future sources...

Prism

journal metadata

archives/records (e.g. DS Calm)

thesis repositories

rare items/special collections

and more!

Page 11: Linked Library Data in the wild

SaaS/Cloud Based

Prism

Page 12: Linked Library Data in the wild

MARC 21 RDF

Performs data conversion

Prism

Page 13: Linked Library Data in the wild

this ensures it keeps in sync with the LMS

Initial “bulk” conversion then periodic “delta” files

Prism

Page 14: Linked Library Data in the wild

provided by a suite of RESTful web services

Borrower/Availability data pulled from LMS “live”

Prism

Page 15: Linked Library Data in the wild

just add .rss to collectionsor .rdf/.nt/.ttl/.json to items

Linked Data API

Prism

Page 16: Linked Library Data in the wild
Page 17: Linked Library Data in the wild
Page 18: Linked Library Data in the wild
Page 19: Linked Library Data in the wild

The Challenges

Prism

Page 20: Linked Library Data in the wild

Extracting data from MARC 21

The Challenges

Page 21: Linked Library Data in the wild

Some quotes...

Extracting Data from MARC 21

...cataloguers may want to look away now

Page 22: Linked Library Data in the wild
Page 23: Linked Library Data in the wild

...and even if it does, there are millions of existing records that we’ll want to convert

MARC 21 is not goingaway anytime soon...

Extracting Data from MARC 21

Page 24: Linked Library Data in the wild
Page 25: Linked Library Data in the wild

How are we approaching it?

Extracting Data from MARC 21

Page 26: Linked Library Data in the wild

By tackling it in small chunks!

Extracting Data from MARC 21

Page 27: Linked Library Data in the wild

We’ve created a solution that...

Extracting Data from MARC 21

allows us to build the model iteratively

compartmentalises code for different sections

provides robustness

is performant

allows us to experiment

Page 28: Linked Library Data in the wild

Parser Observer Handlers

Our conversion pipeline

Extracting Data from MARC 21

Page 29: Linked Library Data in the wild

Parser Observer Handlers

fires events when it encounters a MARC 21 data structure; very strict with syntax

MARC 21 Parser

Extracting Data from MARC 21

Page 30: Linked Library Data in the wild

Parser Observer Handlers

listens for MARC 21 data structures and hands control over to one or more handlers

Event Observer

Extracting Data from MARC 21

Page 31: Linked Library Data in the wild

Parser Observer Handlers

know how to convert MARC 21structures and fields into linked data

Bibliographic Handlers

Extracting Data from MARC 21

Page 32: Linked Library Data in the wild

So, where are we up to?

Extracting Data from MARC 21

Page 33: Linked Library Data in the wild

we tackled this one first as it allows us to reason more fully about the record

Format (and duration)

Extracting Data from MARC 21

Page 34: Linked Library Data in the wild

In theory quite easy...

Format

Page 35: Linked Library Data in the wild

...in practice not so much...

Format

no code for CD (12cm sound disk, 1.4m/s)

DVD and LaserDisc share(d) a code

LC slow(ish) to support new formats in M21

limited use of control field (007) codings...

...so need to parse text from 3xx, 5xx fields

Page 36: Linked Library Data in the wild

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

Teasing format from a MARC 21 Record

Page 37: Linked Library Data in the wild

Which gives us...

Page 38: Linked Library Data in the wild

an important part of the recordto model, or so I’ve been told

Title

Extracting Data from MARC 21

Page 39: Linked Library Data in the wild

Quite tricky because...

Title

don’t want to duplicate data that appears elsewhere (e.g. in 100/700)

‡c must be last subfield in a 245...

...so sometimes data from ‡n / ‡p is in ‡c instead...

...which means we can’t just drop the ‡c

Page 40: Linked Library Data in the wild

http://journal.code4lib.org/articles/3832

Got a helping hand from Code4Lib Journal (thanks!)

Title

Page 41: Linked Library Data in the wild

Now with more title

Page 42: Linked Library Data in the wild

sounds easy...acronyms from EAN to UPC describing 13 digit codes...right?

Identifier

Extracting Data from MARC 21

Page 43: Linked Library Data in the wild

what are all those other things doing in the ‡a?

...STOP!

Identifier

Page 44: Linked Library Data in the wild

Identifier

“For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.”

Library of Congress Rule Interpretation 1.8

Page 45: Linked Library Data in the wild
Page 46: Linked Library Data in the wild

(and then validate whatever’s left)

So we need to parse them out

Identifier

Page 47: Linked Library Data in the wild

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with

Page 48: Linked Library Data in the wild

Now we can start performing lookups against other sources!

Page 49: Linked Library Data in the wild

hardest of the lot...

Author

Extracting Data from MARC 21

Page 50: Linked Library Data in the wild

...why?

Author

Newt Scamander

Rowling, J.K. vs Rowling, Joanne K.

Few records with relator term in 100/700 ‡e...

...so we have to parse that from the 245 ‡c...

...and we don’t just deal with English records.

Page 51: Linked Library Data in the wild
Page 52: Linked Library Data in the wild

we’ve licensed the names/subjects authority files, and created RDF from them

Library of Congressto the rescue!

Author

Page 53: Linked Library Data in the wild

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher | $e music852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

A contrived example (sorry!) with and without relator terms

Page 54: Linked Library Data in the wild

Hope you can all read this at the back!

Page 55: Linked Library Data in the wild

A closer look atAuthority Matching

Author

Page 56: Linked Library Data in the wild

Some requirements:

Author

needs to be fast...

...(able to process 2M records in several hours)

requires accuracy

must handle pseudonyms and variant spellings

Page 57: Linked Library Data in the wild

which means that for bulk conversions we aren’t incurring HTTP overhead millions of times

So we store as RDF,but index in SOLR

Author

Page 58: Linked Library Data in the wild

You can tell J.K. Rowling is successful, she’s been translated lots

Page 59: Linked Library Data in the wild

Language/Alternate Graphical Representation

Extracting Data from MARC 21

Page 60: Linked Library Data in the wild

Nice “high impact” feature

Language

allows switching between representations

both forms can be searched for

uses RDF content language feature, so useful for people using machine readable RDF

Page 61: Linked Library Data in the wild

001: | 3013197008: | 080624s2007\\\\cc\a\\\\\\\\\\000\0\chi\d041: , | $a chi043: , | $a a-cc--- |050: , 4 | $a NE1300.8.C6 | $b S48 2007 |100: 1, | $6 880-01 | $a Shu, Huifang. |245: 1, 0 | $6 880-02 | $a Fan chen su zi : | $b Min jian nian hua zhong de wen qing feng su / | $c Shu Huifang, Shen Hong zhu. |246: 3, 1 | $6 880-03 | $a Min jian nian hua zhong de wen qing feng su |250: , | $6 880-04 | $a Di 1 ban. |260: , | $6 880-05 | $a Beijing : | $b Zhongguo gong ren chu ban she, | $c 2007. |300: , | $a 3, 3, 229 p. : | $b col. ill. ; | $c 24 cm. |440: , 0 | $6 880-06 | $a Zhongguo min su wen hua cong shu |700: 1, | $6 880-07 | $a Shen, Hong. |880: 1, | $6 100-01/$1 | $a 舒惠芳 . |880: 1, 0 | $6 245-02/$1 | $a 凡尘俗子 : | $b 民间年画中的温情风俗 / | $c 舒惠芳 , 沈泓著 . 880: 3, 1 | $6 246-03/$1 | $a 民间年画中的温情风俗 |880: , | $6 250-04/$1 | $a 第 1 版 . |880: , | $6 260-05/$1 | $a 北京 : | $b 中国工人出版社 | $c 2007. |880: , 0 | $6 440-06/$1 | $a 中国民俗文化丛书 |880: 1, | $6 700-07/$1 | $a 沈泓 . |852: , | $b Main Library | $c East Asian Coll.,Purple 2 | $h 398.351 | $m S4 | Dealing with language in MARC 21

Page 62: Linked Library Data in the wild

MARC Parser Observer Handlers

tagged with an ISO-639-2 language and masquerading as the field listed in ‡6

Passes 880s back into Observer

Language

Page 63: Linked Library Data in the wild

Which gives us...

Page 64: Linked Library Data in the wild
Page 65: Linked Library Data in the wild
Page 66: Linked Library Data in the wild
Page 67: Linked Library Data in the wild

it’s part of the reason we use Linked Data...but it’s got some challenges at the moment

Using/Linking toExternal Datasets

The Challenges

Page 68: Linked Library Data in the wild

Pitfalls:

Language

what if a datasource suffers downtime...

...or worse, is taken offline permanently?

can we trust this data?

can we display it, or is it susceptible to vandalism?

Page 69: Linked Library Data in the wild

Potential solutions (YMMV):

Language

harvest datasets and keep them close to the app...

...or, if that’s not practical, proxy requests using a caching proxy such as Squid

if using Wikipedia and worried about vandalism...

...check for lots of rapid edits, consider caching (or turning off temporarily)

Page 70: Linked Library Data in the wild
Page 71: Linked Library Data in the wild

...or – what we’d like to seehappen to Linked Library Data

The Future...

Page 72: Linked Library Data in the wild

especially on the peripheries – authority data, author information, links to other resources

More library data as LOD

The Future

Page 73: Linked Library Data in the wild

seriously – this would makeour lives so much simpler

LMS vendors adopting LOD

The Future

Page 74: Linked Library Data in the wild

LOD replacing MARC 21 as the standard representation of

bibliographic records

The Future

Page 75: Linked Library Data in the wild
Page 76: Linked Library Data in the wild

Photo Credits

Slide 15 - http://www.flickr.com/photos/gammaman/5241860326/ Slide 21 - http://www.flickr.com/photos/agizienski/3778965891/ Slide 40 - http://www.flickr.com/photos/54409200@N04/5070012761/ Slide 42 - http://www.flickr.com/photos/proimos/4199675334/ Slide 48 - http://www.flickr.com/photos/maveric2003/91198458/ Slide 63 - http://richard.cyganiak.de/2007/10/lod/ Slide 67 - http://www.flickr.com/photos/markchapmanphoto/5139429152/ Slide 72 - http://www.flickr.com/photos/-bast-/349497988/