Upload
phil-john
View
4.409
Download
3
Tags:
Embed Size (px)
DESCRIPTION
How we use Linked Open Data to drive our next generation discovery interface, and how we've gone about it.
Citation preview
Linked Library Datain the wild
Technical Lead for Prism
Phil John
Introductions...
So, what’s Prism then?
Introductions...
a next generation discovery interface
Prism
Introductions
(yes…even configuration settings)
Built entirely on Linked Data
Prism
Discovery of library catalogue resources
Prism
but grander plans afoot...
...some future sources...
Prism
journal metadata
archives/records (e.g. DS Calm)
thesis repositories
rare items/special collections
and more!
SaaS/Cloud Based
Prism
MARC 21 RDF
Performs data conversion
Prism
this ensures it keeps in sync with the LMS
Initial “bulk” conversion then periodic “delta” files
Prism
provided by a suite of RESTful web services
Borrower/Availability data pulled from LMS “live”
Prism
just add .rss to collectionsor .rdf/.nt/.ttl/.json to items
Linked Data API
Prism
The Challenges
Prism
Extracting data from MARC 21
The Challenges
Some quotes...
Extracting Data from MARC 21
...cataloguers may want to look away now
...and even if it does, there are millions of existing records that we’ll want to convert
MARC 21 is not goingaway anytime soon...
Extracting Data from MARC 21
How are we approaching it?
Extracting Data from MARC 21
By tackling it in small chunks!
Extracting Data from MARC 21
We’ve created a solution that...
Extracting Data from MARC 21
allows us to build the model iteratively
compartmentalises code for different sections
provides robustness
is performant
allows us to experiment
Parser Observer Handlers
Our conversion pipeline
Extracting Data from MARC 21
Parser Observer Handlers
fires events when it encounters a MARC 21 data structure; very strict with syntax
MARC 21 Parser
Extracting Data from MARC 21
Parser Observer Handlers
listens for MARC 21 data structures and hands control over to one or more handlers
Event Observer
Extracting Data from MARC 21
Parser Observer Handlers
know how to convert MARC 21structures and fields into linked data
Bibliographic Handlers
Extracting Data from MARC 21
So, where are we up to?
Extracting Data from MARC 21
we tackled this one first as it allows us to reason more fully about the record
Format (and duration)
Extracting Data from MARC 21
In theory quite easy...
Format
...in practice not so much...
Format
no code for CD (12cm sound disk, 1.4m/s)
DVD and LaserDisc share(d) a code
LC slow(ish) to support new formats in M21
limited use of control field (007) codings...
...so need to parse text from 3xx, 5xx fields
LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert
Teasing format from a MARC 21 Record
Which gives us...
an important part of the recordto model, or so I’ve been told
Title
Extracting Data from MARC 21
Quite tricky because...
Title
don’t want to duplicate data that appears elsewhere (e.g. in 100/700)
‡c must be last subfield in a 245...
...so sometimes data from ‡n / ‡p is in ‡c instead...
...which means we can’t just drop the ‡c
http://journal.code4lib.org/articles/3832
Got a helping hand from Code4Lib Journal (thanks!)
Title
Now with more title
sounds easy...acronyms from EAN to UPC describing 13 digit codes...right?
Identifier
Extracting Data from MARC 21
what are all those other things doing in the ‡a?
...STOP!
Identifier
Identifier
“For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.”
Library of Congress Rule Interpretation 1.8
(and then validate whatever’s left)
So we need to parse them out
Identifier
LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert
Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with
Now we can start performing lookups against other sources!
hardest of the lot...
Author
Extracting Data from MARC 21
...why?
Author
Newt Scamander
Rowling, J.K. vs Rowling, Joanne K.
Few records with relator term in 100/700 ‡e...
...so we have to parse that from the 245 ‡c...
...and we don’t just deal with English records.
we’ve licensed the names/subjects authority files, and created RDF from them
Library of Congressto the rescue!
Author
LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher | $e music852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert
A contrived example (sorry!) with and without relator terms
Hope you can all read this at the back!
A closer look atAuthority Matching
Author
Some requirements:
Author
needs to be fast...
...(able to process 2M records in several hours)
requires accuracy
must handle pseudonyms and variant spellings
which means that for bulk conversions we aren’t incurring HTTP overhead millions of times
So we store as RDF,but index in SOLR
Author
You can tell J.K. Rowling is successful, she’s been translated lots
Language/Alternate Graphical Representation
Extracting Data from MARC 21
Nice “high impact” feature
Language
allows switching between representations
both forms can be searched for
uses RDF content language feature, so useful for people using machine readable RDF
001: | 3013197008: | 080624s2007\\\\cc\a\\\\\\\\\\000\0\chi\d041: , | $a chi043: , | $a a-cc--- |050: , 4 | $a NE1300.8.C6 | $b S48 2007 |100: 1, | $6 880-01 | $a Shu, Huifang. |245: 1, 0 | $6 880-02 | $a Fan chen su zi : | $b Min jian nian hua zhong de wen qing feng su / | $c Shu Huifang, Shen Hong zhu. |246: 3, 1 | $6 880-03 | $a Min jian nian hua zhong de wen qing feng su |250: , | $6 880-04 | $a Di 1 ban. |260: , | $6 880-05 | $a Beijing : | $b Zhongguo gong ren chu ban she, | $c 2007. |300: , | $a 3, 3, 229 p. : | $b col. ill. ; | $c 24 cm. |440: , 0 | $6 880-06 | $a Zhongguo min su wen hua cong shu |700: 1, | $6 880-07 | $a Shen, Hong. |880: 1, | $6 100-01/$1 | $a 舒惠芳 . |880: 1, 0 | $6 245-02/$1 | $a 凡尘俗子 : | $b 民间年画中的温情风俗 / | $c 舒惠芳 , 沈泓著 . 880: 3, 1 | $6 246-03/$1 | $a 民间年画中的温情风俗 |880: , | $6 250-04/$1 | $a 第 1 版 . |880: , | $6 260-05/$1 | $a 北京 : | $b 中国工人出版社 | $c 2007. |880: , 0 | $6 440-06/$1 | $a 中国民俗文化丛书 |880: 1, | $6 700-07/$1 | $a 沈泓 . |852: , | $b Main Library | $c East Asian Coll.,Purple 2 | $h 398.351 | $m S4 | Dealing with language in MARC 21
MARC Parser Observer Handlers
tagged with an ISO-639-2 language and masquerading as the field listed in ‡6
Passes 880s back into Observer
Language
Which gives us...
it’s part of the reason we use Linked Data...but it’s got some challenges at the moment
Using/Linking toExternal Datasets
The Challenges
Pitfalls:
Language
what if a datasource suffers downtime...
...or worse, is taken offline permanently?
can we trust this data?
can we display it, or is it susceptible to vandalism?
Potential solutions (YMMV):
Language
harvest datasets and keep them close to the app...
...or, if that’s not practical, proxy requests using a caching proxy such as Squid
if using Wikipedia and worried about vandalism...
...check for lots of rapid edits, consider caching (or turning off temporarily)
...or – what we’d like to seehappen to Linked Library Data
The Future...
especially on the peripheries – authority data, author information, links to other resources
More library data as LOD
The Future
seriously – this would makeour lives so much simpler
LMS vendors adopting LOD
The Future
LOD replacing MARC 21 as the standard representation of
bibliographic records
The Future
Photo Credits
Slide 15 - http://www.flickr.com/photos/gammaman/5241860326/ Slide 21 - http://www.flickr.com/photos/agizienski/3778965891/ Slide 40 - http://www.flickr.com/photos/54409200@N04/5070012761/ Slide 42 - http://www.flickr.com/photos/proimos/4199675334/ Slide 48 - http://www.flickr.com/photos/maveric2003/91198458/ Slide 63 - http://richard.cyganiak.de/2007/10/lod/ Slide 67 - http://www.flickr.com/photos/markchapmanphoto/5139429152/ Slide 72 - http://www.flickr.com/photos/-bast-/349497988/