44
In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Head of Library Programme Management – Dawson Books

UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Embed Size (px)

Citation preview

Page 1: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

In and out: how does that metadata getinto a knowledgebase anyhow?

Heather Sherman

Head of Library Programme Management – Dawson Books

Page 2: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

Creation process

2In and out: how does that metadata get into a knowledgebase anyhow?

Sign contract with publisher

Acquire content and basic metadata

Correct metadata errors

Enhance basic metadata

Create ProQuest xml feed

Create TOC data

Page 3: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 3In and out: how does that metadata get into a knowledgebase anyhow?

Sign contract with publisher

Process starts with a publisher agreeing to host their titles on dawsonera.

Publishers are asked to send Dawson the ebook content, jacket image and associated metadata.

Some send this in xml. Others complete a spreadsheet.

Page 4: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 4In and out: how does that metadata get into a knowledgebase anyhow?

Publisher sends files of metadata

Publishers supply key pieces of metadata

eISBN

Title

Subtitle

Author(s)

Price

Currency

PDF file name

Jacket image

Publisher

Imprint

Publication date

Edition

Country of publication

Usage model

Page 5: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

Spreadsheet of metadata

5In and out: how does that metadata get into a knowledgebase anyhow?

Page 6: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 6In and out: how does that metadata get into a knowledgebase anyhow?

Publisher sends files of metadata

However….

Not all publishers supply the key data, so we have to go and find it.

Some supply incorrect data, so we have to fix that.

Dawson’s automated import process checks that key data is present and correct, and reports on error.

Page 7: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

Metadata errors

7In and out: how does that metadata get into a knowledgebase anyhow?

Page 8: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 8In and out: how does that metadata get into a knowledgebase anyhow?

Table of contents data created

PDF files are sent to an agency who create Table of Contents (TOC) data.

For ePub files, the TOC is extracted directly from the file.

TOC data is imported into the Dawson system and matched up with the PDFs and metadata.

Page 9: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

TOC xml

9In and out: how does that metadata get into a knowledgebase anyhow?

Page 10: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 10In and out: how does that metadata get into a knowledgebase anyhow?

Metadata enhanced

Publisher metadata and TOC data is matched to existing print records in the Dawson title database.

Hybrid record is created incorporating data from the publishers and Dawson.

Produces a record containing as much information as Dawson have about the title.

Page 11: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

Dawson ebook MARC record

=LDR 01354nam 2200349 4500=001 DAW28874972=007 cr=008 140327s2014\\\\enk\\\\fs\\\\\001|0|eng|d=020 \\$a0191015024 (e-book) =020 \\$a9780191015021 (e-book) =040 \\$aStDuBDS$cStDuBDS$erda$dDAWSON=041 1\$aeng$hita=082 04$a320.53209$223=100 1\$aPons, Silvio,$eauthor.=245 14$aThe global revolution$h[electronic resource] : $ba history of international communism, 1917-1991 / $cSilvio Pons ; translated by Allan Cameron. =264 \1$aOxford :$bOxford University Press,$c2014.=300 \\$axx, 365 pages =336 \\$atext$2rdacontent=337 \\$acomputer$2rdamedia=338 \\$aonline resource$2rdacarrier=490 1\$aOxford studies in modern European history=500 \\$aTranslated from the Italian.=504 \\$aIncludes bibliographical references and index.=530 \\$aAlso available in printed form.=533 \\$aElectronic reproduction.$cDawson Books.$nMode of access: World Wide Web.=650 \0$aCommunism$xHistory.=650 \0$aCommunism.=655 \7$aElectronic books.$2lcsh=700 1\$aCameron, Allan,$d1952-$etranslator.=776 0\$cHardback$z9780199657629=830 \0$aOxford studies in modern European history.

11In and out: how does that metadata get into a knowledgebase anyhow?

Page 12: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC 12In and out: how does that metadata get into a knowledgebase anyhow?

ProQuest feed created

Hybrid record is extracted and turned into an xml record.

Dawson sends daily files of new titles and updated data to ProQuest.

A weekly file of data for all titles is sent.

Page 13: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Connect Group PLC

xml data sent to ProQuest

<document initial-page="4" jacket="9780191015021.jpg" lang="eng">

<eisbn>

<eisbn13>9780191015021</eisbn13>

<eisbn10>0191015024</eisbn10>

</eisbn>

<isbn-group>

<isbn10 type="hb">0199657629</isbn10>

<isbn13 type="hb">9780199657629</isbn13>

</isbn-group>

<title-group>

<title>The Global Revolution: A History of International Communism 1917-1991</title>

<subtitle>A History of International Communism 1917-1991</subtitle>

</title-group>

<author-group>

<author>

<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>

</author>

</author-group>

13In and out: how does that metadata get into a knowledgebase anyhow?

Page 14: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

IN AND OUT: HOW DOES THAT

METADATA GET

INTO A KNOWLEDGEBASE ANYHOW?

Ben Johnson

Lead Metadata Librarian, KB Provider Data

[email protected]

Acquisition and Ingestion of Provider Data

into a Knowledgebase (KB)

Page 15: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Introduction

What do I do?

4/15/2015 15

Page 16: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Lots of times it feels more like this:

4/15/2015 16

Page 17: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Introduction

Acquire

• Get the data

• Verify compatibility

• Map the data

Ingest

• Transform the data

• Load

• Review

• Accept/Reject

Correct

• Customer inquiries

• Content integrity

• Product interoperability

… Profit!

4/15/2015 17

Page 18: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Providers we partner with

PublishersContent

Aggregators (PQ, Gale)

University and Library

Local Content

Library Consortia

(JISC, BIBSAM)

4/15/2015 18

Page 19: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Content Acquisition

• No data

• No contracts

• Provider Relations

4/15/2015 19

Page 20: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

KBART

• Joint NISO/UKSG Group

• Librarians, Vendors, Providers

• Transmission of metadata to vendors

• Human and machine readable data

• http://www.niso.org/workrooms/kbart

4/15/2015 20

Page 21: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Ingestion – mapping and transformation

• FTP, HTML

• CSV/Text, Excel, XML, HTML

Acquire the data

• Data for existing content is mapped to KB packages (new T&F package, JISC/BIBSAM new license)

Create packages

• Map the content to our schema

• Normalize the data (dates, diacritics)

Transform the content

4/15/2015 21

Page 22: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

XML Data from Dawsonera

4/15/2015 22

Page 23: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

File ready for ingestion

4/15/2015 23

Page 24: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Ingestion – Loading and Review

4/15/2015 24

Page 25: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Currency (Updating)

Acquisition

IngestionReview

4/15/2015 25

Page 26: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Corrections

4/15/2015 26

Correcshunz Corrections

Page 27: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Downstream products

Data in KBDownstream

ProductsProduct

functionalityDiscovery Access

4/15/2015 27

Page 28: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

IN AND OUT: HOW DOES THAT

METADATA GET INTO A

KNOWLEDGEBASE ANYHOW?

Dave Hovenden – Content Operations Manager, Summon

ProQuest

UKSG Conference – 30 March – 1 April, 2015

Page 29: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Streams for Summon

4/15/2015 29

Page 30: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Process at Summon for Commercial

Content

Identify New Content to Add into Summon

4/15/2015 30

Page 31: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

• Product Management, Sales,

and our Global Content Alliance

work together to identify new

content to add into Summon

• New content requests from

Summon customers are also

considered

• Publishers and content

providers may also request to

have their content added into

Summon

4/15/2015 31

Identifying New Commercial Content to Add into Summon

Page 32: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Process at Summon for Commercial

Content

Identify New Content to Add into Summon

Engage with Publisher/Provider

Pre-Agreement Content Sample

Analysis

4/15/2015 32

Page 33: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

• The sample analysis is used to help determine the quality and extent of the metadata and the metadata schema

• We also try to determine things such as linking methods, how rights are assigned to the content, and what databases we may need in our knowledgebase (if they don’t already exist)

• Summon often indexes content at the article-level, or chapter-level as that is usually the level of granularity that the content is supplied at

4/15/2015 33

Pre-Agreement Content Sample Analysis

Page 34: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

What Metadata Do We Look For During Sample Analysis?

4/15/2015 34

Title Metadata

• Article titles, Book titles, Publication titles, Subtitles, etc.

Identifier Metadata

• Unique IDs for specific articles, chapters, etc.

• Publication-level unique identifiers such as ISSN or ISBN

• Additional identifiers such as OCLC Number, LCCN, Dewey, DOI, etc.

Publication Information Metadata

• Publisher, Author(s), Corporate Authors, Volume Numbers, Issue Numbers, Start Page, Publication Date, Publication Series, etc.

Additional Metadata

• Subject Headings, Keywords, Language

Page 35: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Dawsonera Book Example – The Global Revolution: A History of

International Communism 1917-1991 (ISBN-13 – 9780199657629)

4/15/2015 35

<document initial-page="4" jacket="9780191015021.jpg" lang="eng">

<eisbn>

<eisbn13>9780191015021</eisbn13>

<eisbn10>0191015024</eisbn10>

</eisbn>

<territory-group/>

<parent-isbn/>

<isbn-group>

<isbn10 type="hb">0199657629</isbn10>

<isbn13 type="hb">9780199657629</isbn13>

</isbn-group>

<title-group>

<title>The Global Revolution: A History of International Communism 1917-1991</title>

<subtitle>A History of International Communism 1917-1991</subtitle>

</title-group>

<author-group>

<author>

<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>

</author>

</author-group>

<endnote-authors>

<endnote-author>Pons, Silvio,</endnote-author>

<endnote-author>Cameron, Allan,</endnote-author>

</endnote-authors>

Page 36: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Dawsonera Book Example (cont.) – The Global Revolution: A

History of International Communism 1917-1991 (ISBN-13 –

9780199657629)

<publisher>

<publisher-name>Oxford University Press</publisher-name>

<imprint>Oxford University Press</imprint>

</publisher>

<pub-place>GB</pub-place>

<pub-date>20140815</pub-date>

<date-added>20140911</date-added>

<first-published/>

<edition/>

<copyright>© Oxford University Press 2014</copyright>

<classification type="dewey">320.53209</classification>

<classification type="loc">HX40</classification>

<classification type="bic">HB</classification>

<series issn="" series-name="Oxford studies in modern European history." number-within-series="">Oxford studies in

modern European history.</series>

<abstract-text>The Global Revolution. A History of International Communism 1917-1991 establishes a relationship

between the history of communism and the main processes of globalization in the past century. Drawing on a wealth of

archival sources, Silvio Pons analyses the multifaceted and contradictory relationship between the Soviet Union and the

international communist movement, to show how communism played a major part in the formation of our modern world.

The volume presents the argument that during the age of wars from 1914 to 1945, the establishment of the Soviet state in

Russia and the birth of the communist movement had an enormous impact because of their promise of world revolution

and international civil war. Such perspective appeared even more plausible in the aftermath of the Second World War and

of revolution in China, which paved the way for the expansion of communism in the post-colonial world. Communism

challenged the West in the Cold War - by means of anti-capitalist modernization and anti-imperialist mobilization - showing

itself to be a powerful factor in the politicization of global trends. However, the international legitimacy of communism

declined rapidly in the post-war era. Soviet power exposed its inability to exercise hegemony, as distinct from domination.

The consequences of Sovietization in Europe and the break between the Soviet Union and China were the primary

reasons for the decline of communist influence and appeal. Since communism lost its political credibility and cultural

cohesion, its global project had failed. The ground was prepared for the devastating impact of Western globalization on

communist regimes in Europe and the Soviet Union.</abstract-text>

<keywords>Communism</keywords>4/15/2015 36

Page 37: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

• Summon relies upon the

knowledgebase to help facilitate

rights access to the content

• Rights access is assigned by

tracking a particular title by ISSN

or ISBN in the knowledgebase, or

by Database ID

• The knowledgebase also helps

Summon indicate when content

has full-text availability

4/15/2015 37

Summon and the Knowledgebase

Page 38: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Process at Summon for Commercial

Content

Identify New Content to Add into Summon

Engage with Publisher/Provider

Pre-Agreement Content Sample

Analysis

Formalize and Sign Data Sharing

Agreement

Data is Delivered in Full from

Publisher/Provider

Data Normalization, Mapping, and Enrichment

4/15/2015 38

Page 39: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Data Normalization, Mapping, and Enrichment Work

• Very basic high-level clean-up of the data to standardize it

• Examples include:

• Remove leading/trailing white spaces in Title and Subtitle fields

• Clean-up diacritics and other encoding issues

Data Normalization

• Map the metadata fields in the records to the Summon schema

• This allows the metadata to appear in the UI and/or be made searchable within Summon

Mapping

• Enriching the content by adding additional metadata when applicable

• Examples:

• Scholarly/peer-reviewed flags from Ulrich’s

• Citation counts from Scopus

• Book cover images from Syndetics

Enrichment

4/15/2015 39

Page 40: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Process at Summon for Commercial

Content

Identify New Content to Add into

Summon

Engage with Publisher/Provider

Pre-Agreement Content Sample

Analysis

Formalize and Sign Data Sharing Agreement

Data is Delivered in Full from

Publisher/Provider

Data Normalization, Mapping, and Enrichment

Indexing

4/15/2015 40

Page 41: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Title as it Appears in Summon Once Indexed

4/15/2015 41

Page 42: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

The Content Ingestion Process at Summon for Commercial

Content

Identify New Content to Add into

Summon

Engage with Publisher/Provider

Pre-Agreement Content Sample

Analysis

Formalize and Sign Data Sharing Agreement

Data is Delivered in Full from

Publisher/Provider

Data Normalization, Mapping, and Enrichment

IndexingPost-Ingestion Maintenance

4/15/2015 42

Page 43: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Post-Ingestion Maintenance

4/15/2015 43

Currency

• Currency is the process of the publisher/provider sending to Summon new/updated metadata records, or record deletions for content that need to be removed

• Frequency of providing updates is often at the discretion of the publisher/provider

Metadata Issues

• Address reported issues of metadata quality from Summon customers

• Most issues involve incorrect metadata, or slight variations in the metadata that may impact OpenURL linking or the record deduplication process (Match & Merge)

Page 44: UKSG Conference 2015 - In and out: how does that metadata get into a knowledgebase anyhow? Heather Sherman Dawson Books and Benjamin Johnson and Dave Hovenden ProQuest

Thank you – Any Questions?

Heather Sherman

[email protected]

Benjamin Johnson

[email protected]

Dave Hovenden

[email protected]