18
Linked Data and the Future of Scientific Publishing Bradley P. Allen, Elsevier Labs Presentation to NFAIS Webinar – “Linked Data: What It Is, What It Does and The Future of Information Discovery” 2012-10-25

Linked data and the future of scientific publishing

Embed Size (px)

DESCRIPTION

Presentation to NFAIS Webinar on "Linked Data: What It Is, What It Does and The Future of Information Discovery", delivered 2012-10-25.

Citation preview

Page 1: Linked data and the future of scientific publishing

Linked Data and the Future of Scientific Publishing

Bradley P. Allen, Elsevier Labs

Presentation to NFAIS Webinar – “Linked Data: What It Is, What It Does and The Future of Information Discovery”

2012-10-25

Page 2: Linked data and the future of scientific publishing

2

“Our new knowledge does not consist of a careful set of works that have passed through a series of gates. … Our new knowledge is not even a set of works. It is an infrastructure of connection.”

David Weinberger. 2011. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room, Basic Books, New York, NY

Scientific knowledge in a post-print world

2

Page 3: Linked data and the future of scientific publishing

3

“Infrastructure of connection” = linked data

3

Type of data Content Inputs Linked Data Outputs Benefits

What the literature is

about

• XML

• Long-form free text

• Short-form free text

• Tables

• Images

• Video

• Audio

• Asset metadata

• Citations

• Classifications

• Clusters

• Entities

• Relations

• Language models

• Probabilistic graphical models

• Better discoverability

• Better visualization and understandability

• Better integration for use in information solutions

How the literature is being used

• Article views

• Search queries

• User behavior

• Social media streams

• Article-level metrics

• Sentiment analysis

• Ranking and impact metrics

• User interest profiles

• Provides the researcher insight about her career

• Provides institutions data about their performance and impact

• Provides publishers data for optimizing our business

Page 4: Linked data and the future of scientific publishing

“Linked data is just a term for how to publish data on the web while working with the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.”

Jeni Tennison. 2010. Why Linked Data for data.gov.uk? http://www.jenitennison.com/blog/node/140

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information, using the standards

4. Include links to other URIs, so that they can discover more things

Tim Berners-Lee. 2006. Linked Data http://www.w3.org/DesignIssues/LinkedData.html

Linked data as standards and best practices

Page 5: Linked data and the future of scientific publishing

5

Scientific publication as linked data

Relational Metadata

Relational Metadata

Asset Metadata

Linked data

Acquire

Transform,

Enhance, Index, Analyze,

Compose

Deliver

Document

Entity record

Media object

Provenance metadata

Asset metadata

Relational metadata

Asset metadata

Page 6: Linked data and the future of scientific publishing

Linked data is increasingly important in science

6

Page 7: Linked data and the future of scientific publishing

7

• Create greater online engagement with our content and platform

• Semantically enrich our content and enhance value of discovery services compared to the same and similar content at other platforms

• Drive additional usage (in journals and books, in downloads and interactivity)

• Improve our ability to be a partner in research, and as a publisher that adds value

• Improve our connection with the scientific community through productive collaborations that improve search and discovery for all researchers

The challenge for publishers

Page 8: Linked data and the future of scientific publishing

8

• Expose existing asset and subject metadata as linked data in Web pages to aid discovery

• Embrace linked data principles while leveraging our existing content production workflow and infrastructure

• Leverage partners for content enhancement and knowledge organization

• Reuse Web-standard vocabularies, taxonomies, ontologies and entity resources where possible

• Collaborate in building needed authoritative resources for identity resolution and metrics

• Deliver benefits across the complementary use cases of researcher and practitioner

Elsevier’s approach to linked data

Page 9: Linked data and the future of scientific publishing

9

Creating smart content by extracting & linking

Asset Metadata

Entities

RelationsCitations

Usage

Page 10: Linked data and the future of scientific publishing

10

Methods for extracting and linking content & data

• Very mature, but

hard to scale

• Crowdsourcing is a

possible solution,

but quality control

is a challenge

• Variable degrees of maturity, but huge

strides through machine learning research

and practical application on the consumer

Internet

• Data-driven, so the more data the better

• Models can be used to build applications,

can be a new type of publication

• Language-driven,

so challenging to

generalize and

scale

• Crucial to realize

promise of ease of

integration

Page 11: Linked data and the future of scientific publishing

11

Packaging linked data for content production

SKOS

Generator

RDF

Generator

LDR

sat:Satellite

rdf:RDF+namespaces

Concept schemes

Statement 1

Diabetes

Statement 2

Hypertension

Tags

...

Para1-Statement-1

Diabetes

Hypertension

Para2-Statement-2

...

...

Regio

n T

ags

tag:satelliteWrapper +

XML Schema

Example RDF Statements

Tags from a taxonomy for a given document

Document sections relevant to a given concept

Document sections providing answers to a given question

Learning objects compliant with a given state educational standard

Genes mentioned in a given document

Documents supporting or disputing conclusions of a given document

Concepts that are in the areas of expertise for a given author

Page 12: Linked data and the future of scientific publishing

12

Infrastructure for storing and publishing linked data

Dis

covery

Serv

ices

Data Spaces

Pipeline Services (Hadoop EMR)

JSON

Transform

N-

Quads

Extract

Reaso

ningInterlin

king

RDF

Validati

on

Ontology

Svcs

An

no

t

atio

n

Sa

telli

tes

Pipeline

Coordination

Loader (REST)

MongoDB Virtuoso

Triplestor

e

SIREN/

SOLRAmazo

n S3

A&EDiscovery

Service API

(REST)

Atom

FeedAdmin&

Monotoring

SPARQL

EndpointAnalytics

Asse

t

Sa

tellit

es

Vo

ca

b

Sa

tellit

es

3rd

Pa

rty

Da

ta

Ontology

Service

Load Balance & Failover (Akamai GTM & Amazon ELB)

Page 13: Linked data and the future of scientific publishing

13

Integrating content & data services with linked data

Page 14: Linked data and the future of scientific publishing

14

Organization Main driver Example Benefits Linked data

S&T Journals Making the article more engaging and informative through visualization and linking

Article of the Future

Understanding, Discovery

Entities, Citations, Relations

Books Making the book more engaging and informative through visualization and linking

Brain Navigator Understanding, Discovery

Entities

A&G Research

Making the discovery of relevant content easier and more engaging

Lipids SciVerseApp

Discovery,Integration

Entities, Asset Metadata

A&G Institutional

Making data about the production and use of scientific content easier to understand

SciVal Spotlight Understanding Entities, Citations, Usage

Corporate Alternative Fuels

Making the exploration of design alternatives easier

Elsevier Biofuels Discovery Entities, Citations

Bibliographical Databases

Automating the indexing of content for traditional discovery channels

Embase Discovery Asset Metadata, Entities

Engineering & Technology

Making the discovery of technology trends and sources easier

Illumin8 Discovery Entities, Citations,Relations

Pharma Biotech Rich integration of content and data in support of research and design workflows

Target Insights Discovery, Understanding

Entities, Citations

HS CDS Delivering actionable information in the context of medical decision making

Order Sets Integration Entities, Relations

GCR Making the discovery of relevant medical content easier and contextual

Clinical Key Discovery Entities, Asset Metadata

NHP Making the delivery and organization of medical content easier to integrate with educational workflows

General Education Platform

Discovery, Integration

Entities, Asset Metadata,Relations

Delivering linked data through multiple online services

Page 15: Linked data and the future of scientific publishing

• Access to content and data– Usage data not integrated or

leveraged

– Hard to stage content for modeling and analytics

• Integration– Adoption of standards across silos

and legacy systems

– Globalization/localization of knowledge organization systems

– Named entity registries for identity resolution for accreditation, provenance and trust

• Human resources– Scarcity of data scientists, language

engineers

• Production– Manually intensive knowledge

engineering

– Balancing production validation and rapid iterative development

– Relation extraction needed but capabilities are minimal at best

– Tools for syntactic rather than semantic validation

• Sharing– Culture and legacy

– Business model disincentives

– Identifier, URI and namespace governance

• Quality control– Lack of clean external data

– Gaps in linked data resources

– Bugs in knowledge organization systems

Challenges in implementing linked data

Page 16: Linked data and the future of scientific publishing

16

• Increasing acquisition of data and text analytics capabilities

• Shifting dependence from partners to in-house resources for content enhancement and knowledge organization

• Innovation in new knowledge organization systems (some through integration of existing ones)– Two main design emphases: taxonomy for discovery,

ontology for understanding and integration

• Emergence of shared smart content infrastructure based on linked data principles

Trends within Elsevier today

Page 17: Linked data and the future of scientific publishing

17

• Smart content allows publishers to create new products and services through structuring content for better discovery, insight and utility– The value is in the structure, not the content– Creating that structure is hard work– The kind of hard work that publishers have

traditionally focused on

• Consumer Internet businesses are using text and data mining to add structure to content today… quickly and on the cheap

• Publishers, societies and libraries both large and small can use the same techniques to follow suit

Smart content is a bridge to the future of publishing

Page 18: Linked data and the future of scientific publishing

Thank you

Bradley P. Allen

[email protected]

bradleypallen on twitter, github