Upload
bradley-allen
View
631
Download
2
Embed Size (px)
DESCRIPTION
Presentation to NFAIS Webinar on "Linked Data: What It Is, What It Does and The Future of Information Discovery", delivered 2012-10-25.
Citation preview
Linked Data and the Future of Scientific Publishing
Bradley P. Allen, Elsevier Labs
Presentation to NFAIS Webinar – “Linked Data: What It Is, What It Does and The Future of Information Discovery”
2012-10-25
2
“Our new knowledge does not consist of a careful set of works that have passed through a series of gates. … Our new knowledge is not even a set of works. It is an infrastructure of connection.”
David Weinberger. 2011. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room, Basic Books, New York, NY
Scientific knowledge in a post-print world
2
3
“Infrastructure of connection” = linked data
3
Type of data Content Inputs Linked Data Outputs Benefits
What the literature is
about
• XML
• Long-form free text
• Short-form free text
• Tables
• Images
• Video
• Audio
• Asset metadata
• Citations
• Classifications
• Clusters
• Entities
• Relations
• Language models
• Probabilistic graphical models
• Better discoverability
• Better visualization and understandability
• Better integration for use in information solutions
How the literature is being used
• Article views
• Search queries
• User behavior
• Social media streams
• Article-level metrics
• Sentiment analysis
• Ranking and impact metrics
• User interest profiles
• Provides the researcher insight about her career
• Provides institutions data about their performance and impact
• Provides publishers data for optimizing our business
“Linked data is just a term for how to publish data on the web while working with the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.”
Jeni Tennison. 2010. Why Linked Data for data.gov.uk? http://www.jenitennison.com/blog/node/140
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful information, using the standards
4. Include links to other URIs, so that they can discover more things
Tim Berners-Lee. 2006. Linked Data http://www.w3.org/DesignIssues/LinkedData.html
Linked data as standards and best practices
5
Scientific publication as linked data
Relational Metadata
Relational Metadata
Asset Metadata
Linked data
Acquire
Transform,
Enhance, Index, Analyze,
Compose
Deliver
Document
Entity record
Media object
Provenance metadata
Asset metadata
Relational metadata
Asset metadata
Linked data is increasingly important in science
6
7
• Create greater online engagement with our content and platform
• Semantically enrich our content and enhance value of discovery services compared to the same and similar content at other platforms
• Drive additional usage (in journals and books, in downloads and interactivity)
• Improve our ability to be a partner in research, and as a publisher that adds value
• Improve our connection with the scientific community through productive collaborations that improve search and discovery for all researchers
The challenge for publishers
8
• Expose existing asset and subject metadata as linked data in Web pages to aid discovery
• Embrace linked data principles while leveraging our existing content production workflow and infrastructure
• Leverage partners for content enhancement and knowledge organization
• Reuse Web-standard vocabularies, taxonomies, ontologies and entity resources where possible
• Collaborate in building needed authoritative resources for identity resolution and metrics
• Deliver benefits across the complementary use cases of researcher and practitioner
Elsevier’s approach to linked data
9
Creating smart content by extracting & linking
Asset Metadata
Entities
RelationsCitations
Usage
10
Methods for extracting and linking content & data
• Very mature, but
hard to scale
• Crowdsourcing is a
possible solution,
but quality control
is a challenge
• Variable degrees of maturity, but huge
strides through machine learning research
and practical application on the consumer
Internet
• Data-driven, so the more data the better
• Models can be used to build applications,
can be a new type of publication
• Language-driven,
so challenging to
generalize and
scale
• Crucial to realize
promise of ease of
integration
11
Packaging linked data for content production
SKOS
Generator
RDF
Generator
LDR
sat:Satellite
rdf:RDF+namespaces
Concept schemes
Statement 1
Diabetes
Statement 2
Hypertension
Tags
...
Para1-Statement-1
Diabetes
Hypertension
Para2-Statement-2
...
...
Regio
n T
ags
tag:satelliteWrapper +
XML Schema
Example RDF Statements
Tags from a taxonomy for a given document
Document sections relevant to a given concept
Document sections providing answers to a given question
Learning objects compliant with a given state educational standard
Genes mentioned in a given document
Documents supporting or disputing conclusions of a given document
Concepts that are in the areas of expertise for a given author
12
Infrastructure for storing and publishing linked data
Dis
covery
Serv
ices
Data Spaces
Pipeline Services (Hadoop EMR)
JSON
Transform
N-
Quads
Extract
Reaso
ningInterlin
king
RDF
Validati
on
Ontology
Svcs
An
no
t
atio
n
Sa
telli
tes
Pipeline
Coordination
Loader (REST)
MongoDB Virtuoso
Triplestor
e
SIREN/
SOLRAmazo
n S3
A&EDiscovery
Service API
(REST)
Atom
FeedAdmin&
Monotoring
SPARQL
EndpointAnalytics
Asse
t
Sa
tellit
es
Vo
ca
b
Sa
tellit
es
3rd
Pa
rty
Da
ta
Ontology
Service
Load Balance & Failover (Akamai GTM & Amazon ELB)
13
Integrating content & data services with linked data
14
Organization Main driver Example Benefits Linked data
S&T Journals Making the article more engaging and informative through visualization and linking
Article of the Future
Understanding, Discovery
Entities, Citations, Relations
Books Making the book more engaging and informative through visualization and linking
Brain Navigator Understanding, Discovery
Entities
A&G Research
Making the discovery of relevant content easier and more engaging
Lipids SciVerseApp
Discovery,Integration
Entities, Asset Metadata
A&G Institutional
Making data about the production and use of scientific content easier to understand
SciVal Spotlight Understanding Entities, Citations, Usage
Corporate Alternative Fuels
Making the exploration of design alternatives easier
Elsevier Biofuels Discovery Entities, Citations
Bibliographical Databases
Automating the indexing of content for traditional discovery channels
Embase Discovery Asset Metadata, Entities
Engineering & Technology
Making the discovery of technology trends and sources easier
Illumin8 Discovery Entities, Citations,Relations
Pharma Biotech Rich integration of content and data in support of research and design workflows
Target Insights Discovery, Understanding
Entities, Citations
HS CDS Delivering actionable information in the context of medical decision making
Order Sets Integration Entities, Relations
GCR Making the discovery of relevant medical content easier and contextual
Clinical Key Discovery Entities, Asset Metadata
NHP Making the delivery and organization of medical content easier to integrate with educational workflows
General Education Platform
Discovery, Integration
Entities, Asset Metadata,Relations
Delivering linked data through multiple online services
• Access to content and data– Usage data not integrated or
leveraged
– Hard to stage content for modeling and analytics
• Integration– Adoption of standards across silos
and legacy systems
– Globalization/localization of knowledge organization systems
– Named entity registries for identity resolution for accreditation, provenance and trust
• Human resources– Scarcity of data scientists, language
engineers
• Production– Manually intensive knowledge
engineering
– Balancing production validation and rapid iterative development
– Relation extraction needed but capabilities are minimal at best
– Tools for syntactic rather than semantic validation
• Sharing– Culture and legacy
– Business model disincentives
– Identifier, URI and namespace governance
• Quality control– Lack of clean external data
– Gaps in linked data resources
– Bugs in knowledge organization systems
Challenges in implementing linked data
16
• Increasing acquisition of data and text analytics capabilities
• Shifting dependence from partners to in-house resources for content enhancement and knowledge organization
• Innovation in new knowledge organization systems (some through integration of existing ones)– Two main design emphases: taxonomy for discovery,
ontology for understanding and integration
• Emergence of shared smart content infrastructure based on linked data principles
Trends within Elsevier today
17
• Smart content allows publishers to create new products and services through structuring content for better discovery, insight and utility– The value is in the structure, not the content– Creating that structure is hard work– The kind of hard work that publishers have
traditionally focused on
• Consumer Internet businesses are using text and data mining to add structure to content today… quickly and on the cheap
• Publishers, societies and libraries both large and small can use the same techniques to follow suit
Smart content is a bridge to the future of publishing
Thank you
Bradley P. Allen
bradleypallen on twitter, github