Upload
accessinnovations
View
809
Download
2
Embed Size (px)
DESCRIPTION
Presented at the 10th annual Data Harmony Users Group meeting on Tuesday, February 11, 2014 by Rachel Drysdale of PLOS. Discusses the process of building and integrating their new thesaurus into the PLOS journals workflow and publication platform. From constructing the thesaurus to creating channels for feedback and updates, through building new current awareness and discovery tools, to gathering data for article level metrics and web site analytics, follow their progress through to today’s PLOS websites and services.
Citation preview
The PLOS Thesaurus: the first year
Rachel Drysdale – Taxonomy Manager, PLOS
DHUG 2014
11th February, 2014
Public Library of Science - evolution
2000 PLOS founded
2003 PLOS Biology
2004 PLOS Medicine
2005 PLOS Computational Biology (June)
PLOS Genetics (July)
PLOS Pathogens (September)
2006 PLOS ONE
2007 PLOS Neglected Tropical
Diseases
2
Journal Article Count
PLOS Biology 3,450
PLOS Medicine 2,626
PLOS Computational Biology 3,112
PLOS Genetics 4,048
PLOS Pathogens 3,639
PLOS ONE 87,296
PLOS Neglect Trop Diseases 2,444
Journal Article Count
PLOS Biology 3,450
PLOS Medicine 2,626
PLOS Computational Biology 3,112
PLOS Genetics 4,048
PLOS Pathogens 3,639
PLOS ONE 87,296
PLOS Neglect Trop Diseases 2,444
beautiful monster….
Overview – today’s talk
The Solution: Good Thesaurus + Machine Aided Indexing
Building the new Thesaurus with AI
The initial implementation at plos.org
MAIstro integration into Publishing workflow
Thesaurus maintenance
The Service:
Content Discovery
Article Analysis Relative Metrics
5
Starting point
2011 – the old Taxonomy
Inadequate
in content – just over 3100 specific terms
Inflexible
in structure – terms in pre-defined paths
Housed in Editorial Manager
ossified and difficult to update
Author-chosen terms - association with article
6
PLOS delivered to Access Innovations….
A copy of the old PLOS Taxonomy
Over 2,000 suggested changes
“Research analysis and methods” branch request
Use cases:
Subject Area-based searches
Hierarchy-based exploration of our corpus
Email Alerts based on Subject Area searches
RSS Feeds based on Subject Areas
7
Access Innovations added:
STEM vocabulary
Broader/Narrower term relationships
Rules for the Machine Aided Indexing
Synonyms
Analysis with respect to the PLOS corpus
.....to and fro with PLOS ….
Result:
Vastly improved NISO Z-39.19-compliant thesaurus
8
Statistics
9
Old Taxonomy A. I. Thesaurus
Terms 3,132 10,156
Synonyms 0 3,291
Tiers 5 7
Rules 0 14,798
Top-level Terms
1. Biology and life sciences
2. Computer and information sciences
3. Earth sciences
4. Engineering and technology
5. Environmental sciences and ecology
6. Medicine and health sciences
7. Physical sciences
8. Research and analysis methods
9. Science policy
10. Social sciences
10
Infrastructure
PLOS Taxonomy server:
Thesaurus – plos2012thes
Data Harmony Thesaurus Master and
MAI Rule Builder
Corpus fed to the Taxonomy Server for MAI
Article by article
Initial implementation:
Title – Abstract - Results – Methods
Top 8 hits selected
11
Elapsed time from
project kick-off
until terms appeared
on published articles:
9 months
13
Learning curve – teething troubles
Not all articles had Subject Area terms – why not?
Initial implementation – text to index:
Title + Abstract* + Results + Methods
Upon consideration – text to index:
Full Text (though not references)
Implementation of “all paths”
Polyhierarchy implications
Consider “White blood cells”
Biology and life sciences Medicine and health sciences
Immunology Immunology
Immune cells Immune cells
White blood cells White blood cells
Biology and life sciences Biology and life sciences
Cell biology Cell biology
Cellular types Cellular types
Animal cells Animal cells
Blood cells Immune cells
White blood cells White blood cells
14
The polyhierarchy and Search
15
Establishing update cycle - articles:
Initial implementation:
Entire back-corpus indexed at once
New Papers:
PLOS submits text to MAIstro at publication
MAI returns terms and term frequencies
PLOS stores terms in search engine
16
Establishing update cycle - thesauri:
Separate instances (nerves):
Production server – plosthes.2013-6
Working version – plosthes.2013-7
When ready to release a new version:
Load onto test server – MAI corpus - Index
Test: new/changed/deleted terms
rule changes
structural changes
any implementation changes
17
Thesaurus updates – why?
More terms : Memory T cells, Monocotyledons
Errrm… : Report gene detection
What? : Webs
Hierarchy changes deemed desirable:
Geographical locations
Organisms
(Un)Rule(y) : snails, fabrication, pumas
Thesaurus updates – how?
18
Thesaurus updates – how?
19
Thesaurus updates – how?
20
Thesaurus updates – how?
21
22
Rule-Building in MAIstro – Pumas before...
23
Rule-Building in MAIstro – Pumas before...
p53 upregulated modifier of apoptosis
or
Rule-Building in MAIstro – Pumas after…
24
25
26
Thesaurus updates – prioritisation?
Miss-hits and missed term reports:
Ourselves:
article pages
Our readers:
in email
complaints in twitter
in correspondence with our editorial staff
via Journal and Saved Search alerts
via article pages – Flagged Term reports
27
28
Things we learned – Thesaurus editorial
Tension:
strict and rigorous taxonomy/ontology construction
vs
user utility
Abbreviations and Synonyms
Issues that continue to exercise us:
T cells/Memory T cells
Obesity/Childhood obesity
When should we make both explicit?
Rule work – working to top 8
29
Building a new project - exports
30
Building a new project - import
Content Discovery
How has having the thesaurus changed the way that
users interact with PLOS web sites?
32
• Journal alerts
• Saved Searches
• RSS feeds
• Hierarchy exploration
Problem:
How to keep up?
Solution:
Current Awareness Tools
33
34
Journal alerts
35
Journal alerts
36
Journal alerts
37
Journal alerts
38
Journal alerts
39
Saved search
40
Saved search
41
RSS feeds
42
RSS feeds
43
Hierarchy exploration
44
Hierarchy exploration
45
Hierarchy exploration
46
Hierarchy exploration
47
Hierarchy exploration
48
Hierarchy exploration
Relative Metrics
Relative Metrics:
Defining a Paper’s Peer Group
1. Group papers by Subject Area
Accommodate multiple topics per paper
2. Group papers by age
Important for comparison of cumulative measures like total downloads or citations
3. Determine norms for peer group
The average usage of each paper is compared with the median usage of its peer group
More on Relative Metrics at:
http://www.plosone.org/static/almInfo#relativeMetrics
50
51
Relative Metrics
52
Relative Metrics
53
54
Area of development - Editorial Workflow
The PLOS Thesaurus and Peer Review
Maintaining a copy of the PLOS thesaurus in Editorial
Manager helps with editor and reviewer matching
56
Classifications for
People
Classifications for
Papers
The PLOS Thesaurus and Peer Review
• Authors select Subject Area terms related to their article
submissions
• Editors and Reviewers select terms that represent their
areas of expertise
• Staff and Editors use these terms to help ensure editors
and reviewers are well matched to the submissions they
are handling
57
Planned Enhancements
• Automate the application of terms associated with
Editors, Reviewers and submitted articles with MAIstro
• Provide Editors and Staff with detailed terms to assist
with reviewer selection and vetting
– Academic disciplines help Editors gauge Subject Area
relevance of potential Reviewers
– Methods, protocols and model organisms help Editors
gauge technical suitability of potential Reviewers
58
59
Jonas Dupuich Product Manager
Patrick Polischuk Product Manager
Sebastian Toomey Interaction Designer
Jennifer Lin Senior Product Manager
Martin Fenner ALM Technical Lead
Kallie Huss Senior Publications Assistant
John Chodacki Director - Product Management
Dramatis personae:
60