Upload
johnnypropolis
View
281
Download
0
Embed Size (px)
DESCRIPTION
Stru
Citation preview
NATIONAL LIBRARY OF MEDICINE
Structured documents: Why we want them, When we want them, and How we get them when we want them
Jeff BeckNCBI/NLM/NIH
NATIONAL LIBRARY OF MEDICINE
But first ...
A personal opinion on PDFs.
NATIONAL LIBRARY OF MEDICINE
“Beyond the PDF” should not mean making better PDFs. It should mean insisting on something better than a PDF.
NATIONAL LIBRARY OF MEDICINE
Quick notes about me
Background in printing/publishing.
Handle all SGML/XML ingest into PubMed Central.
Manage “NLM DTD” project and co-chair of the NISO “Standardized Markup for Journal Articles” project.
NATIONAL LIBRARY OF MEDICINE
Why Structured Documents?
1. automated processing,
2. ease interchange between organizations
3. support reuse of articles as a whole or in part
4. support external indexing/tagging, text mining
NATIONAL LIBRARY OF MEDICINE
Simply putA structured document has its parts labeled for what they ARE rather
than what they look like.<article-meta>
<article-id pub-id-type="manuscript">nihpa244840</article-id> <article-id pub-id-type="doi">10.3758/BRM.40.1.278</article-id> <title-group> <article-title>New and updated tests of print exposure and reading abilities in college students</article-title> </title-group> <contrib-group> <contrib contrib-type="author"> <name> <surname>Acheson</surname> <given-names>Daniel J.</given-names> </name> </contrib> <contrib contrib-type="author"> <name> <surname>Wells</surname> <given-names>Justine B.</given-names> </name> </contrib>
NATIONAL LIBRARY OF MEDICINE
The Costs of Document Structure
But getting a document into XML has costs: monetary costs, time costs, and potential quality costs.
(Because humans must be involved somewhere)
Generally, the later in the workflow that the XML is made, the higher the cost and the poorer the quality.
NATIONAL LIBRARY OF MEDICINE
if we can get articles into the structured format at authoring time, we'll be able to have a simplified workflow and get richer tagged XML at the end of the process.
NATIONAL LIBRARY OF MEDICINE
But ...
Authors* don't think about structure
They want “The Word Authoring Experience”
They just want to just control the look of their documents …
And we XML folks sneer at that.
NATIONAL LIBRARY OF MEDICINE
Re-But ...
Authors do think about the document structure.
Sure, they are used to starting with a blank page and writing whatever they want – even if they are writing some very formally structured content.
They are used to working in the WYSIWYG
environment.
NATIONAL LIBRARY OF MEDICINE
Pseudo-Structure
Although they start with a blank page, they imply the structure of the article with formatting clues that they put into the article – the same way that it is done in print.
Content in XML is explicitly structured. Content in print (or in a Word file) is implicitly structured – or pseudo-structured.
NATIONAL LIBRARY OF MEDICINE
When you look at a printed journal article it is as easy (sometimes easier) to tell what the title of the article is as when you look at the explicitly structured XML.
- in the XML you will find a string with a element name like <article-title>
- in the printed paper, it is just there and you know it. It is the title of the article. It is obvious!
- it is obvious because the reader has been trained to recognize the formatting clues that identify the title of the article as the title. So, authors know and understand the structure of their article, and define the structure using the same formatting clues we have all been trained to decipher.
NATIONAL LIBRARY OF MEDICINE
The StyleSheet is in our brains.
NATIONAL LIBRARY OF MEDICINE
NATIONAL LIBRARY OF MEDICINE
The real issue with moving from pseudo-structured content to explicitly structured content is not the transformation (in theory this is simple),
it is the disconnect between what something looks like in the editor and what is available for the machine to read.
NATIONAL LIBRARY OF MEDICINE
NATIONAL LIBRARY OF MEDICINE
NATIONAL LIBRARY OF MEDICINE
NATIONAL LIBRARY OF MEDICINE
NATIONAL LIBRARY OF MEDICINE
When something looks like a second level heading in print – it IS a second level heading to the author and anyone reading the screen or a printed out copy of the article.
BUT, our existing authoring tools are so flexible that they allow many ways to get the same on-screen output. This, of course, is a service to any author who only cares what the article LOOKS LIKE.
NATIONAL LIBRARY OF MEDICINE
The Siren Song of WYSIWYG The tool cannot be WYSIWYG because what you get is
not always what you want.
Need WYSIWYM – What you see is what you mean. So that the authors and the machines can SEE the explicit structure of the document.
NATIONAL LIBRARY OF MEDICINE
For a true structured authoring tool, we need explicit structure from the beginning of the authoring process.
- because we cannot reliably impose an explicit structure onto the pseudo-structured content we get from WYSIWYG tools.
Therefore, we cannot “post-process” content from a pseudo-structured tool like Google Knol to get valid and correct XML every time.
NATIONAL LIBRARY OF MEDICINE
For an authoring tool to be successful, it needs to make some demands from the authors that existing tool makers were not willing to impose
- form-based input for some structures like frontmatter - not allowed to “do anything anywhere”, ie, you can only
insert a figure where a figure is allowed in the controlling schema
- text formatting must be restricted (so it is not abused) - some objects might not be available.
NATIONAL LIBRARY OF MEDICINE
And
People have to use it
Which means, people have to want to use it Or need to use it.
NATIONAL LIBRARY OF MEDICINE
Things I'd love to talk about
1. Building Domain-specific models or ontologies from general JATS elements.
With a domain-specific validation layer.
2. Any suggested extensions to the JATS (NLM DTD).
3. NCBI/NLM needs to ...