25
NATIONAL LIBRARY OF MEDICINE Structured documents: Why we want them, When we want them, and How we get them when we want them Jeff Beck NCBI/NLM/NIH [email protected]

beck-beyondthepdf

Embed Size (px)

DESCRIPTION

Stru

Citation preview

Page 1: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Structured documents: Why we want them, When we want them, and How we get them when we want them

Jeff BeckNCBI/NLM/NIH

[email protected]

Page 2: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

But first ...

A personal opinion on PDFs.

Page 3: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

“Beyond the PDF” should not mean making better PDFs. It should mean insisting on something better than a PDF.

Page 4: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Quick notes about me

Background in printing/publishing.

Handle all SGML/XML ingest into PubMed Central.

Manage “NLM DTD” project and co-chair of the NISO “Standardized Markup for Journal Articles” project.

Page 5: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Why Structured Documents?

1.  automated processing,

2.  ease interchange between organizations

3.  support reuse of articles as a whole or in part

4.  support external indexing/tagging, text mining

Page 6: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Simply putA structured document has its parts labeled for what they ARE rather

than what they look like.<article-meta>

<article-id pub-id-type="manuscript">nihpa244840</article-id> <article-id pub-id-type="doi">10.3758/BRM.40.1.278</article-id> <title-group> <article-title>New and updated tests of print exposure and reading abilities in college students</article-title> </title-group> <contrib-group> <contrib contrib-type="author"> <name> <surname>Acheson</surname> <given-names>Daniel J.</given-names> </name> </contrib> <contrib contrib-type="author"> <name> <surname>Wells</surname> <given-names>Justine B.</given-names> </name> </contrib>

Page 7: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

The Costs of Document Structure

But getting a document into XML has costs: monetary costs, time costs, and potential quality costs.

(Because humans must be involved somewhere)

Generally, the later in the workflow that the XML is made, the higher the cost and the poorer the quality.

Page 8: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

if we can get articles into the structured format at authoring time, we'll be able to have a simplified workflow and get richer tagged XML at the end of the process.

Page 9: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

But ...

Authors* don't think about structure

They want “The Word Authoring Experience”

They just want to just control the look of their documents …

And we XML folks sneer at that.

Page 10: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Re-But ...

Authors do think about the document structure.

Sure, they are used to starting with a blank page and writing whatever they want – even if they are writing some very formally structured content. 

They are used to working in the WYSIWYG

environment. 

Page 11: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Pseudo-Structure

Although they start with a blank page, they imply the structure of the article with formatting clues that they put into the article – the same way that it is done in print. 

Content in XML is explicitly structured. Content in print (or in a Word file) is implicitly structured – or pseudo-structured. 

Page 12: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

When you look at a printed journal article it is as easy (sometimes easier) to tell what the title of the article is as when you look at the explicitly structured XML.   

- in the XML you will find a string with a element name like <article-title>  

- in the printed paper, it is just there and you know it. It is the title of the article. It is obvious!   

 - it is obvious because the reader has been trained to recognize the formatting clues that identify the title of the article as the title.  So, authors know and understand the structure of their article, and define the structure using the same formatting clues we have all been trained to decipher. 

Page 13: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

The StyleSheet is in our brains.

Page 14: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Page 15: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

The real issue with moving from pseudo-structured content to explicitly structured content is not the transformation (in theory this is simple),

it is the disconnect between what something looks like in the editor and what is available for the machine to read.

Page 16: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Page 17: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Page 18: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Page 19: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Page 20: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

When something looks like a second level heading in print – it IS a second level heading to the author and anyone reading the screen or a printed out copy of the article. 

BUT, our existing authoring tools are so flexible that they allow many ways to get the same on-screen output. This, of course, is a service to any author who only cares what the article LOOKS LIKE. 

Page 21: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

The Siren Song of WYSIWYG The tool cannot be WYSIWYG because what you get is

not always what you want. 

Need WYSIWYM – What you see is what you mean.  So that the authors and the machines can SEE the explicit structure of the document. 

Page 22: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

For a true structured authoring tool, we need explicit structure from the beginning of the authoring process. 

 - because we cannot reliably impose an explicit structure onto the pseudo-structured content we get from WYSIWYG tools.

Therefore, we cannot “post-process” content from a pseudo-structured tool like Google Knol to get valid and correct XML every time. 

Page 23: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

For an authoring tool to be successful, it needs to make some demands from the authors that existing tool makers were not willing to impose   

- form-based input for some structures like frontmatter   - not allowed to “do anything anywhere”, ie, you can only

insert a figure where  a figure is allowed in the controlling schema   

- text formatting must be restricted (so it is not abused)   - some objects might not be available.

Page 24: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

And

People have to use it

Which means, people have to want to use it Or need to use it.

Page 25: beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Things I'd love to talk about

1. Building Domain-specific models or ontologies from general JATS elements.

With a domain-specific validation layer.

2. Any suggested extensions to the JATS (NLM DTD).

3. NCBI/NLM needs to ...