beck-beyondthepdf

NATIONAL LIBRARY OF MEDICINE

Structured documents: Why we want them, When we want them, and How we get them when we want them

Jeff BeckNCBI/NLM/NIH

[email protected]


But first ...

A personal opinion on PDFs.


“Beyond the PDF” should not mean making better PDFs. It should mean insisting on something better than a PDF.


Quick notes about me

Background in printing/publishing.

Handle all SGML/XML ingest into PubMed Central.

Manage “NLM DTD” project and co-chair of the NISO “Standardized Markup for Journal Articles” project.


Why Structured Documents?

1. automated processing,

2. ease interchange between organizations

3. support reuse of articles as a whole or in part

4. support external indexing/tagging, text mining


Simply putA structured document has its parts labeled for what they ARE rather

than what they look like.<article-meta>

<article-id pub-id-type="manuscript">nihpa244840</article-id> <article-id pub-id-type="doi">10.3758/BRM.40.1.278</article-id> <title-group> <article-title>New and updated tests of print exposure and reading abilities in college students</article-title> </title-group> <contrib-group> <contrib contrib-type="author"> <name> <surname>Acheson</surname> <given-names>Daniel J.</given-names> </name> </contrib> <contrib contrib-type="author"> <name> <surname>Wells</surname> <given-names>Justine B.</given-names> </name> </contrib>


The Costs of Document Structure

But getting a document into XML has costs: monetary costs, time costs, and potential quality costs.

(Because humans must be involved somewhere)

Generally, the later in the workflow that the XML is made, the higher the cost and the poorer the quality.


if we can get articles into the structured format at authoring time, we'll be able to have a simplified workflow and get richer tagged XML at the end of the process.


But ...

Authors* don't think about structure

They want “The Word Authoring Experience”

They just want to just control the look of their documents …

And we XML folks sneer at that.


Re-But ...

Authors do think about the document structure.

Sure, they are used to starting with a blank page and writing whatever they want – even if they are writing some very formally structured content.

They are used to working in the WYSIWYG

environment.


Pseudo-Structure

Although they start with a blank page, they imply the structure of the article with formatting clues that they put into the article – the same way that it is done in print.

Content in XML is explicitly structured. Content in print (or in a Word file) is implicitly structured – or pseudo-structured.


When you look at a printed journal article it is as easy (sometimes easier) to tell what the title of the article is as when you look at the explicitly structured XML.

- in the XML you will find a string with a element name like <article-title>

- in the printed paper, it is just there and you know it. It is the title of the article. It is obvious!

- it is obvious because the reader has been trained to recognize the formatting clues that identify the title of the article as the title. So, authors know and understand the structure of their article, and define the structure using the same formatting clues we have all been trained to decipher.


The StyleSheet is in our brains.



The real issue with moving from pseudo-structured content to explicitly structured content is not the transformation (in theory this is simple),

it is the disconnect between what something looks like in the editor and what is available for the machine to read.






When something looks like a second level heading in print – it IS a second level heading to the author and anyone reading the screen or a printed out copy of the article.

BUT, our existing authoring tools are so flexible that they allow many ways to get the same on-screen output. This, of course, is a service to any author who only cares what the article LOOKS LIKE.


The Siren Song of WYSIWYG The tool cannot be WYSIWYG because what you get is

not always what you want.

Need WYSIWYM – What you see is what you mean. So that the authors and the machines can SEE the explicit structure of the document.


For a true structured authoring tool, we need explicit structure from the beginning of the authoring process.

- because we cannot reliably impose an explicit structure onto the pseudo-structured content we get from WYSIWYG tools.

Therefore, we cannot “post-process” content from a pseudo-structured tool like Google Knol to get valid and correct XML every time.


For an authoring tool to be successful, it needs to make some demands from the authors that existing tool makers were not willing to impose

- form-based input for some structures like frontmatter - not allowed to “do anything anywhere”, ie, you can only

insert a figure where a figure is allowed in the controlling schema

- text formatting must be restricted (so it is not abused) - some objects might not be available.


And

People have to use it

Which means, people have to want to use it Or need to use it.


Things I'd love to talk about

1. Building Domain-specific models or ontologies from general JATS elements.

With a domain-specific validation layer.

2. Any suggested extensions to the JATS (NLM DTD).

3. NCBI/NLM needs to ...