View
216
Download
1
Category
Tags:
Preview:
Citation preview
Superset Me—Not:Why the JPTS Is Sufficient if You Use Appropriate Layer
Validation
Alexander (“Sasha”) SchwarzmanAmerican Geophysical Union (AGU)
JATS-ConNovember 2, 2010
Summary
We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron
Alexander (“Sasha”) Schwarzman 2 Superset Me—Not JATS-Con Nov 2, 2010
3
Contents
• Why we built a JPTS superset• DTD vs. Schematron– Attribute values– Number of element occurrences– Element position & sequence– References
• Lessons learned
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
4
Why we built a JPTS superset
• No generic book model• Lack of familiarity with Schematron• Lack of mature tool support (running SVRL not
a viable option in Production environment)• Lack of expertise on integrating Schematron
with validation against relational DB• JATS v2.3: no Compound Keywords, not all
content models parameterized
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
5
DTD vs. Schematron:Attribute values
Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt)
Strict DTD
<!ATTLIST article article-type (rga | cor | edt) #REQUIRED >
JPTS
<!ATTLIST article article-type CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
6
DTD vs. Schematron:Attribute values (cont’d)
XML instance (contains non-allowed article type)
<article article-type='xxx'/> Schematron
<rule context="article"> <assert test="@article-type=('rga','cor','edt')">
@article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule>
Schematron message
@article-type 'xxx' not allowed, must be 'rga', 'cor', or'edt'Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
7
DTD vs. Schematron:Number of element occurrences
Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs
Strict DTD
<!ELEMENT ack (p, p?) >
JPTS
<!ELEMENT ack (p*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
8
DTD vs. Schematron:Number of occurrences (cont’d)
XML instance (wrong number of paragraphs)
<article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
9
DTD vs. Schematron:Number of occurrences (cont’d)
Schematron
<rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule>
<rule context="ack"> <assert test="count(p) eq 1">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
10
DTD vs. Schematron:Number of occurrences (cont’d)
Schematron message
'ack' in 'jb' must contain only one paragraph
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
11
DTD vs. Schematron:Element position & sequence
Requirement: If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info
Strict DTD
<!ELEMENT article-categories (subject-group*, special-collection?) >JPTS
<!ELEMENT article-categories (subj-group*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
12
DTD vs. Schematron:Element position & sequence (cont’d)
XML instance (wrong sequence of subject groups)
<article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and
Applications of Earthquake Early Warning</subject>
</subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group></article-categories>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
13
DTD vs. Schematron:Element position & sequence (cont’d)Schematron
<rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling::
subj-group[@subj-group-type=('toc-category','subset')])">
<name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule>
Schematron message
subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
Superset Me—Not JATS-Con Nov 2, 2010 14
DTD vs. Schematron:References
Validating references is a challenge:• Variety vs. the need to enforce editorial styleStrict DTD:• Fixed element order, no mixed content• Punctuation, spacing, face markup – on outputJPTS:• Lots of elements, any order, mixed content• Punctuation, spacing, face markup includedAlexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 15
DTD vs. Schematron:References (cont’d)
Strict DTD
<!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) ><!ATTLIST book-standalone-citation id ID #REQUIRED >
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 16
DTD vs. Schematron:References (cont’d)
JPTS
<!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* >
<!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 17
DTD vs. Schematron:References (cont’d)
Example:
Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York.
Alexander (“Sasha”) Schwarzman
18
DTD vs. Schematron:References (cont’d)
XML instance (strict DTD)<book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc></book-standalone-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
19
DTD vs. Schematron:References (cont’d)
XML instance (JPTS)<mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
Superset Me—Not JATS-Con Nov 2, 2010 20
DTD vs. Schematron:References (cont’d)
Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition, if present, follows source):
<!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) >
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 21
DTD vs. Schematron:References (cont’d)
• Schematron can check that all required elements are present:
<rule context="mixed-citation[@publication-type='book-standalone']">
<assert test="(person-group | string-name) and yearand source and publisher-nameand publisher-loc">
required element missing</assert></rule>
• & that the elements are in the correct sequence:
Alexander (“Sasha”) Schwarzman
22
DTD vs. Schematron:References (cont’d)
XML instance (JPTS) (edition is in the wrong place)
<mixed-citation publication-type="book-standalone"><string-name> <surname>Mood</surname>, <given-names>A. M.</given-names></string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname></string-name> (<year>1963</year>), <edition>2</edition>nd ed.,<source><italic>Introduction to the Theory …</italic></source>, <size units="page">295</size> pp.,<publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
23
DTD vs. Schematron:References (cont’d)
This Schematron uses positional predicate [1] to check that year is immediately followed by source:
<rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of
select='name(following-sibling::*[1])'/>'</assert></rule>
Schematron message
'year' must be immediately followed by 'source', not by 'edition'
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
Superset Me—Not JATS-Con Nov 2, 2010 24
DTD vs. Schematron:References (cont’d)
But how to check the sequence of required elements when there might be optional elements interspersed between them?
This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between:
<rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source">
'<name/>' must be preceded by 'source'</assert></rule>
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 25
DTD vs. Schematron:References (cont’d)
• Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order:– Each element rewritten as a string of its element
names– Content model represented as a regular expression– Schematron checks the string of names against regex– Schematron generates an error message if content
does not match the model
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 26
DTD vs. Schematron:References (cont’d)
An XML file, e.g., citation-models.xml, specifies structured citation models:
...<model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc)</model> ...
Alexander (“Sasha”) Schwarzman
Superset Me—Not JATS-Con Nov 2, 2010 27
DTD vs. Schematron:References (cont’d)
• Advantages:– DTD is still DTD-valid– Mixed content is permitted– Type-sensitive handling of references is possible
• Caveat: XSLT 2.0!
Alexander (“Sasha”) Schwarzman
28
Lessons learned• AGU Tag Set + Schematron (200+ checks)– Ensures data quality– Ensures markup integrity– Provides control over production processes
• AGU Tag Set is a superset of JPTS– Based on JPTS– Uses the same modularization principles– Can be easily mapped to JPTS
• Were we to do this again we would have developed JPTS subset and a Schematron
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
29
Lessons learned (cont’d)
• Appropriate layer validation– Even the most “Prussian” DTD can’t enforce all
business rules, data types, and house style– Rules-based checking needed anyway– May as well use “Californian” JPTS (de facto
industry standard) adopted by publishers, conversion & composition vendors, archives, etc.
• Paradigm shift: the crux of validation shifts from XML parser to Schematron engine
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
30
Lessons learned (cont’d)
• This shift is not without costs:– Content may be valid to JPTS but make no sense– Dependency on Schematron for semantic integrity– Constraints on business partners: must be
Schematron-capable and have tools– Schematron does not “fix” problems—people do.
Processes and procedures must be well-defined
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
31
Lessons learned (cont’d)• Writing a simple Schematron is easy; building a complex and efficient one is not:– Elicit, document, convey, and clarify the Requirements– Ensure Schematron fits into your workflow– Modularize Schematron– Ensure that individual Schematron rules aren’t in conflict– Optimize Schematron performance– Employ XSLT 2.0– Test, test, test– Cultivate Schematron & XSLT 2.0 expertise in-house
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
32
Conclusion• What about content that is not like a journal
article, e.g., generic (non-NCBI) books and their parts/chapters?
• When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say:
“Superset Me—Not!”
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010
Recommended