Schematron and Other Useful Tools

Preview:

Citation preview

Schematron(and other useful tools)

Stuart Myles

smyles@ap.org

An Aside: AP’s Ingestion Pipleline

This is greatly simplified, obviously.

ATOM + XHTML

XSLT Transform

APPL + NITF

One way we ingest content:

we transform ATOM and XHTML into

our internal XML (APPL) and NITF

Converting from HTML to XML

<p>The budget was just &pound;100.</p>

<p>How could it be done for so little money?

<p>Luckily open source tools were available.</p>

These are not new problems.</p>

The solutions were even standardized.<p/>

Hard to enforce rules in the spec

“HeadLine - this element must contain the same

value as the entry’s <title> element”

“summary is required for non-text content items,

such as news photos and video. This element is

optional for text story content items.”

XML structure complies with XSD…

…but can fail in downstream systems

Validate and Fix Prior to Ingestion

Original ATOM + XHTML

Tidy fixes sloppy HTML

Custom XSLT tidies up XML

W3C schema validates structure & syntax

Schematron schema validates business rules

Valid ATOM + XHTML, ready for ingestion

HTML Tidy

Fix sloppy HTML

HTML -> XHTML

Schematron

Fact checker for XML documents

Business rules that can’t be expressed in W3C XSD schema

• MediaType="Video"

• Format="ANPA1312"

Previously, we had to inspect new feeds to catch errors

The risk is that feeds are approved but errors appear later

(Not to mention manual checking of XML is tedious)

Schematron

Small, powerful, lightweight fact-checker for XML documents

Schematron Schema

Validate

Specify constraints using XPATH rules

You write the error messages

One time compile

into an XSLT

ReportsValidation reports

Validation as an

XSLT transform

Presence or absence of

specific content

Relationships between

elements and attributes

Anatomy of a Schematron Rule

<sch:rule context="atom:feed/atom:link">

<sch:assert test="starts-with(@href, 'http://')">

The feed/link/@href must contain an http url

</sch:assert>

</sch:rule>

Establish the context of the rule

with an XPATH expression XSLT-style test establishes

the constraint for each assert

You write the error message to be

used if the assert fails

DSDL – Pipeline Validation

XSD RELAX NG

Schematron

NVDL

DTTL

CRSL

DSRL

Grammar

Rules

Namespace dispatch

Datatype

Character repertoire

Document Semantic Renaming

Still under development

Declaratively specify a pipeline (using XML, naturally)

Similar in concept to

Yahoo! Pipes

BizTalk

But XML specific and a W3C standard

Thanks!