XML Vocabularies: Opportunities for Efficiency and Reliability Steven R. Newcomb [email protected] TechnoTeacher and ISOGEN Int’l Corp

XML Vocabularies:Opportunities for Efficiency and

Reliability

Steven R. Newcomb

[email protected]

TechnoTeacher and ISOGEN Int’l Corp.

2

A “Markup Vocabulary” is a list of names

Minimally, XML parsing yields elements with named types (tag names).

The list of these named element types (tag names) is the “vocabulary” of the document. (The names of their attributes are also part of the “vocabulary”.)

3

I can parse it,but what is it?

<hperm><thone>UDLINGBLON</thone><kallow>29</kallow><spec>GUINEA FOWL</spec><date>2000 3 9</date></hperm>

(Vocabulary: hperm, thone, kallow, spec, date)

Is information in XML interchangeable?

4

•XML “Namespaces”

XML “Namespaces” are vocabularies. The XML “Namespace” recommendation is a step on the road toward interoperability for XML messages.

A “namespace” amounts to an abstract “place” where there is a list of element type names (tag names) and/or attribute names.

URIs specify the namespaces in use.

There is no requirement that the specified URI is valid, much less that the indicated resource conforms to any sort of specification.

XML “Namespaces” provide a way for names to be guaranteed to be unique, and that’s all.

5

•XML “Namespaces” and expectations

In some sense, an XML resource that uses the names of an XML “namespace” must inherit from it expectations as to the meaning and conventional use of each of the names.

(Right? Otherwise, why use it at all?)

6

I can parse it,but what is it?

<hp:hperm xmlns:hp=’http://www.gov.ng/hp.x’><hp:thone>UDLINGBLON</hp:thone><hp:kallow>29</hp:kallow><hp:spec>GUINEA FOWL</hp:spec><hp:date>2000 3 9</hp:date></hp:hperm>

(Vocabulary in the HP namespace: hperm, thone, kallow, spec, date)

Do namespaces help with interchange?

7

•XML “Namespaces:” Unresolved issues

How to express a namespace so it can be shared? What is the list of names?

How to write processing software for a namespace?

How to determine whether software for a namespace works according to the expectations of other users of the namespace?

How to determine whether an XML resource conforms to the syntactic and semantic requirements of the namespace?

How to determine, when information interchange fails, whose software is at fault? (The software that created the XML resource, or the recipient application?)

Accountability is vital to interchange.

8

•XML Vocabularies in open environments

Ideally, an XML resource is self-describing. Since many XML resources use the same vocabularies, it’s efficient to describe them in terms of the vocabularies they use.

Anybody who receives a well-described XML resource should be able to interpret it accurately.

Anybody should be able to create an XML resource that uses a vocabulary correctly, so that its recipient will interpret it accurately.

Vocabularies should be able to support entire industries and areas of human endeavor, in open, multivendor environments.

Vocabularies should offer huge advantages in efficiency and reliability.

9

•XML Vocabularies in closed environments

Closed syndicates and would-be cartels need to resolve the same issues, so that their XML messages will interoperate.

It’s extremely inefficient for each syndicate to invent the methodologies and tools for guaranteeing reliable vocabulary-based interoperability.

It’s also a net contraction in the noosphere of the syndicate. Where to find technical expertise? How to maintain it? Etc.

Enlightened self-interest demands that the same methodologies and tools that support open interoperability be used internally.

Vocabularies should offer huge advantages in efficiency and reliability.

10

Methodologies and Tools for Vocabularies

Vocabularies can be used to make XML resources fully self-describing, fully interchangeable and fully interoperable, down to the last syntactic and semantic feature.

This can be accomplished using existing W3C and ISO recommendations and standards, all from the XML and SGML families of recommendations and standards.

Alternatively, the same principles could be applied using different modeling syntaxes, purpose-built for the Web.

…but if it can be done without reinventing everything, why bother?

11

Processing of XML resources: 2 stages

The first stage of vocabulary processing can be accomplished by a single generic piece of software, the XML parser.

XML parsers don’t do much vocabulary processing yet.

First stage: vocabulary syntax processing and validation:

Check for conformance of the XML resource to each of the vocabularies it uses, to see whether invalid names were used.

Check for conformance to the structural model (DTD?) of each vocabulary used. Is each name used in a valid context with respect to the other names in

the same namespace? Meanings of names may change with context!

Check for conformance of data and attribute values to lexical models of valid data of each element/attribute in each vocabulary.

12

Processing of XML resources: 2 stages

Second stage of vocabulary processing is semantic interpretation of the vocabularies.

Since all vocabularies are different, according to the natures of their applications, no generic piece of software can interpret all vocabularies.

However, a paradigm in which vocabulary-specific processing need never include code which is duplicated in software that processes any other vocabulary could offer significant efficiencies and enhanced reliability.

More on this in a moment.

13

More efficiency/reliability in Stage 1

(Reminder: Stage 1 is vocabulary syntax processing and validation.)

Provide a formalism for the expression of vocabularies: the list of names, the contexts in which names can be used, and lexical models for the data contained in elements and in the value of attributes named in vocabularies.

The existing DTD formalism can already do most of this. Let’s not force applications to duplicate the functionality of checking

the validity of vocabulary usage in XML resources. Let’s build it into re-usable validating XML parsers. They already validate against DTDs. Why not use that existing

functionality for inherited vocabularies?

14

SX already validates inherited vocabularies.

There is an ISO standard for declaring, in an XML resource, conformance to one or more inheritable XML vocabularies. (In the ISO context, such a vocabulary is called an “inheritable information architecture”.)

Vocabularies can inherit from other vocabularies.

A single XML resource can inherit from more than one vocabulary.

Vocabularies are expressed using ordinary DTD syntax (with minor, optional enhancements).

Demonstration using the Topic Map inheritable vocabulary.

15

How to document vocabularies?

It would be great to be able to document vocabularies more effectively than we can now.

16

Which constructs are the comments about?

<!ELEMENT hperm (thone, kallow, spec, date)><!ELEMENT thone (#PCDATA)> <!ELEMENT kallow (#PCDATA)> <!ELEMENT spec (#PCDATA)> <!ELEMENT date (#PCDATA)>  

<hperm><thone>UDLINGBLON</thone><kallow>29</kallow><spec>GUINEA FOWL</spec><date>2000 3 9</date></hperm>

17

Documenting vocabularies

Topic maps are an extremely powerful way of documenting DTDs.

...but that’s another story for another time.

18


Reminder: “Stage 2” is application-specific (i.e., vocabulary-specific) processing of XML resources, after parsing and other processing common to all XML resources has already been done.

Stage 2 is about resource interoperability, not just about interchangeability. It’s about how we can guarantee that everyone understands the resource in the same way.

It’s about the meaning of each name in a vocabulary . It’s about the meaning of the data associated with each vocabulary

name in each resource that uses the vocabulary.

It’s about expectations: the resource creator’s expectations about what will be understood by recipients of the resource, and the recipients’ expectations about the kinds of things that a resource that uses a certain vocabulary can say.

19


No generic processor can understand all vocabularies. In general, a special processor is needed for each vocabulary.

Still, there are huge opportunities, even in Stage 2, for efficiency and reliability:

There can be a common way to express vocabulary-specific semantics.

At least some of these expressions can be formal and machine-readable, so tools can be built that enhance the productivity of application builders.

Many XML resources can inherit multiple vocabularies, thus recycling existing knowledge about vocabularies, and avoiding redundant learning cycles. (Example: XLL combined with Biztalk.)

A re-usable software engine can be built for each vocabulary, and means for plugging such engines into applications can be developed. (Same example applies.)

20

Modeling is the key

In Stage 1 of XML resource processing, models of the structural and lexical requirements associated with each vocabulary can drive a generic parsing/validating process.

In Stage 2 of XML resource processing, models of the abstract information sets that can be conveyed by specific vocabularies can be created.

These “abstract APIs” give names to each of the properties of the information set that “emerges” from processing a vocabulary.

Abstract API models are contracts between programmers, just as a DTD is a contract between information users and providers.

In an actual implementation of a vocabulary processing engine, these property names can become function calls (or whatever).

In other words, these abstract information set models can drive a generic engine-building process that produces vocabulary-specific engines.

21

Bi-directional transformation

All XML resources convey information that really has two forms:

The interchangeable (but otherwise useless), XML form, and The parsed, processed, application-internal form.

“Stages 1 and 2” are about the conversion from the interchange form to the useful form. The other transformation -- from the useful form to the interchange form -- is at least equally important.

For reliable, efficient information interchange, the nature of both transformations must be documented.

It would be great if the URI of the vocabulary’s “namespace” pointed at a document that had both models, and explained the algorithms involved in transforming information between them.

22

A common fallacy: DTD is API

The fallacy is: the structure of an XML resource should also be the API to the information it contains.

Trying to make the element structure also be the API makes it impossible to have both a good interchange structure and a good API. The attempt introduces inefficiency and invites unreliability of information interchange.

The Document Object Model (DOM) is an API to the generic structure of XML resources. It is not and can never be the API to the information sets conveyed by all vocabularies.

If, e.g., the XLL vocabulary’s functionality gets built into the DOM, what vocabulary’s functionality shouldn’t be built into the DOM? No committee can possibly do all this work!

23

Desirable qualities in an interchange syntax

Maximal appropriateness to the information it conveys

intrinsic character of information well reflected in interchange structure.

Communications efficiency

no redundancy

Validatability

no ambiguity

Neutrality

no hidden assumptions about platform, vendor or application

Self-description

conformance to intelligible, well-documented formal model

24

Interchange syntax model is a contract

DTD is a contract between

information creators information consumers applications developers

DTD enhanced with type checking, lexical typing, etc., is a more detailed contract between the same players

25

Desirable qualities in an Abstract API

Maximal convenience for applications developers

Abstract API is intuitive for learning and use Abstract APIs often need redundant access methods, for the

convenience of programmers

Processing tasks common to all applications (beyond parsing and validation) are supported by the implementation of the abstract API.

Abstract API should include both: Properties directly derivable from syntactic structure of interchange form. Properties implicit in architecture but not reflected in syntactic structures.

Neutrality

no hidden assumptions about platform, vendor or application.

Self-description

API is intelligible, well-documented

26

Abstract API model is a contract, too

...between programmers of applications that, with respect to a given vocabulary:

Create XML resources. Receive XML resources and use the information they convey. Support the creation of XML resources that link to the emergent

properties of other resources. Support the querying of XML resources with respect to the values of

specific emergent properties.

27

Two sides of one coin

The interchange syntax model and the abstract API are two aspects of the same information set:

Syntax model = consensus about the interchange format of the information set

Abstract API = consensus about the abstract properties of the information set

28

XML needs:

Enhanced syntactic modeling capabilities for generic XML processing/validation.

Especially: Means for inheriting multiple vocabularies in XML instances, and for proving that they are all used correctly. Note: lexical modeling features, and many other syntactic enhancements

can be made to XML by means of vocabularies.

Semantic modeling capabilities that allow us to give names to the emergent properties of XML resources that use vocabularies.

A convention, such as that which exists for XML “Namespaces” today, for pointing to these models from within XML resources, so as to indicate the use of a given vocabulary.

29

Semantic modeling: emergent properties

Example of an “emergent” property: The property of being a target of an xlink (considering XLL as a vocabulary, as it is in ISO-land).

All emergent properties of a vocabulary must be described clearly, comprehensively, unambiguously, and formally, because

accuracy and reliability are important. the information is expected to be useful in multi-vendor application

environments (if not, why inherit a vocabulary at all?). implementation of vocabulary-specific applications must be done at

reasonable cost.

30

Semantic validation becomes a side-effect

Computing an emergent property value often isn’t possible without validating the interchanged information on which the computation is based.

For example, if an element that inherits from a vocabulary specifies a "start-time" attribute and an "end-time" attribute, we may intend that the duration of time between the start-time and the end-time be calculable and that it fall within a certain range (or at least be non-negative). In any case, we can’t calculate the value of the “duration” property unless the start-time and end-time values exist and are amenable to calculation.

31

A standard property language exists…

It’s called "Property Sets”

A property set is an XML document that conforms to the ISO standard DTD for property sets.

Already in commercial use; the software already works with XML.

Every class of information component (“node”), and every property of every class, has a unique name.

These names can be used in queries.

This whole idea is often called "the Grove Paradigm.” It’s the basis of SGML processing, and the SGML Property Set aided the development of the DOM.

32

In the Grove Paradigm...

Vocabulary-specific engines can be plugged together in applications that support XML resources that use multiple vocabularies.

Vocabulary-specific engines generate a "grove" (object graph with relevant Property Set as schema) from any vocabulary-conforming XML instance.

Vocabulary-specific engines can mature and offer reliable semantic validation and processing services in a variety of application contexts, instead of being rebuilt in each application.

Time and cost of developing applications is reduced, while reliability of information interchange increases.

33

The Grove Paradigm is Portable

The Grove Paradigm is highly portable: it can be used with any notation, not just XML and SGML.

Property sets can be used as a way to represent consensus about how to address the abstract properties of any notation.

Think about it: a vocabulary is a notation. (And XML is a notation for vocabulary-notations.)

Let’s look at some groves! (GroveMinder demo.)

34

Summary: Designing XML Vocabularies

Questions to ask:

Must certain semantic processing and validation operations be performed by all applications of this vocabulary?

Will more than one application have to deal with this vocabulary?

If so, its syntactic requirements deserve to be made explicit in a DTD (or something like a DTD), and

A property set (or other explicit Abstract API) defined for it will pay big dividends

in software reuse in achieving widespread consensus about what the vocabulary really

means in determining what went wrong when vocabulary-mediated

information interchange fails

The preceding SX and GroveMinder demos are available from

Steve Newcomb

[email protected]

Documents

XML Vocabularies: Opportunities for Efficiency and Reliability Steven R. Newcomb [email protected] TechnoTeacher and ISOGEN Int’l Corp