Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

  • View
    231

  • Download
    4

Embed Size (px)

Text of Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

  • Introduction to XML

  • OutlineBackgroundXML BasicsDocument Type Descriptors (DTDs)XML schemaCML

  • From HTML To XMLHTML - Hyper text Markup languageMean for structuring text for visual presentationDesigned to describe how a Web browser should arrange text, images and push-buttons on a page.HTML describes:Intra-document structure Inter-document structure

    Introduction to XML

    XML

    Opening tagText (PCDATA)Closing tagAttribute nameAttribute value

  • From HTML to XMLNeed for data structuring for more general applications than display applicationsExamples:Extracting biological data from NCBI search result page to be used for running a bioinformatics toolExtracting financial data from web pages to conduct financial analyses Solution: markup language to structure document contents (XML)

  • XML: brief historyXML: eXtended Markup LanguageSubset of SGMLFirst version (1.0) formally ratified by the W3C in 1998Current version is XML 1.1 released in 2004XML is becoming the standard for data interchange between applications

  • XML: brief historyPurpose: used for structuring the content of documentsBasis for various application specific markup languages including:GML: Geography Markup LanguageOFX: Open Financial Exchange Markup Language SBML: The systems biology markup language MusicXML: Music Markup languageCML: Chemical Markup Language Much more

  • XML: brief historySome advantages of XMLXML is extensibleXML is both human readable and computer readableXML is platform and language independentXML is a public standardXML tool set is large and growingXML works well with the InternetXML documents can be transformedXML is global

  • XML: brief historySome of the disadvantages XML is verboseXML is not a cure-all for data integrationXML does not guarantee unified formatXML requires a large learning curve

  • OutlineBackgroundXML BasicsDocument Type Descriptors (DTDs)XML schemaCML

  • XML structure - Example

    P26954 IL3B_MOUSE Interleukin-3 receptor class II beta chain [Precursor] CSF2RB2 AI2CA IL3RB2 IL3R Mus musculus FUNCTION: IN MOUSE THERE ARE TWO CLASSES OF HIGH-AFFINITY IL-3 RECEPTORS. ONE CONTAINS THIS IL-3-SPECIFIC BETA CHAIN AND THE OTHER CONTAINS THE BETA CHAIN ALSO SHARED BY HIGH-AFFINITY IL-5 AND GM-CSF RECEPTORS. SUBUNIT: Heterodimer of an alpha and a beta chain.Receptor Glycoprotein Signal

  • XML structureKey components:TagsText

  • XML structureTags:Represent element namesUsed in pairsE.g. Must be properly nested: ... ... --- good ... ... --- bad

  • XML structureElement names follow XML name specificationXML names:Include:Alphanumeric charactersNon- English charactersIdeograms: e.g. Underscore (_), hyphen (-), period, colonShould not includeWhite spaces, quotation marks, apostrophes, dollar signs, percent symbols, carets, and semicolonMay only start with:LettersIdeogramsUnderscore character

  • XML structureText:XML has only one basic type -- textText is bounded by tags E.g.: Interleukin-3 receptor class II beta chain [Precursor] 2650 --- 2650 is still textXML text is called PCDATA (for parsed character data)

  • XML structureTag nesting - used for expressing various data structures including:Tuple (record):

    Johnston, M. The nucleotide sequence of Saccharomyces cerevisiae chromosome XII

  • XML Terminology - Elements Element: segment of an XML document between an opening and a corresponding closing tag Johnston, M. Hillier, L. The nucleotide sequence of Saccharomyces cerevisiae chromosome XII 1997 elementelement, a sub-element of

  • XML Terminology - Elements Mixed content: an element may contain mixture of sub-elements and PCDATA E.g.: My First XML
  • XML Terminology - Attributes An (opening) tag may contain attributes Typically used to describe the content of an element Syntax: attribute_name = value1 value2 Attribute names follow XML namingExample 1:
  • XML Terminology - Attributes Common use for attributes is to express dimension or type

    2400 96 M05-.+C$@02!G96YE

  • XML Terminology - Using IDsSpecial attributeUsed to uniquely identify elements Can be used by other elements for referencing purposesValue of an ID attribute is uniqueMust be declared of type ID in the DTD

  • XML Terminology - Using IDs

    Jane Doe John Doe Mary Doe

  • A Complete XML DocumentAn XML document must include: A declaration part:E.g. A root elementE.g.

  • Well-formed documentsXML documents must be well-formed:Presence of one root elementProper XML namingProper matching of tagsProper nesting of tagsAttribute values must be quotedThe name of an attribute is unique within an element Comments and preprocessing instructions may not appear inside tagsNo un-escaped < or & may appear in the character data of an element or an attribute

  • OutlineXML BasicsDocument Type Descriptors (DTDs)XML schemaCML

  • Document Type DescriptorsDocument Type Descriptors (DTDs) impose structure on an XML documentThe DTD is a syntactic specificationGeneral syntax: ]Note: DTD-name corresponds to the root element of XML documents that use the DTD for validation

  • Example: An Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH 98765 (321) 786 2543 (321) 786 2543 (321) 786 2543 jm@abc.com Exactly one nameAt most one greetingAs many address lines as needed (in order)Mixed telephones and faxesAs manyas needed

  • Specifying the structurename to specify a name elementgreet? to specify an optional (0 or 1) greet elementsname,greet? to specify a name followed by an optional greet

  • Specifying the structure (cont)addr*to specify 0 or more address linestel | faxa tel or a fax element (tel | fax)* 0 or more repeats of tel or faxemail*0 or more email elements

  • Specifying the structure (cont)So the whole structure of a person entry is specified by

    name, greet?, addr*, (tel | fax)*, email*

    This is known as a regular expression

  • A DTD for the address book

    ]>

  • Summary of XML regular expressions for element descriptioneThe tag e occurse1,e2The expression e1 followed by e2e*0 or more occurrences of ee?Optional -- 0 or 1 occurrencese+1 or more occurrencese1 | e2either e1 or e2(e)grouping

  • Specifying attributes in the DTD Example:

    The dimension attribute is required; the accuracy attribute is optional CDATA is the type of the attribute -- it means string

  • Specifying attributes in the DTD General syntax:

    Attribute types includeCDATA, ENUMERATION, ID, IDREF, IDREFS, NOTATION, NMTOKEN, NMTOKENS, ENTITY, ENTITIESAttribute default values include:#IMPLIED, #REQUIRED, #FIXED, literal

  • Specifying ID and IDREF attributes

    ]>

  • Some conforming data

    Jane Doe John Doe Mary Doe

  • Consistency of ID and IDREF attribute valuesIf an attribute is declared as IDthe associated values must all be distinct (no confusion)If an attribute is declared as IDREFthe associated value must exist as the value of some ID attributeSimilarly for all the values of an IDREFS attributeID and IDREF attributes are not typed

  • Connecting the document with its DTDIn line: ...

    Another file:

  • Valid DocumentsXML documents are checked for validity against a an XML validator such as DTDsValidity specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on ID and IDREF(S) satisfied

  • OutlineBackgroundXML BasicsDocument Type Descriptors (DTDs)XML schemaCML

  • XML schemaW3C recommendationSuccessor of DTDsUsed to validate XML documentsSpecification lengthy and rather complexProposed to address DTDs pitfalls

  • XML schemaFeatures Data typing: compared to DTD where elements and attributes are stringsSchema files are XML filesSupport for object-oriented practicesAddition validation rules (e.g. pattern of a element content, minimum/maximum values for attributes)Full support of namespaces

  • XML schemaXML schema for scientific applications Used in several areas: bioinformatics, chemical informatics, laboratory informatics, etc.Examples include:AGAVECMLPEMLPSI-MISBMLUniProt XMLXFF

  • Introductory ExampleExample1: Representing protein data

  • XML Schema constructsThe elementRoot of XML schema documentE.g.

    Prefix xs references the namespace of XML schemasUsed to reference schema constructs such as sx:annotation, xs:complexType

  • XML Schema constructsSchema DocumentationSchema element xs:annotation is used to document schema documents, providing information about the document and detailed information about its elements and attributesTwo types of documentation:Human readable: using element xs:documentationMachine readable using element xs:appinfoE.g.

    Sample XML Schema for representing Protein data.

  • XML Schema constructsSimple types vs. complex typesA schema element is either of simpleType or complexTypeAn element is of simple type if it does not contain any attribute or children elementsAn element is of complex type if it does include children, attributes or bothSee example1

  • XML Schema constructsGlobal elements vs. local elementsGlobal elements are direct children of the root schema elementE.g. protein_set in example1 is a global elementLocal elements are not direct