Extensible Markup Language: XML

  • View

  • Download

Embed Size (px)


Extensible Markup Language: XML. XML developed by World Wide Consortium’s (W3C’s) XML Working Group (1996) XML portable, widely supported technology for describing data XML quickly becoming standard for data exchange between applications. 15.2 XML Documents. - PowerPoint PPT Presentation

Text of Extensible Markup Language: XML

  • Extensible Markup Language: XMLXML developed by World Wide Consortiums (W3Cs) XML Working Group (1996)XML portable, widely supported technology for describing dataXML quickly becoming standard for data exchange between applications

  • 15.2 XML DocumentsXML marks up data using tags, which are names enclosed in angle brackets < >All tags appear in pairs: .. Elements: units of data (i.e., everything included between a start tag and its corresponding end tag)Root element contains all other document elementsTag pairs cannot appear interleaved: Must be: Nested elements form hierarchies (trees)

    Thus: What defines an XML document is not its tag names but that it has tags that are formatted in this way.

  • article.xml1 2 3 4 5 6 7 8 Simple XML9 10 December 21, 200111 12 13 John14 Doe15 16 17 XML is pretty easy.18 19 In this chapter, we present a wide variety of examples20 that use XML.21 22 23

    Root element contains all other document elements

    Optional XML declaration includes version information parameterXML comments delimited by articletitledateauthorsummarycontentfirstNamelastNameBecause of the nice .. structure, the data can be viewed as organized in a tree:

  • dna Aspergillus awamori U03518 aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc


  • Parsing and displaying XMLXML is just another data formatWe need to write yet another parserNo more filters, please!


    No! XML is becoming standardMany different systems can read XML not many systems can read our I-sequence format..Thus, parsers exist already

  • XML document opened in Internet ExplorerMinus signEach parent element/node can be expanded and collapsedPlus sign

  • XML document opened in MozillaAgain: Each parent element/node can be expanded and collapsed (here by pressing the minus, not the element)

  • letter.xml1 2 3 4 5 6 7 8 Jane Doe9 Box 1234510 15 Any Ave.11 Othertown12 Otherstate13 6789014 555-432115 16 17 18 19 John Doe20 123 Main St.21 22 Anytown23 Anystate24 1234525 555-123426 27 28 29 Dear Sir:30 AttributesData can also be placed in attributes: name/value pairs

  • letter.xml31 It is our privilege to inform you about our new32 database managed with XML. This33 new system allows you to reduce the load on34 your inventory list server by having the client machine35 perform the work of sorting and filtering the data.36 37 38 Please visit our Web site for availability39 and pricing.40 41 42 Sincerely43 44 Ms. Doe45

  • Intermezzo 11. Finish this i2xml.py filter so it translates a list of Isequence objects into XML (following the above structure) and saves it in a file. Assume the list contains only one Isequence object. Use your module with this driver program and translate this Fasta file into XML. Load the resulting XML file into a browser. Change the XML structure defined by your filter so that TYPE is no longer a tag by itself but an attribute of the SEQ tag (see page 496). Modify your i2xml filter so that it can now translate a list of several Isequence objects into one XML file, using the structure from part 2. Test your program with the same driver on this Fasta file.http://www.daimi.au.dk/~chili/CSS/Intermezzi/30.10.1.htmlAll files found from the Example Programs page

  • solutionfrom Isequence import Isequenceimport sys

    # Save a list of Isequences in XML

    class SaveToFiles: """Stores a list of ISequences in XML format"""

    def save_to_files(self, iseqlist, savefilename):

    try: savefile = open(savefilename, "w") print >> savefile, "" print >> savefile, "" for seq in iseqlist:

    print >> savefile, %seq.get_type() print >> savefile, " %s"%seq.get_name() print >> savefile, " %s"%seq.get_id() print >> savefile, " %s"%seq.get_sequence() print >> savefile, " "

    print >> savefile, ""

    savefile.close() except IOError, message: sys.exit(message)

  • solution XML file loaded in Internet Explorer

  • Parsers and treesWeve already seen that XML markup can be displayed as a tree

    Some XML parsers exploit this. They parse the file extract the datareturn it organized in a tree data structure called a Document Object ModelarticletitledateauthorsummarycontentfirstNamelastName

  • 15.4 Document Object Model (DOM)DOM parser retrieves data from XML documentHierarchical tree structure called a DOM treeEach component of an XML document represented as a tree nodeParent nodes contain child nodesSibling nodes have same parentSingle root (or document) node contains all other document nodes

  • DOM tree of previous exampleFig. 15.6Tree structure for article.xml. one single document root nodesibling nodesparent nodechild nodes

    Simple XML December 21, 2001 John Doe XML is pretty easy. In this chapter, we present a wide variety of examples that use XML.

  • Python provides a DOM parser!all nodes have name (of tag) and valuetext (incl. whitespace) represented in nodes with tag name #text

    Simple XML December 21, 2001 John Doe XML is pretty easy. In this chapter, we present a wide variety of examples that use XML.

    articletitle#text#text#text#textdateauthorsummarycontent#text#text#textfirstName#textlastName#text#textSimple XML#textDec..2001#textXML..easy.#textIn this..XML.#textJohn#textDoe

  • revisedfig16_04.pyimport sysfrom xml.dom.minidom import parse # stuff we have to importfrom xml.parsers.expat import ExpatError # the book uses an old version

    .. >

    try: document = parse( file ) file.close()except ExpatError: sys.exit( "Error processing XML file" )

    rootElement = document.documentElementprint "Here is the root element of the document: %s" % rootElement.nodeName

    # traverse all child nodes of root element for node in rootElement.childNodes:

    print node.nodeName

    # get first child node of root elementchild = rootElement.firstChildprint "\nThe first child of root element is:", child.nodeNameprint "whose next sibling is:",

    # get next sibling of first childsibling = child.nextSiblingprint sibling.nodeName

    print Text inside + sibling.nodeName + tag is,textnode = sibling.firstChild

    print textnode.nodeValueprint "Parent node of %s is: %s" % ( sibling.nodeName, sibling.parentNode.nodeName )List of a nodes childrenget root element of the DOM tree, documentElement attribute refers to root node

    nodeName refers to elements tag name

    Other node attributes:firstChildnextSiblingnodeValue parentNode

  • Program outputHere is the root element of the document: articleThe following are its child elements:#texttitle#textdate#textauthor#textsummary#textcontent#text

    The first child of root element is: #textwhose next sibling is: titleText inside "title" tag is Simple XMLParent node of title is: article..

    print Text inside + sibling.nodeName + tag is,textnode = sibling.firstChild

    # print text value of siblingprint textnode.nodeValue..articletitle#text#text#text#textdateauthorsummarycontent#text#text#textfirstName#textlastName#text#textSimple XML#textDec..2001#textXML..easy.#textIn this..XML.#textJohn#textDoe

  • Parsing XML sequence?We have i2xml filter we want xml2i alsoDont have to write XML parser, Python provides oneThus, algorithm:Open fileUse Python parser to obtain the DOM treeTraverse tree to extract sequence information, build Isequence objects

    SEQUENCEDATASEQ (type)DATAIDNAMESEQ (type)DATAIDNAMEIgnoring whitespace nodes, we have to search a tree like this:

  • from Isequence import Isequenceimport sysfrom xml.dom.minidom import parsefrom xml.parsers.expat import ExpatError

    class Parser: """Parses xml file, stores sequences in Isequence list"""

    def __init__( self ): self.iseqlist = [] # make empty list def parse_file( self, loadfilename ): try: loadfile = open( loadfilename, "r ) except IOError, message: sys.exit( message )

    # Use Python's own xml parser to parse xml file: try: dom = parse( loadfilename ) loadfile.close() except ExpatError: sys.exit( "Couldn't parse xml file )

    # now dom is our dom tree structure. Was the xml file a sequence file? if dom.documentElement.nodeName == "SEQUENCEDATA :

    # recursively search the parse tree: for child in dom.documentElement.childNodes: self.traverse_dom_tree( child ) else: sys.exit( "This is not a sequence file" ) return self.iseqlistpart 1:2

  • def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree"""