43
<XML> Pierre Lindenbaum http://plindenbaum.blogspot.com @yokofakun(http://twitter.com/yokofakun) INSERM-UMR1087 Nantes January 2013 https://github.com/lindenb/courses/tree/master/about.xml

XML for bioinformatics

Embed Size (px)

DESCRIPTION

My short course about XML and bioinformatics. January 2013.

Citation preview

Page 1: XML for bioinformatics

<XML>Pierre Lindenbaum

http://plindenbaum.blogspot.com@yokofakun(http://twitter.com/yokofakun)

INSERM-UMR1087 NantesJanuary 2013

https://github.com/lindenb/courses/tree/master/about.xml

Page 2: XML for bioinformatics

Extensible Markup Language

Page 3: XML for bioinformatics

Machine Readeable

Page 4: XML for bioinformatics

Human Readeable

Page 5: XML for bioinformatics

DOM

Page 6: XML for bioinformatics

... not alwaysartOfLineage></rdf:Description><rdf:Descriptionrdf:about="http://purl.uniprot.org/taxonomy/12292"><rdf:typerdf:resource="http://purl.uniprot.org/core/Taxon"/><rankrdf:resource="http://purl.uniprot.org/core/Species"/><reviewedrdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</reviewed><mnemonic>NVMV</mnemonic><scientificName>Nicotianavelutinamosaicvirus</scientificName><commonName>NvMV</commonName><hostrdf:resource="http://purl.uniprot.org/taxonomy/49454"/><rdfs:subClassOfrdf:resource="http://purl.uniprot.org/taxonomy/12429"/><partOfLineagerdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</partOfLineage></rdf:Description><rdf:Descriptionrdf:about="http://purl.uniprot.org/taxonomy/12439"><rdf:typerdf:resource="http://purl.uniprot.org/core/Taxon"/><rankrdf:resource="http://purl.uniprot.org/core/Species"/><scientificName>20SRNAreplicon</scientificName><rdfs:subClassOfrdf:resource="http://purl.uniprot.org/taxonomy/12429"/><partOfLineagerdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</partOfLineage></rdf:Description><rdf:Descriptionrdf:about="http://purl.uniprot.org/taxonomy/12440"><rdf:typerdf:resource="http://purl.uniprot.org/core/Taxon"/><rankrdf:resource="http://purl.uniprot.org/core/Species"/><reviewedrdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</reviewed><replacesrdf:resource="http://purl.uniprot.org/taxonomy/36457"/><replacesrdf:resource="http://purl.uniprot.org/taxonomy/12646"/><mnemonic>HSVAB</mnemonic><scientificName>Non-Anon-Bhepatitisvirus</scientificName><otherName>Non-A,non-Bhepatitisvirus</otherName><otherName>enterically-transmittednon-A,non-BhepatitisvirusET-NANBHV</otherName><otherName>non-A</otherName><otherName>non-A,non-BhepatitisvirusET-NANBHV</otherName><otherN

Page 7: XML for bioinformatics

Just a format

Page 8: XML for bioinformatics

*.txt

PMID- 16381885OWN - NLMSTAT- MEDLINEDA - 20051229DCOM- 20060228LR - 20091118IS - 1362-4962 (Electronic)IS - 0305-1048 (Linking)VI - 34IP - Database issueDP - 2006 Jan 1TI - From genomics to chemical genomics: new developments in KEGG.PG - D354-7AB - The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.AD - Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan. [email protected] - Kanehisa, MinoruAU - Kanehisa MFAU - Goto, SusumuAU - Goto SFAU - Hattori, MasahiroAU - Hattori MFAU - Aoki-Kinoshita, Kiyoko FAU - Aoki-Kinoshita KFFAU - Itoh, MasumiAU - Itoh MFAU - Kawashima, ShuichiAU - Kawashima SFAU - Katayama, ToshiakiAU - Katayama TFAU - Araki, MichihiroAU - Araki MFAU - Hirakawa, MikaAU - Hirakawa MLA - engPT - Journal ArticlePT - Research Support, Non-U.S. Gov'tPL - EnglandTA - Nucleic Acids ResJT - Nucleic acids researchJID - 0411011RN - 0 (Enzymes)RN - 0 (Ligands)RN - 0 (Pharmaceutical Preparations)SB - IMMH - *BiotransformationMH - Chemical PhenomenaMH - *ChemistryMH - *Databases, FactualMH - *Databases, GeneticMH - EnvironmentMH - Enzymes/chemistry/geneticsMH - *GenomicsMH - HumansMH - InternetMH - LigandsMH - Pharmaceutical Preparations/chemistry/classificationMH - Signal TransductionMH - Systems IntegrationMH - User-Computer InterfacePMC - PMC1347464OID - NLM: PMC1347464EDAT- 2005/12/31 09:00MHDA- 2006/03/01 09:00CRDT- 2005/12/31 09:00AID - 34/suppl_1/D354 [pii]AID - 10.1093/nar/gkj102 [doi]PST - ppublishSO - Nucleic Acids Res. 2006 Jan 1;34(Database issue):D354-7.

Page 9: XML for bioinformatics

*.xml

<?xml version="1.0"?><!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2008//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_080101.dtd"><PubmedArticleSet><PubmedArticle> <MedlineCitation Status='MEDLINE' Owner='NLM'> <PMID Version='1'>16381885</PMID> <DateCreated> <Year>2005</Year> <Month>12</Month> <Day>29</Day> </DateCreated> <DateCompleted> <Year>2006</Year> <Month>02</Month> <Day>28</Day> </DateCompleted> <DateRevised> <Year>2009</Year> <Month>11</Month> <Day>18</Day> </DateRevised> <Article PubModel='Print'> <Journal> <ISSN IssnType='Electronic'>1362-4962</ISSN> <JournalIssue CitedMedium='Internet'> <Volume>34</Volume> <Issue>Database issue</Issue> <PubDate> <Year>2006</Year> <Month>Jan</Month> <Day>1</Day> </PubDate> </JournalIssue> <Title>Nucleic acids research</Title> <ISOAbbreviation>Nucleic Acids Res.</ISOAbbreviation> </Journal> <ArticleTitle>From genomics to chemical genomics: new developments in KEGG.</ArticleTitle> <Pagination> <MedlinePgn>D354-7</MedlinePgn> </Pagination> <Abstract> <AbstractText>The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.</AbstractText> </Abstract> <Affiliation>Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan. [email protected]</Affiliation> <AuthorList CompleteYN='Y'> <Author ValidYN='Y'> <LastName>Kanehisa</LastName> <ForeName>Minoru</ForeName> <Initials>M</Initials> </Author> <Author ValidYN='Y'> <LastName>Goto</LastName> <ForeName>Susumu</ForeName> <Initials>S</Initials> </Author> <Author ValidYN='Y'> <LastName>Hattori</LastName> <ForeName>Masahiro</ForeName> <Initials>M</Initials> </Author> <Author ValidYN='Y'> <LastName>Aoki-Kinoshita</LastName> <ForeName>Kiyoko F</ForeName> <Initials>KF</Initials> </Author> <Author ValidYN='Y'> <LastName>Itoh</LastName> <ForeName>Masumi</ForeName> <Initials>M</Initials> </Author> <Author ValidYN='Y'> <LastName>Kawashima</LastName> <ForeName>Shuichi</ForeName> <Initials>S</Initials> </Author> <Author ValidYN='Y'> <LastName>Katayama</LastName> <ForeName>Toshiaki</ForeName> <Initials>T</Initials> </Author> <Author ValidYN='Y'> <LastName>Araki</LastName> <ForeName>Michihiro</ForeName> <Initials>M</Initials> </Author> <Author ValidYN='Y'> <LastName>Hirakawa</LastName> <ForeName>Mika</ForeName> <Initials>M</Initials> </Author> </AuthorList> <Language>eng</Language> <PublicationTypeList> <PublicationType>Journal Article</PublicationType> <PublicationType>Research Support, Non-U.S. Gov't</PublicationType> </PublicationTypeList> </Article> <MedlineJournalInfo> <Country>England</Country> <MedlineTA>Nucleic Acids Res</MedlineTA> <NlmUniqueID>0411011</NlmUniqueID> <ISSNLinking>0305-1048</ISSNLinking> </MedlineJournalInfo> <ChemicalList> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Enzymes</NameOfSubstance> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Ligands</NameOfSubstance> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Pharmaceutical Preparations</NameOfSubstance> </Chemical> </ChemicalList> <CitationSubset>IM</CitationSubset> <CommentsCorrectionsList> <CommentsCorrections RefType='Cites'> <RefSource>Nucleic Acids Res. 2001 Jan 1;29(1):22-8</RefSource> <PMID Version='1'>11125040</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>Nucleic Acids Res. 2002 Jan 1;30(1):42-6</RefSource> <PMID Version='1'>11752249</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>J Am Chem Soc. 2003 Oct 1;125(39):11853-65</RefSource> <PMID Version='1'>14505407</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>Bioinformatics. 1998;14(7):591-9</RefSource> <PMID Version='1'>9730924</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>J Am Chem Soc. 2004 Dec 22;126(50):16487-98</RefSource> <PMID Version='1'>15600352</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4</RefSource> <PMID Version='1'>15608248</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>Trends Genet. 1997 Sep;13(9):375-6</RefSource> <PMID Version='1'>9287494</PMID> </CommentsCorrections> <CommentsCorrections RefType='Cites'> <RefSource>Nucleic Acids Res. 2004 Jan 1;32(Database issue):D277-80</RefSource> <PMID Version='1'>14681412</PMID> </CommentsCorrections> </CommentsCorrectionsList> <MeshHeadingList> <MeshHeading> <DescriptorName MajorTopicYN='Y'>Biotransformation</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Chemical Phenomena</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='Y'>Chemistry</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='Y'>Databases, Factual</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='Y'>Databases, Genetic</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Environment</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Enzymes</DescriptorName> <QualifierName MajorTopicYN='N'>chemistry</QualifierName> <QualifierName MajorTopicYN='N'>genetics</QualifierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='Y'>Genomics</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Humans</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Internet</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Ligands</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Pharmaceutical Preparations</DescriptorName> <QualifierName MajorTopicYN='N'>chemistry</QualifierName> <QualifierName MajorTopicYN='N'>classification</QualifierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Signal Transduction</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>Systems Integration</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN='N'>User-Computer Interface</DescriptorName> </MeshHeading> </MeshHeadingList> <OtherID Source='NLM'>PMC1347464</OtherID> </MedlineCitation> <PubmedData> <History> <PubMedPubDate PubStatus='pubmed'> <Year>2005</Year> <Month>12</Month> <Day>31</Day> <Hour>9</Hour> <Minute>0</Minute> </PubMedPubDate> <PubMedPubDate PubStatus='medline'> <Year>2006</Year> <Month>3</Month> <Day>1</Day> <Hour>9</Hour> <Minute>0</Minute> </PubMedPubDate> <PubMedPubDate PubStatus='entrez'> <Year>2005</Year> <Month>12</Month> <Day>31</Day> <Hour>9</Hour> <Minute>0</Minute> </PubMedPubDate> </History> <PublicationStatus>ppublish</PublicationStatus> <ArticleIdList> <ArticleId IdType='pii'>34/suppl_1/D354</ArticleId> <ArticleId IdType='doi'>10.1093/nar/gkj102</ArticleId> <ArticleId IdType='pubmed'>16381885</ArticleId> <ArticleId IdType='pmc'>PMC1347464</ArticleId> </ArticleIdList> </PubmedData></PubmedArticle></PubmedArticleSet>

Page 10: XML for bioinformatics

*.json

{ "header": { "type": "efetch.pubmed", "version": "0.3" }, "result": [ { "medlinecitation": { "pmid": { "version": "1", "value": "17284678" }, "datecreated": { "year": "2007", "month": "03", "day": "02" }, "datecompleted": { "year": "2007", "month": "04", "day": "05" }, "daterevised": { "year": "2009", "month": "11", "day": "18" }, "article": { "journal": { "issn": { "issntype": "Print", "value": "1088-9051" }, "journalissue": { "citedmedium": "Print", "volume": "17", "issue": "3", "pubdate": [ "2007", "Mar" ] }, "title": "Genome research", "isoabbreviation": "Genome Res." }, "articletitle": "Sequencing and analysis of chromosome 1 of Eimeria tenella reveals a unique segmental organization.", "pagination": [ "311-9" ], "abstract": { "abstracttexts": [ { "value": "Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation." } ] }, "affiliation": "Malaysia Genome Institute, UKM-MTDC Smart Technology Centre, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor DE, Malaysia.", "authorlist": [ { "completeyn": true, "type": "authors" }, { "validyn": true, "lastname": "Ling", "forename": "King-Hwa", "initials": "KH", "nameids": [ ] }, { "validyn": true, "lastname": "Rajandream", "forename": "Marie-Adele", "initials": "MA", "nameids": [ ] }, { "validyn": true, "lastname": "Rivailler", "forename": "Pierre", "initials": "P", "nameids": [ ] }, { "validyn": true, "lastname": "Ivens", "forename": "Alasdair", "initials": "A", "nameids": [ ] }, { "validyn": true, "lastname": "Yap", "forename": "Soon-Joo", "initials": "SJ", "nameids": [ ] }, { "validyn": true, "lastname": "Madeira", "forename": "Alda M B N", "initials": "AM", "nameids": [ ] }, { "validyn": true, "lastname": "Mungall", "forename": "Karen", "initials": "K", "nameids": [ ] }, { "validyn": true, "lastname": "Billington", "forename": "Karen", "initials": "K", "nameids": [ ] }, { "validyn": true, "lastname": "Yee", "forename": "Wai-Yan", "initials": "WY", "nameids": [ ] }, { "validyn": true, "lastname": "Bankier", "forename": "Alan T", "initials": "AT", "nameids": [ ] }, { "validyn": true, "lastname": "Carroll", "forename": "Fionnadh", "initials": "F", "nameids": [ ] }, { "validyn": true, "lastname": "Durham", "forename": "Alan M", "initials": "AM", "nameids": [ ] }, { "validyn": true, "lastname": "Peters", "forename": "Nicholas", "initials": "N", "nameids": [ ] }, { "validyn": true, "lastname": "Loo", "forename": "Shu-San", "initials": "SS", "nameids": [ ] }, { "validyn": true, "lastname": "Isa", "forename": "Mohd Noor Mat", "initials": "MN", "nameids": [ ] }, { "validyn": true, "lastname": "Novaes", "forename": "Jeniffer", "initials": "J", "nameids": [ ] }, { "validyn": true, "lastname": "Quail", "forename": "Michael", "initials": "M", "nameids": [ ] }, { "validyn": true, "lastname": "Rosli", "forename": "Rozita", "initials": "R", "nameids": [ ] }, { "validyn": true, "lastname": "Nor Shamsudin", "forename": "Mariana", "initials": "M", "nameids": [ ] }, { "validyn": true, "lastname": "Sobreira", "forename": "Tiago J P", "initials": "TJ", "nameids": [ ] }, { "validyn": true, "lastname": "Tivey", "forename": "Adrian R", "initials": "AR", "nameids": [ ] }, { "validyn": true, "lastname": "Wai", "forename": "Siew-Fun", "initials": "SF", "nameids": [ ] }, { "validyn": true, "lastname": "White", "forename": "Sarah", "initials": "S", "nameids": [ ] }, { "validyn": true, "lastname": "Wu", "forename": "Xikun", "initials": "X", "nameids": [ ] }, { "validyn": true, "lastname": "Kerhornou", "forename": "Arnaud", "initials": "A", "nameids": [ ] }, { "validyn": true, "lastname": "Blake", "forename": "Damer", "initials": "D", "nameids": [ ] }, { "validyn": true, "lastname": "Mohamed", "forename": "Rahmah", "initials": "R", "nameids": [ ] }, { "validyn": true, "lastname": "Shirley", "forename": "Martin", "initials": "M", "nameids": [ ] }, { "validyn": true, "lastname": "Gruber", "forename": "Arthur", "initials": "A", "nameids": [ ] }, { "validyn": true, "lastname": "Berriman", "forename": "Matthew", "initials": "M", "nameids": [ ] }, { "validyn": true, "lastname": "Tomley", "forename": "Fiona", "initials": "F", "nameids": [ ] }, { "validyn": true, "lastname": "Dear", "forename": "Paul H", "initials": "PH", "nameids": [ ] }, { "validyn": true, "lastname": "Wan", "forename": "Kiew-Lian", "initials": "KL", "nameids": [ ] } ], "grantlist": [ { "completeyn": true }, { "agency": "Wellcome Trust", "country": "United Kingdom" } ], "publicationtypelist": [ "Comparative Study", "Journal Article", "Research Support, Non-U.S. Gov't" ], "elocationids": [ ], "languages": [ "eng" ], "articledates": [ { "datetype": "Electronic", "year": "2007", "month": "02", "day": "06" } ] }, "medlinejournalinfo": { "country": "United States", "medlineta": "Genome Res", "nlmuniqueid": "9518021", "issnlinking": "1088-9051" }, "commentscorrectionslist": [ { "reftype": "Cites", "refsource": "Nucleic Acids Res. 1999 Jan 15;27(2):573-80", "pmid": { "version": "1", "value": "9862982" } }, { "reftype": "Cites", "refsource": "Nucleic Acids Res. 1997 Mar 1;25(5):955-64", "pmid": { "version": "1", "value": "9023104" } }, { "reftype": "Cites", "refsource": "Genome Res. 2000 Oct;10(10):1587-93", "pmid": { "version": "1", "value": "11042156" } }, { "reftype": "Cites", "refsource": "Genome Res. 2000 Nov;10(11):1737-42", "pmid": { "version": "1", "value": "11076859" } }, { "reftype": "Cites", "refsource": "Bioinformatics. 2000 Oct;16(10):944-5", "pmid": { "version": "1", "value": "11120685" } }, { "reftype": "Cites", "refsource": "Nature. 2001 Feb 15;409(6822):860-921", "pmid": { "version": "1", "value": "11237011" } }, { "reftype": "Cites", "refsource": "Nature. 2002 Jul 4;418(6893):79-85", "pmid": { "version": "1", "value": "12097910" } }, { "reftype": "Cites", "refsource": "Nature. 2002 Oct 3;419(6906):498-511", "pmid": { "version": "1", "value": "12368864" } }, { "reftype": "Cites", "refsource": "Nature. 2002 Oct 3;419(6906):527-31", "pmid": { "version": "1", "value": "12368867" } }, { "reftype": "Cites", "refsource": "Exp Parasitol. 2002 Jun-Jul;101(2-3):168-73", "pmid": { "version": "1", "value": "12427472" } }, { "reftype": "Cites", "refsource": "Nucleic Acids Res. 2003 Jan 1;31(1):439-41", "pmid": { "version": "1", "value": "12520045" } }, { "reftype": "Cites", "refsource": "Genome Res. 2003 Mar;13(3):443-54", "pmid": { "version": "1", "value": "12618375" } }, { "reftype": "Cites", "refsource": "Avian Pathol. 2003 Apr;32(2):115-27", "pmid": { "version": "1", "value": "12745365" } }, { "reftype": "Cites", "refsource": "Parasitol Res. 2003 Aug;90(6):473-5", "pmid": { "version": "1", "value": "12802683" } }, { "reftype": "Cites", "refsource": "Trends Parasitol. 2004 May;20(5):199-201", "pmid": { "version": "1", "value": "15105014" } }, { "reftype": "Cites", "refsource": "J Mol Biol. 2004 May 14;338(5):1027-36", "pmid": { "version": "1", "value": "15111065" } }, { "reftype": "Cites", "refsource": "BMC Bioinformatics. 2004 May 14;5:59", "pmid": { "version": "1", "value": "15144565" } }, { "reftype": "Cites", "refsource": "Bioinformatics. 2004 Nov 1;20(16):2878-9", "pmid": { "version": "1", "value": "15145805" } }, { "reftype": "Cites", "refsource": "Parasitol Today. 1991 May;7(5):99-105", "pmid": { "version": "1", "value": "15463458" } }, { "reftype": "Cites", "refsource": "Nature. 2005 May 5;435(7038):43-57", "pmid": { "version": "1", "value": "15875012" } }, { "reftype": "Cites", "refsource": "Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W116-20", "pmid": { "version": "1", "value": "15980438" } }, { "reftype": "Cites", "refsource": "Science. 2005 Jul 15;309(5733):416-22", "pmid": { "version": "1", "value": "16020726" } }, { "reftype": "Cites", "refsource": "Nat Genet. 2005 Sep;37(9):986-90", "pmid": { "version": "1", "value": "16086015" } }, { "reftype": "Cites", "refsource": "Chromosome Res. 2005;13(5):517-24", "pmid": { "version": "1", "value": "16132816" } }, { "reftype": "Cites", "refsource": "Nat Rev Genet. 2005 Oct;6(10):743-55", "pmid": { "version": "1", "value": "16205714" } }, { "reftype": "Cites", "refsource": "Bioinformatics. 2006 Feb 1;22(3):361-2", "pmid": { "version": "1", "value": "16332714" } }, { "reftype": "Cites", "refsource": "Mol Microbiol. 2006 Apr;60(1):5-15", "pmid": { "version": "1", "value": "16556216" } }, { "reftype": "Cites", "refsource": "Mol Biochem Parasitol. 1990 Jan 15;38(2):169-73", "pmid": { "version": "1", "value": "2325704" } }, { "reftype": "Cites", "refsource": "Parasite Immunol. 1986 Nov;8(6):529-39", "pmid": { "version": "1", "value": "3543808" } }, { "reftype": "Cites", "refsource": "Parasitol Res. 1994;80(5):366-73", "pmid": { "version": "1", "value": "7971922" } }, { "reftype": "Cites", "refsource": "Nucleic Acids Res. 1995 Dec 25;23(24):4992-9", "pmid": { "version": "1", "value": "8559656" } }, { "reftype": "Cites", "refsource": "Int J Parasitol. 1999 Dec;29(12):1885-92", "pmid": { "version": "1", "value": "10961844" } } ], "meshheadinglist": [ { "descriptorname": { "majortopicyn": false, "value": "Animals" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Base Sequence" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Chromosome Mapping" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Chromosome Structures" }, "qualifiernames": [ { "majortopicyn": true, "value": "genetics" } ] }, { "descriptorname": { "majortopicyn": false, "value": "Computational Biology" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Eimeria tenella" }, "qualifiernames": [ { "majortopicyn": true, "value": "genetics" } ] }, { "descriptorname": { "majortopicyn": false, "value": "Genes, Protozoan" }, "qualifiernames": [ { "majortopicyn": true, "value": "genetics" } ] }, { "descriptorname": { "majortopicyn": false, "value": "Minisatellite Repeats" }, "qualifiernames": [ { "majortopicyn": false, "value": "genetics" } ] }, { "descriptorname": { "majortopicyn": false, "value": "Molecular Sequence Data" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Polymorphism, Restriction Fragment Length" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Sequence Analysis, DNA" }, "qualifiernames": [ ] } ], "citationsubsets": [ "IM" ], "otherids": [ { "source": "NLM", "value": "PMC1800922" } ], "otherabstracts": [ ], "keywordlists": [ ], "spaceflightmissions": [ ], "generalnotes": [ ] }, "pubmeddata": { "history": [ { "pubstatus": "aheadofprint", "year": "2007", "month": "2", "day": "6" }, { "pubstatus": "pubmed", "year": "2007", "month": "2", "day": "8", "hour": "9", "minute": "0" }, { "pubstatus": "medline", "year": "2007", "month": "4", "day": "6", "hour": "9", "minute": "0" }, { "pubstatus": "entrez", "year": "2007", "month": "2", "day": "8", "hour": "9", "minute": "0" } ], "publicationstatus": "ppublish", "articleidlist": [ { "idtype": "pii", "value": "gr.5823007" }, { "idtype": "doi", "value": "10.1101/gr.5823007" }, { "idtype": "pubmed", "value": "17284678" }, { "idtype": "pmc", "value": "PMC1800922" } ] } }, { "medlinecitation": { "pmid": { "version": "1", "value": "9997" }, "datecreated": { "year": "1976", "month": "12", "day": "30" }, "datecompleted": { "year": "1976", "month": "12", "day": "30" }, "daterevised": { "year": "2003", "month": "11", "day": "14" }, "article": { "journal": { "issn": { "issntype": "Print", "value": "0006-3002" }, "journalissue": { "citedmedium": "Print", "volume": "446", "issue": "1", "pubdate": [ "1976", "Sep", "28" ] }, "title": "Biochimica et biophysica acta", "isoabbreviation": "Biochim. Biophys. Acta" }, "articletitle": "Magnetic studies of Chromatium flavocytochrome C552. A mechanism for heme-flavin interaction.", "pagination": [ "179-91" ], "abstract": { "abstracttexts": [ { "value": "Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction." } ] }, "authorlist": [ { "completeyn": true, "type": "authors" }, { "validyn": true, "lastname": "Strekas", "forename": "T C", "initials": "TC", "nameids": [ ] } ], "publicationtypelist": [ "Journal Article" ], "elocationids": [ ], "languages": [ "eng" ], "articledates": [ ] }, "medlinejournalinfo": { "country": "NETHERLANDS", "medlineta": "Biochim Biophys Acta", "nlmuniqueid": "0217513", "issnlinking": "0006-3002" }, "chemicallist": [ { "registrynumber": "0", "nameofsubstance": "Cytochrome c Group" }, { "registrynumber": "0", "nameofsubstance": "Flavins" }, { "registrynumber": "14875-96-8", "nameofsubstance": "Heme" }, { "registrynumber": "7439-89-6", "nameofsubstance": "Iron" } ], "meshheadinglist": [ { "descriptorname": { "majortopicyn": false, "value": "Binding Sites" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Chromatium" }, "qualifiernames": [ { "majortopicyn": true, "value": "enzymology" } ] }, { "descriptorname": { "majortopicyn": true, "value": "Cytochrome c Group" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Electron Spin Resonance Spectroscopy" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Flavins" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Heme" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Hydrogen-Ion Concentration" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Iron" }, "qualifiernames": [ { "majortopicyn": false, "value": "analysis" } ] }, { "descriptorname": { "majortopicyn": false, "value": "Magnetics" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Oxidation-Reduction" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Protein Binding" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Protein Conformation" }, "qualifiernames": [ ] }, { "descriptorname": { "majortopicyn": false, "value": "Temperature" }, "qualifiernames": [ ] } ], "citationsubsets": [ "IM" ], "otherids": [ ], "otherabstracts": [ ], "keywordlists": [ ], "spaceflightmissions": [ ], "generalnotes": [ ] }, "pubmeddata": { "history": [ { "pubstatus": "pubmed", "year": "1976", "month": "9", "day": "28" }, { "pubstatus": "medline", "year": "1976", "month": "9", "day": "28", "hour": "0", "minute": "1" }, { "pubstatus": "entrez", "year": "1976", "month": "9", "day": "28", "hour": "0", "minute": "0" } ], "publicationstatus": "ppublish", "articleidlist": [ { "idtype": "pubmed", "value": "9997" } ] } } ] }

Page 11: XML for bioinformatics
Page 12: XML for bioinformatics

XML namespace

<my-database> <record> <title>Record1</title> <html> <head> <title>hello</title> </head> <body> <h1>Hello</h1> </body> </html> </record> </my-database>

<my-database xmlns="http://mydatabase.org" xmlns:h="http://www.w3.org/1999/xhtml"> <record> <title>Record1</title> <h:html> <h:head> <h:title>hello</title> </h:head> <h:body> <h:h1>Hello</h:h1> </h:body> </h:html> </record> </my-database>

Page 13: XML for bioinformatics

xmllint

Page 14: XML for bioinformatics

xsltproc

Page 15: XML for bioinformatics

Parsing

Page 16: XML for bioinformatics

DOM

Element root = document.getDocumentElement();for (Node item=root.getFirstChild(); item!=null; item=item.getNextSibling()){ if (item.getNodeType()==Node.ELEMENT_NODE) { System.out.println( ((Element)item).getAttribute("id")); }}

Page 17: XML for bioinformatics

StAx

public interface XMLStreamReader { public int next(); public boolean hasNext() ; public String getText(); public String getLocalName(); public String getNamespaceURI(); // ...other methods not shown }

Page 18: XML for bioinformatics

SAX

public interface ContentHandler { public void startDocument () ; public void endDocument(); public void startElement(String name, Attributes atts); public void endElement (String name); public void characters (char ch[], int start, int length) }

Page 19: XML for bioinformatics

XPath

<?xml version="1.0" encoding="UTF-8"?><genes> <gene id="1"> <name>Gene1</name> <name>gene-1</name> <sequence>ATAATGCTAGCTAGCTATCGAATG</sequence> </gene> <gene id="2"> <name>Gene2</name> <name>gene-2</name> <sequence>AATTGCGATTCATCGATGCTATA</sequence> </gene></genes>

$ xmllint -xpath \ '/genes/gene[1]/name[2]/text()' \ genes1.xml gene-1

$ xmllint -xpath \ '/genes/gene[1]/name[2]' \ genes1.xml <name>gene-1</name>

$ xmllint -xpath \ 'count(/genes/gene)' \ genes1.xml 2

$ xmllint -xpath \ '/genes/gene[@id='2']/name[1]/text()' \ genes1.xml Gene2

Page 20: XML for bioinformatics

XInclude

<?xml version="1.0" encoding="UTF-8"?><genes xmlns:xi="http://www.w3.org/2001/XInclude"> <gene id="1"> <name>Gene1</name> <name>gene-1</name> <sequence><xi:include href="sequence.txt" parse="text" /></sequence> </gene> <xi:include href="gene2.xml" parse="xml"/></genes>

Page 21: XML for bioinformatics

XHTML

Page 22: XML for bioinformatics

SVG

<svg xmlns="http://www.w3.org/2000/svg" width='300px' height='300px'>

<circle cx='120' cy='150' r='60' style='fill: gold;' />

<polyline points='120 30, 25 150, 290 150' stroke-width='4' stroke='brown' style='fill: none;' />

<polygon points='210 100, 210 200, 270 150' style='fill: lawngreen;' />

<text x='60' y='250' fill='blue'>Hello, World!</text>

</svg>

Page 23: XML for bioinformatics

XSL-FO

<?xml version="1.0" encoding="ISO-8859-1"?>

<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

<fo:layout-master-set> <fo:simple-page-master master-name="A4"> <!-- Page template goes here --> </fo:simple-page-master></fo:layout-master-set>

<fo:page-sequence master-reference="A4"> <!-- Page content goes here --></fo:page-sequence>

</fo:root>

Page 24: XML for bioinformatics

RDF

<?xml version="1.0" encoding="UTF-8"?><rdf:RDF (...)><rdf:Description rdf:about="http://…/isbn/2020386682"> <f:titre xml:lang="fr">Le palais des mirroirs</f:titre> <f:original rdf:resource="http://…/isbn/000651409X"/></rdf:Description></rdf:RDF>

Page 25: XML for bioinformatics

RDF

Page 26: XML for bioinformatics

SOAP

<?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope (...)> <SOAP-ENV:Body> <r:queryPathwaysForReferenceIdentifiers> <r:referenceIdentifiers> <soapenc:string>Q9Y266</soapenc:string> <soapenc:string>P17480</soapenc:string> <soapenc:string>P2048</soapenc:string> </r:referenceIdentifiers> </r:queryPathwaysForReferenceIdentifiers> </SOAP-ENV:Body> </SOAP-ENV:Envelope>

Page 27: XML for bioinformatics

WSDL

(...) <wsdl:message name="getEvsData"> <wsdl:part element="tns:getEvsData" name="parameters"> </wsdl:part> </wsdl:message> <wsdl:message name="getEvsDataResponse"> <wsdl:part element="tns:getEvsDataResponse" name="parameters"> </wsdl:part> </wsdl:message> <wsdl:portType name="DataQuery"> <wsdl:operation name="getEvsData"> <wsdl:input message="tns:getEvsData" name="getEvsData"> </wsdl:input> <wsdl:output message="tns:getEvsDataResponse" name="getEvsDataResponse"> </wsdl:output> </wsdl:operation> </wsdl:portType> <wsdl:binding name="DataQueryServiceSoapBinding" type="tns:DataQuery"> <soap:binding style="document" transport="http://schemas.xmlsoap.org/soap/http" /> <wsdl:operation name="getEvsData"> <soap:operation soapAction="" style="document" /> <wsdl:input name="getEvsData"> <soap:body use="literal" /> </wsdl:input> <wsdl:output name="getEvsDataResponse"> <soap:body use="literal" /> </wsdl:output> </wsdl:operation> </wsdl:binding>

Page 28: XML for bioinformatics

WSDL

$ wsimport \ "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl"

parsing WSDL...Generating code...Compiling code...

Page 29: XML for bioinformatics

WSDL

$ more ./edu/washington/gs/evs/webservice/Locus.java

package edu.washington.gs.evs.webservice;(...)@XmlAccessorType(XmlAccessType.FIELD)@XmlType(name = "locus", propOrder = { "geneName", "chromosome", "strand", "mrnaAccession", "geneId", "txStart", "txEnd", "keggPathwayIds"})public class Locus {

protected String geneName; protected String chromosome; protected String strand; protected String mrnaAccession; protected int geneId; protected int txStart; protected int txEnd; @XmlElement(nillable = true) (...)

Page 30: XML for bioinformatics

Well formed..<a><b>c</a></b>

Page 31: XML for bioinformatics

Validated (DTD)

$ cat genes1.dtd

<!ELEMENT genes (gene+)><!ELEMENT gene ((name+),sequence)><!ELEMENT name (#PCDATA)><!ELEMENT sequence (#PCDATA)><!ATTLIST gene id CDATA #REQUIRED>

$ xmllint --dtdvalid genes1.dtd genes1.xml

Page 32: XML for bioinformatics

DTD/JAXB : no need to create a parser

$ xjc genes1.xsd $ xjc -dtd genes1.dtd parsing a schema...compiling a schema...generated/Gene.javagenerated/Genes.javagenerated/Name.javagenerated/ObjectFactory.java

Page 33: XML for bioinformatics

Validated (XSD)

<?xml version="1.0" encoding="UTF-8"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" >

<xsd:complexType name="Genes"> <xsd:sequence> <xsd:element name="gene" type="Gene" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Gene"> <xsd:sequence> <xsd:element name="name" maxOccurs="unbounded" type="xsd:string"/> <xsd:element name="sequence" type="xsd:string"/> </xsd:sequence> <xsd:attribute name="id" use="required" type="xsd:int"/> </xsd:complexType> <xsd:element type="Genes" name="genes"/> </xsd:schema>

Page 34: XML for bioinformatics

Validated (XSD)

$ xmllint --noout \ --schema genes1.xsd \ genes1.xml genes1.xml validates

Page 35: XML for bioinformatics

XSD/JAXB : no need to create a parser

$ xjc genes1.xsd parsing a schema...compiling a schema...generated/Gene.javagenerated/Genes.javagenerated/ObjectFactory.java

Page 36: XML for bioinformatics

XSLT

Page 37: XML for bioinformatics

XSLT (text)

<?xml version='1.0' encoding="ISO-8859-1"?> <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' > <xsl:output method='text'/> <xsl:template match="/"> <xsl:apply-templates select="genes"/> </xsl:template> <xsl:template match="genes"> <xsl:apply-templates select="gene"/> </xsl:template> <xsl:template match="gene"> <xsl:text>&gt;id:</xsl:text> <xsl:value-of select="@id"/> <xsl:text>|</xsl:text> <xsl:value-of select="name[1]"/> <xsl:text> </xsl:text> <xsl:value-of select="sequence"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>

$ xsltproc genes2txt.xsl genes1.xml

>id:1|Gene1ATAATGCTAGCTAGCTATCGAATG>id:2|Gene2AATTGCGATTCATCGATGCTATA

Page 38: XML for bioinformatics

XSLT (html)

<?xml version='1.0' encoding="ISO-8859-1"?> <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' > <xsl:output method='html'/> <xsl:template match="/"> <html><body> <xsl:apply-templates select="genes"/> </body></html> </xsl:template> <xsl:template match="genes"> <h1> <xsl:value-of select="count(gene)"/> genes </h1> <xsl:apply-templates select="gene"/> </xsl:template> <xsl:template match="gene"> <h2> <xsl:text>&gt;id:</xsl:text> <xsl:value-of select="@id"/> <xsl:text>|</xsl:text> <xsl:value-of select="name[1]"/> </h2> <pre> <xsl:value-of select="sequence"/> </pre> </xsl:template> </xsl:stylesheet>

$ xsltproc \ genes2html.xsl \ genes1.xml

<html><body><h1>2 genes</h1><h2>&gt;id:1|Gene1</h2><pre>ATAATGCTAGCTAGCTATCGAATG</pre><h2>&gt;id:2|Gene2</h2><pre>AATTGCGATTCATCGATGCTATA</pre></body></html>

Page 39: XML for bioinformatics

XSLT Embedded<?xml-stylesheet type="text/xsl" href="genes2html.xsl"?>

Page 40: XML for bioinformatics

XSLT (xml)<?xml version='1.0' encoding="ISO-8859-1"?> <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns="http://www.w3.org/2000/svg" xmlns:math="http://exslt.org/math" version="1.0" > <xsl:output method='xml'/> <xsl:template match="/"> <svg width="500" height="500" version='1.0'> <xsl:apply-templates select="genes"/> </svg> </xsl:template> <xsl:template match="genes"> <xsl:apply-templates select="gene[1]"/> </xsl:template> <xsl:template match="gene"> <text x="250" y="250"> <xsl:value-of select="name[1]"/> </text> <xsl:call-template name="drawseq"> <xsl:with-param name="i" select="number(1.0)"/> <xsl:with-param name="s" select="sequence"/> </xsl:call-template> </xsl:template> <xsl:template name="drawseq"> <xsl:param name="i"/> <xsl:param name="s" /> <xsl:variable name="L" select="string-length($s)"/> <text> <xsl:variable name="angle" select="$i * ( (2.0*3.14159) div $L )"/> <xsl:attribute name="x"><xsl:value-of select="250+200*math:cos( $angle )"/></xsl:attribute> <xsl:attribute name="y"><xsl:value-of select="250+200*math:sin( $angle )"/></xsl:attribute> <xsl:value-of select="substring($s,$i,1)"/> </text> <xsl:if test="$i+1 &lt;= $L"> <xsl:call-template name="drawseq"> <xsl:with-param name="i" select="1 + $i"/> <xsl:with-param name="s" select="$s"/> </xsl:call-template> </xsl:if> </xsl:template> </xsl:stylesheet>

Page 41: XML for bioinformatics
Page 42: XML for bioinformatics

END

Page 43: XML for bioinformatics

Photos from wikipedia and W3C.