23
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Embed Size (px)

DESCRIPTION

Examples: GenBank

Citation preview

Page 1: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology Driven Data Collection for EuPathDB

Jie Zheng, Omar Harb, Chris Stoeckert 

Center for Bioinformatics, University of Pennsylvania

Page 2: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Issues associated with Data Collection

• Heterogeneity of free text• Difficulty in data integration, requires human intervention

• Complex queries are limited 

2

Page 3: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Examples: GenBankGenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1

LOCUSAY129684      535 bp   DNA   linear   INV 21-APR-2003 LOCUS

AB195219    125 bp    DNA     linear   INV 10-JAN-2009 LOCUS       

AF262324     826 bp    DNA     linear   INV 14-DEC-2000

DEFINITION  

Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION  

Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION  

Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.

ACCESSION   AY129684 ACCESSION   AB195219 ACCESSION   AF262324VERSION      AY129684.1  GI:30038069 VERSION      AB195219.1  GI:58036440 VERSION     AF262324.1  GI:11761734KEYWORDS     . KEYWORDS     . KEYWORDS     .SOURCE      Cryptosporidium parvum SOURCE      Giardia intestinalis (Giardia lamblia) SOURCE      Cryptosporidium baileyi  ORGANISM   Cryptosporidium parvum   ORGANISM  Giardia intestinalis   ORGANISM   Cryptosporidium baileyi

            Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;        

Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.             

Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;

            Eimeriorina; Cryptosporidiidae; Cryptosporidium.             

Eimeriorina; Cryptosporidiidae; Cryptosporidium.

FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers     source          1..535      source           1..125      source           1..826

                      /organism="Cryptosporidium parvum"                       /organism="Giardia intestinalis"                       /organism="Cryptosporidium baileyi"                      /mol_type="genomic DNA"                       /mol_type="genomic DNA"                       /mol_type="genomic DNA"                      /isolate="HM5-C"                       /isolate="GH-125"                       /isolate="GG"

                     /isolation_source="isolated from Homo sapiens patient infected with HIV"                       /isolation_source="human"                       /db_xref="taxon:27987"

                      /db_xref="taxon:5807"                       /host="Homo sapiens"                       /country="USA: New York"                      /country="USA"                       /db_xref="taxon:5741"                       /note="isolated from storm waters                      /note="genotype: I"                       /country="Japan: Osaka"                       genotype: W10"     CDS              62..397      gene             <1..>125      rRNA             <1..>826

                      /codon_start=1                       /gene="SSUrDNA"                      /product="small subunit ribosomal RNA"

                      /product="sporozoite surface antigen p23"      rRNA             <1..>125                      /protein_id="AAN08813.1"                       /gene="SSUrDNA"

                      /db_xref="GI:30038070"                       /product="small subunit ribosomal RNA"

                     /translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQDKPAE

GenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1

LOCUSAY129684      535 bp   DNA   linear   INV 21-APR-2003 LOCUS

AB195219    125 bp    DNA     linear   INV 10-JAN-2009 LOCUS       

AF262324     826 bp    DNA     linear   INV 14-DEC-2000

DEFINITION  

Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION  

Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION  

Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.

ACCESSION   AY129684 ACCESSION  AB195219 ACCESSION   AF262324VERSION      AY129684.1  GI:30038069 VERSION      AB195219.1  GI:58036440 VERSION     AF262324.1  GI:11761734KEYWORDS     . KEYWORDS    . KEYWORDS    .SOURCE      Cryptosporidium parvum SOURCE      Giardia intestinalis (Giardia lamblia) SOURCE      Cryptosporidium baileyi  ORGANISM  Cryptosporidium parvum   ORGANISM  Giardia intestinalis   ORGANISM  Cryptosporidium baileyi

            Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;        

Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.             

Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;

            Eimeriorina; Cryptosporidiidae; Cryptosporidium.             

Eimeriorina; Cryptosporidiidae; Cryptosporidium.

FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers     source          1..535      source          1..125      source          1..826

                      /organism="Cryptosporidium parvum"                       /organism="Giardia intestinalis"                      /organism="Cryptosporidium baileyi"

                      /mol_type="genomic DNA"                       /mol_type="genomic DNA"                       /mol_type="genomic DNA"                      /isolate="HM5-C"                       /isolate="GH-125"                       /isolate="GG"

                     /isolation_source="isolated from Homo sapiens patient infected with HIV"                       /isolation_source="human"                       /db_xref="taxon:27987"

                      /db_xref="taxon:5807"                       /host="Homo sapiens"                       /country="USA: New York"                      /country="USA"                       /db_xref="taxon:5741"                       /note="isolated from storm waters                      /note="genotype: I"                       /country="Japan: Osaka"                       genotype: W10"     CDS             62..397      gene            <1..>125      rRNA            <1..>826

                      /codon_start=1                       /gene="SSUrDNA"                      /product="small subunit ribosomal RNA"

                      /product="sporozoite surface antigen p23"      rRNA            <1..>125                      /protein_id="AAN08813.1"                       /gene="SSUrDNA"

                      /db_xref="GI:30038070"                      /product="small subunit ribosomal RNA"

                     /translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQ

Page 4: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Data Collection for EuPathDB• Apply ontology to data submission form design

– Form to collect sequence data and information on isolates of pathogens

• Geographic location from where isolate specimen collected• Host organism information: species, age, clinical information

– Genetic manipulation with resulting phenotype data collection form• Mutation method• Effects of genetic modification on the parasite and on the location, function,  

and involvement in biological process of the resultant modified protein These data are important for parasite epidemiology and research on 

vaccines and anti-parasitic drugs• Enable Queries

– Compare sequence data from Plasmodium isolates that are restricted to East Africa to those from West Africa and are controlled for age and health of hosts

– List genes that when knocked out result in a defect in parasite growth during the erythrocytic cycle

– List genes fused to green fluorescent protein (GFP) that when expressed are located in the cell membrane

Page 5: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

EupathDBEupathDB  (Eukaryotic Pathogen Database Resources ) is a NIAID Bioinformatics Resource Center covering Eukaryotic Parasites

EuPathDB: a portal to eukaryotic pathogen databases.Aurrecoechea C, et al.Nucleic Acids Res. 2010

Page 6: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Isolate Data

Need to import and integrate datasets from GenBank

But GenBank did not specify needed metadata for isolates

Manual curation requiredHarmonize: enable host queries: Human-> Homo sapiensDeconvolute descriptions in free text: isolated from storm watersisolated from Homo sapiens patient infected with HIV

Page 7: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Isolate Data: GenBank ->EuPathDBGenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1

LOCUSAY129684      535 bp   DNA   linear   INV 21-APR-2003 LOCUS

AB195219    125 bp    DNA     linear   INV 10-JAN-2009 LOCUS       

AF262324     826 bp    DNA     linear   INV 14-DEC-2000

DEFINITION  

Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION  

Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION  

Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.

ACCESSION   AY129684 ACCESSION   AB195219 ACCESSION   AF262324VERSION      AY129684.1  GI:30038069 VERSION      AB195219.1  GI:58036440 VERSION     AF262324.1  GI:11761734KEYWORDS     . KEYWORDS     . KEYWORDS     .SOURCE      Cryptosporidium parvum SOURCE      Giardia intestinalis (Giardia lamblia) SOURCE      Cryptosporidium baileyi  ORGANISM   Cryptosporidium parvum   ORGANISM  Giardia intestinalis   ORGANISM   Cryptosporidium baileyi

            Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;        

Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.             

Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;

            Eimeriorina; Cryptosporidiidae; Cryptosporidium.             

Eimeriorina; Cryptosporidiidae; Cryptosporidium.

FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers     source          1..535      source           1..125      source           1..826

                      /organism="Cryptosporidium parvum"                       /organism="Giardia intestinalis"                       /organism="Cryptosporidium baileyi"                      /mol_type="genomic DNA"                       /mol_type="genomic DNA"                       /mol_type="genomic DNA"                      /isolate="HM5-C"                       /isolate="GH-125"                       /isolate="GG"

                     /isolation_source="isolated from Homo sapiens patient infected with HIV"                       /isolation_source="human"                       /db_xref="taxon:27987"

                      /db_xref="taxon:5807"                       /host="Homo sapiens"                       /country="USA: New York"                      /country="USA"                       /db_xref="taxon:5741"                       /note="isolated from storm waters                      /note="genotype: I"                       /country="Japan: Osaka"                       genotype: W10"     CDS              62..397      gene             <1..>125      rRNA             <1..>826

                      /codon_start=1                       /gene="SSUrDNA"                      /product="small subunit ribosomal RNA"

                      /product="sporozoite surface antigen p23"      rRNA             <1..>125                      /protein_id="AAN08813.1"                       /gene="SSUrDNA"

                      /db_xref="GI:30038070"                       /product="small subunit ribosomal RNA"

                     /translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQDKPAE

Page 8: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Isolate Submission Form• Target isolate information• Geographic location• Source organism samples 

information• or Environmental samples 

information• Sequence information 

I solate IDDate CollectedCountryState or provinceCityGPS CoordinatesIsolate SpeciesIsolate Environmental SourceHostSequence 1 product NameSequence 1Sequence 2 product NameSequence 2Sequence 3 product NameSequence 3Sequence 4 product NameSequence 4

Geographic Location

Source

Nucleotide Sequence

Page 9: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology-based Representation of Isolate Data

The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box

Page 10: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Isolate Submission FormBefore After

I solate IDDate CollectedCountryState or provinceCityGPS CoordinatesIsolate SpeciesIsolate Environmental SourceHostSequence 1 product NameSequence 1Sequence 2 product NameSequence 2Sequence 3 product Name EnvironmentSequence 3Sequence 4 product NameSequence 4

Geographic Location

CountryRegion -- State or provinceCounty

Geographic Location

Source

Nucleotide SequenceCity/village/localityLatitude/longitude CoordinatesIsolate Environmental SourceHost Species-- scientific nameRace/BreedAge

Isolate I D

Date Collected

IsolateIsolate SpeciesAdditional Classification -- genotypeAdditional Classification -- subtypeOther organism isolated from same sample

Sequence 4

SymptomsHost Material I solated from

Non-human HabitatAdditional Notes

Nucleotide Sequence

Sequence 1 product or locus NameSequence 1 Primer PairsSequence 1 descriptionSequence 1Sequence 2 product or locus NameSequence 2 Primer PairsSequence 2 descriptionSequence 2Sequence 3 product or locus NameSequence 3 Primer Pairs

SexHost I nformationIsolation Source

Sequence 4 Primer PairsSequence 4 description

Sequence 3 descriptionSequence 3Sequence 4 product or locus Name

Page 11: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology SelectionDayMonthYear

Environment

Ontology Term Controlled Vocabulary

Isolate ID

Date Collected

IsolateIsolate SpeciesAdditional Classification -- genotypeAdditional Classification -- subtypeOther organism isolated from same sample

Geographic Location

CountryRegion -- State or provinceCountyCity/ village/ localityLatitude/ longitude Coordinates

Sequence 2 Primer PairsSequence 2 descriptionSequence 2

Isolate Environmental SourceHost Species-- scientific nameRace/ BreedAgeSexHost Material Isolated fromSymptoms

Additional NotesSequence 1 product or locus NameSequence 1 Primer PairsSequence 1 descriptionSequence 1Sequence 2 product or locus Name

Gazetteer (GAZ)Gazetteer (GAZ)

Environment Ontology (EnVO)NCBI TaxonAmerican Indian, Asian, White, etc

Nucleotide Sequence

NCBI Taxon

Gazetteer (GAZ)Gazetteer (GAZ)

Neonate, Weanling, Adult, etcPATOOBIOntology for General Medical Science (OGMS)

Host Information

Isolation Source

Non-human Habitat

complete sequence, partial sequence, partial cds, unknown

NCBI Taxon

wild, captive wild, domestic, etc.

Page 12: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Excel Format

• Generally already collected in this format according to our community advisors– Lowers the barrier for usage

• Easily converted to GenBank submission-ready format automatically

• Allows multiple sequence submission

Page 13: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Parser for GenBank SubmissionW17211 W26211

Day 2Month 10 6Year 2008 2010

Cryptosporidium cuniculus Cryptosporidium parvum

unknown genotype sp. 42VaA18 VaA26Eimeria sp. Eimeria sp.

United Kingdom United States

Northamptonshire Northamptonshire

Environment

Oryctolagus cuniculus Oryctolagus cuniculus

Juvenile Juvenile

no symptoms no symptomsWild/Feral Wild/Feralpreviously known as Cryptosporidium sp. rabbit genotype

previously known as Cryptosporidium sp. rabbit genotype

60 kDa glycoprotein (GP60)

60 kDa glycoprotein (GP60)

Isolation Source

Isolate Environmental Source

Isolate ID

Date Collected

Isolate

Isolate Species

Additional Classification -- genotype

Additional Classification -- subtype(Include subtype information if known)Other organism isolated from same sample

Geographic Location

Country

Region -- State or province

County

City/village/localityLatitude/longitude Coordinates

Host Information

Host Species-- scientific nameRace/BreedAgeSexHost Material Isolated fromSymptomsNon-human Habitat

Nucleotide Sequence

Additional Notes

Sequence 1 product or locus Name

Sequence 1 Primer Pairs

Page 14: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Genetic Manipulation and Phenotype Data

T. bruceiRNAi knockdowns

•Integrate phenotype data from  other resources (GeneDB)•Allow individuals to submit   phenotype data via the EuPathDB  web site via User Comments   on Gene pages•Either way these are free text   descriptions limiting utility for   data exploration 

Page 15: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Genetic Manipulation and Phenotype Submission Form

• Genetic Manipulation – Mutation method including selective marker, report if available

– Mutation type (effect on gene function)

• Phenotype data – impact of genetic manipulation on four possible observed features:– Quality of the organism– Cellular location of gene product– Molecular function of gene product– Biological process of gene product 

Page 16: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology-based Representation of Genetic Manipulation with Resulting Phenotype Data

The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box.Ontology for Parasite Lifecycle (OPL) will be used in the annotation of life cycle stage

Page 17: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Ontology-based Representation of Genetic Manipulation – Gene Knock Out

Page 18: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Genetic Manipulation Section

OBI

Page 19: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Phenotype Section

Cellular location

Biological process

GOOBI

OPL

GOPATO

OBI

Page 20: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Web-based Form

• Collect the data directly from specific components of the EuPathDB web site

• Change dynamically based on user’s inputs (lifecycle stage based on species, display selective marker, report, etc. section when needed)

Page 21: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Future Work• Submission forms are at the prototype stage• Distribute isolate submission forms to EuPathDB users

• Incorporate genetic manipulation and phenotype form into EuPathDB website

• Evaluation of submission forms based on the data collected

• Improve the submission forms based on feedback

Page 22: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Acknowledgements• Stoeckert Lab• Haiming Wang and EuPathDB Team• EuPathDB Community

Dr. G Robinson, Dr. R Chalmers, Dr. CJ Janse, Dr. G. Widmer, Dr. L. Xiao, Dr. SM Khan

• Funding– NIH grant 5R01GM93132-1– National Institute of Allergy and Infectious Diseases at the National 

Institutes of Health Award NO1-AI900038C Contract No. HHSN272200900038C

Page 23: Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania

Thank You!