Upload
silas-underwood
View
216
Download
0
Embed Size (px)
DESCRIPTION
Examples: GenBank
Citation preview
Ontology Driven Data Collection for EuPathDB
Jie Zheng, Omar Harb, Chris Stoeckert
Center for Bioinformatics, University of Pennsylvania
Issues associated with Data Collection
• Heterogeneity of free text• Difficulty in data integration, requires human intervention
• Complex queries are limited
2
Examples: GenBankGenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1
LOCUSAY129684 535 bp DNA linear INV 21-APR-2003 LOCUS
AB195219 125 bp DNA linear INV 10-JAN-2009 LOCUS
AF262324 826 bp DNA linear INV 14-DEC-2000
DEFINITION
Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION
Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION
Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.
ACCESSION AY129684 ACCESSION AB195219 ACCESSION AF262324VERSION AY129684.1 GI:30038069 VERSION AB195219.1 GI:58036440 VERSION AF262324.1 GI:11761734KEYWORDS . KEYWORDS . KEYWORDS .SOURCE Cryptosporidium parvum SOURCE Giardia intestinalis (Giardia lamblia) SOURCE Cryptosporidium baileyi ORGANISM Cryptosporidium parvum ORGANISM Giardia intestinalis ORGANISM Cryptosporidium baileyi
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers source 1..535 source 1..125 source 1..826
/organism="Cryptosporidium parvum" /organism="Giardia intestinalis" /organism="Cryptosporidium baileyi" /mol_type="genomic DNA" /mol_type="genomic DNA" /mol_type="genomic DNA" /isolate="HM5-C" /isolate="GH-125" /isolate="GG"
/isolation_source="isolated from Homo sapiens patient infected with HIV" /isolation_source="human" /db_xref="taxon:27987"
/db_xref="taxon:5807" /host="Homo sapiens" /country="USA: New York" /country="USA" /db_xref="taxon:5741" /note="isolated from storm waters /note="genotype: I" /country="Japan: Osaka" genotype: W10" CDS 62..397 gene <1..>125 rRNA <1..>826
/codon_start=1 /gene="SSUrDNA" /product="small subunit ribosomal RNA"
/product="sporozoite surface antigen p23" rRNA <1..>125 /protein_id="AAN08813.1" /gene="SSUrDNA"
/db_xref="GI:30038070" /product="small subunit ribosomal RNA"
/translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQDKPAE
GenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1
LOCUSAY129684 535 bp DNA linear INV 21-APR-2003 LOCUS
AB195219 125 bp DNA linear INV 10-JAN-2009 LOCUS
AF262324 826 bp DNA linear INV 14-DEC-2000
DEFINITION
Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION
Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION
Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.
ACCESSION AY129684 ACCESSION AB195219 ACCESSION AF262324VERSION AY129684.1 GI:30038069 VERSION AB195219.1 GI:58036440 VERSION AF262324.1 GI:11761734KEYWORDS . KEYWORDS . KEYWORDS .SOURCE Cryptosporidium parvum SOURCE Giardia intestinalis (Giardia lamblia) SOURCE Cryptosporidium baileyi ORGANISM Cryptosporidium parvum ORGANISM Giardia intestinalis ORGANISM Cryptosporidium baileyi
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers source 1..535 source 1..125 source 1..826
/organism="Cryptosporidium parvum" /organism="Giardia intestinalis" /organism="Cryptosporidium baileyi"
/mol_type="genomic DNA" /mol_type="genomic DNA" /mol_type="genomic DNA" /isolate="HM5-C" /isolate="GH-125" /isolate="GG"
/isolation_source="isolated from Homo sapiens patient infected with HIV" /isolation_source="human" /db_xref="taxon:27987"
/db_xref="taxon:5807" /host="Homo sapiens" /country="USA: New York" /country="USA" /db_xref="taxon:5741" /note="isolated from storm waters /note="genotype: I" /country="Japan: Osaka" genotype: W10" CDS 62..397 gene <1..>125 rRNA <1..>826
/codon_start=1 /gene="SSUrDNA" /product="small subunit ribosomal RNA"
/product="sporozoite surface antigen p23" rRNA <1..>125 /protein_id="AAN08813.1" /gene="SSUrDNA"
/db_xref="GI:30038070" /product="small subunit ribosomal RNA"
/translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQ
Data Collection for EuPathDB• Apply ontology to data submission form design
– Form to collect sequence data and information on isolates of pathogens
• Geographic location from where isolate specimen collected• Host organism information: species, age, clinical information
– Genetic manipulation with resulting phenotype data collection form• Mutation method• Effects of genetic modification on the parasite and on the location, function,
and involvement in biological process of the resultant modified protein These data are important for parasite epidemiology and research on
vaccines and anti-parasitic drugs• Enable Queries
– Compare sequence data from Plasmodium isolates that are restricted to East Africa to those from West Africa and are controlled for age and health of hosts
– List genes that when knocked out result in a defect in parasite growth during the erythrocytic cycle
– List genes fused to green fluorescent protein (GFP) that when expressed are located in the cell membrane
EupathDBEupathDB (Eukaryotic Pathogen Database Resources ) is a NIAID Bioinformatics Resource Center covering Eukaryotic Parasites
EuPathDB: a portal to eukaryotic pathogen databases.Aurrecoechea C, et al.Nucleic Acids Res. 2010
Isolate Data
Need to import and integrate datasets from GenBank
But GenBank did not specify needed metadata for isolates
Manual curation requiredHarmonize: enable host queries: Human-> Homo sapiensDeconvolute descriptions in free text: isolated from storm watersisolated from Homo sapiens patient infected with HIV
Isolate Data: GenBank ->EuPathDBGenBank: AY129684.1 GenBank: AB195219.1 GenBank: AF262324.1
LOCUSAY129684 535 bp DNA linear INV 21-APR-2003 LOCUS
AB195219 125 bp DNA linear INV 10-JAN-2009 LOCUS
AF262324 826 bp DNA linear INV 14-DEC-2000
DEFINITION
Cryptosporidium parvum isolate HM5-C sporozoite surface antigen p23 gene, complete cds. DEFINITION
Giardia intestinalis SSUrDNA gene for small subunit ribosomal RNA,partial sequence, isolate: GH-125. DEFINITION
Cryptosporidium baileyi small subunit ribosomal RNA gene, partial sequence.
ACCESSION AY129684 ACCESSION AB195219 ACCESSION AF262324VERSION AY129684.1 GI:30038069 VERSION AB195219.1 GI:58036440 VERSION AF262324.1 GI:11761734KEYWORDS . KEYWORDS . KEYWORDS .SOURCE Cryptosporidium parvum SOURCE Giardia intestinalis (Giardia lamblia) SOURCE Cryptosporidium baileyi ORGANISM Cryptosporidium parvum ORGANISM Giardia intestinalis ORGANISM Cryptosporidium baileyi
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eukaryota; Diplomonadida; Hexamitidae; Giardiinae; Giardia.
Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida;
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
Eimeriorina; Cryptosporidiidae; Cryptosporidium.
FEATURES Location/Qualifiers FEATURES Location/Qualifiers FEATURES Location/Qualifiers source 1..535 source 1..125 source 1..826
/organism="Cryptosporidium parvum" /organism="Giardia intestinalis" /organism="Cryptosporidium baileyi" /mol_type="genomic DNA" /mol_type="genomic DNA" /mol_type="genomic DNA" /isolate="HM5-C" /isolate="GH-125" /isolate="GG"
/isolation_source="isolated from Homo sapiens patient infected with HIV" /isolation_source="human" /db_xref="taxon:27987"
/db_xref="taxon:5807" /host="Homo sapiens" /country="USA: New York" /country="USA" /db_xref="taxon:5741" /note="isolated from storm waters /note="genotype: I" /country="Japan: Osaka" genotype: W10" CDS 62..397 gene <1..>125 rRNA <1..>826
/codon_start=1 /gene="SSUrDNA" /product="small subunit ribosomal RNA"
/product="sporozoite surface antigen p23" rRNA <1..>125 /protein_id="AAN08813.1" /gene="SSUrDNA"
/db_xref="GI:30038070" /product="small subunit ribosomal RNA"
/translation="MGCSSSKPETKVAENKSAADANKQRELAEKKAQLAKAVKNPAPISNQAQQKPEEPKKSEPASNNPPAADAPAAQAPPAPAEPAPQDKPAE
Isolate Submission Form• Target isolate information• Geographic location• Source organism samples
information• or Environmental samples
information• Sequence information
I solate IDDate CollectedCountryState or provinceCityGPS CoordinatesIsolate SpeciesIsolate Environmental SourceHostSequence 1 product NameSequence 1Sequence 2 product NameSequence 2Sequence 3 product NameSequence 3Sequence 4 product NameSequence 4
Geographic Location
Source
Nucleotide Sequence
Ontology-based Representation of Isolate Data
The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box
Isolate Submission FormBefore After
I solate IDDate CollectedCountryState or provinceCityGPS CoordinatesIsolate SpeciesIsolate Environmental SourceHostSequence 1 product NameSequence 1Sequence 2 product NameSequence 2Sequence 3 product Name EnvironmentSequence 3Sequence 4 product NameSequence 4
Geographic Location
CountryRegion -- State or provinceCounty
Geographic Location
Source
Nucleotide SequenceCity/village/localityLatitude/longitude CoordinatesIsolate Environmental SourceHost Species-- scientific nameRace/BreedAge
Isolate I D
Date Collected
IsolateIsolate SpeciesAdditional Classification -- genotypeAdditional Classification -- subtypeOther organism isolated from same sample
Sequence 4
SymptomsHost Material I solated from
Non-human HabitatAdditional Notes
Nucleotide Sequence
Sequence 1 product or locus NameSequence 1 Primer PairsSequence 1 descriptionSequence 1Sequence 2 product or locus NameSequence 2 Primer PairsSequence 2 descriptionSequence 2Sequence 3 product or locus NameSequence 3 Primer Pairs
SexHost I nformationIsolation Source
Sequence 4 Primer PairsSequence 4 description
Sequence 3 descriptionSequence 3Sequence 4 product or locus Name
Ontology SelectionDayMonthYear
Environment
Ontology Term Controlled Vocabulary
Isolate ID
Date Collected
IsolateIsolate SpeciesAdditional Classification -- genotypeAdditional Classification -- subtypeOther organism isolated from same sample
Geographic Location
CountryRegion -- State or provinceCountyCity/ village/ localityLatitude/ longitude Coordinates
Sequence 2 Primer PairsSequence 2 descriptionSequence 2
Isolate Environmental SourceHost Species-- scientific nameRace/ BreedAgeSexHost Material Isolated fromSymptoms
Additional NotesSequence 1 product or locus NameSequence 1 Primer PairsSequence 1 descriptionSequence 1Sequence 2 product or locus Name
Gazetteer (GAZ)Gazetteer (GAZ)
Environment Ontology (EnVO)NCBI TaxonAmerican Indian, Asian, White, etc
Nucleotide Sequence
NCBI Taxon
Gazetteer (GAZ)Gazetteer (GAZ)
Neonate, Weanling, Adult, etcPATOOBIOntology for General Medical Science (OGMS)
Host Information
Isolation Source
Non-human Habitat
complete sequence, partial sequence, partial cds, unknown
NCBI Taxon
wild, captive wild, domestic, etc.
Excel Format
• Generally already collected in this format according to our community advisors– Lowers the barrier for usage
• Easily converted to GenBank submission-ready format automatically
• Allows multiple sequence submission
Parser for GenBank SubmissionW17211 W26211
Day 2Month 10 6Year 2008 2010
Cryptosporidium cuniculus Cryptosporidium parvum
unknown genotype sp. 42VaA18 VaA26Eimeria sp. Eimeria sp.
United Kingdom United States
Northamptonshire Northamptonshire
Environment
Oryctolagus cuniculus Oryctolagus cuniculus
Juvenile Juvenile
no symptoms no symptomsWild/Feral Wild/Feralpreviously known as Cryptosporidium sp. rabbit genotype
previously known as Cryptosporidium sp. rabbit genotype
60 kDa glycoprotein (GP60)
60 kDa glycoprotein (GP60)
Isolation Source
Isolate Environmental Source
Isolate ID
Date Collected
Isolate
Isolate Species
Additional Classification -- genotype
Additional Classification -- subtype(Include subtype information if known)Other organism isolated from same sample
Geographic Location
Country
Region -- State or province
County
City/village/localityLatitude/longitude Coordinates
Host Information
Host Species-- scientific nameRace/BreedAgeSexHost Material Isolated fromSymptomsNon-human Habitat
Nucleotide Sequence
Additional Notes
Sequence 1 product or locus Name
Sequence 1 Primer Pairs
Genetic Manipulation and Phenotype Data
T. bruceiRNAi knockdowns
•Integrate phenotype data from other resources (GeneDB)•Allow individuals to submit phenotype data via the EuPathDB web site via User Comments on Gene pages•Either way these are free text descriptions limiting utility for data exploration
Genetic Manipulation and Phenotype Submission Form
• Genetic Manipulation – Mutation method including selective marker, report if available
– Mutation type (effect on gene function)
• Phenotype data – impact of genetic manipulation on four possible observed features:– Quality of the organism– Cellular location of gene product– Molecular function of gene product– Biological process of gene product
Ontology-based Representation of Genetic Manipulation with Resulting Phenotype Data
The data collected in the submission form are in the bold font. The fields require ontology terms are in thick border box.Ontology for Parasite Lifecycle (OPL) will be used in the annotation of life cycle stage
Ontology-based Representation of Genetic Manipulation – Gene Knock Out
Genetic Manipulation Section
OBI
Phenotype Section
Cellular location
Biological process
GOOBI
OPL
GOPATO
OBI
Web-based Form
• Collect the data directly from specific components of the EuPathDB web site
• Change dynamically based on user’s inputs (lifecycle stage based on species, display selective marker, report, etc. section when needed)
Future Work• Submission forms are at the prototype stage• Distribute isolate submission forms to EuPathDB users
• Incorporate genetic manipulation and phenotype form into EuPathDB website
• Evaluation of submission forms based on the data collected
• Improve the submission forms based on feedback
Acknowledgements• Stoeckert Lab• Haiming Wang and EuPathDB Team• EuPathDB Community
Dr. G Robinson, Dr. R Chalmers, Dr. CJ Janse, Dr. G. Widmer, Dr. L. Xiao, Dr. SM Khan
• Funding– NIH grant 5R01GM93132-1– National Institute of Allergy and Infectious Diseases at the National
Institutes of Health Award NO1-AI900038C Contract No. HHSN272200900038C
Thank You!