Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Databases inbioinformatics II
Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology
Institute of Biomedicine
BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2008
Databases in bioinformatics 2
Overview
– Uniprot/Swissprot– Divisions at NCBI (nt db)– Sequencing methods– EST– RefSeq vs GenBank– TraceArchive
– Refining searches at Entrez– eUtilis (programer utilities)
Databases in bioinformatics 3
UniProt/SwissProt
1980’s Protein sequence databaseHigh quality detailed curationEBI + SIB
Quick release of data not yet annotatedTrEMBL (Translation of EMBL nucleotide sequences)only computationally annotated entries
2002 EBI + SIB + PIRUniprot Consortium
http://www.expasy.ch/sprot/sprot_details.html
Databases in bioinformatics 4
UniProtKB
Central hub for the collection of functional Information on proteins with accurate, consistent and rich annotation
OntologiesClassificationsCross-referencesIndications of the quality of annotation (Exp – Comp)
•Manually-annotated records: literature and curator-evaluated "UniProtKB/Swiss-Prot”
•Computationally analyzed records that await full manual annotation"UniProtKB/TrEMBL"
Databases in bioinformatics 5
Uniprot - UniRefhttp://www.uniprot.org/
Clustered sets of sequences (UniProt Knowledgebase + UniParc)
complete coverage of sequence space at several resolutionshiding redundant sequences (but not their descriptions)
UniRef100: Identical sequences and sub-fragments (11 or more)sequence of a representative proteinaccession numbers of all the merged entrieslinks to the corresponding records
UniRef90 and UniRef50 by clustering UniRef100 90% or 50% sequence identity
Faster in sequence searches.
Databases in bioinformatics 6
UniProt - record
Databases in bioinformatics 7
UniProt - record
Databases in bioinformatics 8
UniProt - record
Databases in bioinformatics 9
UniProt - record
Databases in bioinformatics 10
Functional divisions in Nucleotide DB at NCBI
Organization of nucleotide sequence records into discrete functional types:
Query specific subsets particular techniqueinterpretation of data from a proper biological point of view
EST 300-500 bp single reads from mRNA (cDNA)STS 200-500 bp GSS Similar to EST but from genomic originHTG Unfinished DNA sequences generated by HTSHTC Unfinished sequences from HT cDNA projectsPAT Patent sequencesCON Constructed records of chrs, genomes and other long DNA
sequences
Databases in bioinformatics 11
Genome sequencing
Encompasses biochemical methods for determining the order of the nucleotide bases (AGCT) in a DNA oligonucleotide (~20, today 200)
Databases in bioinformatics 12
Why Sequencing Genomes
Remarkable similar molecular level despite their obvious outward differences
genes similar DNA sequence tend to perform ≈ functions
Understanding the function of a gene in one organism we may get an idea of what function that gene may perform in a more complex organism (humans)
Applied to various fields: medicine, biological engineering, forensics
Databases in bioinformatics 13
Sequencing methods
1954 Whitfeld PR. - Sequencing by degradation
Sequencing by Synthesis1975 F. Sanger – AR. Coulson (plus-minus method)1977 Walter Gilbert – A. Maxam (chemical modification)
F. Sanger (chain termination)1979 Shotgun sequencing1984 Ligation based (Applied Biosystems)1988 Pyrosequencing (Roche, Biotage)1994 Reversible dye terminators (Illumina – Helicos)
Non-enzymatic1989 Sequencing by Hybridization (Affymetrix)
DNA cannot be synthesized from scratch.
Archon X Prize 10 million 100 human genomes / 10 days with $10,000 / genome
Databases in bioinformatics 14
Maxam-Gilbert sequencing
- Chemical modification of DNA(radiolabelling)
- Cleavage at specific bases(G,G+A,C,C+T)
- Size-separated(gel electrophoresis)
- Autoradiography(X-ray film)
PROS: Purified DNA could be used directly
CONS: Technical complexUse of hazardous chemicalsDifficulties tos scale-up
Strong band 1st w/ weaker band in the 2nd AStrong band 2nd w/ weaker bnad in the 1st GBand in 3rd and 4th CBand only in 4th T
Maxam AM, Gilbert W., A new method for sequencing DNA, Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4
Databases in bioinformatics 15
Sanger method
dNTP (deoxynucleotide) didNTP (dideoxynucleotide)
Databases in bioinformatics 16
Sanger method
Radio/fluorescentlylabelled nt
Databases in bioinformatics 17
Sanger method
Databases in bioinformatics 18
Sanger method
Databases in bioinformatics 19
Sanger method
Databases in bioinformatics 20
Sanger method: variations
Dye-labeled primer
PROS: Upon completion, these four reactions can be combined into one lane on a gel, and run on a machine that can scan the lanes with a laser
http://www.escience.ws/b572/L8/L8.htm
Databases in bioinformatics 21
Sanger method: variations
Dye-terminator sequencing
PROS: Use an optical system fastermore economicalautomation
Single reaction (≠ dye for each nt)
Databases in bioinformatics 22
Large scale sequencing strategies
Sanger: Not practical to sequence a complete genomeOnly about 1000 bases can be sequenced accuratelyA primer of known sequence is required
A Privately-Funded Sequencing Project : Celera Genomics
No libraries of BAC clones Human genome fragments of 2-10 kb sequence themAssembly ?
The Publically-funded Human Genome Project : NIH/NSF
'libraries' of BAC clones sequence them
Databases in bioinformatics 23
Hierarchichal shotgun sequencing
150 Mb
contig
PROS: Individual clone can be sequenced by different peopleEach stretch of DNA only needs to be sequenced once
CONS: Slow process of sub-cloning and mapping of the clonesRequires significant human manipulation
http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
Databases in bioinformatics 24
Shotgun sequencing
Prokaryotic genomes (smaller in size,less repetitive DNA)
PROS: Faster and less expensive
CONS: Prone to errors due to incorrect assembly of finished sequenceMuch more sequencing to have p < 1% of missing a sub-clone
http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
Databases in bioinformatics 25
Next generation platforms
Platform Chemistry Read LengthAffymetrix Sequencing by hybridization ~200bpRoche (454) Pyrosequencing 230 - 400 bpIllumina (Solexa) Sequencing by Synthesis 40 bpABI SOLiD Ligation based sequencing 35 bp
Databases in bioinformatics 26
Sequencing by synthesis
ss DNA Enzymatically synthesize its complementary strand Detect fluorescence of one nucleotide at a timeRemove the blocking group Polymerization of another nucleotide
http://www.illumina.com/media.ilmn?Title=Sequencing-By-Synthesis%20Demo&Cap=&PageName=solexa%20technology&PageURL=203&Media=1
Databases in bioinformatics 27
Sequencing by synthesis
Databases in bioinformatics 28
Sequencing by synthesis
Databases in bioinformatics 29
Sequencing by synthesis
Databases in bioinformatics 30
Pyrosequencing
Detects the activity of DNA polymerase with a chemiluminescentenzyme by synthesizing the complementary strand.
PROS: 96 samples 1hr (vs 24 hr)CONS: 300-500 nucleotides
Used for resequencing or sequencing of genomes for which the sequence of a close relative is already available
Fungal, bacterial and viral identification
Databases in bioinformatics 31
Pyrosequencing
C G T C C G G A
SulfurylaseApy
rase
Luciferin
(1)PPi
(1)ATP
Oxyluciferin
Luciferase
Charge coupleddevice (CCD)
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics 32
Pyrosequencing
C G T C C G G A
SulfurylaseApy
rase
Luciferin
(1)PPi
(1)ATP
Oxyluciferin
Luciferase
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics 33
Pyrosequencing
C G T C C G G A
Apyra
se
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics 34
Pyrosequencing
C G T C C G G A
Sulfurylase
Apyra
se
Luciferin
(1)PPi
(1)ATP
Oxyluciferin
Luciferase
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics 35
Pyrosequencing
C G T C C G G A
Sulfurylase
Apyra
se
Luciferin
(2)PPi
(2)ATP
Oxyluciferin
Luciferase
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics 36
Pyrosequencing
C G T C C G G A
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Sulfurylase
Apyra
se
Luciferin
(2)PPi
(2)ATP
Oxyluciferin
Luciferase
Databases in bioinformatics 37
Pyrosequencing
C G T C C G G A
Pyrogram
http://www.biotagebio.com/DynPage.aspx?id=7454
Sulfurylase
Apyra
se
Luciferin
(1)PPi
(1)ATP
Oxyluciferin
Luciferase
Databases in bioinformatics 38
Sequencing by ligation
The method:
It is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time
DNA ligase rather than polymerase
The system uses 4 florescent dyes to enconde for the 16 possibletwo base combinations
Multiple ligation cycles of probe hybridization, ligation, imaging an analysis are preformed
The resulting product is the removed
The process is repeated for 5 more extension rounds with primershybridized to position n-1, n-2, etc in th adaptor.
http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiDSystemSequencing/index.htm
Databases in bioinformatics 39
Sequencing by ligation2-base color encoding data
1 dye = 4 possible di-nucelotides
2 bases are interrogated in each ligation reaction providing increased specificity
Databases in bioinformatics 40
Sequencing by ligationPrimer round 1
Databases in bioinformatics 41
Sequencing by ligationPrimer round 2
Total of 5 primer rounds
Each sequence is interrogated twice in different reactionsimproves the signal to noise ratio
Databases in bioinformatics 42
Sequencing by ligationDecoding
Color space
Possible dinucleotides
Base zero Decoded sequence
Base space sequence
Databases in bioinformatics 43
Sequencing by ligation
Ref seq
CS Ref
CS Reads
CS consensus
BS consensusPolymorphism
Error
RE-sequencing
Higher accuracy in built-in error checking capabilitydiscrimiation between measurement errors and SNP
Databases in bioinformatics 44
Sequencing by hybridizationMicroarray – DNA chip
Hybridization
Probe
Databases in bioinformatics 45
Sequencing by hybridization
ACG TAC GGG CAT
GAT GTT CTA TTT
CGC CCC ATC GTA
ACT AAG AAA GCA
A C GC G C
G C AC A T
A T CA C G C A T C
A C GC G CG C AC A TA T C
ACGCATCACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATCACGCATC
ACGCATC
ACGCATC
ACGCATC
3. Spectrum1. DNA sample
4. Reconstruct the sequence
2. Hybridization
A C G C A T C
Drmanac R et al. Adv Biochem Eng Biotechnol. 2002
Databases in bioinformatics 46
Sequencing by hybridization
A C C G C C T C C AA C C
C C GC G C
G C CC C T
C T CT C C
C C A
A C C T C C G C C AA C C
C C TC T C
T C CC C G
C G CG C C
C C A
Problem: diferent sequences have the same spectrum
Databases in bioinformatics 47
Sequencing by hybridization
Oligomers in chip = 4 # bases 12 bases = 16,777,126 oligomers!(6,5 million )
Probe: 5-25 bases
Probe overlapEach base is read by multiple probes SNP
Not homogeneous hybridization conditions melting temparature depends strongly on the ratio on GC AT
Repeats
Databases in bioinformatics 48
Sequencing and gene expression
Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression.
Expression in normal circumstances
altered state (?)
Identify and study the protein(s) coded by a geneIdentify gene (Genome bioinformatics)
Databases in bioinformatics 49
EST
Expressed Sequence TagsPieces of DNA sequence Expressed gene
200 to 500 nt long
Cells, tissues, organsCertain conditions
5’EST coding proteinconserved species
3’EST non-coding (UTR)
Generated rapidly and inexpensively
Used in gene identificationHereditary diseases
Databases in bioinformatics 50
Redundancy at GenBank
Many sequences are represented more than once in GenBank
huge degrees of Redundancy
2003 RefSeq collection : curated secondary databasenon-redundtantselected organisms
•Genome DNA (assemblies)•Transcripts (RNA)•Protein
Databases in bioinformatics 51
RefSeq vs GenBank
GenBank RefSeqNot curated Curated
Author submits NCBI creates from existing data
Only author can revise NCBI reivses as new data emerge
Multiple records fro sam loci common Single records for each moleculer of major organisms
Records can contradict each other
No limit to species included Limitied to model organisms
Data exchange among INDSC members Exclusive NCBI database
Akin to primary literature Akin to review articles
Proteins identified and linked Proteins and transcripts identified and linked
Access via NCBI Nucleotide db Access via Nucl. and Protein db
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook
Databases in bioinformatics 52
Trace Archive
2001 NCBI and EMBL/ENSEMBLpurpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads
Data 22 trillion bytes in size (stack of CDs 10 stories high)keep on growing ...
Traces Pieces of a Puzzlebetween 300 and 1,000 DNA letters
vital hunt for polymorphisms in gene sequences linked to disease (human DNA)linked to virulence (viral DNA)
dbSNP : detailed info > 25 million SNPs
Insigths to the impact of genetic variation on health
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?
Databases in bioinformatics 54
Entrez
Databases in bioinformatics 55
Refining search resultshttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Searching_Entrez_usihttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Searching_PubMed
Databases in bioinformatics 56
Limits
Refine search results retrieve only the most relevant documents
Allow restriction of a search to a defined subset of the database
Databases in bioinformatics 57
Refining search results
Databases in bioinformatics 58
Index
Alphabetical lists of terms from searchable database fields
Used to browse and/or select the terms by which records and/or data are described
Databases in bioinformatics 59
Refining search results
Databases in bioinformatics 60
Search Field Descriptions and Qualifiers
Index search field Qualifier
Accession [ACCN] or [ACCESSION]
All Fields [ALL] or [ALL FIELDS]
Author [AUTH] or [AUTHOR]
EC/RN Number [ECNO]
Feature Key [FKEY]
Filter [FILT] or [SB]
Gene Name [GENE]
Issue [ISS] or [ISSUE]
Keyword [KYWD] or [KEYWORD]
Journal Name [JOUR] or [JOURNAL]
Modification Date [MDAT]
Organism [ORGN] or [ORGANISM]
Page Number [PAGE]
Primary Accession [PACC]
Index search field Qualifier
Title [TITL]
Title/Abstract [TIAB]
Volume [VOL]
Entrez date [EDAT]
Journal title [TA]
Language [LA]
MeSH term [MH]
Properties [PROP]
Protein Name [PROT]
Publication Date [PDAT]
SeqID String [SQID]
Sequence Length [SLEN]
Substance Name [SUBS]
Text Word [WORD]
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.table.EntrezHelp.T7http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip
Databases in bioinformatics 61
Advanced search statements
term [field] OPERATOR term [field]
Find all human nucleotide sequences with D-loop annotations
Find Drosophila population studies published in the Journal of Molecular Evolution
D-loop[FKEY] AND human[ORGN] in Nucleotide database
Databases in bioinformatics 62
History
Provides a record of the searches performed during a search session.
Database specificLost after eight hours of inactivity
Used to review, revise, or combine the results of earlier searches.
Databases in bioinformatics 63
Combining results
Databases in bioinformatics 64
Query translation
Databases in bioinformatics 65
Details
Display your search strategy as translated using Entrez's search and syntax rules
Error messages, when applicable
Databases in bioinformatics 66
Author search
Databases in bioinformatics 67
Example - author
Databases in bioinformatics 68
Example - journal
Databases in bioinformatics 69
eUtils: Entrez Programming Utilities
•Tools that provide access to Entrez data outside of the regular web query interface.
• Set of 7 server-side programs
• Helpful for retrieving search results (manipulated in another environment)
• Perl, Python, Java, and C++
• Currently includes 35 databases
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
ESearch
ESummary
EGQuery
EInfo
EFetch
ELink
EPost
Espell
Databases in bioinformatics 70
Uses
• Perform searches on large datasets• Implement data pipelines for genomic, proteomic, or
microarray analysis • Create automated searches to keep local databases current • Create and download customized datasets • Seamlessly combine local data with NCBI data • Develop a focused interface to NCBI data
URL Result(XML)
Databases in bioinformatics 71
Common Entrez Engine
Assemble a list of UIDs
ESearch (for a given db)
EGQuery (global version all db)
ESummary (for a list of UIDs)
Retrieve a brief summary record (DocSum)
Databases in bioinformatics 72
URL
http://www.ncbi.nlm.nih.gov/sites/gquery?term=cancer+stem+cells
[Base_URL] [Query] [DB][Eutils_URL]
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=9913&retmode=xml
[Base_URL] [Query][DB][Eutils_URL]
Databases in bioinformatics 73
URL: DB
[Base_URL] [Query][DB][Eutils_URL]eSearch =
Entrez Database E-Utility Database Name
3D Domains domains
Domains cdd
Genome genome
Nucleotide nucleotide
OMIM omim
PopSet popset
Protein protein
ProbeSet geo
PubMed pubmed
Structure structure
SNP snp
Taxonomy taxonomy
UniGene unigene
UniSTS unists
Each Entrez DB has an E-Utility name (used instead of its original name)
Databases in bioinformatics 74
URL: QueryEFetch
EGQuery Espell EInfo ESearch ESummary Tax
X
X
X
X
X
X
Seq ELink EPost
X
X
X
X
X
X
X
X
X
X
X
X
Lit
db X X
X
history X X X X X
WebEnv X X X X X
query_key X X X X X
X X X X X
X
X
X
X
X
X
X
X
X
term X X
id X X X
dbfrom
report
strand
seq_start
field X
reldate X
mindate X
maxdate X
datatype X
retstart X X X
X
X
X
retmax X X
retmode X
X
X X
rettype X
seq_stop
cmd
Databases in bioinformatics 75
Espell
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=brest+cancer
Retrieves spelling suggestions when available
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?
Only PubMed
Databases in bioinformatics 76
EInfo
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
Provides detailed information about a given database:term counts, last update and available links
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?
Databases in bioinformatics 77
EGQuery
Provides Entrez database counts in XML for a single search using GQuery
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2&rettype=html
Databases in bioinformatics 78
ESummary
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml
xml, ref, html, text, asn.1
Retrieves DocSums from a list of primary IDs
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
Databases in bioinformatics 79
UIDs: Unique ID
Entrez Database Primary ID E-Utility Database Name
3D Domains 3D SDI domains
Domains PSSM-ID cdd
Genome Genome ID genome
Nucleotide GI number nucleotide
OMIM MIM number omim
PopSet Popset ID popset
Protein GI number protein
ProbeSet GEO ID geo
PubMed PMID pubmed
Structure MMDB ID structure
SNP SNP ID snp
Taxonomy TAXID taxonomy
UniGene UniGene ID unigene
UniSTS UniSTS ID unists
•Always integers
•Refers to a unique record in a given Entrez database
•Each Entrez DBs has an E-Utility name (used instead of its original name)
Databases in bioinformatics 80
ELink
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=protein&id=7140346
Existence of an external/Related Articles link from a list of UIDsRetrieves related IDs to a list of UIDs (same db, external db)
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
Databases in bioinformatics 81
ELink
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10611131&retmode=ref&cmd=prlinks
Creates a hyperlink to the primary LinkOut provider for a specific IDLists LinkOut URLs and attributes for multiple IDs.
Databases in bioinformatics 82
Epost
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=pubmed&id=11237011
Returns a label (query_key) and an encoded server address (WebEnv) that corresponds to a UID list for subsequent search strategies
Optimal for large datasets (see Example)
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?
Databases in bioinformatics 83
ESearch
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=100
Returns a list of matching UIDs (text search) in a given Entrez database
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
edat, mdat, dp
Databases in bioinformatics 84
EFetch
Generates formatted output for a list of input IDs: abstracts from PubMedFASTA format from Protein
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
DBs:Literature Database
PubMed, Journals, PubMed Central, OMIM
Sequence and other Molecular Biology DatabasesNucleotide,Protein, Gene, etc.
Taxonomy
Databases in bioinformatics 85
Rettype
Rettype scope Description
count PubMed Hits counts
sort PubMed and gene
abastract PubMed
citation PubMed
medline PubMed
full PubMed
uilist all Default format for viewing hits
native all Default format for viewing sequences
fasta sequence FASTA view of a sequence
gb nucleotide GenBank view for sequences
est dbEST EST Report.
gp protein GenPept view
seqid sequence To convert list of gis into list of seqids.
acc sequence To convert list of gis into list of accessions
chr dbSNP only SNP Chromosome Report.
Databases in bioinformatics 86
EFetch - Literature
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12345,9997&retmode=html&rettype=abstract
Databases in bioinformatics 87
EFetch - Sequences
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=fasta
Strand 1(+), 2(-)
Databases in bioinformatics 88
Efetch - Taxonomy
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=44689&report=docsum
uilist, brief, docsum, xmml
Databases in bioinformatics 89
Search in Journals for the term obstetrics:
In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type:
From Entrez Gene display as xml the GenomeID 2:
To retrieve PubMed related articles for proteins 61742829 with a publication date from 1995 to the present:
Excercise
Databases in bioinformatics 90
Combining eUtils calls
The eUtils are useful when used by themselves in single URLs; however their full potential is reached when successive eUtils URLs are combined to create a data pipeline
• Retrieving data records matching an Entrez query
ESearch → ESummaryESearch → EFetch
• Finding IDs linked to records matching an Entrez query
ESearch → ELink
• Retrieving data records in database B linked to records in database A matching an Entrez query
ESearch → ELink → ESummaryESearch → ELink → EFetch
Databases in bioinformatics 91
a PERL example
TASK: Retrieve protein sequences of the factor IX in fasta format
my $Base_URL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/" ;
my $esearch_URL = "esearch.fcgi?" ;
my $DB = "db=protein&";
my $Query = "term=factor ix human";
my $esearch_Parameters= "retmax=1&usehistory=y&";
my $E_search =
"$Base_URL$esearch_URL$DB$esearch_Parameters$Query";
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&retmax=1&usehistory=y&term=factor ix human
ESearch → EFetch
Databases in bioinformatics 92
Output from ESearch
Databases in bioinformatics 93
QueryKey - WebEnv
$WebEnv: cookie value used with EFetch in place of primary ID result list
$QueryKey: value used for a history search number
Databases in bioinformatics 94
a PERL example
my $efetch_URL= "efetch.fcgi?";
my $efetch_Parameters =
"rettype=fasta&retmode=text&query_key=$QueryKey&WebEnv=$WebEnv";
my $E_fetch = "$Base_URL$efetch_URL$DB$efetch_Parameters" ;
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&query_key=1&WebEnv=0ujfmXBW0U0hNr3FjaUutLkz1bR-NnJ9kp5vybL3u1AbTQdD7uMETHEtG5N@1EE047D172B3B8D0_0015SID
ESearch → EFetch
TASK: Retrieve protein sequences of the factor IX in fasta format
Databases in bioinformatics 95
Output from EFetch