Databases in bioinformatics II - Göteborgs universitetbio.lundberg.gu.se/courses/ht08/bio2/DB2_2008_toprint.pdfDatabases in bioinformatics 13 Sequencing methods 1954 Whitfeld PR

Databases inbioinformatics II

Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology

Institute of Biomedicine

BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2008

Databases in bioinformatics 2

Overview

– Uniprot/Swissprot– Divisions at NCBI (nt db)– Sequencing methods– EST– RefSeq vs GenBank– TraceArchive

– Refining searches at Entrez– eUtilis (programer utilities)


UniProt/SwissProt

1980’s Protein sequence databaseHigh quality detailed curationEBI + SIB

Quick release of data not yet annotatedTrEMBL (Translation of EMBL nucleotide sequences)only computationally annotated entries

2002 EBI + SIB + PIRUniprot Consortium

http://www.expasy.ch/sprot/sprot_details.html


UniProtKB

Central hub for the collection of functional Information on proteins with accurate, consistent and rich annotation

OntologiesClassificationsCross-referencesIndications of the quality of annotation (Exp – Comp)

•Manually-annotated records: literature and curator-evaluated "UniProtKB/Swiss-Prot”

•Computationally analyzed records that await full manual annotation"UniProtKB/TrEMBL"


Uniprot - UniRefhttp://www.uniprot.org/

Clustered sets of sequences (UniProt Knowledgebase + UniParc)

complete coverage of sequence space at several resolutionshiding redundant sequences (but not their descriptions)

UniRef100: Identical sequences and sub-fragments (11 or more)sequence of a representative proteinaccession numbers of all the merged entrieslinks to the corresponding records

UniRef90 and UniRef50 by clustering UniRef100 90% or 50% sequence identity

Faster in sequence searches.


UniProt - record


UniProt - record


UniProt - record


UniProt - record


Functional divisions in Nucleotide DB at NCBI

Organization of nucleotide sequence records into discrete functional types:

Query specific subsets particular techniqueinterpretation of data from a proper biological point of view

EST 300-500 bp single reads from mRNA (cDNA)STS 200-500 bp GSS Similar to EST but from genomic originHTG Unfinished DNA sequences generated by HTSHTC Unfinished sequences from HT cDNA projectsPAT Patent sequencesCON Constructed records of chrs, genomes and other long DNA

sequences


Genome sequencing

Encompasses biochemical methods for determining the order of the nucleotide bases (AGCT) in a DNA oligonucleotide (~20, today 200)


Why Sequencing Genomes

Remarkable similar molecular level despite their obvious outward differences

genes similar DNA sequence tend to perform ≈ functions

Understanding the function of a gene in one organism we may get an idea of what function that gene may perform in a more complex organism (humans)

Applied to various fields: medicine, biological engineering, forensics


Sequencing methods

1954 Whitfeld PR. - Sequencing by degradation

Sequencing by Synthesis1975 F. Sanger – AR. Coulson (plus-minus method)1977 Walter Gilbert – A. Maxam (chemical modification)

F. Sanger (chain termination)1979 Shotgun sequencing1984 Ligation based (Applied Biosystems)1988 Pyrosequencing (Roche, Biotage)1994 Reversible dye terminators (Illumina – Helicos)

Non-enzymatic1989 Sequencing by Hybridization (Affymetrix)

DNA cannot be synthesized from scratch.

Archon X Prize 10 million 100 human genomes / 10 days with $10,000 / genome


Maxam-Gilbert sequencing

- Chemical modification of DNA(radiolabelling)

- Cleavage at specific bases(G,G+A,C,C+T)

- Size-separated(gel electrophoresis)

- Autoradiography(X-ray film)

PROS: Purified DNA could be used directly

CONS: Technical complexUse of hazardous chemicalsDifficulties tos scale-up

Strong band 1st w/ weaker band in the 2nd AStrong band 2nd w/ weaker bnad in the 1st GBand in 3rd and 4th CBand only in 4th T

Maxam AM, Gilbert W., A new method for sequencing DNA, Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4


Sanger method

dNTP (deoxynucleotide) didNTP (dideoxynucleotide)


Sanger method

Radio/fluorescentlylabelled nt


Sanger method


Sanger method


Sanger method


Sanger method: variations

Dye-labeled primer

PROS: Upon completion, these four reactions can be combined into one lane on a gel, and run on a machine that can scan the lanes with a laser

http://www.escience.ws/b572/L8/L8.htm


Sanger method: variations

Dye-terminator sequencing

PROS: Use an optical system fastermore economicalautomation

Single reaction (≠ dye for each nt)


Large scale sequencing strategies

Sanger: Not practical to sequence a complete genomeOnly about 1000 bases can be sequenced accuratelyA primer of known sequence is required

A Privately-Funded Sequencing Project : Celera Genomics

No libraries of BAC clones Human genome fragments of 2-10 kb sequence themAssembly ?

The Publically-funded Human Genome Project : NIH/NSF

'libraries' of BAC clones sequence them


Hierarchichal shotgun sequencing

150 Mb

contig

PROS: Individual clone can be sequenced by different peopleEach stretch of DNA only needs to be sequenced once

CONS: Slow process of sub-cloning and mapping of the clonesRequires significant human manipulation

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/


Shotgun sequencing

Prokaryotic genomes (smaller in size,less repetitive DNA)

PROS: Faster and less expensive

CONS: Prone to errors due to incorrect assembly of finished sequenceMuch more sequencing to have p < 1% of missing a sub-clone

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/


Next generation platforms

Platform Chemistry Read LengthAffymetrix Sequencing by hybridization ~200bpRoche (454) Pyrosequencing 230 - 400 bpIllumina (Solexa) Sequencing by Synthesis 40 bpABI SOLiD Ligation based sequencing 35 bp


Sequencing by synthesis

ss DNA Enzymatically synthesize its complementary strand Detect fluorescence of one nucleotide at a timeRemove the blocking group Polymerization of another nucleotide

http://www.illumina.com/media.ilmn?Title=Sequencing-By-Synthesis%20Demo&Cap=&PageName=solexa%20technology&PageURL=203&Media=1








Pyrosequencing

Detects the activity of DNA polymerase with a chemiluminescentenzyme by synthesizing the complementary strand.

PROS: 96 samples 1hr (vs 24 hr)CONS: 300-500 nucleotides

Used for resequencing or sequencing of genomes for which the sequence of a close relative is already available

Fungal, bacterial and viral identification


Pyrosequencing

C G T C C G G A

SulfurylaseApy

rase

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase

Charge coupleddevice (CCD)

Pyrogram

http://www.biotagebio.com/DynPage.aspx?id=7454


Pyrosequencing

C G T C C G G A

SulfurylaseApy

rase

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase

Pyrogram



Pyrosequencing

C G T C C G G A

Apyra

se

Pyrogram



Pyrosequencing

C G T C C G G A

Sulfurylase

Apyra

se

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase

Pyrogram



Pyrosequencing

C G T C C G G A

Sulfurylase

Apyra

se

Luciferin

(2)PPi

(2)ATP

Oxyluciferin

Luciferase

Pyrogram



Pyrosequencing

C G T C C G G A

Pyrogram


Sulfurylase

Apyra

se

Luciferin

(2)PPi

(2)ATP

Oxyluciferin

Luciferase


Pyrosequencing

C G T C C G G A

Pyrogram


Sulfurylase

Apyra

se

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase


Sequencing by ligation

The method:

It is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time

DNA ligase rather than polymerase

The system uses 4 florescent dyes to enconde for the 16 possibletwo base combinations

Multiple ligation cycles of probe hybridization, ligation, imaging an analysis are preformed

The resulting product is the removed

The process is repeated for 5 more extension rounds with primershybridized to position n-1, n-2, etc in th adaptor.

http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiDSystemSequencing/index.htm


Sequencing by ligation2-base color encoding data

1 dye = 4 possible di-nucelotides

2 bases are interrogated in each ligation reaction providing increased specificity


Sequencing by ligationPrimer round 1


Sequencing by ligationPrimer round 2

Total of 5 primer rounds

Each sequence is interrogated twice in different reactionsimproves the signal to noise ratio


Sequencing by ligationDecoding

Color space

Possible dinucleotides

Base zero Decoded sequence

Base space sequence


Sequencing by ligation

Ref seq

CS Ref

CS Reads

CS consensus

BS consensusPolymorphism

Error

RE-sequencing

Higher accuracy in built-in error checking capabilitydiscrimiation between measurement errors and SNP


Sequencing by hybridizationMicroarray – DNA chip

Hybridization

Probe


Sequencing by hybridization

ACG TAC GGG CAT

GAT GTT CTA TTT

CGC CCC ATC GTA

ACT AAG AAA GCA

A C GC G C

G C AC A T

A T CA C G C A T C

A C GC G CG C AC A TA T C

ACGCATCACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATC

ACGCATC ACGCATC ACGCATC ACGCATCACGCATC

ACGCATC

ACGCATC

ACGCATC

3. Spectrum1. DNA sample

4. Reconstruct the sequence

2. Hybridization

A C G C A T C

Drmanac R et al. Adv Biochem Eng Biotechnol. 2002



A C C G C C T C C AA C C

C C GC G C

G C CC C T

C T CT C C

C C A

A C C T C C G C C AA C C

C C TC T C

T C CC C G

C G CG C C

C C A

Problem: diferent sequences have the same spectrum



Oligomers in chip = 4 # bases 12 bases = 16,777,126 oligomers!(6,5 million )

Probe: 5-25 bases

Probe overlapEach base is read by multiple probes SNP

Not homogeneous hybridization conditions melting temparature depends strongly on the ratio on GC AT

Repeats


Sequencing and gene expression

Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression.

Expression in normal circumstances

altered state (?)

Identify and study the protein(s) coded by a geneIdentify gene (Genome bioinformatics)


EST

Expressed Sequence TagsPieces of DNA sequence Expressed gene

200 to 500 nt long

Cells, tissues, organsCertain conditions

5’EST coding proteinconserved species

3’EST non-coding (UTR)

Generated rapidly and inexpensively

Used in gene identificationHereditary diseases


Redundancy at GenBank

Many sequences are represented more than once in GenBank

huge degrees of Redundancy

2003 RefSeq collection : curated secondary databasenon-redundtantselected organisms

•Genome DNA (assemblies)•Transcripts (RNA)•Protein


RefSeq vs GenBank

GenBank RefSeqNot curated Curated

Author submits NCBI creates from existing data

Only author can revise NCBI reivses as new data emerge

Multiple records fro sam loci common Single records for each moleculer of major organisms

Records can contradict each other

No limit to species included Limitied to model organisms

Data exchange among INDSC members Exclusive NCBI database

Akin to primary literature Akin to review articles

Proteins identified and linked Proteins and transcripts identified and linked

Access via NCBI Nucleotide db Access via Nucl. and Protein db

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook


Trace Archive

2001 NCBI and EMBL/ENSEMBLpurpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads

Data 22 trillion bytes in size (stack of CDs 10 stories high)keep on growing ...

Traces Pieces of a Puzzlebetween 300 and 1,000 DNA letters

vital hunt for polymorphisms in gene sequences linked to disease (human DNA)linked to virulence (viral DNA)

dbSNP : detailed info > 25 million SNPs

Insigths to the impact of genetic variation on health

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?


Entrez


Refining search resultshttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Searching_Entrez_usihttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Searching_PubMed


Limits

Refine search results retrieve only the most relevant documents

Allow restriction of a search to a defined subset of the database


Refining search results


Index

Alphabetical lists of terms from searchable database fields

Used to browse and/or select the terms by which records and/or data are described


Refining search results


Search Field Descriptions and Qualifiers

Index search field Qualifier

Accession [ACCN] or [ACCESSION]

All Fields [ALL] or [ALL FIELDS]

Author [AUTH] or [AUTHOR]

EC/RN Number [ECNO]

Feature Key [FKEY]

Filter [FILT] or [SB]

Gene Name [GENE]

Issue [ISS] or [ISSUE]

Keyword [KYWD] or [KEYWORD]

Journal Name [JOUR] or [JOURNAL]

Modification Date [MDAT]

Organism [ORGN] or [ORGANISM]

Page Number [PAGE]

Primary Accession [PACC]

Index search field Qualifier

Title [TITL]

Title/Abstract [TIAB]

Volume [VOL]

Entrez date [EDAT]

Journal title [TA]

Language [LA]

MeSH term [MH]

Properties [PROP]

Protein Name [PROT]

Publication Date [PDAT]

SeqID String [SQID]

Sequence Length [SLEN]

Substance Name [SUBS]

Text Word [WORD]

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.table.EntrezHelp.T7http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip


Advanced search statements

term [field] OPERATOR term [field]

Find all human nucleotide sequences with D-loop annotations

Find Drosophila population studies published in the Journal of Molecular Evolution

D-loop[FKEY] AND human[ORGN] in Nucleotide database


History

Provides a record of the searches performed during a search session.

Database specificLost after eight hours of inactivity

Used to review, revise, or combine the results of earlier searches.


Combining results


Query translation


Details

Display your search strategy as translated using Entrez's search and syntax rules

Error messages, when applicable


Author search


Example - author


Example - journal


eUtils: Entrez Programming Utilities

•Tools that provide access to Entrez data outside of the regular web query interface.

• Set of 7 server-side programs

• Helpful for retrieving search results (manipulated in another environment)

• Perl, Python, Java, and C++

• Currently includes 35 databases

http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

ESearch

ESummary

EGQuery

EInfo

EFetch

ELink

EPost

Espell


Uses

• Perform searches on large datasets• Implement data pipelines for genomic, proteomic, or

microarray analysis • Create automated searches to keep local databases current • Create and download customized datasets • Seamlessly combine local data with NCBI data • Develop a focused interface to NCBI data

URL Result(XML)


Common Entrez Engine

Assemble a list of UIDs

ESearch (for a given db)

EGQuery (global version all db)

ESummary (for a list of UIDs)

Retrieve a brief summary record (DocSum)


URL

http://www.ncbi.nlm.nih.gov/sites/gquery?term=cancer+stem+cells

[Base_URL] [Query] [DB][Eutils_URL]

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=9913&retmode=xml

[Base_URL] [Query][DB][Eutils_URL]


URL: DB

[Base_URL] [Query][DB][Eutils_URL]eSearch =

Entrez Database E-Utility Database Name

3D Domains domains

Domains cdd

Genome genome

Nucleotide nucleotide

OMIM omim

PopSet popset

Protein protein

ProbeSet geo

PubMed pubmed

Structure structure

SNP snp

Taxonomy taxonomy

UniGene unigene

UniSTS unists

Each Entrez DB has an E-Utility name (used instead of its original name)


URL: QueryEFetch

EGQuery Espell EInfo ESearch ESummary Tax

X

X

X

X

X

X

Seq ELink EPost

X

X

X

X

X

X

X

X

X

X

X

X

Lit

db X X

X

history X X X X X

WebEnv X X X X X

query_key X X X X X

X X X X X

X

X

X

X

X

X

X

X

X

term X X

id X X X

dbfrom

report

strand

seq_start

field X

reldate X

mindate X

maxdate X

datatype X

retstart X X X

X

X

X

retmax X X

retmode X

X

X X

rettype X

seq_stop

cmd


Espell

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=brest+cancer

Retrieves spelling suggestions when available

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?

Only PubMed


EInfo

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

Provides detailed information about a given database:term counts, last update and available links

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?


EGQuery

Provides Entrez database counts in XML for a single search using GQuery

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2&rettype=html


ESummary

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml

xml, ref, html, text, asn.1

Retrieves DocSums from a list of primary IDs

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?


UIDs: Unique ID

Entrez Database Primary ID E-Utility Database Name

3D Domains 3D SDI domains

Domains PSSM-ID cdd

Genome Genome ID genome

Nucleotide GI number nucleotide

OMIM MIM number omim

PopSet Popset ID popset

Protein GI number protein

ProbeSet GEO ID geo

PubMed PMID pubmed

Structure MMDB ID structure

SNP SNP ID snp

Taxonomy TAXID taxonomy

UniGene UniGene ID unigene

UniSTS UniSTS ID unists

•Always integers

•Refers to a unique record in a given Entrez database

•Each Entrez DBs has an E-Utility name (used instead of its original name)


ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=protein&id=7140346

Existence of an external/Related Articles link from a list of UIDsRetrieves related IDs to a list of UIDs (same db, external db)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?


ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10611131&retmode=ref&cmd=prlinks

Creates a hyperlink to the primary LinkOut provider for a specific IDLists LinkOut URLs and attributes for multiple IDs.


Epost

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=pubmed&id=11237011

Returns a label (query_key) and an encoded server address (WebEnv) that corresponds to a UID list for subsequent search strategies

Optimal for large datasets (see Example)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?


ESearch

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=100

Returns a list of matching UIDs (text search) in a given Entrez database

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?

edat, mdat, dp


EFetch

Generates formatted output for a list of input IDs: abstracts from PubMedFASTA format from Protein

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

DBs:Literature Database

PubMed, Journals, PubMed Central, OMIM

Sequence and other Molecular Biology DatabasesNucleotide,Protein, Gene, etc.

Taxonomy


Rettype

Rettype scope Description

count PubMed Hits counts

sort PubMed and gene

abastract PubMed

citation PubMed

medline PubMed

full PubMed

uilist all Default format for viewing hits

native all Default format for viewing sequences

fasta sequence FASTA view of a sequence

gb nucleotide GenBank view for sequences

est dbEST EST Report.

gp protein GenPept view

seqid sequence To convert list of gis into list of seqids.

acc sequence To convert list of gis into list of accessions

chr dbSNP only SNP Chromosome Report.


EFetch - Literature

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12345,9997&retmode=html&rettype=abstract


EFetch - Sequences

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=fasta

Strand 1(+), 2(-)


Efetch - Taxonomy

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=44689&report=docsum

uilist, brief, docsum, xmml


Search in Journals for the term obstetrics:

In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type:

From Entrez Gene display as xml the GenomeID 2:

To retrieve PubMed related articles for proteins 61742829 with a publication date from 1995 to the present:

Excercise


Combining eUtils calls

The eUtils are useful when used by themselves in single URLs; however their full potential is reached when successive eUtils URLs are combined to create a data pipeline

• Retrieving data records matching an Entrez query

ESearch → ESummaryESearch → EFetch

• Finding IDs linked to records matching an Entrez query

ESearch → ELink

• Retrieving data records in database B linked to records in database A matching an Entrez query

ESearch → ELink → ESummaryESearch → ELink → EFetch


a PERL example

TASK: Retrieve protein sequences of the factor IX in fasta format

my $Base_URL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/" ;

my $esearch_URL = "esearch.fcgi?" ;

my $DB = "db=protein&";

my $Query = "term=factor ix human";

my $esearch_Parameters= "retmax=1&usehistory=y&";

my $E_search =

"$Base_URL$esearch_URL$DB$esearch_Parameters$Query";

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&retmax=1&usehistory=y&term=factor ix human

ESearch → EFetch


Output from ESearch


QueryKey - WebEnv

$WebEnv: cookie value used with EFetch in place of primary ID result list

$QueryKey: value used for a history search number


a PERL example

my $efetch_URL= "efetch.fcgi?";

my $efetch_Parameters =

"rettype=fasta&retmode=text&query_key=$QueryKey&WebEnv=$WebEnv";

my $E_fetch = "$Base_URL$efetch_URL$DB$efetch_Parameters" ;

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&query_key=1&WebEnv=0ujfmXBW0U0hNr3FjaUutLkz1bR-NnJ9kp5vybL3u1AbTQdD7uMETHEtG5N@1EE047D172B3B8D0_0015SID

ESearch → EFetch

TASK: Retrieve protein sequences of the factor IX in fasta format


Output from EFetch

Documents

Databases in bioinformatics II - Göteborgs universitetbio.lundberg.gu.se/courses/ht08/bio2/DB2_2008_toprint.pdfDatabases in bioinformatics 13 Sequencing methods 1954 Whitfeld PR