Upload
aulii
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information organization. Oct 2, 2012 - PowerPoint PPT Presentation
Citation preview
Information organization
Oct 2, 2012Learning objectives-Demonstrate Dotter Program. Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database.Homework #2 due today.Homework #3 due Tues. Oct. 9
What is GenBank?
Gene sequence database
Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region.
Generated from direct submissions to the DNA sequence databases from the authors.
Part of the International Nucleotide Sequence Database Collaboration.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
History of GenBank
Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965)In 1986 it shared data with EMBL and in 1987 it shared data with DDBJ.Primary databaseExamples of secondary databases derived from GenBank: UniProt, EST database.GenBank Flat File is a human readable form of a GenBank record.
Transcription
5’
5’
3’
3’Promoter
Coding strand
Template strand
Transcription
initiation site
Transcription
termination site
Protein Coding Sequence (CDS)
5’ untranslated region (5’UTR)
3’ untranslated region (3’UTR)
Downstream (relative to CDS)Upstream (relative to CDS)
DNA
RNA
Translation
Protein
Protein folding
Folded protein
Start of gene End of gene
5’ 3’
1 2 3 4
Transcription
1 2 3 4
DNA
Primary
transcript
Splicing
mRNA
Translation
protein
Intron 1 Intron 2 Intron 3
Transcript splicing
Primary transcript1 2 3 4
Alternative splicing
General Comments on GBFF
Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each
represented by a key. 3) Nucleotide sequence-each ends with // on
last line of record.
DNA-centered
Translated sequence is a feature
Feature Keys
Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to
sequencesFeature Key Description
conflict Separate determinations of the same seq. differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS (Protein) coding sequence
Feature Keys-Terminology
Feature Key Location/Qualifiers
CDS 23..400
/product=“alcohol dehydro.”
/gene=“adhI”
The feature CDS is a coding sequence beginning at base 23 and ending at base 400 that has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.
Feature Keys-Terminology (Cont.)
Feat. Key Location/Qualifiers
CDS join (544..589,688..1032)
/product=“T-cell recep. B-ch.”
/partial
The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.
Record from GenBank
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE baker's yeast.
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces.
Modification dateGenBank division (plant, fungal and algal)
Coding sequenceAccession number (never changes)
Nucleotide sequence identifier (changes when there is a changein sequence (accession.version))
GeneInfo identifier (changes whenever there is a change)
Word or phrase describing the sequence (not based on controlled vocabulary).Not used in newer records.
Common name for organism
Formal scientific name for the source organism and its lineagebased on NCBI Taxonomy Database
Locus name
Record from GenBank (cont.1)
REFERENCE 1 (bases 1 to 5028)
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
TITLE Cloning and sequence of REV7, a gene whose function is required
for DNA damage-induced mutagenesis in Saccharomyces cerevisiae
JOURNAL Yeast 10 (11), 1503-1509 (1994)
MEDLINE 95176709
REFERENCE 2 (bases 1 to 5028)
AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE Selection of axial growth sites in yeast requires Axl2p, a
novel plasma membrane glycoprotein
JOURNAL Genes Dev. 10 (7), 777-793 (1996)
MEDLINE 96194260
Oldest reference first
Medline UID
REFERENCE 3 (bases 1 to 5028)
AUTHORS Roemer,T.
TITLE Direct Submission
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,
New Haven, CT, USA
Submitter of sequence (always the last reference)
Record from GenBank (cont.2)
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
/map="9"
CDS <1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
Partial sequence on the 5’ end. The 3’ end is complete.
There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature)
Keys
Location
Qualifiers
Descriptive free text must be in quotations
Start of open reading frame
Database cross-refsProtein sequence ID #
Note: only a partial sequence
Values
Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “
Cutoff
Cutoff
Another location
Another location
Coding strand is complementary strand
Record from GenBank (cont.4)
BASE COUNT 1510 a 1074 c 835 g 1609 t
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//
RNA
cDNA
DNA protein
DNA databases derived from GenBankcontaining data for a single gene
•Non-redundant (nr)•dbGSS•dbSTS
RNA (cDNA) databases derivedfrom GenBankcontaining data for a single gene•dbEST•UniGene•RefSeq
Protein databases derivedfrom GenBank containingdata for a single gene•Non-redundant (nr)•UniProtKB
Types of primary databases carrying biological infomation
GenBank/EMBL/DDBJ
dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.)
It is non-redundant
PDB-Three-dimensional structure coordinates of biological molecules
PROSITE-database of protein domain/function relationships.
Summary
GenBank-longest running molecular biology database.
Three sections in every GenBank record
Primary databases and secondary databases.
RefSeq-contains unique record for each RNA variant.
UniProtKB-protein centered
Workshop
Do problem 1 in Chapter 2.
Homework
Do problems 2 and 3 in Chapter 2.