Upload
meryl-walsh
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Gene Structure and Identification
Genes and GenomesORFs and more
Consensus SequencesGene Finding
BIO520 Bioinformatics Jim Lund
Reading: sections 1.3, 9.1-9.6
Gene
The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.
Gene-Informatics
Genes are character strings embedded in much larger strings called the genome. A gene usually encodes a protein. Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.
Genomes
• Genome seq. has only limited use by itself– Markers, SNPs, etc.
• Functional annotation – Identify proteins and their functions.– And regulatory regions, etc.
• Parts list: a source for understanding all biology--and ushers in the post-genomic age of biology.
Characteristics of Protein Coding Genes
• ORF– long (usually >100 aa)– “known” proteinslikely
• Basal signals– Transcription, splicing, translation
• Regulatory signals– Depend on organism
• Prokaryotes vs Eukaryotes• Verterbrate vs fungi, eg.
Infer Gene Structure“Gene Model”
Promoter•Strength•Regulation
mRNA•Exons•Splicing•Stability •ORF=protein
GenomesGene Content
Human
27,148 genes X 2 kbp=54 Mb mRNA
Introns=300 Mb?Regulatory regions=300 Mb?
2,446 Mb = ?
Complex Genome DNA
• ~10% highly repetitive (300 Mb)– NOT GENES
• ~25% moderate repetitive (750 Mb)– Some genes
• ~10% exons and introns (354 Mb)
• 55% = ?– Regulatory regions– Intergenic regions
Easy problem:Bacterial Gene Finding
• Dense Genomes• Short intergenic regions• Uninterrupted ORFs• Conserved signals• Abundant comparative
information• Complete Genomes
E. coli genome
• 4,415 genes• Ave. distance between genes: 118 bp• 318 aa, average protein length• 57 proteins longer than 1000 aa.• 318 shorter than 100 aa.• 2,584 operons, 70% contain one gene.• 1.5% repetitive DNA (mostly viral
fragments).
Prokaryotic Gene Expression
PromoterCistron1Cistron2CistronNTerminator
Transcription RNA Polymerase
mRNA 5’ 3’
TranslationRibosome, tRNAs,Protein Factors
1 2 N
Polypeptides
NC
NC N
C1 2 3
Prokaryotic gene prediction•ORFs
•Biased nucleotide distribution–Periodicity of 3–Codon bias (codon usage statistics)–Also called Codon Adaptation Index (CAI).
•Signal sequences•Homology•Other biological info: for E. coli, partial N-terminal protein sequences.
Prokaryotic signal sequences•Ribosome binding site (RBS)/Shine-
Delgarno element•3-9 purines complementary to sequence at 3’ end of the 16S rRNA in the small subunit of the ribosome.•Located: 4-7 bps 5’ of the AUG.
•Promoter•-35 consensus site (TTGACA)•-10 consensus site (TATAAT)
•Signal peptides•Regulatory protein binding sites (4 to 8 bps)
ORF finding tools
• Artemis– analyze ORFs
• Testcode (Fickett’s)• CodonPreference• ORF Finder (NCBI)• BCM Search Launcher
Codon Bias
• Genetic code degenerate• Codon usage varies
– Organism to organism– Gene to gene
•High bias correlates with high level expression
•Bias correlates with tRNA isoacceptors
•Change bias or tRNAs, change expression
Nucleotide Bias
• Coding DNA vs non-Coding DNA– often G+C content higher than bulk
• Empirical statistics (Fickett’s TESTCODE)
Useful:• ORF matches “typical”
– organism, bias
• ORF obscured by STOP codons
We found ORFs-now what?
• Work backwards–Locate adjacent cistrons
–Locate RBS
–Locate promoter
–Locate terminator
–Locate regulatory sites
TranslationRibosome Binding Site,
Shine-Dalgarno Site
nnAGGAGGAGGAGGnnnnnATG…
Consensus not always used, example E. coli gene:
nnAaGAGGAaGAGGnnnnATG(Better represented as a PSSM or a HMM)
Bacterial Promoter
-35T82T84G78A65C54A45…
(16-18 bp)…T80A95T45A60A50T96…(A,G)
-10 +1
Alternate sigma factorsAlternate sigma factorsCCCTTGAA….CCCGATNT
Terminators
• Stem/loop– structural only
• 3’-U tail
Rho-independent
• C-rich
• G-poor
• “loose” consensus
Rho-dependent
Difficulties in gene prediction
• Frame shifts– sequencing errors
• Overlapping ORFs– Rare (a few percent)
• Short ORFs• Unusual genes
– bp composition– signal sequences