31
Gene Structure and Identification Genes and Genomes ORFs and more Consensus Sequences Gene Finding BIO520 Bioinformatics Jim Lund Reading: sections 1.3, 9.1-9.6

Gene Structure and Identification Genes and Genomes ORFs and more Consensus Sequences Gene Finding BIO520 BioinformaticsJim Lund Reading: sections 1.3,

Embed Size (px)

Citation preview

Gene Structure and Identification

Genes and GenomesORFs and more

Consensus SequencesGene Finding

BIO520 Bioinformatics Jim Lund

Reading: sections 1.3, 9.1-9.6

Gene

The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.

Gene-Informatics

Genes are character strings embedded in much larger strings called the genome. A gene usually encodes a protein. Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.

ACGT to Gene

Cells recognize genes from DNA sequence.

Genes

•Protein Coding•RNA genes

– rRNA–tRNA–siRNA, miRNA, snRNA,

snoRNA…

Genomes

• Genome seq. has only limited use by itself– Markers, SNPs, etc.

• Functional annotation – Identify proteins and their functions.– And regulatory regions, etc.

• Parts list: a source for understanding all biology--and ushers in the post-genomic age of biology.

Genomes

2002 Mus musculus 2,700,000,000

3,100,000,000

Characteristics of Protein Coding Genes

• ORF– long (usually >100 aa)– “known” proteinslikely

• Basal signals– Transcription, splicing, translation

• Regulatory signals– Depend on organism

• Prokaryotes vs Eukaryotes• Verterbrate vs fungi, eg.

Infer Gene Structure“Gene Model”

Promoter•Strength•Regulation

mRNA•Exons•Splicing•Stability •ORF=protein

GenomesGene Content

E. coli4000 genes X 1 kbp/gene=4 Mbp

Genome=4 Mbp!

GenomesGene Content

Human

27,148 genes X 2 kbp=54 Mb mRNA

Introns=300 Mb?Regulatory regions=300 Mb?

2,446 Mb = ?

Complex Genome DNA

• ~10% highly repetitive (300 Mb)– NOT GENES

• ~25% moderate repetitive (750 Mb)– Some genes

• ~10% exons and introns (354 Mb)

• 55% = ?– Regulatory regions– Intergenic regions

Easy problem:Bacterial Gene Finding

• Dense Genomes• Short intergenic regions• Uninterrupted ORFs• Conserved signals• Abundant comparative

information• Complete Genomes

E. coli genome

• 4,415 genes• Ave. distance between genes: 118 bp• 318 aa, average protein length• 57 proteins longer than 1000 aa.• 318 shorter than 100 aa.• 2,584 operons, 70% contain one gene.• 1.5% repetitive DNA (mostly viral

fragments).

Prokaryotic Gene Expression

PromoterCistron1Cistron2CistronNTerminator

Transcription RNA Polymerase

mRNA 5’ 3’

TranslationRibosome, tRNAs,Protein Factors

1 2 N

Polypeptides

NC

NC N

C1 2 3

Prokaryotic gene prediction•ORFs

•Biased nucleotide distribution–Periodicity of 3–Codon bias (codon usage statistics)–Also called Codon Adaptation Index (CAI).

•Signal sequences•Homology•Other biological info: for E. coli, partial N-terminal protein sequences.

Prokaryotic signal sequences•Ribosome binding site (RBS)/Shine-

Delgarno element•3-9 purines complementary to sequence at 3’ end of the 16S rRNA in the small subunit of the ribosome.•Located: 4-7 bps 5’ of the AUG.

•Promoter•-35 consensus site (TTGACA)•-10 consensus site (TATAAT)

•Signal peptides•Regulatory protein binding sites (4 to 8 bps)

ORFs

P(ORF)=(61/64)n

P(20)=(61/64)20=.38

P(100)=0.008

P(200)=10-4

ORF finding tools

• Artemis– analyze ORFs

• Testcode (Fickett’s)• CodonPreference• ORF Finder (NCBI)• BCM Search Launcher

ORFs in E. coli

1

2

3

-1

-2

-3

Frame

Codon Bias

• Genetic code degenerate• Codon usage varies

– Organism to organism– Gene to gene

•High bias correlates with high level expression

•Bias correlates with tRNA isoacceptors

•Change bias or tRNAs, change expression

Codon Bias

Gly GGG 6 0.21Gly GGA 6 0.17Gly GGT 6 0.38Gly GGC 6 0.24

Codon Bias Gene Differences

GAL4 ADH1Gly GGG 0.21 0Gly GGA 0.17 0Gly GGT 0.38 0.93Gly GGC 0.24 0.07

Nucleotide Bias

• Coding DNA vs non-Coding DNA– often G+C content higher than bulk

• Empirical statistics (Fickett’s TESTCODE)

Useful:• ORF matches “typical”

– organism, bias

• ORF obscured by STOP codons

We found ORFs-now what?

• Work backwards–Locate adjacent cistrons

–Locate RBS

–Locate promoter

–Locate terminator

–Locate regulatory sites

Operon Structure

Promoter?

TranslationRibosome Binding Site,

Shine-Dalgarno Site

nnAGGAGGAGGAGGnnnnnATG…

Consensus not always used, example E. coli gene:

nnAaGAGGAaGAGGnnnnATG(Better represented as a PSSM or a HMM)

Bacterial Promoter

-35T82T84G78A65C54A45…

(16-18 bp)…T80A95T45A60A50T96…(A,G)

-10 +1

Alternate sigma factorsAlternate sigma factorsCCCTTGAA….CCCGATNT

Terminators

• Stem/loop– structural only

• 3’-U tail

Rho-independent

• C-rich

• G-poor

• “loose” consensus

Rho-dependent

Difficulties in gene prediction

• Frame shifts– sequencing errors

• Overlapping ORFs– Rare (a few percent)

• Short ORFs• Unusual genes

– bp composition– signal sequences

Programs for prokaryotic gene prediction

•Glimmer

•ORPHEUS

•GeneMark

•90%+ sensitivity and specificity

•GENSCAN