34
http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.

Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 1

This Friday 10am Beckman B-200

Introduction to the UCSC Browser.

Page 2: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 2

Lecture 6

Genome Evolution

Chromosomal Mutations

Paralogy & Orthology

Chains & Nets

Page 3: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 3

One Cell, One Genome, One Replication

Every cell holds a copy of all its DNA = its genome.

The human body is made of ~1013 cells.

All originate from a single cell through repeated cell divisions.

cell

genome =

all DNA

chicken ≈ 1013 copies(DNA) of egg (DNA)

chicken

eggegg

egg

cell

division

DNA strings =

Chromosomes

Page 4: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Mutation Rate per bp

• 10-9 per base pair per cell division

• This refers to mutations that are not repaired

• Thus, there are at least six new mutations in each kid that were not present in either parent

• Mutations range from the smallest possible (single base pair change) to the largest – whole genome duplication.

• Selection does not tolerate all of these mutation, but it sure does tolerate some.

chicken

egg

chicken

4

Page 5: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

5

Example: Human-Chimp Genomic DifferencesN

umbe

r of

eve

nts

Nucleotid

e substi

tutions

Indels < 10 K

b

Microinve

rsions <

100 Kb

Deletions/D

uplicatio

ns

Microinve

rsions >

100 Kb

Pericentric

inve

rsions

Fusion

1%

3%

Open question..

Page 6: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chromosomal (ie Big) Mutations

• May Involve:– Changing

the structure of a chromosome

– The loss or gain of part of a chromosome

Page 7: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chromosome Mutations

• Five types exist:–Deletion– Inversion–Translocation–Nondisjunction

–Duplication

Page 8: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Deletion

• Due to breakage• A piece of a

chromosome is lost

Page 9: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Inversion• Chromosome segment

breaks off• Segment flips around

backwards• Segment reattaches

Page 10: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Duplication

• Occurs when a gene sequence is repeated

Page 11: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Whole Genome Duplication at the Base of the Vertebrate Tree

http://cs273a.stanford.edu [Bejerano Fall09/10] 11

Xen.Laevis WGD

Page 12: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Translocation

• Involves two chromosomes that aren’t homologous

•Part of one chromosome is transferred to another chromosomes

Page 13: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Nondisjunction• Failure of chromosomes to

separate during meiosis• Causes gamete to have too many

or too few chromosomes• Disorders:

– Down Syndrome – three 21st chromosomes

– Turner Syndrome – single X chromosome– Klinefelter’s Syndrome – XXY

chromosomes

Page 14: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chromosome Mutation Animation

Page 15: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

15

The Species Tree

Sampled Genomes

S

S

S Speciation

Page 16: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

16

A Gene tree evolves with respect to a Species tree

Species tree

Gene tree

SpeciationSpeciation

DuplicationLoss

Page 17: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 17

Terminology

Orthologs : Genes related via speciation (e.g. C,M,H3)

Paralogs: Genes related through duplication (e.g. H1,H2,H3)

Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3)

Species tree

Gene tree

SpeciationSpeciation

DuplicationLoss

single

ancestral

gene

Page 18: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 18

Gene trees and even species trees are figments of our (scientific) imagination

Species trees and gene trees can be wrong.

All we really have are extant observations, and fossils.

Species tree

Gene tree

SpeciationSpeciation

DuplicationLoss

single

ancestral

gene

ObservedInferred

Page 19: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Gene Families

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif19

Page 20: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Gu et al. Age distribution of human gene families shows significant roles of both large-scale and small-scale duplication in vertebrate evolution (2002) Nature Genetics 31; 205-208

20http://cs273a.stanford.edu [Bejerano Fall09/10]

Page 21: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 21

Chaining Alignments

Chaining highlights homologous regions between genomes (it bridges the gulf between syntenic blocks and base-by-base alignments.

Local alignments tend to break at transposon insertions, inversions, duplications, etc.

Global alignments tend to force non-homologous bases to align.

Chaining is a rigorous way of joining together local alignments into larger structures.

Page 22: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 22

“Raw” Blastz track (no longer displayed)

Protease Regulatory Subunit 3

Alignment = homologous regions

Page 23: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chains & Nets: How they’re built

• 1: Blastz one genome to another– Local alignment algorithm– Finds short blocks of similarity

Hg18: AAAAAACCCCCAAAAA

Mm8: AAAAAAGGGGG

Hg18.1-6 + AAAAAAMm8.1-6 + AAAAAA

Hg18.7-11 + CCCCCMm8.1-5 - CCCCC

Hg18.12-16 + AAAAAMm8.1-5 + AAAAA

23

Page 24: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chains & Nets: How they’re built• 2: “Chain” alignment blocks together

– Links blocks that preserve order and orientation– Not single coverage in either species

Hg18: AAAAAACCCCCAAAAA

Mm8: AAAAAAGGGGGAAAAA

Hg18: AAAAAACCCCCAAAAA Mm8 chains

Mm8.1-6 +

Mm8.7-11 -

Mm8.12-16 +

Mm8.12-15 + Mm8.1-5 + 24

Page 25: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Another Chain Example

A B CD E

Ancestral Sequence

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

25

Page 26: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 26

Chains join together related local alignments

Protease Regulatory Subunit 3

likely ortholog

likely paralogs

Page 27: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 27

Chains• a chain is a sequence of gapless aligned blocks, where there must

be no overlaps of blocks' target or query coords within the chain.• Within a chain, target and query coords are monotonically non-

decreasing. (i.e. always increasing or flat)• double-sided gaps are a new capability (blastz can't do that) that

allow extremely long chains to be constructed.• not just orthologs, but paralogs too, can result in good chains. but

that's useful!• chains should be symmetrical -- e.g. swap human-mouse -> mouse-

human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments.

• chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.

• chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

Page 28: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 28

Before and After Chaining

Page 29: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 29

Chaining Algorithm

Input - blocks of gapless alignments from blastzDynamic program based on the recurrence relationship:

score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Page 30: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 30

Netting Alignments

Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

Net finds best match mouse match for each human region.Highest scoring chains are used first.Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

Page 31: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

Chains & Nets: How they’re built• 3: “Net” the chains heuristically to find “best

guess” of orthologs– Pick highest-scoring chains that do not overlap chains

already added to net– Single coverage in target (reference), not in query– Not symmetrical

31

Page 32: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 32

Net Focuses on Ortholog

Page 33: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 33

Nets

• a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels.

• a net is single-coverage for target but not for query.• because it's single-coverage in the target, it's no longer symmetrical.• the netter has two outputs, one of which we usually ignore: the target-

centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.

• nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

[Angie Hinrichs, UCSC wiki]

Page 34: Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser

http://cs273a.stanford.edu [Bejerano Fall09/10] 34

Before and After Netting