58
1 Introduction to Bioinformatics

Varriation Within and Between Species

Embed Size (px)

DESCRIPTION

Case study: Are Neanderthals still among us?

Citation preview

Page 1: Varriation Within and Between Species

1

Introduction to

Bioinformatics

Page 2: Varriation Within and Between Species

2

Introduction to Bioinformatics.

LECTURE 5: Variation within and between species

* Chapter 5: Are Neanderthals among us?

Page 3: Varriation Within and Between Species

3

Neandertal, Germany, 1856

Initial interpretations:

* bear skull* pathological idiot* Old Dutchman ...

Page 4: Varriation Within and Between Species

4

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

Page 5: Varriation Within and Between Species

5

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

Page 6: Varriation Within and Between Species

6

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

Page 7: Varriation Within and Between Species

7

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.1 Variation in DNA sequences

* Even closely related individuals differ in genetic sequences

* (point) mutations : copy error at certain location

* Sexual reproduction – diploid genome

Page 8: Varriation Within and Between Species

8

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Diploid chromosomes

Page 9: Varriation Within and Between Species

9

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Mitosis: diploid reproduction

Page 10: Varriation Within and Between Species

10

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Meiosis: diploid (=double) → haploid (=single)

Page 11: Varriation Within and Between Species

11

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

* typing error rate very good typist: 1 error / 1K typed letters

* all our diploid cells constantly reproduce 7 billion letters

* typical cell copying error rate is ~ 1 error /1 Gbp

Page 12: Varriation Within and Between Species

12

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

GERM LINE

Reverse time and follow your cells:

• Now you count ~ 1013 cells• One generation ago you had 2 cells ‘somewhere’ in your parents body• Small T generations ago you had (2T – multiple ancestors) cells• Large T generations ago you counted #(fertile ancestors) cells• Congratulations: you are 3.4 billion years old !!!

Fast-forward time and follow your cells:

• Only a few cells in your reproductive organs have a chance to live on in the next generations

• The rest (including you) will die …

Page 13: Varriation Within and Between Species

13

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

GERM LINE MUTATIONS

This potentially immortal lineage of (germ) cells is called the GERM LINE

All mutations that we have accumulated are en route on the germ line

Page 14: Varriation Within and Between Species

14

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

* Polymorphism : multiple possibilities for a nucleotide: allelle

* Single Nucleotide Polymorphism – SNP (“snip”) point mutation example: AAATAAA vs AAACAAA

* Humans: SNP = 1/1500 bases = 0.067%

* STR = Short Tandem Repeats (microsatelites) example: CACACACACACACACACA …

* Transition - transversion

Page 15: Varriation Within and Between Species

15

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Purines – Pyrimidines

Page 16: Varriation Within and Between Species

16

Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

Transitions – Transversions

Page 17: Varriation Within and Between Species

17

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.2 Mitochondrial DNA

* mitochondriae are inherited only via the maternal line!!!

* Very suitable for comparing evolution, not reshuffled

Page 18: Varriation Within and Between Species

18

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

H.sapiens mitochondrion

Page 19: Varriation Within and Between Species

19

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

EM photograph of H. Sapiens mtDNA

Page 20: Varriation Within and Between Species

20

Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

Page 21: Varriation Within and Between Species

21

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.3 Variation between species

* genetic variation accounts for morphological-physiological-behavioral variation

* Genetic variation (c.q. distance) relates to phylogenetic relation (=relationship)

* Necessity to measure distances between sequences: a metric

Page 22: Varriation Within and Between Species

22

Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES

Substitution rate

* Mutations originate in single individuals

* Mutations can become fixed in a population

* Mutation rate: rate at which new mutations arise

* Substitution rate: rate at which a species fixes new mutations

* For neutral mutations

Page 23: Varriation Within and Between Species

23

Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES

Substitution rate and mutation rate

* For neutral mutations

* ρ = 2Nμ*1/(2N) = μ

* ρ = K/(2T)

Page 24: Varriation Within and Between Species

24

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.4 Estimating genetic distance

* Substitutions are independent (?)

* Substitutions are random

* Multiple substitutions may occur

* Back-mutations mutate a nucleotide back to an earlier value

Page 25: Varriation Within and Between Species

25

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Multiple substitutions and Back-mutations

conceal the real genetic distance

GACTGATCCACCTCTGATCCTTTGGAACTGATCGTTTCTGATCCACCTCTGATCCTTTGGAACTGATCGTTTCTGATCCACCTCTGATCCATCGGAACTGATCGTGTCTGATCCACCTCTGATCCATTGGAACTGATCGT

observed : 2 (= d)actual : 4 (= K)

Page 26: Varriation Within and Between Species

26

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

* Saturation: on average one substitution per site

* Two random sequences of equal length will match for approximately ¼ of their sites

* In saturation therefore the proportional genetic distance is ¼

Page 27: Varriation Within and Between Species

27

Introduction to Bioinformatics5.4 ESTIMATING GENETIC DISTANCE

* True genetic distance (proportion): K

* Observed proportion of differences: d

* Due to back-mutations K ≥ d

Page 28: Varriation Within and Between Species

28

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

SEQUENCE EVOLUTION is a Markov process: a sequence at generation (= time) t depends only the sequence at generation t-1

Page 29: Varriation Within and Between Species

29

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

The Jukes-Cantor model

Correction for multiple substitutions

Substitution probability per site per second is α

Substitution means there are 3 possible replacements (e.g. C → {A,G,T})

Non-substitution means there is 1 possibility(e.g. C → C)

Page 30: Varriation Within and Between Species

30

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Therefore, the one-step Markov process has the following transition matrix:

MJC =

A C G T

A 1-α α/3 α/3 α/3

C α/3 1-α α/3 α/3

G α/3 α/3 1-α α/3

T α/3 α/3 α/3 1-α

Page 31: Varriation Within and Between Species

31

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

After t generations the substitution probability is:

M(t) = MJCt

Eigen-values and eigen-vectors of M(t):

λ1 = 1, (multiplicity 1): v1 = 1/4 (1 1 1 1)T

λ2..4 = 1-4α/3, (multiplicity 3): v2 = 1/4 (-1 -1 1 1)T

v3 = 1/4 (-1 -1 -1 1)T

v4 = 1/4 (1 -1 1 -1)T

Page 32: Varriation Within and Between Species

32

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Spectral decomposition of M(t):

MJCt = ∑i λi

tviviT

Define M(t) as:

MJCt =

Therefore, substitution probability s(t) per site after t generations is:

s(t) = ¼ - ¼ (1 - 4α/3)t

r(t) s(t) s(t) s(t)

s(t) r(t) s(t) s(t)

s(t) s(t) r(t) s(t)

s(t) s(t) s(t) r(t)

Page 33: Varriation Within and Between Species

33

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

substitution probability s(t) per site after t generations:

s(t) = ¼ - ¼ (1 - 4α/3)t

observed genetic distance d after t generations ≈ s(t) :

d = ¼ - ¼ (1 - 4α/3)t

For small α :

( )dt 341ln

4

3 −−≈α

Page 34: Varriation Within and Between Species

34

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

For small α the observed genetic distance is:

The actual genetic distance is (of course):

K = αt

So:

This is the Jukes-Cantor formula : independent of α and t.

( )dt 341ln

4

3 −−≈α

( )dK 34

43 1ln −−≈

Page 35: Varriation Within and Between Species

35

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

The Jukes-Cantor formula :

For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance

For saturation: d ↑ ¾ : K →∞So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate

( )dK 34

43 1ln −−≈

Page 36: Varriation Within and Between Species

36

Jukes-Cantor

Page 37: Varriation Within and Between Species

37

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Variance in K

If: K = f(d) then:

So:

Generation of a sequence of length n with substitution rate

d is a binomial process:

and therefore with variance: Var(d) = d(1-d)/n

Because of the Jukes-Cantor formula:

knk ddk

nk −−

= )1()(Prob

dd

K

341

1

−=

∂∂

)(Var)(Var2

dd

KK

∂∂=

22

2 dd

KKd

d

KK δδδδ

∂∂=⇒

∂∂=

Page 38: Varriation Within and Between Species

38

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

Variance in K

Variance: Var(d) = d(1-d)/n

Jukes-Cantor:

So:

dd

K

341

1

−=

∂∂

234 )1(

)1()(Var

dn

ddK

−−≈

Page 39: Varriation Within and Between Species

39

Var(K)

Page 40: Varriation Within and Between Species

40

Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL

EXAMPLE 5.4 on page 90

* Create artificial data with n = 1000: generate K* mutations

* Count d

* With Jukes-Cantor relation reconstruct estimate K(d)

* Plot K(d) – K*

Page 41: Varriation Within and Between Species

41

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

Page 42: Varriation Within and Between Species

42

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

Page 43: Varriation Within and Between Species

43

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

Page 44: Varriation Within and Between Species

44

Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90 (= FIG 5.3)

Page 45: Varriation Within and Between Species

45

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

The Kimura 2-parameter model

Include substitution bias in correction factor

Transition probability (G↔A and T↔C) per site per second is α

Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second is β

Page 46: Varriation Within and Between Species

46

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

The one-step Markov process substitution matrix now becomes:

MK2P =

A C G T

A 1-α-β β α β

C β 1-α-β β α

G α β 1-α-β β

T β α β 1-α-β

Page 47: Varriation Within and Between Species

47

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

After t generations the substitution probability is:

M(t) = MK2Pt

Determine of M(t):

eigen-values {λi}

and eigen-vectors {vi}

Page 48: Varriation Within and Between Species

48

Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL

Spectral decomposition of M(t):

MK2Pt = ∑i λi

tviviT

Determine fraction of transitions per site after t generations : P(t)

Determine fraction of transitions per site after t generations : Q(t)

Genetic distance: K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q)

Fraction of substitutions d = P + Q → Jukes-Cantor

Page 49: Varriation Within and Between Species

49

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Other models for nucleotide evolution

* Different types of transitions/transversions

* Pairwise substitutions GTR (= General Time Reversible) model

* Amino-acid substitutions matrices

* …

Page 50: Varriation Within and Between Species

50

Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE

Other models for nucleotide evolution

DEFICIT:

all above models assume symmetric substitution probs;

prob(A→T) = prob(T→A)

Now strong evidence that this assumption is not true

Challenge: incorporate this in a self-consistent model

Page 51: Varriation Within and Between Species

51

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

5.5 CASE STUDY: Neanderthals

* mtDNA of 206 H. sapiens from different regions

* Fragments of mtDNA of 2 H. neanderthaliensis, including the original 1856 specimen.

* all 208 samples from GenBank

* A homologous sequence of 800 bp of the HVR could be found in all 208 specimen.

Page 52: Varriation Within and Between Species

52

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

* Pairwise genetic difference – corrected with Jukes-Cantor formula

* d(i,j) is JC-corrected genetic difference between pair (i,j);

* dT = d

* MDS (Multi Dimensional Scaling): translate distance table d to a nD-map X, here 2D-map

Page 53: Varriation Within and Between Species

53

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

distance map d(i,j)

Page 54: Varriation Within and Between Species

54

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

MDS

H. sapiens

H. neanderthaliensiswell-separated

Page 55: Varriation Within and Between Species

55

Introduction to Bioinformatics5.5 CASE STUDY: Neanderthals

phylogentic tree

Page 56: Varriation Within and Between Species

56

END of LECTURE 5

Page 57: Varriation Within and Between Species

57

Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

Page 58: Varriation Within and Between Species

58