27
CAI and the most biased gen es Zinovyev Andrei Institut des Hautes Études Scientifiques

CAI and the most biased genes Zinovyev Andrei Institut des Hautes Études Scientifiques

Embed Size (px)

Citation preview

CAI and the most biased genes

Zinovyev Andrei

Institut des Hautes Études Scientifiques

For bacterial genomes the main source of heterogeneity of the genetic text is the signal corresponding to the presence of coding information

Mutual information in three consecutive letters - frequency of triplet ijk

- frequency of letter i

Introduction

ijk kji

ijkijk ppp

ffM 2log

ijkf

ip

Example: Codon bias in Ecoli

Overall codon usage Highly expressed genes

Different types of codon bias

Translational (mainly fast-growing bacteria) GC-rich (or AT-rich) codons are preffered Codons with G and C in 3rd position are preffered (or

A and T) Influenced by GC-skew (G-C/G+C) or AT-skew Influenced by strand (leading or lagging) Codon bias connected with genes from other

organisms (horizontally transferred)

Questions

How codon usage of different genes in different genomes is organized?

How to describe codon bias quantatively? How to detect what is the main source of

codon bias?

Qualitative study of codon usage

We can describe every gene by its frequencies of codons – vector with 64 components (59 are interesting for studying codon bias)

PCA (principal component analysis) and CA (correspondence analysis) are the most common techniques for exploratory study of codon usage

Close points – genes with similar codon usage

Common pattern of fast-growing bacteria

IV

II

I

III

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic)

Typical case of fast-growing bacterium:

Bacillus subtilis

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic)

Escherichia coli

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic)

Lower-eukaryotic organism:

Saccharomyces cerevisiae

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic)

Higher-eukaryotic organism:

Caenorhabditis elegans

Genes of class I(most of)

Genes of class II(higly expressed)

Genes of class III(unusual)

Genes of class IV(hydrophobic)

Slow-growing bacterium:

Helicobacter pylori

Genes of class I(most of)

Genes of class IV(hydrophobic)

Slow-growing bacterium:

Borrelia burgdorferi

Leading strand

Lagging strand

Some conclusions: sources of sequence heterogeneity

Hydrophobicity Evolutional pressure (translational bias) Horizontal transfer Different GC(AT)-content Strand heterogeneity

Quantative measures of bias

Effective number of codons Nc

Relative Synonymous Codon Usage

Relative Codon Adaptiveness [0..1]

jNkk

j

i

fN

f

..1

i 1 RSCU

},max{ w i iforsynonymsallf

f

j

i

Codon Adaptaion Index (CAI)

Codon bias with respect to some small set of genes (Reference Set)

},max{ w i iforsynonymsallf

f

j

i

L

L

iiwgeneCAI

1

)(

fi – frequency of codon i, calculated over referenceset SL – number of all codons

in a gene

geneiii

i wwggeneCAI lnln)(ln64

1

gi – frequency of codon iin a gene

Expert chooses Reference Set

Ribosomal proteins Elongation factors Glycolitic proteins …

Problems:

Functions of genes need to be known Expert needs to know the type of codon bias

already (else the results will be meaningless) The genes in Reference Set may not have the

highest CAIs

We use as a Reference Set the most biased genes with respect to dominating codon bias.

It is not necessarily translational

The most biased set of genes SR

Calculate CAI (with wi calculated over SR) for every gene in genome

Then every gene in SR has CAI higher than any gene which is not in SR

We can have several SR for one genome, every of them reflects presence of some type of codon bias

)()( RR SgeneCAISgeneCAI

Algorithm for detecting dominating codon bias

1. Calculate wi over 100% genes, and CAIs for all genes

2. Select 50% genes with the highest CAIs, calculate wi, recalculate CAIs

3. Select 25% genes with the highest CAIs, calculate wi, recalculate CAIs

…When we will have to select 1% of genes or less,

repeat with 1% until convergence.

Example: Bacillus subtilis

How it works for fast-growers

Reference set

Dominating bias, connected with translation

Dominating bias, connected with GC3s

Dominating bias, connected with strand

Example of non-dominating bias

Genes in Class III (possibly horizontally transferred genes) of Bacillus subtilis

We can detect and measure this bias by finding the most biased genes in class III with analog of the algorithm proposed

REFERENCE

A.Carbone, A.Zinovyev, F.Képès

“Codon Adaptation Index as a measure of dominating codon bias”, preprint of

Institut des Hautes Études Scientifiques,

2003.

http://www.ihes.fr/~materials