45
FUNCTIONAL GENOMICS FUNCTIONAL GENOMICS COURSE COURSE 26.5.2006 26.5.2006 Petri Pehkonen Petri Pehkonen Laboratory of Functional Genomics and Laboratory of Functional Genomics and Bioinformatics Bioinformatics Department of Neurobiology Department of Neurobiology A.I.Virtanen Institute for Molecular A.I.Virtanen Institute for Molecular Sciences Sciences

FUNCTIONAL GENOMICS COURSE 26.5.2006

  • Upload
    falala

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

FUNCTIONAL GENOMICS COURSE 26.5.2006. Petri Pehkonen Laboratory of Functional Genomics and Bioinformatics Department of Neurobiology A.I.Virtanen Institute for Molecular Sciences University of Kuopio. [ I N T R O D U C T I O N ] Gene lists from microarrays. 2 X UP AA407331 BG062929 - PowerPoint PPT Presentation

Citation preview

Page 1: FUNCTIONAL GENOMICS  COURSE 26.5.2006

FUNCTIONAL GENOMICS FUNCTIONAL GENOMICS COURSECOURSE26.5.200626.5.2006

Petri PehkonenPetri PehkonenLaboratory of Functional Genomics and BioinformaticsLaboratory of Functional Genomics and BioinformaticsDepartment of NeurobiologyDepartment of NeurobiologyA.I.Virtanen Institute for Molecular SciencesA.I.Virtanen Institute for Molecular SciencesUniversity of KuopioUniversity of Kuopio

Page 2: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ I N T R O D U C T I O N ][ I N T R O D U C T I O N ] Gene lists from microarraysGene lists from microarrays

2 X UPAA407331BG062929BG062930BG062931BG062932AA407367BG062933BG062934AA407377BG062935BG062936BG062937BG063015BG063016BG063017BG063018

2 X DOWNBG069315BG082333BG082334BG069318BG069319BG069320BG069321BG069322AU018797BG069323BG069324BG069325BG069326BG069327BG069328BG069406BG069407BG082348AU018835BG069409BG082350BG082351BG069412BG069413

NON REGULATEDBG068248AU022405BG068249BG068250BG068251BG068252BG068253BG068254BG081291BG081292BG068257BG068258BG068259BG068260BG068261BG068262BG068263BG068264BG068265AU022448BG081301BG068267AU022455BG068268BG068269BG068270BG081306BG068272BG068273BG068274BG068275BG068276AU022477BG068277BG068278BG068279BG068280BG068281BG081317BG068283BG068284

Genes are compared between two or more Genes are compared between two or more samplessamples

t-testt-test used to detect dissimilar expression used to detect dissimilar expression

Genes are ranked according to Genes are ranked according to p-valuesp-values

p-value p-value cut offcut off is set to select differently is set to select differently expressed genes (up/down regulated)expressed genes (up/down regulated)

Output: up, down and non-regulated gene listsOutput: up, down and non-regulated gene lists

mRNA Hybridization Dataset

Page 3: FUNCTIONAL GENOMICS  COURSE 26.5.2006

TTTT

[ I N T R O D U C T I O N ][ I N T R O D U C T I O N ] Gene associated dataGene associated data

G1G1G1G1 G2G2G2G2 G3G3G3G3 G4G4G4G4

Chromosomal regions

G1G1G1G1

G2G2G2G2

G3G3G3G3

Cell cycleCell cycleCell cycleCell cycle

ApoptosisApoptosisApoptosisApoptosis

NeuorogenesisNeuorogenesisNeuorogenesisNeuorogenesis

Cell deathCell deathCell deathCell death

G4G4G4G4

ATPase activityATPase activityATPase activityATPase activity

Functional classes

G1G1G1G1

G2G2G2G2

G3G3G3G3

"is""is""is""is"

"segregation""segregation""segregation""segregation"

"protein""protein""protein""protein"

"antioxidant""antioxidant""antioxidant""antioxidant"

G4G4G4G4

"dopamine""dopamine""dopamine""dopamine"

Words from scientific literature

G1G1G1G1 G2G2G2G2

Transcription factor binding sites

GGGG = a gene

= element

6p13 6p13 6p146p14 6p15 6p15 6p12 6p12

Page 4: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] MotivationMotivation

AimAim

Find the chromosomal regions where our genes are Find the chromosomal regions where our genes are over-representedover-represented

Biological basisBiological basis

Gene duplication during evolutionGene duplication during evolution

Genes located nearby and/or having similar regulatory Genes located nearby and/or having similar regulatory elements may be regulated by same factors and elements may be regulated by same factors and participate to same biological processparticipate to same biological process

Page 5: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Data representationData representation

We can have gene sets obtained from any kind of We can have gene sets obtained from any kind of laboratory technology or in silico techniquelaboratory technology or in silico technique

Gene locations in chromosome can be represented as Gene locations in chromosome can be represented as ordered binomial data vectorsordered binomial data vectors

1 1 indicates that gene is from the row's class, indicates that gene is from the row's class, 00 indicates indicates the oppositethe opposite

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 6: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] First analysisFirst analysis

We can select a region and find the number of genes thereWe can select a region and find the number of genes there

Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 7: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Our first analysisOur first analysis

We can select a region and find the number of genes thereWe can select a region and find the number of genes there

We can calculate the statistical significance of the regionWe can calculate the statistical significance of the region

Contigency table tests: Fisher's exact test, Chi squared test etc.Contigency table tests: Fisher's exact test, Chi squared test etc.

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene

Page 8: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Our first analysisOur first analysis

Contigency table test: determining if there is a difference between Contigency table test: determining if there is a difference between two proportionstwo proportions

We can now test whether there is difference between the selected We can now test whether there is difference between the selected chromosomal region and whole genome chromosomal region and whole genome IN PROPORTIONS OF IN PROPORTIONS OF OVER-EXPRESSED GENESOVER-EXPRESSED GENES

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene

Page 9: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I [ C H R O M O S O M A L A N A L Y S I

S ]S ] Our first analysisOur first analysis 2X2 contigency table for over-expressed vs. other genes:2X2 contigency table for over-expressed vs. other genes:

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Over-Over-expr.expr.

OtherOther SumsSums

SelectedSelected 44 22 66

OtherOther 44 88 1212

SumsSums 88 1010 1818

Page 10: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric distributionHypergeometric distribution

Classes

G e

n e

s

Oth

er g

enes

C1 C2 C3 Regulated genes can be seen as Regulated genes can be seen as a samplea sample taken one by taken one by

one one without replacementwithout replacement from from aa populationpopulation of all of all

genes tested in experimentgenes tested in experiment

Hypergeometric distributionHypergeometric distribution describes how random describes how random

this kind of sample is this kind of sample is for one associated classfor one associated class

Hy

pe

rge

om

etr

ic p

rob

ab

ilit

y

Hypergeometric Probability Density Function (HygePDF)

Hypergeometric DistributionFunction (HygeCDF)

Reg

ula

ted

gen

es

Number of genes that were associated to class C in the regulated list

Page 11: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Classes

G e

n e

s

Oth

er g

enes

C1 C2 C3

Reg

ula

ted

gen

es

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric probabilityHypergeometric probability

Hypergeometric probability fHypergeometric probability f answers if class answers if class CC is is

randomly distributed between regulated and other genesrandomly distributed between regulated and other genes

For calculating For calculating ff for class for class CC, we need sizes of regulated , we need sizes of regulated

gene list (N) and array (M), and amounts of class gene list (N) and array (M), and amounts of class CC

associated genes in gene list (x) and in the array (n)associated genes in gene list (x) and in the array (n)

Hy

pe

rge

om

etr

ic p

rob

ab

ilit

y

x

f

x

Number of genes that were associated to class C in the regulated list

Page 12: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric probabilityHypergeometric probability

Low probability means non random distribution => the Low probability means non random distribution => the

class is either over or under-representedclass is either over or under-represented

Figures: class Figures: class C1C1 is over-represented and is over-represented and C3C3 under- under-

represented in gene list. How about represented in gene list. How about C2C2??

Hy

pe

rge

om

etr

ic p

rob

ab

ilit

y

C1C2C3

Classes

G e

n e

s

Oth

er g

enes

C1 C2 C3

Reg

ula

ted

gen

es

Number of genes that were associated to class C in the regulated list

Page 13: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Fisher's exact testFisher's exact test

Hypergeometric probabilityHypergeometric probability f(x,M,N,n)f(x,M,N,n) measures the measures the probability to detect probability to detect exactly exactly xx genes that are associated genes that are associated to the class to the class CC from from a random samplea random sample

x

C2

Hy

pe

rge

om

etr

ic p

rob

ab

ilit

y

Classes

G e

n e

s

Oth

er g

enes

C1 C2 C3

Reg

ula

ted

gen

es

Number of genes that were associated to class C in the regulated list

Page 14: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Fisher's exact testFisher's exact test

Hypergeometric probabilityHypergeometric probability f(x,M,N,n)f(x,M,N,n) measures the measures the probability to detect probability to detect exactly exactly xx genes that are associated genes that are associated to the class to the class CC from from a random samplea random sample

It’s more natural to ask: what is probability to detect It’s more natural to ask: what is probability to detect x or x or moremore class associated genes in a random sample? class associated genes in a random sample?

Fisher’s exact testFisher’s exact test F F answers this by summing the tail answers this by summing the tail from HygePDFfrom HygePDF

FISHER’S TEST FOR CLASS C2:

F(x=36, M=1300, N=400, n=95) = 0.1916

C2 DOES NOT SHOW STATISTICALLYSIGNIFICANT OVER-REPRESENTATION(WITH A SIGNIFICANCE LEVEL α=0.05)

x

C2

Hy

pe

rge

om

etr

ic p

rob

ab

ilit

y

Classes

G e

n e

s

Oth

er g

enes

C1 C2 C3

Reg

ula

ted

gen

es

Number of genes that were associated to class C in the regulated list

Page 15: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

We want to analyse the chromsome from whole of its lengthWe want to analyse the chromsome from whole of its length

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 16: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

We want to analyse the chromsome from whole of its lengthWe want to analyse the chromsome from whole of its length

We can discretize the chromosome into equal sized regionsWe can discretize the chromosome into equal sized regions

based on number of genes orbased on number of genes or

based on physical locations of genesbased on physical locations of genes

Then we can calculate the significance of each region separatelyThen we can calculate the significance of each region separately

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 17: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Statistics appliedStatistics applied

We wanted to see if there exist over-represented chromosome regions in We wanted to see if there exist over-represented chromosome regions in down regulated gene list obtained from C. Elegans strain comparisondown regulated gene list obtained from C. Elegans strain comparison

Chromosomes were split into 250 kbp segments (themes)Chromosomes were split into 250 kbp segments (themes)

Fisher's test from Hyge distribution was used to calculate over-expressionFisher's test from Hyge distribution was used to calculate over-expression

G1G1G1G1 G2G2G2G2 G3G3G3G3G4G4G4G4

Chromosomal regions as themes

250-500

kbp 500-750

kbp 750-1000

kbp 1000-1250kbp

FISHER’S EXACT TEST

F(x, M, N, n) where:

x = Down-regulated genes in region TM = All chip genes in chromosomeN = All down-regulated genesn = All genes in region T

Genes G

Regions T

Page 18: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Statistics appliedStatistics applied

Sig

nif

ican

ce

GENE LIST: DOWN REGULATED BETWEEN C. ELEGANS HAWAIIAN VS. N2 STRAINS

CHR I CHR II CHR III CHR IV CHR V CHR X

0 3 6 9 12 15Mbp

0,00

1,00

2,00

3,00

0 3 6 9 12 15Mbp

0,00

0,50

1,00

1,50

0 3 6 9 12 15Mbp

0,00

0,50

1,00

1,50

0 3 6 9 12 15 18Mbp

2,00

4,00

6,00

8,00

10,00

0 3 6 9 12 15 18 21Mbp

0,00

0,50

1,00

1,50

2,00

0 3 6 9 12 15 18Mbp

1,00

2,00

3,00

4,00

Resulting p-values were transformed into Resulting p-values were transformed into ten based negative logaritmic form where ten based negative logaritmic form where p<0.05 corresponds to log p > 1,3p<0.05 corresponds to log p > 1,3

Some regions were found where genes Some regions were found where genes were highly over-representedwere highly over-represented

Kaja Reisner, Petri Pehkonen, Garry WongKaja Reisner, Petri Pehkonen, Garry Wong

FISHER’S EXACT TEST EXAMPLE

For region T(12750...13000 kbp):x = Down-regulated genes in region T = 14 M = All chip genes in chromosome = 3407N = All down-regulated genes = 92n = All genes in region T = 59

P(x,M,N,n) ~ 0.0000000001984-log P ~ 9.73

Page 19: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

What is the shortcoming of the previous approach?What is the shortcoming of the previous approach?

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 20: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

What is the shortcoming of the previous approach?What is the shortcoming of the previous approach?

By discretization we can split 'good' regions which causes that we By discretization we can split 'good' regions which causes that we may not found themmay not found them

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Total: 4 genesOver-expressed: 3 genesUnder-expressed: 1 gene

Page 21: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

Our bin size can be also too large to detect some regions and too Our bin size can be also too large to detect some regions and too small to detect otherssmall to detect others

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Total: 8 genesOver-expressed: 4 genesUnder-expressed: 4 genes

Page 22: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

Solution (partial):Solution (partial):

Sliding window techniqueSliding window technique

Simply sliding a bin through the chromosomeSimply sliding a bin through the chromosome

Calculating significance in each positionCalculating significance in each position

Window size can be of amount of genes or physical region sizeWindow size can be of amount of genes or physical region size

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 23: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

Good sidesGood sides

Now we do not split any region to halvesNow we do not split any region to halves

Rather we gather information in every possible positionRather we gather information in every possible position

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 24: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

ShortcomingsShortcomings

User have to select a constant window sizeUser have to select a constant window size

We can still miss 'good' regions that are a lot smaller or larger to We can still miss 'good' regions that are a lot smaller or larger to the windowthe window

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 25: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome

Our solution (Pehkonen, Törönen and Wong)Our solution (Pehkonen, Törönen and Wong)

Hierarchical segmentation analysis of chromosomeHierarchical segmentation analysis of chromosome

Possibility to find different sizes of 'significant' regions in different Possibility to find different sizes of 'significant' regions in different locations of chromosomelocations of chromosome

Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0

Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 26: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Ordered set of data includes signal rich regions among noiseOrdered set of data includes signal rich regions among noise Data can be partitioned into segments in order to separate signal from noise, and patterns from other patternsData can be partitioned into segments in order to separate signal from noise, and patterns from other patterns Existing method: recursive segmentationExisting method: recursive segmentation Existing applications: DNA-segmentation, image analysis etc.Existing applications: DNA-segmentation, image analysis etc.

Ordered set of data points as binary image

Cla

sse

s of

dat

a

Signal rich region

Noisy region

Segmentatio split

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ] BackgroundBackground

Page 27: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Recursive segmentationRecursive segmentation Used with e.g. maximum likelyhood ratio test to Used with e.g. maximum likelyhood ratio test to

decide whether to proceed in splittingdecide whether to proceed in splitting

Shortcomings:Shortcomings: Unsatisfactory definition of stopping criterionUnsatisfactory definition of stopping criterion Algorithm is often unable to detect global optimumAlgorithm is often unable to detect global optimum ML-model bases on classical probability theory and ML-model bases on classical probability theory and

do not take account the uncertainitydo not take account the uncertainity

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Existing methodsExisting methods

Page 28: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Iterative algorithm instead of recursiveIterative algorithm instead of recursive Proceeds into split that increases most the global Proceeds into split that increases most the global

segmentation scoresegmentation score Reasonable visualization of result with dendogramReasonable visualization of result with dendogram Facilitates observation of changes in segmentation Facilitates observation of changes in segmentation

score between local maxima and minima, and global score between local maxima and minima, and global maximamaxima

Bayesian segmentation scoreBayesian segmentation score Dirichlet multinomial modelDirichlet multinomial model Takes account uncertainityTakes account uncertainity

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Existing methodsExisting methods

Page 29: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation level:segmentation level:

)1log();(_1kNpFscoreonsegmentati

priors of sum

cluster of size

class ofprior

cluster in class ofy probabilit

amount class

amountcluster

size data

jc

i

jip

C

k

N

j

i

ji

k

jj

C

iijii cx

c

xpF

1 1

))(log())(log()))(log())(log(();(

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Our improvementsOur improvements

Page 30: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level

There are N data points in the data There are N data points in the data

=> There are N-1 positions between the=> There are N-1 positions between the

)1log();(_1kNpFscoreonsegmentati

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Our improvementsOur improvements

Page 31: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level

There are N data points in the data There are N data points in the data

=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model

=> There are k-1 edges (splits)=> There are k-1 edges (splits)

)1log();(_1kNpFscoreonsegmentati

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Our improvementsOur improvements

Page 32: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level

There are N data points in the data There are N data points in the data

=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model

=> There are k-1 edges (splits)=> There are k-1 edges (splits) How many times k-1 edges can be positioned into N-1 How many times k-1 edges can be positioned into N-1

positions between data pointspositions between data points

)1log();(_1kNpFscoreonsegmentati

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Our improvementsOur improvements

Page 33: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level

There are N data points in the data There are N data points in the data

=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model

=> There are k-1 edges (splits)=> There are k-1 edges (splits) How many times k-1 edges can be positioned into N-1 How many times k-1 edges can be positioned into N-1

positions between data pointspositions between data points Corresponds to MDL-based model selection criterion Corresponds to MDL-based model selection criterion

for detecting clustering levelfor detecting clustering level)1log();(_

1kNpFscoreonsegmentati

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Our improvementsOur improvements

Page 34: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Simulated data creatorSimulated data creator Creates artificial data with random clusters Creates artificial data with random clusters

and noise from given modeland noise from given model

Evaluation of segmentation methodsEvaluation of segmentation methods How closely clustering result corresponds to How closely clustering result corresponds to

the given original modelthe given original model Kullback Leiber -distance, Jenssen-Shannon Kullback Leiber -distance, Jenssen-Shannon

divergence, Mutual Informationdivergence, Mutual Information

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Evaluation of our methodEvaluation of our method

Page 35: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Comparison of given model vs. created Comparison of given model vs. created clustering result with KL-divergenceclustering result with KL-divergence

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Evaluation of our methodEvaluation of our method

Page 36: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Multinomial data with 5 classes:Multinomial data with 5 classes:

0

100

200

300

400

500

600

AIC

BIC

BIC2

OWN

Simulated artificial datasets

Clu

ster

ing

vs.

mod

el -

dist

ance

(K

L d

ive

rgen

ce)

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Evaluation of our methodEvaluation of our method

Page 37: FUNCTIONAL GENOMICS  COURSE 26.5.2006

Multinomial data with 30 classes:Multinomial data with 30 classes:

Simulated artificial datasets

Clu

ster

ing

vs.

mod

el -

dist

ance

(K

L d

ive

rgen

ce)

0

500

1000

1500

2000

2500

3000

3500

AIC

BIC

BIC2

OWN

[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]

Evaluation of our methodEvaluation of our method

Page 38: FUNCTIONAL GENOMICS  COURSE 26.5.2006

CATALIST SOFTWARECATALIST SOFTWARE

Two main partsTwo main parts Method testing: creation of artificial datasets and evaluation of Method testing: creation of artificial datasets and evaluation of

segmentation methodssegmentation methods Analysis of biological datasets: import of miroarray data or Analysis of biological datasets: import of miroarray data or

gene lists from filesgene lists from files

Page 39: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ C A T A L I S T ][ C A T A L I S T ] Results visualizationResults visualization

Draws a dendogramDraws a dendogram

that shows the progress of the that shows the progress of the segmentationsegmentation

Dendogram revealsDendogram reveals

hierarchical localization of significant hierarchical localization of significant regionsregions

Detailed informationDetailed information

on genes by clicking found regionson genes by clicking found regions

C.Elegans strain comparisonC.Elegans strain comparison

See three clusters found in See three clusters found in chromosome 4chromosome 4

CHROMOSOME IV

Page 40: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Yeast cell cycle genes were clustered according to gene Yeast cell cycle genes were clustered according to gene expression dataexpression data

K-means clustering was performed with 2..10 clustersK-means clustering was performed with 2..10 clusters

Page 41: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Each clustering into k groups was encoded as multinomial Each clustering into k groups was encoded as multinomial data vector with k classesdata vector with k classes

Clustering to 2 1 0 2 2 2 2 0 0 0 0 0 1 1 1 0 0 2 0

Clustering to 3 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0

Clustering to 4 4 4 3 3 3 3 4 4 4 4 4 3 1 1 2 2 4 4

Clustering to 5 5 5 4 5

Clustering to 6 1 0 4 4 4 4 1 1 1 1 1 0 0 0 1 0 1 0

...... 1 1 7 7 7 7 1 2 2 1 2 1 1 1 0 6 0 0

Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 42: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Yeast cell cycle gene expression clusters, chromosome 4:Yeast cell cycle gene expression clusters, chromosome 4:

Page 43: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Result of hierarchical segmentation:Result of hierarchical segmentation:

Page 44: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Real results (red) compared to segmentation of several Real results (red) compared to segmentation of several randomized data sets (blue):randomized data sets (blue):

Page 45: FUNCTIONAL GENOMICS  COURSE 26.5.2006

[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults

Nuclear receptors gene list:Nuclear receptors gene list: