86
Chip arrays and gene expression data

Chip arrays and gene expression data. Motivation

Embed Size (px)

Citation preview

Page 1: Chip arrays and gene expression data. Motivation

Chip arrays and gene

expression data

Chip arrays and gene

expression data

Page 2: Chip arrays and gene expression data. Motivation

Motivation

Motivation

Page 3: Chip arrays and gene expression data. Motivation

With the chip array technology, one can measure the expression of all genes at once (even all exons). Can answer questions such as:

1.Which genes are expressed in a muscle cell?

2.Which genes are expressed during the first weak of pregnancy in the mother? In the new baby?

3.Which genes are expressed in cancer?

Page 4: Chip arrays and gene expression data. Motivation

4. If one mutates a TF: which genes are not expressed following this change?

5. Which genes are not expressed in the brain of a retarded baby?

6. Which genes are expressed when one is asleep versuswhen the same personis awake?

Page 5: Chip arrays and gene expression data. Motivation

Analyzing Output

Analyzing Output

Page 6: Chip arrays and gene expression data. Motivation

Output

w.tBrain tumor

males

Brain tumor

females

Gene 1

Gene 2

Gene 3

Gene 25,000

Each cell is either an absolute number or a relative one, depending on the technology used.

Page 7: Chip arrays and gene expression data. Motivation

Repeats

w.tBrain tumor

male1

Brain tumor

male2

Brain tumor

female1

Gene 1

Gene 2

Gene 3

Gene 25,000

The repeat can either be the same sample – a different chip or a “real” biological repeat – a different sample.

Page 8: Chip arrays and gene expression data. Motivation

Expression profile

wt1wt2wt3wt4bt1bt2bt3bt4

g1435415161723

g275466379

g3232525263060

Genes 1 and 3 show the same trend (go both high under the same conditions). That is: they have the same expression profile.

Page 9: Chip arrays and gene expression data. Motivation

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g1435415161723

g275466379

g3232525263060

In general, we want to find all the genes that share the same expression profile → suggestive of a functional linkage.

There are clustering algorithms, which do exactly that.

Page 10: Chip arrays and gene expression data. Motivation

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g32325601661

Clustering of the conditions can suggest two types of brain tumor (bt)

Page 11: Chip arrays and gene expression data. Motivation

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g323256173

Bi-clustering: both on the conditions and the genes.

Page 12: Chip arrays and gene expression data. Motivation

Applications

Applications

Page 13: Chip arrays and gene expression data. Motivation

Applications

Think of increasing the glucose concentration of E.coli and making a chip array in various concentration.

One can potentially discover allgenes in the glucose pathway.

Knocking out a gene → discoverall genes that interact with it.

Page 14: Chip arrays and gene expression data. Motivation

Applications

Analyzing expression of genes can help reveal the gene network of a given organism.

Page 15: Chip arrays and gene expression data. Motivation

Gene network

Page 16: Chip arrays and gene expression data. Motivation

Clinical

/

g111

g24

g30

Do someone has a brain tumor?

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g32325601661

Page 17: Chip arrays and gene expression data. Motivation

MammaPrint

Used to assess the risk that a breast tumor will spread to other parts of the body (metastasis). It is based on the well-known 70-gene breast cancer gene signature

In February, 2007 the FDA cleared the MammaPrint test for use in the U.S

Page 18: Chip arrays and gene expression data. Motivation

Sequence by hybridization

It was thought that the following procedure could work for sequencing a genome:

1.Make a chip containing all x mers (e.g., x = 25).2.Hybridize a genome to the chip.3.By analyzing all the hybridizations with their overlaps – assemble the genome.

Problem: it doesn’t work.

Page 19: Chip arrays and gene expression data. Motivation

ChIP-on-chip : A method for measuring protein-DNA interaction.

Proteins that bind DNA includes:

Those responsible for transcription regulation

Transcription factors (TFs)

Replication proteins

Histones…

Page 20: Chip arrays and gene expression data. Motivation

ChIP-on-chip: One chip is for Chromatin ImmunoPrecipitation and the second chip is for DNA microarrays.

The method is used mostly to detect TF binding sites.

Page 21: Chip arrays and gene expression data. Motivation

Tiling arrays

Here the chip array should include not only protein coding genes but also control regions, or simply – the entire genome.

Page 22: Chip arrays and gene expression data. Motivation

Deep sequencing reads

Yoder-Himes D.R. et al. PNAS (2009)

Page 23: Chip arrays and gene expression data. Motivation

Machine learning

Learning mode on.Bioinfo is great.

Page 24: Chip arrays and gene expression data. Motivation

Clustering

Page 25: Chip arrays and gene expression data. Motivation

Clustering (of expression data)

UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree.

It was suggested by Sokal and Michener (1958).

Page 26: Chip arrays and gene expression data. Motivation

Clustering (of expression data)

Often, there is a one-to-one transformation between the data and points in space.

For example, expression of all genes under a specific condition is a point:

Condition 1

Gene 15

Gene 27

Gene 32

Gene 2000054

(5,7,2,…, 54) a point in a space of dimension 20,000.

Page 27: Chip arrays and gene expression data. Motivation

Clustering (of expression data)

Another example, each expression profile is a point in a space whose dimension is the number of conditions

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

(50,20,4,33) a point in a space of dimension 4

Page 28: Chip arrays and gene expression data. Motivation

In space: each point is a gene

Condition 1

Condition 2 g1

Page 29: Chip arrays and gene expression data. Motivation

Our goal will be to cluster genes

Condition 1

Condition 2

Genes that are in the same cluster (show similar patterns of expression) are likely to be functionally related.

Page 30: Chip arrays and gene expression data. Motivation

Distance between two expression profiles

The Euclidian distance =

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

Gene 23020331

22222,1 )3133()34()2020()3050( d

Page 31: Chip arrays and gene expression data. Motivation

Distance between two expression profiles

We can compute the distances between each pair of expression profiles and obtain a distance table.

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

Gene 23020331

Gene 33020331

Gene 43020331

Page 32: Chip arrays and gene expression data. Motivation

The distance table

g1g2g3g4g5g6g7g8

g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80

Page 33: Chip arrays and gene expression data. Motivation

The distance table

g1g2g3g4g5g6g7g8

g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80

Page 34: Chip arrays and gene expression data. Motivation

Starting tree

g5 g6

We call the father node of g5 and g6 -- “g56”.

g56

Page 35: Chip arrays and gene expression data. Motivation

Removing the g5 and g6 rows and columns,and adding the g56 row and column

g1g2g3g4g56g7g8

g10324851?98148g202634?84136g3042?92152g40?86142

g56089142g70148g80

Page 36: Chip arrays and gene expression data. Motivation

Computing distances

g1g2g3g4g5g6g7g8

g10324851504898148

( 56, 1)

1 1( 5, 1) ( 6, 1)

2 249

D g g

D g g D g g

Page 37: Chip arrays and gene expression data. Motivation

The updated table. Starting the second iteration…

g1g2g3g4g56g7g8

g103248514998148g2026343184136g30424492152g404186142

g56089142g70148g80

Page 38: Chip arrays and gene expression data. Motivation

Building the tree - Continued

We call the father node of g2 and g3 -- “g23”.

g5 g6

g56

g2 g3

g23

Page 39: Chip arrays and gene expression data. Motivation

Computing distances

g1g2g3g4g56g7g8

g5649314441089142

( 23, 56)

1 1( 2, 56) ( 3, 56)

2 237.5

D g g

D g g D g g

Page 40: Chip arrays and gene expression data. Motivation

The updated table. Starting a new iteration…

g1g23g4g56g7g8

g1040514998148g2303837.588144

g404186142g56089142

g70148g80

Page 41: Chip arrays and gene expression data. Motivation

Tree

g5 g6

g56

g2 g3

g2356

g23

Page 42: Chip arrays and gene expression data. Motivation

Computing distances

g1g23g4g56g7g8

g1040514998148

( 2356, 1)

1 1( 23, 1) ( 56, 1)

2 244.5

D g g

D g g D g g

Page 43: Chip arrays and gene expression data. Motivation

Starting a new iteration…

g1g2356g4g7g8

g1044.55198148g2356039.588.75143

g4086142g70148g80

Page 44: Chip arrays and gene expression data. Motivation

Building the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g23456

Page 45: Chip arrays and gene expression data. Motivation

Computing distances

g1g2356g4g7g8

g1044.55198148

( 23456, 1)

4 1( 2356, 1) ( 4, 1)

5 545.8

D g g

D g g D g g

Page 46: Chip arrays and gene expression data. Motivation

Starting an additional iteration…

g1g23456g7g8

g1045.898148g23456088.2142.8

g70148g80

Page 47: Chip arrays and gene expression data. Motivation

Constructing the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g123456

g1

g23456

Page 48: Chip arrays and gene expression data. Motivation

One more iteration…

g123456g7g8

g123456089.833143.66g70148g80

Page 49: Chip arrays and gene expression data. Motivation

Reconstructing the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g1234567

g1

g23456

g7

g123456

Page 50: Chip arrays and gene expression data. Motivation

The new table

g1234567g8

g12345670144.2857g80

Page 51: Chip arrays and gene expression data. Motivation

Resulting tree

g5 g6

g56

g2 g3

g2356

g23

g4

g123456

g1

g23456

g7

g1234567

g8

Page 52: Chip arrays and gene expression data. Motivation

From tree to clusters

g5 g6 g2 g3 g4 g1g7

g8

If we want two clusters, we will cut here, and obtain g8 versus g1-7.

Page 53: Chip arrays and gene expression data. Motivation

From tree to clusters

g5 g6 g2 g3 g4 g1g7

g8

If we want 3 clusters, we will cut here, and obtain g8,g7, and g1-6.

Page 54: Chip arrays and gene expression data. Motivation

From tree to clusters

g5 g6 g2 g3 g4 g1g7

g8

The 4 clusters are: g8,g7,g1,g23456

Page 55: Chip arrays and gene expression data. Motivation

Classification

Condition 1

Condition 2

Gene 15020

Gene 23020

Gene 33020

Gene 43020

Gene 1

Gene 2

?

If red = brain tumor and yellow healthy – do I have a brain tumor?

Page 56: Chip arrays and gene expression data. Motivation

Gene 1

Gene 2

?

In SVM we find a (hyper)plane that divides the space in two.

SVM = support vector machine

Condition 1

Condition 2

Gene 15020

Gene 23020

Gene 33020

Gene 43020

Page 57: Chip arrays and gene expression data. Motivation

Gene 1

Gene 2

?

The further the point is from the separating (hyper)plane, the more confident we are in the classification

SVM – confidence in classification

Page 58: Chip arrays and gene expression data. Motivation

Gene 1

Gene 2

?

Sometimes we cannot perfectly separate the training data. In this case, we will find the best separation.

SVM – cannot always perfectly classify

Page 59: Chip arrays and gene expression data. Motivation

KNN = k nearest neighbors

Gene 1

Gene 2

?

KNN is another method for classification. For each point it looks at its k nearest neighbors.

If red = brain tumor and yellow healthy – do I have a brain tumor?

Page 60: Chip arrays and gene expression data. Motivation

Gene 1

Gene 2

?

For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.

If red = brain tumor and yellow healthy – do I have a brain tumor?

KNN = k nearest neighbors

Page 61: Chip arrays and gene expression data. Motivation

Gene 1

Gene 2

?

KNN is better than SVM for the above case.

If red = brain tumor and yellow healthy – do I have a brain tumor?

KNN = k nearest neighbors

Page 62: Chip arrays and gene expression data. Motivation

In the above example – how will the point be classified in KNN with K=1? In SVM?

Gene 1

Gene 2

?

KNN - exercise

Page 63: Chip arrays and gene expression data. Motivation

Training dataset

Gene 1

Gene 2

?

The red and yellow points are used to train the classifier.

The more training data one has -> the better the classifier will perform.

Page 64: Chip arrays and gene expression data. Motivation

Test dataset

Gene 1

Gene 2

?

Usually some points for which we know the answer are not given to the classifier and are used to TEST its performance.

Page 65: Chip arrays and gene expression data. Motivation

Decision tree

AgeGene1Gene2SmokerOperation

>20highhighyesyes

>20highhighyesyes

>20lowlownono

[20,40]lowhighyesyes

[20,40]highhighnoyes

[20,40]highlowyesno

>40lowlowyesno

>40highlownono

>40lowhighnono

Page 66: Chip arrays and gene expression data. Motivation

Decision tree

Age >40

Operation = no

Yes No

Gene 2

high low

Operation = yes

Operation = no

Decision trees are automatically built from “train data” and are used for classification.

They also tell us which features are most important.

Page 67: Chip arrays and gene expression data. Motivation

Voting

Decision trees

Training data that need a classification algorithm (Yes/No)

Voting uses an array of machine learning algorithms and chooses the classification suggested by most classifiers.

KNN SVMTrain:

New datum(Test) No YesYes

YES

Page 68: Chip arrays and gene expression data. Motivation

Classification is used outside the scope of bioinformatics

The distance between the query and each point in the dataset is computed. Based on the identity of the k nearest members, the digit is identified.

*More advanced algorithms allow rotation and enlargement of the digit to be classified.

Page 69: Chip arrays and gene expression data. Motivation

UPGMA - exercise

x12x34

x12015x340

In the above example – how will the point be clustered using UPGMA?

x1x2x3x4

x1021230x20810x304x40

x12x3x4

x1201020x304x40

Page 70: Chip arrays and gene expression data. Motivation

Dataset sizes

A classifier is needed to detect “Pupko disease” based on gene expression.

Pupko disease is extremely rare (say, it inflicts 1 out of 100000 people).

A classifier was trained on a large volume of samples in which all cases are negative. On a test dataset it correctly classified 99.9% of the cases…

the fraction of positive cases in the test data is only :"לא חוכמה"~0.01%.

Take home message: (1) better to train classifier on ~equal number of “positive” and “negative” cases.

(2) Reporting only “% accurate classifications” is not enough. One has to report both FP ,FN, TP, TN (in this example, all positive are FP FALSE POSITIVE RATE OF 100%).

Page 71: Chip arrays and gene expression data. Motivation

Exercises - examples

, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן

.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •

כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.

. 7 קטן מ Y וגן Xהמרחק בין גן •

א'+ב'.•

א'+ג'. •

ב'+ג'.•

א'+ב'+ג'.•

אף תשובה אינה נכונה.•

Page 72: Chip arrays and gene expression data. Motivation

Exercises - examples

, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן

.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •

כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.

. 7 קטן מ Y וגן Xהמרחק בין גן •

א'+ב'.•

א'+ג'. •

ב'+ג'.•

א'+ב'+ג'.•

אף תשובה אינה נכונה.•

Page 73: Chip arrays and gene expression data. Motivation

Exercises - examples

. אלו מהמשפטים הבאים נכון?23

ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.

תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.

שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא. SVMניתן לפתח ג-

אף תשובה אינה נכונה.ד-

Page 74: Chip arrays and gene expression data. Motivation

Exercises - examples

. אלו מהמשפטים הבאים נכון?23

ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.

תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.

שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא.SVMניתן לפתח ג-

אף תשובה אינה נכונה.ד-

Page 75: Chip arrays and gene expression data. Motivation

Exercises - examples

. נתון האיור הבא:24

אלו מהמשפטים הבאים נכון?

(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-

כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.

א'+ב'ג-

אף תשובה אינה נכונהד-

Gene 1

Gene 2

?

Page 76: Chip arrays and gene expression data. Motivation

Exercises - examples

. נתון האיור הבא:24

אלו מהמשפטים הבאים נכון?

(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-

כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.

א'+ב'ג-

אף תשובה אינה נכונהד-

Gene 1

Gene 2

?

Page 77: Chip arrays and gene expression data. Motivation

Legionalla pneumophilacase-study

Legionalla pneumophilacase-study

Page 78: Chip arrays and gene expression data. Motivation

How did it all begin? How did it all begin?

Legionella pneumophila

Page 79: Chip arrays and gene expression data. Motivation

Legionnaire disease nowadaysLegionnaire disease nowadays

Legionella pneumophila

Page 80: Chip arrays and gene expression data. Motivation

Legionella pneumophila Legionella pneumophila

Legionella pneumophila

Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology

Page 81: Chip arrays and gene expression data. Motivation

Identifying the effectorsIdentifying the effectors

Legionella pneumophila

Page 82: Chip arrays and gene expression data. Motivation

Homology to host proteins

Regulatory

elements

Genome proximity to

other effectors

Secretion signalAbundance in Metazoa / Bacteria

GC contentSequence homology

The featuresThe features

Legionella pneumophila

Page 83: Chip arrays and gene expression data. Motivation

The effectors machineThe effectors machine

5

5

Legionella pneumophila

Page 84: Chip arrays and gene expression data. Motivation

The big pictureThe big pictureSimilarity to

known effectors

Regulatory elements

Features

Similarity tohost proteins

G-C content

Secretory signals

Feature selection

NN

SVMNaïve Bayes

Bayesian Net

Voting

Classification algorithms

Experimentalvalidation

Predictedeffectors

Prior knowledge

Trainedmodel

Unclassifiedgenes

Predictednon-effectors

Newly validatedeffectors

Non-effectors

Validatedeffectors

Abundance in Metazoa\Bacteria

Genome arrangement

Legionella pneumophila

Page 85: Chip arrays and gene expression data. Motivation

Does it really work??Does it really work??

Machine learning

Page 86: Chip arrays and gene expression data. Motivation