24
Clustered alignments of gene-expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics & Medical Informatics, Department of Computer Sciences and Department of Oncology, University of Wisconsin, Madison, USA BIOINFROMATICS Vol. 25 pages i119-i127, 2009

Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Clustered alignments of gene-expression time series data

Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven

Department of Biosatatistics & Medical Informatics, Department of Computer Sciences and Department of Oncology, University of

Wisconsin, Madison, USABIOINFROMATICS Vol. 25 pages i119-i127, 2009

Page 2: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Outline

• Introduction• Method– SCOW– Clustered alignments

• Results and Discussion• Conclusion

Page 3: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Introduction

• Charactering and comparing temporal gene-expression responses is an important computational task for answering a variety of questions in biological studies.

• One application : Toxicongenomics charactering the potential toxicity of chemicals

Page 4: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Introduction

• answering similarity queries:assess similarity by determine the temporal correspondence between the query and treatment

Page 5: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Introduction

• Tow issue:– First :

(Treatment B) all genes should be aligned together.(Treatment C) some genes need to be warped separately

– Second :• The best alignment does not account for the complete

extent of both time series.• Allow a type of local alignments in which the end of one

series is unaligned• Shorting the alignment

Page 6: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Introduction

• Multi-segment alignment method :

Shorting : The alignment path that represents shorting ends in the top row or the right column of the alignment space diagram, but not in the top-right cell.

Page 7: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Introduction

• To solve “all genes are assumed to be aligned in lockstep with one another”– Calculated clustered alignments– Find clusters of gene such that genes within a cluster share a common

alignment– Each cluster is aligned independently of the others– Similar to k-means

• Alternates between assigning genes to cluster and recomputing the alignment for each cluster using the genes assigned to it

• To solve “alignment for the complete extent of both time series”– Multi-segment alignment– shorting

Page 8: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW (Shorting COW)

• COW (Nielsen et al., 1998)– a dynamic programming algorithm designed to find an

optimal alignment between two series with multiple channels of information(such as genes).

– Briefly, it aligns and scores two give time series based on their similarity

– Two series as q (for query series) and d (for database series)– The series are partitioned into m segments, in which the i-

th segments of the two series correspond to each other.– The score of a give alignment is the sum of correlations

between corresponding segments

Page 9: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

The segment are assumed to be of constant length and usually evenly spaced in q

Variable in d

COW search for good segment boundaries in only a limited area of alignment space.

The vector K contains the coordinates of the knots (segment endpoints) in q

Page 10: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

– The zero-indexed matrix , which is of dimensions m+1 by |d|+1.

– The element contains the score of the best alignment of d from zero to x and q from zero to k.

xk ,

Pearson correlation

q(a,b) : Subseries of q from a to bd is defined likewise.The predecessor function list valid starting

locations in d for segments ending at x

Page 11: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

– The best score– a one-channel time series : the expression profile

of a single genea multi-channel time series : the expression profile of a set of genesThe only difference between these two cases is in how the correlations are calculated.

– COW is apt to align segments which differ greatly in magnitude.

Page 12: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

• SCOW– Search for optimal knots in both dimensions

The first step : seach independently in both dimensions.

Second step : SCOW alternates horizontal and vertical movement of each knot until it converges.

Page 13: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

First step

Second step

Page 14: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

– The matrix is calculated when the algorithm searches for knots with respect to q and hold them constant with respect to d, while is calculated during the opposite case.

– The predecessor function : a cone-shaped search apace

q

d

Page 15: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – SCOW

– Score function :• Include terms that incur penalties for segment that involve

stretching and significant difference in amplitude.

The stretching si is defined as the ratio of lengths between qi and di, and ai is the amplitude ratio between the two as determined by a weighted least squares fitting procedure.

Page 16: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – Clustered alignment

• Find sets of genes that would have very similar alignments if they were aligned independently.

• a variant of traditional k-means cluster– Identifying clusters in which the genes have similar

warpings– The genes in one of our clusters may have very

different expression profiles.

Page 17: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – Clustered alignmentThe first step is to assign the initial alignment centroids, to select a representative set of gene alignments as the centroids.

Subroutine Align returns the best alignment between two sereis based on a give set of genes.

ScoreGene returns the score of two series when aligned using a given alignment and a specified gene.

Record the best score so far that gene using one of the current centroidls.

Page 18: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Method – Clustered alignmentIt alternates between assigning genes to cluster and recomputing the alignment for each cluster using the genes assigned to it.

Page 19: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Results and Discussion

• SCOW experiments– We construct queries for which we know the correct

matching database treatments and their correct alignments.

– The data we use comes from the EDGE toxicolog databases (http://edge.oncology.wisc.edu)

– Dataset consists of 216 unique observations of microarray data, each of which represents the the values for 1600 different genes.

– Time range from 6h up to 96h.– The data span 11 different treatments.

Page 20: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Results and Discussion

– Assemble 10 queries for each treatment by randomly sub-sampling time series in our dataset

– We measure two accuracy :• Treatment accuracy : identify the treatment from which

each query series was extracted• Alignment accuracy : align the query points to their

actual time points in the treatment.

Page 21: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Results and Discussion

The top line : treatment accuracy with different orders of splinesThe middle line : alignment accuracy by adding the criterion that the average time

error in the mapping is less than or equal to 24 hThe bottom line : alignment accuracy where this tolerance is decreased to 12 h.

Page 22: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Results and Discussion

– Conclusion :• Multi-segment alignment computed by SCOW, COW

and Generative Multi-segment are superior to the alignment determined by ordinary dynamic time warping and the linear alignment method• SCOW find more accurate alignment than the other

two multi-segment algorithms

Page 23: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Results and Discussion

• Clustered alignment experiments

Page 24: Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics

Conclusion

• Present new method which advance in two ways :– Compute clustered alignments– A new multi-segment alignment method, called

SCOW