55
1 PROTERAN: PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

Embed Size (px)

Citation preview

Page 1: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

1

PROTERAN:PROTERAN:

ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

Page 2: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

2

The need for BioinformaticsThe need for Bioinformatics

Bioinformatics: Application of computational techniques to the management and analysis of biological information.

Clustering techniques applied on data not enough. Need a good visual representation

Page 3: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

3

AgendaAgenda Microarrays Review of existing clustering and visualization

techniques on gene expression data The need for a customized visualization tool for use

by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics

Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN

Page 4: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

4

Function of Genes & ProteinsFunction of Genes & Proteins

Through the proteins they encode genes orchestrate the mysteries of life

Protein functions vary widely from mechanical support to transportation to regulation.

Page 5: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

5

Still a lot of work aheadStill a lot of work ahead

Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low.

Believed that many genes work together; this is not exhibited in a one-by-one fashion.

Page 6: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

6

MicroarraysMicroarrays

Solve the throughput problemAllow scientists to see genes on a genomic

level

Page 7: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

7

Expression MatrixExpression Matrix

  Experiment 1 Experiment 2 ……….. Experiment M

Gene 1 C511/C311 C512/C312 ……….. C51M/C31M

Gene 2 C521/C321 C522/C322 ……….. C52M/C32M

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM

Page 8: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

8

Clustering & Visualization Clustering & Visualization Techniques ReviewTechniques Review

Page 9: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

9

ClusteringClustering

Clustering: Act of grouping similar objects together

Applied to gene expression in order to find the function of unknown genes

Many different clustering techniques in the literature. Represented techniques are discussed next.

Page 10: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

10

Determining similarity Determining similarity between two genesbetween two genes

Choose a similarity distance to compare genes

e.g. Euclidian distance

  Experiment 1 Experiment 2 ……….. Experiment M

Gene 1 C511/C311 C512/C312 ……….. C51M/C31M

Gene 2 C521/C321 C522/C322 ……….. C52M/C32M

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM

Page 11: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

11

Hierarchical ClusteringHierarchical Clustering

1. Create distance matrix of all genes in relation to each other

2. Find the two closest genes

3. Merge these two genes and redo distance matrix

4. Repeat steps 2-3 until only one cluster left

Page 12: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

12

DendrogramDendrogram

Binary tree with a distinguished root, which has all the data items at the leaves

Re-orders the expression matrix to place similar genes beside each other

Page 13: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

13

ExampleExample  A B C D

A 0 1 6 8

B   0 5 7

C     0 2

D       0

   (A,B) C D

(A,B) 0 5 7

C   0 2

D     0

  

  (A,B) (C,D)

(A,B) 0 5

(C,D)   0

Agglomerative Hierarchical Clustering

Page 14: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

14

AdvantagesAdvantages

Familiar to biologistsFew parameters to specify

Page 15: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

15

DisadvantagesDisadvantages

Requires fast CPUs and large amounts of memory

Does not identify important clustersOnly represents hierarchical organized dataDoes not scale up

Page 16: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

16

Disadvantages cont..Disadvantages cont..

Dendrogram always offers 2n-1

representations (where n = number of elements)

Page 17: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

17

Self Organizing Maps (SOMs)Self Organizing Maps (SOMs)

User picks number of clusters called nodesNodes randomly mapped to M-dimensional

space (M = # of experiments)Node values are adjusted by random vectors

picked from original dataAfter node values settle vectors are

clustered to closest node

Page 18: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

18

VisualizationVisualization

1. Dendrogram

2. Error Bar Representation

Page 19: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

19

VisualizationVisualization

3. U-Matrix

Page 20: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

20

AdvantagesAdvantages

User has partial control over structureFuzzy ClustersVariety of visual techniques applicable

Page 21: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

21

DisadvantagesDisadvantages

Knowledge of number of clusters beforehand

Many parameters to specify

Page 22: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

22

Principle Component Analysis Principle Component Analysis (PCA)(PCA)

Mathematical technique that can be used to reduce the number of dimensions of data

Principal component analysis

Page 23: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

23

VisualizationVisualization

Page 24: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

24

AdvantagesAdvantages

No parameters required3D Visualization

Page 25: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

25

DisadvantagesDisadvantages

Little control over structureRunning time of O(N3)Not applicable when input is a distance

matrix

Page 26: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

26

BiclusteringBiclustering

Clustering of both rows and columns simultaneously

Page 27: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

27

Available SoftwareAvailable SoftwareSoftware Name Description Available at

F-Scan Quantification and analysis of fluorescently probed microarrays; scatterplots; multiple image comparison.

http://abs.cit.nih.gov/fscan/

TIGR SpotFinder Spot identification. http://www.tigr.org/software/

Cluster Hierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA

http://rana.lbl.gov/EisenSoftware.htm

Genesis A Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA,

http://genome.tugraz.at/Software/GenesisCenter.html

J-Express Pro 2.0 Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization

http://www.molmine.com/frameset/frm_jexpress.htm

TreeView Cluster output visualization http://rana.lbl.gov/EisenSoftware.htm

Page 28: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

28

Protein FoldingProtein Folding

Page 29: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

29

Reaction CoordinatesReaction Coordinates

Folding determines the function of protein All-atom recreation of protein unrealistic Reaction coordinates used to describe protein

structure1. Fraction of Native Contacts2. Radius of Gyration3. RMSD from the native structure4. Number of beta-strand Hydrogen Bonds5. Number of alpha helix turns6. Hydrophobic core radius of gyration7. Principle Components

Page 30: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

30

Protein StatesProtein States

While folding, a protein goes through certain states

The raw data is similar to microarray data.Dr. Parida and Dr. Zhou have developed their

own techniques and clustered β-Hairpin data.

Page 31: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

31

Reaction Coordinates used on Reaction Coordinates used on the the ββ-Hairpin-Hairpin

1. Number of Native β-strand hydrogen bonds2. Radius of gyration of the hydrophobic core residues3. Radius of gyration of entire protein4. Fraction of native contacts5. Principle component 1 6. Principle component 27. Root mean square deviation (RMSD) from the

native structure.

Page 32: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

32

Raw DataRaw Data

Page 33: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

33

Patterned ClusterPatterned Cluster

RED = Number of columns in pattern. (Also defined as the Pattern Type)

WHITE = Column Number

PURPLE = Column Value

YELLOW = Number of occurrences

GREEN = Occurrences

2  0  0.1 4  0.23

3  23    26     27

Page 34: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

34

Sample Patterned Cluster FileSample Patterned Cluster File2 0 7.335 1 0.735

1006 59728 87235 94826-94831 95748-95752 95761-95763 … 120424-120426

2 0 7.335 1 0.736

1003 59728 87235 94826-94831 95748-95752 95761-95763 … 95769

3 0 7.335 4 -5.881 6 3.292

1036 59728 72071 87235 94826 94828-94831 … 95761-95763

3 0 7.335 4 -5.881 5 2.214

1056 59728 72071 87235 94826 94828-94831 … 95761-95763

:

5 2 8.144 3 0.899 4 -3.855 5 -33.574 6 3.292

1089 45533 59728 72071 87235 94826 … 95748-95752

Page 35: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

35

The need for Visual Analysis of The need for Visual Analysis of Patterned Cluster DataPatterned Cluster Data

β-Hairpin file approx 500MB largeDifficult to study the textual representation

and get a global viewVery difficult to see interaction of all

patterned clusters in relation to each otherAlso very difficult to remember all

patterned clusters and their occurrence in time

Page 36: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

36

Visual RequirementsVisual Requirements

Global ViewNavigation & FocusRelative growthDetails of characteristics on demand

Page 37: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

37

Need for Customized ToolNeed for Customized Tool

All of the existing visualization techniques on microarrays had one or more drawbacks

None were able to provide a visual for depicting relative growth of clusters.

Page 38: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

38

Terrain MetaphorTerrain Metaphor

Has been shown to be a useful technique in searching a corpus of documents

Very recently the idea has been applied to gene expression with high density clusters representing mountains

Page 39: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

39

Using a Landscape Metaphor to Using a Landscape Metaphor to solve our requirementssolve our requirements

Each mountain represents a patterned cluster

Mountain growth represents evolution of patterned cluster

Clicking on mountains returns details of patterned cluster

Page 40: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

40

PROTERANPROTERAN

Page 41: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

41

Mapping of Patterned Cluster Mapping of Patterned Cluster Data into Terrain GeometryData into Terrain Geometry

Page 42: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

42

Mapping of Patterned Cluster Mapping of Patterned Cluster data into Terrain Geometrydata into Terrain Geometry

Pattern Type: Number of columns in a patterned cluster

Column Combination: Unique number that identifies a combination of columns

2  0  0.1 4  0.23

3  23    26     27

Page 43: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

43

Column CombinationsColumn Combinations c!

(c – t)! * t! c = number of characteristics

t = pattern number

Pattern Type Number of Column Combinations

2 21

3 35

4 35

5 21

6 7

7 1

Page 44: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

44

LayoutLayout

We first thought of using an automated layout technique. However, one of Dr. Zhou’s requirements was that the same pattern

cluster should appear in the same position for consistent interpretation.

Another was that larger pattern types (6 and 7 column) must be very distinguishably placed.

Hence it was decided to use a manual layout design described next.

Page 45: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

45

LayoutLayout01 02 03 01234 01235 01236 012 013 014 015 016

04 05 06 01245 01246 01256 023 024 025 026 034

12 13 14 01345 01346 01356 035 036 045 046 056

15 16 23 01456 02345 02346 123 124 125 126 134

24 25 26 02356 02456 03456 135 136 145 146 156

34 35 36 12345 12346 12356 234 235 236 245 246

45 46 56 12456 13456 23456 256 345 346 356 456

0123 0124 0125 0126 0134

0135 0136 0145 0146 0156

012345 012346 012356 0234 0235 0236 0245 0246

0123456 012456 013456 023456 0256 0345 0346 0356 0456

123456 1234 1235 1236 1245 1246

1256 1345 1346 1356 1456

2345 2346 2356 2456 3456

Page 46: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

46

Top Patterned Clusters Top Patterned Clusters VisualizedVisualized

Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized

10TH Highest Occurrence of combination 01

9TH Highest Occurrence of combination 01

2ND Highest Occurence of combination 01

3RD Highest Occurrence of combination 01

8TH Highest Occurrence of combination 01

Highest Occurrence of combination 01

4TH Highest Occurrence of combination 01

7TH Highest Occurrence of combination 01

6TH Highest Occurrence of combination 01

5TH Highest Occurrence of combination 01

Page 47: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

47

PROTERAN LAYOUTPROTERAN LAYOUT

Page 48: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

48

Animated Terrain EvolutionAnimated Terrain Evolution

Time proceeds from 0 to the maximum number of experiments

Each time unit all patterned clusters are checked

If there is an occurrence the mountain’s height is increased

Page 49: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

49

Mountains of PROTERANMountains of PROTERAN

Page 50: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

50

Results & ExtensionsResults & Extensions

Page 51: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

51

ResultsResults

Very encouraging feedback Easy to use layout and the interface allows

1. Identification of states

2. Obtain values of patterned clusters

3. Relation of patterned clusters to each other as they grow over time

In the initial use itself, Dr. Zhou said that “ he was able to find that the hydrophobic core is largely formed before the beta-strand hydrogen bonds are formed.”

Page 52: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

52

Future of PROTERANFuture of PROTERAN

Introduced at the Intelligent Systems For Molecular Biology (ISMB) in Scotland – Received very well

Robert-Cedergren Bioinformatics Colloquium at University of Montreal (Sept 23-24th)

Page 53: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

53

ExtensionsExtensions

Analyze with different types of protein dataMore generic layout with more

characteristicsApplication with different types of data

Page 54: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

54

SummarySummary

1. Review of existing techniques to cluster and visualize gene expression data

2. Protein characteristics data is similar to that of gene expression data

3. None of the existing techniques applied, thus the need for a customized visual

4. Terrain Metaphor to solve our requirements implemented in the program PROTERAN

Page 55: 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

55

QuestionsQuestions