Click here to load reader

LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected]

Embed Size (px)

Citation preview

  • LSM3241: Bioinformatics and Biocomputing

    Lecture 8: Gene Expression Profiles and Microarray Data Analysis

    Prof. Chen Yu Zong

    Tel: 6874-6877Email: [email protected]://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

  • Biology and CellsAll living organisms consist of cells (trillions of cells in human, yeast has one cell).Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg)Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.

  • Gene ExpressionCells are different because of differential gene expression. About 40% of human genes are expressed at one time.Gene is expressed by transcribing DNA into single-stranded mRNAmRNA is later translated into a proteinMicroarrays measure the level of mRNA expression

  • Overview of Molecular BiologyCellNucleusChromosomeProteinGene (DNA)Gene (mRNA), single strandcDNA

  • Gene ExpressionGenes control cell behavior by controlling which proteins are made by a cell

    House keeping genes vs. cell/tissue specific genes

    Regulation:

    Transcriptional (promoters and enhancers)Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)

  • Gene ExpressionRegulation:

    Translational (3UTR repressors, poly A tail)

    Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)

    Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein)cDNA

  • Gene Expression MeasurementmRNA expression represents dynamic aspects of cell mRNA expression can be measured by latest technologymRNA is isolated and labeled with fluorescent proteinmRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

  • Traditional MethodsNorthern BlottingSingle RNA isolatedProbed with labeled cDNA

    RT-PCRPrimers amplify specific cDNA transcripts

  • Microarray TechnologyMicroarray:New Technology (first paper: 1995)Allows study of thousands of genes at same time

    Glass slide of DNA molecules Molecule: string of bases (25 bp 500 bp) uniquely identifies gene or unit to be studied

  • Gene Expression MicroarraysThe main types of gene expression microarrays:

    Short oligonucleotide arrays (Affymetrix)cDNA or spotted arrays (Brown/Botstein).Long oligonucleotide arrays (Agilent Inkjet);Fiber-optic arrays...

  • Fabrications of MicroarraysSize of a microscope slideImages: http://www.affymetrix.com/

  • Differing ConditionsUltimate Goal:Understand expression level of genes under different conditions

    Helps to:Determine genes involved in a diseasePathways to a diseaseUsed as a screening tool

  • Gene ConditionsCell types (brain vs. liver)Developmental (fetal vs. adult)Response to stimulusGene activity (wild vs. mutant)Disease states (healthy vs. diseased)

  • Expressed GenesGenes under a given conditionmRNA extracted from cellsmRNA labeledLabeled mRNA is mRNA present in a given conditionLabeled mRNA will hybridize (base pair) with corresponding sequence on slide

  • Two Different Types of MicroarraysCustom spotted arrays (up to 20,000 sequences)cDNAOligonucleotide

    High-density (up to 100,000 sequences) synthetic oligonucleotide arraysAffymetrix (25 bases)SHOW AFFYMETRIX LAYOUT

  • Custom ArraysMostly cDNA arrays2-dye (2-channel)RNA from two sources (cDNA created)Source 1: labeled with red dyeSource 2: labeled with green dye

  • Two Channel MicroarraysMicroarrays measure gene expressionTwo different samples:Control (green label)Sample (red label)Both are washed over the microarrayHybridization occurs Each spot is one of 4 colors

  • Microarray Technology

  • Microarray Image AnalysisMicroarrays detect gene interactions: 4 colors: Green: high controlRed: High sampleYellow: EqualBlack: None

    Problem is to quantify image signals

  • Single Color MicroarraysPrefabricated Affymetrix (25mers)

    CustomcDNA (500 bases or so)Spotted oligos (70-80 bases)

  • Microarray AnimationsDavidson University:http://www.bio.davidson.edu/courses/genomics/chip/chip.html

    Imagecyte:http://www.imagecyte.com/array2.html

  • Basic idea of MicroarrayConstructionPlace array of probes on microchipProbe (for example) is oligonucleotide ~25 bases long that characterizes gene or genomeEach probe has many, many clonesChip is about 2cm by 2cmApplication principlePut (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest Analyze hybridization pattern

  • Microarray analysisOperation Principle:

    Samples are tagged with flourescentmaterial to show pattern of sample-probe interaction (hybridization)

    Microarray may have 60K probe

  • Gene Expression DataGene expression data on p genes for n samplesGenesmRNA samplesGene expression level of gene i in mRNA sample j=Log (Red intensity / Green intensity)Log(Avg. PM - Avg. MM)sample1sample2sample3sample4sample5 1 0.46 0.30 0.80 1.51 0.90...2-0.10 0.49 0.24 0.06 0.46...3 0.15 0.74 0.04 0.10 0.20...4-0.45-1.03-0.79-0.56-0.32...5-0.06 1.06 1.35 1.09-1.09...

  • Some possible applicationsSample from specific organ to show which genes are expressed and responsible for a functionality

    Compare samples from healthy and sick host to find gene-disease connection

    Analyze samples to differentiate sick and healthy, disease subtypes, drug response groups

    Probe samples, including human pathogens, for disease detection

  • Huge amount of data from single microarrayIf just two color, then amount of data on array with N probes is 2N

    Cannot analyze pixel by pixel

    Analyze by pattern cluster analysis

  • Major Data Mining TechniquesLink AnalysisAssociations DiscoverySequential Pattern DiscoverySimilar Time Series Discovery

    Predictive ModelingClassification (assigns genes into known classes)Clustering (groups genes into unknown clusters)

  • Supervised vs. Unsupervised LearningSupervised: there is a teacher, class labels are knownSupport vector machinesBackpropagation neural networks

    Unsupervised: No teacher, class labels are unknownClusteringSelf-organizing maps

  • Strengthens signal when averages are taken within clusters of genes (Eisen)

    Useful (essential?) when seeking new subclasses of cells, diseases, drug responses etc.

    Leads to readily interpreted figuresCluster Analysis: Grouping Similarly Expressed Genes,Cell Samples, or Both

  • Some clustering methods and softwarePartitioningK-Means, K-Medoids, PAM, CLARA HierarchicalCluster, HACBIRCHCUREROCKDensity-based CAST, DBSCANOPTICSCLIQUEGrid-basedSTINGCLIQUEWaveClusterModel-basedSOM (self-organized map)COBWEBCLASSITAutoClassTwo-way ClusteringBlock clustering

  • Partitioning

  • Density-based clustering

  • Hierarchical (used most often)

  • Gene Expression DataGene expression data on p genes for n samplesGenesmRNA samplesGene expression level of gene i in mRNA sample j=Log (Red intensity / Green intensity)Log(Avg. PM - Avg. MM)sample1sample2sample3sample4sample5 1 0.46 0.30 0.80 1.51 0.90...2-0.10 0.49 0.24 0.06 0.46...3 0.15 0.74 0.04 0.10 0.20...4-0.45-1.03-0.79-0.56-0.32...5-0.06 1.06 1.35 1.09-1.09...

  • Expression VectorsGene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. -0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5Line GraphNumeric VectorHeat map

    Chart1

    -0.8

    1.5

    1.8

    0.5

    -0.4

    -1.3

    0.8

    1.5

    Sheet1

    1-0.8

    21.5

    31.8

    40.5

    5-0.4

    6-1.3

    70.8

    81.5

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    Sheet3

  • Expression Vectors As Points in Expression SpaceExperiment 1Experiment 2Experiment 3Similar Expression-0.8-0.60.91.2-0.31.3-0.7t 1t 2t 3G1G2G3G4G5-0.4-0.4-0.8-0.8-0.71.30.9-0.6

  • Cluster AnalysisGroup a collection of objects into subsets or clusters such that objects within a cluster are closely related to one another than objects assigned to different clusters.

  • How can we do this?What is closely related?Distance or similarity metricWhat is close?

    Clustering algorithmHow do we minimize distance between objects in a group while maximizing distances between groups?

  • Distance MetricsEuclidean Distance measures average distanceManhattan (City Block) measures average in each dimensionCorrelation measures difference with respect to linear trendsGene Expression 1Gene Expression 2(5.5,6)(3.5,4)

  • Clustering Time Series DataMeasure gene expression on consecutive daysGene Measurement matrixG1= [1.2 4.0 5.0 1.0]G2= [2.0 2.5 5.5 6.0]G3= [4.5 3.0 2.5 1.0]G4= [3.5 1.5 1.2 1.5]

  • Euclidean DistanceDistance is the square root of the sum of the squared distance between coordinates

  • City Block or Manhattan DistanceDistance is the sum of the absolute value between coordinates

    G1= [1.2 4.0 5.0 1.0]G2= [2.0 2.5 5.5 6.0]G3= [4.5 3.0 2.5 1.0]G4= [3.5 1.5 1.2 1.5]

  • Correlation DistancePearson correlation measures the degree of linear relationship between variables, [-1,1]Distance is 1-(pearson correlation), range of [0,2]

  • Similarity MeasurementsPearson CorrelationTwo profiles (vectors)and+1 Pearson Correlation 1

  • Hierarchical Clustering(HCL-1)IDEA: Iteratively combines genes into groups based on similar patterns of observed expression By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. Display the data as a heat map and dendrogram Cluster genes, samples or both

  • Hierarchical ClusteringDendrogramVenn Diagram of Clustered Data

  • Hierarchical clusteringMerging (agglomerative): start with every measurement as a separate cluster then combine

    Splitting: make one large cluster, then split up into smaller pieces

    What is the distance between two clusters?

  • Distance between clustersSingle-link: distance is the shortest distance from any member of one cluster to any member of the other cluster Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster Average: Distance between the average of all points in each clusterWard: minimizes the sum of squares of any two clusters

  • Hierarchical Clustering-MergingEuclidean distanceAverage linkingGene expression time seriesDistance between clusters when combined

  • Manhattan DistanceAverage linkingGene expression time seriesDistance between clusters when combined

  • Correlation Distance

  • Data StandardizationData points are normalized with respect to mean and variance, sphering the data

    After sphering, Euclidean and correlation distance are equivalentStandardization makes sense if you are not interested in the size of the effects, but in the effect itselfResults are misleading for noisy data

  • Distance MatrixInitial Data ItemsHierarchical Clustering

    DistABCDA2072B1025C3D

  • Distance MatrixInitial Data ItemsHierarchical Clustering

    DistABCDA2072B1025C3D

  • Current ClustersSingle LinkageHierarchical ClusteringDistance Matrix2

    DistABCDA2072B1025C3D

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering

    DistADBCAD203B10C

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering

    DistADBCAD203B10C

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering3

    DistADBCAD203B10C

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering

    DistADCBADC10B

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering

    DistADCBADC10B

  • Distance MatrixCurrent ClustersSingle LinkageHierarchical Clustering10

    DistADCBADC10B

  • Distance MatrixFinal ResultSingle LinkageHierarchical Clustering

    DistADCBADCB

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical Clustering

  • Hierarchical ClusteringHL

  • Hierarchical ClusteringThe Leaf Ordering Problem:Find optimal layout of branches for a given dendrogram architecture2N-1 possible orderings of the branchesFor a small microarray dataset of 500 genes, there are 1.6*E150 branch configurationsSamplesGenes

  • Hierarchical ClusteringThe Leaf Ordering Problem:

  • Hierarchical ClusteringPros:Commonly used algorithmSimple and quick to calculateCons:Real genes probably do not have a hierarchical organization

  • Using Hierarchical Clustering Choose what samples and genes to use in your analysisChoose similarity/distance metricChoose clustering directionChoose linkage methodCalculate the dendrogramChoose height/number of clusters for interpretationAssess resultsInterpret cluster structure

  • LimitationsCluster analyses:Usually outside the normal framework of statistical inferenceLess appropriate when only a few genes are likely to changeNeeds lots of experiments

    Single gene tests:May be too noisy in general to show muchMay not reveal coordinated effects of positively correlated genes.Hard to relate to pathways

  • Useful LinksAffymetrix www.affymetrix.comMichael Eisen Lab at LBL (hierarchical clustering software Cluster and Tree View (Windows)) rana.lbl.gov/Review of Currently Available Microarray Software www.the-scientist.com/yr2001/apr/profile1_010430.html

    ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/Stanford MicroArray Database http://genome-www5.stanford.edu/Yale Microarray Database http://info.med.yale.edu/microarray/Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html

    3 Actually a number of clustering methods have been proposed. Ill go through some representative types in the following slides. The paritioning method is to divide the data items into a user-specified number of clusters according to their spatial relationships. The most well-known method is k-means.

    Hierch method works by grouping similar objects level by level, resutling in a hierach structure. The main feature of density-based method is the consideration of the density of objects in a cluster in addition to the spatial relationship between objects. So this kind of methods perform pretty well in screening out some outliers. One Representative method is CAST.

    The grid-based method tries to partition the sample space of data objects into cubes for clustering. This kind of methods perform very efficiently in high dimensional data space. The paritioning method is to divide the data items into a user-specified number of clusters according to their spatial relationships. The most well-known method is k-means.

    Hierch method works by grouping similar objects level by level, resutling in a hierach structure.3Let me explain the hierarchical clustering first for those of you who are not familiar with hierarchical clustering.

    Here is a very simple example.

    We have 4 data items.

    The initial distance matrix is given in this table.

    In hierarchical agglomerative clustering, we consider each data item as an independent cluster initially, so we have 4 clusters now.

    At first, we choose the most similar pair, A and D. These two items will be merged together to be a new cluster.

    The height of this new subtree is 2.We should update the distance matrix since we have new clusters. The distances between the new cluster and the remaining clusters can be updated in many different ways.

    Lets assume that we use single linkage.When we calculate the distance between this new cluster(A and D) and ,for example, B, we choose the minimum of the distance between A and B, and the distance between D and B.

    In this example, the distance between A and B is 20, and the distance between D and B is 25. So the distance between the new cluster {A,D} and B will be 20. And next, we choose 3 in the new distance matrix, so {A,D} and C are merged together.The height of this new subtree is 3.We have a new distance matrix with 2 clusters.Finally, we merge the remaining two clusters.This binary tree is the result of hierarchical clustering using single linkage. If we use a different linkage method, the result can be different from this one.