Microarray Analysis Software

  • Upload
    zizi

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

Microarray Analysis Software. Maximiliano Corredor Institute of Biology, Leiden University. Steps of a Microarray Experiment. Genomic sequence / EST library sequence. RNA. RT. Annotation. cDNA. labeling. cDNA-Cy3 / -Cy5. Probe design. hybridization. Image Processing. Statistical - PowerPoint PPT Presentation

Citation preview

  • Microarray Analysis SoftwareMaximiliano CorredorInstitute of Biology, Leiden University

  • Steps of a MicroarrayExperimentRNAcDNARTcDNA-Cy3 / -Cy5labelinghybridizationImage ProcessingGenomic sequence / EST library sequenceAnnotationProbe designStatistical

    Analysis

  • Bioinformatic steps of MA experimentsProbre design

    Image processing (with QC)

    Normalisation (with QC)

    Statistical analysis and data mining

    Database management

  • Probe design softwareArray Designer - a software that can design hundreds of primer for DNA or oligonucleotide microarrays, product of Premier Biosoft. http://www.premierbiosoft.com/dnamicroarray/index.html

    OligoArray2 - a free software that computes gene specific oligonucleotides for genome-scale oligonucleotide microarray construction. http://berry.engin.umich.edu/oligoarray2/ OligoWiz2 Server - server for designing oligonucleotide probes for microarrays.http://www.cbs.dtu.dk/services/OligoWiz2/

    ProbeWiz Server - The CBS ProbeWiz WWW server predicts optimal PCR primer pairs for generation of probes for cDNA arrays.http://www.cbs.dtu.dk/services/DNAarray/probewiz.php

    Primer3 - a common used software for designing primers for microarray construction.http://frodo.wi.mit.edu/primer3/primer3_code.html

  • Image processingAddressing: estimate location of spot centers

    Segmentation: classify pixels as foreground or background

    Information Extraction: for each spot on the array and each channelForeground intensitiesBackground intensitiesquality measures

  • Image processing softwareGenePix Pro (Axon Instruments) for Windows Spot identification, scatter plot, histogram, normalization, quality controlhttp://www.moleculardevices.com/pages/software/gn_genepix_pro.html

    ScanArray (PerkinElmer) for WindowsQuantitation, spot quality measures and normalization http://las.perkinelmer.com/Catalog/default.htm?CategoryID=Analysis+Software

    ScanAlyze (Eisen's lab, Lawrence BerkeleyNational Lab (LBNL). For WindowsProcess fluorescent images of microarrays. Semi-automatic definition of grids and complex pixel and spot analyses. Free for academichttp://rana.lbl.gov/EisenSoftware.htm

    TIGR Spotfinder (TIGR) for WindowsSpot identification; Microarray image processing. Free

  • Image processing with GenePix

  • QC: Background substraction

    Background arises from glass autofluorescence, dust particles or washing defectsBG and specific hybridisation are assumed additive (but look at the image!!)Low background can be substracted from the average intensity of the spot.High background features should be removed from analysis: artificial saturation may occur and therefore the maximum measure is not the addition of background and real specific intensity.Features with high negative intensities after background substraction (like those of the image) should also be removed.Features with background similar to spot intensity will give a normal distribution centered in 0 intensity and can therefore be considered absent.

  • Background correctionDifferent types of background substractionPossibility of flagging features that dont match our QC criteria:- high background intensity - % of pixels above background- background higher than foreground

  • QC: Histogram and scatterplotThe intensities should follow a normal distribution with:Natural lower limit: only positive intensities exist (minimum RNA concentration is 0)Long tail to the higher intensitiesArtificial upper limit: saturation of detector and/or TIFF file. This can cause an accumulation of points at the highest intensityThis effect can also be observed in the scatterplot

  • QC: Std. Dev. vs. AvgGood spots should be homogenous: low standard deviationLinear correlation std. dev. vs averageHigher std dev = variability within spotLower std dev = uniformity within spot (saturation)

  • Sources of technical variabilityChip productionefficiencies of-RNA extraction-reverse transcription-labeling-photodetection

    SYSTEMATIC

    Calibration can correct for them

    PCR yieldDNA qualityspotting efficiency,spot sizecross-/unspecific hybridizationstray signal

    STOCHASTIC

    Error model normalization

  • NormalisationSeveral assumptions:Normal distribution of intensitiesAll channels behave equally

    Centering and scaling:Intensities are transformed in a way that the averages and ranges are the same (and therefore comparable)

    Within hyb normalisation:In two channel data, both channels are centered and scaled.More complex normalisations may be needed in order to ensure linearity along all intensities range.

    Between hybs normalisation:Every time that two or more different chips are going to be compared, its necessary that all of them are centered and scaled Normalisation should be made taking into account the experimental design; error model must include distinction between experimental units, biological replicates and technical replicates

  • Normalisation softwareBasic normalisation within hybridisation is possible in GenePix

    Acuity includes more advanced normalization algorithms (Lowess, etc)

    Rosetta implements several pipelines for normalizationWithin hybs when uploaded to the database, using manufacturer indications for developing their error-models (providing therefore with p-values)Between hybs when compared to each other (centering and scaling)

  • QC: M vs AM stands for Log(Ratio); A is the product of the Log(Intensity) of both channels.

    If the two channels behave symmetrically, everything is OK. Otherwise, we may have dye bias

    It is very common to find such deviations in the tails of the distribution (lowess normalisation can help here).

  • QC: M vs ABefore normalisation (left), average ratio was higher than 0.Intensity saturation of one channel produces skewed tail. This effect is not removed with normalisation, requires calibration of the image acquisition (or elimination of saturated spots from analysis)

  • QC and basic statistics softwareSome image processing packages include basic statistics functions, like GenePix

    Numerous stand-alone programs and plug-ins or scripts for more general statistical packages, like R/Bioconductor, Matlab, SPSS, MS Excelhttp://ihome.cuhk.edu.hk/~b400559/arraysoft_statistics.html

    All microarray analysis packages include this functions and many more

  • Database systemsAcuity (Axon Instruments) Runs on Windows 2000/XP client; Windows 2000 server (recommended)Stores data in relational database, Microsoft SQL or OracleVarious visualization tools; normalization; hierarchical, k-means, k-medians clustering with many different similarity metrics, SOM, PCA, gene shaving.Scripting engine for customizable analysis http://www.moleculardevices.com/pages/software/gn_acuity.html

    ArrayDB (NHGRI)Html/ linux or UnixAnalyzed expression data stored in a relational database a software suite that provides an interactive user interface for the mining and analysis of microarray gene expression data.http://genome.nhgri.nih.gov/arraydb/

  • Database systemsBASE (BioArray Software Environment) Department of Oncology, Lund University Linux server, MySQL, web clientManages biomaterial information, raw data and images, and provides integrated and "plug-in"-able normalization, data viewing and analysis tools. The system also has array production LIMS features; support MIAME and MAGE-ML

    Rosetta Resolver (Rosetta Biosoftware)JAVA/ UNIX with Oracle relational databaseThe Rosetta Resolver system combines advanced analysis software, a high-capacity database, and high-performance server framework in one enterprise-wide tool.

  • Database systemsStanford Microarray Database (SMD) package (Stanford University) Oracle server; web server; UNIX with Perl supportSMD stores raw and normalized data from microarray experiments, as well as their corresponding image files. In addition,SMD provides interfaces for data retrieval, analysis and visualization.http://genome-www5.stanford.edu//download/

    Longhorn Array Database (Institute for Cellular and Molecular Biology, University of Texas at Austin)Linux and PostgreSQLThe Longhorn Array Database (LAD) is a MIAME compliant microarray database. It is a fully open source version of the Stanford Microarray Database (SMD)http://www.longhornarraydatabase.org/

  • Rosetta Resolver Excellent databaseBut requires dedicated staff to maintainIdeal for institutions and big companiesWho are the only ones able to afford itIncludes a good set of statistical toolsBut it isnt very transparentGUI user-friendly(ish)Flexible advanced statistics available as visual scripts and R implementationHowever this requires deep knowledge of the DB structure and some programming skillsCompatible with multitude of data formatsBut hard to get info out of the system (no MIAME yet)

  • Statistical Analysis and Data MiningBasic output of a microarray experiment is a list of genes differentially transcribed. This can be obtained easily (Excel) from the image processing.

    However the list is arbitrary: fold-change values are arbitrarily chosen and there is no measure of the significance of the observed difference: to do science we need statistics

    Many packages like Acuity, BASE and Rosetta Resolver combine database and statistical analysis tools, but there are also many other programs exclusively devoted to the statistical analysis of microarray experiments:http://ihome.cuhk.edu.hk/~b400559/arraysoft_mining_comprehensive.html

  • Statistical analysis and Data mining softwareGeneSpring (Silicon Genetics) Analyze various array types, scatter plot, cluster analysis, PCA, SOM, statistic tools, 2D, 3D plotting

    J-Express (MolMine) Hierarchical clustering, K-means particional clustering, Principal component anlaysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization. Free for academics

    BioConductor, an open source software project providing infrastructure in terms of design and software for analysing genomic data, some form of graphical user interface for selected libraries. For other microarray related R packages: http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html

    SpotFire (Spotfire) Hierarchical, bi-directional hierarchical and K-means cluster analysis, PCA, profile search, coincidence testing, normalization, a number of interactive plots for visualization of data, access GATC databases

  • Basic plots and tables

  • Basic plots and tables

  • Classification tasks for microarrays

    Classification of SAMPLESGenerate gene expression profiles that can(i) discriminate between different known cell types or conditions, e.g. between tumor and normal tissue,(ii) identify different and previously unknown cell types or conditions, e.g. new subclasses of an existing class of tumors.Classification of GENES(i) Assign an unknown cDNA sequence to one of a set of knowngene classes.(ii) Partition a set of genes into new (unknown) functional classes on the basis of their expression patterns across a number of samples.

    Discriminant analysis: CLASSES KNOWNCluster analysis: CLASSES NOT KNOWN

  • Cluster analysisGrouping a collection of objects into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters.

    Two ingredients are needed to group objects:Distance measurement Clustering algorithm

    Clustering columns: grouping similar samples

    Clustering rows: grouping similarly expressed genes

  • Clustering of genesGenes with similar patterns of expression (synexpression groups) cluster together.

    Synexpression groups may be functional groups (this is a hypothesis that always has to be tested).Iyer et al., Science 1999

  • Clustering of samplesProvided enough number of samples, functional relationships might be foundGolub et al. http://www.genome.wi.mit.edu/MPR

  • Discriminant analysis

  • Useful linkshttp://ihome.cuhk.edu.hk/~b400559/arraysoft.htmlComprehensive recopilation of information on microarray software

    https://www.cs.tcd.ie/Nadia.Bolshakova/softwaretotal.html Catalogue of microarray analysis software

    http://genome-www5.stanford.edu/resources/restech.shtml Stanford Microarray Database Software and Tools

    http://www.tigr.org/software/microarray.shtml The Institute for Genomic Research Microarray Software

    Laser scans produce 16bit TIFF images for each channel, each pixel stores the intensity of light emitted by a point excited by the laser beam. If pixel size is less than beam size, blurring occurs.Inkjet synthesis companies may also provide image processing by default, and in any case its very straight forward because of the high quality of the features (even size and pattern)Spotted oligos need user defined image processing because of:Uneven feature size (differences due to volume spotted or drying process)Uneven spacing between figures (pin movements) Curved grids (unbalanced chip)This means that a visual check of feature alignment is needed (extremely painful sometimes, introduces a lot of variability)

    Numerical info: for the pixels of the feature, the intensities are used to calculate mean, median and standard deviation. Median is more robust to outliers. Same for the pixels of the background

    Non feature intensities arise form glass fluorescence, dust or hybridisation washing defects. Background and specific hybridization are assumed additive, so background intensity can be substracted from the feature intensity.High background may arise from a bad region of the chip, features should be flagged as bad and removed from further analysis: for intense features, saturation may occur and for low expressed genes difference may be masked and therefore taken as not expressed).If low background and low feature intensity, then feature flagged as absent. This will remove it from analysis.There is a linear correlation between std. dev. and mean or median. Dust particles can be detected because of a higher standard deviation in the feature. When a feature is too intense, the number of saturated pixels increase, so the standard deviation decreases. This deviation from linearity can be easily assesed.

    As a result of the experimental prodecure outlined, there are a lot of sources of variability that must not be confused with biological variability between the samples assayed.