51
Introduction to Gene Array

Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Embed Size (px)

Citation preview

Page 1: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Introduction to Gene Array

Page 2: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Overview

• Basic cell biology• Basic Biotechnological methods• DNA chips & microarrays• Statistical issues

Page 3: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Biological Background

Page 4: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Biological Background

All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

Page 5: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Human Genome

Page 6: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

DNA Organization

Source: Alberts et al.

Page 7: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

DNA Organization

• In 1953 James Watson and Francis Crick deduced the 3-D double helix structure of DNA and immediately inferred its method of replication.

• April 2003: HGP sequencing is completed and Project is declared finished.“Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning” (W. Churchill 1942).

Page 8: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The double helix

Page 9: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

DNA ComponentsFour nucleotide types:AdenineGuanineCytosineThymine

Hydrogen bonds(electrostatic connection:)

A-TC-G

Page 10: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Replication

Page 11: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

So, how does the cell use this data?

Page 12: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Transcription

RNA nucleotides:Similar to DNA, U instead of T.

Page 13: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Protein production and Overview

Page 14: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Genetic code

There are 20 amino acids from which proteins are build.

Page 15: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Proteins structure

Page 16: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The Central Dogma

Transcription

mRNA

Translation

ProteinGene

cells express different subset of the genesIn different tissues and under different conditions

hard disk one program its output

Page 17: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues
Page 18: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Figures…• Number of cells in human body ~ 10-100 *1012.

• Length of human DNA ~ 3*109 bp.

• 99.9% of genome is exactly the same in all people.

• Only 2-3% of genome encodes protein data.

• Number of genes in human ~ 25,000 (still, most of those functions are unknown).

• Average gene length – 3000b (largest - 2.4*106).

• Total number of protein variants is estimated as 106.

• Average protein length – 200aa

Page 19: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Figures (cont.)

There are over 200 types of cells in the human body, such as

muscle nervous tissue sensory cells blood

Page 20: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Basic Biotechnology

• Restriction Enzymes

• Sequencing by Gel (500-800 bases) – Advanced machines can sequence simultaneously 96 different sequences of 500-700 nucleotides in a few hours.

• PCR (Polymerase Chain Reaction) – Allows the DNA from a selected region of a genome to be amplified a billion fold, provided that at least part of its nucleotide sequence is already known.

How can we access the data/processes?

Page 21: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Basic Biotechnology (cont.)

Hybridization:

• DNA double strandsform by “gluing” of complementary single strands

• complementary rule:A-T/U, G-C

Page 22: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Motivation for Genome Research

• Molecular Medicine

• Bioarchaeology, Anthropology, Evolution and Human Migration

• DNA Identification

• Agriculture, Livestock Breeding and Bioprocessing

Page 23: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Functional GenomicsA study of the functionality of specific genes, their relations to diseases, their associated proteins and their participation in biological processes. It is widely believed that thousands of genes function in a complicated and orchestrated way that creates the mystery of life .

However, traditional methods in molecular biology generally work on a ”one gene in one experiment” basis, which means that the throughput is very limited and the ”whole picture” of gene function is hard to obtain.

Page 24: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

• Technologies for simultaneously analyzing the expression levels of large numbers of genes provides the opportunity to study the activity of whole genomes.

• In the long-term, large-scale gene expression analysis will enable the behavior of co-regulated gene networks to be studied.

Functional Genomics (cont.)

Page 25: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

• The technology can be used to look for groups of genes involved in a particular biological process or in a specific disease by identifying genes whose expression levels change under certain circumstances.

• The RNA transcription profiles of wild type (a normal organism) and mutant can be compared using gene expression technologies.

• Our main interest is in the protein levels, unfortunately this technology is still undeveloped.

Functional Genomics (cont.)

Page 26: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Monitoring Gene Expression

• Goal: Simultaneous measurements of expression levels of all genes in one experiment:

• Two fundamental biological assumptions: Transcription level indicates genes’

regulation. Only genes which contribute to organism

fitness are expressed in a particular condition.

Detecting changes in gene expression level provides clues on its product function.

Page 27: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The DNA Chip• Terminologies: biochip, DNA chip, DNA microarray, and gene array.

• An array is an orderly arrangement of samples. Those samples can be either DNA or DNA products.

• Each spot in the array contains many copies of the sample.

• The array provides a medium for matching known and unknown DNA samples based on base-pairing (hybridization) rules and automating the process of identifying the unknowns.

Page 28: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The DNA Chip (cont.)• These arrays usually contain hundreds of thousands of spots.

• An experiment with a single DNA chip can provide information on thousands of genes simultaneously – a dramatic increase in throughput.

• A probe is the tethered nucleic acid with knownsequence which we use in order to discover information about the target which is the free nucleic acid sample whose identity/abundance is being detected .

Page 29: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The DNA Chip (cont.)

The basic idea, is to generate probes that would captureeach coding region as specifically as possible. The length of the oligos (a sequence of nucleotides) used depends on the application, but usually no longer than 25 bases.

Since the oligos are short, the density of these chips is very high, for instance, a chip that of 1cm by 1cm can easily contain 100,000 oligo types .

Page 30: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The DNA Chip (cont.)

Two variants of arrays technology:

• Format I - DNA microarray: The target (500-5,000 bases long) is immobilized to a solid surface such as glass using robot spotting and exposed to a set of probes either separately or in a mixture.

• Format II - DNA chip: An array of oligonucleotide (20-80 oligos )probes is synthesized either in situ (on-chip)

or by conventional synthesis followed by on-chip immobilization. The array is exposed to labeled sample DNA, hybridized, andthe identity/abundance of complementary sequences are determined.

Page 31: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Chips and Features

Page 32: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Expression Assay Using Affymetrix GeneChip

Now, we can infer which of the genes were expressed and in what intensity.

Due to some biological processes, not always the correct sequence will hybridized to the oligo.

Page 33: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

GeneChip Array Design

• The Mismatch oligos help us with reducing some of the background noise.• Since, each gene is being represented by several spots, we have to normalize those values.

Page 34: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

GeneChip® Manufacturing Process

Page 35: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Synthesis of Ordered Oligonucleotide Arrays

Page 36: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

cDNA Microarrays• Despite to the fact that this method is more specific to genes hybridization, due to manufacture limitations, we can not read the absolute values of the expressions (variation with the amounts of cDNA spotted to the chip). So, we use marked (Red/Green) control/case cells in order to have the ratio levels.

Page 37: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

example.swf

Process Example

Page 38: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

The Raw Data

Row - genes’ expression pattern.Column - experiment/conditions’

profile .

Entries of the Raw Data matrix:• Ratio levels• Absolute values

Expose the cell to radioactive/chemical substance and sample some time intervals, life cycle of the cell,abnormal tissues (cancer, etc.), starvation.

Page 39: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Computational Challenges• Dissociate actual gene expression values from experimental noise (missing data).

• Normalization: How does one best normalize thousands of signals from same/different conditions/experiments ?• Clustering: Partition genes into subset that manifest similar experiments pattern.

• Classification: Given partition of the conditions into types, classify the types of the new conditions (SVM, NN, etc.). • Feature selection: Given partition of the conditions into types, find a subset of the genes for each type that distinguished it.

• Assign statistical significance to your answers.

Page 40: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues
Page 41: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Clustering Example

Genes

Tissue state

Page 42: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Reading the Chip

Often one may find shifts between the expected grid to the observed grid.

We will average the pixels from within the circle to find out the signal.

Page 43: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Data Pre-processing - Example“Molecular classication of cancer: class discovery and class prediction by gene expression monitoring” – Golub et al.

The dataset comes from a study of gene expression in two types of

acute leukemias: acute lymphoblastic leukemia (ALL) and acute myeloid

leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays containing p =

6817 human genes. The data comprise 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML.

The following pre-processing steps were applied (i) thresholding: floor of 100 and ceiling of 16,000.(ii) filtering: exclusion of genes with max/min ≤ 5 and (max-min)

≤ 500, where max and min refer respectively to the maximum and minimum expression levels of a particular gene across mRNA samples.

(iii) base 10 logarithmic transformation.

Page 44: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Data Pre-processing-Missing/Negative values (~6%)

*x

• Imputing using SVD (use only perfect genes) – most of the genes do not have missing entries.

• SVD imputation using all the data (may not do well for unusual genes not well represented by the bulk of the data).

• Nearest-neighbor imputation: 1. Compute the Euclidean distance between and all

the genes in XC, using only those corianders not missing in . Identify the K closest.

• 2. Impute the missing coordinates of by averaging the corresponding coordinates of the K closest.

• Imputing using regression – using standard EM approach for fitting multivariate Gaussian means and covariance in the presence of missing data.

“Imputing Missing Data for Gene Expression Arrays” T. Hastie et al.

*x*x

Page 45: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Data Pre-processing• Normalization: controls for chip-wide variations in intensity (to control for variations in the total harvest of mRNA across samples - normalize the distribution of all genes).

• This variation could be due to inconsistent washing, inconsistent sample preparation, or other microarray production or microfluidics imperfections.

Page 46: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Data Pre-processingThe ratio is determined by dividing the signal (raw data) by the control strength. Any expression value above one is over-expressed, and all under-expressed data is less than one, but greater than zero (under-expressed data appears flattened). The Log of ratio normalized values spaces them logarithmically (better than just divide by max value).

• LogRatio(R,G) = log(R/G)

• RelDiff(R,G) = (R-G)/(0.5*(R+G)) – handle negative values (MM > PM).

Page 47: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

StandardizationIt is common practice to use the correlation between the gene expression profiles of two mRNA samples to measure their similarity. Consequently, we can standardize theobservations (arrays) to have mean 0 and variance 1 across variables (genes).

With the data standardized in this fashion, the distance between two mRNA samples may be measured by their Euclidean distance, angle or hamming distance (binary arrays).

• Normal to the median (instead of average – breakdown=0)

,*, 1 1, , .i j ii j i p j N

i

xx

Page 48: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Genetic Network

Page 49: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Discussion• Fourier’s transformation – moving to the frequency space – enable us to easily detect expression cycles, remove noise, etc.

Page 50: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Questions?

Page 51: Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

References• Prof. Ron Shamir lectures notes• Prof. Shlomo Moran course: Algorithms in computational biology• Ron Ophir’s presentation on Microarray Analysis • “Imputing Missing Data for Gene Expression Arrays” T. Hastie et al. •“Comparison of Discrimination Methods for theClassification of Tumors Using Gene Expression Data”Sandrine Dudoit et al.• “Is log ratio a good value for identifying differential expressed genes in microarray experiments?” Alfred Ultsch