Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues

Introduction to Gene Array

Overview

• Basic cell biology• Basic Biotechnological methods• DNA chips & microarrays• Statistical issues

Biological Background

Biological Background

All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

Human Genome

DNA Organization

Source: Alberts et al.

DNA Organization

• In 1953 James Watson and Francis Crick deduced the 3-D double helix structure of DNA and immediately inferred its method of replication.

• April 2003: HGP sequencing is completed and Project is declared finished.“Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning” (W. Churchill 1942).

The double helix

DNA ComponentsFour nucleotide types:AdenineGuanineCytosineThymine

Hydrogen bonds(electrostatic connection:)

A-TC-G

Replication

So, how does the cell use this data?

Transcription

RNA nucleotides:Similar to DNA, U instead of T.

Protein production and Overview

Genetic code

There are 20 amino acids from which proteins are build.

Proteins structure

The Central Dogma

Transcription

mRNA

Translation

ProteinGene

cells express different subset of the genesIn different tissues and under different conditions

hard disk one program its output

Figures…• Number of cells in human body ~ 10-100 *1012.

• Length of human DNA ~ 3*109 bp.

• 99.9% of genome is exactly the same in all people.

• Only 2-3% of genome encodes protein data.

• Number of genes in human ~ 25,000 (still, most of those functions are unknown).

• Average gene length – 3000b (largest - 2.4*106).

• Total number of protein variants is estimated as 106.

• Average protein length – 200aa

Figures (cont.)

There are over 200 types of cells in the human body, such as

muscle nervous tissue sensory cells blood

Basic Biotechnology

• Restriction Enzymes

• Sequencing by Gel (500-800 bases) – Advanced machines can sequence simultaneously 96 different sequences of 500-700 nucleotides in a few hours.

• PCR (Polymerase Chain Reaction) – Allows the DNA from a selected region of a genome to be amplified a billion fold, provided that at least part of its nucleotide sequence is already known.

How can we access the data/processes?

Basic Biotechnology (cont.)

Hybridization:

• DNA double strandsform by “gluing” of complementary single strands

• complementary rule:A-T/U, G-C

Motivation for Genome Research

• Molecular Medicine

• Bioarchaeology, Anthropology, Evolution and Human Migration

• DNA Identification

• Agriculture, Livestock Breeding and Bioprocessing

Functional GenomicsA study of the functionality of specific genes, their relations to diseases, their associated proteins and their participation in biological processes. It is widely believed that thousands of genes function in a complicated and orchestrated way that creates the mystery of life .

However, traditional methods in molecular biology generally work on a ”one gene in one experiment” basis, which means that the throughput is very limited and the ”whole picture” of gene function is hard to obtain.

• Technologies for simultaneously analyzing the expression levels of large numbers of genes provides the opportunity to study the activity of whole genomes.

• In the long-term, large-scale gene expression analysis will enable the behavior of co-regulated gene networks to be studied.

Functional Genomics (cont.)

• The technology can be used to look for groups of genes involved in a particular biological process or in a specific disease by identifying genes whose expression levels change under certain circumstances.

• The RNA transcription profiles of wild type (a normal organism) and mutant can be compared using gene expression technologies.

• Our main interest is in the protein levels, unfortunately this technology is still undeveloped.

Functional Genomics (cont.)

Monitoring Gene Expression

• Goal: Simultaneous measurements of expression levels of all genes in one experiment:

• Two fundamental biological assumptions: Transcription level indicates genes’

regulation. Only genes which contribute to organism

fitness are expressed in a particular condition.

Detecting changes in gene expression level provides clues on its product function.

The DNA Chip• Terminologies: biochip, DNA chip, DNA microarray, and gene array.

• An array is an orderly arrangement of samples. Those samples can be either DNA or DNA products.

• Each spot in the array contains many copies of the sample.

• The array provides a medium for matching known and unknown DNA samples based on base-pairing (hybridization) rules and automating the process of identifying the unknowns.

The DNA Chip (cont.)• These arrays usually contain hundreds of thousands of spots.

• An experiment with a single DNA chip can provide information on thousands of genes simultaneously – a dramatic increase in throughput.

• A probe is the tethered nucleic acid with knownsequence which we use in order to discover information about the target which is the free nucleic acid sample whose identity/abundance is being detected .

The DNA Chip (cont.)

The basic idea, is to generate probes that would captureeach coding region as specifically as possible. The length of the oligos (a sequence of nucleotides) used depends on the application, but usually no longer than 25 bases.

Since the oligos are short, the density of these chips is very high, for instance, a chip that of 1cm by 1cm can easily contain 100,000 oligo types .

The DNA Chip (cont.)

Two variants of arrays technology:

• Format I - DNA microarray: The target (500-5,000 bases long) is immobilized to a solid surface such as glass using robot spotting and exposed to a set of probes either separately or in a mixture.

• Format II - DNA chip: An array of oligonucleotide (20-80 oligos )probes is synthesized either in situ (on-chip)

or by conventional synthesis followed by on-chip immobilization. The array is exposed to labeled sample DNA, hybridized, andthe identity/abundance of complementary sequences are determined.

Chips and Features

Expression Assay Using Affymetrix GeneChip

Now, we can infer which of the genes were expressed and in what intensity.

Due to some biological processes, not always the correct sequence will hybridized to the oligo.

GeneChip Array Design

• The Mismatch oligos help us with reducing some of the background noise.• Since, each gene is being represented by several spots, we have to normalize those values.

GeneChip® Manufacturing Process

Synthesis of Ordered Oligonucleotide Arrays

cDNA Microarrays• Despite to the fact that this method is more specific to genes hybridization, due to manufacture limitations, we can not read the absolute values of the expressions (variation with the amounts of cDNA spotted to the chip). So, we use marked (Red/Green) control/case cells in order to have the ratio levels.

example.swf

Process Example

The Raw Data

Row - genes’ expression pattern.Column - experiment/conditions’

profile .

Entries of the Raw Data matrix:• Ratio levels• Absolute values

Expose the cell to radioactive/chemical substance and sample some time intervals, life cycle of the cell,abnormal tissues (cancer, etc.), starvation.

Computational Challenges• Dissociate actual gene expression values from experimental noise (missing data).

• Normalization: How does one best normalize thousands of signals from same/different conditions/experiments ?• Clustering: Partition genes into subset that manifest similar experiments pattern.

• Classification: Given partition of the conditions into types, classify the types of the new conditions (SVM, NN, etc.). • Feature selection: Given partition of the conditions into types, find a subset of the genes for each type that distinguished it.

• Assign statistical significance to your answers.

Clustering Example

Genes

Tissue state

Reading the Chip

Often one may find shifts between the expected grid to the observed grid.

We will average the pixels from within the circle to find out the signal.

Data Pre-processing - Example“Molecular classication of cancer: class discovery and class prediction by gene expression monitoring” – Golub et al.

The dataset comes from a study of gene expression in two types of

acute leukemias: acute lymphoblastic leukemia (ALL) and acute myeloid

leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays containing p =

6817 human genes. The data comprise 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML.

The following pre-processing steps were applied (i) thresholding: floor of 100 and ceiling of 16,000.(ii) filtering: exclusion of genes with max/min ≤ 5 and (max-min)

≤ 500, where max and min refer respectively to the maximum and minimum expression levels of a particular gene across mRNA samples.

(iii) base 10 logarithmic transformation.

Data Pre-processing-Missing/Negative values (~6%)

*x

• Imputing using SVD (use only perfect genes) – most of the genes do not have missing entries.

• SVD imputation using all the data (may not do well for unusual genes not well represented by the bulk of the data).

• Nearest-neighbor imputation: 1. Compute the Euclidean distance between and all

the genes in XC, using only those corianders not missing in . Identify the K closest.

• 2. Impute the missing coordinates of by averaging the corresponding coordinates of the K closest.

• Imputing using regression – using standard EM approach for fitting multivariate Gaussian means and covariance in the presence of missing data.

“Imputing Missing Data for Gene Expression Arrays” T. Hastie et al.

*x*x

Data Pre-processing• Normalization: controls for chip-wide variations in intensity (to control for variations in the total harvest of mRNA across samples - normalize the distribution of all genes).

• This variation could be due to inconsistent washing, inconsistent sample preparation, or other microarray production or microfluidics imperfections.

Data Pre-processingThe ratio is determined by dividing the signal (raw data) by the control strength. Any expression value above one is over-expressed, and all under-expressed data is less than one, but greater than zero (under-expressed data appears flattened). The Log of ratio normalized values spaces them logarithmically (better than just divide by max value).

• LogRatio(R,G) = log(R/G)

• RelDiff(R,G) = (R-G)/(0.5*(R+G)) – handle negative values (MM > PM).

StandardizationIt is common practice to use the correlation between the gene expression profiles of two mRNA samples to measure their similarity. Consequently, we can standardize theobservations (arrays) to have mean 0 and variance 1 across variables (genes).

With the data standardized in this fashion, the distance between two mRNA samples may be measured by their Euclidean distance, angle or hamming distance (binary arrays).

• Normal to the median (instead of average – breakdown=0)

,*, 1 1, , .i j ii j i p j N

i

xx

Genetic Network

Discussion• Fourier’s transformation – moving to the frequency space – enable us to easily detect expression cycles, remove noise, etc.

Questions?

References• Prof. Ron Shamir lectures notes• Prof. Shlomo Moran course: Algorithms in computational biology• Ron Ophir’s presentation on Microarray Analysis • “Imputing Missing Data for Gene Expression Arrays” T. Hastie et al. •“Comparison of Discrimination Methods for theClassification of Tumors Using Gene Expression Data”Sandrine Dudoit et al.• “Is log ratio a good value for identifying differential expressed genes in microarray experiments?” Alfred Ultsch

http://www.cs.tau.ac.il/~rshamir/ge/02/ge02.html

http://www.cs.technion.ac.il/~cs236522/index_winter03.html

http://bip.weizmann.ac.il/course/ISMBM_03/presentation_ro/index.htm



Documents

Introduction to Gene Array. Overview Basic cell biology Basic Biotechnological methods DNA chips & microarrays Statistical issues