Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Mining and Pattern Analysis in Large

Data Sets for Biological Information.

David W. Mount

Arizona Cancer Center

• Analysis of gene expression microarray data

sets with goal of preventing or curing cancer

– Statistical analysis of data

– Using biological information to interpret

data

• Future types of genetic analyses

My major objectives.

Develop hypotheses based on data analysis that can betested in the laboratory or clinic

Use and develop new methods for data analysis - patternanalysis, clustering, data mining, biological models

Focus 1: early changes in colorectal and prostate cancer

Focus 2: drugs for pancreatic cancer

Major goal: to discover the unusual based on statisticaland biological data analyses

C3Y

labeled

cDNA

matched

oligos

mismatched

oligos

cDNA,

EST

collection

oligo 1, oligo2, oligo3,….for each

gene

control

sample

mRNA

test

sample

mRNA

synthesis

on slide

C5Y

labeled

cDNA control

sample

mRNA

to one

slide

test

sample

mRNA to

another slide

biotin

labeled cDNAs

hybridized to

oligos

Cy5/Cy3

for each

gene

slide1/slide2

for each

gene

mix hybridized

to one slide

Using data from two types of microarrays for measuring

gene expression of ~35,000 human genes.

Spotted

arrays

Affymetrix

arrays

control

sample

mRNA to one slide

Green – down or

Red – up <2-4 fold

NAP normal adjacent

MET metastatic

PCA localized

BPH benign hyperplasia

Using gene

expression

microarrays for

predicting genetic

variation in tissues.

- Michigan Prostate Study

Underexpressedpredicting lost functions

Overexpressedpredicting newmetabolism�

Use data to find

• An unusual gene product or gene

expression value that indicates a

good drug target

• An early change that can help

with early detection/diagnosis

Microarrays provide new drug targets

- 1Over-expressed genes in metastatic tissue. What genes,what pathways, what functions, where in cell? Cancercells need these additional proteins to support their abnormal metabolism.

Cancer cellNormal cell A AAA

Inhibitor of Aproduct

Cancer cells lose many gene functions by mutation (A-).They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlappinggene functions synthetic lethals (A. Kamb)

A-B+A+B+Normal cell Cancer cell

Inhibitor of Bproduct

Microarrays provide new drug targets

- 2

Careful Experimental Design and

Statistical Analysis are Extremely

Important

1. Plan experiment so as to identify sources of

variation

2. Include biological replication

3. Perform data quality analysis

4. Find genes that are varying significantly

using data model in 1.

5. Mine this gene list for biological information

Complications: genetic variability person to

person, cancer stage, tissues are cell

mixtures

Analysis of biological data with a variable

genetic component is not new!

We are using R statistical computing/BioConductor for data

analysis combined with Perl/Bioperl for biological mining.

R has tools for looking at data quality, etc..

Background varies slide to slide

bad spotted array good affy array

Antibody used for immunochemical stain reveals

which cells are producing a protein (cytokeratin)

Labeled cells

Unlabeled cells

Example of Pancreatic Cancer

• 1/200 people get pancreatic cancer; 1/4 if have

pancreatitis

• It is a very painful and debilitating disease

• Death usually within 1-2 years of discovery

• Few drugs available - gemcitabine hopeful ut only

helps small percentage of people

• There is very little currently being spent on research

into pancreatic cancer compared to other cancers

• I will describe early results: 4 cancer tissues vs one

normal tissue on Aglilent spotted arrays (24K genes).

Boxplots of normalized data of 4 tissues reveal

between slide comparisons should be valid.

Boxplots illustrate that distribution of M values in each

sample is similar. Bars are 25% and 75% levels.

Normalization within arrays corrects for

labeling and label detection variation.

MA plot with no normalization MA plot with Loess normalization

Red - tumor

Blue - normal

A = average of R

and G values

(square root of

their product)

M = log of R to

G ratio to the

base 2.

MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes.

The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with

a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0;

and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in

these tissues are also shown. Normalization restores M of most genes to approx. 0.

Top 100 genes that are statistically best

supported are mostly down regulated.

Red - tumor

Blue - normal

A1 = average of R

and G values (square

root of their product)

M1 = log of R to G

ratio to the base 2.

Volcano plot of fold change (x axis) against log odds that gene is

differentially expressed (y axis) for 100 most significantly varying

genes.

This plot also shows that the

most significantly varying genes

in the pancreatic cancer tissues

are down regulated, which

probably means they are not

functional. Some down

regulated genes are also tumor

suppressor genes and thus are

candidates for project 2 drug

screens in the Pancreatic PPG.

Log odds of 5 means that

the chance that these genes

are NOT varying significantly

from M=0 is e5 = 1/148. This

is a measure of the false

discovery rate.

Example of genes varying significantly between 4 pancr.

cancer tissues and a normal pancr. tissue sample. -TGen

data - Agilent arrays.

Gb_accession GeneName DescriptionM =

log2(R/G)

A =

RG tp corr.

for FDR B

BC004490 FOSV-fos transcr.

factor 3.6 10.3 36.9 0.00090 7.41

NM_033194 NM_033194.1 Heat shock pr B9 -1.7 8.2 -30.0 0.00090 6.64

Y12661 VGFVGF nerve

growth factor -2.5 13.7 -28.3 0.00090 6.41

AF488739 GABABLfamily G protein

coupled rec. -2.0 10.2 -26.1 0.00090 6.06

……

NM_015711 GLTSCR1Glioma tumorsuppressor -1.0 10.3 -15.3 0.00188 3.51

……

BC000311 COPEBKruppel-like

transcr. factor 1.6 10.8 13.8 0.00250 2.99

NM_006999 POLS DNA Poly. sigma 0.8 9.0 13.8 0.00250 2.98

……

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.4 7.4 7.6 0.00865 -0.19

NM_001530 HIF1A

Hypoxia-indfactor 1 1.6 7.3 7.6 0.00867 -0.20

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.5 7.3 7.6 0.00872 -0.21

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.6 7.3 7.5 0.00888 -0.26

p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is theexpected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectlyreported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds ofcorrect prediction is 4.48/1. For B=0, odds = 1/1.

What do you do with a list of

genes?

• Influence on known metabolic and regulatory

pathways (usually ~1/4 of genes)

• Gene Ontology (GO) terms

• Protein-protein and gene-gene interactions

• Where located - genome amplification,

rearrangements?

• Agreement with models - biological and

computational

Local genome databases are

maintained at AZCC

• Local databases of human, rat, mouse, and model organisms

• Direct links to genetic, proteomic, and regulatory/pathwaydatabases

• Information on protein-protein and gene-gene interactions

• http://www.biorag.org is public access Web site

Pathway Miner

• http://www.biorag.org/pathway.html

• Pandey et al. 2004 Bioinformatics. 20:2156-8

• Builds genetic network displays based on

regulatory and metabolic relationships

• Produces lists of genes in excel format

Genetic network analysis of pancreatic data with Pathway Miner

- top 800 pancreatic genes - GenMAPP pathways

A java interactive display

that can be filtered in many

ways. Click on gene

names to retrieve all

relevant information and on

edges to view the pathways

in common. Any list of

genes can be uploaded for

analysis.

Five genes in the top 800 are in MAPK, including FOS

About prostate cancer

• Men screened for a serum antigen - PSA

• If levels go up -> biopsy specimens examined for

evidence of cancer (black box -> Gleason score)

• Decision made about prostatectomy (undesirable -

incontinence, sex dysfunction, etc.)

• Survival about 2/3

• Tissues collected from men used for gene expression

analysis using Affymetrix arrays (about 12,500

genes)

Analysis of a large Affy prostate data

set (Singh et al. 2002, Cancer Cell)

• 50 normal tissues

• 52 staged tissues

• Perform BioConductor Linear Models (LIMMA)

analysis

• Trying advanced statistical modeling and

clustering of genes e.g. independent

component analysis (fastICA, MLICA), mixture

models (nlme)

• Test models of penetration, altered

metabolism, etc.

The data set: human Affy

hgu95av2 chip - 1/3 of

genome

• 50 normal prostate tissues

• 52 cancer tissues at different stages

– 29 negative capsule penetration/20 positive

– 13 positive resesection surg. margin/37 negative

– 9 non re-occurring/5 re-occurring

– Gleason score available

– no apparent dissection

– no apparent pairing of N/C samples

Singh et al. Cancer Cell March 2001

Results from prostate data set

• Can find about 600-1,000 genes

changing N/C depending on acceptable

level of FDR

• No significant changes capsular, margin,

or recurrence data (agrees with paper

• What next?

Biocarta pathways

Metabolic changes in prostate cancer - cells

deprived of oxygen depend on these changes.

Cancer cells in general learn to survive with reduced oxygen and they make a factor

(vascular endothelial growth factor or VEGF) that induces growth of blood vessels. This is

clearly observed in gene expression data.

Getting at the unknown gene

relationships.

• Try to identify sets of genes that are regulated

independently of other sets

• A new method is independent component analysis

(vs. principal components analysis, etc)

– Can superimpose regulatory models and build a more

detailed model

• Genes interact to different degrees

• Problem: find sets of genes that are statistically most

different across the tissue samples

• R provides resources

Independent component analysis: suitable

for building and testing regulatory models

samples

ge

ne

s

componentsg

en

es

= X

samples

com

ponents

Do any of these gene groups better

Separate the sample classes C and N?

Matrix A

Matrix SMatrix X

NN…CC

33331111

31331311

31311311

Use of ICA in analysis of

endometrial cancer

Noise

Good separation

SA Saidi et al. Oncogene 2004

Some samples of ICA: objective - can we find a

set to discriminate gene and tissue classes in

prostate ca.?

Hierarchical clustering (complete) of 102 prostate tissue samples

Boxplots of 102 samples after ICA

Another approach - use list of genes that

are of biological interest during early

stages of prostate Ca. and build model.

• 14-3-3 sigma

• actinin

• BP180

• BP230

• cadherin

• catenin

• CD151

• CD44

• CD63

• CD81

• CD9

• connexin 32

• desmocollin

• desmoglein

• desmoplakin

• ehm2

• EWI

• Ezrin

• fascin

• fibulin

• HD1

• keratin

• laminin

• MTA3

• Nanoshomolog 1

• PKC-delta

• plakoglobin

• Plectin

• SNAI1

• tenascin

• vinculin

• ZonaOccludens 1

• ZonaOccludens 2

Genes related to cell adhesion to intracellular matrix

If change these genes - then expect cells to be able to penetrate the capsule and invade

surrounding tissues.

The source of germline

variability in humans

One of my pairsof chromosomes

Maternal

Paternal

What I pass on toour children

Hundreds of thousands of differences in sequenceCalled SNPs - single nucleotide polymorphisms

What my wife passes onto our children

Inheritance is throughhaplotype blocks of 10sto 100s of kbases

Genotype revealed in humans by haplotype structure of 5q31 (Daly et al. 2001)

Goal: relationship between genotype and expression.

Pomp et al. 2004

Large scale

expression analysis

mapped against

genotype.

Conclusions and future plans

• gene expression data are used to identifydrug targets

• further analysis

– ICA analysis - Maximum likelihood method

– Examine all penetration related genes forpossible variation

Acknowledgements

Colleagues at

UMC/AZCC/SWEHSC

• Ritu Pandey, Greg

Thomas, Rob Klein,

Raghavendra Guru

• Dave Alberts

• Anne Cress

• Gene Gerner

• Serrine Lau - SWEHSC

• Clark Lantz

• Ray Nagle

• Garth Powis

• George Tsaprailis and the

proteomics core

• Bernie Futscher and

George Watts of the

genomics core

Colleagues at

Tgen, Phoenix

• Dan Von Hoff

• Jeff Trent

• Phillip Stafford

• Haiyong Han

• Spyro Mousses

Documents

Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •