39
Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center Analysis of gene expression microarray data sets with goal of preventing or curing cancer Statistical analysis of data Using biological information to interpret data Future types of genetic analyses

Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Mining and Pattern Analysis in Large

Data Sets for Biological Information.

David W. Mount

Arizona Cancer Center

• Analysis of gene expression microarray data

sets with goal of preventing or curing cancer

– Statistical analysis of data

– Using biological information to interpret

data

• Future types of genetic analyses

Page 2: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

My major objectives.

Develop hypotheses based on data analysis that can betested in the laboratory or clinic

Use and develop new methods for data analysis - patternanalysis, clustering, data mining, biological models

Focus 1: early changes in colorectal and prostate cancer

Focus 2: drugs for pancreatic cancer

Major goal: to discover the unusual based on statisticaland biological data analyses

Page 3: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

C3Y

labeled

cDNA

matched

oligos

mismatched

oligos

cDNA,

EST

collection

oligo 1, oligo2, oligo3,….for each

gene

control

sample

mRNA

test

sample

mRNA

synthesis

on slide

C5Y

labeled

cDNA control

sample

mRNA

to one

slide

test

sample

mRNA to

another slide

biotin

labeled cDNAs

hybridized to

oligos

Cy5/Cy3

for each

gene

slide1/slide2

for each

gene

mix hybridized

to one slide

Using data from two types of microarrays for measuring

gene expression of ~35,000 human genes.

Spotted

arrays

Affymetrix

arrays

control

sample

mRNA to one slide

Page 4: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Green – down or

Red – up <2-4 fold

NAP normal adjacent

MET metastatic

PCA localized

BPH benign hyperplasia

Using gene

expression

microarrays for

predicting genetic

variation in tissues.

- Michigan Prostate Study

Underexpressedpredicting lost functions

Overexpressedpredicting newmetabolism�

Page 5: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Use data to find

• An unusual gene product or gene

expression value that indicates a

good drug target

• An early change that can help

with early detection/diagnosis

Page 6: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Microarrays provide new drug targets

- 1Over-expressed genes in metastatic tissue. What genes,what pathways, what functions, where in cell? Cancercells need these additional proteins to support their abnormal metabolism.

Cancer cellNormal cell A AAA

Inhibitor of Aproduct

Page 7: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Cancer cells lose many gene functions by mutation (A-).They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlappinggene functions synthetic lethals (A. Kamb)

A-B+A+B+Normal cell Cancer cell

Inhibitor of Bproduct

Microarrays provide new drug targets

- 2

Page 8: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Careful Experimental Design and

Statistical Analysis are Extremely

Important

1. Plan experiment so as to identify sources of

variation

2. Include biological replication

3. Perform data quality analysis

4. Find genes that are varying significantly

using data model in 1.

5. Mine this gene list for biological information

Complications: genetic variability person to

person, cancer stage, tissues are cell

mixtures

Page 9: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Analysis of biological data with a variable

genetic component is not new!

Page 10: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

We are using R statistical computing/BioConductor for data

analysis combined with Perl/Bioperl for biological mining.

Page 11: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

R has tools for looking at data quality, etc..

Background varies slide to slide

bad spotted array good affy array

Page 12: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Antibody used for immunochemical stain reveals

which cells are producing a protein (cytokeratin)

Labeled cells

Unlabeled cells

Page 13: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Example of Pancreatic Cancer

• 1/200 people get pancreatic cancer; 1/4 if have

pancreatitis

• It is a very painful and debilitating disease

• Death usually within 1-2 years of discovery

• Few drugs available - gemcitabine hopeful ut only

helps small percentage of people

• There is very little currently being spent on research

into pancreatic cancer compared to other cancers

• I will describe early results: 4 cancer tissues vs one

normal tissue on Aglilent spotted arrays (24K genes).

Page 14: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Boxplots of normalized data of 4 tissues reveal

between slide comparisons should be valid.

Boxplots illustrate that distribution of M values in each

sample is similar. Bars are 25% and 75% levels.

Page 15: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Normalization within arrays corrects for

labeling and label detection variation.

MA plot with no normalization MA plot with Loess normalization

Red - tumor

Blue - normal

A = average of R

and G values

(square root of

their product)

M = log of R to

G ratio to the

base 2.

MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes.

The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with

a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0;

and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in

these tissues are also shown. Normalization restores M of most genes to approx. 0.

Page 16: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Top 100 genes that are statistically best

supported are mostly down regulated.

Red - tumor

Blue - normal

A1 = average of R

and G values (square

root of their product)

M1 = log of R to G

ratio to the base 2.

Page 17: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Volcano plot of fold change (x axis) against log odds that gene is

differentially expressed (y axis) for 100 most significantly varying

genes.

This plot also shows that the

most significantly varying genes

in the pancreatic cancer tissues

are down regulated, which

probably means they are not

functional. Some down

regulated genes are also tumor

suppressor genes and thus are

candidates for project 2 drug

screens in the Pancreatic PPG.

Log odds of 5 means that

the chance that these genes

are NOT varying significantly

from M=0 is e5 = 1/148. This

is a measure of the false

discovery rate.

Page 18: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Example of genes varying significantly between 4 pancr.

cancer tissues and a normal pancr. tissue sample. -TGen

data - Agilent arrays.

Gb_accession GeneName DescriptionM =

log2(R/G)

A =

RG tp corr.

for FDR B

BC004490 FOSV-fos transcr.

factor 3.6 10.3 36.9 0.00090 7.41

NM_033194 NM_033194.1 Heat shock pr B9 -1.7 8.2 -30.0 0.00090 6.64

Y12661 VGFVGF nerve

growth factor -2.5 13.7 -28.3 0.00090 6.41

AF488739 GABABLfamily G protein

coupled rec. -2.0 10.2 -26.1 0.00090 6.06

……

NM_015711 GLTSCR1Glioma tumorsuppressor -1.0 10.3 -15.3 0.00188 3.51

……

BC000311 COPEBKruppel-like

transcr. factor 1.6 10.8 13.8 0.00250 2.99

NM_006999 POLS DNA Poly. sigma 0.8 9.0 13.8 0.00250 2.98

……

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.4 7.4 7.6 0.00865 -0.19

NM_001530 HIF1A

Hypoxia-indfactor 1 1.6 7.3 7.6 0.00867 -0.20

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.5 7.3 7.6 0.00872 -0.21

NM_001530 HIF1A

Hypoxia-ind

factor 1 1.6 7.3 7.5 0.00888 -0.26

p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is theexpected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectlyreported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds ofcorrect prediction is 4.48/1. For B=0, odds = 1/1.

Page 19: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

What do you do with a list of

genes?

• Influence on known metabolic and regulatory

pathways (usually ~1/4 of genes)

• Gene Ontology (GO) terms

• Protein-protein and gene-gene interactions

• Where located - genome amplification,

rearrangements?

• Agreement with models - biological and

computational

Page 20: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Local genome databases are

maintained at AZCC

• Local databases of human, rat, mouse, and model organisms

• Direct links to genetic, proteomic, and regulatory/pathwaydatabases

• Information on protein-protein and gene-gene interactions

• http://www.biorag.org is public access Web site

Page 21: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Pathway Miner

• http://www.biorag.org/pathway.html

• Pandey et al. 2004 Bioinformatics. 20:2156-8

• Builds genetic network displays based on

regulatory and metabolic relationships

• Produces lists of genes in excel format

Page 22: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Genetic network analysis of pancreatic data with Pathway Miner

- top 800 pancreatic genes - GenMAPP pathways

A java interactive display

that can be filtered in many

ways. Click on gene

names to retrieve all

relevant information and on

edges to view the pathways

in common. Any list of

genes can be uploaded for

analysis.

Page 23: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Five genes in the top 800 are in MAPK, including FOS

Page 24: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

About prostate cancer

• Men screened for a serum antigen - PSA

• If levels go up -> biopsy specimens examined for

evidence of cancer (black box -> Gleason score)

• Decision made about prostatectomy (undesirable -

incontinence, sex dysfunction, etc.)

• Survival about 2/3

• Tissues collected from men used for gene expression

analysis using Affymetrix arrays (about 12,500

genes)

Page 25: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Analysis of a large Affy prostate data

set (Singh et al. 2002, Cancer Cell)

• 50 normal tissues

• 52 staged tissues

• Perform BioConductor Linear Models (LIMMA)

analysis

• Trying advanced statistical modeling and

clustering of genes e.g. independent

component analysis (fastICA, MLICA), mixture

models (nlme)

• Test models of penetration, altered

metabolism, etc.

Page 26: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

The data set: human Affy

hgu95av2 chip - 1/3 of

genome

• 50 normal prostate tissues

• 52 cancer tissues at different stages

– 29 negative capsule penetration/20 positive

– 13 positive resesection surg. margin/37 negative

– 9 non re-occurring/5 re-occurring

– Gleason score available

– no apparent dissection

– no apparent pairing of N/C samples

Singh et al. Cancer Cell March 2001

Page 27: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Results from prostate data set

• Can find about 600-1,000 genes

changing N/C depending on acceptable

level of FDR

• No significant changes capsular, margin,

or recurrence data (agrees with paper

• What next?

Page 28: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Biocarta pathways

Page 29: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Metabolic changes in prostate cancer - cells

deprived of oxygen depend on these changes.

Cancer cells in general learn to survive with reduced oxygen and they make a factor

(vascular endothelial growth factor or VEGF) that induces growth of blood vessels. This is

clearly observed in gene expression data.

Page 30: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Getting at the unknown gene

relationships.

• Try to identify sets of genes that are regulated

independently of other sets

• A new method is independent component analysis

(vs. principal components analysis, etc)

– Can superimpose regulatory models and build a more

detailed model

• Genes interact to different degrees

• Problem: find sets of genes that are statistically most

different across the tissue samples

• R provides resources

Page 31: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Independent component analysis: suitable

for building and testing regulatory models

samples

ge

ne

s

componentsg

en

es

= X

samples

com

ponents

Do any of these gene groups better

Separate the sample classes C and N?

Matrix A

Matrix SMatrix X

NN…CC

33331111

31331311

31311311

Page 32: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Use of ICA in analysis of

endometrial cancer

Noise

Good separation

SA Saidi et al. Oncogene 2004

Page 33: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Some samples of ICA: objective - can we find a

set to discriminate gene and tissue classes in

prostate ca.?

Hierarchical clustering (complete) of 102 prostate tissue samples

Boxplots of 102 samples after ICA

Page 34: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Another approach - use list of genes that

are of biological interest during early

stages of prostate Ca. and build model.

• 14-3-3 sigma

• actinin

• BP180

• BP230

• cadherin

• catenin

• CD151

• CD44

• CD63

• CD81

• CD9

• connexin 32

• desmocollin

• desmoglein

• desmoplakin

• ehm2

• EWI

• Ezrin

• fascin

• fibulin

• HD1

• keratin

• laminin

• MTA3

• Nanoshomolog 1

• PKC-delta

• plakoglobin

• Plectin

• SNAI1

• tenascin

• vinculin

• ZonaOccludens 1

• ZonaOccludens 2

Genes related to cell adhesion to intracellular matrix

If change these genes - then expect cells to be able to penetrate the capsule and invade

surrounding tissues.

Page 35: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

The source of germline

variability in humans

One of my pairsof chromosomes

Maternal

Paternal

What I pass on toour children

Hundreds of thousands of differences in sequenceCalled SNPs - single nucleotide polymorphisms

What my wife passes onto our children

Inheritance is throughhaplotype blocks of 10sto 100s of kbases

Page 36: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Genotype revealed in humans by haplotype structure of 5q31 (Daly et al. 2001)

Page 37: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Goal: relationship between genotype and expression.

Pomp et al. 2004

Large scale

expression analysis

mapped against

genotype.

Page 38: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Conclusions and future plans

• gene expression data are used to identifydrug targets

• further analysis

– ICA analysis - Maximum likelihood method

– Examine all penetration related genes forpossible variation

Page 39: Mining and Pattern Analysis in Large Data Sets for ...parida/DIMACSworkshopJune20... · Raghavendra Guru • Dave Alberts • Anne Cress • Gene Gerner • Serrine Lau - SWEHSC •

Acknowledgements

Colleagues at

UMC/AZCC/SWEHSC

• Ritu Pandey, Greg

Thomas, Rob Klein,

Raghavendra Guru

• Dave Alberts

• Anne Cress

• Gene Gerner

• Serrine Lau - SWEHSC

• Clark Lantz

• Ray Nagle

• Garth Powis

• George Tsaprailis and the

proteomics core

• Bernie Futscher and

George Watts of the

genomics core

Colleagues at

Tgen, Phoenix

• Dan Von Hoff

• Jeff Trent

• Phillip Stafford

• Haiyong Han

• Spyro Mousses