92
Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Exploratory Failure Time Analysisand Copy Number Variation Inference

Cheng ChengDepartment of Biostatistics

St. Jude Children’s Research Hospital

Page 2: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Outline

Part I Background

Part II Exploratory Failure Time Analysis

Part III Copy Number Variation Inference

Page 3: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

I. Background• Nucleus, nucleotides, DNA, chromosomes, SNP• SNP arrays• Genome Wide Association Study (GWAS)• Multiple tests• Cause-specific failure and Competing risk• Cumulative incidence function, Gray's test, Fine-Gray

hazard rate regression model• Censor at time competing event: OK for testing

stochastic independence, biased for estimation

Page 4: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Animal Cell Organelles

Nucleus Nucleolus Endoplasmic ReticulumCentriole Centrosome Golgi Cytoskeleton Cytosol Mitochondrion Secretory Vesicle Lysosome Peroxisome Vacuole

Page 5: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Nucleus FunctionsThe cell nucleus is an organelle that forms the package for our genes and their controlling factors. • Store genes on chromosomes • Organize genes into chromosomes to allow cell division.• Transport regulatory factors & gene products via nuclear pores • Produce messages (messenger Ribonucleic acid or mRNA) that code

for proteins • Produce ribosomes in the nucleolus • Organize the uncoiling of DNA to replicate key genes

Page 6: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Chromosome inside nucleus

• What is a chromosome? – In the nucleus of each cell,

the DNA molecule is packaged into thread-like structures called chromosomes.

– Each chromosome is made up of DNA tightly coiled many times around proteins called histones that support its structure.

DNA = deoxyribonucleic acid

Page 7: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Human chromosomes

• In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46.

• Twenty-two of these pairs, called autosomes, look the same in both males and females.

• The 23rd pair, the sex chromosomes, differ between males and females. – Females have two copies of the X

chromosome– males have one X and one Y

chromosome.

Page 8: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Chromosome Structure

• Each chromosome has a constriction point called the centromere, which divides the chromosome into two sections, or “arms.”

• The short arm of the chromosome is labeled the “p arm.” The long arm of the chromosome is labeled the “q arm.”

• Each chromosome has two chromatids as a result of duplication of the DNA which took place during interphase. The two chromatids are linked together at a centromere.

Page 9: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

DNA structure

DNA is a double-stranded molecule twisted into a helix (think of a spiral staircase). Each spiraling strand, comprised of a sugar-phosphate backbone and attached bases, is connected to a complementary strand by non-covalent hydrogen bonding between paired bases. The bases are adenine (A), thymine (T), cytosine (C) and guanine (G).

Page 10: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

SNPs occur in human DNA at a frequency of one every 1,000 bases. These variations can be used to track inheritance in families.

Genetic code is specified by the four nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine).

A Single Nucleotide Polymorphism (SNP) is a change of a single nucleotide, such as an T, replaces one of the other three nucleotide letters -- A, C, or G, within a person's DNA sequence.

Page 11: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

GenomicSequence

5´ 3´ SNP T / G

SNPprobe = 25 bases

Allele ‘A’Perfect Match

Mismatch

Perfect Match

MismatchAllele ‘B’

SNP Array Design

Quartet

Page 12: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Hundreds of Millions of Pixel Intensities…..

Page 13: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Genotype Calling

AA AB BB

Page 14: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Genome Wide Association Study (GWAS)

Typically 400,000 to 900,000 SNPs are investigated in a single study

Number of subjects in a study typically ranges from a few hundreds to 20,000

Each SNP takes three possible (generic) values “AA”, “AB”, “BB”, often coded as 0, 1, 2

Each SNP in each individual has a unique value, which is one of 0, 1, or 2

A small number of phenotypes: disease status (yes/no), or quantitative traitThis lecture: time to a cause-specific failure

n subjects, n observed trait values Y1, …, Yn, n observed SNP values for the ith SNP Xi1, …, Xin

Inference (Test) for stochastic dependence of the ith SNP with the trait based on the dataset (Xij, Yj), j=1,…,n; do this for each SNP; thus many tests of the null hypothesis of stochastic independence.

Page 15: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Massive Multiple Tests

“Genome-wide significance”Bonferroni-type adjustment:

Declare statistical significance if P≤10-7 (0.05/500K)

FDR and q valueBenjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57, 289–300.

Storey, J. D., Taylor, J. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach JRSS-B, 66, 187–205.

Profile information criteriaCheng, C., Pounds, S., Boyett, J. M. et al (2004). Statistical significance threshold criteria for analysis of microarray gene expression data. Statistical Applications in Genetics and Molecular Biology 3, Article 36. URL //www.bepress.com/sagmb/vol3/iss1/art36

Cheng, C (2006) An adaptive significance threshold criterion for massive multiple hypotheses testing. IMS Lecture Notes - Monograph Series 2nd Lehmann Symposium – Optimality 49, 51–76

Page 16: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Cause-specific failure and competing risk

Alive

Relapse

2nd Cancer

Die in remission

Failure type 1(of interest)

Failure type 2(competing risk/event)

Failure type 3(competing risk)

Klein, J. P. (2010) Competing risks. WIREs Comp Stat, www.wiley.com/wires/compstats, DOI: 10.1002/wics.83

Page 17: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Cumulative incidence function (CIN) (T, δ); Fj(t)=Pr(T ≤ t and δ=j)

Gray’s test: Compare CIN across K groupsAnalog of weighted log-rank test

Gray, R. J. (1988) A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1141-1154.

Fine-Gray’s CIN hazard rate regression modelAnalog of Cox’s hazard rate regression model Fine, J. P., Gary, R.J. (1999) A proportional hazards model for the subdistribution of a competing risk. JASA, 94, 496-509.

Censor at the time of competing event

Page 18: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

II. Exploratory Failure Time Analysis• Large-scale Genomic Association Analysis

o Feature (variable) screening and feature extraction• A Motivating Example from a GWAS• Correlation Profile Test (CPT)

o Hypotheseso Correlation profile functiono CPT statistico Hybrid permutation test of significance

• A Simulation Study: Strength and Weakness• Example: Analysis of SNPs on Chromosome 9 • Summary and Remarks

• Feature Extraction (sparse regression) • Example: “Prognostic” Gene (RNA) expression • Summary and remarks

Page 19: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Large-scale Genomic Association Analysis• Feature (variable) screening

– Find individual genomic features (factor/predictor variables) associated with one or more phenotypes (response variables)

• GWAS

– Association: stochastic dependence– Parametric/semi-parametric approaches: linear models, GLMs, hazard

rate (Cox) regression

• Feature extraction– Find (linear) combinations (or sets) of genomic features (variables)

associated with one or more phenotypes– Determine sets of variables using biological knowledge (gene signaling

pathways, functional/ontology groups, etc.): GSEA – Variable/Model selection methods: ridge regression, LASSO, SCAD,

SEAMLESS, sparse regression

Page 20: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Motivating Example• GWAS to screen SNP markers for risk of relapse in childhood leukemia patients

AA AB BB

X 0 1 2

Relapse 70 0 0

Comp. Event 24 0 0

Censored 585 11 9

Page 21: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Motivating Example

Need: a more omnibus and algorithmically robust test procedure

AA AB BB

X 0 1 2

Relapse 70 0 0

Comp. Event 24 0 0

Censored 585 11 9

P Coeff. (s.e.)

Cox Regression

Test of coeff. 1 -16.4 (2901)LR test 0.0458

R gives a warning

Fine-Gray Regression

JASA 1999 94(446):496-509

Test of coeff. 0 -11.4 (0.334)LR test 0.0488

Gray's Test 0.3542

Ann. Statist. 1988 16(3):1141-1154

Jung's Test 0.6607

Statist. Medicine 2005 24:3077-3088

Page 22: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

• Model, Null and alternative hypotheses (classical survival setting)

Correlation Profile Test (CPT)

Page 23: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Correlation Profile Test (CPT)• Sample correlation profile function

observed event point process of individual i

Can do rank transformation for continuous X

Page 24: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Correlation Profile Test (CPT)

Page 25: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Correlation Profile Test (CPT)• CPT statistic, hybrid permutation test

Page 26: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Back to the SNP Example

AA AB BB

X 0 1 2

Relapse 70 0 0

Comp. Event 24 0 0

Censored 585 11 9

P Coeff. (s.e.)

Cox Regression

Test of coeff. 1 -16.4 (2901)LR test 0.0458

R gives a warning

Fine-Gray Regression

JASA 1999 94(446):496-509

Test of coeff. 0 -11.4 (0.334)LR test 0.0488

Gray's Test 0.3542

Ann. Statist. 1988 16(3):1141-1154

Jung's Test 0.6607

Statist. Medicine 2005 24:3077-3088

CPT 0.2582 Test stat is negative

Page 27: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Simulation Study• A model mimicking the SNP example

Generate X: Pr(X=0)=0.98, Pr(X=1)=0.015, Pr(X=2)=0.005Generate Censor Time TC ~ Exp(0.2)

Generate failure indicator IF|X ~ Bernoulli(πF);

πF = 0.2exp{-θ(X-2)}

If IF = 1, generate Failure Time TF|X ~ LogNormal(βX,1)

else set TF = ∞

Generate competing risk indicator IR ~ Bernoulli(0.1)

If IR = 1, generate Competing Failure Time TR ~ Unif(0,7)

else set TR = ∞

Observed Failure Time T = min{TC TF TR}

Repeat the above n times to simulate n individuals

Page 28: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Simulation Study• A model mimicking the SNP example

0.05 0.01 CPT FG Jung Gray CPT FG Jung Gray

Nullθ=0, β=0

0.04860.00215

0.19530.0040

00

0.07110.0007

0.01990.0014

0.16580.0037

00

0.02810.0003

θ=0β=0.5

0.090.0028

0.170.0038

00

0.0960.0029

0.0070.0008

0.0690.0025

00

0.0160.0012

θ=0.5β=1.2

0.4530.0050

0.5020.0050

00

0.2770.0045

0.040.0020

0.1890.0039

00

0.050.0022

θ=0.8β=0

0.670.0047

0.8020.0040

0.0040.0006

0.8070.0039

0.3780.0048

0.4950.0050

00

0.5650.0050

θ=0.8β=1.2

0.9810.0014

0.990.0010

0.0340.0018

0.9670.0018

0.8750.0033

0.8670.0034

0.0030.0005

0.8690.0034

Pwr est. s.e.

Page 29: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Simulation Study

0.01 0.005

CPT Cox Jung CPT Cox Jung

Null 0.00880.0009

0.01120.0011

0.00960.0010

0.00470.0007

0.00670.0008

0.00530.0007

β=0.5 0.0930.0029

0.1730.0038

0.1170.0032

0.0660.0025

0.1230.0033

0.0820.0027

β=0.8 0.3590.0048

0.6170.0049

0.5180.0050

0.2680.0044

0.5240.0050

0.4360.0050

Exact Proportional Hazard, continuous predicator

Page 30: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Simulation Study

0.01 0.005 CPT Cox Jung

CPT Cox Jung

β=0.5

n=3000.149

n=4000.181

n=2000.173

n=2000.117

n=3000.094

n=4000.127

n=2000.123

n=2000.082

β=0.8

n=3000.537

n=4000.730

n=2000.617

n=2000.518

n=3000.449

n=4000.655

n=2000.524

n=2000.436

Exact Proportional Hazard, continuous predicator

Page 31: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

A Simulation Study

0.01 0.005

CPT FG Jung CPT FG Jung

β1=0β=0

0.00820.0009

0.01590.0012

0.01240.0011

0.00390.0006

0.00960.0010

0.00520.0007

β1=1β=0

0.1930.0039

0.0190.0014

0.0270.0016

0.130.0033

0.0110.0010

0.0160.0012

β1=2β=0

0.3610.0048

0.0310.0017

0.0450.0021

0.2540.0044

0.0220.0015

0.0250.0016

β1=3β=0

0.3990.0049

0.0540.0022

0.0760.0026

0.3020.0046

0.0330.0018

0.0460.0021

β=0.6β1=0

0.3250.0047

0.2360.0042

0.2010.0040

0.2310.0042

0.1650.0037

0.1250.0033

β=1.2β1=0

0.6310.0048

0.6980.0046

0.5960.0049

0.5020.0050

0.5870.0049

0.4880.0050

Continuous predictor, deviation from proportional hazard

Page 32: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

0.01 0.005 CPT FG Jung Gray CPT FG Jung Gray

Nullθ=0, β=0

0.00880.0009

0.01090.0010

0.0050.0007

0.00440.0006

0.00630.0008

0.00190.0004

θ=0β=0.6

0.1720.0038

0.0570.0023

0.0250.0016

0.0610.0024

0.1210.0033

0.0430.0020

0.0160.0012

0.0410.0020

θ=0β=1.2

0.5510.0050

0.360.0048

0.2110.0041

0.2730.0044

0.4710.0050

0.2860.0045

0.1420.0035

0.2140.0041

θ=0.25β=0

0.0590.0024

0.0990.0030

0.0570.0023

0.0750.0026

0.0370.0020

0.0760.0027

0.0390.0019

0.0610.0024

θ=0.25β=1.2

0.8620.0034

0.7080.0045

0.5690.0050

0.6210.0048

0.8010.0040

0.6320.0048

0.4640.0050

0.5380.0050

θ=0.5β=0

0.3140.0046

0.5250.0050

0.3750.0048

0.4410.0050

0.2360.0042

0.4520.0050

0.2880.0045

0.3550.0048

θ=0.5β=0.6

0.8950.0031

0.8310.0037

0.7130.0045

0.7820.0041

0.8470.0036

0.760.0043

0.6180.0049

0.7120.0045

A Simulation StudyOrdinal predictor, deviation from proportional hazard

Opposite scenario of the SNP example

AA AB BB

Page 33: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic

Leukemia (ALL)

21,909 SNPs on Chr 9 obtained by Affy 100K and 500K SNP arrays were tested for association with relapse of childhood ALL

Alive

Relapse

2nd Cancer

Die in remission

Failure type 1(of interest)

Failure type 2(competing risk/event)

Failure type 3(competing risk)

Page 34: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic Leukemia (ALL)

• n=707 subjects from two most recent clinical trial at SJCRH

• 21,909 SNPs

• CPT test performed on each SNP, with 200 permutations in the hybrid permutation test

• Significance determined by the profile info criteria Ip (Cheng et al. 2000); 200 SNPs were considered statistically significant, estimated FDR=48.7%

Page 35: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

P

cd

f

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

89

10

P

pd

f

pi0 = 0.9535

Page 36: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

02

00

04

00

06

00

08

00

01

00

00

12

00

01

40

00

alpha

Ippi0 = 0.9535 alph.opt = 0.004815 FDR = 0.4869 min Ip = 160.745

Page 37: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

SNP Pval.CPT Annotation

SNP_A-4216803 6.44E-06 C9orf82.downstream.461318.AFFY

SNP_A-2142223 7.87E-05 TMC1.upstream.48246.AFFY//ZFAND5.upstream.108579.AFFY

SNP_A-1878719 8.80E-05 PTPRD.In_gene.5000.5kRuleLD

SNP_A-4201296 0.000126717 DBC1.In_gene.5000.5kRuleLD

SNP_A-4254975 0.000131144 JMJD2C.downstream.206626.AFFY

SNP_A-2289668 0.000138093 GLIS3.In_gene.5000.5kRuleLD

SNP_A-1847956 0.000140329 RFX3.downstream.268660.AFFY

SNP_A-1995935 0.000169608 C9orf150.downstream.49950.AFFY

SNP_A-1996276 0.000182668 ELAVL2.In_gene.5000.5kRuleLD

SNP_A-2202030 0.000216207 C9orf93.downstream.130169.AFFY

SNP_A-2100956 0.000254439 C9orf82.downstream.513396.AFFY

SNP_A-2098514 0.000337669 BNC2.In_gene.5000.5kRuleLD

SNP_A-2228460 0.000398633 GLIS3.upstream.6282.AFFY

SNP_A-4252517 0.00040459 TUSC1.downstream.286633.AFFY

SNP_A-1917590 0.000430395 PTPRD.In_gene.5000.5kRuleLD

SNP_A-2052752 0.000432044 DMRT1.In_gene.5000.5kRuleLD

SNP_A-2061098 0.000448184 C9orf94.In_gene.5000.5kRuleLD

SNP_A-1786517 0.000478508 GSN.In_gene.5000.5kRuleLD

SNP_A-2304920 0.000575635 UBE2R2.In_gene.5000.5kRuleLD

SNP_A-1902372 0.00057778 GSN.In_gene.5000.5kRuleLD

SNP_A-2201300 0.000588075 ABL1.In_gene.5000.5kRuleLD

SNP_A-2238268 0.000592182 PCSK5.upstream.10183.AFFY

SNP_A-1830183 0.000613767 TUSC1.downstream.181857.AFFY

Page 38: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

1 2 3 4 5

-1.0

-0.5

0.0

0.5

1.0

time

Co

rr

ρ^(tj), j=1, …, J=9

Test stat = -3.478

Page 39: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Years

Pro

ba

bility

AAABBBOverall

At Risk:AA:AB:BB:

Overall:707 676 646 546 454 366 295

AA 5.1%AB 28.7%BB 66.2%

PGary’s test 0.0451Fine-Gray regression 0.0380; coeff=-0.3905

Page 40: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

ABL1 Gene Germline SNP

AA AB BB Tot36 (0.051) 201 (0.287) 464 (0.662) 701 (1.00)

A 273 (0.195)B 1129 (0.805)

AA AB BBT13B intermediate/high risk 12 27 75

(0.152)T13B Low risk 7 33 67

(0.065)T15 standard/high risk 11 74 161

(0.047)T15 Low risk 6 67 161

(0.026)

Page 41: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Extension to Recurrent Events • Model, Null and alternative hypotheses

Multiple event times

# events occurred ≤ t

Page 42: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Extension to Recurrent Events

N = # events occurred ≤ tN

Page 43: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Summary and Remarks• Correlation Profile Test:

– Computationally more robust– More omnibus: covers certain deviations from the

semi-parametric hazard regression model– Highly competitive with other non-parametric

procedures (Gray’s test, Jung’s test) – Relative deficiency vs. Cox model under PH ?? – Extension to recurrent-event phenotypes– Informative censoring in the presence of competing

risk

Page 44: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)

• Identify (linear) combinations of covariate variables that are associated with the failure phenotype

Page 45: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)

• Sparse regression by the General Path seeking (GPS) algorithm (Friedman 2008)

• Exploratory failure time analysis by weighted least square -- the association criteria

• The modified GPS algorithm to find a solution• A small simulation study• Example: Gene (RNA) expression “prognostic” for

relapse of childhood ALL

Page 46: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Sparse Regression by General Path Seeking (GPS, Friedman 2008)http://www-stat.stanford.edu/~jhf//ftp/GPSpub.pdf

General Setup

),()),;(,(1

),()(ˆ),;(ˆ

)),;(,(1

)(ˆ

),(,),,(

))((E)(

)),;(,()(

large is,,);,;(~|

1

1

11

YX,

PXFYLn

PRPR

XFYLn

R

YXYX

LossR

XFYLLoss

mRXXFXY

n

iii

n

iii

nn

m

Lasso (Tibshirani 1996), grouped lasso (Yuan and Lin 2006), SCAD (Fan and Li 2001)Elastic net (Zuo and Hastie 2005)SEAL (Xihong Lin, 2009 JSM)

Page 47: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• The general GPS algorithm

mj

sign

jjS

jSj

mj

mjp

gPp

Rg

mjP

RPXFYLn

PRPR

j

jjj

jSjjj

jjj

j

j

jj

jj

jj

j

mn

ii

,...,1,0 UNTIL

)(

||argmax* ELSE||argmax* IF

}0:{ set ;, Compute

REPEAT

,...,1,0:Initialize

,...,1,||

),()(ˆ

,...,1,0||

),( and loss, abledifferenti andconvex :Assume

),,(),;(,(1

),()(ˆ),;(ˆ

***

1

Page 48: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• Exploratory failure time analysis: setup

Page 49: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• Association criteria: Penalized weighted least square

Page 50: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• The power penalty function |β|γ, 0<γ≤1

-0.04 -0.02 0.0 0.02 0.04

0.0

0.2

0.4

0.6

0.8

1.0

γ =1

γ =0.5

γ =0.0001

Page 51: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• The modified GPS algorithm

iteration of #max OR valuespecified )||||||||abs( UNTIL

from remove THEN

sign changed AND interation previous as same theis * IF

)(

||argmax*Set

},...,2,1{ set ;, Compute

REPEAT

,...,1,0:Initialize

,...,1,||

),()(ˆ

,...,1,0||

),( and loss, abledifferenti andconvex :Assume

),,()),;(,(1

),()(ˆ),;(ˆ

*

***

1

pre

j

jjj

jSj

j

j

j

jj

jj

jj

j

mn

iii

gg

Sj*

gj

sign

j

mSj

mj

mjp

gPp

Rg

mjP

RPXFYLn

PRPR

Page 52: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• Gradient descent with a fixed step size; searches the solution along a

sequence of increasing values of the penalty parameter λ, thus no need to use CV or GCV type criteria to determine λ.

• Initial value: all β’s are set to zero• Each iteration modifies just one of the m dimensions, criteria to choose

which dimension to update involves the gradient of the association criteria and penalty function

• Need to modify for this particular application– Relatively large step size Δν: 0.01 (sometimes)– Stopping rule: Stop if the size of the gradient vector does not change by more than Δν

from the previous iteration or pre-specified max number of iterations is reached

Page 53: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• A small simulation study

Simulation model: Proportional hazard

Page 54: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Characteristic of Solution Freq. %

X1 & X2 no false + (perfect) 27 5.4

X1 & X2 w/ false + 40 8X2 only no false + 223 44.6X2 only w/ false + 199 39.8

X1 only no false + 2 0.4

X1 only w/ false + 1 0.2

None (all false +, worst) 8 1.6

TOTAL 500 100

Feature Extraction (Sparse regression)• A small simulation study

Performance assessment

Page 55: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• A small simulation study

Performance assessment

#non-zero Freq. (%) X1-only X2-only Both none

1 230 (46) 2 223 0 5

2 132 (26.4) 0 103 27 2

3 70 (14) 1 54 14 14 31 (6.2) 0 21 10 0

5 19 (3.8) 0 10 9 0

6 10 (2) 0 5 5 0

7 7 (1.4) 0 5 2 08 1 (0.2) 0 1 0 0

Page 56: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• A small simulation study

Performance assessmentR=# of non-zeros in solutionV=# of incorrect non-zeros in solutionFDR = E(V/R|R>0)

Estimated FDR = 0.2984s.e. = 0.0142

99% CI (0.2618, 0.3350)

Page 57: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)• An Example: Gene (RNA) expression in ALL and risk

of relapse• Affymetrix U133A GeneChip• n=287 Arrays (subjects)

• m=22,278 Probesets• Two clinical variables: Age group at Dx, lineage• Intercept term• Total number of variables = m+3

Page 58: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Feature Extraction (Sparse regression)

Example (Cont.)• Run parameters:

– step size = 6x10-4, γ = 0.01

• Initial values:– Coeff. of Age = 6x10-4

– Coeff of Lineage = 6x10-4

– All others set to 10-8

Page 59: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

• Top meaningful findings

Variable Coeff Gene

Age.DX 0.183

Lineage 0.1302

212869_x_at 0.003 TPT1; (similar to) tumor protein; translationally controlled

201288_at 0.0006 RhoGDI2; plays a role in apoptosis; may be a marker for tumor progression in gastric and breast cancer; literature not consistent

Page 60: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Summary and Remarks• The sparse regression approach to exploratory

survival analysis– Step size of descent is crucial– Newton-type descent more adaptive, maybe

better, how to incorporate?– Stopping criteria: change in gradient vector? size

of the gradient vector? – Other association criteria: -log likelihood by

Logistic, Probit, Poisson etc. links– “Oracle” property of the solution? “Accuracy” of

the solution? in asymptopia

Page 61: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

III. Copy Number Variation Inference

• Cell division• DNA Copy Number Variation (CNV)• Use SNP array signals to infer CNV• Reference signal alignment: example• Reference signal alignment procedure• Recent development• Examples

Page 62: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Cell division: How cell grow and divide

• Mitosis: The process in somatic cell division by which the nucleus divides

• Meiosis: The process of cell division in sexually reproducing organisms that reduces the number of chromosomes in reproductive cells from diploid to haploid, leading to the production of gametes in animals and spores in plants

Page 63: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Mitosis: Prophase• Prophase is the first stage of cell division, the cell prepares itself for

division. The nucleus swells, and chromosomes become visible.

• Each chromosome has two chromatids as a result of duplication of the DNA which took place during interphase. The two chromatids are linked together at a centromere.

• The centrosome (2 centrioles) duplicates into 2 diplosomes, and each diplosome, or aster moves toward opposite poles of the nucleus.

Page 64: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Mitosis: Metaphase• Microtubules assemble, and form a network (the spindle fibres).

• The chromosomes move towards the equator of the cell, where they are visible.

• This is the phase in which morphological studies of chromosomes are carried out, often for clinical purposes.

Page 65: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Mitosis: Anaphase • The two sister chromatids separate.

• Each one migrates to opposite ends of the cell. So each daughter cell has an identical complement of chromosomes .

• The nuclear membrane has disappeared at this stage. The cell membrane expands as the cell itself elongates.

• The diameter of the cell decreases at the equator.

Page 66: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Mitosis: Telophase • A new membrane forms around the new nuclei and two cells are quickly

formed.

• The chromatid, now called a chromosome, uncoils, and the nucleolus becomes visible again.

• Each cell contains a pair of chromosomes (2n chromosomes)

Page 67: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Meiosis• The process of meiosis essentially involves two cycles of division, involving a gamete mother cell

(diploid cell) dividing and then dividing again to form 4 haploid cells. These can be subdivided into four distinct phases which are a continuous process

• 1st Division– Prophase - Homologous chromosomes in the nucleus begin to pair up with one another

and then split into chromatids (one half of a chromosome) where cross over can occur. Cross over can increase genetic variation.

– Metaphase - Chromosomes line up at the equator of the cell, where the sequence of the chromosomes lined up is at random, through chance, increasing genetic variation via independent assortment.

– Anaphase - The homologous chromosomes move to opposing poles from the equator – Telophase - A new nuclei forms near each pole alongside its new chromosome

compliment. – At this stage two haploid cells have been created from the original diploid cell of the

parent.• 2nd Division

– Prophase II - The nuclear membrane disappears and the second meiotic division is initiated.

– Metaphase II - Pairs of chromatids line up at the equator – Anaphase II - Each of these chromatid pairs move away from the equator to the poles via

spindle fibres – Telophase II - Four new haploid gametes are created that will fuse with the gametes of

the opposite sex to create a zygote.

Page 68: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Meiosis

Page 69: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Meiosis vs. MitosisMITOSIS (In somatic cells) MEIOSIS (In reproductive cells )

One single division of the mother cell (m) results in two daughter cells (d)

Two divisions of the mother cell result in four meiotic products (p)

The number of chromosomes per nucleus remains the same after division

The meiotic products contain a haploid (n) number of chromosomes, in contrast to the 2n mother cell

Chromosomal re-distribution Chromosomal change (cross over) and re-distribution

Gain/duplication or loss of DNA can occur in either process, resulting in deviations from the normal, 2-copy state of the chromomes or segments on chromosomes.

Page 70: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Oncogenesis

• DNA gain: Excess of genes promoting cell division and proliferation

• DNA loss: Loss of gene functions regulating cell cycles, such as signaling apoptosis.

• DNA loss: Loss of functions necessary for proper lineage differentiation

Page 71: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Karyotyping andComplex CNV patterns in tumor genomes

• An assay technology to assess gains/losses of DNA; now routinely performed at diagnosis of childhood leukemia; not so readily available for solid tumors

67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar

1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2) 6 2 12 4 2 18 4 6 4 6 2 total=66 +Y = 67

Page 72: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Contemporary technology:Array Comparative Genome Hybridization(aCGH)

Page 73: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNV

• Goal: Infer loss/gain -- qualitative

Page 74: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNV

• Importance of normalization

• Reason for single-array reference alignment: example from paper

Page 75: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNV• Motivation/Reason for single-array reference alignment

Page 76: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNV• Basic algorithm (Pounds et al. 2009)

1. Select a chromosome that is most likely in the 2-copy (diploid) state – make an educated guess

2. Use the empirical distribution (EDF) of the signals of the markers on this chromosome to transform all marker signals into the unit interval (0,1) via the probability-integral (quantile) transformation

3. Map the above transformed data into a known, convenient target distribution (e.g., N(0,1)); this produces the reference-aligned signals

4. Perform CNV segmentation using the above reference-aligned signals.

Note after Step 3 the empirical distr. of reference markers is essentially the same as the target distribution (N(0,1))

Page 77: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

HOW TO SELECT THE REFERENCE CHROMOSOME?Pounds et al. (2009)

• Cytonormalization– Utilize karyotype data: select a chromosome not implied as

abnormal by karyotyping

• Algorithmic selection– Select a chromosome that appears most likely be in the

diploid state based on a set of statistics, such the percentage of heterozygous calls, joint behavior of signal mean and standard deviation (details in paper).

Page 78: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNVNew Development

• Affymetrix SNP6 array: Genotype (SNP) and CNV probesets – two type of signals; don’t always follow the same distribution

• The auto selection of reference chromosome fails on cases with complex CNV patterns

• Modified algorithm: more flexible– Marker (instead of chrom.) based reference– Initial CNV inference, adjustment, final CNV inference

Page 79: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNV• Modified algorithm: Work in progress …

1. Select all SNP markers with heterozygous calls (‘AB’) as reference markers; use the empirical distribution of these marker signals as the initial reference distribution

2. Map all the SNP marker signals into (0,1) by probability-integral transformation using the above distribution

3. Map the above transformed signals to a target distribution – take N(0,1), to produce initially reference-aligned signals

4. Map the CNV marker signals to have the same distribution as the SNP markers via a quantile transformation

5. Perform initial CNV segmentation – windowed t test + run-length encoding

6. Assess each autosome and each “large” (>20 SNP markers) inferred CNV segments to identify problems

7. Correct the problems by adjusting the initial reference-aligned signals (step 3) chromosome by chromosome

8. Perform final CNV segmentation using corrected signals – windowed t test plus run-length encoding.

Page 80: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Male, Hypodiploid (<46 chromosomes) ALL and germline samples: Steps 3, 4, 5, 6

initial reference

Tum

orG

erm

line

overall2cp, initially inferred loss

t(9)Initial reference

Page 81: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNVNew Development

• Examples: – Two ALL cases -- one hypodiploid (<46 chr’s)

with matched germline, one hyperdiploid (triploid, 66 chr’s) with matched germline and relapse samples

Page 82: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Male, Hypodiploid ALL and germline samples

Probeset signal: mean of probes, directly from the .cell file

Page 83: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Male, Hypodip ALL and matched germline

%Het=26%

%Het=0.16%

%Het=29%

Page 84: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Germline

Tumor

Page 85: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Male, Hypodiploid ALL and germline samples

initial reference

Tum

orG

erm

line

overall2cp, initially inferred loss

t(9)Initial reference

Page 86: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Male, Triploid ALL, DNA index 1.46

67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar

1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2) 6 2 12 4 2 18 4 6 4 6 2 total=66 +Y = 67

Matched germline sample, relapse sample

Probeset signal: generated by Affy MAS 5.0 (?) package

Page 87: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2) 6 2 12 4 2 18 4 6 4 6 2 total=66 +Y = 67

Page 88: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital
Page 89: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

initial reference

Rel

apse

d T

umor

Ger

mlin

e

overall2cp, initially inferred loss

t(9)Initial reference

Page 90: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Use SNP array signals to infer CNVNew Development• Extension to NextGen sequence data

– Use sequence coverage counts as raw signals– Preprocessing: Adjust signals for reference genome

and sequence features that affect the depth of coverage (“mapability”, GC content, etc.)

– The new alignment and segmentation algorithms consist of only single-marker based computation, thus straightforward to implement divide-and-conquer strategies and parallelization on multiprocessor or CPU cluster systems

Page 91: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

ACKNOWLEDGEMENTSNIH/NIGMS Pharmacogenetics Research Network and Database (U01 GM61393, U01GM61374, http://pharmgkb.org/) National Institutes of Health

Cancer Center Support Grant P30 CA-21765, NIH

The American Lebanese and Syrian Associated Charities (ALSAC).

Stan Pounds, Deqing Pei, Xueyuan Cao – Biostatistics

Mary Relling, William Evans, Jun Yang, Wenjian Yang – Pharmaceutical Science

Ching-Hon Pui, Dario Campana – Oncology & Pathology

Charles Mullighan, James Downing -- Pathology

Geoff Neale, Yiping Fan -- Bioinformatics

Javier Rojo – My first and most favorite Math Statistics teacher

Page 92: Exploratory Failure Time Analysis and Copy Number Variation Inference Cheng Department of Biostatistics St. Jude Children’s Research Hospital

THANK YOU !!