Fixing the leaks in the pipeline from public genomics data to the clinic

Preview:

Citation preview

fixing the leaks in the genomics

http://jhudatascience.org/

https://www.coursera.org/specialization/genomics/41

@simplystatshttp://simplystatistics.org

@jtleekhttp://www.jtleek.com

https://www.counsyl.com/

Their basic pitch was “Genomics is a fraud”

“”

http://www.technologyreview.com/news/535771/a-contrarian-in-biotech/

“The explosive growth of next-generation sequencing data submitted into the SRA exceeds the growth rate of storage capacity ”

http://www.ncbi.nlm.nih.gov/pubmed/22009675

3 costanalyst variationmotivation

1 cost

costs

moneyinterpretability

http://arxiv.org/pdf/math/0606441.pdf

http://www.ncbi.nlm.nih.gov/pubmed/19276151

@leekgroup

http://www.ncbi.nlm.nih.gov/pubmed/25788628

http://www.ncbi.nlm.nih.gov/pubmed/25788628

Agilent/Grade 1 Agilent/Grade 3 Illumina/Grade1 Illumina/Grade3

100%

75%

50%

25%

0%

Acc

urac

y

Pam Scaled Pam Unscaled TSP

http://www.ncbi.nlm.nih.gov/pubmed/25788628

algorithm1.select useful pairs2.screen pairs for association3.build a simple cart predictor

http://www.ncbi.nlm.nih.gov/pubmed/19276151

Patil et al. (in prep)

Patil et al. (in prep)

Patil et al. (in prep)

@leekgroup

Data:

xik

- value for feature i, sample k

yk - group indicator for sample k

TSP is (i,j) pair that maximizes:

|Pr(xik

< xjk

| yk=1) – Pr(x

ik < x

jk | y

k=0)| ⌃ ⌃

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1989150/

@leekgroup

zijk

=1(xik

< xjk

)

E[zijk

|yk] = a

0ij + a

1ijy

k

→ max |a1jk

| = TSP

Patil et al. (in prep)

@leekgroup

• Not the same as TSP• But |â/s.e.(â)| = |û/s.e.(û)|, algebraically• “Variance regularized” TSP• zijk invariant to monotone transformations• Fix parameters → find features

E[yk|z

ijk] = u

0ij + u

1ijz

ijk

Patil et al. (in prep)

@leekgroup

1. Calculate t-statistic for all pairs2. Choose top pair (or covariate)3. Continue for a fixed number of pairs

E[yk|z

ijk] = u

0ij + u

1ijz

ijk

Patil et al. (in prep)

@leekgroup

http://astor.som.jhmi.edu/~marchion//breastTSP.html

@leekgroup

USP7 < RP11-423C15.3

NM_018610 < MTCH1

RND1 < LGALS14

No Recur

No Recur

No Recur

Recur

No Yes

No Yes

No Yes

@leekgroup

@leekgroup

Mammaprint

Patil et al. (in prep)

2 analyst variation

what went wrong?

2things

what went wrong? transparency

The data/code weren’t reproducible

what went wrong? transparency

There was a lack of cooperation

what went wrong? expertise

They used silly prediction rules

(Pr(FEC) = 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)

what went wrong? expertise

They had study design problems

(Batch effects)

what went wrong? expertise

Their predictions weren’t locked down

Today: Pr(FEC) = 0.8Tomorrow: Pr(FEC) = 0.1

At the end of the day the Pottianalysis was fully reproducible

The problem is that the analysiswas wrong

@leekgroup

http://bit.ly/10vS1yt

@leekgroup

http://bit.ly/OgW3xv

@leekgroup

Drinkel et al. Oganometalics 2013

@leekgroup

@leekgroup

@leekgroup

@leekgroup

http://simplystatistics.tumblr.com/post/19646774024/laws-of-nature-and-the-law-of-patents-supreme-court

3 motivation

$(from reducing sample size)

basic idearandomization isn’t perfect “rebalance” with baseline covariatesimprove estimator precision

Ack Math!!!!

Estimate probability of being in arm given baseline covariates

Calculate initial estimate for each person using each arm model using propensity score weighted logistic regression

Define a covariate as the residual from fitting the arm-level models minus the arm-level means and fit new propensity models

Use these propensities to re-fit WLR from (2), then average predictions to get covariate-adjusted treatment effect

@leekgroup

http://astor.som.jhmi.edu/~marchion//breastTSP.html

@leekgroup

Age, Tumor Size, Grade 5.1%

Age, Tumor Size, Grade, ER Status

4.9%

Mammaprint Risk Category (MRC)

5.4%

Age, Tumor Size, Grade, ER Status, MRC

7.8%

@leekgroup

Age, Tumor Size, Grade 5.1%

Age, Tumor Size, Grade, ER Status

4.9%

Mammaprint Risk Category (MRC)

5.4%

Age, Tumor Size, Grade, ER Status, MRC

7.8%

Age, Tumor Size, Grade, ER Status, TSP

6.2%

3 costanalyst variationmotivation

acknowledgementsLeek groupPrasad PatilLeo Collado TorresAbhi NelloreClaire RubermanJack FuKai Kammers

CollaboratorsMichael RosenblumBenjamin Haibe-KainsP.O. Bachant-WinnerRoger Peng

Prasad Patilhttp://www.biostat.jhsph.edu/~prpatil/

Links

https://github.com/leekgroup/sig2trial

http://jtleek.com/talks/

Recommended