Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University

Embed Size (px)

Citation preview

  • Slide 1

Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University of Cambridge Slide 2 FROM RANDOM VARIABLES TO HYPOTHESIS TESTING Slide 3 Random Variables Function that associates probability to: Countable items (discrete random variable); Tumor vs. Normal; Yes vs. No; Head vs. Tail; Uncountable items (continuous random variable): Log-expression; weight; height; Characterized by a distribution function: Bernoulli; Binomial; Geometric; Negative- Binomial; Poisson; Normal; Students t; Gamma; Slide 4 Examples Discrete Distributions Slide 5 Examples Continuous Distributions Slide 6 Common Uses of Different Distributions Bernoulli: probability of 1 success; Binomial: probability of K successes; Geometric: probability of K failures before 1 st success; Negative-Binomial: probability of K failures before R successes; Poisson: probability of K rare events; Slide 7 The Questions Investigation of populations or groups within a population leads to questions: How does BRCAI behave across groups? Can genotype predict drug response? Does transcript abundance change as a function of time? Slide 8 The Experiment A procedure used to answer the questions; Comprised of multiple items: Population; Sample; Hypotheses; Test statistic; Rejection criteria; Slide 9 Population Superset of subjects of interest; Ideally, every subject in the population is surveyed; Issues with the census approach; Slide 10 Sample Select some subjects from the population; We refer to this subset as sample; Subject in a sample can be called replicate; Replicate: technical vs. biological; Slide 11 Hypotheses Sets that define the underlying truth; Null Hypothesis (H0): default situation. Cannot be proven; Reject (in favor of H1) vs. fail to reject; Alternative Hypothesis (H1): alternative (duh!) Complements H0 on the parametric space; Assists on the definition of the rejection criteria. Slide 12 Examples of Hypotheses P1 Comparing expression: Tumor vs. Normal: Expression on tumor is at most as high as on normal; Expression on tumor is higher than on normal; Slide 13 Examples of Hypotheses P2 Comparing expression: Tumor vs. Normal: Expression on tumor is at least as low as on normal; Expression on tumor is lower than on normal; Slide 14 Examples of Hypotheses P3 Comparing expression: Tumor vs. Normal: Expressions on tumor and normal are the same; Expressions on tumor and normal are different; Slide 15 Test Statistic Summary of the data; Built under H0; Independent of unknown parameters; Known distributions; Compatibility between data and H0; Slide 16 Test Statistic What the statistician see Slide 17 Rejection Criteria Function of three factors: Test statistic; Hypotheses; Type I Error (False Positive), ; Determines thresholds used to reject H0: One threshold: one-sided tests; Two thresholds: two-sided tests; Defines what is extreme for the experiment; Slide 18 Rejection Criteria Slide 19 From Rejection Criteria to P-value! p-value Slide 20 Rejection Criteria Slide 21 From Rejection Criteria to P-value! p-value Slide 22 Rejection Criteria Slide 23 From Rejection Criteria to P-value! p-value Slide 24 Sampling and testing 10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete observations When do I think that I am not sampling from this box anymore? How many reds could I expect to get just by chance alone! #red = 3 24 Slide 25 10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete observations Sample Null hypothesis (about the population that is being sampled) Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) #red = 3 Test statistic 25 Slide 26 Continuous observations Sample Null hypothesis (about the population that is being sampled) Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) mean = 3, sd = 0.6 Test statistic 4, 2.3, 5.2, 4.7, 2.1, 3.5, .. 26 Slide 27 Summary of the Experiment 4) decision 1) hypotheses 2) sample 3) test statistic Slide 28 Useful Facts The Law of the Large Numbers guarantees that the larger the sample size is, the closer the sample average is to the actual mean; Normality assumption isnt that important with large sample size; The Central Limit Theorem states that the average is asymptotically normal; Slide 29 Useful Facts The Z-score depends on the precise knowledge of the variance term: Estimating the variance changes the distribution of the test statistic: Slide 30 Useful Facts The Students t distribution is similar to the Normal distribution, but has heavier tails; Larger sample size, more d.f.; More d.f., closer to Normal; Slide 31 Multiple Testing We are doing high-throughput experiments; Comparing thousands of units simultaneously; At this scale, we can observe several instances of rare events just by chance: Event A: 1 in 1000 chance of happening; Event B: 999 in 1000 chance of happening; And the experiment is tried 20,000 times; We expect 20 occurrences of Event A to be observed, although Event B is much more likely; Slide 32 Multiple Testing Similar scenario, for example, with DE; Most genes are not differentially expressed; High-throughput experiments; Differential expression is tested for 20K genes; Need to protect against false positives; Suggestion: use non-specific filtering; Slide 33 DATA MODELING Slide 34 What is a model? Slide 35 Statistical Models There is no correct model; Models are approximations of the truth; There is a useful model; Understand the mechanisms of the system for better choices of model alternatives; Slide 36 Revisiting Microarrays Scanned images; Fluorescence intensities; Proportional to target abundances; Restricted dynamic range; Asymmetrical distribution; Log-Intensities behave better; Slide 37 Revisiting Microarrays Slide 38 Intensities Slide 39 Log-Intensities Slide 40 Back to Data Modeling Linear Regression / ANOVA Nature of the data: continuous; Linear regression often used; For subject i, known factors/covariates are candidates to predict log-intensities of a gene: Residuals expected to be Normal; Slide 41 Interpreting Coefficients Statisticians indicate that a parameter is estimated by using a hat on top of it: Assuming that X = 0 for normal tissue: Assuming that X = 1 for tumor tissue: Slide 42 Interpreting Coefficients Average log-intensity for normal tissue Change in average log-intensity associated to the tumor tissue Average log-intensity for tumor tissue Slide 43 GLM Generalized Linear Models; Generic framework; Accommodates different types of data; Special cases: Linear regressions and ANOVAs; Slide 44 Example GLM Binomial Family Responses: yes/no; dead/alive; sick/healthy; Predictors: Gene expression / genotype / age; Example: Response: Cytogenetic abnormalities (Yes/No); Predictors: Log-expression of probeset 1059_at; Slide 45 Log-Expression vs. Abnormalities Slide 46 Modeling a Binary Response Response in the previous example: Observed cytogenetic abnormalities; Did not observe cytogenetic abnormalities; Linear regression does not work: Slide 47 Modeling a Binary Response Instead of modeling the actual response, we model the probability of that response; Linear regression still fails; Valid Results Slide 48 Logistic Regression - Rationale Probability is restricted to the [0, 1] interval; Linear regression isnt; Need to transform probability; Slide 49 Logistic Regression - Rationale Instead of probability, model the odds: Odds range from 0 to Infinity; A linear regression approach would still fail; Slide 50 Logistic Regression - Rationale Instead of odds, model the log-odds: Log-odds range from -Infinity to Infinity; An approach like linear regression, using the log-odds scale, would work fine; Slide 51 Back to GLM In the previous example: Link function: logitLinear Predictor Slide 52 Interpreting Coefficients on a Logistic Model b0: average log-odds for normal tissue; b1: average change in log-odds on tumor; Suppose b0 = 10.87 and b1 = -3.46: How do we interpret? Slide 53 Model Selection Likelihood measures the probability of observing the data under a certain model; Given two models, M1 and M2 (M2 M1): Get L1: likelihood of the data under M1; Get L2: likelihood of the data under M2; LRT = -2 log(L1/L2) is known; Small LRT: choose M1; Large LRT: choose M2; Slide 54 MODELING STRATEGIES FOR SEQUENCING DATA Slide 55 Sequencing Rationale Technical Replicate Sample j, transcript i is generated at rate ij ; A fragment attaches to the flow cell with a (low) probability p ij ; Number of observed tags, y ij, is Poisson distributed with rate proportional to ij p ij ; Adapted from notes by Tom Hardcastle Slide 56 Poisson Probability function: Slide 57 Analysis method: GLM Expected count of region i in sample j Design matrix Library size effect (Differential) effect for region i Noise Part Deterministic Part Slide 58 technical rep consistent with Poison biol. rep not consistent with Poison Based on the data of Nagalakshmi et al. Science 2008; slide adapted from Huber; Need to account for extra variability Slide 59 Sequencing Rationale Biological Replicates For subject j, on transcript i: Different subjects have different rates, which we can model through: This hierarchy changes the distribution of Y: Slide 60 Negative Binomial Probability function: Slide 61 Adding an additional source of variation smooth dispersion-mean relation Slide 62 CONSIDERATIONS ON EXPERIMENT DESIGN Slide 63 Consideration Sample size is crucial. The larger, the better; With differential expression, one can observe this more easily; Is RNA-Seq really worth it when we consider: Cost, Strategies for analysis, and Technical requirements? Slide 64 Differential Expression Across Groups Flow Cell Confounded With Group 12345678 123456781234567812345678 Group AGroup BGroup CGroup D Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell 4 Slide 65 Differential Expression Across Groups Randomize Samples wrt Flow Cell 12345678 123456781234567812345678 Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell 4 Slide 66 Differential Expression Across Groups Barcoding vs. Lane Effect 12345678 Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell 4 123456781234567812345678 Slide 67 CONSIDERATIONS ON DATA PROCESSING Slide 68 Normalization Samples are sequenced in different depths: Genes with higher expression on Sample 2; Adjusting by total reads can be misleading; GeneSample 1Sample 2 Gene 1500,000 Gene N0500,000 Total Reads15,000,00 0 30,000,00 0 Slide 69 Normalization Length can affect relative inference of expression across genes; Gene A K-times longer than B is expected to have K-times more reads than B: Gene A Gene B