1
Applied Statistics – Challenges and Reward
Applied Statistics – Challenges and Reward
Wenjiang Fu, Ph.D
Computational Genomics Lab, Department of Epidemiology
Michigan State University
[email protected] www.msu.edu/~fuw
2
What is Statistics ?What is Statistics ?
“Lies, Damned Lies, and Statistics”
“Figures fool when fools figure”
A branch of mathematical science that studies data through probability distribution and modeling.
Fields: probability theory, actuarial science, biostatistics, finance statistics, industrial statistics, etc.
Related fields: biometrics, bioinformatics, geo-statistics, statistical mechanics, econometrics, etc.
3
Grand challenges we are facing …Grand challenges we are facing …
“Data”Knowledge
&Information
Decision
Statistics
21st century will be the golden age of statistics !
4
Grand challenges we are facing …Grand challenges we are facing …
1. Data collection technology has advanced dramatically, but without sufficient statistical sampling design and experimental design.
2. Advancement of technology for discovering and retrieving useful information has been lagging and has become the bottleneck.
3. More sophisticated approaches are needed for decision making and risk management.
5
Statistical Challenges -- Massive Amount of DataStatistical Challenges -- Massive Amount of Data
6
Statistical Challenges – Image DataStatistical Challenges – Image Data
7
Statistical Challenges – Functional Data, Graph (Network) Data, and Shape DataStatistical Challenges – Functional Data, Graph (Network) Data, and Shape Data
8
Statistical Challenges – Click Stream DataStatistical Challenges – Click Stream Data
9
Statistical Challenges – Data Fusion and AssimilationStatistical Challenges – Data Fusion and Assimilation
Data
10
Statistics in ScienceStatistics in Science
Cosmic microwave background radiationHigh Energy Physics
Tick-by-tick stock data Genomic/proteomic data
11
Statistics in ScienceStatistics in Science
Finger Prints Microarray
12
What do we do? What do we do?
New ways of thinking and attacking problems
Finding sub-optimal but computationally feasible solutions.
New paradigm for new types of data
Be satisfied with ‘very rough’ approximations
Turn research results into easy and publicly available software and programs
Join force with computer scientists.
13
Some ‘hot’ research directions Some ‘hot’ research directions
Dimension reduction
Visualization
Dynamic systems
Simulation and real time computation
Uncertainty and risk management
Interdisciplinary research
14
Example 1. Sociology dataExample 1. Sociology data
Homicide Arrest Rate (per 105) (R. O'Brien, 2000)
1960 1965 1970 1975 1980 1985 1990 1995
15 8.89 9.07 17.22 17.54 18.02 16.32 36.52 35.24
20 14.00 15.18 23.76 25.62 23.95 21.11 29.10 32.34
25 13.45 14.69 20.09 21.05 18.91 16.79 17.99 16.75
30 10.73 11.70 16.00 15.81 15.22 12.59 12.44 10.05
35 9.37 9.76 13.13 12.83 12.31 9.60 9.38 7.27
40 6.48 7.41 10.10 10.52 8.79 7.50 6.81 5.48
45 5.71 5.56 7.51 7.32 6.76 5.31 5.17 3.67
15
Result through statistical modeling Result through statistical modeling
age
ag
e e
ffect
15 20 25 30 35 40 45
-0.5
0.0
0.5
1.0
Age trend
period
pe
rio
d e
ffect
1960 1970 1980 1990
-0.5
0.0
0.5
1.0
Period trend
cohort
coh
ort
effe
ct
1920 1930 1940 1950 1960 1970 1980
-0.5
0.0
0.5
1.0
Cohort trend
16
Example 2. Epidemiological study dataExample 2. Epidemiological study data
Mortality from Cervical Cancer in Ontario 1960-94 Rate (per 105 person-year) and Frequency
Age Year 60-64 65-69 70-74 75-79 80-84 85-89 90-94
20-24 0.15 2
0.11 2
0.15 3
0.14 3
0.14 3
0.20 4
0.13 1
25-29 1.22 14
0.52 8
1.24 23
0.80 16
0.88 20
0.47 11
0.93 8
30-34 3.15 35
2.94 37
2.01 32
1.45 27
1.79 38
1.31 32
1.08 11
35-39 5.38 62
4.47 52
3.59 46
3.86 61
3.12 60
2.47 55
2.16 21
40-44 9.80 116
7.15 84
4.32 51
5.12 66
3.71 60
2.47 63
2.16 33
45-49 15.66 160
10.97 130
7.75 91
4.69 55
5.17 67
5.02 83
3.41 27
50-54 17.01 151
13.32 138
8.19 97
6.82 80
6.12 72
4.65 61
5.79 35
55-59 18.56 141
15.23 133
11.53 118
9.12 107
5.94 70
5.81 69
5.77 29
60-64 22.44 144
16.08 121
13.66 117
10.71 108
7.93 92
7.35 86
4.02 19
65-69 23.53 128
18.87 119
15.31 112
13.79 115
10.36 102
7.60 86
6.83 31
70-74 25.89 116
19.36 97
15.36 89
15.18 103
13.95 108
10.42 96
10.44 44
75-79 29.12 94
20.08 75
23.84 102
16.29 82
14.90 88
11.50 78
12.73 38
80-84 31.76 62
24.72 59
21..51 60
23.82 79
12.69 50
17.40 81
12.77 27
85 + 33.16 42
28.95 50
22.90 50
24.94 68
15.23 51
13.88 56
10.42 19
17
Results from statistical modeling Results from statistical modeling
age
age
effe
ct
20 30 40 50 60 70 80
-3-2
-10
1
Age trend, 95% CI
period
perio
d ef
fect
1960 1965 1970 1975 1980 1985 1990
-3-2
-10
1
Period trend, 95% CI
cohort
coho
rt ef
fect
1880 1900 1920 1940 1960
-3-2
-10
1
Cohort trend, 95% CI
18
Example 3 Medical study data: Ob/GynExample 3 Medical study data: Ob/Gyn
Modeling of PlGF: Placental Growth Factor
19
SNP: Single Nucleotide PolymorphismSNP: Single Nucleotide Polymorphism
Homologous pairs of chromosomes
Paternal allele
Maternal allele
Paternal allele
Maternal allele
ACGAACAGCTTGCTTGTCGA
ACGAGCAGCT
TGCTCGTCGA
SNP A/G
20The International HapMap Consortium (Nature 2003)
21
Allele, Haplotype and Diplotype
A
B
a
b
SNP 1: two alleles A and a
SNP 2: two alleles B and b
Haplotype [AB]
Diplotype [AB][ab]
Haplotype [ab]
22
Microarray Technology: 2 channelsMicroarray Technology: 2 channels
Hybridization:
A T C G T A G
| | | | | | |
T A G C A T C
23
Microarray normalization: between slides
Boxplots of log ratios from 3 replicate self-self hybridizations.Left panel: before normalizationMiddle panel: after within print-tip group normalizationRight panel: after a further between-slide scale normalization.
24
Affymetrix SNP ArrayAffymetrix SNP Array
Illustration of SNP annotation on Affymetrix SNP array.
Adopted from Matsuzaki et al 2004.
‘AB’ SNP: AC
A – A, B – C.
25
Computational Genomics Data: SNP GenotypeComputational Genomics Data: SNP Genotype
Error rate : 1 – 5 % : GIGO – Garbage in Garbage out
26
Computational Genomics Data: SNP GenotypeComputational Genomics Data: SNP Genotype
27
Genetic Variation influences
- disease susceptibility- disease progression- therapeutic response- unwanted drug effects
Genetics is pointing the way to personalized medicine…
With the development of human HapMap project, coupling with advanced statistical approaches, we
are entering an era to design personalized medicine based on individual’s genetic profile.
Prospects IProspects I Genome-oriented Medicine
28
Whole Genome-wide Association StudiesWhole Genome-wide Association Studies
29
Whole Genome-wide Association StudiesWhole Genome-wide Association Studies
Successful study:
Wellcome Trust Case-Control Consortium
GWAS on 7 diseases with 14,000 patients and 2000 common controls. (Nature 2007)
Hypertension, diabetes, etc.
30
Recruiting Graduate StudentsRecruiting Graduate Students
Epidemiology: Study distribution of Disease;
Biostatistics: data modeling, computation;
Quantitative Biology Initiative: MSU cross-disciplinary center.
Background: Mathematics, Statistics, Physics, Biology, Chemistry, and others.
Opportunity: Contact your department graduate director/chairman for funding from the Ministry of Education. MSU Epi/Biostatistics provide partial funding and cover tuition fee.
Qualification: TOEFL, GRE, GPA, Reference letter.
My contact: [email protected] www.msu.edu/~fuw
Application: WWW.MSU.EDU
31
Thank you!
Q and A.
Office: CMS 415.