Upload
clifton-may
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
A simple statistical modelfor deciphering the cdc15-
synchronized yeastcell cycle-regulated genes expression
data
Ker-Chau Li , Robert Yuan Statistics, UCLAMing Yan Biochemistry , UCLA
The goal of this study is to demonstrate how simple statistical
models can be employed for helping the organization and
explanation ofcomplex gene expression patterns
Outlines
• Introd : Micro-array and cell-cycle• Data : cdc15 experiment• A statistical model• Phase determination• Comparison with Spellman et al(1998)• Regularly oscillated genes• Further discussion
MicroArray
• Allows measuring the mRNA level of thousands of genes in one experiment -- system level response
• The data generation can be fully automated by robots
• Common experimental themes:– Time Course– Mutation/Knockout Response
A B C D E …..
A -- 2.1 0.8 1.3 0.5
B 0.2 -- -0.5 2.3 0.22
… -1.2 -- 0.3 -1.1
Expression level
Time0
1
Change of Condition
Or:
Time Course:
MicroArray Technique:
Synthesize GeneSpecific DNA Oligos
Attach oligo toSolid Support
Tissue or Cell
extract mRNA
Amplificationand Labeling
Hybridize
Scan and Quantitate
Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)
Getting a homogeneous population of cells:
cell cycle
Cells at variousstages of cell cycle
Synchronization conditions:-Temperature shift to 37 C for CDC15 yeast ts-strain-add pheromone-Elutriation
Release back into cell cycle
Take sampleas cells progressthrough cyclesimultaneously
The data set available at http:cellcycle-www.standford.edu
We focus on one experiment in which a strain of yeast(cdc15-2) was incubated at a high temperature(35 degrees C) for a long time, causing cdc15 arrest. Cells were then shifted back to a low temperature( 23 degrees C) and the monitoring of gene expression is taken every 10 min for 300 min.
Data from some chips are not availableWe concentrate on those from the 19 Consecutive time points from 70 minsTo 250 mins
24 Time points: (mins)
10 30 50 70 80 ..... 240 250 270 290----------> 10 mins apart
Use of full data will be discussed later.
Genes with missing values are also Deleted
There are 4530 genes remainingThe data can be represented by a4530 by 19 matrix
Example of the time curve:
Histone Genes: (HTT2)ORF: YNL031CTime course:
50 250100 150 200
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
0 205 10 15
YKL164C
YNL082W
Preliminary study with two-way anova
This is to investigate the constancy of average expression Level over the time for each gene and the constancy ofThe average expression level over all genes at each timePoint.
> cdc15Factor df SS MS Fgene | 4529 | 5.2408E+2 | 1.1572E-1 | 6.4169E-1time | 18 | 2.9745E+2 | 1.6525E+1 | 9.1638E+1residual |81522 | 1.4701E+4 | 1.8033E-1total |86069 | 1.5522E+4
Gene insignificantTime appears statistically significant; But …………(next slide)
0 205 10 15
Column mean (Time) from Anova result The values are small
The expression level is log_2 of ratio of red/greenRed = light intensity for red channel - “noise”Green = light intensity of green channel - “noise”Red channel = mRNA from cells at one time pointGreen channel =mRNA from unsynchronized cells.5 fold increase = log_2 1.5=.585 ; 2^.15 =1.11=.11 fold increase
A statistical model• Motivation : modeling each curve
with simple functions such as linear, quadratic, sine, cosine appears reasonable but inflexible;
• Parsimony and accuracy can be gained if basis curves are chosen by data themselves
• The model : each gene expression curve =c0 + c1V1 + c2V2 + c3V3 + εV1 ,1 st basis curveV2 ,2 nd basis curve
V3 ,3 rd basis curve
The model -continuedThe errors have mean zero,
uncorrelated ,same variance cross the time;
But the variance may depend on genes
(This is important)
It turns out that we can find the basis functions from an application of PCA.
(see pdf file for pca)
Enhanced PCA for curve fitting Choose the number of basis curves by eigenvalues Assess the goodness of each curve fitting by R-squared and by residual sum of squares Identify genes that comply well to the model Interactive plotting helps resetting user-specified parameters
PCA:
For a list of vectors, PCA could be used for finding the common basis based on the scaling matrix.
Covariance Matrix:
The directions found will have highest variance along those directions.
Find the directions by eigenvalue decomposition:
Model the curves by the PCA directions:
Here, we chose first three PCA directions as our basis.
Σ = ( X − μ )' ( X − μ )
Σ νi
= λ νi
X
i= a
1 iν
1+ a
2 iν
2+ K + a
k iν
k+ ε
0 205 10 15 0 205 10 15
0 205 10 15 0 205 10 15
1st PCA direction 2nd PCA direction
3rd PCA direction Eigenvalues
1. Compliance Check:
2. Cycle Component Check:
3. Smoothness Check:
H0
: three - bases model holds
H0
: a2 i
= a3 i
= 0
H0
: a1 i
= 0
Ri
2
< 0 . 56 & RSSi
> 7 . 25
( a2 i
2+ a
3 i
2) / 2
RSSi/ 15
> F2 , 15
( 0 . 95 ) = 3 . 68
a1 i
RSSi
/ 15
> t15
( 0 . 975 ) = 2 . 131
(Corr. Coff between fit and observed < .75And error s.d. Bigger than .70 , which is equivalent to .5 fold increase.)
Reject if
Reject if
Reject if
453041448928241665714951missing valuescompletenon-compliancecomplianceinsignificantcycle comonentsSignificant cyclle componentsSmoothNon-smoothFor the non-compliance group, visual examination of each curve pattern is done .*** of these 41 have visible cycle patterns. l 61781648
Noncompliance genes (41)
. High overall expression levels
. May or may not show cycle patterns… Recommendation : inspect each gene separately
50 250100 150 200
50 250100 150 200
0 255 10 15 20
YJL159W
YLR126C
Phase determination
• The second and the third basis curves show clear cycle patterns. The third basis appears to be a 40 min-delayed version of the second basis, with an R-squared value of .78
• Linear combinations of these two basis curves show a variety of expression patterns.
Construction of A Compass plot
• Use of known cycle-regulated genes• Compliance checking with RSS/R^2 plot• Cycle- exhibition checking with projection angles• Coherent pattern checking by ANOVA
• ( A list of 104 known genes with 6 groups)
Phases of genes:Identify the phases of genes:
Prior Knowledge:There were 104 know genes whose phases were determined by traditional experiment methods.Known genes:
There are 6 groups of genes.SCB (G1 phase) MCB (G1 phase)Histone (S phase) S/G2 phaseG2/M phase M/G1 phase
The noncompliance genes and without significant cycle components are excluded The group of genes, SCB, are also excluded due to the inconsistent patterns within their expression vectors.
-1 1-0.5 0 0.5
YNL082W
0 255 10 15 20
-1 1-0.5 0 0.5
0 255 10 15 20
82 non-missing known phase genes
Remove genes with insignificant cycle component
Points obtained by normalizing the loading coeff. for 2nd and 3rd bases to unit length
0 255 10 15 20
Late G1, SCB regulated genes:
-1 1-0.5 0 0.5
YBR067C
YCL055W
YDL055CYER001W
YGL225WYJL187C
YLR342WYMR307W
YPL256C
YPR159W
-1 1-0.5 0 0.5
Compass plot for phase assignment
S
S/G2
G2/M
M/G1
G1
Histone genes
Smooth
-8 4-6 -4 -2 0 2
10831
352
90
295
SG1
S/G2
G2/MM/G1
-6 4-4 -2 0 2
10327
255
239
SG1
S/G2
G2/MM/G1
90
165
Non-smooth
Phase Assignment
Comparison
• For the 800 cell-regulated genes classified by Spellman et al, we re-classified them with our method. If a gene does not comply with our model or does not have significant second or third regression coefficients, we would not assign the phase.
• Contingency tables of mismatched and unclassified cases.
6549645130515293222missing valuescompletenon-compliancecomplianceinsignificantcycle comonentsSignificant cyclle componentsSmoothNon-smooth800The group of 130 insiginicant cycle components appear quite bumpy. All but one in the non-complicance group show clear cycle patterns.
A non-compliance gene
YJL159W :
Spellman et.al’s Score : 10.86 R2: 0.36273 (M/G1)RSS: 14.15322Angle: -2.43803
Least Squares Estimates:
Constant -4.794002E-16 (0.222846)Variable 0 1.28464 (0.971364)Variable 1 -2.04016 (0.971364)Variable 2 -1.49779 (0.971364)
Black: data curveRed : fitted curve (full model)Blue : fitted curve (cyclic model)
0 205 10 15
Locus_info: Other_name PIR2 YJL159W CCW7 ORE1 Gene_class HSP Gene_Info HSP150 Gene_product Heat shock protein, secretory glycoprotein Function cell wall structural protein Cellular_Component cell wall Process cell wall organization and biogenesis Phenotype Null mutant is viable Locus_notes 14 HSP150 has also been called gp400 Position_info: Chromosome X ORF_name YJL159W
An example of our non-compliance gene
YDR055W :
Spellman et.al’s Score : 7.266 R2: 0.30136 (M/G1)RSS: 7.94018Angle: -2.81396 (Insig. Coef.)
Least Squares Estimates:
Constant -5.428720E-16 (0.166914)Variable 0 1.47329 (0.727561)Variable 1 -1.07451 (0.727561)Variable 2 -0.316032 (0.727561)
Black: data curveRed : fitted curve (full model)Blue : fitted curve (cyclic model) 0 205 10 15
Locus_info: Other_name YDR055W Gene_class PST Gene_Info PST1 Description Protoplasts-secreted Gene_product The gene product has been detected among the proteins secreted by regenerating protoplasts Phenotype Viable Position_info: Chromosome IV ORF_name YDR055W
An example of non-compliancegene
YNL082W :
Spellman et.al’s Score : 4.843R2: 0.229191 (G1)RSS: 18.247480537500003
Least Squares Estimates:
Constant -6.087129E-16 (0.253035)Variable 0 1.51725 (1.10295)Variable 1 -1.74757 (1.10295)Variable 2 0.263945 (1.10295)
Black: data curveRed : fitted curve (full model)Blue : fitted curve (cyclic model)
50 250100 150 200
Top 10 scores and gene names from insignificantCycle component group
3.69 3.85 3.874 4.022 4.048 4.13 4.41 5.047 6.28 6.716
"YOR263C" "YOR320C" "YGR035C" "YCR042C" "YPR019W”"YJL194W" "YJR010W" "YEL068C" "YGR124W" "YKL172W"
78 genes score higher than 6.716; 188 genes score higher than 4.022213 genes score higher than 3.69
Yet these genes appear very bumpy; see next slide
An example of insignificant cyclecomponent gene
YGR124W :
Spellman et.al’s Score: 6.28 (S/G2)R2: 0.364945 (small)RSS: 0.812496 (small)Angle: 3.13118
CDC15
70 mins
250 mins
Locus_info: Other_name YGR124W Gene_class ASN Gene_Info ASN2 Description Asn1p and Asn2p are isozymes Gene_product asparagine synthetase Phenotype Null mutant is viable; L- asparagine auxotrophy occurs upon mutation of both ASN1 and ASN2 Position_info: Chromosome VII ORF_name YGR124W
EBP2: YKL172W
TSM1: YCR042C
YOR263C
Our\their G1 S S/G2 G2/M M/G1 TotalG1 59 6 0 0 0 | 65S 4 3 0 0 0 | 7S/G2 1 7 31 17 0 | 56G2/M 0 0 3 47 1 | 51M/G1 18 0 0 4 21 | 43Total 82 16 34 68 22 | 222
Non-smooth group from 800 genes
Our\their G1 S S/G2 G2/M M/G1 TotalG1 74 8 0 0 1 | 83S 7 10 1 0 0 | 18S/G2 5 11 43 17 1 | 77G2/M 0 0 1 39 1 | 41M/G1 43 0 0 3 28 | 74Total 129 29 45 59 31 | 293
Smooth group from 800 genesLow overall expression level
0 205 10 15
CLN2: YPL256C
0 205 10 15
HTA1: YDR225W
0 205 10 15
YJL091C
0 205 10 15
CLB4: YLR210W
(Phase ??)
(G1)
(S)
(S/G2)
0 205 10 15
0 205 10 15
CLN2: YPL256C
0 205 10 15
HTA1: YDR225W
0 205 10 15
CLB4: YLR210W
(G1)
(S)
(S/G2)(Phase ??)
FKS1: YLR342W
From 5 cell
Least Squares Estimates:
Constant -5.706461E-16 (4.704328E-2)Variable 0 -0.170979 (0.205057)Variable 1 0.479678 (0.205057)Variable 2 0.762583 (0.205057)
R Squared: 0.571396 Sigma hat: 0.205057 Number of cases: 19Degrees of freedom: 15
YOR264W
From 1 , total SS small
Oscillated genes
• First curve basis is oscillating in a extremely regular way
• There are over 200 genes with such regular oscillating patterns
• Role unknown : Systematic error ? Common upstream promoter region ?
DIM1 (YPL266W)
Locus_info: Other_name YPL266W Gene_class DIM Gene_Info DIM1 Description Dimethyladenosine transferase, (rRNA(adenine-N6,N6-)-dimethyltransferase),reponsible for m6[2]Am6[2]A dimethylation in 3'-terminal loop of 18S rRNA Gene_product dimethyladenosine transferase Function rRNA (adenine-N6,N6-)-dimethyltransferase Cellular_Component nucleolus Process 35S primary transcript processing rRNA modification Phenotype Null mutant is inviable Position_info: Chromosome XVI ORF_name YPL266W
PRS1A (YLR441C)
Locus_info: Other_name YLR441C RP10A Gene_class RPS Gene_Info RPS1A Description Homologous to rat S3A Gene_product Ribosomal protein S1A (rp10A) Function structural protein of ribosome Cellular_Component cytosolic small ribosomal (40S)-subunit Process 0006416 protein biosynthesis Locus_notes 13 RP10A (RPS1A) and RP10B (RPS1B) are nearly identical; this gene has also been called PLC1, but should not be confused with PLC1 on chromosome XVI encoding a phosphoinositide-specific phospholipase Position_info: Chromosome XII ORF_name YLR441C
50 250100 150 200
GLN1: YPR035W
Least Squares Estimates:
Constant -6.276471E-16 (4.762055E-2)Variable 0 -2.47649 (0.207573)Variable 1 3.958405E-2 (0.207573)Variable 2 1.01860 (0.207573)
R Squared: 0.917337 Sigma hat: 0.207573
One gene from non-smooth groupNot in Spellman et. al.’s list.
Further discussion
• Others who use PCA
• Clustering
• Other data set
• Use of SIR/PHD
• Without a time scale ? B-cell lymphoma data
• Pathway study
0 205 10 15
YGR231C
Least Squares Estimates:
Constant -5.803153E-16 (4.131369E-2)Variable 0 -0.156478 (0.180082)Variable 1 -1.59995 (0.180082)Variable 2 -0.623201 (0.180082)
R Squared: 0.859375 Sigma hat: 0.180082
Total sum of squares equals to 3.4591 which is about 71.6 percentile among all genes.The median of the total sum of squares is 2.27735.
One gene from smooth groupNot in Spellman et. al.’s list.
. Genes with overall small expression levels could have been Removed from the beginning???
THE END
0 255 10 15 20
YBL002W
YDR224C
YER124C
YJL159W
YKL163WYKL164C
YKL185W
YMR003W
YMR011W
YNL160W
0 205 10 15
YBL002W
0 205 10 15
YDR224C
0 205 10 15
YER124C
0 205 10 15
YJL159W
0 205 10 15
YKL163W
0 205 10 15
YKL164C
0 205 10 15
YKL185W
0 205 10 15
YMR003W
0 205 10 15
YMR011W
0 205 10 15
YNL160W
0 205 10 15
YDR055W