FINDING CONSISTENT SUBNETWORKS ACROSS MICROARRAY DATASETFan Qi
GS5002 Journal Club
2
OUTLINE
Introduction
Methodology
Results & Discussions
Conclusions
3
INTRODUCTION
Identify Differential Gene Expression Identify significant genes w.r.t a phenotype
Importance: Testing effectiveness of treatment Biological insights of diseases Develop new treatment Disease Prophylaxis Any others ?
4
CURRENT METHODS
Individual Genes Search for individual differentially expressed
genes Fold-change, t-test, SAM
Gene Pathway Detection Looking at a set of genes instead of individual
genes Bayesian learning and Boolean network learning
Gene Classes Adding existing biological insights Over-representation analysis (ORA), Functional
Class Scoring(FCS), GSEA, NEA, ErmineJ
5
CHALLENGE
Different Results from Different Dataset of the SAME disease!
Zhang M [1] demonstrated inconsistency in SAM:Datasets DEGs POG nPOG
Prostate cancer
Top 10 0.3 0.3
Top 50 0.14 0.14
TOP 100 0.15 0.15
Lung cancer
Top 10 0.00 0.00
Top 50 0.20 0.19
TOP 100 0.31 0.30
DMD
Top 10 0.20 0.20
Top 50 0.42 0.42
TOP 100 0.54 0.54
Reconstruct from Table 1 in [1]
Inconsistencyamong datasets
6
NEW APPROACH
SNet [2] Proposed in 2011 Utilize gene-gene relationship in analysis
Gene-gene relationship Activates VS. Inhibits
Gene Subnetwork Gene is the Vertex, Relationship is an edge
From Fig 1 in [2]
RHOA VAVPIK3R
2
ARHGEF1
RAC1 IQGAP1 Partially adapted
from Fig 2 in [2]
7
METHODOLOGY
Input: Genes labeled with phenotype
Gain from microarray experiment
Third-party Info: Gene Pathway Info Gene Reaction Info
Attributes of Subnetwork Size, Score
Output: A set of significant sub-network
Subnetwork
Extraction
Subnetwork
Scoring
Subnetwork
Significance
8
METHODOLOGY –STEP 1
P3 P2P1
Phenotypes
……..
Patient’s Gene Ranked List
9
METHODOLOGY –STEP 1
P1 P1
Only top genes is kept
for patient
Repeat for every phenotype group
10
METHODOLOGY –STEP 1
P1 (d)
Select one phenotype as others as
select genes occur in of patients
𝛽=50
𝐺𝐿
P1 P1 P1 P1
…….
11
METHODOLOGY –STEP 1
Partition into multiple pathwaysGenerate Subnetwork
𝐺𝐿
………
𝑎1
𝑎5𝑎3
𝑎4 𝑎7
𝑎6
𝑎2
𝑎1
𝑎5𝑎3
𝑎4 𝑎7
𝑎6
𝑎2
A list of Subnetworks w.r.t
12
METHODOLOGY – STEP 2 For each Subnetwork in in the and Patient ,
compute overall expression level: = , where a gene in that is highly expressed in # patients in who have highly expressed : total # patients in
For Patients and compute t-test
𝑆 𝑠𝑝𝑠𝑝 ,𝑑=¿𝑆𝑁𝑒𝑡𝑠𝑝 ,1 ,𝑆𝑁𝑒𝑡𝑠𝑝 ,2…𝑆𝑁𝑒𝑡 𝑠𝑝 ,𝑛>¿
𝑆 𝑠𝑝𝑠𝑝 ,¬𝑑=¿𝑆𝑁𝑒𝑡 𝑠𝑝 ,𝑛+1 ,𝑆𝑁𝑒𝑡𝑠𝑝 ,𝑛+2…𝑆𝑁𝑒𝑡 𝑠𝑝 ,𝑚>¿𝑆𝑆𝑝 𝑠𝑝 , 𝑡
T test
Assign to each Subnetwork
𝑎1
𝑎5𝑎3
𝑎4 𝑎7
𝑎6
𝑎2
P1 (d)
13
METHODOLOGY – STEP 3
A. Randomly Swap Phenotype labels of patient, recreating subnetworks and t-test scores (step 1-2)
B. Repeat [A] for 1,000 permutations.• Forms a 2-D histogram ()
C. Estimate the nominal p-value of each Subnetwork
D. Select Subnetwork with -Null-hypo: subnetwork with is not significant
Fig 5 in original paper
14
RESULTS AND DISCUSSIONS
Dataset: Leukemia: Golub VS Armstrong ALL: Ross VS Yeoh DMD: Haslett VS Pescatori Lung: Bhattacharjee VS Garber
Performance Comparison: Subnetwork Overlap (with GSEA) Gene Overlap (GSEA, SAM, t-Test)
Other Comparisons: Network Size, Gene Validity with t-Test
15
RESULTS AND DISCUSSIONS
Subnetwork Overlap
Disease Dataset 1 Dataset 2 SNET GSEA SNET
GSEA
Leukemia Golub Armstrong
83.33% 0% 20 0
ALL Ross Yeoh 47.63% 23.1% 10 6
DMD Haslett Pescatori 58.33% 55.6% 7 10
Lung Bhattacharjee
Garber 90.90% 0% 9 0
Synthesized from Table 1, 2 from [2]Higher the better
16
RESULTS AND DISCUSSIONS
Gene Overlap
Disease Snet GSEA T-Test (p <0.05)
T-Test(top)
SAM(p <0.05)
SAM(top)
Leukemia 91.30% 2.38% 73.01% 14.29% 49.96% 22.62%
ALL 93.01% 4.0% 60.20% 57.33% 81.25% 49.33%
DMD 69.23% 28.9% 49.60% 20.00% 76.98% 42.22%
Lung 51.18% 4.0% 65.61% 26.16% 65.61% 24.62%
Synthesized from Table 3, 4,5 from [2]Higher the better
17
RESULTS AND DISCUSSIONS
Size of subnetworks
Disease T-Test SNet
Size of Network 2 3 4 5 5 6 7 >8
Leukemia 84 8 1 0 0 2 3 2 1
Subtype 75 5 1 1 1 1 0 1 6
DMD 45 3 1 0 0 1 0 0 5
Lung 65 3 2 1 0 5 3 0 1
Reconstructed from Table 6 from [2]
18
RESULTS AND DISCUSSIONS
Validity Compare the genes in EACH Subnetwork with
those in t-test Genes in each Subnetwork appears in T-Test is
around 70%- 100% Selected Results (too large to present full) Subnetwork Name Percentage Subnetwork Name Percentage
Leukaemia_B Cell-VAV1 81.82% SNET_CTNNB1 100%
Leukaemia_UBC 100% SNET_TNFSF10 60%
Leukaemia_RAC1 57.15% SNET_PYGM 60%
DMD_RHOA 75% DMD_ACTB 83.33%
DMD_SDC3 88.89% Leaukaemia_POU2F2 75.00%
MLLBCR_ACAA1 28.67% BCR_T_RASA1 44.44%
MLLBCR_BLNK 72.73% BCR_ABL1 75.00%
SNET_NOTCH3 100% DMD_CALM1 80%
Selected from Table 7,8,9,10 in[2]
19
CONCLUSIONS
Traditional Methods have inconsistency problem across different dataset of the same disease
SNet utilize Biological insights to mitigate the gap Gene-to-Gene relationship Gene Pathway knowledge
SNet shows better results than established algorithms More consistent
20
REFERENCES [1] Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D,
Wang C, Guo Z: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes.
[2] Donny Soh, Difeng Dong1, Yike Guo, Limsoon Wong Finding consistent disease subnetworks across microarray datasets
21
THANK YOU!!