View
1
Download
0
Category
Preview:
Citation preview
A Machine Learning approach to generate Gene Interaction
Networks
Nicola Lazzarini
Dr. Jaume BacarditProf. Natalio Krasnogor
Supervisors:
Nicola Lazzarini 11th March 2014BNC Seminar
• Gene Interaction Networks• Co-Prediction• General Pipeline• Co-Prediction evaluation• Co-Expression comparison• Results• Conclusions
Outline
Nicola Lazzarini 11th March 2014BNC Seminar
Gene Interaction Networks
A graph where nodes represents genes and e d g e s i n d i c a t e a relationship among them.
There exist many methods to infer GIN: Correlation based, Bayesian Networks, ODEs, etc.
Nicola Lazzarini 11th March 2014BNC Seminar
ML Approach
Machine Learning approach:Use the dataset to learn a model (e.g classification) from the existing samples, and afterwards infer gene interactions from the structure of the model
ML techniques used to infer GINs:
• Model Trees Inferring gene regression networks with model trees (Nepomuceno-Chamorro et. al)
• Rule Based Systems An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier
systems (Urbanowicz et al.)
• Association Rules Inferring gene-gene associations from Quantitative Association Rules (Martinez-Ballestreros et. al)
Our approach: use a rule based classification algorithm to define GINs
Nicola Lazzarini 11th March 2014BNC Seminar
Attribute1 Attribute2 score
Attribute1 Attribute4 score
Attribute1 Attribute5 score
... .... ....
Attribute1 score
Attribute2 score
Attribute3 score
.... ....
Core assumption:Genes present within the same rules, due to their co-operation in classification, might be functionally related
List of nodes:
BioHEL is a rule based classification algorithm:rule1: if AttributeX > value1 AND AttributeY < value2 then Cancerrule2: if AttributeX < value3 AND AttributeZ < value4 then Normal ... ... ... ... ... ...rule n: ... ... ... ... ...
Network Generation
List of edges:
1
Improving the scalability of rule-based evolutionary learning (Bacardit et al.)1:
Nicola Lazzarini 11th March 2014BNC Seminar
What is Co-Prediction
This way of inferring GINs is termed: Co-PredictionInteraction between genes that act together for class prediction
We want to generalise this analysis:• Define a general pipeline to create Co-Prediction networks• Test on a set of data (8 datasets)• Compare with other GINs inferring methods (Co-Expression)
Already used in specific studies:- Reduced Neonatal Mortality in Meishan Piglets: A Role for Hepatic Fatty Acids? (H.P. Fainberg et al.)- Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features (Bacardit et al.)- Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets (Bassell et al.)
Nicola Lazzarini 11th March 2014BNC Seminar
General Pipeline
There are different ways of create the Co-Prediction Network
4 different variants/configurations
2 main switches allow different variants of the inferring process
Feature Selection
Rule Base Network Generation
Permutation Test
2nd Training
Rule Base Network Generation
Co-Prediction Network
Yes
Yes No
Original Dataset
No
Nicola Lazzarini 11th March 2014BNC Seminar
Feature Selection
Remove irrelevant features that might penalize the classification
Selected method: SVM-RFE (Recursive Feature Elimination)Gene Selection for Cancer Classification using Support Vector Machines (Guyon et al.)
Removes features based on SVM model
Only 10% of features are selected from the original set
Nicola Lazzarini 11th March 2014BNC Seminar
Permutation Test
We want to select o n l y t h e m o s t important nodes
Shuffle
Original Dataset
.... }Learning Phase - Network Construction
100 Random Datasets
Original
Y Score Y Score Z Score
... ...
Perm.1 Perm.2 Perm.99 Perm.100
Statistical Test
Strong Nodes
Y Score Y Score Z Score
... ...
Y Score Y Score Z Score
... ...
Y Score Y Score Z Score
... ...
Y Score Y Score Z Score
... ...
....
Nicola Lazzarini 11th March 2014BNC Seminar
Permutation Test
Two ways to use strong nodes:
1) Filter the list of edges: keep only the interactions where at least one node is “strong” (consider their neighbours)
2) Further Feature Selection method: consider the “strong” nodes (and their neighbours) for a second rule based network inferring process
Nicola Lazzarini 11th March 2014BNC Seminar
Configurations
Configuration Description
Conf1 Feature Selection + 1 Training Phase
Conf3 Original Dataset + 1 Training Phase
Conf5 Feature Selection + 2 Training Phases
Conf7 Original Dataset + 2 Training Phases
Final 4 Configurations used:
Conf2, Conf4, Conf6, Conf8 discarded by previous tests
Nicola Lazzarini 11th March 2014BNC Seminar
Datasets
Name Attributes Samples
Leukemia 7129 72
Lung Harvard 12534 181
Lung Michigan 7129 96
CNS 7129 60
dlbcl 2647 77
GSE2191 12625 54
GSE3726 22283 52
ProstateML 12600 136
Cohort of 8 human cancer related datasets
Nicola Lazzarini 11th March 2014BNC Seminar
Topological Analysis
Topology: layout of a network, is the arrangement of its various elements
Five different parameters: • Clustering Coefficients (Triangle & Square)• Density• Diameter• Shortest Path Length
Nicola Lazzarini 11th March 2014BNC Seminar
Enrichment Analysis
Biological terms overrepresented in a gene setSelected categories: Gene Ontology (GO), KEGG pathways and PubMed
Enrichment Analysis done using DAVID web toolSystematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources (Huang et al.)
Nicola Lazzarini 11th March 2014BNC Seminar
Co-Expression Comparison
Pearson Correlation Coefficient (PCC)
P (X1, ..., Xn)
P (X1, ..., Xn) =n�
i=1
P (Xi = xi�Xj = xj , ..., Xj+p = xj+p)
P (A,B) = P (B�A) ∗ P (A) =P (A�B)∗P (B)
P (G | D) = P (D|G)∗P (G)P (D)
P (G)P (D | G)
P (D)
xi(t) = fi (x1, ..., xN , u, θi)
θi xi(t)t xi(t) =
dxidt (t)
fi θi
rx,y =
�i (xi − x) (yi − y)��
i (xi − x)2 ×��
i (yi − y)2
(x, y) xi yix y
Measures how two genes tend to respond in the same direction across different samples
Ranges from -1 to +1
Nicola Lazzarini 11th March 2014BNC Seminar
Co-Expression Comparison
Given a Co-Prediction network with N nodes and E edges we build 2 type of Co-Expression networks:
• SE type: with E edges • SN type: with N nodes
Considering the edges with highest PCC
This is done for every Configuration
Nicola Lazzarini 11th March 2014BNC Seminar
MTERF
PAX5
PRKCI
PRTN3BLVRA
TAL1
ALDH3B1RFC1
CTSA
PGD
FSCN1
ATP6V0C
WT1
C10orf116
SERPINB6
ABAT APLNRCSTA
MGP
RPS26
TSTA3SLC14A2
AKR1C1C21orf2
CD36GYPE RBP1
PDLIM7
CST3GCNT2
ANPEP
HSD17BP1
CAPNS1
SMPDL3BMT1BSTOM
S100A4RPS6
FAM190B
PXN
IL8
TEAD1
HIST1H4H
TMED10
CFD
CLC
SERPINA1
SC4MOL
TPSAB1
ALCAM
USP9XARID3A
SNAPC3AGPAT2
ETS1
SNTB2
TTN
MNX1 TRD@
SLC2A5
MMRN1HIST2H2AA3
FKBP2
LEPROTGJB1
NMB
XKPOLB
DNM1
FPR1SLC18A3
PCBD1 CSF1
MUC5AC
SERPING1
RHD
MMP16
DHPS
IGJ
MGST1 MICA
SHC1PRKAR2B
TMEFF1
APOC1NONO
NFIX
MLLT3
RPA1RPL27A
FCER1A
AES
HAPLN1
ACTN1
MAPK4
GLB1
PKNOX1
KRT5ZYX
PTPN2UOX LOC100131311
SERPINB1
ABCC6
CSTF2
SQSTM1
GSTP1
NTRK2AHSG
RAPGEF3
CG030 ACADVL
PTGDSATP6V1F
GNS
CCDC6
ANKS1A
RIPK1
MYO1F
CDKN3PTH2R
GNAI2
ACY1
URODCENPC1 VSNL1
FAH AGPAT1OMPPTGS2
ALDH1A1MPO
DDOST
CCNH
NME4HOXA9
ASAH1
ELANE
GYPA
DRD3LTC4S
RAB3B
PIGA KYNU
SALL1
RGS1RALB
CTSCIGHM
TUBBFASGTF2E1
ZFP36L1
ZNF33BACOT7TMED9
MGST2
SLC2A4
PNMTGATA2
FOSL1
ADAM17ZNF174
MLC1
BGLAPVWF
ABCC2
FLNCMYO7A
CPA3
PABPC1
GOLIM4
TERT
IFITM2RPLP1
CD151
POLR3D
ZNF136
HOXB6CHRFAM7A
P4HB
HSD11B1
TAGLN2
GRPR
SLC17A2
DNAJB4
GJB2
TPM2RND2
GH1
PML
CGREF1
GSTO1
HEXA
ARF1AMHKAT5
RHAG
STAR
MSH2ADRM1
TXN
FOLH1
KLRC3
ABCA4MGLL
RPL31TALDO1KMO
PIK3R2
FMO1FASN POLR2F
SEPP1
PSMC3IP
MBL2
CA9 PRKCSH
FBLN1
RARA
NR1H2
HYAL3
DLD
NDUFV3SREBF2CFH
RBM10
PROS1
ZNF787
TGFB1FANCA
PPP2R5B
ERG
ITGB1BP1
EN2
PPARA
COL4A5DIRAS1
SLC17A1 KCNJ8
ZNF91
SOD3
HSD17B1
ITKNMU
TIMM8ACYBA AMPH TIAM1ARC
MAD1L1
CCR5
LOC96610
GPX4CD302
CITED2
RPS3 EPHA1
SPTLC2FHL1MSTP9
HINT1
GRIK3
ALOX12MAGEA11
MYL4
EEF1A1PPT1
FMO3
IRF5
CA6
BDNF RPL8
EPHA2
HDCCLCN6RAB2A
RPN1ID3
CSRP1 IFITM1
GAD1
CEBPD
CD37
SLCO2A1OPRK1
COL2A1HMGA1
MPST
SEMA3B
HBB
MBOAT7
MUC1
GM2A
PKN1PEX5GATA1
CTNNB1
ETF1
SRGN
TTC35
PTMA
ACSM3LZTR1
GATAD1
CD8A
SPINK2
SEPT7
HMGN1
PTPRD
CYP2A6
KRT18
MCM2
PRKG1 C12orf32
TPM1PPIF
TCF3
SLC3A2
CDC20CUL5
CD33ITGA2B EPCAM
POU2F2PCTK2
EWSR1
AIF1
ITPK1OXCT1 BMP6
TMSB10
NUP88MRPS12 PLD1
IL18
CRYAA
RAB5B FTL
MFSD10
RHOA
PPP3CC
WNT10B
DTYMK
MATN1
CD63
MAPKAPK3GNB3
PRAME
SEPT6
HFE JUNB
VPREB1
COMP
CDKN2A
SEPT5
RIN2
FTH1
MPL
GALC GPX1
CXCR4
TBXAS1
SCN7A
PRF1
RBM
GGT5
KIAA0195
CD7
TPM4
ZNF135
APOC4
RNF113A
S100A2
APLP2
IL7
CYB5R3
LYL1
ZFP36L2
HOXB2
UROS
OS9
MAP2K5NEK3
MAB21L1
RPN2ENO1
EYA2
UQCRH
NQO1
CTRB1
SYMPKME3
GRIK1
EBP
NCSTN
GYPB
FCGR1A
PPIB
ICAM3BUD31
TNNC2
ITGB2
NPC2CFP
POLD2
AHDC1
NAT1 ID2B
LAMB2 APOA1
ACTA2CTSDLOC728849
RNH1
MST1R
CYP24A1
CAT
POU2AF1
NME3 FPGS
1 Connected Component
Dataset: LeukemiaType: Co-Prediction
Configuration: Conf1Nodes: 421
Edges: 1529
Triangle Coeff: 0.712
Square Coeff: 0.013
Density: 0.017
Diameter: 5
Avg. SPL: 2.284
Topology Analysis
Nicola Lazzarini 11th March 2014BNC Seminar
PMS1 C12orf24
IQGAP2
ADH5
STXBP3
FARSA
UTP14CZNF146
NMI
CLNS1APTTG1IPPTDSS1
GTF2A2
HGF
NSMAF
TAF7
SRP54
ZNHIT3
KTN1
UBE2A
SLC35A1
PAFAH1B1
PSMC6
DPM1
ZCCHC11
PSMA3
TUBGCP3
PPWD1
EIF4A2
VRK1
METAP1
PUM1
U2AF1
SFRS5
PNN
PIK3R1
RBM4
CUL3
SAFB
DDX5
RBM39
C1QBP
GLO1
ATP5A1
SEPT7
SNRPG
SNRPF
YWHAQ
GARS
SNRPE
CCT6A
PPP1CC
FXR1
ACP1
MAPRE2
SUMO1
BTF3
CCNG1
UQCRFS1
NAP1L1
TOMM20
BCLAF1
EIF4H
PSMD6
HNRNPK
LOC728844
PSMA1
RPS7
SNRPD2
UQCRC2
CEP57
ATXN10
PRMT1
TGFBR2
COX6C
SSBP1
GCSH
SRP14
COX7C
SHFM1
ATP5C1
NAE1
EBNA1BP2
SET
EIF2S2
DDX1
CCT5
PKLR
DDX18
H3F3A
ATP5G3
COX6A1
NDUFA4
COX11
COX7A2
NACA
R3HDM1
TCEB1
FEZ2
WAPAL
LOC90925
SUZ12
KLRK1
UBXN4
SSB
TCF3
PARP1
TCP1
TOP2B
PRPF18
NOLC1
TCEA1
UBE4A
CCT7
SFRS3
PPP2R5E
HNRNPC
TMEM131
SKIV2L2ZNF638
CRYZ
RRAGA
NARS
MEF2APRDX6
PSMB7
MOBKL1B PSMD14PTPN11
KPNB1
CNBPTDG
HMGN4GLRX
PSMA4
SMAD1AVPSERPINI1 ZBTB16
H19HIST1H2AE
SPOCK2LIG4 ITGAXPOSTNAPCS CCL23 WARSALAS1 AGPSDMPK CALML3 CYP4A11SLC14A2 FOSBMCAMPTPN5 ZFP36MCM3 CCT8RING1 GNA11 SDCBP ATP6V1B2 NPTX2ATP5J PRKCZCRYMINPP5J P2RX5
HPGDHIST1H2BG
HIST1H4E
ADARB1
HIST1H2AIDPP4
FAM3CPTPRZ1
PTPRM
CYFIP1
CTSD
CYP2B7P1
GBA
PLAU
CD33
GCNT1DYRK1A
ADIPOQ CDC7N4BP2L2CTGF ZNF91MS4A1 CTLA4VLDLR
MRPS27
ARCN1 UBL4A
TRA2B
SNRPD3
MDH1
IMMT
CCT3
POLR2G
SRP9
MRPS22
TSNLRPPRC
SNRPD1
APEX1
PUM2
XRCC5
SP3
TCF12
PRKDC
PSMA5
RBMX
SFRS2
CALM2
CCT4
YWHAERBBP4
MRPL3
ACADM
COPS5
FNTACBFB
UBE2V2PWP1
TP53
PRMT5
PRKAR1AXPO1
ACLY
DCK
ERH
IARS EPRS
DARS
HNRNPF
SPCS2C11orf58
NMT1DYNLL1
MRPS31FAM120A
FLI1
C5orf13
GTF2I
NEDD8
FABP5L7
HNRNPU
BCL2
NCL
IRF2
HDAC2SFRS7
LOC728849
LBR
MX1
IRF7
IGBP1
IFI44L
XAF1
SLC25A5
RBBP7
HPRT1
CSTF3HLA-DMA
HLA-DRB1
HLA-DMB
SETDB1HLA-DPA1
GCM1ISG15 ANGPT1IKBKEIFIT1 ASMTDNA2
RPS29
EEF1A1
RPL37A
MT1P2
MT1A
RPL18A
RPL41
HLA-DRA
HLA-DQB1
CYP2D6
WNT2B
CD74
CTSA
MT2A
ADSS
PARP4TARBP1
SNRNP70 TM9SF2
SPG7
RPLP1
RBM25
DCAF8
CAPRIN1TP53BP1
B3GNTL1 DDOST FOXO1RPN2 CD24 TRH MPO IFI44 FUCA1DAD1 PIK3C3IRF9 PSMA6PPP2R5C PRSS3PRSS1INHBACREMPRG2CLCSSR2SND1KIAA0101 NFKBIACD72 H2AFZ PSMB8PSMB9CSRP2MYL6B MSLN CCR2 SLC6A8 ANXA11 LIMS1AQP1MOBP ITGA8CD83
IL8RBINHA
TAF1BNIP1
SLC30A3
PIP4K2B
IFI35
CSNK2A2
FABP3
CHIC1
B4GALT1
TRD@
NEUROG1
TMEM106A
KRT86MAP2K5
FAS
CLCN5
DAZL
P2RY1ANK3
TGFA CYP2E1
TM4SF4IL5
ARHGEF7
ID4
PDGFRA
GNG11
ITGA6
PMP22
GLDC
RIMS3
GLIPR1MAN1A1
GJA1ACP5
CLTC
KIT
NFE2
GATM
DRG2KCNJ4
ADCYAP1
SMARCD1 FOLR1
CYP2C19
BBS9
INA
KYNU
SMA4
IFNA1
FOXF1TCIRG1
PCSK6
CYP3A4
EGF
SILVBNC1
FOXF2
E2F5
LHFPL2
AKAP12
MPPED2
KDRTNFSF9UGP2
CLK3 GP9
NCOR2
DNASE1
MAP3K10
ASNA1
CD68
MCHR1GASTAPOB
MMP14MED22ADAM11
TBX5
TULP1
SMARCD3
ACOX2 NCOA2GOLGA2 S100A7
ZRSR2
PAPPASRPK3
IGL@
LOC91316
IGHG1
DGCR2HLA-C
INHBB
CD2PRTN3
PTP4A2ELANE
BPI DEFA4
FAT1CEACAM8
FXYD2
SEPW1 RHOHLCK ZAP70ITGA1 STAB1CD3DCHI3L2
TRBV5-4
TCF7 FES
FDPS
MLLT11
CD3G
CD3EPLCB2 TM9SF4
ANXA3
ATP6V1C1CDC25B
LTF
UBA7
MMP8
MMP9
CPT1B
LCN2CAMP
RGL2
ICAM1IL1B
IL6 NINJ1G0S2
CCL4 IFIT2
SERPINB2NIACR2
CCL3
PF4
PDE4C
PPBP
P4HB
PLEK
ADAM8
LMNA
CD63
ANXA1
UPP1
SOD2
LRPAP1PLAUR
ACSL1IL1RN
CXCL3
PTX3THBS1
GCA
BTN2A1
CD1BABI2
CD1E
CEACAM6
CR2MAL
S100P
DEFB1
BPGMGYPEGYPB
ALDH1A1BCL2L1
SEPT5ITGA2B
MARCH8
MYL4
CXCL2
XKSNCA
PPIF
BCL2A1NFKB1
EPB42
RHAG
MFGE8
GYPA
SPTA1
HBA2
BCL3
PRDX2
URODIL8
HBZ
SLC2A1
NAMPT
TFRC BSG
HBQ1
BLVRB
HMBS
COL16A1
FECH
LYZ
S100A12
ALAS2
S100A8
SELENBP1
HBB
CTAG1B
RENBP
MYLPF
SIRPD
SLC18A1
PMS2L11
PDE4A
PCDH1
HERC2P3CRABP2
ALPP
MAP1ACCL2
S100A9
GH2AQP2
LYST
NEFL
NCAPD3
MVK
SP1
CST4
AVPR1B COMTAGER
ST14
HADHB
TEC
PDIA2
CCR7
C4orf8
SPRR1A
SLC10A1
SLC17A2
ZWINTAS
CES2
CSN3
EFNA3
CALCRL
PAX8
IFNA21
DHRS2
FEV
MPP2
BCAT2
SLC2A4 ZNF8
KRT31
ERCC4RNF113A
ATP6V0A1
PTPRU
CDH5
MYH6
HLA-DQA1FANCC
CCL22
NELL2ARHGEF2
TIMP3
PDE6B
BCL6
GDF5
ATAD4
PSG7
FMOD
HTR2C
PTGIS
AOC2
GABRA1
NR2C1
COL11A2HMGA2REG1A
HCG8CCR9
GNB3 LPCAT3
GHRHR
CHRNE
MADDAFAP1
ADORA3
GFAP
CD4
SAA1
LIMK2 TGFBI
CTSSCX3CR1
CSF1R
CD1D
FGL2
IFI30
HCK
FCER1G
CTSH
PRKCD
S100A10
NCF2
ANXA2
TNFSF10
PLD3
LGALS2
CCL5
SERPINA1
IL13RA1
CD14CFP
VCAN
CST3
ZYX
TNFAIP2
SPI1
GRN
APLP2
RHOGCD302
ITGB2
ITGAM
TSPO
S100A4
FCGR1A
CSTA
AOAH
ASAH1
S100A11
Dataset: LeukemiaType: Co-Expression
Configuration: Conf1_SE
Nodes: 683
Edges: 1529
Triangle Coeff: 0.333
Square Coeff: 0.14
Density: 0.007
Diameter: 24
Avg. SPL: 8.856
59 Connected Components
Topology Analysis
Nicola Lazzarini 11th March 2014BNC Seminar
SFRS7
LBR
LRPPRC
PSMA5
ATP5A1
SRP9
DARSPTPN11
CNBPRBBP4
MOBKL1B
C11orf58
SNRPF
GCSH
SSBP1
SNRPE
SNRPG
PPP1CCNDUFA4
COX6CNAE1
SUMO1ACP1COX7C GARS
YWHAQ MYLPF
CTAG1BPMS2L11 BCAT2
CSN3
MPP2
SMA4
TM4SF4
CLCN5
IL5
FOXF2
FOXF1
ANK3
EGF
FAS
MAP3K10
SILV
HLA-C
IGHG1
TNFSF9
SRPK3
DNASE1
ITGA1
FDPS
LHFPL2
ID4 ITGA6
ARHGEF7
CLTC
CD1B
PDGFRA
MPPED2
PMP22PLD3
CCL5
E2F5
AKAP12
RIMS3
GLDCGNG11
IL13RA1
LRPAP1
CD63
P4HB
UPP1LMNA
PRTN3
ELANE
BTN2A1
CEACAM6 RBM39
SAFB
UBXN4
PUM1
BPI
CEACAM8S100P
DEFA4
CLC PRG2 CREM INHBA HLA-DRA CD74 HPGD ZBTB16
AVPH19
PSMD6 EIF4H BCLAF1 HNRNPK
DPP4
CTSH
CD14
VCAN
CFP
CSTA
TNFAIP2
GRN
FGL2
HCK
FCGR1A
S100A4
ASAH1TSPO
SERPINA1
LGALS2
FCER1G
IFI30
SPI1
ZYX
S100A11
ITGAM
NCF2
KYNU
APLP2
CTSSCD1D
SMARCD3
CD68
ITGB2
CD302
IGL@LOC91316ANXA3MMP8 PRSS3PRSS1
FEV
CST4
TAF1
EFNA3
CES2
SPRR1ACDH5
BCL2A1CCL22CCR9 SLC2A4
THBS1IL1RN
ACSL1
PLAURNAMPT
AGERLPCAT3PTPRU
ERCC4 MVK REG1A
TIMP3
HMGA2
ADCYAP1CYP2C19
NCAPD3ARHGEF2 AVPR1B
GH2
PAX8
DHRS2
SMARCD1
ZWINTAS
IL8RB
SLC30A3
FOLR1FANCC
TMEM106AGNB3
NEUROG1
MYH6RNF113ACHIC1
ATP6V0A1
PDE6B
IL1B
PDE4C
CCL3IFIT2
IL6 CCL4G0S2
SOD2
PTX3
ICAM1
GABRA1CALCRL
SLC18A1
IFNA21
MAP2K5
IFI35
MADD
GFAPCOL11A2
B4GALT1
MDH1
POLR2G
TCF12
CCT5CRYZ
CCT4
APEX1
GLRX
CBFB
FLI1PARP1
PSMA4
CCT6A
DPM1
CR2FXYD2
MEF2A
LOC90925
IMMT
EPRS
ZNF638
FAM120A
SNRPD1
CHI3L2
SEPW1
CD3G
TCF7
FAT1
CDC25B
PTP4A2
CD3EMAL
LCK
TULP1
ZAP70
TRBV5-4
CD3D
CD1E
MED22
ACOX2GAST
UGP2
APOB
CX3CR1
TBX5
MMP14
SPCS2
TRA2B
UBE2V2
CALM2
CCT3
XPO1
SP3
TSN
RBMX
FNTA
IARS
XRCC5
TMEM131
SSB
H3F3A
WAPAL
PSMC6
PKLR
ADH5
ZNHIT3
NARS
ZNF146STXBP3
FARSA
PMS1
C12orf24
UTP14C
BCL2
DYNLL1
COPS5
ACADM
PRKAR1A
PUM2
NEDD8ERH
HDAC2
DCK
LOC728849
MAN1A1
NFE2
GLIPR1
KIT
PTDSS1
GTF2A2
NSMAF
IQGAP2
ACP5
HGF
GJA1
RHAGGYPB
MYL4
ALDH1A1
GYPE
SEPT5
PTPRZ1
BNC1
TGFA
CYP3A4
CYFIP1
ADARB1
FAM3C
UROD
GYPA
HMBS
TFRC
SLC2A1FECH
BLVRB
PRDX2
HBB
IL8
SELENBP1
ALAS2
BPGM
SPTA1
HBQ1
MARCH8
BSG
NFKB1
EPB42
NEFL
SLC10A1
CXCL3 CXCL2
DYRK1A LCN2 HLA-DRB1
CAMP
RPL18AATP6V1C1
UBA7GCNT1
PLAU
TM9SF4 CYP2B7P1
STAB1CPT1B
MT1P2
MT2A
MT1A
EEF1A1
MMP9
RPL41
LTF
RPS29
RPL37A
GBA
CYP2D6
PPBPSERPINB2BCL6
PSG7
HLA-DMA
NR2C1
HLA-DMB
HLA-DPA1
PF4
ITGA8 RBM25 TARBP1 COX6A1 COX7A2 ATXN10 PRMT1 ANXA2R3HDM1 PPP2R5E PRPF18 MSLN CCR2 NFKBIA CD83 MOBPDAD1S100A9S100A8S100A12HIST1H4EHIST1H2AIHIST1H2AEDMPK ZCCHC11 CD72 CSRP2 KIAA0101 H2AFZ PSMB9 PSMB8 ATP5C1PPP2R5C SFRS3 CCT7 FUCA1 PIK3C3 PSMA6 MYL6B PPWD1 S100A10 SLC6A8 AQP1 CTSA CTSD ANXA11 LIMS1 SND1 SPG7SSR2 ATP5G3 SNRPD2 SNRNP70
PRKCD CST3
APCS POSTN
53 Connected Components
Dataset: LeukemiaType: Co-Expression
Configuration: Conf1_SN
Nodes: 422
Edges: 680
Triangle Coeff: 0.323
Square Coeff: 0.199
Density: 0.008
Diameter: 16
Avg. SPL: 5.644
Topology Analysis
Nicola Lazzarini 11th March 2014BNC Seminar
Topology Analysis
Co-Prediction vs. Co-ExpressionConfiguration Description
Conf1 F.S + 1 Train.
Conf3 Orig + 1 Train
Conf5 F.S + 2 Train.
Conf7 Orig + 2 Train.
- +
Nodes
Conf5P
Conf1P
Conf1SE
Conf5SE
Conf3SE
Conf7SE
Conf7P
Conf3P
- +
Triangle Clustering Coefficient
Conf3P
Conf7P
Conf1SE
Conf5SE
Conf3SE
Conf7SE
Conf1P
Conf5P
- +
Edges
Conf5SN
Conf1SN
Conf1P
Conf5P
Conf3P
Conf7P
Conf7SN
Conf3SN
- +
Square Clustering Coefficient
Conf3P
Conf7P
Conf1P
Conf5P
Conf7SE
Conf3SE
Conf1SE
Conf5SE
- +
Diameter
Conf1P
Conf5P
Conf7P
Conf3P
Conf1SE
Conf7SE
Conf3SE
Conf5SE
8 1234567
8 1234567
8 1234567
8 1234567
8 1234567
Nicola Lazzarini 11th March 2014BNC Seminar
Co-Prediction configurations diversity
O(Conf1,Conf2) = Commons
Commons + Unique1+ Unique2
GO Conf1 Conf3 Conf5 Conf7
Conf1 - 0.166 0.658 0.190
Conf3 - 0.156 0.571
Conf5 - 0.179
Conf7 -
KEGG Conf1 Conf3 Conf5 Conf7
Conf1 - 0.238 0.690 0.124
Conf3 - 0.127 0.508
Conf5 - 0.139
Conf7 -
Overlap of Enriched terms among different configurations
Nicola Lazzarini 11th March 2014BNC Seminar
Generally small overlap
Variants on the inferring process lead to different networks
PubMed Conf1 Conf3 Conf5 Conf7
Conf1 - 0.065 0.720 0.080
Conf3 - 0.062 0.423
Conf5 - 0.077
Conf7 -
Similar configurations (1,5) and (3,7) share more
Co-Prediction configurations diversity
Nicola Lazzarini 11th March 2014BNC Seminar
Co-Prediction vs. Co-Expression biological knowledge
Enrichment Score to have unbiased results
ES = # of Enriched Terms
# of Nodes
Average Ranking across 8 datasets
Conf1_P Conf1_ESE Conf3_P Conf3_ESE Conf5_P Conf5_ESE Conf7_P Conf7_ESE
GO 3.000 (2.5) 3.000 (2.5) 6.562 (8) 5.250 (6) 2.250 (1) 4.625 (4) 5.062 (5) 6.250 (7)
KEGG 4.437 (6) 3.125 (1) 6.687 (8) 4.312 (5) 3.625 (2) 4.125 (3) 4.250 (4) 5.437 (7)
PubMed 3.750 (4) 3.375 (1) 5.750 (7) 3.500 (2) 3.625 (3) 4.375 (5) 6.250 (8) 5.375 (6)
Conf1_P Conf1_ESN Conf3_P Conf3_ESN Conf5_P Conf5_ESN Conf7_P Conf7_ESN
GO 3.00 (3) 2.750 (2) 6.187 (7) 7.625 (8) 2.000 (1) 3.375 (4) 4.943 (5) 6.125 (6)
KEGG 4.500 (5) 3.625 (3.5) 6.562 (8) 5.937 (7) 3.625 (3.5) 3.250 (1) 3.562 (2) 4.937 (6)
PubMed 4.000 (4) 3.125 (1) 5.875 (8) 5.000 (5) 3.375 (2) 3.562 (3) 5.75 (7) 5.125 (6)
Co-Expression SE type
Co-Expression SN type
Nicola Lazzarini 11th March 2014BNC Seminar
Conf1 Conf3 Conf5 Conf7
GO 0.155 0.325 0.166 0.294
KEGG 0.068 0.256 0.092 0.215
PubMed 0.045 0.120 0.043 0.147
Conf1 Conf3 Conf5 Conf7
GO 0.149 0.469 0.144 0.371
KEGG 0.049 0.421 0.068 0.300
PubMed 0.048 0.270 0.047 0.193
Overlap of Enriched terms: Co-Prediction vs. Co-Expression
Co-Expression SE type
Co-Expression SN type
Co-Prediction vs. Co-Expression biological diversity
Nicola Lazzarini 11th March 2014BNC Seminar
Statistical test for performance evaluation
Is one method better than the other?
Consider each configuration as independent algorithm
Friedman testNull-hypothesis: the performances of the algorithms are equivalent
Enrichment Score used as metric
Post-Hoc with Nemenyi testCompare each algorithm vs. each other to check if one is better
Significance level α=0.05
Nicola Lazzarini 11th March 2014BNC Seminar
GO Terms•Null-hypothesis rejected with p-value 5.322e for SN type-07
Conf5_P better than: Conf3_P, Conf7_SN, Conf3_SNConf1_P better than: Conf3_SNConf1_SN and Conf5_SN better than: Conf3_SN
Statistical test for performance evaluation
Conf5_P better than: Conf3_P, Conf7_SE
•Null-hypothesis rejected with p-value 0.001012 for SE type
KEGG and PubMed TermsNull-hypothesis is not rejected
Nicola Lazzarini 11th March 2014BNC Seminar
Literature Mining
Manual literature mining for “important” nodes:• Top Hubs (highest degree)• Central Nodes (highest betweenness centrality)
Those important nodes appear to play an important role on the disease studied in the microarray
Example: GSE3726 (Colon vs. Breast Cancer)
6 out 7 “important” genes involved in one of those 2 cancers.
Nicola Lazzarini 11th March 2014BNC Seminar
We presented a ML approach to define GINs called Co-Prediction.•It has general pipeline to infer GINs with 4 different variants•It identifies interactions in a different way compared with statistical methods as Co-Expression•Compared to Co-Expression networks, Co-Prediction GINs:
✦Encode the same amount of biological knowledge ✦Different in: - Topology
- Enriched Terms
•It creates biologically relevant networks
Conclusions
Nicola Lazzarini 11th March 2014BNC Seminar
Thank you!
Recommended