View
272
Download
0
Embed Size (px)
DESCRIPTION
Linked Open Data for ACademia (LODAC) together with National Museum of Nature and Science have started collecting linked data of interspecies interaction and making link prediction for future observations. The initial data is very sparse and disconnected, making it very difficult to predict potential missing links using collaborative filtering alone. In this paper, we introduce Link Prediction on Interspecies Interaction (LPII) to solve this situation using hybrid recommendation approach. Our prediction model is a combination of three scoring functions, and takes into account collaborative filtering, community structure, and biological classification. We have found our approach, LPII, to be more accurate than other combinations of perdition models. Using statistical significance testing, we demonstrate that these scoring functions are important and play different roles depending on the conditions of linked data. This shows that LPII can be applied to deal with other real-world situations of link prediction.
Citation preview
Link Prediction in Linked Data of Interspecies Interactions using
Hybrid Recommendation Approach
Hideaki TAKEDAProfessor
Chiang Mai, Thailand JIST 2014 November 10th, 2014
Tsuyoshi HOSOYAMycologist
Rathachai [email protected]
Linked Open Data for ACadamiaLODAC
“Salix pierotii”
lodac:Salix
species:hasSuperTaxon
lodac:Salix_ pierotii
National Museum of Nature and Science
30,000 Interactions4,000 Fungi7,000 Hosts
Let’s find the Missing Linksbetween speciesLPII
Link Prediction
on Interspecies Interactions
Objective:
To predict missing links between fungi and hosts
Agenda
•Dataset
• Introduction
•Hybrid Recommendation• Collaborative Filtering• Community Structure• Biological Classification
• Evaluation
• Summary
• Future work
lodac:Melampsora_yezoensis
rdfs:label “Melampsora yezoensis”@la ;
species:hasTaxonRank species:Species ;
species:hasSuperTaxon lodac:Melampsora .
lodac:Melampsora species:hasTaxonRank species:Genus.
lodac:Salix_pierotii
rdfs:label “Salix pierotii”@la ;
rdf:type species:ScientificName ;
species:hasSuperTaxon lodac:Salix .
lodac:Salix species:hasTaxonRank species:Genus.
lodac:Melampsora_yezoensis species:growsOn lodac:Salix_pierotii.
Dataset
6
Host
Fungus
Link
lodac:Melampsora
lodac:Salix
species:hasSuperTaxon
species:hasSuperTaxon
species:growsOn
lodac:Melampsora_
yezoensis
lodac:
Salix_pierotii
7
903 Rust Fungi 2,001 Hosts
2,966 Links
BiologicalClassification
of Fungi
BiologicalClassification
of Hosts
Selected
8
List of
Fungus-Host
interaction with
predictive scores
DATA PREPARATION LPII APPROACH
RESULT
transform data using
a Weight Function
BIOLOGIST
Making Observation
Collaborative
Filtering
Finding
Missing
Links
Combine
Score Score Score
1 2
3
4
Intr
od
uct
ion
9
Community
Structure
Biological
Classification
Fungus-Host
Interaction
Dataset
Generate Result
Collaborative Filtering
Some fungi found at the same host are common neighbors.
If some close neighbors of the fungus fare found at a host h,the fungus f may be found at the host h.
10
1
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
Fungi Hosts
11
f1
f2
h1
h2
PCF
( f1,h2 ) = ?
Collaborative Filtering for Link Prediction
Sum of similarities between fungi with common hosts
12
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
w = ?
Jaccard Index
13
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
w = 0.50
w = 0.33
14
Predictive Score usingCollaborative Filtering
PCF( f1,h2 ) = 0.50
PCF( f2,h3 ) = 0.33
PCF( f1,h3 ) = ???
PCF( f4,h3 ) = ???
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
w = 0.50
w = 0.33
PCF( f4,h5 ) = ???
etc.15
( Dash red lines are predicted links)
Community Structure
If a host h is commonly foundin the community of the fungus f, the fungus f may be found at the host h.
16
2
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
0.50
0.33
f4
f5
0.50
0.33
Bipartite GraphProjection of Fungi
f2
f1
f3
17
CommunityStructure
o f
Rust Fungi
18
Using Modularity with Random Walk
f4
f5
0.50
0.33
Projection of Fungi
f2
f1
f3
CommunityStructureh1
h2
h3
h4
h5
Community #1
Community #2
Community #3
PCS( f,h ) =
Number of links between
the community of the
fungus f and the host h
Number of all links
given by the community
of the fungus f
PCS( f3,h1 ) =2
5= 0.40
19
20
How to deal with
many very smal l
communit ies?
Biological Classification
If a host h is commonly foundin the biological classification of the fungus f, the fungus f may be found at the host h.
21
3
BIOLOGICAL CLASSIFICATION (TAXONOMY)
Domain e.g. Eukaryota
Kingdom e.g. Fungi
Phylum e.g. Basidiomycota
Class e.g. Urediniomycetes
Order e.g. Uredinales
Family e.g. Melampsoraceae
Genus e.g. Melampsora
Species e.g. Melampsora Yezoensis
Classification Example
22
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
with Biological Classification
G1
G2
Biological Classification
23
PBC( f,h ) =
Number of links between the
biological classification of the
fungus f and the host h
Number of all links given by
the biological classification of
the fungus f
PBC( f4,h2 ) =1
4= 0.25
PCF( f,h )
PII( f,h )
Hybrid Recommender Approach
PCS( f,h )
PBC( f,h )
CollaborativeFiltering
CommunityStructure
BiologicalClassification
24
Combination of
Evaluation
25
Training set(2,500 links)
Test set(500 links)
Candidates(400,000 links)
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
Learning and Testing
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
All PossibleLinks
Existent Links Missing Links
0.4210.8640.4660.4900.3660.5150.3130.0760.3620.9020.0690.5240.8760.4640.8390.504
26
AUC Area Under the receiver operating characteristic Curve
① PII( f1,h2 ) = 0.70
② PII( f2,h3 ) = 0.60
③ PII( f1,h3 ) = 0.50
④ PII( f4,h3 ) = 0.40
⑤ PII( f2,h2 ) = 0.30
⑥ PII( f3,h3 ) = 0.20
⑦ PII( f4,h3 ) = 0.10
① PII( f1,h2 ) = 0.70
② PII( f2,h2 ) = 0.60
③ PII( f3,h3 ) = 0.50
④ PII( f2,h3 ) = 0.50
⑤ PII( f1,h3 ) = 0.40
⑥ PII( f4,h3 ) = 0.30
⑦ PII( f4,h3 ) = 0.10
Predicted List #1
(sorted by predictive score)
Low AUCHigh AUC
For n comparisons,
• n' is number of times when
the test links have higher
score than the missing links.
• n" is number of times when
the test links have same
score as the missing links.
Predicted List #2
(sorted by predictive score)
27( Red scores are test links)
AUC Area Under the receiver operating characteristic Curve
Combination Scoring Function(s) AUC
Stand-alone functionPCF 0.859
PCS 0.823
PBC 0.680
Summation of functionsPCF + PCS 0.867
PCF + PBC 0.876
PCS + PBC 0.865
PCF + PCS + PBC 0.892
Multiplication of functionsPCF × PCS 0.817
PCF × PBC 0.862
PCS × PBC 0.827
PCF × PCS × PBC 0.818
28
RDF data of
Interspecies
Interactions
Projection
of Fungi
Collaborative
Filtering
Community
Structure
Biological
Classification
SPARQL
querying
being input of
Scoring Functions
ranking
predictions
in decreasing
order
Predicted Missing Links
of Fungus-Host together with
prediction scores
DATA PREPARATION LPII APPROACH
RESULT
Bipartite Graph
Missing
Links
Community
Detection Method
transform data using
a Weight Function
DOMAIN
EXPERT
found?yes
update
knowledgebase
NOTE
select
connected fungi
clustering using
Biological
Classification
make
observation
Data
Process
Third party method
Scoring Function
Input argument
Linear Operation
Decision
Dataflow
+
find
missing
linkssharing
LOD
Cloud
PII(f,h) +
PCF(f,h) PCS
(f,h) PBC(f,h)
1 2
3
4
29
Ove
rall
PCF( f,h )PII( f,h )
Hybrid Recommender Approach
PCS( f,h )
PBC( f,h )
α
β
γγ should be very
low as about 0.1 and 0.2.
30
Conclusion
Informatics Biology
• RDF Model for Interspecies Interaction• Improve the use of Collaborative filtering
with sparse dataset using• Community Structure• and Biological Classification
• It has been found that • In general case, PCF + PCS is enough.• But when a node
• having a few common neighbors• and locating in a small community,• PBC becomes a key player for
making link prediction.
• This model supports the view that most fungi under the same genus have similar parasite behavior.
• Some predicted links having high predictive score, such as,• Phragmidium mucronatum ハマナス• Phragmidium fusiforme ハマナス• Phragmidium potentillae イワキンバイ
have been discovered from other literatures.
• Next enhancement is to analyze fungal species into fungal spore types.
31
PCF( f,h )PII( f,h )
Future Work
PCS( f,h )
PBC( f,h )
α
β
γ
x1 (f,h)
x2 (f,h)
x3 (f,h)
32
RDF data of
Interspecies
Interactions
NFungi-Projection
or GProjFungi
Collaborative
Filtering
Community
Structure
Biological
Classification
SPARQL
querying
being input of
Scoring Functions
ranking
predictions
in decreasing
order
Predicted Missing Links
of Fungus-Host together with
prediction scores
DATA PREPARATION LPII APPROACH
RESULT
Bipartite Graph
GBipt
including
LExist
Missing
Links
Or
LMiss
clustering using
a Community
Detection Method
transform data using
a Weight Function
DOMAIN
EXPERT
found?yes
update
knowledgebase
NOTE
select
connected fungi
clustering using
Biological
Classification
make
observation
Data
Process
Third party method
Scoring Function
Input argument
Linear Operation
Decision
Dataflow
+
find
missing
linkssharing
LOD
Cloud
PII(f,h) +
PCF(f,h) PCS
(f,h) PBC(f,h)
1 2
3
4Ove
rall
α β γ
33
Any idea for improvement?