Upload
eagle-genomics-ltd
View
777
Download
2
Tags:
Embed Size (px)
DESCRIPTION
The volume and diversity of life science and healthcare data have created huge data integration challenges. Technologies such as federation and warehousing allowed us to manage volume but didn't give us the ability to respond flexibly to change or to routinely create novel insights from those data. In part this is due to imperfect recording and understanding of the context of the data and in part due to the representations we use. This talk will explore some of the historical and future approaches to large-scale semantic data integration and look at new graph and geometrical approaches to large scale knowledge modelling.
Citation preview
o
o
o
o
select aminoid, seq1[0:6], xss[0:6] from amino a where seq1=‘R[2,4]+polar++hydroxyl+’
GO:0003673 : Gene_Ontology (28348)
GO:0008150 : biological_process (21805)
GO:0005575 : cellular_component (13866)
GO:0003674 : molecular_function (20801)
GO:0008369 : obsolete (289)
GO:0004432 : 1-phosphatidylinositol-4-phosphate kinase, class IA (0)
GO:0003824 : enzyme(7162)
GO:0016301 : kinase(1027)
GO:0004428 : inositol/phosphatidylinositol kinase(37)
GO:0016307 : phosphatidylinositol phosphate kinase(9)
GO:0000285 : 1-phosphatidylinositol-3-phosphate 5-kinase(1)
GO:0016740 : transferase(2130)
GO:0016772 : transferase, transferring phosphorus-containing groups(1239)
GO:0016773 : phosphotransferase, alcohol group as acceptor(969)
GO:0004428 : inositol/phosphatidylinositol kinase(37)
GO:0016307 : phosphatidylinositol phosphate kinase(9)
GO:0000285 : 1-phosphatidylinositol-3-phosphate 5-kinase(1)
Ontology
Structured Data Sources Unstructured Data Sources
o
oooooooo
oooooooo
ooooooooo
oooooooo
o
ooooooooooooooo
o
o
o
o
o
o
o
o
o
o
o
Context Vectors Term Vectors
1 2
3 n
‘Zinc’ ‘Finger’
1 2
3 n
Dot product comparisons of query vector vs term/context vectors gives semantic distance
‘Zinc finger’ OR addition
Query vector
‘Tachycardia’ search – (untrained – no starting vocab provided) 400K clinical trials (500MB of XML), unfiltered result set Approx. 1.2M ‘terms’ in corpus
Vector length = semantic distance (in corpus) Colour = term density in corpus
o
o
o
o
o
o ° ° °
o
o
o
o
o
o
o
o
o
o
𝒙′
𝒚′
𝒛′
𝒙𝒚𝒛
𝟏 𝟎 𝟎𝟎 𝒄𝒐𝒔∅ 𝒔𝒊𝒏∅𝟎 −𝒔𝒊𝒏∅ 𝒄𝒐𝒔∅
A B C D
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 1
1 1 1 0
1 1 1 1
A B C D
0 0 0 0
0 1 0 1 0 1 1
0 1 0 1 1 0
1 0 1 0 1 0 1 1
o
o
o
o
o
o
o
o
o
o