Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Enriching DBpedia by Knowledge Base Population and Dark Entity Resolution
Key-Sun [email protected]
School of ComputingKorea Advanced Institute of Science & Technology
(www.kaist.edu)
Semantic Web Research Center
Overview of Fact Triplet Extraction
2
2
Natural LanguageSentence à L2K
Bullet list à B2K
• nationality(Ernest_Hemingway, Unites_States)• occupation(Ernest_Hemingway, Novelist)• occupation(Ernest_Hemingway, Journalist)
• notableWork(Ernest_Hemingway, The_Torrents_of_Spring)• notableWork(Ernest_Hemingway, The_Sun_Also_Rises)
�
Subject Relation Object
Ernest_Hemingway
award Nobel_Prize_in_Literature
birthPlace United_States
notableWork
The_Old_Man_and_the_Sea
For_Whom_the_Bells_Tolls
A_Farewell_to_Arms
Learning
Extraction
Learning
Semantic Web Research Center
Goal and Challenges
• Goal– Constructing an iterative learning platform based on
Distant Supervision to improve knowledge learning quality
• Challenges– 1) How to build a good quality & quantity initial knowledge
base?– 2) How to make a reliable knowledge base population
system?
3
Semantic Web Research Center
Machine need to learn from text
• Corpus– Wikipedia, News, Blog, Twitter, Dialog, …
• Knowledge Base– DBpeida, Freebase, Wikidata, …
4
Semantic Web Research Center
KBox : Extended DBpedia
• KBox is a new KB that expands Korean DBpedia.
– follows the DBpedia ontology schema.
– continually expands much of the knowledge extracted from text using the Korean KBP system.
5
Knowl. Evolver
Entity-TypeResolution
LBox
Relation Extractor
CNN
LSTM B2K
Lattice MLN
Entity Linker
Entity Discovery
Knowledge Validation
Textual Data Analyzer
Sentence Segmentation
POS Tagger
Co-reference resolution
Surface Graph Parser
KB Agent
EntitySummarization
NER
Textual Source
KBox
Triple layerRDB layer
��
Crowdsourcing Learning
Frame-Semantic
Parser
DependencyParser
RL
Entity Linking
KB Completion
Iteration Control
Knowledge Seed
T2KConfidence
Filtering
DBpedia
Continuous
expansion
Semantic Web Research Center
Kbox Initialization, kbox.kaist.ac.kr- Korean enriched DBPedia
6
Korean DBpedia
Convert local property to DBO ontology triple
Triple extraction and mergefrom ENG.Dbpedia using :sameAs link
Type inference using Wikidata
Type inference using automatic type inference system
8,874,922
8,814,984
6,628,484
5,645,729
3,967,566
# of triplesKBox Initial
Semantic Web Research Center
Kbox Initialization - How to build a good quality & quantity initial knowledge base
• 2-layer Storage– RDB (MySQL) : All the information about all the KBox
triples, such as triple score, extraction module, source sentence.
– Triple Store (Stardog, Virtuoso) : Reliable triples• 1) The initial triples extracted from the DBpedia and Wikidata
– 1-1) Convert local property (prop-ko) to DBO property (출생지àbirthPlace)
– 1-2) Type inference using :sameAs link from EN.DBpedia– 1-3) Triple generation using :sameAs link from EN.Dbpedia and
Wikidata• 2) Automatically extracted triples using the Korean KB
Population with a confidence score above 0.9
7
Semantic Web Research Center
KBox Statistics(How to build a good quality & quantity initial knowledge base?)
8
0
200000
400000
600000
800000
1000000
1200000
DBpedia 3.8 2015-10 2016-10 KBox
Enrichment in Entity and Types
# of entities # of types
050000
100000150000200000250000300000350000400000
2016-10 KBox
Enrichment in Typed entities
# of typed entities # of untyped entities
Version # of Korean local triple
# of DBO triple
# of relational
triples
DBpedia 3.8 1,219,355 432,600 1,651,649
2015-10 2,569,432 738,599 3,308,031
2016-10 3,077,297 890,269 3,967,383
KBox 3,077,297 2,350,292 5,426,857
0
1000000
2000000
3000000
4000000
5000000
6000000
DBpedia 3.8 2015-10 2016-10 KBox
Enrichment in Triples
# of Korean-local triple # of DBO tr iple
Semantic Web Research Center
Essential Tasks for KB Population
9
RDFTriplesText
Entity Linking
K-Box
relation extraction
Learning
Extracting
KB Completion
Entity Discovery
L-Box Co-reference resolution Crowdsourcing
Distant Supervision
Semantic Web Research Center
Example of Knowledge Learning / Extracting
10
Relation Extraction Entity Linking
Semantic Web Research Center
Problem and solutions in KB PopulationHow to make a reliable knowledge base population system?
11
� � � � � �� �
� � � �
� � �
� � � � � � �
Problems: Solutions:
� , , à�
, - � � � �- - �, � - �, à �
, � ,- �- � �, � �
, , ,� -
, � , � ,� �, ��
, ,
Semantic Web Research Center
Distant Supervision (Relation Extraction)
• Distant Supervision (Mintz et al. 2009)– For a triple fact r(e1, e2) in a KB, all sentences that
mention both entities e1 and e2 are aligned with relation r.
12
Semantic Web Research Center
Automatic labeling between KB and sentence
13
Semantic Web Research Center
Problem : Noise data
14
Wrong label
Semantic Web Research Center
• Average error rate reported in this paper1
– 74.1% (DS data from Wikipedia – YAGO)
– 31.0% (DS data from NYT News – Freebase)
• Intent : Up to 20% performance improvement by
learning models with noise-free data (in our
experiments)
Problem : Noise data
15
1. Ru, C., Tang, J., Li, S., Xie, S., & Wang, T. (2018). Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Information Processing & Management.
Semantic Web Research Center
Sentence collecting problem and Imbalanced-labeled data
16
0
0.2
0.4
0.6
0.8
1
0
20000
40000
60000
80000
100000
120000
birt
hPla
cese
rvin
gRai
lway
Line
com
putin
gPla
tfor
mpu
blis
her
com
pose
rlic
ense
last
ProM
atch
foun
dedB
yro
uteN
umbe
rca
mpu
sar
chite
ctpi
ctur
eFor
mat
first
Lead
er r1c
gend
erm
ilita
ryU
nit
first
Driv
erTe
amen
gine
oppo
nent
regi
onal
Lang
uage
colle
geho
meh
rca
usal
ties
wor
ldAt
hlet
em
otm
form
erBr
oadc
astN
etw
ork
low
estM
ount
ain
ency
clop
edia
subs
idau
tom
obile
Plat
form
sour
ceCo
nflu
ence
Plac
eal
tBoo
ster
fuel
auth
orM
ask
push
pinM
apCa
ptio
ngr
owin
gGra
pebo
oste
rnam
eop
phav
prop
ella
ntto
talS
hips
Pres
erve
daR
d2Lo
ser
foun
ders
freq
uenc
yOfP
ublic
atio
nfo
llow
edBy
infr
aord
oto
talS
hips
Laid
Up
edith
orav
ioni
csso
urce
1Cou
ntry
hom
cura
tor
KBox
# of total property : 1099# of property that account for 90% of the total : 122
Deep-learning Approach Rule-based Approach
Semantic Web Research Center
Solution 1 : Crowdsourcing
• Goal : Improving the performance of knowledge learning models with human-annotated dataset on 5 tasks
17
� �-�
� �
��
Corpus Target Tasks Now GoalWikipedia Task 1-5 500 Par. 40K Par.
News Task 4 relation extraction 5,482 Par. 60K Par.
(Par. = Paragraph)
Semantic Web Research Center
What makes increase Recall
• Entity Discovery– Finding new dark entities that can be registered to the
KB
• Co-reference resolution– Intent : Up to 21.6% improved knowledge
extraction coverage using Entity Discovery and Co-reference resolution
18
# of sentence
Sentence with no or only one entity.
(Exception from knowledge extraction target)
Sentences that are included in knowledge extraction targets with
Entity Discover / CoreferenceResolution
Coverage
1,435 445 9621.6%
(can be improved)
Crowdsourced Gold Set Analysis Result
Semantic Web Research Center
Solution 2 : Ensemble
• Even a state-of-the-art relation extraction model shows low performance (F1-score, 40-50%).
19
RDFTriplesText
Entity Linking
K-Box
relation extraction
Learning
Extracting
KB Completion
Entity Discovery
L-Box Co-reference resolution Crowdsourcing
Distant Supervision
�
• � � - -
• � -� - -
• � - �)- � � -
• 2 � � ( -� -
• � � � -
Semantic Web Research Center
Full-text Graph
• Purpose– KBP/QA coverage expansion by combining
ontological KB and full-text graph
• Lack of expression– There is a lot of knowledge that can not be expressed
in an ontological relation.
• ISO/TC37/SC4 Proposal: Surface Knowledge graph with Linguistic Content
20
Semantic Web Research Center
Full-text Graph: Example of Application
22
AGF� � G�N9 � � A �H GF� G� 9 � � G �2G (� T_ a WbS R(
)F N � � G9 �)E F F
DBpedia
G9 �)E F F
OH G G N ?A9F HGH
G 2G
occupationnationality
knownFor
G9 �)E F F
G �2G
: GE �9F�OH G
G ? F 9 �1 GG � GF
reach (EUN, E, ACTIVE)
decide (EUN, RO, ACTIVE)
A �E9F AF_ (_, RO, _)
born (EUN, RO, PASSIVE)
_ (_, E-SEO, _)
Surface(Full-text) Graph
G 2G
G N29 9?
. ::99
award
inspired (_, EUL, PASSIVE)
A BG 9F FQ �G AF?�G �
F 9F
knownForGE )M P
Hard to get an exact answer.
Semantic Web Research Center
Application: www.OKBQA.org(Open Knowledge Base and Question Answering)
23
KB
English ver.
Framework Setting(Which module to use)
Knowledge Base Setting(Virtuoso Endpoint + Graph IRI)