22
Enriching DBpedia by Knowledge Base Population and Dark Entity Resolution Key-Sun Choi [email protected] School of Computing Korea Advanced Institute of Science & Technology (www.kaist.edu)

Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Enriching DBpedia by Knowledge Base Population and Dark Entity Resolution

Key-Sun [email protected]

School of ComputingKorea Advanced Institute of Science & Technology

(www.kaist.edu)

Page 2: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Overview of Fact Triplet Extraction

2

2

Natural LanguageSentence à L2K

Bullet list à B2K

• nationality(Ernest_Hemingway, Unites_States)• occupation(Ernest_Hemingway, Novelist)• occupation(Ernest_Hemingway, Journalist)

• notableWork(Ernest_Hemingway, The_Torrents_of_Spring)• notableWork(Ernest_Hemingway, The_Sun_Also_Rises)

Subject Relation Object

Ernest_Hemingway

award Nobel_Prize_in_Literature

birthPlace United_States

notableWork

The_Old_Man_and_the_Sea

For_Whom_the_Bells_Tolls

A_Farewell_to_Arms

Learning

Extraction

Learning

Page 3: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Goal and Challenges

• Goal– Constructing an iterative learning platform based on

Distant Supervision to improve knowledge learning quality

• Challenges– 1) How to build a good quality & quantity initial knowledge

base?– 2) How to make a reliable knowledge base population

system?

3

Page 4: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Machine need to learn from text

• Corpus– Wikipedia, News, Blog, Twitter, Dialog, …

• Knowledge Base– DBpeida, Freebase, Wikidata, …

4

Page 5: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

KBox : Extended DBpedia

• KBox is a new KB that expands Korean DBpedia.

– follows the DBpedia ontology schema.

– continually expands much of the knowledge extracted from text using the Korean KBP system.

5

Knowl. Evolver

Entity-TypeResolution

LBox

Relation Extractor

CNN

LSTM B2K

Lattice MLN

Entity Linker

Entity Discovery

Knowledge Validation

Textual Data Analyzer

Sentence Segmentation

POS Tagger

Co-reference resolution

Surface Graph Parser

KB Agent

EntitySummarization

NER

Textual Source

KBox

Triple layerRDB layer

��

Crowdsourcing Learning

Frame-Semantic

Parser

DependencyParser

RL

Entity Linking

KB Completion

Iteration Control

Knowledge Seed

T2KConfidence

Filtering

DBpedia

Continuous

expansion

Page 6: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Kbox Initialization, kbox.kaist.ac.kr- Korean enriched DBPedia

6

Korean DBpedia

Convert local property to DBO ontology triple

Triple extraction and mergefrom ENG.Dbpedia using :sameAs link

Type inference using Wikidata

Type inference using automatic type inference system

8,874,922

8,814,984

6,628,484

5,645,729

3,967,566

# of triplesKBox Initial

Page 7: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Kbox Initialization - How to build a good quality & quantity initial knowledge base

• 2-layer Storage– RDB (MySQL) : All the information about all the KBox

triples, such as triple score, extraction module, source sentence.

– Triple Store (Stardog, Virtuoso) : Reliable triples• 1) The initial triples extracted from the DBpedia and Wikidata

– 1-1) Convert local property (prop-ko) to DBO property (출생지àbirthPlace)

– 1-2) Type inference using :sameAs link from EN.DBpedia– 1-3) Triple generation using :sameAs link from EN.Dbpedia and

Wikidata• 2) Automatically extracted triples using the Korean KB

Population with a confidence score above 0.9

7

Page 8: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

KBox Statistics(How to build a good quality & quantity initial knowledge base?)

8

0

200000

400000

600000

800000

1000000

1200000

DBpedia 3.8 2015-10 2016-10 KBox

Enrichment in Entity and Types

# of entities # of types

050000

100000150000200000250000300000350000400000

2016-10 KBox

Enrichment in Typed entities

# of typed entities # of untyped entities

Version # of Korean local triple

# of DBO triple

# of relational

triples

DBpedia 3.8 1,219,355 432,600 1,651,649

2015-10 2,569,432 738,599 3,308,031

2016-10 3,077,297 890,269 3,967,383

KBox 3,077,297 2,350,292 5,426,857

0

1000000

2000000

3000000

4000000

5000000

6000000

DBpedia 3.8 2015-10 2016-10 KBox

Enrichment in Triples

# of Korean-local triple # of DBO tr iple

Page 9: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Essential Tasks for KB Population

9

RDFTriplesText

Entity Linking

K-Box

relation extraction

Learning

Extracting

KB Completion

Entity Discovery

L-Box Co-reference resolution Crowdsourcing

Distant Supervision

Page 10: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Example of Knowledge Learning / Extracting

10

Relation Extraction Entity Linking

Page 11: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Problem and solutions in KB PopulationHow to make a reliable knowledge base population system?

11

� � � � � �� �

� � � �

� � �

� � � � � � �

Problems: Solutions:

� , , à�

, - � � � �- - �, � - �, à �

, � ,- �- � �, � �

, , ,� -

, � , � ,� �, ��

, ,

Page 12: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Distant Supervision (Relation Extraction)

• Distant Supervision (Mintz et al. 2009)– For a triple fact r(e1, e2) in a KB, all sentences that

mention both entities e1 and e2 are aligned with relation r.

12

Page 13: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Automatic labeling between KB and sentence

13

Page 14: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Problem : Noise data

14

Wrong label

Page 15: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

• Average error rate reported in this paper1

– 74.1% (DS data from Wikipedia – YAGO)

– 31.0% (DS data from NYT News – Freebase)

• Intent : Up to 20% performance improvement by

learning models with noise-free data (in our

experiments)

Problem : Noise data

15

1. Ru, C., Tang, J., Li, S., Xie, S., & Wang, T. (2018). Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Information Processing & Management.

Page 16: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Sentence collecting problem and Imbalanced-labeled data

16

0

0.2

0.4

0.6

0.8

1

0

20000

40000

60000

80000

100000

120000

birt

hPla

cese

rvin

gRai

lway

Line

com

putin

gPla

tfor

mpu

blis

her

com

pose

rlic

ense

last

ProM

atch

foun

dedB

yro

uteN

umbe

rca

mpu

sar

chite

ctpi

ctur

eFor

mat

first

Lead

er r1c

gend

erm

ilita

ryU

nit

first

Driv

erTe

amen

gine

oppo

nent

regi

onal

Lang

uage

colle

geho

meh

rca

usal

ties

wor

ldAt

hlet

em

otm

form

erBr

oadc

astN

etw

ork

low

estM

ount

ain

ency

clop

edia

subs

idau

tom

obile

Plat

form

sour

ceCo

nflu

ence

Plac

eal

tBoo

ster

fuel

auth

orM

ask

push

pinM

apCa

ptio

ngr

owin

gGra

pebo

oste

rnam

eop

phav

prop

ella

ntto

talS

hips

Pres

erve

daR

d2Lo

ser

foun

ders

freq

uenc

yOfP

ublic

atio

nfo

llow

edBy

infr

aord

oto

talS

hips

Laid

Up

edith

orav

ioni

csso

urce

1Cou

ntry

hom

cura

tor

KBox

# of total property : 1099# of property that account for 90% of the total : 122

Deep-learning Approach Rule-based Approach

Page 17: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Solution 1 : Crowdsourcing

• Goal : Improving the performance of knowledge learning models with human-annotated dataset on 5 tasks

17

� �-�

� �

��

Corpus Target Tasks Now GoalWikipedia Task 1-5 500 Par. 40K Par.

News Task 4 relation extraction 5,482 Par. 60K Par.

(Par. = Paragraph)

Page 18: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

What makes increase Recall

• Entity Discovery– Finding new dark entities that can be registered to the

KB

• Co-reference resolution– Intent : Up to 21.6% improved knowledge

extraction coverage using Entity Discovery and Co-reference resolution

18

# of sentence

Sentence with no or only one entity.

(Exception from knowledge extraction target)

Sentences that are included in knowledge extraction targets with

Entity Discover / CoreferenceResolution

Coverage

1,435 445 9621.6%

(can be improved)

Crowdsourced Gold Set Analysis Result

Page 19: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Solution 2 : Ensemble

• Even a state-of-the-art relation extraction model shows low performance (F1-score, 40-50%).

19

RDFTriplesText

Entity Linking

K-Box

relation extraction

Learning

Extracting

KB Completion

Entity Discovery

L-Box Co-reference resolution Crowdsourcing

Distant Supervision

• � � - -

• � -� - -

• � - �)- � � -

• 2 � � ( -� -

• � � � -

Page 20: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Full-text Graph

• Purpose– KBP/QA coverage expansion by combining

ontological KB and full-text graph

• Lack of expression– There is a lot of knowledge that can not be expressed

in an ontological relation.

• ISO/TC37/SC4 Proposal: Surface Knowledge graph with Linguistic Content

20

Page 21: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Full-text Graph: Example of Application

22

AGF� � G�N9 � � A �H GF� G� 9 � � G �2G (� T_ a WbS R(

)F N � � G9 �)E F F

DBpedia

G9 �)E F F

OH G G N ?A9F HGH

G 2G

occupationnationality

knownFor

G9 �)E F F

G �2G

: GE �9F�OH G

G ? F 9 �1 GG � GF

reach (EUN, E, ACTIVE)

decide (EUN, RO, ACTIVE)

A �E9F AF_ (_, RO, _)

born (EUN, RO, PASSIVE)

_ (_, E-SEO, _)

Surface(Full-text) Graph

G 2G

G N29 9?

. ::99

award

inspired (_, EUL, PASSIVE)

A BG 9F FQ �G AF?�G �

F 9F

knownForGE )M P

Hard to get an exact answer.

Page 22: Enriching DBpediaby Knowledge Base Population and Dark ... · triples, such as triple score, extraction module, source sentence. –Triple Store (Stardog, Virtuoso) : Reliable triples

Semantic Web Research Center

Application: www.OKBQA.org(Open Knowledge Base and Question Answering)

23

KB

English ver.

Framework Setting(Which module to use)

Knowledge Base Setting(Virtuoso Endpoint + Graph IRI)