34
LOD2 KOREA : Towards Publishing Korean Linked Data on the Web Key-Sun Choi Joint work with Martin Rezk Jungyeul Park Yoon Yongun Kyungtae Lim YoungGyun Hahm

LOD2 KOREA : Towards Publishing Korean Linked Data on the Web

  • Upload
    freira

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

LOD2 KOREA : Towards Publishing Korean Linked Data on the Web. Key-Sun Choi. Joint work with Martin Rezk Jungyeul Park. Yoon Yongun Kyungtae Lim. YoungGyun Hahm. Key-Sun Choi - Personal History. NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation - PowerPoint PPT Presentation

Citation preview

Page 1: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

LOD2 KOREA :Towards Publishing Korean Linked Data on the Web

Key-Sun Choi

Joint work with Martin RezkJungyeul Park

Yoon YongunKyungtae Lim

YoungGyun Hahm

Page 2: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-Sun Choi - Personal History• NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation• Korean Part-of-Speech Tagset, Corpus, Dictionary• CoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004)• KORTERM: Korea Terminology Research Center for Language and Knowledge

Engineering (1998-2007), Research Center of Ministry of Culture• KAIST Research Grand Award (1998)• ISO/TC37/SC4 Founding member (Language Resource Management

Standards) • ISWC 2007 PC Co-Chair (International Semantic Web Conference)• AFNLP President (2009-2010)

• DBPedia Korea http://ko.dbpedia.org/ • http://lod2.eu/ partner (EU FP7) 2

Page 3: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

NLP2RDF• Triple in Natural Language

• Subject• Object• Predicate

• Extract from Sentences

• 野生種의 장미는 主로 北半球의 溫帶와 寒帶 地方에 分布한다 .• Wild rose is located mainly in the northern hemisphere of its

temperate and figid zones. 1. Subject : 장미 (rose)2. Object : 북반구의 온대 지방 , 한대 지방 (Northern hemi-sphere, Temperate and

Frigid Zones)3. Predicate : 分布 (isDistributedAt) 3

Page 4: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

4

4

Page 5: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

5

소프트웨어 시스템

임베디드소프트웨어

임베디드시스템

가전기기

운영체제미들웨어

임베디드운영체제

플랫폼

응용프로그램

개발환경

통신미들웨어

브라우져

미디어플레이어

DVD플레이어 셋탑박스

MP3 플레이어

디지털카메라

제조회사

실시간임베디드운영체제

비실시간임베디드운영체제RTOS

VxWorks pSOS

VRTX

WinCE

마이크로소프트

Wind River

consists_ofreside_on

제조사

5

Page 6: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

NLP2RDF

The Output ofNLP tools

Barack Obama is the President of the United States

Sentence: ‘Barack Obama is the President of the United States’

Barack ObamaURI = sen1word1 (documentary Unique)<POStag> NNG </POStag>,,,

“KNIF”Wrapper

<Conceptonal Layer>

<DBpedia> (based on DBpedia Ontology)

Barack Obama

URI = dbpedia12415 (conceptonal Unique)<Career> President <Nationality> United States<Party> Democrats ,,,

LOD algorithm

Page 7: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

1. For RDF Mapping• Triples and URI• Ontology

• String Ontology• Structured Sentence Ontology• NIF and Korean language

2. For LOD Mapping• URI for DBpedia entity• Mapping Word in Text DBpedia

For these work

8

Page 8: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Parser tree to Summary• 물체의 낙하 거리는 시간의 제곱에 비례한다

<Triple>1. Subject

• 물체의 낙하거리2. Predicate

• 비례한다3. Contents

• 시간의 제곱

10

Page 9: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Why NLP? Why Syntactic, Semantics?

• Advanced technology on the higher-level layers

11

Page 10: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

NLP Layer Cake

12

Page 11: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Semantic Web vs. NLP layer cake

철수가 2 층에 있는 세미나실을 예약한다 .John-SUBJ 2-floor-LOC room-OBJ reserve-FIN Discourse John: X1, room: L2

Syntactic structure

subject, object, predicate

Phrase Room in 2nd floor

Semantic tagging

[John: Human], [2-FL: Loc], [seminar-room: Room]

Morph. Analysis + 가 //2+ 층 + 에POS tagging NPP/JOSA//Numeral/

Tokenization 철수가 //2 층에 //

String URI Encoding13

Page 12: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

How to develop parser and semantic classifier creatively?• Open Source NLP tools

• Rich English, Japanese open tools/resources• A few Korean tools

• How to adapt Korean tools to the already developed tools

• Already developed Korean language resources• KAIST tools/resources • KAIST open source in sourceforge and web• Cambridge University Press: NLP Textbook (undergoing)

• Linked Data – http://lod2.eu/ partner

14

Page 13: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Background

• The idea of linking data from different sources is not new:• Network Database Model: 70’s• Linked Data: Today

• The goal is to facilitate sharing and re-using information.

• Linked Data aims to extend the Web with data commons by creating typed links between data from different sources

15

Page 14: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Background

• These links are usually modeled using the Resource Description Framework (RDF)

• Each piece of data is identified with an URI

• The first task towards linking data is to identify which resources and which properties we want to describe

16

Page 15: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Introduction• NLP2RDF is a LOD2 Community project that is developing the NLP Interchange

Format (NIF)• NIF aims to achieve interoperability between Natural Language Processing (NLP)

tools, language resources and annotations• The output of NLP tools can be converted into RDF and used in the LOD2 Stack• http://nlp2rdf.org

NIF…

• Is based on RDF/OWL• Enables users to annotate for several languages in a uniform way• Enables users to query text documents with SPARQL

(EX http://semanticweb.kaist.ac.kr/nlp2rdf/ )• Sentence : 다크나이트는 미국의 영화이다 .• Dark knight is a American film.

17

Page 16: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

18

Page 17: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

NIF Wrapping• NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to

combine and chain several Natural Language Processing (NLP) tools in a flexible, light-weight way.

Sebastian Hellmann, AKSW, Universitat Leipzig, NLP Interchange Format(NIF)

19

Page 18: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Structure of NLP2RDF

NLP Layer

InterchangeLayer

DataLayer 20

Page 19: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Example of NLP LayerInput

Sentence

Tokenization

CFG Parser

English NLP

Dependency Parser 21

Page 20: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

How to create RDF from NLP output

NLP Tools

NIF Wrapper

RDF

Raw Texts

output

My dog also likes eating sausage.

StanfordWrapper.Java

ExampleProcess

22

Page 21: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Example of NLP2RDF in ENG• http://nlp2rdf.lod2.eu/demo.php

• Sentence: Obama is the president of USA.

<http://prefix.given.by/theClient#offset_0_5> sso:oliaLink <http://purl.org/olia/penn.owl#NNP> ; sso:posTag "NNP" ; sso:lemma "Obama" ; str:referenceContext<http://prefix.given.by/theClient#offset_0_30> ; str:anchorOf "Obama" ; rdf:type sso:Word , str:String .

23

Page 22: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

• Resources: morphemes, words (eojeols) and sentences in Korean

• Properties: POS, grammatical roles, etc.

• Problems to solve:• Linguistic Modeling (OLiA)• Processing Korean Text (NLP)• How to Produce and Query RDF

Korean NLP2RDF

24

Page 23: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Linguistic Modeling (1)• We use OLiA (Ontologies of Linguistic Annotation) to link the

Sejong tagset with language-independent reference concepts.• Sejong tagset is a Korean default standard

• OLiA consists of three different ontologies:• the OLiA reference model (language-independent), • the OLiA annotation model (depends on the tagset),• the OLiA linking model (depends on the tagset).

• We developed a fragment of these last two ontologies for Korean, that is, for the Sejong tagset.

25

Page 24: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Linguistic Modeling (2)• We use the NIF (NLP Interchange Format) to

• standardize the input/output of the different tools to ease to connection among them, and to

• uniquely identify (parts of) text, entities and relationships.

• NIF provides two URI schemes to identify resources• Offset-based• Hash-based

• We opt in our application for the Hash-based26

Page 25: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Korean NLP2RDF Platform

RAW Text

Morpheme Analyzer

Parser

HanNanum• Korean Open Source Morpheme Analyzer• Developed by SWRC, KAIST

Korean Berkeley Parser• Training set: Modified Sejong Treebank

(DongHyun Choi, Jungyeul Park, Key-Sun Choi , Korean Treebank Transformation for Parsr Training, ACL - SPMRL 2012)

• F1-score: 82.12%

Produce triples• Use OLiA (Ontologies of Linguistic

Annotation) to link the Korean tagsets with language-independent reference concepts

• The OLiA annotation model and the OLiA linking model produce triples using the Sejong tagset

Wrapper

NIF output

27

Page 26: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Input Korean

Sentence

DataBase

OnTopFramework

RDF generator

MappingsOntologies

RDF triples

Parsed resultURI, Tag

Morph.Analyzer

CFG Parser

Korean NLP

Dependency Parser

SPARQL Query RDF triples

Korean Grammar

Framework

Korean Language information

SPARQL QueryHandler 28

Page 27: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

NIF Output

• Each piece of data is identified with an URI (Hash-based)• Resources: Morphemes, Words (eojeols), Sentences in Korean• Properties: POS-tag, Grammatical roles, etc.

Parsing results

Some produced triples

DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf 29

Page 28: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

NIF Output이탈리아에서 공부하고 온 마틴은 한국을 사랑합니다 .Martin who came from Italy after studying there loves Korea.

30

Page 29: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Specific Issues of Korean

1. Korean Tagset2. Linking with OLiA

Parser Output1. String2. Word, Sentence, Phrase,,,3. Tag4. ,,,

Ontology:

1. String Ontology2. Structured Sentence Ontology (SSO)3. OLiA4. Penn

NLP2RDF:

Produce Triples

Sejong Tag Set

RDF output 31

Page 30: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Tag Sejong OLiAsuperclass LinguisticAnnotation/Tag/ LinguisticConcept/MorphosyntacticCategory/

MA Adverb AdverbMAJ Adverb/ConjunctiveAdverb Adverb and Conjunction/CoordinatingConjunction

MAG Adverb/GeneralAdverb AdverbSN, XN CardinalNumber Quantifier/Numeral

MM Determiner PronounOrDeterminer/DeterminerSH, SL ForeignWord Residual/Foreign

IC Interjection InterjectionNA, NF, NN Noun Noun

XR Noun/BaseMorpheme Noun/CommonNounNNB, NNG Noun/CommonNoun Noun/CommonNoun

NNP Noun/ProperNoun Noun/ProperNounNP Pronoun PronounOrDeterminer/Pronoun

SE, SF, SO, SP, SS Symbol Punctuation

NV, V Verb VerbVA Verb/Adjective Verb and Adjective/PredicativeAdjectiveVX Verb/AuxiliaryPredicate Verb/AuxiliaryVerb

VC, VCN, VCP Verb/Copula Verb/FiniteVerb

VV Verb/VerbalPredicate VerbE, JK, XP, XS Particle MorphologicalCategory/morpheme/

JC, JX Particle/AuxiliaryPostposition MorphologicalCategory/morpheme/MorphologicalParticle

JKB, JKC, JKG, JKO, JKQ, JKS,

JKV

Particle/CaseMarker MorphologicalCategory/morpheme/MorphologicalParticle

XPN Particle/Prefix MorphologicalCategory/morpheme/prefixXSA, XSN, XSV Particle/Suffix MorphologicalCategory/morpheme/suffix

EC, EF, EP, ETM, ETN

Particle/VerbalEnding MorphologicalCategory/morpheme/suffix 32

Page 31: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Conclusions: • We presented a framework that allows

• processing Korean text, • Efficiently producing RDF triples, and • querying the NLP tools outcome

• The RDF outcome of our framework is compliant with the NIF (NLP Interchange Format) and the OLiA ontologies to facilitate its combination with other NLP tools

• Future: • complete the development of the language-dependent part of the OLiA

ontologies, • include the missing features required by NIF, • allow richer SPARQL queries, and • disambiguate the different entities in the text and link them with Wikipedia

articles.

33

Page 32: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Issues

• DBpedia• How to link between produced triples and DBpedia triples

• Josa (postposition case marker)• Korean specific grammatical feature

Sentence : 다크나이트는 미국의 영화이다 .

Sentence : Dark knight is the American movie.

34

Page 33: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-

Sun

Choi

- LO

D2 K

orea

Source

• OnTop• https://babbage.inf.unibz.it/trac/obdapublic/wiki/ObdalibPluginIntro

• Demo Site : for Korean• http://semanticweb.kaist.ac.kr/nlp2rdf

• Demo site : for English• http://nlp2rdf.lod2.eu/demo.php

• NLP2RDF• http://nlp2rdf.org

35

Page 34: LOD2 KOREA : Towards  Publishing Korean  Linked  Data  on  the  Web

Key-Sun Choi, Mun-Yong Yi, In-Young Koh, Younghee Lee(CS/WebST, Knowledge Service Eng., CS/WebST, CS)Tony Veale (Invited Professor, Computational Creativity)Yoon, Yong-Un (research professor, NLP+DB)Martin Rezk (postdoctoral researcher, Logic)Park, Jung-Yeol (researcher, parser)Lee, Jae-Sung (Professor, morphology and word)

Graduate Students:Soon-Gil Hong, Young-Gyun Hahm , Kyungtae Lim, Se-Mi Jang, Youngho Jeong, … http://ko.dbpedia.org/

http://[email protected]