View
59
Download
0
Category
Tags:
Preview:
DESCRIPTION
LOD2 KOREA : Towards Publishing Korean Linked Data on the Web. Key-Sun Choi. Joint work with Martin Rezk Jungyeul Park. Yoon Yongun Kyungtae Lim. YoungGyun Hahm. Key-Sun Choi - Personal History. NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation - PowerPoint PPT Presentation
Citation preview
LOD2 KOREA :Towards Publishing Korean Linked Data on the Web
Key-Sun Choi
Joint work with Martin RezkJungyeul Park
Yoon YongunKyungtae Lim
YoungGyun Hahm
Key-Sun Choi - Personal History• NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation• Korean Part-of-Speech Tagset, Corpus, Dictionary• CoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004)• KORTERM: Korea Terminology Research Center for Language and Knowledge
Engineering (1998-2007), Research Center of Ministry of Culture• KAIST Research Grand Award (1998)• ISO/TC37/SC4 Founding member (Language Resource Management
Standards) • ISWC 2007 PC Co-Chair (International Semantic Web Conference)• AFNLP President (2009-2010)
• DBPedia Korea http://ko.dbpedia.org/ • http://lod2.eu/ partner (EU FP7) 2
Key-
Sun
Choi
- LO
D2 K
orea
NLP2RDF• Triple in Natural Language
• Subject• Object• Predicate
• Extract from Sentences
• 野生種의 장미는 主로 北半球의 溫帶와 寒帶 地方에 分布한다 .• Wild rose is located mainly in the northern hemisphere of its
temperate and figid zones. 1. Subject : 장미 (rose)2. Object : 북반구의 온대 지방 , 한대 지방 (Northern hemi-sphere, Temperate and
Frigid Zones)3. Predicate : 分布 (isDistributedAt) 3
4
4
5
소프트웨어 시스템
임베디드소프트웨어
임베디드시스템
가전기기
운영체제미들웨어
임베디드운영체제
플랫폼
응용프로그램
개발환경
통신미들웨어
브라우져
미디어플레이어
DVD플레이어 셋탑박스
MP3 플레이어
디지털카메라
제조회사
실시간임베디드운영체제
비실시간임베디드운영체제RTOS
VxWorks pSOS
VRTX
WinCE
마이크로소프트
Wind River
consists_ofreside_on
제조사
5
NLP2RDF
The Output ofNLP tools
Barack Obama is the President of the United States
Sentence: ‘Barack Obama is the President of the United States’
Barack ObamaURI = sen1word1 (documentary Unique)<POStag> NNG </POStag>,,,
“KNIF”Wrapper
<Conceptonal Layer>
<DBpedia> (based on DBpedia Ontology)
Barack Obama
URI = dbpedia12415 (conceptonal Unique)<Career> President <Nationality> United States<Party> Democrats ,,,
LOD algorithm
Key-
Sun
Choi
- LO
D2 K
orea
1. For RDF Mapping• Triples and URI• Ontology
• String Ontology• Structured Sentence Ontology• NIF and Korean language
2. For LOD Mapping• URI for DBpedia entity• Mapping Word in Text DBpedia
For these work
8
Key-
Sun
Choi
- LO
D2 K
orea
Parser tree to Summary• 물체의 낙하 거리는 시간의 제곱에 비례한다
<Triple>1. Subject
• 물체의 낙하거리2. Predicate
• 비례한다3. Contents
• 시간의 제곱
10
Key-
Sun
Choi
- LO
D2 K
orea
Why NLP? Why Syntactic, Semantics?
• Advanced technology on the higher-level layers
11
Key-
Sun
Choi
- LO
D2 K
orea
NLP Layer Cake
12
Key-
Sun
Choi
- LO
D2 K
orea
Semantic Web vs. NLP layer cake
철수가 2 층에 있는 세미나실을 예약한다 .John-SUBJ 2-floor-LOC room-OBJ reserve-FIN Discourse John: X1, room: L2
Syntactic structure
subject, object, predicate
Phrase Room in 2nd floor
Semantic tagging
[John: Human], [2-FL: Loc], [seminar-room: Room]
Morph. Analysis + 가 //2+ 층 + 에POS tagging NPP/JOSA//Numeral/
Tokenization 철수가 //2 층에 //
String URI Encoding13
Key-
Sun
Choi
- LO
D2 K
orea
How to develop parser and semantic classifier creatively?• Open Source NLP tools
• Rich English, Japanese open tools/resources• A few Korean tools
• How to adapt Korean tools to the already developed tools
• Already developed Korean language resources• KAIST tools/resources • KAIST open source in sourceforge and web• Cambridge University Press: NLP Textbook (undergoing)
• Linked Data – http://lod2.eu/ partner
14
Key-
Sun
Choi
- LO
D2 K
orea
Background
• The idea of linking data from different sources is not new:• Network Database Model: 70’s• Linked Data: Today
• The goal is to facilitate sharing and re-using information.
• Linked Data aims to extend the Web with data commons by creating typed links between data from different sources
15
Key-
Sun
Choi
- LO
D2 K
orea
Background
• These links are usually modeled using the Resource Description Framework (RDF)
• Each piece of data is identified with an URI
• The first task towards linking data is to identify which resources and which properties we want to describe
16
Key-
Sun
Choi
- LO
D2 K
orea
Introduction• NLP2RDF is a LOD2 Community project that is developing the NLP Interchange
Format (NIF)• NIF aims to achieve interoperability between Natural Language Processing (NLP)
tools, language resources and annotations• The output of NLP tools can be converted into RDF and used in the LOD2 Stack• http://nlp2rdf.org
NIF…
• Is based on RDF/OWL• Enables users to annotate for several languages in a uniform way• Enables users to query text documents with SPARQL
(EX http://semanticweb.kaist.ac.kr/nlp2rdf/ )• Sentence : 다크나이트는 미국의 영화이다 .• Dark knight is a American film.
17
Key-
Sun
Choi
- LO
D2 K
orea
18
Key-
Sun
Choi
- LO
D2 K
orea
NIF Wrapping• NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to
combine and chain several Natural Language Processing (NLP) tools in a flexible, light-weight way.
Sebastian Hellmann, AKSW, Universitat Leipzig, NLP Interchange Format(NIF)
19
Key-
Sun
Choi
- LO
D2 K
orea
Structure of NLP2RDF
NLP Layer
InterchangeLayer
DataLayer 20
Key-
Sun
Choi
- LO
D2 K
orea
Example of NLP LayerInput
Sentence
Tokenization
CFG Parser
English NLP
Dependency Parser 21
Key-
Sun
Choi
- LO
D2 K
orea
How to create RDF from NLP output
NLP Tools
NIF Wrapper
RDF
Raw Texts
output
My dog also likes eating sausage.
StanfordWrapper.Java
ExampleProcess
22
Key-
Sun
Choi
- LO
D2 K
orea
Example of NLP2RDF in ENG• http://nlp2rdf.lod2.eu/demo.php
• Sentence: Obama is the president of USA.
<http://prefix.given.by/theClient#offset_0_5> sso:oliaLink <http://purl.org/olia/penn.owl#NNP> ; sso:posTag "NNP" ; sso:lemma "Obama" ; str:referenceContext<http://prefix.given.by/theClient#offset_0_30> ; str:anchorOf "Obama" ; rdf:type sso:Word , str:String .
23
Key-
Sun
Choi
- LO
D2 K
orea
• Resources: morphemes, words (eojeols) and sentences in Korean
• Properties: POS, grammatical roles, etc.
• Problems to solve:• Linguistic Modeling (OLiA)• Processing Korean Text (NLP)• How to Produce and Query RDF
Korean NLP2RDF
24
Key-
Sun
Choi
- LO
D2 K
orea
Linguistic Modeling (1)• We use OLiA (Ontologies of Linguistic Annotation) to link the
Sejong tagset with language-independent reference concepts.• Sejong tagset is a Korean default standard
• OLiA consists of three different ontologies:• the OLiA reference model (language-independent), • the OLiA annotation model (depends on the tagset),• the OLiA linking model (depends on the tagset).
• We developed a fragment of these last two ontologies for Korean, that is, for the Sejong tagset.
25
Key-
Sun
Choi
- LO
D2 K
orea
Linguistic Modeling (2)• We use the NIF (NLP Interchange Format) to
• standardize the input/output of the different tools to ease to connection among them, and to
• uniquely identify (parts of) text, entities and relationships.
• NIF provides two URI schemes to identify resources• Offset-based• Hash-based
• We opt in our application for the Hash-based26
Key-
Sun
Choi
- LO
D2 K
orea
Korean NLP2RDF Platform
RAW Text
Morpheme Analyzer
Parser
HanNanum• Korean Open Source Morpheme Analyzer• Developed by SWRC, KAIST
Korean Berkeley Parser• Training set: Modified Sejong Treebank
(DongHyun Choi, Jungyeul Park, Key-Sun Choi , Korean Treebank Transformation for Parsr Training, ACL - SPMRL 2012)
• F1-score: 82.12%
Produce triples• Use OLiA (Ontologies of Linguistic
Annotation) to link the Korean tagsets with language-independent reference concepts
• The OLiA annotation model and the OLiA linking model produce triples using the Sejong tagset
Wrapper
NIF output
27
Key-
Sun
Choi
- LO
D2 K
orea
Input Korean
Sentence
DataBase
OnTopFramework
RDF generator
MappingsOntologies
RDF triples
Parsed resultURI, Tag
Morph.Analyzer
CFG Parser
Korean NLP
Dependency Parser
SPARQL Query RDF triples
Korean Grammar
Framework
Korean Language information
SPARQL QueryHandler 28
Key-
Sun
Choi
- LO
D2 K
orea
NIF Output
• Each piece of data is identified with an URI (Hash-based)• Resources: Morphemes, Words (eojeols), Sentences in Korean• Properties: POS-tag, Grammatical roles, etc.
Parsing results
Some produced triples
DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf 29
Key-
Sun
Choi
- LO
D2 K
orea
NIF Output이탈리아에서 공부하고 온 마틴은 한국을 사랑합니다 .Martin who came from Italy after studying there loves Korea.
30
Key-
Sun
Choi
- LO
D2 K
orea
Specific Issues of Korean
1. Korean Tagset2. Linking with OLiA
Parser Output1. String2. Word, Sentence, Phrase,,,3. Tag4. ,,,
Ontology:
1. String Ontology2. Structured Sentence Ontology (SSO)3. OLiA4. Penn
NLP2RDF:
Produce Triples
Sejong Tag Set
RDF output 31
Key-
Sun
Choi
- LO
D2 K
orea
Tag Sejong OLiAsuperclass LinguisticAnnotation/Tag/ LinguisticConcept/MorphosyntacticCategory/
MA Adverb AdverbMAJ Adverb/ConjunctiveAdverb Adverb and Conjunction/CoordinatingConjunction
MAG Adverb/GeneralAdverb AdverbSN, XN CardinalNumber Quantifier/Numeral
MM Determiner PronounOrDeterminer/DeterminerSH, SL ForeignWord Residual/Foreign
IC Interjection InterjectionNA, NF, NN Noun Noun
XR Noun/BaseMorpheme Noun/CommonNounNNB, NNG Noun/CommonNoun Noun/CommonNoun
NNP Noun/ProperNoun Noun/ProperNounNP Pronoun PronounOrDeterminer/Pronoun
SE, SF, SO, SP, SS Symbol Punctuation
NV, V Verb VerbVA Verb/Adjective Verb and Adjective/PredicativeAdjectiveVX Verb/AuxiliaryPredicate Verb/AuxiliaryVerb
VC, VCN, VCP Verb/Copula Verb/FiniteVerb
VV Verb/VerbalPredicate VerbE, JK, XP, XS Particle MorphologicalCategory/morpheme/
JC, JX Particle/AuxiliaryPostposition MorphologicalCategory/morpheme/MorphologicalParticle
JKB, JKC, JKG, JKO, JKQ, JKS,
JKV
Particle/CaseMarker MorphologicalCategory/morpheme/MorphologicalParticle
XPN Particle/Prefix MorphologicalCategory/morpheme/prefixXSA, XSN, XSV Particle/Suffix MorphologicalCategory/morpheme/suffix
EC, EF, EP, ETM, ETN
Particle/VerbalEnding MorphologicalCategory/morpheme/suffix 32
Key-
Sun
Choi
- LO
D2 K
orea
Conclusions: • We presented a framework that allows
• processing Korean text, • Efficiently producing RDF triples, and • querying the NLP tools outcome
• The RDF outcome of our framework is compliant with the NIF (NLP Interchange Format) and the OLiA ontologies to facilitate its combination with other NLP tools
• Future: • complete the development of the language-dependent part of the OLiA
ontologies, • include the missing features required by NIF, • allow richer SPARQL queries, and • disambiguate the different entities in the text and link them with Wikipedia
articles.
33
Key-
Sun
Choi
- LO
D2 K
orea
Issues
• DBpedia• How to link between produced triples and DBpedia triples
• Josa (postposition case marker)• Korean specific grammatical feature
Sentence : 다크나이트는 미국의 영화이다 .
Sentence : Dark knight is the American movie.
34
Key-
Sun
Choi
- LO
D2 K
orea
Source
• OnTop• https://babbage.inf.unibz.it/trac/obdapublic/wiki/ObdalibPluginIntro
• Demo Site : for Korean• http://semanticweb.kaist.ac.kr/nlp2rdf
• Demo site : for English• http://nlp2rdf.lod2.eu/demo.php
• NLP2RDF• http://nlp2rdf.org
35
Key-Sun Choi, Mun-Yong Yi, In-Young Koh, Younghee Lee(CS/WebST, Knowledge Service Eng., CS/WebST, CS)Tony Veale (Invited Professor, Computational Creativity)Yoon, Yong-Un (research professor, NLP+DB)Martin Rezk (postdoctoral researcher, Logic)Park, Jung-Yeol (researcher, parser)Lee, Jae-Sung (Professor, morphology and word)
Graduate Students:Soon-Gil Hong, Young-Gyun Hahm , Kyungtae Lim, Se-Mi Jang, Youngho Jeong, … http://ko.dbpedia.org/
http://semanticweb.kaist.ac.krkschoi@kaist.ac.kr
Recommended