LOD2 KOREA : Towards Publishing Korean Linked Data on the Web

Embed Size (px)

DESCRIPTION

LOD2 KOREA : Towards Publishing Korean Linked Data on the Web. Key-Sun Choi. Joint work with Martin Rezk Jungyeul Park. Yoon Yongun Kyungtae Lim. YoungGyun Hahm. Key-Sun Choi - Personal History. NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation - PowerPoint PPT Presentation

Text of LOD2 KOREA : Towards Publishing Korean Linked Data on the Web

NLP2RDF

LOD2 KOREA :Towards Publishing Korean Linked Data on the WebKey-Sun ChoiJoint work with Martin RezkJungyeul ParkYoon YongunKyungtae LimYoungGyun Hahm

1Key-Sun Choi - Personal HistoryNEC C&C Lab. PIVOT Japanese-Korean Machine TranslationKorean Part-of-Speech Tagset, Corpus, DictionaryCoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004)KORTERM: Korea Terminology Research Center for Language and Knowledge Engineering (1998-2007), Research Center of Ministry of CultureKAIST Research Grand Award (1998)ISO/TC37/SC4 Founding member (Language Resource Management Standards) ISWC 2007 PC Co-Chair (International Semantic Web Conference)AFNLP President (2009-2010)

DBPedia Korea http://ko.dbpedia.org/ http://lod2.eu/ partner (EU FP7)2NLP2RDFTriple in Natural Language

SubjectObjectPredicateExtract from Sentences

.Wild rose is located mainly in the northern hemisphere of its temperate and figid zones. Subject : (rose)Object : , (Northern hemi-sphere, Temperate and Frigid Zones)Predicate : (isDistributedAt)

3Key-Sun Choi - LOD2 Korea4

45DVDMP3RTOSVxWorkspSOSVRTXWinCEWind Riverconsists_ofreside_on 5DB Mapping

NLP2RDFThe Output ofNLP toolsBarack Obama is the President of the United States

Sentence: Barack Obama is the President of the United StatesBarack ObamaURI = sen1word1 (documentary Unique) NNG ,,,KNIFWrapper

(based on DBpedia Ontology)

Barack Obama

URI = dbpedia12415 (conceptonal Unique) President United States Democrats ,,,

LOD algorithmFor RDF MappingTriples and URIOntologyString OntologyStructured Sentence OntologyNIF and Korean language

For LOD MappingURI for DBpedia entityMapping Word in Text DBpedia

For these workKey-Sun Choi - LOD2 Korea8DB update

Key-Sun Choi - LOD2 Korea9Parser tree to Summary

Subject

Predicate

Contents

10Key-Sun Choi - LOD2 KoreaWhy NLP? Why Syntactic, Semantics?Advanced technology on the higher-level layers

11Key-Sun Choi - LOD2 KoreaNLP Layer Cake

12Key-Sun Choi - LOD2 KoreaSemantic Web vs. NLP layer cake 2 .John-SUBJ 2-floor-LOC room-OBJ reserve-FIN DiscourseJohn: X1, room: L2Syntactic structuresubject, object, predicatePhraseRoom in 2nd floorSemantic tagging [John: Human], [2-FL: Loc], [seminar-room: Room]Morph. Analysis+//2++POS taggingNPP/JOSA//Numeral/Tokenization//2//String URIEncoding

13Key-Sun Choi - LOD2 KoreaHow to develop parser and semantic classifier creatively?Open Source NLP toolsRich English, Japanese open tools/resourcesA few Korean tools How to adapt Korean tools to the already developed toolsAlready developed Korean language resourcesKAIST tools/resources KAIST open source in sourceforge and webCambridge University Press: NLP Textbook (undergoing)Linked Data http://lod2.eu/ partner14Key-Sun Choi - LOD2 KoreaBackgroundThe idea of linking data from different sources is not new:Network Database Model: 70sLinked Data: TodayThe goal is to facilitate sharing and re-using information.Linked Data aims to extend the Web with data commons by creating typed links between data from different sources

Key-Sun Choi - LOD2 Korea1515BackgroundThese links are usually modeled using the Resource Description Framework (RDF)

Each piece of data is identified with an URI

The first task towards linking data is to identify which resources and which properties we want to describe

Key-Sun Choi - LOD2 Korea16IntroductionNLP2RDF is a LOD2 Community project that is developing the NLP Interchange Format (NIF)NIF aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotationsThe output of NLP tools can be converted into RDF and used in the LOD2 Stackhttp://nlp2rdf.org

NIF

Is based on RDF/OWLEnables users to annotate for several languages in a uniform wayEnables users to query text documents with SPARQL (EX http://semanticweb.kaist.ac.kr/nlp2rdf/ )Sentence : . Dark knight is a American film.

Key-Sun Choi - LOD2 Korea17 .17

Key-Sun Choi - LOD2 Korea18NIF WrappingNLP Interchange Format (NIF) is an RDF/OWL-based format that allows to combine and chain several Natural Language Processing (NLP) tools in a flexible, light-weight way.

Sebastian Hellmann, AKSW, Universitat Leipzig, NLP Interchange Format(NIF)Key-Sun Choi - LOD2 Korea19Structure of NLP2RDF

NLP LayerInterchangeLayerDataLayerKey-Sun Choi - LOD2 Korea20Example of NLP Layer

InputSentenceTokenizationCFG ParserEnglish NLPDependency Parser

Key-Sun Choi - LOD2 Korea21How to create RDF from NLP outputNLP ToolsNIF Wrapper RDFRaw Texts output

My dog also likes eating sausage.StanfordWrapper.Java

ExampleProcessKey-Sun Choi - LOD2 Korea22Example of NLP2RDF in ENGhttp://nlp2rdf.lod2.eu/demo.php

Sentence: Obama is the president of USA. sso:oliaLink ; sso:posTag "NNP" ; sso:lemma "Obama" ; str:referenceContext ; str:anchorOf "Obama" ; rdf:type sso:Word , str:String .Key-Sun Choi - LOD2 Korea23Obama is the president of USA.23Resources: morphemes, words (eojeols) and sentences in Korean

Properties: POS, grammatical roles, etc.

Problems to solve:Linguistic Modeling (OLiA)Processing Korean Text (NLP)How to Produce and Query RDFKorean NLP2RDFKey-Sun Choi - LOD2 Korea24Linguistic Modeling (1)We use OLiA (Ontologies of Linguistic Annotation) to link the Sejong tagset with language-independent reference concepts.Sejong tagset is a Korean default standard

OLiA consists of three different ontologies:the OLiA reference model (language-independent), the OLiA annotation model (depends on the tagset),the OLiA linking model (depends on the tagset).

We developed a fragment of these last two ontologies for Korean, that is, for the Sejong tagset.

Key-Sun Choi - LOD2 Korea25Linguistic Modeling (2)We use the NIF (NLP Interchange Format) to standardize the input/output of the different tools to ease to connection among them, and to uniquely identify (parts of) text, entities and relationships.

NIF provides two URI schemes to identify resourcesOffset-basedHash-based

We opt in our application for the Hash-based

Key-Sun Choi - LOD2 Korea26Korean NLP2RDF PlatformRAW TextMorpheme AnalyzerParserHanNanumKorean Open Source Morpheme AnalyzerDeveloped by SWRC, KAISTKorean Berkeley ParserTraining set: Modified Sejong Treebank(DongHyun Choi, Jungyeul Park, Key-Sun Choi , Korean Treebank Transformation for Parsr Training, ACL - SPMRL 2012)F1-score: 82.12%Produce triplesUse OLiA (Ontologies of Linguistic Annotation) to link the Korean tagsets with language-independent reference conceptsThe OLiA annotation model and the OLiA linking model produce triples using the Sejong tagset WrapperNIF outputKey-Sun Choi - LOD2 Korea27Input KoreanSentenceDataBaseOnTopFrameworkRDF generator

MappingsOntologiesRDF triplesParsed resultURI, TagMorph.AnalyzerCFG ParserKorean NLPDependency ParserSPARQL Query

RDF triplesKorean Grammar FrameworkKorean Language informationSPARQL QueryHandlerKey-Sun Choi - LOD2 Korea28NIF OutputEach piece of data is identified with an URI (Hash-based)Resources: Morphemes, Words (eojeols), Sentences in KoreanProperties: POS-tag, Grammatical roles, etc.

Parsing resultsSome produced triples

DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf Key-Sun Choi - LOD2 Korea29 . 29

NIF Output .Martin who came from Italy after studying there loves Korea. Key-Sun Choi - LOD2 Korea30Specific Issues of KoreanKorean TagsetLinking with OLiA

Parser OutputStringWord, Sentence, Phrase,,,Tag,,,Ontology:

String OntologyStructured Sentence Ontology (SSO)OLiAPennNLP2RDF:

Produce TriplesSejong Tag SetRDF outputKey-Sun Choi - LOD2 Korea31TagSejongOLiAsuperclassLinguisticAnnotation/Tag/LinguisticConcept/MorphosyntacticCategory/MAAdverbAdverbMAJAdverb/ConjunctiveAdverbAdverb and Conjunction/CoordinatingConjunctionMAGAdverb/GeneralAdverbAdverbSN, XNCardinalNumberQuantifier/NumeralMMDeterminerPronounOrDeterminer/DeterminerSH, SLForeignWordResidual/ForeignICInterjectionInterjectionNA, NF, NNNounNounXRNoun/BaseMorphemeNoun/CommonNounNNB, NNGNoun/CommonNounNoun/CommonNounNNPNoun/ProperNounNoun/ProperNounNPPronounPronounOrDeterminer/PronounSE, SF, SO, SP, SSSymbolPunctuationNV, VVerbVerbVAVerb/AdjectiveVerb and Adjective/PredicativeAdjectiveVXVerb/AuxiliaryPredicateVerb/AuxiliaryVerbVC, VCN, VCPVerb/CopulaVerb/FiniteVerbVVVerb/VerbalPredicateVerbE, JK, XP, XSParticleMorphologicalCategory/morpheme/JC, JXParticle/AuxiliaryPostpositionMorphologicalCategory/morpheme/MorphologicalParticleJKB, JKC, JKG, JKO, JKQ, JKS, JKVParticle/CaseMarkerMorphologicalCategory/morpheme/MorphologicalParticleXPNParticle/PrefixMorphologicalCategory/morpheme/prefixXSA, XSN, XSVParticle/SuffixMorphologicalCategory/morpheme/suffixEC, EF, EP, ETM, ETNParticle/VerbalEndingMorphologicalCategory/morpheme/suffixKey-Sun Choi - LOD2 Korea32Conclusions: We presented a framework that allows processing Korean text, Efficiently producing RDF triples, and querying the NLP tools outcome

The RDF outcome of our framework is compliant with the NIF (NLP Interchange Format) and the OLiA ontologies to facilitate its combination with other NLP tools

Future: complete the development of the language-dependent part of the OLiA ontologies, include the missing features required by NIF, allow richer SPARQL queries, and disambiguate the different entities in the text and link them with Wikipedia articles.Key-Sun Choi - L