40
Building DBpedia Japanese and Linked Data Cloud in Japanese Fumihiro Kato, Hideaki Takeda, Seiji Koide, Ikki Ohmukai {fumi, takeda, koide, i2k}@nii.ac.jp National Institute of Informatics (NII) Research Organization of Information and Systems (ROIS) Graduate University for Advanced Studies (Sokendai) ked Data in Practice Workshop (LDPW2013) , 30 November, 2013

Building DBpedia Japanese and Linked Data Cloud in Japanese

Embed Size (px)

DESCRIPTION

Presented at 2013 Linked Data in Practice Workshop (LDPW2013), 30 November, 2013

Citation preview

Page 1: Building DBpedia Japanese and Linked Data Cloud in Japanese

Building DBpedia Japanese and Linked Data Cloud in Japanese

Fumihiro Kato, Hideaki Takeda, Seiji Koide, Ikki Ohmukai{fumi, takeda, koide, i2k}@nii.ac.jpNational Institute of Informatics (NII)Research Organization of Information and Systems (ROIS)Graduate University for Advanced Studies (Sokendai)

2013 Linked Data in Practice Workshop (LDPW2013) , 30 November, 2013

Page 2: Building DBpedia Japanese and Linked Data Cloud in Japanese

Two Driving Forces to push LOD in Japan

• LOD for ACademia (LODAC) Project since 2010– A research project in ROIS and NII– Research on Linked Data for research

• Linked Open Data Initiative Inc., (LODI) since 2012– Non Profit Organization– Promotion of LOD in Japan– Collaboration with various stakeholders• Government, Public sectors, companies

• Members of two forces are mostly overlapped

Page 3: Building DBpedia Japanese and Linked Data Cloud in Japanese

LODAC Project - connecting academic data -

LODAC SPECIES: Connecting species data by nameSpecimen

DB

Species Info. DB

Taxon Name DBGBIF BioSci.

DB

Research DB

No. of Names :   113118No. of Triples : 14,532,449

Data from Source BIntegrated data

dc:references dc:references

dc:references dc:references

dc:references dc:references

dc:creatordc:creator

crm:P55_has_current_location

crm:P55_has_current_location

crm:P55_has_current_locationdc:creator

Data from Source AWork

Museum

Creator

Minimum Data to identify entitiesRaw Data for entities Raw Data for entitiesLODAC Museum: LOD of data in museums

App. for query expansion

CKAN Japanese: Catalog for Open Data

DBPedia Japanese

LODAC Location: Integration of location information

Page 4: Building DBpedia Japanese and Linked Data Cloud in Japanese

LODAC Museum

• Integrated database for information on museums in Japan– Data• No. of museums : 114• No. of triples :

40,059,131

• Integration by creator, work and institute• Data publication by RDF• Some applications using the data

Type of Information RDF type No. of items

Collections (total) lodac:Specimen + lodac:Work

ca. 1,770,000

Collections (specimen) lodac:Specimen ca. 1,690,000

Collections (creative and historical work)

lodac:Work ca. 130,000

Creators foaf:Person ca. 8,800

Institutes Foaf:Organization ca. 200,000

Page 5: Building DBpedia Japanese and Linked Data Cloud in Japanese

Yokohama Art Spot

–Application using museum and local data–Data related to art in

Yokohama• Collections• Events• Q&A

http://lod.ac/apps/yas/

LODAC Museum   ×   Yokohama Art LOD   ×   PinQA

Use

Page 6: Building DBpedia Japanese and Linked Data Cloud in Japanese

LODAC SPECIES: Linking Species Information with names

Museum Specimen

DB

Species Info. DB

Taxon Name LOD

GBIFBioSci.

DB

Research DB

No. of Species Names : 113118No. of Triples : 14,532,449

Page 7: Building DBpedia Japanese and Linked Data Cloud in Japanese

Search application with LODAC SPECIES

http://lod.ac/apps/lsdcs

Page 8: Building DBpedia Japanese and Linked Data Cloud in Japanese

Specified Non-profit Corporation

Linked Open Data Initiative, Inc.

Page 9: Building DBpedia Japanese and Linked Data Cloud in Japanese

Prospectus

• LOD is becoming an infrastructure of our society– Similar to the impact to our society by Web – LOD help maturity and diversity of our society

• We wish to diffuse LOD more in Japan !– For Governments (Central and Local)– For Companies– For Citizens

• How?– By Researchers, Engineers, Citizens together

Page 10: Building DBpedia Japanese and Linked Data Cloud in Japanese

Projects

• Platforms– CKAN Japanese– DBpedia Japanese

• Collaborative Projects– with Ministry of Industry, Trade, and Economics (METI)

• Open Data METI

– with National Statistics Center• Scheme Design for Area Code

– Collaboration with Sabae City • e.g., “Sabae Burari”

• Promotional Events

Page 11: Building DBpedia Japanese and Linked Data Cloud in Japanese
Page 12: Building DBpedia Japanese and Linked Data Cloud in Japanese

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 13: Building DBpedia Japanese and Linked Data Cloud in Japanese

provided by NDL

Page 14: Building DBpedia Japanese and Linked Data Cloud in Japanese

Motivation

• Data hub for Japanese resources– To promote LOD in Japan– To connect datasets in Japanese

• Two linguistic datasets– DBpedia Japanese– RDFized Japanese WordNet

Page 15: Building DBpedia Japanese and Linked Data Cloud in Japanese

DBpedia Japanese

• DBpedia i18n project– 14 chapters

• generated from Japanese Wikipedia dump files– DIEF (DBpedia Information Extraction

Framework)– ~80m triples

• Linking to– Japanese WordNet– Japanese Wikipedia Ontology– other DBpedia chapters

• http://ja.dbpedia.org

Page 16: Building DBpedia Japanese and Linked Data Cloud in Japanese

i18n/l10n efforts

• IRI, IRI, IRI, ...• Configurations for Extractors and Parsers• DBpedia Mappings for each chapter

Page 17: Building DBpedia Japanese and Linked Data Cloud in Japanese

Extraction process

ref: D. Kontokostas et al. "Internationalization of Linked Data. The case of the Greek DBpedia edition." Journal of Web Semantics: Science, Services and Agents on the World Wide Web, vol. 15, No.3, Sep. 2012, pp.51-61

Page 18: Building DBpedia Japanese and Linked Data Cloud in Japanese

DBpedia Information Extraction Framework

• Software to extract data from Wikipedia dump– including custom extractors/parsers to apply

language specific configurations• Extractors / Parsers– DisambiguationExtractor– HomepageExtractor– ImageExtractor– PersondataExtractor

Page 19: Building DBpedia Japanese and Linked Data Cloud in Japanese

DisambiguationExtractor

• "ja" -> "( 曖昧さ回避 )"

Page 20: Building DBpedia Japanese and Linked Data Cloud in Japanese

HomepageExtractor

• propertyNamesMap– "ja" -> Set("homepage", "website", " ホームヘー

シ ", " ウェフサイト ", "Web サイト ", "Web サイト ")

• externalLinkSectionsMap – "ja" -> " 外部リンク "

• officialMap– "ja" -> " 公式 "

Page 21: Building DBpedia Japanese and Linked Data Cloud in Japanese

ImageExtractor

• "ja" -> """(?i)\{\{\s?(Non free|Non-free pubart)\s?\}\}""".r

Page 22: Building DBpedia Japanese and Linked Data Cloud in Japanese

PersondataExtractor

• Names of templates for personal information• “ 名前” (name)• “ 別名” (alias)• “ 概要” (abstract) • dates and places for birth and death

Page 23: Building DBpedia Japanese and Linked Data Cloud in Japanese

Extracted triples after configurations

Type Triples

disambiguation 106,386

homepages 49,355

images 843,170

persondata 1,811

Page 24: Building DBpedia Japanese and Linked Data Cloud in Japanese

Image of Infobox Extraction

Template

Mapping Infobox to ontology

Data Extractionused forextraction process

Page 25: Building DBpedia Japanese and Linked Data Cloud in Japanese
Page 26: Building DBpedia Japanese and Linked Data Cloud in Japanese
Page 27: Building DBpedia Japanese and Linked Data Cloud in Japanese

{{TemplateMapping| mapToClass = ComicsCreator| mappings =

{{PropertyMapping | templateProperty = 名前 | ontologyProperty = foaf:name }}{{PropertyMapping | templateProperty = 本名 | ontologyProperty = foaf:name }}{{PropertyMapping | templateProperty = 生年 | ontologyProperty = birthYear }}{{PropertyMapping | templateProperty = 生地 | ontologyProperty = birthPlace }}{{PropertyMapping | templateProperty = 没年 | ontologyProperty = deathYear }}{{PropertyMapping | templateProperty = 没地 | ontologyProperty = deathPlace }}{{PropertyMapping | templateProperty = 国籍 | ontologyProperty = nationality }}{{PropertyMapping | templateProperty = 受賞 | ontologyProperty = award }}{{PropertyMapping | templateProperty = 公式サイト | ontologyProperty = foaf:homepage }}{{PropertyMapping | templateProperty = 画像 | ontologyProperty = foaf:depiction }}{{PropertyMapping | templateProperty = シャンル | ontologyProperty = genre }}{{PropertyMapping | templateProperty = 画像サイズ | ontologyProperty = imageSize }}{{PropertyMapping | templateProperty = 職業 | ontologyProperty = occupation }}{{PropertyMapping | templateProperty = 代表作 | ontologyProperty = notableWork }}

}}

Page 28: Building DBpedia Japanese and Linked Data Cloud in Japanese
Page 29: Building DBpedia Japanese and Linked Data Cloud in Japanese

Statistics for DBpedia MappingsDBpedia Japanese DBpeida (English)

rate of all templates in Wikipedia are mapped

4.67% (81 of 1733) 6.33% (369 of 5,826)

rate of all properties in Wikipedia are mapped

2.47% (1,581 of 62,679) 3.47% (6,169 of 177,599)

rate of all template occurrences Wikipedia are mapped

47.99% (286,858 of 597,696)

82.24% (2,435,773 of 2,728,357)

rate of all property occurrences Wikipedia are mapped

38.75% (3,128,208 of 8,071,982)

54.95% (27,283,343 of 49,654,072)

Page 30: Building DBpedia Japanese and Linked Data Cloud in Japanese

"Mapping Party"

• The mapping task is not easy– Wikipedia Template– DBpedia Ontology– Well known vocabularies

• We held hands-on sessions– Aug. 2012: 10 people– Mar. 2013: 25 people

Page 31: Building DBpedia Japanese and Linked Data Cloud in Japanese

DBpedia Publishing Architecture

Page 32: Building DBpedia Japanese and Linked Data Cloud in Japanese

URI case

URI

decode URIfor users

URI

URI

Page 33: Building DBpedia Japanese and Linked Data Cloud in Japanese

IRI case

IRI

IRI

IRI

IRI to URI

Page 34: Building DBpedia Japanese and Linked Data Cloud in Japanese

IRI issues

IRI

IRI

IRI

IRI to URI

2. Input URIs must be

decoded to IRIs

1. IRIs have to be used properly

in queries

4. don't decode IRI

5. use the latest version

3. Some serializations can

not use IRIs

Page 35: Building DBpedia Japanese and Linked Data Cloud in Japanese

dbp-owl:AdministrativeRegiondbp: サイボーグ 009

dbp-owl:ComicsCreator

dbp: 宮城県

dbp: 石ノ森章太郎

rdfs:label

rdf:type

rdfs:label

dbp-prop: 生年

dbp-owl:notableWork

dbp-owl:award

dbp-owl:birthPlace

rdf:type

サイボーグ 009

宮城県 foaf:Person

1938

rdf:type

rdfs:label

dbp: 村井嘉浩

dbp-owl:leaderName

dbp: 手塚治虫文化賞

dbp-owl:Comics

rdf:type

Query: Notable comics written by comics creators who have received the Tezuka Osamu Cultural Prize

PREFIX dbp: <http://ja.dbpedia.org/resource/>PREFIX dbp-owl: <http://dbpedia.org/ontology/>

SELECT ?creatorName ?comicNameWHERE { ?creator a dbp-owl:ComicsCreator ; dbp-owl:award dbp: 手塚治虫文化賞 ; dbp-owl:notableWork ?comic ; rdfs:label ?creatorName . ?comic a dbp-owl:Comics ; rdfs:label ?comicName .}

石ノ森章太郎

Page 36: Building DBpedia Japanese and Linked Data Cloud in Japanese

Japanese Linked Data Cloud

• 21 datasets• Criteria

– providing more than 1000 triples

– providing either dereference, data dump or SPARQL Endpoint

– including Japanese labels– linking to other datasets in

LOD cloud or JLDC

• Open license is not mandatory

Page 37: Building DBpedia Japanese and Linked Data Cloud in Japanese

JLDC with LOD cloud criteria

21 → 9

Page 38: Building DBpedia Japanese and Linked Data Cloud in Japanese

Links to/from Japanese WordNet

links WN nouns DBpedia IRIs

WN to DBpedia

DBpedia to WN

resources 33,017 65,788 1,456,158 50.1% 2.3%

properties 1,245 65,788 16,020 1.9% 7.8%

Page 39: Building DBpedia Japanese and Linked Data Cloud in Japanese

Ongoing Work

• More Wikipedia entries and infoboxes– Wikipedia Town

• More DBpedia mappings– Mapping Party

• Parsers for Japanese– Japanese Calendar: 慶応 3 年 1 月 2 日 =>

"1868-01-02"^^xsd:date

Page 40: Building DBpedia Japanese and Linked Data Cloud in Japanese

Summary

• Linked Data in Japan is steadily expanding– Started by the research project– Now extended to various areas

• Creating a local chapter of DBpedia is a key to promote Linked Data in the local language– A hub in the local language– People in any areas can find connections in

DBpedia with their data• Promotion of open license is still in progress