Upload
hideaki-takeda
View
831
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Presented at 2013 Linked Data in Practice Workshop (LDPW2013), 30 November, 2013
Citation preview
Building DBpedia Japanese and Linked Data Cloud in Japanese
Fumihiro Kato, Hideaki Takeda, Seiji Koide, Ikki Ohmukai{fumi, takeda, koide, i2k}@nii.ac.jpNational Institute of Informatics (NII)Research Organization of Information and Systems (ROIS)Graduate University for Advanced Studies (Sokendai)
2013 Linked Data in Practice Workshop (LDPW2013) , 30 November, 2013
Two Driving Forces to push LOD in Japan
• LOD for ACademia (LODAC) Project since 2010– A research project in ROIS and NII– Research on Linked Data for research
• Linked Open Data Initiative Inc., (LODI) since 2012– Non Profit Organization– Promotion of LOD in Japan– Collaboration with various stakeholders• Government, Public sectors, companies
• Members of two forces are mostly overlapped
LODAC Project - connecting academic data -
LODAC SPECIES: Connecting species data by nameSpecimen
DB
Species Info. DB
Taxon Name DBGBIF BioSci.
DB
Research DB
No. of Names : 113118No. of Triples : 14,532,449
Data from Source BIntegrated data
dc:references dc:references
dc:references dc:references
dc:references dc:references
dc:creatordc:creator
crm:P55_has_current_location
crm:P55_has_current_location
crm:P55_has_current_locationdc:creator
Data from Source AWork
Museum
Creator
Minimum Data to identify entitiesRaw Data for entities Raw Data for entitiesLODAC Museum: LOD of data in museums
App. for query expansion
CKAN Japanese: Catalog for Open Data
DBPedia Japanese
LODAC Location: Integration of location information
LODAC Museum
• Integrated database for information on museums in Japan– Data• No. of museums : 114• No. of triples :
40,059,131
• Integration by creator, work and institute• Data publication by RDF• Some applications using the data
Type of Information RDF type No. of items
Collections (total) lodac:Specimen + lodac:Work
ca. 1,770,000
Collections (specimen) lodac:Specimen ca. 1,690,000
Collections (creative and historical work)
lodac:Work ca. 130,000
Creators foaf:Person ca. 8,800
Institutes Foaf:Organization ca. 200,000
Yokohama Art Spot
–Application using museum and local data–Data related to art in
Yokohama• Collections• Events• Q&A
http://lod.ac/apps/yas/
LODAC Museum × Yokohama Art LOD × PinQA
Use
LODAC SPECIES: Linking Species Information with names
Museum Specimen
DB
Species Info. DB
Taxon Name LOD
GBIFBioSci.
DB
Research DB
No. of Species Names : 113118No. of Triples : 14,532,449
Specified Non-profit Corporation
Linked Open Data Initiative, Inc.
Prospectus
• LOD is becoming an infrastructure of our society– Similar to the impact to our society by Web – LOD help maturity and diversity of our society
• We wish to diffuse LOD more in Japan !– For Governments (Central and Local)– For Companies– For Citizens
• How?– By Researchers, Engineers, Citizens together
Projects
• Platforms– CKAN Japanese– DBpedia Japanese
• Collaborative Projects– with Ministry of Industry, Trade, and Economics (METI)
• Open Data METI
– with National Statistics Center• Scheme Design for Area Code
– Collaboration with Sabae City • e.g., “Sabae Burari”
• Promotional Events
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
provided by NDL
Motivation
• Data hub for Japanese resources– To promote LOD in Japan– To connect datasets in Japanese
• Two linguistic datasets– DBpedia Japanese– RDFized Japanese WordNet
DBpedia Japanese
• DBpedia i18n project– 14 chapters
• generated from Japanese Wikipedia dump files– DIEF (DBpedia Information Extraction
Framework)– ~80m triples
• Linking to– Japanese WordNet– Japanese Wikipedia Ontology– other DBpedia chapters
• http://ja.dbpedia.org
i18n/l10n efforts
• IRI, IRI, IRI, ...• Configurations for Extractors and Parsers• DBpedia Mappings for each chapter
Extraction process
ref: D. Kontokostas et al. "Internationalization of Linked Data. The case of the Greek DBpedia edition." Journal of Web Semantics: Science, Services and Agents on the World Wide Web, vol. 15, No.3, Sep. 2012, pp.51-61
DBpedia Information Extraction Framework
• Software to extract data from Wikipedia dump– including custom extractors/parsers to apply
language specific configurations• Extractors / Parsers– DisambiguationExtractor– HomepageExtractor– ImageExtractor– PersondataExtractor
DisambiguationExtractor
• "ja" -> "( 曖昧さ回避 )"
HomepageExtractor
• propertyNamesMap– "ja" -> Set("homepage", "website", " ホームヘー
シ ", " ウェフサイト ", "Web サイト ", "Web サイト ")
• externalLinkSectionsMap – "ja" -> " 外部リンク "
• officialMap– "ja" -> " 公式 "
ImageExtractor
• "ja" -> """(?i)\{\{\s?(Non free|Non-free pubart)\s?\}\}""".r
PersondataExtractor
• Names of templates for personal information• “ 名前” (name)• “ 別名” (alias)• “ 概要” (abstract) • dates and places for birth and death
Extracted triples after configurations
Type Triples
disambiguation 106,386
homepages 49,355
images 843,170
persondata 1,811
Image of Infobox Extraction
Template
Mapping Infobox to ontology
Data Extractionused forextraction process
{{TemplateMapping| mapToClass = ComicsCreator| mappings =
{{PropertyMapping | templateProperty = 名前 | ontologyProperty = foaf:name }}{{PropertyMapping | templateProperty = 本名 | ontologyProperty = foaf:name }}{{PropertyMapping | templateProperty = 生年 | ontologyProperty = birthYear }}{{PropertyMapping | templateProperty = 生地 | ontologyProperty = birthPlace }}{{PropertyMapping | templateProperty = 没年 | ontologyProperty = deathYear }}{{PropertyMapping | templateProperty = 没地 | ontologyProperty = deathPlace }}{{PropertyMapping | templateProperty = 国籍 | ontologyProperty = nationality }}{{PropertyMapping | templateProperty = 受賞 | ontologyProperty = award }}{{PropertyMapping | templateProperty = 公式サイト | ontologyProperty = foaf:homepage }}{{PropertyMapping | templateProperty = 画像 | ontologyProperty = foaf:depiction }}{{PropertyMapping | templateProperty = シャンル | ontologyProperty = genre }}{{PropertyMapping | templateProperty = 画像サイズ | ontologyProperty = imageSize }}{{PropertyMapping | templateProperty = 職業 | ontologyProperty = occupation }}{{PropertyMapping | templateProperty = 代表作 | ontologyProperty = notableWork }}
}}
Statistics for DBpedia MappingsDBpedia Japanese DBpeida (English)
rate of all templates in Wikipedia are mapped
4.67% (81 of 1733) 6.33% (369 of 5,826)
rate of all properties in Wikipedia are mapped
2.47% (1,581 of 62,679) 3.47% (6,169 of 177,599)
rate of all template occurrences Wikipedia are mapped
47.99% (286,858 of 597,696)
82.24% (2,435,773 of 2,728,357)
rate of all property occurrences Wikipedia are mapped
38.75% (3,128,208 of 8,071,982)
54.95% (27,283,343 of 49,654,072)
"Mapping Party"
• The mapping task is not easy– Wikipedia Template– DBpedia Ontology– Well known vocabularies
• We held hands-on sessions– Aug. 2012: 10 people– Mar. 2013: 25 people
DBpedia Publishing Architecture
URI case
URI
decode URIfor users
URI
URI
IRI case
IRI
IRI
IRI
IRI to URI
IRI issues
IRI
IRI
IRI
IRI to URI
2. Input URIs must be
decoded to IRIs
1. IRIs have to be used properly
in queries
4. don't decode IRI
5. use the latest version
3. Some serializations can
not use IRIs
dbp-owl:AdministrativeRegiondbp: サイボーグ 009
dbp-owl:ComicsCreator
dbp: 宮城県
dbp: 石ノ森章太郎
rdfs:label
rdf:type
rdfs:label
dbp-prop: 生年
dbp-owl:notableWork
dbp-owl:award
dbp-owl:birthPlace
rdf:type
サイボーグ 009
宮城県 foaf:Person
1938
rdf:type
rdfs:label
dbp: 村井嘉浩
dbp-owl:leaderName
dbp: 手塚治虫文化賞
dbp-owl:Comics
rdf:type
Query: Notable comics written by comics creators who have received the Tezuka Osamu Cultural Prize
PREFIX dbp: <http://ja.dbpedia.org/resource/>PREFIX dbp-owl: <http://dbpedia.org/ontology/>
SELECT ?creatorName ?comicNameWHERE { ?creator a dbp-owl:ComicsCreator ; dbp-owl:award dbp: 手塚治虫文化賞 ; dbp-owl:notableWork ?comic ; rdfs:label ?creatorName . ?comic a dbp-owl:Comics ; rdfs:label ?comicName .}
石ノ森章太郎
Japanese Linked Data Cloud
• 21 datasets• Criteria
– providing more than 1000 triples
– providing either dereference, data dump or SPARQL Endpoint
– including Japanese labels– linking to other datasets in
LOD cloud or JLDC
• Open license is not mandatory
JLDC with LOD cloud criteria
21 → 9
Links to/from Japanese WordNet
links WN nouns DBpedia IRIs
WN to DBpedia
DBpedia to WN
resources 33,017 65,788 1,456,158 50.1% 2.3%
properties 1,245 65,788 16,020 1.9% 7.8%
Ongoing Work
• More Wikipedia entries and infoboxes– Wikipedia Town
• More DBpedia mappings– Mapping Party
• Parsers for Japanese– Japanese Calendar: 慶応 3 年 1 月 2 日 =>
"1868-01-02"^^xsd:date
Summary
• Linked Data in Japan is steadily expanding– Started by the research project– Now extended to various areas
• Creating a local chapter of DBpedia is a key to promote Linked Data in the local language– A hub in the local language– People in any areas can find connections in
DBpedia with their data• Promotion of open license is still in progress