13
In collaboration with NANYANG TECHNOLOGICAL UNIVERSITY Wee Kim Wee School of Communication & Information K6299 Critical Inquiry in Knowledge Management Proposal for Designing a Linked Data Migrational Framework for Singapore Government Data Sets Under the guidance of Dr. Khoo Soo Guan, Christopher (Assoc Prof) Mr. Soy Boom Lim (Manager, iDA Singapore) Page 1 of 13

Linked data migrational framework

Embed Size (px)

DESCRIPTION

Semantic web application

Citation preview

Page 1: Linked data migrational framework

In collaboration with

NANYANG TECHNOLOGICAL UNIVERSITY

Wee Kim Wee

School of Communication & Information

K6299 – Critical Inquiry in Knowledge Management

Proposal for Designing a Linked Data Migrational Framework for Singapore Government Data Sets

Under the guidance of

Dr. Khoo Soo Guan, Christopher (Assoc Prof)Mr. Soy Boom Lim (Manager, iDA Singapore)

Submitted by

SESAGIRI RAAMKUMAR ARAVIND (G1101761F)

THANGAVELU MUTHU KUMAAR (G1101765E)

KALEESWARAN SUDARSAN (G1001065F)

Page 1 of 10

Page 2: Linked data migrational framework

Introduction

“The Internet is becoming the town square for the global village of tomorrow” – This quote of Bill Gates,

Chairman of Microsoft rightly pictures the world’s present business scene using internet as the dominant

medium for connecting with its resources across geographies enabling voluminous transactions at ease.

The challenge now vests upon enabling machines to read and understand data on the internet for a chain

of intelligent transactions that has been manual earlier due to the human understandable format in the

traditional form of WWW. This idea was well formulated with the concept of Semantic Web that has

content defined with semantics (Berners-Lee, Hendler & Lassila, 2001). Based on the concept, principles

describing Linked Data were released to guide individuals, enterprises and public bodies to release their

data in a common standard, RDF (Resource Description Framework) to form a web of data (Berners-Lee,

2006). Standardised data representation provides more scope for interlinking data sets across domains,

creating avenues for multi-point usage and knowledge discovery with intelligent software applications

built over it.

The most interesting large scale application of Linked Data taken for exploration is the eGovernment

(eGov) initiatives of US, UK and many other nations to publish their Open Governmental Data (OGD)

pertaining to governance and public affairs for transparency and value co-creation to empower people

with appropriate knowledge. The recent Open Government Partnership1 mandates nations to publish their

OGD in linked data format. Many nations have started to publish their data in the form of linked data, the

latest being Brazil data portal data . gov.br2. The start of the Linked data movement spurred the release of

new data sets highlighted by the LOD cloud3 maintained by CKAN4 registry.US and UK governments

have realized the benefits by releasing selective data sets in the linked data format in the portals data.gov 5

and data.gov.uk6 respectively. Well-defined relationships between these datasets and ready-made

applications guide public’s daily activities related to transport, business and other needs. Some of the

1 Open Government Partnership http://www.state.gov/g/ogp/2 Brazil Data Portal data.gov.br3 LOD cloud diagram shows datasets that have been published in Linked Data format, by contributors to the Linking Open Data community project and other individuals and organisations http://richard.cyganiak.de/2007/10/lod/4 Comprehensive Knowledge Archive Network http://ckan.net/5 data.gov6 data.gov.uk

Page 2 of 10

Page 3: Linked data migrational framework

existing applications are Numberhood7, FixMyTransport8, BIS Research Funding Explorer9, SemaPlorer10

and “Linking Wildland Fire and Government Budget” mashup11.

The current OGD scenario in Singapore doesn’t make use of Linked Data standards. This proposal aims

at suggesting a migrational framework from the existing system of data publishing. A study is being done

on the current OGD ecosystem in Singapore as a starting point. iDA12 maintains the portal data.gov.sg13

that handles data collated from different government agencies (Chee Hean, 2011). The data portal aims to

meet Singapore public’s data needs and also to establish a co-creative environment. The data is provided

in different structured and unstructured formats such as txt, excel, pdf, xml, webpages, maps and also in

the form of agency specific Application Programming Interfaces (APIs) and web services. There are

multiple endpoints for data consumption. Prominent examples include data.gov.sg, OneMap API 14,

Singapore Statistics15,mytransport.sg16 and Integrated Land Information Services17. There is some level of

redundancy in data spanning across the different sources in the current OGD ecosystem with limited

interlinking and re-use capabilities. The vocabularies used by the agencies are specific to their own with

limited standardisation of commonly used terms. The process of building a mash-up application

leveraging data across agencies is complex. This study has indicated the scope for the application of

linked data as it requires standardised data representation at source level and common interface at

publication level with the data sets linked by interconnected vocabularies.

7 http://www.Numberhood.net8 http://www.fixmytransport.com/ 9 http://consulting.talis.com/case-study/bis-research-funding-explorer/10 http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/Research/systeme/semap11 http://logd.tw.rpi.edu/demo/linking_wildland_fire_and_government_budget12 Infocomm Development Authority of Singapore (iDA) http://www.ida.gov.sg/home/index.aspx13 data.gov.sg14 http://www.onemap.sg15 http://www.singstat.gov.sg/16 http://mytransport.sg17 http://www.inlis.gov.sg/layout/homepage.aspx#

Page 3 of 10

Page 4: Linked data migrational framework

Fig1: Linked Data implementation over current DGS (DATA.GOV.SG) Ecosystem

Objectives of the Proposal

The current study aims to build a linked data migrational framework that could be used by iDA and

Singapore Government agencies to publish their data sets in the form of linked data to the public. A

multi-step methodology would be devised with clearly defined activities and deliverables at each step

based on the current ecosystem of data.gov.sg and other OGD publishing portals in Singapore.

Geographical and Statistical data have been selected for describing each step in the framework.

The framework build process is based on the metadata and specifications provided by iDA and

government agencies. The current study focuses on linking the internal data sets. Additionally, it aims to

provide recommendations on a few use-cases that leverage the utility of external linked data. The holistic

nature of the framework will be validated with Geographical and Statistics data provided by SLA and

DOS.

Other objectives of the study are as follows:-

1.) Explore case studies pertaining to implementation of Linked Open Government data

2.) Prepare an inventory by assessing different linked data tools, technical frameworks and processes

3.) Provide recommendations for linked data implementation as per nature of the government

agency.

4.) Build an Ontology Network model (Haase, Rudolph, Wang et al, 2006) meant to unify

vocabularies from different agency domains.

5.) Build a POC application based on the devised methodology to validate its applicability. This

objective is subject to availability of sufficient time and infrastructure.

Page 4 of 10

Page 5: Linked data migrational framework

The migrational framework will be useful for iDA in formulating their Linked Data implementation

strategy in the near future, as the government body intends to make the portal data.gov.sg as a cornerstone

portal for OGD publication. The common output interface suggested by the framework will showcase the

potential of unifying the different end points provided by the agencies thereby simplifying access and

facilitating the creation of applications that integrate data from disparate sources. The ontology network

suggested by the framework will help the agencies in standardising vocabulary across domains for better

understanding their data and its relation to data from other agencies.

The framework can also be used by enterprises and individuals to understand the steps, tools and

processes involved in releasing their data to the WWW in the form of linked data.

Literature Review

The Semantic Web facilitates a web of data18 that works on top of URI19 RDF20, Ontology21 and

SPARQL22 concepts. Resources and values are identified and described in a common standard, RDF

based on the modelled Ontology specifying the relationships (Berners-Lee, Hendler & Lassila, 2001). The

LOD223 initiative aims to build a LOD stack of products, frameworks and processes that aim to accelerate

the implementation of linked data across the globe.W3C has setup two committees 24 to provide best

practices and recommendations for governments to publish their OGD in standardised linked data format.

(Bizer, Heath, Idehen & Berners-Lee, 2008), (Villazón, Vilches, Corcho & Gómez-Pérez, 2011) and

(Hyland & Wood, 2011) provide cookbooks and guidelines for OGD conversion to Linked Data format.

They are helpful in understanding the general steps and tools required in converting and publishing OGD

in Linked Data format. Governments that are new entrants in adopting Linked Data publication strategy

18 Linked Data and Web of Data http://www.youtube.com/watch?v=GKfJ5onP5SQ19  Uniform Resource Identifiers (URIs) are short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources. They make resources available under a variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same simple way http://www.w3.org/Addressing/20 RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed http://www.w3.org/RDF/21 Ontologies or vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern. http://www.w3.org/standards/semanticweb/ontology22 SPARQL is an RDF query language; its name is an acronym that stands for SPARQL Protocol and RDF Query Language. http://www.w3.org/TR/rdf-sparql-query/23 LOD2 Project http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html24 http://www.w3.org/2011/gld/charter and http://www.w3.org/egov/

Page 5 of 10

Page 6: Linked data migrational framework

need a tailored migrational framework specific to the local OGD ecosystem. The customized framework

could be used by the government steering committee to expedite the migration to LOGD format.

Methodology

The project team has been discussing with iDA staff, SLA staff and NIIT staff (the IT vendor supporting

DGS25 platform) prior to the proposal to get a basic understanding of the current architecture and to

identify the DGS components that could accommodate changes as a part of this study. Primary data

would be provided by iDA and SLA. The data sets selected for the study are indicated in the below table

1.1. These seemingly disparate datasets can be connected to give a context specific knowledge on

each site for the prospective tenderers to gain insights on the consumer and locality trends based

on the demographics.

Data set Agency Category Data typeResident Population by DGP Zone/ Subzone and Age Group, Type of Dwelling, Ethnic Group

Department of Statistics

Population and Household Characteristics

Textual

Sites Sold by URA - Details Urban Redevelopment Authority (URA)

Housing and Urban Planning

Textual

Table 1.1: Primary datasets used for the study

The entire data sets would not be used for the study instead the latest year’s data would be used for the

study. The secondary data for the research study would be extracted from LOGD statistical and

geospatial data sets from the portal thedatahub.org for building the framework. The migrational

framework will be customized based on the current architecture of DGS because the steps will be devised

based on the understanding of the different layers in DGS and still the framework will be generic enough

to be applicable for other cases. The project team would be conducting interviews with iDA support staff

for collecting specification documents and insights relevant to the current architecture of DGS.

The framework formulation would be based on the context-specific integration of different approaches

put forth by LOGD activists, researchers and practitioners. Each step in the framework will be sequential,

comprising of sub steps covering intrinsic activities. For example, object modelling of the different data

objects in the selected data sets is a step that precedes the RDF modelling and Ontology/Vocabulary

building steps. The steps will be substantiated with sample implementations using the primary data.

25 DGS – Data.gov.sg data store

Page 6 of 10

Page 7: Linked data migrational framework

Suggestions from W3C LOGD steering groups10 will be taken into account for framework formulation.

The tools that will be identified as part of the inventory will be used for the activities such as RDF

creation, RDF storage and Ontology re-use/modelling in the framework.

Difficulties and Issues

Agencies do not provide raw data to iDA. Aggregated report data is split into X dimensions representing

columns, Y dimensions representing rows and data points representing cells. These fields are provided in

an XML file and sent to iDA on a periodic basis. There is no separate master data file. The hierarchy in

master data dimensions is not explicitly set or provided. Therefore, a mechanism to identify the master

data and the relationship between different levels in the master data dimensions needs to be devised. This

mechanism may not serve as a generic transformation applicable for all agencies due to the implicit

nature of data representation in the files.

The data conversion to RDF formats will not be done at the agency level instead it will be done on top of

the data model in iDA data store. This leads to data duplication as the data is converted to RDF format for

Linked data implementation.

There is no master data management system in place right now that standardises the dimension values

across agencies. Standardisation is required to link common data in the data sets used in the study. This

might be a complex task due to the different versions of master data values in a single data set and also

across data sets.

The current OGD ecosystem of Singapore provides multiple end points to the users such as API, web

services and files. A common endpoint in the form of Linked data API would mean building different

wrappers over the end points. The below diagram from (Bizer , Heath, Idehen, & Berners-Lee, 2008) 

illustrates the different approaches of linked data implementation over existing systems.

Page 7 of 10

Page 8: Linked data migrational framework

Fig2: Different Linked Data Implementation Approaches

Schedule

The schedule for the study is covered in the embedded Gantt chart.

Proposed Report Outline

The proposed final report will be structured in the following format.

1. Abstract

2. Introduction

a. Introduction to Linked Data and its relevance to Open Government Data and eGov

b. Overview of SG OGD Ecosystem

3. Literature Review

Page 8 of 10

Page 9: Linked data migrational framework

a. Government Linked Data Implementation Cookbooks, Guidelines and Recommendations

i. URI formulation

ii. RDF creation

iii. Ontology Formulation

iv. Publication and Exploitation

4. Migrational Framework

a. Multi-step methodology

i. Formulation and Description

ii. Examples

5. Implementation Results and Observations

a. POC details

b. Description of issues faced in implementation

6. Limitations

7. Conclusion and Recommendations

Few new sections and sub-sections might be added in the final report.

Dissemination of Results

The migrational framework will be published in the form of a report subject to review by NTU Supervisor

followed by submission to iDA. The researchers plan to publish the report in the form of a conference

paper in the later part of the year.

References

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). THE SEMANTIC WEB. Scientific American, 284(5),

34

Berners-Lee, T. (2006). Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData.html. Last

accessed 11th Jan 2012

Chee Hean, T. (2011). Keynote Address by Mr Teo Chee Hean, Deputy Prime Minister, Coordinating

Minister for National Security and Minister for Home Affairs at the e-Gov Global Exchange 2011.

Available: http://www.ida.gov.sg/News%20and%20Events/20110620114104.aspx?getPagetype=21.

Last accessed 11th Jan 2012

Bizer , C., Heath, T., Idehen, K., & Berners-Lee, T. (2008). Linked Data: Evolving the Web into a Global

Data Space. (J. Hendler & F. Van Harmelen, Eds.)Proceeding of the 17th international conference on

World Wide Web WWW 08 (Vol. 1, p. 1265). ACM Press.

Page 9 of 10

Page 10: Linked data migrational framework

Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., and Gómez-Pérez, A. (2011). Methodological

guidelines for publishing government linked data linking government data. In Wood, D., editor,

Linking Government Data, chapter 2, pages 27-49. Springer New York, New York, NY.

Hyland, B. and Wood, D. (2011). The joy of data - a cookbook for publishing linked government data on

the web linking government data. In Wood, D., editor, Linking Government Data, chapter 1, pages 3-

26. Springer New York, New York, NY.

Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., & d’ Aquin, M. (2006,

November). Networked Ontology Model. Technical Report, NeOn project deliverable D1.1.1

Page 10 of 10