124
© Insight 2014. All Rights Reserved Knowledge Processing with Big Data and Semantic Web Technologies Ali Hasnain, Narumol Prangnawarat, Stefan Decker, Naoise Dunne

Knowledge Processing with Big Data and Semantic Web Technologies

Embed Size (px)

Citation preview

Page 1: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Knowledge Processing with Big Data and

Semantic Web TechnologiesAli Hasnain, Narumol Prangnawarat, Stefan Decker, Naoise Dunne

Page 2: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Presenters and Contributors

Ali Hasnain

Stefan Decker

Narumol Prangnawarat

Naoise Dunne

Page 3: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Agenda• Motivation• Infrastructure• Data Curation• Query Federation• Analyze• Visualization• Hands On Session

Page 4: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Session 0: Motivation

Page 5: Knowledge Processing with Big Data and  Semantic Web Technologies

The Web is evolving...

WWW (Tim Berners-Lee)“There was a second part of the dream […] we could then use computers to help us analyse it, make sense of what we re doing, where we individually fit in, and how we can better work together.”

Page 6: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

A Network of Knowledge● Interconnected● Universal● All encompassing

● assists humans, organisations and systems with problem solving

● enabling innovation and increased productivity

Page 7: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved 7 of 46

Two Key Ingredients

1. RDF – Resource Description FrameworkGraph based Data – nodes and arcs• Identifies objects (URIs)• Interlink information (Relationships)

2. Vocabularies (Ontologies)• provide shared understanding of a domain• organise knowledge in a machine-comprehensible way• give an exploitable meaning to the data

Page 8: Knowledge Processing with Big Data and  Semantic Web Technologies

Why Graphs?Cities:Dublin

84421km2Geo:IslandOfIreland

EU:RepublicOfIrelandGeo:locatedOn

Geo:areaGeo:hasCapital

Geo:hasLargestCityWikipedia.org

Gov.ieEU:RepublicOfIreland

Person:EndaKennyGov:hasTaoiseach Gov:hasDepartment

IE:DepartmentOfFinance

Page 9: Knowledge Processing with Big Data and  Semantic Web Technologies

Why Graphs?Cities:Dublin

84421km2Geo:IslandOfIreland

EU:RepublicOfIrelandGeo:locatedOn

Geo:areaGeo:hasCapital

Geo:hasLargestCityWikipedia.org

Gov.ieEU:RepublicOfIreland

Person:EndaKennyGov:hasTaoiseach Gov:hasDepartment

IE:DepartmentOfFinance

Page 10: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Why Graphs?

TGFβ-3

transforming growth factor, beta 3

Homo sapiens

CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGG …

14q24

nci:has_description

nih:sequence

nih:organism

nih:location

nih:organism

TGFβ-3

Platelet activation, signalling,aggregation

Response to elevated platelet cytosol Ca2+

Platelet degranulation

rea:process

rea:processrea:process

Gene Database Pathway

Database

Page 11: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Why Graphs?

TGFβ-3

transforming growth factor, beta 3

Homo sapiens

CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGG …

14q24

nci:has_description

nih:sequence

nih:organism

nih:location

nih:organism

TGFβ-3

Platelet activation, signalling,aggregation

Response to elevated platelet cytosol Ca2+

Platelet degranulation

rea:process

rea:processrea:process

Gene Database Pathway

Database

Page 12: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Linked Open Data Cloud

Page 13: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Life Sciences….

Page 14: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Cultural Institutions...

Page 15: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Open Government Data...

Page 16: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Legacy Data Sources….

Page 17: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

How to analyse these data?

Heterogeneity in • Data: Different formats• Domains: How to cross discipline borders?• Users: Life Science Data needs different

analysis and visualisation than Cultural Data

An analysis tool for each domain?

Page 18: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Information

Action

Networked Data Management

Abstraction,Reasoning,Analytics

Visualisation,Collaboration,Exploitation

Reusable Infrastructure: Knowledge Pipeline

Page 19: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Challenges for a Knowledge Pipeline

• How to ingest data sets• How to automate and scale processing• How to realise large scale Linked Data processing• How to analyse large data sets• How to visualize large datasets• How to combine different components

Stefan Decker
Different things the tutorial can cover:1) RDF Data base 2) IR Techniques3) Query optimisation and Graph Summarisation
Page 20: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Session 1: Infrastructuretrends in big linked data infrastructure

Page 21: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

IntroductionCloud computing frameworks tailored for managing and analyzing big data-sets are powering ever larger clusters of computers.

This presentation will describe the infrastructure that is required to serve a linked data flavour of big data

Page 22: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

What is Big Data?“Big Data is characterized by High Volume, Velocity

and Variety requiring specific Technology and Analytical Methods for its transformation into Value” - Gartner

• Volume - Data is too big to fit on even largest server

• Velocity - Data need handling based on speed • Variety - heterogeneous Data - comes in

many forms• Veracity - (sometimes) Quality of the data

Page 23: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved Infographic summerizing 4 Vs

Page 24: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

The qualities of a big data Infrastructure• Distributed

• Data and its processing is shared on large cluster multiple cheap commodity servers

• loose coupling, isolation, location transparency, data locality & app-level composition

• High Utilisation• The infrastructure give the best use of computing resources

• Resilience (handling failure)• The infrastructure stays responsive in the face of failure and can “heal”

• Scalability • grows to meet demand (Elastic), responsive under varying workload (Load

balanced)• Operationally efficient

• Needs to be highly automated, be very easy to maintain

Page 25: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

DistributedHigh Utilisation & Scalability

The Rise of distributed Datacenter Schedulers

Page 26: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Why Datacenter schedulers? Schedulers run your Distributed Apps

• are an operating system kernel for the cloud• Schedulers coordinate execution of work on

cluster• help you to get as many compute resources as

you want whenever you want it• Abstract some scalability and load balancing

issues

Page 27: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Benefits of using a Scheduler• Efficiency - best use of computing resources• Agility - change your application mix with no

turnaround• Scalability - grow to the current demand of your

app• Modularity - 2 level schedulers have plugin

frameworks that allow quick repurposing of core and no reliance on one vendor (more later)

Page 28: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Datacenter schedulersSchedulers help you focus on your own work and not the infrastructure.“its great to be able to focus on what it is you want

to be doing rather than worrying about how do you get what it is you need in order to be able to get stuff done” - John Wilkes (Google)

Page 29: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved Quick history of distributed schedulers

2004 mapreduce paper

2004 Google Borg

2011 Hadoop1.02003 Google filesystem

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

2008 Hadoop released

2013 Yarn

2010 Spark Paper

2010 Nexus (Mesos)

2005 Hadoop started

2013 Mesos Released

2011 Mesos Paper

2014 Kubernetes

2014 Google Omega paper

History of Datacenter Schedulers

2003 Slurm

Page 30: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

HadoopMonolithic scheduler: Original opensource datacenter scheduler

• jobs are batched and executed

• Designed only to run Mapreduce jobs

• No concurrency between apps

• Evolving into yarn

Hadoop

Linux Server

Linux Server

hadoop- resource management

mesos slavemesos slave

Linked datat m/r job

Linked datat m/r job

Linked datat app

Page 31: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Mesos2 level scheduler : More flexible

• Can Schedule many kinds of applications

• Frameworks (such as spark) are delegated the per application scheduling

• Mesos responsible for resource distribution between applications and enforcing overall fairness

• Very modular, due to 2 level scheduling. frameworks manage apps as they like

Mesos

Linux Server

Linux Server

Mesos - resource management

Mesos - scheduler jobs

frameworkchronos

mesos slave

frameworkspark

frameworkmarathon

mesos slave

Hadoop M/R job

Linked data job

Linked datat app

Page 32: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Schedulers RecapScheduler allow you to use your cluster as one

machine• ease operations• provide elasticity and load balancing• Can run both batch and longer running jobs• Are Efficient, Agile, scalable and modular

Page 33: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Managing Failure in Infrastructure

“Everything fails all the time”Werner Vogels (CTO Amazon)

Page 34: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Handling FailureThe harsh reality: all distributed infrastructures must deal with failurefor the designers of applications running on distributed infrastructure (even Mesos), a great number of design mistakes need to be avoided...

Page 35: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Fallacies of distributed computing• The most common misconceptions that lead to failure of Network

infrastructure:• The network is reliable.• Latency is zero.• Bandwidth is infinite.• The network is secure.• Topology doesnt change.• There is one administrator.• Transport cost is zero.• The network is homogeneous

All prove to be false in the long run and all cause big trouble and painful learning experiences.

Page 36: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

How successful infrastructure avoids failureNo Single point of failuremultiple masters, multiple copies of data, and redundancy on all services. Use elections between masters, use distributed locks and thresholds.

Design Applications to Expect FailureApplications should continue to function even if the underlying physical hardware fails.Evolving infrastructure eases this by the use schedulers and container managers.

Special Channel for failuresProvides the means to delegate errors as messages on their own channel or service. Techniques include log aggregation and shared monitoring services.

Page 37: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Operationally efficientUsing Containers for better Quality of Service

Page 38: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

ContainersContainers run your application in isolation in a portable and repeatable fashion“Because of the way that ... containers separate the application constraints from infrastructure concerns, we help solve that dependency hell.” - Docker

Page 39: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Why Containers

Containers Are: • Small (footprint)• Portable• Fast

Containers Allow:• Resiliency

• can be redeployed in seconds• Operationally efficiency

• The infrastructure can “heal”• Scalability

• on a single server allows resources (such as CPU) to be dialed up or down

Page 40: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Containers vs. VMs

Virtual Machinesemulate a virtual hardware, require considerable overhead in CPU, Disk and Memory

Containersuse shared operating systems much more efficient than hypervisors in system resource terms

Page 41: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Containers vs. VMs

Page 42: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

ContainersBut which container to use.

Many choices exist…LXC, Docker, Rocket

Page 43: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Docker - our container of choiceWhy Docker?• Best of breed at the moment• integrates natively with Mesos and Kubernetes • Has great infrastructure including • Docker registry for looking up containers• Docker compose for combining containers

Alternatives• Rockit - More secure as uses init.d but Linux only

Page 44: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Container StandardisationBut choosing a container is not a commitment...Looks like standardisation around the corner via open container:

https://www.opencontainers.org/

Initiative Sponsors: Apcera, AT&T, AWS, Cisco, ClusterHQ, CoreOS, Datera, Docker, EMC, Fujitsu, Google, Goldman Sachs, HP, Huawei, IBM, Intel, Joyent, Kismatic, Kyup, the Linux Foundation, Mesosphere, Microsoft, Midokura, Nutanix, Oracle, Pivotal, Polyverse, Rancher, Red Hat, Resin.io, Suse, Sysdig, Twitter, Verizon, VMWare

Page 45: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Recap: ContainersContainers Are: • Small (footprint), Portable, Fast• Allow you to repeatedly deploy applications• Work well with schedulers such as Mesos• Help with Resiliency and scalability

Page 46: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Putting it all togetherInsights Linked Data infrastructure

Page 47: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked Data Infrastructure

Mesos

Mesos - scheduler short jobs Mesos - scheduler long run jobs

Spark Fwk Chronos Fwk

Marathon Framework

OS Monitor

Mesos Monitor

Linux Server

Linux Server

Linux Server

Mesos - resource management

mesos client Docker mesos

client Docker mesos client Docker

Resources cpu mem disk Managed by Mesos

Applications work with frameworks to get resources they need

Frameworks Negotiate with mesos to run their jobs

DatastoresHDT, Neo4JgraphX Granatum RevealedGraph

Jobs

Docker manages isolation on Linux servers

Page 48: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked Data Infrastructure

Mesos

Linux Server

Linux Server

Linux Server

Mesos - resource management

Mesos - scheduler short jobs Mesos - scheduler long run jobs

Spark Fwk

Chronos Fwk

DatastoresHDT, Neo4J

Marathon

graphX

mesos client Docker

OS Monitor

Mesos Monitor

mesos client Docker mesos

client Docker

We use graph X for large graph batch jobs

We use both HDT(RDF Store)Neo4J (Graph)

Granatum Revealed

We deploy specialised linked data applicationsto cluster

Page 49: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

RecapTo provide linked data at scale and with the right service mix, infrastructures need to consider:

• Services to help application to be Distributed Scalable

• High Utilisation of computing resources • Know how you will handle failure• Operationally efficient

We suggest, using schedulers such as mesos with containers (Docker), use suitable frameworks (GraphX/Spark) and datastores (Neo4J)

Page 50: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Data Curation•Ali Hasnain, Narumol Prangnawarat, Naoise

Dunne

Page 51: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

IntroWe will discuss: •Serialisation formats for RDF•Converting between these…•Mapping to conventional data

• D2RQ• TARQL

Page 52: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Serialisation Formats

Page 53: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

RDF Serialization formats - W3C Standards• Turtle

• a compact, human-friendly format.• N-Triples, N-Quads

• a simple, easy-to-parse, line-based format that is not as compact as Turtle. Nquads: superset of N-Triples, for multiple RDF graphs

• JSON-LD, • a JSON-based serialization

• RDF/XML,• first standard format for serializing RDF.

Page 54: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

N triples unreadable<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/creator> "Dave Beckett" .

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/creator> "Art Barstow" .

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/publisher> <http://www.w3.org/> .

Page 55: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

RDF Serialization formats - Non StandardNon standard but popular formats

• N3 or Notation3, • a non-standard serialization that is very similar to Turtle, but

has some additional features, such as the ability to define inference rules

• HDT• Compressed Binary RDF, HDT compresses big RDF datasets

while maintaining search and browse operations • Microformats

• Similar to RDF, and can be converted RDF. Uses html pages as both a human readable document and machine readable data, very big on web

Page 56: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Converting between standardsAny23

• Created at Deri (Insight) for converting popular serialisation formats.

• Now First class apache project• Online converter http://any23-vm.apache.org/

Page 57: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Mapping traditional data to Linked DataTwo very popular tools created at Insight (Deri)D2RQ - http://d2rq.org/Maps relational databases to RDF

TARQLMaps tables (csv) to RDF

Page 58: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Why TarqlVery simple mapping syntax.

• Most of the world’s structured data is stored as tables

• Most RDBMS database tables can be denormalized to a single table

• Data cleansing can be an earlier step and make use of best in case for tabular data

• Compared to tools such as D2RQ very easy to learn and use

Page 59: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

TARQL

General structure of TARQL mapping:

• Normal SPARQL select but has From file parameter

• Can work on selects and constructs.

SELECT DISTINCT ?id ?name FROM <file:filename.csv> WHERE {} LIMIT 100

Page 60: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

TARQL - Select

Looks very like normal SPARQL, in where clause

• Special Bind statements that bind column name with graph construct

SELECT ... WHERE { BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri) BIND (STRLANG(?a, 'en') AS ?with_language_tag) }

Page 61: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

TARQL -Construct

Looks very like normal SPARQL, in where clause

• Special Bind statements that bind column name with graph construct

CONSTRUCT { ?URI a ex:Organization;

ex:name ?NameWithLang;ex:CIK ?CIK;ex:LEI ?LEI;ex:ticker ?Stock_ticker;

} FROM <file:companies.csv> WHERE { BIND (URI(CONCAT('companies/', ?Stock_ticker)) AS ?URI) BIND (STRLANG(?Name, "en") AS ?NameWithLang) }

Page 62: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

TarqlAny23

• Created at Insight (DERI) for converting popular serialisation formats.

• Online converter http://any23-vm.apache.org/

Page 63: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Accessing Data- Query Federation

Ali Hasnain, Narumol Prangnawarat, Naoise Dunne

Page 64: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

SPARQL Query Federation Approaches

••SPARQL Endpoint Federation (SEF)••Linked Data Federation (LDF)••Distributed Hash Tables (DHTs)••Hybrid of SEF+LDF

Curtsey Muhammad Saleem (AKSW)

Page 65: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

SPARQL Query Federation Approaches

Curtsey Muhammad Saleem (AKSW)

Page 66: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

SPARQL Endpoint Federation Approaches• Most commonly used approaches• Make use of SPARQL endpoints URLs• Fast query execution• RDF data needs to be exposed via SPARQL

endpoints• E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD

etc.

Curtsey Muhammad Saleem (AKSW)

Page 67: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked Data Federation ApproachesData needs not be exposed via SPARQL endpoints

Uses URI lookups at runtime

Data should follow Linked Data principles

Slower as compared to previous approaches

E.g., LDQPS, SIHJoin, WoDQA etc.

Curtsey Muhammad Saleem (AKSW)

Page 68: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Query federation on top of Distributed Hash Tables•Uses DHT indexing to federate SPARQL queries

•Space efficient

•Cannot deal with whole LOD

•E.g., ATLAS

Curtsey Muhammad Saleem (AKSW)

Page 69: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Hybrid of SEF+LDFFederation over SPARQL endpoints and Linked Data

Can potentially deal with whole LOD

E.g., ADERIS-Hybrid

Curtsey Muhammad Saleem (AKSW)

Page 70: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

SPARQL Endpoint Federation

S1

S2

S3

S4

RDF RDF RDF RDF

Parsing/Rewriting

Source Selection

Federator Optimzer

Integrator

Rewrite query and get Individual Triple

Patterns

Identify capable source against Individual Triple

Patterns

Generate optimized sub-query Exe. Plan

Integrate sub-queries results

Execute sub-queries

Curtsey Muhammad Saleem (AKSW)

Page 71: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

Source Selection

Curtsey Muhammad Saleem (AKSW)

Page 72: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Source Selection

Curtsey Muhammad Saleem (AKSW)

Page 73: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Source Selection

Curtsey Muhammad Saleem (AKSW)

Page 74: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4-S9

Source Selection

Total triple pattern-wise sources selected = 1+1+1+1+8 => 12

Curtsey Muhammad Saleem (AKSW)

Page 76: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Overview of Implementation details of Federated Sparql Query Engines

Page 77: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

System Features of Federated Sparql Query Engines

Page 78: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

System’s Support for SPARQL Query Construct

Page 79: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Analyze of Linked Data at scale

Narumol Prangnawarat, Ali Hasnain, Naoise Dunne

Page 80: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked DataA method of publishing structured data so that it can be interlinked.

• Normally represented as RDF• Normally queried using

SPARQL

4 principles of linked data1. Use URIs to name (identify) things.2. Use HTTP URIs so can be looked up3. Provide metadata about what thing is.

- use open standards RDF, SPARQL, etc.

4. Link to other things using their HTTP URI-based names

Page 81: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

How do I query Linked data

Page 82: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked data as a graphMany approaches to query and reason over linked data exist

the most popular query language in the RDF community is sparql, but alternatives exist…

… if we think about linked data as a graph

Page 83: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked data as a graphAs Linked data graphs share graph structure and can be connected we can reason over them using graph algorithms.

Popular approaches to querying graphs are:• Declarative pattern matching - SPARQL • graph traversal languages - Cypher, Gremlin• distributed graph data structures - GraphX

Page 84: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Comparing Graph QueryApproaches

Page 85: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Comparing Graph Query ApproachesCypher - graph traversalSomewhere between graph traversal and pattern matching runs on Neo4JCharacteristics

• Property graph• index-free adjacency• uses adjacency tables

Prosgreat at localized searches (shortest path for instance)is a databaseIs Expressive languageConsPoor at aggregation

Graph X - message passingDSL rather than language. For “programmers” Scala, Java and Python APIs.Characteristics

• Resilient Distributed Property Graph

• index-free adjacency• Vertex/Edge table

ProsBest (in list) distributed executionPowerful mix of table and traversalhas a set of optimised operators ConsNot a database, data needs to be loaded and stored separately

SPARQL -declarativeDeclarative language, allowing better expression and abstracting the writer from optimisation problemsCharacteristics

• Stores Triples (tuples)• Vendor normally optimise

popular queries.• Often built on RDBMS

ProsIs Expressive languageEasy to express search patternsConsDifficult to scalePoor at traversal

Page 86: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Comparing Graph Query ApproachesCypher - graph traversalSomewhere between graph traversal and pattern matching runs on Neo4J

Graph X - message passingDSL rather than language. Scala, Java and Python APIs.

SPARQL -declarativeis a declarative language, allowing better expression and abstracting the writer from optimisation problems

Abstract Concrete

Considerable work (for vendors) to scale Little work to scale

Optimised for aggregation

Optimised for connections

Global Local

Page 87: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Other Graph Query LanguagesCypher - graph traversalSomewhere between graph traversal and pattern matching runs on Neo4J

AlternativesGremlinA “pure” graph traversal language based on xpathNetwork XA python DSL for graphs

Graph X - Message passingDSL rather than language. For “programmers” Scala, Java and Python APIs.

AlternativesGraphLabvery powerful commercial productGiraffeA hadoop API for graphs

SPARQL -DeclarativeDeclarative language, allowing better expression and abstracting the writer from optimisation problems

AlternativesCypher(!)Has pattern matching constructs, can do much the same as sparql on Neo4J database

Page 88: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Sparql at big data scale

What about using SPARQL in distributed big data infrastructure?

Page 89: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Sparql at big data scaleAt big data scales and in distributed infrastructure SPARQL quickly becomes an impedimentWhy?

• It is difficult to optimise SPARQL at scale with fast data

• As optimisations are embedded in query, optimised SPARQL queries become less natural and hard to write.

• SPARQL abstractions “leak”, leading to “hacks” of big data RDF infrastructure

Page 90: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Why use Graph algorithms

Why not Sparql • For distributed computing,

declarative languages such as sparql are (for now) problematic

• you cannot know if query is NP or EXP time.

• Difficult to create query plans especially with fast or changing graphs

• Have to rely on federation which is still being researched

Why Graph Traversal• Proven to scale • Graphs traversal lower level

but easier to tune so that it is in P time.

• Most popular algorithm are local (as in) only query neighbouring nodes at any time -thus easier to break up across compute nodes

Page 91: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Scaling SparqlSo you still want to use sparql at scaleHow do we query RDF data with SPARQL at Big data scale?Clustering

• Some vendors/platforms provide clustered triple-stores

FederationBecame available in Sparql 1.1 with SERVICE keyword

• Federation emerged with great promises.• For distributed computing, sparql federation has limitations (for

now).• Query planning for SPARQL is NP Complete

Naoise Insight
Ali to expand based on targeted project he worked on - I can use Aidien Hogans notes if required
Page 92: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Alternatives to Scaling SparqlFortunately, both Graph traversal and GraphX

approaches to linked data that work at scale today. We will look at these approaches now.

Page 93: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Analysing Linked Data as a Graph

Querying linked data at scale with complementary technologies to sparql

Page 94: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked data as a graphLinked data can be considered a Heterogeneous

Graph

What do we mean by Heterogeneous Graph?• Linked data graphs share graph structure but have mixed characteristics • Nodes (Vertices) contain different data• Links (Edges) Mix of directed and undirected• Linked Data Graph can be weighted or not• Can have a mix of classes and types from differing Ontologies

What do these graphs look like?

Page 95: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved graphs within a Heterogeneous Graph

Page 96: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Linked data as a graphAs Linked data graphs share graph structure and can be connected we can reason over them using graph algorithms using

Popular approaches are:• graph traversal languages - Cypher, Gremlin• distributed graph data structures - GraphX

Declarative pattern matching - SPARQL, but can be hard to scale

Page 97: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Cypher graph languageSome basic cypher...

Find the actor named "Tom Hanks"...MATCH (tom {name: "Tom Hanks"}) RETURN tom

Who directed "Cloud Atlas"?

MATCH (cloudAtlas {title: "Cloud Atlas"})<-[:DIRECTED]-(directors) RETURN directors.name

Page 98: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Shortest PathShortest path is the problem of finding a path

between two vertices (or nodes) in a graph with the lowest weight (this could be cost or distance).

Page 99: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

What is DijkstraAn approach for shortest path, using traversal, that changes complexity of problem from NP to P time

naive shortest path between 2 points has complexity of:

O(|V³|)Dijkstra approach we get a worst case complexity of:

O(|E| + |V| log |V|)

Page 100: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Dijkstra using CypherMATCH p= shortestPath( (bacon:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"}))RETURN p

Find the shortest path between a Person with name Kevin Bacon and Meg RyanDEMO

Page 101: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved Cypher Shortest path

Page 102: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Another useful algo. Community detectionWhat is community Detectiona Graph is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally.

Page 103: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Visualisation tools for large graphsOn larger datasets Neo4js UI is too unresponsive for large result sets from queries such as community detection.

• instead you may want to use a visualisation tool such as Gephi.

• We will now demonstrate Louvain Community Detection using the graph visualisation tool Gephi.

Page 104: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Using GephiHere is a graph that we calculated using Luvian earlier...

This example shows clustering around Topics discussed on twitter:• graph cluster tweets around the topic under discussion• Each retweet or reply creates a link in the graph

Page 105: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved Gephi Screenshot

Page 106: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Using GephiHere is a graph that we calculated using Luvian earlier...

This example shows clustering around Topics discussed on twitter:• we cluster tweets around the topic discussion• Each retweet or reply creates a link in the graph

Once the Luvian analysis is complete, we load the results to gephi and we gain new insights on visualising the clusters around these topics:

• Compared to text analysis we are able to detect deeper community structures from the retweets and replies to tweets

• This gives us a deeper understanding of the individuals and communities

Page 107: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Sparql Federation example

Q: How are the protein targets of the gleevec drug differentially expressed, which pathways are they involved in?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/>PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>PREFIX sio: <http://semanticscience.org/resource/>PREFIX dcterms: <http://purl.org/dc/terms/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT distinct ?dbXref (str(?pathwayname) as ?pathname) ?factorLabel WHERE {

# query chembl for gleevec (CHEMBL941) protein targets ?act a cco:Activity; cco:hasMolecule chembl_molecule:CHEMBL941 ; cco:hasAssay ?assay . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?dbXref . ?targetcmpt cco:taxonomy . ?dbXref a cco:UniprotRef

# query for pathways by those protein targets SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> { ?protein rdf:type biopax3:Protein . ?protein biopax3:memberPhysicalEntity [biopax3:entityReference ?dbXref] . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction ?rel ?protein . }

# get Atlas experiment plus experimental factor where protein is expressed SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?probe atlasterms:dbXref ?dbXref . ?value atlasterms:isMeasurementOf ?probe . ?value atlasterms:hasFactorValue ?factor . ?value rdfs:label ?factorLabel . }}

Page 108: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Why use Graph algorithms

Polynomial timeA lot of queries on linked data can be expressed as well known graph traversal algo. that work in P time such as Dijkstra for positive weighted directed graph. Because these alog are localised suit distributed computed.

Page 109: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

More pure Dijkstra using CypherMATCH (from: Location {LocationName:"x"}), (to: Location

{LocationName:"y"}) , paths = allShortestPaths((from)-[:CONNECTED_TO*]->(to))WITH REDUCE(dist = 0, rel in rels(paths) | dist +

rel.distance) AS distance, pathsRETURN paths, distance ORDER BY distance LIMIT 1

The other approach used an inbuilt function - this shows a closer approximation of the actual algorithm

Page 110: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Dijkstra Psudocodedist[s]←0 (distance to

source vertex is zero)forall v V–{s}∈do dist[v]←∞ (set all other

distances to infinity)S← ∅ (S,the set of visited vert

is initially empty)Q←V (Q,the queue

initially contains all vertices)whileQ≠ ∅ (while the queue

is not empty)dou← mindistance(Q,dist) (select the element of Q with the

min.distance)S←S {u} ∪ (add u to list of

visited vertices)forall v neighbors[u]∈

do if dist[v]>dist[u]+w(u,v) (if new shortest path found)then d[v]←d[u]+w(u,v) (set new value of shortest path)

(if desired,add trace back code)returndist

Page 111: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

VisualizationAli Hasnain, Stefan Decker, Naoise

Dunne

Page 112: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Visualization

• Visualize your Data!• Available Tools

• ReVeaLD• FedViz• Genome Wheel

Page 113: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

ReVeaLD Search PlatformReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user-driven domain-specific search platform.

Intuitively formulate advanced search queries using a click-input-select mechanism

Visualize the results in a domain–suitable format.

Assembly of the query is governed by a Domain Specific Language (DSL), which in this case is the Granatum Biomedical Semantic Model (CanCO)

Page 114: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

ReVeaLD Search PlatformAvailability: http://n10.soma.insight-centre.org:31005/explorer

Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1

Curtsey Maulik Kamdar

Page 115: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

DSL Visual RepresentationConcept Map Visualization is used.

Page 116: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Visual Query Builder

Page 117: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Visual Query Model

Page 118: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

FedViz: A Visual Interface for SPARQL Queries Formulation and Execution

FedViz is an online application that provides Biologist a flexible visual interface to formulate and execute both federated and non-federated SPARQL queries.

It translates the visually assembled queries into SPARQL equivalent and execute using query engine (FedX).

Availability: http://srvgal86.deri.ie/FedViz/index.html

Curtsey Sana e Zainab

Page 119: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

FedViz: A Visual Interface for SPARQL Queries Formulation and Execution

Page 120: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Using FedViz: Step by Step

Page 121: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

GenomeSnip PlatformA semantic, visual analytics prototype devised to expedite knowledge exploration and discovery in cancer research.

Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative, circular visualization, the‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’ (linear) display.

Technologies: Web-based client application developed using native technologies like HTML5 Canvas, JavaScript and JSON.KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering, caching and event handling.

Availability: http://srvgal78.deri.ie/genomeSnip/

Curtsey Maulik Kamdar

Page 122: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Genome Browser

Page 123: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved

Hands On Session

Narumol Prangnawarat, Ali Hasnain, Naoise Dunne

Page 124: Knowledge Processing with Big Data and  Semantic Web Technologies

© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved

Convert CSV File to RDF using TARQL

Instructions Manual https://goo.gl/xLpF8Y