12
RDF and Java Monica Macoveiciuc and Constantin Stan Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, “in which information is given well-defined meaning, better enabling computers and people to work in cooperation”[1]. RDF, to- gether with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF.

RDF and Java

Embed Size (px)

DESCRIPTION

The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, ``in which information is given well-defined meaning, better enabling computers and people to work in cooperation''\cite{semweb:lee}. RDF, together with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF: Jena, Sesame and JRDF.

Citation preview

Page 1: RDF and Java

RDF and Java

Monica Macoveiciuc and Constantin Stan

Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi

Abstract. The Web is a universal medium for information, data andknowledge exchange. The Semantic Web is an extension of the WorldWide Web, “in which information is given well-defined meaning, betterenabling computers and people to work in cooperation”[1]. RDF, to-gether with SparQL, provide a powerful mechanism for describing andinterchanging metadata on the web. This paper presents briefly the twoconcepts - RDF, SparQL - and three of the most popular frameworks(written in Java) that offer support for RDF.

Page 2: RDF and Java

RDF and SPARQL

1 What is RDF?

RDF (Resource Description Framework) is the W3C standard for encodingknowledge. It is a structure for describing and interchanging metadata on theWeb in numerous forms and purposes.

RDF provides a framework, that is consistent, and syntax for describing andquerying data. It also makes easy and possible sharing website descriptions.RDF’s family of specifications are quite complex and a difficult to manage, that’swhy there are times when using the full potential of its capabilities is not an easything to do. The RDF offers a model for describing resources which have proper-ties (attributes or characteristics). Any object that is uniquely identifiable by anURI (or Uniform Resource Identifier) is considered by RDF a resource. These re-sources have properties associated with them and these properties are identifiedby property-types which have, on their turn, associated values. Property-typesdefine the relations between values and resources. The values may be atomic orother resources (which can, obviously, have properties). A group of propertiesthat belong to the same resource is called description.

The RDF’s core stands in the triple described above. This actually states thatonly three pieces of information are all that’s needed to fully define a bit ofknowledge.

So we have the resource (or subject) - the thing that’s being described (identi-fied by an URI), the property-type (or the predicate) such as a relationship, anattribute or a characteristic, and, in addition to the subject and the predicatewe have the third component which is the value of the resource property type(or the object). An RDF triple documents these three pieces of information,within the RDF specification, in a consistent manner so that allows, in an idealway, consumption of the same data on both on human and on machine ends.This allows human meaning and understanding to be interpreted consistentlyand mechanically. For example let’s consider these two sentences:

I have a name, which is Monica Macoveiciuc.I have a gender, which is female.I have a job, which is programmer.

We can quickly identify the triple about which we talked about earlier withinthe above sentences:

I (subject) have a name (property), which is Monica Macoveiciuc (propertyvalue).

Page 3: RDF and Java

I (subject) have a gender (property), which is female (property value).I (subject) have a job (property), which is programmer (property value).

There are many ways to represent a triple. For example we can use the 3-tuplerepresentation. In this case we’ll have:

subject, predicate, object Applied on the examples above we get:

{I, name, Monica Macoveiciuc}{I, gender, female}{I, job, programmer}

The above is just one way of serializing RDF data. The formal way to serializethis data is the directed graph (a directed label graph). There are two mainreasons that were considered when this method was chosen as default represen-tation and these reasons are: the graphs are extremely easy to read (there is noconfusion between the 3-tuple core elements, the can be no confusion about thestatements that are being made) and there are some RDF data models that canbe represented this way (using RDF graphs), but not in RDF/XML. The graph isa set of nodes connected by arcs which form a pattern of node-arc-node. Thereare 3 types of nodes: blank nodes, literals and uriref. RDF requires a syntaxthat represents this model, in order to store instances of this model in machineaccesibile/readable files and to communicate these instaces among application.The answer for this required syntax is XML. In order to have XML supportingthe consistent representation of semantics, RDF imposes formal structure onit. To provide unicity within its identification RDF uses the namespace mecha-nism (which is part of the XML technology). The RDF Schema acts as a boot-strapping mechanism for the declaration of the necessary vocabulary used inexpressing the data model. Elements as RDF:RDF or RDF:Description havespecific meaning. Both belong to the same namespace: RDF. For example theRDF:RDF tag marks the boundaries within an XML document where the con-tent is intended to be written to fit into an RDF data model instance and theRDF:Description tag is designed to reflect the corresponding data model. Theconstraints imposed by RDF are there to support the consistent encoding andexchange of standardized metadata defined by different communities.

2 What is SPARQL?

SparQL (which is pronounced “sparkle” and has as recursive acronym SPARQLProtocol and RDF Query Language) is an RDF query language. It’s a fresh W3CRecommendation about which Sir Tim Berners-Lee said that “will make a hugedifference”. RDF is pretty foundational to the Semantic Web. Until SparQL’slaunch, RDF had a data model, a formal semantics, and a concrete serialization(in XML), but what it didnt have was a standard query language.

Page 4: RDF and Java

SparQL came in place and now offers to the Semantic Web and to Web 2.0a common data manipulation language in the form of expressive query againstthe RDF data model. Using WSDL 2.0, SparQL Protocol for RDF describesa very simple web service with one operation, query which is available withboth HTTP and SOAP bindings. This operation is the way you send SPARQLqueries to other sites and the way you get back the results. The HTTP bindingsare REST-friendly and a simple SparQL protocol client takes little amount ofcode in order to implement.

SparQL consists of 3 separate specifications.The first one is the query language specification (which makes up the core).The second is the query results XML format (which describes an XML formatdor serializing the results of an SparQL queries - SELECT, ASK). The thirdspecification is the data access protocol (which uses WSDL 2.0 to define simpleSOAP and HTTP protocols for remotely querying RDF databases - or any datarepository that can be mapped to the RDF model). Alltogether it consists of aquery language, a mean of conveying a query to a query processor service anddefining the XML format in which the results will arrive.

Some issues are not addressed yet by SparQL. The most notable is that it can’tmodify an RDF dataset (it’s read-only). As we mentioned previously, RDF isbuild on the triple pattern (a 3-tuple consisting of subject, predicate, and ob-ject). Similar to RDF, SPARQL is built on the triple pattern, which also consistsof a subject, predicate and object. SparQL allows to match patterns in an RDFgraph using triple patterns, which are like triples except they may contain vari-ables in place of concrete values (the variables are used as “wildcards” to matchRDF terms in the dataset).

The SELECT query can be used to extract data from an RDF graph, returningit as an array result set. For more complex graph patterns one should use re-quired and/or OPTIONAL data. UNION queries are also a way of dealing withselecting alternatives from the dataset. It is possible to apply ordering to theresults, jump forward through results using OFFSET, and LIMIT the amount ofdata returned. The SparQL Query Results XML Format specification includesseveral relevant examples. Given its obvious simplicity and regular structure,manipulating this format with XSLT or XQuery is fairly trivial.

The syntax shortcuts make writing queries much simpler. These are especiallyuseful with repetitive graph patterns and long URIs. SparQL presents itself asbeing the missing and long waited part from the Semantic Web and Web 2.0.

Page 5: RDF and Java

Java APIs for RDF

There are many frameworks for processing RDF available for Java programmers.Some of them also offer support for SPARQL inferences. This paper presentsthree of the most popular frameworks: Jena, Sesame and JRDF.

3 Jena

3.1 The Model

Jena uses the concept of graph for dealing with the data: the nodes correspondto URIs, while the edges are the triples.The graphs are represented through the Model interface, which has differentimplementations: a memory-based one, one which uses a relational database etc.The memory-based model is the simplest and easier to use one.

A triple is represented through an interface called Statement. A statement cor-responds to an edge in the graph and consists of three parts:

– the subject - the resource from which the arch leaves - implements the Re-source interface;

– the predicate - the property (the label of the arch) - implements the Propertyinterface;

– the object - the resource that is pointed by the arch - implements the Re-source or the Literal interface.

The components of the statement have a common base - the RDFNode interface.

Page 6: RDF and Java

The object component is more complex. A statement can be used as the objectcomponent of the triple, since RDF allows nested statements. Objects imple-menting the Container, Alt, Bag, or Seq interface can also be used as objects.A resource is declared as follows:

Model model = ModelFactory.createDefaultModel();

String resourceURL = "http://localhost:8080/George";

Resource person = model.createResource(resourceURL);

The ModelFactory method createDefaultModel() creates a memory-based model,which is then used for creating a resource. This is done by calling the createRe-source method, to which we provide the URI of the resource. Jena API containsconstant classes for some well known schemas, such as RDF and RDF schema,Dublin Core and DAML. Adding the Formatted Name property of the vCardfile format can be done easily:

person.addProperty(VCARD.FN, "George");

An RDF Model is represented as a set of statements. Accessing the componentsof the statement can be achieved through the getSubject, getPredicate and getO-bject methods of the Statement class. The API provides methods for the mostcommon operations:

– addProperty - adds a new statement (triple) to the model;– listSubjects - lists the subject component of each triple from the model;– listObjects - lists the object component of each triple from the model;– write - writes the model in RDF XML format to the output stream given as

parameter;– read - reads the statements in RDF XML format into a model.

The Jena2 persistent storage subsystem implements an extension of the Modelclass that provides transparent persistence for models through the use of adatabase engine. Implementations for MySQL, HSQLDB, PostgreSQL, Oracleand Microsoft SQL Server are provided and other databases have been addedby 3rd parties.TDB and SDB are two components of Jena that provide large scale storage andquery of RDF datasets.SDB is a system that uses relational databases for storage of RDF and OWL. Itsupports many open source and commercial databases including MySQL, Post-greSQL, Oracle 11g, Microsoft SQL server and IBM DB2. It scales to graphs of100 million triples.TDB is a non-transactional, faster database solution for use by a single system.It scales well beyond SDB and is simpler to setup.

3.2 Inferences

SparQL is implemented in Jena through the ARQ package, and queries may bemade within Java scripts or via a SparQL client distributed with Jena.

Page 7: RDF and Java

The package containing that offers SparQL support is com.hp.hpl.jena.query.There are four types of queries supported by the Jena classes: SELECT, ASK,DESCRIBE, CONSTRUCT.

ASK query returns “yes” if the query’s graph pattern has any matched in thedataset and “no” otherwise.

DESCRIBE query returns a graph containing information related to the nodesmatched in the graph pattern.

CONSTRUCT query is used for creating a RDF graph for each solution of thequery.

For running a query, one needs:

– a Query object, obtained through the create method of the QueryFactory;

– a QueryExecution object, obtained through the QueryExecutionFactory;

– an execute method, depending of the type of the query.

The results are provided in the form of a QuerySolution object, and a ResultSetcan be used to iterate over the solution. The results can be refined throughthe SparQL options DISTINCT, LIMIT, OFFSET, ORDER BY, optional andalternative matches and filters.Jena offers support for working with multiple graphs. The DataSetFactory classcan be used to specify named graphs to be queried programmatically.

4 Sesame

4.1 The Model

As Jena does, Sesame uses a graph model for the resource. URIs are nodes,and triples are a pair of edges (an edge from subject to predicate, and an edgefrom predicate to object) each. A central concept in Sesame is the Repository.A repository is a abstraction of storage container for RDF data. This can meanJava objects in memory, or it can mean a relational database. Virtually all op-erations in Sesame happens with respect to a repository: the repository is theprovider of persistence and querying capability.

The Graph API provides a representation of an RDF graph in the form of aJava object. The Graph object is used to store the triples. In order to be able toadd statements to the graph, one must obtain a ValueFactory object from theGraph.

Graph graph = new org.openrdf.model.impl.GraphImpl();

ValueFactory factory = graph.getValueFactory();

Page 8: RDF and Java

Adding a statement is done similar to Jena:

String resourceURL = "http://localhost:8080/human#";

URI subject = factory.createURI(resourceURL, "person");

URI predicate = factory.createURI(namespace, "hasName");

Literal object = factory.createLiteral("George");

graph.add(subject, predicate, object);

Sesame offers the possibility of running SeRQL-construct queries in order to cre-ate and update graphs. Another capability of the framework is allowing addingand removing graphs from a repository.SAIL is Sesame’s abstraction from the storage format used and also providesreasoning support. In the persistence layer, there are SAIL implementations forPostgreSQL, MySQL, SQL Server and Oracle database. SAIL can be used toimplement concurrent access handling and caching. Each Sesame repository hasits own SAIL object to represent it.

There are few operations that are defined by the SAIL abstraction, such asadding and removing triples, starting and committing transactions, clearing therepository etc.

4.2 Inferences

Sesame does not offer support for SparQL, but it does include a new RDF/RDFSquery language, SeRQL.

SeRQL stands for “Sesame RDF Query Language”. It combines the best fea-tures of other query languages (RQL, RDQL, N-Triples, N3), also adding someof its own. Its most important features include: RDF Schema support, XMLSchema datatype support, graph transformation, optional path matching.

SparQL and SeRQL are quite similar: they both support advanced path ex-pressions as branching and chaining, optional paths and partial match of thetarget graph. SeRQL allows SELECT, CONSTRUCT and DESCRIBE query

Page 9: RDF and Java

types and their functionality is similar to the one provided by SparQL.

When speaking about the set operations, SparQL is limited, UNION being theonly operation allowed. SeRQL offers support for more operations:

– union - UNION;

– intersection - INTERSECT;

– difference - MINUS;

The operators IN, ANY, ALL, EXISTS and nested queries are other featuressupported by SeRQL.

Some limitations of SeRQL include the missing of ORDER BY clause and nosupport for regular expressions.

5 JRDF

5.1 The Model

JRDF Java RDF Binding is an attempt to create a standard set of APIs andbase implementations to RDF using Java. It is based on existing libraries, suchas Jena, Sesame, Aquamarine and Sergey Melnik’s RDF API. Unlike the otherframeworks, JRDF tries to deal with most of the aspects that are useful for Javaprogrammers and tp ensure a high degree of modularity. It includes a defaultmemory implementation that can be used in conjunction with Mulgara to pro-vide a scalable RDF solution.

As Jena and Sesame, JRDF offers a graph-based view of the RDF data. TheGraph interface is used for the representation of the graph. A graph consists ofRDF structures such as triples, literals, URI References. A graph is created asfollows:

JRDFFactory factory = SortedMemoryJRDFFactory.getFactory();

Graph graph = factory.getGraph();

GraphElementFactory elementFactory = graph.getElementFactory();

Node node = elementFactory.createURIReference(URI.create("urn:node"));

graph.add(node, node, node);

The methods provided by the API allow adding, removing and finding triples.

The components of the triple - the subject, the predicate and the object - have acommon base: the Node interface. This represents the top of the class hierarchyof the JRDF model. The Node is subclassed by the positional nodes: Subject,Predicate and Object. These are also subclassed by other types of node, such asURI, Literal and bnode (the blank node).

Page 10: RDF and Java

There are four JRDF Graph implementations:1. The memory graph - it is included in the jrdf jar and it is useful for small

graphs.2. The server-side JRDF Graph - it is a server-side interface provided by

Mulgara. The graph is created in the JVM and can be used for direct access tothe database using a graph API.

3. The client JRDF Graph - Mulgara provides a client-side JRDF graphinterface for accessing a model, which represents a scalable solution for remoteclient applications.

4. iTQL graph - this is a read-only graph that can be created from the resultsof an iTQL query (used for retrieving data and updating Mulgara databases).This offers the possibility of displaying the results as a subgraph.

5.2 Inferences

JRDF contains an implementation of SparQL, although it is not complete. Butthe API does offer support for developing a powerful query engine. Such animplementation (based on JRDF) requires a mapping between RDF and theRelational Model.

An approach for this is using a modified relational algebra to represent theJOIN, UNION and OPTIONAL operations. This algebra must support untypedrelations and operations. These must be defined to work with tuples of differingattributes, to cover all the possible types that a tuple can contain.

6 Support, Documentation and Licensing

Jena, Sesame and JRDF are all cross-platform and they are available underBSD-style license. However, Jena seems to be the most popular among thesesolutions. This is because it provides a robust API and great support for rea-soning, along with good documentation and support for developers.

Jena Documentation page contains the public API, together with a tutorial anda FAQ section. Great attention is paid to practical examples - there are many

Page 11: RDF and Java

HowTo’s included, covering a large are of interest, from creating models to con-currency and locking issues. Other resources are presented, such as SparQL, withuseful links. There is also a mailing list (jena-dev) and a large dev-communitybuilt around the project. The Jena website includes a user contributions page,which contains really interesting examples provided by the Jena users.

Sesame Documentation is comparable to the one provided by Jena. A user man-ual describes in detail, with examples, each part of the framework. The Docu-mentation section includes some tutorials, FAQs and links to external resources.There are also some mailing lists and an old (now not functional) forum. Theusers can also report bugs and problems through an Issue Tracker.

JRDF offers less support for developers than the other two frameworks. A Wikisection contains some basic description and examples. Javadocs are available forsix releases of the project, providing a good way of tracking the changes. Thereis also a mailing list and some links to related publications.

Page 12: RDF and Java

Conclusion

All three frameworks are mature enough to support complex applications. Eachof them is better than the other under certain aspects, and it is the user whoshould decide which API to use to best cover the application’s needs. One criteriato take into account is the query language that the application needs to use, sinceSesame doesn’t support SparQL (although it does come with its own solution)and neither does JRDF. Sesame provides support in scripting languages - Perl,PHP5 - which can be really useful. JRDF is a good example of good practice,trying to use standard Java conventions.

References

[1] Berners-Lee, T.; Hendler, J.; Lassila, O.: The Semantic Web. Scientific AmericanMagazine (March 26, 2008)

[2] Powers, Shelly: Practical RDF. O’Reilly 2003[3] http://jena.sourceforge.net/[4] http://www.xml.com/pub/a/2001/01/24/rdf.html[5] http://www.ibm.com/developerworks/xml/library/j-sparql/[6] http://www.openrdf.org/documentation.jsp[7] http://www.dlib.org/dlib/may98/miller/05miller.html[8] http://www.oreillynet.com/xml/blog[9] http://www.xml.com/pub/a/2005/11/16/introducing-sparql-querying-semantic-

web-tutorial.html[10] http://www.w3.org/TR/rdf-sparql-query/[11] http://en.wikipedia.org/wiki/SPARQL