MashQL: A Query-by-Diagram Language › ~mjarrar › JD08.pdf · 2008-08-28 · 1.2 Why RDF and SPARQL Assuming the Internet data is represented in RDF, querying this data can be

MashQL: A Query-by-Diagram Language

Technical Article

Mustafa Jarrar University of Cyprus

[email protected]

Marios D. Dikaiakos University of Cyprus [email protected]

ABSTRACT

This article is motivated by the massively increasing structured data on the web (data web), and the need for novel methods to expose this data to its full potential. Building on the remarkable success of Web 2.0 mashups, and specially Yahoo Pipes, we generalize the idea of mashups and regard the Internet as a database. Each internet data source is seen as a table, and a mashup is seen as a query on these tables. We assume that web data sources are represented in RDF, and SPARQL is the query language. We propose a mashup language, called MashQL, which allows people to mash up data intuitively. In the background, MashQL queries are translated into and executed as SPARQL queries. The novelty of MashQL is that it allows one to query (and navigate) an RDF graph without any prior knowledge about its structure or technical details; as well as it supports pipelines and materialized queries as built-in concepts. Users also do not need any knowledge about RDF/SPARQL to get started. Although we focus on RDF/SPARQL mashups, but our approach can be easily reused for other data formats and query languages.

Keywords

Query-by-diagram, Query Formulation, Data Mashup, Semantic Mashup, Semantic Web, Data Web, Linked Data, Web 2.0

Table of Contents

1. Background and Motivation ........................ 1 2. Related Work and Contributions ................. 3 3. The Basics of MashQL .................................. 4 4. Formal Definition of MashQL ...................... 5 5. Query Pipelines ........................................... 8 6. Implementation ........................................... 8 7. Use Cases................................................... 11 8. Discussion .................................................. 13 9. Evaluation .................................................. 14 Appendices ...................................................... 16

Remark: This is a live report, updated regularly (every week/month) to include our latest developments on all aspects related to MashQL (its formal syntax and semantics, implementation, use cases, evaluation, etc.). The content of this article is not published yet, thus if you wish to cite this work, please use the reference below. Any publication stemming from this report will be listed here. Reference: Jarrar M, Dikaiakos M: MashQL: A query-by-diagram. Technical Article TAR200805. University of Cyprus, 2008 Download from: http://www.jarrar.info/mashql/TA/

1. Background and Motivation The rapid growth of Web 2.0 content has created a high demand for making this content more reusable. Companies are competing not only on gathering more content and contributions from people, but also on making their content available for using inside their own websites. Many companies such as Google, Yahoo, Microsoft, Amazon, eBay, LinkedIn, and Wikipedia, have made their content publicly accessible through APIs. People are encouraged to make their own applications and profit based on others’ content. For example, one can build a program to access the content of the Craigslist real-state database to find apartments in certain area, mix this content with location information from Google Maps, and provide a new web service that was not originally provided by either source. Another web service can be created to find the events happening in a city, by integrating the content of several event databases (such as Upcoming, and Google Base), mix the results with relevant photos from Flickr, and render the final results on Yahoo Maps. Web applications that consume content originated from third parties and retrieved via a public interface or API are called Mashups.

To expose the massive amount of public content and to allow people to build mashups easily, several mashup editors have been launched, including Google Mashup, Microsoft’s Popfly, IBM’s Smash, Yahoo Pipes, and few others. Yahoo Pipes have received the greatest attention thanks to their simplicity. Yahoo Pipes allow people to combine different data sources into mashups, in a graphical and user-friendly way without having to write code. Yahoo Pipes generalize the idea of the mashup, providing a drag and drop editor that … can easily fetch data from any data source providing an RSS, Atom or RDF feed, extract the data the user wants, combine it with data from other sources, apply various built-in filters, and have the output directed to a web page or to other users’ pipes” [Ot]. Some people consider Yahoo Pipes to be “a milestone in the history of the internet” [Ot]. The most interesting part in this, is that a user does not really need to have technical experience to get started with Yahoo Pipes.

However, the limitation of Yahoo Pipes and other mashup editors is that they focus only on providing encapsulated access to some public APIs or feeds. Web feeds are capable of only representing news items; they are not capable of representing structured-data retrieved from the so-called Data Web. Currently, the development of data mashups requires extensive programming skills.

1.1 Position (Mashups as a Data Retrieval Paradigm) To build on the remarkable success of Web 2.0 mashups and expose the data web to its full potential, we propose to regard the Internet as a database; each data source is seen as a table, and each mashup as a query. The average user should be able to build mashups intuitively, and mashups should be reusable; so that people can reuse and build on each other’s results. Mashups should not be limited to sophisticated queries, but can also be simple inquiries, e.g., “What books and articles Hacker wrote last month”. In other words, we propose to also regard mashups as a mechanism for general-purpose data retrieval.

In what follows we present our motivation of using RDF and SPARQL in data mashups. The main challenges to building such data mashups are presented at the end of this section.

1.2 Why RDF and SPARQL Assuming the Internet data is represented in RDF, querying this data can be done using SPARQL [P], the RDF query language. SPARQL is a recent recommendation by the W3C. It allows one to query remote RDF resources, in a manner similar to the querying of databases using SQL. A SPARQL query is a set of graph patterns; any data triple matching these patterns is added to the query results. An example shown in Figure 1 retrieves Hacker’s articles published after 2000, from two web locations. http://Site1.com/RDF

:a1 :Title “Web 2.0”:a1 :Author “Hacker B.” :a1 :Year 2007 :a1 :Publisher “Springer”:a2 :Title “Web 3.0” :a2 :Author “Smith B.”

http://Site2.com/RDF

:4 :Title “Semantic Web”:4 :Author “Tom Lara” :4 :PubYear 2005 :5 :Title “Web services” :5 :Author “Bob Hacker”

Query: PREFIX S1: <http://site1.com/rdf> PREFIX S2: <http://site1.com/rdf> SELECT ? ArticleTitle FROM <http://site1.com/rdf> FROM <http://site2.com/rdf> WHERE { {{?X S1:Title ?ArticleTitle}UNION {?X S2:Title ?ArticleTitle}} {?X S1:Author ?X1} UNION {?X S2:Author ?X1} {?X S1:PubYear ?X2} UNION {?X S2:Year ?X2} FILTER regex(?X1, “^Hacker”) FILTER (?X2 > 2000)}

Results: ArticleTitle Web 2.0

Figure 1. An example of a SPARQL query.

RDF is expressive in representing data and its semantics in a portable and linked form [BHB], and other data formats can be easily mapped into RDF [Wr]. Although RDF has been standardized by W3C since 1999 ˗to play the role of a semantically enabled metadata model˗ only recently has it received a special attention from leading companies. For example, Yahoo announced that the next generation of their search engine will understand web semantics through RDF [Y]. Several models of RDF (such as RDFa and eRDF, microformats, and standard vocabularies) will also be supported by Yahoo. MySpace announced that they are adopting the semantic web technology and that they will use RDF for profile and data portability [O]. Upcoming is already publishing their content in microformats and RDFa. Furthermore, Oracle 11g supports RDF storage and query. Querying RDF in Oracle is done in a SPARQL-like style. As shown by Oracle in [CDES], this implementation is scalable. For example, a query with a medium size complexity over 80 Million RDF triples (5.2 GB) takes one or few seconds. This support from the leading companies is indeed accelerating the adoption of RDF as the main metadata language. Therefore, we believe that RDF and SPARQL mashups will be an important trend of web applications in the near future.

In particular, we expect a wide spread of RDFa -embedding RDF triples inside XHTML. For example, when searching (Yahoo Local, Scholar, Jobs.ac.uk, etc.), the results will be represented in RDFa, so that both humans and machines can understand these results. In this way, the web is seen as set of RDF sources that can be queried and processed as structured and linked data, or the so-called data web1.

1 We are not strict that the vocabulary in an RDF source is grounded to an

agreed and shared ontology. Hence, we prefer (as O'Reilly [M]) to call a web of RDF sources as data web, rather than semantic web. Ontologies will be learnt and used as a next step.

1.3 Challenges to Building Data Mashups The problem is that building data mashups requires high programming skills and intensive efforts. There is no yet an approach to easy access and consume structured data on the web. in the following we specify two challenges. Although we focus on RDF/SPARQL mashups, but other data formats and query languages do face similar challenges. The schema challenge. Understanding the structure of an RDF source (in order to formulate a query about it) is a challenging task indeed. Before a user formulates a query, she needs to know how the data is structured, and what are the labels of the data elements, i.e., the schema. Web users are not expected to investigate “what is the schema” each time they search or filter structured information. This issue is particularly more difficult in case of RDF. RDF data may come without a schema, or the schema is mixed up with the data. In addition, as RDF data is a graph, one have to navigate this graph in order to formulate a query about it. Imagine a large RDF source with diverse content, how you would manage to understand the data structure, inter-relationships, namespaces, and the unwieldy labels of the data elements. In short, formulating queries in open environments, where data structures and vocabularies are unknown in advance, is a hard challenge, and may hamper building data mashups by non-IT people, i.e., regarding data mashups as a mechanism for general-purpose data retrieval.

The pipelining challenge. It is very likely that the output of a mashup is used as input to another. For example, one may create a mashup by querying Upcoming to retrieve all events happening in Paris (Q1); another person may create another mashup based on Q1 to only retrieve the scientific events (Q2); a third person may mash up the results of Q2 with the events retrieved from the ACM Conferences (Q3); and so on. Mashups that connect to each other in this way are called pipes. Although the pipelining concept is very useful indeed, because it enables people to collaborate and build on each others’ efforts (which we call social collaboration and innovation), however, allowing queries to be pipelined in an open environment may lead to poor performance and complexities. For example, it is possible to end up with loops; the results of mashup A is used as input in mashup C, and the results of C is used by A. Such loops may lead to empty results after some iterations, and may cause computational overheads. Furthermore, when querying a set of remote RDF data sources, these sources must be downloaded and stored locally (at the query-engine side) before executing the query [MPTL]. The time complexity of the query itself is not the problem (as queries on local data are scalable [CERE]), however, the bottleneck will be the downloading time. The more queries call each other, the worse the performance may be. The time required to execute a pipe of queries depends on the pipeline depth, the amount of transferred data, and the transferring bandwidth. This article proposes a mashup language called MashQL, which uses SPARQL as a backend query language. It encapsulates the complexity of SPARQL and allows people to query RDF sources intuitively (see Figure 2). In the background, MashQL queries are translated into and executed as SPARQL queries. The novelty of MashQL is that it allows one to formulate a query over a data source(s) without any prior knowledge about its schema. MashQL does not also assume its users to have any general knowledge about RDF or SPARQL to get started. Hence, the average internet user can use MashQL to develop data mashups easily.

Furthermore, MashQL supports pipelines and materialized queries as built-in concepts.

2. Related Work and Contributions

In this section we overview different approaches to query formulation, focusing on the usability of these approaches for non-IT people.

Accessing and consuming structured data is a task that is usually limited to professional programmers. This is indeed a general challenge and is not only limited to RDF and SPARQL. Non-IT professionals who cannot write queries in a technical language (like SQL) are limited to accessing databases using forms.

Query-by-form is an old practice; users can fill in and submit a form, where all fields in this form are seen as query variables. This way of data access is simple; however, it is neither flexible nor expressive. For each query, a form needs to be developed, and any change to the query implies changing the form.

Query-by-example allows users to formulate their queries as filling a table [Z]. The names of the queried relations and fields are selected first; then users can enter their keywords. Although this approach is claimed to be easy to learn by non-IT people, however, it was not used by such people. In our opinion, this is because users are still required to understand the relational structure, which is difficult for non-IT people.

Conceptual query languages are an alternative approach to query formulation. As many databases are modeled conceptually using EER or ORM diagrams, one can also query these databases starting from those diagrams. Users can select some concepts from a given conceptual diagram, and their selection is automatically translated into SQL queries. This scenario was implemented by several EER-based [CERE,PS] and ORM-based [DMV, HPW] approaches. ConQuer [BH] is another ORM-based language, but it has some nice features indeed. Instead of starting from a conceptual diagram that may not exist, it starts from the logical schema and converts it into lists of concepts and relations. Users can then drag-drop from these lists to formulate their queries. What users drag-drop become a tree of facts, and this tree is seen as a query. Although this drag-drop scenario is not simple, however structuring a query as a tree-pattern looks intuitive indeed.

Although conceptual query languages received a considerable amount of research, but none of these languages was used in practice. In our opinion, this is because formulating a query starting from a conceptual diagram is still a difficult task for non-IT people. In addition, the need to query databases at the conceptual level is not an important issue, because a database is a single enterprise’s project, and the world of its developers and users is closed. In open worlds such as the Web, structured data is being created and consumed by different users, and the need for a mechanism to mash up and consume this distributed and heterogeneous data easily is a real demand.

Furthermore, some techniques start to emerge for filtering streams of information (see e.g. Outlook, Yahoo Pipes, Google Base). For

example one can block/permit an item (e.g., email, feed, job) if it contains (or less than, same as...) a certain value.

In the recent years, we are observing some advanced techniques start to emerge for filtering streams of information, which we call Query-by-Filter. For example, Yahoo Pipes does not support any query language; however, the Filter module has some general concepts, which allow people to permit/block items according to a certain set of conditions. One may block any web feed that contains (/doesn’t contain/the same as/ LessThan…) a certain keyword in the title of a feed. This way of expressing filters is also used in most email applications for filtering and organizing emails. Google Base allows one to search information in a mixture of query-by-form and query-by-filter manner. Although the flexibility and expressivity of such filters are very limited, if compared to query languages; however, this approach is well understood and being successfully used by non-IT people.

Furthermore, the need for simplified query techniques is receiving a high importance within the semantic web community. Most approaches are proposing to Visualize Triple Patterns, see GRQL [ACK], iSPARQL [iS], NITELIGHT [RSBS] and RDFAuthor [Ra]. The idea is to represent triple patterns graphically as ellipses connected with arrows, so that one would need less programming skills to formulate a query. Other semantic web approaches are suggesting to use visual scripting languages, such as SPARQLMotion [Sm] and Deri Pipes [TPM]. Their idea is to allow users to use visual box and lines, however, queries are written in a textual form. We found all of these approaches assume advanced knowledge of RDF and SPARQL, thus cannot be used by the casual user.

Please refer to [KB] about a usability study on what casual users prefer, which concludes that a query language should be close to natural language and graphically intuitive.

MashQL is a generalization and extension to many aspects of the above approaches, yielding a formal and expressive but yet simple, query-by-diagram language. MashQL inherits some aspects from conceptual queries and query-by-filters. Similar to ConQuer (and somehow LISA-D), MashQL queries are represented as trees, which makes queries easy to understand. Tree branches in MashQL are similar to filtering rules, which makes query formulation as simple as building filters. The look-and-feel of the MashQL is inspired from Yahoo Pipes.

The difference between MashQL and the query-by-filter approaches is that MashQL is a general language for querying any structured data, not only filtering a specific structure of a data stream, as in Yahoo Pipes. In addition, unlike conceptual queries that start from a conceptual or logical database schema, MashQL is fundamentally different as it assumes that it is not necessary for the queried data sources to have a schema at all.

3. The Basics of MashQL MashQL is a query-by-diagram language. The idea of MashQL is to allow people to query, mash up, and pipeline structured data intuitively. In the background MashQL queries are automatically translated into and executed as SPARQL queries. The novelty of MashQL is that it allows one to query (and navigate) an RDF graph without any prior knowledge about its structure or technical details; as well as it supports query pipelines as a built-in concept. Figure 2 shows a MashQL query that is equivalent to the SPARQL query in Figure 1. The first module specifies the query input, while the second MashQL module specifies the query body. The output of this query can be piped into a third module (not shown here), which renders the results into a certain format (such as HTML or XML), or as RDF input to other MashQL queries.

Figure 2. An example of MashQL query.

3.1 The Intuition of MashQL The intuition of MashQL is described as the following: Each MashQL query is seen as a tree. The root of this tree is called the query subject (e.g. Article), which is the subject matter being inquired. Each branch of the tree is called a query restriction and is used to restrict a certain property of the query subject. Branches can be expanded to allow sub trees (called query paths), which enable one to navigate the underlying data sources (see Figure 3). This query retrieves the recent articles from Cyprus, i.e. every article that is written by an author, who has an address, this address has a country called Cyprus, and the article is published after 2000.

PREFIX … SELECT ?ArticleTitle, ?Institute FROM … WHERE { ?Article :Title ?ArticleTitle ?Article :Author ?X1 ?X1 :Address ?X2 ?X2 :Country ?X3 OPTIONAL{?X1 :Affiliation ?Institute} ?Article :Year ?X4 FILTER (?X3 = “Cyprus”) FILTER (?X4 > 2000)}

Figure 3. Query paths (/sub trees) in MashQL.

3.2 The Query Formulation Protocol Formulating MashQL queries is designed to be an interactive process, in which the complexity and the responsibility of understanding data structures are moved from the user to the query editor. Users only use drop-down lists to express their queries. While interacting with the query editor, the editor

performs some background queries and dynamically generates these lists. In what follows, we describe these background queries.

• After a user selects the dataset in the RDF Input module, formulating a MashQL query is done by first selecting the query subject, which is offered through a drop-down list generated from (the union of all subject and object identifiers2 in the RDF dataset), no matter whether an identifier represents an instance or a type. Users can also choose not to select from the list and introduce their own subject label. In this case, the subject is seen as a variable and displayed in italic, which means any data item3.

• To add a restriction on the chosen subject, a (list of the possible properties for this subject) is dynamically generated. For example, given the data in Figure 1, if a user chooses a1 as a subject, the list of the a1’s properties will be {Title, Author, Publisher, Year}; if the subject is a variable, the list will be the set of all properties in the dataset. Users can also choose whether this property is required, optional, or unbound. If a property is prefixed with “maybe” this property is considered optional (see Figure 4), if it is prefixed with “without” it is considered unbound, and if it is not prefixed then it is required.

• Users may then choose an object filter such as (Equals, Contains, Doesn’t contain, OneOf, Between, MoreThan, Not, etc.), or may select an object identifier from a list, which is generated from (the set of the possible objects, depending on the previously chosen subject and predicate). Furthermore, users can also click on the restriction icon to expand the tree, i.e., declare a query path as shown in Figure 3. The symbol can be used before subject, property, or object variables to indicate that this variable will be returned in the results. For example, the results of the above query are a one-column table that contains the list of all retrieved titles4.

3.3 MashQL Intuitiveness. The trade-off between expressivity and simplicity in MashQL is achieved by making technical variables and namespaces to be implicit, and through the tree structure of MashQL queries, which is close to the intuition people use in their natural language communication. For example, the query path shown in Figure 3 means, retrieve the article that has an Author x1, and x1 has an address x2, and x2 has a country x4, and x4 equals “Cyprus”. Furthermore, suppose you would like to ask; “Give me the list of 2 People not familiar with the RDF terminology may glance section 4.1. 3 The default value for a subject (in case a user does not select from the

offered list or introduce his own label) is the variable “Everything”. 4 MashQL supports some novel interface issues that are lengthy or difficult to illustrate here, especially the edit-and-verbalize modes. For example, when a user clicks on a restriction, it gets the editing mode and all other restrictions get the verbalize mode (i.e., all boxes and lists are made invisible, but the verbalization of their content is generated and displayed instead, as shown in all figures). This does not only make the query formulation process even easier and joyful, but more importantly, from a methodology viewpoint it makes the readability of the queries closer to natural language, by which users are guided to achieve what they intended to query. Similarly, when two predicates originating from different sources have the same label, their namespaces are hidden and one of them is displayed, unless the user decided not so.

all stores that sell parts of the iPhone mobile, and that are located in Brussels”; or, “Which cinemas are located in Brussels, offer a movie called ‘Fahrenheit’ and will be played between 20:00 and 23:00”. Apart from some terms (such as: give me the list of all, which, who, that are), all of these inquiries can be directly converted into MashQL queries. Hence, MashQL can be used by both the average internet users and IT professionals to create data mashups intuitively and as expressive as SPARQL.

4. The Formal Definition of MashQL Before presenting the formal definition of MashQL, we define the structure of a data source.

4.1 Dataset MashQL assumes the queried dataset is structured as (or mapped into) a directed labeled graph, i.e. similar to but not necessarily the same as RDF. For example, suppose you have a person with the firstname Bob, surname Hacker, working at a company called IBM and located in the USA. This information can be represented as primitive facts forming the graph: {<P1,Firstname,“Bob”>, <P1,Surname,“Hacker”>,<C1,Name,”IBM”>,<C1,Country,”USA”>,<P1,WorkAt,C1>,<C1,T

ype,Company>,<P1,Type,Person>}. Notice that the last two triples declare P1 as an instance of a person and C1 as an instance of a company5. A MashQL dataset D is a set of data triples <Subject, Predicate, Object> forming a directed labeled graph. The value of a subject and a predicate can only be an identifier I, i.e. a URL or a key; the value of an object can be an identifier I or a literal L. Def.1 (Dataset): A dataset D is a set of triples, each triple t is formed as <S, P, O>, where S ∈ I, P ∈ I, and O ∈ I ∪ L. The only difference between this definition and the RDF model is that we allow an identifier to be any form of a key (i.e. not necessarily a URL). Allowing an identifier to be a database key, for example, would enable MashQL to be used for querying databases. Relational databases (and XML) can be converted to this primitive data structure. For example, the primary key of a table can be seen as a subject, a column label as a predicate, and the data-entry in that column as an object. Foreign keys represent relationships between data elements across tables. Each object literal must have a datatype. If an object value does not have an explicit datatype, it can be implicitly assumed by taking advantage of XML and RDF conventions: in RDF and XML, the general syntax for literals is a String, enclosed in either double or single quotes; Integers can be written directly without quotes or explicit datatype; Boolean literals are written as true or false; and so on. In RDF and XML Schema, stating a datatype explicitly is done using namespaces, such as: "1"^^xsd:integer, "2004-12-06"^^xsd:date. Def.2 (Typed Literals): Every object literal must have a datatype D: If O ∈ L then O ∈ D. Object literals may also have a language tag Lt (such as “En”, “Gr”, “It”) to indicate to which language the object value belongs. 5 The “Type” predicate (which somehow represents the schema of an RDF

source) might not be declared at all, i.e. poorly schematized data. This is another reason why query formulation in MashQL does not start from data schemes, if compared with the conceptual query languages (See our discussion in section 2).

In the RDF best practice, language tags are expressed using @ followed by a language tag, such as “Person”@En, “Ατομο”@Gr.

Def.3 (Language Tags): An object literal (O ∈ L) may have a language tag Lt.

4.2 The MashQL Basics This section presents the formal definition and the meaning of MashQL constructs. The SPARQL interpretation of each MashQL construct is presented as a mapping rule after each definition. A MashQL query is seen as a tree. The root of this tree is called the query subject, and each branch is called a restriction. A MashQL query can be defined on a one, or a merge6 of several data sources, called the dataset. All triples in the dataset matching the query restrictions are retrieved. Def. 4 (Query): A Query Q with a subject S, denoted by Q(S), is a set of restrictions on S. Q(S) = R1 AND … AND Rn.

Def. 5 (Subject): A subject S ∈ (I ∪ V), where I is an identifier and V is a variable. Def. 6 (Restriction): A restriction R = <Rx , P, Of>, where Rx is an optional restriction prefix that can be (maybe | without), P is a predicate (P ∈ I ∪ V), and Of is an object filter (which shall be defined shortly). The meaning of the MashQL restrictions is described as the following: When evaluating a query Q(S) against a dataset D, only the triples that satisfy all restrictions are retrieved, such that: First, if a restriction does not have a prefix (R=< , P, Of>), the truth-evaluation of the restriction is considered true if the subject, the predicate, and the object-filter are matched (see the first two restrictions in Figure 4). This case is mapped into a normal graph pattern in SPARQL (see rule-3). Second, if a restriction is prefixed with “maybe” (R=<maybe, P, Of>), its truth-evaluation is considered always true (see the third restriction in Figure 4). This case is mapped into an optional graph pattern in SPARQL (see rule 4). Third, if a restriction is prefixed with “without” (R=<without, P, Of>), it is truth-evaluation is considered true if the subject S and the predicate P do not appear together in a triple (see the last restriction in Figure 4). In SPARQL, this restriction means that the object O should not be bound (see rule 5).

The following rules map the MashQL constructs into SPARQL: Rule-1: The symbol before a variable means that it will be returned in

the results; i.e., included in the SELECT part of in SPARQL. If the output of the query is input to another, use CONSTRUCT * Rule-2: In any of the following rules, if a subject, predicate, or object is

italicized: it is seen as a SPARQL variable, i.e. prefixed with “?”. Rule-3: If S is a subject and R = < , P, Of>, the mapping is:

{S P O}. Rule-4: If S is a subject and R = <maybe, P, Of>, the mapping is:

{OPTIONAL{S P O}}. Rule-5: If S is a subject and R = < without, P, Of>, the mapping is:

{S P O. FILTER (!bound(?O))}.

Example. The query in Figure 4 means: retrieve everything (call this thing as a Person) that: has a name (call it as PersonName); works at Yahoo; possibly studying somewhere (call it as University); and, does not have any salary. In other words, when evaluating this query, we retrieve the triples that have a subject that i) appears in a triple with the predicate Name, ii) appears in a 6 In this paper we do not discuss how several RDF sources can be merged

into one graph, which is part of the W3C standard. Basically, all triples are put together, but prefixed with respective namespaces.

triple with the predicate WorkFor and the object Yahoo, iii) may appear in a triple with the predicate StudyAt, and iv) does not appear in a triple with the predicate Salary.

SELECT ?PersonName, ?University WHERE {

?Person :Name ?PersonName. ?Person :WorkFor :Yahoo. OPTIONAL{?Person :StudyAt ?University}?Person :Salary ?X1. FILTER (!Bound(?X1))}

}

Figure 4. A MashQL query and its mapping into SPARQL. The above definitions specify the simple forms of restrictions, i.e. without object filters Of. In what follows, we define nine forms of object filters, which allow further restrictions based on the value of an object (such as Equals, Between, OneOf, Not, etc). The last definition (Def.7.9) specifies how to allow sub-queries (i.e. query paths) based on object values. Def.7 (Object Filter): An object filter Of = <O, f>, where O is an object and f is a filtering function. An object filter can have one of the following nine forms:

1. Of = <O>, where O is an object, O ∈ V ∪ I. This is the simplest object filter, i.e., it does not add any restriction on the object value of the retrieved triples, as shown in Figure 4.

2. Of = <O, Equals(X, T, Lt)>, where X can be a variable or a constant, T is a datatype, and Lt is a language tag. This filter restricts the retrieved results, such that, the object value O should be equal to X, with datatype T, and with language Lt. See the restriction (Country “Cyprus”) in Figure 3. This object filter is mapped into a simple SPARQL filter (see rule-6).

3. Of = <O, Contains(X, T, Lt)>, where O is an object variable, X is a regex literal, T is a data type, and Lt is a language. This filter restricts the retrieved results, such that, the object value O should be equal to regex(X), with datatype T, and with language Lt. A regex literal is a literal that contains a regular expression matching pattern. See the restriction (Author “^Hacker”) in Figure 2. This object filter is mapped into a SPARQL filter with a regex function (see rule-7).

4. Of = <O, MoreThan(X, T)>, where O is an object variable, X is a variable or a constant, T is a datatype. This filter restricts the retrieved results, such that, the object value O should be more than X and with datatype T. See the restriction (Year > 2000) in Figure 3, the general SPARQL mapping rule-8.

5. Of = <O, LessThan(X, T)>, where O is an object variable, X is a variable or a constant, T is a datatype identifier. This filter restricts the retrieved results, such that, the object value O should be less than X and with datatype T (see rule-9).

6. Of = <O, Between(X, Y, T)>, where X and Y are variables or constants, T is a datatype identifier. This filter restricts the retrieved results, such that, the object value O should be more than or equals X, less than or equals Y, and with datatype T.

7. Of = <O, OneOf(V)>, where O is an object variable, and V is a set of values {v1, ... , vn}, vi is a variable or constant. This filter restricts the retrieved results, such that, the object value O should be equal to one of the values in V. See Figure 13. Notice that this construct is not supported in SPARQL, but can be emulated using filters (see rule-12).

8. Of = <O, Not(f)>, where f is one of the functions defined above. This filter extends all of the above functions with simple negation. As an example, see the query restriction (-Lifestyle Not

“Vitamins in Take“) in Figure 15. The filter is same as the Equals filter but with negation, i.e., Not Equal.

9. Of = <O, Qi(O)>, where O is an object (O ∈ V ∪ I), and Qi(O) is a sub-query with O being the query subject. The restrictions defined in the sub-query Qi(O) should be satisfied as well (e.g. see Figure 3). Notice that this definition is recursive; however, this does not mean the query itself is recursive.

Rule 6. If Of = <O, Equals(X, T, Lt)>: Append the mapping with: FILTER(?O = X) If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T) If Lt ≠ Null: Append the mapping with: FILTER(lang(?O) = Lt)

Rule 7. If Of = Contains(X, T, Lt)>: Append the mapping with: FILTER regex(?O, X) If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T) If Lt ≠ Null: Append the mapping with: FILTER(lang(?O) = Lt)

Rule 8. If Of = <O, MoreThan(X, T)>: Append the mapping with: FILTER(?O > X) If T ≠ Null: Append the mapping with: FILTER(datatype(?O=T)

Rule 9. If Of = <O, LessThan(X, T)>: Append the mapping with: FILTER(?O < X) If T ≠ Null: Append the mapping with: FILTER(datatype(?O=T)

Rule 10. If Of = <O, Between(X, Y, T)>: Append the mapping with: FILTER(?O >=X)&& FILTER(?O<=Y)

If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T)Rule 11. If Of = <O, OneOf (V)>:

Append the mapping with: {FILTER(?O = V1)|| . . . || FILTER(?O = Vn)}

If Vi is a regex-ed literal, the ith filter above should be replaced with: FILTER Regex(?O, Vi)

Rule 12. If Of = <O, Not(f)>: The f filter will be generated as above, but with a negation.

Rule 13. If Of = <O, Qi(O)>: Repeat all mapping rules to generate Qi(O).

4.3 Other Constructs There are other constructs that are not presented in this article for the sake of simplicity. This section highlights some of them.

Def.8 (Types): A subject (S ∈ I) or an object (O ∈ I) can be prefixed with “a” or “an” to mean the instances of this subject/object type, instead of the subject/object itself. The following query retrieves every instance of the concept Person that works for an instance of the concept Company.

SELECT ?Person, ?Company WHERE { ?Person rdf:type :Person ?Person :WorkFor ?Company ?Company rdf:type :Company }

Figure 5. MashQL’s support of types. Rule 14. If a subject S is prefixed with “a” or “an”:

Append the mapping with: {?S rdf:type :S} Rule 15. If an object O is prefixed with “a” or “an”:

Append the mapping with: {?O rdf:type :O} Def.9 (Union): A union can be declared between objects, predicates, subjects and/or queries, in the following forms: 1. On = <O1\O2 \ . . . \On>, to indicate unions between objects,

where Oi ∈ I. See Figure 6.A.

2. Pn = <P1\P2 \ . . . \Pn>, to indicate unions between predicates, where Pi ∈ I. See Figure 6.B.

3. Sn = <S1\S2 \ . . . \Sn>, to indicate unions between subjects, where Si ∈ I. See Figure 6.C.

4. Qn = <Q1\Q2 \ . . . \Qn>, to indicate unions between queries, See Figure 6.D.

A

SELECT ?Person WHERE { ?Person :WorkFor :Google UNION ?Person WorkFor :Yahoo}

B

SELECT ?FName WHERE { ?Person :Surname ?FName UNION ?Person :Firstname ?FName}

C

SELECT ?AgentName, ?AgentPhoneWHERE { {?Person rdf:type :Person. ?Person :Name ?AgentName. ?Person :Phone ?AgentPhone} UNION {?Company rdf:type :Company. ?Company :Name ?AgentName. ?Company :Phone ?AgentPhone}}

D

SELECT ?CustName, WHERE { ?Person :Name ?CustName. UNION {?Company :Title ?CustName. ?Company :City ?X1. FILTER regex(?X1, “Paris”)}}

Figure 6. MashQL’s support of unions.

Rule 16. Given On , If n >1 and Oi ∈ I : The mapping in rules 3-4 will be: {{S P :O1} UNION . . . UNION {S P :On}}

Rule 17. Given Pn , If n >1 and Pi ∈ I : The mapping in rules 3-4 will be: {{S :P1 O} UNION . . . UNION {S :Pn O}}

Rule 18. Given Sn , If n >1 and Si ∈ I : Regenerate the query n times, each time with Si as a root, and with a UNION between the queries.Rule 19. Given Qn , If n >1 : Add UNION between the n queries.

Def.10 (Reverse): <~P> indicates the reverse of the predicate P. Let R1 be a restriction on S such that <S P O>, and R2 be <O ~P S>, R1 and R2 have the same meaning. This construct allows concise queries. For example, given a dataset D = {<:C1, :name, “IBM”>, <:P1, :name, Bob>, <:P1, :WorkFor, :C1>}. In order to retrieve the company names that people work for (see the right-side query in Figure 7), which might be not intuitive for some people, one would use the reverse construct to retrieve the same results as shown in the left-side.

Figure 7. MashQL’s support of reverse predicates.

Mapping the reverse construct into SPARQL is done by simply swapping the positions of the subject and the object (see rule-13).

Rule 20. If S is a subject and R = <~P, O>, the mapping is: {O P S}.

5. Query Pipelines in MashQL As discussed earlier, allowing people to pipeline queries (i.e., build on each other’s results) is an important requirement for social collaboration and innovation. Similar to Yahoo Pipes, MashQL supports query pipelines as a built-in concept, rather than only an interface issue. For example, depending on the pipeline structure, MashQL generates either SELECT or CONSTRUCT queries (see Figure 13). In addition, to overcome the pipelining challenges described earlier, MashQL allows query results to be materialized, and enforce pipelines to be acyclic. The result of a query (called derived source) is stored physically and deployed as a concrete RDF source. Primal input sources (called base sources) are also cached for performance purposes.

Dif 11: Given a query Q over a set of base or derived sources {D1,…,Dm}, the results of this query is denoted as D = Q(D1,…,Dm), and D ∉ {D1,…,Dm}.

We define a Pipe as an acyclic chain of SPARQL queries, where the results of a query is an input to the next query (see Figure 8).

Dif 12: Given a derived RDF source D, the chain of the queries that derives D is acyclic, and denoted as the pipe P(D).

For example, the pipe P(D10) in Figure 8 is the chain Q5(Q3(Q1(D1,D2),Q2(D2,D3,D4)),D8), and the pipe P(D3) is the chain Q3(Q1(D1,D2),Q2(D2,D3,D4)). A pipe can also be defined recursively, such as P(D10) = Q5(P(D7), P(D8)), and P(D3) = Q3(P(D5), P(D6)). If D is a base source, then P(D) = D. In other words, a pipe to a base resource leads to the resource itself. As we shall discuss shortly, refreshing a pipe P(D) implies re-executing all of the queries in this pipe, while refreshing Q(D) implies re-executing only the query that derives it.

Figure 8. Depict of Semantic Web Pipes.

We call the problem of keeping a pipe up-to-date, the pipes consistency. Dif 13: (Consistency) Let D be the results of a query Q over a set of sources {D1,…,Dm}. Let T be the latest time the set {D1,…,Dm} has been changed. Then, D is consistent at T, if D= Q(D1,…,Dm).

5.1 Updating Strategies To maintain the pipes consistency, we propose two simple updating strategies: Query auto-refresh and Pipe auto-refresh. Let us first introduce the notion of source registry RD and the notion of query registry RQ, as the following: Each derived or base source D has a time stamp of its last update RD

T (such as RD1

13042008,23:44:23); and an auto-refresh time interval RDA (such as

RD12592000 , i.e. to be refreshed every month). Each query has a

time stamp of its previous successful execution RQT (such as

RQ317052007,13:04:53); and an auto-refresh time interval (such as

RQ32592000). Each query will be automatically executed if its auto-

refresh interval expires and one of its input sources is updated. Notice that the auto-refresh boundaries might be enforced by the system in order to limit heavy and undesired automatic executions.

Dif 14: (Query Auto-Refresh) Let Qi be a query over a set of sources {D1,…,Dm}, and T is a given time. Qi will be executed if (RQi

T + RQiA) ≤ T and (RQi

T < RDjT), where 1 ≤ j ≤ m.

Similarly, a pipe P(D) will be automatically refreshed if RDA

expires. Refreshing a pipe implies executing the chain of queries in this pipe, starting from the bottom to the topmost.

Dif 14: (Pipe Auto-Refresh) Let P(D) be a pipe, D = Qn (D1,…,Dm), and T is a given time. If (RD

T + RDA) ≤ T, each ith

query in P(D) will be executed if (RQiT < RDj

T ), where 1 ≤ j ≤ m for Qi, and 1 ≤ i ≤ n. Queries in P(D) are executed from the bottom to the topmost, or recursively as P(P(D1),…,P(Dm)).

Although these updating strategies are simple and might be sufficient in many cases in practice, however, as shown in the data warehousing literature, incremental updates strategies (e.g., [AD, ZGHW]) are indeed the efficient. The idea is that if a base source receives a new transaction (insert or delete), only these transactions are transformed and the affected queries are refreshed. We would like to remark that incremental updating is still an open research issue for RDF; especially that, unlike data warehouses, RDF sources and their queries live in an open world.

5.2 Performance-based Materialization In this section, we present an algorithm to detect which queries are most important to materialize. Based on the log of the pervious execution of a query, and based on how many times the results of this query are used, the algorithm finds the most expensive and important queries, and suggest whether a query shuld to be materialized or not.

As our implementation prototype is implemented using Oracle 11g, we extend the XXX table, which is a system table that Oracle generates to keep a log of the execution of queries. This table contains …. We extend this table with …. To be continued soon

6. Implementation Issues MashQL can generally be implemented by online mashup editors (similar to or as an extension to Yahoo Pipes), or as a query interface for online RDF datasets (e.g., Freebase, DBpedia, or DBLP). It can be implemented also as a query plug-in to offline RDF stores (e.g., AllegroGraph or Oracle); or it can be used to filter metadata streams in, e.g., iTunes, jobs.ac.uk, eBay, or Upcoming. We are currently developing a MashQL prototype in a Yahoo Pipes style. This prototype is an AJAX web-based plug-in to Oracle 11g. Choosing Oracle is not only because of its scalability, but also because Oracle’s SPARQL inherits all functionalities of SQL, including aggregation and grouping functions, which are not supported in the standard SPARQL. Our implementation

prototype does not only generate the W3C’s standard SPARQL, but being extended to also generate Oracle’s SPARQL [CDES]. Although this article focuses on using MashQL for querying RDF data sources using SPARQL, however, MashQL can be similarly used for querying relational databases or XML documents. In this case, one needs to either develop a stylesheet that translates MashQL markups into SQL, XQuery, or any preferred backend query language; or maybe map the dataset (e.g. using views) into an RDF-like model.

6.1 MashQL Markup Based on the formal definition of MashQL that we have presented in section 4, this section presents the technical specification of MashQL queries, called the MashQL markup. The goal of this markup is to serialize graphical MashQL queries in a textual and interchangeable format. In this way, tools will be able to save, load, process, and exchange queries easily. Example. Figure 9 illustrates the markup of the MashQL pipes shown in Figure 2, which is also equivalent to the SPARQL query in Figure 1. In appendix A2 we also illustrate the markup of the job-seeking use case shown in Figure 13.

<Pipe ID="0" xsi:noNamespaceSchemaLocation="MashQL.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Meta Content="Title" Name="Bob's Articles"/><Meta Content="Creator" Name="Bob Hacker"/> <RDFInput ID="1" X1="10" Y1="10" X2="50" Y2="110"> <Source Order="1"> <Ref>http://www.site1.com/rdf</Ref> <LastUpload>2008-06-02T09:30:47.0Z</LastUpload></Source> <Source Order="2"> <Ref>http://www.site2.com/rdf</Ref> <LastUpload>2008-06-02T09:32:47.0Z</LastUpload></Source> </RDFInput> <Query ID="0" X1="0" X2="0" Y1="xs:integer" Y2="0" InputModule="1"> <Header> <Meta Content="Title" Name=""/> <Prefix Tag="S1" Ref="http://www.site1.com/rdf"/> <Prefix Tag="S2" Ref="http://www.site2.com/rdf"/> </Header> <Body> <Subject Name="Article" Type="Variable" isRetrun="false"/> <Restriction Prefix=""> <Predicate Name="S1:Title" Type="Identifier" isReturn="false" /> <Predicate Name="S2:Title" Type="Identifier" isReturn="false" /> <Object Name="ArticleTitle" Type="Variable" isReturn="true" /> </Restriction> <Restriction Prefix=""> <Predicate Name="S1:Author" Type="Identifier" isReturn="false" /> <Predicate Name="S2:Author" Type="Identifier" isReturn="false" /> <Object Name="X1" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="Contains" Value="^Hacker" Type="Variable" Language="" DataType="" /> </Restriction> <Restriction Prefix=""> <Predicate Name="S2;Year" Type="Identifier" isReturn="false" /> <Predicate Name="S2:PubYear" Type="Identifier" isReturn="false" /> <Object Name="X2" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="MoreThan" Value="2000" Type="Constant" DataType="" /> </Restriction> </Body> <Footer> <Order><Variable Name="" Direction=""/></Order> <Modifier Limit="" Offset="" Duplication=""/> <Output Stylesheet="" Format=""/> </Footer> </Query> </Pipe>

Figure 9. An example of a MashQL query and its markup.

The XML schema of the complete MashQL markup can be found in appendix A1. This schema is used as the reference MashQL grammar. Any markup of a MashQL query should be valid according to this XML-Schema. Notice that the MashQL markup is not meant to be written by hand or interpreted by humans. It is meant to be implemented for example, as a "save as" or "export to" functionality in MashQL tools. As we will explain in the next section, we use this schema to automatically translate MashQL into SPARQL. The translator takes the markup of a MashQL query as input and generates its SPARQL equivalence. Figure 10 presents a high-level depiction of the XML schema. As shown in this figure, a MashQL pipe consists of four sections: Meta, RDFInput, UserInput, and Query. The Meta section is used to represent metadata about the pipe itself, such as its title, creator, owner, last execution, publication status, etc. These elements are used for management and discovery of pipes. Notice that our design is not restricted to a

certain set or a standard of metadata elements. Any element can be included by specifying its "name" and "content", see Figure 9. The RDFInput section represents RDF sources. Each RDFInput element represents an RDFInput module in a pipe, which can be zero or many. Each module represents the URL of the RDF sources, the latest upload, caching locations, and order in the module. The UserInput element represents UserInput modules, which can be zero or many in a pipe. Each UserInput element consists of a set of DataItems. Each Item is described by its Value, Type (which can be: Constant literal, Regex-Constant literal, Variable, or Identifier), Datatype, and language tag. The Query element serializes MashQL query modules in a pipe. A query consists of three main parts: Header, Body, and Footer. The Header serializes the prefixes, and some metadata about the query, such as its title, last execution, execution time cost, among others. The Body element serializes mainly a query subject and its conditions. As shown in Figure 10, a restriction consists of three elements predicate, object, and an object filter. The ObjectFilter is defined as an abstract element (an XML datatype) that can be one of the following complex elements {Contains, Doesn’t Contain, Equals, Not Equal, LessThan, Not LessThan, MoreThan, Not MoreThan, Between, Not Between, OneOf, Not OneOf, SubQuery}. Finally, the Footer element serializes result modifiers (such as limit, offset, and duplication parameters), ordering preferences, and output styles. Notice that these parameters are usually not displayed in the main window of a MashQL query, but are configured as "query options". Based on this XML schema we have developed a MashQL parser, in order to read and write MashQL queries, see the UML class Diagram in Figure 11.

Figure 10. A high-level depiction of the MashQL XML schema.

Figure 11. The Class diagram of the MashQL-markup parser.

6.2 MashQL to SPARQL Translator As explained earlier, MashQL is used for building data mashups diagrammatically, however, in the background MashQL is translated into and executed as SPARQL queries. Translating MashQL into SPARQL is done according to the mapping rules defined in section 4. These rules are implemented by the MashQL-to-SPARQL translator, which is a java program. This translator takes the markup of a MashQL query and generates its SPARQL equivalent according to these rules. In Figure 12, we present the use case diagram of the translator. When a user executes or debugs a MashQL query, this query is translated into SPARQL first, then the generated SPARQL is executed.

Figure 12. Use Case diagram of the translator.

7. Use Cases This section presents five use cases of using MashQL for building semantic data mashups. The goal of these use cases is to demonstrate the utility of MashQL, and evaluate how much MashQL is able to solve some real-life problems for non-IT experts. These use cases also suggest novel and useful Semantic Web applications in areas like job-seeking, data integration, web information systems, life-sciences research, and business data auditing.

7.1 Use case: Job Seeking Bob has a PhD in bioinformatics. He is looking for a full-time, well paid, and research-oriented job in some European countries. He spent an enormous amount of time searching different job portals, each time trying many keywords and filters. Instead, Bob used MashQL to find the job that meets his specific preferences. Figure 13 shows Bob’s queries on Google Base and on Jobs.ac.uk. First, he visited Google Base and performed a keyword search (bioinformatics OR "computational biology" OR "systems biology" OR e-health); he copied the link of the retrieved results from Google into the RDFInput module7; and then created a MashQL query on these results. He performed a similar task to query Jobs.ac.uk. The third MashQL module in Figure 13, mixes the results of the above two queries and filters them based on location preferences (provided in the UserInput module). The SPRQAL equivalent to Bob’s MashQL query is shown in Figure 14. Notice that we translate MashQL queries that pipe each other using "CONSTRUCT *", which is not part of the current SPARQL standard. However, this construct is one of the top proposed extensions to SPARQL (see [Se]). Otherwise, if this construct will not be included in the next version of SPARQL, our translation will return every triple involved in the query patterns.

Figure 13. Bob’s MashQL Queries.

7 We assume that both Google and Jobs.ac.uk render their search results in

RDFa (i.e. the RDF triples are embedded in HTML), as many companies started to do nowadays. However, Bob can also use a third party’s service (e.g. triplify.org) to extract triples from HTML pages.

… CONSTRUCT * WHERE { {{?Job :Category :Health}UNION {?Job :Category :Medicine}} ?Job :Role ?X1. ?Job :Salary ?X2. ?X2 :Currency :UPK. ?X2 :Minimun ?X3. FILTER(?X1=“Research” || ?X1=”Academic”) FILTER (?X3 > 50000) }

… CONSTRUCT * WHERE{?Job :JobIndustry ?X1. ?Job :Type ?X2. ?Job :Currency ?X3. ?Job :Salary ?X4. FILTER(?X1=“Education”|| ?X1=“HealthCare”) FILTER(?X2=“Full-Time”|| ?X2=“Fulltime”)|| ?X2=“Contract”) FILTER(?X3=“^Euro”|| ?X3=“^€”) FILTER(?X4>=75000|| ?X4<=120000)}

… SELECT ?Job ?Firm WHERE {?Job :Location ?X1. ?X1 :Country ?X2. FILTER (?X2=“Italy”||?X2=“Spain”)|| ?X2=“Greece”||?X2=“Cyprus”)} OPTIONAL{{?job :Organization ?Firm}UNION {?job :Employer ?Firm}}

Figure 14. The SPARQL translation of the MashQL queries in Figure 13.

7.2 Use case: eHealth Research Alice is a PhD student in biology, she wants to search an online eHealth database, to know what causes prostate cancer most. First, she wrote a simple query to find all patients that certainly have a prostate cancer i.e. every person whose prostate biopsy test is positive. Suppose she found 3500 cases. Then she started to add and remove restrictions to this query. Her goal was to reach the closest number to 3500, with a maximum number of relevant restrictions. Alice found that the restrictions shown in Figure 15 are the indicators for 90% of the people who had cancer.

Figure 15. Alice’s research query.

7.3 Use case: Car Rental This use case demonstrates a different and an offline use of MashQL for declaring business rules in an auditing application. The government usually audits whether car rental companies comply with the local regulations. All companies have to open their databases for auditing. The auditors visit companies irregularly, and run a set of queries (called auditing queries) to discover violations. As each company has different database design and vocabulary, the auditors have to analyze and understand the database schema, and write the auditing queries each time from scratch. The government decided to introduce the new semantic technology. They built a wrapper that connects to any database and automatically converts this into one large RDF table, this takes few minutes to complete. See a sample of such data in Figure 16. The auditors then use the MashQL editor to

write and execute their set of auditing queries. Figure 17 shows three examples of such auditing queries. The first query checks whether there is a car rental that occurred during a period that this car was not insured. The result was {<:Rental1>} as the car was not insured during the first 3 days of the rental. The second query checks whether there is a car rental occurred for a customer who does not have a driving license; the result is empty. The third query checks whether there is a car rental occurred for a customer whose driving license does not authorize him to drive that type of car ; the result is {<:Rental2>}, because the license of Customer2 is B, while the category of the car he rented requires license C or higher.

<http://localhost/Company1> :Vehicle1 rdf:type :Vehicle :Vehicle1 :PlateNumber “ABC323” :Vehicle1 :VehicleCategory “C” :Vehicle1 :Insurance :InsContract1 :Vehicle2 rdf:type :Vehicle :Vehicle2 :PlateNumber “BDC987” :Vehicle2 :VehicleCategory “B” :Vehicle2 :Insurance :InsContract2 :InsContract1 rdf:type :VehicleInsurance :InsContract1 :InsuranceType “Full” :InsContract1 :InsuranceSDate 23-11-2007 :InsContract1 :InsuranceEDate 22-11-2008 :InsContract2 rdf:type :VehicleInsurance :InsContract2 :InsuranceType “Full” :InsContract2 :InsuranceSDate 10-04-2007 :InsContract2 :InsuranceEDate 11-04-2008 :Customer1 rdf:type :Customer :Customer1 :CustomerID “6783”

:Customer1 :LicenseNumber “8723” :Customer1 :LicenseType “B” :Customer2 rdf:type :Customer :Customer2 :CustomerID “3456” :Customer2 :LicenseNumber “8723” :Customer2:LicenseType “B” :Rental1 rdf:type :Rental :Rental1 :Vehicle :Vehicle2 :Rental1 :Customer :Customer1 :Rental1 :StartOn 20-11-2007 :Rental1 :EndsOn 25-11-2007 :Rental2 rdf:type :Rental :Rental2 :Vehicle :Vehicle1 :Rental2 :Customer :Customer2 :Rental2 :StartsOn 15-02-2008 :Rental2 :EndsOn 16-02-2008

Figure 16. Sample of RDF for a car rental company.

(a)

(b)

(c)

Figure 17. Examples of Auditing Queries.

7.4 Use case: Citations List Bob would like to compile the list of articles that cited his articles (excluding what he cited himself). He built a mashup using MashQL to mix his citations retrieved from both Google Scholar and CiteSeer, and then filter out the self-citations. First, he performed a keyword search (“Bob Hacker”) on both Google Scholar and CiteSeer8. Figure 18 shows a sample of the extracted RDF triples. Bob’s MashQL query is shown in Figure 19, and its SPARQL equivalent in Figure 20. In this query, Bob wrote: retrieve every article that has a title (call it CitingArticle), has an author that does not contain "Bob Hacker" or "Hacker B.", and cites another article that has a title (call it CitedArticle), and has

8 Similar to the previous use case, we also assume that Google Scholar’s

and CiteSeer’s retrieved results are formatted in RDFa.

an author that contains "Bob Hacker" or "Hacker B.". Figure 21 shows the result of this query.

http://scholar.google.com/scholar?q=bob+Hacker<g:3> :Title “Prostate Cancer” <g:3> :Author “Hacker B., Hacker A.” <g:4> :Title “Best and Worst Lifestyles” <g:4> :Atuhor “Bob Hacker” <g:4> :Cites <g:3> <g:7> :Title “Protein Categories” <g:7> :Atuhor “Bob Smith” <g:7> :Cites <g:3> <g:7> :Cites <g:4> <g:8> :Title “Cancer Vaccines” <g:8> :Atuhor “Alice Hacker” <g:8> :Cites <g:3>

http://www.citeseer.com/search?s=“Bob Hacker”_:1 :Title “Prostate Cancer” _:1 :Author “Hacker B., Hacker A.” _:2 :Title “Protocols in Molecular Biology” _:2 :Atuhor “Bob Hacker” _:2 :ArticleCited _:1 _:3 :Title “Cancer Vaccines” _:3 :Atuhor “Eve Lee, Bob Hacker” _:4 :Title “Overview about Systems Biology” _:4 :Atuhor “Tom Lara” _:4 :ArticleCited _:1 _:4 :ArticleCited _:2

Figure 18. Sample of RDF data about Bob’s articles.

Figure 19. A MashQL query to mashup citation from different sites.

PREFIX s1: http://scholar.google.com/scholar?q=bob+Hacker PREFIX s2: http://www.citeseer.com/search?s=“Bob Hacker SELECT CitingArticle? ?CitedArticle From <http://scholar.google.com/scholar?q=bob+Hacker> From <http://www.citeseer.com/search?s=“Bob Hacker”> WHERE { {{?X1 s1:Title ?CitingArticle} UNION {?X1 s2:Title ?CitingArticle}} {{?X1 s1:Author ?X2} UNION {?X1 s2:Author ?X2}} {{?X1 s1:Cites ?X3} UNION {?X1 s2:ArticleCited ?X3}} {{?X3 S1:Title ?CitedArticle} Union {?X3 S2:Title ?CitedArticle} {{?X3 s1:Author ?X4} UNION {?X3 s2:Author ?X4}} FILTER (regex(?X2,”^Bob Hacker”) || regex(?X2,”^Hacker B.”))} FILTER Not(regex(?X4,”^Bob Hacker”) || regex(?X4,”^Hacker B.”)) }

Figure 20. The SPARQL translation.

CitingArticle CitedArticle Protein Categories Prostate Cancer Protein Categories Best and Worst Lifestyles

Cancer Vaccines Prostate Cancer Overview about Systems Biology Prostate Cancer Overview about Systems Biology Protocols in Molecular Biology

Figure 21. The query results.

7.5 Use case: Fnac retailer Fnac is a large retailer of cultural and consumer electronics products. When a new product arrives to Fnac, it has to be entered to the inventory database. This is usually done by scanning the barcode on each product, and then manually filling the product specifications. Furthermore, as Fnac trades in many countries, their product specifications have to be translated into several languages. To save time entering and translating information manually, Fnac decided to reuse the product data specifications (and their translation) that are produced at the factory side. For example, suppose Fnac received three packages from Cannon, Alfred, and IMDB. Fnac would like to scan the barcode of the received products and then get their specifications directly from the online catalogues of those suppliers. In Figure 22 we show samples of online product catalogues of the three suppliers (we assume they are published in RDFa). Figure 23 illustrates a query that Fnac built to look up the multilingual titles of three products. This query is a mashup of three RDF data sources with a user-input of three barcode numbers. The query takes each of these barcodes and finds the English and French titles. Notice that Fnac assumed that short titles provided by Cannon are in English, thus,

they are joined with the other titles that are tagged with "@en". See the retrieved results in Figure 25. In this same way, a barcode reader could be connected with user-input module, to retrieve the specifications (which could be stored at the supplier side) each time a product is scanned. http:www.cannon/products/rdf

_:P1 :ShortName “CanScan 4400F” _:P1 :FullName “Canon CanoScan 4400F Color Image Scanner”

_:P1 :Producer “Canon” _:P1 :ShippingWeight> “4 pounds” _:P1 :Barcode 9780133557022 _:P2 :ShortName “PowerShot SD100” _:P2 :FullName “Canon PowerShot SD10007.1MP Camera 3x Zoom”

_:P2 :Producer “Canon” _:P2 :ShippingWeight> “2 pounds” _:P2 :Barcode 9781143557532

http://www.alfred.com/books <:B1> :Type <:Book> <:B1> :Title “The Prophet”@en <:B1> :Title “Le prophète”@fr <:B1> :BCode 8765422097653 <:B1> :Authors “Kahlil Gibran” <:B1> :ISBN-10 0394404289 <:B3> :Type <:Book> <:B3> :Title “Alfred Nobel”@en <:B3> :Title “Alfred Nobel”@fr <:B3> :BCode 75639898123 <:B3> :Authors “Kenne Fant” <:B3> :ISBN- 0531123286

http://www.imdb.com/movies _:1 rdf:Type <:Movie> _:1 :Title “All about my mother”@en_:1 :Title “Tout sur ma mère”@fr _:1 :ProdCode 3248765355133 _:1 :NumberOfDiscs: 1 _:2 rdf:Type <:Movie> _:2 :Title “Lords of the rings”@en _:2 :Title “Seigneur des anneaux”@f_:2 : ProdCode 4852834058083 _:2 :NumberOfDiscs: 3

Figure 22. Sample of RDF data about products

Figure 23. A query to mashup product titles from different resources.

PREFIX s1: <http:www.cannon/products/rdf> PREFIX s2: <http://www.alfred.com/books> PREFIX s3: <http://www.imdb.com/movies> SELECT ?Barcode ?EnglishTitle ?FrenchTitle FROM <http:www.cannon/products/rdf> FROM <http://www.alfred.com/books> FROM <http://www.imdb.com/movies> WHERE{ {{?x s1:Barcode ?Barcode}UNION {?x s2:Bcode ?Barcode}UNION {?x s3:Prodcode ?Barcode}} FILTER (regex(?Barcode, “9781143557532”) || regex(?Barcode, “8765422097653”) || regex(?Barcode, “3248765355133)”). {OPTIONAL {?x s1:ShortName ?EnglishTitle}}UNION {{OPTIONAL {?x s1:Title ?EnglishTitle}} UNION {OPTIONAL {?x s2:Title ?EnglishTitle}} FILTER (lang(?EnglishName) = ”en”)} {{OPTIONAL {?x s1:Title ?FrenchTitle}} UNION {OPTIONAL {?x s2:Title ?FrenchTitle}} FILTER (lang(?FrenchTitle) = ”fr”)}}

Figure 24. The SPARQL equivalent of Figure 23.

Barcode EnglishTitle FrenchTitle 9781143557532 CanScan 4400F 8765422097653 The Prophet Le prophète 3248765355133 All about my mother Tout sur ma mère

Figure 25. Retrieved product titles.

8. Discussion

8.1 Coverage and Elegance

We have learned from these use cases that some MashQL constructs (especially the OneOf, Between, and the Union “\”) are useful to have in the language, because they were used in most cases. On the other hand, we found that some important constructs are missing, specially the aggregation functions. Thus, we decided to extend our approach and use Oracle’s SPARQL, instead of only the W3C’s SPARQL standard. Oracle’s SPARQL inherits all

functionalities of SQL, including aggregation functions, grouping, among others.

All SPARQL constructs are somehow supported by MashQL; and all MashQL constructs can be easily expressed in SPARQL. Some constructs are mapped directly, but some other constructs require lengthy emulation. This shows indeed that MashQL constructs are more concise and intuitive than SPARQL. For example, SPARQL query scripts containing the union operator require almost twice the size than their MashQL equivalents. The OneOf operator, which is represented as a set of constants in MashQL, is emulated by long SPARQL filters.

The main limitation of MashQL’s elegancy is that in case an RDF data source contains unwieldy terms, this may yield inelegant MashQL queries. For example, suppose you want to query this dataset: {<:C1,:eno,AB12345> <:C1,:eid,987665> <:C1,:efname,Bob>}. The technical labels of predicates ˗which are difficult to understand by people that did not create them˗ may yield inelegant or technical queries. However, we believe that if an RDF source is made public for external users, predicate vocabularies will be clear and mostly based on standard RDF vocabularies (such as Dublin Core, FOAF, GeoRSS, etc.), or based on a shared ontology [J05]. Furthermore, the use cases also taught us that representing queries visually (as modules connected through pipes) helps non-IT users to visualize and understand the information flow in their queries. In other words, this visualization facilitates users to organize and modularize their queries at a high level of abstraction. For example, one may notice that Bob can create only one query and get the same results of the three queries shown in Figure 13. However, instead of building such a complex query, Bob preferred to modularize his queries in this way, as it is easier for him to build and maintain. In short, modularizing queries and piping them in this way is easier to abstract for non-IT experts.

8.2 Performance and Complexity

Suppose one asks how long it takes to execute the query shown in Figure 2. The answer is: exactly the same time the query shown in Figure 1 would take, which is its SPARQL translation. In other words, although users build and view their mashups using MashQL, but MashQL queries are not executed themselves, their SPARQL translation that is executed. The time complexity of translating a MashQL query into SPARQL is neglectable as it takes less than a second. Thus, the complexity of executing a MashQL query is bounded to the complexity and performance of its backend query language, which is SPARQL in our case.

Executing a SPARQL query over a huge RDF dataset is indeed very fast. As shown by Oracle in [CDES] and AllegroGraph [All], a query with a medium size complexity over hundreds of millions of triples takes only one or few seconds. Oracle has proven indeed that both querying and bulk-loading of RDF data is scalable [CDES, DCWAS]. However, a problem may arise in case of querying remote RDF sources. When a user defines a query over a set of remote sources, as shown in Figure 1, these sources must be transferred and stored locally before executing the query [TPM]. Hence, the time required to execute a query depends mainly on the amount of the transferred data and the transferring bandwidth. The performance of using SPARQL (and thus MashQL) to mash up large remote sources might be unacceptable in case these sources are very large. However, in most application scenarios in practice (see our use cases), people would build

mashups on acceptable sizes of data sources. For example, querying HTML pages that contain RDFa, small RDF files, or retrieved results from online RDF stores that support SPARQL endpoints, among many other scenarios. As this topic is related to the performance of SPARQL, rather than MashQL, we tackle it separately; see the future work section.

Furthermore, recall that when formulating a MashQL query, there are some background queries that are used to generate drop-down lists (see section 3). These queries should be fast enough to allow efficient query formulation. For example, after specifying the RDF sources to be queried, users can select the query subject, which is offered through a drop-down list that is dynamically generated from the union of the subject and object identifiers in the data sources. Similarly, predicates and objects are selected from dynamic lists (see section 3). In general, for a query with n restrictions, there are at most (2n+1) queries that will be performed in the background, i.e. during the query formulation process. The performance of these queries should be fast: as soon as a user selects from a list and moves the mouse to select from another list, this list should be ready. As discussed above, in case the queried sources are stored locally, the performance of the background queries is indeed fast, as each query takes only one second. However, in case of remote input sources, these sources need to be transferred and stored locally, before the query formulation process starts. Therefore, query formulation in case of large remote sources might be unacceptable. In section 7, we shall discuss our future plans to overcome this challenge, which a SPARQL rather than a MashQL issue.

9. Evaluation Coming Soon

10. Conclusions and Future Work In this article, we have proposed a query-by-diagram language in order to allow building data mashups intuitively. Not only MashQL is user-friendly for non-IT people, but also it allows querying (and navigating) RDF data sources without having to know the schema or the technical details of these sources. We have demonstrated the use of MashQL using different use cases, and discussed the lessons learned regarding the elegancy, coverage, and performance of MashQL. As we have discussed and demonstrated, MashQL is not merely an interface of SPARQL. Although it can be used as such, but it can be used also as a general query language by its own. In addition, MashQL can be used also for filtering metadata streams. In this article, we focus on using the W3C’s SPARQL standard as the backend query language.

However, we plan to also translate MashQL into Oracle’s SPARQL. This translation is being implemented as an AJAX web-based plug-in to Oracle 11g. Choosing Oracle is not only because of its scalability, but also because Oracle’s SPARQL inherits all functionalities of SQL, including aggregation and grouping functions, which are not supported in the standard

SPARQL. Furthermore, as querying large remote sources using SPARQL matters (see our discussion in the section 5.2), we are developing a framework called Semantic Web Pipes. This framework extends MashQL and allows caching remote sources, materializing query result, distributing queries, publishing, and discovery of queries, among other issues. Last but yet important, we also plan to extract schemas from RDF sources and depict these schemas using ORM and based on [J07a, J07b], which would be an extra offline support for MashQL users.

Acknowledgement

We are indebted to George Pallis for his valuable comments and feedback on the early drafts of this paper. This research is partially supported by the SEARCHiN project (FP6-042467, Marie Curie Actions).

References [ACK] Athanasis, N., Christophides, V., and Kotzinos, D.: Generating On

the Fly Queries for the Semantic Web. In Proceedings of the ISWC. LNCS, Springer. 2004.

[AD] Abiteboul S, Duschkal O: Complexity of Answering Queries Using Materialized Views. ACM SIGACT-SIGMOD-SIGART. (1998)

[All] AllegroGraph http://www.franz.com/products/allegrograph (May 2008)

[BH] Bloesch, A. and Halpin, T.: Conceptual Queries using ConQuer–II. In Proceedings of the ER. LNCS, Springer. 1997.

[BHB] Bizer C, Heath T, Berners-Lee T:Linked Data: Principles and State of the Art. WWW. 2008

[CDES] Chong, E., Das, S., Eadon, G., and Srinivasan, J.: An efficient SQL-based RDF querying scheme. In Proceedings of VLDB. LNCS, Springer. 2005

[DCWAS] Das, S., Chong, E., Wu, Z., Annamalai, M., and Srinivasan, J.: A Scalable Scheme for Bulk Loading Large RDF Graphs into Oracle. In Proceedings of the ICDE. ACM. 2008.

[CERE] Czejdo, B., Elmasri, R., Rusinkiewicz, M., and Embley, D.: An algebraic language for graphical query formulation using an EER model. In Proceedings of the ACM Conference on Computer Science. 1987.

[DMV] De Troyer, O., Meersman, R., and Verlinden, P.: RIDL on the CRIS Case: A Workbench for NIAM. In Proceedings of the IFIP.8.1 Conference. 1988.

[HPW] Hofstede, A., Proper, H., and van der Weide, T.: Computer Supported Query Formulation in an Evolving Context. In Proceedings of the ADC. 1995.

[I] Iskold, A.: Semantic Web: Difficulties with the Classic Approach. The ReadWriteWeb magazine. Sep. 19, 2007

[iS] iSPARQL http://demo.openlinksw.com/isparql (May 2008)

[J05] Jarrar, M.: Towards Methodological Principles for Ontology Engineering. PhD Thesis. Vrije Universiteit Brussel. May 2005

[J07a] Jarrar, M.: Towards Automated Reasoning on ORM Schemes -Mapping ORM into the DLRidf Description Logic. In Proceedings of the ER. LNCS, Springer. 2007.

[J07b] Jarrar, M.: Mapping ORM into the SHOIN/OWL Description Logic -Towards a Methodological and Expressive Graphical Notation for Ontology Engineering. In Proceedings of the OTM’07 Workshops. LNCS, Springer. 2007

[JD08] Jarrar, M., and Dikaiakos, M. D.: MashQL: A Query-By-Diagram Language for Data Mashups. Technical Article (No. TAR200805).

University of Cyprus, 2008 http://www.jarrar.info/publications/JD08.pdf

[KB] Kaufmann, E., and Bernstein, A.: How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users. In Proceedings of the ISWC. LNCS, Springer. 2007.

[M] Mathieson SA: Spread the word, and join it up. The Guardian. April 05 2006.

[MPTL] Morbidoni C, Polleres A, Tummarello G, LePhuoc D: Semantic Web Pipes. Technical Report. DERI (2007)

[O] O'Donoghue, J.: MySpace joins the 'semantic' web. The Web User online magazine. May 9, 2008

[Ot] O'Reilly, T.: http://radar.oreilly.com/archives/ 2007/02/pipes-and-filters-for-the-inte.html (May 2008)

[PAG] Perez, J., Arenas, M., and Gutierrez, C.: Semantics and Complexity of SPARQL. In Proceedings of the ISWC. LNCS, Springer. 2006

[P] Prud’hommeaux, E. (ed.): SPARQL Query Language for RDF, W3C Working Draft, 4 Oct. 2006

[P07] Polleres, A.: From SPARQL to rules (and back). In proceedings of the WWW. ACM. 2007

[PS] Parent, C., and Spaccapietra, S.: About Complex Entities, Complex Objects and Object-Oriented Data Models. In Proceedings of the IFIP 8.1 conference. 1989

[Ra] RFAuthor (May 2008) http://rdfweb.org/people/damian/2001/10/RDFAuthor

[RSBS] Russell, A., Smart, R., Braines, D., and Shadbolt, R.: NITELIGHT: A Graphical Tool for Semantic Query Construction. In Proceedings of the Semantic Web User Interaction Workshop (SWUI). 2008.

[Se] SPARQL Extensions (March 2008) http://esw.w3.org/topic/SPARQL/Extensions?highlight=%28sparql%29

[Sm] SPARQLMotion http://www.topquadrant.com/sparqlmotion (May 2008)

[TPM] Tummarello, G., Polleres, A., and Morbidoni, C.: Who the FOAF knows Alice? A needed step toward Semantic Web Pipes. In Proceedings of the ISWC Workshops. 2007

[Wr] W3C RDB2RDF Incubator Group (Aug. 08) http://www.w3.org/2005/Incubator/rdb2rdf/

[Y] Yahoo Blog (Mar.08) http://www.ysearchblog.com/ archives/ 000527.html

[Z] Zloof, M.: Query-by-Example: A Data Base Language. IBM Systems Journal, 16(4). 1977

[ZGHW] Zhuge Y, Garcia-Molina H, Hammer J, Widom J: View Maintenance in a Warehousing Environment. SIGMOD Conf. (1995)

Appendices

A1. MashQL XML Schema <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="ObjectFilter" abstract="true"/> <xs:complexType name="NotContains"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Contains"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="NotEquals"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Equals"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="NotLessThan"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="LessThan"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="NotBetween"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="minValue"/> <xs:attribute name="maxValue"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Between"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="minValue"/> <xs:attribute name="maxValue"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="NotMoreThan"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="MoreThan"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/>

</xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="NotOneOf"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:choice> <xs:element name="DataItem" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:complexType> </xs:element> <xs:element name="UserInput"> <xs:complexType> <xs:attribute name="ModuleID"/> </xs:complexType> </xs:element> </xs:choice> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="OneOf"> <xs:complexContent> <xs:extension base="ObjectFilter"> <xs:choice> <xs:element name="DataItem" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Language"/> </xs:complexType> </xs:element> <xs:element name="UserInput"> <xs:complexType> <xs:attribute name="ModuleID"/> </xs:complexType> </xs:element> </xs:choice> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="SubQuery"> <xs:sequence> <xs:element name="Restrictions" type="Restriction"/> </xs:sequence> </xs:complexType> <xs:complexType name="Body"> <xs:sequence> <xs:element name="Subject"> <xs:complexType> <xs:attribute name="Name" use="required"/> <xs:attribute name="Type" use="required"/> <xs:attribute name="isRetrun" type="xs:boolean" use="required"/> </xs:complexType> </xs:element> <xs:element name="Restriction " type="Restriction"

maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="Header"> <xs:sequence> <xs:element name="Meta" type="Meta" minOccurs="0"/> <xs:element name="Prefix" minOccurs="0"

maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Tag" use="required"/> <xs:attribute name="Ref" use="required"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:complexType name="Footer"> <xs:sequence> <xs:element name="Order"> <xs:complexType> <xs:sequence> <xs:element name="Variable"> <xs:complexType> <xs:attribute name="Name"/> <xs:attribute name="Direction"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Modifier"> <xs:complexType> <xs:attribute name="Duplication"/> <xs:attribute name="Limit"/> <xs:attribute name="Offset"/> </xs:complexType> </xs:element>

<xs:element name="Output"> <xs:complexType> <xs:attribute name="Format"/> <xs:attribute name="Stylesheet"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:complexType name="Restriction"> <xs:sequence> <xs:element name="Predicate" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Name"/> <xs:attribute name="Type"/> <xs:attribute name="isReturn" type="xs:boolean"/> </xs:complexType> </xs:element> <xs:element name="Object" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Name"/> <xs:attribute name="Type"/> <xs:attribute name="isReturn" type="xs:boolean"/> </xs:complexType> </xs:element> <xs:element name="ObjectFilter" type="ObjectFilter"

minOccurs="0"/> </xs:sequence> <xs:attribute name="Prefix"/> </xs:complexType> <xs:element name="Pipe"> <xs:complexType> <xs:sequence> <xs:element name="Meta" type="Meta" minOccurs="0"

maxOccurs="unbounded"/> <xs:element name="RDFInput" type="RDFInput" minOccurs="0"

maxOccurs="unbounded"/> <xs:element name="UserInput" type="UserInput" minOccurs="0"

maxOccurs="unbounded"/> <xs:element name="Query" type="Query" minOccurs="0"

maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="ID" type="xs:integer" use="required"/> </xs:complexType> </xs:element> <xs:complexType name="Meta"> <xs:attribute name="Name"/> <xs:attribute name="Content"/> </xs:complexType> <xs:complexType name="RDFInput"> <xs:sequence> <xs:element name="Source" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="Ref" type="xs:anyURI"/> <xs:element name="LastUpload" type="xs:dateTime"/> <xs:element name="CashedAt" minOccurs="0"/> </xs:sequence> <xs:attribute name="Order" type="xs:integer"/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="ID" type="xs:integer" use="required"/> <xs:attribute name="Y1" type="xs:integer"/> <xs:attribute name="X1" type="xs:integer"/> <xs:attribute name="X2" type="xs:integer"/> <xs:attribute name="Y2" type="xs:integer"/> </xs:complexType> <xs:complexType name="UserInput"> <xs:sequence> <xs:element name="DataItem" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Value"/> <xs:attribute name="Type"/> <xs:attribute name="DataType"/> <xs:attribute name="Langauge"/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="ID" type="xs:integer" use="required"/> <xs:attribute name="X1" type="xs:integer"/> <xs:attribute name="Y1" type="xs:integer"/> <xs:attribute name="X2" type="xs:integer"/> <xs:attribute name="Y2" type="xs:integer"/> </xs:complexType> <xs:complexType name="Query"> <xs:sequence> <xs:element name="Header" type="Header" minOccurs="0"/> <xs:element name="Body" type="Body"/> <xs:element name="Footer" type="Footer" minOccurs="0"/> </xs:sequence> <xs:attribute name="ID" type="xs:integer" use="required"/> <xs:attribute name="InputModule" type="xs:integer" use="required"/> <xs:attribute name="X1" type="xs:integer"/> <xs:attribute name="Y1" default="xs:integer"/> <xs:attribute name="X2" type="xs:integer"/> <xs:attribute name="Y2" type="xs:integer"/> </xs:complexType>

</xs:schema>

A2. The markup of the Job-Seeking use case <?xml version="1.0" encoding="UTF-8"?> <Pipe ID="0" xsi:noNamespaceSchemaLocation="MashQL.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Meta Content="Title" Name="Bob's Jobs"/> <Meta Content="Creator" Name="Bob Hacker"/> <Meta Content="DateCreated" Name="2008-05-17T09:30:47.0Z"/> <Meta Content="DateExecuted" Name="2008-05-18T08:22:10.0Z"/> <Meta Content="UserID" Name="012345"/> <RDFInput ID="1" X1="10" Y1="10" X2="110" Y2="60"> <Source Order="0"> <Ref>http://base.google.com/jobs/rdf</Ref> <LastUpload>2001-12-17T09:30:47.0Z</LastUpload> </Source> </RDFInput> <RDFInput ID="2" X1="210" Y1="10" X2="210" Y2="60"> <Source Order="0"> <Ref>http://jobs.ac.uk/rdf</Ref> <LastUpload>2001-12-17T09:30:47.0Z</LastUpload> </Source> </RDFInput> <RDFInput ID="6" X1="210" Y1="10" X2="210" Y2="60"> <Source Order="0"> <Ref>:4</Ref> <LastUpload>2001-12-17T09:30:47.0Z</LastUpload> </Source> <Source Order="0"> <Ref>:5</Ref> <LastUpload>2001-12-17T09:30:47.0Z</LastUpload> </Source> </RDFInput> <UserInput ID="3" X1="230" Y1="160" X2="300" Y2="280"> <DataItem Value="UK" Type="Constant" DataType="" Langauge=""/> <DataItem Value="Belgium" Type="Constant" DataType="" Langauge=""/> <DataItem Value="Germany" Type="Constant" DataType="" Langauge=""/> <DataItem Value="Austria" Type="Constant" DataType="" Langauge=""/> <DataItem Value="Holland" Type="Constant" DataType="" Langauge=""/> </UserInput> <Query ID="4" X1="0" Y1="70" X2="150" Y2="170" InputModule="1"> <Header> <Meta Content="Title" Name="Google Jobs"/> <Prefix Ref="" Tag=""/> </Header> <Body> <Subject Name="Job" Type="Variable" isRetrun="false"/> <Restriction Prefix=" "> <Predicate Name="JobIndustry" Type="Identifier" isReturn="false" /> <Object Name="X1" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="OneOf" > <DataItem Value="Education" Type="Constant" DataType="" Language="" /> <DataItem Value="HealthCare" Type="Constant" DataType="" Language="" /> </ObjectFilter> </Restriction> <Restriction Prefix=" "> <Predicate Name="Type" Type="Identifier" isReturn="false" /> <Object Name="X2" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="OneOf" > <DataItem Value="Full-Time" Type="Constant" DataType="" Language="" /> <DataItem Value="FullTime" Type="Constant" DataType="" Language="" /> <DataItem Value="Contract" Type="Constant" DataType="" Language="" /> </ObjectFilter> </Restriction> <Restriction Prefix=" "> <Predicate Name="Currency" Type="Identifier" isReturn="false" /> <Object Name="X3" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="OneOf" > <DataItem Value="^Euro" Type="Constant" DataType="" Language="" /> <DataItem Value="^€" Type="RegexConstant" DataType="" Language="" /> </ObjectFilter> </Restriction> <Restriction Prefix=" "> <Predicate Name="Salary" Type="Identifier" isReturn="false" /> <Object Name="X4" Type="Variable" isReturn="true" /> <ObjectFilter xsi:type="Between" minValue="75000" maxValue="120000" Type="Constant" DataType="" /> </Restriction> </Body>

<Footer> <Order> <Variable Name="" Direction=""/> </Order> <Modifier Limit="" Offset="" Duplication=""/> <Output Stylesheet="" Format=""/> </Footer> </Query> <Query ID="5" X1="160" Y1="70" X2="300" Y2="170" InputModule="2"> <Header> <Meta Content="Title" Name="Jobs.uk"/> <Prefix Ref="" Tag=""/> </Header> <Body> <Subject Name="Job" Type="Variable" isRetrun="false"/> <Restriction Prefix=" "> <Predicate Name="Category" Type="Identifier" isReturn="false" /> <Object Name="X5" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="OneOf" > <DataItem Value="Health" Type="Constant" DataType="" Language="" /> <DataItem Value="BioSciences" Type="Constant" DataType="" Language="" /> </ObjectFilter> </Restriction> <Restriction Prefix=" "> <Predicate Name="Type" Type="Identifier" isReturn="false" /> <Object Name="X2" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="Equals" Value="Research/Academic" Type=" Constant" DataType="" Language="" /> </Restriction> <Restriction Prefix=" "> <Predicate Name="SalaryCurrency" Type="Identifier" isReturn="false" /> <Object Name="X6" Type="Variable" isReturn="false" /> <ObjectFilter xsi:type="Equals" Value="UPK" Type="Constant" DataType="" Language="" /> </Restriction> <Restriction Prefix=" "> <Predicate Name="Salary" Type="Identifier" isReturn="false" /> <Object Name="X4" Type="Variable" isReturn="true" /> <ObjectFilter xsi:type="MoreThan" Value="50000" Type="Constant" DataType="" /> </Restriction> </Body> <Footer> <Order> <Variable Name="" Direction=""/> </Order> <Modifier Limit="" Offset="" Duplication=""/> <Output Stylesheet="" Format=""/> </Footer> </Query> <Query ID="7" X1="80" Y1="180" X2="200" Y2="250" InputModule="6"> <Header> <Meta Content="Title" Name="Job Seeking"/> </Header> <Body> <Subject Name="Job" Type="Variable" isRetrun="true"/> <Restriction Prefix=""> <Predicate Name="Location" Type="Identifier" isReturn="false" /> <Object Name="X9" Type="Variable" isReturn="false"/> <ObjectFilter xsi:type="OneOf" > <UserInput ModuleID="3"/> </ObjectFilter> </Restriction> </Body> <Footer> <Order> <Variable Name="" Direction=""/> </Order> <Modifier Limit="" Offset="" Duplication=""/> <Output Stylesheet="" Format=""/> </Footer> </Query> </Pipe>

Documents

MashQL: A Query-by-Diagram Language › ~mjarrar › JD08.pdf · 2008-08-28 · 1.2 Why RDF and SPARQL Assuming the Internet data is represented in RDF, querying this data can be