42
PRINCIPLES OF DATASPACE SYSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

Embed Size (px)

Citation preview

Page 1: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

PRINCIPLES OF DATASPACE SYSTEMSBy

Alon Halevy (Google Inc.)

Michael Franklin (University of California, Berkeley)

David Maier (Portland State University)

Page 2: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

PRESENTATION BY…

Aditya KumarRajesh Ramakrishnan

Page 3: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

INDEX Introduction DSSP

DSSP vs. Traditional Database and Data Integration

Examples Query Answering

Query Answering Model Obtaining Answers

Dataspace Introspection Lineage, Uncertainty and Inconsistency Finding the Right Answers

Reusing Human Attention Conclusion

Page 4: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

INTRODUCTION Most data management scenarios today rarely

have a situation in which all the data can be t nicely into a conventional relational DBMS, or into any other single data model or system.

The first set of challenges are user facing functions locating relevant data sources providing search and query capability tracing lineage and determining accuracy of the data.

The second set of challenges, on the administration side, include enforcing rules integrity constraints and naming conventions across a

collection providing availability, recovery and access control, and

Page 5: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

OriginatedFrom

PublishedIn

ConfHomePage

ExperimentOf

ArticleAbout

BudgetOf

CourseGradeIn

AddressOf

Cites

CoAuthor

FrequentEmailer

HomePage

Sender

EarlyVersion

Recipient

AttachedTo

PresentationFor

PERSONAL INFORMATION MANAGEMENT [Semex: Dong et al.]

Page 6: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

DSSP (DATASPACE SUPPORT PLATFORM)

Dataspaces are not a data integration approach; rather, they are more of a data co-existence approach.

Dataspaces is a new abstraction for data management in such scenarios, and proposed the design and development of DataSpace Support Platforms (DSSPs) as a key agenda item for the data management field.

DSSP offers a suite of interrelated services and guarantees that enables developers to focus on the specific challenges of their applications

Page 7: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

DSSP (DATASPACE SUPPORT PLATFORM)

DSSP helps to identify sources in a dataspace and inter-relate them, offers basic query mechanisms over them, including the ability to introspect about the contents.

A DSSP also provides some mechanisms for enforcing constraints and some limited notions of consistency and recovery

DSSPs can be viewed as the next step in the evolution of data integration architectures, but are distinct from current data integration systems.

Page 8: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

DSSP VS. TRADITIONAL DATABASE AND DATA INTEGRATION

A DSSP must deal with data and applications in a wide variety of formats accessible through many systems with different interfaces.

Unlike a DBMS, a DSSP is not in full control of its data.

Queries to a DSSP may offer varying levels of service, and in some cases may return best-effort or approximate answers.

A DSSP must offer the tools and pathways to create tighter integration of data in the space as necessary.

Page 9: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

DSSP can be viewed as a ________ approach?

1. Integrated2. Linear3. Database4. Data co-existence 5. None of the above

Page 10: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

EXAMPLES

Personal Information Management

Scientific Data Management

Structured Queries and content on the WWW

Page 11: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

Dataspace can be viewed as next step in the evolution of ________

1. Networks2. Query Processing3. Data Integration4. DBMS5. None of the above

Page 12: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

THE WEB IS GETTING SEMANTIC

Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop

DB search engine coming soon!

“A little semantics goes a long way”

Page 13: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

GOOGLE BASE

Page 14: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

QUERY ANSWERING Dataspace Participants and Relationship

model a dataspace as a set of participants and relationships.

Queries When users interact more deeply with certain

data sources, they may pose more complex queries, which may lead to more complex SQL or XQuery queries.

Answers Ranked Heterogeneous Sources to answers Iterative Reflection

Page 15: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

QUERY ANSWERING MODEL(CHALLENGES ) Develop a formal model for studying query answering in

dataspaces.

Develop an intuitive semantics for answering a query that takes into consideration a sequence of earlier queries leading up to it.

Develop a formal model of information gathering tasks that include a sequence of lower-level operations on a dataspace.

Develop algorithms that given a keyword query and a large collection of data sources, will rank the data sources according to how likely they are to contain the answer.

Develop methods for ranking answers that are obtained from multiple heterogeneous sources (even when semantic mappings are available).

Page 16: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

OBTAINING ANSWERS

The most signicant challenge to answering queries in dataspaces arises from data heterogeneity.

Hence, to answer queries in a dataspace we need to shift the attention away from semantic mappings.

Page 17: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

OBTAINING ANSWERS (CHALLENGES)

Develop methods for answering queries from multiple sources that do not rely solely on applying a set of correct semantic mappings.

Develop techniques for answering queries based on the following ideas, or combinations thereof:

apply several approximate or uncertain mappings and compare the answers obtained by each,

apply keyword search techniques to obtain some data or some constants that can be used in instantiating mappings.

examine previous queries and answers obtained from data sources in the dataspace and try to infer mappings between the data sources.

Page 18: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

OBTAINING ANSWERS (CHALLENGES)

Develop a formal model for approximate semantic mappings and for measuring the accuracy of answers obtained with them.

Given two data sets that use the same terminology but different data models, develop automatic best-effort methods for translating a query over one data set onto the other

Page 19: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

QUESTIONS?????

Page 20: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

DATASPACE INTROSPECTION

Lineage, Uncertainty and Inconsistency (LUI)

Projects

Uncertain Database

Inconsistencies in Database

Modeling Data Lineage

LUI Introspection

Finding the Right Answers

Page 21: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

Lineage, Uncertainty and Inconsistency are highly related to each other.

DSSP should have a single mechanism that models all three.

Inconsistencies can be modeled as Uncertainty about which data value of several is correct.

Uncertainty and Inconsistency need to be ultimately resolved, and Lineage is often the only way of doing so.

Page 22: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

PROJECTS

The relationship between uncertainty and lineage has recently formed the foundation for the Trio Project

The need to manage inconsistency along with lineage is one of the main idea underlying the Orchestra Project

Page 23: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

UNCERTAIN DATABASES

Uncertainty arises in data management applications because the exact state of the world is not known.

The goal of an uncertain database is to represent a set of possible states of the world, typically referred to as possible worlds.

Page 24: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

UNCERTAIN DATABASES Several formalisms have been proposed for

uncertain databases

A-Tuple

X-Tuple

C-Tables

Page 25: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

A-TUPLE

An a-tuple differs from an ordinary tuple in two ways. First, instead of having a single value for an

attribute, it may have several values. Second, the tuple may be a maybe-tuple.

As an example, consider the following two tuples:

(Karina Powers, f 345-9934 345-9935g)

(George Flowers, 674-9912)

Page 26: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

X-TUPLE An x-tuple is simply a set of ordinary tuples, meant to

describe different possible states, and they too can be marked as maybe-tuples

Consider the following x-tuple, where the second column is the person's work phone and the second is the work fax:

(Karina Powers, 345-9934, 345-9935) (Karina Powers, 345-9935, 345-9934)

The x-tuple represents the fact that we're not sure which number is the work phone and which is the fax, and represents two possible states of the world.

While x-tuples are more powerful than a-tuples, they are still not closed under relational operators.

Page 27: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

C-TABLES

There are two worlds, depending on the value of x.

If it is 1, then only the first tuple is in the database.

If x is not 1, then the second and third tuples are in.

Hence, we can model a constraint saying that if a particular tuple is not in the database, then two other ones must be.

Example

Page 28: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

C-TABLES

C-tables are closed under relational operators.

C-tables can be shown to be complete. That is, given any set of possible worlds S of a schema R, there exists a database of c-tables DS whose possible worlds are precisely S.

The disadvantage of c-tables

They are a bit harder to understand as a user. Checking whether a set of tuples I is a possible

world of Ds is known to be NP-complete.

Page 29: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

INCONSITENCIES IN DATABASE

Inconsistent databases are meant to handle situations in which the database contains conflicting data.

The most common type of inconsistency is disagreement on single-valued attributes of a tuple. E.g., a database storing the salary of an

employee may have two different values for the salary, each coming from different sources.

Page 30: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)

MODELLING DATA LINEAGE

The lineage of a tuple explains how the tuple was derived to be a member of a particular set.

Internal Lineage

External Lineage

Page 31: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

LINEAGE, UNCERTAINTY AND INCONSISTENCY (LUI)LUI INTROSPECTION

A DSSP should provide a single unied mechanism for modeling uncertainty, inconsistency and lineage. Broadly, the challenge is the following:

Develop formalisms that enable modeling uncertainty, inconsistency and lineage in a unified fashion.

Develop formalisms that capture uncertainty about common forms of inconsistency in databases.

Develop formalisms for representing and reasoning about external lineage.

Develop a general technique to extend any uncertainty formalism with lineage, and study the representational and computational advantages of doing so.

Develop formalisms where uncertainty can be attached to tuples in views and view uncertainty can be used to derive uncertainty of other view tuples.

Page 32: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

FINDING THE RIGHT ANSWERS

The ability to introspect about data and query answers raises the next natural question: what are good answers to a query.

relevance to the query,

certainty of the answer (or whether it contradicts another)

completeness and precision requested by the user,

maximum latency required in answering the query.

Page 33: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

FINDING THE RIGHT ANSWERS (CHALLENGES) Define metrics for comparing the quality of answers

and answer sets over dataspaces, and efficient query processing techniques.

Develop query-language extensions and their corresponding semantics that enable specifying preferences on answer sets along the dimensions of completeness and precision, certainty and inconsistency, lineage preferences and latency.

Define notions of query containment that take into consideration completeness and precision, uncertainty and inconsistency and lineage of answers, and efficient algorithms for computing containment.

Develop methods for efficient processing of queries over uncertain and inconsistent data that conserve the external and internal lineage of the answers. Study whether existing query processors can be leveraged for this goal.

Page 34: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

What does LUI stand for?

1. Look, Use and Interpret

2. Linear, Uncertain and Incomplete

3. Lineage, Uncertainty and Inconsistency

4. None of the above

Page 35: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

TECHNICAL OUTLINE

Query

Evolve

Reflect• Reuse human

attention

Page 36: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

REUSING HUMAN ATTENTION Principle:

User action = statement of semantic relationship Leverage actions to infer other semantic relationships

ExamplesProviding a semantic mapping

Infer other mappingsWriting a query

Infer content of sources, relationships between sourcesCreating a “digital workspace”

Infer “relatedness” of documents/sources Infer co-reference between objects in the dataspace

Annotating, cutting & pasting, browsing among docs

Query

Evolve

Reflect

Page 37: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

REUSING HUMAN ATTENTION (CHALLENGES)

Develop methods that capture users' activities when interacting with a dataspace and analyze these activities to create additional meaningful relationships between sources in a dataspace.

Develop techniques that examine collections of queries over data sources and their results to build new mappings between disparate data sources.

Develop algorithms for grouping actions on a dataspace into tasks.

Page 38: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

REUSING HUMAN ATTENTION (CHALLENGES)

Develop methods that examine the known semantic relationships in a dataspace and generate a few select questions that can be posed to a user and whose answer would improve semantic integration most significantly.

Develop a formal framework for learning from human attention in dataspaces. A definition of the specific learning problem. We need a formalism for describing approximate

semantic mappings and distances between semantic mappings.

We need to spell out the space of possible mappings the learning problem will consider.

Finally, we need a way of interpreting the training examples

Page 39: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

CONCLUSION AND OUTLOOK

Data management moving to consumers

Dataspaces: key element in this agenda Pay as you go data management Reuse human attention

The role of theory: Reflect, generalize and explain People, people, people

Page 40: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

The major challenge in developing the framework for learning from human attention is ??1. A clear definition2. Need for formalism3. Spell out all possible mappings4. Interpret the examples5. All of the above

Page 41: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

QUESTIONS?????

Page 42: P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)

Thank You….