Organic Databases H. V. Jagadish Li Qian Arnab Nandiweb-ext.u-aizu.ac.jp/labs/is-ds/PPT/organic.pdfOrganic Databases H. V. Jagadish Department of Computer Science and Engineering,

Int. J. Signal and Imaging Systems Engineering, Vol. 0, No. 00, 2004 1

Organic Databases

H. V. JagadishDepartment of Computer Science and Engineering,University of Michigan,Ann Arbor, MI, USAE-mail: [email protected]

Li QianDepartment of Computer Science and Engineering,University of Michigan,Ann Arbor, MI, USAE-mail: [email protected]

Arnab NandiDepartment of Computer Science and Engineering,Ohio State University,Columbus, OH, USAE-mail: [email protected]

Abstract: Databases today are carefully engineered: there is an expensive anddeliberate design process, after which a database schema is defined; duringthis design process, various possible instance examples and use cases arehypothesized and carefully analyzed; finally, the schema is ready and then canbe populated with data. All of this effort is a major barrier to database adoption.

In this paper, we explore the possibility of organic database creation insteadof the traditional engineered approach. The idea is to let the user start storingdata in a database with a schema that is just enough to cove the instances athand. We then support efficient schema evolution as new data instances arrive.By designing the database to evolve, we can sidestep the expensive front-endcost of carefully engineering the design of the database.

Indeed, the deliberate design model complicates not only database creation,but also database transformation (i.e., schema mapping). Because traditionalschema mapping tasks are carefully engineered with declarative specificationhidden beneath complex user interface. In this paper, we also study theissue of organic database transformation, which automatically induces schemamappings from sample target database instances.

1 Motivation

Database technology has made great strides inthe past decades. Today, we are able to processefficiently ever larger numbers of ever morecomplex queries on ever more humongous datasets. We can be justifiably proud of what we haveaccomplished.

However, when we see how information iscreated, accessed, and shared today, databasetechnology remains only a bit player: much ofthe data in the world today remains outsidedatabase systems. Even worse, in the places wheredatabase systems are used extensively, we find anarmy of database administrators, consultants, andother technical experts all busily helping users get

Copyright c© 2009 Inderscience Enterprises Ltd.

2 author

data into and out of a database. For almost allorganizations, the indirect cost of maintaining atechnical support team far exceeds the direct costof hardware infrastructure and database productlicenses. Not only are support staff expensive,they also interpose themselves between the usersand the databases. Users cannot interact with thedatabase directly and are therefore less likely totry less straightforward operations. This hiddenopportunity cost may be greater than the visiblecosts of hardware/software and technical staff.Most of us remember the day not too long agowhen booking a flight meant calling a travel agentwho used magic incantations at an arcane systemto pull up information regarding flights and tomake bookings. Today, most of us book our ownflights on the web through interfaces that aresimple enough for anyone to use. Many enjoythe power of being able to explore options forthemselves that would have been too much troubleto explain to an agent, such as willingness to tradeoff price against convenience of a flight connection.

Search engines have done a remarkable job atdirectly connecting users with the web. Users canpublish documents of any form on the Web. Fora keyword query, the user is pointed to a set ofdocuments that are most likely to be relevant tothe user. This best-effort nature can lead to possiblyinaccurate results, but it allows the users the abilityto easily and efficiently get information into andout of the ever-changing Web.

In contrast, the database world has had theheritage of constructing rigid, precisely defined,carefully planned, explicitly engineered, silos ofinformation based on predictions regarding dataand queries. It was assumed that informationwould be clean, rigid and well structured. This hasled to databases today being hard to design, hardto modify, and hard to query.

When we look at characteristics of search,we find that there is very low prediction andplanning burden placed on users – neither toquery nor to publish data. Furthermore, precision,while desirable, is not required. In contrast, usersinteracting with databases find themselves fightingan uphill battle with the constant flux of the datathey deal with in today’s highly connected world.

The emergence of “big data” in enterprisesettings has also presented a unique set of criticaldata management challenges. Due to the sheersize, data is typically stored in large, unindexed

data warehouses running on large clusters. Datais curated in a highly collaborative manner usingdata pipelines built by hundreds, if not thousandsof engineers. Data manipulations are not restrictedto simple database querying. They involve taskssuch as information extraction and building ofstatistical models. Datasets range from completelyunstructured to fully structured, and representa wide variety of data models. While existingdatabase systems pay a lot of attention to aspectssuch as query execution for simple databasequeries, the ability to deal with this new kindof data paradigm is still extremely primitive.For example, practical tasks such as “Which I.P.-address to zip-code table is most accurate to usewhile building a classifier for my user locationdata?” are encountered regularly and are currentlysolved by trial and error along with by duplicationof engineer effort and cluster usage. Clearly, thereare a host of data management issues, ranging fromschema, workflow and provenance managementto efficient indexing of heterogeneous structureddata. These issues are permeate across all types ofenterprise-class data, be it scientific or web-centricdata management. With massive changes in thescale, size and complexity of both data and its use,a wide variety of research problems emerge.

Our goal in this paper is to render databaseinteraction lenient in its demands for prediction,planning, and precision. We call this organic,to distinguish from the carefully designed andengineered “synthetic” database and query systemused today. The result of an organic query may notbe as perfect as the result of an engineered query,but it has the benefit of not requiring precisionand planning, and hence being more “natural” formost users. To be able to develop such an organicsystem, let us first study the precision and planningchallenges that users face as they interact withdatabases.

2 Challenges

2.1 Structure Specification Challenge

Precise specification is challenging for usersinteracting with a database. Consider an airlinedatabase with a basic schema shown in Figure1, for tracing planes and flights. The dataencapsulated is starting location, destination, plane

short title 3

information, and times — essentially what everypassenger thinks of as a flight. Yet, in ournormalized relational representation, this singleconcept is recorded across four different tables.Such splattering of data decreases the usability ofthe database in terms of schema comprehension,join computation, and query expression.

First, given the large number of tables in adatabase, often with poorly named entities, it isusually not easy to understand how to locate aparticular piece of data. Even in a toy schemasuch as Figure 1, there is the possibility of trouble.Obviously, the airports table has information aboutthe starting location and the destination. To findwhat is used by a particular flight, we haveto bring up the schema and follow the foreignkey constraint, or trace the database creationstatements. Neither solution is user-friendly, andthus the current solution is often to leave the taskto DBAs.

The next problem users face is computingthe joins. We break apart information during thedatabase design phase such that everything isnormalized — space efficient and amenable toupdates. However, the users will have to stitchthe information back together to answer most realqueries. The fundamental issue is that joins destroythe connections between information pertaining tothe same real world entities. Query specification isnon-intuitive to most normal users in consequence.But even the design is brittle. What if a singleflight has multiple flight numbers on account ofcode sharing? What about special flights not ona weekly schedule? There are any number ofsuch unanticipated possibilities that could render acarefully designed structure inadequate instantly.

2.2 Remote Specification Challenge

Querying in its current form requires predictionon the part of the user. In our airline databaseexample, consider the specification of a three letterairport code. Some interfaces provide a drop downlist of all the cities that the airline flies into. Foran airline of any size, this list can have hundredsof entries, most of which are not relevant to theuser. The fact that it is alphabetized may not help— there may be multiple airports for some majorcities, the airport may be named for a neighboringcity, and so on.

A better interface allows a user to enter thename of the place they want to get to, and then

looks for close matches. This cannot be a simplestring comparison — we need Narita airport tobe suggested no matter whether the user enteredNarita or Tokyo or even Tokyu. This does notseem too hard, and some airline web sites willdo this. But now consider a user who wants tovisit Aizu. No airline search interface today, to ourknowledge, can suggest flying into Narita airportin response to a search for Aizu airport even thoughthat is likely to be the preferred solution for mosttravelers.

On account of difficulty in prediction, it is oftenthe case that the user does not initially specify thequery correctly. The user then has to revise herquery and resubmit if it did not return desiredresults. However, essentially all query languages,including visual query builders, separate queryspecification from output.

Our goal is to enable users to query a databasein a WYSIWYG (What You See Is What YouGet) fashion. Consider the display of a worldmap. The user could zoom into the area ofinterest and select airports geographically fromthe choices presented. Most map databases todayprovide excellent direct manipulation capabilities,including pan, zoom, and so on. Imagine a mapdatabase without these facilities that requires usersto specify, through a text selection of zip code orlatitude/longitude, the portion of the map that isof interest each time. We would find it terriblyfrustrating. Unfortunately, most database queryinterfaces today are not WYSIWYG and can becompared to this hypothetical frustrating mapquery interface.

What does WYSIWYG mean for databases?After all, the point of specifying a query is to getinformation that the user does not possess. Evensearch engines are not WYSIWYG. A WYSIWYGinterface for selection specification and data resultsinvolves a constant predictive capability on thepart of the system. For example, instantaneous-response interfaces [58] allow users to gain insightsinto the schema and the data during query time,which allows the user to continuously refine thequery as they are typing the initial query. By the timethe user has typed out the entire query, the queryhas been correctly formulated and the results havereturned. Furthermore, if the user then wishes tomodify the query, this should be possible by directmanipulation of the result set rather than an abinitio restatement of the query.

4 author

airplane

id

type

serial_number

schedule

id

day_of_week

departure_time

arrival_time

flight_info

id

flight_number

airplane_id

tid

fid

schedule_id

date

airports

id

city_name

airport_name

Figure 1 The base tables needed to store a “flight”. A flight contains from location, destination, airplane info andschedule, yet consists of at least four tables. Note that an actual schema for such data is likely to involvemany more attributes and tables.

2.3 Schema Evolution Challenge

While database systems have fully establishedthemselves in the corporate market, they havenot made a large impact on how users organizetheir everyday information. Many users would liketo put into their databases [8] information suchas shopping lists, expense reports, etc. The mainreason for this is that creating a database is not easy.

Database systems require that the schema bespecified in advance, and then populated withdata. This burdens the user with developing anabstract design of the schema – without anyconcrete data – a task that we computer scientistsare trained to do, but most others find verydifficult. Furthermore, careful planning is requiredas users are expected to predict what data they willneed to store in the future, and what queries theymay ask, and use these predictions to develop asuitable schema.

Example 2.1 Consider a user, Jane, who started to keeptrack of her shopping lists. The first list she createdsimply contained a list of items and quantities of eachto be purchased. After the first shopping trip, Janerealized that she needed to add price information tothe list to monitor her expenses and she also startedmarking items that were not in stock at the store. A weekbefore Thanksgiving, Jane created another shopping list.However, this time, the items were gifts to her friends,

and information about the friends therefore needed to beadded to create this “gift list.” A week after Christmas,Jane started to create another “gift list” to track giftsshe received from her friends. However, the friendsinformation were now about friends giving her gifts.In the end, what started as a simple list of items forJane had become a repository of items, stores, and moreimportantly, friends — an important part of Jane’s life.

The above example, although simple, illustrateshow an everyday database evolves and the manyusability challenges facing a database system. First,users do not have a clear knowledge of whatthe final structure of the database will be andtherefore a comprehensive design of the databaseis impossible at the beginning. For example, Janedid not know that she needed to keep track ofinformation about her friends until the time hadcome to buy gifts for them. Second, the structureof the database grows as more information becomeavailable. For example, the information about priceand out of stock only became available after theshopping trip. Finally, information structures maybe heterogeneous. For example, the two “gift lists”that Jane created had different semantics in theirfriends information and the database needs togracefully handle this heterogeneity.

In summary, for everyday data, the structuregrows incrementally and a database system mustprovide interfaces for users to easily create both

short title 5

unstructured and structured information and tofluidly manipulate the structure when necessary.

2.4 Schema Mapping Challenge

Schema mapping has long been one of the mostimportant problems in industry. Moreover, asthe amount of structured Web-based informationexplodes, users are directly exposed to thetask of combining, structuring and re-purposinginformation [15]. Unfortunately, schema mappingis extremely sophisticated with the state-of-the-artapproaches.

Nowadays, schema mapping requires asubstantial amount of careful planning in advance.These planning are usually based on complicateddata transformation specification languages suchas datalog and source-to-target tgds [28], whichare difficult for normal users to understand.Furthermore, planning the mappings with moderntools requires declarative precision, which theusers may not possess before they fully interpretthe semantics of the mappings they are trying toconstruct.

Example 2.2 When exploring the Yahoo Moviesdatabase, a user wishes to store the movie title asName and the director name as Director in a targetMyMovieInfo relation, as shown1 in Figure 2. Giventhe fact that most users are unable/unwilling to specifythe mappings in complex mapping languages, modernmapping tools usually offer a mapping interface asshown in Figure 3. However, users still have to preciselyplan for two essential mapping components.

First, similar to data location challenge describedbefore, it can be potentially difficult to search for thesource schema and locate the specific attribute that isbeing mapped to the corresponding target attribute. Ina movie database, there are typically dozens of relationswith hundreds of attributes. Worse, end-users usuallyhave no access to foreign key constraints and/or databasecreation statements. In such a case, picking the rightcorresponding attribute can be a large pain.

Even if the set of correspondences are all correctlyand precisely established, the user must face thestructural specification challenge. In other words, howare these source attributes joined and projected to thetarget? Again, join prevents the user from linkingdesired concrete target concept to normalized storedinformation, and thus needs to be specified precisely.What if there are hidden intermediate join relations?

What are the join attributes? All these planningquestions are left to users who do not need to know theseanswers.

3 Proposed Solution

3.1 Presentation Data Model

We propose the use of a presentation datamodel [37], as a full-fledged layer above thephysical and logical layers in the database. Justas the logical layer provides data abstraction andsaves the user from having to worry about physicaldata aspects such as data structures, indices, accessmethods, etc., the presentation layer saves the userfrom having to worry about logical data aspectssuch as relational structure, keys, joins, constraints,etc. To do this, the presentation layer should be ableto represent data in a form most suited for the userto easily comprehend, manipulate and query.

3.2 Addressing Structure SpecificationChallenge

We address the structure specification challengethrough the qunit search paradigm [59], wherethe database is translated into a collectionof independent qunits, which can be treatedas documents for standard IR-like documentretrieval. A qunit is the basic, independentsemantic unit of information in a database. Itrepresents a quantified unit of information inresponse to a user’s query. The database searchproblem then becomes one of choosing the mostappropriate qunit(s) to return, in ranked order.Users only have to input keywords, which ismuch simpler than navigating complex databaseschema and specifying a structured query. Inother words, the precision burden is lifted fromthe user. Consider the flight example in Figure 1.A qunit “flight” can be defined to representthe complete information of what a passengerthinks of as a flight. The qunit includes startinglocation, destination, plane, and time of travel. Thiscompletely relieves users from having to manuallyperforming joins among all the tables. As a userinputs a search criterion, for example “from DTWto LAX, Jan. 2010”, qunits are ranked based on theinput and the best matches are presented to theuser.

6 author

MoviePK mid title

PersonPK pid name

DirectorFK2 pidFK1 mid

WriterFK2 pidFK1 mid

MyMovieInfo Name Director

Map to

Source SchemaTarget Schema

Correspondence

Correspondence

Figure 2 An Example Schema Mapping with The Question Mark Indicating a Join Path Ambiguity.

Figure 3 A Screenshot of IBM InfoSphere Data Architect

short title 7

person

cast

moviegenre

namebirthdate

gender

title

releasedaterating

info

level

role

info movie cast …

star wars cast

search

(star wars)!"#$%&'(%()&*+!"#$%

quotes movie cast

query

typed query

qunit-query

qunit instances

qunit definitions

database

(a) An Simplified database schema (b) Qunit Search on IMDbFigure 4 Qunit Example

We now explain the definition of qunits over adatabase, and how to search based on qunits. Weuse a slightly more complex IMDb movie databasein order to explain more effectively. Figure 4 (a)shows a simplified example schema of a moviedatabase, which contains entities such as movie,cast, person, etc. Qunits are defined over thisdatabase corresponding to various informationneeds. For example, we can define a qunit “cast”,as the people associated with a movie. Meanwhile,rather than having the name of the movie repeatedwith each tuple, we may prefer to have a nestedpresentation with the movie title on top and onetuple for each cast member. The base data inIMDb is relational, and against its schema, wewould write the base expression in SQL withthe conversion expression in XSL-like markup asfollows:

SELECT * FROM person, cast, movie

WHERE cast.movie_id = movie.id AND

cast.person_id = person.id AND

movie.title = "$x"

RETURN

<cast movie="$x">

<foreach:tuple>

<person>$person.name</person>

</foreach:tuple>

</cast>

The combination of these two expressions formsour qunit definition. On applying this definitionto a database, we derive qunit instances, one permovie.

To search based on qunits, consider the userquery, star wars cast, as shown in Figure 4 (b).Queries are first processed to identify entities usingstandard query segmentation techniques [75].In our case one high-ranking segmentation is“[movie.name] [cast]” and this has a very highoverlap with the qunit definition that involvesa join between “movie.name” and “cast”. Now,standard IR techniques can be used to evaluatethis query against qunit instances of the identifiedtype; each considered independently even if theycontain elements in common. The qunit instancedescribing the cast of the movie Star Wars is chosenas the appropriate result.

In current models of keyword search indatabases, several heuristics are applied toleverage the database structure to construct a resulton the fly. These heuristics are often based on theassumption that the structure within the databasereflects the semantics assumed by the user (thoughdata / link cardinality is not necessarily anevidence of importance), and that all structure isactually relevant towards ranking (though internalid fields are never really meant for search).

8 author

A significant sub-challenge is the automatedderivation of qunit definitions themselves. Inaddition to clustering-based techniques [79] thatleverage the data and structure of the database,there is also a wealth of useful information thatexists in the form of external evidence. Externalevidence can be in the form of existing “reports”– published results of queries to the database,or relevant web pages that present parts of thedata. Such evidence is common in settings wheresuch reports and web pages may be publishedand exchanged but the queries to the data arenot published. Information extraction techniquessuch as wrapper induction [48] allow us to extracttemplates by considering each piece of evidence asa qunit instance, which an then be assembled intoqunit definitions by mapping them to the database.

3.3 Addressing the Schema Evolution Challenge

In this section we address the schema evolutionchallenge (Sec. 2.3) by proposing a technique fordrag-and-drop modification of data schemas inthe spreadsheet-like presentation model, enablingorganic evolution of a schema and lifting theplanning burden from the user. Consider theexample of Jane’s shopping list again. Figure 5shows how Jane can organically grow the schemaof the shopping list table. Initially, she has onlycolumns for items to shop (Figure 5 (a)). She latertries to add information about friends to whomthe gifts will be given, for instance, by adding a“name” column in “Shopping List”. But now, Peter,a close friend of Jane, appears twice since both itemXbox and iPod will be given to him. As a result,Jane may think it makes more sense to group thegifts by person. Jane can do this by dragging theheader of the name column and dropping it on thelower edge of the “Shopping List” (Figure 5 (b)).This makes the name attribute a level up; the restof the columns forms a sub-relation “Gift” (shownin Figure 5 (c)). Now Jane can feel free to add newinformation, such as an attribute “address”, for herfriends without worrying that these informationwould be duplicated (Figure 5 (d)). This processshows how effortless it is for Jane to grow the tableabout shopping items to include information aboutfriends and structure the table as she desires.

Next, we briefly outline the challenges inbuilding a system such as this, and our plans totackle these challenges.

Specification: Specifying a schema update asin Figure 5 is challenging using existing tools. Forexample, using conventional spreadsheet software,it is impossible to arrive at a hierarchical schemaas shown in Figure 5 (d). To specify the schemaupdate, one has to split the table manually.Alternatively, using a relational DBMS, one has toset up the cross-table relationship, which is noteasy for end-users, even with support from GUItools.

We show how to use a presentation layer toaddress the specification challenge. We designthe presentation layer based on a next-generationspreadsheet and it supports easy schema creationand modification through a simple drag-and-dropinterface. We call such a spreadsheet span tablebecause it is presented in such a way that both tableheaders and data fields can span multiple cells. Thepresentation supports four key operations: movean attribute to be part of a sub-relation (e.g., wecan move the “Name” column back to “Gift” inFigure 5 (d)), move an attribute out of a sub-relation (the converse of the previous one), createa intermediate sub-relation by moving an attributeup one layer (e.g., Jane moves “Name” out tocreate a new sub-relation under “Shopping List”as in Figure 5 (b)) or down a layer (e.g., moving“In Stock” down deepens it by inserting a newimmediate level, with only “In Stock” in it; Jane canlater add new columns such as “Date” to indicatethe timestamp of stocking information).

Data Migration: Once a new schema isspecified, there is still a critical task of migratingexisting data to the new schema. Because theschema structure is changed, one has to introducea complex mapping in order to “fit” the old datainto the new schema. Even if spreadsheet softwaresupporting hierarchical schema is provided, theuser may still have to manually copy data in a cell-by-cell manner to perform such mapping, which isextremely time-consuming and error-prone.

We address this challenge with an algebraiclayer. Directly below the presentation layer, thealgebraic layer must translate drag-and-drops intooperations that modify the basic structure of thespan table. For this purpose, we have proposeda novel span table algebra consisted of three setsof operations. The first set, schema restructuringoperators, corresponds to the four aforementionedoperations in the presentation layer. We also havea second set of schema modification operators for

short title 9

(a) Initial Shopping List (b) Moving Name Column

(c) After Moving Name Column (d) Adding Address ColumnFigure 5 Organic Schema Evolution

adding/dropping columns in any sub-relations.Finally, there is a set of data manipulationoperators (insert, delete, and update), whichextends traditional data edit to our hierarchicalpresentation. This algebraic layer completelyautomates the data migration as soon as the theschema modification is performed.

Data Integrity: Expressing and understandingintegrity constraints is central to schema design,and thus also critical for an organic databasewhere schema is continuously evolved. Functionaldependencies (FD) are often used in databasedesign to add semantics to schemas and to assertintegrity constraints for data.

Nested functional dependencies have beenstudied extensively in the past [33]. However,CRIUS presents some new challenges due toits user-centric support for data and schemamodification. When a user updates data, ormodifies the schema, it is important to understandhow the update affects existing dependencies sothat we can communicate this information back tothe user, and optionally take steps to resolve anyresulting inconsistencies.

For this challenge, we consider two specificoperations: data value updates and schemaupdates. For the former case, we show how datavalue updates and integrity constraints interferewith each other and how we may take advantageof such inference to guide user data entry from aset of appropriately maintained FDs. Specifically,we feature autocompletion for qualified data entryand provide a contextual menu to alert the usereach time she issues an update that violates a givenFD, in order to preserve data integrity. For schema

updates, our hypothesis is that for each schemaupdate operation there is a way to “rewrite”involved FDs to preserve their validity. Preciselyhow to rewrite the schema is described in detailin [64].

Performance: Schema evolution is usually aheavy-weight operation in traditional databasesystems. It is not unusual for a commercialdatabase to take days to complete the maintenancerequired after schema evolution. IT organizationscarefully plan schema changes, and make themonly infrequently. In contrast, everyday users areunlikely to plan carefully. We would like to developtechniques that support quick schema evolutionwithout giving up on any of the other desirablefeatures.

We address performance challenge with astorage layer to implement a practical means ofactually storing the data. Conventionally, databasesystems have been designed with the goal ofoptimizing query processing. However, schemamodifications (e.g., ALTER TABLE) are often time-consuming, heavy-weight operations in currentsystems. We utilize a vertically partitioned formatfor the storage layer. Our goal is to significantlyreduce the performance penalty incurred due toschema modifications at a very modest cost ofoverhead in query processing.

Understanding Schema Evolution: When aschema has evolved over an extended periodof time, it is difficult for a user to keep trackof the changes. A natural need is to conciselyconvey to the user how a database has beenevolving. For example, the user may query therelationship between columns in the initial schema

10 author

and the final schema and how the transformationfrom old columns to new ones took place overtime. We want to show users the gradual organicchanges rather than a sudden transformation.We could keep track of all the changes step bystep, which requires all changes to be maintained.If such information is not available, which isfrequently the case when the user looks at externaldata sources, we seek to automatically discoversuch evolution from the data. Challenges involvemining conceptual changes from large amounts ofchanges to the database (e.g. Inferring the splittingof every “Name” column in each table to two “FirstName” and “Last Name” columns, followed bya normalization of the names into a single table).Mining such inferences can be done using eitherjust the data, or a combination of the data andprovenance information.

3.4 Addressing the Schema Mapping Challenge

In this section, we address the schema mappingchallenge (Sec. 2.4) by proposing a sample-drivenmapping approach, as opposed to the traditionalmatch-driven mapping approach adopted by mostof the state-of-the-art schema mapping tools. Thesample-driven mapping approach allows the userto freely provide sample target instances for thesystem to automatically infer the desired schemamapping, enabling organic data transformation byreducing user-side planning burden.

Let us examine a natural extension ofExample 2.2, in which the user wishes to establishthe mapping from the whole source to the toytarget consists of the movie title, the director’sname and the production company. Even withsuch a simple target, the traditional mappingapproach could imply a great planning pain. First,for a given target attribute, there may be severalpossible candidate source attributes that canmatch. Moreover, even if the source attributes havea one-to-one matching with the target attributes,there may be various ways these source attributescan be joined. To resolve either ambiguity witha match-driven interface may be potentiallyoverwhelming for the user, as shown in the toppart of Figure 6.

With the sample-driven approach, the usercan bypass the source-to-target-correspondenceestablishment and join path construction. All sheneeds to do is to type in sample instances in arestful WYSIWYG target relation, as shown in the

bottom part in Figure 6. For instance, the userbegins by entering “Avatar”, and the system mayfind the value in both source attributes Movie.titleand Company.name. As the user enters moremovie names, such as “Harry Potter”, the set ofattributes eventually converges to a single attributeMovie.title. The same concept applies to join pathspecification. After the user inputs “Avatar” and“James Cameron” in the first row, the relationbetween these two entries are not clear enough.It may stand for a movie-director relationship ora movie-writer relationship. However, after theuser enters “Harry Potter” and “David Yates”, thesystem knows the desired relationship must bemovie-director, since the writer of “Harry Potter”is “J.K. Rowling”. By this means, the user easilyspecifies the mapping without planning for theunderlying complicated details.

In the remainder of this section, we brieflyhighlight the challenges in supporting such organicschema mapping system and our solutions.

Mapping Deduction: The key challenge behindthe scene is to automatically deduce the desiredmapping using just the user-provided sampletarget instances. This is essentially difficult becauseof two reasons. First, the user input is smalland incomplete compared with the whole desiredtarget relation which the user eventually wishesto generate. Second, there is no prior-knowledgeof what this “desired target” would be. In otherwords, there may be a large number of candidatemappings that are consistent with the current userinput, yet no one knows which one is the groundtruth. Therefore, we have to keep track of allpossible mappings.

Indeed, the deduction problem is a reverse-engineering problem which uses only incompleteinformation to search for the most suitable processthat would generate such information, while thenumber of possible processes may be many. Thereason for such ambiguity, to be viewed from theschema mapping context, is obvious. For each user-input data, there may be multiple source attributeswhich contain that piece of data. Moreover, foreach pair of source attributes that are beingmatched to the target, there may be multiple waysto join them. Consequently, the critical questionto the sample-driven mapping approach becomeshow to effectively compute the set of candidatemappings that would yield the current user-inputtarget sample instances.

short title 11

Director MyMovieInfo

role name

director Movie

title producer

User manually refines correspondences.

Movie join Writer on … join Person on…

Movie join Director on … join Person on…

User manually refines mapping structure.

name director producer

Avatar James Cameron Lightstorm Co.

Harry Potter David Yates

Tim Burton

MyMovieInfo

User types sample instances in the target table.

Requires precise knowledge of both the source and the target schema and

comprehensive interpretation of the

schema mapping.

Requires only knowledge of the target schema and

a few sample instances.

Match-D

riven Sam

ple-Driven

Figure 6 Traditional Schema Mapping v.s. Organic Schema Mapping with Sample-Driven Approach

In traditional schema mapping design, databaseengineers may spend sufficient time on discussingand choosing the optimal mapping. In the organicmapping scenario, however, these mappings haveto be generated quickly enough so that the usercan obtain “interactive-speed” feedback, allowingher to review the current system status beforecontinuing to provide more samples or stoppingif the system has generated the desired mapping.As a result, the system has to generate the set ofcandidate mappings with satisfying performance.

We address this mapping deduction challengeby abstracting schema mappings into mappingpaths, and developing algorithms that efficientlyderive bigger mapping paths from smaller ones,where the smallest mapping paths are generatedby depth-limited breadth-first-searching user-input connections in the source database instances.The highlight of the algorithm is, every piece ofrequired information is captured in those smallestmapping paths, and the process of constructingthem into the final mapping candidates follows anatural bottom-up procedure with a “weaving”operation that can be completely done in memory.It is proved that the algorithm is sound andcomplete. And because all database accesses onlyoccur when generating the smallest mapping

paths, the overall performance is noticeable highand is able to meet practical requirements.

Mapping Pruning:After we deduct the set of all mapping

candidates, we have to efficiently prune them toobtain the final desired mapping. This process isalso time-critical since it has to be done when thesystem interacts with continuous user-input. Herewe propose two approaches for mapping pruning.

After the initial set of candidate mappingsis generated, the user may continue to enteradditional samples to prune the candidate set. Thisbasic static pruning functions in two steps. In thefirst step, source attributes in existing candidatemappings are being checked towards newly inputsamples. Those source attributes not containingthe new samples, together with the correspondingmappings will be pruned. In the second step,the connections between newly input samples arebeing checked against candidate mappings andmappings that do not satisfy the new connectionswill be removed.

In complement to the static pruning, wepropose dynamic pruning based on automaticsample instance recommendation. In general,a user may not realize the critical differencesbetween various candidate mappings and

12 author

continues with samples supporting a singleexisting mapping. This, indeed, renders thecandidate set convergence time not upper-bounded. In order to make the candidate setconverge faster, we can automatically constructtarget instances that satisfy only part of thecandidate set, and ask the user if such a selectionis desired. By this means, the size of the candidateset quickly shrinks and the user is able to obtainthe goal mapping within just a few interactions.

4 Related Work

Database usability started to receive attentionmore than 25 years ago [23] and gained moremomentum lately [37]. Research in databaseusability has been mainly focusing on innovativeand effective query interface design, includingvisual, text (i.e., keyword), natural languageinterfaces, direct manipulation interfaces, andspreadsheet interfaces.

Visual Interfaces: Query By Example [82],which is the first study on building a queryinterface not based on a database query language,allows users to implicitly construct queries byidentifying examples of desired data elements.This work is followed more recently by QBT [67],Kaleidoquery [57], VISIONARY [9], MIX [56],Xing [27], and XQBE [12]. Alternatively, forms-based query interface design has also beenreceiving attention. Early works on such interfacesinclude [26, 20], which provide users with visualtools to frame queries and to perform tasks such asdatabase design and view definition. This directionis more recently followed by GRIDS [66] andAcuity [70], and, in XML database systems, byFoXQ [1], EquiX [21], QURSED [62]. Adaptiveform construction is studied in DRIVE [55],which enables runtime context-sensitive interfaceediting for object-oriented databases, and in [39],which studies how forms can be automaticallydesigned and constructed based on past queryhistory. Recent work by Jayapandian and Jagadishproposes techniques for automatic construction offorms based on database schema and data [40] andexpressive form customization [41].

Text Interfaces: The success of InformationRetrieval (IR) style (i.e., keyword based) searchamong ordinary users has prompted databaseresearcher to study a similar search interface

for database systems. The goal is to maintainthe simplicity of the search and exploit notonly the textual content of the tuples, but alsothe structures within and across tuples to rankthe results in a way that is more effective thanthe traditional IR-style ranking mechanism. Forrelational databases, this approach is first studiedby Goldman et. al. in [29] and followed by manysystems, including DBXplorer [2], BANKS [10],DISCOVER [35], and ObjectRank [7]. For XMLdatabases, the inherently more complicatedstructure within the database content allows theresearchers to explore query languages rangingfrom pure keywords and approximate structuralquery, and has led to various projects includingXSEarch [22], XRANK [30], JuruXML [16],FlexPath [5], Schema-Free XQuery [50], andMeaningful Summary Query [80]. A more recenttrend in keyword-based search is to analyze akeyword query and automatically discover thehidden semantic structures that the query carries.This trend has influenced the design of projectsfor both traditional database search [42] and websearch [53].

Natural Language Interfaces: Constructing anatural language interface to databases has along history [6]. In particular, [68] analyzed theexpressive power of a declarative query language(SEQUEL) in comparison to natural language.Most recently, NaLIX [49] proposed a genericnatural language interface to XML database, whichis capable of adapting to multiple domains throughuser feedbacks. However, to this day, naturallanguage understanding is still an extremelydifficult problem, and current systems tend to beunreliable or unable to answer questions outside afew predefined narrow domains [63].

Direct Manipulation Interfaces: Directmanipulation [69], although a crucial conceptin the user interface field, is seldom mentionedin database literature. Pasta-3 [47] is one of theearliest efforts attempting a direct manipulationinterface for databases, but its support of directmanipulation is limited to allowing users tomanipulate a query expression with clicks and drags.Tioga-2 [4] (later developed into DataSplash [60]) isa direct manipulation database visualization tool,and its visual query language allows specificationwith a drag-and-drop interface. Its emphasis,however, is on visualization instead of querying.Recent work by Liu and Jagadish [52] develops a

short title 13

direct manipulation query interface based on anspreadsheet algebra.

Spreadsheet Interface: Spreadsheets haveproven to be one of the most user-friendly andpopular interfaces for handling data, partiallyevidenced by the ubiquity of Microsoft Excel.FOCUS [73] provides an interface for manipulatinglocal tables. Its query operations are quite simple(e.g., allowing only one level of grouping and beinghighly restrictive on the form of query conditions).Tableau [31], which is built on VizQL [32],specializes in interactive data visualization andis limited in querying capability. Spreadsheetshave also been used for data cleaning [65], logicprogramming [72], visualization exploration [38],and photo management [44]. Witkowski et al [77]proposed SQL extensions supporting spreadsheet-like computations in RDBMS.

Query interface is just one aspect of databaseusability, there are many other research fields thathave direct or indirect impacts on the usability ofdatabases, which we briefly describe below.

Personalization: Studies in this field attemptto customize database systems for each individualuser and therefore making them easier to exploreand extract information by the particular user,e.g., [24]. In addition, studies have also beenfocusing on analyzing past query workloads todetect the user interests and provide better resultstuned to those interests, e.g., [46, 19, 36]. It is alsoworth noting that the notion of personalizationhas also found interest in the information retrievalcommunity, where the ranking of search results isbiased using a certain personalized metric [34, 43].

Automatic Database Management: Toalleviate the burden on database administrators,commercial database systems come with a suiteof auxiliary tools. The AutoAdmin project [3, 18]at Microsoft, initiated by Surajit Chaudhuri andhis colleagues, makes great strides with respect tomany aspects of database configuration includingphysical design and index tuning. Similarly, theAutonomic Computing project [51, 54] at IBMprovides a platform to tune a database system,including query optimization. However, none ofthese projects deal with the user-level databaseusability that is the focus of this proposal.

Database Schema Design: This has beenstudied extensively [11, 78, 61]. There is agreat deal of work on defining a good schema,both from the perspective of capturing real-life

requirements (e.g., normalization) and supportingefficient queries. However, schema design hastypically been considered a heavyweight, one-time operation, which is done by a technicallyskilled database administrator, based on carefulrequirements analysis and planning. The newchallenge of enabling non-expert user to“give birth” to a database schema was posedrecently [37], but no solution was provided.

Usability Study in Other Systems: Usabilityof information retrieval systems was studiedin [74, 81], which analyzed usability errorsand design flaws, and also in [25], whichperformed a comparison of usability testingmethods. Principles of user-centered design wereintroduced in [45, 76], including how they couldcomplement software engineering techniques tocreate interactive systems. Incorporating usabilityinto the evaluation of computer systems was firststudied in [13]. An extensive user study wasperformed in [17] to identify the reasons for userfrustration in computing experiences, while [14]takes a more formal approach to model userbehavior for usability analysis. There is also arecent move in the software systems communityto conduct serious user studies [71]. However, fordatabase systems in particular, these only scratchthe surface of what needs to be done to improveusability.

5 Conclusion

Today, many users have to manage their owndata, without the luxury of having it managedfor them by a trained DBA. Without technicaltraining, it is difficult for ordinary users to reasonabout abstract concepts, such as schema, let alonesuccessfully design a database schema or mapthe schema of an unfamiliar database. To meetthe needs of such users, we have introduced thenotion of an organic database C distinguished fromthe traditional carefully engineered database. Inthis paper, we showed two instantiations of thisconcept, one for creating a database and one forschema mapping in data integration.

Acknowledgement

The project is supported in part by NSF grants IIS1017296 and IIS 1250880.

14 author

References

[1] R. Abraham. FoXQ - XQuery by forms. In IEEESymposium on Human Centric Computing Languagesand Environments, 2003.

[2] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer:A System for Keyword-Based Search overRelational Databases. In ICDE, 2002.

[3] S. Agrawal, S. Chaudhuri, L. Kollar, A. Marathe,V. Narasayya, and M. Syamala. Database TuningAdvisor for Microsoft SQL Server 2005. In VLDB,2004.

[4] A. Aiken, J. Chen, M. Stonebraker, and A. Woodruff.Tioga-2: A direct manipulation databasevisualization environment. In ICDE, pages 208–217,1996.

[5] S. Amer-Yahia, L. V. S. Lakshmanan, and S. Pandit.FleXPath: Flexible Structure and Full-Text Queryingfor XML. In SIGMOD, 2004.

[6] I. Androutsopoulos, G. Ritchie, and P. Thanisch.Natural Language Interfaces to Databases–anintroduction. Journal of Language Engineering,1(1):29–81, 1995.

[7] A. Balmin, V. Hristidis, and Y. Papakonstantinou.ObjectRank: Authority-Based Keyword Search inDatabases. In VLDB, 2004.

[8] G. Bell and J. Gemmell. A Digital Life, 2007.[9] F. Benzi, D. Maio, and S. Rizzi. Visionary: A

Viewpoint-based Visual Language for QueryingRelational Databases. Journal of Visual Languages andComputing, 10(2), 1999.

[10] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti,and S. Sudarshan. Keyword Searching andBrowsing in Databases using BANKS. In ICDE,2002.

[11] J. Biskup. Achievements of relational databaseschema design theory revisited. In Semantics inDatabases, pages 29–54. Springer-Verlag, 1998.

[12] D. Braga, A. Campi, and S. Ceri. XQBE (XQuery ByExample): A Visual Interface to the Standard XMLQuery Language. ACM Trans. Database Syst., 30(2),2005.

[13] A. B. Brown, L. C. Chung, and D. A. Patterson.Including the Human Factor in DependabilityBenchmarks. In DSN Workshop on DependabilityBenchmarking, 2002.

[14] R. Butterworth, A. Blandford, and D. Duke. UsingFormal Models to Explore Display-Based UsabilityIssues. Journal of Visual Languages and Computing,10(5), 1999.

[15] M. Cafarella, A. Halevy, and N. Khoussainova. Dataintegration for the relational web. VLDB, 2(1):1090–1101, 2009.

[16] D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass,and A. Soffer. Searching XML Documents via XMLFragments. In SIGIR, 2003.

[17] I. Ceaparu, J. Lazar, K. Bessiere, J. Robinson, andB. Shneiderman. Determining Causes and Severityof End-User Frustration. International Journal ofHuman Computer Interaction, 17(3), 2004.

[18] S. Chaudhuri and G. Weikum. Rethinking DatabaseSystem Architecture: Towards a Self-Tuning, RISC-style Database System. In VLDB, 2000.

[19] Z. Chen and T. Li. Addressing Diverse UserPreferences in SQL-Query-Result Navigation. InSIGMOD, 2007.

[20] J. Choobineh, M. V. Mannino, and V. P. Tseng. AForm-Based Approach for Database Analysis andDesign. CACM, 35(2), 1992.

[21] S. Cohen, Y. Kanza, Y. Kogan, Y. Sagiv, W. Nutt,and A. Serebrenik. EquiX–A Search and QueryLanguage for XML. JASIST, 53(6), 2002.

[22] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv.XSEarch: A Semantic Search Engine for XML. InVLDB, 2003.

[23] C. J. Date. Database Usability. In SIGMOD, NewYork, NY, USA, 1983. ACM Press.

[24] X. Dong and A. Halevy. A Platform for PersonalInformation Management and Integration. In CIDR,2005.

[25] A. Doubleday, M. Ryan, M. Springett, andA. Sutcliffe. A Comparison of Usability Techniquesfor Evaluating Design. In DIS, 1997.

[26] D. W. Embley. NFQL: The Natural Forms QueryLanguage. ACM Trans. Database Syst., 1989.

[27] M. Erwig. A Visual Language for XML. In IEEESymposium on Visual Languages, 2000.

[28] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa.Data exchange: semantics and query answering.Theor. Comput. Sci., 336(1):89–124, May 2005.

[29] R. Goldman, N. Shivakumar,S. Venkatasubramanian, and H. Garcia-Molina.Proximity Search in Databases. In VLDB, 1998.

[30] L. Guo, F. Shao, C. Botev, andJ. Shanmugasundaram. XRANK: Ranked KeywordSearch over XML Documents. In SIGMOD, 2003.

[31] P. Hanrahan. VizQL: A Language for Query,Analysis and Visualization. SIGMOD, pages 721–721, 2006.

[32] P. Hanrahan. Vizql: a language for query, analysisand visualization. In SIGMOD, page 721, 2006.

[33] C. Hara and S. Davidson. Reasoning about nestedfunctional dependencies. In PODS, 1999.

short title 15

[34] T. Haveliwala. Topic-Sensitive PageRank: AContext-Sensitive Ranking Algorithm for WebSearch. IEEE Transactions on Knowledge and DataEngineering, 15(4):784–796, 2003.

[35] V. Hristidis and Y. Papakonstantinou. DISCOVER:Keyword Search in Relational Databases. In VLDB,2002.

[36] Y. E. Ioannidis and S. Viglas. ConversationalQuerying. Inf. Syst, 31(1):33–56, 2006.

[37] H. V. Jagadish, A. Chapman, A. Elkiss,M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Makingdatabase systems usable. In SIGMOD, 2007.

[38] T. J. Jankun-Kelly and K.-L. Ma. A spreadsheetinterface for visualization exploration. In IEEEVisualization, pages 69–76, 2000.

[39] M. Jayapandian and H. V. Jagadish. Automating theDesign and Construction of Query Forms. In ICDE,2006.

[40] M. Jayapandian and H. V. Jagadish. Automatedcreation of a forms-based database query interface.In VLDB, 2008.

[41] M. Jayapandian and H. V. Jagadish. Expressivequery specification through form customization. InEDBT, 2008.

[42] T. S. Jayram, R. Krishnamurthy, S. Raghavan,S. Vaithyanathan, and H. Zhu. Avatar InformationExtraction System. IEEE Data Eng. Bull., 29(1):40–48,2006.

[43] G. Jeh and J. Widom. Scaling Personalized WebSearch. WWW, pages 271–279, 2003.

[44] S. Kandel, A. Paepcke, M. Theobald, and H. Garcia-Molina. The photospread query language.Technical report, Stanford Univ., 2007.

[45] J. F. Kelley. An Iterative Design Methodologyfor User-Friendly Natural Language OfficeInformation Applications. ACM Trans. DatabaseSyst., 2(1), 1984.

[46] G. Koutrika and Y. Ioannidis. Personalization ofQueries in Database Systems. In ICDE, 2004.

[47] M. Kuntz and R. Melchert. Pasta-3’s graphicalquery language: Direct manipulation, cooperativequeries, full expressive power. In VLDB, pages 97–105, 1989.

[48] N. Kushmerick. Wrapper induction: Efficiency andexpressiveness. Artificial Intelligence, 118(1):15–68,2000.

[49] Y. Li, H. Yang, and H. V. Jagadish. NaLIX: A GenericNatural Language Search Environment for XMLData. ACM Transactions on Database Systems-TODS,32(4), 2007.

[50] Y. Li, C. Yu, and H. V. Jagadish. Enabling Schema-Free XQuery with Meaningful Query Focus. VLDBJournal, in press.

[51] S. Lightstone, G. M. Lohman, P. J. Haas, et al.Making DB2 Products Self-Managing: Strategiesand Experiences. IEEE Data Eng. Bull, 29(3):16–23,2006.

[52] B. Liu and H. V. Jagadish. A spreadsheet algebra fora direct data manipulation query interface. In ICDE,2009.

[53] J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko,C. Yu, and A. Halevy. Web-scale Data Integration:You Can Only Afford to Pay As You Go. In CIDR,2007.

[54] V. Markl, G. M. Lohman, and V. Raman. LEO: AnAutonomic Query Optimizer for DB2. IBM SystemsJournal, 42(1):98–106, 2003.

[55] K. Mitchell and J. Kennedy. DRIVE: AnEnvironment for the Organized Construction ofUser-Interfaces to Databases. In Interfaces toDatabases (IDS-3), 1996.

[56] P. Mukhopadhyay and Y. Papakonstantinou.Mixing Querying and Navigation in MIX. In ICDE,2002.

[57] N. Murray, N. Paton, and C. Goble. Kaleidoquery:A Visual Query Language for Object Databases. InAdvanced Visual Interfaces, 1998.

[58] A. Nandi and H. V. Jagadish. Assisted Queryingusing Instant-Response Interfaces. In SIGMOD,2007.

[59] A. Nandi and H. V. Jagadish. Qunits: queried unitsfor database search. CIDR, 2009.

[60] C. Olston, A. Woodruff, A. Aiken, M. Chu,V. Ercegovac, M. Lin, M. Spalding, andM. Stonebraker. Datasplash. In SIGMOD, pages550–552, 1998.

[61] E. Papadomanolakis and A. Ailamaki. Autopart:Automating schema design for large scientificdatabases using data partitioning. In SSDBM, 2004.

[62] Y. Papakonstantinou, M. Petropoulos, andV. Vassalos. QURSED: Querying and ReportingSemistructured Data. In SIGMOD, 2002.

[63] A.-M. Popescu, O. Etzioni, and H. A. Kautz.Towards a Theory of Natural Language Interfacesto Databases. In IUI, 2003.

[64] L. Qian, K. LeFevre, and H. V. Jagadish. Crius: User-friendly database design. PVLDB, 4(2):81–92, Nov.2010.

[65] V. Raman and J. M. Hellerstein. Potter’s wheel: Aninteractive data cleaning system. In VLDB, pages381–390, 2001.

[66] R. E. Sabin and T. K. Yap. Integrating InformationRetrieval Techniques with Traditional DB Methodsin a Web-Based Database Browser. In SAC, 1998.

16 author

[67] A. Sengupta and A. Dillon. Query by Templates:A Generalized Approach for Visual QueryFormulation for Text Dominated Databases. InADL, 1997.

[68] B. Sheneiderman. Improving the Human FactorsAspect of Database Interactions. ACM Trans.Database Syst., 3(4), 1978.

[69] B. Shneiderman. The future of interactivesystems and the emergence of direct manipulation.Behaviour & Information Technology, 1(3):237–256,1982.

[70] S. Sinha, K. Bowers, and S. A. Mamrak. Accessinga Medical Database using WWW-Based UserInterfaces. Technical report, Ohio State University,1998.

[71] C. Soules, S. Shah, G. R. Ganger, and B. D. Noble.It’s Time to Bite the User Study Bullet. Technicalreport, University of Michigan, 2007.

[72] M. Spenke and C. Beilken. A spreadsheet interfacefor logic programming. In CHI, pages 75–80, 1989.

[73] M. Spenke, C. Beilken, and T. Berlage. Focus:The interactive table for product comparison andselection. In UIST, pages 41–50, 1996.

[74] A. Sutcliffe, M. Ryan, A. Doubleday, andM. Springett. Model Mismatch Analysis: Towardsa Deeper Explanation of Users’ Usability Problems.Behavior & Information Technology, 19(1), 2000.

[75] B. Tan and F. Peng. Unsupervised querysegmentation using generative language modelsand wikipedia. In WWW, 2008.

[76] A. I. Wasserman. User Software Engineeringand the Design of Interactive Systems. In ICSE,Piscataway, NJ, USA, 1981. IEEE Press.

[77] A. Witkowski, S. Bellamkonda, T. Bozkaya,G. Dorman, N. Folkert, A. Gupta, L. Sheng, andS. Subramanian. Spreadsheets in rdbms for olap.In SIGMOD, 2003.

[78] S. K. M. Wong, C. J. Butz, and Y. Xiang.Automated database schema design using mineddata dependencies. Journal of the American Society forInformation Science, 49:455–470, 1998.

[79] C. Yu and H. V. Jagadish. Schema Summarization.In VLDB, 2006.

[80] C. Yu and H. V. Jagadish. Querying ComplexStructured Databases. In VLDB, 2007.

[81] W. Yuan. End-User Searching Behavior inInformation Retrieval: A Longitudinal Study.JASIST, 48(3), 1997.

[82] M. M. Zloof. Query-by-Example: the Invocationand Definition of Tables and Forms. In VLDB, 1975.