Semex a Platform for Personal Information Management and ...homes.cs.washington.edu/~alon/files/NYDBIR05.pdf · Semex: a Platform for Personal Information Management and Integration

Semex: a Platform for PersonalInformation Management and

IntegrationAlon Halevy

University of Washington(On Sabbatical @ Stanford & Transformic Inc.)

April 15, 2005NY Area DB/IR Day

Joint work with: Luna Dong, Jayant Madhavan

Did You mean:DB and IR, DB or IR?

Personal information management: Pushes the limits on DB&IR.

Demonstrate some DB&IR issuesthrough the Semex Project.

NSF starting to get interested in PIM: First brainstorming workshop in January. People from DB, IR, HCI, Psychology.

What is PIM?

Questions I Can’t Answer

Find my VLDB04 paper, and the PowerPoint(maybe in an attachment?).Find emails from my Californian friends.Which paper by Ken Ross did I cite in mylatest SIGMOD paper?What quarter was Mary in my class and whatgrade did she get?Which experiment did I run with NF1 andwhich emails discussed them?

Why?

HTMLMail &

calendar Papers Files Presentations

Information is organized by application, notby any semantically meaningful logicalorganization.Vannevar Bush said this in 1946: PersonalMemex.

OriginitatedFrom

EarlyVersion

PublishedIn

ConfHomePage

ExperimentOf

PaperAbout

BudgetOf

Sender

Recipient

CourseGradeIn

AddressOf

Attached

Cites

PresentationFor

CoAuthor

FrequentEmailer

HomePage

Miller Barton MillerR. Miller

Association queries

Association queries

Email

Articles

Contact info

R. Miller

Association queries

Article: “Data drivenunderstanding andrefinement of schemamapping”

IsCitedBy

Article: “The Piazza Peer-data Management Project”

Cites

Association queries

IsCitedBy

Article: “The Piazza Peer-data Management Project”

Cites

PIM vs. Web Search"But there's a fundamental difference betweensearching a universe of documents created bystrangers and searching your on personal library.

When you're free wheeling through ideas that youyourself have collated -- particularly when you'dlong ago forgotten about them -- there's somethingabout the experience that seems uncannily like freewheeling through the corridors of your ownmemory. It feels like thinking."

Steven Johnson, New York Times,January 30, 2005

Semex Over-arching Goals

Create an ‘AHA!’ experience with a PIMsystem “How did I ever live without this?” Extensible to arbitrary associations.

Leverage the PIM environment andknowledge to increase productivity inother tasksBuild a platform for <your cool stuff here>

Leveraging Semex:On-the-Fly Data Integration

Who published at SIGMOD but was not recently on the PC?

Partial Success of DI

EII: Enterprise Information Integration Starting to catch on. See SIGMOD-05 industrial paper for good

perspectives.Mostly in applications such as: Customer Relationship Management Portal construction Frequently occurring queries.

Still quite an effort to set up an integrationscenario.

On-The-Fly IntegrationConference

PC

Presentation

OrganizedBy

publishedIn

Person

Paper

servesOn

Author

presentedIn


Outline

Semex (open) architectureThe glue: reference reconciliationCurrent research “what if’s”, and challenges: Malleable schemas On-the-fly information integration Association queries and indexing Visualizations of personal information More challenges

System Architecture

Word Excel PPT PDF Bibtex Latex Email Contacts

Semi-structured domain Model repository

Association extractor

ReferenceReconciliation

SimpleExtractedExternalDefined

Association extractor Association extractor Association extractor

IR/DB Themes

Axiom: the desktop is the database.

Need to manage any kind of data: Once you touch it, it’s managed!

Schema? Sure! A bit here and there.

Association queries

Referencereconciliation

Halevy

Multi-class Reconciliation

Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,“Austin, Texas”)

c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Reference Reconciliation

Input: A set of references ROutput: A partitioning over R, such that Each partition refers to a single real-world

entity – high precision

Different partitions refer to different entities – high recall

Reference Reconciliation ResultsArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)

a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)



Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

Novel ChallengesArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)

a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)



Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

1. MultipleClasses 3. Multi-value

Attributes

2. LimitedInformation

4. Lack of training data

Applying Traditional Record LinkageAlgorithm

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

ers

on

Pa

rtit

ion

s)

1409

Person references: 24076 Real-world persons:1750

3159

Main Ideas[Dong et. al, SIGMOD 05]

Leverage the context (network) of thereferences.Propagate reconciliation decisionsbetween different classes.Enrich references as we go along.Enforce some integrity constraints.

I. Exploiting Context Information

Associated Reference I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Associated Reference II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Cross-attribute similarity – Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

I. Exploiting Context Information

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

ers

on

Pa

rtit

ion

s)

1409

346


Merging PropagationArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)

a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null)

Perseon: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

ers

on

Pa

rti

tio

ns

)

Traditional Propagation

II. Merging Propagation


III. Reference Enrichment

p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)

P8-9 =(“mike”, “[email protected]”, {p7})

III. Reference Enrichment

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

ers

on

Part

itio

ns)

Traditional Merge Propagation


3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Merge Propagation Full

Overall Results


1409

125346

The Dependency Graph(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

Reference similarity Attribute similarity

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Propagation I(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Propagation II(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Propagation III(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Propagation IV(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Propagation V(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Comparison with UAI Methods

Propagating similarities betweenclasses was investigated with UAImethods: (e.g., Russell and Pasula)Fit everything into a probabilistic modelof the domain.Our approach exploits thedependencies, but does not enforce amodel.

Outline

Semex (open) architectureThe glue: reference reconciliation

Current research “what if’s”, and challenges: Malleable schemas On-the-fly information integration Association queries and indexing Visualizations of personal information More challenges

DB/IR Themes

How do we model an applicationdomain that involves both structuredand unstructured data?

Malleable SchemasMost DB/IR work is on seamless querying: Integration after the fact.

But what about the modeling phase? How can we design applications that manipulate both

kinds of data? Domains where:

Border between two types is not clear or evolving, There is no obvious structure, Structure is not known at modeling time, Complete structure would be too complicated for users.

A Different Example:Web Data Integration

Building a meta-search engine forclassifieds sites on the web.Modeling the class RealEstate. Realize: subclasses are a messy

proposition. Instead: describe subclasses by keywords.

Malleable Schemas:Keywords as Schema Constructs

Web: modeling the class RealEstate. Realize: subclasses are a messy proposition. Instead: describe subclasses by keywords.

PIM: modeling property Participant. Realize: there are many shades of participation. Instead: describe variants with keywords.

Key point: keywords are seen asreplacements for some schema constructs.

Importing External DataConference

PC

Presentation

OrganizedBy

publishedIn

Person

Paper

servesOn

Author

presentedIn


Importing Data w/Background Knowledge

We know a lot about the domain model andits possible instances: Schema matching is easier: [A la Doan et.] Reference reconciliation is easier: [Etzioni and

Perkowitz, 95] Wrapper construction: [Kushmerick et al, 97]

Leverage: The user’s past actions Colleagues with the same data needs

Challenge: matching relationships (inaddition to attributes).

Help When Looking at Data

View external sources from myperspective: Highlight people I know (and why) on a

web pageFill in blanks: Tell me which people may be missing in a

spreadsheet I’ve received Suggest other names (papers, etc.) when

I’m creating a list.

Association Querying

Find objects not referenced in thequery: Ask for Semex: Get Luna Dong, Jayant

MadhavanLearn interesting/useful associationpaths: Co-author, collaborator, relatedProject

Rank lists intelligently (lineage)

Views on Personal Information

Semex enables multiple views onpersonal information: See everything about a project View the progression of a project or paper Activity clustering [Mitchell does for email] Pointers to external resources

From PIM to G(roup)IM

We would like to share: Subsets of our data Fragments of the domain model

Create personal profiles for: Better web search, online shopping, ad

placement.Manage information along a social network.To share or not to share?

Summary

The goal of Semex is to bring the benefits ofdata management to the desktop Needs to be invisible! Automatically create associations between data

items.Fundamental challenges to DB/IR: Manage everything Exploit schema(S) when you them them Model data flexibly Support new types of queries.

“The most profound technologies arethose that disappear”. Mark Weiser

Some References

Overview: CIDR 2005Reference Reconciliation: SIGMOD 2005A cool demo: SIGMOD 2005The website:http://data.cs.washington.edu/semex/NSF PIM Workshop:http://pim.ischool.washington.edu/

Documents

Semex a Platform for Personal Information Management and ...homes.cs.washington.edu/~alon/files/NYDBIR05.pdf · Semex: a Platform for Personal Information Management and Integration