37
Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 4, 2008 Slides based on content by AnHai Doan, used with permission

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Embed Size (px)

Citation preview

Page 1: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Cimple: Building Community Portal Sitesthrough Crawling & Extraction

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

November 4, 2008

Slides based on content by AnHai Doan, used with permission

Page 2: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Administrivia

By next Tuesday: a rough schedule and division of duties for your project

Please read the Halevy et al. paper on Piazza

2

Page 3: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

The Web Is Full of Special-Interest Portal Sites for Communities

Academia Certain bioinformatics topics; citations; etc.

Medicine WebMD

Infotainment Rotten Tomatoes, IMDB, fantasy football

Business enterprise intranets, tech support groups, lawyers

CIA / homeland security Intellipedia

Some of these gather information from the Web

3

Page 4: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Cimple Project @ Wisconsin (+ Yahoo)

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Develops a general solution to community Web portals using extraction + integration + mass collaboration

Mass collaboration

Page 5: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

The Basic Ideas

Architecture mainly consists of extractors and ER-graphs

The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired

5

Page 6: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Prototype System: DBLife

Integrate data of the DB research community 1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

Page 7: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Data Integration

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

Page 8: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Resulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

coauthor

advise advise

write

write

write

PC-Chair

PC-member

Page 9: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Provide Services

DBLife system

Page 10: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Mass Collaboration via Wiki

Page 11: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Issues Addressed by Cimple

Cimple addresses challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing

feedback4. Mass collaboration

Page 12: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

1. Source Selection

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 13: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Current Solutions vs. Cimple Current solutions: topic specific crawlers

find all relevant data sources (e.g., using focused crawling, search engines)

maximize coverage results in many “noisy” sources

Cimple allows for incremental development, deployment starts with a small set of high-quality “core”

sources incrementally adds more sources

only from “high-quality” places or as suggested by users (mass collaboration)

Page 14: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Start with a Small Set of “Core” Sources

Key observation: communities often follow 80-20 rule 20% of sources cover 80% of interesting

activities

Initial portal over these 20% often is already quite useful

How do we select these 20%? select as many sources as possible then evaluate and select most relevant ones

Page 15: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Evaluate the Relevance of Sources

Use PageRank + virtual links across entities + TF/IDF

... Gerhard Weikum

G. Weikum

See [VLDB-07a]

Page 16: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Add More Sources over Time Key observation: most important sources will

eventually be mentioned within the community so monitor certain “community channels” to find them

Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data

Call for Participation Workshop on

"Management of Uncertain Data" in conjunction with VLDB 2007

http://mud.cs.utwente.nl ...

Also allow users to suggest new sources– e.g., the Silicon Valley Database Society

Page 17: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Summary: Source Selection

Incremental approach: start with highly relevant sources expand carefully minimize “garbage in, garbage out”

Need a notion of source relevance Need a way to compute this

Page 18: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

2. Extraction and Integration

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 19: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Extracting Entity Mentions Key idea: reasonable plan, then “patch” Reasonable basic plan:

collect person names, e.g., David Smith generate variations, e.g., D. Smith, Dr. Smith, etc. find occurrences of these variations

ExtractMbyName

Union

s1 … sn

Works well, but can’t handle

certain difficult spots

Page 20: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Handling Difficult Spots Example

R. Miller, D. Smith, B. Jones if “David Miller” is in the dictionary

will flag “Miller, D.” as a person name

Solution: patch such spots with stricter plans

ExtractMbyName

Union

s1 … sn

FindPotentialNameLists

ExtractMStrict

Page 21: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan

mention names are the same (modulo some variation) match

e.g., David Smith and D. Smith

Union

Extract Plan

MatchMbyName

s1 sn…Works well, but can’t handle

certain difficult spots

Page 22: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Handling Difficult Spots

Estimate the semantic ambiguity of data sources use social networking techniques related to cohesion of graphs [see ICDE-

07a]

Apply stricter matchers to more ambiguous sources

MatchMStrict

Extract Plan

MatchMbyName

Union

{s1 … sn} DBLP\

Extract Plan

DBLP

DBLP: Chen Li

· · ·41. Chen Li, Bin Wang, Xiaochun Yang.

VGRAM. VLDB 2007.· · ·

38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.

Applied Mathematics and Computation.· · ·

Page 23: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Summary: Extraction and Integration Most current solutions

try to find a single good plan, applied to all of data

Cimple solution: reasonable plan, then patch So the focus shifts to:

how to find a reasonable plan? how to detect problematic data spots? how to patch those?

Need a notion of semantic ambiguity Different from the notion of source relevance

Page 24: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

3. Detecting Problems and Making Corrections

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 25: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

How to Detect Problems?

After extraction and matching, build services e.g., superhomepages

Many such homepages contain minor problems e.g., X graduated in 19998

X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers

Intuitively, something is semantically incorrect

To fix this, build a Semantic Debugger learns what is a normal profile for researcher, paper, etc. alerts the builder to potentially buggy superhomepages so corrections / feedback can be provided

Page 26: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

What Types of Feedback?

Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge

e.g., no researcher has ever published 5 SIGMOD papers in a year

Add more data e.g., X was advised by Z e.g., here is the URL of another data source

Modify the underlying algorithm e.g., pull out all data involving X

match using names and co-authors, not just names

Page 27: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

How to Make Providing Feedback Very Easy?

Extremely crucial in DBLife context If feedback can be provided easily

can get more feedback can leverage the mass of users

Page 28: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Critical but unsolved

Provide a Wiki interface

How to Make Providing Feedback Very Easy?

Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06

Add domain knowledge Add more data Modify the underlying algorithm

Provide form interfaces

Unsolved: some recent interest on how to mass

customize software

Page 29: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Summary: Detection and Feedback

How to detect problems? Semantic Debugger

What types of feedback & how to easily provide them? critical, largely unsolved

What feedback would make most impact? crucial in large-scale systems need a notion of a Feedback Advisor need a precise notion of system quality

Page 30: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

4. Mass Collaboration

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintenance and expansion

Mass collaboration

Page 31: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Mass Collaboration: Voting

Can be applied to numerous problems

Page 32: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Example: Matching

Hard for machine, but easy for human

Mouse for Dell laptop 200 series ...

Dell X200; mouse at reduced price ...

Dell laptop X200 with mouse ...

Page 33: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Mass Collaboration: Wiki

Community wikipedia built by machine + human backed up by a structured database

DataSources G

T

V1

V2

V3

W1

W2

W3

u1

V3’ W3’

T3’

M

Page 34: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Machine MachineHuman

Mass Collaboration: Wiki

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=Professor #>

<strong>Interests:</strong><# person(id=1).interests(id=3)

.topic(id=4){name}=Parallel Database #>

David J. DeWitt

Professor

Interests: Parallel Database

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=John P. Morgridge Professor #>

<# person(id=1) {organization}=UW #> since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3)

.topic(id=4){name}=Parallel Database #>

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}= John P. Morgridge Professor #>

<# person(id=1){organization}=UW-Madison#>since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3)

.topic(id=4){name}=Parallel Database #>

<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

David J. DeWitt

John P. Morgridge ProfessorUW-Madison since 1976

Interests: Parallel Database

Privacy

Machine

Human

Page 35: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Summary: Mass Collaboration

What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?

Page 36: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Summary: Cimple

A very interesting attempt to rethink Web crawling and information extraction

Based on a “best-effort” notion One of many concurrent efforts in that vein “Dataspaces”

Simple building blocks, progressive refinement

36

Page 37: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management

Open Questions and Issues

Incorporating uncertain data

Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?

How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse

Others?

37