13
Processing in Data Integration + a (corny) ending 4/30 oject 3 due today mos (to the TA) as scheduled W+presentation due 5/8 da today: 3:15—3:30: Soft Joins 3:30—4:00: Query processing in data integration 4:00—4:30: End review

Query Processing in Data Integration + a (corny) ending 4/30 Project 3 due today Demos (to the TA) as scheduled FHW+presentation due 5/8 Agenda today:

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Query Processing in

Data Integration+

a (corny) ending4/30

Project 3 due todayDemos (to the TA) as scheduledFHW+presentation due 5/8

Agenda today: 3:15—3:30: Soft Joins 3:30—4:00: Query processing in data integration 4:00—4:30: End review

Page 2: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

May 8th 2:40—4:30pm

• Each student gives a 5 min presentation– 19x5=95min– Also get a hard copy of the review with you

• 15min buffer + wrapup

• I’ll get refreshments; you keep us all awake.

Page 3: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

• What is the problem that the paper is addressing?

• Why is the problem interesting?

• How is it related to what we learned in the class?

• What is the solution that the authors propose?

• What is your criticism of the solution presented?

Page 4: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

4

Course Outcomes

• After this course, you should be able to answer:– How search engines work

and why are some better than others

– Can web be seen as a collection of (semi)structured databases?

• If so, can we adapt database technology to Web?

– Can useful patterns be mined from the pages/data of the web?

What did you think these were going to be??

C S E 4 9 4 / 5 9 8I n f o r m a t i o n R e t r i e v a l , M i n i n g a n d

I n t e g r a t i o n o n t h e I n t e r n e t

a b o u t X M L / X q u e r y / R D F

H e l l o , S u b b a r a o K a m b h a m p a t i .W e h a v e r e c o m m e n d a t i o n s f o r y o u .

REVIEW

Page 5: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

5

Main Topics• Approximately three halves plus a bit:

– Information retrieval– Information integration/Aggregation– Information mining– other topics as permitted by time

REVIEW

Page 6: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

6

Adapting old disciplines for Web-age

• Information (text) retrieval – Scale of the web

– Hyper text/ Link structure

– Authority/hub computations

• Databases– Multiple databases

• Heterogeneous, access limited, partially overlapping

– Network (un)reliability

• Datamining [Machine Learning/Statistics/Databases]– Learning patterns from large scale data

REVIEW

Page 7: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Topics Covered• Clustering (2)• Text Classification (1)• Filtering/Recommender Syst

ems (2)

• Why do we even care about databases in the context of web (1)

• XML and handling semi-structured data + Semantic Web standards (3)

• Information Extraction (2)• Information/data Integration

(2+)

• Introduction (1)• Text retrieval; vectorspace

ranking (3) • Correlation analysis & Latent

Semantic Indexing (2)• Indexing; Crawling;

Exploiting tags in web pages (2)

• Social Network Analysis (2)• Link Analysis in Web Search

(A/H; Pagerank) (3+)

Discussion Classes: ~3+

Page 8: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Finding“Sweet Spots” in computer-mediated cooperative work

• It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop– All you need is to find the right sweet spot, where the

computer plays a pre-processing role and presents “potential solutions”

– …and the human very gratefully does the in-depth analysis on those few potential solutions

• Examples:– The incredible success of “Bag of Words” model!

• Bag of letters would be a disaster ;-)• Bag of sentences and/or NLP would be good

– ..but only to your discriminating and irascible searchers ;-)

Page 9: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Collaborative Computing AKA Brain Cycle Stealing

AKA Computizing Eyeballs

• A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks– It is like “cycle stealing”—except we are stealing “human brain

cycles” (the most idle of the computers if there is ever one ;-) • Remember the mice in the Hitch Hikers Guide to the Galaxy?

(..who were running a mass-scale experiment on the humans to figure out the question..)

– Collaborative knowledge compilation (wikipedia!)– Collaborative Curation – Collaborative tagging– Paid collaboration/contracting

• Many big open issues– How do you pose the problem such that it can be solved using

collaborative computing?– How do you “incentivize” people into letting you steal their brain

cycles?

Page 10: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Tapping into the Collective UnconsciousAKA “Wisdom of the Crowds”

• Another thread of exciting research is driven by the realization that WEB is not random at all!– It is written by humans

– …so analyzing its structure and content allows us to tap into the collective unconscious ..

• Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”

• Examples:– Analyzing term co-occurrences in the web-scale corpora to capture

semantic information (today’s paper)

– Analyzing the link-structure of the web graph to discover communities• DoD and NSA are very much into this as a way of breaking terrorist cells

– Analyzing the transaction patterns of customers (collaborative filtering)

Page 11: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

If you don’t take Autonomous/Adversarial Nature of the Web into account, then it is gonna getcha..

• Most “first-generation” ideas of web make too generous an assumption of the “good intentions” of the source/page/email creators. The reasonableness of this assumption is increasingly going to be called into question as Web evolves in an uncontrolled manner…

• Controlling creation rights removes the very essence of scalability of the web. Instead we have to factor in adversarial nature.. – Links can be manipulated to change page importance

• So we need “trust rank”– Fake annotations can be added to pages and images

• So we need ESP-game like self-correcting annotations.. – Fake/spam mails can be sent (and the nature of the spam mails can be

altered to defeat simple spam classifiers…)• So we need adversarial classification techniques

– Fake pages (in large numbers) can be created and put on the web (although, as of now, I don’t yet see the economic motive for this)

• So we can not see web as the collective unconscious.. and co-occurrence may not imply semantic proximity.

Page 12: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Anatomy may be likened to a harvest-field. • First come the reapers, who, entering upon untrodden

ground, cut down great store of corn from all sides of them. These are the early anatomists of Europe

• Then come the gleaners, who gather up ears enough from the bare ridges to make a few loaves of bread. Such were the anatomists of last.

• Last of all come the geese, who still contrive to pick up a few grains scattered here and there among the stubble, and waddle home in the evening, poor things, cackling with joy because of their success.

Gentlemen, we are the geese. --John Barclay English Anatomist

Page 13: Query Processing in Data Integration + a (corny) ending 4/30  Project 3 due today  Demos (to the TA) as scheduled  FHW+presentation due 5/8 Agenda today:

Information Integration on Web still rife with uncut corn

• Unlike anatomy of Barclay’s day, Web is still young. We are just figuring out how to tap its potential

• …You have great stores of uncut corn in front of you.

• ……