31
Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Keywords Selection Problem in Hidden Web Crawling

Ka Cheung Sia, Richard

March 15 2004

Page 2: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Agenda What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword

Greedy Tree searching Pruning

Experiments & results Conclusion

Page 3: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

What is Hidden Web? Hidden

Unreachable by following hyperlinks Dynamically generated Accessible only through a search interface

Informative Examples

http://citeseer.ist.psu.edu/ - CS research paper http://www.pubmed.org – medical research paper http://catalog.loc.gov – library of congress

Page 4: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

What is Hidden Web? Search interface

http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1

Page 5: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

What is Hidden Web? Result

Page 6: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

What is Hidden Web? Document

Page 7: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

How to crawl the Hidden Web http://citeseer.ist.psu.edu/cis?

q=heuristic+search&submit=Search+Documents&cs=1

Figure out a keyword

HiddenWeb

QueryResult

Our task

Page 8: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Problem formalization Set-cover

Vertex – documents Hyper-edges – query words

Page 9: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Goal Maximize the number of unique documents

retrieved with minimum number of query words

Page 10: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Problem formalization P(qi)

portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”)

P(qi v qj) portion of unique documents retrieved by issuing query

words qi and qj (portion of documents containing qi or qj)

P(qi | qj) portion of documents containing qi in the set of

documents retrieved by issuing query words qj

Page 11: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Problem formalization What is the next “best” query word?

P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1)

P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown Approximate P(qi) using P(qi | q1 v … v qi-1)

Page 12: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word Greedy: choose the most frequently occurring

word so far to be the query Choose qi with maximum P(qi | q1 v … v qi-1)

For set-cover problem, greedy is proven to obtain log-optimal solution

Page 13: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word Can we do better? Intuition

Correlation of keywords E.g.

- linux- debian, redhat, suse, knoppix, fedora, etc…

We might save the query word “linux” !

Page 14: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word

Wholedocumentcollection

Already retrieveddocuments

Documents retrieved by qi

Documentsretrieved by qj

Documentsretrieved by qk

Page 15: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word

linux

debian

redhat

f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”

Page 16: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word The search tree is huge (branching factor)

We look ahead for the 10 most frequent keywords

We only search up to depth 6 Pruning

Page 17: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word DFBnB

Sub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution

Page 18: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Experiment Document collection : ~100K front pages of

randomly selected websites Query interface : an inverted index (a program that

returns documents containing the given query word) Methods

Greedy DFS search (look ahead for 10 words, up to depth 6) DFS search with pruning (DFBnB)

Page 19: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Results Does searching helps?

provide 51work 159privacy 144years 172world 344list 205info 1467map 184want 57order 87people 85read 56main 2270high 95designed 240latest 36events 132looking 46send 80right 380enter 1285local 77browser 1216questions 77real 77

provide 51work 159privacy 144years 172read 101main 2364designed 291info 1455latest 53looking 60send 101right 402local 99world 239list 142map 150want 42order 69people 67high 85events 126questions 85enter 1272browser 1216real 77

Page 20: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Results Does searching helps?

Page 21: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Results How much does pruning saves?

With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5)

With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand)

DFBnB saves ~ 30 times

Page 22: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Conclusion Searching helps little “in this problem”

DFBnB is “really effective” in pruning search tree

Page 23: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

End

Page 24: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

More results Priori information helps

Page 25: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Results

Page 26: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Results

Page 27: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search & Greedy

Page 28: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search with prune & Greedy

Page 29: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Search for best query word base = q1 v … v qi

P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2)

P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)

Page 30: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

2 words overlapping

Page 31: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

3 words overlapping