24
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs- University

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Embed Size (px)

Citation preview

Page 1: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search – Summer Term 2006

VII. Web Search -Indexing: Structure Index

(c) Wolfgang Hürst, Albert-Ludwigs-University

Page 2: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Structure Index (Links)

Represents the links between the indexed pages

Important for- Relevance calculation (PageRank, HITS, ...)- Crawling (importance metrics, ...)and some other applications (Web mining, etc.)

Most critical issues (again):- Size and rate of change

Most important requirements:- Reduce space / compression- Support required operations (random and streaming access, add / delete)- Speed

Page 3: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Web Graph

The structure index represents the web graph:- Node = web page- Directed edge = link

1

3

2

Common representation techniques for graphs:a) Adjacency matrix

Page 4: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Web Graph

The structure index represents the web graph:- Node = web page- Directed Edge = link

1

3

2

Common representation techniques for graphs:b) Adjacency list

Page 5: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Structure Index

Example: The Connectivity Server [3]

Based on a data structure that supports the following operations:

- Given a URL u (or a set of URLs U), return a list of pages that point to u (U), i.e. its predecessors (back links) and a list of pages that are pointed to from u (U), i.e. its successors (forward links)

- Given a set of URLs U and a distance, return the respective neighborhood of U in the graph

Page 6: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Connectivity ServerNodes: Array (1 node = 1 element)

Edges: - OUTLIST: Adjacency list (successors) - INLIST: Inverted adjacency list (predecessors)

URLDATA-BASE

PTR TO URL PTR TO INLIST PTR TO OUTLIST

... ... ...

NODE TABLE

...

INLIST TABLE

...

OUTLIST TABLE

Page 7: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Connectivity Server (cont.)Additional data structure to map ULRs to IDs (and vice versa)

ID = index in the lexicographically sorted list of all crawled URLs

Advantage: Compression, i.e. delta-encoding

Example:WWW.FOOBAR.COM/WWW.FOOBAR.COM/GANDALF.HTMWWW.FOOGRAB.COM/

0 WWW.FOOBAR.COM/ 115 GANDALF.HTM 267 GRAB.COM/ 41

ORIGINAL TEXT

DELTA ENCODING

Page 8: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The Connectivity Server (cont.)

Problem: Need to scan all URLs because of delta encoding (i.e. saves space at cost of speed)Solution: Include Checkpoint URLs

Another problem: Updates are hard to do

Several other (newer) approaches exist that take into account (e.g.) the actual web structure

Page 9: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

S-Node Representation [4]Observations on the web structure:

- Link copying: Lots of clusters with nodes containing very similar adjacency lists

- Domain and URL locality: A significant fraction of links on a page point to pages from the same domain

- Page similarity: Pages that have very similar adjacency lists are likely to be related

Idea: Make use of these observations, e.g. by grouping related pages / similar URLs

Page 10: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

S-Node Representation - ExamplePARTITION P = {N1, N2, N3}

N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}

1

2

3

5

4

1

2

3

5

4

INTRA-NODES Ni

N2

N1 N3

SUPERNODEGRAPH

Page 11: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

S-Node Representation - ExamplePARTITION P = {N1, N2, N3}

N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}

1

2

3

5

4

N2

N1 N3

1

2

3

5

4

INTRA-NODES Ni

SUPERNODEGRAPH

POSITIVE SUPEREDGES

2 523

1

43 51

NEGATIVE SUPER-EDGES 53 2

4

5

1

2

41

5

Page 12: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Creating partitions

1. Initial partition: Based on URL (top two levels of DNS), e.g. - www.informatik.uni-freiburg.de - ad.informatik.uni-freiburg.de - www.imtek.uni-freiburg.de

2. URL Split: Split Nis based on URL prefixes, e.g. - www.informatik.uni-freiburg.de/students - www.informatik.uni-freiburg.de/studienberatung

3. Clustered Split: Use clustering algorithm to split partitions into groups with similar adjacency lists

Page 13: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

References - Indexing[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing)

[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy)

[3] BHARAT, BRODER, HENZINGER, KUMAR, VENKATASUBRAMINAIN: "THE CONNECTIVITY SERVER: FAST ACCESS TO LINKAGE INFORMATION ON THE WEB", WWW 1998

[4] RAGHAVAN, GARCIA-MOLINA: "REPRESENTING WEB GRAPHS", STANFORD TECHNICAL REPORT 2002

Page 14: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. [1] FIG. 1)

Page 15: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

The evolution of search engines1st generation: Use only "on page", text data- Word frequency, language

1995-1997 (AltaVista, Excite, Lycos, etc.)

2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page)

From 1998 (made popular by Google but everyone now)

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Page 16: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Still experimental

The evolution of search engines

Semantic analysis - What is it about?

Focus on user need, rather than on query- Corpus reflects user needs / expectations- Integrates multiple sources of data- Help the user create a good query

Context determination- Spatial (user location/target location)- Query stream (previous queries)- Personal (user profile)- Explicit (vertical search)- Implicit (on altavista.de)

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

3rd gener.: Answer the need behind the query

Page 17: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Still experimental

The evolution of search engines

3rd gener.: Answer the need behind the query (cont.)

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Helping the user- UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc.

Integration of search and text analysis

Page 18: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Example: Google

3rd gener.: Answer the need behind the query

Page 19: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search Lecture - Schedule

1. Classic IR (Basics)

2. Classic IR Exercises

3. Web Search (Basics)

4. Web Search Exercises [June, 28th till July 12th]

5. Web Search (Selected Topics) [July, 18th till July 26th]

Page 20: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search – Summer Term 2006

Web Search Basics -(Programming) Exercises

(c) Wolfgang Hürst, Albert-Ludwigs-University

Page 21: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. [1] FIG. 1)

Page 22: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Programming Exercises

Exercise sheet 1: Tools, Library (Lucene)

Exercise sheet 2: Database (and text index)

Exercise sheet 3: Index (structure index)

Exercise sheet 4: Search (link-based ranking)

Page 23: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search Lecture - Schedule

1. Classic IR (Basics)

2. Classic IR Exercises

3. Web Search (Basics)

4. Web Search Exercises [June, 28th till July 12th]

5. Web Search (Selected Topics) [July, 18th till July 26th]

Page 24: Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

New Lecturnity Player

Advanced replay features (developed by us)

Modification of replay speed (while preserving the pitch of the voice)