24
Design of a Click- tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu

Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu

Embed Size (px)

Citation preview

Design of a Click-tracking Network for

Full-text Search Engine

Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu

Outline• Introduction

• Objective

• Project diagram– Web Crawling– Indexing schema

• Ranking strategies – PageRank Algorithms– Neural Network– Content-Based Ranking

• Software and Reference

Introduction

• Full-text Search Engine – search on key words– rank results

• What is in a Search Engine?– Crawling– Indexing– Ranking results of query

Objective

• Design a full-text search engine

• Rank search results in different ways

Project Diagram

Website

Crawling

Text & urls

Database

Indexing

Query Function

Click-Tracking Network

PageRank Algorithms

Content-Based Ranking

Ranked results

Web Crawling

Depth 1: crawling all the url links on the main page

Depth 2: crawling all the url links found in depth 1

Main page:

……

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain

http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning

……

# Implemented with Python urllib2 module and BeautifulSoup API

URL

LINKURL

Main Page

Depth 1

Depth 2

URL

LINK

Schema for Basic Index

Link

Row_ID

From_ID

To_ID

Url_list

Row_ID

UrlWord_locat

ionUrl_ID

Word_ID

LocationWord_list

Row_ID

Word

Link_words

Word_ID

Link_ID

# Implemented with SQLite

Results for Multiple-words Query

Words Combination

Same url _idWord location

! Notice that all the url_ids returned are not ranked..

Query function

PageRank Algorithm

•Developed by Larry Page at Stanford U. in 1996.•How important that page is.•The importance of the page is calculated from all the other pages that link to it.

http://www.rasch.org/rmt/rmt232a.htm

http://www.rasch.org/rmt/rmt232a.htm

How to Calculate PR

• d: damping factor, 0<d<1, 0.85.• PR(B), ……..,PR(D)…. : PageRank value of

each webpage linking to page A.• L(B),…….,L(D),….. : The number of links

going out of page B,……D…..

Example

PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) )= 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 )= 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2)= 0.15 + 0.85 * 0.465= 0.575

How to Update the PR Value If we don’t know what their PR should be to

begin with, just assign an initial PR value for every page.

20 Iterations

Update

http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

Results for PageRank

PageRank values

Neural Network

Why?• Make reasonable guess about results

for queries that they have never seen before.

Click-tracking • The weights are updated based on

the search results which the user clicked.

Neural Net Work

• Step1: Setting Up the Database

• Step2: Feeding Forward Activation

• Step3: Training with BackPropagation

How Neural Network works?Solid line: Strong connectionsBold text: Active node

Step1: Setting Up the ANN Database

• Create a table for hidden layer(red box)

• Create two tables for the connections(green boxes)

Step2: Feeding Forward Activation

• Objective: activate the ANN. – Take words as inputs– Activate the links in the network– Give outputs for URL

• Hyperbolic tangent function

X-axis: total input to the node

Step3: Training with Backpropagation

• Train the network every time someone performs a search and choose one of the links

• The same algorithm covered in class. • Learning rate = 0.5

Step 1:

From ID

To IDHidden node

Strength

Step 2:

relevance of URL input URL

Results For Neural Network

Step 3:

Training with one query

Results For Neural Network(contd)

Step 3:

Training with more queries

Content-Based Ranking

• Word frequency

• Document location

• Word distance

Basic Idea: Calculate a score based only on the query and the content of the page

Reference• Collective Intelligence- Toby Segaran• SQLite Tutorial - ZetCode• Dive into Python – Mark Pilgrim

Software• Ubuntu 11.04• Python 2.7.3• SQLite

Thank you.