Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

Search Engine and Web Mining

Hamed MonkaresiDepartment of Computer Engineering and Information Technology,

Razi University, Kermanshah

Search Engine and Web Mining 1

2

Outline

• Web challenges

• Search engines

• Web crawling

• Web ranking

– Ranking algorithms

– Ranking challenges


3

What is the success reason of the Web?

• A distributed system

• A simple protocol

• Production and generation is very simple


4

Web Retrieval

User Space Information Space

Matching

RetrievalBrowsing

Index termsFull text

Full text + Structure (e.g. hypertext)

Search Engine


5

IR vs Data Retrieval

• A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression

• DR does not solve the problem of retrieving information about subject or object


6

Comparing IR to databases (vs data retrieval)

Databases IR

Data Structured Unstructured

FieldsClear semantics (SSN, age)

No fields (other than text)

QueriesDefined (relational algebra, SQL)

Free text (“natural language”), Boolean

Query specification

Complete Incomplete

MatchingExact (results are always “correct”)

Imprecise (need to measure effectiveness)

Error response Sensitive Insensitive


7

Main points in IR

• What is the definition of relevancy?

• Evaluation!

– Subjective (opposite to hardware, network)


8

Web Challenges

• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)

• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after

one year – New links are created at rate 25% per week

• Heterogeneous contents– HTML/Text/Audio/…


9

Web IR (SE) Challenges (1)

• The definition of Relevancy

• The connectivity with content in Web– A huge graph

• Different type of Queries– Narrow

• Needle in a haystack

– Wide• Overlapping with many areas

• User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query


10


• Spamming phenomenon– it is crucial for business sites to be ranked highly by

the major search engines. – There are quite a few companies who sell this kind of

expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking

– SEO Books

• Content & Connectivity Spamming• Anti Spamming solutions


11


• Rich-get-richer problem

– It takes a long time for a young high quality web pages to receive an appropriate quality

– Unfairness

– Bad directions in growing web contents


12


• Crawling challenges– Huge size of information with dynamic nature

– Freshness & converge• Google covers only 70% of the Web

– An suitable scheduling policy

– Hidden web (600 times bigger)

• Using meta search engines to increase coverage– Merging and ranking problem


13


• User evaluation is subjective and changes in time

– Relevancy between a query and document depends on user and time

– Two users with the same query expect different results


14


• Query Ambiguity

– Python

– Car & automobile


15

Web Structure• Web graph has Bow-tie shape• It has scale-free topology

– Many features of graph follow a power-law distribution

– The core has small-worldproperty

• the shortest directed path from any page in the core to any other page in the core involves 16–20 links on average

xxp )(


16

Distribution of Web Graph: Power-Law


17

Search Engines Trends

• 625 million search queries are received by major search engines each day

• 80% of web surfers discover the new sites that they visit through search engines

• Web search currently generates more than 85% of the traffic to most web sites


18

Components of Search Engines

• Crawling

• Indexing

• Ranking


19

Architecture of Search Engines

Crawler(s)

Page Repository

Indexer Module

CollectionAnalysis Module

Query Engine

Ranking

Client

Indexes : Text Structure Utility

Queries Results

Web


20

Web Crawling Issues

• Coverage– Google, the biggest search engine, covers only 70% of web content

– We must focus on high quality pages

• Freshness– Keep the copy in synchronize with the source pages

• Politeness– Do it without disrupting the web and obeying the webmasters constrains


21

Web Crawling Issues


22

Web crawling

Crawler


23

Crawling Scheduling

• Breadth-First

• Back-link count

• PageRank,…


24

Crawling scheduling

Downloader

Web

Repository

Ranking

Algorithm

URLs and Links


25

Indexing

• Text Operations forms index words (tokens).

– Stopword removal

– Stemming

• Indexing constructs an inverted index of word to document pointers.


26

Indexing Systems

• Google file system

• MG4J (Managing Gigabytes for Java)

• Lucene (Java-GPL)

• Swish-e (C++-Linux)


27

Ranking : Definition

• Ranking is the process which estimates the quality of a set of results retrieved by a search engine

• Ranking is the most important part of a search engine


28

Ranking Types

• Content-based

– Classical IR

• Connectivity based (web)

– Query independent

– Query dependent

• User-behavior based


29

• Ranking is a function of

query term frequency

within the document (tf)

and across all documents

(idf)

– Vector space

– Probabilistic

Classical Information Retrieval

WordsDocs

1

2

w

1

2

n

Query


30

Classical Information Retrieval

• This works because of the following

assumptions in classical IR:

– Queries are long and well specified

– Documents (e.g., newspaper articles) are

coherent, well authored, and are usually about

one topic

– The vocabulary is small and relatively well

understood


31

Web information retrieval

• Queries are short: 2.35 terms in avg.

• Huge variety in documents: language, quality, duplication

• Huge vocabulary: 100s millions terms

• Deliberate misinformation

• Spamming!– Its rank is completely under the control of

Web page’s author


32

Ranking in Web IR

• Ranking is a function of the

query terms and of the

hyperlink structure

– Using content of other pages to

rank current pages

• It is out of the control of the page’s author– Spamming is hard

WordsDocsDocs

1

2

w

1

2

n

1

2

n

Web graph

Query


Books


• Main Text book: – C. D. Manning, P. Raghavan, H. Schutz, Introduction to Information

Retrival, Cambridge University Press, 2008.

– http://www.cs.utexas.edu/~mooney/ir-course/

• Secondary:– R. Baeza-Yates, B. Ribeiro-Neto,

Modern Information Retrieval,

Addison Wesley, 1999.

34

Assessment

• Final Exam: 10 Marks

• Project: 5 Marks

• Homework: 2 Marks

• Paper Review and Presentation: 3 Marks


Papers for Review

• Cho, Junghoo, and Sourashis Roy. "Impact of search engines on page popularity." Proceedings of the 13th international conference on World Wide Web. ACM, 2004.

• Spink, Amanda, et al. "Searching the web: The public and their queries." Journal of the Association for Information Science and Technology 52.3 (2001): 226-234.

• Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific american 284.5 (2001): 34-43.

35Search Engine and Web Mining

Contacts

• Be a member of this group in Shagerdaneh:

https://shagerdaneh.ir/

• Telegram Channel

https://telegram.me/RaziWM982

• Instructor’s email:

[email protected]


QUESTIONS ?