37
Search Engine and Web Mining Hamed Monkaresi Department of Computer Engineering and Information Technology, Razi University, Kermanshah Search Engine and Web Mining 1

Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

Search Engine and Web Mining

Hamed MonkaresiDepartment of Computer Engineering and Information Technology,

Razi University, Kermanshah

Search Engine and Web Mining 1

Page 2: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

2

Outline

• Web challenges

• Search engines

• Web crawling

• Web ranking

– Ranking algorithms

– Ranking challenges

Search Engine and Web Mining

Page 3: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

3

What is the success reason of the Web?

• A distributed system

• A simple protocol

• Production and generation is very simple

Search Engine and Web Mining

Page 4: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

4

Web Retrieval

User Space Information Space

Matching

RetrievalBrowsing

Index termsFull text

Full text + Structure (e.g. hypertext)

Search Engine

Search Engine and Web Mining

Page 5: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

5

IR vs Data Retrieval

• A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression

• DR does not solve the problem of retrieving information about subject or object

Search Engine and Web Mining

Page 6: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

6

Comparing IR to databases (vs data retrieval)

Databases IR

Data Structured Unstructured

FieldsClear semantics (SSN, age)

No fields (other than text)

QueriesDefined (relational algebra, SQL)

Free text (“natural language”), Boolean

Query specification

Complete Incomplete

MatchingExact (results are always “correct”)

Imprecise (need to measure effectiveness)

Error response Sensitive Insensitive

Search Engine and Web Mining

Page 7: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

7

Main points in IR

• What is the definition of relevancy?

• Evaluation!

– Subjective (opposite to hardware, network)

Search Engine and Web Mining

Page 8: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

8

Web Challenges

• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)

• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after

one year – New links are created at rate 25% per week

• Heterogeneous contents– HTML/Text/Audio/…

Search Engine and Web Mining

Page 9: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

9

Web IR (SE) Challenges (1)

• The definition of Relevancy

• The connectivity with content in Web– A huge graph

• Different type of Queries– Narrow

• Needle in a haystack

– Wide• Overlapping with many areas

• User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query

Search Engine and Web Mining

Page 10: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

10

Web IR (SE) Challenges (2)

• Spamming phenomenon– it is crucial for business sites to be ranked highly by

the major search engines. – There are quite a few companies who sell this kind of

expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking

– SEO Books

• Content & Connectivity Spamming• Anti Spamming solutions

Search Engine and Web Mining

Page 11: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

11

Web IR (SE) Challenges (3)

• Rich-get-richer problem

– It takes a long time for a young high quality web pages to receive an appropriate quality

– Unfairness

– Bad directions in growing web contents

Search Engine and Web Mining

Page 12: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

12

Web IR (SE) Challenges (4)

• Crawling challenges– Huge size of information with dynamic nature

– Freshness & converge• Google covers only 70% of the Web

– An suitable scheduling policy

– Hidden web (600 times bigger)

• Using meta search engines to increase coverage– Merging and ranking problem

Search Engine and Web Mining

Page 13: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

13

Web IR (SE) Challenges (5)

• User evaluation is subjective and changes in time

– Relevancy between a query and document depends on user and time

– Two users with the same query expect different results

Search Engine and Web Mining

Page 14: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

14

Web IR (SE) Challenges (6)

• Query Ambiguity

– Python

– Car & automobile

Search Engine and Web Mining

Page 15: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

15

Web Structure• Web graph has Bow-tie shape• It has scale-free topology

– Many features of graph follow a power-law distribution

– The core has small-worldproperty

• the shortest directed path from any page in the core to any other page in the core involves 16–20 links on average

xxp )(

Search Engine and Web Mining

Page 16: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

16

Distribution of Web Graph: Power-Law

Search Engine and Web Mining

Page 17: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

17

Search Engines Trends

• 625 million search queries are received by major search engines each day

• 80% of web surfers discover the new sites that they visit through search engines

• Web search currently generates more than 85% of the traffic to most web sites

Search Engine and Web Mining

Page 18: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

18

Components of Search Engines

• Crawling

• Indexing

• Ranking

Search Engine and Web Mining

Page 19: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

19

Architecture of Search Engines

Crawler(s)

Page Repository

Indexer Module

CollectionAnalysis Module

Query Engine

Ranking

Client

Indexes : Text Structure Utility

Queries Results

Web

Search Engine and Web Mining

Page 20: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

20

Web Crawling Issues

• Coverage– Google, the biggest search engine, covers only 70% of web content

– We must focus on high quality pages

• Freshness– Keep the copy in synchronize with the source pages

• Politeness– Do it without disrupting the web and obeying the webmasters constrains

Search Engine and Web Mining

Page 21: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

21

Web Crawling Issues

Search Engine and Web Mining

Page 22: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

22

Web crawling

Crawler

Search Engine and Web Mining

Page 23: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

23

Crawling Scheduling

• Breadth-First

• Back-link count

• PageRank,…

Search Engine and Web Mining

Page 24: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

24

Crawling scheduling

Downloader

Web

Repository

Ranking

Algorithm

URLs and Links

Search Engine and Web Mining

Page 25: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

25

Indexing

• Text Operations forms index words (tokens).

– Stopword removal

– Stemming

• Indexing constructs an inverted index of word to document pointers.

Search Engine and Web Mining

Page 26: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

26

Indexing Systems

• Google file system

• MG4J (Managing Gigabytes for Java)

• Lucene (Java-GPL)

• Swish-e (C++-Linux)

Search Engine and Web Mining

Page 27: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

27

Ranking : Definition

• Ranking is the process which estimates the quality of a set of results retrieved by a search engine

• Ranking is the most important part of a search engine

Search Engine and Web Mining

Page 28: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

28

Ranking Types

• Content-based

– Classical IR

• Connectivity based (web)

– Query independent

– Query dependent

• User-behavior based

Search Engine and Web Mining

Page 29: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

29

• Ranking is a function of

query term frequency

within the document (tf)

and across all documents

(idf)

– Vector space

– Probabilistic

Classical Information Retrieval

WordsDocs

1

2

w

1

2

n

Query

Search Engine and Web Mining

Page 30: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

30

Classical Information Retrieval

• This works because of the following

assumptions in classical IR:

– Queries are long and well specified

– Documents (e.g., newspaper articles) are

coherent, well authored, and are usually about

one topic

– The vocabulary is small and relatively well

understood

Search Engine and Web Mining

Page 31: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

31

Web information retrieval

• Queries are short: 2.35 terms in avg.

• Huge variety in documents: language, quality, duplication

• Huge vocabulary: 100s millions terms

• Deliberate misinformation

• Spamming!– Its rank is completely under the control of

Web page’s author

Search Engine and Web Mining

Page 32: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

32

Ranking in Web IR

• Ranking is a function of the

query terms and of the

hyperlink structure

– Using content of other pages to

rank current pages

• It is out of the control of the page’s author– Spamming is hard

WordsDocsDocs

1

2

w

1

2

n

1

2

n

Web graph

Query

Search Engine and Web Mining

Page 33: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

Books

Search Engine and Web Mining 33

• Main Text book: – C. D. Manning, P. Raghavan, H. Schutz, Introduction to Information

Retrival, Cambridge University Press, 2008.

– http://www.cs.utexas.edu/~mooney/ir-course/

• Secondary:– R. Baeza-Yates, B. Ribeiro-Neto,

Modern Information Retrieval,

Addison Wesley, 1999.

Page 34: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

34

Assessment

• Final Exam: 10 Marks

• Project: 5 Marks

• Homework: 2 Marks

• Paper Review and Presentation: 3 Marks

Search Engine and Web Mining

Page 35: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

Papers for Review

• Cho, Junghoo, and Sourashis Roy. "Impact of search engines on page popularity." Proceedings of the 13th international conference on World Wide Web. ACM, 2004.

• Spink, Amanda, et al. "Searching the web: The public and their queries." Journal of the Association for Information Science and Technology 52.3 (2001): 226-234.

• Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific american 284.5 (2001): 34-43.

35Search Engine and Web Mining

Page 36: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

Contacts

• Be a member of this group in Shagerdaneh:

https://shagerdaneh.ir/

• Telegram Channel

https://telegram.me/RaziWM982

• Instructor’s email:

[email protected]

Search Engine and Web Mining 36

Page 37: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality

QUESTIONS ?

Search Engine and Web Mining 37