Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Search Engine and Web Mining
Hamed MonkaresiDepartment of Computer Engineering and Information Technology,
Razi University, Kermanshah
Search Engine and Web Mining 1
2
Outline
• Web challenges
• Search engines
• Web crawling
• Web ranking
– Ranking algorithms
– Ranking challenges
Search Engine and Web Mining
3
What is the success reason of the Web?
• A distributed system
• A simple protocol
• Production and generation is very simple
Search Engine and Web Mining
4
Web Retrieval
User Space Information Space
Matching
RetrievalBrowsing
Index termsFull text
Full text + Structure (e.g. hypertext)
Search Engine
Search Engine and Web Mining
5
IR vs Data Retrieval
• A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression
• DR does not solve the problem of retrieving information about subject or object
Search Engine and Web Mining
6
Comparing IR to databases (vs data retrieval)
Databases IR
Data Structured Unstructured
FieldsClear semantics (SSN, age)
No fields (other than text)
QueriesDefined (relational algebra, SQL)
Free text (“natural language”), Boolean
Query specification
Complete Incomplete
MatchingExact (results are always “correct”)
Imprecise (need to measure effectiveness)
Error response Sensitive Insensitive
Search Engine and Web Mining
7
Main points in IR
• What is the definition of relevancy?
• Evaluation!
– Subjective (opposite to hardware, network)
Search Engine and Web Mining
8
Web Challenges
• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)
• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after
one year – New links are created at rate 25% per week
• Heterogeneous contents– HTML/Text/Audio/…
Search Engine and Web Mining
9
Web IR (SE) Challenges (1)
• The definition of Relevancy
• The connectivity with content in Web– A huge graph
• Different type of Queries– Narrow
• Needle in a haystack
– Wide• Overlapping with many areas
• User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query
Search Engine and Web Mining
10
Web IR (SE) Challenges (2)
• Spamming phenomenon– it is crucial for business sites to be ranked highly by
the major search engines. – There are quite a few companies who sell this kind of
expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking
– SEO Books
• Content & Connectivity Spamming• Anti Spamming solutions
Search Engine and Web Mining
11
Web IR (SE) Challenges (3)
• Rich-get-richer problem
– It takes a long time for a young high quality web pages to receive an appropriate quality
– Unfairness
– Bad directions in growing web contents
Search Engine and Web Mining
12
Web IR (SE) Challenges (4)
• Crawling challenges– Huge size of information with dynamic nature
– Freshness & converge• Google covers only 70% of the Web
– An suitable scheduling policy
– Hidden web (600 times bigger)
• Using meta search engines to increase coverage– Merging and ranking problem
Search Engine and Web Mining
13
Web IR (SE) Challenges (5)
• User evaluation is subjective and changes in time
– Relevancy between a query and document depends on user and time
– Two users with the same query expect different results
Search Engine and Web Mining
14
Web IR (SE) Challenges (6)
• Query Ambiguity
– Python
– Car & automobile
Search Engine and Web Mining
15
Web Structure• Web graph has Bow-tie shape• It has scale-free topology
– Many features of graph follow a power-law distribution
– The core has small-worldproperty
• the shortest directed path from any page in the core to any other page in the core involves 16–20 links on average
xxp )(
Search Engine and Web Mining
16
Distribution of Web Graph: Power-Law
Search Engine and Web Mining
17
Search Engines Trends
• 625 million search queries are received by major search engines each day
• 80% of web surfers discover the new sites that they visit through search engines
• Web search currently generates more than 85% of the traffic to most web sites
Search Engine and Web Mining
18
Components of Search Engines
• Crawling
• Indexing
• Ranking
Search Engine and Web Mining
19
Architecture of Search Engines
Crawler(s)
Page Repository
Indexer Module
CollectionAnalysis Module
Query Engine
Ranking
Client
Indexes : Text Structure Utility
Queries Results
Web
Search Engine and Web Mining
20
Web Crawling Issues
• Coverage– Google, the biggest search engine, covers only 70% of web content
– We must focus on high quality pages
• Freshness– Keep the copy in synchronize with the source pages
• Politeness– Do it without disrupting the web and obeying the webmasters constrains
Search Engine and Web Mining
21
Web Crawling Issues
Search Engine and Web Mining
22
Web crawling
Crawler
Search Engine and Web Mining
23
Crawling Scheduling
• Breadth-First
• Back-link count
• PageRank,…
Search Engine and Web Mining
24
Crawling scheduling
Downloader
Web
Repository
Ranking
Algorithm
URLs and Links
Search Engine and Web Mining
25
Indexing
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of word to document pointers.
Search Engine and Web Mining
26
Indexing Systems
• Google file system
• MG4J (Managing Gigabytes for Java)
• Lucene (Java-GPL)
• Swish-e (C++-Linux)
Search Engine and Web Mining
27
Ranking : Definition
• Ranking is the process which estimates the quality of a set of results retrieved by a search engine
• Ranking is the most important part of a search engine
Search Engine and Web Mining
28
Ranking Types
• Content-based
– Classical IR
• Connectivity based (web)
– Query independent
– Query dependent
• User-behavior based
Search Engine and Web Mining
29
• Ranking is a function of
query term frequency
within the document (tf)
and across all documents
(idf)
– Vector space
– Probabilistic
Classical Information Retrieval
WordsDocs
1
2
w
1
2
n
Query
Search Engine and Web Mining
30
Classical Information Retrieval
• This works because of the following
assumptions in classical IR:
– Queries are long and well specified
– Documents (e.g., newspaper articles) are
coherent, well authored, and are usually about
one topic
– The vocabulary is small and relatively well
understood
Search Engine and Web Mining
31
Web information retrieval
• Queries are short: 2.35 terms in avg.
• Huge variety in documents: language, quality, duplication
• Huge vocabulary: 100s millions terms
• Deliberate misinformation
• Spamming!– Its rank is completely under the control of
Web page’s author
Search Engine and Web Mining
32
Ranking in Web IR
• Ranking is a function of the
query terms and of the
hyperlink structure
– Using content of other pages to
rank current pages
• It is out of the control of the page’s author– Spamming is hard
WordsDocsDocs
1
2
w
1
2
n
1
2
n
Web graph
Query
Search Engine and Web Mining
Books
Search Engine and Web Mining 33
• Main Text book: – C. D. Manning, P. Raghavan, H. Schutz, Introduction to Information
Retrival, Cambridge University Press, 2008.
– http://www.cs.utexas.edu/~mooney/ir-course/
• Secondary:– R. Baeza-Yates, B. Ribeiro-Neto,
Modern Information Retrieval,
Addison Wesley, 1999.
34
Assessment
• Final Exam: 10 Marks
• Project: 5 Marks
• Homework: 2 Marks
• Paper Review and Presentation: 3 Marks
Search Engine and Web Mining
Papers for Review
• Cho, Junghoo, and Sourashis Roy. "Impact of search engines on page popularity." Proceedings of the 13th international conference on World Wide Web. ACM, 2004.
• Spink, Amanda, et al. "Searching the web: The public and their queries." Journal of the Association for Information Science and Technology 52.3 (2001): 226-234.
• Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific american 284.5 (2001): 34-43.
35Search Engine and Web Mining
Contacts
• Be a member of this group in Shagerdaneh:
https://shagerdaneh.ir/
• Telegram Channel
https://telegram.me/RaziWM982
• Instructor’s email:
Search Engine and Web Mining 36
QUESTIONS ?
Search Engine and Web Mining 37