28
Chapter 5 Introduction to WWW Application

Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5

Introduction to WWW Application

Page 2: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

2

WWW Applications

Search Engine / Meta-Search Engine Web Data Mining Bots and Internet Intelligent Agents Electronic Commerce Web Titles e -Learning

Page 3: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Section 5-1

Search Engine / Meta-Search Engine

Page 4: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

4

What is Search Engine?

A mechanism that help users to find online resources quickly.

InternetUser

Browser

SearchEngine

Database

Page 5: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

5

Popular Search Engines

AltaVista (http://www.altavista.com)

Excite (http://www.excite.com)

Google (http://www.google.com)

HotBot (http://www.hotbot.com)

Lycos (http://www.lycos.com)

Yahoo! (http://www.yahoo.com)

WebCrawler (http://www.webcrawler.com)

Openfind, GAIS, Yam,…etc.

Page 6: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

6

Types of Search Tools

Search Engines & Meta-Search Engines– Search Engine: Google– Meta-Search Engine: Metacrawler,SavvySearch

Subject Directories– Yahoo!

Specialized Databases (The Invisible Web)– Librarian's Index

Page 7: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

7

How to choose a starting point?

Search Engines– Advantage: Can be fast.– disadvantage: Irrelevant information can

overwhelm useful information. (Good choice of keywords can help here.)

Specialized Web Site– Advantage: Leads to information inaccessible

to search engines.– disadvantage: May not exist for your topic.

Page 8: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

8

How to choose a starting point? (Cont.)

FAQ– Advantage: A great place to start.– Disadvantage: Not all topics have FAQs.

Guess– Advantage: Can be very fast.– Disadvantage: Requires experience, intuition.

Discussion group– Advantage: Reaches a community of experts.– Disadvantage: Relatively slow. Experts may tire of begi

nner questions.

Page 9: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

9

Search Engine & Catalog

Catalog is the set of Web pages that a search engine knows how to find. Also called a database or index.

A search engine can find only the Web pages in its catalog.

No catalog covers the entire Internet since the Internet keeps changing, so catalogs are never completely up to date.

Page 10: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

10

Give a query, get a hit

Keyword is a word, partial word, or phrase that you can give to a search engine. Also called a search term.

Query is one or more keywords that, together, represent the concept that you want to find on the Net. Also called a search string.

Hit is a Web page in the catalog that matches your query. Also called a match.

Page 11: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

11

Techniques to build catalogs

Active Search Engine– Collects Web page information by itself.– Use a program called a spider (also called a

robot, wanderer or crawler) that travels around the Net, locates Web pages, and adds entries to the catalog.

– Some spiders run all the time, adding information to the catalog on a regular basis. Others run less frequently, perhaps updating the catalog weekly or monthly.

Page 12: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

12

WWWActive Search Engine

URLs

URLs

URLs

My Web Page

My Web Page

Passive Search Engine

Register

Page 13: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

13

Techniques to build catalogs (Cont.)

Passive Search Engine– Does not seek out Web pages by itself.– Allow people to register their Web pages,

usually by filling out a form online. Once a page is registered with the search engine, the page can be found by queries.

– Some search engines have both active and passive features. They use a spider to gather information, but also allow users to register pages.

Page 14: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

14

Techniques to build catalogs (Cont.)

Meta-Search Engine– Do not catalog any Web pages themselves.– It forward user’s queries to other search engines

to do the actual work.– When results come back from the other search

engines, the meta-search engine presents them to the user, possibly summarizing them or at least giving them a consistent appearance.

Page 15: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

15

MetaCrawler

AltaVista

Lycos

Yahoo

query

hits

query

hits

query

hits

Query

Hits

Page 16: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

16

Comparison of Search Engines

Active Search Engine– Advantage: Large catalog.

– Disadvantage: Too many hits.

Passive Search Engine– Advantage: Possibly more organized.

– Disadvantage: Smaller catalog; items may be cataloged in unexpected places.

Meta-Search Engine– Advantage: One query goes a long way.

– Disadvantage: Longer search time.

Page 17: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

17

Choose keywords with care

The success of a Web search depends heavily on the keywords you choose. Be sure to watch out for:– Misspellings (拼錯字 )– Alternate spellings (不同的拼法 )– Synonyms (同義字 )– Word forms (文字的型態 )

Page 18: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

18

The forms of advanced queryConcept Appearance Meaning

And AND, &, && Match all of these keywords.

Or OR, |, || Match at least one of these keywords.

Not NOT, ~, - Match if this keyword is not present.

Some Usually an on/off switch

Only some of the keywords must be matched.

Required keyword + Along with the “Some” operator, indicates a keyword that must be matched.

Near NEAR Match these keywords if they are near each other.

Page 19: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

19

The forms of advanced query (Cont.)

Concept Appearance Meaning

Adjacent “quotation marks” Match these keywords if they are next to each other, in order.

Grouping (parentheses) Try to match these keywords before matching the rest of the keywords.

Allow misspellings

Usually an on/off switch

Match words that are almost the same spelling as these keywords.

Allow partial words

Usually an on/off switch

Also called substring match. Match.words that contain your keyword.

Page 20: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

20

The forms of advanced query (Cont.)

Concept Appearance Meaning

Case sensitivity Usually an on/off switch

Ignore or obey capitalization when matching words.

Wildcard * Match anything

Limit search Usually an on/off switch

Search only part of the search engine’s catalog.

Page 21: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

21

Search Strategies General search (廣域式搜尋 ): When you know little abo

ut your topic. Specific search (集中式搜尋 ): When you know a lot abo

ut your topic. Incremental search (漸進式搜尋 ): Zeroing in on your top

ic. Substring search (字串搜尋 ): Matching several similar ke

ywords at once. Search-and-jump (搜尋再搜尋 ): A speedy, two-part searc

h technique. Category search (目錄搜尋 ): Convenient browsing of a t

opic area. Search-and-rank (搜尋與排序 ): Locating the most releva

nt hits first.

Page 22: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

22

Comparison of search strategiesStrategy Advantage Disadvantage

General search Likely to get a relevant hit.

Likely to get many irrelevant hits too.

Specific search Hits are more likely to be relevant.

Low odds of getting a hit.

Incremental search Zero in on your goal. Multiple queries are time-consuming.

Substring search Can simplify queries. Likely to produce irrelevant hits.

Search-and-jump Faster than multiple queries.

Download time may be longer; less powerful than multiple queries.

Category search Logical, organized, great for browsing.

Relies on the skill of the organizer, whose world view may or may not match yours.

Search-and-rank Lists the most relevant hits first.

Effective ranking functions are still undiscovered.

Page 23: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

23

Some Meta-Search Engines

WebCrawler– Characteristics

• It uses a content-based, full-text indexing system to provide a high-quality index.

• It uses a breadth-first search strategy to create a broad index.

• It tries to include as many Web servers as possible.

Page 24: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

24

Some Meta-Search Engines (Cont.)

– Architecture• The search engine.

• The agents.

• The database.

• The query server.

Internet Webspac

e

Internet Webspac

e

Agents

QueryServer

SearchEngine

Database

Page 25: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

25

Some Meta-Search Engines (Cont.) Lycos

– It extracts the following pieces of information from each document that it retrieves:

• Title

• Headings and subheadings

• 100 most important words

• First 20 lines

• Size in bytes

• Number of words

Page 26: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

26

Some Meta-Search Engines (Cont.)

– The 100 important words are selected using the Tf * Idf weighting algorithm.

• Tf (Term Frequency) is the number of occurences of particular terms in the collection.

• Df (Document Frequency) is the number of documents in the collection which particular terms occur.

• IDf (Inverse Document Frequency)• N: the number of documents in a collection• IDf = log(N / Df)• weight = Tf * IDf = Tf * log(N / Df)

Page 27: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

27

Some Meta-Search Engines (Cont.) Harvest

– It is an integrated tool that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information.

WWWRobot

WebServer

WebServer

WebServer

WWWRobot

WWWRobot

Page 28: Chapter 5 Introduction to WWW Application. Chapter 5 : Introduction to WWW Application 2 WWW Applications Search Engine / Meta-Search Engine Web Data

Chapter 5 Introduction to WWW Appli:cation

28

Some Meta-Search Engines (Cont.)

Broker(Index)

Broker(Index)

Broker(Index)

Gatherer

Web Server

Gatherer

Web Server

Gatherer

Web Server

FilterFilter

Subsystems– Gatherer collects indexing

information– Broker provides a flexible

interface to gathered information

– Index/Search subsystem allows the information space to be flexibly indexed and searched in a variety of ways

– Object Cache stores contents of retrieved objects to alleviate access bottlenecks to popular data

– Replicator mirrors index information of Brokers to alleviate server bottlenecks