Searching the Internet CSCI-N 100 Department of Computer and Information Science

Searching the Internet

CSCI-N 100 Department of Computer and Information Science

Searching the Internet What is the Internet

Does anyone own the Internet

How is the Internet controlled

The Internet… It is not a centrally owned or organized institution. It is not a single entity. It is not a 'Den of Iniquity' It is not crawling with eight - year - old children

controlling nuclear bombs. The Internet is not a hive of viruses waiting to attack

your computer. The Internet is not just for pimple-faced teenagers

with propeller beanies.

The Internet… Is a vast repository of information. Is relatively universal Is dynamic – changing minute-by-minute

The Internet InterNIC

- Internet Network Information Center - An international coalition of Internet organization that has what control there is of the Internet

IAB - Internet Architecture Board - An organization that sets standards for the

Internet

ICANN - Internet Corporation for Assigned Names and Numbers – An organization

responsible for the global coordination of the Internet's system of unique identifiers

W3C World Wide Web Consortium - develops interoperable technologies,

specifications, guidelines, software, and tools

Search engines Search Engines

an information retrieval system allows one to ask for content meeting specific

criteria list is often sorted with respect to some measure

of relevance of the results use regularly updated indexes to operate quickly

and efficiently

Search engines First search engines

Archie - archive" without the "v" created in 1990 by a student at in Montreal program downloaded the directory listings of all the

files located on public anonymous FTP (File Transfer Protocol) sites

creating a searchable database of filenames could not search by file contents

Search engines Gopher

indexed plain text documents created in 1991 at the University of Minnesota:

Gopher was named after the school's mascot most of the Gopher sites became websites after the

creation of the World Wide Web because these were text files

Search engines Veronica (Very Easy Rodent-Oriented Net-wide

Index to Computerized Archives) provided a keyword search of most Gopher menu

titles in the entire Gopher listings

Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) a tool for obtaining menu information from various

Gopher servers

And the answer is … People have trouble with

How to ask What to ask Where to ask When to ask

How to ask Search criteria

Build a query Date File name Location Keyword Domain Country

How to ask Boolean phrases

And, + (plus) Finds documents containing all of the specified words or phrases Peanut AND butter finds documents with both the word peanut and the word butter.

Or Finds documents containing at least one of the specified words or phrases Peanut OR butter finds documents containing either peanut or butter. The found

documents could contain both items, but not necessarily. Not, - (minus)

Excludes documents containing the specified word or phrase Peanut NOT butter finds documents with peanut but not containing butter

Wild card (*) Finds documents with just given information, * fills in the rest Pea* returns all pages with the phrase pea (Be Careful!!)

What to ask All of these words

Documents must contain all of the words you list This exact phrase

Documents must contain these exact words in the order you typed them

Any of these words Documents must contain at least one of the words you list

None of these words Documents that contain these words will be omitted from

your results

Where to ask Search engines

Do not really search the World Wide Web directly Searches a database of the full text of web pages selected

from the billions of web pages out there residing on servers

Search engine databases are selected and built by computer robot programs called “spiders”

After spiders find pages, they pass them on to another computer program for "indexing."

Types of Search Tools Search engines

built by computer robot programs ("spiders") -- not by human selection

NOT organized by subject categories -- all pages are ranked by a computer algorithm

contain full-text (every word) of the web pages they link to -- you find pages by matching words in the pages you want

huge and often retrieve a lot of information -- for complex searches use ones that allow you to search within results

Unevaluated -- contain the good, the bad, and the ugly -- YOU must evaluate everything you find Google, Yahoo, Ask.com

Types of Search Tools Subject directories

built by human selection -- not by computers or robot programs

organized into subject categories, classification of pages by subjects -- subjects not standardized and vary according to the scope of each directory

NEVER contain full-text of the web pages they link to -- you can only search what you can see (titles, descriptions, subject categories, etc.) -- use broad or general terms

small and specialized to large, but smaller than most search engines -- huge range in size

often carefully evaluated and annotated (but not always!!)

Directories Librarians Index

www.lii.org Infomine

infomine.ucr.edu AcademicInfo

www.academicinfo.us About.com

www.about.com Google Directory

directory.google.com Yahoo!

dir.yahoo.com

Types of Search Tools Searchable database contents or the "Invisible Web"

Invisible Web is estimated to offer two to three times as many pages

as the visible web Pages in non-HTML formats (pdf, Word, Excel, Corell suite, etc.) are

"translated" into HTML Script-based pages, whose links contain a ? or other script coding, no

longer cause most search engines to exclude them Pages generated dynamically by other types of database software

(e.g., Active Server Pages, Cold Fusion) can be indexed if there is a

stable URL somewhere that search engine spiders can find

Types of search engines Meta-Search Engines

submit keywords in its search box it transmits your search simultaneously to

several individual search engines and their databases of web pages

Meta-search engines do not own a database of Web pages Examples

Dopgpile.com Clusty.com Surfwax.com

References Module #8: Communication and Internet protocols

http://www.cs.iupui.edu/~aharris/mmcc/mod8/abip.html

Module #2: Communication and the World Wide Web http://www.cs.iupui.edu/~aharris/mmcc/mod2/abwww.html

World Wide Web Consortium http://www.w3.org/

Search engine http://en.wikipedia.org/wiki/Search_engine

References The BEST Search Engines

UC Berkeley - Teaching Library Internet Workshops http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html