80
1 Tema 4. Búsquedas en el Web Sistemas de Gestión Documental

Tema 4. Búsquedas en el Web

Embed Size (px)

DESCRIPTION

Tema 4. Búsquedas en el Web. Sistemas de Gestión Documental. Introducción. El WWW data de finales de 1980. Tiene un ritmo de crecimiento exponencial. Podemos encontrar información textual, pero también multimedia. Podemos considerar el web como una enorme base de datos sin estructura. - PowerPoint PPT Presentation

Citation preview

Page 1: Tema 4. Búsquedas en el Web

1

Tema 4.Búsquedas en el Web

Sistemas de Gestión Documental

Page 2: Tema 4. Búsquedas en el Web

2

Introducción El WWW data de finales de 1980. Tiene un ritmo de crecimiento

exponencial. Podemos encontrar información

textual, pero también multimedia. Podemos considerar el web como

una enorme base de datos sin estructura.

Page 3: Tema 4. Búsquedas en el Web

3

Introducción Se plantea el problema de encontrar

información en el Web. Existen 3 formas distintas de hacer búsquedas: Utilizar motores de búsqueda (indexan

parte del web como documentos en una base de datos textual).

Usar Directorios Web (clasifican documentos por temas).

Realizar búsquedas utilizando la característica de hiperenlaces.

Page 4: Tema 4. Búsquedas en el Web

4

Introducción Los principales problemas con los

que nos enfrentamos son: Datos distribuidos. Alto porcentaje de datos volátiles. Enorme cantidad de información. Datos redundantes y no

estructurados. Calidad de los datos. Datos heterogéneos.

Page 5: Tema 4. Búsquedas en el Web

5

Tipos de buscadores

Types of Search Tools Characteristics Examples

Search Engines (& Meta-Search Engines)

• Full-text of selected Web pages • Search by keyword, trying to match

exactly the words in the pages • No browsing, no subject categories • Databases compiled by "spiders"

(computer-robot programs) with minimal human oversight

• Search-Engine size: from small and specialized to huge (about 20 billion websites or pages)

• Meta-Search Engines quickly and superficially search several individual search engines at once and return results compiled into a sometimes convenient format. Caveat: They only catch about 1% of search results in any of the search engines they visit.

• Google, Yahoo Search, Ask.com

• Meta-Search Engines: Dogpile, Copernic

Page 6: Tema 4. Búsquedas en el Web

6

Tipos de buscadores

Types of Search Tools Characteristics Examples

Subject Directories • Human-selected sites picked by editors (sometimes experts in a subject)

• Often carefully evaluated and kept up to date, but not always -- frequently not if large and general

• Usually organized into hierarchical subject categories

• Often annotated with descriptions (not in Yahoo!)

• Can browse subject categories or search using broad, general terms

• NO full-text of documents. Searches need to be less specific than in search engines, because you are not matching on the words in the pages you eventually want. In Directories you are searching only the subject categories and descriptions you see in its pages.

• Librarians' Index, Infomine, Google Directory, About.com, AcademicInfo

• There are thousand more of Subject Directories on practically every topic you can think of.

Page 7: Tema 4. Búsquedas en el Web

7

Tipos de buscadores

Types of Search Tools Characteristics Examples

Specialized Databases (The Invisible Web)

• The Web provides access through a search box into the contents of a database in a computer somewhere

• Can be on any topic, can be trivial, commercial, task-specific, governmental, or a rich treasure devoted to your topic Also includes

• Also includes many pages generated as search results from libraries online catalogs, and the many copyright-protected articles in the databases of journal and magazine publishers.

• Locate specialized databases by looking for them in good Subject Directories like the Librarian's Index, Yahoo!, or AcademicInfo; in special guides to searchable databases; and sometimes by keyword searching in general search engines

Page 8: Tema 4. Búsquedas en el Web

8

Search Engines ¿Como funcionan?

No buscan en el web directamente Utilizan una base de datos de páginas web. Las bases de datos las crean los spiders o crawlers.

Buscan páginas en base a los links que poseen. Una página que no esté enlazada nunca será indexada. Los spiders envían las páginas web a programas

indexadores, que identifican texto, enlaces, ... Almacenan en la base de datos los términos indexados.

Algunos tipos de páginas son excluidos de la indexación siguiendo alguna regla (páginas no encontradas, contenido no adecuado, formato no procesable, información generada de forma dinámica, etc.).

Page 9: Tema 4. Búsquedas en el Web

9

Search EnginesSearch Engine Google

www.google.com Yahoo! Search

search.yahoo.comAsk.com

www.ask.com

Size, typeSize varies frequently

and widely.

HUGE. Size not disclosed in any way that allows comparison. Probably the biggest. Biggest in tests.

HUGE. Claims over 20 billion total "web objects."

LARGE. Claims to have 2 billion fully indexed, searchable pages. Strives to become #1 in size.

Noteworthy features and limitations

Popularity ranking using PageRank™.Indexes the first 101KB of a Web page, and 120KB of PDF's.~ before a word finds synonyms sometimes (~help > FAQ, tutorial, etc.)

Shortcuts give quick access to dictionary, synonyms, patents, traffic, stocks, encyclopedia, and more.

Subject-Specific Popularity™ ranking.Suggests broader and narrower terms.

Phrase searching Yes. Use " ". Searches common "stop words" if in phrases in quotes.

Yes. Use " " Yes. Use " ". Searches common "stop words" if in phrases in quotes.

Boolean logic Partial. AND assumed between words.Capitalize OR.- excludes.No ( ) or nesting.In Advanced Search, partial Boolean available in boxes.

Accepts AND, OR, NOT or AND NOT, and ( ). Must be capitalized.You must enclose terms joined by OR in parentheses (classic Boolean).

Partial. AND assumed between words.Capitalize OR.- excludes.No ( ) or nesting.

+Requires/ -Excludes - excludes + will allow you to retrieve "stop words" (e.g., +in)

- excludes  + will allow you to search common words: "+in truth"

- excludes + will allow you to retrieve "stop words" (e.g., +in)

Page 10: Tema 4. Búsquedas en el Web

10

Search EnginesSearch Engine Google

www.google.com Yahoo! Search

search.yahoo.comAsk.com

www.ask.com

Sub-Searching Sort of . At bottom of results page, click "Search within results" and enter more terms. Adds terms.

Add terms. Sort of . Add terms.

Results Ranking Based on page popularity measured in links to it from other pages: high rank if a lot of other pages link to it. Fuzzy AND also invoked.Matching and ranking based on "cached" version of pages that may not be the most recent version.

Automatic Fuzzy AND. Based on Subject-Specific Popularity™, links to a page by related pages. More info.

Field limiting link:site:intitle:inurl:Advanced Search boxes for most of these.Offers Uncle Sam for US federal pages and other special searches.

link:site:intitle:inurl:url:hostname:(Explanation of these distinctions.)

intitle:inurl:site:

Page 11: Tema 4. Búsquedas en el Web

11

Search Engines

Search Engine Googlewww.google.com

Yahoo! Search search.yahoo.com

Ask.comwww.ask.com

TruncationStemming

No truncation. Stems some words. Search variant endings and synonyms separately, separating with OR (capitalized):airline OR airlines

Neither. Search with OR as in Google.

Neither. Search with OR as in Google.

Case sensitivity No. No. No.

Language  Yes. Major Romanized and non-Romanized languages in Advanced Search.

Yes. Major Romanized and non-Romanized languages.

Yes. Major Romanized languages. Use Advanced Search to limit.

Limit by age of documents In Advanced Search. In Advanced Search. In Advanced Search.

Translation Yes, in Translate this page link following some pages. To and sometimes from English and major European languages and Chinese, Japanese, Korean.

Yes. No.

Page 12: Tema 4. Búsquedas en el Web

12

Search Engines

Search Engines

Boolean Default Proximity Truncation Fields Limits Stop Sorting

Google -, OR and Phrase No (stems)word in phrase

intitle, inurl, link, site, more

Language, filetype, date, domain

Few, + searches

Relevance, site

Yahoo! AND, OR, NOT,( ), -

and Phrase No word in phrase

intitle, inurl, link, site, more

Language, file type, date, domain

No Relevance, site

Ask -, OR and Phrase No intitle, inurl, site Language, site, date Yes, + searches

Relevance, metasites

Live Search AND, OR, NOT,( ), -

and Phrase No intitle, link, site, loc, url

Language,  site Varies, + searches

Relevance,site,  sliders

Gigablast AND, OR, AND NOT, ( ), +, -

and Phrase No title, site, ip, more Domain, type Varies, + searches

Relevance

Exalead AND, OR, NOT,( ),-

and Phrase, NEAR

Yes and stems intitle, inurl, link, site

Language, file type, date, domain

Varies, + searches

Relevance, date

Features ChartLast updated Oct. 1, 2007.

Page 13: Tema 4. Búsquedas en el Web

Search Engines (¿diferentes?)

13http://www.bruceclay.com/searchenginerelationshipchart.htm

Page 14: Tema 4. Búsquedas en el Web

14

Search Engines

Page 15: Tema 4. Búsquedas en el Web

15

Search Engines

Page 16: Tema 4. Búsquedas en el Web

16

Search Engines

Page 17: Tema 4. Búsquedas en el Web

17

Search Engines

Page 18: Tema 4. Búsquedas en el Web

18

Metasearch

Meta-Search ToolWhat's Searched

(As of date at bottom of page. They change often.)

Complex Search Ability Results Display

Clustyclusty.com

Currently searches a number of free, search engines and directories, not Google or Yahoo.

Accepts and "translates" complex searches with Boolean operators and field limiting.

Results accompanied with subject subdivisions based on words in search results, giving usually the major themes (Vivisimo Clustering Engine™). Click on these to search within results on each theme.

Dogpilewww.dogpile.com

Searches Google, Yahoo, LookSmart, AskJeeves/Teoma, Google ADS, MSN search. Sites that have purchased ranking and inclusion are blended in. Watch for Sponsored by... links below search results.

Accepts Boolean logic, especially in advanced search modes.

Dogpile allows you to see each search engine's results separately in a useful list for comparison. Click the search engine icons by "Best of Breed."

Page 19: Tema 4. Búsquedas en el Web

19

Metasearch

Meta-Search ToolWhat's Searched

(As of date at bottom of page. They change often.)

Complex Search Ability Results Display

SurfWaxwww.surfwax.com

A better than average set of search engines.Can mix with educational, US Govt tools, and news sources, or many other categories.

Accepts " ", +/-. Default is AND between words. I recommend fairly simple searches, allowing SurfWax's SiteSnaps and other features to help you dig deeply into results.

Click on source link to view complete search results there.Click on      to view helpful "SiteSnap™" extracted from most sites in frame on right.Many additional features for probing within a site.

Copernic Agent www.copernic.com

Select from list of search engines by clicking the Properties button following Advanced Search search box.

ALL, ANY, Phrase, and more. Also Boolean searching within results under Refine (powerful!).

Must be downloaded and installed, but Basic version is free of charge. Table comparing versions.

Page 20: Tema 4. Búsquedas en el Web

20

MetasearchDogpile http://www.dogpile.comPopular metasearch site owned by InfoSpace that sends a search to a customizable list of search engines, directories and specialty search sites, then displays results from each search engine individually.

Vivisimo http://vivisimo.com/Enter a search term, and Vivismo will not only pull back matching responses from major search engines but also automatically organize the pages into categories. Slick and easy to use.

Kartoo http://www.kartoo.comIf you like the idea of seeing your web results visually, this meta search site shows the results with sites being interconnected by keywords.

Mamma http://www.mamma.comFounded in 1996, Mamma.com is one of the oldest meta search engines on the web. Mamma searches against a variety of major crawlers, directories and specialty search sites. The service also provides a paid listings option for advertisers, Mamma Classifieds.

SurfWax http://www.surfwax.comSearches against major engines or provides those who open free accounts the ability to chose from a list of hundreds. Using the "SiteSnaps" feature, you can preview any page in the results and see where your terms appear in the document. Allows results or documents to be saved for future use.

Page 21: Tema 4. Búsquedas en el Web

21

MetasearchClustyhttp://www.clusty.com

CurryGuidehttp://web.curryguide.com/

Excitehttp://www.excite.com

Fazzlehttp://www.fazzle.com/

Gimeneihttp://gimenei.com/

IceRockethttp://www.icerocket.com/

Info.comhttp://www.info.com

MetaEurekahttp://www.metaeureka.com

ProFusionhttp://www.profusion.com

Query Serverhttp://www.queryserver.com/web.htm

Turbo10http://turbo10.com

Search.comhttp://www.search.com

Ujikohttp://www.ujiko.com/

WebCrawlerhttp://www.webcrawler.com

ZapMetahttp://www.zapmeta.com

InfoGridhttp://www.infogrid.com

Infonetware RealTerm Searchhttp://www.infonetware.com

Ixquickhttp://www.ixquick.com/

iZitohttp://www.izito.com

Jux2http://www.jux2.com/

Meceoo http://www.meceoo.com/

MetaCrawlerhttp://www.metacrawler.com

Page 22: Tema 4. Búsquedas en el Web

22

Directorios

Subject Directories

Librarians' Indexwww.lii.org

Infomineinfomine.ucr.edu

Academic Infowww.academicinfo.us

Recommend Browsing

About.comwww.about.com

Google Directory

directory.google.com

Yahoo!dir.yahoo.com

Size, type Over 16,000Compiled by public librarians in information supply business. Highest quality sites only. Great, reliable annotations.

Over 120,000Great, reliable annotations. Cooperatively compiled by university & college-level, academic librarians of the UC campuses.

Rich selection of about 25,000 pages, selected as "college and research level Internet resources" aimed at "at the undergraduate level or above." Brief annotations.

Over 2 millionGenerally good annotations done by "Guides" with various levels of expertise.

About 5 million web pages, selected by the Open Directory Project and enhanced by Google searching and ranking.Often useful to find "better" results, especially on broad or widely covered topics.

About 4 million.Scarce descriptions and annotations. Often useful, especially for popular and commercial topics.

Phrase searching

Yes. Use " " Yes. Use " " |term term| requires exact match

No. " " make searches fail.

Yes. Use " " Yes. Use " " Yes. Use " "

Page 23: Tema 4. Búsquedas en el Web

23

Directorios

Subject Directories

Librarians' Indexwww.lii.org

Infomineinfomine.ucr.edu

Academic Infowww.academicinfo.us

Recommend Browsing

About.comwww.about.com

Google Directory

directory.google.com

Yahoo!dir.yahoo.com

Boolean logic AND implied between words. Also accepts OR and NOT, and (  ).

AND implied between words. Also accepts OR, NOT, and (  ).

OR implied between words. Accepts AND, OR, NOT and (  )Recommend AND between words in most searches.

No. OR, capitalized, as in Google's web search engine.

Yes, as in Yahoo! Search web search engine.

Truncation Use *. Also stems.Can turn stemming off on Advanced Search page.

Use *. Also stems. Can turn stemming off. Use "  " or | | to search exact terms.

No. Use *.Not accepted consistently.

No. No.

Field searching

Advanced Search allows Boolean searching within subject, titles, description, parts of URLs, and more.

Select boxes under search box to limit.

No. No. Same as in Google's web search engine.

As in Yahoo! Search web search engine.

Page 24: Tema 4. Búsquedas en el Web

24

El web invisible ¿Qué es?

El web visible es lo que se ve como resultado de una consulta en un buscador o en los directorios.

El web invisible está formado por todas aquellas páginas y contenidos que no pueden ser procesados por los buscadores y catalogados en los índices. Por ejemplo:

Información dinámica. Bases de datos buscables. Páginas excluidas de los buscadores por algún

tipo de política de procesamiento. Los buscadores no pueden encontrar la información

ofrecida en estas páginas. Para acceder a la información del web invisible hay

que ir directamente a la página que la ofrece, y buscar en ella.

Page 25: Tema 4. Búsquedas en el Web

25

El web invisible ¿Cómo buscar en el web invisible?

Hay que mantener en la mente el concepto “bases de datos” y permanecer atento a cualquier información que nos puedan ofrecer los buscadores y directorios.

Las páginas pueden aparecer en cualquier momento de la navegación o ejecución de nuestras consultas.

Para encontrar páginas del web invisible se pueden utilizar buscadores añadiendo en la consulta el término “base de datos” o “database”. Ejemplo: plane crash database

Además de planificar una buena búsqueda con una estrategia adecuada en un buscador o un directorio, hay que dedicar tiempo a investigar las bases de datos que encontremos referentes a los temas de nuestra necesidad de información.

Page 26: Tema 4. Búsquedas en el Web

26

El web invisibleWhen dealing with the Deep Web, keep these points in

mind:

• Information that is likely to be stored in a database is a part of the deep Web.

• Information that is new and dynamically changing in content will appear on the deep Web.

• Web sites of searchable databases can be retrieved via directories and search engines.

• Many search engine sites and commercial portals feature searchable databases as part of their package of services.

• Some search engines will search the deep Web for related content subsequent to an initial search.

• Topical coverage on the deep Web is extremely varied. • Some of the information stored on Web-accessible databases

may not be substantive or useful to most searchers.

Page 27: Tema 4. Búsquedas en el Web

27

El web invisible

The Invisible Web: Databases not accessible to ordinary search engines.

Librarians’ Internet Index (lii.org)

Lots of categorized databases.

Findarticles.com (www.findarticles.com)

Search hundreds of journals.

Complete Planet (www.completeplanet.com)

Hundreds of databases by category.

Magportal (www.magportal.com)

Full text magazine articles.

All Academic (www.allacademic.com)

Journals & other free academic content.

Infomine (infomine.ucr.edu)Scholarly Internet Resource Collections.

Invisible-Web.net (www.invisible-web.net)

Companion site to Invisible Web book.

Online Books Page (onlinebooks.library.upenn.edu)

Full text of more than 18,000 books.

Page 28: Tema 4. Búsquedas en el Web

28

Algunas estadísticas

Page 29: Tema 4. Búsquedas en el Web

29

Algunas estadísticas

Page 30: Tema 4. Búsquedas en el Web

30

Algunas estadísticas

Millions Of Textual Documents Indexed

Page 31: Tema 4. Búsquedas en el Web

31

Algunas estadísticas

Billions Of Textual Documents IndexedDecember 1995-September 2003

Search Engine Reported Size Page Depth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion

(estimate)500K

Ask Jeeves 2.5 billion 101K+

Search Engine SizeNovember 2004

Page 32: Tema 4. Búsquedas en el Web

32

Algunas estadísticas

Page 33: Tema 4. Búsquedas en el Web

33

Algunas estadísticas

Page 34: Tema 4. Búsquedas en el Web

34

Algunas estadísticas

Page 35: Tema 4. Búsquedas en el Web

35

Algunas estadísticas

Page 36: Tema 4. Búsquedas en el Web

36

Algunas estadísticas

Page 37: Tema 4. Búsquedas en el Web

37

Algunas estadísticas

Page 38: Tema 4. Búsquedas en el Web

38

Algunas estadísticas

Page 39: Tema 4. Búsquedas en el Web

39

Algunas estadísticas

Page 40: Tema 4. Búsquedas en el Web

40

Algunas estadísticas

Page 41: Tema 4. Búsquedas en el Web

41

Algunas estadísticas

Searches Per Day (Millions) Per Month (Millions)

Google 91 2,733

Yahoo 60 1,792

MSN 28 845

AOL 16 486

Ask 13 378

Others 6 166

Total 213 6,400

How many searches are performed each day? Below are how many searches happen within the United States in March 2006, based on comScore figures.

Page 42: Tema 4. Búsquedas en el Web

42

Algunas estadísticas

Page 43: Tema 4. Búsquedas en el Web

43

Como buscan otros en el Web

Page 44: Tema 4. Búsquedas en el Web

44

Como buscan otros en el Web

Page 45: Tema 4. Búsquedas en el Web

45

Como buscan otros en el Web

Page 46: Tema 4. Búsquedas en el Web

46

Como buscar en el WebEstrategias

Step #1. Analyze your topic to decide where to beginClick here for a printable FORM you may use to Analyze Your Topic (pdf file).  PDF files are supported in Netscape 4.x and some other browsers. To view, search, or print the PDF files, you will need to use Adobe® Acrobat® Reader software, which is available free from Adobe if you need it.

Does your topic...

have distinctive words or phrases? methernitha, unique meaning "affirmative action", specific, accepted meaning in word cluster

have NO distinctive words or phrases you can think of? You have only common or general terms that get the "wrong" pages.

"order out of chaos", used in too many contexts to be useful sundiata, retrieves a myth, a rock group, a person, etc.

seek an overview of a broad topic? victorian literature, alternative energy sources

specify a narrow aspect of a broad or common topic? automobile recyclability, want current research, future designs, not how to recycle or oil

recycling or other community efforts

have synonymous, equivalent terms, or variant spellings or endings that need to be included? echinoderm OR echinoidea OR "sea urchin", any may be in useful pages "cold fusion energy" OR "hydrogen energy", some use one term, some the other; you

want both, although not precisely equivalent millennium OR millennial OR millenium OR millenial OR "year 2000", etc. Pages you want may contain any or all.

Make you feel confused? Don't really know much about the topic yet? Need guidance?

Page 47: Tema 4. Búsquedas en el Web

47

Como buscar en el WebEstrategias

Step #2. Pick the right starting place using this table:

YOUR TOPIC'S FEATURES:

Search Engines Subject DirectoriesSpecialized Databases"Invisible Web"

Find an Expert LUCK

Distinctive or word or phrase?

Enclose phrases in " ". Test run your word or phrase in Google.

Search the broader concept, what your term is "about."

Want data? Facts? Statistics?All of something? One of many like things?

Schedules? Maps?Look for a specialized database on the Invisible Web.Hard to predict what you might find.

Look for a specialized subject directory on your topic.E-mail the author of a good page you find.Ask a discussion group or blog.Never hurts to seek help.

Always on your side. Keep your mind open.

Learn as you search.

NO distinctive words or phrases?

Use more than one term or phrase in " " to get fewer results.

Try to find distinctive terms in Subject Directories

Seek an overview?

NOT RECOMMENDED Look for a specialized Subject Directory focused on your topic

Narrow aspect of broad or common topic?

Boolean searching as in Yahoo! Search.

Look for a Directory focused on the broad subject.

Synonyms, equivalent terms, variants

Choose search engines with Boolean OR, orTruncation, or Field limiting.

NOT RECOMMENDED

Confused? Need more information?

NOT RECOMMENDED

Look for a Gateway Page (Subject Guide).Try an encyclopedia.iAsk at a library reference desk.

Page 48: Tema 4. Búsquedas en el Web

48

Como buscar en el WebEstrategias

Step #3. Learn as you go & VARY your approach with what you learn.

Step #4. Don't bog down in any strategy that doesn't work.

Step #5. Return to previous strategies better informed.

Don't assume you know what you want to find. Look at search results and see what you might use in addition to what you've thought of.

Switch from search engines to directories and back. Find specialized directories on your topic. Think about possible databases and look for them.

Page 49: Tema 4. Búsquedas en el Web

49

Como buscar en el WebEstrategiasSearch Strategies We Do NOT Recommend

Because of their inefficiency and often haphazard and frustrating results, we do not recommend either of the following two approaches to finding Web documents:

• Browsing searchable directories. If you can find a search box, search a directory. BROWSING is sometimes fun but rarely as efficient. The term "directories" refers here to any collection of web resources organized into subject categories or some other breakdown appropriate to the content (Subject Directories or directories of specialized databases). Browsing locates documents by your trying to match your topic in first the top, broadest layer of a subject hierarchy, then by choosing narrower sub-subject-categories in the hierarchy that you hope will lead to your target. Browsing encounters the difficulty of guessing under which subject category your topic is classified. The taxonomy in every directory differs, making browsing inconsistent from one search tool to another. The category "health" may contain documents on medicine, homeopathy, psychiatry, and fitness in one directory. In another "medicine" may include health, mental health, and alternative medicine, but not the term psychiatry and may classify fitness only under "lifestyle." Searching (typing keywords in a search box) retrieves occurrences of your words no matter where they may be classified by subject. Use broad terms in searching any directory.

• Following links to sites recommended by heavy use or commercial interest.  Often in search engine results, you will see links to sites that are selected based on how often they are visited by others, or based on fees paid to the browser.  Or you may see recommended "cool" sites.  Use these with caution!   Others may visit sites for reasons having no relation to your information interests, and the best sites for you may still be largely undiscovered by the vast public searching the Web.  Taste varies and should vary.  Make your own evaluations.

Page 50: Tema 4. Búsquedas en el Web

50

Como buscar en el WebEstrategias

Features of your search inquiry Matching Search Tools Features worth learning

Are you looking for a proper name  or a distinct phrase ?

•The name of an organization or society or movement   •A proper name or an individual   •A distinctive string of words generally associated with your topic

Can you think of an organization, proper name, or phrase to search for? It might help zoom in on the pages you want.

PHRASE SEARCHING is a feature you want in every search tools you choose.  

Requires your terms all to appear in exactly the order you enter them.   Enclose the phrase in double quotations " "  

Examples:   "affirmative action"    "world health organization"    "a person's name"   

In , capitalizing initial letters will cause the terms to be searched as a phrase:  

World Health Organization  

Are some of your terms common words with many meanings and contexts ? 

•Children in conjunction with television and also violence   •Censorship as an aspect of ethics in journalism 

BOOLEAN AND will help:   children AND television AND violence    journalism AND ethics AND censorship   

Google and AllTheWeband most other search engines put AND in between words automatically (by default):   

children television violence    journalism ethics censorship   

Do you anticipate lots of search results with terms you do not want ?

•Your search for biomedical engineering and cancer brings you lots of academic programs, and you want research reports. So you try to exclude documents containing Department of or School of

BOOLEAN AND NOT will help: "biomedical engineering" AND cancer AND NOT "Department of" AND NOT "School of"

or its -EXCLUDES near equivalent: "biomedical engineering" cancer -"Department of" -"School of"

Page 51: Tema 4. Búsquedas en el Web

51

Como buscar en el WebEstrategias

Features of your search inquiry Matching Search Tools Features worth learning

Are there synonyms, spelling variations, or foreign spellings for some of your terms?

•women, females with networking •Sarajevo, Sarayevo with peace •literature, litterature with French, francaise

BOOLEAN OR will help: (women OR females) AND networking (Sarajevo OR Sarayevo) AND peace (literature OR litterature) AND (French or francaise) 

In Google, capitalize OR (no need to type "and"): peace sarajevo OR sarayevo literature OR litterature french OR francaise

In AllTheWeb, use parentheses and omit the OR: peace (sarajevo sarayevo) (literature litterature) (french francaise)

Are you looking for home pages and/or other documents primarily about your term(s)? 

•The home page of the American Dietetic Association   •Pages primarily about Affirmative Action

LIMIT TO TITLE FIELD IN DOCUMENTS   intitle:"American Dietetic Association"  intitle:"affirmative action"

In Google, use intitle:"affirmative action"

Are you looking for terms with many possible endings ? 

•Feminism, feminist, feminine   •Children, child 

Some systems search word ending variants automatically (stemming). See the specific instructions for each of the recommended search tools.  To be sure use OR searches:

children OR child

Page 52: Tema 4. Búsquedas en el Web

52

Como buscar en el WebComandos

Command How Supported By

Must Include Term + All

Must Exclude Term - All

Must Include Phrase " " All

Match All Terms Automatic at All

Match Any Terms

Via Advanced SearchAllTheWeb, AltaVista, Google,Lycos, MSN Search, Teoma, Yahoo

(HotBot offers but failed to work when tested)

OR

AltaVista, AOL Search, Ask Jeeves,Google, HotBot, MSN Search, Teoma, Yahoo

(must be done in ALL CAPS)AllTheWeb, Lycos

(only works for two words)

Page 53: Tema 4. Búsquedas en el Web

53

Como buscar en el WebComandos

Command How Supported By

Title Search(Updated March 11,

2003)

title:AltaVista, AllTheWeb,

Inktomi

intitle: GoogleTeoma

allintitle: Google

SiteSearch

host: AltaVista

site:Excite, Google

(Netscape, Yahoo)

url.host:AllTheWeb,

Lycos (for AllTheWeb results only)

domain:Inktomi (HotBot, iWon,

LookSmart)

none

AOL, Direct Hit, HotBot, LookSmart,

Lycos, MSN, Netscape, Northern Light, Open

Directory, Yahoo

Page 54: Tema 4. Búsquedas en el Web

54

Como buscar en el Web Comandos

URL Search

url: AltaVista, Excite, Northern Light

url.all:AllTheWeb,

Lycos (for AllTheWeb results only)

allinurl:inurl:

Google

originurl:Inktomi

(AOL, GoTo, HotBot)

u: Yahoo

none

 AOL, Direct Hit, HotBot, LookSmart, MSN

Not yet updated, but may be still correct:

Open Directory

Link Search

link: AltaVista, Google, Northern Light

linkdomain:Inktomi (AOL, HotBot, iWon, MSN)(NOTE: measures links to entire

domains)

link.all:AllTheWeb,

Lycos (for AllTheWeb results only)

none

AOL, Direct Hit, Excite, HotBot, LookSmart,

Northern LightNot yet updated, but may be still

correct:Netscape, Yahoo (n/a)

Page 55: Tema 4. Búsquedas en el Web

55

Como buscar en el Web Comandos

Wildcard

*

AltaVista, Inktomi (iWon), Northern Light

Not yet updated, but may be still correct:Yahoo

? AOL Search, Inktomi (iWon)

% Northern Light

none

AllTheWeb, Direct Hit, Excite, Google, HotBot, LookSmart, Lycos, MSN

(MSN's help says it offers wildcard,but it failed to during testing)

Anchor Search

anchor: AltaVista

NoneAllTheWeb, AOL Search, Direct Hit,

Excite, Google, Inktomi, HotBot, Lycos

Page 56: Tema 4. Búsquedas en el Web

56

Como buscar en el Web Ayudas

Feature Offered By

Related Searches

 AltaVista, AllTheWeb, Excite, HotBot, Lycos, MSN, Yahoo

Not yet updated, but may be still correct:

iWon

ClusteringAltaVista, AllTheWeb, Excite, Google,

HotBot, MSN, Northern Light

Find Similar AltaVista, AOL Search, Google

StemmingAOL Search, Direct Hit, HotBot, Inktomi

(HotBot, MSN)

Search Within AltaVista, Google, HotBot, Lycos

Spidered Version Google

Search By LanguageAltaVista, AllTheWeb, Excite, Google,HotBot, Lycos, MSN, Northern Light

Page Translation AltaVista, Google, Lycos

Porn Filter AltaVista, AllTheWeb, Google

Porn Warning HotBot, MSN, Northern Light

Page 57: Tema 4. Búsquedas en el Web

57

Como buscar en el Web Ayudas

Feature Supported By

Number Of Listings Shown(10 unless noted)

AltaVista, AllTheWeb, AOL Search (5), Direct Hit, Excite, Google, HotBot, LookSmart (15), Lycos, MSN (15), Northern Light

Not yet updated, but may be still correct:iWon, Netscape, Yahoo (20)

Ability To Increase Number Of Listings?AltaVista, AllTheWeb, Excite, Google, HotBot, MSN

Not yet updated, but may be still correct: Yahoo

See 20 ResultsAltaVista, AllTheWeb, Excite, Google, HotBot, MSN

Not yet updated, but may be still correct: Yahoo

See 50 ResultsAltaVista, AllTheWeb, Excite, Google, HotBot, MSN

Not yet updated, but may be still correct: Yahoo

See 100 ResultsAllTheWeb, Google, HotBot, 

Not yet updated, but may be still correct: Yahoo

Sort By Date MSN Search, Northern Light

Date RangeAltaVista, Google, HotBot, MSN, Northern Light

Not yet updated, but may be still correct: iWon, Yahoo

Date Displayed? AltaVista, HotBot (for Inktomi results), Northern Light

Display Titles Only? AltaVista, Excite, HotBot (URLs only option), MSN

Other Major Customize Options AltaVista, AllTheWeb, Google

Page 58: Tema 4. Búsquedas en el Web

58

Como buscar en el Web Operadores

Command How Supported By

Or

OR AltaVista, AOL Search, Excite, Google, Inktomi (HotBot, MSN), Lycos, Northern Light

NoneAllTheWeb, Direct Hit, LookSmart, 

Not yet updated, but may be still correct: Yahoo

And

AND AltaVista, AOL Search, Excite, Inktomi (HotBot, MSN) Lycos, Northern Light

NoneAllTheWeb, Direct Hit, Google, LookSmart

Not yet updated, but may be still correct: Yahoo

Not

NOT AOL Search, Excite, Inktomi (HotBot), Lycos, Northern Light

AND NOTAltaVista, Inktomi (MSN)

Not yet updated, but may be still correct: Netscape

NoneAllTheWeb, Direct Hit, Google, LookSmart,

Not yet updated, but may be still correct: Yahoo

Nesting

( ) AltaVista, AOL Search, Excite, Inktomi (MSN), Northern Light

NoneAllTheWeb, Direct Hit, Google, Inktomi (HotBot), LookSmart, Lycos

Not yet updated, but may be still correct: Yahoo

NearNEAR AltaVista (10 words), AOL Search (specify number), Lycos (25 words)

None AllTheWeb, Direct Hit, Google, Inktomi (HotBot, MSN), LookSmart

NotesAt AltaVista, Boolean only works on advanced search page.

At Excite, Google & MSN, Boolean commands must be in UPPERCASEAt Inktomi-powered services, set menu to "Boolean"

Page 59: Tema 4. Búsquedas en el Web

59

Un ejemplo: Google Google = [googol] = 10100

Objetivo en su creación (1997): mejorar los buscadores existentes en cuanto a calidad de las búsquedas. Ej. De los 4 principales buscadores de la

época, sólo 1 se encontraba a sí mismo. Se pretende obtener muy alta precisión a

costa de la exhaustividad. Se contempla la inclusión de texto y

estructura de los enlaces como mejora a otros sistemas.

Page 60: Tema 4. Búsquedas en el Web

60

Un ejemplo: Google Características:

Utiliza la estructura de los enlaces para calcular el ranking de cada página, a través de una medida llamada PageRank.

Utiliza los enlaces para mejorar los resultados de las búsquedas. Se incluye la información del enlace tanto en la página que lo contiene como en la enlazada (en algunos casos, el texto del enlace es más descriptivo de la página enlazada que los propios contenidos de la página).

Mantiene información sobre localización de términos. Por tanto, permite utilizar búsquedas de proximidad, y aplicar la proximidad al cálculo de la relevancia.

Mantiene información sobre la tipología y visualización de los caracteres (negrita, comillas, ...) para determinar la importancia de un término.

Mantiene todas las páginas que analiza en formato comprimido (sólo el contenido html).

Page 61: Tema 4. Búsquedas en el Web

61

Un ejemplo: Google PageRank

Medida objetiva de la importancia de una página atendiendo al número de referencias que existen a la misma en otras páginas.

Tiene en cuenta: El número de referencias a esa página. La calidad de las páginas que hacen referencia a

esa página. El número total de referencias existentes en cada

página que hace referencia a esa página.

Page 62: Tema 4. Búsquedas en el Web

62

Un ejemplo: Google Elementos considerados:

El web no es una colección controlada. Mejorar la búsqueda no tiene que restringirse

a mejorar la consulta (un usuario puede consultar lo que quiera y como quiera).

No hay control sobre lo que la gente pone el en web.

Las empresas comerciales aprovechan el funcionamiento de los buscadores para manipularlos y obtener altos rankings.

Page 63: Tema 4. Búsquedas en el Web

63

Un ejemplo: Google

Arquitectura

Page 64: Tema 4. Búsquedas en el Web

64

Un ejemplo: Google Funcionamiento

El URLServer envía URLs a los crawlers Las páginas encontradas se envían al StoreServer para que se

almacenen en el Repository (comprimidas). El Indexer lee el repositorio, descomprime los documentos y los

parsea. Convierte el documento en un conjunto de ocurrencias de palabras llamadas hits. Los hits almacenan la palabra, posición en el documento, tamaño de fuente y mayúsculas. Distribuye los hits en los barrels creando el forward index parcialmente ordenado. Almacena información sobre los enlaces hallados en las páginas.

El URLResolver convierte direcciones relativas en absolutas, y genera los identificadores de documentos. Genera base de datos de links para calcular el PageRank.

El sorter reordena la información de los barrels por identificador de palabras en lugar de por identificador de documentos. Genera el fichero invertido.

El Searcher se encarga de resolver las consultas.

Page 65: Tema 4. Búsquedas en el Web

65

Un ejemplo: Google Estructuras de datos:

BigFiles. Ficheros virtuales. Repositorio. Documentos comprimidos. Indices de documentos. Lexicon. Lista completa de palabras. Hit Lists. Forward index. Ordenación parcial

(barrels) Inverted index. Ordenación total (barrels)

Page 66: Tema 4. Búsquedas en el Web

66

Un ejemplo: Google El proceso de indexación:

Parsing Muchos problemas por errores de sintaxis y tipos

de contenidos. Indexar documento en los ‘barrels’

El parsing genera documentos que se codifican en los ‘barrels’.

Ordenar Se genera el índice invertido ordenando por

identificadores de palabras.

Page 67: Tema 4. Búsquedas en el Web

67

Un ejemplo: Google El proceso de búsqueda

Parsing de la consulta. Conversión de palabras en identificadores. Búsqueda de comienzo de lista de

documentos para cada palabra. Buscar documentos que contengan todas

las palabras. Calcular el ranking de cada documento. Ordenar y mostrar los primeros k

documentos.

Page 68: Tema 4. Búsquedas en el Web

68

Un ejemplo: Google Algunas estadísticas (1997)

Storage Statistics

Total Size of Fetched Pages 147.8 GB

Compressed Repository 53.5 GB

Short Inverted Index 4.1 GB

Full Inverted Index 37.2 GB

Lexicon 293 MB

Temporary Anchor Data  (not in total)

6.6 GB

Document Index Incl.  Variable Width Data

9.7 GB

Links Database 3.9 GB

Total Without Repository 55.2 GB

Total With Repository 108.7 GB

Web Page Statistics

Number of Web Pages Fetched 24 million

Number of Urls Seen76.5 million

Number of Email Addresses 1.7 million

Number of 404's 1.6 million

Page 69: Tema 4. Búsquedas en el Web

69

Un ejemplo: Google

Page 70: Tema 4. Búsquedas en el Web

70

Un ejemplo: Google

Page 71: Tema 4. Búsquedas en el Web

71

Evaluar páginas Los buscadores recuperan información, pero

(por ahora) no dan datos sobre la calidad de las páginas encontradas.

En algunos casos el ranking de los resultados de una consulta trata de considerar la calidad de las páginas (PageRank – google), pero no hay criterios objetivos para su valoración.

Es necesario evaluar de forma objetiva las páginas encontradas. Para ello se necesita:

Utilizar técnicas para identificar características de las páginas y la información que se necesita

Aplicar un pensamiento crítico sobre los contenidos, y realizar una serie de preguntas para decidir sobre su calidad.

Page 72: Tema 4. Búsquedas en el Web

72

Evaluar páginasQuestions to ask: What are the implications?

Is it somebody's personal page? • Read the URL* carefully:

• Look for a personal name (e.g., jbarker or barker) following a tilde ( ~ ), a percent sign ( % ), or or the words "users," "members," or "people."

• Is the server a commercial ISP* or other provider mostly of web page hosting (like aol.com or geocities.com

Personal pages are not necessarily "bad," but you need to investigate the author very carefully. For personal pages, there is no publisher or domain owner vouching for the information in the page.

What type of domain does it come from ? (educational, nonprofit, commercial, government, etc.)

• Is the domain appropriate for the content? • Government sites: look for .gov, .mil, .us, or

other country code • Educational sites: look for .edu • Nonprofit organizations: look for .org

• If from a foreign country, look at the country code and read the page to be sure who published it.

Look for a appropriateness, fit. What kind of information source do you think is most reliable for your topic?

Is it published by an entity that makes sense? Who "published" the page?

• In general, the publisher is the agency or person operating the "server" computer from which the document is issued.

• The server is usually named in first portion of the URL (between http:// and the first /)

• Have you heard of this entity before? • Does it correspond the name of the site? Should it?

You can rely more on information that is published by the source:

• Look for New York Times news from www.nytimes.com • Look for health information from any of the agencies of the

National Institute of Health on sites with nih somewhere in

the domain name.

1. What can the URL tell you?

Page 73: Tema 4. Búsquedas en el Web

73

Evaluar páginas

2. Scan the perimeter of the page

Questions to ask: What are the implications?

Who wrote the page? • Look for the name of the author, or the name of the organization, institution, agency, or whatever who is responsible for the page

• An e-mail contact is not enough • If there is no personal author, look for an agency or organization that claims responsibility for the page.

• If you cannot find this, locate the publisher by truncating back the URL (see technique above). Does this publisher claim responsibility for the content? Does it explain why

the page exists in any way?

Web pages are all created with a purpose in mind by some person or agency or entity. They do not simply "grow" on the web like mildew grows in moist corners.

You are looking for someone who claims accountability and responsibility for the content.

An e-mail address with no additional information about the author is not sufficient for assessing the author's credentials.

If this is all you have, try e-mailing the author and asking

politely for more information about him/her.

Is the page dated? Is it current enough? • Is it "stale" or "dusty" information on a time-sensitive or evolving topic? • CAUTION: Undated factual or statistical information is no better

than anonymous information. Don't use it.

How recent the date needs to be depends on your needs. For some topics you want current information. For others, you want information put on the web near the time it became known.

In some cases, the importance of the date is to tell you whether the page author is still maintaining an interest in the page, or has

abandoned it. What are the author's credentials on this subject? • Does the purported background or education look like someone who is qualified to write on this topic? • Might the page be by a hobbyist, self-proclaimed expert, or enthusiast?

• Is the page merely an opinion? Is there any reason you should believe its content more than any other page? • Is the page a rant, an extreme view, possibly distorted or

exaggerated? • If you cannot find strong, relevant credentials, look very closely at documentation of sources (next section).

Anyone can put anything on the web for pennies in just a few minutes. Your task is to distinguish between the reliable and questionable.

Many web pages are opinion pieces offered in a vast public forum.

You should hold the author to the same degree of credentials, authority, and documentation that you would expect from something published in a reputable print resource (book, journal

article, good newspaper).

Page 74: Tema 4. Búsquedas en el Web

74

Evaluar páginas

3. Look for indicators of quality information

Questions to ask: What are the implications?

Are sources documented with footnotes or links? • Where did the author get the information?

• As in published scholarly/academic journals and books, you should expect documentation.

• If there are links to other pages as sources, are they to reliable

sources? • Do the links work?

In scholarly/research work, the credibility of most writings is proven through footnote documentation or other means of revealing the sources of information. Saying what you believe without documentation is not much better than just expressing an opinion or a point of view. What credibility does your research need? An exception can be journalism from highly reputable newspapers. But these are not scholarly. Check with your instructor before using this type of material. Links that don't work or are to other weak or fringe pages do not

help strengthen the credibility of your research. If reproduced information (from another source), is it complete, not altered, not fake or forged? • Is it retyped? If so, it could easily be altered. • Is it reproduced from another publication?

• Are permissions to reproduce and copyright information provided? • Is there a reason there are not links to the original

source if it is online (instead of reproducing it)?

You may have to find the original to be sure a copy of something is not altered and is complete.

Look at the URL: is it from the original source? If you find a legitimate article from a reputable journal or other publication, it should be accompanied by the copyright statement and/or permission to reprint. If it is not, be suspicious.

Try to find the source. If the URL of the document is not to the original source, it is likely that it is illegally reproduced, and the text could be altered, even with the copyright information present.

Are there links to other resources on the topic? • Are the links well chosen, well organized, and/or evaluated/annotated? • Do the links work? • Do the links represent other viewpoints? • Do the links (or absence of other viewpoints) indicate a bias?

Many well developed pages offer links to other pages on the same topic that they consider worthwhile. They are inviting you compare their information with other pages. Links that offer opposing viewpoints as well as their own are more likely to be balanced and unbiased than pages that offer only one view. Anything not said that could be said? And perhaps would be said if all points of view were represented? Always look for bias.

Especially when you agree with something, check for bias.

Page 75: Tema 4. Búsquedas en el Web

75

Evaluar páginas

4. What do others say?

Questions to ask: What are the implications?

Who links to the page? • Are there many links? • What kinds of sites link to it? • What do they say? • Are any of them directories? Try looking at what directories say.

Sometimes a page is linked to only by other parts of its own site (not much of a recommendation). Sometimes a page is linked to by its fan club, and by detractors. Read both points of view. If a page or its site is in a bona fide directory, think about whether there is much critical

evaluation of the links in the directory. Is the page listed in one or more reputable directories or pages?

Good directories include a tiny fraction of the web, and inclusion in a directory is therefore noteworthy.

But read what the directory says! It may

not be 100% positive. What do others say about the author or responsible authoring body?

"Googling someone" (new term for this) can be revealing. Be sure to consider the source. If the viewpoint is radical or controversial, expect to find detractors. Think critically about all points of view.

Page 76: Tema 4. Búsquedas en el Web

76

Evaluar páginas

5. Does it all add up?

Questions to ask: So what? What are the implications?

Why was the page put on the web? • Inform, give facts, give data? • Explain, persuade? • Sell, entice? • Share? • Disclose?

These are some of the reasons to think of. The web is a public place, open to all. You need to be aware of the entire range of human possibilities of intentions behind web pages.

Might it be ironic? Satire or parody? • Think about the "tone" of the page. • Humorous? Parody? Exaggerated? Overblown arguments? • Outrageous photographs or juxtaposition of unlikely images? • Arguing a viewpoint with examples that suggest that what is argued is ultimately not possible.

It is easy to be fooled, and this can make you look foolish in turn.

Is this as good as resources I could find if I used the library, or some of the web-based indexes available through the library, or other print resources? • Are you being completely fair? Too harsh? Totally objective? Requiring the same degree of "proof" you

would from a print publication? • Is the site good for some things and not for others? • Are your hopes biasing your interpretation?

What is your requirement (or your instructor's requirement) for the quality of reliability of your information?

In general, published information is considered more reliable than what is on the web. But many, many reputable agencies and publishers make great stuff available by "publishing" it on the web. This applies to most governments, most institutions and societies, many publishing houses and news sources.

But take the time to check it out.

Page 77: Tema 4. Búsquedas en el Web

77

Evaluar Buscadores Creación de índices

¿Cómo se compila el índice? Tamaño – número de páginas indexadas Cobertura (http, ftp, www, news, …) ¿Hay criterios especiales de inclusión? ¿Tiene el spider acceso a sitios protegidos por

contraseñas? ¿Dónde no busca el motor? ¿Qué elementos de las páginas se indexan? ¿Hay control de vocabulario? ¿Se usan stopwords? Frecuencia de actualizaciones Tiempo de indexación de una página solicitada Páginas indexadas por día Comprobación de enlaces muertos

Page 78: Tema 4. Búsquedas en el Web

78

Evaluar Buscadores Capacidad de búsqueda

¿Dónde busca (que hay en el índice)? Búsqueda en distintos lugares a la vez Tratamiento de stopwords Rango de funciones de búsqueda Refinamiento de búsquedas Opciones avanzadas Uso de campos Uso de lógica boolean (si/no, fácil/difícil, …) Tratamiento de sinónimos / Uso de tesauros ¿Se puede guardar la búsqueda?

Page 79: Tema 4. Búsquedas en el Web

79

Evaluar Buscadores Calidad de las respuestas

Tiempo de respuesta Número de resultados Calidad del resumen del hitlist (host, motivo, enlace,

ranking, …) Detalle del criterio de relevancia usado Eliminación de duplicados Tratamiento de resultados (visualización, ordenación,

exportación, buscar-como, …) Guardar resultados de la búsqueda Análisis metodológico (precisión, exhaustividad,

relevancia, cobertura, fiabilidad, utilidad, novedad, …)

Page 80: Tema 4. Búsquedas en el Web

80

Evaluar Buscadores Usabilidad

Interface (claridad, simplicidad, …) Legibilidad (tamaño de letra, distribución de

texto, disposición de párrafos, …) Facilidad de uso (navegación) Ayuda en línea Proceso de construcción de la consulta Capacidad de personalización Guardar preferencias Tiempos de carga y respuesta