25
Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited [email protected] 01603 628818

Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited [email protected]

Embed Size (px)

Citation preview

Page 1: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Don’t accept the limits of Google!

Presentation for the Energy InstituteApril 2009

Terry KendrickInformation Now Limited

[email protected] 628818

Page 2: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google enough?

Comprehensive?

And enough for any searcher?

Biggest?

Best?-ease of use?

-sources?90% plus

market share for search

Page 3: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google the biggest?(sometimes but not always ….)

“Terry Kendrick” (hits)

Yahoo.com 3230

Altavista.com 3240

Live.com 2,900000

Google.com 3040

Ask.com 554

Source: Search 27 April 2009 19.00

Hmmm… but how many hits can you really see anyway?

Page 4: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google the biggest?(sometimes but not always ….)

“Terry Kendrick” (hits)

Yahoo.com 5,620

Altavista.com 5,470

Live.com 2,320

Google.com 2,690

Ask.com 428

Source: Search 12 October 2008 20.50

Hmmm… but how many hits can you really see anyway?

Cuil – 3,126

Page 5: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google best?

• Google is great for coverage and accessibility. Academic library resources are better quality : Brophy, J., & Bawden, D. (2005). Is Google enough? Comparison of an internet search engine with academic library resources. Aslib Proceedings, 57(6), 498-5

Page 6: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Comprehensive and all you need?

• “There is nothing in this study to explain why web users seem to greatly prefer the Google search engine, since overall the performance of Google and Yahoo is more or less equivalent, and ahead of their competitors. We therefore suppose that the reasons go beyond the criteria of relevance of results” – Jean Veronis . University of Provence “Comparative Study of

Six Search Engines” . 2006

Page 7: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Limits of Google

• Doesn’t have everything on the web in its cache

• Doesn’t show you everything it has got in its cache

• Other search engines may have some different material

• Even “breaking” Google will only give you up to around 1000 hits per search

• Advanced Search is better done directly into the search line rather than through the mask

• (But it’s still an excellent search engine!)

Page 8: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

First page results – Google, Microsoft, Yahoo, Ask

• Among 12,570 random user-defined queries just over 1 percent of first page search results were the same across the engines

– The percent of total results unique to one search engine was

88.3 percent.

– The percent of total results shared by any two search engines was 8.9 percent.

– The percent of total results shared by three search engines was 2.2 percent.

– The percent of total results shared by the top four search engines was 0.6 percent.

Source: Dogpile, April 2007

Research by: Queensland University of Technology and Pennsylvania State University

Page 9: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Despite Dogpile’s self supporting research

there’s a high overlap in the first

ten pages or so though, right?

Intuitive …. But is it really the case?

See: http://ranking.thumbshots.com/

Page 10: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

“Must See” Search engines(all .com unless noted otherwise)

• Yahoo• Altavista• Alltheweb• Google• Live• Ask• BBC• Searchme• Cuil

• Trovando.it• Exalead• Quintura• A9

……

• Ixquick• Vivisimo / Clusty• Mamma• Dogpile• ez2www• Surfwax• Webcrawler• Fazzle• Killerinfo• Icerocket

• Zuula• Mahalo• Toolbe • Baidu (China)/ Yandex (Russia)

• Altsearchengines ( top 100) • http://altsr.us/• www.thesearchrace.com/

.

Page 11: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

… don’t forget specialist search engines

Examples:

www.zoominfo.com People /company summarywww.base-search.net. Academic search enginewww.searchmil.com/ Military search engine … but good for

tools and techniqueswww.truveo.com – video search enginewww.questia.com –”world’s largest online library”www.archive.org – includes “wayback machine”www.seeqpod.com / www.songza.com – playable audio fileswww.bandsintown.com Gigswww.masterseek.com – business directory

Page 12: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Human web: blogs, newsgroups and mailing

lists• www.boardreader.com• www.twazzup.com• www.bloogz.com• www.blogpulse.com• www.feedster.com• www.technorati.com

• http://groups.google.com/groups• http://google.com/blogsearch

• …also Dark Net (see www.darknet.com) such as Bittorrents

Searching them

Page 13: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com
Page 14: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google’s view on the size of the web

• “Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

• the number of individual web pages out there is growing by several billion pages per day.

• So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite -- for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.

We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers. But we're proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world's data.”

• Google Blog http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

Page 15: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

How big is the deep web?“The Deep Web covers somewhere in the vicinity of 900 billion pages of information located

through the World Wide Web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. The current search engines find about 8 billion pages at the time of this writing.”

Source: Deep Web Research Research 2006 by Marcus P. Zillman Published January 15, 2006

Fall 2007 data:

• Google.com indexes 12.5 billion public web pages.

• 71 billion static web pages are publicly-available. These pages can easily be found by Google and other search engines.

• 6.5 billion static pages are hidden from the public. As private intranet content, these are the corporate pages that are only open to employees of specific companies

• 220+ billion database-driven pages are completely invisible to Google .

Google therefore = 6% of the internet ?

http://netforbeginners.about.com/cs/secondaryweb1/a/secondaryweb.htm

Page 16: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Invisible Web includes key information resources…

• Databases– E.g. Companies House– Library catalogues– Picture collections– “Mash –ups”

Password protected/ subscription sites– E.g. Newspaper archives

Page 17: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Example databases (many invisible web)

• www.oscars.org• http://vads.ahds.ac.uk/collections/ST.html• www.a2a.org.uk• www.ipo.gov.uk• www.ncjrs.gov/abstractdb/Search.asp• http://businesscreditusa.com/index.asp• http://plants.ifas.ufl.edu/search80/NetAns2/• www.allmusic.com/• http://aad.archives.gov/aad/• www.eric.ed.gov

• www.istl.org/01-winter/internet.html

Page 18: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Mashups and podcasts

• www.folkestonegerald.com/map/• www.chicagocrime.org/map• www.housingmaps.com• www.yourhistoryhere.com• www.ufomaps.com• www.gypsymaps.com

• www.programmableweb.com/matrix

Podcasting: www.ipodder.org ; http://britcaster.com/ www.podcast.net; www.podcastcentral.com;Subject specific example: www.jodcast.net/amp/index.html

Google maps

Page 19: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Video streaming• www.researchchannel.org• www.britishpathe.com• http://mitworld.mit.edu/index.php• http://web.sls.csail.mit.edu/lectures/• http://videolectures.net• www.monkeysee.com/

• www.loc.gov/film/arch.html

• www.mediachannel.com• http://showbiz.quickfound.net/video_search_and_news.html

• www.youtube.com• www.veoh.com• www.eefoof.com• http://communityvideo.aol.com/Main.do

• c/f www.video.google.com

AcademicAcademic

CommunityCommunity

Page 20: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Open access repositories• www.doaj.org/• http://oaister.umdl.umich.edu/o/oaister/viewcolls.ht

ml• www.freefulltext.com

• www.arl.org/sparc/repos/ir.html

• http://archives.eprints.org/• www.sherpa.ac.uk

• http://re.cs.uct.ac.za//• www.hw.ac.uk/libwww/irn/irn142/irn142.html large

list

• http://www.interdok.com/dopp/search.cfm -conference proceedings, not free access

Page 21: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

What if?

• The bot visits the site but goes away before doing the whole site (eg parts of pages, number of pages)?

• Page author used a “No robots” command?• The material was put up last week or is real

time?• The content is dynamically generated (cgi

asp and others)• Material is graphic or embedded deep (e.g

ppt notes pages)• Spelling is wrong! (e.g Mary J Bilge)• Other reasons!

Page 22: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

How invisible is the invisible web?• http://oedb.org/library/college-basics/research-beyond-google “Research Beyond Google: 119

Authoritative, Invisible, and Comprehensive Resources”

• www.completeplanet.com/ (and Brightplanet – little out of date))• http://virtualchase.com/search_engines/databases.html• www.freepint.com/gary/direct.htm (very out of date)• www.deepwebresearch.info (up to date – incredibly detailed often techy)

• www.turbo10.com (Hmm…..) www.incywincy.com• www.deepdyve.com

• http://www.osti.gov/media/deepWebWM_256.html• www.enth.com• www.iage.com/invisible.html• www.weblens.org/invisible.html

• www.deepweb.us

• www.llrx.com/features/deepweb2009.htm

• http://library.laguardia.edu/invisibleweb/webography

• Federated search –Deep Web Technologies

• Long shot ……… “Search our database” [subject term]– Database [subject term]

How do I find these “invisible”

resources

Page 23: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Virtual libraries / Gateways / Portals

Examples:

• www.hw.ac.uk/libWWW/irn/pinakes/pinakes.html

• www.intute.ac.uk

• www.loc.gov/rr/askalib/virtualref.html• www.loc.gov/rr/international/portals.html• www.lii.org

Page 25: Don’t accept the limits of Google! Presentation for the Energy Institute April 2009 Terry Kendrick Information Now Limited terry.kendrick@btconnect.com

Google on the futureComing up with elegant, fitting and relevant solutions to

meet the challenges of mobility, modes, media, personalization, location, socialization, and language will take decades.

Search is a science that will develop and advance over hundreds of years. Think of it like biology and physics in the 1500s or 1600s: it’s a new science where we make big and exciting breakthroughs all the time. ……. Just like biology and physics several hundred years ago, the biggest advances are yet to come. That’s what makes the field of Internet search so exciting.http://googleblog.blogspot.com/2008/09/future-of-search.html