61
May 2007 May 2007 Search Engines Search Engines Challenges & Trends Challenges & Trends David Rashty David Rashty [email protected] [email protected]

May 2007 May 2007 Search Engines Challenges & Trends David Rashty [email protected]

Embed Size (px)

Citation preview

Page 1: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

May 2007May 2007

Search Engines Search Engines Challenges & TrendsChallenges & Trends

David RashtyDavid Rashty

[email protected]@gmail.com

Page 2: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Challenges Search Challenges

• Where web search fails ?Where web search fails ?

• Search engines characteristicsSearch engines characteristics

• Search engines user interfaceSearch engines user interface

• Search engines trendsSearch engines trends

• Q & AQ & A

(1)(1)

Page 3: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

About MyselfAbout Myself• One of the first One of the first WWWWWW developers (1992). Created developers (1992). Created

the 10th website of its kind.the 10th website of its kind.

• Founder of two start-ups/ventures: Founder of two start-ups/ventures: Addwise.comAddwise.com (1999- 2005) which deals with (1999- 2005) which deals with software application and Information architecture.software application and Information architecture.SnunitSnunit (1994-1999) which develops e-learning (1994-1999) which develops e-learning activities.activities.

• Currently involved in a new venture called Currently involved in a new venture called “ResearchTrail.com”.“ResearchTrail.com”.

(2)(2)

Page 4: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Where Web Search Fails ?Where Web Search Fails ?

(3)(3)

Page 5: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

How People Search How People Search • NavigationalNavigational – – (find out what is the address of a website) (find out what is the address of a website)

‘How do I find the website of CNN’‘How do I find the website of CNN’

• FactualFactual – – (find exact information)(find exact information) ““Population of Population of China; President Bush's email; Flights from NY China; President Bush's email; Flights from NY to Detroitto Detroit““

• ComprehensiveComprehensive – – (build a picture of a new world ) (build a picture of a new world ) ‘I ‘I need to understand the market around wireless need to understand the market around wireless networking’, ‘I need to know more about networking’, ‘I need to know more about Leukemia Leukemia

(4)(4)

Page 6: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Skills Vary Significantly Search Skills Vary Significantly between Peoplebetween PeopleSome may succeed and some may fail, in locating what Some may succeed and some may fail, in locating what

they are looking forthey are looking for

Web +/- refers to Web expertise, Econo +/- refers to domain knowledge

From(Christoph Hölscher & Gerhard Strube, 2000), http://www9.org/w9cdrom/81/81.html

Only users who could rely both on high web expertise and high domain knowledge ("double experts") were able to solve an average of 3.2 out of the 5 tasks

(Christoph Hölscher & Gerhard Strube , 2000)

((55))

Page 7: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Gap in Web Search Gap in Web Search

• Despite the existence of huge websites and Despite the existence of huge websites and powerful search engines, novice users powerful search engines, novice users have have difficulty finding comprehensive informationdifficulty finding comprehensive information about even common topics.about even common topics.

Searching for relevant information on the World Wide Web is often a laborious and frustrating task for casual and experienced users

(Christoph Hölscher, Gerhard Strube, 2000)

(6)(6)

Page 8: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Challenge: Effectiveness Search Challenge: Effectiveness

If users don't find the result with their first If users don't find the result with their first query, they are progressively query, they are progressively less and less less and less likely to succeedlikely to succeed with additional searches. with additional searches. Many users don't even bother… Many users don't even bother…

(source: Nilsen, 2002)(source: Nilsen, 2002)

((77))

JupiterResearch found that 71% of online consumers use search engines to find health-related information, but only 16% find the information they are looking for

(ZDNet Research, June 2006)

Page 9: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Scatter Nature of Information Scatter Nature of Information

• Users often retrieve incomplete information Users often retrieve incomplete information because of the because of the complex scatter of relevant complex scatter of relevant facts about a topicfacts about a topic across web pages across web pages (source: (source:

Bahavnani 2006)Bahavnani 2006)

(8)(8)

Page 10: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Information Density Information Density • General pagesGeneral pages contained information on many contained information on many

subjects with medium amount of detail (portals)subjects with medium amount of detail (portals)

• Specific pagesSpecific pages contained information on a few contained information on a few subjects with high amount of detail (articles, subjects with high amount of detail (articles, expert sites)expert sites)

• Sparse pagesSparse pages contained information on a few contained information on a few subjects with little detail (references)subjects with little detail (references)

(source: Bahavnani 2006)(source: Bahavnani 2006)

((99))

Page 11: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search ChallengeSearch Challenge

• Searching for comprehensive information need Searching for comprehensive information need knowledge and skillsknowledge and skills

• Novice users are lacking advanced search Novice users are lacking advanced search skillsskills

• Information scatterings is not addressed by Information scatterings is not addressed by search engines, novice users are usually search engines, novice users are usually unaware of that. unaware of that.

(10)(10)

Page 12: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Engines CharacteristicsSearch Engines Characteristics

(11)(11)

Page 13: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Engines Overlap Search Engines Overlap • study looked at search results from more than 12,500 study looked at search results from more than 12,500

random queries on Ask Jeeves, Google, MSN search random queries on Ask Jeeves, Google, MSN search and Yahoo, and found that the overlap in first page and Yahoo, and found that the overlap in first page results for these four engines was a scant 1.1% on results for these four engines was a scant 1.1% on average for a given queryaverage for a given query

• 84.9% of total results are unique to one engine 84.9% of total results are unique to one engine

• 11.4% of total results were shared by any two engines 11.4% of total results were shared by any two engines

• 2.6% of total results were shared by any three 2.6% of total results were shared by any three engines engines

• 1.1% of total results were shared by any four engines 1.1% of total results were shared by any four engines

(12)(12)Source: Search Engine Watch

Page 14: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Engines Overlap Search Engines Overlap

(13)(13)

Page 15: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Users Attention Users Attention

(14)(14)(source: Checkit)

Page 16: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Query Syntax Search Query Syntax

(15)(15)

Old SE / ProfessionalOld SE / Professional

Modern SE / NoviceModern SE / Novice

Is it more Is it more advancedadvanced or helping or helping define the define the query better?query better?

Page 17: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Query ProblemsSearch Query Problems

(16)(16)

• VocabularyVocabulary - Two people are unlikely to use the - Two people are unlikely to use the same word to describe the same thingsame word to describe the same thing… (source: (source: Google)Google)

• OperatorsOperators – – most people don’t know how to use most people don’t know how to use search engines operators (“Only 16% of participants search engines operators (“Only 16% of participants used quotation marks,many incorrectly”, source: used quotation marks,many incorrectly”, source: HargittaiHargittai ))

• Query lengthQuery length - Average query was 2.6 words long - Average query was 2.6 words long (in 2001),up from 2.4 words in 1997 (source: Google)(in 2001),up from 2.4 words in 1997 (source: Google)

• Boolean operatorsBoolean operators - People Don’t Understand - People Don’t Understand Boolean Logic (AND, OR) ! (source: Google)Boolean Logic (AND, OR) ! (source: Google)

Page 18: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Query LengthSearch Query Length

(17)(17)(Source: Yahoo)(Source: Yahoo)

Page 19: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

SE Commands, OperatorsSE Commands, Operators & Shortcuts & Shortcuts

(18)(18)

• GoogleGoogle: more than 60 : more than 60 http://www.google.com/intl/en/help/features.htmhttp://www.google.com/intl/en/help/features.htmll

• YahooYahoo: more than 60 (: more than 60 (http://help.yahoo.com/help/us/ysearch/basics/bahttp://help.yahoo.com/help/us/ysearch/basics/basics-04.htmlsics-04.html))

• How many people use them ?How many people use them ?

• How many people are aware of them ?How many people are aware of them ?

• Are they useful ?Are they useful ?

Page 20: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Advanced Search ?Advanced Search ?

(19)(19)

Used by less than 10%!!Used by less than 10%!!

Page 21: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Invisible Web / Hidden WebInvisible Web / Hidden Web

(20)(20)

• Deep WebDeep Web (or Deepnet, invisible Web or hidden (or Deepnet, invisible Web or hidden Web) refers to World Wide Web content not part Web) refers to World Wide Web content not part of the surface Web indexed by search engines. of the surface Web indexed by search engines. ((source: Wikipedia))

• Includes: Includes: Dynamic content, unlinked content, Dynamic content, unlinked content, limited access content, scripted content, non-limited access content, scripted content, non-text contenttext content

• MoreMore than 500 timesthan 500 times as much information as as much information as traditional search engines "know about" is traditional search engines "know about" is available in the deep Web (available in the deep Web (source: Computerworld))

Page 22: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Invisible Web / Hidden WebInvisible Web / Hidden Web

(21)(21)

Page 23: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Misleading & Spam ContentMisleading & Spam Content

(22)(22)

• Spam, adwareSpam, adware

• People add unrelated terms, use multiple People add unrelated terms, use multiple domains, link farms, guestbook botsdomains, link farms, guestbook bots

• CloakingCloaking - Also known as - Also known as stealthstealth, a , a technique used by some Web sites to technique used by some Web sites to deliver one page to a search engine for deliver one page to a search engine for indexing while serving an entirely indexing while serving an entirely different page to everyone else different page to everyone else

Page 24: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Spam Content (1)Spam Content (1)

(23)(23)Source: http://www.yr-bcn.es/webspam/

Black nodes are spam, white nodes are non-spam

Corpus consists of 77 million pages from 12,000 hosts. These pages have been annotated at the level of hosts. Over 3,000 hosts have been manually labelled by at least two human judges as ”Spam”, ”Not Spam” or ”Borderline.

Page 25: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Spam Content (2)Spam Content (2)

(24)(24)Source: http://www.yr-bcn.es/webspam/

Red nodes are spam, blue nodes are normal, and green nodes are normal pages with spam content.

It is composed of a connected graph of 5,000 Web pages and is labeled at the page level. Each Web page is labeled as ”Spam”, ”Not Spam” or ”Borderline” - the last category corresponds to Web page where the content is only partially spam, blog spam pages for example.

Page 26: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Misleading Content AgencyMisleading Content Agency

(25)(25)

Page 27: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Unsafe Content (2006)Unsafe Content (2006)

(26)(26)

1- "Red" rated sites failed SiteAdvisor's safety tests. Examples are sites that distribute adware, send a high volume of spam, or make unauthorized changes to a user's computer.

2 - "Yellow" rated sites engage in practices that warrant important advisory information based on SiteAdvisor's safety tests. Examples are sites which send a high volume of "non-spammy" email, display many popup ads, or prompt a user to change browser settings.

Source: http://www.siteadvisor.com/studies/search_safety_dec2006

Page 28: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Engines ChallengeSearch Engines Challenge

• Can we trust SE ranking ?Can we trust SE ranking ?

• How to handle misleading & spam How to handle misleading & spam information ?information ?

• How to use advanced features, NLP is not How to use advanced features, NLP is not enoughenough

(27)(27)

Page 29: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search User InterfaceSearch User Interface

(28)(28)

Page 30: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

AltaVista AltaVista 19951995

(29)(29)

Page 31: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Google Google 19981998

(30)(30)

Page 32: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Google Google 20072007

(31)(31)

Page 33: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

KartOO KartOO 20072007

(32)(32)

Page 34: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Advanced UI ???Advanced UI ???

(33)(33)

Page 35: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search UI ChallengeSearch UI Challenge

• Search engines UI didn’t change much in the Search engines UI didn’t change much in the last 10 years (web did change…).last 10 years (web did change…).

• Search engines UI does not reflect what is Search engines UI does not reflect what is known about user behavior.known about user behavior.

• 1,000,000……. results but only 30 are 1,000,000……. results but only 30 are currently useful.currently useful.

• Too much noise !!Too much noise !!

(34)(34)

Page 36: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Engines TrendsSearch Engines Trends

(35)(35)

Page 37: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

(36)(36)

Search Engines Statistics Search Engines Statistics

Total 97.71 % Total 97.71 %

2.29 % Left for 2.29 % Left for all the othersall the others

Page 38: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search TrendsSearch Trends

(37)(37)

VerticalVertical

ContentContent

ClusteringClustering

VisualizationVisualization

Improved UIImproved UI

TailoredTailored

Assisted Assisted SearchSearch

Community Community

Factual Factual QAQA

ExpertsExperts

MetaMetaLanguageLanguage

Search Search TrendsTrends

Page 39: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Clusty Clusty 20072007 (clustering)(clustering)

(38)(38)

Page 40: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Grokker Grokker 20072007 (clustering + visualization)(clustering + visualization)

(39)(39)

Page 41: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Rollyo Rollyo 20072007 (tailor made search)(tailor made search)

(40)(40)

Page 42: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Yahoo Search Builder Yahoo Search Builder 20072007 (tailor made search)(tailor made search)

(41)(41)

Page 43: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

MetaCrawler MetaCrawler 20072007 (combined search)(combined search)

(42)(42)

Page 44: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

ChaCha ChaCha 20072007 (expert/community search)(expert/community search)

(43)(43)

Page 45: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Trexy Trexy 20072007 (strategies)(strategies)

(44)(44)

Page 46: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Snap Snap 20072007 (improved UI)(improved UI)

(45)(45)

Page 47: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

SearchMash SearchMash 20072007 (Google playground)(Google playground)

(46)(46)

Page 48: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Swiki Swiki 20072007 (social search)(social search)

(47)(47)

Page 49: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Pure Video Pure Video 20072007 (content oriented)(content oriented)

(48)(48)

Page 50: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Pure Video Pure Video 20072007 (subject oriented/vertical)(subject oriented/vertical)

(49)(49)

Page 51: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Yahoo Answers Yahoo Answers (Q&A)(Q&A)

(50)(50)

Page 52: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Kasamba Kasamba (experts)(experts)

(51)(51)

Page 53: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Baidu Baidu (language)(language)

(52)(52)

Page 54: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search TrendsSearch Trends

(53)(53)

VerticalVertical

ContentContent

ClusteringClustering

VisualizationVisualization

Improved UIImproved UI

TailoredTailored

Assisted Assisted SearchSearch

Community Community

Factual Factual QAQA

ExpertsExperts

MetaMetaLanguageLanguage

Search Search TrendsTrends

Page 55: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Search Trends ChallengeSearch Trends Challenge

• How do we combine all the relevant features How do we combine all the relevant features together without complicating the user together without complicating the user interface ?interface ?

• Will Google add more advanced features ?Will Google add more advanced features ?

(54)(54)

Page 56: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

ResearchTrailResearchTrail

(55)(55)

Page 57: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

ResearchTrail Alpha Application ResearchTrail Alpha Application

((5656))

http://www.ResearchTrail.comhttp://www.ResearchTrail.com

Page 58: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

References & QAReferences & QA

(57)(57)

Page 59: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Reference Reference

• Article in YNET - Article in YNET - http://www.ynet.co.il/articles/0,7340,L-3338735http://www.ynet.co.il/articles/0,7340,L-3338735,00.html,00.html

• Pandia - Pandia - http://www.pandia.com/http://www.pandia.com/

• ResearchBuzz - ResearchBuzz - http://www.researchbuzz.org/wp/http://www.researchbuzz.org/wp/

• Prolog (Hebrew) - Prolog (Hebrew) - http://www.i-zm.info/http://www.i-zm.info/

((5858))

Page 60: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

More Examples… More Examples…

(59)(59)

Page 61: May 2007 May 2007 Search Engines Challenges & Trends David Rashty david.rashty@gmail.com

Thank You!Thank You!

David RashtyDavid [email protected]@gmail.com