15
Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

Embed Size (px)

Citation preview

Page 1: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

Characterizing the Web

CSCI 572: Information Retrieval and Search Engines

Summer 2010

Page 2: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-2

Outline

• The web– Scale

– Complexity

– Growth

• Differences between then and now• Where the web is headed

Page 3: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-3

The Web

• Massive scale directed graph• Driven by the underlying REST architecture

– The key abstraction of information is a resource, named by an URL.

– The representation of a resource is a sequence of bytes, plus representation metadata to describe those bytes.

– All interactions are context-free: each interaction contains all of the information necessary to understand the request.

– Components perform only a small set of well-defined methods on a resource producing a representation to capture the current or intended state of that resource and transfer that representation between components.

– Representation metadata are encouraged in support of caching and representation reuse.

– The presence of intermediaries is promoted. Copyright © Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All

rights reserved.

Page 4: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-4

Scale• http://www.worldwidewebsize.com/

GYBA = Sorted on Google, Yahoo!, Bing and AskYGBA = Sorted on Yahoo!, Google, Bing and Ask

Page 5: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-5

How is the scale measured?

• # of indexed web pages by search engines?– Is this an accurate representation?

• Published data from major ISPs?– Is this accurate information?

• What’s missing?– The “deep” web, or dynamic pages

– Pages behind security firewalls

Page 6: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-6

Why is scale important?• Has many influential drivers on the ultimate use cases of

the web– Discovery and retrieval of information via:

• Search Engines

• Web Services and Grid Computing

• Targeted communities like Social Networking and the growing field of Analytics

• Has many influential drivers on the way we build software for web-scale systems– New programming paradigms, e.g., Map Reduce

– New technologies to handle huge scale computing, or “Big Data”

Page 7: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-7

Complexity

Page 8: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-8

Proliferation of content types available

• By some accounts, 16K to 51K content types*• What to do with content types?

– Parse them• How?

• Extract their text and structure

– Index their metadata• In an indexing technology like Lucene, Solr, or Compass, or in

Google Appliance

– Identify what language they belong to• Ngrams

*http://filext.com/

Page 9: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-9

Growth

• Steady growth, on logarithmic scale since mid 90’s• Well into the 100s of M of website and 10s of B of web page

scale (even without the deep web)

Page 10: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-10

What does growth mean to us (you)?

• Need for efficient algorithms for all sorts of things– Mining the web for information on you to target ads

– Mining the web for information on you to decide whether to hire you or not

– Disseminating news effectively (to you)

– Disseminating media effectively (to you)

– Providing rich browser experiences to lure you to web sites so that you can be sold products

• NOTE: I underlined you everywhere above for those that missed it, we’ll get back to this

Page 11: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-11

The Web: Then and Now

• Before– The purpose of the web was for geeks to exchange email,

post on bulletin boards regarding their favorite D&D games, to send files to one another

– Scope was limited to geeks, broad infection was many years away

– Search* since 1996: Hotbot, Excite, WebCrawler, AskJeeves, Yahoo!, Google, DogPile, Altavista, Lycos, MSN Search, AOL Search, Infoseek, Netscape, Metacrawler, AllTheWeb

*http://sixrevisions.com/web_design/popular-search-engines-in-the-90s-then-and-now/

Page 12: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-12

The Web: Then and Now

• Now– The purpose is limitless

• Computation with services, semantic description of content, proliferation of content, rich browsers, clients, interaction, media

• Social web is next big thing

– Scope is (I kid you not, a 2 year old on up)

– Search* now: Google, with competitors like Yahoo and Bing pulling up the rear, and trying to build out open source computational infrastructures to compete

*http://sixrevisions.com/web_design/popular-search-engines-in-the-90s-then-and-now/

Page 13: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-13

The movement towards the social web• Social Networking

companies have figured out that mining info aboutyou guys can help build the “semantic” information that was once dreamed about by the likes of Tim Berners-Lee in his Scientific American article in the late 90’s, early 2000’s

• Why did semantic web fail to gain acceptance but social web has succeeded?– The realization that machines are poor annotators of information

and that they are even worse trust establishers

– And that you guys are the experts at this!

Page 14: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-14

Social Web and “Big Data”• Many challenges induced by the complexity, scale, and

growth of the traditional web are only increased when the social web is taken into account

• The development of algorithms to crawl the social graph have led to several Ph.D.s and are huge money makers for existing businesses– Analytics is what they call this nowadays

• Search is a HUGE challenge and interesting research problem within the social web– Instead of using information retrieval to deduce a “rank” for a

page, use the trust value assigned via your social graph

Page 15: Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-15

Wrapup

• Web has changed dramatically in the last 10 years• Understand the different dimensions of the web

and the variation points– Scale, complexity and growth are only a selected few

• Understand where the web is going and why