18
CS315-Web Search & Data Mining

CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Embed Size (px)

Citation preview

Page 1: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

CS315-Web Search & Data Mining

Page 2: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

A Semester in 50 minutes or less

The Web History Key technologies and developments Its future

Information Retrieval (IR) How do you find the information you need, fast?

IR on the Web Web crawling and Indexing Link Analysis Quality of information

Introduction to “The Social Web” Blogs, Twitter, FB, … Social Networks

Page 3: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Web’s Search Engines

What are they?How did they start?How do they work? What do they really do?How do they make money?

Should I care about privacy?How high is the quality of their results?Can they be improved?

Page 4: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

ORGANIC RESULTS

PAID RESULTS

Page 5: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Problems of Search and Mining

The Web poses a number of difficulties Populist medium Abundance and authority problem Uniform access Data with little structure

Page 6: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

The Web: A populist medium

Anyone can be an author!# of writers ~= # of readers

Because ~= online members

Anyone can be an author!The evolution of memes

Memes: ideas, theories, etc., that spread from person to person by imitation

Now more easily spread via the web

Easier to connect to people with similar interests Gave rise to a plethora of online social networks

Page 7: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Abundance of information

Liberal and informal culture of content generation and disseminationRedundancyNon-standard form and contentMillions of qualifying pages for broad queries

E.g.: java, kayaking, panther

No authoritative information about the reliability or trustworthiness of content on a site

Your favorite urban legend?

Page 8: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Problems from uniform access

Little support for adapting to the background of specific users

Does your grandmother surf and search the web as easily as you do?

Personalized search might help (somewhat)

Commercial interests routinely influence the operation of Web search

“Search Engine Optimization” AdSense

Page 9: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

(Lack of) Structured Information

Hypertext refers to ability to click and link, not to the structure of dataSemi-structured or unstructured

No schema (precise description of data)

Large number of attributes Each word is a potential feature

Page 10: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Major topics to cover

History of the WebRelevant network protocolsSearch Engines and Directories Clustering and classification Hyperlink analysis Measuring and Modeling the WebQuality of information Social networksThe Future of the web

Page 11: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Reading for next time

Vanevar Bush: “As We May Think”Tim Berners-Lee:

Chapters 1 (Enquire within) & 2 (Tangles, Bits, Webs)

Find online and watch the “now-famous video, which [TBL] didn’t see until 1994”

Make notes of your actions to find the video

Page 12: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

A few more details

Page 13: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

S.E.: Crawling, Indexing, Ranking

Crawl: Quickly fetch large number of Web pages into a local repository Index: based on keywords Rank: responses to maximize user’s chances that the first few responses satisfies her information needEarly search engines: WebCrawler, Lycos (1994)

Search engines from the beginning. Successful, even with the difficulties described Started as university research projects with small infrastructure, yet

eminently useful Based in part on traditional IR techniques. Had interesting ideas that are still useful

Page 14: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Web directories

Yahoo! directory to locate useful Web sites

Efforts for organizing knowledge into ontologies Centralized: (Yahoo!) Decentralized:

About.COM the Open Directory Project (dmoz)

Page 15: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Clustering and classification

Clustering Discover groups in a set of documents such that

documents within a group are more similar than documents across groups.

Subjective disagreements due to Different similarity measures Large feature sets

Classification For assisting human efforts in maintaining taxonomies

(topic directories)

Page 16: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

(Hyper)Link Analysis

Traditional IR insufficient Short queries Abundance and authority problems

Take advantage of the structure of the Web graph. Indicators of prestige of a page (E.g. citations) HITS & PageRank Anchor text

Bibliometry Bibliographic citation graph of academic papers.

Page 17: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Measuring and Modeling the Web

Useful to better understand the structure of the Web Can we characterize the Web?

Distribution of hyperlinks per page Patterns of linkage within topic communities Path lengths between pages

Can we build a generative model with same characteristics?

Page 18: CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Structured vs Web data mining

Traditional data mining data is structured and relational Well-defined tables, columns, rows, keys, and constraints.

Web data readily available data rich in features and patterns spontaneous formation and evolution of

topic-induced graph clusters hyperlink-induced communities

Our goal: to discover patterns which are spontaneously driven by semantics.