20
Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Embed Size (px)

Citation preview

Page 1: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining

By-

Pawan Singh

Piyush Arora

Pooja Mansharamani

Pramod Singh

Praveen Kumar1

Page 2: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

2

Page 3: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Four Problems

Finding relevant information Low precision-which is due to the irrelevance of many of the search results. This

results in a difficulty finding the relevant information. LOW RECALL which is due to the inability to index all the information available

on the web.This results in a difficulty finding the unindexed information that is relevant.

Creating new knowledge out of available information on the web

While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process .

3

Page 4: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

4

Personalizing the information Catering to personal preference in content and presentation(associated

with the type and presentation of the information )

Learning about the consumers What does the customer want to do? Using web data to effectively market products and/or services

Page 5: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Other Approaches

Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP)

In-depth syntactic and semantic analysis

Web document communityStandards, manually appended meta-information,

maintained directories, etc

5

Page 6: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Direct vs. Indirect Web Mining

Web mining techniques can be used to solve the information overload problems:Directly

Attack the problem with web mining techniquesE.g. newsgroup agent classifies news as relevant

IndirectlyUsed as part of a bigger application that addresses

problemsE.g. used to create index terms for a web search service

6

Page 7: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

The Research

Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

Focusing on research from the machine learning point of view

7

Page 8: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining: Definition

“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.”Can be viewed as four subtasksNot the same as Information RetrievalNot the same as Information Extraction

8

Page 9: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining: Subtasks

Resource findingRetrieving intended documents

Information selection/pre-processingSelect and pre-process specific information from

retrieved web resources. Generalization

Discover general patterns within and across web sites Analysis

Validation and/or interpretation of mined patterns

9

Page 10: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining: Not IR

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

10

Page 11: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining: Not IE

Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents.IE systems for the general Web are not feasibleMost focus on specific Web sites or content

11

Page 12: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

12

IE - IR

Information RetrievalAutomatic retrieval of

relevant documentsPrimary Goals:

o Indexing Text o Searching for useful

documents in a collection

o“Bag of unordered words”

o“Web document classification “ task is an instance of IR

Information ExtractionExtract relevant facts from

documents

Primary Goals:o Transform collection of retrieved

documents to information.o Structure of representation of a

documento“Web document classification “

task is an instance of IRoIE has a higher level of granularityoResult:

o Structured Databaseo Compression or summary of Text

or documents

Page 13: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

13

Types of IE

I E from unstructured texts ( Classical)• Unstructured ?? Free texts

eg.News stories• Basic to deep linguistic pre-

processing.

IE from semi-structured texts (Structural)

• Semi-Structured ?? HTML• Uses meta-information eg. HTML

tagsWrapper Induction, Machine learning used to build systems (semi-)automatically

Page 14: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining and Machine Learning

Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn".

Web mining is NOT learning from the Web. Some applications of machine learning on the web

are NOT Web Mining Methods used for Web Mining are NOT limited to

machine learning There is a close relationship between web mining

and machine learning

14

Page 15: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

15

• Machine learning techniques support and help web mining as they could be applied to the processes in the web mining.

• For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques.

• In short,web mining intersects with the application of the machine learning on the web.

Web Mining and Machine Learning

Page 16: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining Categories

Web Content MiningDiscovering useful information from web

contents/data/documents.

Web Structure MiningDiscovering the model underlying link structures

(topology) on the Web. E.g. discovering authorities and hubs

Web Usage MiningMake sense of data generated by surfersUsage data from logs, user profiles, user sessions, cookies,

user queries, bookmarks, mouse clicks and scrolls, etc. 16

Page 17: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Content Data Structure

Unstructured – free text Semi-structured – HTML More structured – Table or Database

generated HTML pages Multimedia data – receive less attention than

text or hypertext

17

Page 18: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Structure Mining

Interested in the structure between Web documents (not within a document)

Example: PageRank – Google Application: Discovering micro-communities in the

Web Measuring the “completeness” of a Web site

18

Page 19: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Usage Mining

Tries to predict user behavior from interaction with the Web

Wide range of data (logs) Web client data Proxy server data Web server data

Two common approaches Map usage data into relational tables before using

adapted data mining techniques Use log data directly by utilizing special pre-processing

techniques

19

Page 20: WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Thank you!

20