WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1

Web Mining

By-

Pawan Singh

Piyush Arora

Pooja Mansharamani

Pramod Singh

Praveen Kumar1

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

2

Four Problems

Finding relevant information Low precision-which is due to the irrelevance of many of the search results. This

results in a difficulty finding the relevant information. LOW RECALL which is due to the inability to index all the information available

on the web.This results in a difficulty finding the unindexed information that is relevant.

Creating new knowledge out of available information on the web

While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process .

3

4

Personalizing the information Catering to personal preference in content and presentation(associated

with the type and presentation of the information )

Learning about the consumers What does the customer want to do? Using web data to effectively market products and/or services

Other Approaches

Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP)

In-depth syntactic and semantic analysis

Web document communityStandards, manually appended meta-information,

maintained directories, etc

5

Direct vs. Indirect Web Mining

Web mining techniques can be used to solve the information overload problems:Directly

Attack the problem with web mining techniquesE.g. newsgroup agent classifies news as relevant

IndirectlyUsed as part of a bigger application that addresses

problemsE.g. used to create index terms for a web search service

6

The Research

Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

Focusing on research from the machine learning point of view

7

Web Mining: Definition

“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.”Can be viewed as four subtasksNot the same as Information RetrievalNot the same as Information Extraction

8

Web Mining: Subtasks

Resource findingRetrieving intended documents

Information selection/pre-processingSelect and pre-process specific information from

retrieved web resources. Generalization

Discover general patterns within and across web sites Analysis

Validation and/or interpretation of mined patterns

9

Web Mining: Not IR

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

10

Web Mining: Not IE

Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents.IE systems for the general Web are not feasibleMost focus on specific Web sites or content

11

12

IE - IR

Information RetrievalAutomatic retrieval of

relevant documentsPrimary Goals:

o Indexing Text o Searching for useful

documents in a collection

o“Bag of unordered words”

o“Web document classification “ task is an instance of IR

Information ExtractionExtract relevant facts from

documents

Primary Goals:o Transform collection of retrieved

documents to information.o Structure of representation of a

documento“Web document classification “

task is an instance of IRoIE has a higher level of granularityoResult:

o Structured Databaseo Compression or summary of Text

or documents

13

Types of IE

I E from unstructured texts ( Classical)• Unstructured ?? Free texts

eg.News stories• Basic to deep linguistic pre-

processing.

IE from semi-structured texts (Structural)

• Semi-Structured ?? HTML• Uses meta-information eg. HTML

tagsWrapper Induction, Machine learning used to build systems (semi-)automatically

Web Mining and Machine Learning

Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn".

Web mining is NOT learning from the Web. Some applications of machine learning on the web

are NOT Web Mining Methods used for Web Mining are NOT limited to

machine learning There is a close relationship between web mining

and machine learning

14

15

• Machine learning techniques support and help web mining as they could be applied to the processes in the web mining.

• For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques.

• In short,web mining intersects with the application of the machine learning on the web.

Web Mining and Machine Learning

Web Mining Categories

Web Content MiningDiscovering useful information from web

contents/data/documents.

Web Structure MiningDiscovering the model underlying link structures

(topology) on the Web. E.g. discovering authorities and hubs

Web Usage MiningMake sense of data generated by surfersUsage data from logs, user profiles, user sessions, cookies,

user queries, bookmarks, mouse clicks and scrolls, etc. 16

Web Content Data Structure

Unstructured – free text Semi-structured – HTML More structured – Table or Database

generated HTML pages Multimedia data – receive less attention than

text or hypertext

17

Web Structure Mining

Interested in the structure between Web documents (not within a document)

Example: PageRank – Google Application: Discovering micro-communities in the

Web Measuring the “completeness” of a Web site

18

Web Usage Mining

Tries to predict user behavior from interaction with the Web

Wide range of data (logs) Web client data Proxy server data Web server data

Two common approaches Map usage data into relational tables before using

adapted data mining techniques Use log data directly by utilizing special pre-processing

techniques

19

Thank you!

20

Documents

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1