Upload
melvyn-marshall
View
220
Download
3
Embed Size (px)
Citation preview
Web Mining
By-
Pawan Singh
Piyush Arora
Pooja Mansharamani
Pramod Singh
Praveen Kumar1
Outline
Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions
2
Four Problems
Finding relevant information Low precision-which is due to the irrelevance of many of the search results. This
results in a difficulty finding the relevant information. LOW RECALL which is due to the inability to index all the information available
on the web.This results in a difficulty finding the unindexed information that is relevant.
Creating new knowledge out of available information on the web
While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process .
3
4
Personalizing the information Catering to personal preference in content and presentation(associated
with the type and presentation of the information )
Learning about the consumers What does the customer want to do? Using web data to effectively market products and/or services
Other Approaches
Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP)
In-depth syntactic and semantic analysis
Web document communityStandards, manually appended meta-information,
maintained directories, etc
5
Direct vs. Indirect Web Mining
Web mining techniques can be used to solve the information overload problems:Directly
Attack the problem with web mining techniquesE.g. newsgroup agent classifies news as relevant
IndirectlyUsed as part of a bigger application that addresses
problemsE.g. used to create index terms for a web search service
6
The Research
Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)
Focusing on research from the machine learning point of view
7
Web Mining: Definition
“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.”Can be viewed as four subtasksNot the same as Information RetrievalNot the same as Information Extraction
8
Web Mining: Subtasks
Resource findingRetrieving intended documents
Information selection/pre-processingSelect and pre-process specific information from
retrieved web resources. Generalization
Discover general patterns within and across web sites Analysis
Validation and/or interpretation of mined patterns
9
Web Mining: Not IR
Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible
Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)
10
Web Mining: Not IE
Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents.IE systems for the general Web are not feasibleMost focus on specific Web sites or content
11
12
IE - IR
Information RetrievalAutomatic retrieval of
relevant documentsPrimary Goals:
o Indexing Text o Searching for useful
documents in a collection
o“Bag of unordered words”
o“Web document classification “ task is an instance of IR
Information ExtractionExtract relevant facts from
documents
Primary Goals:o Transform collection of retrieved
documents to information.o Structure of representation of a
documento“Web document classification “
task is an instance of IRoIE has a higher level of granularityoResult:
o Structured Databaseo Compression or summary of Text
or documents
13
Types of IE
I E from unstructured texts ( Classical)• Unstructured ?? Free texts
eg.News stories• Basic to deep linguistic pre-
processing.
IE from semi-structured texts (Structural)
• Semi-Structured ?? HTML• Uses meta-information eg. HTML
tagsWrapper Induction, Machine learning used to build systems (semi-)automatically
Web Mining and Machine Learning
Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn".
Web mining is NOT learning from the Web. Some applications of machine learning on the web
are NOT Web Mining Methods used for Web Mining are NOT limited to
machine learning There is a close relationship between web mining
and machine learning
14
15
• Machine learning techniques support and help web mining as they could be applied to the processes in the web mining.
• For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques.
• In short,web mining intersects with the application of the machine learning on the web.
Web Mining and Machine Learning
Web Mining Categories
Web Content MiningDiscovering useful information from web
contents/data/documents.
Web Structure MiningDiscovering the model underlying link structures
(topology) on the Web. E.g. discovering authorities and hubs
Web Usage MiningMake sense of data generated by surfersUsage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc. 16
Web Content Data Structure
Unstructured – free text Semi-structured – HTML More structured – Table or Database
generated HTML pages Multimedia data – receive less attention than
text or hypertext
17
Web Structure Mining
Interested in the structure between Web documents (not within a document)
Example: PageRank – Google Application: Discovering micro-communities in the
Web Measuring the “completeness” of a Web site
18
Web Usage Mining
Tries to predict user behavior from interaction with the Web
Wide range of data (logs) Web client data Proxy server data Web server data
Two common approaches Map usage data into relational tables before using
adapted data mining techniques Use log data directly by utilizing special pre-processing
techniques
19
Thank you!
20