Upload
arpit-verma
View
519
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
WEB MINING
BYArpit
Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services
Discovering useful information from the World-Wide Web and its usage patterns
Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
Web Mining
Web usage mining
Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications.
Web Mining
Web content mining
Web page content mining
Search result mining
Web structure mining
Web usage mining
General access pattern
tracking
Customized usage tracking
Web Mining Taxonomy
Data Mining Techniques Association rules Sequential patterns Classification Clustering Outlier discovery
Applications to the Web E-commerce Information retrieval (search) Network management
Web Mining
The WWW is huge, widely distributed, global information service centre for Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.
Hyper-link informationAccess and usage information
WWW provides rich sources of data for data mining
Web Mining
Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagat's) Car prices (e.g. CarPoint)
Lots of data on user access patterns Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from
November until February
Why Mine the Web?
The Web is a huge collection of documents except for Hyper-link information Access and usage information
The Web is very dynamic New pages are constantly being generated
Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental
Why is Web Mining Different?
Given: A source of textual documents A well defined limited query (text based)
Find: Sentences with relevant information Extract the relevant information and
ignore non-relevant information (important!) Link related information and output in a
predetermined format
What is Information Extraction?
Keyword (or term) based association analysisautomatic document (topic) classificationsimilarity detection
cluster documents by a common authorcluster documents containing information from a
common source
sequence analysis: predicting a recurring event, discovering trends
anomaly detection: find information that violates usual patterns
Types of text mining
Pre-Processing Pattern Discovery Pattern Analysis
RawSever log
User sessionFile Rules and Patterns Interesting
Knowledge
Web Usage Mining – Three Phases
Creating a model of web organizationClassify web pagesCreate similarity measures between web pages
Page RankThe Clever systemHyperlink induced topic search(HITS)
Web Structure Mining
13
Combine the intelligent IR tools meaning of words order of words in the query user dependency for the data authority of the source
With the unique web features retrieve Hyper-link information utilize Hyper-link as input
Intelligent Web Search
Program which browses WWW in a methodical, automated manner
Copy in cache and do IndexingStarts from a seed urlSearches and finds links, keywordsTypes of Crawler
Context focusedFocusedIncrementalPeriodic
Web Crawler
Link analysis algorithm which assigns numerical weight to a webpage.
The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).
the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
PageRank
Increase effectiveness of search engines
Based on number of back linksRank sink problem exists
Page Rank
Finds both authoritative pages and hubs
Authoritative - best sourceHub - link to authoritative pagesMost value page returnedHyperlink Induced Topic SearchKeywordsAuthority and hub measure
Clever System
Applies mining on web usage data or weblogs or clickstream data
Client perspective Server perspectiveAid in personalizationHelps in evaluating quality and effectivenessPreprocessing, pattern discovery and data structures
Web Usage Mining
Trackers for site usage and analysis
Queries ‘N Suggestions