119-03-08. Web mining is the use of data mining techniques to automatically discover and extract...



Over 1 billion HTML pages, 15 terabytes Wealth of information Bookstores, restaurants, travel, malls, dictionaries, news, stock quotes, yellow & white pages, maps, markets, Diverse media types: text, images, audio, video Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 Highly Dynamic 1 million new pages each day Average page changes in a few weeks Graph structure with links between pages Average page has 7-10 links Hundreds of millions of queries per day
