1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google

1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google 2 Course Administration 3 Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges Volume of material -- several billion items, growing steadily Items created dynamically or in databases Great variety -- length, formats, quality control, purpose, etc. Inexperience of users -- range of needs Economic models to pay for the service 4 Strategies Subject hierarchies Yahoo! -- use of human indexing Web crawling + automatic indexing General -- Google, AltaVista, Ask Jeeves, NorthernLight,... Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com,... Mixed models Human directed web crawling and automatic indexing -- BBC News 5 Components of Web Search Service Components Web crawler Indexing system Search system Considerations Economics Scalability 6 Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services 7 8 Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people. 9 Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: Ranking hits in order that fits user's requirements Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is not the most important factor. 10 Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text. 11 Effective Information Retrieval (cont) Full text indexing with contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materials and ranked retrieval (e.g., Google image retrieval). Promising, but still experimental. 12 Google: Ranking 1.Paid advertisers 2.Manually created classification 3.Vector space ranking with corrections for document length 4.Extra weighting for specific fields, e.g., title, anchors, etc. 5.PageRank The balance between 3, 4, and 5 is not made public. 13 Usability: Display of Results 14 Usability: Dynamic Abstracts Query: Cornell sports LII: Law about...Sports... sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues... vocation. Amateur Sports....Query: NCAA Tarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On... 15 Limitations of Web Crawling Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag) 16 Scalability ,000 10, ,000 1,000,000 10,000, ,000,000 1,000,000,000 10,000,000, The growth of the web 17 Web search services are centralized systems Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are telecommunications costs, disk access rates. Scalability 18 Case Study: Google Python with C/C++ Linux Module-based architecture Multi-machine Multi-thread 19 Performance Storage Scale with the size of the Web Repository is comparatively small Good/Fast compression/decompression System Crawling, Indexing, Sorting Last two simultaneously Searching Bounded by disk I/O 20 Image Search: indexing by contextual information only 21 Google API 22 Selective searching 23 Google News 24 Conclusion Google: Scalable search engine Complete architecture Many research ideas arise Always something to improve High quality search is the dominant factor precision presentation of results

Documents

1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google