CSCI 5417 Information Retrieval Systems Jim Martin Lecture 17 10/25/2011

  • View
    213

  • Download
    0

Embed Size (px)

Text of CSCI 5417 Information Retrieval Systems Jim Martin Lecture 17 10/25/2011

  • CSCI 5417Information Retrieval Systems

    Jim MartinLecture 1710/25/2011

    CSCI 5417 - IR

  • TodayFinish topic models model introStart on web search

  • What if?What if we just have the documents but no class assignments?But assume we do have knowledge about the number of classes involvedCan we still use probabilistic models? In particular, can we use nave Bayes?Yes, via EMExpectation Maximization

  • EMGiven some model, like NB, make up some class assignments randomly.Use those assignments to generate model parameters P(class) and P(word|class)Use those model parameters to re-classify the training data.Go to 2

  • Nave Bayes Example (EM)

    DocCategoryD1?D2?D3?D4?D5?

  • Nave Bayes Example (EM)

    DocCategoryD1SportsD2PoliticsD3SportsD4PoliticsD5Sports

    DocCategory{China, soccer}Sports{Japan, baseball}Politics{baseball, trade}Sports{China, trade}Politics{Japan, Japan, exports}Sports

    Sports (.6)baseball2/13China2/13exports2/13Japan3/13soccer2/13trade2/13

    Politics (.4)baseball2/10China2/10exports1/10Japan2/10soccer1/10trade2/10

  • Nave Bayes Example (EM)Use these counts to reassess the class membership for D1 to D5. Reassign them to new classes. Recompute the tables and priors.Repeat until happy

  • TopicsWhats the deal with trade?

    DocCategory{China, soccer}Sports{Japan, baseball}Sports{baseball, trade}Sports{China, trade}Politics{Japan, Japan, exports}Politics

  • Topics{basketball2, strike3}

    DocCategory{China1, soccer2}Sports{Japan1, baseball2}Sports{baseball2, trade2}Sports{China1, trade1}Politics{Japan1, Japan1, exports1}Politics

  • TopicsSo lets propose that instead of assigning documents to classes, we assign each word token in each document to a class (topic).Then we can some new probabilities to associate with words, topics and documentsDistribution of topics in a docDistribution of topics overallAssociation of words with topics*CSCI 5417 - IR*

    CSCI 5417 - IR

  • TopicsExample. A document like {basketball2, strike3}Can be said to be .5 about topic 2 and .5 about topic 3 and 0 about the rest of the possible topics (may want to worry about smoothing later.For a collection as a whole we can get a topic distribution (prior) by summing the words tagged with a particular topic, and dividing by the number of tagged tokens.*CSCI 5417 - IR*

    CSCI 5417 - IR

  • ProblemWith normal text classification the training data associates a document with one or more topics.Now we need to associate topics with the (content) words in each documentThis is a semantic tagging task, not unlike part-of-speech tagging and word-sense taggingIts hard, slow and expensive to do right*CSCI 5417 - IR*

    CSCI 5417 - IR

  • Topic modelingDo it without the human taggingGiven a set of documentsAnd a fixed number of topics (given)Find the statistics that we need*CSCI 5417 - IR*

    CSCI 5417 - IR

  • Graphical Models Notation: Take 2

  • Unsupervised NBNow suppose that Cat isnt observedThat is, we dont have category labels for each documentThen we need to learn two distributions:P(Cat)P(w|Cat)How do we do this?We might use EMAlternative: Bayesian methods

  • Bayesian document categorization

  • Latent Dirichlet Allocation: Topic Models(Blei, Ng, & Jordan, 2001; 2003)NdDziwi (d) (j) (d) Dirichlet()zi Discrete( (d) ) (j) Dirichlet()wi Discrete( (zi) )T

  • Given ThatWhat could you do with it.Browse/explore a collection and individual documents is the basic task

    *CSCI 5417 - IR*

    CSCI 5417 - IR

  • Visualize the topics*CSCI 5417 - IR*

    CSCI 5417 - IR

  • Visualize documents*CSCI 5417 - IR*

    CSCI 5417 - IR

  • Break*CSCI 5417 - IR*

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Brief History of Web SearchEarly keyword-based enginesAltavista, Excite, Infoseek, Inktomi, Lycos ca. 1995-1997Sponsored search ranking: WWWW (Colorado/McBryan) -> Goto.com (morphed into Overture.com Yahoo! ???)Your search ranking depended on how much you paidAuction for keywords: casino was an expensive keyword!

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Brief history1998+: Link-based ranking introduced by GooglePerception was that it represented a fundamental improvement over existing systemsGreat user experience in search of a business modelMeanwhile Goto/Overtures annual revenues were nearing $1 billionGoogle adds paid-placement ads to the side, distinct from search results2003: Yahoo follows suitacquires Overture (for paid placement) and Inktomi (for search)

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Web search basics

    CSCI 5417 - IR

    Web

    Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

    Sponsored Links

    CG Appliance ExpressDiscount Appliances (650) 756-3931Same Day Certified Installationwww.cgappliance.comSan Francisco-Oakland-San Jose, CA

    Miele Vacuum CleanersMiele Vacuums- Complete SelectionFree Shipping!www.vacuums.com

    Miele Vacuum CleanersMiele-Free Air shipping!All models. Helpful advice.www.best-vacuum.com

    Miele, Inc -- Anything else is a compromiseAt the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similarpages

    MieleWelcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similarpages

    Miele - Deutscher Hersteller von Einbaugerten, Hausgerten ... - [ Translate this page ]Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit...ein Leben lang. ... Whlen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similarpages

    Herzlich willkommen bei Miele sterreich - [ Translate this page ]Herzlich willkommen bei Miele sterreich Wenn Sie nicht automatischweitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERTE ... www.miele.at/ - 3k - Cached - Similarpages

  • User NeedsNeed [Brod02, RL04]Informational want to learn about something (~40% / 65%)

    Navigational want to go to that page (~25% / 15%)

    Transactional want to do something (web-mediated) (~35% / 20%)Access a serviceDownloads ShopGray areasFind a good hubExploratory search see whats there Low hemoglobinUnited AirlinesCar rental BrazilSec. 19.4.1**CSCI 5417 - IR

    CSCI 5417 - IR

  • How far do people look for results?(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)**CSCI 5417 - IR

    CSCI 5417 - IR

  • Users empirical evaluation of resultsQuality of pages varies widelyRelevance is not enoughOther desirable qualitiesContent: Trustworthy, diverse, non-duplicated, well maintainedWeb readability: display correctly & fastNo annoyances: pop-ups, etcPrecision vs. recallOn the web, recall seldom mattersWhat mattersPrecision at 1? Precision at k?Comprehensiveness must be able to deal with obscure queriesRecall matters when the number of matches is very small**CSCI 5417 - IR

    CSCI 5417 - IR

  • Users empirical evaluation of enginesRelevance and validity of resultsUI Simple, no clutter, error tolerantTrust Results are objectiveCoverage of topics for polysemic queriesPre/Post process tools providedMitigate user errors (auto spell check, search assist,)Explicit: Search within results, more like this, refine ...Anticipative: related searches, suggest, instant searchDeal with idiosyncrasiesWeb specific vocabularyImpact on stemming, spell-check, etcWeb addresses typed in the search box**CSCI 5417 - IR

    CSCI 5417 - IR

  • The Web as a Document CollectionNo design/co-ordinationDistributed content creation, linking, democratization of publishingContent includes truth, lies, obsolete information, contradictions Unstructured (text, html, ), semi-structured (XML, annotated photos), structured (Databases)Scale much larger than previous text collections but corporate records are catching upGrowth slowed down from initial volume doubling every few months but still expandingContent can be dynamically generated**CSCI 5417 - IR

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Web search engine piecesSpider (a.k.a. crawler/robot) builds corpusCollects web pages recursivelyFor each known URL, fetch the page, parse it, and extract new URLsRepeatAdditional pages from direct submissions & other sourcesThe indexer creates inverted indexesUsual issues wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, language issues, etc.Query processor serves query resultsFront end query reformulation, word stemming, capitalization, optimization of Booleans, phrases, wildcards, spelling, etc.Back end finds matching documents and ranks them

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Search Engine: Three sub-problemsMatch ads to query/contextGenerate and Order the adsPricing on a click-through

    CSCI 5417 - IR

  • *CSCI 5417 - IR*The trouble with search adsThey cost real money. Search Engine Optimization:Tuning your web page to rank highly in the search results for select keywordsAlternative to paying for placementThus, intrinsically a marketing functionPerformed by companies, webmasters and consultants (Search engine optimizers) for their clientsSome perfectly legitimate, some very shady

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Basic crawler operationBegin with known seed pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeat

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Crawling pictureUnseen WebSeedpages

    CSCI 5417 - IR

  • *CSCI 5417 - IR*Simple picture complicationsEffective Web c