CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011

  • View
    214

  • Download
    1

Embed Size (px)

Text of CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011

  • Slide 1

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011 Slide 2 Today Review clustering K-means Review nave Bayes Unsupervised classification EM Nave Bayes/EM for text classification Topic models model intuition Slide 3 5/2/2015 CSCI 5417 - IR 3 K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Iterative reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities) Slide 4 5/2/2015 CSCI 5417 - IR 4 K-Means Algorithm Select K random docs {s 1, s 2, s K } as seeds. Until stopping criterion: For each doc d i : Assign d i to the cluster c j such that dist(d i, s j ) is minimal. For each cluster c_j s_j = m(c_j) Slide 5 5/2/2015 CSCI 5417 - IR 5 K Means Example (K=2) Pick seeds Assign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged! Slide 6 5/2/2015 CSCI 5417 - IR 6 Termination conditions Several possibilities A fixed number of iterations Doc partition unchanged Centroid positions dont change Slide 7 Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters dont change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. But in practice usually isnt Sec. 16.4 Slide 8 5/2/2015 CSCI 5417 - IR 8 Text j single document containing all docs j for each word x k in Vocabulary n k number of occurrences of x k in Text j Nave Bayes: Learning From training corpus, extract Vocabulary Calculate required P(c j ) and P(x k | c j ) terms For each c j in C do docs j subset of documents for which the target class is c j Slide 9 5/2/2015 CSCI 5417 - IR 9 Multinomial Model Slide 10 5/2/2015 CSCI 5417 - IR 10 Nave Bayes: Classifying positions all word positions in current document which contain tokens found in Vocabulary Return c NB, where Slide 11 5/2/2015 CSCI 5417 - IR 11 Apply Multinomial Slide 12 Nave Bayes Example DocCategory D1Sports D2Sports D3Sports D4Politics D5Politics DocCategory {China, soccer}Sports {Japan, baseball}Sports {baseball, trade}Sports {China, trade}Politics {Japan, Japan, exports}Politics Sports (.6) baseball3/12 China2/12 exports1/12 Japan2/12 soccer2/12 trade2/12 Politics (.4) baseball1/11 China2/11 exports2/11 Japan3/11 soccer1/11 trade2/11 Using +1; |V| =6; |Sports| = 6; |Politics| = 5 Politics"> 5/2/2015 CSCI 5417 - IR 13 Nave Bayes Example Classifying Soccer (as a doc) Soccer | sports =.167 Soccer | politics =.09 Sports > Politics Slide 14 5/2/2015 CSCI 5417 - IR 14 Example 2 Howa about? Japan soccer Sports P(japan|sports)P(soccer|sports)P(sports).166 *.166*.6 =.0166 Politics P(japan|politics)P(soccer|politics)P(politics).27 *.09 *. 4 =.00972 Sports > Politics Slide 15 Break No class Thursday; work on the HW No office hours either. HW questions? The format of the test docs will be same as the current docs minus the.M field which will be removed. How should you organize your development efforts? 5/2/2015CSCI 5417 - IR15 Slide 16 Example 3 What about? China trade Sports.166 *.166 *.6 =.0166 Politics.1818 *.1818 *. 4 =.0132 Again Sports > Politics Sports (.6) basebal l 3/12 China2/12 exports1/12 Japan2/12 soccer2/12 trade2/12 Politics (.4) baseball1/11 China2/11 exports2/11 Japan3/11 soccer1/11 trade2/11 Slide 17 Problem? 5/2/2015CSCI 5417 - IR17 DocCategory {China, soccer}Sports {Japan, baseball}Sports {baseball, trade}Sports {China, trade}Politics {Japan, Japan, exports}Politics Nave Bayes doesnt remember the training data. It just extracts statistics from it. Theres no guarantee that the numbers will generate correct answers for all members of the training set. Slide 22 Nave Bayes Example (EM) Use these counts to reassess the class membership for D1 to D5. Reassign them to new classes. Recompute the tables and priors. Repeat until happy Slide 23 Topics DocCategory {China, soccer}Sports {Japan, baseball}Sports {baseball, trade}Sports {China, trade}Politics {Japan, Japan, exports}Politics Whats the deal with trade? Slide 24 Topics DocCategory {China 1, soccer 2 }Sports {Japan 1, baseball 2 }Sports {baseball 2, trade 2 }Sports {China 1, trade 1 }Politics {Japan 1, Japan 1, exports 1 }Politics {basketball 2, strike 3 }