Text Categorization
Hongning WangCS@UVa
Todayβs lecture
β’ Bayes decision theoryβ’ Supervised text categorization
β General steps for text categorizationβ Feature selection methodsβ Evaluation metrics
CS@UVa CS 6501: Text Mining 2
Text mining in general
CS@UVa CS6501: Text Mining 3
Access Mining
Organization
Filterinformation
Discover knowledge
AddStructure/Annotations
Serve for IR applications
Based on NLP/ML techniques
Sub-area of DM research
Applications of text categorization
β’ Automatically classify political news from sports news
CS@UVa CS 6501: Text Mining 4
political sports
Applications of text categorization
β’ Recognizing spam emails
CS@UVa CS 6501: Text Mining 5
Spam=True/False
Applications of text categorization
β’ Sentiment analysis
CS@UVa CS 6501: Text Mining 6
Basic notions about categorization
β’ Data points/Instancesβ ππ: an ππ-dimensional feature vector
β’ Labelsβ π¦π¦: a categorical value from {1, β¦ ,ππ}
β’ Classification hyper-planeβ ππ ππ β π¦π¦
vector space representation
CS@UVa CS 6501: Text Mining 7
Key question: how to find such a mapping?
Bayes decision theory
β’ If we know ππ(π¦π¦) and ππ(ππ|π¦π¦), the Bayes decision rule is
οΏ½π¦π¦ = πππππππππππ₯π₯π¦π¦ππ π¦π¦ ππ = πππππππππππ₯π₯π¦π¦ππ ππ π¦π¦ ππ π¦π¦
ππ ππβ Example in binary classification
β’ οΏ½π¦π¦ = 1, if ππ ππ π¦π¦ = 1 ππ π¦π¦ = 1 > ππ ππ π¦π¦ = 0 ππ π¦π¦ = 0β’ οΏ½π¦π¦ = 0, otherwise
β’ This leads to optimal classification resultβ Optimal in the sense of βriskβ minimization
Constant with respect to π¦π¦
CS@UVa CS 6501: Text Mining 8
Bayes risk
β’ Risk β assign instance to a wrong classβ Type I error: (π¦π¦β = 0, οΏ½π¦π¦ = 1)
β’ ππ ππ π¦π¦ = 0 ππ(π¦π¦ = 0)β Type II error: (π¦π¦β = 1, οΏ½π¦π¦ = 0)
β’ ππ ππ π¦π¦ = 1 ππ(π¦π¦ = 1)β Risk by Bayes decision rule
β’ ππ(ππ) = min{ππ ππ π¦π¦ = 1 ππ π¦π¦ = 1 ,ππ ππ π¦π¦ = 0 ππ(π¦π¦ =0)}
β’ It can determine a βreject regionβ
CS@UVa CS 6501: Text Mining 9
False positive
False negative
Bayes risk
β’ Risk β assign instance to a wrong class
CS@UVa CS 6501: Text Mining 10
ππ
ππ(ππ,π¦π¦)
ππ ππ π¦π¦ = 1 ππ(π¦π¦ = 1)ππ ππ π¦π¦ = 0 ππ(π¦π¦ = 0)
οΏ½π¦π¦ = 0 οΏ½π¦π¦ = 1
False positiveFalse negative
Bayes decision boundary
Bayes risk
β’ Risk β assign instance to a wrong class
CS@UVa CS 6501: Text Mining 11
ππ
ππ(ππ,π¦π¦)
ππ ππ π¦π¦ = 1 ππ(π¦π¦ = 1)ππ ππ π¦π¦ = 0 ππ(π¦π¦ = 0)
οΏ½π¦π¦ = 0 οΏ½π¦π¦ = 1
False positiveFalse negative
Increased error
Bayes risk
β’ Risk β assign instance to a wrong class
CS@UVa CS 6501: Text Mining 12
ππ
ππ(ππ,π¦π¦)
ππ ππ π¦π¦ = 1 ππ(π¦π¦ = 1)ππ ππ π¦π¦ = 0 ππ(π¦π¦ = 0)
οΏ½π¦π¦ = 0 οΏ½π¦π¦ = 1
False positiveFalse negative
*Optimal Bayes decision boundary
Bayes risk
β’ Expected riskπΈπΈ ππ π₯π₯ = β«π₯π₯ ππ π₯π₯ ππ π₯π₯ πππ₯π₯
= οΏ½π₯π₯ππ π₯π₯ min{ππ π₯π₯ π¦π¦ = 1 ππ π¦π¦ = 1 ,ππ π₯π₯ π¦π¦ = 0 ππ(π¦π¦ = 0)}dx
= ππ π¦π¦ = 1 οΏ½π π 0ππ(π₯π₯)ππ π₯π₯|π¦π¦ = 1 πππ₯π₯
+ ππ π¦π¦ = 0 οΏ½π π 1ππ(π₯π₯)ππ π₯π₯|π¦π¦ = 0 πππ₯π₯
Region where we assign π₯π₯ to class 0
Region where we assign π₯π₯ to class 1
Will the error of assigning ππ1 to ππ0 be always equal to the error of assigning ππ0 to ππ1?
CS@UVa CS 6501: Text Mining 13
Loss function
β’ The penalty we will pay when misclassifying instances
β’ Goal of classification in generalβ Minimize loss
πΈπΈ πΏπΏ = πΏπΏ1,0ππ π¦π¦ = 1 οΏ½π π 0ππ π₯π₯ ππ π₯π₯|π¦π¦ = 1 πππ₯π₯
+πΏπΏ0,1ππ π¦π¦ = 0 οΏ½π π 1ππ π₯π₯ ππ(π₯π₯|π¦π¦ = 1)πππ₯π₯
Penalty when misclassifying ππ1 to ππ0Penalty when misclassifying ππ0 to ππ1
CS@UVa CS 6501: Text Mining 14
Region where we assign π₯π₯ to class 0
Region where we assign π₯π₯ to class 1
Will this still be optimal?
Supervised text categorization
β’ Supervised learningβ Estimate a model/method from labeled dataβ It can then be used to determine the labels of the
unobserved samples
Classifierππ π₯π₯,ππ β π¦π¦
SportsBusiness
Education
π₯π₯ππ ,π¦π¦ππ ππ=1ππ
β¦
Sports
Business
Education
Scienceβ¦
π₯π₯ππ , οΏ½π¦π¦ππ ππ=1ππ
Training Testing
CS@UVa CS 6501: Text Mining 15
π₯π₯ππ , ? ππ=1ππ
Type of classification methods
β’ Model-lessβ Instance based classifiers
β’ Use observation directlyβ’ E.g., kNN
Classifierππ π₯π₯,ππ β π¦π¦
β¦
Sports
Business
Education
Scienceβ¦
SportsBusiness
Education
π₯π₯ππ ,π¦π¦ππ ππ=1ππ
π₯π₯ππ , οΏ½π¦π¦ππ ππ=1ππ
Training Testing
Instance lookup
CS@UVa CS 6501: Text Mining 16
π₯π₯ππ , ? ππ=1ππ
Key: assuming similar items have similar class labels!
Type of classification methods
β’ Model-basedβ Generative models
β’ Modeling joint probability of ππ(π₯π₯,π¦π¦)β’ E.g., NaΓ―ve Bayes
β Discriminative modelsβ’ Directly estimate a decision rule/boundaryβ’ E.g., SVM
Classifierππ π₯π₯,ππ β π¦π¦
β¦
Sports
Business
Education
Scienceβ¦
SportsBusiness
Education
π₯π₯ππ ,π¦π¦ππ ππ=1ππ
π₯π₯ππ , οΏ½π¦π¦ππ ππ=1ππ
Training Testing
CS@UVa CS 6501: Text Mining 17π₯π₯ππ , ? ππ=1
ππ
Key: i.i.d. assumption!
Generative V.S. discriminative models
β’ Binary classification as an example
CS@UVa CS 6501: Text Mining 18
Generative Modelβs view Discriminative Modelβs view
Generative V.S. discriminative models
Generativeβ’ Specifying joint distribution
β Full probabilistic specification for all the random variables
β’ Dependence assumption has to be specified for ππ π₯π₯ π¦π¦ and ππ(π¦π¦)
β’ Flexible, can be used in unsupervised learning
Discriminative β’ Specifying conditional
distributionβ Only explain the target
variable
β’ Arbitrary features can be incorporated for modeling ππ π¦π¦ π₯π₯
β’ Need labeled data, only suitable for (semi-) supervised learning
CS@UVa CS 6501: Text Mining 19
General steps for text categorization
1. Feature construction and selection
2. Model specification3. Model estimation
and selection4. Evaluation
Political News
Sports News
Entertainment News
CS@UVa CS 6501: Text Mining 20
General steps for text categorization
1. Feature construction and selection
2. Model specification3. Model estimation
and selection4. Evaluation
Political News
Sports News
Entertainment News Consider:
1.1 How to represent the text documents? 1.2 Do we need all those features?
CS@UVa CS 6501: Text Mining 21
Feature construction for text categorization
β’ Vector space representationβ Standard procedure in document representationβ Features
β’ N-gram, POS tags, named entities, topics
β Feature valueβ’ Binary (presence/absence)β’ TF-IDF (many variants)
CS@UVa CS 6501: Text Mining 22
Recall MP1
β’ How many unigram+bigram are there in our controlled vocabulary?β 130K on Yelp_small
β’ How many review documents do we have there for training?β 629K Yelp_small
Very sparse feature representation!
CS@UVa CS 6501: Text Mining 23
Feature selection for text categorization
β’ Select the most informative features for model trainingβ Reduce noise in feature representation
β’ Improve final classification performance
β Improve training/testing efficiencyβ’ Less time complexityβ’ Fewer training data
CS@UVa CS 6501: Text Mining 24
Feature selection methods
β’ Wrapper methodβ Find the best subset of features for a particular
classification method
R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324
the same classifier
CS@UVa CS 6501: Text Mining 25
R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324
Feature selection methods
β’ Wrapper methodβ Search in the whole space of feature groups
β’ Sequential forward selection or genetic search to speed up the search
CS@UVa CS 6501: Text Mining 26
Feature selection methods
β’ Wrapper methodβ Consider all possible dependencies among the
featuresβ Impractical for text categorization
β’ Cannot deal with large feature setβ’ A NP-complete problem
β No direct relation between feature subset selection and evaluation
CS@UVa CS 6501: Text Mining 27
Feature selection methods
β’ Filter methodβ Evaluate the features independently from the
classifier and other featuresβ’ No indication of a classifierβs performance on the
selected featuresβ’ No dependency among the features
β Feasible for very large feature setβ’ Usually used as a preprocessing step
R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324
CS@UVa CS 6501: Text Mining 28
Feature scoring metrics
β’ Document frequencyβ Rare words: non-influential for global prediction,
reduce vocabulary size
remove head words
remove rare words
CS@UVa CS 6501: Text Mining 29
Feature scoring metrics
β’ Information gainβ Decrease in entropy of categorical prediction
when the feature is present v.s. absent
class uncertainty decreases class uncertainty intact
CS@UVa CS 6501: Text Mining 30
Feature scoring metrics
β’ Information gainβ Decrease in entropy of categorical prediction
when the feature is present or absent
πΌπΌπΌπΌ π‘π‘ = βοΏ½ππ
ππ ππ logππ ππ
+ππ(π‘π‘)οΏ½ππ
ππ ππ|π‘π‘ logππ ππ|π‘π‘
+ππ( π‘π‘)οΏ½ππ
ππ ππ| π‘π‘ logππ ππ| π‘π‘
Entropy of class label along
Entropy of class label if π‘π‘ is present
Entropy of class label if π‘π‘ is absent
probability of seeing class label ππ in documents where t does not occur
probability of seeing class label ππ in documents where t occurs
CS@UVa CS 6501: Text Mining 31
Feature scoring metrics
β’ ππ2 statisticsβ Test whether distributions of two categorical
variables are independent of one anotherβ’ π»π»0: they are independentβ’ π»π»1: they are dependent
π‘π‘ π‘π‘ππ π΄π΄ π΅π΅ππ πΆπΆ π·π·
ππ2 π‘π‘, ππ =(π΄π΄ + π΅π΅ + πΆπΆ + π·π·) π΄π΄π·π· β π΅π΅πΆπΆ 2
(π΄π΄ + πΆπΆ)(π΅π΅ + π·π·)(π΄π΄ + π΅π΅)(πΆπΆ + π·π·)
π·π·π·π·(π‘π‘) ππ β π·π·π·π·(π‘π‘) #ππππππ ππππππ #ππππππ ππππππCS@UVa CS 6501: Text Mining 32
Feature scoring metrics
β’ ππ2 statisticsβ Test whether distributions of two categorical
variables are independent of one anotherβ’ Degree of freedom = (#col-1)X(#row-1)β’ Significance level: πΌπΌ, i.e., ππ-value<πΌπΌ
π‘π‘ π‘π‘ππ 36 30
ππ 14 25
ππ2 π‘π‘, ππ =105 36 Γ 25 β 14 Γ 30 2
50 Γ 55 Γ 66 Γ 39 = 3.418
Look into ππ2 distribution table to find the threshold
DF=1, πΌπΌ = 0.05 => threshold = 3.841
We cannot reject π»π»0
π‘π‘ is not a good feature to choose
CS@UVa CS 6501: Text Mining 33
Feature scoring metrics
β’ ππ2 statisticsβ Test whether distributions of two categorical
variables are independent of one anotherβ’ Degree of freedom = (#col-1)X(#row-1)β’ Significance level: πΌπΌ, i.e., ππ-value<πΌπΌβ’ For the features passing the threshold, rank them by
descending order of ππ2 values and choose the top ππfeatures
CS@UVa CS 6501: Text Mining 34
Feature scoring metrics
β’ ππ2 statistics with multiple categoriesβ ππ2 π‘π‘ = βππ ππ(ππ)ππ2 ππ, π‘π‘
β’ Expectation of ππ2 over all the categories
β ππ2 π‘π‘ = maxππ
ππ2 ππ, π‘π‘β’ Strongest dependency between a category
β’ Problem with ππ2 statisticsβ Normalization breaks down for the very low frequency
termsβ’ ππ2 values become incomparable between high frequency
terms and very low frequency terms
Distribution assumption becomes inappropriate in this test
CS@UVa CS 6501: Text Mining 35
Feature scoring metrics
β’ Many other metricsβ Mutual information
β’ Relatedness between term π‘π‘ and class ππ
β Odds ratioβ’ Odds of term π‘π‘ occurring with class ππ normalized by
that without ππ
πππππΌπΌ π‘π‘; ππ = ππ(π‘π‘, ππ) log(ππ(π‘π‘, ππ)ππ π‘π‘ ππ(ππ))
ππππππππ π‘π‘; ππ =ππ(π‘π‘, ππ)
1 β ππ(π‘π‘, ππ) Γ1 β ππ(π‘π‘, ππ)ππ(π‘π‘, ππ)
Same trick as in ππ2 statistics for multi-class cases
CS@UVa CS 6501: Text Mining 36
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.
A graphical analysis of feature selection
β’ Isoclines for each feature scoring metric β Machine learning papers v.s. other CS papers
Zoom in
Stopword removal
CS@UVa CS 6501: Text Mining 37
A graphical analysis of feature selection
β’ Isoclines for each feature scoring metric β Machine learning papers v.s. other CS papers
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.
CS@UVa CS 6501: Text Mining 38
Effectiveness of feature selection methods
β’ On a multi-class classification data setβ 229 documents, 19 classesβ Binary feature, SVM classifier
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.CS@UVa CS 6501: Text Mining 39
Effectiveness of feature selection methods
β’ On a multi-class classification data setβ 229 documents, 19 classesβ Binary feature, SVM classifier
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.
CS@UVa CS 6501: Text Mining 40
Empirical analysis of feature selection methods
β’ Text corpusβ Reuters-22173
β’ 13272 documents, 92 classes, 16039 unique words
β OHSUMED β’ 3981 documents, 14321 classes, 72076 unique words
β’ Classifier: kNN and LLSF
CS@UVa CS 6501: Text Mining 41
General steps for text categorization
1. Feature construction and selection
2. Model specification3. Model estimation
and selection4. Evaluation
Political News
Sports News
Entertainment News
Consider:2.1 What is the unique property of this problem? 2.2 What type of classifier we should use?
CS@UVa CS 6501: Text Mining 42
Model specification
β’ Specify dependency assumptions β Linear relation between π₯π₯ and π¦π¦
β’ π€π€πππ₯π₯ β π¦π¦β’ Features are independent among each other
β NaΓ―ve Bayes, linear SVM
β Non-linear relation between π₯π₯ and π¦π¦β’ ππ(π₯π₯) β π¦π¦, where ππ(β ) is a non-linear function of π₯π₯β’ Features are not independent among each other
β Decision tree, kernel SVM, mixture model
β’ Choose based on our domain knowledge of the problem
We will discuss these choices later
CS@UVa CS 6501: Text Mining 43
General steps for text categorization
1. Feature construction and selection
2. Model specification3. Model estimation
and selection4. Evaluation
Political News
Sports News
Entertainment News
Consider:3.1 How to estimate the parameters in the selected model? 3.2 How to control the complexity of the estimated model?
CS@UVa CS 6501: Text Mining 44
Model estimation and selection
β’ General philosophyβ Loss minimization
πΈπΈ πΏπΏ = πΏπΏ1,0ππ π¦π¦ = 1 οΏ½π π 0ππ π₯π₯ πππ₯π₯ + πΏπΏ0,1ππ π¦π¦ = 0 οΏ½
π π 1ππ π₯π₯ πππ₯π₯
Penalty when misclassifying ππ1 to ππ0
Penalty when misclassifying ππ0 to ππ1
Empirically estimated from training setEmpirical loss!
CS@UVa CS 6501: Text Mining 45
Key assumption: Independent and Identically Distributed!
Empirical loss minimization
β’ Overfittingβ Good empirical loss, terrible generalization lossβ High model complexity -> prune to overfit noise
Underlying dependency: linear relation
Over-complicated dependency assumption: polynomial
CS@UVa CS 6501: Text Mining 46
Generalization loss minimization
β’ Avoid overfittingβ Measure model complexity as wellβ Model selection and regularization
Error on training set
Error on testing set
Model complexity
Error
CS@UVa CS 6501: Text Mining 47
Generalization loss minimization
β’ Cross validationβ Avoid noise in train/test separationβ k-fold cross-validation
1. Partition all training data into k equal size disjoint subsets;
2. Leave one subset for validation and the other k-1 for training;
3. Repeat step (2) k times with each of the k subsets used exactly once as the validation data.
CS@UVa CS 6501: Text Mining 48
Generalization loss minimization
β’ Cross validationβ Avoid noise in train/test separationβ k-fold cross-validation
CS@UVa CS 6501: Text Mining 49
Generalization loss minimization
β’ Cross validationβ Avoid noise in train/test separationβ k-fold cross-validation
β’ Choose the model (among different models or same model with different settings) that has the best average performance on the validation sets
β’ Some statistical test is needed to decide if one model is significantly better than another
Will cover it shortly
CS@UVa CS 6501: Text Mining 50
General steps for text categorization
1. Feature construction and selection
2. Model specification3. Model estimation
and selection4. Evaluation
Political News
Sports News
Entertainment News
Consider:4.1 How to judge the quality of learned model? 4.2 How can you further improve the performance?
CS@UVa CS 6501: Text Mining 51
Classification evaluation
β’ Accuracyβ Percentage of correct prediction over all
predictions, i.e., ππ π¦π¦β = π¦π¦β Limitation
β’ Highly skewed class distributionβ ππ π¦π¦β = 1 = 0.99
Β» Trivial solution: all testing cases are positiveβ Classifiersβ capability is only differentiated by 1% testing cases
CS@UVa CS 6501: Text Mining 52
Evaluation of binary classification
β’ Precisionβ Fraction of predicted positive documents that are
indeed positive, i.e., ππ(π¦π¦β = 1|π¦π¦ = 1)β’ Recall
β Fraction of positive documents that are predicted to be positive, i.e., ππ(π¦π¦ = 1|π¦π¦β = 1)
π¦π¦β = 1 π¦π¦β = 0y = 1 true positive (TP) false positive (FP)
π¦π¦ = 0 false negative (FN) true negative (TN)
ππππππππ + π·π·ππRecall=
ππππππππ + π·π·ππ
Precision=
CS@UVa CS 6501: Text Mining 53
Evaluation of binary classification
CS@UVa CS 6501: Text Mining 54
ππ
ππ(ππ,π¦π¦)
ππ ππ π¦π¦ = 1 ππ(π¦π¦ = 1)ππ ππ π¦π¦ = 0 ππ(π¦π¦ = 0)
οΏ½π¦π¦ = 0 οΏ½π¦π¦ = 1
False positiveFalse negative
*Optimal Bayes decision boundary
True positiveTrue negative
Precision and recall trade off
β’ Precision decreases as the number of documents predicted to be positive increases (unless in perfect classification), while recall keeps increasing
β’ These two metrics emphasize different perspectives of a classifierβ Precision: prefers a classifier to recognize fewer
documents, but highly accurateβ Recall: prefers a classifier to recognize more
documents
CS@UVa CS 6501: Text Mining 55
Summarizing precision and recall
β’ With a single valueβ In order to compare different classifiersβ F-measure: weighted harmonic mean of precision
and recall, πΌπΌ balances the trade-off
β Why harmonic mean?β’ Classifier1: P:0.53, R:0.36β’ Classifier2: P:0.01, R:0.99
π·π· =1
πΌπΌ 1ππ + 1 β πΌπΌ 1
π π π·π·1 =
21ππ + 1
π π
Equal weight between precision and recall
CS@UVa CS 6501: Text Mining 56
H A
0.429 0.445
0.019 0.500
Summarizing precision and recall
β’ With a curve1. Order all the testing cases by the classifierβs
prediction score (assuming the higher the score is, the more likely it is positive);
2. Scan through each testing case: treat all cases above it as positive (including itself), below it as negative; compute precision and recall;
3. Plot precision and recall computed for each testing case in step (2).
CS@UVa CS 6501: Text Mining 57
Summarizing precision and recall
β’ With a curveβ A.k.a., precision-recall curveβ Area Under Curve (AUC)
Under each recall level, we prefer a higher precision
CS@UVa CS 6501: Text Mining 58
Multi-class categorization
β’ Confusion matrixβ A generalized contingency table for precision and
recall
CS@UVa CS 6501: Text Mining 59
Statistical significance tests
β’ How confident you are that an observed difference doesnβt simply result from the train/test separation you chose?
Classifier 1
0.200.210.220.190.18
Classifier 2
0.180.190.210.200.37
Fold12345
Average 0.20 0.23
CS@UVa CS 6501: Text Mining 60
Background knowledge
β’ p-value in statistic test is the probability of obtaining data as extreme as was observed, if the null hypothesis were true (e.g., if observation is totally random)
β’ If p-value is smaller than the chosen significance level (Ξ±), we reject the null hypothesis (e.g., observation is not random)
β’ We seek to reject the null hypothesis (we seek to show that the observation is not a random result), and so small p-values are good
61CS@UVa CS 6501: Text Mining 61
Paired t-test
β’ Paired t-testβ Test if two sets of observations are significantly
different from each otherβ’ On k-fold cross validation, different classifiers are
applied onto the same train/test separationβ Hypothesis: difference between two responses
measured on the same statistical unit has a zero mean value
β’ One-tail v.s. two-tail?β If you arenβt sure, use two-tail
CS@UVa CS 6501: Text Mining 62
Algorithm 1 Algorithm 2
Statistical significance test
0
95% of outcomes
CS@UVa CS 6501: Text Mining 63
Classifier A Classifier BFold12345
Average 0.20 0.23
paired t-test
p=0.4987
+0.02+0.02+0.01-0.01-0.19
0.200.210.220.190.18
0.180.190.210.200.37
What you should know
β’ Bayes decision theoryβ Bayes risk minimization
β’ General steps for text categorizationβ Text feature constructionβ Feature selection methodsβ Model specification and estimationβ Evaluation metrics
CS@UVa CS 6501: Text Mining 64
Todayβs reading
β’ Introduction to Information Retrievalβ Chapter 13: Text classification and Naive Bayes
β’ 13.1 β Text classification problemβ’ 13.5 β Feature selectionβ’ 13.6 β Evaluation of text classification
CS@UVa CS 6501: Text Mining 65