A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004

A Comprehensive A Comprehensive Comparative Study on Comparative Study on

Term Weighting Term Weighting Schemes for Text Schemes for Text

Categorization with Categorization with SVMSVMLan ManLan Man

3 Nov, 20043 Nov, 2004

SynopsisSynopsis Purpose of this workPurpose of this work Experiment DesignExperiment Design Results and DiscussionsResults and Discussions ConclusionsConclusions

Purpose of this workPurpose of this work Text categorization, the task of assigning Text categorization, the task of assigning unlabelled documents into predefined caunlabelled documents into predefined categoriestegories kNN, Decision Tree, Neural Network, NaïkNN, Decision Tree, Neural Network, Naïve Bayes, Linear Regression, SVM, Perceve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc.ptron, Rocchio and etc. Classifier Committees, Bagging, Boosting Classifier Committees, Bagging, Boosting SVMSVM has been shown rather good perfor has been shown rather good performancemance

Purpose of this work Purpose of this work (Cont.)(Cont.)

BiblioBiblio Term Term WeightinWeightingg

Kernel of Kernel of SVMSVM

Data Data CollectionCollection

Performance Performance EvaluationEvaluation

DumaiDumais, 1998s, 1998

BinaryBinary LinearLinear Reuters-Reuters-21578 top 1021578 top 10

.92 .92 (microaveraged (microaveraged breakeven breakeven point)point)

JoachiJoachims, ms, 19981998

tf.idftf.idf PolynomiPolynomialal&&RBFRBF

Reuters-Reuters-21578 top 9021578 top 90

.86.86&&.864 .864 (microaveraged (microaveraged breakeven breakeven point)point)

Dai, Dai, 20032003

logtf.idflogtf.idf LinearLinear Part Part Reuters-Reuters-21578 top 1021578 top 10

.9402 (F1).9402 (F1)

…… …… …… …… ……


Does the difference of performance come Does the difference of performance come from different text representations or from different text representations or from different kernel functions of SVM ?from different kernel functions of SVM ?

[Leopold, 2002] points out that it is the [Leopold, 2002] points out that it is the text representation schemes which text representation schemes which dominate the performance of text dominate the performance of text categorization rather than the kernel categorization rather than the kernel functions of SVM in text categorization functions of SVM in text categorization domain.domain.


Therefore, choosing an appropriate Therefore, choosing an appropriate term weighting scheme is more term weighting scheme is more important than choosing and tuning important than choosing and tuning kernel functions of SVM for text kernel functions of SVM for text categorization task.categorization task.

However, the previous works are not However, the previous works are not enough to draw a definite conclusion enough to draw a definite conclusion that which term weighting scheme is that which term weighting scheme is better for SVM.better for SVM.


Different Data Preparation: Different Data Preparation: Stemming, stop-words, feature Stemming, stop-words, feature selection, term weighting schemesselection, term weighting schemes

Different Data Collection: Reuters Different Data Collection: Reuters (whole, top 10, top 90, partial top 10)(whole, top 10, top 90, partial top 10)

Different Classifiers with various Different Classifiers with various parametersparameters

Different performance evaluationDifferent performance evaluation


Our study focuses on the various term Our study focuses on the various term weighting schemes for SVM.weighting schemes for SVM.

The reason why choose linear kernel The reason why choose linear kernel function:function: It is simple and fastIt is simple and fast Based on our preliminary experiments and Based on our preliminary experiments and

previous studies, linear is better than non-previous studies, linear is better than non-linear models even handling high linear models even handling high dimensional datadimensional data

Comparison of term weighting schemes Comparison of term weighting schemes rather than the choosing and tuning of rather than the choosing and tuning of kernel functions is our current workkernel functions is our current work

Term Weighting Term Weighting SchemesSchemes

10 different term weighting schemes s10 different term weighting schemes selected due to their reported superior elected due to their reported superior classification results or their typical reclassification results or their typical representation when using SVMpresentation when using SVM They are: They are: binary, tf, logtf, ITF, idf, tf.idbinary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rff, logtf.idf, tf.idf-prob, tf.chi, tf.rf


The following four are related with The following four are related with term fterm frequencyrequency alone: alone: binary binary : 1 for term present and 0 for term : 1 for term present and 0 for term absent in a vectorabsent in a vector tf tf : # of times a term occurs in a document: # of times a term occurs in a document logtf logtf : 1 + log(tf), where log is to mend unf: 1 + log(tf), where log is to mend unfavorable linearityavorable linearity ITF ITF : 1-r/(r+tf), usually r=1 (: 1-r/(r+tf), usually r=1 (iinverse nverse tterm erm ffrequency presented by Leopold)requency presented by Leopold)


The following four are related with The following four are related with idfidf fact factor:or: idf : idf : log(N/ni), where N is the # of docs, ni tlog(N/ni), where N is the # of docs, ni the # of docs which contain term tihe # of docs which contain term ti tf.idf : tf.idf : the widely-used term representationthe widely-used term representation logtf.idf : logtf.idf : (1+logtf).idf(1+logtf).idf tf.idf-prob : tf.idf-prob : idf-prob = log((N-ni)/ni), is an idf-prob = log((N-ni)/ni), is an approximate representation of approximate representation of term relevaterm relevancence weight, also called probabilistic weight, also called probabilistic idfidf


tf.chi : tf.chi : as a representative of combining as a representative of combining feature selection measures (chi^2, inforfeature selection measures (chi^2, information gain, odds-ratio, gain ratio and mation gain, odds-ratio, gain ratio and etc.)etc.) tf.rf : tf.rf : newly proposed by us; newly proposed by us; relevant frrelevant frequencyequency (rf) = log(1+ni/ni_), ni is the # o (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is f docs which contain term ti, and ni_ is the # of negative docs which contain terthe # of negative docs which contain term tim ti

Analysis of Analysis of Discriminating PowerDiscriminating PowerDifferent Formula

idf = log (N/(a+c))chi^2= N*((ad-bc)^2) / ((a+c)(b+d)(a+b)(c+d))idf-prob = log((b+d)/(a+c))rf = log(2+a/c) To avoid c=0, we setrf = log(2+a/_max(1,c))N=a+b+c+d, d>>a, b, c

Analysis of Analysis of Discriminating PowerDiscriminating PowerAssume the six terms have the same tf value. The first three terms have the same idf1, and the last three ones have the same idf2.

idf = log ( N/(a+c) ) idf1 > idf2N = a+b+c+d

Analysis of Analysis of Discriminating PowerDiscriminating PowerGiven idf1<idf2, the classical tf.idf gives more weight to the first three terms than the last three terms.

But t1 has more discriminating power than t2 and t3 in positive category. tf.idf representation may lose its discriminating power. We propose new factor relevance frequency rf = log (1+(a+c)/c).

Benchmark Data Benchmark Data Collection 1Collection 1

Data Collection 1 – Reuters-21578 Data Collection 1 – Reuters-21578 top 10, 7193 trainings and 2787 tests top 10, 7193 trainings and 2787 tests Remove stop words (292), punctuation Remove stop words (292), punctuation

and numbersand numbers Porter stemming performedPorter stemming performed Minimal term length is 4Minimal term length is 4 Top Top pp features per category selected by features per category selected by

using chi-square metric, using chi-square metric, pp = {50, 150, = {50, 150, 300, 600, 900, 1200, 1800, 2400, All}300, 600, 900, 1200, 1800, 2400, All}

Null vectors are removedNull vectors are removed 15959 terms15959 terms

Summary of Reuters-21578 Summary of Reuters-21578 Data SetData Set

pp #-#-FeatureFeature

ss

#-Trains#-Trains #-Tests#-Tests

5050 405405 61236123 23812381150150 12171217 62406240 24222422300300 24252425 63186318 24522452600600 49384938 63646364 24682468900900 70077007 64106410 24792479

12001200 90459045 64236423 2486248618001800 1114211142 64566456 2510251024002400 1274112741 64696469 25122512AllAll 1593715937 64896489 25192519

Benchmark Data Benchmark Data Collection 2Collection 2

Data Collection 2 – 20 Newsgroups Data Collection 2 – 20 Newsgroups 200 trainings and 100 tests per category, 200 trainings and 100 tests per category,

20 categories; 4000 trainings and 2000 tests20 categories; 4000 trainings and 2000 tests Remove stop words, punctuation and Remove stop words, punctuation and

numbersnumbers Minimal term length is 4Minimal term length is 4 Top Top pp features per category selected by features per category selected by

using chi-square metric, using chi-square metric, pp = {5, 25, 50, 75, = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500}100, 150, 200, 250, 300, 400, 500}

Null vectors are removedNull vectors are removed 50088 terms50088 terms

Summary of 20 Newsgroups Summary of 20 Newsgroups Data SetData Set

pp #-#-FeaturesFeatures

#-Trains#-Trains #-Tests#-Tests

5050 991991 38133813 186118617575 14831483 38863886 19181918100100 19731973 39333933 19401940150150 29662966 39613961 19611961200200 39553955 39743974 19731973250250 49384938 39813981 19801980300300 59015901 39873987 19851985400400 78567856 39923992 19941994500500 98039803 39963996 19951995

Two Data Sets Two Data Sets ComparisonComparison

Reuters Reuters : Skewed category distribution;: Skewed category distribution; Among 7193 trainings, the most common cAmong 7193 trainings, the most common category (earn) contains 2877 trainings (40ategory (earn) contains 2877 trainings (40%); while 80% of the categories have less t%); while 80% of the categories have less than 7.5% training samples.han 7.5% training samples. 20 Newsgroups 20 Newsgroups : uniform distribution;: uniform distribution; We selected the first 200 trainings and the fWe selected the first 200 trainings and the first 100 tests per category based on the parirst 100 tests per category based on the partition -- 20 news-bydate. 200 positive samptition -- 20 news-bydate. 200 positive samples and 3800 negative samples per each chles and 3800 negative samples per each chosen category,.osen category,.

Performance MeasurePerformance Measure PrecisionPrecision=true positive/(true positive =true positive/(true positive

+ false positive)+ false positive) RecallRecall = true positive / (true positive = true positive / (true positive

+ false negative)+ false negative) Precision/Recall breakeven point Precision/Recall breakeven point : :

tune the classifier parameter and tune the classifier parameter and yield the hypothetical point at which yield the hypothetical point at which precision and recall are equal.precision and recall are equal.

McNemar’s significance teMcNemar’s significance testst Two classifier f1 and f2 are based on two Two classifier f1 and f2 are based on two

term weighting schemes.term weighting schemes. Contingency tableContingency table

n00 n00 ( # of examples ( # of examples misclassified by both misclassified by both f1 and f2 )f1 and f2 )

n01 n01 ( # of examples ( # of examples misclassified by f1 but misclassified by f1 but not by f2 )not by f2 )

n10 n10 ( # of examples ( # of examples misclassified by f2 but misclassified by f2 but not by f1 )not by f1 )

n11 n11 ( # of examples ( # of examples correctly classified by correctly classified by both f1 and f2 )both f1 and f2 )

McNemar’s Significance TMcNemar’s Significance Testest If the two classifiers have the same If the two classifiers have the same

error rate, then n10 = n01;error rate, then n10 = n01; chi chi = (|n10-n01|-1)^2 / (n01+n10) = (|n10-n01|-1)^2 / (n01+n10) is approximately distributed as chi^2 is approximately distributed as chi^2

with 1 degree of freedom;with 1 degree of freedom; If the null hypothesis is correct, then If the null hypothesis is correct, then

the probability that this quantity is the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha).less than 0.01(significant level \alpha).

Results on the ReutersResults on the Reuters

Results on the ReutersResults on the Reuters

Observation: The break-even point increases as the #-features grows. All schemes reach a maxi value at the full vocabulary and the best BEP is 0.9272 by tr.rf scheme

Significance Tests Results Significance Tests Results on Reuterson Reuters

#-#-featuresfeatures

McNemar’s TestMcNemar’s Test200200 {tf.chi} << all the others{tf.chi} << all the others400-400-15001500

{binary, tf.chi}<<{all the others}{binary, tf.chi}<<{all the others}25002500 {binary, tf.chi}<<{idf, tf.idf, tf.idf-pro{binary, tf.chi}<<{idf, tf.idf, tf.idf-prob}<{all the others}b}<{all the others}5000+5000+ {binary, idf, tf.chi}<<{tf.idf, logtf.idf, t{binary, idf, tf.chi}<<{tf.idf, logtf.idf, tf.idf-prob}<<{tf, logtf, ITF}<{tf.rf}f.idf-prob}<<{tf, logtf, ITF}<{tf.rf}‘<’ and ‘<<’ denote better than at significance level 0.01 and 0.001 respectively; ‘{}’ denote no significant difference

Results on the 20 Results on the 20 NewsgroupsNewsgroups

Results on the 20 Results on the 20 NewsgroupsNewsgroups

Observation: The tends are not monotonic increase. All schemes reach a maxi value at a small vocabulary range from 1000 to 3000. The best BEP is 0.6743 by tr.rf scheme

Significance Tests on 20 Significance Tests on 20 NewsgroupsNewsgroups

#-#-featuresfeatures

McNemar’s TestMcNemar’s Test100-500100-500 {tf.chi} << {all the other}{tf.chi} << {all the other}10001000 {tf.chi}<<{binary}<<{all the other}{tf.chi}<<{binary}<<{all the other}15001500 {tf.chi}<<{binary}<{all the other}<{ITF,{tf.chi}<<{binary}<{all the other}<{ITF, idf, tf.rf} idf, tf.rf}20002000 {tf.chi,binary}<<{all the other}<{ITF, t{tf.chi,binary}<<{all the other}<{ITF, tf.rf}f.rf}3000-3000-50005000

{binary, tf.chi}<<{all the other}<{tf.rf}{binary, tf.chi}<<{all the other}<{tf.rf}6000-6000-1000010000

{binary}<<{all the other}<<{tf.rf}{binary}<<{all the other}<<{tf.rf}

Discussion Discussion To achieve high break-even point, To achieve high break-even point,

different number of vocabularies are different number of vocabularies are required for the two data sets. required for the two data sets.

Reuters Reuters : diverse subject matters per : diverse subject matters per category with overlapping vocabularies category with overlapping vocabularies and large vocabularies are required;and large vocabularies are required;

20 Newsgroups20 Newsgroups : single narrow subject : single narrow subject with limited vocabularies and 50-100 with limited vocabularies and 50-100 vocabularies per category is sufficient.vocabularies per category is sufficient.

tf.rftf.rf shows significant better performan shows significant better performance than other schemes on the two differce than other schemes on the two different data sets. ent data sets. Both of the best break-even points are aBoth of the best break-even points are achieved by using the tf.rf scheme no machieved by using the tf.rf scheme no matter on the skewed or uniform category tter on the skewed or uniform category distribution.distribution. The significance tests support this obserThe significance tests support this observation.vation.

DiscussionDiscussion

There is no observation that There is no observation that idfidf factor can ad factor can add the term’s discriminating power for text cd the term’s discriminating power for text categorization when combined with tf factor.ategorization when combined with tf factor. Reuters Reuters : : tftf, , logtflogtf and and ITFITF achieve higher brea achieve higher break-even point than schemes combined with idf k-even point than schemes combined with idf – – tf.idftf.idf, , logtf.idflogtf.idf and and tf.idf-probtf.idf-prob.. 20 Newsgroups20 Newsgroups : difference between : difference between tftf alone alone or or idfidf alone or alone or bothboth are not significant are not significant Hence, Hence, idfidf factor gives no discriminating pow factor gives no discriminating power or even decrease the term’s discriminatiner or even decrease the term’s discriminating power.g power.


BinaryBinary and and tf.chitf.chi show consistently worse show consistently worse performance than other schemes.performance than other schemes. BinaryBinary scheme ignores the frequency infor scheme ignores the frequency information which is crucial to the representatimation which is crucial to the representation of the content of the documenton of the content of the document Feature selection metrics, Feature selection metrics, chi^2chi^2, involve d , involve d value where d>>a, b, and c . d value domivalue where d>>a, b, and c . d value dominates chi^2 value and may not appropriatenates chi^2 value and may not appropriately express the term’s discriminating powly express the term’s discriminating power.er.


Specially, Specially, ITFITF scheme has comparabl scheme has comparable good performance in the two data se good performance in the two data sets but still worse than ets but still worse than tf.rftf.rf scheme scheme


ConclusionsConclusions Our newly proposed Our newly proposed tf.rftf.rf shows signifi shows significant better performance than other sccant better performance than other schemes based on the two widely-used dhemes based on the two widely-used data sets with different category distribata sets with different category distributionsutions Schemes related with tf alone, Schemes related with tf alone, tf, logtf, tf, logtf, ITFITF show rather good performance wh show rather good performance while still worse than the ile still worse than the tf.rftf.rf scheme scheme

ConclusionsConclusions The The idfidf and and chichi factor, taking the col factor, taking the collection distribution into consideratilection distribution into consideration, have not improve or even decreaon, have not improve or even decrease the term’s discriminating power se the term’s discriminating power for categorization.for categorization. BinaryBinary and and tf.chitf.chi significantly under significantly underperform the other schemes.perform the other schemes.

Thanks for your time and Thanks for your time and participation !participation !

Documents

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004