UMass Amherst at TDT 2003

UMass Amherst at TDT 2003

James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema RaghavanCenter for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts Amherst

What we did Tasks

Story Link Detection Topic Tracking New Event Detection Cluster Detection

Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques

Dictionary translation of Arabic stories Native language comparisons Adaptive tracking

Relevance models

ROI motivation

Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics

different locations and individuals similarity dominated by “health care” terms

drugs, cost, coverage, plan, prescription Possible solution: first categorize stories

different category different topics (mostly true) use within-category statistics

“health care” may be less confusing Rules of Interpretation provide natural categories

ROI intuition

•Each document in the corpus is classified into one of the ROI categories•Stories in different ROIs are less likely to be in same topic.•If two stories belong to different ROIs, we should trust their similarities less

ROI tagged corpus

simnew(s1,s2)=simold(s1,s2)

simnew(s1,s2)<simold(s1,s2)Sn

Sn

ROI classifiers Naïve Bayes BoosTexter [Schapire and Singer, 2000 ]

Decision tree classifier Generates and combines simple rules Features are terms with tf as weights

Used most likely single class Explored distribution of all classes Unable to do so successfully

Training Data for Classification Experiments: train on TDT-2,test on TDT-3

Submissions: train on TDT-2 plus TDT-3 Training data prepared the same way

Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI

Worst case: a single story relevant to…Chinese Labor Activists with ROI Legal/Criminal CasesBlair Visits China in October with ROI Political/Diplomatic Mtgs.China will not allow Opposition Parties with ROI Miscellaneous

Experiments with removing named entities for training

Naïve Bayes vs. BoosTexter Similar classification accuracy

Overall accuracy is the same Errors are substantially different

Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED

BoosTexter used in most tasks for submission Evaluation results:

In Link Detection, using Naïve Bayes more useful

ROI classes in link detection Given story pair and their estimated ROIs If estimated ROIs are same, leave score alone If they are different, reduce score

Reduced to 1/3 of original value based on training runs Used four different ROI classifiers

ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne: BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities

Training effectiveness (TDT-3)

Story Link Detection Minimum normalized cost

Various types of databases

1Dcos 4Dcos UDcos

original 0.3536 0.2556 0.3254

ROI-BT,ne 0.2959 0.2360 0.2748

ROI-BT,no ne 0.4600 0.3670 0.4246

ROI-NB,ne 0.3724 0.3047 0.3380

ROI-NB,no ne 0.4072 0.3269 0.3718

Evaluation results Story link detection


1Dcos 4Dcos UDcos

original 0.2472 0.1983 0.2439

ROI-BT,ne 0.3090 0.2587 0.2938

ROI-BT,no ne 0.3220 0.2649 0.3020

ROI-NB,ne 0.2867 0.2407 0.2697

ROI-NB,no ne 0.2937 0.2463 0.2738

ROI for tracking Compare story to centroid of topic

Built from training stories If ROI does not match, drop score based on how

bad mismatch is Used ROI-BT,ne classifier only

model story

scoreold

Scorenew

1/3

Training for tracking


1Dcos 4Dcos ADcos UDcosNt=1 orig 0.1890 0.1819 0.1390 0.1819

ROI-BT,ne 0.1659 0.1489 0.1280 0.1541

Nt=4 orig 0.1427 0.1294 0.1076 0.1321

ROI-BT,ne 0.1639 0.1314 0.1078 0.1494

Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only

Evaluation results


1Dcos 4Dcos ADcos UDcosNt=1 orig 0.1968 0.2149 0.2270 0.2604

ROI-BT,ne 0.3965 0.3807 0.3572 0.5002

Nt=4 orig 0.1716 0.1610 0.1463 0.1988

ROI-BT,ne 0.2996 0.2682 0.2525 0.3677

Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only

ROI-based vocabulary pruning

New Event Detection only Create “stop list” for each ROI

300 most frequent terms in stories within ROI Obtained from TDT-2 corpus

When story is classified into an ROI… Remove those terms from the story’s vector

ROI determined from BoosTexter classifier

New Event Detection approach Cosine Similarity measure

ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents

Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training

NED ResultsTDT 3 TDT 4

ROI Conclusions Both uses of ROI helped in training

Score reduction for ROI mismatch Tracking and link detection

Vocabulary pruning for new event detection Score reduction failed in evaluation

Name entities important in ROI classifier TDT-4 has different set of entities (time gap)

Possible overfitting to TDT-3? Preliminary work applying to detection

Unsuccessful to date



Relevance models

Comparing multilingual stories

Baseline All stories converted to English Using provided machine translations

New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking

Dictionary Translation of Arabic

Probabilistic translation model Each Arabic word has multiple English

translations Obtain P(e|a) from UN Arabic-English parallel

corpus Forms a pseudo-story in English representing

Arabic Story Can get large due to multiple translations per

word Keep English words whose summed

probabilities are the greatest

Language specific comparisons

Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping

bigrams Linking Task:

If stories in same language, use that language All other comparisons done using all stories

translated into English

Adaptation in tracking Adaptation

Stories added to topic when high similarity score

Establish topic representation in each language as soon as added story in that language appears

Similarity of Arabic story compared to Arabic topic representation, etc.

Cross-Lingual Link Detection Results

Translation Condition

Minimum Cost Cost

TDT-3 TDT-4 TDT-4

1DcostIDF 0.3536 0.2472 0.2523

UDcosIDF 0.3254 (-8 %) 0.2439 (-1%) 0.2597

4DcosIDF 0.2556 (-28%) 0.1983 (-20%) 0.2000

Translation Conditions: 1DcosIDF: baseline, all

stories in English using provided translations.

UDcosIDF: all stories in English but using dictionary translation of Arabic.

4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic

Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman)


Minimum Cost Cost

TDT-3 TDT-4 TDT-4

1DcostIDF 0.1890 0.1968 0.1964

UDcosIDF 0.1853 (-2 %) 0.2024 (+3%) 0.2604

4DcosIDF 0.1819 (-4%) 0.2036 (+3%) 0.2149

ADcosIDF 0.1390 (-26%) 0.2007 (+2%) 0.2270

Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary

translation of Arabic. 4DcosIDF: comparing a pair

of stories in native language. ADcosIDF: baseline plus

adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.

Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr)

Translation Conditions:1DcosIDF: baseline.UDcosIDF: dictionary translation of Arabic.4DcosIDF: comparing a pair of stories in native language.ADcosIDF: baseline plus adaptation.


Minimum Cost Cost

TDT-3 TDT-4 TDT-4

1DcostIDF 0.1427 0.1676 0.1716

UDcosIDF 0.1321 (-7 %) 0.1594 (-5 %) 0.1988

4DcosIDF 0.1294 (-9 %) 0.1501 (-10%) 0.1610

ADcosIDF 0.1076 (-25%) 0.1443 (-14%) 0.1463



Relevance models

Relevance Models for SLD Relevance Model (RM): “model of stories relevant to a query” Algorithm:

Given stories A,B1. compute “queries” QA and QB

2. estimate relevance models P(w|QA) and P(w|QB)

3. compute divergence between relevance models

M

QMPMwPQwP )|()|()|(

w

w

w

wwwchance tfn

cfNtfcf

nN

NncftfP ),,|(

TDT-3 TDT-4

Cosine / tf.idf .2551 .1983

Relevance Model .1938 .1881

Rel. Model +ROI .1862 .1863

Results: Story Link Detection

Relevance Models for Tracking

1. Initialize:• set P(M|Q) = 1/Nt if M is a training doc• compute relevance model as before

2. For each incoming story D:• score = divergence between P(w|D) and RM• if (score > threshold)

add D to the training, recompute RM• allow no more than k adaptations

TDT-3 TDT-4

Cosine / tf.idf .1888 .1964

Language Model .1481 .2122

Adaptive tf.idf .1390 .2007

Relevance Model .0953 .1784

Results: Topic Tracking

Conclusions Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques


Relevance models

Documents

UMass Amherst at TDT 2003