Sentence Level Information Patterns for Novelty Detection

1

Sentence Level Information Patterns

for Novelty Detection

Xiaoyan LiPhD in Computer Science UMass Amherst

Visiting Assistant ProfessorDepartment of Computer ScienceMount Holyoke [email protected]

2

Research Backgrounds and Interests

● Information Retrieval (IR)● [CIKM’03], [IPM’07], [ECIR’08]

● Novelty Detection (ND)● [IPM’07], [CIKM’05], [CIKM’06]

● Question Answering (QA)● [HLT01], [SIGIR’03]

● Database Systems & Data Mining ● MIS at Tsinghua, and now at MHC

● Bioinformatics● Now at MHC

● The Intersection of IR, QA, MIS and Data Mining● Organize/access data in Relational Databases and indexed free text● Answer users questions instead of matching query words

3

Outline

● What is Novelty Detection?● Related Work● Novelty and Information Patterns

− New definition of “novelty”− Information patterns and analysis

● ip-BAND: An Information Pattern-Based Approach− Query analysis− Relevant sentence retrieval− Novel sentence detection

● Experiments and Results● Conclusions and Future Work

4

What Is Novelty Detection?

car bomb

http://news.google.com/

Any car bomb events recently?

When and where did they happen?

5

16,400

Car bomb

6

Car Bomb, Baghdad, Monday June 14th.

Car Bomb, Gaza Strip, Tuesday June 15th.

7


● Novelty Detection at the Event Level− Document is relevant to the query− Document discusses a new event

● Novelty Detection at the Sentence Level− A sentence is relevant to the query− A sentence has new information about an old event or a

sentence discusses a new event

8


● Task of a Novelty Detection System (NDS)− Given a query, a NDS is to retrieve a list of sentences that

each of them is both relevant to the query and contain new information that is not covered by previous sentences.

− Relevance judgment: independent of other sentences.− Novelty judgment: depends on previously delivered sentences.

● Goal of a Novelty Detection System − For the user to get useful information without going through

redundant sentences as well as non-relevant sentences

9

Related Work: Novelty Detection At Different Levels

● Novelty Detection At the Event Level− New event detection from Topic Detection and Tracking (TDT)

research.− Most techniques based on:

− Bag-of-words representation− Clustering algorithms

10


● Novelty Detection At the Event Level● Novelty Detection At the Sentence Level

− TREC novelty tracks (2002-2004)

− New words appearing in a sentence contribute to its novelty scores

11


● Novelty Detection At the Event Level● Novelty Detection At the Sentence Level● Novelty Detection in Other Applications

− “The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries” Carbonell and Goldstein (SIGIR 1998)

− “Novelty and redundancy detection in document filtering”, Zhang, Callan and Minka (SIGIR 2002)

− “Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval”, Zhai, Cohen and Lafferty (SIGIR 2003)

12


● Novelty Detection at the Sentence Level− Similarity functions in IR− New words contribute to novelty scores− High Sim(query, S) -> increase the relevance rank− High Sim(S, previous sentences) -> decrease the novelty rank

13

Novelty Detection at the Sentence Level

Given a query

Non-relevant sentences

relevant sentences

Set of sentences

Novel sentences

Redundant

14

Related Work -- Limitations

● Query 306 (TREC novelty track 2002)

− <title> “African Civilian Deaths”

− <description> “How many civilian non-combatants have been killed in the various civil wars in Africa?”

− <Narrative> A relevant document will contain specific casualty

information for a given area, country, or region. It will cite numbers of civilian deaths caused directly or indirectly by armed conflict.

15

Related Work -- Limitations

● Four Sentences:− Sentence 1 (Relevant): “It could not verify Somali claims of more than

100 civilian deaths”

− Sentence 2 (Relevant): “Natal's death toll includes another massacre of 11 ANC [African National Congress] supporters”

− Sentence 3 (Non-relevant): “Once the slaughter began, following the death of President Juvenal Habyarimana in an air crash on April 6, hand grenades were thrown into schools and churches that had given refuge to Tutsi civilians.”

− Sentence 4 (Non-relevant): “A Ghana News Agency correspondent

with the West African force said that rebels loyal to Charles Taylor began attacking the civilians shortly after the peace force arrived in Monrovia last Saturday to try to end the eight-month-old civil war.”

16

Related Work -- Motivations

● A Deeper Query Understanding− A query representation beyond keywords− Type of information required in additional to topical

relevance (“number” for query 306)

● Determination of Novel Sentences− Topical relevance + the right type of information− New Words != New Information

17

Novelty and Information Patterns

● What is Novelty or New Information?

− Novelty or new information means new answers to the potential questions representing a user’s request or information need

− Two aspects− Query -> question(s)− New answers ->novel or new information

18

Information Patterns

• Information Patterns in Sentences - Indicators of answers to users’ questions

• Understanding Information Patterns- Sentence Lengths (SLs)

- Named Entities (NEs)

- Opinion Patterns (OPs)

19

Sentence Lengths (SLs)

• SL Observations:

• relevant sentences on average have more words than non-relevant sentences

• differences in SLs between novel and non-relevant sentences are slightly larger

Types of Sentences (S.)

TREC 2002: 49 topics TREC 2003: 50 topics

# of S. Length # of S. Length

Relevant 1365 15.58 15557 13.1

Novel 1241 15.64 10226 13.3

Non-relevant 55862 9.55 24263 8.5

20

Named Entities (NEs)

• NE Observation 1.

- The five most frequent types (>25%) of NE are: PERSON, ORGANIZATION, LCATION, DATE and NUMBER.

TREC 2002 Novelty Track: Total S# = 57227, Total Rel#=1365, Total Non-Rel#=55862

TREC 2003 Novelty Track:Total S# = 39820, Total Rel#=15557, Total Non-Rel#=24263

NEs Rel # (%) Non-Rel # (%) NEs Rel # (%) Non-Rel # (%)

PERSON 381(27.91%) 13101(23.45%) PERSON 6633(42.64%) 7211(29.72%)

ORGANIZATION 532(38.97%) 17196(30.78%) ORGANIZATION 6572(42.24%) 9211(37.96%)

LOCATION 536(39.27%) 11598(20.76%) LOCATION 5052(32.47%) 5168(21.30%)

DATE 382(27.99%) 6860(12.28%) DATE 3926(25.24%) 4236(17.46%)

NUMBER 444(32.53%) 14035(25.12%) NUMBER 4141(26.62%) 6573(27.09%)

ENERGY 0(0.00%) 5(0.01%) ENERGY 0(0.00%) 0(0.00%)

MASS 31(2.27%) 1455(2.60%) MASS 34(0.22%) 19(0.08%)

21

Named Entities (NEs) in Opinion/Event Topics

• NE Observations 2 & 3:

- PERSON, LOCATION and DATE are more important than NUMBER and ORGANIZATION for relevance

- PERSON, LOCATION and DATE play a more important role in event topics than in opinion topics

The statistics of named entities in opinion and event topics (2003) TREC 2003 Novelty Track Event TopicsTotal = 18705, Rel#= 7802, Non-Rel#= 10903

TREC 2003 Novelty Track Opinion TopicsTotal S# = 21115, Rel#= 7755, Non-Rel#= 13360

NEs Rel # (%) Non-Rel # (%) NEs Rel # (%) Non-Rel # (%)

PERSON 3833(49.13%) 3228(29.61%) PERSON 2800(36.11%) 3983(29.81%)

LOCATION 3100(39.73%) 2567(23.54%) LOCATION 1952(25.17%) 2601(19.47%))

DATE 2342(30.02%) 1980(18.16%) DATE 1584(20.43%) 2256(16.89%))

22

Named Entities (NEs) in Novelty

• NE Observations 4 & 5:- novel sentences have more new named entities than relevant but

redundant sentences

- PERSON, LOCATION, ORGANIZATION and DATE (POLD) NEs are more important for novelty

Previously unseen NEs and Novelty/Redundancy (TREC2002&UMass)

Types of Sentences

Total # of Sentences

# of Sentences /w New NEs (%)

# of Queries

Novel S. 4170 2801 (67.2%) 101

Redundant S. 777 355 (45.7%) 75

23

Opinion Patterns (OPs)

• OP Observation

- there are more opinion sentences in relevant (and novel) sentences

than in non-relevant sentences

Opinion patterns for 22 opinion topics (2003)

Sentences (S.) Total # of S. # of Opinion S. (and %)

Relevant 7755 3733 (48.1%)

Novel 5374 2609 (48.6%)

Non-relevant 13360 3788 (28.4%)

24

Opinion Patterns (OPs)

• Opinion patterns are detected in a sentence if it includes quotation marks or one or more of the

opinion expressions indicating it states an opinion

Quotation marks “ ”, said, say, according to, add, addressed, agree, affirmed, reaffirmed, argue, believe, believes, claim, concern, consider, disagreed, expressed, finds that, found that, fear that, idea that, insist, maintains that, predicted, reported, report, state that, stated that, states that, show that, showed that, shows that, think, wrote, etc

25

Step1. Query Analysis (1) Question Generation

a. One or more specific questions, or b. A general question

(2) Information Pattern (IP) Determination a. For each specific question, IP entities includes

(a specific NE type, sentence length (SL) ) b. For a general question, IP entities includes

(POLD NE types, SL, and opinion/event type)

Step2. Relevant Sentence Retrieval - Sentence Relevance Ranking using TFIDF Scores - Information Pattern Detection in Sentences - Sentence Re-Ranking using Information Patterns

Step3. Novel Sentence Detection - For specific topic: New and Specific NE Detection - For general topic: New NEs and New Words Detection

Query (Topic)

Information Patterns

Relevant Sentences

Novel Sentences

information-pattern-BAsed Novelty Detection (ip-BAND)

26

ip-BAND: Query Analysis

● Classify topics- Specific topics: multiple NE questions

- General topics: (opinion topic, event and others)

● Determine the possible query-related information patterns for specific topic:

- Query words + expected answer types

27

An Example Query

● Query 306 (TREC novelty track 2002)

− <title> “African Civilian Deaths”

− <description> “How many civilian non-combatants have been killed in the various civil wars in Africa?”

− <Narrative> A relevant document will contain specific casualty

information for a given area, country, or region. It will cite numbers of civilian deaths caused directly or indirectly by armed conflict.

28


● Query Analysis

− Determine the possible query-related patterns

--Query words + expected answer type

--African Civilian Death + NUMBER (query 306)

--Civilian Death + NUMBER (query 306)

…

--expanded query words + Number

29


● Query Analysis—determine expected answer types

Word patterns for the five types of NE question

Answer types Word patterns

Person who, individual, person, people, participant, candidate, customer, victim, leader, member, player, name

Organization who, company, companies, organization, agency, agencies, name, participant

Location where, location, nation, country, countries, city, cities, town, area, region

Number how many, how much, length, number, polls, death tolls, injuries, how long

Date when, date, time, which year, which month, which day

30

ip-BAND: Relevant Sentence Retrieval

• Retrieve sentences indicating “possible answers”

• TFIDF ranking

• Sentence re-ranking - SL, NE, OP

• Filter out sentences without “answers” for specific topics

)]( )( )([ 2

00 iiq

n

iis tidfttfttfS

31

ip-BAND: Relevant Sentence Retrieval

• Sentence re-ranking-Sentence length adjustment

-Named entity adjustment

-Opinion adjustment

(general opinion topics only)

)/(*01 LLSS

)](1[*12 datelocationperson FFFSS

]1[*23 opinionFSS

32

ip-BAND: Novel Sentence Detection

• Identify Sentences with “new answers”

• Sn = Nw + Nne - Novel if Sn> T

• For specific topics/questions- IPs: new answer NEs = 0, = 1, and T = 1

• For general topics/questions- IPs: new words and new NEs

- = 1, = 1, and T = 4

33

Experiments and Results

● Data From TREC Novelty Tracks (2002-2004)

− 49 queries (2002), 50 queries (2003) and 50 queries(2004)

− For each query, up to 25 relevant documents (2002, 2003), 25 relevant Documents + more non-relevant documents (2004)

− Documents were pre-segmented into sentences.

− Redundancy: 2002 (9.1%), 2003 (34.3%), 2004(58.6%)

34


● Baseline Approaches− B-NN: Initial Retrieval Ranking

− No novelty detection performed− B-NW: New Word Detection

− New words existing in sentences indicate novel− B-NWT: New Word Detection with Threshold T

− T New words existing in sentences indicate novel− B-MMR: Maximal Marginal Relevance

− Carbonell and Goldstein (1998)− It was reported to work well in non-redundant text

summarization, novelty detection at document filtering and subtopic retrieval.

− MMR may incorporate various novelty measure

35


● Baseline Approaches− MMR score

− B-MMR: Maximal Marginal Relevance− (1) Start with a sentence relevance ranking, select the

first sentence (it is always novel)− (2) Calculate MMR score for the rest sentences − (3) pick one with the max MMR score and go to (2)

until the last sentence is selected

),(max)1(),((maxarg 21

/ji

NSi

NRSSSSimQSSimMMR

ji

36


● Three Sets of Experiments

− (1) Performance of identifying novel sentences for queries transformed into multiple specific questions

− (2) Performance of identifying novel sentences for queries transformed into a general question

− (3)Performance of finding relevant sentences for all queries

Performance of Novelty Detection for Specific topics

Performance of Novelty Detection for General topics

The Overall Performance of Novelty Detection

40


● The proposed approach outperforms all baselines at top (5,10,15, 20, 30) ranks

● The proposed approach beats baselines approaches across different data collections

● All approaches achieve better performance on specific topics than on general topics

41

Conclusions and Future Work

● New definition of novelty- New answers to potential questions from a query

● Analysis of information patterns- Sentence lengths, named entities, opinion patterns

● ip-BAND- information pattern-BAsed Novelty Detection approach

42

Conclusions and Future Work (cont.)

● Combine with other IR approaches – information patterns + language modeling

● Improve query analysis- other types of questions in addition to NE question- Why, what, how …

● Extend to other novelty-based applications− New Event Detection− Multi-document Summarization

43

Research Backgrounds and Interests

● Information Retrieval (IR)● Robust High Performance IR

● Novelty Detection (ND)

● Question Answering (QA)

● Database systems & Data Mining

● Bioinformatics

● The Intersection of IR, QA, MIS and Data Mining● Organize and access data in Relational Databases and indexed files of

free text.● Answer users questions instead of matching query words

Thank You!

Questions?