Fast Mining of Interesting Phrases from Subsets of Text Corpora

IBM Research - India, Bengaluru, India

April 19, 2023 1

Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora

Deepak P, Atreyee Dey, Debapriyo Majumdar*1IBM Research - India, Bengaluru, INDIA

EDBT 2014 Conference, Athens, Greece

*presently with Indian Statistical Institute, Kolkata, India

April 19, 2023 IBM Research – India, Bengaluru 2

Problem Description

Text Corpus

ukraine, crimea …

Chosen Subset

Crimea independence, 0.90USA Russia Relations, 0.85

G8 Membership, 0.81……),(

)',()',(

Dpfreq

DpfreqDpID

D D’

Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D

Earlier Approaches

p1 d12 d13 d30 d9901

p9876 d1 d11 d305 d8100

Phrase Indexing, Simistis et al., VLDB 2008

O(|P|)

Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012

d1 p5 p43 p167 p8970

d9998 p23 p49 p305 p9987O(|D’|)

Estimating Interestingness: AND Query Consider an AND query composed of k key-words

Q = {Q1, Q2, …, Qk}

Docspp

QkQDocsQkQppDpID #)(

}),...,1({#),...,1|()',(

)',()',(

Dpfreq

DpfreqDpID

)|,...,1()(

),...,1|(pQkQp

Query Word Independence Assumption

)1|2()1,2|1()1|2,1( PQpPQQpPQQp

Consider an AND Query of two words Q1 and Q2

We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1

Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide)

)1|2()1|1( PQpPQp

pQippQippQkQp11

)|(log)|()|,...,1(

For OR Query Handling details, refer to the paper

Our Disk-Resident Indexes

w1 p30 p12 p990 p13

w9876 p810 p11 p305 p8

0.23 0.21 0.18 0.002

0.1 0.08 0.007 0.0001

The score that is stored along with each phrase is p(w|p)All values are stored in sorted order

Aggregation Approach: NRA We use the well-known NRA algorithm to do

aggregation of the lists corresponding to the query words, to arrive at the top phrases

At any point, we have upper and lower bounds. An example sum-aggregation below

w1 P1, 0.04167 P5, 0.0333

w2 P103, 0.26 P1, 0.113

P1 – [0.1547, 0.1547]P5 – [0.0333,0.1433]P103 – [0.26, 0.2933]

……

Our In-Memory Indexes

w1 p12 p13 p30 p990

w9876 p8 p11 p305 p810

0.21 0.002 0.23 0.18

0.0001 0.08 0.007 0.1

The score that is stored along with each phrase is p(w|p)All values are stored in PhraseID sorted order

Indexes may be created by preserving just the top-10% values of each list

We will use simple Sort-Merge-Join on these lists for In-Memory operation

Example Results Query: trade reserves (Reuters Dataset)

– economic minister

– reserves

– taiwan’s foreign exchange reserves

– economic planning

– economic planning and development

Result Quality Evaluation

0.90.910.920.930.940.950.960.970.980.99

20-AND 50-AND

PrecMRRNDCGMAP

PubMed Dataset

Running Times: Disk-based Operation (NRA)

100000

1000000

10000000

0 20 40 60 80 100

AND-NRA

OR-NRA

AND-GM

X-Axis: Percentage of NRA Lists Traversed

PubMed Dataset

Percentages of Lists Traversed (NRA)

27 28 29 30 31 32 33 34

Reuters-AND

Reuters-OR

Pubmed-AND

Pubmed-OR

Running Times: Mem-based Operation (SMJ)

100000

1000000

10000000

0 10 20 30 40 50 60 70 80 90 100

AND-SMJ

OR-SMJ

AND-GM

X-Axis: Percentage of Entries Stored

PubMed Dataset

Shortcomings Index Sizes

– Earlier approaches index only phrases and documents

– Our method has word-specific indexes, with each word having a list in the index

– Number of words across documents could be much more than the number of phrases

– If we would like to support querying over all possible words, index sizes could get large

Queries on Metadata Facets– Instead of using keyword queries, document subsets could also be

chosen using metadata facets

– E.g., venue:sigmod AND year:2007, on a set of scholarly publications

– Our independence assumption has not yet been tested on metadata facets

Summary Proposed an approach for the problem of mining interesting

phrases from subsets of text corpora

Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases

Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques

Future Work– Other potential avenues for leveraging the independence

assumption for phrase analytics

– Methods to speed up interesting phrase mining over metadata facets

IBM Research - India, Bengaluru, India

April 19, 2023 16

Thank You

Questions, Comments, Suggestions?

Fast Mining of Interesting Phrases from Subsets of Text Corpora

Documents

Sorting It All Out Mathematical Topics Mathematical Topics Subsets of real numbers and the relationships between these subsets Subsets of real numbers

Verbal Phrases: Participle Phrases

English Phrases Arabic Phrases

Text Corpora and Lexical Resources - GitHub PagesCorpora Accessing Text Corpora Annotated Text Corpora Lexical Resources References Corpora When the nltk.corpus module is imported,

War on Grammar. Battles Parallel structure, noun phrases, verb phrases, adjectival phrases, adverbial phrases, participial phrases, prepositional phrases,

Survey- topical phrases and exploration of text corpora by topics

Multimodal Corpora: How Should Multimodal Corpora Deal ...michaelkipp.de/publication/MultimodalCorpora2012-proceedings.pdf · How Should Multimodal Corpora Deal with the Situation?

Web Corpora

PowerLink Views and Subsets

Log Subsets

Corpora Tivo

Section 2.2 Subsets

WORDS PHRASES PHRASES

Mining Quality Phrases from Massive Text Corpora › ecd2 › 9f217fe21033... · Mining Quality Phrases from Massive Text Corpora Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei

Best Subsets

Language Corpora

MODIS Land Product Subsets

Integrating Corpora

Lexika Corpora

Corpora translation