View
30
Download
0
Category
Preview:
DESCRIPTION
Fast Mining of Interesting Phrases from Subsets of Text Corpora. Deepak P , Atreyee Dey, Debapriyo Majumdar* 1 IBM Research - India, Bengaluru, INDIA. EDBT 2014 Conference, Athens, Greece. *presently with Indian Statistical Institute, Kolkata, India. Problem Description. D’. D. - PowerPoint PPT Presentation
Citation preview
IBM Research - India, Bengaluru, India
April 19, 2023 1
Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora
Deepak P, Atreyee Dey, Debapriyo Majumdar*1IBM Research - India, Bengaluru, INDIA
EDBT 2014 Conference, Athens, Greece
*presently with Indian Statistical Institute, Kolkata, India
April 19, 2023 IBM Research – India, Bengaluru 2
Problem Description
Text Corpus
ukraine, crimea …
Chosen Subset
Crimea independence, 0.90USA Russia Relations, 0.85
G8 Membership, 0.81……),(
)',()',(
Dpfreq
DpfreqDpID
D D’
Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D
April 19, 2023 IBM Research – India, Bengaluru 3
Earlier Approaches
p1 d12 d13 d30 d9901
p9876 d1 d11 d305 d8100
Phrase Indexing, Simistis et al., VLDB 2008
O(|P|)
Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012
d1 p5 p43 p167 p8970
d9998 p23 p49 p305 p9987O(|D’|)
April 19, 2023 IBM Research – India, Bengaluru 4
Estimating Interestingness: AND Query Consider an AND query composed of k key-words
Q = {Q1, Q2, …, Qk}
Docspp
QkQDocsQkQppDpID #)(
}),...,1({#),...,1|()',(
),(
)',()',(
Dpfreq
DpfreqDpID
)|,...,1()(
),...,1|(pQkQp
pp
QkQpp
April 19, 2023 IBM Research – India, Bengaluru 5
Query Word Independence Assumption
)1|2()1,2|1()1|2,1( PQpPQQpPQQp
Consider an AND Query of two words Q1 and Q2
We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1
Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide)
)1|2()1|1( PQpPQp
k
i
k
i
pQippQippQkQp11
)|(log)|()|,...,1(
For OR Query Handling details, refer to the paper
April 19, 2023 IBM Research – India, Bengaluru 6
Our Disk-Resident Indexes
w1 p30 p12 p990 p13
w9876 p810 p11 p305 p8
0.23 0.21 0.18 0.002
0.1 0.08 0.007 0.0001
The score that is stored along with each phrase is p(w|p)All values are stored in sorted order
April 19, 2023 IBM Research – India, Bengaluru 7
Aggregation Approach: NRA We use the well-known NRA algorithm to do
aggregation of the lists corresponding to the query words, to arrive at the top phrases
At any point, we have upper and lower bounds. An example sum-aggregation below
w1 P1, 0.04167 P5, 0.0333
w2 P103, 0.26 P1, 0.113
P1 – [0.1547, 0.1547]P5 – [0.0333,0.1433]P103 – [0.26, 0.2933]
……
April 19, 2023 IBM Research – India, Bengaluru 8
Our In-Memory Indexes
w1 p12 p13 p30 p990
w9876 p8 p11 p305 p810
0.21 0.002 0.23 0.18
0.0001 0.08 0.007 0.1
The score that is stored along with each phrase is p(w|p)All values are stored in PhraseID sorted order
Indexes may be created by preserving just the top-10% values of each list
We will use simple Sort-Merge-Join on these lists for In-Memory operation
April 19, 2023 IBM Research – India, Bengaluru 9
Example Results Query: trade reserves (Reuters Dataset)
– economic minister
– reserves
– taiwan’s foreign exchange reserves
– economic planning
– economic planning and development
April 19, 2023 IBM Research – India, Bengaluru 10
Result Quality Evaluation
0.90.910.920.930.940.950.960.970.980.99
1
20-AND 50-AND
PrecMRRNDCGMAP
PubMed Dataset
April 19, 2023 IBM Research – India, Bengaluru 11
Running Times: Disk-based Operation (NRA)
100
1000
10000
100000
1000000
10000000
0 20 40 60 80 100
AND-NRA
OR-NRA
AND-GM
OR-GM
X-Axis: Percentage of NRA Lists Traversed
PubMed Dataset
April 19, 2023 IBM Research – India, Bengaluru 12
Percentages of Lists Traversed (NRA)
27 28 29 30 31 32 33 34
Reuters-AND
Reuters-OR
Pubmed-AND
Pubmed-OR
April 19, 2023 IBM Research – India, Bengaluru 13
Running Times: Mem-based Operation (SMJ)
1
10
100
1000
10000
100000
1000000
10000000
0 10 20 30 40 50 60 70 80 90 100
AND-SMJ
OR-SMJ
AND-GM
OR-GM
X-Axis: Percentage of Entries Stored
PubMed Dataset
April 19, 2023 IBM Research – India, Bengaluru 14
Shortcomings Index Sizes
– Earlier approaches index only phrases and documents
– Our method has word-specific indexes, with each word having a list in the index
– Number of words across documents could be much more than the number of phrases
– If we would like to support querying over all possible words, index sizes could get large
Queries on Metadata Facets– Instead of using keyword queries, document subsets could also be
chosen using metadata facets
– E.g., venue:sigmod AND year:2007, on a set of scholarly publications
– Our independence assumption has not yet been tested on metadata facets
April 19, 2023 IBM Research – India, Bengaluru 15
Summary Proposed an approach for the problem of mining interesting
phrases from subsets of text corpora
Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases
Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques
Future Work– Other potential avenues for leveraging the independence
assumption for phrase analytics
– Methods to speed up interesting phrase mining over metadata facets
IBM Research - India, Bengaluru, India
April 19, 2023 16
Thank You
Questions, Comments, Suggestions?
Recommended