Click here to load reader

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap

  • View
    220

  • Download
    5

Embed Size (px)

Text of ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated...

Slide 1©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo – Automated Content Management System
Srikanth Kallurkar
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Capabilities
Automated domain relevant information gathering
Gathers documents relevant to domains of interest from www or proprietary databases.
Automated content organization
Organizes documents by topics, keywords, sources, time references and features of interest.
Automated information discovery
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Comparison to existing manual information gathering method (what most users do currently)
3. Search
6. Satisfied
Take a break
The goal is to maximize the results for a user keyword query
User performs a “Keyword Search”
7a. Give up
2. Form Keywords
*
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Information Gathering method
3. Filter
Information
Need
User
Data
The focus is on informative results seeded by a user selected combination of features
User explores, filters and discovers documents assisted by Apollo features
Features
Yes
No
2. Explore Features
*
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Architecture
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Domain Modeling
(behind the scenes)
1. Bootstrap Domain
3. Get Training Documents
(Select a small sample)
4. Build Domain Signature
Compute Classification Threshold
5. Organize Documents
Classify into defined topics/subtopics
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Data Organization
e.g. Published Article,
Feature A
Doc 1
Doc 2
Doc N
Snapshot of Apollo process to collect a domain relevant document
Snapshot of Apollo process to evolve domain relevant libraries
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Information Discovery
User selects a feature via the Apollo interface
Apollo builds a set of documents from the library that contains the feature
Apollo collates all other features from the set and ranks them by domain relevance
User is presented with co-occurring features
e.g.: user selects phrase
phrase “global warming”
to expand or restrict the focus of
search based on driving interests
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Illustration: Apollo Web Content Management Application for the domain “Climate Change”
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
“Climate Change” Domain Model
Can be reduced by input
from human experts
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Prototype
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Inline Document View
*
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Expanded Document View
*
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Automatically Generated Domain Vocabulary
*
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Performance
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment Setup
The experiment setup comprised the Text Retrieval Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were:
The collection contained documents from Reuters Corpus Volume 1.
There were 83,650 training and 723,141 testing documents
There were 50 assessor and 50 intersection topics. The assessor topics had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned.
The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility.
1. http://trec.nist.gov/data/filtering/T11filter_guide.html
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment
Each topic was set as an independent domain in Apollo.
Only the set of relevant documents from the training set of the topic were used to create the topic signature.
The topic signature was used to output a vector – called the filter vector – that comprised single word terms that were weighted by their ranks.
A threshold of comparison was calculated based on the mean and standard deviation of the cross products of the training documents with the filter vector.
Different distributions were assumed to estimate the appropriate thresholds.
In addition, the number of documents to be selected was set to be a multiple of the training sample size.
The entire testing set was indexed using Lucene.
For each topic, the documents were compared using the cross product with the topic filter vector in the document order prescribed by TREC.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Initial Results
Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks
Precision and recall can be improved by leveraging additional components of the signatures.
2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.
50 Assessor Topics
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Topic Performance
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
289731
285935
312903
282712
290987
289631
289940
288645
297668
283719
281041
282681
289180
292519
286798
350417
290466
289618
286900
288768
293687
301323
289051
333604
300576
286259
285564
293431
298805
291776
291213
298666
288907
290008
294806
314179
289422
291409
297664
301558
322122
290638
283597
277768
289188
287663
298062
293169
288387
284776
80.1785714286
474.235
92.725
402.265
117.48
93.925
83.6666666667
90.35
197.08
93.72
75.6625
61.85
107.685
102.83
86.35
139.23
102.9
79.1166666667
84.8875
67.05
135.955
124.845
94.0166666667
152.0666666667
177.185
186.525
138.26
92.8
157.925
113.5
81.4375
106.3928571429
77.87
98.23
117.155
97.29375
130.1428571429
115.0571428571
102.6
103.77
171.55
68.55
68.9625
60.45
70.59
144.565
109.8166666667
91.225
95.43
61.3625
Chart4
0.9900990099
0.8968609865
0.5799086758
0.61352657
1
0.7763975155
0.3857142857
0.4336188437
0.8666666667
0.7647058824
0.2857142857
0.1694915254
0.5454545455
0.48
0.3846153846
0.3731343284
0.925
0.7905982906
0.3
0.2112676056
0.25
0.1612903226
0.7
0.5833333333
0.8947368421
0.5821917808
0.8222222222
0.7644628099
0.8333333333
0.2873563218
0.5384615385
0.4320987654
0.4375
0.3645833333
0.6666666667
0.3846153846
0
0
0.9428571429
0.7534246575
0.9076923077
0.8575581395
0.606557377
0.6271186441
0
0
0.7777777778
0.5072463768
0.94
0.8834586466
0.9841269841
0.7311320755
0.6101694915
0.6474820144
0.6666666667
0.6535947712
0.6666666667
0.5649717514
1
0.25
0.3333333333
0.1020408163
0.8
0.6451612903
0.5238095238
0.4910714286
0.6666666667
0.4316546763
0.8803827751
0.7843137255
0.1363636364
0.0967741935
0
0
0.5192307692
0.5357142857
0.3333333333
0.243902439
0.2941176471
0.1851851852
0.7894736842
0.7258064516
1
0.7142857143
0.6
0.3488372093
0.7307692308
0.7224334601
1
0.8
0.9186046512
0.8681318681
0.3913043478
0.3571428571
0.9565217391
0.6547619048
0
0
0.9545454545
0.7394366197
Precision
FBeta
Topics
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
289731
285935
312903
282712
290987
289631
289940
288645
297668
283719
281041
282681
289180
292519
286798
350417
290466
289618
286900
288768
293687
301323
289051
333604
300576
286259
285564
293431
298805
291776
291213
298666
288907
290008
294806
314179
289422
291409
297664
301558
322122
290638
283597
277768
289188
287663
298062
293169
288387
284776
80.1785714286
474.235
92.725
402.265
117.48
93.925
83.6666666667
90.35
197.08
93.72
75.6625
61.85
107.685
102.83
86.35
139.23
102.9
79.1166666667
84.8875
67.05
135.955
124.845
94.0166666667
152.0666666667
177.185
186.525
138.26
92.8
157.925
113.5
81.4375
106.3928571429
77.87
98.23
117.155
97.29375
130.1428571429
115.0571428571
102.6
103.77
171.55
68.55
68.9625
60.45
70.59
144.565
109.8166666667
91.225
95.43
61.3625
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Filtering Performance
Apollo training period was linear to the number and size of the training set (num training docs vs. avg. training time).
On average, the filtering time per document was constant (avg. test time).
Chart6
7
135
14
120
16
4
3
3
20
5
4
6
12
5
3
16
3
3
4
9
14
15
3
6
12
19
5
4
17
3
4
7
5
5
14
8
3
7
3
11
24
4
4
6
5
13
6
12
5
4
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
289731
285935
312903
282712
290987
289631
289940
288645
297668
283719
281041
282681
289180
292519
286798
350417
290466
289618
286900
288768
293687
301323
289051
333604
300576
286259
285564
293431
298805
291776
291213
298666
288907
290008
294806
314179
289422
291409
297664
301558
322122
290638
283597
277768
289188
287663
298062
293169
288387
284776
80.1785714286
474.235
92.725
402.265
117.48
93.925
83.6666666667
90.35
197.08
93.72
75.6625
61.85
107.685
102.83
86.35
139.23
102.9
79.1166666667
84.8875
67.05
135.955
124.845
94.0166666667
152.0666666667
177.185
186.525
138.26
92.8
157.925
113.5
81.4375
106.3928571429
77.87
98.23
117.155
97.29375
130.1428571429
115.0571428571
102.6
103.77
171.55
68.55
68.9625
60.45
70.59
144.565
109.8166666667
91.225
95.43
61.3625
Chart8
1363.2857142857
2901.9037037037
1621.0714285714
2728.9
1696.5
1867.5
2245.6666666667
1795.6666666667
2375.6
1278.4
1659.5
863.5
1260.5833333333
1948.8
1865.3333333333
2246.4375
2249
2139
1370.75
976.7777777778
1803
2027.6
2233
3522.3333333333
1926.8333333333
1104.2631578947
1759.4
2006
1962.5882352941
2698.3333333333
1936.25
1720.1428571429
1380.6
1363
1451.2857142857
2227.5
2051.3333333333
2135
2646.6666666667
1333.9090909091
2299.8333333333
1790.5
1602.75
965.5
1103.4
2042.2307692308
1654.3333333333
1087
1645.2
1089.75
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
289731
285935
312903
282712
290987
289631
289940
288645
297668
283719
281041
282681
289180
292519
286798
350417
290466
289618
286900
288768
293687
301323
289051
333604
300576
286259
285564
293431
298805
291776
291213
298666
288907
290008
294806
314179
289422
291409
297664
301558
322122
290638
283597
277768
289188
287663
298062
293169
288387
284776
80.1785714286
474.235
92.725
402.265
117.48
93.925
83.6666666667
90.35
197.08
93.72
75.6625
61.85
107.685
102.83
86.35
139.23
102.9
79.1166666667
84.8875
67.05
135.955
124.845
94.0166666667
152.0666666667
177.185
186.525
138.26
92.8
157.925
113.5
81.4375
106.3928571429
77.87
98.23
117.155
97.29375
130.1428571429
115.0571428571
102.6
103.77
171.55
68.55
68.9625
60.45
70.59
144.565
109.8166666667
91.225
95.43
61.3625
Chart9
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
Sheet1
topic
R101
R102
R103
R104
R105
R106
R107
R108
R109
R110
R111
R112
R113
R114
R115
R116
R117
R118
R119
R120
R121
R122
R123
R124
R125
R126
R127
R128
R129
R130
R131
R132
R133
R134
R135
R136
R137
R138
R139
R140
R141
R142
R143
R144
R145
R146
R147
R148
R149
R150
#training
7
135
14
120
16
4
3
3
20
5
4
6
12
5
3
16
3
3
4
9
14
15
3
6
12
19
5
4
17
3
4
7
5
5
14
8
3
7
3
11
24
4
4
6
5
13
6
12
5
4
avgsize
559.7142857143
605.4962962963
640.9285714286
601.375
615.1875
865
1006
776.6666666667
849.15
526.6
684.25
313.1666666667
483.5
814.2
817
889.0625
1013.3333333333
961.6666666667
548.5
386.8888888889
706.2142857143
779.6666666667
964.3333333333
1542.1666666667
750.25
369.9473684211
709.8
930.25
740.7647058824
1277
866
749.5714285714
579.8
606.4
561.1428571429
952.375
913
894.2857142857
1274
536.5454545455
802.5416666667
782.25
648.25
353
454.8
785.5384615385
727.1666666667
407.3333333333
705.2
426.75
avgtime
1363.2857142857
2901.9037037037
1621.0714285714
2728.9
1696.5
1867.5
2245.6666666667
1795.6666666667
2375.6
1278.4
1659.5
863.5
1260.5833333333
1948.8
1865.3333333333
2246.4375
2249
2139
1370.75
976.7777777778
1803
2027.6
2233
3522.3333333333
1926.8333333333
1104.2631578947
1759.4
2006
1962.5882352941
2698.3333333333
1936.25
1720.1428571429
1380.6
1363
1451.2857142857
2227.5
2051.3333333333
2135
2646.6666666667
1333.9090909091
2299.8333333333
1790.5
1602.75
965.5
1103.4
2042.2307692308
1654.3333333333
1087
1645.2
1089.75
maxpos
307
159
61
94
50
31
37
15
74
31
15
20
70
62
63
87
32
14
40
158
84
51
17
33
132
172
42
33
57
16
74
22
28
67
337
67
9
44
17
67
82
24
23
55
27
111
34
228
57
54
0.3589475797
0.3542447173
0.387655358
0.3502517444
0.3605036374
0.3588236897
0.3592065097
0.3576021349
0.3687807247
0.3514993161
0.3481815434
0.3502133385
0.3582649461
0.3624016314
0.3553138876
0.4341314324
0.3598581708
0.358807584
0.3554402553
0.3577545195
0.363848666
0.3733089022
0.3581051281
0.4133018157
0.3723834443
0.3546461208
0.3537850856
0.3635315077
0.3701893534
0.3614811291
0.3607836287
0.3700171464
0.3579267265
0.3592907548
0.3652349944
0.3892361937
0.3585647598
0.361026453
0.3687757691
0.3736000436
0.3990767721
0.3600712615
0.3513481704
0.3441266254
0.3582748573
0.3563855356
0.3692688511
0.363206916
0.3572824988
0.3528088329
289731
285935
312903
282712
290987
289631
289940
288645
297668
283719
281041
282681
289180
292519
286798
350417
290466
289618
286900
288768
293687
301323
289051
333604
300576
286259
285564
293431
298805
291776
291213
298666
288907
290008
294806
314179
289422
291409
297664
301558
322122
290638
283597
277768
289188
287663
298062
293169
288387
284776
80.1785714286
474.235
92.725
402.265
117.48
93.925
83.6666666667
90.35
197.08
93.72
75.6625
61.85
107.685
102.83
86.35
139.23
102.9
79.1166666667
84.8875
67.05
135.955
124.845
94.0166666667
152.0666666667
177.185
186.525
138.26
92.8
157.925
113.5
81.4375
106.3928571429
77.87
98.23
117.155
97.29375
130.1428571429
115.0571428571
102.6
103.77
171.55
68.55
68.9625
60.45
70.59
144.565
109.8166666667
91.225
95.43
61.3625
Precision v/s T11F
0
0.1
0.2
0.3
0.4
0.5
1471013161922252831343740434649