Upload
bryce-dennis
View
213
Download
0
Embed Size (px)
Citation preview
Estimation of the Number of Relevant Images in Infinite Databases
Presented by: Xiaoling Wang Supervisor: Prof. Clement Leung
Introduction
Due to the increased importance of the Internet, the use of image search engines is becoming increasingly widespread. However, it is difficult for users to make a decision as to which image search engine should be selected.
The more effective the system is, the more it will offer satisfaction to the user.
Retrieval effectiveness becomes one of the most important parameters to measure the performance of image retrieval systems.
Measures: Precision
Recall
Significant Challenge: the total number of relevant images is not directly observable in such a potentially infinite database
retrievedimagesofnumbertotal
retrievedimagesrelevantofnumberP
retrievedimagesrelevantofnumbertotal
retrievedimagesrelevantofnumberR
Objective
To Investigate the probabilistic behavior of the distribution of relevant images among the returned results for the image search engines:
a) Independent Distribution
b) Markov Chain Distribution
From such models, we shall determine algorithms for the meaningful estimation of recall.
Independent Model
Let pk denote the probability that the cumulative relevance of all the images in page k.
In general, it is normally true that, for search engines, the first pages provide a larger probability, so that
p1 p2 pk pk+1
Since the relevant outcomes of different ranked images are not mutually exclusive events and that the search results do not feasibly terminate, we have in general and that, as
11
kkP
0kP
k
Independent Model
Record the number of relevant images per page
as some stochastic processXi1,Xi2, …Xik, where i=1,2,
…69 k=1,2…
Investigate the quadratic formula:
Pk = 1k2 +2k +, where k=1, 2, 3…
Determine the parameters using the least square
method
Calculate the percentage that the cumulative relevance of all the images in page k, ,...2,1,
20 k
Xp kk
Obtain a mean number of relevant images for each page
69
1
,...2,1,i
ikk kXX
Markov Chain Model
Since in internet image search, results are returned in units of pages, we shall focus on the integer-valued stochastic process X
1, X2,…, where XJ represents the aggregate relevance of all the images in page J, the sequence X={X1, X2 ,…} will be modeled as Markov Chain.
Take the conditional probability of the number of relevant images in XJ given the number of relevant images in XJ-1 to be the transition probability:
p(J-1),J={ XJ=xJ |XJ-1=xJ-1 }.
Markov Chain Model
From this, we construct the transition probability matrix.
where n is the number of images contained in a page.
nnnnn
n
n
pppp
ppp
ppp
P
210
11110
00100
............
...
...
Markov Chain Model
Calculate the initial probabilities. The probabilities are placed in a vector of state probabilities:
(J) = vector of state probabilities for page J
= (0, 1, 2, 3, … , n)
Where k is the probability of having k relevant images Therefore, from this model, we can estimate the number of
relevant images by pages by using the formula: (J) = (J-1)*P, J=1, 2, 3, …, n
Experiment
Image search engine selection: Google, Yahoo, Msn
Queries Selection: the queries consist of one-word, two-word and more than three-word queries, which range from simple words like apple to specific query like apple computers and finally progressing to more specific query like eagle catching fish
Record the stochastic sequence X={X1, X2 ,…} for each query
Apply the models: Independent Model and Markov Chain Model
Test the returned results using the query: volcano, tibetan girl, desert camel shadow
Independent Model and Testing Results for Google
Figure 1. Independent Model for Google
Figure 2. Testing Results and Independent Distribution Model for Google
y = - 0. 0189x2 - 1. 9129x + 97. 25R2 = 0. 9523
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10
No. of Page
Perc
enta
ge o
f th
e Nu
mber
of
Rele
vant
Ima
ges
Per
Page
Googl e Pol y. (Googl e)
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
s
Vol cano Ti betan Gi rlDesert Camel Shadow I ndependent Di stri buti on
Independent Model and Testing Results for Yahoo
Figure 3. Independent Model for Yahoo Figure 4. Testing Results and Indepen
dent Distribution Model for Yahoo
y = 0. 3788x2 - 6. 1364x + 96. 667R2 = 0. 8559
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10
No. of Page
Perc
enta
ge o
f th
e Nu
mber
of
Rele
vant
Ima
ges
Per
Page
Yahoo Pol y. (Yahoo)
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
s
Vol cano Ti betan Gi rlDesert Camel Shadow I ndependent Di stri buti on
Independent Model and Testing Results for Msn
Figure 5. Independent Model for Msn Figure 6. Testing Results and Independent Distribution Model for
Msn
y = 0. 1894x2 - 4. 8409x + 93. 833R2 = 0. 961
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10
No. of Page
Perc
enta
ge o
f th
e Nu
mber
of
Rele
vant
Ima
ges
Per
Page
MSN Pol y. (MSN)
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
s
Vol cano Ti betan Gi rlDesert Camel Shadow I ndependent Di stri buti on
Markov Chain Model and Testing Results for Google
Figure 7. Search Result of Testing Queries and Markov Chain Model for Google
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
s
Markov Chai n Model Vol cano Ti betan Gi rl Desert Camel Shadow
Markov Chain Model and Testing Results for Yahoo
Figure 8. Search Result of Testing Queries and Markov Chain Model for Yahoo
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
s
Markov Chai n Model Vol cano Ti betan Gi rl Desert Camel Shadow
Markov Chain Model and Testing Results for Msn
Figure 9. Search Result of Testing Queries and Markov Chain Model for Msn
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10No. of Page
Numb
er o
f Re
leva
nt I
mage
Markov Chai n Model Vol cano Ti betan Gi rl Desert Camel Shadow
Measure of Accuracy
One measure of accuracy is the mean absolute deviationmean absolute deviation (MADMAD)
n
errorforecast MAD
ISE
MAD
Model
Google Yahoo MsnOne-word
Two-word
Three-word
One-word
Two-word
Three-word
One-word
Two-word
Three-word
INDP
Model
1.2 2.4 1.1 2.9 4.6 2.5 2.7 2.6 11.8
MC Model
1 0.4 2.3 2.9 0 2.1 1.4 1.7 15.8
Conclusion
In terms of MAD, we conclude that the Markov Chain Model can estimate the number of relevant images for the ISE better than Independent Model does.
Except for three word query for Msn, such models could estimate the total number of image search engines quite well
Future Work
Optimal stopping rules for the different models will be established
Time series modeling and exponential Smoothing. Because the previous models indicates that the situation may be modeled as a time series with the page number representing the time.
Q & A