22
Methods for evaluating dynamic changes in search engine rankings: a case study Judit Bar-Ilan Bar-Ilan University, Ramat Gan, Israel, and Mark Levene and Mazlita Mat-Hassan School of Computer Science and Information Systems, Birkbeck University of London, London, UK Abstract Purpose – The objective of this paper is to characterize the changes in the rankings of the top ten results of major search engines over time and to compare the rankings between these engines. Design/methodology/approach – The papers compare rankings of the top-ten results of the search engines Google and AlltheWeb on ten identical queries over a period of three weeks. Only the top-ten results were considered, since users do not normally inspect more than the first results page returned by a search engine. The experiment was repeated twice, in October 2003 and in January 2004, in order to assess changes to the top-ten results of some of the queries during the three months interval. In order to assess the changes in the rankings, three measures were computed for each data collection point and each search engine. Findings – The findings in this paper show that the rankings of AlltheWeb were highly stable over each period, while the rankings of Google underwent constant yet minor changes, with occasional major ones. Changes over time can be explained by the dynamic nature of the web or by fluctuations in the search engines’ indexes. The top-ten results of the two search engines had surprisingly low overlap. With such small overlap, the task of comparing the rankings of the two engines becomes extremely challenging. Originality/value – The paper shows that because of the abundance of information on the web, ranking search results is of extreme importance. The paper compares several measures for computing the similarity between rankings of search tools, and shows that none of the measures is fully satisfactory as a standalone measure. It also demonstrates the apparent differences in the ranking algorithms of two widely used search engines. Keywords Search engines, Data collection, Worldwide web Paper type Research paper Introduction The web is growing continuously; as new pages are published on the web every day. However, it is not enough to publish a web page – this page must also be locatable. Currently, the primary tools for locating information on the web are the search engines, and by far the most popular search engine is Google (Nielsen/NetRatings, 2004; Sullivan and Sherman, 2004). Google reportedly covers over 4.2 billion pages as of mid-February 2004 (Google, 2004; Price, 2004), a considerable increase from over 3.3 billion as reported during the period from August 2003 until mid-February 2004. As of February 2004 the second largest search engine in terms of the reported number of indexed pages was AlltheWeb The current issue and full text archive of this journal is available at www.emeraldinsight.com/0022-0418.htm JDOC 62,6 708 Received 15 October 2004 Revised 21 November 2005 Accepted 30 November 2005 Journal of Documentation Vol. 62 No. 6, 2006 pp. 708-729 q Emerald Group Publishing Limited 0022-0418 DOI 10.1108/00220410610714930

JDOC Methods for evaluating dynamic changes in search engine

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JDOC Methods for evaluating dynamic changes in search engine

Methods for evaluating dynamicchanges in search enginerankings: a case study

Judit Bar-IlanBar-Ilan University, Ramat Gan, Israel, and

Mark Levene and Mazlita Mat-HassanSchool of Computer Science and Information Systems,

Birkbeck University of London, London, UK

Abstract

Purpose – The objective of this paper is to characterize the changes in the rankings of the top tenresults of major search engines over time and to compare the rankings between these engines.

Design/methodology/approach – The papers compare rankings of the top-ten results of the searchengines Google and AlltheWeb on ten identical queries over a period of three weeks. Only the top-tenresults were considered, since users do not normally inspect more than the first results page returnedby a search engine. The experiment was repeated twice, in October 2003 and in January 2004, in orderto assess changes to the top-ten results of some of the queries during the three months interval. Inorder to assess the changes in the rankings, three measures were computed for each data collectionpoint and each search engine.

Findings – The findings in this paper show that the rankings of AlltheWeb were highly stable overeach period, while the rankings of Google underwent constant yet minor changes, with occasionalmajor ones. Changes over time can be explained by the dynamic nature of the web or by fluctuations inthe search engines’ indexes. The top-ten results of the two search engines had surprisingly lowoverlap. With such small overlap, the task of comparing the rankings of the two engines becomesextremely challenging.

Originality/value – The paper shows that because of the abundance of information on the web,ranking search results is of extreme importance. The paper compares several measures for computingthe similarity between rankings of search tools, and shows that none of the measures is fullysatisfactory as a standalone measure. It also demonstrates the apparent differences in the rankingalgorithms of two widely used search engines.

Keywords Search engines, Data collection, Worldwide web

Paper type Research paper

IntroductionThe web is growing continuously; as new pages are published on the web every day.However, it is not enough to publish a web page – this page must also be locatable.Currently, the primary tools for locating information on the web are the search engines,and by far the most popular search engine is Google (Nielsen/NetRatings, 2004;Sullivan and Sherman, 2004).

Google reportedly covers over 4.2 billion pages as of mid-February 2004 (Google,2004; Price, 2004), a considerable increase from over 3.3 billion as reported during theperiod from August 2003 until mid-February 2004. As of February 2004 the secondlargest search engine in terms of the reported number of indexed pages was AlltheWeb

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

JDOC62,6

708

Received 15 October 2004Revised 21 November 2005Accepted 30 November 2005

Journal of DocumentationVol. 62 No. 6, 2006pp. 708-729q Emerald Group Publishing Limited0022-0418DOI 10.1108/00220410610714930

Page 2: JDOC Methods for evaluating dynamic changes in search engine

with over 3.1 billion pages (AlltheWeb, 2004). At the time of our data collection, the twosearch engines were of similar size, and AlltheWeb, although already owned by Yahoo!(Sullivan, 2003a), was still operating as an independent service. At the time of writingof this paper (October 2004), AlltheWeb is powered by Yahoo!, but uses essentially thesame interface as before.

There are no recent studies on the coverage of web search engines, but the 1999study of Lawrence and Giles found that the, then largest search engine(NorthernLight), covered only about 16 percent of the Web. Today, authors of webpages can influence the inclusion of their pages through paid-inclusion services.AlltheWeb has a paid-inclusion service, and even though Google does not, one’schances of being crawled are increased if the pages appear in one of major directories(which do have paid-inclusion services) (Sullivan, 2003b).

However, it is not enough to be included in the index of a search engine, ranking isalso crucial, since most web users do not browse beyond the first ten or at most 20results (Silverstein et al., 1999; Spink et al., 2002). Moreover, paid inclusion is notsupposed to influence the ranking of the page. SEOs (Search Engine Optimizers) offertheir services to increase the ranking of web pages for certain queries (see, for example,Search Engine Optimization, Inc, www.seoinc.com/), but Google (Google, 2003a) hasissued a strong warning against careless use of such services. It is clear that the top-tenresults retrieved on a given query have the best chance of being visited by web users,and this was the main motivation for the research we present herein. In addition, wewere motivated to examining the changes over time in the top-ten results for a set ofqueries of the currently two largest search engines, Google and AlltheWeb. In parallelto this line of enquiry, we also studied the similarity (or rather non-similarity) betweenthe top-ten results of these two tools.

For this study, we could not analyze the ranking algorithms of the search engines,since these are kept secret, both because of the competition between the different toolsand in order to avoid misuse of the knowledge of these algorithms by users who wantto be ranked high on specific queries. For example, Google is only willing to disclosethat its ranking algorithm involves more than 100 factors, but “due to the nature of ourbusiness and our interest in protecting the integrity of our search results, this is theonly information we make available to the public about our ranking system” (Google,2003b). Thus, we had to use empirical methods to study the differences in the rankingalgorithms and the influence of time on the rankings of search engines. We appliedseveral methods for comparing the ranking of two search engine result pages for thesame query. The two result pages compared were either results of two different searchengines or results of the same engine at different point in time. Three measures wereused, because none of them was satisfactory by itself, and the three measurescomplement each other and provide a fuller picture on the dynamics of rankings.

Literature reviewThe usual method of evaluating rankings is through human judgment. In an earlystudy by Su et al. (1998), users were asked to choose and rank the five most relevantitems from the first 20 results retrieved for their queries. In their study, Lycosperformed better on this criteria than the other three examined search engines.Hawking et al. (2001)) compared precision at 20 of five commercial search engines, withprecision at 20 of six TREC systems. The results for the commercial engines were

Search enginerankings

709

Page 3: JDOC Methods for evaluating dynamic changes in search engine

retrieved from their proprietary databases, while the TREC engines’ results came froman 18.5 million pages test collection of web pages. The findings showed that the TRECsystems outperformed the web search engines, and the authors concluded that: “thestandard of document rankings produced by public web search engines is by no meansstate-of-the-art.” On the other hand, Singhal and Kaszkiel (2001) compared awell-performing TREC system with four web search engines and found that “forfinding the web page/site of an entity, commercial web search engines are notablybetter than a state-of-the-art TREC algorithm.” In their study, they were looking forhome pages of the entity, and they evaluated the search tool by the rank of the URL inthe search results that pointed to the desired site. In fall 1999, Hawking et al. (2001)evaluated the effectiveness of 20 public web search engines on 54 queries. One of themeasures used was the reciprocal rank of the first relevant document – a measureclosely related to ranking. The results showed significant differences between thesearch engines, and high inter-correlation between the different measures. Chowdhuryand Soboroff (2002) also evaluated search effectiveness based on the reciprocal rank –they computed the reciprocal rank of a known item for a query (a URL they a prioripaired with the query).

Evaluations based on human judgments are unavoidably subjective. Voorhees(2000) examined this issue, and found very high correlations among the rankings of thesystems produced by different relevance judgment sets. That paper considers rankingsof the different systems and not rankings within the search results, and despite the factthat the agreement on the ranking performance of the search tools was high, the meanoverlap between the relevance judgments on individual documents of two judges wasbelow 50 percent (binary relevance judgments were made). Soboroff et al. (2001) foundthat differences in human judgments of relevance do not affect the relative evaluatedperformance of the different systems, and proposed a ranking system based onrandomly selecting “pseudo-relevant” documents. In a recent study, Vaughan (2004)compared human rankings of 24 participants with those of three large commercialsearch engines, Google, AltaVista, and Teoma on four search topics. The highestaverage correlation between the human-based rankings and the rankings of the searchengines was for Google, where the average correlation was 0.72. The averagecorrelation for AltaVista was only 0.49.

Fagin et al. (2003) proposed a method for comparing the top-k results retrievedby different search engines. One of the applications of the metrics they proposedwas comparing the rankings of the top 50 results of seven public search tools(AltaVista, Lycos, AlltheWeb, HotBot, NorthernLight, AOLSearch and MSNSearch– some of them received their results from the same source, e.g. Lycos andAlltheWeb) on 750 queries. The basic idea of their method was to assign somereasonable, virtual ranking to documents that appear in one of the lists but not inthe other. The resulting measures were proven to be metrics, which is a majorpoint they emphasize in their paper.

The studies we have mentioned concentrate on comparing the search results ofseveral engines at one point in time. In contrast, this study examines the temporalchanges in search results over a period of time within a single engine and betweendifferent engines. In particular, we concentrate on the results of two of the largestsearch engines (at the time of data collection), Google and AlltheWeb using threedifferent measures described below.

JDOC62,6

710

Page 4: JDOC Methods for evaluating dynamic changes in search engine

MethodologyData collectionThe data for this study was collected during two, approximately three weeks longtime-periods, the first during October 2003, and the second during January 2004. Thedata collection for the first period was a course assignment at Birkbeck, University ofLondon. Each student was required to choose a query from a list of ten queries and alsoto choose an additional query of his/her own liking. The ten queries listed by usconstitute a convenience sample. As can be seen in Table I, some of the queries were“everyday”queries, while others were more academically oriented. The purpose of thisexperiment was to demonstrate the applicability of the different comparison measures.

These two queries were to be submitted to Google (google.com) and AlltheWeb(alltheweb.com) twice a day (morning and evening) during a period of three weeks. Thestudents were to record the ranked list of the top-ten retrieved URLs for each searchpoint. Overall, 34 different queries were tracked by 27 students (some of the querieswere tracked by more than one student). The set of all queries that were processed withthe numbering assigned to them appear in Table I.

The process was repeated at the beginning January 2004. We picked ten queriesfrom the list of 34 queries. This time we queried Google.com, Google.co.uk, Google.co.iland Alltheweb in order to assess the differences between the different Google sites aswell. In this experiment, at each data collection point all the searches were carried outwithin a 20-minute timeframe. The reason for rerunning the searches was to study theeffect of time on the top-ten results. Between the two parts of the experiment, Googlemost likely introduced a major change into its ranking algorithm (called the “FloridaGoogle Dance” – (Sullivan, 2003c)), and we were interested to study the effects of thischange. For the second period queries q01-q10 were analyzed. The search terms werenot submitted as phrases at either stage.

The measuresWe used three measures in order to assess the changes over time in the rankings of thesearch engines and to compare the results of Google and AlltheWeb. The first andsimplest measure is simply the size of the overlap between two top-ten lists.

The second measure was Spearman’s footrule (Diaconis and Graham, 1977; Dworket al., 2001; Fagin et al., 2003). Spearman’s footrule is applied to two rankings of thesame set, and thus if the size of the set is N, all the rankings must be between 1 and N

Query ID Query

q01 Modern architectureq02 Web data miningq03 World rugbyq04 Web personalizationq05 Human cloningq06 Internet securityq07 Organic foodq08 Snowboardingq09 DNA evidenceq10 Internet advertising techniques

Table I.The queries

Search enginerankings

711

Page 5: JDOC Methods for evaluating dynamic changes in search engine

(the measure is based on permutations, thus no ties are allowed). Since the top-tenresults retrieved by two search engines on a given query, or retrieved by the sameengine on two consecutive days are not necessarily identical, the two lists had to betransformed before Spearman’s footrule could be computed. First, the non-overlappingURLs were eliminated from both lists, and then the remaining lists were re-ranked,each URL was given its relative rank in the set of remaining URLs in each list. Theresult of the re-rankings are two permutations s1 and s2 on 1. . .S, where jSj is thenumber of overlapping URLs. After these transformations Spearman’s footrule couldbe computed:

Fr jSjðs1;s2Þ ¼XjSj

i¼1jðs1ði Þ2 s2ði ÞÞj

When the two lists are identical, Fr jSjis zero, and its maximum value is 1/2jSj2when jSj

is even, and 1/2ðjSj þ 1ÞðjSj2 1Þ when jSj is odd. If we divide by the maximum value,Fr jSj will be between 0 and 1, independent of the size of the overlap (this is defined onlyfor jSj . 1). Thus, we compute the normalized Spearman’s footrule, Fr *:

Fr* ¼Fr ðjSjÞ

maxFr ðjSjÞ

Note that Spearman’s footrule is based on the re-ranked lists, and thus for example ifthe original ranks of the URLs that appear in both lists (the overlapping pairs) are (1,8),(2,9) and (3,10), the re-ranked pairs will be (1,1), (2,2) and (3,3) and the value ofSpearman’s footrule will be 0.

The third measure utilized by us was one of the metrics introduced by Fagin et al.(2003). It is relatively easy to compare two rankings of the same list of items – for this,well-known statistical measures such as Kendall’s tau or Spearman’s rho orSpearman’s footrule can be utilized. The problem arises when the two search enginesthat are being compared rank non-identical sets of documents. To cover this case(which is the usual case when comparing top-k lists created by different searchengines), Fagin et al. (2003) extended the previously mentioned metrics. Here wediscuss only the extension of Spearman’s footrule, but the extensions of Kendall’s tauare shown by Fagin et al. to be equivalent to the extension of Spearman’s footrule. Amajor point in their method was to develop measures that are either metrics or “near”metrics. Spearman’s footrule, is the L1 distance between two permutations, i.e.Fðs1;s2Þ ¼

Pjs1ði Þ2 s2ði Þj.

This metric is extended to the case where the two lists are not identical, byassigning to documents appearing in one of the lists but not in the other an arbitraryranking in the second list (which is larger than the length of the list) – when comparinglists of length k this placement can be kþ 1 for all the documents not appearing in thelist. The rationale for this extension is that the ranking of those documents must bekþ 1 or higher – Fagin et al. do not take into account the possibility that thosedocuments are not indexed at all by the other search engine. The extended metricbecomes:

F ðkþ1Þðt1; t2Þ ¼ 2ðk2 zÞðkþ 1Þ þX

ieZjt1ði Þ2 t2ði Þj2

XieSt1ði Þ2

Xiet 2 t2ði Þ

JDOC62,6

712

Page 6: JDOC Methods for evaluating dynamic changes in search engine

where Z is the set of overlapping documents, and z is the size of Z, S is the set ofdocuments that appear only in the first list and T is the set of documents that appearonly in the second list. A problem with the measures proposed by Fagin et al. (2003) isthat when the two lists have little in common, the non-common documents have amajor effect on the measure. Our experiments show that usually the overlap betweenthe top-ten results of two search engines for an identical query is very small, and thenon-overlapping elements have a major effect.

F ðkþ1Þ was normalized by Fagin et al. (2003) so that the values lie between 0 and 1.For k ¼ 10 the normalization factor max F ðkþ1Þ is 110. Thus, we compute:

G ðkþ1Þ ¼F ðkþ1Þ

maxF ðkþ1Þ

which we refer to as the G measure.As we have already mentioned, Spearman’s footrule is calculated on the re-ranked

list of overlapping elements, and ignores the actual rank of the overlapping elements.Thus for the case when there are only two overlapping elements, it cannot differentiatebetween the cases where the original placements of the overlapping elements are, say:

(1,1), (2,2)

(1,9), (2,10)

(1,2), (2,10)

In all three cases Spearman’s footrule is 0, since after the re-ranking in all three case weare considering the pairs (1,1), (2,2).

The G measure does take into account the placement of the overlapping elements inthe lists. For the above examples, the values of G will be:

0.655

0.945

0.818

This measure seems to capture our intuition that when the overlapping elementsappear in the same order in the two lists, if these appear in more similar places the Gvalue should be smaller. On the other hand, even in the first case, where the top twoURLs are identical and appear in the same order in both lists, the G value isconsiderable. Note that if the two lists are identical, then G ¼ 0.

Data analysisFor a given search engine and a given query we computed these measures on theresults for consecutive data collection points. When comparing two search engines wecomputed the measures on the top-ten results retrieved by both engines on the givendata collection point. The two periods were compared on five queries – here wecalculated the overlap between the two periods and assessed the changes in therankings of the overlapping elements based on the average rankings.

Search enginerankings

713

Page 7: JDOC Methods for evaluating dynamic changes in search engine

Results and discussionA single engine over timeAlltheWeb was very stable during both phases on all queries, as can be seen in Table II.There were almost no changes either in the set of URLs retrieved or in the relativeplacement of these URLs in the top-ten results. Some queries were monitored byseveral students, thus, the number of data comparisons (comparing the results ofconsecutive data collection points), was high. For each query we present the totalnumber of URLs identified during the period, the average and minimum number ofURLs that were retrieved at both of the two consecutive data collection points(overlap). The maximum overlap was 10 for each of the queries, an overlap of 10 wasrather frequent, thus we computed the percentage of the comparisons where the set ofURLs was not identical in both of the points that were compared (percent of points withoverlap less than 10). In addition, Table II displays the percentage of comparisons,where the relative ranking of the overlapping URLs changed and the maximal valuesof Spearman’s footrule (Fr *) and of G (the minimum values were 0 in all cases). Finally,in order to assess the changes in the top-ten URLs over a longer period of time, we alsopresent the number of URLs that were retrieved in both the first and the last datacollection points. Note that in some cases (q02, q03 and q08), the lists of retrieved URLswere identical at the first and last data collection points; even though at some timeduring the data collection period additional URLs also appeared in the top-ten results.The maximal value of Fr * for q09 is rather unusual; it is due to the fact that therankings of the overlapping URLs were completely reversed between the two timepoints used for this measurement.

When considering the data for Google we see somewhat larger variability, butstill the changes between two consecutive data points are rather small. Note thatin Table III, for query number q03 (world rugby), which was very topical at thetime, there were frequent changes in the placement of the top-ten URLs.

Figures 1 and 2 present time-series for query q04, web personalization. Thex-axis for both graphs shows consecutive time-ordered data capture points. InFigure 1 we see that, during the observed period, the G metric fluctuates mainlybetween 0 and 0.2, apart from a significant rise to 0.3 during three data capturepoints during the middle of the period. This is due to the decrease in the size ofthe overlap (from 9 to 7) and changes in the ranking of the top-ten URLsobserved.

Figure 2 illustrates the change in Google’s ranking of one web page entitled “Webpersonalization – Computer World”, which contains an article from the ComputerWorld website. The ranking of this page was stable at position 6, for the firsttwenty-three data point observed. The ranking then fluctuates between positions 8, 9,and 10 from data capture points 25 to 35. It is interesting to observe that, during thisperiod, the rank of this web page changed twice a day, in the morning and the eveningperiods. The page then settled at position 9, and disappeared completely from thetop-ten result list, three days before the end of the observed period.

Figure 3 demonstrates an interesting case for AlltheWeb, where the URL initiallywas ranked at position 3, did not appear among the top-ten for 12 days, reappearedfirst at position 5, and then was moved down to position 7. This was rather unusual forAlltheWeb, since, as can be seen from Table II, its rankings were very stable. Theusual stability of the rankings of AlltheWeb can also be seen in Figure 4. This can be

JDOC62,6

714

Page 8: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

dat

aco

mp

aris

ons

No.

ofU

RL

sid

enti

fied

du

rin

gp

erio

dA

ver

age

over

lap

Min

.ov

erla

p

Per

cen

tof

poi

nts

over

lap

less

than

10(%

)

Per

cen

tof

poi

nts

wh

ere

rela

tiv

era

nk

ing

chan

ged

(%)

Max

.Fr

*M

ax.

G

Ov

erla

pb

etw

een

firs

tan

dla

stp

oin

t

q01

1220

1010

100

00

010

q02

1734

119.

919

90

00

10q

0326

109

129.

869

142

0.15

0.2

10q

0424

100

159.

89

200

00.

127

8q

0521

4110

1010

00

00

10q

0621

4110

1010

00

00

10q

0721

4210

1010

00

00

10q

0815

2913

9.79

77

40

0.23

610

q09

1631

169.

655

1013

0.66

70.

782

4q

1020

3912

9.77

923

00

0.18

28

Table II.AlltheWeb – first period

Search enginerankings

715

Page 9: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

dat

aco

mp

aris

ons

No.

ofU

RL

sid

enti

fied

du

rin

gp

erio

dA

ver

age

over

lap

Min

.ov

erla

p

Per

cen

tof

poi

nts

over

lap

less

than

10(%

)

Per

cen

tof

poi

nts

wh

ere

rela

tiv

era

nk

ing

chan

ged

(%)

Max

.Fr

*M

ax.

G

Ov

erla

pb

etw

een

firs

tan

dla

stp

oin

t

q01

1220

119.

959

510

0.15

0.10

99

q02

1734

129.

889

123

0.05

0.06

79

q03

2610

914

9.86

810

350.

375

0.2

8q

0424

100

149.

367

570

00.

309

6q

0521

4110

1010

054

0.16

0.07

310

q06

2141

129.

98

50

00.

055

10q

0721

4216

9.76

717

480.

250.

255

8q

0815

2912

9.62

938

00

0.16

49

q09

1631

1310

90

30.

160.

091

10q

1020

3911

99

58

0.16

0.14

59

Table III.Google.com – first period

JDOC62,6

716

Page 10: JDOC Methods for evaluating dynamic changes in search engine

compared to the typical fluctuations in Google’s rankings, which can be observed inFigure 5.

Similar analysis was carried out for the queries during the second period. Theresults appear in Tables IV and V. During the second period the results and therankings of AlltheWeb were also highly stable. Google.com exhibited considerablevariability, even though the average overlap was above 9 for all ten queries. UnlikeAlltheWeb, quite often the relative placements of the URLs changed.

Perhaps the most interesting case for Google.com was query q10 (internetadvertising techniques), where at some point all except two of the previous hits werereplaced by completely new ones (and the relative rankings of the two remaining URLswere swapped), and from this point on the search engine presented this new set ofresults. This was not accidental, the same behavior was observed on Google.co.uk andGoogle.co.il as well. We do not display the results for Google.co.uk and Google.co.ilhere, since the descriptive statistics are very similar, even though there are slight

Figure 1.Time series of G metric for

the query webpersonalization submitted

to Google.com

Figure 2.Time series of Google’s

rankings for “Webpersonalization-Computer

World” page

Search enginerankings

717

Page 11: JDOC Methods for evaluating dynamic changes in search engine

differences between the result sets. This point will be discussed more extensively whenwe compare the results of the different engines.

Comparing two enginesAt the time of the data collection, the two search engines reportedly indexedapproximately the same number of documents (approximately 3 billion documents). Inspite of this, the results show that the overlap between the top-ten results is extremelysmall (see Tables VI and VII). The large averages of the normalized Spearman’sfootrule (Fr *) in cases where the average size of the overlap is greater than 3, indicatethat the relative rankings on the overlapping elements are considerably different –thus even for those URLs that are considered highly relevant for the given topic byboth search engines, the agreement on the relative importance of these documents israther low. In cases where the overlap is very small (two common URLs), nothingconclusive can be said – either the URLs appear in the same order (e.g. query q05during the second period) or in opposite order (e.g. query q01 during the second period).

Figure 3.Time series ofAlltheWeb’s rankings forthe “DNA Evidence Shelf”page

Figure 4.Time series ofAlltheWeb’s rankings forthe “Web site promotiontechniques” page

JDOC62,6

718

Page 12: JDOC Methods for evaluating dynamic changes in search engine

There are two possible reasons why a given URL does not appear in the top-ten resultsof a search engine: either it is not indexed by the search engine, or the engine ranks itafter the first ten results. We checked whether the URLs identified by the two searchengines during the second period are indexed by the search engine (we ran this checkin February 2004). We defined three cases:

(1) The URL was in the top-ten list of the engine some time during the period(called “top-ten”).

(2) It was not in the top-ten, but was indexed by the search engine (“indexed”).

(3) Was not indexed at all (“not indexed”).

The results for queries q01-q05 appear in Table VIII. The results for these five queriesshow that both engines index most of the URLs located (between 67.6 and 96.6 percentof the URLs, i.e. the value for top-ten and indexed combined), thus it seems that theranking algorithms of the two search engines are highly dissimilar.

During the second period we collected data not only from Google.com, but also fromGoogle.co.uk and Google.co.il. Overall, the results are rather similar, but there are somedifferences, as can be seen from the results for five randomly chosen queries,comparing Google.co.il and AlltheWeb (Table IX – compare with Table VII) andcomparing Google.com with Google.co.il (Table X).

When comparing AlltheWeb with Google.com versus Google.co.il, we observe aconsiderably difference for the normalized footrule for query q08 (snowboarding). Thiswas the only query, where at no point in time the top-ten results of Goggle.com andGoogle.co.il were identical. This also had an influence on the overlap with AlltheWeb,although the average size of the overlap is rather similar (3.31 versus 3.51), some of theURLs on which the two interfaces of Google intersected with AlltheWeb were differentat the same data collection point.

For query q02 (web data mining), the results of Google.co.il and Google.com becamevery different at one point (only three overlapping elements), however the relativeranking of these three remained the same (thus the low average value for the footrule in

Figure 5.Time series of Google’srankings for the “Don’t

Panic Eat Organic” page

Search enginerankings

719

Page 13: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

com

par

ison

s

No.

ofU

RL

sid

enti

fied

du

rin

gp

erio

dA

ver

age

over

lap

Min

.ov

erla

p

Per

cen

tof

poi

nts

over

lap

less

than

10(%

)

Per

cen

tof

poi

nts

wh

ere

rela

tiv

era

nk

ing

chan

ged

(%)

Max

.Fr

*M

ax.

G

Ov

erla

pb

etw

een

firs

tan

dla

stp

oin

t

q01

2244

119.

979

20

00.

169

q02

2244

119.

979

20

00.

055

9q

0322

4411

9.97

92

00

0.18

29

q04

2244

139.

768

210

00.

118

q05

2244

1010

100

00

010

q06

2244

1010

100

00

010

q07

2244

1010

100

00

010

q08

2244

119.

979

20

00.

029

q09

2244

139.

99

140

00.

091

9q

1022

4413

9.97

85

00

0.12

78

Table IV.AlltheWeb – secondperiod

JDOC62,6

720

Page 14: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

dat

aco

mp

aris

ons

No.

ofU

RL

sid

enti

fied

du

rin

gp

erio

dA

ver

age

over

lap

Min

.ov

erla

p

Per

cen

tof

poi

nts

over

lap

less

than

10(%

)

Per

cen

tof

poi

nts

wh

ere

rela

tiv

era

nk

ing

chan

ged

(%)

Max

.Fr

*M

ax.

G

Ov

erla

pb

etw

een

firs

tan

dla

stp

oin

t

q01

2243

209.

566

3528

0.22

20.

364

5q

0222

4317

9.65

830

120.

10.

164

6q

0322

4317

9.65

828

230.

240.

182

7q

0422

4328

8.37

554

210.

50.

582

7q

0522

4313

9.88

912

260.

080.

091

9q

0622

4314

9.77

923

20.

10.

164

8q

0722

4315

9.81

816

580.

313

0.18

28

q08

2243

199.

497

3523

0.18

80.

255

6q

0922

4313

9.77

925

140.

150.

145

9q

1022

4320

9.7

214

121

0.89

11

Table V.Google.com – second

period

Search enginerankings

721

Page 15: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

com

par

ison

sA

ver

age

over

lap

Min

.ov

erla

pM

ax.

over

lap

Av

erag

eFr

*M

in.

Fr

*M

ax.

Fr

*A

ver

age

GM

ax.

GM

in.

G

q01

1221

22

21

11

0.85

50.

855

0.85

5q

0217

354

44

0.5

0.5

0.5

0.57

50.

60.

564

q03

2611

04.

434

60.

716

0.55

61

0.61

30.

745

0.52

7q

0424

101

11

1n

/an

/an

/a0.

823

0.82

70.

818

q05

2142

33

30.

50.

50.

50.

638

0.65

40.

6q

0621

415

55

0.16

0.16

0.16

0.36

40.

364

0.36

4q

0721

421.

021

2n

/a1

10.

818

0.81

80.

818

q08

1529

3.1

14

n/a

00

0.61

80.

873

0.58

2q

0916

315.

14

60.

390

0.33

30.

667

0.49

80.

636

0.41

8q

1020

391

11

n/a

n/a

n/a

0.86

30.

891

0.84

5

Table VI.Comparing the searchresults for AlltheWeb andGoogle.com – first period

JDOC62,6

722

Page 16: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

com

par

ison

sA

ver

age

over

lap

Min

.ov

erla

pM

ax.

over

lap

Av

erag

eFr

*M

in.

Fr

*M

ax.

Fr

*A

ver

age

GM

ax.

GM

in.

G

q01

2244

22

21

11

0.86

70.

891

0.85

5q

0222

443.

482

40.

50.

51

0.64

80.

745

0.58

2q

0322

443.

753

50.

426

0.25

0.75

0.68

30.

709

0.65

5q

0422

441.

051

2n

/a1

10.

860

0.87

30.

764

q05

2244

1.82

12

n/a

00

0.78

40.

818

0.76

4q

0622

445

55

0.33

30.

333

0.33

30.

40.

491

0.38

4q

0722

444.

954

50.

663

0.5

0.66

70.

584

0.6

0.52

8q

0822

443.

32

40.

557

01

0.69

10.

782

0.49

1q

0922

443.

093

40.

472

0.25

0.5

0.56

20.

564

0.54

5q

1022

441.

551

3n

/a0.

50.

50.

892

0.96

40.

727

Table VII.Comparing the search

results for AlltheWeb andGoogle.com – second

period

Search enginerankings

723

Page 17: JDOC Methods for evaluating dynamic changes in search engine

Table X), and this had little influence on the comparison with AlltheWeb, since theintersection with AlltheWeb was essentially the same for Google.com and forGoogle.co.il. The G values for the initial data collection points were all 0 (totalagreement between the two lists, but at the last ten data collection points (where theoverlap decreased to 3), the G value increased to values around 0.6 (see Table X).

Comparing two periodsThe second period of data collection took place about three months after the first one.We tried to assess the changes in the top-ten lists of the two search engines. Thefindings are summarized in Table XI. Here we see again that AlltheWeb is lessdynamic than Google, except for query q04 (web personalization), where considerablechanges were recorded for AlltheWeb as well.

Discussion and conclusionsIn this paper, we computed a number of measures in order to assess the changes thatoccur over time to the rankings of the top-ten results on a number of queries for twosearch engines. We computed a number of measures, since none of them weresatisfactory as a standalone measure for such an assessment. Overlap does not assessrankings at all, while Spearman’s footrule ignores the non-overlapping elements andtakes into account relative placement only. Moreover, the extended G measure givestoo much weight to the non-overlapping elements. The three measures togetherprovide a better picture than any of these measures alone. Since none of these measuresare completely satisfactory, we recommend experimenting with additional measures inthe future.

The results indicate that the top-ten results usually change gradually. Abruptchanges were observed only very occasionally. Overall, AlltheWeb seemed to be muchless dynamic than Google. One possible reason could have been that during this periodAlltheWeb’s database was perhaps refreshed less often than Google. The rankingalgorithms of the two search engines seem to be highly dissimilar, despite the fact thatboth engines index most of the URLs that appeared in the top-ten lists. The differencesin the top-ten lists of the two search engines are large (the overlap is small and thecorrelations between the rankings of the overlapping elements are usually small,sometimes even negative). One reason for Google being more dynamic may be due tothe copies of its search indexes being unsynchronized while they are being updated,and the non-deterministic nature of query processing due to its distributed nature. Thismay also explain the slight differences between the results produced by Google’sdifferent interfaces.

AlltheWeb Google.com

Query URLS identifiedTop-ten

(%)Indexed

(%)Not indexed

(%)Top-ten

(%)Indexed

(%)Not indexed

(%)

q01 28 35.7 42.9 21.4 71.4 25.0 3.6q02 24 45.8 45.8 8.4 70.8 25.0 4.2q03 22 50.0 31.8 18.2 77.3 13.6 9.1q04 39 33.3 35.9 30.8 71.8 12.8 15.4q05 20 50 25 25 60 30 10

Table VIII.URLs indexed by bothengines

JDOC62,6

724

Page 18: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

com

par

ison

sA

ver

age

over

lap

Min

.ov

erla

pM

ax.

over

lap

Av

erag

eFr

*M

in.

Fr*

Max

.Fr

*A

ver

age

GM

ax.

GM

in.

G

q02

2244

3.27

34

0.53

0.5

0.75

0.63

0.67

30.

682

q04

2244

1.02

12

n/a

00

0.86

10.

873

0.76

4q

0622

445

55

0.33

30.

333

0.33

30.

40.

491

0.38

2q

0722

444.

984

50.

642

0.25

0.66

70.

594

0.63

60.

545

q08

2244

3.51

34

0.22

10

0.5

0.61

70.

691

0.56

4

Table IX.Comparing the search

results for AlltheWeb andGoogle.co.il – second

period

Search enginerankings

725

Page 19: JDOC Methods for evaluating dynamic changes in search engine

Qu

ery

No.

ofd

ays

mon

itor

edN

o.of

com

par

ison

sA

ver

age

over

lap

Min

.ov

erla

pM

ax.

over

lap

Av

erag

eFr

*M

in.

Fr

*M

ax.

Fr

*A

ver

age

GM

ax.

GM

in.

G

q01

2244

9.6

910

0.00

50

0.00

80.

030.

091

0q

0222

448.

33

100.

001

00.

444

0.16

30.

618

0q

0522

449.

759

100.

086

00.

250

0.05

0.16

40

q06

2244

9.91

910

00

00.

005

0.09

10

q08

2244

8.39

57

90.

229

00.

417

0.20

40.

381

0.18

2

Table X.Comparing the searchresults for Google.comand Google.co.il – secondperiod

JDOC62,6

726

Page 20: JDOC Methods for evaluating dynamic changes in search engine

All

theW

ebG

oog

le

Qu

ery

UR

Ls

(tw

op

erio

ds)

Ov

erla

pU

RL

sm

issi

ng

from

seco

nd

set

Min

.ch

ang

eav

erag

era

nk

ing

Max

.ch

ang

eav

erag

era

nk

ing

UR

Ls

(bot

hp

erio

d)

Ov

erla

pU

RL

sm

issi

ng

from

seco

nd

set

Min

.ch

ang

eav

erag

era

nk

ing

Max

.ch

ang

eav

erag

era

nk

ing

q01

1110

10

0.75

229

20

2.72

q02

1110

00

119

102

05.

61q

0322

84

02.

4519

122

0.18

3.64

q04

197

70

2.68

3210

40

2.52

q05

1010

00

013

100

01.

40q

0610

100

00

1511

10

4.43

q07

128

20

019

124

0.07

3.73

q08

1410

30

0.21

238

40

2.56

q09

218

80

3.4

199

40

3.98

q10

1510

20

0.81

255

60.

853.

64

Table XI.Comparing the two

periods

Search enginerankings

727

Page 21: JDOC Methods for evaluating dynamic changes in search engine

An additional area for further research, along the lines of the research carried out byVaughan (2004), is comparing the rankings provided by the search engines withhuman judgments placed on the value of the retrieved documents.

References

AlltheWeb (2004), available at: www.alltheweb.com (accessed February 18, 2004).

Chowdhury, A. and Soboroff, I. (2002), “Automatic evaluation of worldwide web searchservices”, Proceedings of the 25th Annual International ACM SIGIR Conference, pp. 421-2.

Diaconis, P. and Graham, R.L. (1977), “Spearman’s footrule as a measure of disarray”, Journal ofthe Royal Statistical Society, Vol. 39 No. 2, pp. 262-8 (Series B (Methodological)).

Dwork, C., Kumar, R., Naor, M. and Sivakumar, D. (2001), “Rank aggregation methods for theweb”, Proceedings of the 10th Worldwide Web Conference, May, Hong-Kong, pp. 613-22.

Fagin, R., Kumar, R. and Sivakumar, D. (2003), “Comparing top k lists”, SIAM Journal on DiscreteMathematics, Vol. 17 No. 1, pp. 134-60.

Google (2003a), Google Information for Webmasters, available at: www.google.com/webmasters/seo.html (accessed February 18, 2004).

Google (2003b), Google Information for Webmasters, available at: www.google.com/webmasters/4.html (accessed February 18, 2004).

Google (2004), Google, available at: www.google.com (accessed February 18, 2004).

Hawking, D., Craswell, N., Bailey, P. and Griffiths, K. (2001), “Measuring search engine quality”,Information Retrieval, Vol. 4, pp. 33-59.

Nielsen/NetRatings (2004), NetView Usage Metrics, available at: www.netratings.com/news.jsp?section=dat_to (accessed September 16, 2004).

Price, G. (2004), “Google ups total page count”, Resourceshelf, available at: http://www.resourceshelf.com/archives/2004_02_01_resourceshelf_archive.html#107702946623981034 (accessed February 18).

Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. (1999), “Analysis of a very large websearch engine query log”, ACM SIGIR Forum, Vol. 33 No. 1, available at: www.acm.org/sigir/forum/F99/Silverstein.pdf (accessed February 18, 2004).

Singhal, A. and Kaszkiel, M. (2001), “A case study in web search using TREC algorithms”,Proceedings of the 10th International Worldwide Web Conference, May, pp. 708-16,available at: www10.org/cdrom/papers/pdf/p317.pdf (accessed February 18, 2004).

Spink, A., Ozmutlu, S., Ozmutlu, H.C. and Jansen, B.J. (2002), “US versus European web searchingtrends”, SIGIR Forum, Fall, available at: www.acm.org/sigir/forum/F2002/spink.pdf(accessed February 18, 2004).

Soboroff, I., Nicholas, C. and Cahan, P. (2001), “Ranking retrieval systems without relevancejudgments”, Proceedings of the 24th Annual International ACM SIGIR Conference,pp. 66-72.

Su, L.T., Chen, H.L. and Dong, X.Y. (1998), “Evaluation of web-based search engines from theend-user’s perspective: a pilot study”, Proceedings of the ASIS Annual Meeting, Vol. 35,pp. 348-61.

Sullivan, D. (2003a), Yahoo to Buy Overture, available at: http://searchenginewatch.com/searchday/article.php/2234821 (accessed October 4, 2004).

Sullivan, D. (2003b), Buying Your Way in: Search Engine Advertising Chart, available at: www.searchenginewatch.com/webmasters/article.php/2167941 (accessed February 18, 2004).

JDOC62,6

728

Page 22: JDOC Methods for evaluating dynamic changes in search engine

Sullivan, D. (2003c), Florida Google Dance Resources, available at: www.searchenginewatch.com/searchday/article.php/3285661 (accessed February 18, 2004).

Sullivan, D. and Sherman, C. (2004), The 4th Annual Search Engine Watch 2003 Awards,available at: www.searchenginewatch.com/awards/article.php/3309841 (accessedFebruary 18, 2004).

Vaughan, L. (2004), “New measurements for search engine evaluation proposed and tested”,Information Processing and Management, Vol. 40, pp. 677-91.

Voorhees, E.M. (2000), “Variations in relevance judgments and the measurement of retrievaleffectiveness”, Information Processing and Management, Vol. 36, pp. 697-716.

Further reading

Hawking, D., Craswell, N., Thistlewaite, P. and Harman, D. (2004), “Results and challenges inweb search evaluation”, Computer Networks, Vol. 31 Nos 11-16, pp. 1321-30.

Corresponding authorMark Levene can be contacted at: M.Levene@ dcs.bbk.ac.uk

Search enginerankings

729

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints