Download pdf - EXAMINE THE FREQUENCY AND PERIODICITY OF REPETITION BEHAVIOURdagda.shef.ac.uk/dispub/dissertations/2009-10/External/QIAN_R.pdf · services. And with the help of this study, we would

i

EXAMINE THE FREQUENCY AND PERIODICITY OF

REPETITION BEHAVIOUR

A study submitted in partial fulfillment of the requirements

For the degree of

Master of Science in Information Management

At

Department of Information Study

At

The University of Sheffield

By

Ruoning Qian

September 2010

ii

ABSTRACT

Query repetition has been found very common in web searching. Numerous studies

have examined the different aspects of such behavior either through search log

analysis or experimental studies. While many of them focused on identifying the

characters of this behavior based on static observation, only s few of them have

studied the temporal features of repeating over time. During the last few years, a

number of studies have explored the potentials to utilize the time relevant features of

such behavior to assist the search engine development. These possible applications

have made the further longitude study towards those features more attractive than

before.

This study aimed to identify the periodicity of user’s query repeating behavior, and to

further understand the behavior variance both by query type and user frequency. A

query log containing thousands of user’s query and click-through data over a period

of three months was collected from a popular American search engine and was then

analyzed over a relational database. The results of the analyses show that, generally,

user’s repetition behavior follows a 7day periodicity; however, both informational

query and navigational query tend to be repeated in a different ways when examined

separately. And frequent search engine user to repeat the query different form the low

frequency user. Specially, it was found that, a repeated query after 9 will incur a rank

change. Based on these results, we can conclude that: since user’s query repetition

behavior follows different pattern based on the different query intent and user type, it

is important to identify both the query type and user type in order to determine the

possible periodicity for them each.

iii

ACKNOWLEDGEMENT

Thanks for the help of my supervisor Dr. Mark Sanderson who had provided both the

log data and useful advices for me during the study. And also thanks for his

understanding and patience during the process. Thanks should also be given to Mr.

Peter Stordy and Mr. John Holiday who are willing to help me with the use of Oracle

Database; also thanks Dr. Paul Clough for understanding my special situation. Thanks

should be given to my friend Wang, who supported me with a server like computer

for log analyzing.

iv

Word Count

Number of Pages: 80

Number of Words: 16,500

v

CONTENTS

ACKNOWLEDGEMENT .......................................................................................... iii

1 INTRODUCTION ............................................................................................... 1

1.1 Background .................................................................................................. 1

1.2 Motivation .................................................................................................... 2

1.3 research Objectives ....................................................................................... 3

1.4 Dissertation Structure ................................................................................... 4

1.5 Key findings ................................................................................................. 4

2 LITERATURE REVIEW .................................................................................... 6

2.1 overal query repetition examination ............................................................. 6

2.2 Repetition Examined by individual and group ............................................. 8

2.3 static locality examination ............................................................................ 9

2.4 TEMPORAL LOCALITY EXAMINation .................................................. 9

2.5 temporal repetition features examined by query type ................................ 11

2.6 User Variance from General Repetition Pattern ......................................... 12

2.7 Applications and Implications .................................................................... 13

3 METHODOLOGY ............................................................................................ 17

3.1 Search Log Analysis as Method ................................................................. 17

3.1.1 Search Log ....................................................................................... 17

3.1.2 Theoretical Foundation .................................................................... 18

3.1.3 Possible Issues in Related to the Process ......................................... 20

3.2 Data Collection ........................................................................................... 21

3.2.1 Data Source ...................................................................................... 22

3.2.2 Standard Field in Search Log ........................................................... 22

3.2.3 Privacy Issue and Offensive Information ........................................ 23

3.3 Platform and Tools ..................................................................................... 23

vi

3.4 Data preparation ......................................................................................... 25

3.4.1 Data Importing ................................................................................. 25

3.4.2 Remove Corrupted Data .................................................................. 26

3.4.3 Abnormal User Identification .......................................................... 26

3.4.4 Group Query Episode ...................................................................... 27

3.4.5 Query Classification......................................................................... 28

3.5 Metric for Analyzing .................................................................................. 30

4 Data Analysis ..................................................................................................... 32

4.1 Process Design ........................................................................................... 32

4.2 Special method ........................................................................................... 33

4.2.1 Time-Based Frequency .................................................................... 33

4.2.2 Repetition Distance: ......................................................................... 34

4.2.3 Normalization .................................................................................. 34

4.3 Overall Repetition Examination ................................................................. 35

4.4 Temporal Repetition Rate examination ...................................................... 36

4.4.1 User Frequency ................................................................................ 36

4.4.2 User Variance on Temporal Repetition Rate ................................... 37

4.5 Repetition Periodicity Identification .......................................................... 37

4.5.1 General query repetition periodicity ................................................ 37

4.5.2 Query Repetition Periodicity Examine By Type ............................. 38

4.5.3 User Variance on Repetition Periodicity ......................................... 38

4.6 Rank first convert to change ....................................................................... 38

5 RESULTS .......................................................................................................... 40

5.1 Results for data preparation ........................................................................ 40

5.2 Results for Overall Query Repetition Examined ........................................ 42

5.3 temporal Query Repetition rate Examined ................................................. 45

5.4 Repetition Periodicity Examination ........................................................... 49

5.5 Rank first convert to change ....................................................................... 54

5.6 result summary ........................................................................................... 57

vii

5.7 Result discussion ........................................................................................ 59

6 CONCLUSIONS................................................................................................ 62

6.1 contribution ................................................................................................ 62

6.2 Limitation and Future Work ....................................................................... 63

REFERENCE ............................................................................................................. 65

viii

- Figure 1Abnormal User Detection

- Figure 2 Temporal repetition rate Distribution

- Figure 3User Frequency Distribution

- Figure4 User Variance on Temporal Repetition Rate

- Figure 5 User Repetition rate Distribution

- Figure 6 Same User Query Time Intervals

- Figure7 Same User Same Query Time Intervals

- Figure 8 Normalized General Query Repetition Time Intervals

- Figure9 Informational Query Repetition Periodicity

- Figure 10 Navigational Query Repetition Time Intervals

- Figure 11User Variance on Repetition Periodicity

- Figure 12 General Rank Change Time Intervals

- Figure 13 Rank Changes distribution

- Figure14 Rank First Change Periodicity

- Table 1 Data Preparation Result

- Table 2 Repetition Overview

- Table 3 Click Repetition Overview

- Table4 Most Repeated Queries

- Table 5 Most Repeated Queries across User

- Table 6 Result Sheet

1

1 INTRODUCTION

1.1 BACKGROUND

Web searching is the most popular internet activity according to the Nielsen report in

2009. More and more people are using search engine every day to find information

or to navigate. And as the web is growing bigger in size, people’s reliance on search

engine is growing at the same time. The dramatic increase of search engine usage has

given rise to a growing interest in web searching studies, including the modeling of

user behavior and web search engine performance as summarized by Spink & Jansen

(2004). During the last ten years, numerous studies of web searching especially the

web user studies have been carried out in order to better understand user’s

information need, search engine usage and etc.

Among all the user behaviors that have been identified so far, one is of special

interests to many researchers. It was found that, although millions of queries have

been submitted to a search engine by thousands of different users every day, only a

relative smaller size of queries are being queried again and again by users. Large

amount of repeated queries are found in most of the search engine query logs, which

confirms that: people tend to repeat the queries which have been searched before

either by themselves or by others.

This finding has aroused many interests to further investigate into it or to exploit its

potential for many other researches. Many interesting findings about this behavior

have been generated ever since. Such as, it was found that, besides the repetition of

query, users tend to click on the same result as they have clicked on before either

from the same query or a different query, their choice of results tend to be highly in

consistence with the past (Smyth et al,2004); also has been found is that, small

2

numbers of queries are being repeated very frequently while the majority of the

queries are being repeated less often by user (Xie and O’Halloran ,2002); some

studies also identified a weekly and daily periodicity of the behavior as well as a

different repeating pattern varied by task( Sanderson and Dumains, 2006) etc. Also,

there has been many other researches which have been able to make progress based

on it, such as the studies towards user trends which identify the common interests of

users from the most frequent repeated queries; the research of caching strategy also

found a way to benefit from applying the two-level caching strategy according to

whether the query is being repeated often by a single user or shared among the

majority; the ranking of results could also take the repeating of a click as a sign of

highly relevancy and etc.

Besides the existing findings, there have been many new research directions being

pointed out, among which the further examining of the periodical feature of the

behavior is of special interest.

1.2 MOTIVATION

Understanding the time relevant features of user’s query repetition behavior can be

very beneficial to the development of search engine strategies.

How long will an informational query burst last for, and how often user would use

search engine to re-navigate? The above questions are of great interest for search

engine, because it is important to base their caching strategies on the different

temporal demanding features of the repeated queries both shared by majority and

pursued by an individual over time. For a search engine caching strategy, the

struggling would always be between the limited cache space and the need to provide

timely response based on reserved results. Deciding the right time to replace a

cached query which is not likely to be repeated again with a new

3

likely-to-be-repeated query is important for caching strategies. It would be great to

predict the likelihood of a query’s being repeated again within a certain time based

on different user intention behind the query.

In order to provide personalized search engine service for different users, it is

necessary to decide how long will the user profile be kept and be used for query

re-using and result re-ranking. It was identified by Dou et al. (2008) that, different

frequency user would benefited from different personalization strategies. Whether to

build a short term profile or long term profile for the future repetition is quite

important and this can be determined by identify different frequency user group.

In order to prevent the re-ranking of results from hindering the re-finding of

previously viewed result pages (Teevan et al, 2006), it is better to decide the right

time to re-rank the result based on the possible time interval between two repeated

query and click. Also it was suggested that, the periodical character of a query’s

being repeated can be used as a special ID to identify semantically related queries

(Vlachos et al, 2004) or used as a criteria for query classification().

Based on the above potential benefit, it would be interesting and necessary trying to

identify the time-relevant features of the repetition behavior which could be utilized

to improve search engine effectiveness and to provide timely and personalized

services. And with the help of this study, we would be able to identify some time

related features of the behavior and yield new findings.

1.3 RESEARCH OBJECTIVES

Based on the above motivation, this research will focus on the temporal features of

the repetition behavior both in general and with specific to query type varied by tasks

4

and by user. Also as being mentioned above, the change of rank is one of the

possible challenges that will be met with during re-accessing period. So examination

towards the relation between re-ranking and re-clicking is included in the aim of this

study. So the research aims of this study are listed as follows:

- To examine the temporal query repetition rate and user’s variance

- To identify the general query repetition periodicity;

- To identify the repetition periodicity varied by user’s query intention;

- To identify the periodicity variance based on users frequency;

- To identify the rank change periodicity during the repetition and to find out the

relation between the two periodicity;

1.4 DISSERTATION STRUCTURE

In this study, search log analysis will be performed to examine the query repetition

behavior which will be identified in a large search query log. This paper starts by

reviewing the related works focused on query repetition behavior and thither studies

in which search log analysis were used as the main method. The data and

methodology will be presented next, giving out both detailed steps for whole

research process and methodological level explanations to the method being used in

the article. The results are shown in the later chapter, followed by a discussion try to

explain the results and keep in line with the ones provided by previous works. The

limitations of the study and suggestion for future research will be given at the end of

this paper.

1.5 KEY FINDINGS

From this study, we know that, generally, users repeat their queries within a 7 day

period, however, further analyses towards different query type based on different

5

user intention shows that, while an repeated information query tend to burst within 3

days, a navigational query tend to be seen again issued by the same user after a week.

Analyses of different frequency user reveal that, the frequent user repeated more than

the non- frequent user; however, they also seem to have more unique queries to be

launched after they have launched a certain numbers’ of queries. And the comparison

between the rank change periodicity and query repetition periodicity shows that, a

query, most possibly a navigational query, being repeated after 9 days will possibly

be met with a rank change if he/she wants to click on the same result as last time.

The above introduction provides an overview of the whole dissertation, and the

details will be given in the rest chapters. And in the next chapter, related previous

works on query repetition will be reviewed first.

6

2 LITERATURE REVIEW

Reviewing the previous works before carrying out the study will be rather useful for

us to understand where this study will be standing in the related studies on this topic

as well as to get a general framework of the work to be done in the research. The

literature review in this chapter will discuss relevant studies focused on query

repetition behavior. The literatures are organized into the following listed aspects:

- Overall Repetition Examination

- Repetition Examined by individual and group

- Repetition Frequency Examination

- Query repetition examine by user intent

- User variance in query repetition

- Temporal features examined

- Applications and implications

2.1 OVERAL QUERY REPETITION EXAMINATION

Smyth et al. (2004) made two basic assumptions about web searching. They assumed

that the world of the web searching tends to be a place where similar queries tend to

re-occur and similar results tend to selected again. Many works have been done

trying to identify the query repetition in different query logs.

Markatos (2000) analyzed a million queries from the Excite web search engine and

found that nearly 20%-30% of the submitted queries were resubmitted by either the

same or different users. Xie and O’ Hallaron (2002) studied the Vivisimo log data

over a period of 35 days and find out about 32% of the queries are repeated at least

once; and the study of Excite log showed that 42% of the queries were repeated ones.

7

Teevan et al (2004) analyzed observed 13,060 queries and 21,942 clicks from 114

Web browsers over a period of one year, and found that, of all the queries, 67% are

submitted more than once. They also found that 71% of the repeated clicks are from

the same query, and 28% of repeated clicks are from the same user, while only 7%

repeated clicks are from multiple users. They found that, user were more likely to

click on the previously viewed result pages.

Sanderson and Dumains (2006) analyzed 3.3 million queries containing 7.7 million

clicks from more than 30 thousands of unique users over a period of 3 months. They

found that, repeat queries accounts for more than 50% of all submitted queries; and

17.5% of the total clicks were found to be repeated ones.

Dou et al (2008) analyzed the query repetition in a large Chinese search engine log,

and found that, about 21.87% of the distinct queries have been submitted more than

once, and the repeated query instances accounted for 54.78% of the total query

instances.

As can be seen from the above statistics, repeated query accounted for 20%-67% of

the total queries. Although the statistics differentiated from each other somehow, all

these works have strongly suggested that: query repetition is quite common in

today’s web searching, and people’s choices of results tend to be in consistent with

their previous ones. Based on the above findings, the modeling of user behavior is

possible. These two premiers have established the foundation for related studies on

query repetition behavior.

8

2.2 REPETITION EXAMINED BY INDIVIDUAL AND

GROUP

Examining the repetition by user group is very useful, especially for developing

personalized strategies.

Individual user analysis carries with it the implication for personalization while

group user analysis is generally used for trend discovering, news events detecting as

well as query-reusing strategies. Some works have suggested the two implications

carrying by this analysis. It was suggested that the frequently repeated queries shared

by groups’ of users should be cached at the server side in order to meet the general

information needs. The personal level query re-using may be better assisted by client

side caching strategies instead. The two-level caching strategy was proved to be both

useful and effective (Fagni, 2006).

Web User trend detecting is one of the most popular routs followed by some early

works as well as one of the most adopted method for new market exploring and

group targeted advertising. Search engine query log can be viewed as a database of

user interests. Brooks (2005) have discussed one of the applications of query

repetition analysis in advertising. He tried to identify a casual relation between

repeated searches of certain product and the final purchasing. He adopted a

time-to-convert method to identify the most likely occurrences of repeated searches

that will lead to a purchase by analysing the number of clicks before paying.

Instead of the general trend, individual query repeating may be used for interests

detecting or user group classification. Many of the study focused on lexis of personal

term use. It was suggested by Xie and O’Halloran (2002) that: many of the users tend

to have a small size of term usage. Therefore, term level analysis will be rather

effective for long term query predicting. However, Dou et al. (2008) suggested that

9

instead of term level analysis captures only the short term information needs, the

long term personalization strategy would benefit from underlying interests detecting.

2.3 STATIC LOCALITY EXAMINATION

The examination toward query frequency can provide accumulative overview of

number of times queries being submitted within the time being analyzed, which can

shed light on the repetition degree of different queries.

Jansen et al. (2000, 2001) indicated that neither queries nor query terms follow a

Zipfian distribution for they had identified large numbers of infrequently repeated

queries and terms in the log; this was updated later by Saraiva et. al (2001) who

discovered that query frequencies follow a Zipf-like distribution over the analyses of

10 thousands queries from a Brazilian search engine; Xie and O’Hallaron (2002)

later identified a similar distribution of query frequency by a comparison study over

both the Vivisimo and Excite query log; Lempel and Moran (2003) analyzed around

seven million queries from AltaVista in the year 2001 and found that the query

frequency followed a power law; this is also proved by Eiron and McCurley( 2003)

later in their study of web query vocabulary.

It has been suggested that: only a small percentage of queries are being repeated for

many times while large amount of the queries are less repeated by user. Those works

were based on static observation of the repetition frequency, providing no insight into

how the queries are being repeated over a period of time.

2.4 TEMPORAL LOCALITY EXAMINATION

Compared to previous static observation of the query repetition behavior which tends

to focus on verifying or describing its existence, the studies which analyzed the

10

temporal evolution of users’ repetition behavior will be able to shed light on the

time-related features of the user behavior.

Later work of Wang et al (2003) who examined the query logs from a university

search engine over a period of four years during 1997-2001 analyzed the temporal

query frequency by day, month and season; Beitzel et al. (2004) analyzed a very

large AOL query log containing queries from millions of users over a period of one

week. They found that query repetition rate by hour remained constant throughout

the day.

Dou et al (2008) analyzed the evolution of query repetition rates by hour over a

period of one month on a large Chinese search engine. The temporal analysis of

query repetition has been able to provide an overview of the numbers of cumulative

repetitions changing by hour or day, however, no insights have been provided into

how a certain repeated query would occur after its first being launched.

Wedig and Madani (2006) have discover that some users repeat clicks over long

period of time; Also Xie and O’Halloran (2002) found that the majority of the

repeated queries are repeated within a short time interval, while a number of the

queries will be seen repeated in a relative longer period.

These works have contributed to the possible estimation made for the likely

occurring of repeated navigational queries. They have suggested that it is possible to

predict a repeated event either by identifying both the possible time span which a

certain repetition would occur and the possibility of occurrence, or by observing a

frequent occurred combination of the two repeated events. However, their findings

can only be applied as a general rule, which shed light on the query repetition pattern

with no specific to query type.

11

2.5 TEMPORAL REPETITION FEATURES EXAMINED

BY QUERY TYPE

The previous repetition studies on query type generally based on the examination of

the co-occurrence of repeated queries and clicks. Lee et al. (2005) identified in their

studies towards re-finding that, the navigational query tend to have a highly

centralized click distribution, while users clicked on a wider range of results for an

informational queries. They then used the click distributions to discriminate

navigational queries from the informational ones.

Lu et al. (2006) proved and extended on Lee’s work later. They examined the

different features of the click-through data resulted by both informational queries and

navigational queries (the types of which have been pre-defined in a training data)

over a period of time. They discovered that: navigational queries tend to show more

stable temporal features than the informational queries by resulting in less diversity

in the click-through data. The top pages clicked by users as a result of these queries

are not likely to differ much over time. This means, when being repeated by users (or

a user), the navigational query will result in smaller size of total clicks which

centralized on only a few most clicked URL.

Teevan et al (2006, 2007) found that, navigational queries tend to be repeated with

one or two often repeated clicks. They later used this method, combined with some

of the other criteria successfully identified 12% navigational queries from all the

queries, they also found that navigational queries tend to be repeated more often than

others and be repeated at longer time intervals. They suggested that: based on the

features identified above, navigational query behavior was particularly easy to

predict.

12

The later works of others such as Asur and Buehrer (2009) have identified the

different temporal patterns exhibited by both navigational queries and news queries.

They found that, while most news-oriented queries tend to occur in a rash over time,

restricted to only a few time intervals between two repeated events, the navigational

queries, would occur more frequently without showing any strong character over a

short period of time.

Although the above have focused on the temporal features of the both informational

and navigational queries, they did not specify for how long these queries will be

repeated. And they did not try to examined the periodical features of such behavior.

Sanderson and Dumais (2006), extended on the previous work of Teevan et al (2004),

examined the temporal features of repeated queries and click over a period of three

month in 2006. They measured the time interval between paired repetition events,

and have been able to identify a dominant 7 days periodicity from daily analysis and

a 24 hour periodicity out of hourly analysis. Also they discriminated the repetition

pattern of navigational queries from the rest, identifying the different temporal

pattern that the navigational queries are being repeated. They found out that:

navigational queries are more likely to be repeated at a longer time interval than

being repeated in a close temporal approximate as the rest queries most of which are

information seeking oriented.

2.6 USER VARIANCE FROM GENERAL REPETITION

PATTERN

Some latest works tried to explored personalization opportunities from the

examination of user variance from the general repetition pattern.

13

Dou et al (2007) explored the relation between the query frequency and the repetition

frequency by experimenting both short term and long term caching strategies

separately on both unique queries and repeated queries. By comparing the hit rate of

the different groups, they found that, the long term caching can improve the chance

to predict a previous click based on the previous query-click as the time a user

searches grow; but after a certain query frequency point (70 queries), the user tend to

submit more new queries, which means less repetition will be observed.

Later in the other work of Dou et al (2008), they tried to prove and extend their

previous finding by analyzing large query logs. They proved that frequent searcher

tends to have a different repetition pattern from the low frequent user; and the

repetition rate will stay stable at a certain rate at some point of query frequency. The

finding in their work suggested that: the query repetition tends to be less observed

after a certain distance calculated as the number of queries between.

The findings above were in accordance with the findings of Wedig and Madani (2006)

who found that, a user’s interests differed from the general after more than 100

queries have been launched by the same user. This means, less repeated query will be

shared among users after a certain period.

The exploring of user variance in query repetition would be rather important for

marking the line for the exploiting of past queries for predicting repeated query or

click.

2.7 APPLICATIONS AND IMPLICATIONS

14

- Query expending based on term level repetition examining

The query repetition carried out at term level holds great promise for query

suggesting. It was found out by Xie and O’Halloran (2002) that, although users

tend to user different queries, they tend to use a small number of words to form

the queries. This has indicated that: exploring the repeated terms will be more

cost-effective than trying to do it on query level. And based on the often

repeated terms, query suggestions could be made. Also the term level repetition

analysis will be able to be used for query clustering; for query expending etc.

- Repetition periodicity used for Re-Ranking strategies

Re-ranking the results according to the past query history is not a new rout of

study. However, previous studies have shown that, the change of rank will

hinder the process of re-finding, and will also reduce the chance of clicking

(Teevan et al, 2007). The finding might have suggested that, the re-ranking of

results would better happen after the longest period that a repeated click would

occur.

- Repetition periodicity for predicting repeated click

Teevan et al (2007) have discussed the possibility to utilize the repetition

periodicity to predict the occurrence of next repetition. This also aroused the

interests of Xie and O’Halloran (2002) who proposed the use of the repetition

feature will be able to predict the likelihood of a query being issued again.

However, as being discussed in the previous section that, the prediction based

on periodical feature will be left for future study.

- Repetition pattern for the identifying of semantic related query

In the work of Vlachos et al (2004, 2005, 2010), the pattern of the query being

repeated regardless of user over time was used to identify semantically similar

queries, which is based on the finding that: semantically related queries tend to

have the same query demand over time; Zhao et al (2006) have tried to use the

click through data as a way to measure the similarity between queries.

15

- Repetition periodicity to aid Information Re-finding

The re-finding behavior is closely related to the query repetition behavior in that:

as proved by Teevan (2007), people’s re-finding of a certain website relied

largely on the re-using of search engine, which could then be identified by

search log analysis.

- Repetition for Personalization

Personalization strategy is based on the assumption that: when users resubmit a

query, their selections of results tend to be in highly consistent with the previous

one (Smyth et al, 2004). Exploiting the results of past queries will enable the

search engine to gather a collection of possible choice for user to choose from if

the same query is submitted again. Personalization strategies which are based on

past repeated click can be very effective (Dou et al, 2007)

- Repletion for query classifications

At the beginning, all the classification was mainly done by manual. Border

(2002) defines queries as informational, navigational or transactional and

manually classified 200 queries by studying an online survey of AltaVista users;

Beitzel, et. Al (2003) categorized a search log queries as navigational by

matching them to a list which was generated from the titles of web directories.

Later works have been able to classify queries automatically. Lee et al (2003)

tried to automatically classify the query by comparing the navigational and

informational queries; Yates et al (2006) used the machine learning to classify

queries as informational and non-informational;

Jansen et al (2007) provided a series of characteristic they identified form a

qualitative study of a sample query log for each category which enabled the

automatic classifying based on those criteria; Teevan et al (2007) discussed the

16

criteria for identify navigational query based on examining of the re-finding

behavior.

Broder et al. (2007) later used the text of the top result pages to judge the query

intent of the user. They found this method is much better than the previous

method that used only the query string. Beitzel et al. (2005) perform a

semi-supervised learning on the query logs to classify queries into topical

categories, and also used a training data which was annotated manually

beforehand. Some work have used the rank to

After reviewing all the related works examining the query repetition behavior, the

framework for this study comes out in shape. In the next section, the methodology of

this study will be talked about.

17

3 METHODOLOGY

A search log from a web based commercial search engine was collected as the data to

be analyzed in the study. This chapter will discuss how the search log analysis was

used to analyses search engine user behavior. At first, we will briefly introduce the

search log analysis as a methodology, including the theoretical foundations; the

issues related to the process, the process of SLA is given out in details, including

data collection, data preparation. Since the data analysis process is quite important, it

is separated from this chapter to be described in details in the next chapter. The

outline of this chapter:

- Search log analysis as method

- Related issues in the process

- Data collection

- Data preparation

3.1 SEARCH LOG ANALYSIS AS METHOD

Comparing to the large amount of findings yielded by performing SLA, the

methodological significance of this method has never been fully addressed of.

However, before adopting one method to a study, one should always guarantee that

the results generated by applying the method will finally find their way to arrive at

the conclusions which are in supportive of the research objectives. In order to build a

linkage between the method and the study objective, a brief discussion of the

theoretical foundations that have been served as the basis of this study is given out at

first

3.1.1 Search Log

18

Mainly two kinds of log are often studied of, including client-side log and server-side

log. The client-side log keeps track of user’s interaction with web browser, which is

often used in web browsing studies; while the server side log keeps down the user’s

search engine usage, which is often used for web searching behavior. The log which

is going to be analyzed in this study, is the search engine log captured by software at

the server-side.

Search engine log, often referred to as search log:

“….is an electronic data which keep down the interaction between a web searcher

and the search engine being used during web searching process…”

-- (Jansen, 2009)

The interaction between the searcher and the search engine include both the activities

of the user and the search engine. The user’s activity captured include the submitting

of a query; click on one of the result click, the requiring for next page, and the

returning to the search result page. The activities logged include the returning and the

ranking of the results; the data contained in the search log including both query data

and the click-through data are often analyzed by researchers based on different

research purpose. For example, the user’s query can be used to infer the underlying

information need; the combining of query and click through data can serve as an

implicit feedback of the result relevance. Etc.

3.1.2 Theoretical Foundation

3.1.2.1 Behaviorism

User study has always been an important area of research in web searching. For

search engine users, their behaviors which have been observed during their searching

19

process will be a mechanical expression of underlying information needs or

motivations. (Otsuka et al, 2004) Most of the time, user’s information needs will be

expressed in the form of search queries or as well as the URL clicked; also, and when

a specific searching pattern is identified from behaviors exhibited by a collective of

users, the general feature of these user groups can be summarized. Ｔhe reason to

study user behavior can be summarized as: First, user’s need and motivation

behind the behavior are very important for service provider, based on which better

service can be provided accordingly; second, user’s reaction to the service provided

can serve as an implicit snapshot of his/her perception of the service provided.

3.1.2.2 Historical Data Re-using

Six years ago, Smyth et al. (2004) made two basic assumptions about web searching.

They assumed that the world of the web searching tends to be a place where similar

queries tend to re-occur and similar results tend to selected again. These assumptions

have changed the search engine world dramatically.

Based on these assumptions, historical queries and clicks will have a large chance to

be re-used again, so they carried with them useful implications for query expansion,

result caching, and user profile building, which were proved to be more effective

than the previous content based method in dealing with vague queries; also, both the

query repetition and selection regularity could serve as implicit feedbacks from users.

No matter a query-click pair is being repeated by one user or by many of them, they

are supposed to be a sign of high relevancy.

3.1.2.3 Web logging

20

Web user’s activities have been largely logged in today’s web. On one hand, web

usage including web browsing, web searching etc. are becoming more and more

popular these years, the increasing usage of web service has created large amount of

user-system interaction. On the other hand, the need for understanding user’s

behavior is growing as the number of the web user is increasing at a rapid pace, how

to provide better service to those people is being focused now.

So the web logging, based on which user behavior contained in historical data is

made possible, have been largely carried out nowadays. It captures the web search

engine user activity by keeping the searching history data. Keep a log about web

usage is mainly motivated by two purposes. The log which captures the process of

the web activities can be used for understanding web user behavior, which can be

used to provide insight into the need and motivation that lying behind, as well as be

used to make a prediction of the future event. In a word, as one of the premises of

search log analysis, the logging of the web searching process is very useful and

necessary for search engine improvement as well as user trends identification.

3.1.3 Possible Issues in Related to the Process

The standard process of the search log analysis includes three main steps: data

collection, data preparation and data analysis. As being summarized by Jansen

(2009):

- Data collection: the process of collecting the interaction data for a period of

time from a web search engine;

- Data preparation: the process of cleaning and preparing the log for further

analysis;

- Data analysis: the process of analyzing the prepared data;

21

There are a few problems existed as the result of those processes, which have been

addressed by many previous work. Generally, the discussion revolves around the

following issues with specific to each of the process.

During the recent years, some privacy issues have come into the center of the public

focus, which had created obstacles for collecting search logs for research purpose.

On the other side, however, both the academic and the commercial world are calling

for more access to those logs and suggesting the building of the centralized search

log database(Clough, 2009). Some recent works trying to anonym the log to prevent

user tracing have yield some achievements. However, the issues in relation to log

collection still call further attention.

The data preparation as well as the analyzing process are the major source of the

inconsistency existed in today’s SLA studies. Although metrics and framework

have been developed before, trying to standardize the processes in order to make the

results exchangeable, based on different perceptions and research objectives, both the

adoption of relevant terms and analyzing levels in different studies are hard to remain

the same. This will further require the defining of analyzing metrics.

However, in spite of the problems mentioned above, it is still the dominant method

being adopted in web searching studies. Also, its scalability and ease of data

collecting still cannot be matched by any other experimental or questionnaire based

methods.

3.2 DATA COLLECTION

The query log examined is collected from a public commercial search engine which is

one of the most popular search engines in US. The collection consists of more than

22

3,600,000 search queries submitted by 658,000 users during the three months between

March and May in 2006. The log data was stored in an ASCII text file which is over

2GB in size. Queries which contain porn messages were not removed here, since

during the analysis, the specific information contained in the query is not of interest.

No personal information was contained in the log, user IP address was removed, and

user was identified by system generated number unique to each.

3.2.1 Data Source

The logs from the search engine have been studied a lot in the past. As one of the

search engine that is used worldwide, there are some problems should be noted. As a

search engine used by people from around the world, there might be non-English

queries being submitted. However, given the relative smaller size of the non-English

queries and that this study is independent of query context, this issue can be ignored.

Another problem is a general issue faced by all search log analysis, especially when

field of query time is within the analysis. The server-side software captures only the

local time upon reception of the query launched by people around world, therefore

the time contained in the query log may not be a faithful reflection of time the

queries being submitted in relative of the users themselves. Works that have been

based on the absolute time recording, especially those studies trying to identify user’s

searching behavior at the different time of a day, should be careful with the results.

In order to be free from the effect of the time zone difference, this study uses a

distance-based method (detailed in the later section) in order to be time independent.

3.2.2 Standard Field in Search Log

A framework has been provided by Jansen & pooch (2001) in order to enable the

communication and comparison between the results. As being defined in the metric,

the standard search log should contain the user ID, Date, the time and the search

URL. In the log being studied here, the following fields are contained in the log data:

23

- Anonymous user ID: a system generated unique identifier used to identify

different user based on the IP address of the user that have been removed before

analysis

- Search query: the query that is entered during user’s interaction with search

engine

- Query time: the time recorded by the server side software upon reception of the

query

- Item rank: the rank of the clicked result in the result page

- Click URL: the actual page URL of the clicked result

3.2.3 Privacy Issue and Offensive Information

The privacy issue related to the data is worthy of mentioning. Although the IP address

of the user were removed before hand, the user can be tracked by the user ID assigned

by the system. The tracking of single user should be very careful, since the clicked

through data can sometimes reveal the actual identity of the user. Some researchers

once correlated the relevant information and matched the searching activities with the

exact people who previously used the search engine to log into their email box (which

may contain the name in the address), or even Facebook. Also there may be offensive

information contained in the log data, which have not been removed in order to keep

it original.

In this study, since the information context contained in the clicked URL was of no

interest to the research. The analysis performed on the data will not result in an

problem stated above.

3.3 PLATFORM AND TOOLS

24

As regarding the choice of tools and platform, there are only a few previous studies

have mentioned the tools they were using to facilitate the log analysis. Jansen (2009)

addressed this issue in his handbook published last year to provide the several tools

that can be used to support SLA. He made a comparison between the most adopted

two methods: using relational database or text processing scripts. However, no

comparison of effectiveness as well as the ease of use between the two methods has

been made before in the academic researches.

Existing tools for s log analysis is widely used by business companies to generate

general report on the traffic of their website (Google Analytics). However, those

tools which have limited ability to perform research goal defined analysis cannot

meet the need of in depth academic research.

The combinations of text processing language (most used is Perl) and log file (such

as .txt file) are usually used to perform the analysis. Such method requires good

command of the coding language. Also should be noted is that, algorithm is very

important in deciding the effective ness of the analysis process. For very large log

data, a bad designed algorithm can be very time costly (sometimes more than 20

hours).

Another most adopted method is to import the log data into a relational database (In

most of the case Microsoft Access, Microsoft SQL server, MySQL), which can be

queried by SQL queries. The manipulating of log data in a database is relatively

easier and more effective than many of the combinations mentioned above. Some

database may not be capable to accommodate data over 2GB. For very large log, the

choices are narrowed down to only a few.

25

In this study, all the data analysis were carried out on Oracle 11g Database which is

installed on Windows XP operating system with 4GB memory, Core™ 2 processer,

6MB random-access memory, 12 Treads. Basically, the PC used is enough to meet

the computing requirements, using total no more than 2-3 hours to run all the codes.

SQL language was used to query the database in order to generate basic statistic as

well as to manipulate the data by correlating, grouping, cross referencing, counting

and etc. to generate views to be able to view data in aggregation or correlation. The

key steps of the whole process of log analysis, including the steps for data cleaning,

data analysis are given out in the appendix.

3.4 DATA PREPARATION

3.4.1 Data Importing

The data should be imported into a relational database before it could be further

analyzed. Oracle database was used here as the platform to store the data in a way

that they could be manipulated more easily. SQL loader was used to upload the

2.12GB data into table that had been created beforehand and the field name and data

type should be set in accordance.

One thing worthy of mentioning here is that, the time cost of the uploading process

varied when different tools and methods were used. Both the size of the data and the

original format of the document should be considered when choosing the tool. For

the relative large data set in this experiment (over 2GB), which exceeded the largest

size the normal software could deal with, partitioning of the original data is needed

for at the price of effectiveness. In this situation, the difference is quite obvious when

the SQL loader used less than 20min to complete the task, with other tools taking

more than 14hours to do the same job. The result of this step is shown in Table1.

26

3.4.2 Remove Corrupted Data

Search logs may contain corrupted data which can be caused by many reasons.

Removing such data is usually carried out before data analysis. One basic problem in

this process is that, for large data set, it is impossible to identify and remove those

data manually. One method provided by Jansen (2009) suggested sorting the data by

its key fields so that the abnormal data would appear in aggregation either at the top

or at the bottom of the overall table. Also some studies choose to ignore them since

those corrupted data sometimes may be very small relative to the overall data set,

which will have little effect on the final results. In this study, by initial observing,

there were many queries with only dot in the column. In some cases this won’t a

problem. However, in consideration of the later query classification which may be

mainly based on matching of strings, those data were removed at this stage to avoid

matching this data all the way through classification. A simple process was

conducted to remove those data. Records containing empty query or query without

letter or number (containing only symbols and makers) were considered as invalid

data that should be removed beforehand. The result for this step is shown in Table1.

3.4.3 Abnormal User Identification

Sometimes there could be abnormal user behaviors being identified in the query

stream. These abnormal users tend to appear in burst, which means, a

more-than-usual amount of queries being submitted within a shorter-than-usual time

span. Some of them may be searching agents. Sometimes, it could be an attack if the

time they appear is rather short.

To identify these users, we observe both the user frequency as well as user active

time. For those who have very high query frequency within a short period of time,

we consider them to be abnormal users that should be removed.

27

For those robots who act as normal user, in this study, they were ignored. Since if the

robot tries to act as human, then his acting could be very close to real human user,

which at least, could be used to represent human behavior. Also the identification of

non-human user is not simple which can be another research topic in itself. As

Silverstein et al (1999) once pointed out: there is no way to totally distinguish human

user from the non-humans. So in order to keep the log data as original as it can, we

just separate the abnormal user whose act could affect the observation of normal

users. In this study, we use a time based user frequency distribution to identify the

abnormal behaviors. The active time was calculated for all the users and the user

frequency were plotted as a function of time. Then, use this method we can identify

such users who leave large trails of queries during a fast-come fast-go. The result is

shown in Figure1 and table1.

3.4.4 Group Query Episode

On the search engine server side, a user’s request for next page or a click on another

result will be logged as a separate query assigned with a different time or even the

same time when time difference is too small to notice. Subsequent queries from the

same user that is identical to the previous one(s) are referred to as identical queries.

The logging of those subsequent queries will mask the query stream data with lots of

query events close in time (or even an exact duplication) which are triggered by only

one query entry. Since this study is not interested in repeated query happen within

one hour and it took the side of modeling user behavior rather than evaluating search

engine performance, those identical queries which were not trivial in size and would

boost the number of repeated queries to further affect the distribution of repetition

frequency, should be grouped into query episode started from user’s initial query.

In this study, a query episode was considered as a period with continuous interaction

with the search engine under one query submitted by a single user. Such episode

could be constructed by grouping the continuous query events following the initial

28

query from the same user at a time interval smaller than a certain period of time. The

grouping of query episode is less discussed than the grouping of a user section in

previous studies. In many cases, researchers either chose to treat all the queries that h

previous studies. Some of the past works treated a subsequent query submitted to the

search engine as a new query or just removed the repeated records from the original

data. Teevan et al. (2006) grouped all queries of the same query string occurred

within thirty minutes.Jansen (2009) mentioned the method to group such episode by

removing the repeated query and extract the first-of-submitted. Another work of He

& Göker (2001) address the issue by defining a web search period as a set of

continuous query by a user with no longer than a certain time limit from one query to

the next. They also suggested that, little difference was observed between using

15mins and 60mins as a threshold. Thus we use the value 30 to serve as a cursor to

identify queries which are continuous. It means same queries from the same user

launched at an interval no more than 30min were considered to be within a query

episode. This step had removed repeated records at the same time. The results were

saved as view, with reference to the original table to fetch the multiple clicks being

clicked within the episode. The query episode could be presented as <Query, click

URL>. The number of grouped episode is presented in Table1.

3.4.5 Query Classification

As being defined by Broder (2002) in his work, a navigational query is a query

which the user used to locate the home pages or a website the user have in mind. An

informational query, then, is a query which user used to gather information from the

relevant web pages containing the information on a particular topic. Based on this

classification, many works have been able to further identify the features of different

searching behavior based on different underlying user intention.

This study took special interest in the examining of different pattern the different

types of query are being repeated by users. Previous work had identified the different

way the navigational and non-navigational queries are being repeated by users

(Sanderson and Dumains, 2006). This suggested that the general query feature may

29

not apply to all queries in each category. Classifying the queries by task will be

useful to examine the features of each type independently.

The classification of queries according to user’s intent has always been a tough task

since it involves human judging and inferring, which are basically done manually in

the past only applied to small query data set. What’s more, the manually classification

of query can be rather ineffective and expensive, and in the case of very large set of

query data, it is not working.

Algorithms have been developed to separate the queries automatically. The

characteristics of each query type have been identified earlier by Jansen et al

(2007) .They gave out criteria for each query type (in their case they break down the

queries into three categories: navigational, informational and transactional) to

classify them accordingly. Lee et al. (2005) also provided criteria for classify

navigational query; a previous work of Teevan et al. (2006) define a navigational

query in detailed description as well. Generally, they contain the following criteria.

Navigational Query

- Repeated Equal-query queries which means only one or two result was clicked

each time.

- The viewed result of which is ranked higher than usual

- Queries contain full or partial URL

- Queries web site or company name

- Queries being repeated more than often over long period

Informational Query

- Queries uses question words; what/how/when/where/who

- Queries that were beyond the first query submitted;

- Queries where the searcher viewed multiple results pages;

- Queries length greater than 2 words.

30

Some of the criteria are vague and cannot be used as independent criteria to judge the

query type, such as the picking out company names, or judging by query

length—though some of the works have proposed a cut-off length, the figure is based

on a special training data which may not apply to all. Some of them are quite strict

criteria and the precision is quite high, such as, we select the queries which contain

URL. All the queries aimed at a pre-known website, according to the definition of

navigational query, are thought to be navigational. Also, some of the time related

criteria can affect the result of this study, such as the criteria ‘query being repeated

more often at a longer period’ is one of our study objectives. Based on the above

consideration, the following steps were carried out to mark identify the query type.

1. Repeated query(regardless of user at least twice) with only 1-2click each time,

will be marked as N;

2. The queries marked with N whose page rank is beyond 10 will be marked with I;

3. With the rest queries which is not marked, queries contain full/partial URL will

be marked as N;

4. With the rest queries which is not mark with I,N, we match them to a selected

most searched company name/website name list, marked them as N; the rest of

will be marked as I.

The navigational query list was later examined by the inspecting of the randomly

generated query samples. It was identified that 83% percent of the sampled queries

were navigational. The sample query list is provided in Appendix.

3.5 METRIC FOR ANALYZING

The importance of developing the analysis metric beforehand has already been

discussed in the previous section. Both the analysis level and the key terms used in

the analysis should be determined at this stage.

31

The analyses in this study were carried out at query level, examining both repetition

identified in queries and clicks. The definitions of the basic terms which were given

by Jansen (2009) are listed below:

- A query: a query episode (grouped as continuous interaction with search engine

under one query by one user issued within one hour), in which one query will

result in zero or several clicks

- A repeated query : a query submitted more than once regardless of user

- A click : a returned URL by a query

- A repeated click: the click from the same user regardless of query it from

- Navigational query: a query which aimed at a pre-assumed website or online

source, which is in the above identified navigational list.

- Repeated navigational query : the query in the above list which have been

submitted twice or above

- Informational query: queries which are information seeking based, in this

study, are the queries which are not in the informational query list.

- Repeated informational query: the query in the above informational query list

which have been submitted twice or above

After defining the analyzing metric, the analysis can be carried out accordingly

aimed on the achieving the research objectives. In the next chapter, the analyzing

process will be given in details, together with necessary explanations of the

method being used.

32

4 DATA ANALYSIS

In order to describe the process in detail, this part is separated from the main

methodology to become a new chapter. The sections included in this chapter are:

- Process design

- Special method

- Overall repetition examination

- Temporal repetition rate examination

- Repetition periodicity examination

- Rank first convert to change

- Result summary

- Result discussion

4.1 PROCESS DESIGN

Based on the research aims of this study, the analyzing process was partitioned into

four parts.

- Overall repetition examination: This step is to identify the existence of

query/click repetition in this log, based on which the rest analyses can be carried

out.

- Temporal repetition rate examination: this step will examine the daily based

repetition rate and also the temporal user variance on repetition rate.

- Query repetition periodicity examination: This step will examine the

periodicity of the query repetition behavior both in general, by query type and by

user’s variance.

33

- First convert to change examination: this step was carried out to provide

possible implication for re-ranking strategy based on the previous identified

periodicity.

4.2 SPECIAL METHOD

There are several special method were used in the study, which would be better to be

briefly introduced here in advance.

4.2.1 Time-Based Frequency

In this study, in order to examine the user variance in query repetition, it was

necessary to classify the users according to their querying frequency. Although the

word ‘frequency’ has been used a lot by previous studies, most of those studies have

only captured the ‘frequency’ by static observation of its total occurrence. It would

be wrong to simply measure the user frequency by counting the queries that have

been submitted by the user. Since users appeared in the log orders, some would

appear at the beginning while some would appear in the end of the three months

period; this search log captured only a snapshot of the query traffic, so the later

user’s activity would be framed off. In this study, the user frequency was measured

as Total numbers of queries by a certain user divided by the time interval from the

user’s first appearance till the end of the log. Using this method, user frequency

represented the numbers of query launched by a user per day, which will be a better

way to describe the regularity of a user.

34

4.2.2 Repetition Distance:

In order to further identify the repetition periodicity, this study took the distance

based analysis to identify the dominant time interval between repeated events. In this

study, the repetition distance is measured as the count of days between continuous log

events which were ranked grouped beforehand. There are two different ranking

strategies used in this study. In order to study the general repetition periodicity which

is independent of time, a random ranking strategy was used in the analysis. In the

next part, when trying to identify the first-convert-to-change time for the repetition

behavior, a time based strategy was used to order the log events.

Then the count of the occurrences of the distance is plotted to find out the most

frequent time interval between two repeated events by inspection. The distribution

based on the measuring of distance between related events has been used in previous

researches to investigate periodicity (Fagni et al, 2006). The distance can be

measured as the number of events, or in this case, was measured as the time interval

between repeated log events. Usually, time serial analysis plotted on time would be

used to identify the periodicity from the evolution diagram. However, the analysis

based on distance turned out to be a better method to identify periodicity for two

reasons. Time interval became the direct target of the analysis, which the observation

of the periodicity can be straight forward. Another reason is that, getting the distance

between two events from the same user by subtracting one from another can

eliminate the troubles brought by time zone difference. In this work, the distance

between two repetitions were measured as day difference which is time order

independent. The number of cumulative repeated queries identified at the same time

interval in order to observe the dominant repetition intervals in the graph.

4.2.3 Normalization

35

When analyzing a certain phenomenon, we need to be careful with the results we get.

Since the real world is quite complex, direct observation of a phenomenon is nearly

impossible. Also, the data we used to analyze may also be characterized by the time

span it covered, the special time it was collected, and also the source where it was

collected from. In Sanderson and Dumains(2006)’s work, they specified the way to

remove both windowing and weekly effect that were identified in the analyses, and

removed the effect which may be exerted by the underlying search engine usage ;

also in the work of Wedig and Madani (2006) which tried to analyze the user

persistence, they identified the time frame which had cut off their result artificially

would be confusing if not removed. The normalization in this study were carried out

by using Sanderson and Dumains (2006)’s formula:

In this study, the seasonal effect, windowing effect, or weekly effect etc. may present.

One simple way to remove them all is to plot the higher level analysis on the lower

one. This is because the more specific problem always inherits the attributes of the

general problems. So it would be safe to use the general data to normalize the more

specialized data. In this study, this rule was used to carry out all the data

normalization. After the above discussion of the method, the analyzing steps are

detailed in the rest of the chapter.

4.3 OVERALL REPETITION EXAMINATION

The analysis started with an overall examination of the repetition percentage of both

the query and click data. The distinct query which has been submitted more than

once was counted. The queries which have been submitted more than once by a

single user or by groups of users were calculated; percentage for both navigational

query and informational query repetition were also examined. All of the result was

shown in Table 2.

36

The click repetition was also examined. The repeated click by a single user was

calculated, and either those repeated clicks were from same query or different query

were examined separately. The result is shown in Table 3

4.4 TEMPORAL REPETITION RATE EXAMINATION

The daily repetition rate was calculated for each day in the three months’ time. Based

on this, we could identify the temporal evolution of the query repetition. The total

queries are grouped by date; the number of the queries in a day was counted and

plotted on the total 92 days. The same method was used to calculate the repeated

queries in each day for the total 92 days. The daily repetition rate was calculated as

the daily percentage of repeated queries in the total queries. The repetition rate for 92

day’s period was plotted in the Figure .

4.4.1 User Frequency

User variance is particular of interest in this study, so the user frequency should be

calculated first to classify the user into different frequency group. As introduced in

the previous section, the user frequency was calculated as the total numbers of

queries instances of a user divided by the active user time. The formula for the

calculation is:

The user frequency distribution was represented in Figure .

37

4.4.2 User Variance on Temporal Repetition Rate

Based on the user frequency identified in the last step, the study tried to further

identify the user variance in the repetition behavior. Previous studies by have found

out that, different frequency user tend to exhibit different pattern in query repeating.

In this section, the temporal repetition rates varied by frequent and non-frequent user

were calculated separately and were plotted in the same temporal distribution

diagram ( )in order to facilitate comparing.

4.5 REPETITION PERIODICITY IDENTIFICATION

This step will further identify the repetition periodicity. In this part, the previously

mentioned distance-based measuring was used to generate an observable dominant

periodicity in the query log.

4.5.1 General query repetition periodicity

According to the definition given above, we first calculated the time interval between

each two log activities from the same user, which is thought to be the analysis

towards user’s search engine usage. Queries issued by the same user were ordered by

random generated value of the data base. The days between the two continuous query

events were calculated as the day difference d, then the number of each occurrence of

a certain value of d was counted and plotted as a function of d. the results is showed

in Figure6. This data was used later to normalize the data generated in the same way

by performing the same steps to examine same query from same user.

38

The same method was then used to generate a distribution of the time interval

between repeated queries from same user. The data was used to produce the diagram

in Figure7. In order to remove the possible effects exerted by other factors, the study

adopted the same normalizing formula with Sanderson and Dumains (2006) to

normalize it with the data generated in the last section. The normalized view is

shown in Figure8.

4.5.2 Query Repetition Periodicity Examine By Type

Duplicate the method used before, the time interval distribution of both navigational

and informational query were analyzed. The queries marked as I and N from the

same user were randomly ordered, and the distribution of query time intervals are

plotted and normalized by the result generated in the last section. The normalized

result is shown in Figure9, 10.

4.5.3 User Variance on Repetition Periodicity

Then, we group the entire query by user, and calculate the average time interval

between two repeated queries from same user. Then the user frequency was plotted

as a function of the user’s average time interval between two repeated queries. We

use this to describe the different repetition periodicity of different frequency user.

The result was shown in result 11.

4.6 RANK FIRST CONVERT TO CHANGE

39

One of the motivations of the study is to develop better re-ranking strategy for the

queries that are expected to occur again based on historical data. For previous

launched queries, re-ranking the results would benefit from the estimation of its next

occurring. Normally, users would expect the previous viewed result to appear at the

same result page or even the same ranking when they repeat their query on the search

engine, trying to go to the viewed page again. This might suggest that the re-ranking

of the result would better occur after a certain repeated query with the same click

would be last observed.

The study tried to find out whether the rank of the repeated click will change at a

smaller (or larger) time interval than the time interval between two repeated queries.

The same distance-based analyses were performed here. All the same clicks will be

grouped and ordered by time. The first time a change of rank within the time

sequential of clicks will be observed, the time interval between the two click events

will be kept down. The s result was showing in the diagram in figure 14 .

40

5 RESULTS

This chapter provides the results for the analyses in the previous chapter accordingly.

The results provided in the chapter are listed as:

- Result 1.1: Result for Data Preparation

- Result 2.1: Query Repetition Overview

- Result 2.2: Click Repetition

- Result 3.1: Query Repetition frequency Distribution

- Result 3.2: Click Repetition Frequency Distribution

- Result 3.3: User repetition Frequency Distribution

- Result 3.4: Further Examination of the Frequent Query List

- Result 3.5: Repetition Rate Variance

- Result 4.1: General Query Repetition Periodicity

- Result 4.2: Informational Query Repetition Periodicity

- Result 4.3: Navigational Query Repetition Periodicity

- Result 4.4: User Variance On Query Repetition Periodicity

- Result 5.1: Rank First-convert-to-change Periodicity

5.1 RESULTS FOR DATA PREPARATION

Result 1.1: Result for Data Preparation

The identification of non-human user was carried out using a User Frequency/Time

moving trend to detect the abnormal user behavior. As can be seen from the graph,

there is a peek between days 37-38, where a single user launched over 13 thousands

queries either repeated or unique. This is considered as an abnormal behavior, and

we identified the user ID and removed all related records. Using the same method,

41

another similar User ID was removed. The graph had detected the abnormal was

shown below.

Abnormal Detection

Figure 1Abnormal User Detection

Table 1 Data Preparation Result

Total Query 35,382,016

Corrupted Queries 1,005,069( )

Non-human User 142,775( )

Query Episode 20,714,848( )

Unique query 10,152,834 ( )

Navigational Query 3,201,860 (31.54%)

Informational Query 6,950,974 (68.46%)

42

5.2 RESULTS FOR OVERALL QUERY REPETITION

EXAMINED

Result 2.1: Query Repetition Overview

Among 20,714,848 queries submitted, 29.8% have been submitted more than once

by users, while 70.2% are unique queries. Among the repeated queries, Individual

user repetition consists 43.75% of the total repeated queries, while group repetition

constitute 56.25% of the total repetition. The repeated queries were then examined

by query type. As we have discussed in the previous section, the informational

queries were identified as queries which was marked as I. The result had displayed

that: around 42.46% of the total repeated queries are informational queries while

57.54% are navigational queries.

Query Repetition

Total Query(Episode) 20,714,848

Distinct query string 10,152,834(49% )

Unique queries 7,126,165 (70.2%)

Repeated queries 3,026,669(29.8%)

Individual repetition 1,324,168 (43.75%)

Group repetition 1,702,501(56.25%)

Repeated Navigational Query 1,741,545 (57.54%)

Repeated Informational Query 1,285,124(42.46%)

Table 2 Repetition Overview

43

Result 2.2: Further Examination of the Frequent Query List

The frequency lists are listed below.

The 20 most repeated queries

Query Number of time being searched

Google 279445

eBay 129968

yahoo 186150

MapQuest 102050

MySpace 145650

internet 32136

weather 22174

http 21074

bank of America 29339

American idol 16892

pogo 16263

Hotmail 27963

msn 14060

craigslist 13518

.com 13115

dictionary 13016

yahoo mail 12689

ask Jeeves 11547

Wal-Mart 11475

mycl.cravelyrics.com 10471

Table3 Most Repeated Queries

By examining the top 20 frequently repeated queries, it can be found that, most of

those queries are navigational queries pointing to either another search

engine(Google, Yahoo, MSN), map website(mapquest.com) or some online

e-commercial website(EBay, Amazon), Social Network(MySpace) etc. Some daily

inquiry such as Weather, TV, appears in the top repeated list too.

Also a list of top 20 queries with a count of the distinct users who had searched them

was generated. The result shows the same trend as the previous one.

44

The 20 Most Shared Repeated Queries

Query Number of Users

Google 120782

eBay 76178

yahoo.com 67606

MapQuest 59098

myspace.com 34003

internet 21996

http 17041

weather 12967

American idol 9980

dictionary 8118

Wal-Mart 7562

ask Jeeves 7125

home depot 6700

ask.com 6202

southwest airlines 5927

target 5925

white pages 5767

maps 5589

hotmail.com 5370

yellow pages 5308

Table 4 Most Repeated Queries across User

Then the top20 most repeated query within a single user were generated into a list as

following.

As can be seen from the above lists, both the most frequently repeated queries and

the most shared repeated queries across users are almost all navigating purpose based.

However, it might be noticed that, some daily based informational inquiries like

Weather, or TV etc. also appeared on the top 20 list. This means, although most of

queries repeated on daily basis tend to be navigational queries, there are still some

informational queries could also show a daily demanding.

Result 2.3: Click Repetition

45

The same clicks from the same user which had appeared more than once were

counted, out of the total click stream, 35.38% were found to have been repeatedly

clicked on. Among the total repeated click, 68.40% were from the same query, and

31.60% were from different queries.

Click Repetition

Total Click 9,826,259

Unique Click 6,348,868(64.61%)

Repeated Click 3,477,391(35.39%)

From Same Query 2,378,465 (68.40%)

From Different Query 1,098,926(31.60%)

Table 5 Click Repetition Overview

The repetition found in click stream is a little higher than the repetition of query; the

results from the examination of both query and click repetition are in supportive of

the previous work by identifying similar portion of repetition behavior in the new log

data. This have further indicate that the existence of such repetition behavior is

independent of the user groups that being analyzed.

5.3 TEMPORAL QUERY REPETITION RATE

EXAMINED

Result 3.1: Temporal Repetition Rate Distribution

The following shows the daily based repetition rate distribution. The x-axil represent

the 92 days in during the observation, the y-axil represent the percentages of the

46

repeated queries in that day. As can be seen from Figure 2, the daily repetition rate

does not change much with time. And the repetition rate keeps stable at around 60%

every day.

Figure 2Temporal Repetition Rate Distribution

Result 3.3: User Frequency

The following diagram shows the frequency distribution of different users.

Figure 3User Frequency Distribution

0%

20%

40%

60%

80%

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Day

Repetition Rate Per-Day

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000 10000 100000Nuber of Users

User Frequency

47

As can be seen from the diagram, most of the users will not submit more than 10

queries per day; while a small numbers of user submitted more than 10 queries in a

day. From the above diagram, the cut-off queries-per-day was set as 10 queries per

day. Any one appear above would be regarded as frequent user, the ones appear

below, would be regarded as non-frequent user.

Result 3.5: Repetition Rate Variance

The following diagram shows that: the user’s average repetition rate change with

user’s frequency of searching. As can be seen from the diagram that: under a certain

user frequency, about 300 times in this graph, high frequency user tend to repeat

more than the low frequency user and the they follows a liner relation. However,

when the user frequency reaches a certain point, though the repetition rate in general

still heading up, the repetition rates tend to fluctuate a lot. This is probably because

of that, high frequency users who are the speciality in the user group, tend to vary a

lot from each other.

Figure 4 User Repetition rate Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100 1000

User Frequency

User Repetition Variance

48

Result 3.4: User Variance on Temporal Repetition Rate

The following diagram represented the different daily repetition rate for both the

frequent user group and the non-frequent user group.

Figure 5 user variance on Temporal repetition rate

As can be seen from the above graph, non-frequent user tend to repeated in a stable

way while the frequent user tend to repeat more with time goes by. However, when

reach a certain point (around 45days), their pursuit of previously submitted queries

tend to decline a little and remain stable over time. This verified that: frequent user

tend to repeat more in a near time interval and tend to have various interests at a later

point of time.

0.01

0.1

1

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Days

User Variance on Temporal Repetition Rate frequent user non-frequent user

49

5.4 REPETITION PERIODICITY EXAMINATION

Result 4.1: General Query Repetition Periodicity

- Step1 User’s searching activity

The following shows the distribution of time intervals between two events of the

same user.

Figure 6 Same User Query Time Intervals

As can be seen, a weekly effect is presented in the diagram. And also a decrease

identified at the end is due to the 92 days log collected. This data was used for later

normalization.

- Step2: User’s Query Repetition

1

10

100

1000

10000

100000

1000000

10000000

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Interval

Query from Same User

50

The same steps were performed to calculate the distribution of the day difference

between repeated queries from same user. The data was used to produce the diagram

in figure 7. The weekly effect was quite obvious.

Figure7 Same User Same Query Time Intervals

- Step3: Normalized Repetition Periodicity

As can be seen from this normalized view below, the events above the y=1 are likely

to happen, while the events below tend occur less. The 7- 8 days cut-off shows that,

user’s repeating of a query is more likely to happen within the following 7-8 days,

after the seven day period, the chance of the query being repeated is reducing.

1

10

100

1000

10000

100000

1000000

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Interval

Same Query From Same User

51

Figure 8 Normalized General Query Repetition Time Intervals

Result 4.2 Informational Query Repetition Periodicity

Then the same method was used to examine the repetition pattern based on different

query type. The informational and navigational queries which were identified in

previous analysis were analyzed in the same way as the previous general analysis.

The data presented user’s general query repetition time interval was used to

normalize the informational query repetition data, which generate the following

figure 9. .

0.01

0.1

1

10

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Intervals

General Query Repetition

0.1

1

10

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98Time Interval

Informational Query Repetition Period

52

Figure9 Informational Query Repetition Periodicity

From the graph we can see that, for an informational query, it will be repeated in a

burst, within 3 days, and then the repetition of an informational query will possibly

not be seen again. This is in line with the previous suggestion made by many studies

that, informational query is more likely to be repeated within a few days in a burst.

Result 4.3: Navigational Query Repetition Periodicity

The same way, we produced the repetition time interval distribution for navigational

query and then it was normalized in the same way with the informational queries.

The normalized view is showed below:

Figure 10 Navigational Query Repetition Time Intervals

0.1

10

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Interval

Navigational Query Repetition Periodicity

53

As can be seen from it, the repetition of navigational queries follows a different

pattern from the repetition of the both general queries and informational queries. The

repeated navigational queries are not likely to be observed being issued within the

following 7 days away from it first being issued.

Combining with the finding in the general query repetition, the repeating of

informational query within the 3-4days may contribute to most of the repeated

queries that have been observed within the first 7 days. These observations tend to be

in line with Sanderson and Dunains (2006)’s findings, although the informational

query repetition period turned out to be shorter than would expected, the previous

findings still hold.

It can be concluded from the above analyses that: in general, the previously issued

queries, of which most are informational queries, tend to be repeated within a short

time interval, while the queries with navigating task are usually being re-issued at a

later point of time. The repetition pattern is different based on the different user

intention.

RESULT 4.4: User Variance on Query Repetition Periodicity

The following shows the repetition time span for different frequent users. As can be

seen from the graph that: almost all the repetition from a low frequency user will

issue a repeated query within 13-20days’ interval on average. While high frequency

users tend to repeat a query within 2-35 days on average. This means low frequency

user tends to repeat queries at a near upcoming time, high frequency user will repeat

a query at a relative longer time interval.

54

User Variance on Repetition Periodicity

Figure 11User Variance on Repetition Periodicity

5.5 RANK FIRST CONVERT TO CHANGE

Result 5.1: First Convert To Change

This section is the analysis of the chance to observe a rank change during the process

to re-access a previous viewed result page form the same path. The method used in

this section is the same with the previous analyses.

- Step1: General Rank change

The following diagram shows the general period a click change will occur.

55

Figure 12 General Rank Change Time Intervals

From the above diagram, we can see a weak 7 days periodicity. This is not as

obvious as the previous analyzed periodicity since this is not a human behavior; rank

change is more of a subjective presence. However, using it to normalize the data

generated later can still hold, since other factors contained in the rank changing will

affect the final result.

- Step2: Repeated Click Rank Change

The same procedure was performed on calculating the first time a change of rank

will be observed by user when the same click from the same query was clicked on.

And the results were plotted in figure .

1

10

100

1000

10000

100000

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Interval Between Two Click

General Rank Change

56

Figure 13 Rank Changes distribution

- Step3: Normalized view

Data from figure was normalized by the data from figure 11, the normalized view

was showed in figure below.

Figure14 Rank First Change Periodicity

1

10

100

1000

10000

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

Time Interval

Rank Change During Repetition

0.1

1

10

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98

First Time Rank Change Will Be Percieved

Rank Change Periodicity

57

As can be seen from the diagram, a user will not see a change of rank if he or she

repeats a query and click the same URL within the following 9 days. The repetition

of click (resulted from same user query) happened after that, will result in a possible

change of rank observed by the user. Combined with the previous finding about the

repetition of navigational queries (which is more often happen after7 days), the result

may have suggest that, a repetition of navigational query is likely to be challenged by

a likely change of rank.

5.6 RESULT SUMMARY

- Results for Data Preparation

Of all the 35,382,016 queries, about 2.84% of the original queries were removed

as corrupted data. 134635 records from one user were removed by identifying

the abnormality in searching behaviour. By grouping the queries submitted by

the same user continuously under one query at an interval no more than 30

minutes, 20,714,848 query episodes were established. Within the query (query

episode) 10,152,834 distinct queries were identified, with 3,201,860 navigational

queries and 6,950,974 informational queries.

- Query Repetition Overview

The initial analysis of the overall repetition find out that, about 29.8% of the

total queries are repeated queries previously have been issue by users. Of the

total repeated queries, 56.25% are from different user while 43.75% are

repeated by the same user. The examination of the repetition by query type

reveals that, more than half of the repeated queries are navigational seeking

based (57.54%); in comparison, the repetition of informational query is a little

less common (42.46%). The repetition examination with click data stream

58

shows that, of all the clicked results, 30.39% are repeated clicks, of which 68.40%

are from the same query while 31.60% are from the different query.

- Temporal repetition rate examination

The temporal repetition rate examinations show that: the daily repetition rate

tend to remain stable over time, the frequent user are the minority while most of

the user won’t submit more than 10 queries per day. Frequent users tend to

repeat more as they search more at a near time interval; they tend to have more

unique queries in the long-term time. Differently from them, the non-frequent

users tend to repeated at an even rate.

- Repetition Periodicity Examination

The examination of repetition periodicity shows that: both the query and click

repetition follows a weekly periodicity. The general query repetition shows a

7day cut-off, which means repetition in general tends to occur within 7 days.

The examination of both informational query and navigational query shows that:

when a query is specified as information oriented or navigational, the repetition

patterns tend to vary a lot. The graphs show that, while most of the

informational query repetition tend to happen within 3 days, the navigational

query tend to be repeated after 7 days.

- Rank first Covert to Change Periodicity analysis

The examination of rank change reveals that: if a user trying to click on the

same click using the same query within 9 days, he or she will not likely to

experience a rank changing.

- User Variance Examination

The examining of the User Variance shows that: Generally, user with a higher

query frequency tend to repeat queries more frequently. For the users search

more than 300 queries, the repetition rates fluctuate a lot. Also the periodicity

analysis towards the user variance shows that: high frequent users tend to repeat

59

query at an interval of 5-35 days; low frequent user on the other hand, tend to

repeat queries at an interval of 10-15 days.

5.7 RESULT DISCUSSION

The results generated in this analysis were then compared to other studies in order to

be validated or be justified. Since sometimes towards a certain expression is different,

so sometimes we configure them into the same standard.

Result comparison

This Study Teevan et al. Sanderson and

Dumains

Overall Query repetition 29.8% - 50%

Individual repetition 25.39% 33% -

Group user repetition 32.64% 18%/7% -

Repeated navigational query 57.54% 47% 80%

Repeated clicks of same user 30.39% 29% 17.5%

From same query 68.4% - 83%

From different query 31.6% - 17%

Table 6 Result Sheet

As can be seen from the above comparison, although some of the percentage may

vary a little bit, they were either close to each other or fall between the ranges of the

figures of previous studies. Some obvious difference lies in the percentage of

repeated query and also the repetition percentages based on different user group; this

is probably because, in this study, the repeated continuous queries within a single

user issued within short time were grouped into query episode, so the percentage of

the individual repetition may be lower than the figure in other studies.

60

The findings about user frequency variance is in accordance with Dou et al. (2008)’s

findings, which proves that: user tend to repeat more as they search more, until to a

certain point, they tend to show a variety of interests towards both repeated query

and also unique queries. This has implicated that: for all the users, query history may

be useful for short term query re-using; for frequent searcher, some long term

personalization strategies may not work well based on past query list. The

examination of temporal repetition rate is also in consistent with Dou et al.’s (2008)

findings: frequent user tends to have a various interests in the long run. Thus, they

suggested that, for a frequent user, the long term interests based profile would work

better than the query based profile.

Also, this study extended on previous study to examine the repetition periodicity

from various aspects. The study has found a 7day periodicity which is in line with

Sanderson and Dumains (2006)’s for general query repetition. The examination

towards navigational query repetition shows a similar 6-7 days for a possible

repetition to happen. The study examined the informational queries in particular, and

found a 3 days cut-off. Although the classification of query cannot very precisely

identify the informational queries, this finding is still in consistence with previous

belief that an informational query turns out in burst. The three day may be a little

shorter than expected, however, can be regarded as being able to have reflected the

trends.

The study then analysed the possible difference between users. The general 7 days

period for a repetition to occur shows only a mixed performance of both high and

low frequency user, when examined separately, the low frequency user repeat queries

within a centralised period while the high frequency users repeated queries within a

wide range of possible time intervals in comparison. As for whether the 13-20days

for low frequency user and the 2-35% for high frequency user are to some extent due

to this special data, it can roughly show a 2-3 week range for low frequency users

and within 5weeks for high frequency users. This means, high frequency user can

61

repeat a query at a very closer time, but can also hold a long time interest towards a

special query and repeat the query at a later point. Low frequency user usually repeat

a query at around 2 weeks later after first submitting, after that, possibly they will

never be seen repeating a query again. Some of the previous findings have suggested

that, frequent search engine users tend out to have various information needs, so their

repetition behaviour is less periodic than those use the search engine less. This

agreed with the above finding in this study.

The analysis of first rank change during repeating is restricted to this study, since the

time interval of the rank change is based on the query repeating process. Normally, a

rank change can happen anytime, or another time that have been found in general

searching analysis. In this study, however, only the perceived rank changes during

the re-accessing of a previously viewed result page are included into the analysis.

This rank change periodicity may be subjected to change with the repetition

periodicity, but, it is analysed here only to suggest that, the repetition after 9days will

possibly experience a rank change. Combined with the previous finding about the

navigational query repetition periodicity, this may suggest that, some of the

navigational query re-issued after 9 days by the same user trying to find the same

page from a same query will possibly be met with a rank change. Even in most of the

case, the rank of the best result for a certain navigational query will not fall out of 10

and most of the times they will appear up on the result page, the change of rank can

still to some level hinder the process of re-finding, this is also mentioned by Teevan

et al (2007) in his analysis towards Yahoo query log.

To sum up, the analyses in this study have further identified some features of the

frequency and periodicity of the repetition behaviour from different aspects,

including the examination by query type and by different user group. As an extended

analysis based on previous work, it has yield some useful findings which would be

useful to shed lights on the different patterns that are exhibited in user’ repetition

behaviour.

62

6 CONCLUSIONS

This part gives out the final conclusions made from this study. First the main

contribution of this study is discussed, and then the limitation and suggestion for

future work are provided.

6.1 CONTRIBUTION

In this study, we proved that, query repetition is quite common during web searching.

By examining the types of the query being repeated, and the pattern of repeating a

query both by individual and group user, we proved many of the previous findings

regarding to this special user behaviour.

The major contribution of this work is that it has extended on the previous work to

examine the query repetition pattern both by type and by user. It took a special look

into both navigational query and informational query repetition pattern, and found

out that, informational query, as being expected, turned to be repeated in a burst

within 3-4 days while the repeating of navigational query will possibly happen after

7 days that have been previously identified. The different frequency user tends to

exhibit different pattern when repeating a query. The high frequency user tend to

have a more varied way to repeat queries, and also, their repeating of a query could

happen in 1-5 weeks, while the behavior of the low frequency user tend to be highly

periodical, shows a centralized inclination to repeat a query between week 2 and

week 3 after the first issuing of a query. Also, suggestion for re-ranking is given

based on the finding that, a rank change will be perceived when the same query from

same user in order to return to a previously viewed result page is repeated 9 days

later. This could also have indicated that the re-finding based on a navigational query

will possibly be hindered by rank change. In this study, the examination of user

variance is a complementary to previous work, and also, the examination of the period

63

before rank change could be observed during the repeating of query suggested possible

limitation of the re-ranking strategy.

6.2 LIMITATION AND FUTURE WORK

One of the limitations of this study is that, the classification of query was not based on

algorithms that have been standardized. There are different interpretations of the

criteria developed for identifying different types of query; therefore the algorithms

for classifying queries may vary. Also the analysis in this paper was carried out at

query level, based on exact matching. The modeling of the real user behavior may

benefit from taking user’s modified queries into consideration. The study only

analyzed the daily periodicity, it did not include the hourly analysis, and also,

because of the size and the period covered, it cannot shed light on monthly or

seasonal periodicity.

Another limitation of this study is that it did not fully adopt the same metric as the

previous works. The process of grouping of query episode may be two folded. On

one hand, the repetition rate exclusive of duplicate and identical queries tend to be

more of a reflection of user’s repetition rates; on the other hand, the 30mins cut off is

based on estimation. As Sanderson and Dumains (2006) once addressed in their work

that, the grouping of query or user based on any of such approximation tend to be

error. Also the study did not remove the oversize sections which are robot suspicious.

So the result may contain deviant points which have not yet been removed from the

log.

This study did not take into consideration the other effect that would mask the

periodicity of the repetition behavior. As Sanderson et al (2006) mentioned in his

work, without the identifying the underlying web usage and computer usage pattern,

the observed periodicity cannot be guaranteed as the feature that is unique to the

64

query repetition behavior.

Another major limitation of this study is that it did not include survey or

questionnaire to complement the quantitative study. This is due to the deep rooted

limitation of SLA as an un-obtrusive method to be used in user studies. As discussed

by Jansen (2009) that, the computer has screened off most of the user information

which would form the background of user being studied. Those information

including basic personal information such as user’s gender, age, career; the user side

activities such as downloading of a document, coping and pasting; user’s need,

perspective which has motivated the query; user’s emotion state and the education

level etc. lacking the background information about the user has caused many

problems when trying to derive conclusions from the results.

In summary, the work has just skimmed the surface of the temporal repetition pattern

exhibited by web search engine user. Weather the repetition pattern is due to other

effects is left for further investigation. Also, the inspection of periodicity from graph

would benefit from exact examination of the periodicity by using Fourier

Transformation to detect the significant frequency. It would be of interest to find out

the probability for both the repeated query and click to fall in a certain time span.

Also, the user variance in repetition periodicity would be a rout of future study.

Finally, it would be good to have larger data set covered longer period of time in

order to be able to perform monthly and even seasonal analysis.

65

REFERENCE

Adar, E., D. Weld, et al. (2007). Why we search: visualizing and predicting user

behavior, In Proc. of the Int'l WWW Conf.

Aula, A., N. Jhaveri, et al. (2005). Information search and re-access strategies of

experienced web users, In Proceedings of WWW'05,583-592.

Beitzel, S., E. Jensen, et al. (2007). "Temporal analysis of a very large topically

categorized web query log." Journal of the American Society for Information Science

and Technology 58(2): 166-178.

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D. and Frieder, O.

(2004) .Hourly analysis of a very large topically categorized Web query log. In

Proceedings of SIGIR, 321-328.

Bruce, H., Jones, W. and Dumais, S. (2004).Keeping and re-finding information on

the Web: What do people do and what do they need? In Proceedings of ASIST.

Brooks, N. (2006). "Repeat search behavior: Implications for advertisers."

BULLETIN-AMERICAN SOCIETY FOR INFORMATION SCIENCE AND

TECHNOLOGY 32(2): 16.

66

Chien, S. and N. Immorlica (2005). Semantic similarity between search engine

queries using temporal correlation, In Proc. of the 14th Int’l World Wide Web

Conference, 2-11.

Cockburn, A. and B. McKenzie (2001). "What do Web users do? An empirical

analysis of Web use." International Journal of Human-Computer Studies 54(6):

903-922.

Cui, H., J. Wen, et al. (2002). Probabilistic query expansion using query logs, Proc.

11th World Wide Web Conf.,pp. 325-332.

Capra, R. and Pérez-Qui.ones, M.A. (2005). Using Web search engines to find and

re-find information. IEEE Computer, 38 (10), 36-42.

Dou, Z., R. Song, et al. (2007). A large-scale evaluation and analysis of personalized

search strategies, In

Proceedings of WWW'07, 581-590.

Dou, Z., X. Yuan, et al. (2008). "Analysis of Query Repetition in Large-scale

Chinese Search Log." Jisuanji Gongcheng/ Computer Engineering 34(21): 40-41.

Fagni, T., R. Perego, et al. (2006). "Boosting the performance of web search engines:

Caching and prefetching query results by exploiting historical usage data." ACM

Transactions on Information Systems (TOIS) 24(1): 78.

67

Global Faces and Networked Places: A Nielsen report on social networking’s new

social foot print.blog.nielsen.com/nielsenwire/wp.../nielsen_globalfaces_mar09.pdf

Acessed 28th

July 2010

Han, W., J. Lee, et al. (2007). Ranked subsequence matching in time-series databases,

In International Conference on Very Large Data Bases (VLDB),pp. 423–434.

Jansen, B., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A

study and analysis of user queries on the web. Information Processing and

Management, 36(2), 207–227.

Jansen, B.J. and Pooch, U. (2001). A review of web searching studies and a

framework for future research. Journal of the American Society for Information

Science and Technology, 52(3), 235–246.

Jansen, B.J. and Spink, A. (2006). How are we searching the World Wide Web? A

comparison of nine search engine transaction logs. Information Processing and

Management, 42(1), 248–263.

Jansen, B.J. (2006). Search log analysis: What it is, what's been done, how to do it.

Library and Information Science Research, 28(3), 407–432.

Jansen, B.J. (2008). The methodology of search log analysis. In B.J. Jansen, A. Spink,

& I. Taksa (Eds.), Handbook of research on Web log analysis (pp. 99–121). Hershey,

PA: Idea Group Inc.

68

Jansen, B.J., Spink, A., & Pedersen, J. (2005). Trend analysis of AltaVista Web

searching. Journal of the American Society for Information Science and Technology,

56(6), 559–570

Koshman, S., A. Spink, et al. (2006). "Web searching on the Vivisimo search

engine." Journal of the American Society for Information Science and Technology

57(14): 1875-1887.

Lee, U., Z. Liu, et al. (2005). Automatic identification of user goals in web search,In

Proceedings of The World Wide

Web Conference. Chiba, Japan, 391-401.

Liu, F., C. Yu, et al. (2002). Personalized web search by mapping user queries to

categories,In Proceedings of the Eleventh International Conference on Information

and Knowledge Management (CIKM'02).USA, 558-565.

Lau, T. and Horvitz, E. (1999) Patterns of search: Analyzing and modeling Web

query refinement. In Proceedings of the UM ‘99, 119-128.

Mahanti, A., D. Eager, et al. (2000). "Temporal locality and its impact on Web proxy

cache performance." Performance Evaluation 42(2-3): 187-203.

69

Obendorf, H., Weinreich, H., Herder, E., and Mayer, M. (2007).Web page revisitation

revisited: Implications of a long-term click-stream study of browser usage. In

Proceedings of CHI ,597–606.

Ozmutlu, S., Spink, A. and Ozmutlu, H.C. (2004). A day in the life of web searching:

an exploratory study. Information processing and management, 40(2), 319–345.

Ross, N., & Wolfram, D. (2000). End user searching on the Internet: An analysis of

term pair topics submitted to the Excite search engine Journal of the American

Society for Information Science, 51(10), 949–958.

Sanderson, M. and Dumais, S. (2007).Examining repetition in user search behavior.

In Proceedings of ECIR ’07,

Silverstein, C. et al. (1999). Analysis of a very large web search engine query log. In

ACM SIGIR Forum. pp. 6–12.

Spink, A. et al. (2002). US versus European Web searching trends. In ACM SIGIR

Forum. pp. 32–38.

Spink, A., Bateman, J. and Jansen, B.J. (1998). Searching heterogeneous collections

on the Web: behaviour of Excite users. Information Research, 4(2), 4–2.

Smyth, B. (2007). "A community-based approach to personalizing web search."

Computer 40(8): 42-50.

70

Smyth, B., E. Balfe, et al. (2004). "Exploiting query repetition and regularity in an

adaptive community-based web search engine." User Modeling and User-Adapted

Interaction 14(5): 383-423.

Teevan, J., E. Adar, et al. (2007). Information re-retrieval: repeat queries in Yahoo's

logs, In SIGIR'07: Proceedings of the 30th annual international ACM SIGIR

conference on Research and development in information retrieval, pp. 151–158.

Tyler, S. and J. Teevan (2010). Large scale query log analysis of

re-finding,Proc.WSDM, 191-200.

Tauscher, L. and Greenberg, S. (1997) How people revisit Web pages: Empirical

findings and implications for the design of history systems. International Journal of

Human-Computer Studies, 47 (1), 97–137.

Teevan, J. (2007).Supporting finding and re-finding through personalization.

Doctoral Thesis, MIT, February.

Teevan, J., Adar, E., Jones, R. and Potts, M. (2006).History repeats itself: Repeat

queries in Yahoo’s logs. In Proceedings of SIGIR, 703-704.

Teevan J., Alvarado C., Ackerman M. S., and Karger D. R. (2004) The perfect search

engine is not enough: A study of orienteering behavior in directed search. In

Proceedings of CHI, 415-422.

71

Vlachos, M., S. Kozat, et al. (2009). Optimal distance bounds on time-series data,In

SDM.

Vlachos, M., S. Kozat, et al. (2010). "Optimal distance bounds for fast search on

compressed time-series query logs." ACM Transactions on the Web (TWEB) 4(2):

1-28.

Vlachos, M., C. Meek, et al. (2004). Identifying similarities, periodicities and bursts

for online search queries, In Proceedings of the ACM SIGMOD International

Conference on Management of Data,pp. 131–142.

Vlachos, M., P. Yu, et al. (2005). On periodicity detection and structural periodic

similarity,In Proceedings of the Siam International conference on Data Mining (SDM

05).

Wang, P., Berry, M.W. and Yang, Y. (2003). Mining longitudinal Web queries: Trends

and patterns. Journal of the American Society for Information Science and

Technology, 54(8), 743–758.

Wedig S. and Madani, O. (2006). A large-scale analysis of query logs for assessing

personalization opportunities. In Proceedings of KDD, 742–747.

Wen, J.-R., Nie, J.-Y. and Zhang, H.-J.(2002) .Query clustering using user logs. TOIS,

20 (1), 59–81.

72

Xie, Y. and D. O Hallaron (2002). Locality in search engine queries and its

implications for caching,In Proceedings of the twenty-first annual joint conference of

the IEEE computer and communications societies pp. 307–317.

Zhang, Y., B. Jansen, et al. (2009). "Time series analysis of a Web search engine

transaction log." Information Processing & Management 45(2): 230-245.

Zhao, Q., S. Hoi, et al. (2006). Time-dependent semantic similarity measure of

queries using historical click-through data, In: WWW'06:Proceedings of the 15th

international conference on World Wide Web, New York, NY, USA,pp. 543–552.

Zhao, Q., T. Liu, et al. (2006). Event detection from evolution of click-through

data,In Proceedings of the 12th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pp. 484–493.