i
EXAMINE THE FREQUENCY AND PERIODICITY OF
REPETITION BEHAVIOUR
A study submitted in partial fulfillment of the requirements
For the degree of
Master of Science in Information Management
At
Department of Information Study
At
The University of Sheffield
By
Ruoning Qian
September 2010
ii
ABSTRACT
Query repetition has been found very common in web searching. Numerous studies
have examined the different aspects of such behavior either through search log
analysis or experimental studies. While many of them focused on identifying the
characters of this behavior based on static observation, only s few of them have
studied the temporal features of repeating over time. During the last few years, a
number of studies have explored the potentials to utilize the time relevant features of
such behavior to assist the search engine development. These possible applications
have made the further longitude study towards those features more attractive than
before.
This study aimed to identify the periodicity of user’s query repeating behavior, and to
further understand the behavior variance both by query type and user frequency. A
query log containing thousands of user’s query and click-through data over a period
of three months was collected from a popular American search engine and was then
analyzed over a relational database. The results of the analyses show that, generally,
user’s repetition behavior follows a 7day periodicity; however, both informational
query and navigational query tend to be repeated in a different ways when examined
separately. And frequent search engine user to repeat the query different form the low
frequency user. Specially, it was found that, a repeated query after 9 will incur a rank
change. Based on these results, we can conclude that: since user’s query repetition
behavior follows different pattern based on the different query intent and user type, it
is important to identify both the query type and user type in order to determine the
possible periodicity for them each.
iii
ACKNOWLEDGEMENT
Thanks for the help of my supervisor Dr. Mark Sanderson who had provided both the
log data and useful advices for me during the study. And also thanks for his
understanding and patience during the process. Thanks should also be given to Mr.
Peter Stordy and Mr. John Holiday who are willing to help me with the use of Oracle
Database; also thanks Dr. Paul Clough for understanding my special situation. Thanks
should be given to my friend Wang, who supported me with a server like computer
for log analyzing.
iv
Word Count
Number of Pages: 80
Number of Words: 16,500
v
CONTENTS
ACKNOWLEDGEMENT .......................................................................................... iii
1 INTRODUCTION ............................................................................................... 1
1.1 Background .................................................................................................. 1
1.2 Motivation .................................................................................................... 2
1.3 research Objectives ....................................................................................... 3
1.4 Dissertation Structure ................................................................................... 4
1.5 Key findings ................................................................................................. 4
2 LITERATURE REVIEW .................................................................................... 6
2.1 overal query repetition examination ............................................................. 6
2.2 Repetition Examined by individual and group ............................................. 8
2.3 static locality examination ............................................................................ 9
2.4 TEMPORAL LOCALITY EXAMINation .................................................. 9
2.5 temporal repetition features examined by query type ................................ 11
2.6 User Variance from General Repetition Pattern ......................................... 12
2.7 Applications and Implications .................................................................... 13
3 METHODOLOGY ............................................................................................ 17
3.1 Search Log Analysis as Method ................................................................. 17
3.1.1 Search Log ....................................................................................... 17
3.1.2 Theoretical Foundation .................................................................... 18
3.1.3 Possible Issues in Related to the Process ......................................... 20
3.2 Data Collection ........................................................................................... 21
3.2.1 Data Source ...................................................................................... 22
3.2.2 Standard Field in Search Log ........................................................... 22
3.2.3 Privacy Issue and Offensive Information ........................................ 23
3.3 Platform and Tools ..................................................................................... 23
vi
3.4 Data preparation ......................................................................................... 25
3.4.1 Data Importing ................................................................................. 25
3.4.2 Remove Corrupted Data .................................................................. 26
3.4.3 Abnormal User Identification .......................................................... 26
3.4.4 Group Query Episode ...................................................................... 27
3.4.5 Query Classification......................................................................... 28
3.5 Metric for Analyzing .................................................................................. 30
4 Data Analysis ..................................................................................................... 32
4.1 Process Design ........................................................................................... 32
4.2 Special method ........................................................................................... 33
4.2.1 Time-Based Frequency .................................................................... 33
4.2.2 Repetition Distance: ......................................................................... 34
4.2.3 Normalization .................................................................................. 34
4.3 Overall Repetition Examination ................................................................. 35
4.4 Temporal Repetition Rate examination ...................................................... 36
4.4.1 User Frequency ................................................................................ 36
4.4.2 User Variance on Temporal Repetition Rate ................................... 37
4.5 Repetition Periodicity Identification .......................................................... 37
4.5.1 General query repetition periodicity ................................................ 37
4.5.2 Query Repetition Periodicity Examine By Type ............................. 38
4.5.3 User Variance on Repetition Periodicity ......................................... 38
4.6 Rank first convert to change ....................................................................... 38
5 RESULTS .......................................................................................................... 40
5.1 Results for data preparation ........................................................................ 40
5.2 Results for Overall Query Repetition Examined ........................................ 42
5.3 temporal Query Repetition rate Examined ................................................. 45
5.4 Repetition Periodicity Examination ........................................................... 49
5.5 Rank first convert to change ....................................................................... 54
5.6 result summary ........................................................................................... 57
vii
5.7 Result discussion ........................................................................................ 59
6 CONCLUSIONS................................................................................................ 62
6.1 contribution ................................................................................................ 62
6.2 Limitation and Future Work ....................................................................... 63
REFERENCE ............................................................................................................. 65
viii
- Figure 1Abnormal User Detection
- Figure 2 Temporal repetition rate Distribution
- Figure 3User Frequency Distribution
- Figure4 User Variance on Temporal Repetition Rate
- Figure 5 User Repetition rate Distribution
- Figure 6 Same User Query Time Intervals
- Figure7 Same User Same Query Time Intervals
- Figure 8 Normalized General Query Repetition Time Intervals
- Figure9 Informational Query Repetition Periodicity
- Figure 10 Navigational Query Repetition Time Intervals
- Figure 11User Variance on Repetition Periodicity
- Figure 12 General Rank Change Time Intervals
- Figure 13 Rank Changes distribution
- Figure14 Rank First Change Periodicity
- Table 1 Data Preparation Result
- Table 2 Repetition Overview
- Table 3 Click Repetition Overview
- Table4 Most Repeated Queries
- Table 5 Most Repeated Queries across User
- Table 6 Result Sheet
1
1 INTRODUCTION
1.1 BACKGROUND
Web searching is the most popular internet activity according to the Nielsen report in
2009. More and more people are using search engine every day to find information
or to navigate. And as the web is growing bigger in size, people’s reliance on search
engine is growing at the same time. The dramatic increase of search engine usage has
given rise to a growing interest in web searching studies, including the modeling of
user behavior and web search engine performance as summarized by Spink & Jansen
(2004). During the last ten years, numerous studies of web searching especially the
web user studies have been carried out in order to better understand user’s
information need, search engine usage and etc.
Among all the user behaviors that have been identified so far, one is of special
interests to many researchers. It was found that, although millions of queries have
been submitted to a search engine by thousands of different users every day, only a
relative smaller size of queries are being queried again and again by users. Large
amount of repeated queries are found in most of the search engine query logs, which
confirms that: people tend to repeat the queries which have been searched before
either by themselves or by others.
This finding has aroused many interests to further investigate into it or to exploit its
potential for many other researches. Many interesting findings about this behavior
have been generated ever since. Such as, it was found that, besides the repetition of
query, users tend to click on the same result as they have clicked on before either
from the same query or a different query, their choice of results tend to be highly in
consistence with the past (Smyth et al,2004); also has been found is that, small
2
numbers of queries are being repeated very frequently while the majority of the
queries are being repeated less often by user (Xie and O’Halloran ,2002); some
studies also identified a weekly and daily periodicity of the behavior as well as a
different repeating pattern varied by task( Sanderson and Dumains, 2006) etc. Also,
there has been many other researches which have been able to make progress based
on it, such as the studies towards user trends which identify the common interests of
users from the most frequent repeated queries; the research of caching strategy also
found a way to benefit from applying the two-level caching strategy according to
whether the query is being repeated often by a single user or shared among the
majority; the ranking of results could also take the repeating of a click as a sign of
highly relevancy and etc.
Besides the existing findings, there have been many new research directions being
pointed out, among which the further examining of the periodical feature of the
behavior is of special interest.
1.2 MOTIVATION
Understanding the time relevant features of user’s query repetition behavior can be
very beneficial to the development of search engine strategies.
How long will an informational query burst last for, and how often user would use
search engine to re-navigate? The above questions are of great interest for search
engine, because it is important to base their caching strategies on the different
temporal demanding features of the repeated queries both shared by majority and
pursued by an individual over time. For a search engine caching strategy, the
struggling would always be between the limited cache space and the need to provide
timely response based on reserved results. Deciding the right time to replace a
cached query which is not likely to be repeated again with a new
3
likely-to-be-repeated query is important for caching strategies. It would be great to
predict the likelihood of a query’s being repeated again within a certain time based
on different user intention behind the query.
In order to provide personalized search engine service for different users, it is
necessary to decide how long will the user profile be kept and be used for query
re-using and result re-ranking. It was identified by Dou et al. (2008) that, different
frequency user would benefited from different personalization strategies. Whether to
build a short term profile or long term profile for the future repetition is quite
important and this can be determined by identify different frequency user group.
In order to prevent the re-ranking of results from hindering the re-finding of
previously viewed result pages (Teevan et al, 2006), it is better to decide the right
time to re-rank the result based on the possible time interval between two repeated
query and click. Also it was suggested that, the periodical character of a query’s
being repeated can be used as a special ID to identify semantically related queries
(Vlachos et al, 2004) or used as a criteria for query classification().
Based on the above potential benefit, it would be interesting and necessary trying to
identify the time-relevant features of the repetition behavior which could be utilized
to improve search engine effectiveness and to provide timely and personalized
services. And with the help of this study, we would be able to identify some time
related features of the behavior and yield new findings.
1.3 RESEARCH OBJECTIVES
Based on the above motivation, this research will focus on the temporal features of
the repetition behavior both in general and with specific to query type varied by tasks
4
and by user. Also as being mentioned above, the change of rank is one of the
possible challenges that will be met with during re-accessing period. So examination
towards the relation between re-ranking and re-clicking is included in the aim of this
study. So the research aims of this study are listed as follows:
- To examine the temporal query repetition rate and user’s variance
- To identify the general query repetition periodicity;
- To identify the repetition periodicity varied by user’s query intention;
- To identify the periodicity variance based on users frequency;
- To identify the rank change periodicity during the repetition and to find out the
relation between the two periodicity;
1.4 DISSERTATION STRUCTURE
In this study, search log analysis will be performed to examine the query repetition
behavior which will be identified in a large search query log. This paper starts by
reviewing the related works focused on query repetition behavior and thither studies
in which search log analysis were used as the main method. The data and
methodology will be presented next, giving out both detailed steps for whole
research process and methodological level explanations to the method being used in
the article. The results are shown in the later chapter, followed by a discussion try to
explain the results and keep in line with the ones provided by previous works. The
limitations of the study and suggestion for future research will be given at the end of
this paper.
1.5 KEY FINDINGS
From this study, we know that, generally, users repeat their queries within a 7 day
period, however, further analyses towards different query type based on different
5
user intention shows that, while an repeated information query tend to burst within 3
days, a navigational query tend to be seen again issued by the same user after a week.
Analyses of different frequency user reveal that, the frequent user repeated more than
the non- frequent user; however, they also seem to have more unique queries to be
launched after they have launched a certain numbers’ of queries. And the comparison
between the rank change periodicity and query repetition periodicity shows that, a
query, most possibly a navigational query, being repeated after 9 days will possibly
be met with a rank change if he/she wants to click on the same result as last time.
The above introduction provides an overview of the whole dissertation, and the
details will be given in the rest chapters. And in the next chapter, related previous
works on query repetition will be reviewed first.
6
2 LITERATURE REVIEW
Reviewing the previous works before carrying out the study will be rather useful for
us to understand where this study will be standing in the related studies on this topic
as well as to get a general framework of the work to be done in the research. The
literature review in this chapter will discuss relevant studies focused on query
repetition behavior. The literatures are organized into the following listed aspects:
- Overall Repetition Examination
- Repetition Examined by individual and group
- Repetition Frequency Examination
- Query repetition examine by user intent
- User variance in query repetition
- Temporal features examined
- Applications and implications
2.1 OVERAL QUERY REPETITION EXAMINATION
Smyth et al. (2004) made two basic assumptions about web searching. They assumed
that the world of the web searching tends to be a place where similar queries tend to
re-occur and similar results tend to selected again. Many works have been done
trying to identify the query repetition in different query logs.
Markatos (2000) analyzed a million queries from the Excite web search engine and
found that nearly 20%-30% of the submitted queries were resubmitted by either the
same or different users. Xie and O’ Hallaron (2002) studied the Vivisimo log data
over a period of 35 days and find out about 32% of the queries are repeated at least
once; and the study of Excite log showed that 42% of the queries were repeated ones.
7
Teevan et al (2004) analyzed observed 13,060 queries and 21,942 clicks from 114
Web browsers over a period of one year, and found that, of all the queries, 67% are
submitted more than once. They also found that 71% of the repeated clicks are from
the same query, and 28% of repeated clicks are from the same user, while only 7%
repeated clicks are from multiple users. They found that, user were more likely to
click on the previously viewed result pages.
Sanderson and Dumains (2006) analyzed 3.3 million queries containing 7.7 million
clicks from more than 30 thousands of unique users over a period of 3 months. They
found that, repeat queries accounts for more than 50% of all submitted queries; and
17.5% of the total clicks were found to be repeated ones.
Dou et al (2008) analyzed the query repetition in a large Chinese search engine log,
and found that, about 21.87% of the distinct queries have been submitted more than
once, and the repeated query instances accounted for 54.78% of the total query
instances.
As can be seen from the above statistics, repeated query accounted for 20%-67% of
the total queries. Although the statistics differentiated from each other somehow, all
these works have strongly suggested that: query repetition is quite common in
today’s web searching, and people’s choices of results tend to be in consistent with
their previous ones. Based on the above findings, the modeling of user behavior is
possible. These two premiers have established the foundation for related studies on
query repetition behavior.
8
2.2 REPETITION EXAMINED BY INDIVIDUAL AND
GROUP
Examining the repetition by user group is very useful, especially for developing
personalized strategies.
Individual user analysis carries with it the implication for personalization while
group user analysis is generally used for trend discovering, news events detecting as
well as query-reusing strategies. Some works have suggested the two implications
carrying by this analysis. It was suggested that the frequently repeated queries shared
by groups’ of users should be cached at the server side in order to meet the general
information needs. The personal level query re-using may be better assisted by client
side caching strategies instead. The two-level caching strategy was proved to be both
useful and effective (Fagni, 2006).
Web User trend detecting is one of the most popular routs followed by some early
works as well as one of the most adopted method for new market exploring and
group targeted advertising. Search engine query log can be viewed as a database of
user interests. Brooks (2005) have discussed one of the applications of query
repetition analysis in advertising. He tried to identify a casual relation between
repeated searches of certain product and the final purchasing. He adopted a
time-to-convert method to identify the most likely occurrences of repeated searches
that will lead to a purchase by analysing the number of clicks before paying.
Instead of the general trend, individual query repeating may be used for interests
detecting or user group classification. Many of the study focused on lexis of personal
term use. It was suggested by Xie and O’Halloran (2002) that: many of the users tend
to have a small size of term usage. Therefore, term level analysis will be rather
effective for long term query predicting. However, Dou et al. (2008) suggested that
9
instead of term level analysis captures only the short term information needs, the
long term personalization strategy would benefit from underlying interests detecting.
2.3 STATIC LOCALITY EXAMINATION
The examination toward query frequency can provide accumulative overview of
number of times queries being submitted within the time being analyzed, which can
shed light on the repetition degree of different queries.
Jansen et al. (2000, 2001) indicated that neither queries nor query terms follow a
Zipfian distribution for they had identified large numbers of infrequently repeated
queries and terms in the log; this was updated later by Saraiva et. al (2001) who
discovered that query frequencies follow a Zipf-like distribution over the analyses of
10 thousands queries from a Brazilian search engine; Xie and O’Hallaron (2002)
later identified a similar distribution of query frequency by a comparison study over
both the Vivisimo and Excite query log; Lempel and Moran (2003) analyzed around
seven million queries from AltaVista in the year 2001 and found that the query
frequency followed a power law; this is also proved by Eiron and McCurley( 2003)
later in their study of web query vocabulary.
It has been suggested that: only a small percentage of queries are being repeated for
many times while large amount of the queries are less repeated by user. Those works
were based on static observation of the repetition frequency, providing no insight into
how the queries are being repeated over a period of time.
2.4 TEMPORAL LOCALITY EXAMINATION
Compared to previous static observation of the query repetition behavior which tends
to focus on verifying or describing its existence, the studies which analyzed the
10
temporal evolution of users’ repetition behavior will be able to shed light on the
time-related features of the user behavior.
Later work of Wang et al (2003) who examined the query logs from a university
search engine over a period of four years during 1997-2001 analyzed the temporal
query frequency by day, month and season; Beitzel et al. (2004) analyzed a very
large AOL query log containing queries from millions of users over a period of one
week. They found that query repetition rate by hour remained constant throughout
the day.
Dou et al (2008) analyzed the evolution of query repetition rates by hour over a
period of one month on a large Chinese search engine. The temporal analysis of
query repetition has been able to provide an overview of the numbers of cumulative
repetitions changing by hour or day, however, no insights have been provided into
how a certain repeated query would occur after its first being launched.
Wedig and Madani (2006) have discover that some users repeat clicks over long
period of time; Also Xie and O’Halloran (2002) found that the majority of the
repeated queries are repeated within a short time interval, while a number of the
queries will be seen repeated in a relative longer period.
These works have contributed to the possible estimation made for the likely
occurring of repeated navigational queries. They have suggested that it is possible to
predict a repeated event either by identifying both the possible time span which a
certain repetition would occur and the possibility of occurrence, or by observing a
frequent occurred combination of the two repeated events. However, their findings
can only be applied as a general rule, which shed light on the query repetition pattern
with no specific to query type.
11
2.5 TEMPORAL REPETITION FEATURES EXAMINED
BY QUERY TYPE
The previous repetition studies on query type generally based on the examination of
the co-occurrence of repeated queries and clicks. Lee et al. (2005) identified in their
studies towards re-finding that, the navigational query tend to have a highly
centralized click distribution, while users clicked on a wider range of results for an
informational queries. They then used the click distributions to discriminate
navigational queries from the informational ones.
Lu et al. (2006) proved and extended on Lee’s work later. They examined the
different features of the click-through data resulted by both informational queries and
navigational queries (the types of which have been pre-defined in a training data)
over a period of time. They discovered that: navigational queries tend to show more
stable temporal features than the informational queries by resulting in less diversity
in the click-through data. The top pages clicked by users as a result of these queries
are not likely to differ much over time. This means, when being repeated by users (or
a user), the navigational query will result in smaller size of total clicks which
centralized on only a few most clicked URL.
Teevan et al (2006, 2007) found that, navigational queries tend to be repeated with
one or two often repeated clicks. They later used this method, combined with some
of the other criteria successfully identified 12% navigational queries from all the
queries, they also found that navigational queries tend to be repeated more often than
others and be repeated at longer time intervals. They suggested that: based on the
features identified above, navigational query behavior was particularly easy to
predict.
12
The later works of others such as Asur and Buehrer (2009) have identified the
different temporal patterns exhibited by both navigational queries and news queries.
They found that, while most news-oriented queries tend to occur in a rash over time,
restricted to only a few time intervals between two repeated events, the navigational
queries, would occur more frequently without showing any strong character over a
short period of time.
Although the above have focused on the temporal features of the both informational
and navigational queries, they did not specify for how long these queries will be
repeated. And they did not try to examined the periodical features of such behavior.
Sanderson and Dumais (2006), extended on the previous work of Teevan et al (2004),
examined the temporal features of repeated queries and click over a period of three
month in 2006. They measured the time interval between paired repetition events,
and have been able to identify a dominant 7 days periodicity from daily analysis and
a 24 hour periodicity out of hourly analysis. Also they discriminated the repetition
pattern of navigational queries from the rest, identifying the different temporal
pattern that the navigational queries are being repeated. They found out that:
navigational queries are more likely to be repeated at a longer time interval than
being repeated in a close temporal approximate as the rest queries most of which are
information seeking oriented.
2.6 USER VARIANCE FROM GENERAL REPETITION
PATTERN
Some latest works tried to explored personalization opportunities from the
examination of user variance from the general repetition pattern.
13
Dou et al (2007) explored the relation between the query frequency and the repetition
frequency by experimenting both short term and long term caching strategies
separately on both unique queries and repeated queries. By comparing the hit rate of
the different groups, they found that, the long term caching can improve the chance
to predict a previous click based on the previous query-click as the time a user
searches grow; but after a certain query frequency point (70 queries), the user tend to
submit more new queries, which means less repetition will be observed.
Later in the other work of Dou et al (2008), they tried to prove and extend their
previous finding by analyzing large query logs. They proved that frequent searcher
tends to have a different repetition pattern from the low frequent user; and the
repetition rate will stay stable at a certain rate at some point of query frequency. The
finding in their work suggested that: the query repetition tends to be less observed
after a certain distance calculated as the number of queries between.
The findings above were in accordance with the findings of Wedig and Madani (2006)
who found that, a user’s interests differed from the general after more than 100
queries have been launched by the same user. This means, less repeated query will be
shared among users after a certain period.
The exploring of user variance in query repetition would be rather important for
marking the line for the exploiting of past queries for predicting repeated query or
click.
2.7 APPLICATIONS AND IMPLICATIONS
14
- Query expending based on term level repetition examining
The query repetition carried out at term level holds great promise for query
suggesting. It was found out by Xie and O’Halloran (2002) that, although users
tend to user different queries, they tend to use a small number of words to form
the queries. This has indicated that: exploring the repeated terms will be more
cost-effective than trying to do it on query level. And based on the often
repeated terms, query suggestions could be made. Also the term level repetition
analysis will be able to be used for query clustering; for query expending etc.
- Repetition periodicity used for Re-Ranking strategies
Re-ranking the results according to the past query history is not a new rout of
study. However, previous studies have shown that, the change of rank will
hinder the process of re-finding, and will also reduce the chance of clicking
(Teevan et al, 2007). The finding might have suggested that, the re-ranking of
results would better happen after the longest period that a repeated click would
occur.
- Repetition periodicity for predicting repeated click
Teevan et al (2007) have discussed the possibility to utilize the repetition
periodicity to predict the occurrence of next repetition. This also aroused the
interests of Xie and O’Halloran (2002) who proposed the use of the repetition
feature will be able to predict the likelihood of a query being issued again.
However, as being discussed in the previous section that, the prediction based
on periodical feature will be left for future study.
- Repetition pattern for the identifying of semantic related query
In the work of Vlachos et al (2004, 2005, 2010), the pattern of the query being
repeated regardless of user over time was used to identify semantically similar
queries, which is based on the finding that: semantically related queries tend to
have the same query demand over time; Zhao et al (2006) have tried to use the
click through data as a way to measure the similarity between queries.
15
- Repetition periodicity to aid Information Re-finding
The re-finding behavior is closely related to the query repetition behavior in that:
as proved by Teevan (2007), people’s re-finding of a certain website relied
largely on the re-using of search engine, which could then be identified by
search log analysis.
- Repetition for Personalization
Personalization strategy is based on the assumption that: when users resubmit a
query, their selections of results tend to be in highly consistent with the previous
one (Smyth et al, 2004). Exploiting the results of past queries will enable the
search engine to gather a collection of possible choice for user to choose from if
the same query is submitted again. Personalization strategies which are based on
past repeated click can be very effective (Dou et al, 2007)
- Repletion for query classifications
At the beginning, all the classification was mainly done by manual. Border
(2002) defines queries as informational, navigational or transactional and
manually classified 200 queries by studying an online survey of AltaVista users;
Beitzel, et. Al (2003) categorized a search log queries as navigational by
matching them to a list which was generated from the titles of web directories.
Later works have been able to classify queries automatically. Lee et al (2003)
tried to automatically classify the query by comparing the navigational and
informational queries; Yates et al (2006) used the machine learning to classify
queries as informational and non-informational;
Jansen et al (2007) provided a series of characteristic they identified form a
qualitative study of a sample query log for each category which enabled the
automatic classifying based on those criteria; Teevan et al (2007) discussed the
16
criteria for identify navigational query based on examining of the re-finding
behavior.
Broder et al. (2007) later used the text of the top result pages to judge the query
intent of the user. They found this method is much better than the previous
method that used only the query string. Beitzel et al. (2005) perform a
semi-supervised learning on the query logs to classify queries into topical
categories, and also used a training data which was annotated manually
beforehand. Some work have used the rank to
After reviewing all the related works examining the query repetition behavior, the
framework for this study comes out in shape. In the next section, the methodology of
this study will be talked about.
17
3 METHODOLOGY
A search log from a web based commercial search engine was collected as the data to
be analyzed in the study. This chapter will discuss how the search log analysis was
used to analyses search engine user behavior. At first, we will briefly introduce the
search log analysis as a methodology, including the theoretical foundations; the
issues related to the process, the process of SLA is given out in details, including
data collection, data preparation. Since the data analysis process is quite important, it
is separated from this chapter to be described in details in the next chapter. The
outline of this chapter:
- Search log analysis as method
- Related issues in the process
- Data collection
- Data preparation
3.1 SEARCH LOG ANALYSIS AS METHOD
Comparing to the large amount of findings yielded by performing SLA, the
methodological significance of this method has never been fully addressed of.
However, before adopting one method to a study, one should always guarantee that
the results generated by applying the method will finally find their way to arrive at
the conclusions which are in supportive of the research objectives. In order to build a
linkage between the method and the study objective, a brief discussion of the
theoretical foundations that have been served as the basis of this study is given out at
first
3.1.1 Search Log
18
Mainly two kinds of log are often studied of, including client-side log and server-side
log. The client-side log keeps track of user’s interaction with web browser, which is
often used in web browsing studies; while the server side log keeps down the user’s
search engine usage, which is often used for web searching behavior. The log which
is going to be analyzed in this study, is the search engine log captured by software at
the server-side.
Search engine log, often referred to as search log:
“….is an electronic data which keep down the interaction between a web searcher
and the search engine being used during web searching process…”
-- (Jansen, 2009)
The interaction between the searcher and the search engine include both the activities
of the user and the search engine. The user’s activity captured include the submitting
of a query; click on one of the result click, the requiring for next page, and the
returning to the search result page. The activities logged include the returning and the
ranking of the results; the data contained in the search log including both query data
and the click-through data are often analyzed by researchers based on different
research purpose. For example, the user’s query can be used to infer the underlying
information need; the combining of query and click through data can serve as an
implicit feedback of the result relevance. Etc.
3.1.2 Theoretical Foundation
3.1.2.1 Behaviorism
User study has always been an important area of research in web searching. For
search engine users, their behaviors which have been observed during their searching
19
process will be a mechanical expression of underlying information needs or
motivations. (Otsuka et al, 2004) Most of the time, user’s information needs will be
expressed in the form of search queries or as well as the URL clicked; also, and when
a specific searching pattern is identified from behaviors exhibited by a collective of
users, the general feature of these user groups can be summarized. The reason to
study user behavior can be summarized as: First, user’s need and motivation
behind the behavior are very important for service provider, based on which better
service can be provided accordingly; second, user’s reaction to the service provided
can serve as an implicit snapshot of his/her perception of the service provided.
3.1.2.2 Historical Data Re-using
Six years ago, Smyth et al. (2004) made two basic assumptions about web searching.
They assumed that the world of the web searching tends to be a place where similar
queries tend to re-occur and similar results tend to selected again. These assumptions
have changed the search engine world dramatically.
Based on these assumptions, historical queries and clicks will have a large chance to
be re-used again, so they carried with them useful implications for query expansion,
result caching, and user profile building, which were proved to be more effective
than the previous content based method in dealing with vague queries; also, both the
query repetition and selection regularity could serve as implicit feedbacks from users.
No matter a query-click pair is being repeated by one user or by many of them, they
are supposed to be a sign of high relevancy.
3.1.2.3 Web logging
20
Web user’s activities have been largely logged in today’s web. On one hand, web
usage including web browsing, web searching etc. are becoming more and more
popular these years, the increasing usage of web service has created large amount of
user-system interaction. On the other hand, the need for understanding user’s
behavior is growing as the number of the web user is increasing at a rapid pace, how
to provide better service to those people is being focused now.
So the web logging, based on which user behavior contained in historical data is
made possible, have been largely carried out nowadays. It captures the web search
engine user activity by keeping the searching history data. Keep a log about web
usage is mainly motivated by two purposes. The log which captures the process of
the web activities can be used for understanding web user behavior, which can be
used to provide insight into the need and motivation that lying behind, as well as be
used to make a prediction of the future event. In a word, as one of the premises of
search log analysis, the logging of the web searching process is very useful and
necessary for search engine improvement as well as user trends identification.
3.1.3 Possible Issues in Related to the Process
The standard process of the search log analysis includes three main steps: data
collection, data preparation and data analysis. As being summarized by Jansen
(2009):
- Data collection: the process of collecting the interaction data for a period of
time from a web search engine;
- Data preparation: the process of cleaning and preparing the log for further
analysis;
- Data analysis: the process of analyzing the prepared data;
21
There are a few problems existed as the result of those processes, which have been
addressed by many previous work. Generally, the discussion revolves around the
following issues with specific to each of the process.
During the recent years, some privacy issues have come into the center of the public
focus, which had created obstacles for collecting search logs for research purpose.
On the other side, however, both the academic and the commercial world are calling
for more access to those logs and suggesting the building of the centralized search
log database(Clough, 2009). Some recent works trying to anonym the log to prevent
user tracing have yield some achievements. However, the issues in relation to log
collection still call further attention.
The data preparation as well as the analyzing process are the major source of the
inconsistency existed in today’s SLA studies. Although metrics and framework
have been developed before, trying to standardize the processes in order to make the
results exchangeable, based on different perceptions and research objectives, both the
adoption of relevant terms and analyzing levels in different studies are hard to remain
the same. This will further require the defining of analyzing metrics.
However, in spite of the problems mentioned above, it is still the dominant method
being adopted in web searching studies. Also, its scalability and ease of data
collecting still cannot be matched by any other experimental or questionnaire based
methods.
3.2 DATA COLLECTION
The query log examined is collected from a public commercial search engine which is
one of the most popular search engines in US. The collection consists of more than
22
3,600,000 search queries submitted by 658,000 users during the three months between
March and May in 2006. The log data was stored in an ASCII text file which is over
2GB in size. Queries which contain porn messages were not removed here, since
during the analysis, the specific information contained in the query is not of interest.
No personal information was contained in the log, user IP address was removed, and
user was identified by system generated number unique to each.
3.2.1 Data Source
The logs from the search engine have been studied a lot in the past. As one of the
search engine that is used worldwide, there are some problems should be noted. As a
search engine used by people from around the world, there might be non-English
queries being submitted. However, given the relative smaller size of the non-English
queries and that this study is independent of query context, this issue can be ignored.
Another problem is a general issue faced by all search log analysis, especially when
field of query time is within the analysis. The server-side software captures only the
local time upon reception of the query launched by people around world, therefore
the time contained in the query log may not be a faithful reflection of time the
queries being submitted in relative of the users themselves. Works that have been
based on the absolute time recording, especially those studies trying to identify user’s
searching behavior at the different time of a day, should be careful with the results.
In order to be free from the effect of the time zone difference, this study uses a
distance-based method (detailed in the later section) in order to be time independent.
3.2.2 Standard Field in Search Log
A framework has been provided by Jansen & pooch (2001) in order to enable the
communication and comparison between the results. As being defined in the metric,
the standard search log should contain the user ID, Date, the time and the search
URL. In the log being studied here, the following fields are contained in the log data:
23
- Anonymous user ID: a system generated unique identifier used to identify
different user based on the IP address of the user that have been removed before
analysis
- Search query: the query that is entered during user’s interaction with search
engine
- Query time: the time recorded by the server side software upon reception of the
query
- Item rank: the rank of the clicked result in the result page
- Click URL: the actual page URL of the clicked result
3.2.3 Privacy Issue and Offensive Information
The privacy issue related to the data is worthy of mentioning. Although the IP address
of the user were removed before hand, the user can be tracked by the user ID assigned
by the system. The tracking of single user should be very careful, since the clicked
through data can sometimes reveal the actual identity of the user. Some researchers
once correlated the relevant information and matched the searching activities with the
exact people who previously used the search engine to log into their email box (which
may contain the name in the address), or even Facebook. Also there may be offensive
information contained in the log data, which have not been removed in order to keep
it original.
In this study, since the information context contained in the clicked URL was of no
interest to the research. The analysis performed on the data will not result in an
problem stated above.
3.3 PLATFORM AND TOOLS
24
As regarding the choice of tools and platform, there are only a few previous studies
have mentioned the tools they were using to facilitate the log analysis. Jansen (2009)
addressed this issue in his handbook published last year to provide the several tools
that can be used to support SLA. He made a comparison between the most adopted
two methods: using relational database or text processing scripts. However, no
comparison of effectiveness as well as the ease of use between the two methods has
been made before in the academic researches.
Existing tools for s log analysis is widely used by business companies to generate
general report on the traffic of their website (Google Analytics). However, those
tools which have limited ability to perform research goal defined analysis cannot
meet the need of in depth academic research.
The combinations of text processing language (most used is Perl) and log file (such
as .txt file) are usually used to perform the analysis. Such method requires good
command of the coding language. Also should be noted is that, algorithm is very
important in deciding the effective ness of the analysis process. For very large log
data, a bad designed algorithm can be very time costly (sometimes more than 20
hours).
Another most adopted method is to import the log data into a relational database (In
most of the case Microsoft Access, Microsoft SQL server, MySQL), which can be
queried by SQL queries. The manipulating of log data in a database is relatively
easier and more effective than many of the combinations mentioned above. Some
database may not be capable to accommodate data over 2GB. For very large log, the
choices are narrowed down to only a few.
25
In this study, all the data analysis were carried out on Oracle 11g Database which is
installed on Windows XP operating system with 4GB memory, Core™ 2 processer,
6MB random-access memory, 12 Treads. Basically, the PC used is enough to meet
the computing requirements, using total no more than 2-3 hours to run all the codes.
SQL language was used to query the database in order to generate basic statistic as
well as to manipulate the data by correlating, grouping, cross referencing, counting
and etc. to generate views to be able to view data in aggregation or correlation. The
key steps of the whole process of log analysis, including the steps for data cleaning,
data analysis are given out in the appendix.
3.4 DATA PREPARATION
3.4.1 Data Importing
The data should be imported into a relational database before it could be further
analyzed. Oracle database was used here as the platform to store the data in a way
that they could be manipulated more easily. SQL loader was used to upload the
2.12GB data into table that had been created beforehand and the field name and data
type should be set in accordance.
One thing worthy of mentioning here is that, the time cost of the uploading process
varied when different tools and methods were used. Both the size of the data and the
original format of the document should be considered when choosing the tool. For
the relative large data set in this experiment (over 2GB), which exceeded the largest
size the normal software could deal with, partitioning of the original data is needed
for at the price of effectiveness. In this situation, the difference is quite obvious when
the SQL loader used less than 20min to complete the task, with other tools taking
more than 14hours to do the same job. The result of this step is shown in Table1.
26
3.4.2 Remove Corrupted Data
Search logs may contain corrupted data which can be caused by many reasons.
Removing such data is usually carried out before data analysis. One basic problem in
this process is that, for large data set, it is impossible to identify and remove those
data manually. One method provided by Jansen (2009) suggested sorting the data by
its key fields so that the abnormal data would appear in aggregation either at the top
or at the bottom of the overall table. Also some studies choose to ignore them since
those corrupted data sometimes may be very small relative to the overall data set,
which will have little effect on the final results. In this study, by initial observing,
there were many queries with only dot in the column. In some cases this won’t a
problem. However, in consideration of the later query classification which may be
mainly based on matching of strings, those data were removed at this stage to avoid
matching this data all the way through classification. A simple process was
conducted to remove those data. Records containing empty query or query without
letter or number (containing only symbols and makers) were considered as invalid
data that should be removed beforehand. The result for this step is shown in Table1.
3.4.3 Abnormal User Identification
Sometimes there could be abnormal user behaviors being identified in the query
stream. These abnormal users tend to appear in burst, which means, a
more-than-usual amount of queries being submitted within a shorter-than-usual time
span. Some of them may be searching agents. Sometimes, it could be an attack if the
time they appear is rather short.
To identify these users, we observe both the user frequency as well as user active
time. For those who have very high query frequency within a short period of time,
we consider them to be abnormal users that should be removed.
27
For those robots who act as normal user, in this study, they were ignored. Since if the
robot tries to act as human, then his acting could be very close to real human user,
which at least, could be used to represent human behavior. Also the identification of
non-human user is not simple which can be another research topic in itself. As
Silverstein et al (1999) once pointed out: there is no way to totally distinguish human
user from the non-humans. So in order to keep the log data as original as it can, we
just separate the abnormal user whose act could affect the observation of normal
users. In this study, we use a time based user frequency distribution to identify the
abnormal behaviors. The active time was calculated for all the users and the user
frequency were plotted as a function of time. Then, use this method we can identify
such users who leave large trails of queries during a fast-come fast-go. The result is
shown in Figure1 and table1.
3.4.4 Group Query Episode
On the search engine server side, a user’s request for next page or a click on another
result will be logged as a separate query assigned with a different time or even the
same time when time difference is too small to notice. Subsequent queries from the
same user that is identical to the previous one(s) are referred to as identical queries.
The logging of those subsequent queries will mask the query stream data with lots of
query events close in time (or even an exact duplication) which are triggered by only
one query entry. Since this study is not interested in repeated query happen within
one hour and it took the side of modeling user behavior rather than evaluating search
engine performance, those identical queries which were not trivial in size and would
boost the number of repeated queries to further affect the distribution of repetition
frequency, should be grouped into query episode started from user’s initial query.
In this study, a query episode was considered as a period with continuous interaction
with the search engine under one query submitted by a single user. Such episode
could be constructed by grouping the continuous query events following the initial
28
query from the same user at a time interval smaller than a certain period of time. The
grouping of query episode is less discussed than the grouping of a user section in
previous studies. In many cases, researchers either chose to treat all the queries that h
previous studies. Some of the past works treated a subsequent query submitted to the
search engine as a new query or just removed the repeated records from the original
data. Teevan et al. (2006) grouped all queries of the same query string occurred
within thirty minutes.Jansen (2009) mentioned the method to group such episode by
removing the repeated query and extract the first-of-submitted. Another work of He
& Göker (2001) address the issue by defining a web search period as a set of
continuous query by a user with no longer than a certain time limit from one query to
the next. They also suggested that, little difference was observed between using
15mins and 60mins as a threshold. Thus we use the value 30 to serve as a cursor to
identify queries which are continuous. It means same queries from the same user
launched at an interval no more than 30min were considered to be within a query
episode. This step had removed repeated records at the same time. The results were
saved as view, with reference to the original table to fetch the multiple clicks being
clicked within the episode. The query episode could be presented as <Query, click
URL>. The number of grouped episode is presented in Table1.
3.4.5 Query Classification
As being defined by Broder (2002) in his work, a navigational query is a query
which the user used to locate the home pages or a website the user have in mind. An
informational query, then, is a query which user used to gather information from the
relevant web pages containing the information on a particular topic. Based on this
classification, many works have been able to further identify the features of different
searching behavior based on different underlying user intention.
This study took special interest in the examining of different pattern the different
types of query are being repeated by users. Previous work had identified the different
way the navigational and non-navigational queries are being repeated by users
(Sanderson and Dumains, 2006). This suggested that the general query feature may
29
not apply to all queries in each category. Classifying the queries by task will be
useful to examine the features of each type independently.
The classification of queries according to user’s intent has always been a tough task
since it involves human judging and inferring, which are basically done manually in
the past only applied to small query data set. What’s more, the manually classification
of query can be rather ineffective and expensive, and in the case of very large set of
query data, it is not working.
Algorithms have been developed to separate the queries automatically. The
characteristics of each query type have been identified earlier by Jansen et al
(2007) .They gave out criteria for each query type (in their case they break down the
queries into three categories: navigational, informational and transactional) to
classify them accordingly. Lee et al. (2005) also provided criteria for classify
navigational query; a previous work of Teevan et al. (2006) define a navigational
query in detailed description as well. Generally, they contain the following criteria.
Navigational Query
- Repeated Equal-query queries which means only one or two result was clicked
each time.
- The viewed result of which is ranked higher than usual
- Queries contain full or partial URL
- Queries web site or company name
- Queries being repeated more than often over long period
Informational Query
- Queries uses question words; what/how/when/where/who
- Queries that were beyond the first query submitted;
- Queries where the searcher viewed multiple results pages;
- Queries length greater than 2 words.
30
Some of the criteria are vague and cannot be used as independent criteria to judge the
query type, such as the picking out company names, or judging by query
length—though some of the works have proposed a cut-off length, the figure is based
on a special training data which may not apply to all. Some of them are quite strict
criteria and the precision is quite high, such as, we select the queries which contain
URL. All the queries aimed at a pre-known website, according to the definition of
navigational query, are thought to be navigational. Also, some of the time related
criteria can affect the result of this study, such as the criteria ‘query being repeated
more often at a longer period’ is one of our study objectives. Based on the above
consideration, the following steps were carried out to mark identify the query type.
1. Repeated query(regardless of user at least twice) with only 1-2click each time,
will be marked as N;
2. The queries marked with N whose page rank is beyond 10 will be marked with I;
3. With the rest queries which is not marked, queries contain full/partial URL will
be marked as N;
4. With the rest queries which is not mark with I,N, we match them to a selected
most searched company name/website name list, marked them as N; the rest of
will be marked as I.
The navigational query list was later examined by the inspecting of the randomly
generated query samples. It was identified that 83% percent of the sampled queries
were navigational. The sample query list is provided in Appendix.
3.5 METRIC FOR ANALYZING
The importance of developing the analysis metric beforehand has already been
discussed in the previous section. Both the analysis level and the key terms used in
the analysis should be determined at this stage.
31
The analyses in this study were carried out at query level, examining both repetition
identified in queries and clicks. The definitions of the basic terms which were given
by Jansen (2009) are listed below:
- A query: a query episode (grouped as continuous interaction with search engine
under one query by one user issued within one hour), in which one query will
result in zero or several clicks
- A repeated query : a query submitted more than once regardless of user
- A click : a returned URL by a query
- A repeated click: the click from the same user regardless of query it from
- Navigational query: a query which aimed at a pre-assumed website or online
source, which is in the above identified navigational list.
- Repeated navigational query : the query in the above list which have been
submitted twice or above
- Informational query: queries which are information seeking based, in this
study, are the queries which are not in the informational query list.
- Repeated informational query: the query in the above informational query list
which have been submitted twice or above
After defining the analyzing metric, the analysis can be carried out accordingly
aimed on the achieving the research objectives. In the next chapter, the analyzing
process will be given in details, together with necessary explanations of the
method being used.
32
4 DATA ANALYSIS
In order to describe the process in detail, this part is separated from the main
methodology to become a new chapter. The sections included in this chapter are:
- Process design
- Special method
- Overall repetition examination
- Temporal repetition rate examination
- Repetition periodicity examination
- Rank first convert to change
- Result summary
- Result discussion
4.1 PROCESS DESIGN
Based on the research aims of this study, the analyzing process was partitioned into
four parts.
- Overall repetition examination: This step is to identify the existence of
query/click repetition in this log, based on which the rest analyses can be carried
out.
- Temporal repetition rate examination: this step will examine the daily based
repetition rate and also the temporal user variance on repetition rate.
- Query repetition periodicity examination: This step will examine the
periodicity of the query repetition behavior both in general, by query type and by
user’s variance.
33
- First convert to change examination: this step was carried out to provide
possible implication for re-ranking strategy based on the previous identified
periodicity.
4.2 SPECIAL METHOD
There are several special method were used in the study, which would be better to be
briefly introduced here in advance.
4.2.1 Time-Based Frequency
In this study, in order to examine the user variance in query repetition, it was
necessary to classify the users according to their querying frequency. Although the
word ‘frequency’ has been used a lot by previous studies, most of those studies have
only captured the ‘frequency’ by static observation of its total occurrence. It would
be wrong to simply measure the user frequency by counting the queries that have
been submitted by the user. Since users appeared in the log orders, some would
appear at the beginning while some would appear in the end of the three months
period; this search log captured only a snapshot of the query traffic, so the later
user’s activity would be framed off. In this study, the user frequency was measured
as Total numbers of queries by a certain user divided by the time interval from the
user’s first appearance till the end of the log. Using this method, user frequency
represented the numbers of query launched by a user per day, which will be a better
way to describe the regularity of a user.
34
4.2.2 Repetition Distance:
In order to further identify the repetition periodicity, this study took the distance
based analysis to identify the dominant time interval between repeated events. In this
study, the repetition distance is measured as the count of days between continuous log
events which were ranked grouped beforehand. There are two different ranking
strategies used in this study. In order to study the general repetition periodicity which
is independent of time, a random ranking strategy was used in the analysis. In the
next part, when trying to identify the first-convert-to-change time for the repetition
behavior, a time based strategy was used to order the log events.
Then the count of the occurrences of the distance is plotted to find out the most
frequent time interval between two repeated events by inspection. The distribution
based on the measuring of distance between related events has been used in previous
researches to investigate periodicity (Fagni et al, 2006). The distance can be
measured as the number of events, or in this case, was measured as the time interval
between repeated log events. Usually, time serial analysis plotted on time would be
used to identify the periodicity from the evolution diagram. However, the analysis
based on distance turned out to be a better method to identify periodicity for two
reasons. Time interval became the direct target of the analysis, which the observation
of the periodicity can be straight forward. Another reason is that, getting the distance
between two events from the same user by subtracting one from another can
eliminate the troubles brought by time zone difference. In this work, the distance
between two repetitions were measured as day difference which is time order
independent. The number of cumulative repeated queries identified at the same time
interval in order to observe the dominant repetition intervals in the graph.
4.2.3 Normalization
35
When analyzing a certain phenomenon, we need to be careful with the results we get.
Since the real world is quite complex, direct observation of a phenomenon is nearly
impossible. Also, the data we used to analyze may also be characterized by the time
span it covered, the special time it was collected, and also the source where it was
collected from. In Sanderson and Dumains(2006)’s work, they specified the way to
remove both windowing and weekly effect that were identified in the analyses, and
removed the effect which may be exerted by the underlying search engine usage ;
also in the work of Wedig and Madani (2006) which tried to analyze the user
persistence, they identified the time frame which had cut off their result artificially
would be confusing if not removed. The normalization in this study were carried out
by using Sanderson and Dumains (2006)’s formula:
In this study, the seasonal effect, windowing effect, or weekly effect etc. may present.
One simple way to remove them all is to plot the higher level analysis on the lower
one. This is because the more specific problem always inherits the attributes of the
general problems. So it would be safe to use the general data to normalize the more
specialized data. In this study, this rule was used to carry out all the data
normalization. After the above discussion of the method, the analyzing steps are
detailed in the rest of the chapter.
4.3 OVERALL REPETITION EXAMINATION
The analysis started with an overall examination of the repetition percentage of both
the query and click data. The distinct query which has been submitted more than
once was counted. The queries which have been submitted more than once by a
single user or by groups of users were calculated; percentage for both navigational
query and informational query repetition were also examined. All of the result was
shown in Table 2.
36
The click repetition was also examined. The repeated click by a single user was
calculated, and either those repeated clicks were from same query or different query
were examined separately. The result is shown in Table 3
4.4 TEMPORAL REPETITION RATE EXAMINATION
The daily repetition rate was calculated for each day in the three months’ time. Based
on this, we could identify the temporal evolution of the query repetition. The total
queries are grouped by date; the number of the queries in a day was counted and
plotted on the total 92 days. The same method was used to calculate the repeated
queries in each day for the total 92 days. The daily repetition rate was calculated as
the daily percentage of repeated queries in the total queries. The repetition rate for 92
day’s period was plotted in the Figure .
4.4.1 User Frequency
User variance is particular of interest in this study, so the user frequency should be
calculated first to classify the user into different frequency group. As introduced in
the previous section, the user frequency was calculated as the total numbers of
queries instances of a user divided by the active user time. The formula for the
calculation is:
The user frequency distribution was represented in Figure .
37
4.4.2 User Variance on Temporal Repetition Rate
Based on the user frequency identified in the last step, the study tried to further
identify the user variance in the repetition behavior. Previous studies by have found
out that, different frequency user tend to exhibit different pattern in query repeating.
In this section, the temporal repetition rates varied by frequent and non-frequent user
were calculated separately and were plotted in the same temporal distribution
diagram ( )in order to facilitate comparing.
4.5 REPETITION PERIODICITY IDENTIFICATION
This step will further identify the repetition periodicity. In this part, the previously
mentioned distance-based measuring was used to generate an observable dominant
periodicity in the query log.
4.5.1 General query repetition periodicity
According to the definition given above, we first calculated the time interval between
each two log activities from the same user, which is thought to be the analysis
towards user’s search engine usage. Queries issued by the same user were ordered by
random generated value of the data base. The days between the two continuous query
events were calculated as the day difference d, then the number of each occurrence of
a certain value of d was counted and plotted as a function of d. the results is showed
in Figure6. This data was used later to normalize the data generated in the same way
by performing the same steps to examine same query from same user.
38
The same method was then used to generate a distribution of the time interval
between repeated queries from same user. The data was used to produce the diagram
in Figure7. In order to remove the possible effects exerted by other factors, the study
adopted the same normalizing formula with Sanderson and Dumains (2006) to
normalize it with the data generated in the last section. The normalized view is
shown in Figure8.
4.5.2 Query Repetition Periodicity Examine By Type
Duplicate the method used before, the time interval distribution of both navigational
and informational query were analyzed. The queries marked as I and N from the
same user were randomly ordered, and the distribution of query time intervals are
plotted and normalized by the result generated in the last section. The normalized
result is shown in Figure9, 10.
4.5.3 User Variance on Repetition Periodicity
Then, we group the entire query by user, and calculate the average time interval
between two repeated queries from same user. Then the user frequency was plotted
as a function of the user’s average time interval between two repeated queries. We
use this to describe the different repetition periodicity of different frequency user.
The result was shown in result 11.
4.6 RANK FIRST CONVERT TO CHANGE
39
One of the motivations of the study is to develop better re-ranking strategy for the
queries that are expected to occur again based on historical data. For previous
launched queries, re-ranking the results would benefit from the estimation of its next
occurring. Normally, users would expect the previous viewed result to appear at the
same result page or even the same ranking when they repeat their query on the search
engine, trying to go to the viewed page again. This might suggest that the re-ranking
of the result would better occur after a certain repeated query with the same click
would be last observed.
The study tried to find out whether the rank of the repeated click will change at a
smaller (or larger) time interval than the time interval between two repeated queries.
The same distance-based analyses were performed here. All the same clicks will be
grouped and ordered by time. The first time a change of rank within the time
sequential of clicks will be observed, the time interval between the two click events
will be kept down. The s result was showing in the diagram in figure 14 .
40
5 RESULTS
This chapter provides the results for the analyses in the previous chapter accordingly.
The results provided in the chapter are listed as:
- Result 1.1: Result for Data Preparation
- Result 2.1: Query Repetition Overview
- Result 2.2: Click Repetition
- Result 3.1: Query Repetition frequency Distribution
- Result 3.2: Click Repetition Frequency Distribution
- Result 3.3: User repetition Frequency Distribution
- Result 3.4: Further Examination of the Frequent Query List
- Result 3.5: Repetition Rate Variance
- Result 4.1: General Query Repetition Periodicity
- Result 4.2: Informational Query Repetition Periodicity
- Result 4.3: Navigational Query Repetition Periodicity
- Result 4.4: User Variance On Query Repetition Periodicity
- Result 5.1: Rank First-convert-to-change Periodicity
5.1 RESULTS FOR DATA PREPARATION
Result 1.1: Result for Data Preparation
The identification of non-human user was carried out using a User Frequency/Time
moving trend to detect the abnormal user behavior. As can be seen from the graph,
there is a peek between days 37-38, where a single user launched over 13 thousands
queries either repeated or unique. This is considered as an abnormal behavior, and
we identified the user ID and removed all related records. Using the same method,
41
another similar User ID was removed. The graph had detected the abnormal was
shown below.
Abnormal Detection
Figure 1Abnormal User Detection
Table 1 Data Preparation Result
Total Query 35,382,016
Corrupted Queries 1,005,069( )
Non-human User 142,775( )
Query Episode 20,714,848( )
Unique query 10,152,834 ( )
Navigational Query 3,201,860 (31.54%)
Informational Query 6,950,974 (68.46%)
42
5.2 RESULTS FOR OVERALL QUERY REPETITION
EXAMINED
Result 2.1: Query Repetition Overview
Among 20,714,848 queries submitted, 29.8% have been submitted more than once
by users, while 70.2% are unique queries. Among the repeated queries, Individual
user repetition consists 43.75% of the total repeated queries, while group repetition
constitute 56.25% of the total repetition. The repeated queries were then examined
by query type. As we have discussed in the previous section, the informational
queries were identified as queries which was marked as I. The result had displayed
that: around 42.46% of the total repeated queries are informational queries while
57.54% are navigational queries.
Query Repetition
Total Query(Episode) 20,714,848
Distinct query string 10,152,834(49% )
Unique queries 7,126,165 (70.2%)
Repeated queries 3,026,669(29.8%)
Individual repetition 1,324,168 (43.75%)
Group repetition 1,702,501(56.25%)
Repeated Navigational Query 1,741,545 (57.54%)
Repeated Informational Query 1,285,124(42.46%)
Table 2 Repetition Overview
43
Result 2.2: Further Examination of the Frequent Query List
The frequency lists are listed below.
The 20 most repeated queries
Query Number of time being searched
Google 279445
eBay 129968
yahoo 186150
MapQuest 102050
MySpace 145650
internet 32136
weather 22174
http 21074
bank of America 29339
American idol 16892
pogo 16263
Hotmail 27963
msn 14060
craigslist 13518
.com 13115
dictionary 13016
yahoo mail 12689
ask Jeeves 11547
Wal-Mart 11475
mycl.cravelyrics.com 10471
Table3 Most Repeated Queries
By examining the top 20 frequently repeated queries, it can be found that, most of
those queries are navigational queries pointing to either another search
engine(Google, Yahoo, MSN), map website(mapquest.com) or some online
e-commercial website(EBay, Amazon), Social Network(MySpace) etc. Some daily
inquiry such as Weather, TV, appears in the top repeated list too.
Also a list of top 20 queries with a count of the distinct users who had searched them
was generated. The result shows the same trend as the previous one.
44
The 20 Most Shared Repeated Queries
Query Number of Users
Google 120782
eBay 76178
yahoo.com 67606
MapQuest 59098
myspace.com 34003
internet 21996
http 17041
weather 12967
American idol 9980
dictionary 8118
Wal-Mart 7562
ask Jeeves 7125
home depot 6700
ask.com 6202
southwest airlines 5927
target 5925
white pages 5767
maps 5589
hotmail.com 5370
yellow pages 5308
Table 4 Most Repeated Queries across User
Then the top20 most repeated query within a single user were generated into a list as
following.
As can be seen from the above lists, both the most frequently repeated queries and
the most shared repeated queries across users are almost all navigating purpose based.
However, it might be noticed that, some daily based informational inquiries like
Weather, or TV etc. also appeared on the top 20 list. This means, although most of
queries repeated on daily basis tend to be navigational queries, there are still some
informational queries could also show a daily demanding.
Result 2.3: Click Repetition
45
The same clicks from the same user which had appeared more than once were
counted, out of the total click stream, 35.38% were found to have been repeatedly
clicked on. Among the total repeated click, 68.40% were from the same query, and
31.60% were from different queries.
Click Repetition
Total Click 9,826,259
Unique Click 6,348,868(64.61%)
Repeated Click 3,477,391(35.39%)
From Same Query 2,378,465 (68.40%)
From Different Query 1,098,926(31.60%)
Table 5 Click Repetition Overview
The repetition found in click stream is a little higher than the repetition of query; the
results from the examination of both query and click repetition are in supportive of
the previous work by identifying similar portion of repetition behavior in the new log
data. This have further indicate that the existence of such repetition behavior is
independent of the user groups that being analyzed.
5.3 TEMPORAL QUERY REPETITION RATE
EXAMINED
Result 3.1: Temporal Repetition Rate Distribution
The following shows the daily based repetition rate distribution. The x-axil represent
the 92 days in during the observation, the y-axil represent the percentages of the
46
repeated queries in that day. As can be seen from Figure 2, the daily repetition rate
does not change much with time. And the repetition rate keeps stable at around 60%
every day.
Figure 2Temporal Repetition Rate Distribution
Result 3.3: User Frequency
The following diagram shows the frequency distribution of different users.
Figure 3User Frequency Distribution
0%
20%
40%
60%
80%
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Day
Repetition Rate Per-Day
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000 10000 100000Nuber of Users
User Frequency
47
As can be seen from the diagram, most of the users will not submit more than 10
queries per day; while a small numbers of user submitted more than 10 queries in a
day. From the above diagram, the cut-off queries-per-day was set as 10 queries per
day. Any one appear above would be regarded as frequent user, the ones appear
below, would be regarded as non-frequent user.
Result 3.5: Repetition Rate Variance
The following diagram shows that: the user’s average repetition rate change with
user’s frequency of searching. As can be seen from the diagram that: under a certain
user frequency, about 300 times in this graph, high frequency user tend to repeat
more than the low frequency user and the they follows a liner relation. However,
when the user frequency reaches a certain point, though the repetition rate in general
still heading up, the repetition rates tend to fluctuate a lot. This is probably because
of that, high frequency users who are the speciality in the user group, tend to vary a
lot from each other.
Figure 4 User Repetition rate Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1 10 100 1000
User Frequency
User Repetition Variance
48
Result 3.4: User Variance on Temporal Repetition Rate
The following diagram represented the different daily repetition rate for both the
frequent user group and the non-frequent user group.
Figure 5 user variance on Temporal repetition rate
As can be seen from the above graph, non-frequent user tend to repeated in a stable
way while the frequent user tend to repeat more with time goes by. However, when
reach a certain point (around 45days), their pursuit of previously submitted queries
tend to decline a little and remain stable over time. This verified that: frequent user
tend to repeat more in a near time interval and tend to have various interests at a later
point of time.
0.01
0.1
1
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Days
User Variance on Temporal Repetition Rate frequent user non-frequent user
49
5.4 REPETITION PERIODICITY EXAMINATION
Result 4.1: General Query Repetition Periodicity
- Step1 User’s searching activity
The following shows the distribution of time intervals between two events of the
same user.
Figure 6 Same User Query Time Intervals
As can be seen, a weekly effect is presented in the diagram. And also a decrease
identified at the end is due to the 92 days log collected. This data was used for later
normalization.
- Step2: User’s Query Repetition
1
10
100
1000
10000
100000
1000000
10000000
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Interval
Query from Same User
50
The same steps were performed to calculate the distribution of the day difference
between repeated queries from same user. The data was used to produce the diagram
in figure 7. The weekly effect was quite obvious.
Figure7 Same User Same Query Time Intervals
- Step3: Normalized Repetition Periodicity
As can be seen from this normalized view below, the events above the y=1 are likely
to happen, while the events below tend occur less. The 7- 8 days cut-off shows that,
user’s repeating of a query is more likely to happen within the following 7-8 days,
after the seven day period, the chance of the query being repeated is reducing.
1
10
100
1000
10000
100000
1000000
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Interval
Same Query From Same User
51
Figure 8 Normalized General Query Repetition Time Intervals
Result 4.2 Informational Query Repetition Periodicity
Then the same method was used to examine the repetition pattern based on different
query type. The informational and navigational queries which were identified in
previous analysis were analyzed in the same way as the previous general analysis.
The data presented user’s general query repetition time interval was used to
normalize the informational query repetition data, which generate the following
figure 9. .
0.01
0.1
1
10
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Intervals
General Query Repetition
0.1
1
10
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98Time Interval
Informational Query Repetition Period
52
Figure9 Informational Query Repetition Periodicity
From the graph we can see that, for an informational query, it will be repeated in a
burst, within 3 days, and then the repetition of an informational query will possibly
not be seen again. This is in line with the previous suggestion made by many studies
that, informational query is more likely to be repeated within a few days in a burst.
Result 4.3: Navigational Query Repetition Periodicity
The same way, we produced the repetition time interval distribution for navigational
query and then it was normalized in the same way with the informational queries.
The normalized view is showed below:
Figure 10 Navigational Query Repetition Time Intervals
0.1
10
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Interval
Navigational Query Repetition Periodicity
53
As can be seen from it, the repetition of navigational queries follows a different
pattern from the repetition of the both general queries and informational queries. The
repeated navigational queries are not likely to be observed being issued within the
following 7 days away from it first being issued.
Combining with the finding in the general query repetition, the repeating of
informational query within the 3-4days may contribute to most of the repeated
queries that have been observed within the first 7 days. These observations tend to be
in line with Sanderson and Dunains (2006)’s findings, although the informational
query repetition period turned out to be shorter than would expected, the previous
findings still hold.
It can be concluded from the above analyses that: in general, the previously issued
queries, of which most are informational queries, tend to be repeated within a short
time interval, while the queries with navigating task are usually being re-issued at a
later point of time. The repetition pattern is different based on the different user
intention.
RESULT 4.4: User Variance on Query Repetition Periodicity
The following shows the repetition time span for different frequent users. As can be
seen from the graph that: almost all the repetition from a low frequency user will
issue a repeated query within 13-20days’ interval on average. While high frequency
users tend to repeat a query within 2-35 days on average. This means low frequency
user tends to repeat queries at a near upcoming time, high frequency user will repeat
a query at a relative longer time interval.
54
User Variance on Repetition Periodicity
Figure 11User Variance on Repetition Periodicity
5.5 RANK FIRST CONVERT TO CHANGE
Result 5.1: First Convert To Change
This section is the analysis of the chance to observe a rank change during the process
to re-access a previous viewed result page form the same path. The method used in
this section is the same with the previous analyses.
- Step1: General Rank change
The following diagram shows the general period a click change will occur.
55
Figure 12 General Rank Change Time Intervals
From the above diagram, we can see a weak 7 days periodicity. This is not as
obvious as the previous analyzed periodicity since this is not a human behavior; rank
change is more of a subjective presence. However, using it to normalize the data
generated later can still hold, since other factors contained in the rank changing will
affect the final result.
- Step2: Repeated Click Rank Change
The same procedure was performed on calculating the first time a change of rank
will be observed by user when the same click from the same query was clicked on.
And the results were plotted in figure .
1
10
100
1000
10000
100000
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Interval Between Two Click
General Rank Change
56
Figure 13 Rank Changes distribution
- Step3: Normalized view
Data from figure was normalized by the data from figure 11, the normalized view
was showed in figure below.
Figure14 Rank First Change Periodicity
1
10
100
1000
10000
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Time Interval
Rank Change During Repetition
0.1
1
10
0 7 14 21 28 35 42 49 56 63 70 77 84 91 98
First Time Rank Change Will Be Percieved
Rank Change Periodicity
57
As can be seen from the diagram, a user will not see a change of rank if he or she
repeats a query and click the same URL within the following 9 days. The repetition
of click (resulted from same user query) happened after that, will result in a possible
change of rank observed by the user. Combined with the previous finding about the
repetition of navigational queries (which is more often happen after7 days), the result
may have suggest that, a repetition of navigational query is likely to be challenged by
a likely change of rank.
5.6 RESULT SUMMARY
- Results for Data Preparation
Of all the 35,382,016 queries, about 2.84% of the original queries were removed
as corrupted data. 134635 records from one user were removed by identifying
the abnormality in searching behaviour. By grouping the queries submitted by
the same user continuously under one query at an interval no more than 30
minutes, 20,714,848 query episodes were established. Within the query (query
episode) 10,152,834 distinct queries were identified, with 3,201,860 navigational
queries and 6,950,974 informational queries.
- Query Repetition Overview
The initial analysis of the overall repetition find out that, about 29.8% of the
total queries are repeated queries previously have been issue by users. Of the
total repeated queries, 56.25% are from different user while 43.75% are
repeated by the same user. The examination of the repetition by query type
reveals that, more than half of the repeated queries are navigational seeking
based (57.54%); in comparison, the repetition of informational query is a little
less common (42.46%). The repetition examination with click data stream
58
shows that, of all the clicked results, 30.39% are repeated clicks, of which 68.40%
are from the same query while 31.60% are from the different query.
- Temporal repetition rate examination
The temporal repetition rate examinations show that: the daily repetition rate
tend to remain stable over time, the frequent user are the minority while most of
the user won’t submit more than 10 queries per day. Frequent users tend to
repeat more as they search more at a near time interval; they tend to have more
unique queries in the long-term time. Differently from them, the non-frequent
users tend to repeated at an even rate.
- Repetition Periodicity Examination
The examination of repetition periodicity shows that: both the query and click
repetition follows a weekly periodicity. The general query repetition shows a
7day cut-off, which means repetition in general tends to occur within 7 days.
The examination of both informational query and navigational query shows that:
when a query is specified as information oriented or navigational, the repetition
patterns tend to vary a lot. The graphs show that, while most of the
informational query repetition tend to happen within 3 days, the navigational
query tend to be repeated after 7 days.
- Rank first Covert to Change Periodicity analysis
The examination of rank change reveals that: if a user trying to click on the
same click using the same query within 9 days, he or she will not likely to
experience a rank changing.
- User Variance Examination
The examining of the User Variance shows that: Generally, user with a higher
query frequency tend to repeat queries more frequently. For the users search
more than 300 queries, the repetition rates fluctuate a lot. Also the periodicity
analysis towards the user variance shows that: high frequent users tend to repeat
59
query at an interval of 5-35 days; low frequent user on the other hand, tend to
repeat queries at an interval of 10-15 days.
5.7 RESULT DISCUSSION
The results generated in this analysis were then compared to other studies in order to
be validated or be justified. Since sometimes towards a certain expression is different,
so sometimes we configure them into the same standard.
Result comparison
This Study Teevan et al. Sanderson and
Dumains
Overall Query repetition 29.8% - 50%
Individual repetition 25.39% 33% -
Group user repetition 32.64% 18%/7% -
Repeated navigational query 57.54% 47% 80%
Repeated clicks of same user 30.39% 29% 17.5%
From same query 68.4% - 83%
From different query 31.6% - 17%
Table 6 Result Sheet
As can be seen from the above comparison, although some of the percentage may
vary a little bit, they were either close to each other or fall between the ranges of the
figures of previous studies. Some obvious difference lies in the percentage of
repeated query and also the repetition percentages based on different user group; this
is probably because, in this study, the repeated continuous queries within a single
user issued within short time were grouped into query episode, so the percentage of
the individual repetition may be lower than the figure in other studies.
60
The findings about user frequency variance is in accordance with Dou et al. (2008)’s
findings, which proves that: user tend to repeat more as they search more, until to a
certain point, they tend to show a variety of interests towards both repeated query
and also unique queries. This has implicated that: for all the users, query history may
be useful for short term query re-using; for frequent searcher, some long term
personalization strategies may not work well based on past query list. The
examination of temporal repetition rate is also in consistent with Dou et al.’s (2008)
findings: frequent user tends to have a various interests in the long run. Thus, they
suggested that, for a frequent user, the long term interests based profile would work
better than the query based profile.
Also, this study extended on previous study to examine the repetition periodicity
from various aspects. The study has found a 7day periodicity which is in line with
Sanderson and Dumains (2006)’s for general query repetition. The examination
towards navigational query repetition shows a similar 6-7 days for a possible
repetition to happen. The study examined the informational queries in particular, and
found a 3 days cut-off. Although the classification of query cannot very precisely
identify the informational queries, this finding is still in consistence with previous
belief that an informational query turns out in burst. The three day may be a little
shorter than expected, however, can be regarded as being able to have reflected the
trends.
The study then analysed the possible difference between users. The general 7 days
period for a repetition to occur shows only a mixed performance of both high and
low frequency user, when examined separately, the low frequency user repeat queries
within a centralised period while the high frequency users repeated queries within a
wide range of possible time intervals in comparison. As for whether the 13-20days
for low frequency user and the 2-35% for high frequency user are to some extent due
to this special data, it can roughly show a 2-3 week range for low frequency users
and within 5weeks for high frequency users. This means, high frequency user can
61
repeat a query at a very closer time, but can also hold a long time interest towards a
special query and repeat the query at a later point. Low frequency user usually repeat
a query at around 2 weeks later after first submitting, after that, possibly they will
never be seen repeating a query again. Some of the previous findings have suggested
that, frequent search engine users tend out to have various information needs, so their
repetition behaviour is less periodic than those use the search engine less. This
agreed with the above finding in this study.
The analysis of first rank change during repeating is restricted to this study, since the
time interval of the rank change is based on the query repeating process. Normally, a
rank change can happen anytime, or another time that have been found in general
searching analysis. In this study, however, only the perceived rank changes during
the re-accessing of a previously viewed result page are included into the analysis.
This rank change periodicity may be subjected to change with the repetition
periodicity, but, it is analysed here only to suggest that, the repetition after 9days will
possibly experience a rank change. Combined with the previous finding about the
navigational query repetition periodicity, this may suggest that, some of the
navigational query re-issued after 9 days by the same user trying to find the same
page from a same query will possibly be met with a rank change. Even in most of the
case, the rank of the best result for a certain navigational query will not fall out of 10
and most of the times they will appear up on the result page, the change of rank can
still to some level hinder the process of re-finding, this is also mentioned by Teevan
et al (2007) in his analysis towards Yahoo query log.
To sum up, the analyses in this study have further identified some features of the
frequency and periodicity of the repetition behaviour from different aspects,
including the examination by query type and by different user group. As an extended
analysis based on previous work, it has yield some useful findings which would be
useful to shed lights on the different patterns that are exhibited in user’ repetition
behaviour.
62
6 CONCLUSIONS
This part gives out the final conclusions made from this study. First the main
contribution of this study is discussed, and then the limitation and suggestion for
future work are provided.
6.1 CONTRIBUTION
In this study, we proved that, query repetition is quite common during web searching.
By examining the types of the query being repeated, and the pattern of repeating a
query both by individual and group user, we proved many of the previous findings
regarding to this special user behaviour.
The major contribution of this work is that it has extended on the previous work to
examine the query repetition pattern both by type and by user. It took a special look
into both navigational query and informational query repetition pattern, and found
out that, informational query, as being expected, turned to be repeated in a burst
within 3-4 days while the repeating of navigational query will possibly happen after
7 days that have been previously identified. The different frequency user tends to
exhibit different pattern when repeating a query. The high frequency user tend to
have a more varied way to repeat queries, and also, their repeating of a query could
happen in 1-5 weeks, while the behavior of the low frequency user tend to be highly
periodical, shows a centralized inclination to repeat a query between week 2 and
week 3 after the first issuing of a query. Also, suggestion for re-ranking is given
based on the finding that, a rank change will be perceived when the same query from
same user in order to return to a previously viewed result page is repeated 9 days
later. This could also have indicated that the re-finding based on a navigational query
will possibly be hindered by rank change. In this study, the examination of user
variance is a complementary to previous work, and also, the examination of the period
63
before rank change could be observed during the repeating of query suggested possible
limitation of the re-ranking strategy.
6.2 LIMITATION AND FUTURE WORK
One of the limitations of this study is that, the classification of query was not based on
algorithms that have been standardized. There are different interpretations of the
criteria developed for identifying different types of query; therefore the algorithms
for classifying queries may vary. Also the analysis in this paper was carried out at
query level, based on exact matching. The modeling of the real user behavior may
benefit from taking user’s modified queries into consideration. The study only
analyzed the daily periodicity, it did not include the hourly analysis, and also,
because of the size and the period covered, it cannot shed light on monthly or
seasonal periodicity.
Another limitation of this study is that it did not fully adopt the same metric as the
previous works. The process of grouping of query episode may be two folded. On
one hand, the repetition rate exclusive of duplicate and identical queries tend to be
more of a reflection of user’s repetition rates; on the other hand, the 30mins cut off is
based on estimation. As Sanderson and Dumains (2006) once addressed in their work
that, the grouping of query or user based on any of such approximation tend to be
error. Also the study did not remove the oversize sections which are robot suspicious.
So the result may contain deviant points which have not yet been removed from the
log.
This study did not take into consideration the other effect that would mask the
periodicity of the repetition behavior. As Sanderson et al (2006) mentioned in his
work, without the identifying the underlying web usage and computer usage pattern,
the observed periodicity cannot be guaranteed as the feature that is unique to the
64
query repetition behavior.
Another major limitation of this study is that it did not include survey or
questionnaire to complement the quantitative study. This is due to the deep rooted
limitation of SLA as an un-obtrusive method to be used in user studies. As discussed
by Jansen (2009) that, the computer has screened off most of the user information
which would form the background of user being studied. Those information
including basic personal information such as user’s gender, age, career; the user side
activities such as downloading of a document, coping and pasting; user’s need,
perspective which has motivated the query; user’s emotion state and the education
level etc. lacking the background information about the user has caused many
problems when trying to derive conclusions from the results.
In summary, the work has just skimmed the surface of the temporal repetition pattern
exhibited by web search engine user. Weather the repetition pattern is due to other
effects is left for further investigation. Also, the inspection of periodicity from graph
would benefit from exact examination of the periodicity by using Fourier
Transformation to detect the significant frequency. It would be of interest to find out
the probability for both the repeated query and click to fall in a certain time span.
Also, the user variance in repetition periodicity would be a rout of future study.
Finally, it would be good to have larger data set covered longer period of time in
order to be able to perform monthly and even seasonal analysis.
65
REFERENCE
Adar, E., D. Weld, et al. (2007). Why we search: visualizing and predicting user
behavior, In Proc. of the Int'l WWW Conf.
Aula, A., N. Jhaveri, et al. (2005). Information search and re-access strategies of
experienced web users, In Proceedings of WWW'05,583-592.
Beitzel, S., E. Jensen, et al. (2007). "Temporal analysis of a very large topically
categorized web query log." Journal of the American Society for Information Science
and Technology 58(2): 166-178.
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D. and Frieder, O.
(2004) .Hourly analysis of a very large topically categorized Web query log. In
Proceedings of SIGIR, 321-328.
Bruce, H., Jones, W. and Dumais, S. (2004).Keeping and re-finding information on
the Web: What do people do and what do they need? In Proceedings of ASIST.
Brooks, N. (2006). "Repeat search behavior: Implications for advertisers."
BULLETIN-AMERICAN SOCIETY FOR INFORMATION SCIENCE AND
TECHNOLOGY 32(2): 16.
66
Chien, S. and N. Immorlica (2005). Semantic similarity between search engine
queries using temporal correlation, In Proc. of the 14th Int’l World Wide Web
Conference, 2-11.
Cockburn, A. and B. McKenzie (2001). "What do Web users do? An empirical
analysis of Web use." International Journal of Human-Computer Studies 54(6):
903-922.
Cui, H., J. Wen, et al. (2002). Probabilistic query expansion using query logs, Proc.
11th World Wide Web Conf.,pp. 325-332.
Capra, R. and Pérez-Qui.ones, M.A. (2005). Using Web search engines to find and
re-find information. IEEE Computer, 38 (10), 36-42.
Dou, Z., R. Song, et al. (2007). A large-scale evaluation and analysis of personalized
search strategies, In
Proceedings of WWW'07, 581-590.
Dou, Z., X. Yuan, et al. (2008). "Analysis of Query Repetition in Large-scale
Chinese Search Log." Jisuanji Gongcheng/ Computer Engineering 34(21): 40-41.
Fagni, T., R. Perego, et al. (2006). "Boosting the performance of web search engines:
Caching and prefetching query results by exploiting historical usage data." ACM
Transactions on Information Systems (TOIS) 24(1): 78.
67
Global Faces and Networked Places: A Nielsen report on social networking’s new
social foot print.blog.nielsen.com/nielsenwire/wp.../nielsen_globalfaces_mar09.pdf
Acessed 28th
July 2010
Han, W., J. Lee, et al. (2007). Ranked subsequence matching in time-series databases,
In International Conference on Very Large Data Bases (VLDB),pp. 423–434.
Jansen, B., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A
study and analysis of user queries on the web. Information Processing and
Management, 36(2), 207–227.
Jansen, B.J. and Pooch, U. (2001). A review of web searching studies and a
framework for future research. Journal of the American Society for Information
Science and Technology, 52(3), 235–246.
Jansen, B.J. and Spink, A. (2006). How are we searching the World Wide Web? A
comparison of nine search engine transaction logs. Information Processing and
Management, 42(1), 248–263.
Jansen, B.J. (2006). Search log analysis: What it is, what's been done, how to do it.
Library and Information Science Research, 28(3), 407–432.
Jansen, B.J. (2008). The methodology of search log analysis. In B.J. Jansen, A. Spink,
& I. Taksa (Eds.), Handbook of research on Web log analysis (pp. 99–121). Hershey,
PA: Idea Group Inc.
68
Jansen, B.J., Spink, A., & Pedersen, J. (2005). Trend analysis of AltaVista Web
searching. Journal of the American Society for Information Science and Technology,
56(6), 559–570
Koshman, S., A. Spink, et al. (2006). "Web searching on the Vivisimo search
engine." Journal of the American Society for Information Science and Technology
57(14): 1875-1887.
Lee, U., Z. Liu, et al. (2005). Automatic identification of user goals in web search,In
Proceedings of The World Wide
Web Conference. Chiba, Japan, 391-401.
Liu, F., C. Yu, et al. (2002). Personalized web search by mapping user queries to
categories,In Proceedings of the Eleventh International Conference on Information
and Knowledge Management (CIKM'02).USA, 558-565.
Lau, T. and Horvitz, E. (1999) Patterns of search: Analyzing and modeling Web
query refinement. In Proceedings of the UM ‘99, 119-128.
Mahanti, A., D. Eager, et al. (2000). "Temporal locality and its impact on Web proxy
cache performance." Performance Evaluation 42(2-3): 187-203.
69
Obendorf, H., Weinreich, H., Herder, E., and Mayer, M. (2007).Web page revisitation
revisited: Implications of a long-term click-stream study of browser usage. In
Proceedings of CHI ,597–606.
Ozmutlu, S., Spink, A. and Ozmutlu, H.C. (2004). A day in the life of web searching:
an exploratory study. Information processing and management, 40(2), 319–345.
Ross, N., & Wolfram, D. (2000). End user searching on the Internet: An analysis of
term pair topics submitted to the Excite search engine Journal of the American
Society for Information Science, 51(10), 949–958.
Sanderson, M. and Dumais, S. (2007).Examining repetition in user search behavior.
In Proceedings of ECIR ’07,
Silverstein, C. et al. (1999). Analysis of a very large web search engine query log. In
ACM SIGIR Forum. pp. 6–12.
Spink, A. et al. (2002). US versus European Web searching trends. In ACM SIGIR
Forum. pp. 32–38.
Spink, A., Bateman, J. and Jansen, B.J. (1998). Searching heterogeneous collections
on the Web: behaviour of Excite users. Information Research, 4(2), 4–2.
Smyth, B. (2007). "A community-based approach to personalizing web search."
Computer 40(8): 42-50.
70
Smyth, B., E. Balfe, et al. (2004). "Exploiting query repetition and regularity in an
adaptive community-based web search engine." User Modeling and User-Adapted
Interaction 14(5): 383-423.
Teevan, J., E. Adar, et al. (2007). Information re-retrieval: repeat queries in Yahoo's
logs, In SIGIR'07: Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pp. 151–158.
Tyler, S. and J. Teevan (2010). Large scale query log analysis of
re-finding,Proc.WSDM, 191-200.
Tauscher, L. and Greenberg, S. (1997) How people revisit Web pages: Empirical
findings and implications for the design of history systems. International Journal of
Human-Computer Studies, 47 (1), 97–137.
Teevan, J. (2007).Supporting finding and re-finding through personalization.
Doctoral Thesis, MIT, February.
Teevan, J., Adar, E., Jones, R. and Potts, M. (2006).History repeats itself: Repeat
queries in Yahoo’s logs. In Proceedings of SIGIR, 703-704.
Teevan J., Alvarado C., Ackerman M. S., and Karger D. R. (2004) The perfect search
engine is not enough: A study of orienteering behavior in directed search. In
Proceedings of CHI, 415-422.
71
Vlachos, M., S. Kozat, et al. (2009). Optimal distance bounds on time-series data,In
SDM.
Vlachos, M., S. Kozat, et al. (2010). "Optimal distance bounds for fast search on
compressed time-series query logs." ACM Transactions on the Web (TWEB) 4(2):
1-28.
Vlachos, M., C. Meek, et al. (2004). Identifying similarities, periodicities and bursts
for online search queries, In Proceedings of the ACM SIGMOD International
Conference on Management of Data,pp. 131–142.
Vlachos, M., P. Yu, et al. (2005). On periodicity detection and structural periodic
similarity,In Proceedings of the Siam International conference on Data Mining (SDM
05).
Wang, P., Berry, M.W. and Yang, Y. (2003). Mining longitudinal Web queries: Trends
and patterns. Journal of the American Society for Information Science and
Technology, 54(8), 743–758.
Wedig S. and Madani, O. (2006). A large-scale analysis of query logs for assessing
personalization opportunities. In Proceedings of KDD, 742–747.
Wen, J.-R., Nie, J.-Y. and Zhang, H.-J.(2002) .Query clustering using user logs. TOIS,
20 (1), 59–81.
72
Xie, Y. and D. O Hallaron (2002). Locality in search engine queries and its
implications for caching,In Proceedings of the twenty-first annual joint conference of
the IEEE computer and communications societies pp. 307–317.
Zhang, Y., B. Jansen, et al. (2009). "Time series analysis of a Web search engine
transaction log." Information Processing & Management 45(2): 230-245.
Zhao, Q., S. Hoi, et al. (2006). Time-dependent semantic similarity measure of
queries using historical click-through data, In: WWW'06:Proceedings of the 15th
international conference on World Wide Web, New York, NY, USA,pp. 543–552.
Zhao, Q., T. Liu, et al. (2006). Event detection from evolution of click-through
data,In Proceedings of the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 484–493.