Discovering Tasks from Search Engine Query Logs · 2013. 9. 25. · Discovering Tasks from Search Engine Query Logs 14:3 The rationale behind this two-step strategy is as follows

14

Discovering Tasks from Search Engine Query Logs

CLAUDIO LUCCHESE, ISTI-CNRSALVATORE ORLANDO, Universita Ca’ Foscari VeneziaRAFFAELE PEREGO and FABRIZIO SILVESTRI, ISTI-CNRGABRIELE TOLOMEI, Universita Ca’ Foscari Venezia

Although Web search engines still answer user queries with lists of ten blue links to webpages, people areincreasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, readingonline news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try toperform through search engines. First, we identify user tasks from individual user sessions stored in searchengine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user searchsession), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks,possibly performed by distinct users. To discover user tasks, we propose query similarity functions based onunsupervised and supervised learning approaches. We present a set of query clustering methods that exploitthese functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-builtground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks,we propose four methods that cluster previously discovered user tasks, which in turn are represented bythe bag-of-words extracted from their composing queries. These solutions were also evaluated on anothermanually-built ground truth.

Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications—Data min-ing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering, Queryformulation, Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software

General Terms: Algorithms, Design, Experimentation

Additional Key Words and Phrases: Query log analysis, query clustering, user search intent, user searchsession boundaries, user tasks, user task discovery, collective tasks, collective task discovery

ACM Reference Format:Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2013. Discovering tasks from search enginequery logs. ACM Trans. Inf. Syst. 31, 3, Article 14 (July 2013), 43 pages.DOI: http://dx.doi.org/10.1145/2493175.2493179

1. INTRODUCTION

People rely heavily on Web search engines to organize their daily activities. A key reasonfor the popularity of today’s search engines is their user-friendly interface [Baeza-Yatesand Ribeiro-Neto 1999], which allows users to easily query for their needs by issuingtheir own lists of keywords. Users exploit this simple query-based interface to retrieveWeb information and resources, which in turn are used to perform one or more Web-mediated tasks [Spink et al. 2006], for example, finding a recipe, booking a flight,reading online news, etc.

This work has been partially supported by projects MIDAS (FP7 EU Grant Agreement no. 318786),InGeoCloudS (CIP-PSP EU Grant Agreement no. 297300), and MIUR PRIN ARS TechnoMedia.Authors’ addresses: C. Lucchese, R. Perego, and F. Silvestri, ISTI-CNR, Pisa, Italy; S. Orlando and G. Tolomei,DAIS, Universita Ca’ Foscari Venezia, Italy; corresponding author’s email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2013 ACM 1046-8188/2013/07-ART14 $15.00

DOI: http://dx.doi.org/10.1145/2493175.2493179

ACM Transactions on Information Systems, Vol. 31, No. 3, Article 14, Publication date: July 2013.

14:2 C. Lucchese et al.

Fig. 1. Our two-step task discovery process: user task and collective task discovery.

Although search engines nowadays offer several mechanisms to help their users(e.g., query suggestion, search-as-you-type, result diversification, etc.), in essence theyare document retrieval systems that answer a user query with a simple list of ten bluelinks. If the results returned are not satisfactory, the user may decide to refine his query,thereby getting a new list of results that is hopefully more relevant to his needs. How-ever, this query-look-refine paradigm is not effective for all the tasks to be accomplished.

We believe next-generation search engines should progress from being mere Webdocument retrieval tools to becoming multifaceted systems which fully support userswhile they are interacting with the Web. This creates novel and exciting researchchallenges ranging from the ability to recognize latent tasks from the issued queries, tothe design of new recommendation strategies and user interfaces for showing relevantresults.

This manuscript focuses on this first research challenge and proposes an effectivemethodology for discovering the tasks that users try to perform through queries theyissue to search engines.

Interesting search behaviors and patterns can be revealed by analyzing and miningsearch engine query logs, which record the activities of many users [Broder 2002; Roseand Levinson 2004; Lee et al. 2005; Jansen and Spink 2006; Silvestri 2010]. It has beendiscovered that query logs are suitable sources from which tasks can be extracted. Inconcrete terms, a very important aspect we can analyze from a query log is representedby query sessions, that is, specific sets/sequences of queries issued by a user whileinteracting with a search engine.

There are two distinct levels of granularity to consider when detecting the set oftasks from a query log. The first is the intra-user level, where tasks are searched forwithin individual user query sessions. The challenge here is to find relationships thatgo beyond lexical query similarity, for instance, the case of a query session for the userAlice which contains the queries new york hotel and waldorf astoria, which are notnecessarily issued consecutively. Those two queries clearly refer to the task reservinga hotel room in New York, yet they do not share any common terms. The second levelof granularity is the inter-user level, where even more subtle problems may occur andneed to be handled, for instance, let another user Bob type the following two queries:hotels in new york and holiday inn ny. Both users are clearly trying to achieve thesame task (i.e., reserving a hotel room in New York), but by means of different queries.This second problem occurs because distinct users tend to phrase even the same taskin several different ways.

Therefore, a task discovery methodology has to take into consideration these two pre-ceding aspects in order to be effective. To this end, we divide the problem of discoveringtasks into two subproblems, and we address them separately. First, we extract fromeach individual user session those sets of queries that were issued to achieve specifictasks. We call each of those sets a user task, since they strictly depend on each individ-ual user. Second, we consider all the user sessions of the query log, and we identify allthe queries related to a common task—possibly performed by distinct users—by group-ing together similar user tasks. We refer to each agglomeration of similar user tasksas a collective task. Figure 1 shows this two-step task discovery process just described.


Discovering Tasks from Search Engine Query Logs 14:3

The rationale behind this two-step strategy is as follows. In a previous work[Lucchese et al. 2011], we already proved that user tasks can effectively be foundby exploiting the lexical and semantic content similarity of queries issued by individu-als within specific search contexts (i.e., time-bounded subsessions of the original querysession). Conversely, the same approach would not be able to discover collective tasksif applied directly to users’ queries, because queries that are issued by two users, whichare lexically or semantically similar, might refer to different latent needs. In addition,this two-step method guarantees scalability, since the agglomeration needs to processa reduced number of objects, that is, groups of queries rather than single ones.

Finally, two kinds of contributions are presented in this manuscript: one concernsuser task discovery while the other is related to collective task discovery, and both areanalyzed in the following.

Contributions on User Task Discovery. In Lucchese et al. [2011], we already showedthat users perform multitasking search activities in the query streams issued to asearch engine [Spink et al. 2006]. Multitasking refers to the way users interact with asearch engine, by intertwining different tasks within the same time period. This makesit difficult to identify user tasks by just splitting each user session into time-basedsequences of queries. Thus, a more precise measure of the task relatedness betweenquery pairs was needed. To this end, we proposed an unsupervised learning approachfor measuring task-based query similarity, which relied on a selection of both internaland external query log features. Internal features are available from the original querylog data, whereas external ones can be derived from other data sources. This approachresulted in two query similarity functions. The first, σ1, was a convex combination of theselected query log features. The second, σ2, combined typical lexical content similaritymeasures with the collaborative knowledge provided by Wiktionary1 and Wikipedia.2These external knowledge bases were used to enrich the semantics of each query, that is,to wikify each query in order to make more accurate decisions during the actual usertask discovery step. Furthermore, the preceding notion of task relatedness betweenquery pairs was used to discover the final set of user tasks. To this end, we introduceda set of query clustering methods with the aim of grouping together task-related queries,namely queries that are assumed to be part of the same user task on the basis of aspecified task-relatedness measure. In particular, we compared two techniques derivedfrom well-known clustering algorithms, that is, K-MEANS [MacQueen 1967] and DB-SCAN [Ester et al. 1996] with two other graph-based methods. All four proposed solutionswere evaluated on a ground truth, that is, a set of manually annotated user tasks.Results showed that the latter two techniques performed better than the former, andthey also improved other state-of-the-art approaches noticeably.

As an innovation contribution to this work, we also propose and evaluate a super-vised learning approach for determining the task-based similarity between query pairs.Unlike the unsupervised learning approach introduced in Lucchese et al. [2011], herethe task relatedness is learned by training several classifiers in our ground truth. Inparticular, we exploit the binary classifiers introduced by Jones and Klinkner [2008]and use the prediction provided by these classifiers in order to determine whether twoqueries belong to the same task. The probability value associated to that prediction isin turn used as a measure of how strong the task relatedness is between each pair ofqueries.

We train the classifiers over all the features that Jones and Klinkner [2008] claimto be the most suitable for predicting if two queries belong to the same search goal.

1http://www.wiktionary.org2http://www.wikipedia.org



Moreover, we expand these training features with both Wikipedia, that is, the wik-ification of the query, as well as the URL overlapping degree between the resultsreturned by a search engine for each query, that is, the Jaccard index between thetop-20 URLs returned for each query. This supervised learning approach leads to a setof new task-based query similarity functions, which are in turn exploited by the twobest-performing clustering methods for user task discovery introduced in our previouswork, namely QC-WCC and QC-HTC.

Experimental results have shown that combining supervised task-relatedness learn-ing with our query clustering methods does not substantially improve the overall effec-tiveness in discovering user tasks. However, the performance of the classifiers proposedby Jones and Klinkner [2008] improves significantly with the combined use of wikifi-cation and URL overlapping along with the other features.

Furthermore, we test our two best-performing user task discovery methods on alarger dataset (besides evaluating them on the smaller ground truth, as in Luccheseet al. [2011]). Interestingly, we discover that the analysis conducted on both the smallerand the larger datasets are coherent, in other words, they result in similar conclusions.This means that we can be relatively confident that replicating experiments on bothdatasets would also lead to similar quantitative results.

Contributions on Collective Task Discovery. The true innovation in this work, whichcompletes the overall roadmap we sketched for finding tasks from search engine querylogs, is based on the notion of collective task (see Figure 1). In a similar way to discov-ering user tasks, we provide a second ground truth by manually grouping a set of usertasks into a set of collective tasks. In addition, we present and discuss four methods usedto actually discover collective tasks. All of them are user task clustering techniques,where each user task is represented by the bag of words of its composing queries. Eachsolution adopts a different clustering strategy (i.e., partitional vs. agglomerative) anda different user task similarity measure (i.e., cosine similarity vs. Pearson’s coefficient).We quantitatively evaluate the four methods on the ground truth of collective tasks.The best results were obtained by a partitional clustering technique, which uses thecosine similarity measure. Finally, this best-performing technique is also run on alarger dataset of user tasks, and its performance is assessed by means of examples ofevidence.

Structure of the Article. The rest of the article is organized as follows. Section 2describes related work on query log analysis and mostly focuses on query sessionboundaries detection. Section 3 provides the description and analysis of our benchmarkdataset, that is, the 2006 AOL query log. In Section 4, we propose our theoretical modeland the statement of the user task discovery problem (UTDP). Section 5 presents theconstruction of our ground truth by manually grouping queries that are considered to betask related in a portion of our sample dataset. In addition, we propose some statisticsrelating to this corpus of manually identified user tasks. Section 6 introduces severalapproaches for measuring the task relatedness between query pairs, that is, task-basedquery similarity functions, which in turn are exploited by the user task discoverymethods proposed and compared in Section 7. Thus, Section 8 shows the experimentswe conducted on user task discovery as well as the results we obtained. Section 9bridges the gap between user task and collective task discovery by introducing the ideaof collective tasks, and it introduces a set of four algorithms for finding collective tasksfrom the set of previously discovered user tasks. We test the quality of all the proposedsolutions by comparing them with a common manually-built ground truth. To test thestrength of the best-performing solution for collective task discovery, we apply it to alarger dataset and illustrate some resulting evidence. Finally, Section 10 presents ourconclusions and indicates some possible future research.



2. RELATED WORK

The analysis of query logs collected by Web search engines has increasingly gainedinterest within the Web mining research community. Query logs record informationabout the search activities of users and are thus precious data sources for under-standing how people search the Web [Silvestri 2010]. Moreover, a number of differentapplications, that is, caching, index partitioning, document prioritization, query sug-gestion, can benefit from analysis performed on search engine query logs [Silvestriet al. 2008].

Typical statistics that can be drawn from query logs simply consider the query setin order to measure query popularity, term popularity, average query length, distancebetween repetitions of queries or terms, etc. A more in-depth analysis consists instudying search sessions, that is, temporal sequences of queries issued by users.

The first study on a query log from a commercial search engine was conducted byJansen et al. [1998]. In their research, the authors analyzed a one-day log collectedby the Excite search engine, which contains 51,473 queries issued by 18,113 users. Inaddition, Silverstein et al. [1999] present an exhaustive analysis of a very large querylog collected by the AltaVista search engine, which consists of about a billion queriessubmitted in a period of 42 days by approximately 285 million users. The authors showinteresting results, including the analysis of user query sessions and of correlationsbetween query terms.

However, most works relating to mining query logs aim to understand the real intentbehind queries issued by users. Broder [2002] claims that the “need behind the query”in a Web context is not clearly informational, like in a standard information retrievaldomain. Hence, he introduces a taxonomy of Web searches by classifying the queriesinto three classes according to their intent: (i) navigational, whose intent is to reacha specific Web site; (ii) informational, which aims to acquire some information fromone or more Web documents; and (iii) transactional, whose intent is to perform someWeb-mediated activity. Rose and Levinson [2004] propose their own user search goalsclassification by adding more hierarchical levels to this taxonomy. Lee et al. [2005]describe whether and how search goal identification processes behind a user querymight be automatically performed on the basis of two features, namely past user-clickbehavior and anchor-link distribution.

Many works deal with the identification of user search sessions boundaries. Previouspapers on this topic, which is very relevant for our work, can be classified into thefollowing groups on the basis of the technique used: (i) time-based, (ii) content-based,and (iii) ad-hoc techniques, which usually combine (i) and (ii).

Time-Based. Time-based techniques have been extensively proposed in past researchworks for detecting meaningful search sessions because of their simplicity and easeof implementation. Indeed, these approaches are based on the assumption that thetime interval between queries issued by the same user is the predominant factor indetermining a topic shift in search activities. Roughly speaking, two consecutive userqueries are likely to be related if the time gap between them is lower than a fixedthreshold.

With this view in mind, Silverstein et al. [1999] first defined the concept of session asfollows: two consecutive queries are part of the same session if they are issued withina five-minute time window. On the basis of this definition, they split the benchmarkdataset into sessions containing 2.02 queries on average. He and Goker [2000] usedifferent thresholds ranging from 1 to 50 minutes to devise user sessions from a Excitequery log.

Radlinski and Joachims [2005] observe that users often perform a sequence of queriesbased on a similar information need, and they refer to those sequences of reformulated



queries as query chains. Their work presents a method for automatically detectingquery chains in query and click-through logs using a 30-minute threshold to determineif two consecutive queries belong to the same search session.

Jansen and Spink [2006] make a comparison of nine search engine transaction logsfrom the perspectives of session length, query length, query complexity, and contentviewed. In their paper, they provide another definition of session, that is, search episode,and describe it as the period of time between the first and the last recorded time stampon the search engine server from a particular user on a single day, so the session lengthmay vary from less than a minute to a few hours.

By using the same concept of search episode, Spink et al. [2006] investigatemultitasking behaviors for user interaction with the search engine. Multitasking dur-ing Web searches involves a seek-and-switch process between several topics within asingle user session. Again, the authors define a user session as the entire series ofqueries submitted by a user during one interaction with the search engine, so that thesession length may vary from less than one minute to a few hours. The results of thisanalysis performed on an AltaVista query log showed that multitasking is a commoncharacteristic of Web searching. In our work, we reveal the presence of multitaskingalso within shorter user activities.

Shi and Yang [2006] describe the so-called dynamic sliding window segmentationmethod, which is based on three temporal constraints: α as the maximum time intervalbetween two consecutive queries in the same session, β as the maximum inactivity timewithin the same session, and γ as the maximum length of a single session. The authorsempirically set α, β, and γ to be 5 minutes, 24 hours, and 60 minutes, respectively.

Finally, Richardson [2008] shows the value of long-term search engine query logswith respect to short-term, that is, within-session query information. He claims thatlong-term query logs can be used to better understand the world where we live andshows that query effects are long lasting. Basically, in his work, Richardson does notlook at term co-occurrences only within a search session (which he agrees to be a30-minute time window) but rather across entire query histories.

Content-Based. Some previous works propose exploiting the lexical content of queriesin order to determine session boundaries corresponding to possible topic shifts in thestream of queries issued by users [Lau and Horvitz 1999; He et al. 2002; Ozmutlu andCavdur 2005]. To this end, several search patterns are proposed by means of lexical com-parison based on different string similarity metrics (e.g., Levenshtein, Jaccard, etc.).

Nevertheless, approaches relying only on content features suffer the so-calledvocabulary-mismatch problem, namely the existence of topically-related queries with-out any shared terms (e.g., the queries nba and kobe bryant are completely differ-ent from a lexical content perspective, but they are undoubtedly related). In order toovercome this issue, Shen et al. [2005] compare expanded representations of queriesinstead of the actual queries themselves. Each individual expanded query is obtainedby concatenating the titles and the Web snippets for the top 50 results provided by asearch engine for the specific query. Therefore, the relatedness between query pairs iscomputed by using the cosine similarity between the corresponding expanded queries.

Ad-Hoc. Jansen et al. [2007] assume that a new search pattern always identifiesthe start of a new session. Moreover, He et al. [2002] show that statistical informationcollected from query logs can be used to estimate the probability that a search patternactually implies a session boundary. In particular, they extend a previous work [Heand Goker 2000] to consider both temporal and lexical information.

Similarly, Ozmutlu and Cavdur [2005] describe a mechanism for identifying topicchanges in user search behavior by combining time and query content features. Theytest the validity of their approach using a genetic algorithm in order to learn the



parameters of the topic identification task. The algorithm takes into account the topicshift and continuation probabilities of the dataset leveraging on query patterns (i.e.,lexical content) and time intervals.

Seco and Cardoso [2006] propose that a candidate query belongs to a new session ifit does not have any terms in common with the queries of the current session or thetime interval between the candidate query and the last query in the current session isgreater than 60 minutes.

Gayo-Avello [2009] presents an exhaustive survey on session boundary detectionmethods. Furthermore, the author introduces a new technique which works on thebasis of a geometric interpretation of both time gap and content similarity betweenconsecutive query pairs.

Other approaches tackle the session boundary detection problem by leveraging morecomplex models and by combining more advanced features.

Boldi et al. [2008] introduce the query-flow graph (QFG) as a model for representingdata collected in search engine query logs. Intuitively, in the QFG, a directed edge fromquery qi to query qj means that the two queries are likely to be part of the same searchgoal. Any path over the QFG may be seen as a searching behavior, the likelihood ofwhich is given by the strength of the edges along the path. The authors exploit thismodel to segment the query stream into sets of related information-seeking queries,thus reducing the problem to an instance of the asymmetric traveling salesman problem(ATSP).

Jones and Klinkner [2008] address a problem that appears to be similar to ours.In particular, they argue that within a user’s query stream, it is possible to recognizespecific hierarchical units, that is, search missions, which are in turn divided intodisjoint search goals. A search goal is defined as an atomic information need resultingin one or more queries, while a search mission is a set of topically-related informationneeds, resulting in one or more goals. Given a manually-generated ground-truth, theauthors investigate how to learn a suitable binary classifier, which is aimed at detectingwhether two queries belong to the same task. One of the various results is that theyrealize that timeouts, whatever their lengths, are of limited use in predicting whethertwo queries belong to the same goal and unsuitable for identifying session boundaries.However, the authors do not explore how such binary classifier could be exploited foractually segmenting users’ query streams into goals and missions.

Mei et al. [2009] present a general framework for studying sequences of search activ-ities performed by users. According to the hierarchical user search model introducedby Jones and Klinkner [2008], this framework captures the user behavior at multiplelevels of granularity: whole search sessions, search missions and their composite sub-tasks (i.e., search goals), blocks of related queries, individual queries, click-throughdata, and eye-tracking fixations.

Donato et al. [2010] adopt the same hierarchical structure of user search behaviorand show how search missions can be automatically identified on-the-fly as the userinteracts with the Web search engine. In particular, the authors plug this automaticsearch mission discovery mechanism into SearchPad, a novel Yahoo! application aimedat helping users keep track of results they have consulted. The novelty of this approachhowever is that it is automatically triggered only when the system decides, with a fairlevel of confidence, that the user is actually undertaking a search mission.

However, when no labeled training set is available, suitable mechanisms for identify-ing session boundaries, and thus for extracting search goals, may rely on unsupervisedlearning approaches. Typically, these mechanisms are based on query clustering algo-rithms which group queries that are related to the same need. The rationale for usingquery clustering is based on the assumption that if two queries end up in the samecluster, then they are topically related.



Beeferman and Berger [2000] introduce a technique for mining a collection of usertransactions with a search engine, which discovers clusters of similar queries andsimilar URLs. The proposed algorithm does not consider query content but analyzesURL co-occurences within the click-through data stored in the query log, which ismodeled as a bipartite graph.

Wen et al. [2002] describe a query clustering method that makes use of user logsto identify the Web documents users have selected for a certain query. The similaritybetween two queries may be deduced from counting the documents the users clickedon and that are shared between the queries. Fu et al. [2003] propose a hybrid methodto cluster queries by utilizing both query terms and query results, showing that thiscombination performs better than using either method alone.

Cao et al. [2008] propose a novel clustering algorithm for summarizing queries intoconcepts throughout a click-through bipartite graph built from a search log.

Leung et al. [2008] develop online techniques to extract concepts from the snippets ofquery results and use these concepts to identify related queries. Moreover, the authorspropose a new two-phase personalized agglomerative clustering algorithm that is ableto generate personalized query clusters.

A couple of more recent related works have appeared. Kotov et al. [2011] discussmethods for modeling and analyzing user search behavior that extends over multiplesearch sessions. The authors focus on two problems: (i) given a user query, identifyall related queries from previous sessions that the user has issued, and (ii) given amulti-query task for a user, predict whether the user will return to this task in thefuture. Both problems are modeled within a classification framework that uses featuresof individual queries and long-term user search behavior at different granularity. Theoutcomes of this work have improved search for complex information needs and havehelped in designing search engine features to support cross-session search tasks.

Finally, Guo et al. [2011] introduce the concept of intent-aware query similarity,namely a novel approach for computing query pair similarity, which takes into accountthe potential search intents behind user queries and then measures query similarity fordifferent intents using intent-aware representations. The authors show the usefulnessof their approach by applying it to query recommendation, thereby suggesting diversequeries in a structured way to search users.

3. QUERY LOG ANALYSIS

The research challenges we want to address in this work rely on extracting usefulpatterns from search engine query logs. However, query log analysis is often hard toperform due to the lack of publicly-available datasets. As a result, we used the 2006AOL query log as our referential dataset. This query log is a very large and long-termcollection consisting of about 20 million Web queries issued by more than 657,000 usersover three months (from 03/01/2006 to 05/31/2006).3

3.1. Session Size Distribution

We analyzed the entire AOL query log and extracted several statistics, such as thetotal number of queries, the number of queries in each session, the average durationof sessions, etc.

The distribution of long-term sessions size over the entire collection is depicted inFigure 2. This is characterized by a Zipf’s law, that is, one of a family of related discretepower law probability distributions [Reed 2001]. Indeed, 67.5% of user sessions containsless than 30 queries, meaning that more than 2/3 of the users issued about ten queriesper month on average. Besides this, longer user sessions, that is, sessions with more

3http://sifaka.cs.uiuc.edu/xshen/aol_querylog.html



1

10

100

1000

10000

100000

1 10 100 1000 10000 100000 1e+06

Freq

uenc

y

Session size (#queries)

Fig. 2. The distribution of long-term sessions size (log-log scale).

than 1,000 queries over three months, represent only about 0.14% of the whole records.Interestingly, this is compliant with the analysis of long-tailed distributions of queryfrequencies and query-term frequencies showed by Baeza-Yates et al. [2008]. Here,the authors observed that a small portion of the terms appearing in a large query logwere used very often, while the remaining terms were individually used less often. Inthe same way, a small portion of the user sessions contains a large number of queries,while the remaining sessions are composed of few queries. Finally, it is worth pointingout that in our analysis we do not consider empty sessions, which account for less than1% of the total.

3.2. Query Time-Gap Distribution

Since users tend to issue bursts of queries for relatively short periods of time, usuallyfollowed by longer periods of inactivity, the time gap between queries plays a significantrole in detecting session boundaries.

According to Silverstein et al. [1999], session boundaries are detected by consideringthe user’s inactivity periods, that is, the time gaps between consecutive queries in eachlong-term user session. To establish whether a time gap between two queries actuallyrefers to a session boundary, a suitable threshold � is needed. This may be obtainedby analyzing the distribution of time gaps between all the consecutive query pairs inour dataset.

We divide all the time gaps into several buckets of 60-seconds each. We then analyzethe query interarrival times distribution which, again, is revealed to be a power-law(see Figure 3). This model closely fits user behavior during Web search activities whenconsecutive queries issued within a short period of time are often not independentbecause they are also topically-related.More formally, given the following general form of a power-law distribution p(x),

p(x) = α − 1xmin

(x

xmin

)−α

,

where α > 1 and xmin is the minimum value of x from which the law holds, we areinterested in finding the value x, such that two consecutive queries whose time gapis smaller than x are considered to belong to the same time-gap session. When theunderlying distribution is unknown, it makes sense to assume a Gaussian distribution



10

100

1000

10000

100000

1e+06

1 10 100 1000 10000

#Que

ry p

airs

Time gap (min.)

Fig. 3. The distribution of time gaps between each consecutive query pair (log-log scale).

and use a threshold x = μ + σ being equal to mean μ plus standard deviation σ , whichresults in “accepting” λ = 84.1% of the samples. This is equivalent to considering thecumulative distribution P(x) = Pr(X ≤ x) and to determining x, such that P(x) = λ.

Since we know the underlying distribution, and since P(x) = Pr(X ≤ x) = 1− Pr(X >x), we map the threshold λ into our context as follows:

Pr(X > x) = C∫ ∞

xp(X) dX = α − 1

x−α+1min

∫ ∞

xX−α dX =

(x

xmin

)−α+1

.

Hence, for our purpose, we have to solve for x in the following equation:

P(x) = 1 − Pr(X > x) = 1 −(

xxmin

)−α+1

= λ = 0.841. (1)

The value xmin represents the minimum query pair time gap and corresponds to thefirst interval, that is, 60 seconds. Therefore, we estimate α = 1.564, and finally wecan solve Eq. (1), finding x � 26 minutes. This means assuming 84.1% of consecutivequery pairs are issued within 26 minutes. As described in detail in Section 8.2.1, inour experiments, we use this value as the threshold � for splitting each long-term usersession of the the query log.

4. USER TASK DISCOVERY PROBLEM

In this section, we address our research challenge, namely to find concrete user tasksrecorded in a search engine query log whose final aim is to satisfy a specific latent need.We use the generic term task to refer to this type of atomic latent need, whereas wegive the name user task to any concrete instance of a task performed by a particularuser. Indeed, users may repeatedly enact the same task over the time and often bymeans of several distinct queries.

In the following, we describe the theoretical model as well as the notation we adoptin order to formally define our research problem.

4.1. Theoretical Model

Let QL be a query log, which records the queries—along with other information, suchas user IDs, time stamps, clicked URLs, etc.—issued by a set U of N distinct users.

We denote by Su = 〈q1, q2, . . . , q|Su|〉 the chronologically ordered sequence of all thequeries in QL issued by a user u ∈ U . The sequence Su of the queries submitted by u



Fig. 4. A generic time-gap session decomposed in a set of interleaved user tasks θ i−1, θ i , and θ i+1.

is the result of multiple long-term interactions with the search engine. Therefore, weconsider each sequence Su to be composed of a sequence of time-gap sessions s, whichresult from applying a time-splitting technique, that is, Su = ⋃

si∈Susi. Basically, each

time-gap session contains the set of contiguous queries submitted by a user such thateach pair of consecutive queries is submitted within a time-gap threshold �.

Note that this time-splitting technique does not impose any restrictions on the totaltime elapsed between the first and the last query of a time-gap session. However, itprovides an inactivity time boundary to reasonably determine a shift in user task. Usu-ally, such an inactivity threshold is fixed arbitrarily. Conversely, we set � = 26 minuteson the basis of the analysis described in Section 3.2.

Since a time-gap session may include queries related to different needs due to mul-titasking [Spink et al. 2006; Lucchese et al. 2011], we further identify a partitioning ofeach time-gap session into subsets of related queries constituting distinct user tasks.

Definition 4.1 (User Task). Given a time-gap session s ⊆ Su of user u ∈ U , a user taskθ , θ ⊆ s, is the maximal subsequence of possibly nonconsecutive queries in s referringto the same latent need. The set of all user tasks in s is a partitioning of s.

The set of user tasks performed by u ∈ U is denoted by u = ⋃si⊆Su

⋃θ j⊆si

θ j , whilethe set of all the user tasks from all the N users is denoted by = ⋃

u∈U u.In general, the queries belonging to the various user tasks θ in each time-gap session

s may interleave due to multitasking. Thus we order the user tasks by looking atthe time stamps of the first query issued within each θ . Using this ordering, let θ i

denote the ith user task performed by the user within a time-gap session. This allowsus to represent each u as an ordered set, namely a sequence, of user tasks, thatis, u = 〈θ1, θ2, . . . , θ |u|〉. Figure 4 illustrates the partitioning of a generic time-gapsession s ⊆ Su to identify the various user tasks θ i and also summarizes the notationwe have just introduced.

Therefore, the problem of finding = ⋃u∈U u in a given query log QL can be

formulated as the user task discovery problem (UTDP), whose goal is to find the bestquery partitioning strategy π that approximates the actual set of user tasks u, whenused to segment the time-gap sessions recorded in the query log.

USER TASK DISCOVERY PROBLEM (UTDP): Given a query log QL and a useru ∈ U , let Tu be the set of user tasks discovered by the query partitioningstrategy π applied to the time-gap sessions of Su, that is, Tu = ⋃

si⊆Suπ (si).

Also, let = ⋃u∈U u and T = ⋃

u∈U Tu.The UTDP requires the best partitioning π , such that

π = argmaxπ

ξ (, T , π ), (2)

where function ξ (·) is an accuracy measure which evaluates how well thequery partitioning strategy π approximates the actual user tasks .



Fig. 5. Snapshot of the Web application used for generating the ground truth.

Several quality measures can be used to evaluate the accuracy of a user task extrac-tion method, and consequently, several ξ functions can be devised. In Section 8, weinstantiate ξ in terms of F1, the Rand index, and the Jaccard index.

5. GROUND TRUTH: DEFINITION AND ANALYSIS

According to the user task discovery problem statement, we need to find the querypartitioning strategy π that gives the set of user tasks T which best approximates theactual user tasks . Such optimal user task partitioning can be manually built from areal-world search engine query log.

To this end, we developed a Web application that helped human assessors manuallyidentify the optimal set of user tasks from the AOL query log. As a result, we produceda ground truth that can be used for evaluating any automatic user task discoverymethod, and which is also freely available to download.4 In Figure 5 we show a samplesnapshot of this Web application.

For each time-gap session, human annotators were presented with the sequence ofqueries as they were originally issued. They therefore grouped together those queriesthat they considered to be task related. In addition, they had opportunity to discardmeaningless queries from those sessions as well as to submit ambiguous queries tothe most important search engines (i.e., Google, Yahoo!, and Bing). For each manuallyidentified user task (i.e., set of task-related queries), the evaluators added a tag and,if appropriate, a longer description. The resulting dataset can be seen as a semanticknowledge base of users’ search goals (i.e., a taxonomy of user tasks).

Furthermore, dealing with such massive query log datasets typically needs a pre-processing phase in order to clean the collection and to make it more suitable forperforming further analysis. This meant that we removed query log records contain-ing both empty queries and query strings composed of only punctuation symbols. In

4http://miles.isti.cnr.it/~tolomei/downloads/aol-task-ground-truth.tar.gz



0

5

10

15

20

25

30

35

40

10 20 30 40 50 60

Freq

uenc

y (%

)

Time-gap session duration (min.)

Fig. 6. The distribution of time-gap session duration.

addition, we removed all the stop words from each query string. Then, we ran the Porterstemming algorithm [Porter 1980] in order to remove the most common morphologi-cal and inflectional English endings from the terms of each query string. Finally, wediscarded from the initial dataset those long-term user sessions containing too manyqueries, which were probably generated by robots. In particular, we are referring tothose sessions which had a total number of queries that would have been physicallydifficult for a human user to issue even, over a period of three months. Indeed, thelongest user session contains 240,180 queries, which amounts to approximately 2,669queries per day. On average, this would be like issuing approximately two queries eachminute 24 hours a day for 90 days, which is a highly unlikely query submission ratefor a human user.

Our sample dataset was based on the 500 user sessions with the highest numberof queries, herein called the top-500 dataset. This dataset contains a total amount of518,790 queries, meaning that each user issued on average approximately 1,038 queriesin three months, that is, roughly 12 queries per day. The maximum number of queriesin a user session is 7,892 and the minimum is 729. This means users submitted fromapproximately 8 to 88 queries per day. However, only a small fraction of the wholedataset (i.e., the first week of user activities) was showed to annotators in order tosimplify the overall manual labeling step. From now on, we will refer to that sub-set as top-500-1week. As regards the human evaluators, they were selected from ourlaboratory but were not directly involved in this work.

The manual annotation procedure referred to a total of 2,004 queries, from which446 time-gap sessions were extracted automatically. one hundred thirty-nine time-gapsessions were discarded as meaningless by the annotators and were therefore removedfrom the ground truth. In the end, 1,424 queries were actually clustered from 307 time-gap sessions.

Figure 6 shows the distribution of time-gap session length, using a discretizationfactor of 60 seconds. While there are a large number of sessions which are shorterthan one minute and which usually contain only one or two queries, the duration of atime-gap session is nevertheless 15 minutes on average. Indeed, as can be observed,sessions lasting 40 minutes (or more) occurred on a fairly frequent basis. Even in thesecases, the session lengths suggest that the interaction of users with search engines isnontrivial and that it is likely to include multitasking behaviors. The longest time-gap



0

5

10

15

20

25

30

5 10 15 20 25 30 35 40

Freq

uenc

y (%

)

Time-gap session size (#queries)

Fig. 7. The distribution of time-gap session size.

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17

Freq

uenc

y (%

)

User Task size (#queries)

Fig. 8. The distribution of user task size.

session lasted 9,207 seconds, that is, about two and a half hours, and this happenedonly once in our dataset.5

In Figure 7, we report the time-gap session size distribution. On average, each time-gap session contained 4.49 queries, and sessions with at most five queries coveredslightly more than half of the query log. The other half of the query log contained longersessions, where there was a high likelihood that multiple tasks were carried out.

The total number of human-annotated user tasks were 554, with an average of2.57 queries per user task. The user task size distribution is illustrated in Figure 8.Furthermore, the average number of user tasks accomplished in a time-gap sessionwas 1.80 (see Figure 9).

Among all 307 time-gap sessions considered, 145 contained multiple user tasks.We found that this 50% split between single-tasking and multitasking sessions was

5It is highly likely that this user was a robot or a script.



0

10

20

30

40

50

60

1 2 3 4 5 6 9

Freq

uenc

y (%

)

#User Tasks per time-gap session

Fig. 9. The distribution of user tasks per time-gap session.

consistent across the various users. Interestingly enough, this shows that a good taskdetection algorithm has to be able to handle efficiently both single- and multitaskingsessions. If we considered all the queries included in each user task, then 1,046 out of1,424 queries were included in multitasking sessions, meaning that about 74% of theuser activity was multitasking.

Finally, we also evaluated the degree of multitasking by taking into account thenumber of overlapping user tasks. We say that a jump occurs whenever two queries ina manually labeled user task are not consecutive. For instance, let su = 〈q1, q2, . . . , q9〉be a time-gap session in Su, and let θ1

u , θ2u , θ3

u be the result of the manual annotationprocedure for su, where θ1

u = {q1, q2, q3, q4}, θ2u = {q5, q7}, and θ3

u = {q6, q8, q9}. In thiscase, the number of jumps observed in su is two, because there are two query pairs(q5, q7) ∈ θ2

u and (q6, q8) ∈ θ3u which are not consecutive. The number of jumps gives a

measure of the simultaneous multitasking activities. We denote by j(su) the simulta-neous multitasking degree of su as the ratio of user tasks in su which have at least onejump. In the previous example, j(su) � 0.67, since two out of three user tasks containat least one jump. In Figure 10, we show the distribution of the multitasking degreeover all the time-gap sessions. Note that the result for j(su) = 0 is omitted, because wealready know that 50% of the sessions were single-tasking.

6. TASK-BASED QUERY SIMILARITY

Task relatedness is the probability that two queries belong to the same user task, wherethe user task is not known in advance. A good task-relatedness measure is the buildingblock for discovering user tasks from a given query log QL.

In this section, we describe three different ways of computing task relatedness, theresult of which is a set of task-based query similarity functions. The first approach,named time-based, considers only the time gap between two adjacent queries. Wesubsequently evaluate both unsupervised and supervised learning approaches by ex-ploiting a number of query log features as well as external knowledge bases, such asWikipedia and Yahoo! Search Boss API.

6.1. Time-Based Approach

The simplest approach for measuring task-based query similarity was presented bySilverstein et al. [1999]. The proposed measure is based on the assumption that if two



0

2

4

6

8

10

0.2 0.4 0.6 0.8 1

Freq

uenc

y (%

)

Multi-tasking degree

Fig. 10. The distribution of multitasking degree.

consecutive queries are issued within a small enough time window t, then they are alsovery likely to be task related. In this approach, only the query submission time of aquery, denoted by τ (q), is taken into account.

Given an adjacent query pair (qi, qi+1) belonging to the same su, the following binarysimilarity function σtime is defined.

σtime(qi, qi+1) ={

1, if |τ (qi+1) − τ (qi)| ≤ t;0, if |τ (qi+1) − τ (qi)| > t.

(3)

The effectiveness of this approach depends on the length of the time window t. Severalpast works empirically evaluated different values of t, ranging from 5 to 60 minutes[Silverstein et al. 1999; Shi and Yang 2006; Richardson 2008]. Instead, in this work,we devised a threshold value t = 26 minutes from a statistical analysis of the querylog, as already discussed in Section 3.2.

6.2. Unsupervised Learning Approach

We first discuss two classes of features, that is, internal and external. Then, we definetwo task-based query similarity functions which exploit both classes of features.

6.2.1. Feature Selection. Most of the previous approaches used to measure query sim-ilarity are based on the lexical content of queries [Salton and Mcgill 1986]. The effec-tiveness of those approaches is, however, quite low due to the short length of queries(about 2.5 words per query, on average) [Jansen et al. 1998; Silverstein et al. 1999],as well as the lack of contextual information in which queries are issued [Wen et al.2002].

Some approaches try to expand these short queries by exploiting URLs returned byWeb search engines [Glance 2001], as well as retrieved Web documents [Raghavan andSever 1995], or even Web document snippets [Leung et al. 2008], that is, two queriesare similar if they return similar results.

In line with these approaches, we propose two similarity measures by consideringqueries’ lexical content and semantics.

Content-Based (σcontent). Two queries that share some common terms are likely to berelated. Sometimes, the terms may be very similar, but not identical, due to misspellingor different prefixes/suffixes. To capture content similarity even in those cases, we adopt



a Jaccard index on character tri-grams [Jarvelin et al. 2007]. Let T (q) be the charactertri-grams from the terms of query q, we define the similarity σ jaccard as follows.

σ jaccard(qi, qj) = |T (qi) ∩ T (qj)||T (qi) ∪ T (qj)| .

In addition, we exploit a normalized Levenshtein similarity σlevenshtein, which Jonesand Klinkner [2008] claimed to be the best edit-based feature for identifying goalboundaries. Finally, the overall content-based similarity is computed as follows.

σcontent(qi, qj) = σ jaccard(qi, qj) + σlevenshtein(qi, qj)2

.

Semantic-Based (σsemantic). In order to enrich queries with a semantic-based context,it is possible to use external knowledge bases. Several semantic relatedness metricsdealing with semantic resources have previously been proposed. They can be classifiedas follows.

—Path-based, in which knowledge is modeled as a graph of concepts, and the metricsrely on paths over that graph [Rada et al. 1989; Leacock and Chodorow 1998].

—Information content-based takes into account the information content of a con-cept [Resnik 1995].

—Gloss-based considers term overlaps between definitions of concepts [Lesk 1986].—Vector-based models each concept as a vector of terms [Gabrilovich and Markovitch

2007] or anchor links [Milne and Witten 2008].

On the basis of the last approach, we assume that either a Wiktionary entry or aWikipedia article describes a certain concept and that the presence of a term in a givenarticle is evidence of the correlation between that term and that concept. We describethe wikification

−→C (t) of a term t as its representation in a high-dimensional concept

space−→C (t) = (c1, c2, . . . , cW ), where W is the number of articles in the knowledge base,

and ci scores the relevance of the term t for the ith article. We measure the relevanceci by using the well-known t f -idf score [Salton and Mcgill 1986].

In order to wikify a query q, we sum up the contribution from its terms.−→C (q) =

∑t∈q

−→C (t).

The task relatedness σwiki f ication(qi, qj) between two queries qi and qj is estimated bythe cosine similarity of their wikification.

σwiki f ication(qi, qj) =−→C (qi) · −→

C (qj)

‖−→C (qi)‖‖−→C (qj)‖.

We use this wikification process by exploiting the Wiktionary or Wikipedia knowledgebases to compute the query similarity measures σwiktionary and σwikipedia. Finally, theoverall semantic-based similarity is computed as follows.

σsemantic(qi, qj) = max(σwiktionary(qi, qj), σwikipedia(qi, qj)).

6.2.2. Similarity Functions. We blend the lexical content σcontent and semantic σsemanticsimilarity scores is through a convex combination.

σ1 = α · σcontent + (1 − α) · σsemantic. (4)

We also propose a novel similarity function σ2 on the basis of the following conjecture.If two queries have similar lexical content, they are very likely to be task related, and



semantic expansion could be useless. On the other hand, queries could be similar evenif they do not share any lexical features, for example, the two queries nba and kobebryant. If there is great similarity in content, then we can be confident that queriesare task related. Otherwise, we compute the similarity score as the maximum valuebetween content- and semantic-based similarities.

σ2 ={

σcontent, if σcontent ≥ t;max(σcontent, b · σsemantic), otherwise.

(5)

Both σ1 and σ2 rely on the estimation of certain parameters, that is, α for σ1, t and bfor σ2, which were learned directly from the ground truth.

6.3. Supervised Learning Approach

Jones and Klinkner [2008] argued that user search activities can be divided into hier-archical units, that is, search missions, which are in turn composed of disjoint searchgoals. According to the authors, a search goal is defined as an atomic information needresulting in one or more queries, while a search mission is a set of topically-relatedinformation needs resulting in one or more goals. Therefore a search goal is equivalentto our definition of user task (see Def. 4.1).

Jones and Klinkner [2008] investigated a supervised learning approach. Basically,given a pair of queries, they used several query features, namely temporal, word- andcharacter-edit, query log sequence, and Web search features, to predict whether thetwo queries belonged to the same goal/mission. It is worth noting that the authors didnot explore how such binary classifiers could be exploited for segmenting user querystreams into goals and missions, and thus for actually discovering search goals, thatis, user tasks and search missions.

In addition, we extend our study to include supervised learning approaches in orderto detect queries belonging to the same user task. The set of features we use includesthose used by Jones and Klinkner [2008], F jk, and the features we have already definedin previous sections.

We train several binary classifiers using various combinations of those features onour ground truth of user tasks. Finally, the output of these classifiers determines the12 new task-based query similarity scores, which in turn will be exploited by our twobest-performing user task discovery methods, as specified later in Section 7.2.1.

6.3.1. Feature Selection. According to Jones and Klinkner [2008], given any two queriesqi and qj , the following set of features F jk provides best accuracy results in predictingwhether they are part of the same user task.

—edlevGT2. This is a binary feature that evaluates as 1 if the normalized Levenshteinedit distance between qi and qj is greater than 2, or 0 otherwise.

—wordr. This feature corresponds to the Jaccard distance between the sets of wordswhich qi and qj are composed of.

—char suf. This feature counts the number of common characters between qi and qj ,starting from the right.

—nsubst qj X. Given P(qi −→ qj), the probability of qi being reformulated as qj , thisfeature is computed as count(X : ∃ P(qj −→ X)).

—time diff. This feature represents the inter-query time gap between qi and qj , ex-pressed in seconds.

—sequential. This binary feature is positive if the queries qi and qj are sequentiallyissued.

—prisma. This feature refers to the cosine distance between vectors obtained from thetop-50 Web pages returned as search results for the terms of qi and qj , respectively.



—entropy qi X. This feature relates to the entropy of rewrite probabilities from queryqi and it is computed as

∑k P(qk|qi) log2(P(qk|qi)).

In addition to this set F jk, we propose two further features. The first is the semanticfeature we already used for computing the task relatedness measures of our unsuper-vised learning approach, that is, σwikipedia (see Section 6.2.1). The second is σ jaccard url,which measures the Jaccard similarity between the top 20 URLs returned as searchresults in response to qi and qj . The rationale for this feature is to capture the similar-ity between two apparently different queries, which share many relevant links to Webdocuments, that is, URLs retrieved by the most popular Web search engines. Giventhat url20(q) = {u1, u2, . . . , u20} is the set of top-20 URLs returned by a search engine inresponse to the query q, this feature was computed as follows.

σ jaccard url(qi, qj) = |url20(qi) ∩ url20(qj)||url20(qi) ∪ url20(qj)| .

In our test, we retrieved each url20(q) by querying the Yahoo! search engine via itsSearch Boss API.6

6.3.2. Binary Classifiers. We exploited the features described in Section 6.3.1 to trainseveral binary classifiers. In particular, we devised four different combinations of fea-tures extracted from our manually-generated ground truth described in Section 5.

—F1 ≡ F jk.—F2 = F jk ∪ σwikipedia.—F3 = F jk ∪ σ jaccard url.—F4 = F jk ∪ σwikipedia ∪ σ jaccard url.

In particular, we used three different kinds of classification algorithms.

—Cdt. A clone of the C4.5 decision tree learner [Quinlan 1993].—Cnb. A naıve bayesian learner.—Clr. A logistic regression learner.

Therefore, the classification step requires training in the preceding set of threeclassifiers, that is, C = {Cdt, Cnb, Clr}, using the four possible feature sets, that is,F = {F1,F2,F3,F4}. By combining each of the three classifier models with each fea-ture set, we obtained 12 distinct classifiers, that is, Cy

x , where x ∈ {dt, nb, lr} andy ∈ {1, 2, 3, 4}.

The training set of each classifier was generated by considering each query pair(qi, qj) in our ground truth, and to each of them, we assigned a binary class at-tribute same task = yes if and only if qi and qj were part of the same task, otherwisesame task = no. Note that these supervised methods are the only ones which exploitthe task labeling of the ground truth.

6.3.3. Similarity Functions. The same task = yes class probability prediction provided byeach binary classifier can be interpreted as a similarity score. This in turn is exploitedby the clustering techniques presented in Section 7.2.1.

Table I summarizes the similarity functions defined as a combination of the classifi-cation algorithm and the set of exploited features.

In the following, we describe the performance of each similarity function. All theevaluations were measured on the basis of ten-fold cross-validation.

First, we start by showing some stratified cross-validation statistics, namely Kappacoefficient, Mean absolute error, Relative absolute error, and Root relative squared error.

6http://developer.yahoo.com/search/boss/



Table I. Supervised Task-Based Query Similarity Functions

Feature Set

F1 F2 F3 F4

Classification Algorithm

Cdt σ 1dt σ 2

dt σ 3dt σ 4

dt

Cnb σ 1nb σ 2

nb σ 3nb σ 4

nb

Clr σ 1lr σ 2

lr σ 3lr σ 4

lr

Table II. Statistical Indicators on the Set of Classifiers Derived from Cdt

Statistical Indicators

Kappa Mean abs. err. Rel. abs. err. (%) Root rel. sq. err. (%)

Classifiers

C1dt 0.61 0.02 48.39 72.30

C2dt 0.62 0.02 48.22 72.54

C3dt 0.63 0.02 47.34 72.40

C4dt 0.63 0.02 46.90 72.30

Then, we express the performance of each binary classifier in terms of TP Rate, FP Rate,Precision, Recall, and F1.

All these measures are shown separately for the two classes of prediction, thatis, same task = yes and same task = no, and then appropriately weight-averaged toprovide the reader with a global performance indicator. In particular, TP Rate refersto the ratio of true positive examples, that is, queries that are correctly classified (foreach of the two classes of prediction), and it is equivalent to Recall. Similarly, FP Ratedescribes the ratio of false positive examples, that is, queries that are misclassified(for each of the two classes of prediction). Furthermore, Precision is the proportion ofexamples of a certain class among all those that are classified with that class. Finally,F1 is the harmonic mean of Precision and Recall.

F1 = 2 × Precision × Recall(Precision + Recall)

.

Note that the distribution of query pairs across these two classes are far from be-ing uniform: the class same task = yes is much less probable than same task = no.Therefore, the classifier performance measured on the first class might be noticeablylower than the one computed on the second class.

Decision Tree Classifier (Cdt). This classification algorithm is based on a clone of theC4.5 decision tree learner [Quinlan 1993]. First, Table II shows some stratified cross-validation statistical indicators on the four binary classifiers, that is, C1

dt, C2dt, C3

dt, andC4

dt. Here, it is worth noting that our newly introduced features, namely σwikipedia andσ jaccard url, improve the performance of the classifier, especially when both are used, asin the case of C4

dt.Moreover, Table III describes the performance of the binary classifiers in terms of

TP Rate, FP Rate, Precision, Recall, and F1. For each classifier, we indicate the valuesof all the preceding measures both individually for each class of prediction and globally,by weight-averaging them over the two classes.

Interestingly, we can derive that our newly proposed features affect two complimen-tary aspects of the performance. Indeed, on the one hand, the introduction of σwikipediaincreases the recall of positive examples, that is, those that are actually labeled withsame task = yes. On the other hand, the usage of σ jaccard url helps to increase theprecision of positive examples. Finally, when combined together in C4

dt, we obtained the



Table III. Performance Evaluation of the Set of Classifiers Derived from Cdt

Accuracy MeasuresTP Rate FP Rate Precision Recall F1 same task

Classifiers

C1dt

0.997 0.473 0.990 0.997 0.993 no0.527 0.003 0.768 0.527 0.625 yes0.987 0.463 0.985 0.987 0.985 (weighted avg.)

C2dt


C3dt


C4dt


Table IV. Statistical Indicators on the Set of Classifiers Derived from Cnb



Classifiers

C1nb 0.52 0.03 76.01 91.26

C2nb 0.50 0.02 59.14 89.16

C3nb 0.53 0.03 76.69 91.65

C4nb 0.51 0.02 59.51 89.21

Table V. Performance Evaluation of the Set of Classifiers Derived from Cnb


Classifiers

C1nb


C2nb


C3nb


C4nb


very best results. This means that the similarity function of choice for this classificationalgorithm is σ 4

dt.

Naıve Bayesian Classifier (Cnb). This classification algorithm is based on a naıvebayesian learner. Table IV shows the statistical indicators for each classifier, that is,C1

nb, C2nb, C3

nb, and C4nb. Moreover, Table V describes the performance of these four binary

classifiers.In this case, the very best accuracy is obtained both with C1

nb and C3nb, by using the

set of features F1 and F3, respectively. This means that the newly introduced features,namely σwikipedia and σ jaccard url, do not significantly improve the performance of theclassifier. Therefore, the chosen similarity functions can be either σ 1

nb or σ 3nb.



Table VI. Statistical Indicators on the Set of Classifiers Derived from Clr



Classifiers

C1lr 0.48 0.02 60.89 78.03

C2lr 0.48 0.02 60.84 77.99

C3lr 0.48 0.02 60.88 78.02

C4lr 0.48 0.02 60.82 77.98

Table VII. Performance Evaluation of the Set of Classifiers Derived From Clr


Classifiers

C1lr


C2lr


C3lr


C4lr


Logistic Regression Classifier (Clr). This classification algorithm is based on logisticregression. Table VI describes the statistical indicators of four binary classifiers, thatis, C1

lr, C2lr, C3

lr, and C4lr, which are obtained using this approach in combination with the

sets of features F1,F2,F3, and F4. As this table highlights, no significant differencesarise from this comparison. Instead, a more detailed evaluation of the performance ofeach classifier is provided in Table VII. Although all four classifiers behave similarlyin general, comments can be made nevertheless. In particular, it is worth noting thatadding our new set of features results in better true positive and false positive rates. Itis nonetheless true that these enhancements are not crucial to the overall performance.Therefore, any classifier could be chosen almost arbitrarily as well, as their relatingsimilarity functions, that is, σ ∗

lr, where ∗ ∈ {1, 2, 3, 4}.We can conclude that the very best performing classifier is C4

dt. Indeed, consideringthe weighted-average performances, it gains nearly 0.5% in terms of F1, and it reducesthe FP Rate by about 8.4% to the best naıve bayesian classifiers, that is, C1

nb andC3

nb. Similarly, it gains roughly 0.5% in terms of F1 and it reduces the FP Rate byapproximately 35.3% to the best logistic regression classifiers, that is, C4

lr. This meansthat σ 4

dt could be considered the very best query similarity function in order to determinetask relatedness.

As a last note, we would like to comment on why the supervised learning approachproposed by Jones and Klinkner [2008] alone is not suitable for effectively discoveringuser tasks, and why we used it only to learn the task relatedness, which in turn is fedinto more complex user task discovery methods, described in Section 7.

Let us consider only the very best classifier, namely C4dt. Among a total of 113,474

classified query pairs, 112,009 (i.e., 98.7%) were correctly classified. However, the dis-tribution of query pairs across the two classes is very skewed, since 111,080 (i.e.,97.9%) belong to one class, namely same task = no. It turns out that evaluating the



performance of the classifier only in terms of its accuracy might overestimate its actualeffectiveness. A fairer approach is to validate the classifier on the rarest class, whichis same task = yes. If we focus only on the ability of the classifier to correctly predictqueries that were actually in the same tasks, then precision reaches at most 78% inthe best case, which is considerably lower than the 98.7% obtained on average.

Section 8.2.3 shows the outcomes from the two best-performing user task discoverymethods using the three best supervised similarity scores, that is, σ 4

dt, σ 1nb (or, equiva-

lently, σ 3nb), and σ ∗

lr, where ∗ ∈ {1, 2, 3, 4}.7. DISCOVERING USER TASKS

In this section, we tackle the user task discovery problem (UTDP) in Section 4.1, as wellas present and discuss several clustering techniques which adopt the query relatednessmeasures presented in Section 6. We use as baseline the time-based task relatednessmeasure σtime (see Section 6.1), that is, a simple splitting by using a time threshold t.

7.1. Time Splitting

For any consecutive query pair (qi, qi+1), if σtime(qi, qi+1) = 1, then time splitting consid-ers both queries as part of the same session. Otherwise, qi is the last (resp. qi+1 is thefirst) query in a distinct session. The time complexity of time-splitting is linear in thenumber of input queries, and in this work, we use time splitting as the preprocessingstep in order to approach UTDP. In fact, there exist task-oriented sessions that aremade up of sequences of consecutive queries (i.e., no multitasking). In such cases, time-splitting methods are a suitable choice. Therefore, we choose to adopt time-splittingtechniques as the baseline method to discover user tasks, by using the time thresholdsTS-5 and TS-15, that is, 5- and 15-minute thresholds, as well as TS-26 [Silversteinet al. 1999; He and Goker 2000] (Section 8.2). The threshold of t = � = 26 minutes(TS-26) was determined on the basis of the statistical analysis we conducted on thetesting dataset (see Section 3.2).

Time-splitting techniques are not able to deal with task-related sessions, since theycan only identify sequences of timely-consecutive queries, whereas multitasking ses-sions represent a significant sample of all the task-related sessions (see Section 5 formore details on the analysis).

7.2. Query Clustering

In order to discover user tasks, in the following, we present several clustering algo-rithms that we apply to TS-26 split time-gap sessions

We start by describing two algorithms derived from well-known clustering methods:QC-MEANS [MacQueen 1967] and QC-SCAN [Ester et al. 1996]. In addition, we introducetwo graph-based techniques: QC-WCC and its computationally-lighter variation QC-HTC.The effectiveness of all these methods mostly depends on the robustness of the simi-larity functions, that is, the measures of task-based query similarity (as described inSection 6) which are exploited by the algorithms.

7.2.1. Algorithms. While QC-MEANS and QC-SCAN are inspired to well-known clusteringalgorithms, QC-WCC and QC-HTC follow a graph-based approach. QC-WCC identifies usertasks from the connected components of a query similarity graph, while QC-HTC, whichis a variation of QC-WCC, is aimed at reducing the computational cost of clusteringwithout affecting the overall effectiveness.

Each query clustering method is associated with a specific partitioning strategy π andoperates on each time-gap session. Let s = 〈qi . . . qi+n−1〉 be a generic time-gap sessionbelonging to a long-term session S of user u, where |s| = n, and i ≥ 1, 1 ≤ n ≤ |S|− i +1.It is worth noting that we do not use the subscript symbol u when there is no ambiguity



for users in order to simplify the readability of notation. Each algorithm provides asoutput π (s) = {t1, t2, . . . t|π (s)|}, that is, the set of user tasks of s, obtained by applying apartitioning strategy π ∈ {QC-MEANS, QC-SCAN, QC-WCC, QC-HTC}.

QC-Means. This is a centroid-based algorithm and a variation of the well-knownK-MEANS [MacQueen 1967]. We replaced the usual K parameter, that is, the numberof clusters to be extracted, with a ρ threshold which establishes the maximum radiusof each cluster. This allowed us to better deal with the varying lengths of the varioususer sessions as well as to avoid specifying the number K of final clusters a priori.

At each step, a query qi ∈ s ⊆ S is either added to an existing cluster of queries tj ifits similarity with respect to the centroid query of tj is at least 1 −ρ; otherwise qi itselfbecomes the centroid of a new cluster tk. The worst case is when each cluster containsa single query. In this case, we need to compute the similarity between all query pairsand the complexity of QC-MEANS becomes quadratic in the size of the input, that is,O(n2).

QC-SCAN. It is the density-based DB-SCAN algorithm [Ester et al. 1996], specificallytailored to extract user tasks from Web search engine query logs. The rationale foralso evaluating a variation of DB-SCAN is that a centroid-based approach may sufferfrom the presence of noise in query logs. Again, QC-SCAN may require computing thesimilarity of all query pairs, thereby making its worst-case time complexity quadraticin the size of the input.

QC-WCC. This algorithm extracts query clusters corresponding to weighted connectedcomponents of a graph [Lucchese et al. 2011]. Given a time-gap session s ⊆ S, we firstbuild a complete graph Gs = (V, E, w) whose vertices V are the queries in s, that is,V = {qi | qi ∈ s}, and whose E edges are weighted by the similarity of the correspondingvertices. The weighting function w, w : E �−→ [0, 1], is computed in terms of the task-based query similarity functions proposed in Section 6. Thus, the graph Gs models thetask-based similarity between any pair of queries in the given time-gap session.

The algorithm works in two steps. In the first, given the graph Gs, we remove weakedges whose weights are smaller than a given threshold, that is, w(e) < η, thus obtain-ing a pruned graph G′

s. In the second step, we extract the connected components of thegraph and consider them as clusters of task-related queries π (s) = {t1, t2, . . . t|π (s)|}.

Assuming a robust similarity function, the QC-WCC algorithm is able to handlethe multitasking nature of users sessions. Groups of related queries are isolatedby the pruning of weak edges. Links with high similarity identify the generaliza-tion/specialization steps of the users, as well as restart from a previous query when thecurrent query chain is found to be unsuccessful.

The computational complexity of QC-WCC is dominated by the construction of thegraph Gs. The similarity between any pair of edges must be computed, resulting in anumber of computations which is quadratic in the number of vertices, that is, O(|s|2).On the other hand, the connected components of a graph can be easily computed inlinear time (in terms of the numbers of vertices and edges of the graph) using eitherbreadth-first search or depth-first search [Hopcroft and Tarjan 1973]. In either case,a search that begins at a particular vertex v eventually finds the entire connectedcomponent containing v before returning.

QC-HTC. This is a variation of the preceding QC-WCC algorithm, which does not needto compute the full similarity graph yet maintains the quality of the obtained queryclustering QC-WCC [Lucchese et al. 2011]. The graph we consider is not complete. Weuse an edge-weighting function w, w : E �−→ [0, 1], which is computed in terms of thetask-based query similarity functions proposed in Section 6. Similarly to QC-WCC, for



QC-HTC, we also exploit a threshold η: two queries q and q′ cannot be considered taskrelated if w(e(q, q′)) < η, where e(q, q′) ∈ E.

The algorithm works in two phases. In the first, we identify query chains withineach time-gap session s ⊆ S. Each chain, called a sequential cluster, is denoted by tjand only contains consecutive queries in a given time-gap session, where each query issimilar (task related) to the chronologically following one. This means that to detect thevarious tj , we only need to compute the weights of the edges e(qi, qi+1), where queries qiand qi+1 occur consecutively in the session. Note that a chain of k task-related queries(qj, . . . , qj+k) must be maximal. If qj−1 (resp. qj+k+1) exists in s, then w(e(qj−1, qj)) < η(resp. w(e(qj+k, qj+k+1)) < η).

The rationale for first detecting query chains is that without losing generality, a usertask can be decomposed into a set of these chains even in the presence of multitasking.Unsurprisingly, due to multitasking, chains of different user tasks can interleave ina given time-gap session. Thus, the algorithm has to finally identify user tasks byrecomposing these chains.

The latter phase of the algorithm therefore merges the sequential clusters but doesnot compute the similarity measures between all the queries included in each cluster.

Instead, we guess that a sequential cluster can be well described by its(chronologically-) first and last queries, respectively, denoted by head(tj) andtail(tj) [Lucchese et al. 2011]. This is because a user involved in a given task oftencarries out a process of specialization/generalization of queries, and thus the middlequeries might be less representative of the user’s real intent. For example, two userscould start a chain from the same query (head) and end in two different query special-izations (tail), whereas they could start a chain from different queries (head) and endin the same specialization (tail). Therefore, the similarity sim between two sequentialclusters tj , tk is computed as follows.

sim(tj, tk) = minq∈{head(tj ),tail(tj )} q′∈{head(tk),tail(tk)}

w(e(q, q′)),

where w weights the edge e(q, q′) linking the queries q and q′ with respect to theirtask-based similarity, analogously to QC-WCC.

We can finally discuss in more detail how this second clustering phase works. Thefirst cluster t1 is initialized with the oldest sequential cluster t1 in a given session, andt1 is removed from the set of sequential clusters. Then, t1 is compared with any otherchronologically-ordered sequential cluster tj by computing the similarity as previously.We still use the threshold η: only if sim(t1, tj) ≥ η, then tj is merged into t1, the head andtail queries of t1 are updated consequently, and tj is removed from the set of sequentialclusters. The algorithm continues comparing the new cluster t1 with the remainingsequential clusters. When all the sequential clusters have been considered, the oldestsequential cluster available is used to build a new cluster t2, and so on. The algorithmiterates this procedure until no more sequential clusters are left.

The worst-case complexity of QC-HTC is still quadratic in the number of queries in s; inpractice there are frequent cases in which the real execution time results to be greatlyreduced with respect to QC-WCC. First, note that the first step of QC-HTC only computesthe similarity between time-adjacent queries, and thus its computational cost is linearin the number of queries in s. We already showed that 52.8% of the time-gap sessionscontain one user task only. Hence, it is highly likely that such user tasks are justfound after the first step of the algorithm, if these tasks exactly correspond to chains oftask-related queries. To detect multitasking sessions, the second step of the algorithmmerges chains, and thus the complexity of is quadratic in the number m of sequentialclusters extracted, that is, O(m2). If m = β · |s| , with 0 < β ≤ 1, while the asymptoticcomplexity is still quadratic in |s| since β is a constant, in practice the execution time



of the second step is reduced by a factor β2. In addition, the algorithm can run evenfaster, since QC-HTC algorithm does not compute all the pairwise similarities amongthe sequential clusters in advance.

8. EXPERIMENTS ON USER TASK DISCOVERY

In this section, we discuss the results obtained with all the user task discovery methodsfor approaching the UTDP, which were described in Section 7. In addition, we compareour results with those provided by two other task discovery methods: (i) the simpletime-splitting technique TS-26, which is considered as the baseline solution, and (ii) thesession extraction method based on the query-flow graph (QFG) proposed by Boldi et al.[2008], which can be considered as the state-of-the-art approach.

For all our clustering methods we can either use the unsupervised learned task-based query similarities (i.e., σ1 and σ2) or the 12 supervised learned similarities. Wehave chosen to start from the unsupervised learned similarity functions [Lucchese et al.2011], and we show that QC-WCC and QC-HTC outperform both QC-MEANS and QC-SCAN,but also state-of-the-art approaches, that is, QFG introduced by Boldi et al. [2008].Then, we concentrate on QC-WCC and QC-HTC only and instantiate their function w forweighting the query graph with the supervised learned similarities, that is, w = σ 4

dt,w = σ 1

nb (or, analogously, w = σ 3nb), and w = σ ∗

lr, where ∗ ∈ {1, 2, 3, 4}.8.1. Measures of Clustering Validity

In order to evaluate all the methods we mentioned, we needed to measure the degreeof correspondence between manually-extracted user tasks of the ground truth (seeSection 5) and user tasks produced by our algorithms. To this end, we used bothclassification- and similarity-oriented measures [Tan et al. 2005]. In the following,predicted class is the user task where a query is assigned to by a specific algorithm,whereas true class indicates the user task where the same query was in the groundtruth.

Classification-oriented approaches measure the degree to which predicted classescorrespond to true classes, and F1 is one of the most popular scores in this category,as it combines both precision and recall. In our case, precision measures the fractionof queries that were assigned to a user taskand that were actually part of that usertask. Instead, recall measures how many queries were assigned to a user task amongall the queries that were really contained in that user task. Globally, F1 evaluates theextent to which a user task contains only and all the queries that were actually part ofit. Given p(i, j), r(i, j) is the precision and recall of user task i with respect to class j,the F1 corresponds to the following weighted harmonic mean of p(i, j) and r(i, j).

F1(i, j) = 2 × p(i, j) × r(i, j)p(i, j) + r(i, j)

.

To compute a global F1, we first considered the set of predicted tasks T associatedwith each long-term session S, which is obtained as T = ⋃

s∈S π (s) = {t1, t2, . . . , t|T |},namely as the union set of all the user tasks extracted from each time-gap session byusing the partitioning strategy π . Analogously, we took into account the set of truetasks = {θ1, θ2, . . . , θ||}, that is, the set of tasks performed by user u according to theground truth.

In addition, in order for the two sets T and to have the same size, that is, |T | = ||,we padded them with all the unclassified queries, which are all the queries that appearin session S but that were discarded during the automatic and/or the manual clustering.This can, in some way, be equivalent to considering discarded queries as singletonclusters, that is, single tasks composed of only one query.



Thus, for each predicted task tj , we computed the maximum F1, that is, F1max(tj),with respect to the true tasks as follows.

F1max(tj) = argmaxk

F1(tj, θk).

Globally, F1 is averaged on the set of all predicted tasks for all the users u ∈ U inthe training set T = ⋃

u∈U Tu with respect to the set of all true tasks = ⋃u∈U u as

follows.

F1(T ,) = w j · F1max(tj)∑|T |j=1 w j

,

where w j = |tj |.Similarity-oriented measures consider pairs of objects instead of single objects.

Again, let s ⊆ S be the generic time-gap session of a long-term session S such that|s| > 1. Furthermore, let T and be the sets of predicted and true tasks of S, re-spectively (both padded with discarded queries as described previously). Thus, for eachS we computed the following quantities.

—tn = number of query pairs that are in different true tasks and in different predictedtasks (true negatives).

— f p = number of query pairs that are in different true tasks but in the same predictedtask (false positives).

— f n = number of query pairs that are in the same true task but in different predictedtasks (false negatives).

—tp = number of query pairs that are in the same true task and in the same predictedtasks (true positives).

Then, we used two different measures.

—Rand index. R(T ) = tn+tptn+ f p+ f n+tp .

—Jaccard index. J(T ) = tpfp+ f n+tp .

A global value of both the Rand and Jaccard index, that is, R and J respectively,might be computed as follows:

R = w j · R(T )∑|T |j=1 w j

, J = w j · J(T )∑|T |j=1 w j

,

where w j = |S|.As specified before, when computing both the Rand and Jaccard index, we did not

consider time-gap sessions containing only one singleton task, that is, time-gap sessionscontaining only one single-query cluster. However, we did take into account time-gapsessions that were composed of a single task with more than one query.

8.2. Evaluation on the Ground Truth

In the following, we show the results we obtained using our two sets of user taskdiscovery methods, namely time-splitting and query clustering methods, respectively.Also, we compare them with a state-of-the-art approach based on the query-flow graph(QFG) [Boldi et al. 2008].

8.2.1. Time-Splitting. This set of task discovery methods is exclusively based on the task-based query similarity function described in Section 6.1, that is, σtime. In particular, herewe compare three different time-splitting techniques—TS-5, TS-15, and TS-26—whichuse 5-, 15-, and 26-minute thresholds of t, respectively.



Table VIII. TS-5, TS-15, and TS-26

F1 Rand Jaccard

TS-5 0.28 0.75 0.03TS-15 0.28 0.71 0.08TS-26 0.65 0.34 0.34

Table IX. QFG Varying the Threshold η

η F1 Rand Jaccard

QFG

0.1 0.68 0.47 0.360.2 0.68 0.49 0.360.3 0.69 0.51 0.370.4 0.70 0.55 0.380.5 0.71 0.59 0.380.6 0.74 0.65 0.390.7 0.77 0.71 0.400.8 0.77 0.71 0.400.9 0.77 0.71 0.40

Table VIII shows the results we obtained using these techniques on the groundtruth. The best result in terms of F1 is found by considering all the time-gap sessionsidentified with TS-26, without splitting them into shorter time-gap sessions. Hence,we consider TS-26 as the baseline approach for addressing UTDP. Roughly speaking,this is equivalent to identifying user tasks with time-gap sessions.

8.2.2. Query-Flow Graph. QFG is constructed over a training segment of the AOLtop-500 user sessions. This method uses chaining probabilities measured by meansof a machine-learning method. First, we extracted some features from the trainingsearch engine log and stored them in a compressed graph representation. In particu-lar, we considered 25 different features (i.e., time-related, session, and textual features)for each pair of queries (q, q′) that were issued consecutively in at least one session ofthe query log.

The validity of QFG was tested on the ground truth, and the results we obtained areshown in Table IX. We found the best values using a threshold η = 0.7. In fact, it wasshown that results do not improve when using a greater threshold value.

QFG significantly improves the baseline TS-26. In particular, F1 gains about 16%,while Rand and Jaccard roughly gain 52% and 15%, respectively.

8.2.3. Query Clustering. We now evaluate all the clustering-oriented user task discoverymethods described in Section 7.2.1.

First, we present the results we obtained using the task-based query similarityfunctions derived from the unsupervised learning approach described in Section 6.2that is, σ1 and σ2. Therefore, as a major innovative contribution to this work, we alsoshow the outcomes of two of these task discovery methods, that is, QC-WCC and QC-HTC,when exploiting the supervised learned similarities proposed in Section 6.3, that is,σ 4

dt, σ 1nb (or, equivalently, σ 3

nb), and σ ∗lr, where ∗ ∈ {1, 2, 3, 4}.

Unsupervised Learned Task-Based Similarity. We start on the premise that theQC-MEANS clustering algorithm uses both the unsupervised learned query similari-ties σ1 and σ2. We empirically set the radius ρ of this centroid-based algorithm to 0.4for both similarity functions, that is, two queries could be part of the same cluster ifand only if their similarity is equal to or greater than 0.6. The overall results of thismethod are shown in Table X.



Table X. QC-MEANS Using Unsupervised Learned Task-Based Query Similarities σ1 and σ2

QC-MEANS σ1

F1 Rand Jaccardα (1 − α)1 0 0.71 0.73 0.26

0.5 0.5 0.68 0.70 0.140 1 0.68 0.70 0.13

QC-MEANS σ2

F1 Rand Jaccardt b

0.5 4 0.72 0.74 0.27

Table XI. QC-SCAN Using Unsupervised Learned Task-Based Query Similarities σ1 and σ2

QC-SCAN σ1

F1 Rand Jaccardα (1 − α)1 0 0.77 0.71 0.17

0.5 0.5 0.74 0.68 0.060 1 0.75 0.68 0.07

QC-SCAN σ2

F1 Rand Jaccardt b

0.5 4 0.77 0.71 0.19

Concerning σ1, the best results were obtained by using only the content-based simi-larity, that is, with α = 1. However, the very best results for QC-MEANS were found whenusing σ2. Here, we significantly improve the baseline TS-26 in terms of F1 (≈10%) andRand (≈54%), while we lose nearly 21% in terms of Jaccard. Moreover, if we comparethe best QC-MEANS with the best QFG, we notice that QC-MEANS loses about 6% interms of F1, 33% in terms of Jaccard, but it gains approximately 4% in terms of Rand.

We now discuss the QC-SCAN algorithm, again using both the similarity functionsσ1 and σ2. We used several combinations of the two density-based parameters, that is,minPts and eps, and we found the best results with minPts = 2 and eps = 0.4.

Table XI illustrates the fact that QC-SCAN provides globally better results thanQC-MEANS for both σ1 and σ2. Similarly, for σ1 the best results were obtained by usingonly content-based similarity, that is, with α = 1. However, our proposed conditionalfunction μ2 reveals a significant improvement with respect to all measures.

Finally, it is worth noting that QC-SCAN behaves exactly the same as QFG, except forthe Jaccard where QC-SCAN roughly loses 53%.

The third algorithm we consider is QC-WCC. Here, we used a breadth-first search inorder to find the connected components of the graph which represent each time-gapsession [Hopcroft and Tarjan 1973]. Table XII shows the results we found using thisalgorithm either with σ1 and σ2, and by varying the pruning threshold η. In particular,concerning σ1 we only consider the best convex combination when α = 0.5.

The best results with σ1 were obtained when η = 0.2, while even better results werefound with σ2 when η = 0.3. In this last case, the overall evaluation is significantlyhigher than the baseline TS-26 but also higher than the state-of-the-art approach QFG.With regard to TS-26, the best QC-WCC gains about 20%, 56%, and 23% in terms of F1,Rand, and Jaccard, respectively. Moreover, QC-WCC also improves the results of QFG,gaining nearly 5% in terms of F1, about 9% in terms of Rand, and approximately 10%in terms of Jaccard.

QC-HTC is the last algorithm we introduced and represents one of the innovativecontributions to our previous work [Lucchese et al. 2011]. The results obtained fromusing this approach with both similarity functions σ1 and σ2 varying the pruningthreshold η are shown in Table XIII. Similarly to QC-WCC, with regard to σ1, we onlyconsider the best convex combination when α = 0.5. Again, the best results with σ1 wereobtained when η = 0.2, while the global best results were found with σ2 when η = 0.3.As the table shows, the overall results are very close to those obtained with QC-WCC.In particular, QC-HTC improves TS-26 by roughly gaining 19%, 56%, and 21% in terms



Table XII. QC-WCC Using Unsupervised Learned Task-Based Query Similarities σ1 and σ2

QC-WCC σ1 (α = 0.5)η F1 Rand Jaccard

0.1 0.78 0.71 0.420.2 0.81 0.78 0.430.3 0.79 0.77 0.370.4 0.75 0.73 0.270.5 0.72 0.71 0.200.6 0.75 0.70 0.140.7 0.74 0.69 0.110.8 0.74 0.68 0.070.9 0.72 0.67 0.04

QC-WCC σ2 (t = 0.5, b = 4)η F1 Rand Jaccard

0.1 0.67 0.45 0.330.2 0.78 0.71 0.420.3 0.81 0.78 0.440.4 0.81 0.78 0.410.5 0.80 0.77 0.370.6 0.78 0.75 0.320.7 0.75 0.73 0.230.8 0.71 0.70 0.150.9 0.69 0.68 0.08

Table XIII. QC-HTC Using Unsupervised Learned Task-Based Query Similarities σ1 and σ2

QC-HTC σ1 (α = 0.5)η F1 Rand Jaccard

0.1 0.78 0.72 0.410.2 0.80 0.78 0.410.3 0.78 0.76 0.350.4 0.75 0.73 0.250.5 0.73 0.70 0.180.6 0.75 0.70 0.130.7 0.74 0.69 0.100.8 0.74 0.68 0.060.9 0.72 0.67 0.03

QC-HTC σ2 (t = 0.5, b = 4)η F1 Rand Jaccard

0.1 0.68 0.56 0.320.2 0.78 0.73 0.410.3 0.80 0.78 0.430.4 0.80 0.77 0.380.5 0.78 0.76 0.340.6 0.77 0.74 0.300.7 0.74 0.72 0.210.8 0.71 0.70 0.140.9 0.68 0.67 0.07

of F1, Rand, and Jaccard, respectively. It is therefore clear that QC-HTC provides betterresults than QFG and gains about 4% in terms of F1, nearly 9% in terms of Rand, andapproximately 8% in terms of Jaccard.

Supervised Learned Task-Based Similarity. Another major contribution to this workconcerns the supervised learning approach for computing the task-based query simi-larity functions, as described in Section 6.3.

In short, a set of query similarity functions was learned by training a family ofclassifiers on a set of both internal and external query log features. This contrasts withthe unsupervised learning approach, where query similarity functions were directlyderived from the query log data without any supervised learning step.

Therefore, here we also evaluate how this new approach for measuring the taskrelatedness between query pairs impacts the effectiveness of the two best-performingclustering-oriented task discovery methods, that is, QC-WCC and QC-HTC.

It is worth remembering that supervised learned similarities affect the way in whichwe build the similarity graph either in QC-WCC or in QC-HTC. Indeed, an edge betweena query pair (qi, qj) is created whenever the considered classifier assigns the classattribute same task = yes to (qi, qj). Moreover, the weight assigned to each creatededge corresponds to the prediction accuracy value provided by the classifier.

Based on the performance evaluation of the classifiers we proposed in Section 6.3.3,we ran both the QC-WCC and QC-HTC algorithms using the three best task-based querysimilarity functions: σ 4

dt, σ 1nb (or, analogously, σ 3

nb), and σ ∗lr, where ∗ ∈ {1, 2, 3, 4}. These

similarity scores were used to compute the weighting edge similarity function w of ourgraph-based algorithms.



Table XIV. QC-WCC vs. QC-HTC Using Supervised Learned Task-Based Query Similarity σ 4dt

QC-WCC using σ 4dt

η F1 Rand Jaccard0.0 0.76 0.69 0.430.1 0.76 0.69 0.430.2 0.76 0.69 0.430.3 0.76 0.69 0.430.4 0.76 0.69 0.430.5 0.76 0.69 0.430.6 0.78 0.77 0.460.7 0.79 0.78 0.450.8 0.80 0.79 0.450.9 0.80 0.79 0.421.0 0.71 0.70 0.13

QC-HTC using σ 4dt

η F1 Rand Jaccard0.0 0.76 0.73 0.420.1 0.76 0.73 0.420.2 0.76 0.73 0.420.3 0.76 0.73 0.420.4 0.76 0.73 0.420.5 0.76 0.73 0.420.6 0.78 0.79 0.440.7 0.79 0.79 0.430.8 0.79 0.79 0.420.9 0.78 0.78 0.381.0 0.68 0.69 0.10

Table XV. QC-WCC vs. QC-HTC Using Supervised Learned Task-Based Query Similarities σ 1nb or σ 3

nb

QC-WCC using σ 1nb or σ 3

nbη F1 Rand Jaccard

0.0 0.65 0.36 0.330.1 0.65 0.36 0.330.2 0.65 0.36 0.330.3 0.65 0.36 0.330.4 0.65 0.36 0.330.5 0.65 0.36 0.330.6 0.65 0.36 0.330.7 0.65 0.37 0.330.8 0.64 0.40 0.320.9 0.65 0.48 0.301.0 0.75 0.73 0.24

QC-HTC using σ 1nb or σ 3

nbη F1 Rand Jaccard

0.0 0.65 0.38 0.330.1 0.65 0.38 0.330.2 0.65 0.38 0.330.3 0.65 0.38 0.330.4 0.65 0.38 0.330.5 0.65 0.38 0.330.6 0.65 0.38 0.330.7 0.65 0.39 0.330.8 0.64 0.42 0.310.9 0.65 0.50 0.301.0 0.75 0.72 0.22

Table XIV illustrates the results obtained both with QC-WCC and QC-HTC using thesupervised learned similarity σ 4

dt. Concerning QC-WCC, the best results were providedwhen η = 0.8, while QC-HTC performed best when η = 0.7.

Similarly, Table XV shows the results obtained both with QC-WCC and QC-HTC usingthe supervised learned similarity σ 1

nb (or, analogously, σ 3nb). In both cases, best F1 and

Rand values were obtained when η = 1.0, whereas best Jaccard results were obtainedwhen 0.0 ≤ η ≤ 0.7. However, all the validity measures are significantly worst thanthose obtained when using σ 4

dt. Another difference is that when using σ 4dt, the best

results are around a unique value of the threshold η, that is, η = 0.8 and η = 0.7,whereas here it appears there is a less strong relationship between the overall bestresults and η.

Table XVI shows the results obtained both with QC-WCC and QC-HTC using the su-pervised learned similarity σ ∗

lr. Both QC-WCC and QC-HTC achieve their best outcomeswhen the threshold η = 0.7. However, even in this case, all the validity measures losesignificant value with respect to QC-WCC and QC-HTC when using σ 4

dt. With regard toσ 4

dt there is a clear relationship between the best validity measures and the value of η.Table XVII compares the best results found with each approach and highlights sim-

ilar behaviors when using unsupervised or supervised learned similarities.Finally, Table XVIII clearly points out the benefit of exploiting collaborative knowl-

edge like Wikipedia. QC-HTC used the similarity function σ2 to capture and grouptogether two queries that are completely different from a content-based perspective,



Table XVI. QC-WCC vs. QC-HTC Using Supervised Learned Task-Based Query Similarities σ ∗lr (∗ ∈ {1, 2, 3, 4})

QC-WCC using σ ∗lr

η F1 Rand Jaccard0.0 0.65 0.50 0.300.1 0.65 0.50 0.300.2 0.65 0.50 0.300.3 0.65 0.50 0.300.4 0.65 0.50 0.300.5 0.65 0.50 0.300.6 0.70 0.64 0.300.7 0.77 0.75 0.310.8 0.76 0.73 0.240.9 0.74 0.70 0.151.0 0.73 0.66 0.00

QC-HTC using σ ∗lr

η F1 Rand Jaccard0.0 0.65 0.51 0.300.1 0.65 0.51 0.300.2 0.65 0.51 0.300.3 0.65 0.51 0.300.4 0.65 0.51 0.300.5 0.65 0.51 0.300.6 0.68 0.65 0.280.7 0.76 0.75 0.300.8 0.75 0.73 0.240.9 0.74 0.70 0.141.0 0.73 0.66 0.00

Table XVII. Best Results Obtained with Each Method Using Both Unsupervised and Supervised LearnedSimilarities

F1 Rand Jaccard

TS-26 (baseline) 0.65 0.34 0.34QFG best (state of the art) 0.77 0.71 0.40

unsupervised learned similarity σ2

QC-MEANS best 0.72 0.74 0.27QC-SCAN best 0.77 0.71 0.19QC-WCC best 0.81 0.78 0.44QC-HTC best 0.80 0.78 0.43

supervised learned similarity σ 4dt

QC-WCC best 0.80 0.79 0.45QC-HTC best 0.79 0.79 0.43

Table XVIII. The Impact of Wikipedia: σ1 vs. σ2

QC-HTC σ1 (α = 1) QC-HTC σ2 (0.5, 4)Query ID Query String Query ID Query String

63 los cabos

64 cancun

65 hurricane wilma 65 hurricane wilma

68 hurricane wilma 68 hurricane wilma

but that are closely correlated from a the point of view of semantics. Indeed, Cancunis one of the regions affected by Hurricane Wilma which hit in 2005 (see the cross ref-erence in the corresponding Wikipedia article7). Moreover, Los Cabos and Cancun areboth in Mexico despite being a great distance apart. It might be the case that the userwas looking for the relative position of Los Cabos from Cancun in order to understandif Los Cabos was hit by the hurricane as well.

8.3. Evaluation on a Larger Dataset

So far, we have evaluated our user task discovery methods on a manually-labeleddataset, which we referred to as our ground truth. However, an evaluation on alarger dataset may give useful hints on whether our proposed techniques are ableto generalize.

In this section, we consider the two best approaches we proposed, that is, QC-WCC

and QC-HTC. Both these methods were run on the public dataset top-500-1week, which

7http://en.wikipedia.org/wiki/Cancun



0

2

4

6

8

10

10 20 30 40 50 60 70 80

Freq

uenc

y (%

)

#User Tasks per 1-week session

Fig. 11. The distribution of user task frequency using the QC-WCC algorithm.

refers to the 500 user sessions with the highest number of queries, yet limited to thefirst week of logging. It is worth noting that a subset of top-500-1week was used tobuild our ground truth (see Section 5). This dataset contains a total amount of 48,257queries, meaning about 97 queries per week per user on average, which corresponds tonearly 14 queries per day per user, and it is available for download.8 Here, the longestuser session, that is, the session with the highest number of queries, is 1,774 queries.

8.3.1. QC-WCC. When the QC-WCC algorithm was executed on this larger dataset, a totalnumber of 8,191 user tasks was found. In Figure 11, we plot the frequency distributionof user tasks over the user sessions contained in the dataset. The maximum number ofdiscovered tasks for a single user session is 72. On average, each user performed 16.4tasks per week.

Moreover, the user task size distribution (i.e., the number of queries for each dis-covered user task) is depicted in Figure 12. This plot shows that the user task sizedistribution in the larger dataset reflects the ground truth, which was reported inFigure 8. However, the actual average number of queries per user task is about 3.93,which is slightly greater than the ground truth (i.e., about 2.57).

On the basis of the analysis conducted on the ground truth and described in Section 5,we also evaluated how user tasks were distributed over time-gap sessions, namely howmany user tasks were discovered within the same time-gap session, using the QC-WCC

algorithm. The plot in Figure 13 shows some similarities to the one depicted in Figure 9,which instead refers to the ground truth. However, the QC-WCC algorithm discoveredabout 1.34 user tasks per time-gap session, as opposed to 1.80 found in the groundtruth.

8.3.2. QC-HTC. The QC-HTC algorithm identified a total figure of 8,301 user tasks onthe larger dataset. Figure 14 depicts the frequency distribution of the number of usertasks for each one-week session. Here, the maximum number of discovered user tasksfor a single user session is 163, whereas the minimum is 1. This means each userperformed about 16.6 tasks, on average.

As regards the QC-WCC, we also evaluated the user task size distribution using thisalgorithm, and the result is shown in Figure 15. Here, not only is the curve progress

8http://miles.isti.cnr.it/~tolomei/downloads/aol-top500-1w.tgz



0

10

20

30

40

50

60

5 10 15 20 25 30

Freq

uenc

y (%

)


Fig. 12. The distribution of user task size using the QC-WCC algorithm.

0

20

40

60

80

100

2 4 6 8 10 12 14

Freq

uenc

y (%

)


Fig. 13. The distribution of user tasks across time-gap sessions using the QC-WCC algorithm.

compliant with the user task size distribution of the ground truth, but also, the averagenumber of queries per task (i.e., about 3.38) is closer to the one we discovered in ourgolden set. Interestingly, the QC-HTC was able to detect about 1.49 user tasks pertime-gap session, and the whole distribution is shown in Figure 16.

9. DISCOVERING COLLECTIVE TASKS

The last major contribution to this work is a method for detecting collective tasks.Given the set T of user tasks extracted from the query log with one of one of thetechniques previously discussed, let ti ∈ T be a generic user task, and ti its bag-of-wordsrepresentation. More specifically, if q is the bag-of-words representation of a queryq ∈ QL, it follows that ti = ⊎

q∈ti q, where⊎

is the bag union operator. Therefore, eachti can be considered as a text document, and the problem of discovering the collectivetasks can be reduced to the clustering of similar text documents [Zhao and Karypis2002, 2004].



0

2

4

6

8

10

0 20 40 60 80 100 120 140 160 180

Freq

uenc

y (%

)

#User Tasks per 1-week session

Fig. 14. The distribution of user task frequency using the QC-HTC algorithm.

0

10

20

30

40

50

60

5 10 15 20 25 30

Freq

uenc

y (%

)


Fig. 15. The distribution of user task size using the QC-HTC algorithm.

In the rest of this section, we first present a manually-generated ground truth ofcollective tasks, which is used to evaluate the quality of the collective tasks extractedby means of a user task clustering algorithm. Then, we discuss a set of possible usertask clustering algorithms and their evaluation.

9.1. Ground Truth of Collective Tasks

The user tasks that were manually annotated to create the ground truth of collectivetasks were identified by running QC-HTC on the same portion of the top-500-1weekquery log, which had previously been used to generate the ground truth of user tasks(see Section 5).

QC-HTC discovered a total amount of 318 user tasks, which in turn were manuallygrouped into a set of collective tasks using the same annotators employed in construct-ing the ground truth of user tasks. The annotators discarded 16 user tasks, since noagreement was reached on the cluster assignments for these user tasks. They grouped



0

20

40

60

80

100

2 4 6 8 10 12 14

Freq

uenc

y (%

)


Fig. 16. The distribution of user tasks across time-gap sessions using the QC-HTC algorithm.

Table XIX. Statistical Indicators on Manually-Identified Collective Tasks

Cluster SizeAvg. Std. Dev. Max Min Median

5.70 13.27 61 1 5

the remaining 302 (i.e., ≈95% of the total) into 53 collective tasks, each one containingon average 5.70 user tasks. Table XIX shows some statistics relating to the size of theclusters which had been manually generated.

9.2. Clustering Algorithms

In order to automatically discover collective tasks, we propose clustering the set ofalready detected user tasks. In particular, in order to cluster the set T of user tasks, weselected a set of algorithms included in the CLUTO9 toolkit. Each algorithm producesa set of K clusters, namely a set of K collective tasks.

Regardless of the clustering algorithm chosen, three input parameters have to beprovided: (i) a similarity measure, (ii) an objective function, and (iii) a number K ofclusters. For the first option, we adopted two measures, namely the well-known cosinesimilarity and the Pearson’s correlation coefficient. Concerning the second, we chose tomaximize the intra-cluster similarity according to the following function:

maxK∑

i=1

√ ∑u,v∈Si

sim(u, v),

where K is the total number of produced clusters, Si is the set of objects assigned to theith cluster, and sim(u, v) is the similarity between the two objects u, v ∈ Si (i.e., eithercosine or Pearson’s correlation coefficient).

Method 1: Repeated Bisections (rbr). This is the first clustering approach we used,where the desired K-way clustering solution is computed by performing a sequence ofK − 1 repeated bisections. Here, the similarity matrix is first clustered into two groups,then one of these groups is selected and bisected further. This process continues until

9http://glaros.dtc.umn.edu/gkhome/views/cluto



the desired number of clusters is found. During each step, the cluster is bisected sothat the resulting two-way clustering solution optimizes the chosen criterion function.The cluster that is selected for further partitioning is customizable, and by default, itcoincides with the biggest cluster at each stage. Note that this approach ensures thatthe criterion function is locally optimized within each bisection, but in general, it is notglobally optimized. Therefore, we selected a variant of this method which, in the end,globally optimizes the objective function.

Method 2: Agglomerative (agg). In this approach, the desired K-way clustering so-lution is computed using the agglomerative paradigm whose goal is again to locallyoptimize the selected clustering objective function. The solution is obtained by stop-ping the agglomeration process when K clusters are left.

Eventually, we came up with four solutions by mixing the two preceding clusteringmethods with the two similarity scores, namely rbr-cosine, rbr-pearson, agg-cosine, andagg-pearson.

9.3. Evaluation on the Ground Truth

All the automatic solutions just described were run on the same set of user tasks weused to manually build the ground truth of collective tasks (see Section 9.1). In orderto evaluate our clustering algorithms, we set the final number of clusters as K = 53,which is the exact number of collective tasks discovered by the human assessors.

Similarly to Section 8.1, we refer to classification-oriented measures of validity inorder to assess the performance of the various clustering methods in relation to thecollective tasks in the ground truth, namely precision, recall, and F1. Table XX reportsthese measurements of clustering validity for the various algorithms, along with somestatistical indicators.

(a) rbr-cosine. This solution produces the set of K = 53 output clusters by performinga sequence of K − 1 repeated bisections. Furthermore, it uses the cosine similarityto compare the textual representations of any two user tasks. From the original set of318 user tasks, 297 were clustered (i.e., ≈93%), whereas 21 were discarded. On average,each collective task contains about 5.60 user tasks, which is close to the value obtainedfrom the ground truth.

(b) rbr-pearson. As with the previous method, rbr-pearson produces the final set ofoutput clusters by performing a sequence of repeated bisections. However, it uses thePearson’s correlation to measure the similarity between pairs of user tasks. Fromthe original set of 318 user tasks, 293 were clustered (i.e., ≈92%), whereas 25 werediscarded.

(c) agg-cosine. This solution agglomerates user tasks by locally optimizing the se-lected criterion function, which is based on the cosine similarity between textual rep-resentations of any two user tasks. All 318 original user tasks were clustered, therebyeach collective task has six user tasks on average. On the basis of the quality of producedclusters, this method results in a significant drop in precision, recall, and F1 scorescompared to the preceding partitional methods.

(d) agg-pearson. Similarly to the preceding method , agg-pearson agglomerates usertasks. This solution uses Pearson’s correlation to measure the similarity between usertasks. As with agg-cosine, all 318 original user tasks were clustered, thereby eachcollective task has six user tasks on average. Again, the overall quality of clustering isworse than that obtained with partitional methods.



Table XX. Statistical Indicators and Quality Evaluation of Each User Task Clustering Algorithm


rbr-cosine 5.60 14.19 89 2 3rbr-pearson 5.53 14.41 80 3 5agg-cosine 6.00 16.97 97 1 3

agg-pearson 6.00 43.86 250 1 1

Cluster QualityPrecision Recall F1

rbr-cosine 0.71 0.48 0.57rbr-pearson 0.68 0.46 0.55agg-cosine 0.59 0.42 0.49

agg-pearson 0.54 0.39 0.45

Table XXI. Statistical Indicators on User Task Clustering Using rbr-cosine


6.81 15.43 479 1 5

Intra-Cluster SimilarityAvg. Std. Dev. Median

0.59 0.17 0.56

Finally, from the results reported in Table XX, the best clustering algorithm is parti-tional (i.e., top-down), that is, rbr-cosine, which achieves the highest values of precision,recall, and F1, respectively.

9.4. Evaluation on a Larger Dataset

In this section, we assess the behavior of the best-performing clustering algorithm (i.e.,rbr-cosine) when applied to a larger collection of tasks. These user tasks were extractedfrom the whole top-500-1week dataset (see Section 8.3.2) by the QC-HTC algorithm.

Unlike the previous tests, which were conducted to find the best algorithm, in thiscase we do not have an a priori knowledge of the number of collective tasks in the wholedataset. In order to select the number K of clusters, we thus observed the behaviorof the objective function by varying K. It was noted as K increases, the objectivefunction monotonically increases as well. Indeed, the maximum intra-cluster similarityis obtained when K is equal to the number of documents to be clustered (i.e., when eachcluster contains exactly one document). It is clear that we need to find a trade-off, andthis is indicated by a well-established empirical criterion, also known as the elbowmethod. Generally speaking, we chose K = K such that for any K, K > K the slopeof our objective function appeared to increase less than for any K, K < K. The reasonfor selecting this method is to choose a number of clusters such that adding anothercluster does not give a much better model for fitting the data. By following this methodon the original collection of 8,301 user tasks, we eventually obtained a set of K = 1,024collective tasks.

Since we do not have a ground truth for such a large dataset, we first show somestatistics relating to the obtained clusters, for example, the number of user taskswithin each collective task, the intra-cluster similarity, etc. Furthermore, we presentsome analysis on the popularity of collective tasks, namely we show how collectivetasks are actually distributed across the original user sessions stored on the query log.Finally, we illustrate some examples as anecdotical evidence.

From the initial input set of 8,301 user tasks, 6,970 (≈84%) were actually clustered.Table XXI shows some statistical indicators on the output clusters of user tasks. In theleft-hand table, we indicate the variety in cluster size, that is, the number of user taskscontained within a collective task. We found that the collective task with the highestnumber contained 479 user tasks. By manually inspecting this large collective task,we found it mainly contains navigational user tasks [Broder 2002], mostly related tocontents of a sexual nature. On average, each collective task included approximatelyseven user tasks. In Figure 17, we also plot the cluster size distribution yet limitedto collective tasks of less than 50 user tasks. Indeed, it is worth noting that the vast



0

5

10

15

20

5 10 15 20 25 30 35 40 45 50

Freq

uenc

y (%

)

Collective Task size (#user tasks)

Fig. 17. The distribution of collective task size by means of the number of composing user tasks.

0

5

10

15

20

5 10 15 20 25 30 35 40 45 50

Freq

uenc

y (%

)

Collective Task popularity

Fig. 18. The distribution of collective task popularity across the original set of user sessions.

majority of collective tasks (i.e., 99.9%) contained less than 50 user tasks. The right-hand table shows some indicators of intra-cluster similarity.

In addition, we were interested in checking if some collective tasks occurred morefrequently in the query log, both within the same user session and across distinct usersessions. To this end, we rewrote each original user session as a set of user tasks (i.e.,actually collective tasks), instead of a sequence. Therefore, for each collective task, wecomputed the percentage of user sessions where this appeared, disregarding its orderand any possible repetition within a single user session. According to this study, thetop-most popular collective task occurred in 183 out of 500 user sessions (i.e., about36.6%). In contrast, the vast majority of collective tasks (i.e., about 93.5%) appeared inless than 11 user sessions, and a collective task occurred in about five user sessions,on average. In Figure 18, we show the popularity distribution of collective tasks acrossuser sessions.



Table XXII. A Collective Task Containing User Tasks Referring to hobby/gardening

Collective Task # 314User Task IDs Queries2439668-18-1 cottage garden qvc, cottage garden roses

1188448-3-7 private hot tub garden calistoga area lodging

1012899-3-2 vegetable garden, vegetable garden ideas

2061454-23-1 japanese garden, decor japanese garden, . . .. . . ...

679436-11-2 tv garden shows, rebecca cole garden show tv, . . .297468-3-2 dry garden, dry garden berkeley, . . .297468-10-1 california garden blog, garden blog, best garden blogs

297468-21-1 open garden, open garden day sacramento

297468-26-1 horton farm iris garden

Table XXIII. A Collective Task Containing User Tasks Referring to History of Rome

Collective Task # 578User Task IDs Queries12472900-4-1 louis xvi descended clovis, descendants roman nobility, . . .8566671-21-3 roman history, . . .4110454-5-4 roman claim conquest, roman historian ivy . . .

Table XXIV. A Collective Task Containing User Tasks Referring to medical diseases

Collective Task # 693User Task IDs Queries

57424-1-1 california sweats company, low sugar sweetener, . . .1524276-2-1 hypoglycemia periods, low blood sugar periods menstrual cycle, . . .257689-1-1 blood sugar 500, blood sugar chart

543587-32-3 prolonged periods perimenopausal, . . .4401012-9-1 high blood sugar use diuretics, blood sugar levels fasting, . . .

Table XXV. A Collective Task Containing User Tasks Referring to math/physics

Collective Task # 946User Task IDs Queries

292860-6-1 calculate moment rotational inertia, kinematic equations

349670-16-2 entropy equations, entropy1411796-16-1 schroedinger’s equation, schrodinger’s equation

Finally, Tables XXII, XXIII, XXIV, and XXV show some examples of anecdoticalevidence within the collective tasks found. In particular, we show the user tasks10

of four collective tasks—discovered by our rbr-cosine clustering—which refer to fourreal-life situations.

10. CONCLUSIONS

This work addresses some important research challenges in developing next-generationWeb search engines which better satisfy user needs. We claim that people increasinglyphrase queries to search engines in order to find information which can simplify theirdaily tasks. Examples of these tasks include finding a recipe, booking a flight, readingonline news, etc. To verify this theory and to discover those tasks, we carried outa detailed analysis of the historical data recorded in long-term search engine query

10User task IDs are uniquely determined by the following pattern: userID-sessionID-taskID.



logs. Our approach involves a two-step methodology. First, we identify user tasks fromindividual user sessions stored in the query log. In our vision, a user task is a set ofpossibly noncontiguous queries occurring within a search session which relates to thesame need. Then, as a second step, we discover collective tasks by aggregating similaruser tasks, possibly performed by distinct users.

For the initial step, we define the user task discovery problem (UTDP) as the problemof finding the best partitioning of a set of queries into subsets of queries related tothe same user task. The UTDP involves two main issues: (i) it requires a robust measureto evaluate the task relatedness between any two queries, and (ii) it needs an effectivemethod in order to discover user tasks on the basis of this measure. With referenceto (i), we propose both unsupervised and supervised learning approaches for devisingseveral task-based query similarities, whereas we tackle (ii) by introducing a set ofquery clustering methods specifically designed to discover user tasks.

We evaluate all the proposed solutions by means of a manually-built ground truth,namely a task-oriented partitioning of the queries in our benchmarking dataset per-formed by human annotators. In particular, two of the proposed clustering methods,that is, QC-WCC and QC-HTC, have been shown to outperform state-of-the-art solutions.

For the second stage, we introduce and investigate the problem of discovering col-lective tasks. To this end, we propose four methods for clustering previously mineduser tasks, which are represented by the bag-of-words extracted from the associatedqueries.

We evaluate all these solutions both on a manually-built ground truth and on a largerdataset. The experiments conducted reveal that our two-step approach can effectivelydetect similar latent needs from a query log by first mining the search behavior ofeach single user, and then by aggregating the similar user tasks performed by differentusers.

As future work, we plan to exploit the collective tasks mined from the query log tobuild a model for representing the task-by-task search behavior of users. This modelcould subsequently be used to devise novel applications like a task recommender systemthat goes beyond query suggestion mechanisms currently offered by modern Web searchengines.

ACKNOWLEDGMENTS

We acknowledge the authors of Boldi et al. [2008] and the Yahoo! Research Labs in Barcelona, Spain, forproviding us their query-flow graph implementation, and Franco Maria Nardini for adapting this implemen-tation to our needs.

REFERENCES

BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F. P., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2008. Designtrade-offs for search engine caching. ACM Trans. Web 2, 4, 1–28.

BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Pub-lishing Co., Inc., Boston, MA.

BEEFERMAN, D. AND BERGER, A. 2000. Agglomerative clustering of a search engine query log. In Proceedingsof the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00).ACM, New York, NY. 407–416.

BOLDI, P., BONCHI, F., CASTILLO, C., DONATO, D., GIONIS, A., AND VIGNA, S. 2008. The query-flow graph: Model andapplications. In Proceedings of the 17th ACM International Conference on Information and KnowledgeManagement (CIKM’08). ACM, New York, NY, 609–618.

BRODER, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3–10.CAO, H., JIANG, D., PEI, J., HE, Q., LIAO, Z., CHEN, E., AND LI, H. 2008. Context-aware query suggestion by mining

click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD’08). ACM, New York, NY, 875–883.



DONATO, D., BONCHI, F., CHI, T., AND MAAREK, Y. 2010. Do you want to take notes?: Identifying research missionsin Yahoo! Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW’10).ACM, New York, NY, 321–330.

ESTER, M., KRIEGEL, H. P., SANDER, J., AND XU, X. 1996. A density-based algorithm for discovering clusters inlarge spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD’96). ACM, New York, NY, 226–231.

FU, L., GOH, D. H.-L., FOO, S. S.-B., AND NA, J.-C. 2003. Collaborative querying through a hybrid queryclustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries(ICADL’03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111–122.

GABRILOVICH, E. AND MARKOVITCH, S. 2007. Computing semantic relatedness using Wikipedia-based explicitsemantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence.6–12.

GAYO-AVELLO, D. 2009. A survey on session detection methods in query logs and a proposal for future evalu-ation. Info. Sci. 179, 12, 1822–1843.

GLANCE, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conferenceon Intelligent User Interfaces (IUI’01). ACM, New York, NY, 91–96.

GUO, J., CHENG, X., XU, G., AND ZHU, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management (CIKM’11). ACM, New York, NY,259–268.

HE, D. AND GOKER, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22ndAnnual Colloquium on Information Retrieval Research (BCS-IRSG). 57–66.

HE, D., GOKER, A., AND HARPER, D. J. 2002. Combining evidence for automatic web session identification. Info.Process. Manage. 38, 5, 727–742.

HOPCROFT, J. AND TARJAN, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun.ACM 16, 6, 372–378.

JANSEN, B. J. AND SPINK, A. 2006. How are we searching the world wide Web?: A comparison of nine searchengine transaction logs. Info. Process. Manage. 42, 1, 248–263.

JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: A study of userqueries on the web. SIGIR Forum 32, 1, 5–17.

JANSEN, B. J., SPINK, A., BLAKELY, C., AND KOSHMAN, S. 2007. Defining a session on Web search engines: Researcharticles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862–871.

JARVELIN, A., JARVELIN, A., AND JARVELIN, K. 2007. s-grams: Defining generalized n-grams for informationretrieval. Info. Process. Manage. 43, 4, 1005–1019.

JONES, R. AND KLINKNER, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation ofsearch topics in query logs. In Proceedings of the 17th ACM International Conference on Informationand Knowledge Management (CIKM’08). ACM, New York, NY, 699–708.

KOTOV, A., BENNETT, P. N., WHITE, R. W., DUMAIS, S. T., AND Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR’11). ACM, New York, NY, 5–14.

LAU, T. AND HORVITZ, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceed-ings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119–128.

LEACOCK, C. AND CHODOROW, M. 1998. Combining Local Context and WordNet Similarity for Word SenseIdentification. The MIT Press, Cambridge, MA, 11, 265–283.

LEE, U., LIU, Z., AND CHO, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the14th International World Wide Web Conference (WWW’05). ACM, New York, NY, 391–400.

LESK, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pinecone from an ice cream cone. In Proceedings of the 5th ACM International Conference on SystemsDocumentation (SIGDOC’86). ACM, New York, NY, 24–26.

LEUNG, K. W. T., NG, W., AND LEE, D. L. 2008. Personalized concept-based clustering of search engine queries.IEEE Trans. Knowl. Data Engi. 20, 11, 1505–1518.

LUCCHESE, C., ORLANDO, S., PEREGO, R., SILVESTRI, F., AND TOLOMEI, G. 2011. Identifying task-based sessions insearch engine query logs. In Proceedings of the 4th ACM International Conference on Web Search andData Mining (WSDM’11). ACM, New York, NY, 277–286.

MACQUEEN, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Pro-ceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam andJ. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281–297.



MEI, Q., KLINKNER, K., KUMAR, R., AND TOMKINS, A. 2009. An analysis framework for search sequences. InProceeding of the 18th Conference on Information and Knowledge Management (CIKM’09). ACM, NewYork, NY, 1991–1994.

MILNE, D. AND WITTEN, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained fromwikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI’08). AAAI Press,Menlo Park, CA, 25–30.

OZMUTLU, H. C. AND CAVDUR, F. 2005. Application of automatic topic identification on excite web search enginedata logs. Info. Process. Manage. 41, 5, 1243–1262.

PORTER, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco,CA, 130–137.

QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco,CA.

RADA, R., MILI, H., BICKNELL, E., AND BLETTNER, M. 1989. Development and application of a metric on semanticnets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17–30.

RADLINSKI, F. AND JOACHIMS, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedingsof the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining (KDD’05). ACM, New York, NY, 239–248.

RAGHAVAN, V. V. AND SEVER, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACMSIGIR International Conference on Research and Development in Information Retrieval (SIGIR’95).ACM, New York, NY, 344–350.

REED, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15–19.RESNIK, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of

the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448–453.RICHARDSON, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1–27.ROSE, D. E. AND LEVINSON, D. 2004. Understanding user goals in web search. In Proceedings of the 13th

International World Wide Web Conference (WWW’04). ACM, New York, NY, 13–19.SALTON, G. AND MCGILL, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New

York, NY.SECO, N. AND CARDOSO, N. 2006. Detecting user sessions in the tumba! web log. Tech. rep. Faculdade de

Ciencias da Universidade de Lisboa.SHEN, X., TAN, B., AND ZHAI, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th

Conference on Information and Knowledge Management (CIKM’05). ACM, New York, NY, 824–831.SHI, X. AND YANG, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the

15th International World Wide Web Conference (WWW’06). ACM, New York, NY, 943–944.SILVERSTEIN, C., MARAIS, H., HENZINGER, M., AND MORICZ, M. 1999. Analysis of a very large Web search engine

query log. SIGIR Forum 33, 1, 6–12.SILVESTRI, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret.

1, 1–2, 1–174.SILVESTRI, F., BARAGLIA, R., LUCCHESE, C., ORLANDO, S., AND PEREGO, R. 2008. (Query) history teaches every-

thing, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB’08). IEEEComputer Society, Washington, DC, 12–22.

SPINK, A., PARK, M., JANSEN, B. J., AND PEDERSEN, J. 2006. Multitasking during Web search sessions. Info.Process. Manage. 42, 1, 264–275.

TAN, P. N., STEINBACH, M., AND KUMAR, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA.WEN, J. R., NIE, J. Y., AND ZHANG, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1,

59–81.ZHAO, Y. AND KARYPIS, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In

Proceeding of the 11th Conference on Information and Knowledge Management (CIKM’02). ACM, NewYork, NY, 515–524.

ZHAO, Y. AND KARYPIS, G. 2004. Empirical and theoretical comparisons of selected criterion functions fordocument clustering. Machine Learn. 55, 3, 311–331.

Received May 2011; revised June, November 2012, March 2013; accepted March 2013


Documents

Discovering Tasks from Search Engine Query Logs · 2013. 9. 25. · Discovering Tasks from Search Engine Query Logs 14:3 The rationale behind this two-step strategy is as follows