143
Improving the effectiveness of Web searching: Methodological issues Barry Eaglestone Department of Information Studies University of Sheffield [email protected]

Improving the effectiveness of Web searching: Methodological issues

  • Upload
    vanig

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Improving the effectiveness of Web searching: Methodological issues. Barry Eaglestone. Department of Information Studies University of Sheffield [email protected]. Overview. An inductive study to build evidence-based meta-cognitive models of web searching by the general public. - PowerPoint PPT Presentation

Citation preview

Page 1: Improving the effectiveness of Web searching:  Methodological issues

Improving the effectiveness of Web searching:

Methodological issuesBarry EaglestoneDepartment of Information StudiesUniversity of [email protected]

Page 2: Improving the effectiveness of Web searching:  Methodological issues

Overview

• An inductive study to build evidence-based meta-cognitive models of web searching by the general public.

• Data modelling issues– A Temporal data modelling solution

• Discussion & Final thoughts

Page 3: Improving the effectiveness of Web searching:  Methodological issues

An inductive study of how the general public search on the web.

Setting the scene – the database approach and state of the art.

Page 4: Improving the effectiveness of Web searching:  Methodological issues

Motivation

• Need to develop new models for searching: update outdated usage paradigms.– Improve training methods– Develop automated assistance systems

Page 5: Improving the effectiveness of Web searching:  Methodological issues

Previous studies of search logs

• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity

Page 6: Improving the effectiveness of Web searching:  Methodological issues

chemoinformatics

Database

The Team

Information SeekingInformation Seeking

chemoinformatics

Database

Page 7: Improving the effectiveness of Web searching:  Methodological issues

Soft Hard

Spectrum of Research Perspective

Modelling/engineering/empirical

Qualitative / quantitative data analysis / modeling

Human / organisationalissues

FormallyDefinedproblems

Computer world formalisations

Hardware /Software solutions

CS Computer WorldCS Computer WorldPeople world ISPeople world ISInventionInventionDiscoveryDiscovery

ProblemProblemSolvingSolvingformalismformalism

Page 8: Improving the effectiveness of Web searching:  Methodological issues

How will we use it?

Effectiveness?

Meta-cognitiveKnowledge aboutweb searching?

How do theysearch?

Who are the searchers?What are they searching

for?

Infer effectiveness from•search transformation patterns•subject’s narrative

ContextThe GENERAL PUBLICVolunteers (c500 searches):

ICT coursesUniversity evening classesCity Learning Centre coursesCitizens’ forumPersonal contactsLibraryAdvertisingStudents and academics

+ over 1,000,000 search logs anonymous searchers

•Self-selected searches explained through interview and think aloud protocols•2-3 set searches

Observe and record•Over 1,000,000 anonymous search engine transaction logs

•c500 observed and recorded searches; talk to searchersDetermine query similarity

Delimit searchesCode query transformationModel searches as transformation graphsData mine for stereotypical search strategesCorrelate with who, why and effectivenessThus, establish evidence-based models of search strategy, related to user and problem characteristics and likelihood of success

Evidence-based meta-cognitive trainingIntelligent interfaces

Page 9: Improving the effectiveness of Web searching:  Methodological issues

Why Meta-cognition?. “Meta-cognition refers to higher order thinking

which involves active control over the cognitive processes engaged in learning. ….”

Livingston (1997)

• Meta-cognitive knowledge– “…knowledge of personal variables to general knowledge about

how human beings learn and process information, as well as individual knowledge of one’s own learning processes…” e.g. “I have a bad memory!”

• Meta-cognitive regulation– “… activities used to ensure that that a cognitive goal has been

met….”, e.g., question yourself about the text and then re-read.Livingston (1997)

Page 10: Improving the effectiveness of Web searching:  Methodological issues

Cognitive Styles Analysishttp://www.memletics.com/manual/default.asp?ref=ga&data=999+learning+styles+free+test

Holist Analyst

Verbalizer

Imager

Page 11: Improving the effectiveness of Web searching:  Methodological issues

Syntactical/quantitative Semantic/qualitative

Exite search logs

~106 searchesHolistic search logs

Supplemented with qualitative data

Page 12: Improving the effectiveness of Web searching:  Methodological issues

Preliminary work

• Analysis of search logs

• Development of descriptive codes

• Aim is to form a basis for the analysis of our experimental data

Page 13: Improving the effectiveness of Web searching:  Methodological issues

Strengths / Limitations• Large sample• Definitely general public.• No enquiry context – what are they looking

for? What are they thinking?• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another

begin?• Limited to one search engine – what did they

do during a delay?

Page 14: Improving the effectiveness of Web searching:  Methodological issues

Excite Database Sampleqid uid time rank query querymore totwords

343 000000000000006a 192141 0 alco fence company ohio No 4

344 000000000000006a 192219 0 alco fence company ohio No 4

345 000000000000006a 192228 10 alco fence company ohio No 4

346 000000000000006a 192243 20 alco fence company ohio No 4

347 000000000000006a 192328 0 lifetime fence company ohio No 4

348 000000000000006a 192359 10 lifetime fence company ohio No 4

349 000000000000006a 192455 0 lifetime wire fence No 3

350 000000000000006a 192634 0 high tensile wire fence No 4

351 000000000000006b 161906 0 sickle cell anemia No 3

352 000000000000006b 162006 10 sickle cell anemia No 3

353 000000000000006b 162130 0 sickle cell anemia No 3

354 000000000000006c 144303 0 Hilton Garden Inn No 3

355 000000000000006c 144331 0 Hilton Garden Inn Jacksonville No 4

356 000000000000006c 144433 0 Hotel Search No 2

357 000000000000006c 144541 0 Jacksonvill Hotel No 2

358 000000000000006c 144728 0 www.hilton.com No 1

~ 106 queries

1

2

3

Sessions

Page 15: Improving the effectiveness of Web searching:  Methodological issues

Query Transformations• Changes in search strategy

– conceptual e.g. changes in type of search: broad specific text image

– Linguistic: syntactic, query structure.

• Examples Q1: shakespeare hamletQ2: shakespeare hamlet quotes

Q3: to be or not to beQ4 “to be or not to be”Q5: “to be or not to be” +shakespeare

Page 16: Improving the effectiveness of Web searching:  Methodological issues

Our Preliminary Analysis

• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual

dissimilarity.• Use these elements to develop a

machine-readable codification of QT’s.• To mine for characteristic patterns.

Page 17: Improving the effectiveness of Web searching:  Methodological issues

Code TransformationN New queryR A repeated query /same page

rank – relevance feedback. P Page ranking (seek more)p Page ranking (earlier pages)

I(k) Identical C(k) Conjoint

S(k) Sub-phrase in common s(k) Sub-phrase + words in commonM(k) Other textual similarity

Example Transformations

Page 18: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

N 1 2 3 5 4 6 Start

M C C s

22

23

25

26

27 s

s

s

S s

24 28

RP(14)

END

s s

20

29

R

s

21 5

uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R

nursing careerspaid undergraduate nursing schools in baltimore city maryland

Code Transformation

N New query

R A repeated query /same page rank – relevance feedback.

P Page ranking (seek more)

p Page ranking (earlier pages)

I(k) Identical

C(k) Conjoint

S(k) Sub-phrase in common

s(k) Sub-phrase + words in common

M(k) Other textual similarity

Page 19: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

7 2

N 1 2

3

5

4

6 Start M C

QJ

C 19 15

14

18

P(7)

END

20

P C

P(3)

Delay

QJ

QD

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Page 20: Improving the effectiveness of Web searching:  Methodological issues

Preliminary Conclusions• We have developed a rich set of codes

describing syntactic part of QT’s• These can be used to develop a graph-based

description• Correlations between the codes are

meaningful/interesting• They form part of the analysis for our current

experimental study.

Page 21: Improving the effectiveness of Web searching:  Methodological issues

…and if you want to read about our preliminary results….

• Whittle M, Eaglestone B, Ford N, Madden A (2007), Data Mining of Search Logs, Journal of the American Society for Information Science and Technology (in press)

• Whittle M, Eaglestone B, Ford N, Gillet V.J., Madden A (2006), Query Tranformations And Their Role In Web Searching By The General Public, Information Research, 12(1) October 2006

• Whittle M, Eaglestone B, Ford N, Gillet V, Madden A (2006), Query transformations and their role in web searching by the general public. Information Seeking in Context Conference 2006 ISIC, Austrailia

• Andrew Madden, Barry Eaglestone, Nigel Ford, MartinWhittle (2006) Search engines: a first step to finding information: preliminary findings from a study of observed searches, Information Seeking in Context Conference 2006 ISIC, Austrailia.

Page 22: Improving the effectiveness of Web searching:  Methodological issues

Sheffield Experimental StudyScreensAudio

Qualitativeanalysis

Quantitativeanalysis

KeystrokesQueriesWeb page titles

Transcribing Pre-Processing

Temporaldatabase

Modeldevelopment

Page 23: Improving the effectiveness of Web searching:  Methodological issues

Data modelling issues

Page 24: Improving the effectiveness of Web searching:  Methodological issues

Evolution of databasesSetting the scene – the database approach and state of the art.

The database approach – A database should be a natural representation of information as data, suitable for all relevant applications without duplication, including the ones you have not yet though of.

“A well designed database system will mirror its users’ perceptionsmirror its users’ perceptions of the problem space, and thus allows them to address the problem in hand without address the problem in hand without complexities and distractions of complexities and distractions of computer world implementation computer world implementation detailsdetails… Implicit is the notion that users should work within the bounds of ‘good ‘good practice’practice’””

Page 25: Improving the effectiveness of Web searching:  Methodological issues

The semantic gap

Customer Salesperson

Take_byPlaced_by

Sales_Order

1

n m

1

C# Name …C1 Dr. EaglestoneC2 Ms Smith

SP# Names …S5 Mr. Chan …S8 Dr. Shao

C# SP# Product QuantityC1 S5 P99 120C1 S5 P2 10

Customer

Salesperson

SalesOrder

The gap between what you wish to represent and what you can represent.

Setting the scene – the database approach and state of the art.

Page 26: Improving the effectiveness of Web searching:  Methodological issues

….. & Data Independence

Applications/Users

External Model

Logical Model

Internal Model

Principles of database technology…

Page 27: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

7 2

N 1 2

3

5

4

6 Start M C

QJ

C 19 15

14

18

P(7)

END

20

P C

P(3)

Delay

QJ

QD

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Page 28: Improving the effectiveness of Web searching:  Methodological issues

A Ready-madeTemporal data modelling solution

Page 29: Improving the effectiveness of Web searching:  Methodological issues

GENREG – A ready-made solution that has also been proposed for healthcare ?

The Organisation: National Museum of Denmark

Multimedia– Pictures as well as descriptions

Distributed– Each department ran their own database system

for their collection (ownership!) Object-oriented design

– Entities, not just values Relational implementation

Page 30: Improving the effectiveness of Web searching:  Methodological issues

Database Research

Science

Technology

Application

Praxis

Theory

Page 31: Improving the effectiveness of Web searching:  Methodological issues

TopologyDanish Pre-history

Department of Antiquity

Ethnographic Department

Coin Collection

LAN

1,000,000 artefacts200,000 images

Page 32: Improving the effectiveness of Web searching:  Methodological issues

Design / Abstractions•Design

•Object oriented•Based on a curator’s perspective

•“Curators apply scientific training to determine the history of artefacts…creating knowledge about past and present societies by determining relationships which group artefacts within certain times and places in history”

•AbstractionsArtefactEventRelationship

•relate artefacts which participate in common events

Page 33: Improving the effectiveness of Web searching:  Methodological issues

Mould

usedto

fabricate

Brooches

Page 34: Improving the effectiveness of Web searching:  Methodological issues

GENREG data model

ARTIFACT

EVENT/ARTIFACT

One (or more) artifactsparticipates

in one or more events.

Page 35: Improving the effectiveness of Web searching:  Methodological issues

Burial site

Grave Grave

ArtefactArtefactArtefactArtefact Artefact Artefact

Page 36: Improving the effectiveness of Web searching:  Methodological issues

E

IH

F

DCB

A

G

LKJ

Merchant’s House

Manor House

Rooms

Furniture

Furniture

Purchase event

Page 37: Improving the effectiveness of Web searching:  Methodological issues
Page 38: Improving the effectiveness of Web searching:  Methodological issues

Integrated Care Pathways Application

[Procter, P., Eaglestone, B.M. & Burdis, C. “A unified model to support an information intensive healthcare environment, MIE

'99]

P1

P2

P6P3

P4 P5

It

It+2

It+1

It+2

It+1

Treatment

Alternative diagnoses

Alternative prognoses

Page 39: Improving the effectiveness of Web searching:  Methodological issues

A formal GENREG Modeltype Genreg = abs [tuple[ Collection : Artifacts, Events : set[Event]]

new : () Genreg,= : (Genreg × Genreg) boolean,events : (Genreg) set[Event],collection : (Genreg) Artifacts]

type Artifacts = graph[Artifact]

type Event = abs[ tuple [id: E_Id, type : Exent_type, t : Time,place : Location, actors : set[Actor_Type], edge : set[Edge]]= : (Event × Event) boolean,id : (Event) E_Id,type : (Event) Event_Type,t : (Event) Time,place : (Event) Location,actors : (Event) set[Actor_Type],edgeset : (Event) set[Edges]]…

Page 40: Improving the effectiveness of Web searching:  Methodological issues

type Time = abs[tuple[ lower, upper: T]new : () Time,= : (Time × Time) boolean,before : (Time × Time) boolean,meets : (Time × Time) boolean,overlaps : (Time × Time) boolean,during : (Time × Time) boolean,starts : (Time × Time) boolean,finishes : (Time × Time) boolean,

Page 41: Improving the effectiveness of Web searching:  Methodological issues

• add_artifact / delete_artifact (D, a)• add_event / delete_event (D, e)• merge (D,F,E)

• select_artefacts (D,p)• select_events (D,p)• related_to (D,n)• related_by (D,e,n)

Page 42: Improving the effectiveness of Web searching:  Methodological issues

Temporal Data Models(See also SQL/Temporal)

Entity

Attr

ibut

e

Time Entity: Barry; Height: 5’ 10’’

Entity: Barry; Height: 2’ 3’’

Time: 2004

Time: 1950

Page 43: Improving the effectiveness of Web searching:  Methodological issues

• Artefact histories are created retrospectively

• Multiple orthogonal time dimensions can be represented (using specialised events), e.g., discovery and historic time.

• Relationships between events and states are modelled.

• Multiple objects can represent different states and interpretations of an entity.

Page 44: Improving the effectiveness of Web searching:  Methodological issues
Page 45: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

7 2

N 1 2

3

5

4

6 Start M C

QJ

C 19 15

14

18

P(7)

END

20

P C

P(3)

Delay

QJ

QD

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Q3

Q4

QJt

Page 46: Improving the effectiveness of Web searching:  Methodological issues

Some final thoughts…

Page 47: Improving the effectiveness of Web searching:  Methodological issues

Some final thoughts…

• The Database Approach?• Semantic gap?• Data independence?• Temporal modelling?• Query language?• So, what’s happening?

Page 48: Improving the effectiveness of Web searching:  Methodological issues

IR & DB?

IR – collections of artefacts are available for ad hoc querying (any relevant problem) –

The problem is modelled by the query

DB – collections of artefacts are structured to model the problem space.

Server(s)Internet accessible

repositoriesof artefacts

Client(s)User are researchers

who derive knowledge fromretrieved artefacts

Problem-relatedQuery

Problem-relevantartefacts

Researcher’s workspace –Developed to model the

Problem spaceArtefact collection

Page 49: Improving the effectiveness of Web searching:  Methodological issues

…final thoughts…• Knowledge of research methodology is

important (qualitative and quantitative)• Nudist, Atlas, SPSS don’t support mixed

methods• Database approach allows integration of

qualitative and quantitative data, and organisation of data to evolve to model emerging theory

• Temporal data models are key to modelling evolving strategy…

Page 50: Improving the effectiveness of Web searching:  Methodological issues

Acknowledgments

• The project team – Nigel Ford, Andrew Madden, Martin Whittle

• Arts and Humanities Research Council (formerly Board) for funding

• Mark Sanderson and Amanda Spink for making the Excite logs available

• Val Gillet and Eleanor Gardiner for help with graphs.

Page 51: Improving the effectiveness of Web searching:  Methodological issues
Page 52: Improving the effectiveness of Web searching:  Methodological issues

Summary

Feedback can lead to semantic changesComplexity can be a hindranceSearches don’t necessarily end when a searcher leaves a search engine.

Page 53: Improving the effectiveness of Web searching:  Methodological issues

AlgorithmLoop over session queries

Loop over previous queries

for i = 1 to n

for j = 1 to i-1

Compare query i with j

Choose most similar pair i,j

Analyse to assign QT type

i

j 1

n

time

Page 54: Improving the effectiveness of Web searching:  Methodological issues

Some Preliminary Observations

• Quote marks are likely to be used with a new query.

• Delay is strongly associated with N (New query): these are successful single queries within a session.

• B (Include Boolean) & C (Conjoint) are positively associated

• B & D (Disjoint) are negatively associated

Page 55: Improving the effectiveness of Web searching:  Methodological issues

Number of words/query: Excite 2001

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 10 100terms/query

Nor

mal

ised

freq

uenc

y

Page 56: Improving the effectiveness of Web searching:  Methodological issues

Classification of textual QT’s

• Word order, addition, subtraction.• Inclusion or removal of

– Boolean terms– “quotes”

• Detection of new enquiries.

• We use similarity methods to compare words and queries.

Page 57: Improving the effectiveness of Web searching:  Methodological issues

Self-selected searches

Prompts:• Think about the last time you had

trouble finding something you were looking for on the Internet.

• Do you have any hobbies or interests for which the Internet might provide useful information?

Page 58: Improving the effectiveness of Web searching:  Methodological issues

Hölsher & Strube (2000): Graphical Representation

Close-up of direct interaction with a search engine: numbers show transition probabilities.

Experts and novicesdoing specificsearch tasks

Page 59: Improving the effectiveness of Web searching:  Methodological issues

Set searches

Heads:What was written on Neville Chamberlain’s piece of paper?You’ve won a holiday to Saga. What can you find out about the place that interests you?

Page 60: Improving the effectiveness of Web searching:  Methodological issues

Set searches

Tails:You’ve received a postcard from friends who say they’re visiting Map. Where are they? There are many opportunities to win things on the Internet. Can you find some that relate to your interests?

Additional search:Find the postcode of the tallest building in the UK outside of London.

Page 61: Improving the effectiveness of Web searching:  Methodological issues

All searches recorded using

Spector pro (key stroke recorder) and My Screen Recorder (which records voice + activities on PC).

Page 62: Improving the effectiveness of Web searching:  Methodological issues

Annotated transcriptsTime at which stated action takes place.

Browse time preceding action

Search 100.50 “I might as well go with what I know best”01.20 (enters ‘CD albums collection’)01.27 (6s browse) Selects 2nd link (CD universe)01.53 (31s browse) – selects Dance = 7 of ?

(>24) (on LHS). “See this is the trouble, cos I don’t really know what category it would go into. It was a mixed CD so it’s got all sorts of different things on, and there’s not really a category for that, I don’t think.”

01.56 (8s browse) – Selects Dance Collections = 7 of 12 (top of page)

Page 63: Improving the effectiveness of Web searching:  Methodological issues

Search dimensions

VolunteerSearch

no. On Off

On .

On+Off DepthIntensity:

Mean (s.d.)1 1 2 1 0.67 1 43.33 (24.66)

2 10 8 0.56 2 14.72 (15.1)

3 6 5 0.55 3 12.27 (11.26)

4 3 1 0.75 1 7.5 (6.45)

2 1 30 14 0.68 6 4.55 (6.36)

2 22 8 0.73 2 7.67 (9.8)

3 8 1 0.89 1 13.33 (16.96)

4 24 2 0.92 1 6.73 (12.88)

Page 64: Improving the effectiveness of Web searching:  Methodological issues

Progress

ca54 volunteers observed since Oct 2005 (representing c200 searches).

Page 65: Improving the effectiveness of Web searching:  Methodological issues

cf Transaction Logs

Internet searches are often regarded as being ‘shallow and promiscuous’ (=many short,simple searches).This idea supports the perception of searches viewed from search engine transaction logs. A useful summary of search engine use, but not of Web search behaviour viewed as a whole.

Page 66: Improving the effectiveness of Web searching:  Methodological issues

Feedback loops

Learn from previous searchesE.g. semantic shifts

Sheffield Pals Battalion

Richard Sparling

Page 67: Improving the effectiveness of Web searching:  Methodological issues

Complex search ≠ good search

Familiarity with search engine facilities (Boolean, “”, etc) does not always indicate competence. E.g.: postcode "tallest building outside london" –london.

Page 68: Improving the effectiveness of Web searching:  Methodological issues

Use the general to find the specialist

Search engine used to find a more focussed search tool. E.g. – searcher looking for info on B&B in York finds a directory of holiday accommodation.

Page 69: Improving the effectiveness of Web searching:  Methodological issues

• Jansen ref re complexity• Findings title• Search dimensions slide• Database side – modelling.

Page 70: Improving the effectiveness of Web searching:  Methodological issues

Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity

Page 71: Improving the effectiveness of Web searching:  Methodological issues

Strengths• Large sample.• Natural environment.• Definitely general public.

• No enquiry context – what are they looking for? What are they thinking?

• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another begin?• Limited to one search engine – what did they do during a delay?

Limitations

Page 72: Improving the effectiveness of Web searching:  Methodological issues

Experimental Study

• Strengths– Very detailed information.– Searching not surfing.– Comparison of identical enquiries.

• Limitations– Small sample of queries.– Limited public sample – volunteers.

Page 73: Improving the effectiveness of Web searching:  Methodological issues

This work• Development of quantitative analysis

• Analysis of search logs (Excite 2001)

• Development of descriptive codes

• Aim is to form a basis for the analysis of our experimental data

Page 74: Improving the effectiveness of Web searching:  Methodological issues

Aims of Quantitative Analysis

• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual

dissimilarity.• Use these elements to develop a

machine-readable codification of QT’s.

Page 75: Improving the effectiveness of Web searching:  Methodological issues

Word similarity

667.087

10*2W

bacS

Drawback:On this measure doing and going are very similar (0.8)while bug and debugging have SW = 0.5

Dice Coefficient

e l e c t e d e l e c t i o n 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0

Shift

Page 76: Improving the effectiveness of Web searching:  Methodological issues

Word Similarity Threshold

dingping 75.0

86WS

bringthing

6.0106

WS

tryingstring

5.0126

WSnursingtraining 4.0

156

WS

•Partial solution: introduce threshold WST = 0.4•Anything less similar than WST is given SW = 0

Page 77: Improving the effectiveness of Web searching:  Methodological issues

Query Similarity• For each word in query 1 find the most similar

word in query 2 and combine results

• Accommodates repeated words (in query 2) without weighting

• Main point of WST is to avoid the accumulation of many small contributions to the query similarity

Page 78: Improving the effectiveness of Web searching:  Methodological issues

Query Similarity Example

leaf gelatin supplier barcelona

gelatine supplies in spain

Score = 0 Score = 0.93 Score = 0.88

Score = 0

wordsofnumberscoresofsumS

maxQ Evaluate = 0.453

Page 79: Improving the effectiveness of Web searching:  Methodological issues

Query Similarity Threshold

We are looking for the most similar previous query to i

i

jtimeIf none are similar maybe i isa new enquiry

Set QST =0.3 as lowest acceptable similarity for a valid query connection

Page 80: Improving the effectiveness of Web searching:  Methodological issues

Setting WST and QST

• Result narrowed down by close inspection

• In first 300 queries the set with WST = 0.4 and QMT = 0.3 agreed with a human analysis of the best categorisation in all cases bar one, which was in any case an unusual entry.

Page 81: Improving the effectiveness of Web searching:  Methodological issues

AlgorithmLoop over session queries

Loop over previous queries

for i = 1 to n

for j = 1 to i-1

Compare query i with j

Choose most similar pair i,j Assign k=j

Analyse to assign QT type k i

i

j 1

n

time

Page 82: Improving the effectiveness of Web searching:  Methodological issues

Code Transformation

U Unique

N New query

R Repeated query

P Page viewing (seek more)

p Page viewing (earlier pages)

“Trivial” Transformations

Page 83: Improving the effectiveness of Web searching:  Methodological issues

Substantive Transformations ICode Transformation (relative to k)I(k) Identical J(k) Identical apart from Quotes/Boolean

C(k) Conjoint

D(k) Disjoint

S(k) Sub-phrase in common

s(k) Sub-phrase + words in common

Page 84: Improving the effectiveness of Web searching:  Methodological issues

Substantive Transformations II

Code Transformation (relative to k)W(k) Single word in commonw(k) Separated single words in common

M(k) Other textual similarity

Below Threshold SimilarityZ(k) Not similar but word in common

z(k) Not similar but words in common

Page 85: Improving the effectiveness of Web searching:  Methodological issues

Target: one two three

Target: 123 Comparison Symbol Type

Basic transfomations 1234 C Conjunction 12 D Disjunction

Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal

Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion

Below threshold similarity 1456 Z Common word 1245678 z Common phrase

Page 86: Improving the effectiveness of Web searching:  Methodological issues

Code Transformation

B Include Boolean term

b Remove Boolean term

Q Include quote marks

q Remove quote marks

_ Delay > 1 hour

Supplementary Transformations

Page 87: Improving the effectiveness of Web searching:  Methodological issues

Example full transformationMay include up to 4 terms e.g.

BQC(4)_Boolean

Quote MarksSubstantive Delay

Page 88: Improving the effectiveness of Web searching:  Methodological issues

Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”

“alternative medicine” university of texas” “alternative medicine”

qw(k) "tax law_depreciation system"

tax law/depreciation system

BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"

Insurance +"Complaint form letters" +Insurance

BS(k) doppler effect labs doppler effect +lab

Page 89: Improving the effectiveness of Web searching:  Methodological issues

More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing

+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser

+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen

Stefani" +mp3 +Moby +southside +mp3

Page 90: Improving the effectiveness of Web searching:  Methodological issues

Output for thefirst 100

Excite queries

Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)

N_NC(2)PPPPW(3)NC(9)P

Page 91: Improving the effectiveness of Web searching:  Methodological issues

One session - 3 sub-sessions

qid uid time rank query querymore totwords

83 000000000000001a 083122 0 chicago sun times No 3

84 000000000000001a 105439 0 f8 No 1

85 000000000000001a 105453 0 f8 airplane No 2

86 000000000000001a 105536 10 f8 airplane No 2

87 000000000000001a 105614 20 f8 airplane No 2

88 000000000000001a 105630 30 f8 airplane No 2

89 000000000000001a 105731 40 f8 airplane No 2

90 000000000000001a 105740 0 airplanes f8 No 2

91 000000000000001a 113441 0 ceo compensation No 2

92 000000000000001a 113633 0 2000 ceo compensation No 3

93 000000000000001a 113752 10 2000 ceo compensation No 3

1 N_

2 N

3 C(2)

4 P

5 P

6 P

7 P

8 W(3)

9 N

10 C(9)

11 P

Page 92: Improving the effectiveness of Web searching:  Methodological issues

Query lengths

1

10

100

1000

10000

100000

1000000

1 10 100

Length/Queries

Freq

uenc

y

sessions sub-session

10% of sub-sessionsare at least 7 queries in length

Page 93: Improving the effectiveness of Web searching:  Methodological issues

QT relative frequencies

0

5

10

15

20

25

30

35

U N P p R I J C D S s W w M Z z B b Q q _Query Transformation

Per

cant

age

Freq

uenc

y

Page 94: Improving the effectiveness of Web searching:  Methodological issues

Terminal QT’s

0

0.2

0.4

0.6

0.8

1

1.2

U N P p R I J C D S s W w M Z z B b Q q _

Query Transformation

Term

inal

QT

ratio

)(QTFreqQTFinalFreqRatio

i.e.: The lastqueries in a sub-session

Page 95: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

N 1 2 3 5 4 6 Start

M C C s

22

23

25

26

27 s

s

s

S s

24 28

RP(14)

END

s s

20

29

R

s

21 5

uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R

nursing careers

paid undergraduate nursing schools in baltimore city maryland

Page 96: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

7 2

N 1 2

3

5

4

6 Start M C

QJ

C 19 15

14

18

P(7)

END

20

P C

P(3)

Delay

QJ

QD

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Page 97: Improving the effectiveness of Web searching:  Methodological issues

Frequency of nodes with k connections

0

2

4

6

8

10

12

0 2 4 6 8 10k

ln(f)

Query length 10

Query length 20

Slope = -1

Exponential scaling

Page 98: Improving the effectiveness of Web searching:  Methodological issues

Intra-QT correlations

• f (A,B) measured coincident frequency of codes A and B

• E{} Expected value• V{} Variance

ij

ijijijf AAfV

AAfEAAfAAD

,

,,,

Correlations within a transform e.g. [BQC(3)_]

Page 99: Improving the effectiveness of Web searching:  Methodological issues

Intra-QT correlations

Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00

Example:

[BQC(3)_]

Page 100: Improving the effectiveness of Web searching:  Methodological issues

Some Observations

• Quote marks are likely to be used with a new query.

• Delay is strongly associated with N: these are successful single queries within a session.

• B & C are positively associated• B & D are negatively associated

Page 101: Improving the effectiveness of Web searching:  Methodological issues

Application to Experimental Results

Page 102: Improving the effectiveness of Web searching:  Methodological issues

Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)

Page 103: Improving the effectiveness of Web searching:  Methodological issues

Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching

Background detailsWeb experienceCognitive style scores

Subjects appraisalof searches

uid

Search queriesWeb page titles

uid

Key stroke recordActivity timings

Query modificationcodes

qidqid

Qualitative analysis

Page 104: Improving the effectiveness of Web searching:  Methodological issues

Acknowledgments

• Arts and Humanities Research Council (formerly Board) for funding

• Mark Sanderson and Amanda Spink for making the Excite logs available

Page 105: Improving the effectiveness of Web searching:  Methodological issues

Questions ?

Page 106: Improving the effectiveness of Web searching:  Methodological issues

Setting WST and QST

excite: WST = 0.4

0

50000

100000

150000

200000

250000

300000

350000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Query Transformation

Freq

uenc

y

Tot NewTot Modz+Z

Page 107: Improving the effectiveness of Web searching:  Methodological issues

Inter-QT correlations

• f ( A | B ) measured frequency of codes B following A

• E{} Expected value• V{} Variance

ij

ijijijf ABfV

ABfEABfABD

|

|||

Correlations of one transform with the next.

Page 108: Improving the effectiveness of Web searching:  Methodological issues

Inter-QT correlations

Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —

N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06

Pos

terio

r Tra

nsfo

rmat

ion

— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46

Example: [BQC(3)_][bqD(5)]

Page 109: Improving the effectiveness of Web searching:  Methodological issues

Some Observations

• Self-correlations suggest habitual tendencies

• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.

• Delay is followed by N, a new query or R or I – suggesting memory refresh.

Page 110: Improving the effectiveness of Web searching:  Methodological issues

Number of words/query: Excite 2001

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 10 100terms/query

Nor

mal

ised

freq

uenc

y

Page 111: Improving the effectiveness of Web searching:  Methodological issues

Hölsher & Strube (2000): Graphical Representation

Close-up of direct interaction with a search engine: numbers show transition probabilities.

Experts and novicesdoing specificsearch tasks

Page 112: Improving the effectiveness of Web searching:  Methodological issues

Word Similarity

e l e c t e d e l e c t i o n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Shift word along until the best match is found

e l e c t e d e l e c t i o n 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

logical AND: same letter

Page 113: Improving the effectiveness of Web searching:  Methodological issues

Motivation

• Need to develop new models for searching: update outdated usage paradigms.

• Improve training methods

• Develop automated assistance systems

Page 114: Improving the effectiveness of Web searching:  Methodological issues

Context

• How do the general public search the web?

• Experimental study– general public volunteers– record sound, screens, keystrokes

• Goal: evidence-based model of effective searching

Page 115: Improving the effectiveness of Web searching:  Methodological issues

Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity

Page 116: Improving the effectiveness of Web searching:  Methodological issues

This work• Development of quantitative analysis

• Analysis of search logs (Excite 2001)

• Development of descriptive codes

• Aim is to form a basis for the analysis of our experimental data

Page 117: Improving the effectiveness of Web searching:  Methodological issues

Aims of Quantitative Analysis

• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual

dissimilarity.• Use these elements to develop a

machine-readable codification of QT’s.

Page 118: Improving the effectiveness of Web searching:  Methodological issues

Target: one two three

Target: 123 Comparison Symbol Type

Basic transfomations 1234 C Conjunction 12 D Disjunction

Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal

Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion

Below threshold similarity 1456 Z Common word 1245678 z Common phrase

Page 119: Improving the effectiveness of Web searching:  Methodological issues

Code Transformation

B Include Boolean term

b Remove Boolean term

Q Include quote marks

q Remove quote marks

_ Delay > 1 hour

Supplementary Transformations

Page 120: Improving the effectiveness of Web searching:  Methodological issues

Example full transformationMay include up to 4 terms e.g.

BQC(4)_Boolean

Quote MarksSubstantive Delay

Page 121: Improving the effectiveness of Web searching:  Methodological issues

Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”

“alternative medicine” university of texas” “alternative medicine”

qw(k) "tax law_depreciation system"

tax law/depreciation system

BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"

Insurance +"Complaint form letters" +Insurance

BS(k) doppler effect labs doppler effect +lab

Page 122: Improving the effectiveness of Web searching:  Methodological issues

More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing

+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser

+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen

Stefani" +mp3 +Moby +southside +mp3

Page 123: Improving the effectiveness of Web searching:  Methodological issues

Output for thefirst 100

Excite queries

Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)

N_NC(2)PPPPW(3)NC(9)P

Page 124: Improving the effectiveness of Web searching:  Methodological issues

One session - 3 sub-sessions

qid uid time rank query querymore totwords

83 000000000000001a 083122 0 chicago sun times No 3

84 000000000000001a 105439 0 f8 No 1

85 000000000000001a 105453 0 f8 airplane No 2

86 000000000000001a 105536 10 f8 airplane No 2

87 000000000000001a 105614 20 f8 airplane No 2

88 000000000000001a 105630 30 f8 airplane No 2

89 000000000000001a 105731 40 f8 airplane No 2

90 000000000000001a 105740 0 airplanes f8 No 2

91 000000000000001a 113441 0 ceo compensation No 2

92 000000000000001a 113633 0 2000 ceo compensation No 3

93 000000000000001a 113752 10 2000 ceo compensation No 3

1 N_

2 N

3 C(2)

4 P

5 P

6 P

7 P

8 W(3)

9 N

10 C(9)

11 P

Page 125: Improving the effectiveness of Web searching:  Methodological issues

Query lengths

1

10

100

1000

10000

100000

1000000

1 10 100

Length/Queries

Freq

uenc

y

sessions sub-session

10% of sub-sessionsare at least 7 queries in length

Page 126: Improving the effectiveness of Web searching:  Methodological issues

QT relative frequencies

0

5

10

15

20

25

30

35

U N P p R I J C D S s W w M Z z B b Q q _Query Transformation

Per

cant

age

Freq

uenc

y

Page 127: Improving the effectiveness of Web searching:  Methodological issues

Terminal QT’s

0

0.2

0.4

0.6

0.8

1

1.2

U N P p R I J C D S s W w M Z z B b Q q _

Query Transformation

Term

inal

QT

ratio

)(QTFreqQTFinalFreqRatio

i.e.: The lastqueries in a sub-session

Page 128: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

N 1 2 3 5 4 6 Start

M C C s

22

23

25

26

27 s

s

s

S s

24 28

RP(14)

END

s s

20

29

R

s

21 5

uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R

nursing careers

paid undergraduate nursing schools in baltimore city maryland

Page 129: Improving the effectiveness of Web searching:  Methodological issues

QT graphs

7 2

N 1 2

3

5

4

6 Start M C

QJ

C 19 15

14

18

P(7)

END

20

P C

P(3)

Delay

QJ

QD

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Page 130: Improving the effectiveness of Web searching:  Methodological issues

Frequency of nodes with k connections

0

2

4

6

8

10

12

0 2 4 6 8 10k

ln(f)

Query length 10

Query length 20

Slope = -1

Exponential scaling

Page 131: Improving the effectiveness of Web searching:  Methodological issues

Intra-QT correlations

• f (A,B) measured coincident frequency of codes A and B

• E{} Expected value• V{} Variance

ij

ijijijf AAfV

AAfEAAfAAD

,

,,,

Correlations within a transform e.g. [BQC(3)_]

Page 132: Improving the effectiveness of Web searching:  Methodological issues

Intra-QT correlations

Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00

Example:

[BQC(3)_]

Page 133: Improving the effectiveness of Web searching:  Methodological issues

Application to Experimental Results

Page 134: Improving the effectiveness of Web searching:  Methodological issues

Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)

Page 135: Improving the effectiveness of Web searching:  Methodological issues

Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching

Background detailsWeb experienceCognitive style scores

Subjects appraisalof searches

uid

Search queriesWeb page titles

uid

Key stroke recordActivity timings

Query modificationcodes

qidqid

Qualitative analysis

Page 136: Improving the effectiveness of Web searching:  Methodological issues

Conclusions• We have developed a rich set of codes

describing syntactic part of QT’s• These can be used to develop a graph-based

description• Correlations between the codes are

meaningful/interesting• They will form part of the analysis for our

experimental study.

Page 137: Improving the effectiveness of Web searching:  Methodological issues

Acknowledgments

• Arts and Humanities Research Council (formerly Board) for funding

• Mark Sanderson and Amanda Spink for making the Excite logs available

Page 138: Improving the effectiveness of Web searching:  Methodological issues

Questions ?

Page 139: Improving the effectiveness of Web searching:  Methodological issues

Setting WST and QST

excite: WST = 0.4

0

50000

100000

150000

200000

250000

300000

350000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Query Transformation

Freq

uenc

y

Tot NewTot Modz+Z

Page 140: Improving the effectiveness of Web searching:  Methodological issues

Inter-QT correlations

• f ( A | B ) measured frequency of codes B following A

• E{} Expected value• V{} Variance

ij

ijijijf ABfV

ABfEABfABD

|

|||

Correlations of one transform with the next.

Page 141: Improving the effectiveness of Web searching:  Methodological issues

Inter-QT correlations

Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —

N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06

Pos

terio

r Tra

nsfo

rmat

ion

— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46

Example: [BQC(3)_][bqD(5)]

Page 142: Improving the effectiveness of Web searching:  Methodological issues

Some Observations

• Self-correlations suggest habitual tendencies

• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.

• Delay is followed by N, a new query or R or I – suggesting memory refresh.

Page 143: Improving the effectiveness of Web searching:  Methodological issues

Number of words/query: Excite 2001

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 10 100terms/query

Nor

mal

ised

freq

uenc

y