159
Search, Exploration and Analytics of Evolving Data Nattiya Kanhabua L3S Research Center Hannover, Germany The 1st Keystone Training School on Keyword Search over Big Data 23 July 2015, Malta

Search, Exploration and Analytics of Evolving Data

Embed Size (px)

Citation preview

Page 1: Search, Exploration and Analytics of Evolving Data

Search, Exploration and

Analytics of Evolving Data

Nattiya Kanhabua

L3S Research Center

Hannover, Germany

The 1st Keystone Training School on

Keyword Search over Big Data

23 July 2015, Malta

Page 2: Search, Exploration and Analytics of Evolving Data

Lecturer Education qualification

2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway

Thesis: “Time-aware Approaches to Information Retrieval”

2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand

Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges”

1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand

Project: “Software Process Enhancement and Control System”

Work experience 2011- now: Postdoc, L3S Research Center, Germany

05/2015: Visiting researcher, University of Trento, Italy

03-05/2010: Research intern, Yahoo! Research, Spain

2007 - 2011: Temporary Scientific Staff, NTNU, Norway

2006 - 2007: Research assistant, University of Trento, Italy

06-10/2006: Research assistant, AIT, Thailand

2005 - 2006: Analyst programmer, IFDS Group, UK

2002 - 2003: Research assistant, Kasetsart University, Thailand

2001 - 2002: System analyst, Accenture, Thailand & Singapore

Skills • 7+ years of research experience in information

retrieval, data mining, machine learning, predictive

methods and spatio-temporal analysis

• 3+ years of research experience in BigData, e.g., large-

scale processing and MapReduce

Hadoop Pig Mahout HBase

Tomcat Servlet Lucene MySQL

Python JAVA JSP PHP

Weka R UML JSON

Eclipse NLP RDF WARC

2 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 3: Search, Exploration and Analytics of Evolving Data

9:00 – 10:30 Part I

Introduction to Temporal Dynamics

Temporal Information Extraction

Temporal Query Analysis (I)

10:30 – 11:00 Coffee break

11:00 – 12:30 Part II

Temporal Query Analysis (II)

Time-aware Retrieval and Ranking

Applications of Temporal IR

Conclusions and Outlook

3

Schedule

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 4: Search, Exploration and Analytics of Evolving Data

Additional Resource

Book: Temporal Information Retrieval

Foundations and Trends® in Information Retrieval

Volume 9, Issue 2, pp 91-208, 2015

Download: http://goo.gl/TunlBb

References can be found in the book

4 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 5: Search, Exploration and Analytics of Evolving Data

Introduction to Temporal Dynamics

What are temporal dynamics?

Why do they occur and impact search?

When and how to leverage temporal information for IR?

5 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 6: Search, Exploration and Analytics of Evolving Data

6

Temporal Dynamics

Figure: Internet Growth/Usage Phases/Tech Events

(created by Mark Schueler, used with permission)

23 July 2015

Page 7: Search, Exploration and Analytics of Evolving Data

Temporal Web Dynamics Web is changing over time in many aspects, e.g., size, content,

structure and how it is accessed by user interactions or queries.

Size: web pages are added/deleted at all time

Content: web pages are edited/modified

Query: users’ information needs changes

[Risvik et al., CN 2002; Ke et al., CN 2006] [WebDyn 2010; Dumais, SIAM-SDM 2012]

7 23 July 2015

Page 8: Search, Exploration and Analytics of Evolving Data

2000

First billion-URL index

The world’s largest!

≈5000 PCs in clusters!

1995 2015

Web and Index Sizes

8 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 9: Search, Exploration and Analytics of Evolving Data

2000

First billion-URL index

The world’s largest!

≈5000 PCs in clusters! 2004

Index grows to

4.2 billion pages

1995 2015

9

Web and Index Sizes

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 10: Search, Exploration and Analytics of Evolving Data

2000

First billion-URL index

The world’s largest!

≈5000 PCs in clusters! 2004

Index grows to

4.2 billion pages

1995 2015

2008

Google counts

1 trillion

unique URLs

10

Web and Index Sizes

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 11: Search, Exploration and Analytics of Evolving Data

2000

First billion-URL index

The world’s largest!

≈5000 PCs in clusters! 2004

Index grows to

4.2 billion pages

1995 2020

2009

TBs or PBs of data/index

Tens of thousands of PCs

2008

Google counts

1 trillion

unique URLs

11

?

Web and Index Sizes

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 12: Search, Exploration and Analytics of Evolving Data

http://www.worldwidewebsize.com/ 12

Web and Index Sizes

23 July 2015

Page 13: Search, Exploration and Analytics of Evolving Data

Content Change

The content of the Web, changes constantly over time, e.g., web

documents are added, modified or deleted continuously.

National and international initiatives collect and preserve parts of

the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013]

Figure: WayBack Machine

a web archive search tool by

Internet Archive

13 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 14: Search, Exploration and Analytics of Evolving Data

Content Change

Challenge:

Document representation and retrieval

14 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 15: Search, Exploration and Analytics of Evolving Data

Categorization of Content Change

15

Implication:

Crawling, Indexing, Ranking

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 16: Search, Exploration and Analytics of Evolving Data

User Interaction Dynamics

Browsing and querying (or search) behavior

User preference, e.g., likes, comments, interests

User’s profiles [Rybak et al., ECIR 2014]

16 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 17: Search, Exploration and Analytics of Evolving Data

Query Popularity Change

Challenge:

Time-sensitive queries

Query understanding and processing

Google Insights for Search: http://www.google.com/insights/search/

Query: Halloween

17 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 18: Search, Exploration and Analytics of Evolving Data

Categorization of Web Search Queries

http://www.google.com/insights/search 18

Implication:

Query Analysis, Ranking

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 19: Search, Exploration and Analytics of Evolving Data

Temporal Information Extraction

(1) Document Creation Time

(2) Document Focus Time

(3) Entity and Event Evolution

19 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 20: Search, Exploration and Analytics of Evolving Data

Motivation

Incorporating time into search can increase retrieval effectiveness

Only when temporal information is available

Research problem:

How to determine the publication of a document?

How to extract temporal information from document contents?

20 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 21: Search, Exploration and Analytics of Evolving Data

Two Time Aspects 1. Publication or modified time

Task: determining timestamps of documents

Method: rule-based technique, or temporal language models

2. Content or focus time

Task: temporal information extraction

Method: natural language processing, or time and event recognition

algorithms

21 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 22: Search, Exploration and Analytics of Evolving Data

content time

publication time

22 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 23: Search, Exploration and Analytics of Evolving Data

Problem Statement: Hard to find trustworthy time for a web page

Time gap between crawling and indexing

Decentralization and relocation of web documents

No standard metadata for time/date

23

Determining Document Creation Time

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 24: Search, Exploration and Analytics of Evolving Data

Problem Statement: Hard to find trustworthy time for a web page

Time gap between crawling and indexing

Decentralization and relocation of web documents

No standard metadata for time/date

I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

24

Determining Document Creation Time

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 25: Search, Exploration and Analytics of Evolving Data

Problem Statement: Hard to find trustworthy time for a web page

Time gap between crawling and indexing

Decentralization and relocation of web documents

No standard metadata for time/date

Let’s me see…

This document is

probably

written in 850 A.C.

with 95% confidence.

I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

25

Determining Document Creation Time

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 26: Search, Exploration and Analytics of Evolving Data

Current Approaches

1. Content-based

Temporal language model [de Jong et al., AHC 2005;

Kanhabua and Nørvåg, ECDL 2008]

Classifier using features based on text’s time expressions

[Chambers, ACL 2012;Ge et al., EMNLP 2013]

Using burstiness of terms for estimating timestamps

[Kotsakos et al., SIGIR 2014]

2. Non content-based

Finding the oldest version of a page in a web archive [Jatowt

et al., WIDM 2007]

Leveraging external resources [Hauff and Azzopardi, ECIR

2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson,

TempWeb 2013]

26 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 27: Search, Exploration and Analytics of Evolving Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1

27 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 28: Search, Exploration and Analytics of Evolving Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1

28

tsunami

Thailand

A non-timestamped

document

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 29: Search, Exploration and Analytics of Evolving Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1

29

tsunami

Thailand

A non-timestamped

document

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 30: Search, Exploration and Analytics of Evolving Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1

30

tsunami

Thailand

A non-timestamped

document

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 31: Search, Exploration and Analytics of Evolving Data

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Temporal Language Models

Based on the statistic usage of

words over time

Compare each word of a non-

timestamped document with a

reference corpus

Tentative timestamp -- a time

partition mostly overlaps in word

usage

Freq

1

1

1

1

1

1

31

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 32: Search, Exploration and Analytics of Evolving Data

Normalized Log-likelihood Ratio

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models Normalized log-likelihood ratio

[Kraaij, SIGIR Forum 2005]

Variant of Kullback-Leibler divergence

Similarity of a document and time partitions

C is the background model estimated on the corpus

Linear interpolation smoothing to avoid the zero probability of unseen words

Freq

1

1

1

1

1

1

32

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 33: Search, Exploration and Analytics of Evolving Data

Improving Temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing

2. Search statistics to enhance similarity scores

3. Temporal entropy as term weights

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into

document preprocessing

[Kanhabua et al., ECDL 2008] (Slide provided by the authors) 33 23 July 2015

Page 34: Search, Exploration and Analytics of Evolving Data

Improving Temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing

2. Search statistics to enhance similarity scores

3. Temporal entropy as term weights

Intuition: Search statistics Google Zeitgeist (GZ) can

increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the

normalized log-likelihood ratio

34 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 35: Search, Exploration and Analytics of Evolving Data

Improving Temporal LMs

Enhancement techniques

1. Semantic-based data preprocessing

2. Search statistics to enhance similarity scores

3. Temporal entropy as term weights

Intuition: A term weight depends on how good the term is

for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter

35 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 36: Search, Exploration and Analytics of Evolving Data

Semantic-based Preprocessing

36

Intuition: Direct comparison between extracted words

and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into

document preprocessing

Semantic-based

Preprocessing

Description

Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives

Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”

Word sense

disambiguation

Identify the correct sense of a word from context, e.g. “bank”

Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”

have the common concept of “disaster”

Word filtering Select the top-ranked words according to TF-IDF scores for a comparison

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 37: Search, Exploration and Analytics of Evolving Data

Leveraging Search Statistics

37

Intuition: Search statistics Google Zeitgeist (GZ) can

increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the

normalized log-likelihood ratio

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 38: Search, Exploration and Analytics of Evolving Data

Leveraging Search Statistics

38

Intuition: Search statistics Google Zeitgeist (GZ) can

increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the

normalized log-likelihood ratio

(b)(a)

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 39: Search, Exploration and Analytics of Evolving Data

Leveraging Search Statistics

39

Intuition: Search statistics Google Zeitgeist (GZ) can

increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the

normalized log-likelihood ratio

P(wi) is the probability that wi occurs:

P(wi) = 1.0 if a gaining query

P(wi) = 0.5 if a declining query

f(R) converts a ranked

number into weight. The

higher ranked query is

more important.

An inverse partition

frequency, ipf = log N/n

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 40: Search, Exploration and Analytics of Evolving Data

Temporal Entropy

Temporal Entropy

A measure of temporal information which a word conveys.

Captures the importance of a term in a document collection

whereas TF-IDF weights a term in a particular document.

Tells how good a term is in separating a partition from others.

A term occurring in few partitions has higher temporal entropy

compared to one appearing in many partitions.

The higher temporal entropy a term has, the better

representative of a partition.

Intuition: A term weight depends on how good the term

is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter

40 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 41: Search, Exploration and Analytics of Evolving Data

Temporal Entropy

Intuition: A term weight depends on how good the term

is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter

41 23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 42: Search, Exploration and Analytics of Evolving Data

Temporal Entropy

Intuition: A term weight depends on how good the term

is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter

42

Np is the total number of

partitions in a corpus

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 43: Search, Exploration and Analytics of Evolving Data

Temporal Entropy

Intuition: A term weight depends on how good the term

is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term

selection presented in Lochbaum and Streeter

43

Np is the total number of

partitions in a corpus

A probability of a partition

p containing a term wi

23 July 2015 [Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Page 44: Search, Exploration and Analytics of Evolving Data

Non Content-based Approaches

Dating a document using its neighbors

1. Web pages linking to the document

I.e., incoming links

2. Web pages pointed by the document

I.e., outgoing links

3. Media assets associated with the document

E.g., images

Averaging the last-modified dates of its neighbors as timestamps

44 [Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 45: Search, Exploration and Analytics of Evolving Data

Non Content-based Approaches

Drawbacks:

Rely on the availability and accuracy of other information

Cover only pages from most recent years

Cannot determine the age of the actual contents

45 [SalahEldeen and Nelson, 2013] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 46: Search, Exploration and Analytics of Evolving Data

Determining Document Focus Time

Three types of temporal expressions

1. Explicit: time mentions being mapped directly to a time point or

interval, e.g., “July 4, 2012”

2. Implicit: imprecise time point or interval, e.g., “Independence Day

2012”

3. Relative: resolved to a time point or interval using other types or

the publication date, e.g., “next month”

Time and event recognition [Mani and Wilson, ACL 2000]

A mix of hand-crafted and machine-learnt rules

Ranking the most relevant temporal expressions [Strötgen et al.,

TempWeb 2012]

46 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 47: Search, Exploration and Analytics of Evolving Data

Time Taggers for Calculating Focus Time

HeidelTime:

http://heideltime.ifi.uni-

heidelberg.de/heideltime

Timestamp:

2013/7/15

23 July 2015 47 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 48: Search, Exploration and Analytics of Evolving Data

Document may lack any temporal expressions

Temporal expressions may be weakly related to document’s

theme

Temporal taggers are not perfect

Limitations

Estimating document focus time

without using temporal expressions

23 July 2015 48 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 49: Search, Exploration and Analytics of Evolving Data

Focus Time of Documents

Def. A document has focus time t if its content refers to t

23 July 2015 49 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 50: Search, Exploration and Analytics of Evolving Data

Estimating Focus time: Concept

Use time-referenced documents for estimating focus time of

target document

A-1935------May 2011----C------

News Article

Collections

---A------2012--

---B-- 1978----

-1915-------------C—B-----A---

--1948-----------C-----2003--

-----A—B--C---A-

----

Target

Document

Target document

focus time

+

... ...

23 July 2015 50 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 51: Search, Exploration and Analytics of Evolving Data

Word Graph

Word co-occurrence graph from large collections of news articles

Link weight estimated by Jaccard coefficient using sentence as unit

war

nazi

1945

1939

aushwitz

jews

germany

jalta

hiroshima

23 July 2015 51 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 52: Search, Exploration and Analytics of Evolving Data

Estimating Direct Word-Year Association

Word-year associations derived from graph

Word w is strongly associated with year y if

if it frequently co-occurs with y

A(war, 1900)

A(war, 1901)

A(war, 1944)

A(war, 1945)

A(war, 2009)

A(war, 2010)

A(hiroshima, 1900)

A(hiroshima, 1901)

A(hiroshima, 1944)

A(hiroshima, 1945)

A(hiroshima, 2009)

A(hiroshima, 2010)

A(word, 1900)

A(word, 1901)

A(word, 1944)

A(word, 1945)

A(word, 2009)

A(word, 2010)

23 July 2015 52 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 53: Search, Exploration and Analytics of Evolving Data

Word w is strongly associated with year y if many other words that

frequently co-occur with w are also strongly associated with y

Second Level Term-Year Association

V

j

jdiriji ywAwwAV

ywA1

2,,

1,

war

nazi

1945

1939

aushwitz

jews

germany

jalta

hiroshima

israel

23 July 2015 53 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 54: Search, Exploration and Analytics of Evolving Data

If a document contains many words strongly associated with year y,

the document is strongly associated with y

Estimating Document-Year Association

1900 1920 1940 1960 1980 2000

word A

word B

A(word,year)

word C

A + 2B + 2C

Time

A B C

B C

Document

Document-year association

23 July 2015 54 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 55: Search, Exploration and Analytics of Evolving Data

Finding Discriminative Features

Not every word is useful for estimating text focus time E.g., “man”, “city” have stable associations with years

Temporal entropy – measure of variability of word associations

Temporal kurtosis – measure of peakness of word associations E.g., “war”, “earthquake” vs. “hitler”, “stalingrad”

1900 1920 1940 1960 1980 2000

word A

word B

Temporal_Entropy(A) < Temporal_Entropy(B)

A(word,year)

1900 1920 1940 1960 1980 2000

word A

Temporal_Kurtosis(A) > Temporal_Kurtosis(B)

A(word,year)

word B

Temporal entropy and Temporal kurtosis

used as temporal weights for words

23 July 2015 55 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 56: Search, Exploration and Analytics of Evolving Data

Importance of Words in Document

Words weakly related to document theme should be skipped

TextRank 0.90 independence

0.82 poland

0.74 war

0.61 nazi

0.56 hitler

0.54 ….

President Obama took part in the

celebrations of the Polish

Independence Day. The US

president met main Polish

politicians in Warsaw.

Poland regained independence at

the end of the World War I

following Bolshevik Revolution.

It then lost the independence as a

result of Nazi and Soviet invasions

led by Hitler and Stalin.

Poland is located in East Europe.

Target Document

Document to

graph conversion

independence

poland war

hitler

TextRank scores used as discriminatory semantic weights for words

[Mihalcea and Tarau, EMNLP 2004]

23 July 2015 56 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 57: Search, Exploration and Analytics of Evolving Data

Estimating Focus Time

1900 1920 1940 1960 1980 2000

word A

word B

word C

Weighted sum (temporality and

semantics)

Focus time: Interval based

threshold

Time

A(word,year)

1900 1920 1940 1960 1980 2000

A B C

B C

Document

Focus time: Instant based

1900 1920 1940 1960 1980 2000

23 July 2015 57 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 58: Search, Exploration and Analytics of Evolving Data

Combined Approach

Combining estimated focus time and temporal expressions in text

Representing dates on timeline - Gaussian Kernel Density Estimate

Mixture of Gaussian distributions with means centered on extracted

dates

ydSydSydS TempExpEstComb ,,,

---1935------------2011-------------------------------------------1932-------------1940---------------------1932-----2001--

-------------

1932 1935 1940 2001 2011

Target document

23 July 2015 58 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 59: Search, Exploration and Analytics of Evolving Data

News articles collected from Google News Archive using country

names as queries

Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)

Published within [1990, 2010]

Dates falling in [1900, 2013] were found using regular expressions

Experimental Settings: Word Graphs

23 July 2015 59 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 60: Search, Exploration and Analytics of Evolving Data

Experimental Settings: Test Datasets

Datasets on events related to countries:

Wiki: 250 Wikipedia pages about events

Books: 735 paragraphs from 2 text books about history (timelines)

Web: 812 paragraphs from web pages on history (BBC timelines,

etc.)

Datasets total #doc

avr. #sent

avr. time span of events

avr. year of events

avr. #dates

Wiki 250 179 3.4 years 1958 14.5

Book 735 43 4.4 years 1982 4.5

Web 819 18.3 1.3 years 1957 2.4

23 July 2015 60

Page 61: Search, Exploration and Analytics of Evolving Data

Experimental Settings: Baselines

Baselines:

Random

Date-based (using only dates in document text)

LDA-based

1. 100 topics over sentences containing year mentions

2. Finding topic distribution of each year

3. Calculating document-year association based on topic distribution

of documents

23 July 2015 61 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 62: Search, Exploration and Analytics of Evolving Data

Experimental Settings: Measurements

Measures:

Average error (in years)

Pearson Correlation Coefficient between ground truth years and

years in focus time

Ground truth

Estimated

focus time

Ground truth

Estimated

focus time

tfocus - + + - - + - - +

Average error (years) for

instant-based representation Correlation measure (-1..+1) for

interval-based representation

error + + - - + + - - +

23 July 2015 62 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 63: Search, Exploration and Analytics of Evolving Data

Experimental Results

Datasets random baseline

LDA baseline

date-based baseline

Proposed (no dates)

Proposed combined

(with dates)

Wiki 36.5 27.2 3 18.3 2.83

Books 39.3 37.3 48.1 23.5 20.4

Web 40.5 41.4 53.4 23.6 20.7

Datasets random baseline

LDA baseline

date-based baseline

Proposed (no dates)

Proposed combined

(with dates)

Wiki 0 0.1 0.65 0.29 0.66

Books 0 0.04 0.01 0.25 0.30

Web 0 0.02 -0.03 0.26 0.41

Average error

Pearson Correlation Coefficient

23 July 2015 63 [Jatowt et al., CIKM 2013](Slide provided by the authors)

Page 64: Search, Exploration and Analytics of Evolving Data

How well can we estimate focus time of documents about

distant past ?

Effect of Time Distance on Focus Time

Wiki Books

Web Instant-based

focus time representation

23 July 2015 64 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 65: Search, Exploration and Analytics of Evolving Data

Question?

65 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 66: Search, Exploration and Analytics of Evolving Data

Temporal Query Analysis

(1) Temporal query intent

(2) Dynamic query subtopics

66 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 67: Search, Exploration and Analytics of Evolving Data

Temporal Queries

Temporal information needs

Searching temporal document collections

E.g., digital libraries, web/news archives

Users: historians, librarians, journalists or students

Temporal queries exist in both standard collections and the Web

Relevancy is dependent on time

Documents are about events at particular time

67 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 68: Search, Exploration and Analytics of Evolving Data

Types of Temporal Queries

Two types of temporal queries 1. Explicit: time is provided, "Presidential election 2012“

2. Implicit: time is not provided, "Germany World Cup"

Temporal intent can be implicitly inferred

I.e., refer to the World Cup event in 2006

Studies of web search query logs show a significant fraction

of temporal queries

1.5% of web queries are explicit

~7% of web queries are implicit

13.8% of queries contain explicit time and 17.1% of queries have

temporal intent implicitly provided

68 [Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010] 23 July 2015

Page 69: Search, Exploration and Analytics of Evolving Data

Figure: Variances of

temporal queries and

their dynamics

23 July 2015 69 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 70: Search, Exploration and Analytics of Evolving Data

Understanding Temporal Query Intent

Current approaches:

1. Mining temporal patterns in query logs

2. Analyzing top-k search results

70

[Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012]

[Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015

Page 71: Search, Exploration and Analytics of Evolving Data

Motivation

Temporal queries are a significant fraction of Web

search queries [Zhang et al., EMNLP 2010]

13.8% of explicit temporal queries

17.1% of implicit temporal queries

Characteristics:

Certain temporal patterns, i.e., spikes, periodicity

(hourly or daily), seasonality and trends

Underlying temporal information needs without

temporal patterns observed

Tasks:

Understand temporal search intent

Enable advanced enhancement techniques

Automatic method for detecting events in search streams

US Election

2016

Brazil FIFA

World Cup

23 July 2015 71 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 72: Search, Exploration and Analytics of Evolving Data

Preliminaries

Data model:

Set of queries Q issues at different time points

Set of clicked URLs U and click-through data

Temporal document collection D

q: keywords or term(q), and hitting time(q)

yq: time series data extracted form Q, U and D

Two-step approach:

Automatically extract a set of candidate queries {q1, ..., qn} from Q

Classify candidates as event-related queries {e1, ..., em} using

machine learning techniques

23 July 2015 72 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 73: Search, Exploration and Analytics of Evolving Data

Identifying Event Candidates

Time and keyword-based clustering:

Step1: Partition query logs into one week

• Group queries from the same event

• Possibly contain multiple, unrelated events

Step2: Cluster queries by lexical similarity

• Pre-process and sort queries alphabetically

• Compute Jaccard similarity of a query pair

Easter - easter 2006, easter 2007, easter 20crafts,

easter activities, easter animation, easter animations,

easter background, easter basket, easter bread,

easter bucket, easter bunny, easter bunny decorations,

easter bunny lights

23 July 2015 73 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 74: Search, Exploration and Analytics of Evolving Data

Event-related Query Classification

Classify a query as event-related or not:

Periodic and seasonal events

Popular and trending events

Sporadic (rare) and unseen events

General time-sensitive queries

Underlying temporal information needs

Features:

Time-series features, e.g., seasonality or trends

Popularity-based features, e.g., click-through and burstiness

Statistic features, e.g., probability distribution of results

temporal KL-divergence and skewness (kurtosis)

23 July 2015 74 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 75: Search, Exploration and Analytics of Evolving Data

Query: Easter

Seasonality

Query: World cup

Detect seasonal queries [Shokouhi, SIGIR 2011]

E.g., Annual events, e.g., US Open and Easter,

or a 4-year recurring event, e.g., FIFA World Cup

Method: time-series decomposition using Holt-

Winters adaptive exponential smoothing

Input: time-series data extracted from external

document collections, YD

Compute a cosine similarity as seasonality

Y is the original time-series data

S is the seasonality component

23 July 2015 75 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 76: Search, Exploration and Analytics of Evolving Data

Autocorrelation

Detect trending events by their predictability

Cross correlation with itself or between its

past and future values at different time lags

The stronger inter-day dependencies, the

higher value for autocorrelation

where lag=1, shifting the 2nd time series by

one day, called 1st-order autocorrelation

23 July 2015 76 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 77: Search, Exploration and Analytics of Evolving Data

Temporal KL-divergence

Analyze a temporal distribution in a result set

Measure the difference between the distribution over time

of top-k documents of q and the document collection C

P(t|q) is the probability of generating a publication date t

given q

P(t|C) is the probability of a publication date t in the

collection

23 July 2015 77 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 78: Search, Exploration and Analytics of Evolving Data

Surprise Score

Detect unseen events or surprisingly popular

queries [Radinsky et al. , WWW 2012]

Assume an unplanned event happening when there is

a significant prediction error

Compute the sum of squared errors of prediction

(SSE) using a simple linear regression model

23 July 2015 78 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 79: Search, Exploration and Analytics of Evolving Data

Experiments

Query logs:

• Two datasets, i.e., AOL and MSN

• AOL: 30M queries March 1 - May 31, 2006

• MSN: 15M queries from May 2006

Temporal collection:

• The New York Times Annotated Corpus

• 1.8M documents from 1987 - 2007

Setting:

• HeidelTime for time extraction and OpenNLP for entity extraction

• Cleansing-step parameters: Jaccard similarity threshold>0.2; edit

distance<3; overlap n-gram=2

• For burstiness features, default parameters for the burst detection

technique provided by CISHELL

In total, 837 event-related queries

23 July 2015 79 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 80: Search, Exploration and Analytics of Evolving Data

Experimental Results (I) Feature selection: • Study high-impact (best) features

• Investigate their importance independent from classification algorithms

• InfoGainAttributeEval method in WEKA

Main findings: • Discriminative features are mostly derived

from D and Q

• TemporalKL and kurtosis are among influential features

• Trend-based features, such as, autocorrelation, burst weight, and trending level, play an important role

• Seasonality computed from Q has less impact than the one extracted D

23 July 2015 80 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 81: Search, Exploration and Analytics of Evolving Data

Experimental Results (II)

Query classification:

• Several classifiers, i.e., support vector

machine (SVM), AdaBoost, decision tree

(J48), and neural network (NN)

• Metrics: accuracy, precision, recall, F-

measure using 10-fold cross validation

Main findings:

• J48 is the best performing algorithm

• TemporalKL achieves accuracy of 84%

• Adding autocorrelation, kurtosis, and

seasonality increases the performance

• However, the performance has dropped

after adding max. query frequency, so on

23 July 2015 81 [Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Page 82: Search, Exploration and Analytics of Evolving Data

Analyzing Top-k Search Results

Using temporal language models

Determine time of queries when no time is given explicitly

Re-rank search results using the determined time

Exploiting time from search snippets

Extract temporal expressions (i.e., years) from the contents of top-k

retrieved web snippets for a given query

Content-based language-independent approach

82 [Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015

Page 83: Search, Exploration and Analytics of Evolving Data

Determining Time of Queries

Approach I. Dating using keywords*

Approach II. Dating using top-k documents*

Queries are short keywords

Inspired by pseudo-relevance feedback

Approach III. Using timestamp of top-k documents

No temporal language models are used

*Using Temporal Language Models proposed by de Jong et al.

83 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 84: Search, Exploration and Analytics of Evolving Data

I. Dating using Keywords

84 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 85: Search, Exploration and Analytics of Evolving Data

I. Dating using Keywords

85

Query’s temporal

profiles

23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 86: Search, Exploration and Analytics of Evolving Data

II. Dating using Top-k Documents

86 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 87: Search, Exploration and Analytics of Evolving Data

II. Dating using Top-k Documents

87

Query’s temporal

profiles

23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 88: Search, Exploration and Analytics of Evolving Data

III. Using Timestamp of Documents

88 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 89: Search, Exploration and Analytics of Evolving Data

III. Using Timestamp of Documents

89

Query’s temporal

profiles

23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 90: Search, Exploration and Analytics of Evolving Data

Re-ranking Search Results

query

News archive

Determine time 2005, 2004, 2006, ...

D2009

Initial retrieved results

90 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Intuition: documents published closely to the time of queries are

more relevant

Assign document priors based on publication dates

Page 91: Search, Exploration and Analytics of Evolving Data

Intuition: documents published closely to the time of queries are

more relevant

Assign document priors based on publication dates

Re-ranking Search Results

query

News archive

Determine time 2005, 2004, 2006, ...

D2009

Initial retrieved results

D2005

Re-ranked results

91 23 July 2015 [Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

Page 92: Search, Exploration and Analytics of Evolving Data

march madness

began

14/03/2006

ncaa women

tournament began

18/03/2006 01/04/2006

final four began

query: ncaa

Change of Query Subtopics over Time

92 [Nguyen and Kanhabua, ECIR 2014] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 93: Search, Exploration and Analytics of Evolving Data

Mining Temporal Anchor Texts

Anchor texts are complementary description

for target pages, widely used to improve search

Characteristics:

Short summary (a few words) of target pages

Collective wisdom of people other than authors

Similar behavior to real-world queries and titles

Capturing aboutness or what a document is about

Main ideas:

Temporal anchor texts mined from the edit history of

Wikipedia as a hook for tracking entity evolution

Large-scale analysis and a more robust discovery of

evolving information using limited resources

23 July 2015 93 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 94: Search, Exploration and Analytics of Evolving Data

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using the one-month granularity

2. For each Wikipedia snapshot, identify named entity articles/pages

3. Extract anchor texts from all articles linking to an entity page

4. Rank aggregated entity-anchor relationships at a particular time t

[Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 95: Search, Exploration and Analytics of Evolving Data

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using

the one-month granularity

2. For each Wikipedia snapshot,

identify named entity articles/pages

3. Extract anchor texts from all articles

linking to an entity page

4. Rank aggregated entity-anchor

relationships at a particular time t

23 July 2015 95 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 96: Search, Exploration and Analytics of Evolving Data

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using

the one-month granularity

2. For each Wikipedia snapshot, identify

named entity articles/pages

3. Extract anchor texts from all articles

linking to an entity page

4. Rank aggregated entity-anchor

relationships at a particular time t

President_of_the_

United_States

President

Bush (43)

Time:

10/2005 Barack

Obama

Time:

11/2008

George

W. Bush

Time:

11/2004 23 July 2015 96 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 97: Search, Exploration and Analytics of Evolving Data

1. Multi-word title with all words capitalized,

except prepositions, determiners, etc.

E.g., President_of_the_United_States => entity

2. Single-word titles with multiple capital

letters

E.g., UNICEF and WHO => entities

3. 75% of occurrences in the article text itself

are capitalized (not beginning of sentence)

Recognizing Named Entity Articles

[Bunescu and Pasca, EACL 2006] 23 July 2015 97 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 98: Search, Exploration and Analytics of Evolving Data

Weight anchor texts by importance with respect

to a target entity at particular time:

• Link-independent : inlink pages are independent and

equally important to the target page

• Compute based on the whole collection of Wikipedia

entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

[Dou et al., SIGIR 2009] 23 July 2015 98 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 99: Search, Exploration and Analytics of Evolving Data

Weight anchor texts by importance with respect

to a target entity at particular time:

• Link-independent : inlink pages are independent and

equally important to the target page

• Compute based on the whole collection of Wikipedia

entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

[Dou et al., SIGIR 2009] 23 July 2015 99 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 100: Search, Exploration and Analytics of Evolving Data

Experiments

Data collection:

• A dump of English Wikipedia edit history (2.8 TB)

• All pages and revisions 03/2001 to 03/2008

• 85 snapshots + 4 additional snapshots

(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)

Tools:

• Preprocess/store revisions using MWDumper

http://www.mediawiki.org/wiki/Mwdumper

• Store anchor texts: mySQL databases

23 July 2015 100 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 101: Search, Exploration and Analytics of Evolving Data

Top-100 Named Entities

23 July 2015 101 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 102: Search, Exploration and Analytics of Evolving Data

Top-100 Named Entities

23 July 2015 102 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 103: Search, Exploration and Analytics of Evolving Data

Top-100 Named Entities

23 July 2015 103 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 104: Search, Exploration and Analytics of Evolving Data

Evolving Context

“Barack Obama”

time

05/2008 03/2009

1. Senator Barack Obama

2. Senator Obama's

legislative

accomplishments

3. Illinois

4. U.S. Sen. Barack Obama

1. Senator Barack Obama

2. Illinois Senator Barack

Obama

3. Barack Hussein Obama II

4. Senator Obama's

legislative

accomplishments

07/2008 10/2008

23 July 2015 104 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 105: Search, Exploration and Analytics of Evolving Data

Evolving Context

“Barack Obama”

time

05/2008 03/2009

1. Senator Barack Obama

2. Senator Obama's

legislative

accomplishments

3. Illinois

4. U.S. Sen. Barack Obama

1. Senator Barack Obama

2. Illinois Senator Barack

Obama

3. Barack Hussein Obama II

4. Senator Obama's

legislative

accomplishments

07/2008

1. Senator Barack

Obama

2. Illinois Senator Barack

Obama

3. Barak Obama, U.S.

Senator, Illinois, 2008

Democratic nominee for

U.S. President

4. presidential

candidacy

announcement

1. President Barack

Obama

2. Senator Barack Obama

3. U.S. President Barack

Obama

4. 44th President of the

United States

5. Obama Administration

10/2008

23 July 2015 105 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 106: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 106 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 107: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 107 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 108: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 108 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 109: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 109 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 110: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 110 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 111: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 111 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 112: Search, Exploration and Analytics of Evolving Data

Main Findings

Evolving information & context

• Role changes for political entities

• Geographic name changes for

locations

• Trend or things in vogue for

celebrities

• Products in demand for

technology

23 July 2015 112 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 113: Search, Exploration and Analytics of Evolving Data

Question?

113 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 114: Search, Exploration and Analytics of Evolving Data

Time-aware Retrieval and Ranking

(1) Recency-based Ranking

(2) Time-dependent Ranking

114 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 115: Search, Exploration and Analytics of Evolving Data

RECAP

Two time dimensions

1. Publication or modified time

2. Content or focus time

115 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 116: Search, Exploration and Analytics of Evolving Data

Searching the past

Historical or temporal information needs

A journalist working the historical story of a particular news article

A Wikipedia contributor finding relevant information that has not been written about yet

116

Web

archives

news

archives

blogs emails

“temporal document

collections”

Retrieve documents

about Pope Benedict

XVI written before 2005

Term-based IR approaches

may give unsatisfied results

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 117: Search, Exploration and Analytics of Evolving Data

Temporal Query Examples

A temporal query consists of:

Query keywords

Temporal expressions

A document consists of:

Terms, i.e., bag-of-words

Publication time and temporal expressions

117 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 118: Search, Exploration and Analytics of Evolving Data

Temporal Query Examples

[Berberich et al., ECIR 2010] 118 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 119: Search, Exploration and Analytics of Evolving Data

Assign prior probabilities using an exponential function

E.g., a more recent creation date obtains high probability

Current approaches:

Time-based language model [Li and Croft, CIKM 2003]

Using retention functions [Peetz and de Rijke, ECIR 2013]

Incorporating freshness into web authority [Dai and Davison,

SIGIR 2010]

Recency-based Ranking

119 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 120: Search, Exploration and Analytics of Evolving Data

Time must be explicitly modeled in order to increase the

effectiveness of ranking

To order search results so that the most relevant ones come first

Time uncertainty should be taken into account

Two temporal expressions can refer to the same time period even

though they are not equally written

E.g. the query “Independence Day 2011”

A retrieval model relying on term-matching only will fail to

retrieve documents mentioning “July 4, 2011”

Time-dependent Ranking

120 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 121: Search, Exploration and Analytics of Evolving Data

Time-dependent Ranking

Two main approaches:

1. Mixture model [Kanhabua et al., ECDL 2010]

Linearly combining textual- and temporal similarity

2. Probabilistic model [Berberich et al., ECIR 2010]

Generating a query from the textual part and temporal part

of a document independently

121 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 122: Search, Exploration and Analytics of Evolving Data

Mixture Model

Linearly combine textual- and temporal similarity

α indicates the importance of similarity scores

Both scores are normalized before combining

Textual similarity can be determined using any term-based retrieval model

E.g., tf.idf or a unigram language model

122 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 123: Search, Exploration and Analytics of Evolving Data

Mixture Model

Linearly combine textual- and temporal similarity

α indicates the importance of similarity scores

Both scores are normalized before combining

Textual similarity can be determined using any term-based retrieval model

E.g., tf.idf or a unigram language model

123

How to determine temporal similarity?

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 124: Search, Exploration and Analytics of Evolving Data

Temporal Similarity

Sim

ilarity

score

Time

d1 d2 <q>

Dist(d1,q)

Dist(d2,q)

[Kanhabua et al., ECDL 2010]

23 July 2015 124 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 125: Search, Exploration and Analytics of Evolving Data

Temporal Similarity

Assume that temporal expressions in the query are generated

independently from a two-step generative model:

P(tq|td) can be estimated based on publication time using an

exponential decay function [Kanhabua et al., ECDL 2010]

Linear interpolation smoothing is applied to eliminates zero

probabilities

I.e., an unseen temporal expression tq in d

125 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 126: Search, Exploration and Analytics of Evolving Data

Comparison of time-aware ranking

Five time-aware ranking models

LMT [Berberich et al., ECIR 2010]

LMTU [Berberich et al., ECIR 2010]

TS [Kanhabua et al., ECLD 2010]

TSU [Kanhabua et al., ECLD 2010]

FuzzySet [Kalczynski et al., Inf. Process. 2005]

126 [Kanhabua et al., SIGIR 2011] 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 127: Search, Exploration and Analytics of Evolving Data

Experiment:

New York Times Annotated Corpus

40 temporal queries [Berberich et al., ECIR 2010]

Result:

TSU outperforms other methods significantly for most metrics

Conclusions:

Although TSU gains the best performance, but only applied to a

collection with time metadata

LMT, LMTU can be applied to any collection without time metadata,

but time extraction is needed

Discussion

127 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 128: Search, Exploration and Analytics of Evolving Data

128

Applications for Temporal IR

(1) Searching the Future

(2) Time-aware Recontextualization

23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 129: Search, Exploration and Analytics of Evolving Data

Searching the Future

People are naturally curious about the future

What will happen to EU economies in next 5 years?

What will be potential effects of climate changes?

129 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 130: Search, Exploration and Analytics of Evolving Data

Previous work

Searching the future

Extract temporal expressions from news articles

Retrieve future information using a probabilistic model, i.e.,

multiplying textual similarity and a time confidence

Supporting analysis of future-related information in news and

Web

Extract future mentions from news snippets obtained from search

engines

Summarize and aggregate results using clustering methods, but no

ranking

[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 130 23 July 2015

Page 131: Search, Exploration and Analytics of Evolving Data

Recorded Future

http://www.recordedfuture.com/

131 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 132: Search, Exploration and Analytics of Evolving Data

Yahoo! Time Explorer

[Matthews et al., HCIR 2010] 132 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 133: Search, Exploration and Analytics of Evolving Data

Ranking News Predictions

Over 32% of 2.5M documents from Yahoo! News (July’09 –

July’10) contain at least one prediction

Retrieve predictions related to a news story in news archives and

rank by relevance

133 23 July 2015

Page 134: Search, Exploration and Analytics of Evolving Data

Related News Predictions

[Kanhabua et al., SIGIR 2011] 134 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 135: Search, Exploration and Analytics of Evolving Data

Related News Predictions

[Kanhabua et al., SIGIR 2011] 135 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 136: Search, Exploration and Analytics of Evolving Data

Related News Predictions

[Kanhabua et al., SIGIR 2011] 136 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 137: Search, Exploration and Analytics of Evolving Data

Four classes of features

Term similarity, entity-based similarity, topic similarity and temporal

similarity

Rank results using a learning-to-rank technique

Approach

23 July 2015 137 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 138: Search, Exploration and Analytics of Evolving Data

Step 1: Document annotation.

Extract temporal expressions

using time and event recognition

Normalize them to dates so they

can be anchored on a timeline

Output: sentences annotated

with named entities and dates,

i.e., predictions

Step 2: Retrieving predictions.

Automatically generate a query

from a news article being read

Retrieve predictions that match

the query

Rank predictions by relevance

(i.e., a prediction is “relevant” if it

is about the topics of the article)

System Architecture

[Kanhabua et al., SIGIR 2011] 138 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 139: Search, Exploration and Analytics of Evolving Data

Capture the term similarity between q and p 1. TF-IDF scoring function

Problem: keyword matching, short texts

Predictions not match with query terms

2. Field-aware ranking function, e.g., bm25f

Search the context of a prediction, i.e., surrounding sentences

Term Similarity

139 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 140: Search, Exploration and Analytics of Evolving Data

Measure the similarity between q

and p using annotated entities in

dp, p, q

Features commonly employed in

entity ranking

Entity-based Similarity

140 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 141: Search, Exploration and Analytics of Evolving Data

Compute the similarity between q and p on topic Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for

modeling topics

1. Train a topic model

2. Infer topics

3. Compute topic similarity

Topic Similarity

141 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 142: Search, Exploration and Analytics of Evolving Data

Compute the similarity between q and p on topic Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for

modeling topics

1. Train a topic model

2. Infer topics

3. Compute topic similarity

Topic Similarity

142 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 143: Search, Exploration and Analytics of Evolving Data

Hypothesis I. Predictions that are more recent to the query are

more relevant

Temporal Similarity

143 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 144: Search, Exploration and Analytics of Evolving Data

Hypothesis I. Predictions that are more recent to the query are

more relevant

Temporal Similarity

Hypothesis II. Predictions extracted from more recent documents

are more relevant

144 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 145: Search, Exploration and Analytics of Evolving Data

Learning-to-rank: Given an unseen (q, p), p is ranked using a

model trained over a set of labeled query/prediction

SVM-MAP [Yue et al., SIGIR 2007]

RankSVM [Joachims, KDD 2002]

SGD-SVM [Zhang, ICML 2004]

PegasosSVM [Shalev-Shwartz et al., ICML 2007]

PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]

Ranking Method

145 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 146: Search, Exploration and Analytics of Evolving Data

42 future-related topics

Relevance Judgments

146 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 147: Search, Exploration and Analytics of Evolving Data

New York Times Annotated Corpus

1.8 million articles, over 20 years

More than 25% contain at least one prediction

Annotation process uses several language processing tools

OpenNLP for tokenizing, sentence splitting, part-of-speech tagging,

shallow parsing

SuperSense tagger for named entity recognition

TARSQI for extracting temporal expressions

Apache Lucene for indexing and retrieving.

44,335,519 sentences and 548,491 predictions

939,455 future dates (avg. future date/prediction is 1.7)

Experiments

147 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 148: Search, Exploration and Analytics of Evolving Data

Results:

Topic features play an important role in ranking

Features in top-5 features with lowest weights are entity-based

features

Open issues:

Extract predictions from other sources, e.g., Wikipedia, blogs,

comments, etc.

Sentiment analysis for future-related information

Discussion

148 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data

[Kanhabua et al., SIGIR 2011]

Page 149: Search, Exploration and Analytics of Evolving Data

Prior to 1964, many of the cigarette

companies advertised their brand by

falsely claiming that their product did not

have serious health risks. A couple of

examples would be "Play safe with Philip

Morris" and "More doctors smoke

Camels". Such claims were made both to

increase the sales of their product and to

combat the increasing public knowledge of

smoking's negative health effects.

Advertisement poster from the

1950s

Time-aware

contextualization

Time-aware Contextualization

23 July 2015 149 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 150: Search, Exploration and Analytics of Evolving Data

Physician

http://en.wikipedia.org/wiki/Physician

Camel (cigarette)

http://en.wikipedia.org/wiki/Camel_(cigarette)

Cigarette

http://en.wikipedia.org/wiki/Cigarette

Entity linking is not sufficient

Wikipedia pages tend to contain large amounts of content

Relevant information might be distributed over various articles

The crucial temporal aspect is missing in pure linking approaches

Entity Linking

23 July 2015 150 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 151: Search, Exploration and Analytics of Evolving Data

Problem Statement

23 July 2015 151

Time-aware contextualization aims to associate an information item

d with time-aware, concise and coherent context information c for

easing its understanding

Several sub-goals of the information search process have to

combined with each other

c has to be relevant for d

c has to complement the information already available in d

c has to consider the time of creation of d

the context information should be concise to avoid overloading the user

[Tran et al., WSDM 2015] (Slide provided by the authors)

Page 152: Search, Exploration and Analytics of Evolving Data

User

Article Query

Formulation

Context

Ranking

Contextualization

units Index Context Context

Retrieval

Contextualization units

Extraction

Context

Hook

Identification

Approach Overview

23 July 2015 152 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 153: Search, Exploration and Analytics of Evolving Data

The goal is to generate a set of queries for a given document to

retrieve candidates as input for the re-ranking step

We explore two families of query formulation methods

Document-based methods : title, lead, title+lead

Hook-based methods: each_hook, all_hooks, and query performance

prediction (qpp_r@k) with the following features

Linguistics features

Document frequency

Scope

Temporal document frequency

Temporal scope

Temporal similarity

Query Formulation

23 July 2015 153 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 154: Search, Exploration and Analytics of Evolving Data

Context retrieval:

Learning to rank context:

• The ranking algorithm needs to balance two goals, i.e., high topical and

temporal relevance as well as complementarity for providing additional

information

• Use supervised machine learning that takes as input a set of labeled

examples and various complementarity features

Topic diversity

Text difference

Entity difference

Anchor text difference

Distributional similarity

Cosine distance

Relevance

Temporal similarity

Context Ranking

23 July 2015 154 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 155: Search, Exploration and Analytics of Evolving Data

Experiments

23 July 2015 155

Datasets:

51 news articles from New York Times Corpus

Wikipedia (2013), 26 million contextualization units (paragraphs)

9464 manual labeled examples (article/context pairs)

Learning to rank algorithms: RankBoost, Random Forests and Adarank

Baselines

Entity linking (Milne and Witten)

Language model (LM)

Time-aware language model (LM-T)

[Tran et al., WSDM 2015] (Slide provided by the authors)

Page 156: Search, Exploration and Analytics of Evolving Data

Evaluating Query Formulation Methods

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P@1 P@3 P@10 MAP

title+lead

all_hooks

qpp_r@100

Wikification technique

achieves a low recall of 0.229

Hook-based approaches

outperform the document-

based approaches

Query performance

prediction method obtains the

highest results on all metrics

[Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156

Page 157: Search, Exploration and Analytics of Evolving Data

The Effect of Complementarity Features

0

0.2

0.4

0.6

0.8

1

P@1 P@3 P@10 MAP

LM-T

RF

Purely using the time dimension

in context retrieval is not sufficient

in the contextualization task

Complementarity plays an

important role in contextualization

23 July 2015 157 [Tran et al., WSDM 2015] (Slide provided by the authors)

Page 158: Search, Exploration and Analytics of Evolving Data

Conclusions and Outlook

Introduced the general topic of web evolution.

Pinpointed a number of issues related to temporal IR.

Focused on temporal information extraction, temporal query

analysis, as well as time-aware retrieval and ranking.

Wrapped up with related applications to temporal IR.

Future directions:

Real-time web mining

Spatio-temporal search and analytics

Brain-inspired information access

23 July 2015 158 The 1st Keystone Summer School: Keyword Search

over Big Data

Page 159: Search, Exploration and Analytics of Evolving Data

Thank you!

159 23 July 2015 The 1st Keystone Summer School: Keyword Search

over Big Data