Estimating Time Models for News Article Excerptspeople.mpi-inf.mpg.de/~kberberi/publications/2016-cikm2016.pdf · Grammy award which took place on “February 2, 2016”. Tempo-

Estimating Time Models for News Article Excerpts

Arunav Mishra˚

˚Max Planck Institute for InformaticsSaarbrücken, Germany

{amishra,kberberi}@mpi-inf.mpg.de

Klaus Berberich˚,:

:htw saarSaarbrücken, Germany

[email protected]

ABSTRACTIt is often difficult to ground text to precise time intervals due tothe inherent uncertainty arising from either missing or multiple ex-pressions at year, month, and day time granularities. We addressthe problem of estimating an excerpt-time model capturing the tem-poral scope of a given news article excerpt as a probability distri-bution over chronons. For this, we propose a semi-supervised dis-tribution propagation framework that leverages redundancy in thedata to improve the quality of estimated time models. Our methodgenerates an event graph with excerpts as nodes and models var-ious inter-excerpt relations as edges. It then propagates empiricalexcerpt-time models estimated for temporally annotated excerpts,to those that are strongly related but miss annotations. In our exper-iments, we first generate a test query set by randomly sampling 100Wikipedia events as queries. For each query, making use of a stan-dard text retrieval model, we then obtain top-10 documents with anaverage of 150 excerpts. From these, each temporally annotated ex-cerpt is considered as gold standard. The evaluation measures arefirst computed for each gold standard excerpt for a single query, bycomparing the estimated model with our method to the empiricalmodel from the original expressions. Final scores are reported byaveraging over all the test queries. Experiments on the English Gi-gaword corpus show that our method estimates significantly bettertime models than several baselines taken from the literature.

CCS Concepts•Information systems Ñ Content analysis and feature selec-tion; Probabilistic retrieval models;

Keywordsexcerpt-time model; temporal scoping; distribution propagation;temporal content analysis; probabilistic models; sparsity reduction;

1. INTRODUCTIONTime is considered an integral dimension for understanding and

retrospecting on past events. Temporal information associated witha past event, not only indicates its occurrence period but also helps

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CIKM’16 , October 24-November 28, 2016, Indianapolis, IN, USAc© 2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00

DOI: http://dx.doi.org/10.1145/2983323.2983802

to understand its causality, evolution, and ramifications. In thisbig data era, with ever increasing amounts of information on pastevents made available on the World Wide Web, time becomes animportant indicator to organize and search relevant information.Typical sources of such temporal information are digital news archi-ves, blogs, and online encyclopedias like Wikipedia. With this con-ception, time as a dimension has thus been leveraged to improve theeffectiveness of various tasks in temporal information extraction,retrieval, and multi-document summarization.

Temporal information extraction [34] focuses on recognizing andnormalizing temporal expressions embedded in text into precisetime points or intervals. The normalized time intervals are usuallyrepresented in a standard format like TIMEX3 [34] of the TimeMLmarkup language. Commonly, temporal expressions are catego-rized into four types: explicit, relative, implicit, and free-text. Mostof the approaches [1, 20, 32] operate only on individual terms orphrases, and associating temporal information to larger textual units(like sentences or paragraphs) is out of scope. For such larger tex-tual units describing events with multiple aspects, it is difficult totag them with precise time intervals.

In temporal information retrieval, most approaches make use ofthe meta-data like creation time or publication dates associated withdocuments to identify their temporal scope. To estimate the rele-vance of a document for a given query, approaches [23, 30] modelthe scope of documents by applying an exponential decay func-tion to smooth their publication dates. There are a few approaches[2, 5, 17, 28, 31] that leverage expressions embedded in the text ofthe documents to estimate their temporal scopes. In a prior work[27], we present an approach to estimate query-time models whichare combined with language models to improve document retrievaleffectiveness. However, most of these approaches that rely on ex-plicit temporal expressions are not easy to extend to sentences orpassage retrieval due to the high sparsity of annotations.

In extractive summarization, time has been leveraged to ordersentences in textual summaries. For a summary generated in con-text of an event, it is intuitive to present a chronological ordering ofthe sentences extracted from the news articles. Some approaches[3] leverage the publication dates of the source news articles asa proxy to model the temporal scope of sentences extracted fromthem. This strategy however assumes that the documents are highlycoherent and focus on a single time period indicated by their publi-cation dates. Other approaches [6, 29] combine chronological withseveral strategies while ordering multiple sentences from the samedocument. In another recent work [26], we estimate time mod-els for sentences by considering temporal expressions in a set ofpseudo-relevant documents. However, in their approach, all sen-tences from a single document that do not come with any tempo-ral expression end up having similar time models. Though this

Excerpts (Events) Time

ε1: The electronic producers Skrillex andDiplo, who under the name Jack Ü had one ofthe biggest hits last year with “Where Are ÜNow”, featuring Justin Bieber, won both bestdance recording for that song and best elec-tronic/dance album.

T1: “last year” = [01-01-2015, 31-12-2015]

ε2: On Monday, Skrillex and Diplo won thebest dance/electronic album for “Skrillex andDiplo Present Jack and best dance recording forWhere Are Ü Now”

T2: “Monday” = [02-02-2016, 03-02-2016]

ε3: Diplo and Skrillex took home the gold twiceat the Grammy Awards night winning BestDance Recording and Best Dance/ElectronicAlbum.

Table 1: Example of redundant news article excerpts describing thesame event, along with their temporal annotations.

Figure 1: An event graph generated from the excerpts whereε1 and ε2 are associated with their two-dimensional empiricaltime models estimated from annotated temporal expressions.

approach proves to be more effective than text-only approaches,it can be further improved with more accurate and discriminativetime models estimated for sentences with missing annotations dur-ing summary generation.

It can be noted that approaches addressing the above mentionedtasks severely suffer due to high sparsity of temporal informationwhen extended to finer granularity of textual units like passages orsentences. We refer to the finer textual units simply as excerpts. Todeal with sparsity, we propose to estimate an excerpt-time modelthat captures the temporal scope of a given excerpt describing anevent. An excerpt-time model can be understood as a probabil-ity distribution over time units or chronons such as days, months,or years. Thus, the temporal scope of excerpts is represented asa probabilistic time model instead of precise intervals on a timescale. To further explain this, consider the first example excerpt ε1from Table 1. The single excerpt gives information on two indepen-dent time periods. First, the song “Where Are Ü Now” becoming ahit in the year “2015” and second, Skrillex and Diplo winning theGrammy award which took place on “February 2, 2016”. Tempo-rally tagging ε1, will associate the entire excerpt to the year “2015”.However, the desideratum is that the excerpt is associated with aprobabilistic model in a time domain where the salient time unitspertaining to the year “2015” and the day “February 2, 2016” re-ceive higher probabilities. Also, even though ε3 does not mentionany expression, it still conveys temporal information with a scopethat needs to be estimated as its time model.

One plausible approach to estimate a more accurate time modelfor a given excerpt is to leverage the redundancy in the data. Forexample, given both ε1 and ε2 from Table 1, one can note that thesecond part of ε1 highly overlaps with the event described in ε2.Since ε1 comes with an expression “Monday” that is normalized to“February 2, 2016” with a standard temporal tagger, we can use thisredundancy to estimate a more accurate time model for ε1. Simi-larly, the time model of ε3 that does not come with any temporalexpression can be estimated by combining the time models of ε2and ε2. This is because it contains overlap with both the excerpts.

Motivated from the above examples, in this paper we address thefollowing problem: from a given set of excerpts describing events,for each excerpt automatically estimate an excerpt-time model cap-turing its temporal scope. As input, we consider: 1) a set of ex-cerpts along with their source documents; 2) a set of normalizedtemporal expressions extracted from the input set of excerpts. Asoutput our method automatically estimates a probabilistic time modelfor each excerpt capturing its temporal scope.

Challenges include estimating an excerpt-time model from a givenset of salient temporal expressions for an excerpt. In addition,leveraging redundancy in text to improve the quality of estimatedexcerpt-time models is a challenge.

Applications. Excerpt-time models can be leveraged in variousapplications. In temporal information extraction, these models canlead to improvements in accuracy of free text temporal expressionnormalization. For example, ε3 contains the expression “GrammyAwards night” whose normalization can be improved by combiningthe other two excerpts in Table 1. For information retrieval, accu-rate excerpt-time models can be used to estimate better query-timemodels [27]. A direct application in extractive summarization is toleverage the time models to improve the summary quality [26], andchronologically ordering excerpts. Several tasks in timeline gener-ation, question answering, and event detection also benefit from thetime models.

Contributions made by this paper are as follows: 1) We proposethe problem of temporally scoping excerpts by estimating proba-bilistic excerpt-time models; 2) we propose a distribution propaga-tion framework that models several inter-excerpt relations, and thenestimates excerpt-time models by propagating information from re-lated excerpts that come with temporal expressions; 3) finally, wepropose two new measures, Model Quality and Generative Power,to evaluate our method on a real-world data set.

Organization. In Section 2, we review related work from the liter-ature and give technical background; Section 3 gives details of ourapproach. Conducted experiments and their results are described inSection 4. Finally, we conclude in Section 6.

2. RELATED WORK AND BACKGROUNDWe identify related work along the following four lines.

Temporal Information Extraction. A temporal expression [34]is a term or a phrase that refers to a precise time point or an in-terval. In the information extraction community, recently a lot ofattention has been given to recognizing and normalizing tempo-ral expressions. TempEval [34] competitions are held with a spe-cific goal to evaluate temporal taggers on different document col-lections. Some existing systems are HeidelTime [32], SUTime [8],and Tarsqi toolkit [35]. Though all these systems mainly target ex-plicit and relative temporal expressions, a recent work by Kuzeyet al. [20] propose to automatically normalize free-text temporalexpressions or temponyms.

Some attention has also been given to annotating text segmentsthat do not contain any expressions. Chasin et al. [9] temporallyanchor events to a time period between the prior time stamp andsubsequent time stamp occurring in text. However, they them-selves consider this as a naïve algorithm which is also acknowl-edged by Gung et al. [14] as a good baseline. Gung et al. usea clustering-based method to annotate textually similar sentenceswith the same time-stamps. Though this seems to be a reasonableassumption, hard clustering may lead to overly-specific or generictemporal scope for specific incidents of larger events. More com-plex approaches [7, 21, 36] make use of additional linguistic fea-tures to anchor time. Bramsen et al. [7] break complex sentencesinto clauses and look for temporal markers like after, in next year,etc., to detect breaks in temporal continuity. Jatowt et al. [15] andKotsakos et al. [19] attempt to estimate the precise focus time oftext documents in a news corpus at the year granularity by leverag-ing temporal language models. More recently, Laparra et al. [21]use event detection algorithms to propagate time stamps.

In the context of our problem, recognizing and normalizing tem-poral expressions is not sufficient to estimate excerpt-time models.This is because the majority of excerpts extracted from documentsdo not come with expressions. Also, the expressions often sufferfrom uncertainty [5]. Other approaches mentioned above rely oncorpus statistics and temporal relations. Firstly, these methods arenot comparable as they aim at annotating text segments with a sin-gle precise time point or interval as opposed to a distribution likein our problem. Secondly, there exists a difference in their notionfor “events” which makes them incomparable. Most commonlyin prior works, events are described as topics with few keywordswhile we consider a sentence from a news article. We use the SU-Time toolkit to annotate the temporal expressions in the excerptsthat are extracted from news articles.

Temporal Information Retrieval is another research area whereleveraging temporal information associated with documents has re-ceived much attention. Among the first works, Li and Croft [23]proposed a method to combine a document’s language model witha time dependent prior indicating its temporal scope. This approachwas extended by Peetz et al. [30] by investigating different priorsbased on cognitive models. Berberich et al. [5] was the first topresent a two dimensional representation to model the time domain,and design a query-likelihood based method to rank documents bycombining text and time. Recently, Efron et al. [12] presenteda non-parametric kernel density estimation method to incorporatetemporal relevance feedback from users to improve retrieval qual-ity. As a recent work in this direction, Mishra et al. [28] presenta method to link Wikipedia events to news articles by treating agiven event as a query and estimating query-time models from aset of pseudo-relevant documents. Most of the approaches men-tioned above rely only on the meta-data associated with the docu-ments like their creation times. In addition, they focus on documentretrieval and may not extend to excerpts due to high sparsity.

In this work, we adopt the two-dimensional representation oftime as presented by Berberich et al. [5]. This allows us to esti-mate two-dimensional time models by capturing the uncertainty intime. Mishra et al. [28] proposed a method to estimate a querytime model that captures the temporal scope of the query. In thiswork, we adopt a methodology proposed by Mishra et al. [28] toestimate excerpt-time models from a set of temporal expressions.However, our work in this paper differs in two ways. First, we aimto estimate time models for excerpts, instead of entire documents,which may not come with expressions. Second, we propose a dis-tribution propagation framework to estimate the time-models moreaccurately instead of relying on pseudo-relevant documents.

Extractive Summarization. Many of the studies that leveragedtime, either used document publication date as a proxy for estimat-ing sentence temporal scope [3, 6, 29] or created handcrafted rulesto identify temporal annotations for relations [13, 24, 25]. Laterworks [14] leveraged rule-based taggers to tag and normalize tem-poral expression. However, as an issue pointed out by Mani et al.[24] temporal expressions only constitute about 25% of the tempo-ral information in a typical news corpus and are insufficient for anylearning-based method. Most recently, Mishra et al. [26] addressthe problem of event digest generation from an input set of newsarticles. They presented a framework to diversify across text, time,geolocations, and named entities for the digest generation.

In this work, we adopt the concept of excerpt-time models intro-duced by Mishra et al. [26]. In their work, they attempt to estimatea probabilistic time model for each excerpt that capture the tempo-ral scope of the events described by them. However, the focus oftheir problem was different. Moreover, they relied on the temporalexpressions that are mentioned in the source documents of the ex-cerpt to estimate the time models. In their approach, excerpts takenfrom the same documents that do not come with any explicit ex-pression get associated with similar time models that are no longerdiscriminative. This stands as a contrast to this work where wefocus on estimating an accurate time model for each excerpt byleveraging the redundancy of information in a set of documents.

Semi-Supervised Label Propagation [4, 18, 37–39] aims at label-ing data by propagating class labels from labeled data. For this, thegeometry of the data is leveraged by first generating a data graphwhere the nodes represent data points, and an edge between twonodes represents similarity between them. This paradigm was in-troduced by Zhu and Ghahramani [38]. In their later work [39],they proposed another formulation of the propagation algorithmthat models it as a Harmonic Gaussian Field. A similar propaga-tion algorithm was proposed by [37] that introduces self-loops sothat during each iteration, nodes also receive a small contributionfrom their initial values. Recently, Karasuyama et al. [18] proposedan algorithm to efficiently identify the manifold structure of a datagraph and at the same time learn the hyper-parameters by a novelfeature propagation method.

In this work, among the several approaches, we choose to adoptthe algorithm proposed by Zhu and Ghahramani [38] for designingour distribution propagation framework. However, there are a fewfundamental differences. Firstly, their algorithm is designed fordata points in Euclidean space. However, in the text space, this typeof similarity (distance function) is known to not perform well. Sowe incorporate JS-Divergence as our distance metric. Secondly, intheir original objective function, they introduce a hyper-parameterσ as an importance parameter for each dimension. This is learnedindependently for each dimension. However, we linearly combineall the inter-excerpt relation weights and have a single scaling pa-rameter that is set to the average of the total weight for simplicity.

3. APPROACHWe design a distribution propagation algorithm that is based on

label propagation proposed by Zhu et al. [38]. In our method, fromthe input set of pseudo-relevant excerpts R, the subset of excerptsthat come with temporal expressions are treated as a seed set Rseed.For each excerpt in Rseed, we estimate an empirical excerpt-timemodel from their original expressions. The time models for ex-cerpts with missing expressions are initialized to a uniform distri-bution. An event graph is generated by considering all excerptsin R as nodes, and models their relationships as weighted edges.Finally, as an iterative process, the empirical time models from ex-

Notations Descriptions

d, C, V Document, Collection, and Vocabularydevent, dcontext Wikipedia Current Events portal, and source documentε, εtext, εtime Excerpt, excerpt text part, excerpt time partEtext, Etime Excerpt-text, and excerpt-time modelt, τ Temporal expression, time unitrtb, tes, tb, te Time interval, begin time point, and end time pointtbl, tel Lower bounds on begin and end time pointstbu, teu Upper bound on begin and end time pointsR, Rseed, Rtest Input set, seed set with annotated excerpts, and test set with a

single excerpt whose time model is to be estimatedmint Earliest time unit in Rseed

maxt Latest time unit in Rseed

Gevent, wij Event graph, and edge weight between two nodesδij Total similarity between two excerptsδtext, δpos, δcep,δctx Textual, positional, conceptual, and contextual similarity

Table 2: List of notations

cerpts with expressions are propagated to those that are stronglyrelated but do not come with any temporal expression.

A time model for an excerpt can be understood as a probabil-ity distribution in our time domain that captures the true temporalscope of the excerpt. We leverage redundancy in data to estimate atime model for excerpts that do not come with explicit information.We follow the idea that strongly related excerpts refer to the sameevent and hence share the same temporal focus.

3.1 DefinitionsWe first define the notations and representations used to design

our method. Table 2 summarizes our notations.

Excerpt ε is a single unit of an input document d that gives in-formation on an event. For this work, fix an excerpt to a singlesentence. Each excerpt can be described as having two parts: textεtext and time εtime derived from the excerpt description. As abag of words, εtext is drawn from a fixed vocabulary V derivedfrom the entire collection C. Similarly, εtime is a bag of temporalexpressions derived from the textual description the excerpt.

We treat the input documents as a set of excerpts referred to asR. In our algorithm, we refer to excerpts that originally come withtemporal expressions as the seed set Rseed Ă R.

In our methods, we design a text background model of a givenexcerpt from its source document referred to as dcontext, and co-alescing the Wikipedia Current Events portal1 into a single docu-ment which is referred devent.

Time unit or chronon τ indicates the time passed (to pass) since(until) a reference date such as the UNIX epoch. Every temporalexpression t is treated as an interval rtb, tes in a two dimensionaltime domain T ˆ T with begin tb and end time te. Further, aninterval is described as a quadruple rtbl, tbu, tel, teus [5] wheretbl gives the lower bound, and tbu gives the upper bound of begintime tb. Analogously, tel and teu give the bounds for the end timete of the interval.

Event graph Geventpε, rq is similar to the widely accepted notionof a sentence network [7, 10, 11, 16] where each excerpt describesan event. We generate a simple graph Geventpε, rq where eachexcerpt ε P R becomes a node and inter-excerpt relationship r ismodeled as a weighted edge. Since any excerpt εi can be related toany other excerpt εj , the event graph Gevent becomes a completegraph. An illustration of an event graph generated from three ex-cerpts in Table 1 is shown in Figure 1. In our experiments describedlater in Section 4, we design a set of methods that make use of thissimple graph to estimate the temporal scope of excerpts.

1https://en.wikipedia.org/wiki/Portal:Current_events

3.2 Excerpt ModelsWith our notations and representations defined, we now adopt

query-modeling techniques described by Mishra et al. [27].

Excerpt-Text Model Etext refers to a unigram language modelestimated from εtext that captures the event described in the ex-cerpt. It can be observed that any εtext comes with two types ofterms: first, that convey background information, and second, thatdescribe a specific event. To stress on the terms salient to the de-scribed event, we combine the empirical excerpt-text model with abackground model estimated from: 1) the textual content of thesource news article, dcontext; and 2) the textual descriptions ofevents from Wikipedia Current Events portal coalesced into a singledocument, devent. The dcontext background model puts emphasison the contextual terms, like {Skrillex, Diplo, Grammy, album}.The devent background model puts emphasis on terms like {won,award} that are discriminative for the specific event at hand.

We combine the excerpt-text model with a background model bylinear interpolation. The generative probability of a word w fromthe excerpt-text model Etext is estimated as,

P pw|Etextq “ p1 ´ λq ¨ P pw | εtextq ` λ ¨ “β ¨ P pw | deventq

` p1 ´ βq ¨ P pw | dcontextq‰. (1)

A term w is generated from the background model with probabilityλ and from the original excerpt with probability 1´λ. Since we usea subset of the available terms, we finally re-normalize the excerpt-text model. The new probability P̂ pw | Etextq is estimated as,

P̂ pw | Etextq “ P pw | Etextqřw1PV P pw1 | Etextq . (2)

Excerpt-Time Model Etime can be understood as probability dis-tribution that captures the salient periods for an event described inthe excerpt. Further, we assume that a temporal expression t Pεtime is sampled from the excerpt-time model Etime. The gen-erative probability of any time unit τ from the time model Etime

is estimated by iterating over all the temporal expressions t “rtbl, tbu, tel, teus in εtime as,

P pτ | Etimeq “ÿ

rtb,tesPεtime

1pτ P rtbl, tbu, tel, teusq|rtbl, tbu, tel, teus| (3)

where the 1p¨q function returns 1 if there is an overlap betweena time unit τ and an interval rtbl, tbu, tel, teus. The denominatorcomputes the area of the temporal expression in T ˆ T . For anygiven temporal expression, we can compute its area and its inter-section with other expressions as described in [5]. Intuitively, theabove equation assigns higher probability to time units that overlapwith a larger number of specific (smaller area) intervals in εtime.Finally, we re-normalize as per Equation 2.

3.3 Inter-Excerpt RelationsEdge weights in an event graph denote the relationship between

two excerpts. Larger weights between two excerpts indicate closerrelation with more informational overlap, and hence point to thefact that excerpts may focus on the same time periods. In ourmethod, the edge weights are computed by enforcing an exponen-tial function over the total similarity δij between two excerpts εiand εj thus allowing propagation from similar excerpts more freely.Formally the edge weights are computed as,

wij “ exp´ δ2ij

σ2

¯(4)

where σ is the scaling parameter and is set to the average similaritybetween the excerpts. We model δij as a linear combination of five

similarity scores capturing different relationships as,

δij “ δ̂text ij ` δ̂cep ij ` δ̂pos ij ` δ̂ctx ij ` δ̂sem ij . (5)

In the above, each of the factors are normalized across the excerptsusing min-max normalization. Formally, this is computed as,

δ̂ij “ δij ´ minpδijqmaxpδijq ´ minpδijq .

Finally, as motivated by Zhu et al. [38], we additionally smooththe weight matrix with a uniform transition probability matrix Uwhere Uij “ 1{|R| to compute W as,

W “ γ ¨ U ` p1 ´ γq ¨ W . (6)

We compute the following inter-excerpt relations by leveragingthe excerpt models estimated in Section 3.2.

Text similarity δtext ij between two excerpts can be a strong in-dicator of their information overlap. Leveraging this idea, we statethe following hypothesis:If two excerpts are textually similar, then they most likely discussthe same event and time period.In order to estimate the text similarity between any two excerptsεi and εj , we compute Jensen-Shannon divergence between theirexcerpt-text models which is estimated as per Equation 1. For-mally, this is given as

δtext ij “ ´JSDpEi text||Ej textq . (7)

Positional similarity δpos ij . It can be assumed that the sourcenews articles from which the excerpts have been extracted exhibita coherent structure. This observation to a certain degree can begeneralized to the temporal dimension of the articles. With thisassumption, we state the following hypothesis:If two excerpts occur in the same document, and have higher posi-tional proximity in the document, then they most likely discuss thesame event and time period.Unlike the two prior similarities for calculating the positional sim-ilarity, we compute the absolute distance between the sentence-positions of two excerpts, and apply an exponential decay functionover it as shown by Tao et al. [33]. Formally this is,

δpos ij “#logpa ` expp´Distpεi, εjqqq ifεi, εj P d

0 otherwise(8)

where a is set to 0.3 [33], and Distpq function returns the differ-ence in positions as

maxppospεiq, pospεjq ´ minpmaxppospεiq, pospεjqq .Conceptual similarity δcep ij between two excerpts can be com-puted as the noun phrase overlap between them. Intuitively, nounphrase extraction can be considered as a soft form of entity recog-nition. Higher similarity between excerpt-noun-phrase models in-dicates a stronger relationship. We state the following hypothesis:If two excerpts contain similar noun phrases, then they most likelydiscuss the same event and time period.To estimate δcep ij , we first run it through an open source part-of-speech based noun phrase extractor. Then for an excerpt εi, we es-timate its excerpt-noun-phrase model Ei np according to Equation 1by simply treating each noun phrase as a distinct term. Finally, con-ceptual similarity between any two excerpts εi and εj is computedas Jensen-Shannon divergence between their excerpt-noun-phrasemodels. Formally, that is,

δcep ij “ ´JSDpEi np||Ej npq . (9)

Algorithm 1 Temporal Distribution Propagation

Construct set R from input documentsInitialize the mint and maxt as earliest and latest day in R, receptivelyfor excerpt εi P R do

if |εi time| ą 0 thenEstimate Ei time (in Section 3.2)Yl Ð Yl Y Ei time

elseInitialize P pτ |Ei timeq Ð 1

|maxt´mint|Yu Ð Yu Y Ei time

end ifend forGenerate an event graph G (in Section 3.1)Compute the affinity matrix W (in Section 3.3)Compute diagonal degree matrix Dii “ ř

j wij

Initialize Y p0q Ð pYl, Yuqwhile not convergence to Y p8q do

Y pt`1q Ð D´1 ¨ W ¨ Y t

Ypt`1ql Ð Y

ptql

end whileFinal Ei time for εi is then obtained from Y p8q

Contextual similarity δctx ij becomes an important indicator ofthe relationship between two excerpts if their textual description issparse. It may also indicate if two events are part of a commonlarger event and should have happened in a similar time period.Considering this idea, we state the following hypothesis:If two excerpts have higher contextual similarity, then most likelythey are a part of the same event and time periodIn order to estimate δctx ij , we leverage the lead (first) paragraphsof the source news articles of εi and εj . For each excerpt, we firstestimate an excerpt-context model Ectx from the lead paragraph oftheir source news articles. This is done similar to Equation 1. Withthis the contextual similarity can simply by defined as the Jensen-Shannon divergence between the excerpt-context models as,

δctx ij “ ´JSDpEi ctx||Ej ctxq . (10)

3.4 Distribution PropagationOur distribution propagation algorithm is based on label propa-

gation as proposed by Zhu et al. [38]. Intuitively, we leverage theidea that if two excerpts have a strong inter-excerpt relationshipsbetween them, then they may refer to the same event, and hencehave a similar temporal scope.

Pseudo-code of our method is illustrated in Algorithm 1. As thefirst step, we extract a set of excerpts R from a given set of inputdocuments. We then extract the earliest and latest time unit at afixed time granularity from R. This fixes the total scope of ourtime domain T ˆ T . In the next step, we estimate the excerpt-timemodel for each excerpt that comes with temporal expression in theseed set Rseed. We add these excerpts to the labeled class Yl. Forthe rest of the excerpts, we assume that the generative probabil-ity of any time unit τ is uniform, and we add these to the unla-beled class Yu. Given all excerpts in R, we then construct an eventgraph G with excerpts as nodes. The weight matrix W is computedthat models the relationship between every excerpt-pair and treatsthem as weighted edges. The algorithm then performs the follow-ing two steps until convergence: first, all excerpts propagate theirtime models. Second, instead of letting the generative probabilitiesin excerpt-time models in Yl getting readjusted, we reinitialize theiroriginal distributions. Finally, we retrieve the excerpt-time modelsof the unlabeled excerpts from Y p8q. With the weight matrix Wrow normalized, the algorithm is guaranteed to converge as provenby Zhu et al. [38].

4. EXPERIMENTSIn this section, we describe the details of the conducted experi-

ments. We make all our experimental data publicly available2.

4.1 SetupWe note that there is no ready-to-use ground truth for our task.

In order to generate a test set of excerpts, we adopt a query-drivenmethodology where we randomly select a set of Wikipedia eventsand treat them as a user query. We then retrieve top-K documentsrelevant to an event and treat them as a set of pseudo-relevant ex-cerpts. This methodology has two effects. First, the search spacefor our distribution propagation algorithm is reduced. Second, thismethod can be considered as filtering out noisy temporal expres-sions. However, it is worthwhile to highlight that our method is notdependent on this filtering step and can be generalized to any set ofexcerpts. This method, however enables us to systematically evalu-ate the effectiveness of our method as compared to other baselines.

Test Collections. We perform experiments on the English Giga-word corpus with about 9 million news articles published between1991 and 2010. We process the queries in our test set with a stan-dard query-likelihood document retrieval model. Top-10 retrieveddocuments with an average of 150 excerpts are considered pseudo-relevant and input into our methods.

Test Queries are generated from the timeline of modern history3 inWikipedia that enlists the most prominent news events in the 20thand the 21st centuries. We randomly sample 100 events that tookplace between 1987 and 2007, and treat them as test queries. Eachquery comes with a short textual description and a time intervalindicating the occurrence of the event. For the experiments, weignore the time interval, and only leverage the textual expression asa keyword query to retrieve the pseudo-relevant documents.

Ground Truth. We rely on the empirical temporal expressionsthat originally occur in the excerpts to evaluate our methods. In ourevaluation methodology, we perform leave-one-out cross validationby randomly selecting excerpts that come with temporal expres-sions into ground truth. The evaluation measure is computed basedon how well the estimated time model for an excerpt describes theempirical model that is estimated from the original expressions intextual description of the excerpt. For a use-case experiment, wemake use of the time interval associated with each query as groundtruth. We describe the use-case experiment in Section 5.

Implementation. All our methods are implemented in Java. Forthe temporal annotation, and noun-phrase extraction we use Stan-ford SUTime toolkit [8]. We additionally use the Weka toolkit4 toimplement the cross-validation framework for the evaluation.

4.2 Evaluation Method, Goals and MeasuresExcerpt-time model estimated by our distribution propagation

method can simply be understood as a two-dimensional probabilitydistribution that captures the true temporal scope of the excerpt. Inthis distribution, the time units with high probability represent thetemporal focus of the event in the excerpt. To evaluate the qualityof the time models estimated for the excerpts, we propose the fol-lowing steps: 1) From the set of excerpts retrieved for each query,we first distinguish those that originally come with an expression.As described in Section 3, these are then considered as an initialseed set for the distribution propagation. 2) We then perform leave-one-out fold cross-validation by randomly selecting one of these

2http://resources.mpi-inf.mpg.de/d5/excerptTime/3https://en.wikipedia.org/wiki/Timeline_of_modern_history4http://www.cs.waikato.ac.nz/ml/weka/

excerpts and ignoring temporal expressions occurring in them. Thisis considered as set Rtest. For each fold, we run our algorithm inSection 3 by considering the rest of the excerpts as seed set Rseed.3) We truncate and re-normalize the estimated model based on thescope of the empirical model to make them comparable, and thencompute the evaluation measures. Final scores are reported by firstaveraging across the folds for a single query and then taking themean across all the queries in the test set.

In addition to evaluating the excerpt-time model quality, we alsoinvestigate the effects of varying the time granularity for the mod-eling. Thus, we perform experiments at three fixed time granulari-ties: day, month, and year. For this, we fix the granularity of timeand represent the original expression in that granularity. The finestgranularity in our experiments is the day level. For experimentsat the month granularity, we additionally perform a preprocessingstep where we relax an expression originally in the day granular-ity into its month. For example, “February 2, 2016” is relaxed to“February 2016”. Similarly, for the experiments at year granular-ity, we relax the original expression occurring at the day and themonth granularity into its year. For example, “February 2, 2016”is relaxed to “2016”.

In our experiments we aim to evaluate: 1) the quality of esti-mated excerpt-time models; 2) the importance of the inter-excerptrelations; 3) the quality of excerpt-time models at different time-granularities. In order to achieve our evaluation goals, we definethe following measures to compare our methods:

Model Quality MQ. We compute how close an excerpt-time modelEtime estimated after propagation is to the empirical model εtime.We define model quality as the KL-divergence between the two.Formally,

MQ “ 1

|Rtest|ÿ

εPRtest

´KLDpεtime||Etimeq . (11)

Intuitively, a method estimating Etime more accurately should havea smaller divergence to εtime and hence higher MQ. As a relativemeasure, the methods can be ranked according to the model quality.

Generative Power GP . With the assumption that the empiricaltime part of an excerpt is generated from the excerpt-time model,we define the generative power as the likelihood of generating theoriginal time intervals in the time part εtime from the estimatedtime model Etime after propagation. To compute GP, we adoptthe approach to estimate the generative probability of time inter-vals proposed by Berberich et al. [5], and compute the generativeprobability as,

GP “ÿ

εPRtest

ˆ1

|εtime|ÿ

tPεtime

P pt|Etimeq˙

. (12)

Intuitively, the better the estimate of Etime, the higher the likeli-hood of generating the empirical time part εtime.

Precision P . We test how well our estimated model can predict theempirical time-part of an excerpt. For this, we note that temporalexpressions associated to excerpts in the Rtest come at a certaingranularity, i.e., year, month or day. For each excerpt, we generate aranked list Rt of temporal intervals at a fixed granularity, where theranking is based on their generative probabilities from the estimatedexcerpt-time models. Finally, we define a notion of relevance forthe time intervals and compute time-precision indicating the qualityof the estimated model.

A generated interval is considered relevant if it overlaps with theempirical time-part of the excerpt. Formally, we define the binaryrelevance Relpq between an empirical time interval t and an esti-

mated time interval t̂ as,

Relp t, t̂ q “#1 if minp te, t̂e q ´ maxp tb, t̂b q ą 0

0 otherwise(13)

where tb and te are begin and end time points of t. This formulationis also used in [15]. Using this notion for relevance we formallydefine precision score for each excerpt as,

P “ 1

|Rt|ÿ

tPεtime

ÿt̂PRt

Relp t, t̂ q (14)

where |Rt| is the total number of time units in Rt. Using the samenotion for relevance, we additionally compute measures like recall,mean average precision MAP , and normalized discounted cumu-lative gain NDCG scores at 5 and 10 cut off levels which are stan-dard for information retrieval tasks.

4.3 MethodsWe categorize our methods into three major types: local time

(LT) based, nearest-neighbor (NN) based, and distribution propaga-tion (DP) based methods. The LT methods take into account onlythe information in the source documents to estimate excerpt-timemodels. We consider two methods that use the publication date(pd), and the surrounding temporal expressions (S) as describedlater. On the other hand, both NN and DP methods estimate excerpt-time models by leveraging the event graph generated as describedin Section 3.2. We define several variants of the methods that con-sider different combinations of the inter-excerpt relations, i.e., text(T), position (P), conceptual (N), and contextual (X) similarities asindicated by their suffixes. Finally, we compare a state-of-the-artmethod proposed by Jatowt et al.[15] that uses term-time associa-tions to predict focus time period of an excerpt. Next, we describethe different methods that we compare in detail.

LT-pd method assumes that each excerpt taken from a news articledescribes an event from the time period indicated by its publicationdate as motivated by several works [3, 6, 23, 29, 30].

LT-pdS method in addition to the publication date takes into con-sideration the surrounding temporal expressions of a given excerptin the source document. Intuitively, temporal expressions denotetemporal contextual change. Thus, the closest mentioned tempo-ral expression prior to a given excerpt, and the closest subsequentmentioned expression can indicate its temporal scope. This methodhowever assumes that all excerpts in a document are temporally co-herent. This method is motivated from [9, 14].

NN-T method estimates excerpt-time model from time models oftextually similar excerpts. This can be understood as a two-stepmethod where in the first step, for a given excerpt, we compute itstextual similarity (as described in Section 3.3) to the other excerpts.Then in the second step, we estimate the excerpt-time model by in-terpolating empirical models of excerpts weighted by their textualsimilarity. Similar methods have been used in the past for estimat-ing query-time models. The two-step method is simply realizedby iterating once in our distribution propagation algorithm (Algo-rithm 1) on an event graph that models on the textual relationshipbetween the excerpts. This is motivated from the pseudo-relevancebased method presented in [26, 28].

NN-TPNX method is analogous to the simpler NN-T however, asan extended method, it leverages all the relationship described inSection 3.3 between the excerpts.

DP-TPNX method implements the distribution algorithm describedin Section 1 by leveraging all the relations described in Section 3.3to generate the event graph. We additionally compare the DP-T,

DP-P, DP-N, DP-X, DP-TPN, DP-PNX, and DP-TNX variants ofthis method that leverage different combinations of inter-excerptrelations as indicated by their suffixes.

ADJ refers to the method proposed by Jatowt et al. [15] for esti-mating the temporal focus of documents. As a baseline, we extendtheir approach to estimate focus time for excerpts from news ar-ticles. Briefly, this method first generates an undirected weightedgraph GpV,Eq where the V denotes the unique words, and E de-notes word relationships. This graph is then used to estimate word-time association scores. Using these scores, a temporal weight isestimated for each word which is then used to estimate the focustime of each excerpt. From several variants of method proposed byJatowt et al., we select their best performing method on Web datathe uses the Adirpw, tq , ωtemK

w , and STF pε, tq. We refer to theirprior work [15] for a full description of the method. We first usetheir method to predict the focus time interval of an excerpt in theyear, month, and the day time granularities. We then estimate anexcerpt-time model with the predicted time interval as described inSection 3.2.

Rand method randomly sets the edge weights in an event graph.This method highlights the quality of the temporal expressions inthe seed set.

Parameters. We set the following parameters for our methods. InEquation 1, we set λ “ 0.85 and β “ 0.5 as motivated by [28]. InEquation 4, σ is set to the average inter-excerpt relation weight es-timated for each query. Finally, for smoothing the estimated weightmatrix in Equation 6, we set γ “ 0.0005 as motivated by [38]. Inall the DP methods, we set the maximum iteration to 15.

4.4 ResultsWe compare the different methods and report the results in terms

of the various measures introduced earlier.

Overall Results from all our methods are shown in Table 3. Wefind the distribution propagation method DP-TPNX proves to bemost effective method for estimating time models for excerpts acrossall measures. The difference in overall result quality of all the meth-ods, in terms of GP, are found to be statistically significant with stu-dent’s t test at α ă 0.001. However, the difference between NN-Tand NN-TPNX is found to be insignificant.

Firstly, we find that the ADJ proves to be the weakest method toestimate temporal models for excerpts across all granularities. Thesimplest LT-pd method is the second weakest at the day and yeargranularities across all metrics. However, it performs significantlybetter than the ADJ method at all three granularities. Leveragingthe closest prior and subsequent temporal expressions around anexcerpt along with the publication date for estimation of its timemodel as the LT-pdS method shows significant improvement overthe simpler LT-pd method. We observe a gain of 10% over LT-pdSin GP at the day granularity. We also observe a similar improve-ment across other metrics. The nearest-neighbor methods NN-Tand NN-TPNX perform significantly better than the LT methodsat the day and year granularities. The more complex NN-TPNXshows marginal improvement over the NN-T method at all granu-larities in terms of GP. Finally, the distribution propagation methodDP-TPNX outperforms other methods across all the time granu-larities and measures. We find more than 35% improvement fromthe NN methods in the day granularity in terms of GP. The methodalso shows similar significant improvements at the month, and yeargranularities over the other methods. The Rand method gives theworst results in the month granularity as compared to the year orday. This is indicative of sparsity of annotations in the input set atthis granularity and may not be generalizable.

Method MQ GP P@5 P@10 MAP Recall ndcg@5 ndcg@10D

AY

DP-TPNX -2.227 0.754 0.828 0.736 0.836 0.902 0.847 0.839NN-TPNX -14.082 0.395 0.408 0.356 0.182 0.404 0.406 0.377NN-T -14.120 0.389 0.407 0.355 0.182 0.404 0.404 0.376Rand -14.212 0.388 0.407 0.354 0.182 0.404 0.405 0.375LT-pdS -14.614 0.471 0.472 0.449 0.286 0.332 0.467 0.429LT-pd -17.807 0.368 0.368 0.368 0.185 0.192 0.362 0.310ADJ -23.120 0.033 0.033 0.033 0.016 0.017 0.032 0.028

MO

NT

H


YE

AR


Table 3: Comparison of results from all methods over 100 queries.Figure 2: The effect of increasing the size of the eventgraph on DP-TPNX (blue) and NN-T (yellow) methods.

Ablation Test. To get an insight into the importance of differentinter-excerpt relations, we compare several variants of our distri-bution propagation method as shown in Table 4. We find that themost effective inter-excerpt relation proves to be text similarity asshown by the DP-T method. The next best relation is found to beconceptual similarity as indicated by the DP-N method. As a differ-ence to the simple full-text, the conceptual similarity is estimatedby regarding only the noun phrases in the excerpt descriptions. In-tuitively, the DP-N method suffers due to increase in sparsity ascompared to considering all terms in case of the DP-T method. Po-sition similarity seems to negatively affect the quality. Intuitively,this relation makes strong coherence assumption on the structureof news articles which may not hold for all news articles. More-over, this relation can only be estimated between excerpts from asingle document. DP-X proves to be the worst of the inter-excerptrelations in terms of MQ at all granularities. However, a combi-nation of contextual, conceptual, and text similarity to model theinter-excerpt relations as leveraged by DP-TNX is observed to bemore effective for estimating time models. Finally, we observe thatthe DP-TPNX is the top fifth method.

Varying Graph Size. Intuitively, a larger number of strongly re-lated excerpts with temporal expressions for a given excerpt in theevent graph should lead to better estimation of its time model. Wetest this by increasing the event graph size by considering more in-put pseudo-relevant documents. Figure 2, illustrates the effects ofincreasing the event graph size on the average quality of excerpt-time model estimated by the DP-TPNX and NN-T methods in termsof generative power, mean average precision, and precision at 5 and10 cut-off levels. We find that both methods show an improvementin the quality as more data is considered for the excerpt-time modelestimation. However, our method consistently outperforms the NN-T method even as the event graph size grows. Moreover, we findthat both the methods show large improvements by increasing thegraph size up to 375 nodes. Further increase in the graph size showssmaller improvements in the average result quality. This is becauseconsidering lower ranked documents, we add more irrelevant ex-cerpts to the event graph for a given query. They often do not posestrong relation to the other excerpts, and hence have a negative im-pact on the overall result quality.

Gain/Loss Analysis. To get insights into the individual queries forwhich our DP-TPNX shows the highest gain and worst loss in termsof P@5 against the best baseline at all three granularities.

At the day granularity, DP-TPNX shows the highest gain of `0.63with a score of 0.88 over the LT-pd method which proves to be thebest performing baseline with 0.25, for the following query:

q1: Rwandan President Juvénal Habyarimana and BurundiPresident Cyprien Ntaryamira die when a missile shoots downtheir jet near Kigali, Rwanda. This is taken as a pretext to beginthe Rwandan Genocide.

It suffers the worst loss of ´0.05, with a score equal to 0.88 whilethe best performing method for this query with a 0.93 score is LT-pd method, for the following query:

q2: Ten-Day War: Fighting breaks out when the Yugoslav Peo-ple’s Army attacks secessionists in Slovenia.

The NN-T method is the next best baseline with score of 0.42.At the month granularity, the query for which the DP-TPNX

shows the largest gain of `0.43 with a score of 0.70 is as follows:

q3: Troops of Laurent Kabila march into Kinshasa. Zaire isofficially renamed Democratic Republic of the Congo.

The best baseline for this query is the LT-pdS with 0.26 and the nextbest method is NN-T with 0.21 P@5 scores. DP-TPNX suffers theworst loss of ´0.23 with a score of 0.11 for the following query:

q4: Cold War: The leaders of the Yemen Arab Republic and thePeople’s Democratic Republic of Yemen announce the unifica-tion of their countries as the Republic of Yemen.

We find that for this query, the LT-pdS method that gets a score of0.33 and become the best method.

At the year granularity, DP-TPNX gets the most gain `0.17 witha score of 0.83 for the following query:

q5: Several explosions at a military dump in Lagos, Nigeria killmore than 1,000.

We find that all the other baselines receive an equal score of 0.67for this query. Our method suffers the worst loss of ´0.89 withscore 0 for the following query:

q6: Dissolution of Czechoslovakia: The Czech Republic andSlovakia separate in the so-called Velvet Divorce.

For this query, we find that the LT-pdS is the best method with ascore of 0.91 and LT-pdS is the second best with a score of 0.89.

DAY MONTH YEARMQ GP MQ GP MQ GP

DP-T -1.829 0.794 -2.082 0.488 -1.463 0.749DP-TNX -2.019 0.775 -2.128 0.483 -1.450 0.751DP-N -2.058 0.780 -2.142 0.485 -1.463 0.749DP-TPN -2.147 0.759 -2.194 0.478 -1.452 0.751DP-TPNX -2.227 0.754 -2.278 0.477 -1.452 0.751DP-P -2.413 0.742 -2.292 0.473 -1.458 0.749DP-PNX -2.428 0.744 -2.267 0.474 -1.459 0.749DP-X -2.578 0.766 -2.393 0.483 -1.511 0.744

Table 4: Ablation test in Model Quality and Generative Power

Method P@1 P@5 P@10 MAP ndcg@5 ndcg@10

YE

AR DP-TPNX 0.46 0.13 0.07 0.52 0.54 0.55

NN-T 0.36 0.13 0.08 0.46 0.49 0.51ADJ 0.18 0.09 0.06 0.30 0.32 0.36

MO

NT

H DP-TPNX 0.28 0.09 0.05 0.31 0.31 0.32NN-T 0.22 0.08 0.05 0.27 0.28 0.29ADJ 0.09 0.05 0.04 0.16 0.16 0.19

DA

Y

DP-TPNX 0.19 0.16 0.12 0.19 0.17 0.21NN-T 0.10 0.10 0.10 0.15 0.10 0.15ADJ 0.02 0.01 0.01 0.01 0.01 0.01

Table 5: Results of estimating the focus time of 100 queries

Discussion. First, we discuss the performance of the different meth-ods. The ADJ method turns out to be the least successful methodfor the time model estimation. The single temporal expression pre-dicted by this method often does not overlap with the expressionsin the ground truth. Since our evaluation metrics rely on the over-lap, this method receives very low scores at all granularities. Thepublication date-based LT-pd method is observed to be less effec-tive for estimating excerpt-time models. Similar to the ADJ, thismethod also relies on a single expression to estimate time modelsand suffers from sparsity. Further, the assumption that the newsarticles present information only on events occurring around itspublication is repudiated. The LT-pdS method which is a simpleextension to the LT-pd estimates much better excerpt-time model.The Rand method benefits over the LT methods from more numberof temporal expressions randomly selected. Among the methodsthat leverage the event graph, the NN-T is motivated from the pop-ular pseudo-relevance feedback models [26] for estimating excerptmodels. As expected, this method estimates more accurate timemodels as compared to the simpler publication date-based meth-ods in terms of MQ. Finally, the distribution propagation improvesthe estimates of the excerpt-time model through multiple iterations(ideally until convergence).

Next, we discuss the different granularities. At the day gran-ularity, larger interval are represented as a set of days with uni-form probability (as described in Section 4.2). Similarly, at year alltemporal expressions are relaxed. However, in the month granular-ity, year-level expression are expanded while the days-level expres-sions are relaxed. Due to this mixed effect, we find that on averagequality of the time models across all the methods is reduced. Wefind that the results are more pronounced at the day as comparedto the year granularity. At the year, due to the relaxation we findthat it becomes easier to estimate time models as indicated by allthe baselines getting much better scores.

Finally, from the gain/loss analysis, we find that our method crit-ically depends on the redundancy in the input set. The queries q2,q4, and q6 describe events with large ramifications and multiple as-pects. DP-TPNX receives low scores as result of propagating fromexcerpts from the input set that describe different aspects.

5. USE-CASE EXPERIMENTAs described in Section 4.1, we randomly sample 100 Wikipedia

events that come with a textual description and a time interval indi-cating its occurrence period. So far, we ignore the time part of thequery and leverage only the text part as a keyword query to retrievea set of documents which are then regarded as pseudo-relevant andinput to the system. In this section, as an application-oriented use-case, we slightly shift the focus to the following problem: for agiven event as a user query that comes with a textual description,estimate its occurrence time period. As input, our method takes an

event query; as output a time interval in the day, month, and yeargranularity is returned indicating its occurrence period.

An event query q is described with two parts: text qtext as a bagof words derived from the textual description of the query; and timeqtime as a single temporal expression that is explicitly given. Thesingle temporal expression in the query is then represented in ourtime domain as described in Section 4.1.

We design a two-stage approach. In Stage 1, leveraging qtextwe make use of a standard KL-divergence based retrieval modelto retrieve a set of top-100 documents. We then generate a set ofexcerpts ε from these documents be fixing an excerpt to a singlesentence. In Stage 2, we then generate an event graph Geventpε, rqas described in Section 4.1. However, as a slight variation to theprevious method, we inject the qtext as a special node. Next, werun our distribution propagation algorithm as described in Section 3to estimate a query-time model Qtime.

We compare our distribution propagation method DP-TNX thattakes into consideration the textual (T), conceptual (N), and con-textual (X) similarity. Positional similarity becomes inapplicablein this setting. Further, we compare the query text against the leadparagraph of the source document for an excerpt to estimate theircontextual similarity. As baselines, we consider the nearest neigh-bor NN-T method, and the method proposed by Jatowt et al. [15].

Analogous to an excerpt-time model, a query-time model Qtime

can be understood as a probability distribution over the time unitsthat captures the temporal scope of the event in the query. Intu-itively, the units that occur with high probability indicate the salienttime period associated with the event. From an estimated query-time model, we generate a rank list of time units with decreasingprobabilities in Qtime. However, unlike previous experiments, wedo not truncate but consider the full estimated query-time modelwhere the temporal scope of the query is set to [mint, maxt]. Asground truth, we make leverage the time part qtime in the origi-nal query. We compare the methods at three time granularity, year,month, and day using the measures define in Section 4.2.

Table 5 shows the results of the experiment. The results arefound to be statistically significant with students t test at α “ 0.05.The DP-TNX method is able to best estimate the occurrence timeperiod at all time granularities across all measures. The result qual-ity is lower than compared to the main experiments because ofmainly two reasons. Firstly, we note that most queries come witha single specific temporal expression usually at the day granularitywhich is considered as ground truth. This is in contrast to the pre-vious setting where pseudo-relevant excerpts may come with mul-tiple expressions at different granularities. Secondly, we find thatthe accuracy of query focus time is strongly dependent on the qual-ity of the input pseudo-relevant excerpts. Topical drifts in the inputexcerpts will result in a query focus time that does not match withthe ground truth. This makes the use-case problem even harder.

6. CONCLUSION AND FUTURE WORKWe proposed a novel problem to estimate time models for ex-

cerpts that are extracted from news articles to capture the true tem-poral scope of events described by the excerpts. To address thisproblem, we proposed a distribution framework that extends a pop-ular semi-supervised label propagation algorithm to propagate timemodels from excerpts that come with temporal expressions to thosethat are strongly related but missing annotations. In our exper-iments, we found that our method estimates most accurate timemodels as compared to several baselines. We compare the meth-ods in terms of existing precision, recall, NDCG measures, andtwo new Model Quality and Generative Power measures.

One interesting perspective on the time models can come fromthe direction of generating temporal embeddings for excerpts (sen-tences, paragraphs, or documents). Recently, leveraging word em-beddings [22] have shown significant improvements in various text-based applications. The time model estimation presented may beconsidered as the first step of embedding excerpts into a space thatmodels the time dimension of information content to capture thetemporal semantics. As a future direction, we plan to design ad-vanced methods to firstly, identify better source of temporal infor-mation than relying on blind feedback, and secondly output denserepresentation of excerpts to better capture its temporal semantics.

References[1] O. Alonso, J. Strötgen, R. A. Baeza-Yates, and M. Gertz.

Temporal information retrieval: Challenges andopportunities. TWAW 2011.

[2] I. Arikan, S. Bedathur, and K. Berberich. Time will tell:Leveraging temporal expressions in ir. In In WSDM 2009.

[3] R. Barzilay, N. Elhadad, and K. R. McKeown. Inferringstrategies for sentence ordering in multidocument newssummarization. Journal of Artificial Intelligence Research,17(1), 2002.

[4] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagationand quadratic criterion. Semi-supervised learning, 2006.

[5] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. Alanguage modeling approach for temporal information needs.In ECIR 2010.

[6] D. Bollegala, N. Okazaki, and M. Ishizuka. A preferencelearning approach to sentence ordering for multi-documentsummarization. Information Sciences, 2012.

[7] P. Bramsen, P. Deshpande, Y. K. Lee, and R. Barzilay.Inducing temporal graphs. In EMNLP 2006.

[8] A. X. Chang and C. D. Manning. Sutime: A library forrecognizing and normalizing time expressions. In LREC2012.

[9] R. Chasin, D. Woodward, J. Witmer, and J. Kalita.Extracting and displaying temporal and geospatial entitiesfrom articles on historical events. The Computer Journal,page bxt112, 2013.

[10] C. R. Chowdary and P. Sreenivasa Kumar. Sentence orderingfor coherent multi-document summary generation. SharingData, Information and Knowledge), 2008.

[11] J. Donghong and N. Yu. Sentence Ordering based on ClusterAdjacency in Multi-Document Summarization. In IJCNLP,2008.

[12] M. Efron, J. Lin, J. He, and A. De Vries. Temporal feedbackfor tweet search with non-parametric density estimation. InSIGIR 2014.

[13] A. Feng and J. Allan. Finding and linking incidents in news.In CIKM 2007.

[14] J. Gung and J. Kalita. Summarization of historical articlesusing temporal event clustering. In ACL 2012.

[15] A. Jatowt, C.-M. Au Yeung, and K. Tanaka. Estimatingdocument focus time. In CIKM 2013.

[16] P. D. Ji and S. Pulman. Sentence ordering withmanifold-based classification in multi-documentsummarization. In EMNLP 2006.

[17] R. Jones and F. Diaz. Temporal profiles of queries. ACMTransactions on Information Systems (TOIS), 25(3), 2007.

[18] M. Karasuyama and H. Mamitsuka. Manifold-basedsimilarity adaptation for label propagation. In NIPS 2013.

[19] D. Kotsakos, T. Lappas, D. Kotzias, D. Gunopulos,N. Kanhabua, and K. Nørvåg. A burstiness-aware approachfor document dating. In SIGIR, 2014.

[20] E. Kuzey, V. Setty, J. Strötgen, and G. Weikum. As time goesby: comprehensive tagging of textual phrases with temporalscopes. In WWW 2016.

[21] E. Laparra, I. Aldabe, and G. Rigau. Document leveltime-anchoring for timeline extraction. Volume 2: ShortPapers, 2015.

[22] Q. V. Le, and T. Mikolov. Distributed Representations ofSentences and Documents. In ICML 2014.

[23] X. Li and W. B. Croft. Time-based language models. InCIKM 2003.

[24] I. Mani, B. Schiffman, and J. Zhang. Inferring temporalordering of events in news. In NAACL 2003.

[25] I. Mani, M. Verhagen, B. Wellner, C. M. Lee, andJ. Pustejovsky. Machine learning of temporal relations. InACL 2006.

[26] A. Mishra and K. Berberich. Event digest: A holistic viewon past events. In SIGIR 2016.

[27] A. Mishra and K. Berberich. Leveraging semanticannotations to link wikipedia and news archives. In ECIR2016.

[28] A. Mishra, D. Milchevski, and K. Berberich. Linkingwikipedia events to past news. In TAIA 2014.

[29] N. Okazaki, Y. Matsuo, and M. Ishizuka. Improvingchronological ordering of sentences extracted from multiplenewspaper articles. In TALIP 2005.

[30] M.-H. Peetz and M. de Rijke. Cognitive temporal documentpriors. In ECIR 2013.

[31] M.-H. Peetz, E. Meij, and M. de Rijke. Using temporal burstsfor query modeling. Information Retrieval, 17(1), 2014.

[32] J. Strötgen and M. Gertz. Multilingual and cross-domaintemporal tagging. Language Resources and Evaluation,47(2), 2013.

[33] T. Tao and C. Zhai. An exploration of proximity measures ininformation retrieval. In SIGIR, 2007.

[34] N. UzZaman, H. Llorens, J. Allen, L. Derczynski,M. Verhagen, and J. Pustejovsky. Tempeval-3: Evaluatingevents, time expressions, and temporal relations. arXivpreprint arXiv:1206.5333, 2012.

[35] M. Verhagen and J. Pustejovsky. The tarsqi toolkit. In LREC2012.

[36] Y. Zhao and C. Hauff. Sub-document timestamping of webdocuments. In SIGIR 2015.

[37] X. Zhu and Z. Ghahramani. Learning from labeled andunlabeled data with label propagation. Technical report,Citeseer.

[38] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervisedlearning using gaussian fields and harmonic functions. InICML 2003.