Motivation and Background - Springer · Table Extraction from Text Documents 1231 T to predicate the distinction from other visual objects in a document. Structural Inference For

T

Table Extraction

�Table Extraction from Text Documents

Table Extraction from TextDocuments

James HodsonAI for Good Foundation,New York, NY, USA

Abstract

Tables appear in text documents in almostevery form imaginable, from simple lists tonested, hierarchical, and multidimensionallayouts. They are primarily designed forhuman consumption and therefore canrequire a wide variety of visual cues andinterpretive capabilities to be fully understood.This chapter deals with the challengesmachines face when attempting to process andunderstand tables, along with state-of-the-artmethods and performance on this task.

Synonyms

Table extraction; Table parsing; Table under-standing

Definition

The objective of table extraction is to converthuman-focused notation, to a logical, machine-readable, and machine-understandable form. Thistask is closely related to and could be viewed asa subproblem of document structure extraction. Itis generally considered a higher level natural lan-guage processing problem, requiring a pipeline ofcapabilities to address.

Motivation and Background

Tables appear in text documents in almost everyform imaginable, from simple lists to nested,hierarchical, and multidimensional layouts. Theyare primarily designed for human consumptionand therefore can require a wide variety of visualcues and interpretive capabilities to be fully un-derstood. In fact, the assumption of human con-sumption allows for a breadth of content presen-tation that is practically limitless. Often, criticalinformation that is relevant to the interpretationis assumed, provided in short-hand notation, orinferred from other aspects of the content orlayout.

Given that the problem of table extraction ismotivated by the presence of electronic docu-ments, it has only been formally studied sincethe early 1990s, as the prevalence of computer-based document storage, editing, and retrievalincreased (Laurentini et al. 1992; Guthrie et al.1993). With hundreds of document formats,

© Springer Science+Business Media New York 2017C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining,DOI 10.1007/978-1-4899-7687-1

http://dx.doi.org/10.1007/978-1-4899-7687-1_902

http://dx.doi.org/10.1007/978-1-4899-7687-1_100460

http://dx.doi.org/10.1007/978-1-4899-7687-1_100461

http://dx.doi.org/10.1007/978-1-4899-7687-1_100462

1230 Table Extraction from Text Documents

layout preferences, and established customs fordata interchange, the problem has only becomeworse at web-scale, with very few documentoriginators choosing machine-readable syntaxover visual layouts (i.e., drawing).

This article explores the genesis of the prob-lem domain, how to formalize and break downthe various tasks involved in building a tableextraction solution, and the methodologies gen-erally used.

There have been several attempts at formal-izing the table model and notation. Some ofthese were designed independently of automatictable extraction research (Association of Amer-ican Publishers 1986) and pertain to the bestpractices for tabular data design. Computation-ally driven table models generally refer to thewidely used Augmented Wang Notation (Wang1996) which specifies a hierarchical schema fordescribing types, classes, and relations amongcells. Common table models are necessary for theinteroperation of different stages of the extractionpipeline as well as the common evaluation ofdifferent approaches with the same gold standardreference data (Govindaraju et al. 2013). As inmost machine learning pipelines, it is often con-venient to isolate component parts for algorithmicdevelopment and testing.

Approaches to evaluation of table extractiontechniques vary widely, and can be looked at frommultiple perspectives document level, table level,access level, and cell level. Each stage of theextraction pipeline can be evaluated separately,or one can look at the overall goal achievementmeasures. Vanessa Long (2010) adopts a multi-level structural evaluation approach which can beparticularly informative.

Recent work is part of a more sparse literature,with consistently decreased focus since the early2000s. In spite of this, table extraction is not aproblem that has any broadly adopted solution. Itis a fragmented environment and often viewed asa practitioner’s problem as part of larger systems.However, certain industries (e.g., finance) andthe rise of web-scale information extraction haveled to a renewed focus on these technologies ina research and applied setting (Mitchell et al.2015).

Structure of the Learning System

We will consider each of the logical steps thatform part of a complete table extraction system.Hurst (2000) and Fang et al. (2012) both proposepipelines that allow for the evaluation of discretecapabilities. Starting from a raw text document,each subsequent pass adds more and more struc-ture, getting us closer to the final goal – a dis-ambiguated relational table object. Approaches ateach stage can consider not only textual featuresbut also layout and other visual cues. In fact, it isoften the case that table extraction techniques ontext documents will use a variety of methodolo-gies from the computer vision community.

Table DetectionGiven a text document, the objective is to identifywhether or not it contains a table object. It shouldbe possible to signal when a document containsmultiple such distinct objects and their roughcontiguous location (Kornfeld and Wattecamps1998). Often, this step is combined with the next(boundary detection) to perform joint detectionand delineation of tabular areas.

In the case where detection is performed inisolation, it may be viewed as a binary classifica-tion, sequence labeling, or clustering task over thedocument. Lopresti et al. (2000) approached thisproblem from a text density/clustering perspec-tive over single-columned ASCII text documents,though more recent efforts in industrial appli-cations tend to benefit from cascading binary(SVM) classifiers or random forest approaches.

Table Boundary IdentificationTable boundary identification recognizes theboundaries of detected tables such that they couldbe isolated from the surrounding information.Laurentini et al. (1992) makes use of theHough transform to identify connected shapesand components that represent the margins oftables. These must be separated from charts,images, and other visual components, which isthe aim of the table detection step mentionedabove. The identification of table boundariescan also benefit the table detection task byproviding additional structural features on which

Table Extraction from Text Documents 1231

T

to predicate the distinction from other visualobjects in a document.

Structural InferenceFor each recognized table, identify the columnand row structure, such that each cell could beuniquely identified. In practice, the methods ap-plied to this task mirror those of table bound-ary identification. However, there are additionalconstraints that often make it worthwhile to con-sider this step separately. For example, tables arestructurally constrained to maintain linear rela-tionships among cells – rows and columns mustremain broadly coherent. Furthermore, the taskmay be recursive, where tables contain tables,or other structural items as inserts. It is im-portant that this step provide the most accuratemicrostructure possible. As such, it can often bebeneficial to look at measures of content coher-ence for merging or splitting neighbors, at thesame time as optimizing overall coherence.

Functional ClassificationThe logical definition of a table is that of aset of associated keys and values. Headers, orgroupings of headers, represent keys, which in-tersect along the axes of a table. The intersects ofthese header cells represent the values of interest.Headers provide the information necessary tounderstand the type of data as well as uniquelypinpoint the location of each value. Functionalclassification, therefore, identifies for each cell,whether it represents a key or a value (Liu 2009).

Functional InterpretationFor each cell representing a value, classify itstype (e.g., weight, location, distance, revenue),according to its associated headers. In addition,many tables rely on auxiliary information andinterpretation, such as footnotes or implied co-herence (e.g., all adjacent cells have the sameproperty, but do not explicitly define it). Theseadditional structures need to be identified andassociated with each cell. Furthermore, cell val-ues should be fully normalized according to theavailable information. If a header states that allvalues are in $M, all numbers should take thisinto account.

DisambiguationIn most cases, the reason for reading and ex-tracting a table from text is to be able to workwith the information held therein. Comparingvalues to prior years’ numbers, reasoning aboutthem, and filtering all require that the data fitinto some logical representation of the domain ofinterest, whether implicitly or explicitly defined.Disambiguating the values allows them to beused consistently and stored uniformly for laterquerying (Liu 2009; Hurst 2000).

Disambiguation requires some desired finalrepresentation, whether a formal ontology or arelational database schema. Ideally the represen-tation would cover the entire universe of interest,allowing every possible value type to be logicallycaptured. However, it is often necessary to ac-count for content that has not been encounteredbefore.

Generally, disambiguation can be viewed asa supervised classification problem, wherebyexplicit or implicit (latent, structural) featuresare mapped probabilistically to availableoutcomes, constrained by meta-schemas suchas length, primitive type, and relative position.Additionally, structural factors (number ofvalues, etc.) can be used, within an iterativeframework, to further limit the output space. Thatis, as more of the table has been disambiguated,fewer options remain that would be consistentwith the prior results. As such, this can be viewedas a constrained optimization, where the schemais sufficiently well defined.

Cross-References

� Semantic Annotation of Text Using Open Se-mantic Resources

�Entity Resolution

Recommended Reading

Association Of American Publishers (1986) Markupof tabular material. Technical report. Association ofAmerican Publishers, Manuscript Series

Fang, J, Mitra P, Tang Z, Lee GC (2012) Tableheader detection and classification. In: Proceedingsof AAAI, Toronto

http://dx.doi.org/10.1007/978-1-4899-7687-1_903

http://dx.doi.org/10.1007/978-1-4899-7687-1_81

1232 Table Parsing

Gobel M, Hassan T, Oro E, Orsi G (2012) A method-ology for evaluating algorithms for table under-standing in PDF documents. In: Proceedings of the2012 ACM symposium on document engineering,Atlanta, pp 45–48

Govindaraju V, Zhang C, Re C (2013) Understandingtables in context using standard NLP toolkits. In:Proceedings of the ACL, Sofia

Guthrie J, Weber S, Kimmerly N (1993) Searchingdocuments: cognitive processes and deficits in un-derstanding graphs, tables, and illustrations. Con-temp Educ Psychol 18:186–221

Hurst MF (2000) The interpretation of tables in texts.Ph.D. thesis, University of Edinburgh, Edinburgh

Kornfeld W, Wattecamps J (1998) Automatically lo-cating, extracting and analyzing tabular data. In:SIGIR’98: Proceedings of the 21st annual interna-tional ACM SIGIR conference, Melbourne, pp 347–348

Laurentini A, Viada P (1992) Identifying and under-standing tabular material in compound documents.In: Proceedings of 11th IAPR international confer-ence on pattern recognition. Conference B: patternrecognition methodology and systems, IEEE, TheHague, vol II, pp 405–409

Liu Y (2009) Tableseer: automatic table extraction,search, and understanding. Ph.D. thesis, The Penn-sylvania State University

Long V (2010) An agent-based approach to tablerecognition and interpretation. Ph.D. thesis, Mac-quarie University, Sydney

Lopresti D, Hu J, Kashi R, Wilfong G (2000)Medium-independent table detection. In: SPIE doc-ument recognition and retrieval VII, San Jose,pp 291–302

Mitchell T, Cohen W, Hruschka E, Talukdar P, Bet-teridge J, Carlson A, Dalvi B, Gardner M, KisielB, Krishnamurthy J, Lao N, Mazaitis K, MohamedT, Nakashole N, Platanios E, Ritter A, Samadi M,Settles B, Wang R, Wijaya D, Gupta A, Chen X,Saparov A, Greaves M, Welling J (2015) In Pro-ceedings of the Conference on Artificial Intelligence(AAAI)

Padmanabhan R, Jandhyala RC, Krishnamoorthy M,Nagy G, Seth S, Silversmith W (2009) Interactiveconversion of Large web tables. In: Proceedings ofeighth international workshop on graphics recogni-tion, GREC 2009. City University of La Rochelle,La Rochelle

Sarawagi S, Chakrabarti S (2014) Open-domain quan-tity queries on Web tables: annotation, response,and consensus models. In: Proceedings of ACMSIGKDD, New York

Shafait F, Smith R (2010) Table detection in het-erogeneous documents. In: Proceedings of the 9thIAPR international workshop on document analysissystems, Boston, pp 65–72

Thompson M (1996) A tables manifesto. In: Proceed-ings of SGMK Europe, Munich, pp 151–153

Wang X (1996) Tabular abstraction, editing, and for-matting. Ph.D. thesis, University of Waterloo, Wa-terloo

Table Parsing


Table Understanding


Tagging

� POS Tagging

TAN

�Tree Augmented Naive Bayes

Taxicab Norm Distance

�Manhattan Distance

TD-Gammon

Definition

TD-Gammon is a world-champion strengthbackgammon program developed by GeraldTesauro. Its development relied heavily onmachine learning techniques, in particular on�Temporal-Difference Learning. Contrary tosuccessful game programs in domains suchas chess, which can easily out-search theirhuman opponents but still trail these ability ofestimating the positional merits of the currentboard configuration, TD-GAMMON was able toexcel in backgammon for the same reasons that

http://dx.doi.org/10.1007/978-1-4899-7687-1_902

http://dx.doi.org/10.1007/978-1-4899-7687-1_902

http://dx.doi.org/10.1007/978-1-4899-7687-1_643

http://dx.doi.org/10.1007/978-1-4899-7687-1_850

http://dx.doi.org/10.1007/978-1-4899-7687-1_511

http://dx.doi.org/10.1007/978-1-4899-7687-1_817

Temporal Difference Learning 1233

T

humans play well: its grasp of the positionalstrengths and weaknesses was excellent. In 1998,it lost a 100-game competition against the worldchampion with only 8 points. Its sometimesunconventional but very solid evaluation ofcertain opening strategies had a strong impacton the backgammon community and was soonadapted by professional players.

Description of the Learning System

TD-Gammon is a conventional game-playingprogram that uses very shallow search (the firstversions only searched one ply) for determiningits move. Candidate moves are evaluated by a�Neural Network, which is trained by TD(�), awell-known algorithm for Temporal-DifferenceLearning (Tesauro 1992). This evaluationfunction is trained on millions of games thatthe program played against itself. At the endof each game, a reinforcement signal thatindicates whether the game has been lost orwon is passed through all moves of the game.TD-Gammon developed useful concepts in thehidden layer of its network. Tesauro (1992)shows examples for two hidden units of TD-Gammon that he interpreted as a race-orientedfeature detector and an attack-oriented featuredetector.

TD-Gammon clearly surpassed its prede-cessors, in particular the Computer Olympiadchampion Neurogammon, which was trainedwith �Preference Learning (Tesauro 1989).In fact, early versions of TD-Gammon, whichonly used the raw board information as features,already learned to play as well as Neurogammon,which used a sophisticated set of features.Adding more sophisticated features to theinput representation further improved TD-Gammon’s playing strength. Over time, TD-Gammon not only that increase the numberof training games that it played against itself,but Tesauro also increased the search depthand changed the network architecture, sothat TD-Gammon eventually reached world-championship strength (Tesauro 1995).

Cross-References

�Machine Learning and Game Playing

Recommended Reading

Tesauro G (1989) Connectionist learning ofexpert preferences by comparison training. In:Touretzky D (ed) Proceedings of the advancesin neural information processing systems 1(NIPS-88). Morgan Kaufmann, San Francisco,pp 99–106

Tesauro G (1992) Practical issues in temporal differ-ence learning. Mach Learn 8:257–278. http://mlis.www.wkap.nl/mach/abstracts/absv8p257.htm

Tesauro G (1995) Temporal difference learning andTD-Gammon. Commun ACM 38(3):58–68. http://www.research.ibm.com/massdist/tdl.html

TDIDT Strategy

�Divide-and-Conquer Learning

Temporal Credit Assignment

�Credit Assignment

Temporal Data

�Time Series

Temporal Difference Learning

William UtherNICTA and The University of New South Wales,Sydney, NSW, Australia

Definition

Temporal Difference Learning, also known asTD-Learning, is a method for computing thelong term utility of a pattern of behavior from a

http://dx.doi.org/10.1007/978-1-4899-7687-1_586

http://dx.doi.org/10.1007/978-1-4899-7687-1_667

http://dx.doi.org/10.1007/978-1-4899-7687-1_509

http://mlis.www.wkap.nl/mach/abstracts/absv8p257.htm

http://mlis.www.wkap.nl/mach/abstracts/absv8p257.htm

http://www.research.ibm.com/massdist/tdl.html

http://www.research.ibm.com/massdist/tdl.html

http://dx.doi.org/10.1007/978-1-4899-7687-1_303

http://dx.doi.org/10.1007/978-1-4899-7687-1_185

http://dx.doi.org/10.1007/978-1-4899-7687-1_972

1234 Temporal Difference Learning

series of intermediate rewards (Sutton 1984,1988, 1998). It uses differences between suc-cessive utility estimates as a feedback signal forlearning. The Temporal Differencing approachto model-free � reinforcement learning wasintroduced by, and is often associated with, R.S.Sutton. It has ties to both the artificial intelligenceand psychological theories of reinforcementlearning as well as � dynamic programming andoperations research from economics (Bellman1957; Samuel 1959; Watkins 1989; Puterman1994; Bertsekas 1996).

While TD learning can be formalised usingthe theory of �Markov Decision Processes, inmany cases it has been used more as a heuristictechnique and has achieved impressive resultseven in situations where the formal theory doesnot strictly apply, e.g., Tesauro’s TD-Gammon(Tesauro 1995) achieved world champion per-formance in the game of backgammon. Theseheuristic results often did not transfer to otherdomains, but over time the theory behind TDlearning has expanded to cover large areas ofreinforcement learning.]

Formal DefinitionsConsider an agent moving through a world indiscrete time steps, t1; t2; : : :. At each time step,t , the agent is informed of both the current stateof the world, st 2 S , and its reward, or utility, forthe previous time step, rt�1 2 R.

As the expected long term utility of a patternof behavior can change depending upon the state,the utility is a function of the state, V:S ! R.V is known as the value function or state-valuefunction. The phrase “long term utility” can beformalized in multiple ways.

Undiscounted Sum of RewardThe simplest definition is that long term reward isthe sum of all future rewards.

V.st / D rt C rtC1 C rtC2 C : : :

D

1X

ıD0

rt C ı

Unfortunately, the undiscounted sum of re-ward is only well defined if this sum converges.Convergence is usually achieved by the additionof a constraint that the agent’s experience termi-nates at some, finite, point in time and all rewardsafter that point are zero.

Discounted Sum of RewardThe discounted utility measure discounts rewardsexponentially into the future.

V.st / D rt C �rtC1 C �2rtC2 C : : : � 2 Œ0; 1�

D

1X

ıD0

rtCı

Note that when � D 1 the discounted andundiscounted regimes are identical. When � < 1,the discounted reward scheme does not requirethat the agent experience terminates at some finitetime for convergence. The discount factor � canbe interpreted as an inflation rate, a probabilityof failure for each time step, or simply as amathematical trick to achieve convergence.

Average RewardRather than consider a sum of rewards, the aver-age reward measure of utility estimates both theexpected reward per future time step, also knownas the gain, and the current difference from thatlong-term average, or bias.

G.st / D limn!1

1

n

nX

ıD0

rtCı

B.st / D

1X

ıD0

ŒrtCı �G.stCı/�

A system where any state has a nonzero prob-ability of being reached from any other state isknown as an ergodic system. For such a systemthe gain, G.s/, will have the same value forall states and the bias, B.s/, serves a similarpurpose to V (s/ above in indicating the relativeworth of different states. While average rewardhas a theoretical advantage in that there is nodiscount factor to choose, historically average

http://dx.doi.org/10.1007/978-1-4899-7687-1_720

http://dx.doi.org/10.1007/978-1-4899-7687-1_77

http://dx.doi.org/10.1007/978-1-4899-7687-1_512


T

reward has been considered more complex touse than the discounted reward regimes and sohas been less used in practice. There is a strongtheoretical relationship between average rewardand discounted reward in the limit as the discountfactor approaches one.

Here we focus on discounted reward.

Estimating Discounted Sum of RewardThe temporal differencing estimation procedureis based on recursive reformulation of the abovedefinitions. For the discounted case:

V.st / D rt C �rtC1 C �2rtC2 C �3rtC3 C : : :

D rtC�ŒrtC1 C �rtC2C�2rtC3 C ldots�

D rt C �V.stC1/

From the recursive formulation we can see thatthe long term utility for one time step is closelyrelated to the long term utility at the next timestep. If there is already an estimate of the longterm utility at st , V (st /, then we could calculatea change in that value given a new trajectory asfollows:

Δt D Œrt C �V.stC1/� � V.st /

If we are dealing with a stochastic system,then we may not want to update V (st / to the newvalue in one jump, but rather only move part waytoward the new value:

Δt D ˛.rt C �V.stC1/ � V.st //

where ˛ is a learning rate between 0 and 1.As an assignment, this update can be writtenin a number of equivalent ways, the two mostcommon being:

V.st / V.st /C ˛.rt C �V.stC1/ � V.st // or;

V .st /! .1 � ˛/V.st /C ˛.rt C �V.stC1//

This update, error, learning or delta rule is thecore of temporal difference learning. It is fromthis formulation, which computes a delta based

on the difference in estimated long term utility ofthe world at two consecutive time steps, that weget the term temporal differencing.

Having derived this update rule, we can nowapply it to finding the long term utility of aparticular agent. In the simplest case we willassume that there are a finite number of Markovstates of the world, and that these can be reliablydetected by the agent at run time. We will storethe function V as an array of real numbers, withone number for each world state.

After each time step, t , we will use the knowl-edge of the previous state, st , the instantaneousreward for the time step, rt , and the resultingstate, stC1, to update the value of the previousstate, V (st /, using the delta rule above:

V.st / V.st /C ˛.rt C �V.stC1/ � V.st //

Eligibility Traces and TD (�/

Basic temporal differencing as represented abovecan be quite slow to converge in many situations.Consider, for example, a simple corridor with asingle reward at the end, and an agent that walksdown the corridor. Assume that the value functionwas initialized to a uniform zero value. On eachwalk down the corridor, useful information isonly pushed one step back toward the start of thecorridor.

Eligibility traces try to alleviate this problemby pushing information further back along thetrajectory of the agent with each update to V .An algorithm incorporating eligibility traces canbe seen as a mixture of “pure” TD, as describedabove, and �Monte-Carlo estimation of the longterm utility. In particular, the � parameter to theTD(�/ family of algorithms specifies where in therange from pure TD, when � D 0, to pure Monte-Carlo, when � = 1, a particular algorithm falls.

Eligibility traces are implemented by keepinga second function of the state space ; � W S ! R.The " function represents how much an experi-ence now should affect the value of a state theagent has previously passed through. When the

http://dx.doi.org/10.1007/978-1-4899-7687-1_952


agent performs an update, the values of all statesare changed according to their eligibility.

The standard definition of the eligibility of aparticular state uses an exponential decay overtime, but this is not a strict requirement andother definitions of eligibility could be used. Inaddition, each time a state is visited, its eligibilityincreases. Formally, on each time step,

8s2S".s/ ��".s/ and then,

".st / ".st /C 1

This eligibility is used to update all statevalues by first calculating the delta for the currentstate as above, but then applying it to all statesaccording to the eligibility values:

ıt D ˛.rt C �V.stC1/ � V.st //

8s2SV.s/ V.s/C ıt ".s/

Convergence

TD value function estimation has been shownto converge under many conditions, but thereare also well known examples where it does notconverge at all, or does not converge to the correctlong term reward (Tsitsiklis 1997).

In particular, temporal differencing has beenshown to converge to the correct value of the longterm discounted reward if,

• The world is finite.• The world state representation is Markovian.• The rewards are bounded.• The representation of the V function has no

constraints (e.g., a tabular representation withan entry for each state).

• The learning rate, ˛, is reduced according tothe Robbins-Monro conditions:

P1tD0 ˛t D

1, andP1

tD

0 ˛2t <1:

Much of the further work in TD learning sinceits invention has been in finding algorithms thatprovably converge in more general cases.

These convergence results require that aMarkovian representation of state be availableto the agent. There has been research into howto acquire such a representation from a sequenceof observations. The approach of the TemporalDifferencing community has been to use TD-Networks (Sutton 2004).

Control of SystemsTemporal Difference Learning is used to estimatethe long term reward of a pattern of behavior.This estimation of utility can then be used toimprove that behavior, allowing TD to help solvea reinforcement learning problem. There are twocommon ways to achieve this: An Actor-Criticsetup uses value function estimation as one com-ponent of a larger system, and the Q-learningand SARSA techniques can be viewed as slightmodifications of the TD method which allow theextraction of control information more directlyfrom the value function.

First we will formalise the concept of a patternof behavior. In the preceding text it was left de-liberately vague as TD can be applied to multipledefinitions. Here we will focus on discrete actionspaces.

Assume there is a set of allowed actions for theagent, A. We define a Markov policy as a functionfrom world states to actions, � W S ! A. Wealso define a stochastic or mixed Markov policyas a function from world states to probabilitydistributions over actions, � W S ! A ! Œ0; 1�.The goal of the control algorithm is to find anoptimal policy: a policy that maximises long termreward in each state. (When function approxima-tion is used (see section “Approximation”), thisdefinition of an optimal policy no longer suffices.One can then either move to average reward ifthe system is ergodic, or give a, possibly implicit,weighting function specifying the relative impor-tance of different states.)

Actor-Critic Control SystemsActor-Critic control is closely related to mixedpolicy iteration from Markov Decision Processtheory. There are two parts to an actor-criticsystem; the actor holds the current policy for


T

the agent, and the critic evaluates the actor andsuggests improvements to the current policy.

There are a number of approaches that fallunder this model. One early approach stores apreference value for each world state and actionpair, p W S � A ! R. The actor then usesa stochastic policy based on the Gibbs softmaxfunction applied to the preferences:

�.s; a/ Dep.s;a/

Px2AeP.s;x/

The critic then uses TD to estimate thelong term utility of the current policy, andalso uses the TD update to change thepreference values. When the agent is positivelysurprised it increases the preference for anaction, when negatively surprised it decreasesthe preference for an action. The size ofthe increase or decrease is modulated by aparameter, ˇ:

p.st ; at / p.st ; at /C ˇıt

Convergence of this algorithm to an optimalpolicy is not guaranteed.

A second approach requires the agent to havean accurate model of its environment. In thisapproach the critic uses TD to learn a valuefunction for the current behavior. The actor usesmodel based forward search to choose an actionlikely to lead to a state with a high expected longterm utility. This approach is common in twoplayer, zero sum, alternating move games such asChess or Checkers where the forward search is adeterministic game tree search.

More modern approaches which guaranteeconvergence are related to policy gradientapproaches to reinforcement learning (Di 2010).These store a stochastic policy in addition to thevalue function, and then use the TD updates toestimate the gradient of the long term utility withrespect to that policy. This allows the critic toadjust the policy in the direction of the negativegradient with respect to long term value, and thusimprove the policy.

Other Value FunctionsThe second class of approaches to using TD forcontrol relies upon extending the value functionto estimate the value of multiple actions. Insteadof V we use a state-action value function, Q W

S � A ! R. The update rule for this function isminimally modified from the TD update definedfor V above.

Once these state-action value functions havebeen estimated, a policy can be selected bychoosing for each state the action that maximizesthe state-action value function, and then addingsome exploration.

In order for this extended value function tobe learned, the agent must explore each actionin each state infinitely often. Traditionally thishas been assured by making the agent selectrandom actions occasionally, even when the agentbelieves that action would be sub-optimal. Ingeneral the choice of when to explore using asub-optimal action, the exploration/exploitationtrade-off, is difficult to optimize. Morerecent approaches to optimizing the explo-ration/exploitation trade-off in reinforcementlearning estimate the variance of the valuefunction to decide where they need to explore(Auer 2007).

The requirement for exploration leads to twodifferent value functions that could be estimated.The agent could estimate the value function ofthe pattern of behavior currently being executed,which includes the exploration. Or, the agentcould estimate the value function of the currentbest policy, excluding the exploration currentlyin use. These are referred to as on-policy and off-policy methods respectively.

Q-Learning is an off-policy update rule:

Q.st ; at / Q.st ; at /C ˛.rt C �V.stC1/

�Q.st ; at //

Where V.stC1/ D maxa2A

Q.stC1; a/

SARSA is an on-policy update rule:

Q.st ; at / Q.st ; at /C ˛.rt C �Q.stC1; atC1/

�Q.st ; at //


Then for both:

�.s/ D argmaxa2AQ.s; a/

and some exploration.As can be seen above, the update rules for

SARSA and Q-learning are very similar – theyonly differ in the value used for the resultingstate. Q-learning uses the value of the best action,whereas SARSA uses the value of the action thatwill actually be chosen.

Q-Learning converges to the best policy to useonce you have converged and can stop exploring.SARSA converges to the best policy to use if youwant to keep exploring as you follow the policy(Lagoudakis 2003).

ApproximationA major problem with many state based algo-rithms, including TD learning, is the so-called� curse of dimensionality. In a factored staterepresentation, the number of states increasesexponentially with the number of factors. Thisexplosion of states produces two problems: it canbe difficult to store a function over the state space,and even if the function can be stored, so muchdata is required to learn the function that learningis impractical.

The standard response to the curse of dimen-sionality is to apply function approximation toany function of state. This directly attacks therepresentation size, and also allows informationfrom one state to affect another “similar” stateallowing generalisation and learning.

While the addition of function approxima-tion can significantly speed up learning, it alsocauses difficulty with convergence. Some typesof function approximation will stop TD fromconverging at all. The resulting algorithms caneither oscillate forever or approach infinite val-ues. Other forms of approximation cause TDto converge to a estimate of long term rewardwhich is only weakly related to the true long termreward (Gordon 1995; Boyan and Moore 1995;Baird 1995).

Most styles of function approximation used inconjunction with TD learning are parameterized,and the output is differentiable with respect to

those parameters. Formally we have V W Θ !

S ! R, where Θ is the space of possibleparameter vectors, so that V� .s/ is the value ofV at state s with parameter vector � , and rV� .s/

is the gradient of V with respect to � at s. The TDupdate then becomes:

ıt D ˛.rt C �V� .stC1/ � V� .st //

� � C ıtrV� .st /

We describe three styles of approximation:state abstraction, linear approximation, andsmooth general approximators (e.g., neuralnetworks).

State abstraction refers to grouping states to-gether and thereafter using the groups, or ab-stract states, instead of individual states. Thiscan significantly reduce the amount of storagerequired for the value function as only values forabstract states need to be stored. It also preservesconvergence results. A slightly more advancedform of state abstraction is the tile coding orCMAC (Albus 1981). In this type of function ap-proximation, the state representation is assumedto be factored, i.e., each state is represented by avector of values rather than a single scalar value.The CMAC represents the value function as thesum of separate value functions; one for eachdimension of the state. Those individual dimen-sions can each have their own state abstraction.Again, TD has been shown to converge whenused with a CMAC value function representation.

In general, any form of function approxima-tion that forms a contraction mapping will con-verge when used with TD (see the entry on�Markov Decision Processes). Linear interpola-tion is a contraction mapping, and hence con-verges. Linear extrapolation is not a contractionmapping and care needs to be taken when usinggeneral linear functions with TD. It has beenshown that general linear function approximationused with TD will converge, but only when com-plete trajectories are followed through the statespace (Tsitsiklis 1997).

It is not uncommon to use various types ofback-propagation neural nets with TD, e.g.,Tesauro’s TD-gammon. More recently, TD

http://dx.doi.org/10.1007/978-1-4899-7687-1_192

http://dx.doi.org/10.1007/978-1-4899-7687-1_512


T

algorithms have been proposed that converge forarbitrary differentiable function approximators(Papavassiliou 1999; Maei et al. 2009). Theseuse more complex update techniques than thoseshown above.

Related Differencing SystemsTD learning was originally developed for usein environments where accurate models wereunavailable. It has a close relationship with thetheory of Markov Decision Processes where anaccurate model is assumed. Using the notationV (st /ÝV.stC1/ for a TD-style update that movesthe value at V.st / closer to the value at V.stC1/

(including any discounting and intermediate re-wards), we can now consider many possible up-dates.

As noted above, one way of applying TD tocontrol is to use forward search. Forward searchcan be implemented using dynamic program-ming, and the result is closely related to TD.Let state c.s/ be the best child of state s in theforward search. We can then consider an update,V.s/ Ý V.c.s//. If we let l.s/ be the best leafin the forward search, we could then consider anupdate V.s/ Ý V.l.s//. Neither of these updatesconsider the world after an actual state transition,only simulated state transitions, and so neither istechnically a TD update.

Some work has combined both simulated timesteps and real time steps. The TD-Leaf learningalgorithm for alternative move games uses theV.l.st // Ý V.l.stC1// update rule (Baxter et al.1998).

An important issue to consider when usingforward search is whether the state distributionwhere learning takes place is different to the statedistribution where the value function is used. Inparticular, if updates only occur for states theagent chooses to visit, but the search is usingestimates for states that the agent is not visiting,then TD may give poor results. To combat this,the TreeStrap(˛ � ˇ/ algorithm for alternatingmove games updates all nodes in the forwardsearch tree to be closer to the bound informationprovided by their children (Veness et al. 2009).

Biological LinksThere are strong relationships between TD learn-ing and the Rescorla–Wagner model of Pavlo-vian conditioning. The Rescorla–Wagner modelis one way to formalize the idea that learningoccurs when the co-occurence of two events issurprising rather than every time a co-occurenceis experienced. The Δt value calculated in the TDupdate can be viewed as a measure of surprise.These findings appear to have a neural substratein that dopamine cells react to reward when it isunexpected and to the predictor when the rewardis expected (Schultz et al. 1997; Sutton 1990).

Cross-References

�Curse of Dimensionality�Markov Decision Processes�Markov Chain Monte Carlo�Reinforcement Learning

Recommended Reading

Albus JS (1981) Brains, behavior, and robotics. BYTE,Peterborough. ISBN:0070009759

Auer P, Ortner R (2007) Logarithmic online regretbounds for undiscounted reinforcement learning.Neural and information processing systems (NIPS),Vancouver

Baird LC (1995) Residual algorithms: reinforcementlearning with function approximation. In: PrieditisA, Russell S (eds) Machine learning: proceedingsof the twelfth international conference (ICML95).Morgan Kaufmann, San Mateo, pp 30–37

Baxter J, Tridgell A, Weaver L (1998) Knight-Cap: a chess program that learns by combiningTD(lambda) with game-tree search. In: Shavlik JW(ed.) Proceedings of the fifteenth international con-ference on machine learning (ICML’98). MorganKaufmann, San Francisco, pp 28–36

Bellman RE (1957) Dynamic programming. PrincetonUniversity Press, Princeton

Bertsekas DP, Tsitsiklis J (1996) Neuro-dynamic pro-gramming. Athena Scientific, Belmont

Boyan JA, Moore AW (1995) Generalization in rein-forcement learning: safely approximating the valuefunction. In: Tesauro G, Touretzky DS, Leen TK(eds) Advances in neural information processingsystems, vol 7. MIT, Cambridge

Di Castro D, Meir R (2010) A convergent onlinesingle time scale actor critic algorithm. J Mach

http://dx.doi.org/10.1007/978-1-4899-7687-1_192

http://dx.doi.org/10.1007/978-1-4899-7687-1_512

http://dx.doi.org/10.1007/978-1-4899-7687-1_952

http://dx.doi.org/10.1007/978-1-4899-7687-1_720

1240 Test Data

Learn Res 11:367–410. http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html

Gordon GF (1995) Stable function approximation indynamic programming. Technical report CMU-CS-95-103. School of Computer Science, CarnegieMellon University

Lagoudakis MG, Parr R (2003) Least-squares policyiteration. J Mach Learn Res 4:1107–1149. http://www.cs.duke.edu/�parr/jmlr03.pdf

Maei HR et al (2009) Convergent temporal-differencelearning with arbitrary smooth function approxi-mation. Neural and information processing systems(NIPS), pp 1204–1212. http://books.nips.cc/papers/files/nips22/NIPS2009 1121.pdf

Mahadevan S (1996) Average reward reinforcementlearning: foundations, algorithms, and empiricalresults. Mach Learn 22:159–195. doi:10.1023/A:1018064306595

Papavassiliou VA, Russell S (1999) Convergence of re-inforcement learning with general function approx-imators. International Joint Conference on ArtificialIntelligence, Stockholm

Puterman ML (1994) Markov decision processes: dis-crete stochastic dynamic programming. Wiley seriesin probability and mathematical statistics. Appliedprobability and statistics section. Wiley, New York

Samuel AL (1959) Some studies in machine learn-ing using the game of checkers. IBM J Res Dev3(3):210–229

Schultz W, Dayan P, Read Montague P(1997) A neural substrate of predictionand reward. Science 275(5306):1593–1599.doi:10.1126/science.275.5306.1593

Sutton RS (1984) Temporal credit assignment in re-inforcement learning. Ph.D. thesis, University ofMassachusetts, Amherst

Sutton RS (1988) Learning to predict by the methodof temporal differences. Mach Learn 3:9–44.doi:10.1007/BF00115009

Sutton RS, Barto AG (1990) Time-derivative modelsof Pavlovian reinforcement. In: Gabriel M, MooreJ (eds) Learning and computational neuroscience:foundations of adaptive networks. MIT, Cambridge,pp 497–537

Sutton RS, Barto AG (1998) Reinforcement learning:an introduction. MIT, Cambridge

Sutton R, Tanner B (2004) Temporal difference net-works. Neural and information processing systems(NIPS), Vancouver

Tesauro G (1995) Temporal difference learning andTD-gammon. Commun ACM 38(3):58–67

Tsitsiklis JN, Van Roy B (1997) An analysis oftemporal-difference learning with function approxi-mation. IEEE Trans Autom Control 42(5):674–690

Veness J et al (2009) Bootstrapping from game treesearch. Neural and information processing systems(NIPS), Whistler

Watkins CJCH (1989) Learning with delayed rewards.Ph.D. thesis, Psychology Department, CambridgeUniversity, Cambridge

Test Data

Synonyms

Evaluation data; Test instances

Definition

Test data are data to which a model is appliedfor the purposes of � evaluation. When � holdoutevaluation is performed, test data are also calledout-of-sample data, holdout data, or the holdoutset.

Cross-References

�Test Set

Test Instances

�Test Data

Test Set

Synonyms

Evaluation data; Evaluation set; Test data

Definition

A test set is a � data set containing data thatare used for � evaluation by a learning system.Where the � training set and the test set containdisjoint sets of data, the test set is known as a� holdout set.

Cross-References

�Data Set

http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html

http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html

http://www.cs.duke.edu/~parr/jmlr03.pdf

http://www.cs.duke.edu/~parr/jmlr03.pdf

http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf

http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf

10.1023/A:1018064306595

10.1023/A:1018064306595

http://dx.doi.org/10.1007/978-1-4899-7687-1_100144

http://dx.doi.org/10.1007/978-1-4899-7687-1_100470

http://dx.doi.org/10.1007/978-1-4899-7687-1_265

http://dx.doi.org/10.1007/978-1-4899-7687-1_369

http://dx.doi.org/10.1007/978-1-4899-7687-1_820

http://dx.doi.org/10.1007/978-1-4899-7687-1_818

http://dx.doi.org/10.1007/978-1-4899-7687-1_100144

http://dx.doi.org/10.1007/978-1-4899-7687-1_100146

http://dx.doi.org/10.1007/978-1-4899-7687-1_818

http://dx.doi.org/10.1007/978-1-4899-7687-1_196

http://dx.doi.org/10.1007/978-1-4899-7687-1_265

http://dx.doi.org/10.1007/978-1-4899-7687-1_974

http://dx.doi.org/10.1007/978-1-4899-7687-1_370

http://dx.doi.org/10.1007/978-1-4899-7687-1_196

Text Mining 1241

T

Test Time

A learning algorithm is typically applied at twodistinct times. Test time refers to the time whenan algorithm is applying a learned model to makepredictions. �Training time refers to the timewhen an algorithm is learning a model from� training data. �Lazy learning usually blurs thedistinction between these two times, deferringmost learning until test time.

Test-Based Coevolution

Synonyms

Competitive coevolution

Definition

A coevolutionary system constructed to simul-taneously develop solutions to a problem andchallenging tests for candidate solutions. Here,individuals represent complete solutions or theirtests. Though not precisely the same as competi-tive coevolution, there is a significant overlap.

Text Learning

�Text Mining

Text Mining

Dunja MladenicArtificial Intelligence Laboratory, Jozef StefanInsitute, Ljubljana, Slovenia

Abstract

Text Mining also referred to as Data Mining onText, has emerged at the intersection of severalresearch areas, some focused on data analyticsand others focused more on handling text data.

This entry provides a definition of Text Miningand links it to related research areas, most ofthem included in this book.

Synonyms

Analysis of text; Data mining on text; Text learn-ing

Definition

The term text mining is used analogous to�Data Mining when the data is text. As thereare some data specificities when handling textcompared to handling data from � databases,text mining has a number of specific methodsand approaches. Some of these are extensionsof data mining and machine learning methods,while others are rather text specific. Text miningapproaches combine methods from severalrelated fields, including machine learning,data mining, � Information Retrieval, naturallanguage processing, �Statistical Learning, and� Semantic Web. Basic text mining approachesare also extended to enable handling of differentnatural languages (�Cross-Lingual Text Mining)and are combined with methods for handlingdifferent data types, such as links and graphs(�Link Mining and Link Discovery, Graph Min-ing), images and video (multimedia mining), andsensor data. Adopting �Stream Mining methodsfor text data enables analysis of text streams, suchas news feed or social media texts. Text streammining can be also combined with other types ofdata streams, such as sensor readings, economicindicators, and video, where time stamp and loca-tion of the data can play a crucial role in analytics.

Cross-References

�Cross-Lingual Text Mining� Feature Construction in Text Mining� Feature Selection in Text Mining� Semi-Supervised Text Processing� Stream Mining

http://dx.doi.org/10.1007/978-1-4899-7687-1_975

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

http://dx.doi.org/10.1007/978-1-4899-7687-1_449

http://dx.doi.org/10.1007/978-1-4899-7687-1_100074

http://dx.doi.org/10.1007/978-1-4899-7687-1_831

http://dx.doi.org/10.1007/978-1-4899-7687-1_100013

http://dx.doi.org/10.1007/978-1-4899-7687-1_100099

http://dx.doi.org/10.1007/978-1-4899-7687-1_100471

http://dx.doi.org/10.1007/978-1-4899-7687-1_100099

http://dx.doi.org/10.1007/978-1-4899-7687-1_100099

http://dx.doi.org/10.1007/978-1-4899-7687-1_403

http://dx.doi.org/10.1007/978-1-4899-7687-1_100445

http://dx.doi.org/10.1007/978-1-4899-7687-1_835

http://dx.doi.org/10.1007/978-1-4899-7687-1_189

http://dx.doi.org/10.1007/978-1-4899-7687-1_948

http://dx.doi.org/10.1007/978-1-4899-7687-1_789

http://dx.doi.org/10.1007/978-1-4899-7687-1_189

http://dx.doi.org/10.1007/978-1-4899-7687-1_100

http://dx.doi.org/10.1007/978-1-4899-7687-1_102

http://dx.doi.org/10.1007/978-1-4899-7687-1_967

http://dx.doi.org/10.1007/978-1-4899-7687-1_789

1242 Text Mining for Advertising

�Text Mining for Advertising�Text Mining for News and Blogs Analysis�Text Mining for Spam Filtering�Text Mining for the Semantic Web�Text Visualization

Text Mining for Advertising

Massimiliano CiaramitaYahoo! Research Barcelona, Barcelona, Spain

Synonyms

Content match; Contextual advertising; Spon-sored search; Web advertising

Definition

Text mining for advertising is an area of in-vestigation and application of text mining andmachine learning methods to problems such asWeb advertising; e.g., automatically selecting themost appropriate ads with respect to a Web page,or query submitted to a search engine. Formally,the task can be framed as a ranking or matchingproblem where the unit of retrieval, rather than aWeb page, is an advertisement. Most of the timeads have simple and homogeneous predefinedtextual structures, however, formats can vary andinclude audio and visual information. Advertisingis a challenging problem due to several factorssuch as the economic nature of the transactionsinvolved, engineering issues concerning scalabil-ity, and the inherent complexity of modeling thelinguistic and multimedia content of advertise-ments.


The role of advertising in supporting and shapingthe development of the Web has substantiallyincreased over the past years. According to theInteractive Advertising Bureau (IAB), Internet

advertising revenues in the USA totaled almost$8 billion in the first 6 months of 2006, a 36.7 %increase over the same period in 2005, the lastin a series of consecutive growths. Search, i.e.,ads placed by Internet companies in Web pagesor in response to specific queries, is the largestsource of revenue, accounting for 40 % of totalrevenue (Internet Advertising Bureau 2006). Themost important categories of Web advertising arekeyword match, also known as sponsored searchor paid listing, which places ads in the searchresults for specific queries (see Fain and Pedersen2006 for a brief history of sponsored search), andcontent match, also called content-targeted ad-vertising or contextual advertising, which placesads in Web pages based on the page content.Figure 1 shows an example of sponsored searchand ads are listed on the right side of the page.

Currently, most of the focus in Web advertis-ing involves sponsored search, because matchingbased on keywords is a well-understood problem.Content match has greater potential for contentproviders, publishers, and advertisers, becauseusers spend most of their time on the Web oncontent pages as opposed to search engine re-sult pages. However, content match is a harderproblem than sponsored search. Matching adswith query terms is to a certain degree straight-forward, because advertisers themselves choosethe keywords that characterize their ads that arematched against keywords chosen by users whilesearching. In contextual advertising, matching isdetermined automatically by the page content,which complicates the task considerably.

Advertising touches challenging problemsconcerning how ads should be analyzed, andhow the accurately and efficiently systems selectthe best ads. This area of research is developingrapidly in information retrieval. How best tomodel the structure and components of ads,and the interaction between the ads and thecontexts in that they appear are open problems.Information retrieval systems are designed tocapture relevance, which is a basic concept inadvertising as well. Elements of an ad such astext and images tend to be mutually relevant,and often (in print media for example) ads areplaced in contexts that match a certain product at

http://dx.doi.org/10.1007/978-1-4899-7687-1_826

http://dx.doi.org/10.1007/978-1-4899-7687-1_833

http://dx.doi.org/10.1007/978-1-4899-7687-1_828

http://dx.doi.org/10.1007/978-1-4899-7687-1_835

http://dx.doi.org/10.1007/978-1-4899-7687-1_837

http://dx.doi.org/10.1007/978-1-4899-7687-1_100081

http://dx.doi.org/10.1007/978-1-4899-7687-1_100084

http://dx.doi.org/10.1007/978-1-4899-7687-1_100440

http://dx.doi.org/10.1007/978-1-4899-7687-1_100502

Text Mining for Advertising 1243

T

Text Mining for Advertising, Fig. 1 Ads ranked next to a search results page for the query “Spain holidays”

a topical level; e.g., an ad for sneakers placed ona sport news page. However, topical relevance isonly one the basic parameters which determinea successful advertisement placement. Forexample, an ad for sneakers might be appropriateand effective on a page comparing MP3 players,because they may share a target audience (e.g.,joggers) although they arguably refer to differenttopics, and it is possible they share no commonvocabulary. Conversely, there may be ads that aretopically similar to a Web page, but cannot beplaced there because they are inappropriate. Anexample might be placing ads for a product in thepage of a competitor.

The language of advertising is rich and sophis-ticated and can rely considerably on complex in-ferential processes. A picture of a sunset in an adfor life insurance carries a different implicationthan a picture of a sunset in an ad for beer. Layoutand visual content are designed to elicit infer-ences, possibly hinging on cultural elements; e.g.,the age, appearance, and gender of people in anad affect its meaning. Adequate automatic mod-eling will likely involve, to a substantial degree,

understanding the language of advertisement andthe inferential processes involved (Vestergaardand Schroeder 1985). Today this seems beyondwhat traditional information retrieval systems aredesigned to cope with. In addition, the globalcontext can be captured only partially by mod-eling the text alone. As the Web evolves intoan immense infrastructure for social interactionand multimedia information sharing the need formodeling structured “content” becomes more andmore crucial. This applies to information retrievaland specifically to advertising. For this reason,the problem of content match is of particular in-terest and opens new problems and opportunitiesfor interdisciplinary research.

Today, contextual advertising, the most inter-esting sub-task from a mining perspective, con-sists mostly in selecting ads from a pool to matchthe textual content of a particular Web page. Adsprovide a limited amount of text: typically a fewkeywords, a title, and brief description. The ad-placing system needs to identify relevant ads,from huge ad inventories, quickly and efficientlybased on this very limited amount of information.


Current approaches have focused on augmentingthe representation of the page to increase thechance of a match (Ribeiro-Neto et al. 2005),or by using machine learning to find complexranking functions (Lacerda et al. 2006), or byreducing the problem of content match to thatof sponsored search by extracting keywords fromthe Web page (Yih et al. 2006). All these ap-proaches are based on methods that quantify thesimilarity between the ad and the target pageon the basis of traditional information retrievalnotions such as cosine similarity and tf.idf fea-tures. The relevance of an ad for a page dependson the number of overlapping words, weightedindividually and independently as a function oftheir individual distributional properties in thecollection of documents or ads.

Structure of Learning Problem

The typical elements of an advertisement are aset of keywords, a title, a textual description anda hyperlink pointing to page, the landing page,relative to a product or service, etc. In addition,an ad has an advertiser id and can be part ofa campaign, i.e., a subset of all the ads withsame advertiser id. This latter information can beused, for example, to impose constraints on thenumber of ads to display relative to the campaignor advertiser. While this is possibly the mostcommon layout, it is important to realize thatads structure can vary significantly and includerelevant audio and visual content.

In general, the learning problem for an ad-placing system can be formalized as a rankingtask. Let A be a set of ads, P the set of possiblepages, and Q the set of possible queries. Inkeyword match, the goal is to find a functionF W A�Q; e.g., a function that counts the numberof individual common terms or n-grams of suchterms. In content match, the objective is to find afunction F W A � P ! R. The keyword matchproblem is to a certain extent straightforward andamounts to matching small set of terms definedmanually by both the user and the advertiser. Thecontent match task shares with the former taskthe peculiarities of the units of retrieval (the ads),but introduces new and interesting issues for text

mining and learning because of the more complexconditioning environment, the Web page content,which needs to modeled automatically.

In general terms, an ad can be representedas a feature vector x D Φ.ai , pj / such thatai 2 A, pj 2 P , and given a d -dimensionalfeature space X � R

d , Φ W A � P ! X .In the traditional machine learning setting, oneintroduces a weight vector ˛ 2 R

d which quanti-fies each feature’s contribution individually. Thevector’s weights can be learned from manuallyedited rankings (Lacerda et al. 2006; Ribeiro-Neto et al. 2005) or from click-through data asin search results optimization (Joachims 2002).In the case of a linear classifier the score of anad-target page pair xi would be:

F.xI˛/ D

dX

sD1

˛sxs : (1)

Several methods can be used to learn similar orrelated models such as perceptron, SVM, boost-ing, etc. Constraints on the number of advertisersor campaigns could be easily implemented aspost-ranking filters on the top of the ranked listof ads or included in a suitable objective func-tion.

A basic model for ranking ads can be definedin the vector space model for informationretrieval, using a ranking function based oncosine similarity, where ads and target pagesare represented as vectors of terms weightedby fixed schemes such as tf.idf. If only onefeature is used, the cosine based on tf.idf betweenthe ad and the page, a standard vector spacemodel baseline is obtained, which is at thebase of the ad-placing ranking functions variantsproposed by Ribeiro-Neto et al. (2005). Recentwork has shown that machine learning-basedmodels are considerably more accurate than suchbaselines. However, as in document retrieval,simple feature maps which include mostlycoarse-grained statistical properties of the ad-page pairs, such as tfidf-based cosine, are themost desirable for efficiency and bias reasons.Properties of the different components of the adcan be used and weighted in different ways, andsoft or hard constraints introduced to model the

Text Mining for Advertising 1245

T

presence of the ads keyword in the Web page.The design space for ad-place systems is vastand still little explored. All systems presentedso far in the literature make use of manuallyannotated data for training and/or evaluating amodel.

Structure of Learning Systems

Web advertising presents peculiar engineeringand modeling challenges and has motivated re-search in different areas. Systems need to beable to deal in real time with huge volumes ofdata and transactions involving billions of ads,pages, and queries. Hence several engineeringconstraints need to be taken into account; effi-ciency and computational costs are crucial factorsin the choice of matching algorithms (The Yahoo!Research Team 2006). Ad-placing systems mightrequire new global architecture design; e.g., At-tardi et al. (2004) proposed an architecture forinformation retrieval systems that need to handlelarge-scale targeted advertising based on an in-formation filtering model. The ads that appear onWeb pages or search results pages will ultimatelybe determined taking into account the expectedrevenues and the price of the ads. Modeling themicroeconomics factors of such processes is acomplex area of investigation in itself (Feng et al.2005).

Another crucial issue is the evaluation of theeffectiveness of the ad-placing systems. Studieshave emphasized the impact of the quality ofthe matching on the success of the ad in termsof click-through rates (Gallagher et al. 2001;Sherman and Deighton 2001). Although click-through rates (CTRs) provide a traditional mea-sure of effectiveness, it has been found that adscan be effective even when they do not solicitany conscious response and that the effectivenessof the ad is mainly determined by the level ofcongruency between the ad and the context inwhich it appears (Yoo 2006).

Keyword Extraction ApproachesSince the query-based ranking problem is betterunderstood than contextual advertising, one way

of approaching the latter would be to representthe content page as a set of keywords and thenranking the ads based on the keywords extractedfrom the content page. Carrasco et al. (2003) pro-posed clustering of bi-partite advertiser-keywordgraphs for keyword suggestion and identifyinggroups of advertisers. Yih et al. (2006) proposeda system for keyword extraction from contentpages. The goal is to determine which keywords,or key phrases, are more relevant in a Web page.Yih et al. develop a supervised approach to thistask from a corpus of pages where keywordshave been manually identified. They show that amodel learned with � logistic regression outper-forms traditional vector models based on fixedtf.idf weights. The most useful features to identifygood keywords efficiently are, in this case, termfrequency and document frequency of the candi-date keywords, and particularly the frequency ofthe candidate keyword in a search engine querylog. Other useful features include the similarityof the candidate with the page’s URL and thelength, in number of words, of the candidatekeyword. In terms of feature representation thus,they propose a feature map Φ W A ! Q, whichrepresent a Web page as a set of keywords. Theaccuracy of the best learned system is 30.06 %,in terms of the top predicted keyword being inthe set of manually generated keywords for apage, against 13.01 % of the simpler tf.idf basedmodel. While this approach is simple to apply, itremains to be seen how accurate it is at identify-ing good ads for a page. It identifies potentiallyuseful sources of information in automatically-generated keywords. An interesting related find-ing concerning keywords is that longer keywords,about four words long, lead to increased click-through rates (OneUpWeb 2005).

The Vocabulary Impedance ProblemRibeiro-Neto et al. (2005) introduced anapproach to content match which focuses onthe vocabulary mismatch problem. They noticethat there tends to be not enough overlapin the text of the ad and the target page toguarantee good accuracy; they call this thevocabulary impedance problem. To overcomethis limitation they propose to generate an

http://dx.doi.org/10.1007/978-1-4899-7687-1_951


augmented representation of the target page bymeans of a Bayesian model previously appliedto document retrieval (Ribeiro-Neto and Muntz1996). The expanded vector representation ofthe target page includes a significant number ofadditional words which can potentially matchsome of the terms in the ad. They find that sucha model improves over a standard vector spacemodel baseline, evaluated by means of 11-pointaverage precision on a test bed of 100 Web pages,from 0.168 to 0.253. One possible shortcomingof such an approach is that generating theaugmented representation involves crawling asignificant number of additional related pages. Ithas also been argued (Yih et al. 2006) that thismodel complicates pricing of the ads because thekeywords chosen by the advertisers might not bepresent in the content of the matching page.

Learning with Genetic ProgrammingLacerda et al. (2006) proposed to use machinelearning to find good ranking functions for con-textual advertising. They use the same data-setdescribed in Ribeiro-Neto et al. (2005), but usepart of the data for training a model and partfor evaluation purposes. They use a genetic pro-gramming algorithm to select a ranking functionwhich maximizes the average precision on thetraining data. The resulting ranking function isa nonlinear combination of simple componentsbased on the frequency of ad terms in the targetpage, document frequencies, document length,and size of the collections. Thus, in terms of thefeature representation defined earlier, they choosea feature map which extracts traditional featuresfrom the ad-page pair, but then apply then geneticprogramming methods to select complex nonlin-ear combinations of such features that maximizea fitness function based on average precision.Lacerda et al. (2006) find that the ranking func-tions selected in this way are considerably moreaccurate than the baseline proposed in Ribeiro-Neto et al. (2005); in particular, the best functionselected by genetic programming achieves an av-erage precision at position three of 0.508, against0.314 of the baseline.

Semantic Approaches to ContextualAdvertisingThe most common approaches to contextual ad-vertising are based on matching terms betweenthe ad and the content page. Broder et al. (2007)notice that this approach (which they call the“syntactic—” model), can be improved by adopt-ing a matching model which additionally takesinto account topical proximity; i.e., a “semantic”model. In their model the target page and the adare classified with respect to a taxonomy of top-ics. The similarity of ad and target page estimatedby means of the taxonomy provides an additionalfactor in the ads ranking function. The taxonomy,which has been manually built, contains approx-imately 6,000 nodes, where each node representsa set of queries. The concatenation of all queriesat each node is used as a meta-document, adsand target pages are associated with a node in thetaxonomy using a nearest neighbor classifier andtf. idf weighting. The ultimate score of an ad ai

for a page p is a weighted sum of the taxonomysimilarity score and the similarity of ai and p

based on standard syntactic measures (vector co-sine). On evaluation, Broder et al. (2007) reporta 25 % improvement for mid-range recalls of thesyntactic-semantic model over the pure syntacticone.

Cross-References

�Boosting�Genetic Programming� Information Retrieval�Model Space� Precision� Support Vector Machines�TF–IDF

Recommended Reading

Attardi G, Esuli A, Simi M (2004) Best bets, thousandsof queries in search of a client. In: Proceedings ofthe 13th international conference on World WideWeb, alternate track papers & posters. ACM Press,New York

http://dx.doi.org/10.1007/978-1-4899-7687-1_84

http://dx.doi.org/10.1007/978-1-4899-7687-1_376

http://dx.doi.org/10.1007/978-1-4899-7687-1_403

http://dx.doi.org/10.1007/978-1-4899-7687-1_100309

http://dx.doi.org/10.1007/978-1-4899-7687-1_658

http://dx.doi.org/10.1007/978-1-4899-7687-1_810

http://dx.doi.org/10.1007/978-1-4899-7687-1_832

Text Mining for News and Blogs Analysis 1247

T

Broder A, Fontoura M, Josifovski V, Riedel L (2007)A semantic approach to contextual advertising. In:Proceedings of the 30th annual international ACMSIGIR conference on research and development ininformation retrieval. ACM Press, New York

Carrasco JJ, Fain D, Lang K, Zhukov L (2003) Cluster-ing of bipartite advertiser-keyword graph. In: Work-shop on clustering large datasets, IEEE conferenceon data mining. IEEE Computer Society Press, NewYork

Fain D, Pedersen J (2006) Sponsored search: a briefhistory. In: Proceedings of the 2nd workshop onsponsored search auctions, Ann Arbor. Web Publi-cations

Feng J, Bhargava H, Pennock D (2005, forthcoming)Implementing sponsored search in web search en-gines: computational evaluation of alternative mech-anisms. Inf J Comput

Gallagher K, Foster D, Parsons J (2001) The mediumis not the message: advertising effectiveness andcontent evaluation in print and on the Web. J AdvertRes 41(4):57–70

Internet Advertising Bureau (IAB) (2006) IAB inter-net advertising revenue report. http://www.iab.net/resources/adrevenue/pdf/IAB PwC%202006Q2.pdf

Joachims T (2002) Optimizing search engines usingclickthrough data. In: Proceedings of the ACMconference on knowledge discovery and data mining(KDD). ACM Press, New York

Lacerda A, Cristo M, Goncalves MA, Fan W, ZivianiN, Ribeiro-Neto B (2006). Learning to advertise. In:Proceedings of the 29th annual international ACMSIGIR conference on research and developmentin information retrieval. ACM Press, New York,pp 549–556

OneUpWeb (2005) How keyword length affects con-version rates. http://www.oneupweb.com/landing/keywordstudy landing.htm

Ribeiro-Neto B, Cristo M, Golgher PB, de MouraES (2005) Impedance coupling in content-targetedadvertising. In: Proceedings of the 28th annual in-ternational ACM SIGIR conference on research anddevelopment in information retrieval. ACM Press,New York, pp 496–503

Ribeiro-Neto B, Muntz R (1996) A belief networkmodel for IR. In: Proceedings of the 19th annual in-ternational ACM SIGIR conference on research anddevelopment in information retrieval. ACM Press,New York, pp 253–260

Sherman L, Deighton J (2001) Banner advertising:measuring effectiveness and optimizing placement.J Interact Mark 15(2):60–64

The Yahoo! Research Team (2006) Content, metadata,and behavioral information: directions for Yahoo!Research. IEEE Data Eng Bull 29(4):10–18

Vestergaard T, Schroeder T (1985) The language ofadvertising. Blackwell, Oxford

Yih W, Goodman J, Carvalho VR (2006) Findingadvertising keywords on web pages. In: Proceedingsof the 15th international conference on World WideWeb. ACM Press, New York, pp 213–222

Yoo CY (2006) Preattentive processing of web adver-tising. Ph.D. thesis, University of Texas, Austin

Text Mining for News and BlogsAnalysis

Bettina BerendtKU Leuven, Leuven, Belgium

Abstract

News and blogs are temporally indexed onlinetexts and play a key role in today’s informationdistribution and consumption. News commu-nicate selected information on current events,written by professional or citizen journalists;blogs are updated publications on the Web thatspan a much wider range of topics, styles, andauthors. Particularly important in recent yearshave been microblogs such as Twitter. Theentry gives an overview of how text mining(for tasks such as description, classification,prediction, search, recommendation, or sum-marization) is applied to analyze the textualparts of news and blogs, extracting topics,events, opinions, sentiments, and other aspectsof content. Often, textual analysis is comple-mented by the analysis of further data suchas the social network of authors and readers.The properties of news and blogs data struc-tures and language use require methods forpreprocessing and analyzing that are tailoredto news and (micro)blogs, and the tasks oftenprofit from an interactive approach in whichthe user plays an active role in sensemaking.The methods are deployed in a wide range ofapplications and services.

Definition

News is “the communication of selected infor-mation on current events,” where the selection is

http://www.iab.net/resources/adrevenue/pdf/IAB_PwC%202006Q2.pdf

http://www.iab.net/resources/adrevenue/pdf/IAB_PwC%202006Q2.pdf

http://www.oneupweb.com/landing/keywordstudy_landing.htm

http://www.oneupweb.com/landing/keywordstudy_landing.htm

1248 Text Mining for News and Blogs Analysis

guided by “newsworthiness” or “what intereststhe public.” News are also stories, from whichthe reader usually expects answers to the five Ws:who, what, when, where, and why, to which a“how” is often added. News-style writing – asopposed to, for example, commentary writing –generally strives for objectivity and/or neutral-ity (the representation of different views on theevent).

In this content-centric sense, news can be writ-ten/authored and published by professional jour-nalists and news outlets (such as newspapers orradio or TV stations) but also by anyone else andin any other form, often called citizen journalism:“an alternative and activist form of newsgatheringand reporting that functions outside mainstreammedia institutions, often as a response to short-comings in the professional journalistic field, thatuses similar journalistic practices but is driven bydifferent objectives and ideals and relies on al-ternative sources of legitimacy than traditional ormainstream journalism.” (Radsch 2013, p. 159).However, news, or mainstream (media) news, isalso often thought of in a source-centric way:reports authored by professional journalists inmainstream media institutions, as opposed to re-porting from citizen journalists (or anyone else)who generally publish on the Web, in the form ofblogs with a certain form of periodicity.

A blog is a (more or less) frequently updatedpublication on the Web, sorted in reverse chrono-logical order of the constituent blog posts. Blogcontent may reflect any interest including journal-istic, personal, corporate, and many others. Earlyblog posts (late 1990s) tended to be publishedon content management platforms without lengthrestrictions; with the success of Twitter and simi-lar microblogging platforms, much blogging (andof blog mining) has shifted to short posts (e.g.,140 characters on Twitter.com and Weibo.cn,although the latter’s Chinese characters allowfor much more complex messages). Twitter inparticular has attained a major worldwide role inthe fast diffusion of news (or short summaries andstatements, enriched by hyperlinks to more textand other media), with citizen journalists, main-stream media themselves, politicians, and others

being the publishers (Kwak et al. 2010). Currentresearch in blog mining and the remainder of thepresent entry reflect this dominance of (a) news ornews-related content and (b) microblog format.In addition, blog mining overlaps with social-media mining (Zafarani et al. 2014). In particular,the social graph of a microblogger allows themining analyst to track the blogger’s sources andreaders/“followers” along with the contents.

News and blogs consist of textual and (insome cases) pictorial content and, when Web-based, may contain additional content in anyother format (e.g., video, audio) and hyperlinks.They are indexed by time and structured intosmaller units: news media into articles and blogsinto blog posts. In most news and blogs, textualcontent dominates. Therefore, text analysis isthe most often applied form of knowledge dis-covery. This comprises tasks and methods fromdata/text mining, � information retrieval, and re-lated fields. In accordance with the increasingconvergence of these fields, this entry refers toall of them as � text mining. The present entrywill illustrate the overlap with/use of these fieldsand highlight the specifics that derive from thedomain, including data, tasks, users, and usecases.


News and blogs are today’s most commonsources for learning about current events andalso, in the case of blogs, for uttering opinionsabout current events. In addition, they maydeal with topics of more long-term interest.Both reflect and form societies’, groups’,and individuals’ views of the world, fast oreven instantaneous with the events triggeringthe reporting. However, there are differencesbetween these two types of media regardingauthoring, content, and form. News is generallyauthored by people with journalistic training whoabide by journalistic standards regarding the styleand language of reporting. Topics and ways ofreporting are circumscribed by general societalconsensus and the policies of the news provider.

http://dx.doi.org/10.1007/978-1-4899-7687-1_403

http://dx.doi.org/10.1007/978-1-4899-7687-1_831


T

In contrast, everybody with Internet access canstart a blog, and there are no restrictions oncontent and style (beyond the applicable types ofcensorship). Thus, blogs offer end users a widerrange of topics and views on them.

These application characteristics lead to var-ious linguistic and computational challenges fortext mining analyses of news and blogs:

– Indexing, taxonomic categorization, partialredundancy, and data streams: News isindexed by time and by source (news agencyor provider). In a multisource corpus, manyarticles published at about the same time (inthe same or in other languages) describe thesame events. Over time, a story may developin the articles. Such multiple reporting andtemporal structures are also observed forpopular topics in blogs.

– Language and meaning: News is written inclear, correct, “objective,” and somewhatschematized language. Usually, the start ofa news article summarizes the whole article(feeds are a partial analogue of this in blogs).Information from external sources such aspress agencies is generally reused ratherthan referenced. In sum, news makes fewerassumptions about the reader’s backgroundand context knowledge than many other texts.

– Nonstandard language and subjectivity: Thelanguage in blogs ranges from high-quality,“news-like” language via poor-quality,restricted-code language with many spellingand grammatical errors to creative, sometimesliterary, language. A blog may employ high-quality language but operate outside thenews genre or across journalistic genres(e.g., combining current-events reporting withcommentary and background information).Jargon is very common in blogs, and newlinguistic developments are adopted far morequickly than could be reflected in externalresources such as lexica. Many blog authorsstrive not for objectivity but for subjectivityand emotionality.

– Thematic diversity and new forms of cat-egorization: News are generally categorized

by topic area (“politics,” “business,” etc.). Incontrast, a blog author may choose to writeabout differing, arbitrary topics. When blogsare labeled, it is usually not with referenceto a stable, taxonomic system, but with anarbitrary number of tags: free-form, often in-formal labels chosen by the user.

– Context and its impact on content and mean-ing: The content of a blog (post) is oftennot contained in the text alone. Rather, blogsoftware supports “Web” and “Social Web”behavior, and bloggers practice it: multiwaycommunication rather than broadcasting andsemantics-inducing referencing of both con-tent and people. Specifically, hyperlinks toother resources provide not only context butalso content, as do links to and from cited resp.citing people/sources. The latter evolved from“blogrolls” resp. “trackback links” in earlyblog software to “followees” and “retweet”links resp. “followers” in platforms such asTwitter.


TasksFrom a text mining point of view, tasks can begrouped by different criteria:

– Basic task and type of result: description,classification, and prediction (supervisedor unsupervised, includes, for example,topic identification, tracking, and/or noveltydetection, spam detection), search (ad hocor filtering), recommendation (of blogs, blogposts, or (hash-)tags), and summarization

– Higher-order characterization to be extracted:especially topic or event, opinion, or sentiment

– Time dimension: nontemporal, temporal(stream mining), and multiple streams (e.g.,in different languages; see cross-lingual � textmining)

– User adaptation: none (no explicit mentionof user issues and/or general audience), cus-tomizable, and personalized

http://dx.doi.org/10.1007/978-1-4899-7687-1_831


Real-world applications increasingly employselections or, more often, combinations of thesetasks by their intended users and use cases, inparticular:

– News aggregators allow laypeople and profes-sional users (e.g., journalists) to see “what’s inthe news” and to compare different sources’texts on one story. Reflecting the presumptionthat news (especially mainstream news –sources for news aggregators are usuallywhitelisted) are mostly objective/neutral,these aggregators focus on topics and events.News aggregators are now provided by allmajor search engines.

– Social-media monitoring tools allow laypeo-ple and professional users to track not onlytopical mentions of a keyword or named entity(e.g., person, brand) but also aggregate senti-ment toward it. The focus on sentiment reflectsthe perceptions that even when news-related,social-media content tends to be subjectiveand that studying the blogosphere is thereforean inexpensive way of doing market researchor public opinion research. The whitelist hereis usually the platforms (e.g., Twitter, Tum-blr, LiveJournal, Facebook) rather than thesources themselves, reflecting the huge sizeand dynamic structure of the blogosphere/theSocial Web. The landscape of commercial andfree social-media monitoring tools is wide andchanges frequently; up-to-date overviews andcomparisons can easily be found on the Web.

– Emerging application types include text min-ing not of but for journalistic texts, in partic-ular natural language generation in domainswith highly schematized event structures andreporting, such as sports and finance reporting(e.g., Allen et al. 2010, narrativescience.com)and social-media monitoring tools for helpingjournalists find sources (Diakopoulos et al.2012).

Some tools have dashboard-style interfacesand complex data graphics, which may be mostinteresting for some professional users. However,the increasing move especially of casual users

toward mobile devices with small screens has ledto most applications showing original content andmining output that consists of (especially short)texts and a small number of (especially numeric)analytics.

Solution Approaches

Standardization: Tasks, Datasets, and APIsThe development of methods for mining news,blogs, and social media in general has profitedfrom standard datasets and standard tasksand competitions. Prominent examples are theReuters-21578 dataset, which is not only acollection of newswire articles but also themost classical dataset for text mining in general(https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection); thelarger and also multilingual RCV1, RCV2, andTRC2 datasets (http://trec.nist.gov/data/reuters/reuters.html); the blog datasets provided bythe International Conference on Weblogs andSocial Media (ICWSM, http://www.icwsm.org); and the SNAP datasets (https://snap.stanford.edu/data). The Topic Detection andTracking (TDT) research program and workshops(http://www.itl.nist.gov/iad/mig/tests/tdt; Allan2002) were essential in the formation of newsmining as a research topic. Important tasksand competitions that are ongoing, and thatalso offer important datasets, include the TextRetrieval Conference (TREC, http://trec.nist.gov) and the Text Analysis Conference (TAC,http://www.nist.gov/tac), formerly DocumentUnderstanding Conference (DUC, http://duc.nist.gov). The history of tracks/tasks over time inthese conferences also illustrates how fields havematured or become less relevant; for example,“blog tracks” have been replaced since 2010 by“microblog tracks,” and “topic detection” hasgiven way to “event detection.”

Standard datasets are one answer to a centralproblem in news, blogs, and social-media miningin general. Since most platforms are commercial,they restrict access to their current or archivededitions. Other platforms offer a free API butmake it return a sample whose representativenessand/or even sampling criteria are not known;

https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

http://trec.nist.gov/data/reuters/reuters.html

http://trec.nist.gov/data/reuters/reuters.html

http://www.icwsm.org

http://www.icwsm.org

https://snap.stanford.edu/data

https://snap.stanford.edu/data

http://www.itl.nist.gov/iad/mig/tests/tdt

http://trec.nist.gov

http://trec.nist.gov

http://www.nist.gov/tac

http://duc.nist.gov

http://duc.nist.gov


T

this can affect mining results severely (Morstatteret al. 2013). In addition, the terms of use presenta challenge for creating reusable datasets (for asolution approach, see McCreadie et al. 2012).

A further caveat concerns all social-mediamining results: In general, APIs only give accessto “public” posts and not to posts that users haveset to “private” or otherwise limited to a restrictedaudience. In addition, having gained access toan individual’s online communication does notmean one may use or process it. Thus, privacyand data protection considerations limit the usesof social media for research, and they requirecareful interpretations of the results: these maybe representative of the public utterances of users,but not all of their online communication.

The Modeling Phase of Text MiningSolution approaches are based on general datamining methods and adapted to the conceptualspecifics of news and blogs and their miningtasks (see list of tasks above). Methods include(document) � classification and � clustering andlatent-variable techniques such as (P)LSA orLDA (cf. � feature construction; specifically foran overview of topic models, see Blei 2012),�mixture models, � time series, and � streammining methods.

Named-entity recognition (e.g., Atkinson andVan der Goot 2009; Ritter et al. 2011; Li et al.2012) is an important part or companion of taskssuch as topic detection or text enrichment (e.g.,Stajner et al. 2010). Topic tracking and eventthreading are used to follow a news story unfold-ing over time (e.g., Shahaf and Guestrin 2010),and especially for the purposes of summarizationover time, special attention is paid to burstytopics or events (term introduced by Kleinberg2002; see Subasic and Berendt 2013 for furtherreferences and empirical comparison), i.e., thosethat are marked by “spikes” in the frequency orother weight of reporting at certain points in time.

Information extraction can help to extract theevent(s) of a news story. Events involve namedentities (e.g., people and locations), a time, anda characterization of what the event is about.Information extraction can leverage backgroundontologies (e.g., Kuzey et al. 2014). This cov-

ers the first four of the “five Ws” of a newsstory; the “why” and “how” at present remain tobe extracted by human readers from the origi-nal text (which is therefore generally accessiblefrom platforms; see remarks on semiautomaticsensemaking below). Clustering can be usefulfor the extraction of events from multilingualsources (Leban et al. 2014). Regularities in howreporting (or the world?) evolves have also beenused for predicting events from news (Radinskyand Horvitz 2013). The brevity of microblogscombined with the speed and volume of theirstreams poses special challenges for event detec-tion (McCreadie et al. 2013).

Sentiment analysis and opinion mining arekey especially for analyzing blogs and othersocial media (see overviews in Feldman 2013;Pang and Lee 2007; Potts 2013), and they areevolving toward more sophisticated methods thattake syntactic structure and background knowl-edge/semantics into account (e.g., Gangemiet al. 2014). Sentiment analysis and opinionmining are designed to detect and classify“subjective” content and as such describes(some) social-media content well. It can alsobe appropriate for “subjective” journalisticgenres such as commentary. However, thisdoes not mean that news is really – or canever be truly – objective. The often subtle andoften subconscious structures, backgrounds, andconvictions that express themselves in how anews story is told are referred to as media biasor framing, and text mining has begun to addressthem (e.g., Recasens et al. 2013; Pollak et al.2011; Odijk et al. 2013).

Further classification tasks that are specificallyrelevant for news and blogs are generally solvedwith features that are characteristic of the domainand/or can be easily extracted from its data. Theyinclude (a) geolocation (e.g., Hale et al. 2012), (b)recommendation (e.g., tracking multiple topicsover time in news, personalized to a user whoseinterests may change over time was developedby Pon et al. 2007; an approach for microblogswas proposed by Ren et al. 2013), and (c) spamdetection and blocking (Kolari et al. 2006; fora general overview, see Castillo and Davison2011).

http://dx.doi.org/10.1007/978-1-4899-7687-1_111

http://dx.doi.org/10.1007/978-1-4899-7687-1_943

http://dx.doi.org/10.1007/978-1-4899-7687-1_100

http://dx.doi.org/10.1007/978-1-4899-7687-1_552

http://dx.doi.org/10.1007/978-1-4899-7687-1_972

http://dx.doi.org/10.1007/978-1-4899-7687-1_789


Text summarization (for an overview, seeFiori 2014; specifically for microblogs, seeMackie et al. 2014) is a key technique forhelping users to get an overview of (a) a singledocument’s key messages or (b) a multitudeby different documents, often from differentsources that in turn may have copied fromone another. Today, most summarizations areextractive, either extracting key sentences ornon-sentence structures such as graphs. In real-world applications, even simpler forms arestill predominant, including the extraction ofsingle terms based, for example, on frequencyand their display in tag clouds and the useof the first sentences of news articles that, byjournalistic writing conventions, are designed tosummarize the text. Abstractive summarizationinvolves the generation of natural language,which remains a hard problem. Today, it isused mostly for text genres that are highlyschematized, such that templates can be usedand filled with the entities/constants relevant tothe story at hand (see “Emerging applicationtypes” above).

Texts, or text summaries, can be representednot only as bags of words, sets of topics orevents but also as graphs in which words and/ornamed entities stand in multiple relations to oneanother (see Berendt et al. 2014, for examples andfurther references). (Shallow) semantic parsingis often used to extract triples (e.g., subject-predicate-object statements) (e.g., Stajner et al.2010; Sudhahar et al. 2015).

Text-based modeling can be enhanced by (e.g.,social) network structure (e.g., Mei et al. 2008)(cf. � link mining and link discovery). The anal-ysis of how the actors in a network influence oneanother is important for the domain of news andsocial media (Guille et al. 2013). Such analysesare applied not only to individual text producersbut more often to whole domains. One generalquestion is how blogs and news, viewed in theaggregate, refer to and contextualize each other(e.g., Gamon et al. 2008; Berendt and Trumper2009; Leskovec et al. 2009).

Specifics of Data Understanding, DataCleaning, and Data PreparationData cleaning is similar to that of other onlinedocuments; in particular, it requires the provisionor learning of wrappers for removing mark-upelements. Analysis methods that focus on textmining usually ignore hypermedia elements suchas photographs and videos or use only their meta-data.

While news texts employ standard languageand can be handled with general-purpose text-analysis software, the language of (micro-)blogsrequires specific lexica (e.g., containing the fre-quently used emoticons), abbreviation expansionand grammatical rules, and similar techniques(see “Noah’s ARK” at http://www.ark.cs.cmu.edu/TweetNLP/ for a suite of tools and refer-ences), and linguists have found that rather thanbeing “wrong” and ungrammatical, microblogsare evolving toward new systems that resemblespoken language and indicate nuances such asgeographical region (Eisenstein 2015). Like othersocial media, they often contain irony and otherindirect uses of language for expressing appreci-ation or discontent (e.g., Veale and Hao 2010),and this remains a major stumbling block for themachine understanding of these texts.

The semi-structured nature of blogs and newscan give valuable cues for understanding. Forexample, the format elements “timestamp” and“number of comments” can be treated as indi-cators of increased topical relevance and likeli-hood of being opinionated, respectively (Mishne2007). A combination of text clustering and taganalysis can serve to identify topics as well as theblogs that are on topic and likely to retain thisfocus over time (Hayes et al. 2007). Twitter hash-tags have been used, for example, as indicators ofsentiment (Wang et al. 2011).

Like other online texts, news and blogs makefrequent use of hyperlinks, and the content oflinked materials may be necessary even fora human reader to understand a post. This isparticularly true for microblogs that are oftenmere pointers to a URL, or a URL plus a

http://dx.doi.org/10.1007/978-1-4899-7687-1_948

http://www.ark.cs.cmu.edu/TweetNLP/

http://www.ark.cs.cmu.edu/TweetNLP/


T

short comment. Many mining methods thereforeenrich the text by, for example, the contentsof referenced URLs (e.g., Abel et al. 2011).Semantic enrichment can also utilize external(semi-)structured data; for example, Wikificationcan add context information to microblogs bydrawing on Wikipedia or DBPedia (e.g., Chengand Roth 2013). All these methods can help toenrich and to disambiguate meaning.

The Importance of Interactive Tools forSemi-automatic SensemakingLike most of text mining, machine analyses ofnews, blogs, and other social media are a firststep in a process of human sensemaking, whetherfor news consumers or for news producers. Itis therefore imperative to provide them with in-terfaces that support further steps. Thus, toolsfor news consumers (such as news aggregators)typically provide links to the original articles.Tools for news producers show statistics (suchas aggregate opinions of “the crowd” or proper-ties of one potential source) as an informationfor journalists, and topics or events detected incorpora are generally a starting point for a story,but not a story in and of themselves. Reading,understanding, and writing news and blogs canprobably never be totally automated. One reasonfor this is that different people read a given textdifferently, which is well known in social sciencemedia research but still often neglected in com-putational research – maybe because it requiresus to question key methodological concepts oftext mining such as “the ground truth.” Interactivetools for story detection and tracking have beenproposed as an answer to this dilemma (Berendtet al. 2014), and drag-and-drop story editors areused to create one’s own new story (storify.com).

In addition, text mining as a method for deal-ing with large data volumes is often in compe-tition with or combined with human intelligencefor doing the same. Thus, for example, the con-tributions from many (often unpaid) volunteersand interface elements such as voting consti-tute the “social news aggregator” reddit.com, and

Twitter’s “retweeting” is a major, and human-led,way in which tweets are fed into, and developinfluence across, multiple sub-networks formedby the platform’s users. In these human-machinecollaborations, the algorithms employed by aplatform however are not neutral companions, butshape how users perceive others’ opinions, whichin turn affects their further posting behavior. Forexample, Twitter’s “trending topics” algorithmrewards bursty topics (cf. Wilson 2013). This im-plies that even a topic contained in many tweetscan, if the interest over time remains stable,disappear from the trending topics and therebyfrom public visibility. The implications of suchalgorithmic decisions on user choices and percep-tions as well as public decisions and policy are anew research topic that will be relevant not onlyfor text mining.

Recommended Reading

Abel F, Gao Q, Houben G-J, Tao K (2011) Semanticenrichment of Twitter posts for user profile con-struction on the social web. In: Proceedings ofESWC (2), pp 375–389

Allan J (ed) (2002) Topic detection and tracking: event-based information organization. Kluwer AcademicPublishers, Norwell

Allen ND, Templon JR, McNally PS, BirnbaumL, Hammond K (2010) StatsMonkey: a data-driven sports narrative writer. In: Proceedings of2010 AAAI fall symposium series. AAAI Press.http://www.aaai.org/ocs/index.php/FSS/FSS10/paper/view/2305

Atkinson M, Van der Goot E (2009) Near real timeinformation mining in multilingual news. In: Pro-ceedings of the 18th international conference onWorld Wide Web (WWW’09). ACM, New York,pp 1153–1154

Berendt B, Last M, Subasic I, Verbeke M (2014) Newformats and interfaces for multi-document newssummarization and its evaluation. In: Fiori, pp 231–255

Berendt B, Trumper D (2009) Semantics-based analy-sis and navigation of heterogeneous text corpora: thePorpoise news and blogs engine. In: Ting I-H, WuH-J (eds) Web mining applications in e-commerceand e-services. Springer, Berlin

Blei DM (2012) Probabilistic topic models. CommunACM 55(4):77–84

http://www.aaai.org/ocs/index.php/FSS/FSS10/paper/view/2305


Castillo C, Davison BD (2011) Adversarial websearch. Found Trends Inf Retr 4(5):377–486.doi:10.1561/1500000021

Cheng X, Roth D (2013) Relational inference for Wiki-fication. In: Proceedings of EMNLP 2013, pp 1787–1796

Diakopoulos N, De Choudhury M, Naaman M (2012)Finding and assessing social media informationsources in the context of journalism. In: Proceedingsof CHI 2012. ACM, pp 2451–2460

Eisenstein J (2017) Written dialect variation inonline social media. In: Boberg C, NerbonneJ, Watt D (eds) The handbook of dialectol-ogy. Wiley-Blackwell, Hoboken. Preprint avail-able at http://www.cc.gatech.edu/jeisenst/papers/dialectology-chapter.pdf

Feldman R (2013) Techniques and applications forsentiment analysis. Commun ACM 56(4):82–89

Fiori A (ed) (2014) Innovative document summariza-tion techniques: revolutionizing knowledge under-standing. IGI Global, Hershey

Gamon M, Basu S, Belenko D, Fisher D, Hurst M,Konig AC (2008) BLEWS: using blogs to pro-vide context for news articles. In: Adar E, HurstM, Finin T, Glance N, Nicolov N, Tseng B, Sal-vetti F (eds) Proceedings of the second interna-tional conference on weblogs and social media(ICWSM’08), Seattle/Menlo Park. http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-015.pdf

Gangemi A, Presutti V, Reforgiato Recupero D (2014)Frame-based detection of opinion holders and top-ics: a model and a tool. IEEE Comput Intell Mag9(1):20–30

Guille A, Hacid H, Favre C, Zighed DA (2013) Infor-mation diffusion in online social networks: a survey.SIGMOD Rec 42(2):17–28

Hale S, Gaffney D, Graham M (2012) Where in theworld are you? Geolocation and language identi-fication in Twitter. In: Proceedings of ICWSM’12,pp 518–521

Hayes C, Avesani P, Bojars U (2007) An analysis ofbloggers, topics and tags for a blog recommendersystem. In: Berendt B, Hotho A, Mladeni D, Semer-aro G (eds) From web to social web: discoveringand deploying user and content profiles. LNAI 4737.Springer, Berlin

Kleinberg JM (2002) Bursty and hierarchical struc-ture in streams. In: Proceedings of SIGKDD 2002,pp 91–101

Kolari P, Java A, Finin T, Oates T, Joshi A (2006)Detecting spam blogs: a machine learning approach.In: Proceedings of the 21st national conference onartificial intelligence. AAAI, Boston

Kuzey E, Vreeken J, Weikum G (2014) A fresh lookon knowledge bases: distilling named events fromnews. In: Proceedings of CIKM 2014, pp 1689–1698

Kwak H, Lee C, Park H, Moon S (2010) What isTwitter, a social network or a news media? In:Proceedings of WWW. ACM, pp 591–600

Leban G, Fortuna B, Brank J, Grobelnik M (2014)Event registry: learning about world events fromnews. In: Proceedings of WWW 2014 (companionvolume), pp 107–110

Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In:Elder IV JF, Fogelman-Soulie F, Flach PA, ZakiMJ (eds) Proceedings of the 15th ACM SIGKDDinternational conference on knowledge discoveryand data mining, Paris/New York

Li C, Weng J, He Q, Yao Y, Datta A, Sun A,Lee B-S (2012) TwiNER: named entity ecogni-tion in targeted Twitter stream. In: Proceedingsof the 35th international ACM SIGIR conferenceon research and development in information re-trieval (SIGIR’12). ACM, New York, pp 721–730.doi:10.1145/2348283.2348380

Mackie S, McCreadie R, Macdonald C, Ounis I (2014)Comparing algorithms for microblog summarisa-tion. In: Proceedings of CLEF 2014, pp 153–159

McCreadie R, Macdonald C, Ounis I, Osborne M,Petrovic S (2013) Scalable distributed event detec-tion for Twitter. In: Proceedings of BigData confer-ence 2013, pp 543–549

McCreadie R, Soboroff I, Lin J, Macdonald C,Ounis I, McCullough D (2012) On building areusable Twitter corpus. In: Proceedings of the35th international ACM SIGIR conference on re-search and development in information retrieval(SIGIR’12). ACM, New York, pp 1113–1114.doi:10.1145/2348283.2348495

Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modelingwith network regularization. In: Huai J, Chen R(eds) Proceeding of the 17th international confer-ence on world wide web (WWW’08), Beijing/NewYork. doi:10.1007/978-0-387-30164-8 827

Mishne G (2007) Using blog properties to improveretrieval. In: Glance N, Nicolov N, Adar E, HurstM, Liberman M, Salvetto F (eds) Proceedings ofthe international conference on weblogs and socialmedia (ICWSM), Boulder. http://www.icwsm.org/papers/paper25.html

Morstatter F, Pfeffer J, Liu H, Carley KM (2013) Is thesample good enough? Comparing data from Twit-ter’s streaming API with Twitter’s firehose. In: Pro-ceedings of ICWSM 2013. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6071

Odijk D, Burscher B, Vliegenthart R, de Rijke M(2013) Automatic thematic content analysis: findingframes in news. In: Social informatics 2013. LNCS8238. Springer, Berlin, pp 333–345

Pang B, Lee L (2007) Opinion mining and sentimentanalysis. Found Trends Inf Retr 2(1–2):1–135

Pollak S, Coesemans R, Daelemans W, Lavrae N(2011) Detecting contrast patterns in newspaperarticles by combining discourse analysis and textmining. Pragmatics 21(4):647–683

Pon RK, Cardenas AF, Buttler D, Critchlow T (2007)Tracking multiple topics for finding interesting ar-ticles. In: Berkhin P, Caruana R, Wu X (eds)

http://www.cc.gatech.edu/jeisenst/papers/dialectology-chapter.pdf

http://www.cc.gatech.edu/jeisenst/papers/dialectology-chapter.pdf

http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-015.pdf

http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-015.pdf

http://www.icwsm.org/papers/paper25.html

http://www.icwsm.org/papers/paper25.html

http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6071

http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6071

Text Mining for Spam Filtering 1255

T

Proceedings of the 13th ACM SIGKDD interna-tional conference on knowledge discovery and datamining, San Jose/New York

Potts (2013) Introduction to sentiment analysis.(slide set). http://www.stanford.edu/class/cs224u/slides/2013/cs224u-slides-02-26.pdf. Retrieved 15Feb 2015

Radinsky K, Horvitz E (2013) Mining the web topredict future events. In: Proceedings of WSDM2013, pp 255–264

Radsch CC (2013) Digital dissidence & politicalchange: cyberactivism and citizen journalism inEgypt. Doctoral Dissertation, School of Interna-tional Service, American University. Available atSSRN:http://ssrn.com/abstract=2379913

Recasens M, Danescu-Niculescu-Mizil C, Jurafsky D(2013) Linguistic models for analyzing and detect-ing biased language. In: Proceedings of ACL

Ren Z, Liang S, Meij E, de Rijke M (2013) Personal-ized time-aware tweets summarization. In: Proceed-ings of the 36th international ACM SIGIR confer-ence on research and development in informationretrieval (SIGIR’13). ACM, New York, pp 513–522.doi:10.1145/2484028.2484052

Ritter A, Clark S, Mausam, Etzioni O (2011)Named entity recognition in tweets: an experimen-tal study. In: Proceedings of the conference onempirical methods in natural language processing(EMNLP’11). Association for Computational Lin-guistics, Stroudsburg, pp 1524–1534

Shahaf D, Guestrin C (2010) Connecting the dotsbetween news articles. In: Proceedings of SIGKDD2010, pp 623–632

Sudhahar S, de Fazio G, Franzosi R, Cristian-ini N (2015) Network analysis of narrative con-tent in large corpora. Nat Lang Eng 21(1):81–112

Stajner T, Rusu D, Dali L, Fortuna B, Mladenic D,Grobelnik M (2010) A service oriented frameworkfor natural language text enrichment. Informatica(Ljublj.) 34(3):307–313

Subasic I, Berendt B (2013) Story graphs: trackingdocument set evolution using dynamic graphs. IntellData Anal 17(1):125–147

Veale T, Hao Y (2010) Detecting ironic intent increative comparisons. In: Coelho H, Studer R,Wooldridge M (eds) Proceedings of the 2010 con-ference on ECAI 2010: 19th European conferenceon artificial intelligence. IOS Press, Amsterdam,pp 765–770

Wang X, Wei F, Liu X, Zhou M, Zhang M (2011) Topicsentiment analysis in Twitter: a graph-based hashtagsentiment classification approach. In: Berendt B, deVries A, Fan W, Macdonald C, Ounis I, RuthvenI (eds) Proceedings of the 20th ACM internationalconference on information and knowledge manage-ment (CIKM’11). ACM, New York, pp 1031–1040.doi:10.1145/2063576.2063726

Wilson R (2013) Trending on Twitter: a look atalgorithms behind trending topics. Ignite social

media blog. http://www.ignitesocialmedia.com/twitter-marketing/trending-on-twitter-a-look-at-algorithms-behind-trending-topics/. Retrieved 15 Feb2015

Zafarani R, Abbasi MA, Liu H (2014) Social me-dia mining: an introduction. Cambridge UniversityPress, Cambridge

Text Mining for Spam Filtering

Aleksander KoŁczMicrosoft One Microsoft Way, Redmond, WA,USA

Synonyms

Commercial Email Filtering; Junk email filtering;Spam detection; Unsolicited commercial emailfiltering

Definition

Spam filtering is the process of detecting un-solicited commercial email (UCE) messages onbehalf of an individual recipient or a group ofrecipients. Machine learning applied to this prob-lem is used to create discriminating models basedon labeled and unlabeled examples of spam andnonspam. Such models can serve populations ofusers (e.g., departments, corporations, ISP cus-tomers) or they can be personalized to reflect thejudgments of an individual. An important aspectof spam detection is the way in which textualinformation contained in email is extracted andused for the purpose of discrimination.


Spam has become the bane of existence for bothInternet users and entities providing email ser-vices. Time is lost when sifting through un-wanted messages and important emails may belost through omission or accidental deletion. Ac-cording to various statistics, spam constitutes the

http://www.stanford.edu/class/cs224u/slides/2013/cs224u-slides-02-26.pdf

http://www.stanford.edu/class/cs224u/slides/2013/cs224u-slides-02-26.pdf

http://ssrn.com/abstract=2379913

http://www.ignitesocialmedia.com/twitter-marketing/trending-on-twitter-a-look-at-algorithms-behind-trending-topics/

http://dx.doi.org/10.1007/978-1-4899-7687-1_100070

http://dx.doi.org/10.1007/978-1-4899-7687-1_100230

http://dx.doi.org/10.1007/978-1-4899-7687-1_100434

http://dx.doi.org/10.1007/978-1-4899-7687-1_100498

1256 Text Mining for Spam Filtering

majority of emails sent today and a large portionof emails actually delivered. This translates tolarge costs related to bandwidth and storage use.Spam detection systems help to alleviate theseissues, but they may introduce problems of theirown, such as more complex user interfaces, de-layed message delivery, and accidental filteringof legitimate messages. It is not clear if any oneapproach to fighting spam can lead to its com-plete eradication and a multitude of approacheshave been proposed and implemented. Amongexisting techniques are those relying on the useof supervised and unsupervised machine learningtechniques, which aim to derive a model dif-ferentiating spam from legitimate content usingtextual and nontextual attributes. These methodshave become an important component of the an-tispam arsenal and draw from the body of relatedresearch such as text classification, fraud detec-tion and cost-sensitive learning. The text miningcomponent of these techniques is of particularprominence given that email messages are pri-marily composed of text. Application of machinelearning and data mining to the spam domain ischallenging, however, due, among others, to theadversarial nature of the problem (Dalvi et al.2004; Fawcett 2003).


OverviewA machine-learning approach to spam filteringrelies on the acquisition of a learning sample ofemail data, which is then used to induce a classi-fication or scoring model, followed by tuning andsetup to satisfy the desired operating conditions.Domain knowledge may be injected at variousstages into the induction process. For example,it is common to a priori specific features that areknown be highly correlated with the spam label,e.g., certain patterns contained in email headersor certain words or phrases. Depending on theapplication environment, messages classified asspam are prevented from being delivered (e.g.,are blocked or “bounced”), or are delivered with amechanism to alert users to their likely spam na-ture. Filter deployment is followed by continuous

evaluation of its performance, often accompaniedby the collection of error feedback from its users.

Data AcquisitionA spam filtering system relies on the presence oflabeled training data, which are used to inducea model of what constitutes spam and what islegitimate email. Spam detection represents atwo-class problem, although it may sometimes bedesired to introduce special handling of messagesfor which a confident decision, either way,cannot be made. Depending on the applicationenvironment, the training data may representemails received by one individual or a groupof users. Ideally, the data should correspond toa uniform sample acquired over some periodof time preceding filter deployment. Typicalproblems with data collection revolve aroundprivacy issues, whereby users are unwilling todonate emails of personal or sensitive nature.Additionally, labeling mistakes are commonwhere legitimate emails may be erroneouslymarked as spam or vice versa. Also, since forcertain types of emails, the spam/legitimatedistinction is personal, one may find that thesame message content is labeled in a conflictingmanner by different users (or even by the sameuser at different times). Therefore, data cleaningand conflict resolution techniques may need tobe deployed, especially when building filters thatserve a large and diverse user population.

Due to privacy concerns, few large publiclyemail corpora exist. The ones created for theTREC Spam Track (TREC data is availablefrom: http://plg.uwaterloo.ca/ gvcormac/treccorpus/). stand out in terms of size and availabilityof published comparative results.

Content Encoding and Deobfuscation

Spam has been evolving in many ways over thecourse of time. Some changes reflect the shiftin content advertised in such messages (e.g.,from pornography and pharmaceuticals to stockschemes and phish). Others reflect the formattingof content. While early spam was sent in theform of plain text, it subsequently evolved into

http://plg.uwaterloo.ca/~gvcormac/treccorpus/


T

more complex HTML, with deliberate attempts tomake extraction of meaningful textual features asdifficult as possible. Typically, obfuscation (a listof obfuscation techniques is maintained at http://www.jgc.org/tsc.html) aims at

(a) Altering the text extracted from the messagefor words visible to the user (e.g., by break-ing up the characters in message source byHTML tags, encoding the characters in vari-ous ways, using character look-alikes, wrap-ping the display of text using script codeexecuted by the message viewer). This tacticis used to hide the message “payload.”

(b) Adding content that is not visible to theuser (e.g., using the background coloror zero-width font to render certaincharacters/words). This tactic typicallyattempts to add “legitimate content.”

(c) Purposeful misspelling of words known to befairly incriminating (e.g., Viagra as V1agr@),in a way that allows the email recipient to stillunderstand the spammer’s message.

The line of detection countermeasures aiming atpreventing effective content extraction continuesin the form of image spam, where the payloadmessage is encoded in the form of an image thatis easily legible to a human but poses challengesto an automatic content extraction system. To theextent that rich and multimedia content gets sentout by legitimate users in increasing proportions,spammers are likely to use the complexity ofthese media to obfuscate their messages even fur-ther. The very fact that obfuscation is attempted,however, provides an opportunity for machinelearning techniques to use obfuscation presenceas a feature. Thus, even if payload content cannotbe faithfully decoded, the very presence of elab-orate encoding may help in identifying spam.

Feature Extraction and Selection

An email message represents a semistructureddocument, commonly following the rfc822standard (www.faqs.org/rfcs/rfc822.html). Itsheader consists of fields indicative of formatting,

authorship, and delivery information, while itsbody contains the actual content being sent.There can be little correctness enforcement ofthe header fields and spamming techniques oftenrely on spoofing and forging of the header data,although this may provide evidence of tempering.Many early approaches to detect spam dependedpredominantly on hand-crafted rules identifyinginconsistencies and peculiarities of spam emailheaders. Manually or automatically generatedheader features continue to be relevant even whenother features (e.g., message text) are considered.

Given that an email message tends to beprimarily text, features traditionally useful intext categorization have also been found usefulin spam detection. These include individualwords, phrases, character n-grams, and othertextual components (Siefkes et al. 2004). Naturallanguage processing (NLP) techniques such asstemming, stop-word removal, and case foldingare also sometimes applied to normalize thefeatures further. Text extraction is often nontrivialdue to the application of content obfuscationtechniques. For example, standard lexical featureextractors may need to be strengthened tocorrectly identify word boundaries (e.g., in caseswhere groups of characters within a word areseparated by zero-width HTML tags).

Extraction of features from nontextual attach-ments (e.g., images, audio, and video) is alsopossible but tends to be more computationallydemanding. Other types of features capture theway a message if formatted, encoded in HTML,composed of multiple parts, etc.

Although nontextual features have differentproperties than text, it is common practice tocombine them with textual features and presenta single unified representation to the classifier.Indeed, some approaches make no distinctionbetween text and formatting even during theprocess of feature extraction, and apply patterndiscovery techniques to identifying complexfeatures automatically (Rigoutsos and Huynh2004). The advantage of such techniques is thatthey do not require rich domain knowledge andcan discover new useful patterns. Due to the largespace of possible patterns they can potentially becomputationally expensive. However, even the

http://www.jgc.org/tsc.html

http://www.jgc.org/tsc.html

www.faqs.org/rfcs/rfc822.html


seemingly simplistic treatment of an emailmessage as a plain-text document with “words”delimited by white space often leads to very goodresults.

Even though typical text documents are al-ready very sparse, the problem is even morepronounced for the email medium due to fre-quent misspelling and deliberate randomizationperformed by spammers. Insisting on using allsuch variations may lead to overfitting for someclassifiers, and it leads to large filter memoryfootprints that are undesirable from an opera-tional standpoint. However, due to the constantlychanging distribution of content, it may be dan-gerous to rely on very few features. Traditionalapproaches to feature selection based on mea-sures such as Information Gain have been re-ported as successful in the spam filtering domain,but even simple rudimentary attribute selectionbased on removing very rare and/or very frequentfeatures tends to work well.

There are a number of entities that can beextracted from message text and that tend to be ofrelevance in spam detection. Among others, thereare telephone numbers and URLs. In commercialemail and in spam, these provide a means ofordering products and services and thus, offerimportant information for vendor and campaigntracking. Detection of signature and mailing ad-dress blocks can also be of interest, even if onlyto indicate their presence or absence.

Learning Algorithms

A variety of learning algorithms have been ap-plied in the spam filtering domain. These rangefrom linear classifiers such as Naive Bayes (Met-sis et al. 2006), logistic regression (Goodmanand Yih 2006), or linear support vector ma-chines (Drucker et al. 1999; Kołcz and Alspector2001; Sculley and Wachman 2007) to nonlinearones such as boosted decision trees (Carrerasand Marquez 2001). Language modeling and sta-tistical compression techniques have also beenfound quite effective (Bratko et al. 2006). Ingeneral, due to the high dimensionality of thefeature space, the classifier chosen should be able

to handle tens of thousand, or more, attributeswithout overfitting the training data.

It is usually required that the learned modelprovides a scoring function, such that for emailmessage x score(x/ 2 R, with higher scorevalues corresponding to higher probability of themessage being spam. The score function can alsobe calibrated to represent the posterior probabil-ity P spam jx 2 0, 1, although accurate calibra-tion is difficult due to constantly changing classand content distributions. The scoring function isused to establish a decision rule:

score.x/ � th! spam

where the choice of the decision threshold th isdriven by the minimization of the expected cost.In the linear case, the scoring function takes theform

score.x/ D w � x C b

where w is the weight vectors, and x is a vectorrepresentation of the message. Sometimes scoresare normalized with a monotonic function, e.g.,to give an estimate of the probability of x beingspam.

Linear classifiers tend to provide sufficientlyhigh accuracy, which is also consistent with otherapplication domains involving the text medium.In particular, many variants of the relatively sim-ple Naive Bayes classifier have been found suc-cessful in detecting spam, and Naive Bayes oftenprovides a baseline for systems employing morecomplex classification algorithms (Metsis et al.2006).

One Model Versus Multiple ModelsIt often pays off to combine different types ofclassifiers (even different linear ones) in a se-quential or parallel fashion to benefit from thefact that different classifiers may provide an ad-vantage in different regions of the feature space.Stacking via � linear regression has been reportedto be effective for this purpose (Sakkis et al. 2001;Segal et al. 2004). One can generally distinguishbetween cases where all classifiers are inducedover the same data and cases where several dif-ferent datasets are used. In the former case, the

http://dx.doi.org/10.1007/978-1-4899-7687-1_481


T

combination process exploits the biases of dif-ferent learning algorithms. In the latter case, onecan consider building a multitude of detectors,each targeting a different subclass of spam (e.g.,phish, pharmaceutical spam, “Nigerian” scams,etc.). Datasets can also be defined on a temporalbasis, so that different classifiers have shorter orlonger memory spans. Other criteria of providingdifferent datasets are also possible (e.g., based onthe language of the message).

Additional levels of complexity in the classi-fier combination process can be introduced byconsidering alternative feature representations foreach dataset. For example, a single data collectionand a single learning method can be used tocreate several different classifiers, based uponalternative representations of the same data (e.g.,using just the header features or just the messagetext features).

The method of classifier combination willnecessarily depend on their performance andintended area of applications. The combinationregimes can range from simple logical-OR through linear combinations to complexnonlinear rules, either derived automatically tomaximize the desired performance or specifiedmanually with the guidance of expert domainknowledge.

Off-Line Adaptation Versus OnlineAdaptationA spam filtering system can be configured to re-ceive instant feedback from its users, informing itwhenever certain messages get misdelivered (thisnecessarily does not include cases where misclas-sified legitimate messages are simply blocked).In the case of online filters, the feedback in-formation may be immediately used to updatethe filtering profile. This allows a filter to adjustto the changing distribution of email contentand to detection countermeasures employed byspammers. Not all classifiers are easily amenableto the online learning update, although onlineversions of well-known learners such as logisticregression (Goodman and Yih 2006) and linearSVMs (Sculley and Wachman 2007) have beenproposed. The distinguishing factor is the amountof the original training data that needs to be

retained in addition to the model itself to performfuture updates. In this respect, Naive Bayes isparticularly attractive since it does not require anyof the original data for adaptation, with the modelitself providing all the necessary information.

One issue with the user feedback signal, how-ever, is its bias toward current errors of the classi-fier, which for learners depending on the trainingdata being an unbiased sample drawn from theunderlying distribution may lead to overcompen-sation rather than an improvement in filtering ac-curacy. As an alternative, unbiased feedback canbe obtained by either selectively querying usersabout the nature of uniformly sampled messagesor by deriving the labels implicitly.

In the case where off-line adaptation is in use,the feedback data is collected and saved for lateruse, whereby the filtering models are retrainedperiodically or only as needed using the datacollected. The advantage of off-line adaptationis that it offers more flexibility in terms of thelearning algorithm and its optimization. In par-ticular, model retraining can take advantage ofa larger quantity of data, and does not have tobe constrained to be an extension of the currentversion of the model. As a result, it is, e.g.,possible to redefine the features from one versionof the spam filter to the next. One disadvantage isthat model updates are likely to be performed lessfrequently and may be lagging behind the mostrecent spam trends.

User-Specific Versus User-IndependentSpam DetectionWhat constitutes a spam message tends to bepersonal, at least for some types of spam. Variouscommercial messages, such as promotions andadvertisements, e.g., may be distributed in a so-licited or unsolicited manner, and sometimes onlythe end recipient may be able to judge which. Inthat sense, user-specific spam detection has thepotential of being most accurate, since a user’sown judgments are used to drive the trainingprocess. Since the nonspam content received byany particular user is likely to be more narrowlydistributed when compared a larger user pop-ulation, this makes the discrimination problemmuch simpler. Additionally, in the adversarial


context, a spammer should find it more difficult tomeasure the success of penetrating personalizedfilter defenses, which makes it more difficult tocraft a campaign that reaches sufficiently manymail inboxes to be profitable.

One potential disadvantage of such solutionsis the need for acquiring labeled data on a userby user basis, which may be challenging. Forsome users historical data may not yet exist (orhas already been destroyed), for others even ifsuch data exist, labeling may seem too muchof a burden for the users. Aside from the datacollection issues, personal spam filtering facesmaintainability issues, as the filter is inherentlycontrolled by its user. This may result in less-than-perfect performance, e.g., if the user misdi-rects filter training.

From the perspective of institutions and emailservice providers, it is often more attractive tomaintain just one set of spam filters that service alarger user population. This makes them simplerto operate and maintain, but their accuracy maydepend on the context of any particular user. Theadvantage of centralized filtering when servinglarge user populations is that global trends can bemore readily spotted and any particular user maybe automatically protected against spam, affect-ing other users. Also, the domain knowledge ofthe spam-filtering analysts can be readily injectedinto the filtering pipeline.

To the extent that a service provider maintainspersonal filters for its population of users, thereare potential large system costs to account for,so that a complete cost-benefit analysis needs tobe performed to assess the suitability of such assolution as opposed to a user-independent filter-ing complex. More details on the nature of suchtrade-offs can be found in Kołcz et al. (2006).

Clustering and Volumetric TechniquesContent clustering can serve as an important dataunderstanding technique in spam filtering. Forexample, large clusters can justify the use of spe-cialized classifiers and/or the use of cost-sensitiveapproaches in classifier learning and evaluation(e.g., where different costs are assigned to dif-ferent groups of content within each class (Kołczand Alspector 2001).

Both spam and legitimate commercial emailsare often sent in large campaigns, where the sameor highly similar content is sent to a large numberof recipients, sometimes over prolonged periodsof time. Detection of email campaigns can there-fore play an important role in spam filtering.Since individual messages of a campaign arehighly similar to one another, this can be consid-ered a variant of near-replica document detection(Kołcz 2005). It can also be seen as relying onidentification of highly localized spikes in thecontent density distribution. As found in Yoshidaet al. (2004), density distribution approaches canbe highly effective, which is especially attractivegiven that they do not require the explicitly la-beled training data. Tracking of spam campaignsmay be made difficult due to content random-ization, and some research has been directed atmaking the detection methods robust in the pres-ence such countermeasures (Kołcz 2005; Kołczand Chowdhury 2007).

Misclassification Costs and FilterEvaluationAn important aspect of spam filtering is that thecosts of misclassifying spam as legitimate emailare not the same as the costs of making theopposite mistake. It is thus commonly assumedthat the costs of a false positive mistake (i.e., alegitimate email being misclassified as spam) aremuch higher than the cost of mistaking spam forlegitimate email. Given the prevalence of spam �

and the false-spam (FS) and false-legitimate (FL)rates of the classifier, the misclassification cost c

can be expressed as

c D CFS � .1 � �/ � FSC CFL �� FL

where CFS and CFL are the costs of making afalse-spam and false-legitimate mistake, respec-tively (there is no penalty for making the correctdecision). Since actual values of CFS and CFL

are difficult to quantify, one typically sees themcombined in the form of a ratio, � D CFS=CFL,and the overall cost can be expressed as relativeto the cost of a false-legitimate misclassificatione.g.,

crel D � � .1 � �/ � FSC � �FL


T

Practical choices of � tend to range from 1 to1,000. Nonuniform misclassification costs can beused during the process of model induction or inpostprocessing when setting up the operating pa-rameters of a spam filter, e.g., using the receiveroperating characteristic (ROC) analysis.

Since the costs and cost ratios are sometimeshard to define, some approaches to evaluationfavor direct values of the false-spam and false-legitimate error rates. This captures the intuitiverequirement that an effective spam filter shouldprovide high detection rate at a close-to-zerofalse-spam rate. Alternatively, threshold indepen-dent metrics such as the area under the ROC(AUC) can be used (Bratko et al. 2006; Cormackand Lynam 2006), although other measures havealso been proposed (Sakkis et al. 2001).

Adaptation to CountermeasuresSpam filtering is an inherently adversarial task,where any solution deployed on a large scale islikely to be met with a response on the part ofthe spammers. To that extent that the success ofa spam filter can be pinpointed to any particularcomponent (e.g., the type of features used), thatprominent component is likely to be attackeddirectly and may become a victim of its ownsuccess. For example, the use of word featuresin spam filtering encourages countermeasures inthe form of deliberate misspellings, word frag-mentation, and “invisible ink” in HTML docu-ments. Also, since some words are considered bya model inherently more legitimate than others,“word stuffing” has been used to inject largeblocks of potentially legitimate vocabulary intoan otherwise spammy message in the hope thatthis information outweighs the evidence providedby the spam content (Lowd and Meek 2005).

Some authors have attempted to put theadversarial nature of spam filtering in the formalcontext of game theory (Dalvi et al. 2004). Onedifficulty of drawing broad conclusion basedon such analyses is the breadth of the potentialattack/defense front, of which only small sectionshave been successfully captured in the game-theory formalism. The research on countering thecountermeasures points at using multiple diversefiltering components, normalization of features

to keep them invariant to irrelevant alterations.A key point is that frequent filter retraining islikely to help in keeping up with the shifts incontent distribution, both natural and due tocountermeasures.

Future Directions

Reputation Systems and Social NetworksThere has been a growing interest in developingreputation systems capturing the trustworthinessof a sender with respect to a particular user orgroup of users. To this end however, the identityof the sender needs to be reliably verified, whichposes challenges and presents a target for poten-tial abuses of such systems. Nevertheless, repu-tation systems are likely to grow in importance,since they are intuitive from the user perspec-tive in capturing the communication relationshipsbetween users. Sender reputation can be hardor soft. In the hard variant, the recipient alwaysaccepts or declines messages from a given sender.In the soft variant, the reputation reflects the levelof trustworthiness of the sender in the contextof the given recipient. When sender identities re-solve to individual email addresses, the reputationsystem can be learned via analysis of a largesocial network that documents who exchangesemail with whom. The sender identities can alsobe broader however, e.g., assigning reputationto a particular mail server or all mail serversresponsible for handling the outbound traffic fora particular domain. On the recipient side, reputa-tion can also be understood globally to representthe trustworthiness of the sender with respect toall recipients hosted by the system. Many openquestions remain with regard to computing andmaintaining reputations as well as using themeffectively to improve spam detection. In thecontext of text mining, one such question is theextent to which email content analysis can beused to aid the process of reputation assessment.

Cross-References

�Cost-Sensitive Learning�Document Categorization

http://dx.doi.org/10.1007/978-1-4899-7687-1_181

http://dx.doi.org/10.1007/978-1-4899-7687-1_100120

1262 Text Mining for the Semantic Web

�Linear Separability�Logistic Regression�Naıve Bayes� Support Vector Machines

Recommended Reading

Bratko A, Cormack GV, Filipic B, Lynam TR, Zupan B(2006) Spam filtering using statistical data compres-sion models. J Mach Learn Res 7:2673–2698

Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01,the 4th international conference on recent advancesin natural language processing. ACM, New York

Cormack GV, Lynam TR (2006) On-line supervisedspam filter evaluation. ACM Trans Inf Syst 25(3):11

Dalvi N, Domingos P, Sanghai MS, Verma D (2004)Adversarial classification. In: Proceedings of thetenth international conference on knowledge discov-ery and data mining, vol 1. ACM, New York, pp 99–108

Drucker H, Wu D, Vapnik VN (1999) Support vectormachines for spam categorization. IEEE Trans Neu-ral Netw 5(10):1048–1054

Fawcett T (2003) In vivo’ spam filtering: a challengeproblem for data mining. KDD Explor 5(2):140–148

Goodman J, Yih W (2006) Online discriminative spamfilter training. In: Proceedings of the third confer-ence on email and anti-spam (CEAS-2006), Moun-tain View

Kołcz A (2005) Local sparsity control for naive bayeswith extreme misclassification costs. In: Proceed-ings of the eleventh ACM SIGKDD internationalconference on knowledge discovery and data min-ing. ACM, New York

Kołcz A, Alspector J (2001) SVM-based filtering ofe-mail spam with content-specific misclassificationcosts. In: TextDM’2001 (IEEE ICDM-2001 work-shop on text mining), San Jose

Kołcz A, Bond M, Sargent J (2006) The challenges ofservice-side personalized spam filtering: scalabilityand beyond. In: Proceedings of the first interna-tional conference on scalable information systems(INFOSCALE). ACM, New York

Kołcz AM, Chowdhury A (2007) Hardening finger-printing by context. In: Proceedings of the fourthinternational conference on email and anti-spam,Mountain View

Lowd D, Meek C (2005) Good word attacks on sta-tistical spam filters. In: Proceedings of the secondconference on email and anti-spam (CEAS-2005),Mountain View

Metsis V, Androutsopoulos I, Paliouras G (2006) Spamfiltering with naive bayes – which naive bayes? In:Proceedings of the third conference on email andanti-spam (CEAS-2006), Mountain View

Rigoutsos I, Huynh T (2004) Chung-Kwei: a pattern-discovery-based system for the automatic identifi-cation of unsolicited e-mail messages (SPAM). In:Proceedings of the first conference on email andanti-spam (CEAS-2004), Mountain View

Sahami M, Dumais S, Heckerman D, Horvitz E (1998)A Bayesian approach to filtering junk email. In:AAAI workshop on learning for text categorization,Madison. AAAI technical report WS-98-05

Sakkis G, Androutsopoulos I, Paliouras G, KarkaletsisV, Spyropoulos CD, Stamatopoulos P (2001) Stack-ing classifiers for anti-spam filtering of e-mail. In:Lee L, Harman D (eds) Proceedings of empiricalmethods in natural language processing (EMNLP2001), pp 44–50. http://www.cs.cornell.edu/home/llee/emnlp/proceeding.html

Sculley D, Wachman G (2007) Relaxed online supportvector machines for spam filtering. In: Proceedingsof the 30th annual international ACM SIGIR con-ference on research and development in informationretrieval. ACM, New York

Segal R, Crawford J, Kephart J, Leiba B (2004)SpamGuru: an enterprise anti-spam filtering system.In: Proceedings of the first conference on email andanti-spam (CEAS-2004), Mountain View

Siefkes C, Assis F, Chhabra S, Yerazunis W (2004)Combining winnow and orthogonal sparse bigramsfor incremental spam filtering. In: Proceedings ofthe European conference on principle and practiceof knowledge discovery in databases. Springer, NewYork

Yoshida K, Adachi F, Washio T, Motoda H, Homma T,Nakashima A et al (2004) Densitiy-based spam de-tection. In: Proceedings of the tenth ACM SIGKDDinternational conference on knowledge discoveryand data mining. ACM, New York, pp 486–493

Text Mining for the Semantic Web

Marko Grobelnik1, Dunja Mladenic1, andMichael Witbrock2

1Artificial Intelligence Laboratory, Jozef StefanInsitute, Ljubljana, Slovenia2Cycorp Inc, Austin, TX, USA

Definition

�Text Mining methods allow for the incorpora-tion of textual data within applications of seman-tic technologies on the Web. Application of thesetechniques is appropriate when some of the dataneeded for a Semantic Web use scenario are in

http://dx.doi.org/10.1007/978-1-4899-7687-1_478

http://dx.doi.org/10.1007/978-1-4899-7687-1_951

http://dx.doi.org/10.1007/978-1-4899-7687-1_581

http://dx.doi.org/10.1007/978-1-4899-7687-1_810

http://www.cs.cornell.edu/home/llee/emnlp/proceeding.html

http://www.cs.cornell.edu/home/llee/emnlp/proceeding.html

http://dx.doi.org/10.1007/978-1-4899-7687-1_831

Text Mining for the Semantic Web 1263

T

textual form. The techniques range from simpleprocessing of text to reducing vocabulary size,through applying shallow � natural language pro-cessing to constructing new semantic features orapplying information retrieval to selecting rele-vant texts for analysis, through complex methodsinvolving integrated visualization of semantic in-formation, semantic search, semiautomatic ontol-ogy construction, and large-scale reasoning.


Semantic Web applications usually involve deepstructured knowledge integrated by means ofsome kind of ontology. Text mining methods,on the other hand, support the discoveryof structure in data and effectively supportsemantic technologies on data-driven taskssuch as (semi)automatic � ontology acquisition,extension, and mapping. Fully automatic textmining approaches are not always the mostappropriate for combination with Semantic Webcontent, because often it is too difficult or toocostly to fully integrate the available backgrounddomain knowledge into a suitable representation.For such cases, semiautomatic methods, such as�Active Learning and � Semi-supervised TextProcessing (see � Semi-supervised Learning),can be applied to make use of small pieces ofhuman knowledge to provide guidance towardthe desired ontology or other models. Applicationof these semiautomated techniques can reducethe amount of human effort required to producetraining data by an order of magnitude whilepreserving the quality of results.

To date, Semantic Web applications have typi-cally been associated with data, such as text doc-uments, and corresponding metadata that havebeen designed to be relatively easily manage-able by humans. Humans are, for example, verygood at reading and understanding text and ta-bles. General semantic technologies, on the otherhand, aim more broadly at handling data modali-ties including multimedia, signals from emplacedor remote sensors, and the structure and con-tent of communication and transportation graphsand networks. In handling such multimodal data,

much of which is not readily comprehensibleby unaugmented humans, there must be signifi-cant emphasis on fully or semiautomatic meth-ods offered by knowledge discovery technologieswhose application is not limited to a specific datarepresentation (Grobelnik and Mladenic 2005).

Data and the corresponding semantic struc-tures change over time, and semantic technolo-gies also aim at adapting the ontologies thatmodel the data accordingly. For most such sce-narios, extensive human involvement in build-ing models and adapting them according to thedata is too costly, too inaccurate, and too slow.� Stream mining (Gaber et al. 2005) techniques(Data Streams: Clustering) allow text mining ofdynamic data (e.g., notably in handling a streamof news or of public commentary).

Ontology is a fundamental method for orga-nizing knowledge in a structured way and isapplied, along with formalized reasoning, in ar-eas from philosophy to scientific discovery toknowledge management and the Semantic Web.In computer science, an ontology generally refersto a graph or network structure consisting of a setof concepts (vertices in a graph), a set of relation-ships connecting those concepts (directed edgesin a graph), and, possibly, a set of distinguishedinstance concepts assigned to particular classconcepts (data records assigned to vertices in agraph). Although much useful knowledge can berepresented by the ground binary relations mostconveniently encoded as graphs, more complexrelationships involving more than two entitiesare needed, and the graph metaphor is moreremote. In many cases, knowledge is structuredin one of these ways to allow for automated in-ference based on a logical formalism such as the� predicate calculus (Barwise and Etchemendy2002); for these applications, an ontology of-ten further comprises a set of rules or producesnew knowledge within the representation fromexisting knowledge. An ontology containing bothinstance data and rules for its application is oftenreferred to as a knowledge base (KB) (e.g., Lenat1995).

Machine learning practitioners refer to thetask of automatically constructing these ontolo-gies as � ontology learning. From this point of

http://dx.doi.org/10.1007/978-1-4899-7687-1_525

http://dx.doi.org/10.1007/978-1-4899-7687-1_959

http://dx.doi.org/10.1007/978-1-4899-7687-1_916

http://dx.doi.org/10.1007/978-1-4899-7687-1_967

http://dx.doi.org/10.1007/978-1-4899-7687-1_749

http://dx.doi.org/10.1007/978-1-4899-7687-1_789

http://dx.doi.org/10.1007/978-1-4899-7687-1_100369

http://dx.doi.org/10.1007/978-1-4899-7687-1_959

1264 Text Mining for the Semantic Web

view, an ontology is seen as a class of mod-els – somewhat more complex than most usedin machine learning – which need to be ex-pressed in some �Hypothesis Language. Thisdefinition of ontology learning (from Grobel-nik and Mladenic 2005) enables a decomposi-tion into several machine learning tasks, includ-ing � learning concepts, identifying relationshipsbetween existing concepts, populating an exist-ing ontology/structure with instances, identifyingchange in dynamic ontologies, and inducing rulesover concepts, background knowledge, and in-stances.

Following this scheme, text mining methodshave been applied to extending existing ontolo-gies based on Web documents, learning semanticrelations from text based on collocations, semiau-tomatic data-driven ontology construction basedon � document clustering and classification, ex-tracting semantic graphs from text, transformingtext into �RDF triples (a commonly used Se-mantic Web data representation), and mappingtriplets to semantic classes using several kinds oflexical and ontological background knowledge.Text mining is also intensively used in the effortto produce a Semantic Web for annotation oftext with concepts from ontology. For instance,a text document is split into sentences, each sen-tence is represented as a word vector, sentencesare clustered, and each cluster is labeled by themost characteristic words from its sentences andmapped upon the concepts of a general ontol-ogy. Several approaches that integrate ontologymanagement, knowledge discovery, and humanlanguage technologies are described in Davieset al. (2009).

Extending the text mining paradigm, effortsare aimed at giving machines an approxima-tion of the full human ability to acquire knowl-edge from text. Some of the systems (Curtiset al. 2009; Mitchell 2005; Rusu 2014) activelyuse background knowledge in the extraction pro-cess for disambiguation or knowledge structur-ing. Machine reading aims at full text under-standing by integrating knowledge-based con-struction and use into syntactically sophisticatednatural � language analysis, leading to systemsthat autonomously improve their ability to extract

further knowledge from text (e.g., Curtis et al.2009; Etzioni et al. 2007; Mitchell 2005; Starcand Fortuna 2012; Starc and Mladenic 2013).

Biomedical Text Mining

Because of the development and widespreaduse of high-quality biomedical knowledgebases, such as the Gene Ontology (Ashburneret al. 2000), Cell Ontology (Bard et al. 2005),and Linked Neuron Data (Zeng et al. 2015),and the overwhelming volume of the relevantliterature (24 million biomedicine citations inPubMed), biomedical knowledge extraction issubject to a great deal of research. Relevantshared evaluation tasks include BioCreative(Hirschman et al. 2005) and BioNLP (Cohenet al. 2014). Although much of the work onbiological fact extraction still relies on supervisedtraining with closely annotated training data,with the risk of over-constraining the mappingof semantics to particular text substrings,volume of high-quality Semantic Web fact baseshas enabled more natural training methods,such as distant supervision (Augenstein et al.2014).

Cross-References

�Active Learning�Classification�Clustering� Semi-supervised Learning� Semi-supervised Text Processing�Text Mining�Text Visualization

Recommended Reading

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H,Cherry JM, Davis AP, Dolinski K, Dwight SS,Eppig JT, Harris MA, Hill DP, Issel-Tarver L,Kasarskis A, Lewis S, Matese JC, Richardson JE,Ringwald M, Rubin GM, Sherlock G (2000) Geneontology: tool for the unification of biology. NatGenet 25(1):25–29

Augenstein I, Maynard D, Ciravegna F (2014) Relationextraction from the web using distant supervision.

http://dx.doi.org/10.1007/978-1-4899-7687-1_372

http://dx.doi.org/10.1007/978-1-4899-7687-1_154

http://dx.doi.org/10.1007/978-1-4899-7687-1_943

http://dx.doi.org/10.1007/978-1-4899-7687-1_100391

http://dx.doi.org/10.1007/978-1-4899-7687-1_608

http://dx.doi.org/10.1007/978-1-4899-7687-1_916

http://dx.doi.org/10.1007/978-1-4899-7687-1_111

http://dx.doi.org/10.1007/978-1-4899-7687-1_943

http://dx.doi.org/10.1007/978-1-4899-7687-1_749

http://dx.doi.org/10.1007/978-1-4899-7687-1_967

http://dx.doi.org/10.1007/978-1-4899-7687-1_831

http://dx.doi.org/10.1007/978-1-4899-7687-1_837

Text Visualization 1265

T

In: Janowicz K et al (eds) EKAW 2014. LNAI 8876.Springer, pp 26–41

Bard J, Rhee SY, Ashburner M (2005) An ontology forcell types. Genome Biol 6(2):R21

Barwise J, Etchemendy J (2002) Language proof andlogic. Center for the study of language and informa-tion. ISBN:157586374X

Buitelaar P, Cimiano P, Magnini B (2005) Ontologylearning from text: methods, applications and evalu-ation, frontiers in artificial intelligence and applica-tions. IOS Press, Amsterdam

Cohen K, Demner-Fushman D, Ananiadou S, Tsujii J-i(2014) Proceedings of BioNLP 2014, Baltimore.Association for Computational Linguistics

Curtis J, Baxter D, Wagner P, Cabral J, SchneiderD, Witbrock M (2009) Methods of rule acquisitionin the TextLearner system. In: Proceedings of the2009 AAAI spring symposium on learning by read-ing and learning to read. AAAI Press, Palo Alto,pp 22–28

Davies J, Grobelnik M, Mladenic D (2009) Semanticknowledge management. Springer, Berlin

Etzioni O, Banko M, Cafarella MJ (2007) Machinereading. In: Proceedings of the 2007 AAAI springsymposium on machine reading

Gaber MM, Zaslavsky A, Krishnaswamy S (2005)Mining data streams: a review. ACM SIGMOD Rec34(1):18–26. ISSN:0163-580

Grobelnik M, Mladenic D (2005) Automated knowl-edge discovery in advanced knowledge manage-ment. J Knowl Manag 9:132–149

Hirschman L, Yeh A, Blaschke C, Valencia A (2005)Overview of BioCreAtIvE: critical assessment ofinformation extraction for biology. BMC Bioinform6(Suppl 1):S1

Lenat DB (1995) Cyc: a large-scale investmentin knowledge infrastructure. Commun ACM38(11):33–38

Mitchell T (2005) Reading the web: a breakthroughgoal for AI. Celebrating twenty-five years of AAAI:notes from the AAAI-05 and IAAI-05 conferences.AI Mag 26(3):12–16

Rusu D (2014) Text annotation using backgroundknowledge. Doctoral Dissertation, Jozef Stefan In-ternational Postgraduate School, Ljubljana

Starc J, Fortuna B (2012) Identifying good patternsfor relation extraction. In: Proceedings of the 15thinternational multiconference information society –IS 2012. Institut Jozef Stefan, Ljubljana, pp 205–208

Starc J, Mladenic D (2013) Semi-automatic con-struction of pattern rules for translation of nat-ural language into semantic representation. In:Proceedings of the 5th Jozef Stefan Interna-tional Postgraduate School Students Conference,Jozefa Stefana International Postgraduate School,pp 199–208

Zeng Y, Wang D, Zhang T Linked brain data. Webhttp://www.linked-neuron-data.org/. Retrieved 11Jan 2015

Text Spatialization

�Text Visualization

Text Visualization

John Risch1, Shawn Bohn1, Steve Poteet2,Anne Kao2, Lesley Quach2, and Jason Wu2

1Pacific Northwest National Laboratory,Richland, WA, USA2Boeing Phantom Works, Seattle, WA, USA

Synonyms

Semantic mapping; Text spatialization; Topicmapping

Definition

The term text visualization describes a class ofknowledge discovery techniques that use inter-active graphical representations of textual datato enable knowledge discovery via recruitmentof human visual pattern recognition and spatialreasoning capabilities. It is a subclass of infor-mation visualization, which more generally en-compasses visualization of nonphysically based(or “abstract”) data of all types. Text visualizationis distinguished by its focus on the unstructured(or free text) component of information. Whilethe term “text visualization” has been used to de-scribe a variety of graphical methods for derivingknowledge from text, it is most closely associatedwith techniques for depicting the semantic char-acteristics of large document collections. Textvisualization systems commonly employ unsu-pervised machine learning techniques as part ofbroader strategies for organizing and graphicallyrepresenting such collections.


The Internet enables universal access to vastquantities of information, most of which (despite

http://www.linked-neuron-data.org/

http://dx.doi.org/10.1007/978-1-4899-7687-1_837

http://dx.doi.org/10.1007/978-1-4899-7687-1_100422

http://dx.doi.org/10.1007/978-1-4899-7687-1_100472

http://dx.doi.org/10.1007/978-1-4899-7687-1_100475

1266 Text Visualization

admirable efforts Berners-Lee et al. 2001) ex-ists in the form of unstructured and unorganizedtext. Advancements in search technology makeit possible to retrieve large quantities of thisinformation with reasonable precision; however,only a tiny fraction of the information availableon any given topic can be effectively exploited.

Text visualization technologies, as forms ofcomputer-supported knowledge discovery, aim toimprove our ability to understand and utilize thewealth of text-based information available to us.While the term “text visualization” has been usedto describe a variety of techniques for graphi-cally depicting the characteristics of free-text data(Havre et al. 2002; Small 1996), it is most closelyassociated with the so-called semantic clusteringor semantic mapping techniques (Chalmers andChitson 1992; Kohonen et al. 2000; Lin et al.1991; Wise et al. 1995). These methods attemptto generate summary representations of docu-ment collections that convey information abouttheir general topical content and similarity struc-ture, facilitating general domain understandingand analytical reasoning processes.

Text visualization methods are generally basedon vector space models of text collections (Salton1989), which are commonly used in informa-tion retrieval, clustering, and categorization. Suchmodels represent the text content of individualdocuments in the form of vectors of frequenciesof the terms (text features) they contain. A docu-ment collection is therefore represented as a col-lection of vectors. Because the number of uniqueterms present in a document collection generallyis in the range of tens of thousands, a dimen-sionality reduction method such as singular valuedecomposition (SVD) (Deerwester et al. 1990) orother matrix decomposition method (Kao et al.2008; Booker et al. 1999) is typically used toeliminate noise terms and reduce the length ofthe document vectors to a tractable size (e.g., 50–250 dimensions). Some systems attempt to firstidentify discriminating features in the text andthen use mutual probabilities to specify the vectorspace (York et al. 1995).

To enable visualization, the dimensions mustbe further reduced to two or three. The goal

is a graphical representation that employs a“spatial proximity means conceptual similarity”metaphor where topically similar text documentsare represented as nearby points in the display.Various regions of the semantic map aresubsequently labeled with descriptive termsthat convey the primary concepts describedby nearby documents. The text visualizationcan thus serve as a kind of graphical “table ofcontents” depicting the conceptual similaritystructure of the collection.

Text visualization systems therefore typicallyimplement four key functional components,namely,

1. A tokenization component that characterizesthe lexical content of text units via extraction,normalization, and selection of key terms

2. A vector space modeling component that gen-erates a computationally tractable vector spacerepresentation of a collection of text units

3. A spatialization component that uses the vec-tor space model to generate a 2D or 3D spatialconfiguration that places the points represent-ing conceptually similar text units in nearspatial proximity

4. A labeling component that assigns charac-teristic text labels to various regions of thesemantic map

Although machine learning techniques can beused in several of these steps, their primary usageis in the spatialization stage. An unsupervisedlearning algorithm is typically used to find mean-ingful low-dimensional structures hidden in high-dimensional document feature spaces.

Structure of Learning System

Spatialization is a term generically used in� information visualization to describe theprocess of generating a spatial representation ofinherently nonspatial information. In the contextof text visualization, this term generally refers tothe application of a nonlinear dimensionality re-duction algorithm to a collection of text vectors in

http://dx.doi.org/10.1007/978-1-4899-7687-1_837


T

order to generate a visually interpretable two- orthree-dimensional representation of the similaritystructure of the collection. The goal is the cre-ation of a semantic similarity map that positionsgraphical features representing text units (e.g.,documents) conceptually similar to one anothernear one another in the visualization display.These maps may be further abstracted to producemore general summary representations of textcollections that do not explicitly depict the indi-vidual text units themselves (Wise et al. 1995).

A key assumption in text visualization is thattext units which express similar concepts willemploy similar word patterns and that the exis-tence of these word correlations creates coher-ent structures in high-dimensional text featurespaces. A further assumption is that text fea-ture spaces are nonlinear but that their struc-tural characteristics can be approximated by asmoothly varying low-dimensional manifold. Thetext spatialization problem thus becomes one offinding an embedding of the feature vectors ina two- or three-dimensional manifold that bestapproximates this structure. Because the intrinsicdimensionality of the data is invariably muchlarger than two (or three), significant distortionis unavoidable. However, because the goal oftext visualization is not necessarily the develop-ment of an accurate representation of interdoc-ument similarities, but rather the depiction ofbroad (and ambiguously defined) semantic rela-tionships, this distortion is generally consideredacceptable.

Text vector spatialization therefore involvesthe fitting of a model into a collection of observa-tions. Most text visualization systems developedto date have employed some type of unsupervisedlearning algorithm for this purpose. In general,the desired characteristics of an algorithm usedfor text spatialization include that it (1) preservesglobal properties of the input space, (2) pre-serves the pairwise input distances to the great-est extent possible, (3) supports out-of-sampleextension (i.e., the incremental addition of newdocuments), and (4) has low computational andmemory complexity. Computational and memorycosts are key considerations, as a primary goal

of text visualization is the management and inter-pretation of extremely large quantities of textualinformation.

A leading approach is to iteratively adapt thenodes of a fixed topology mesh to the high-dimensional feature space via adaptive refine-ment. This is the basis of the well-known Koho-nen feature mapping algorithm, more commonlyreferred to as the � self-organizing map (SOM)(Kohonen 1997). In a competitive learning pro-cess, text vectors are presented one at a time toa (typically triangular) grid, the nodes of whichhave been randomly initialized to points in theterm space. The Euclidean distance to each nodeis computed, and the node closest to the sam-ple is identified. The position of the winningnode, along with those of its topologically nearestneighbors, is incrementally adjusted toward thesample vector. The magnitude of the adjustmentsis gradually decreased over time. The process isgenerally repeated using every vector in the setfor many hundreds or thousands of cycles untilthe mesh converges on a solution. At the con-clusion, the samples are assigned to their nearestnodes, and the results are presented as a uniformgrid. In the final step, the nodes of the grid arelabeled with summary terms which describe thekey concepts associated with the text units thathave been assigned to them.

Although self-organizing maps can beconsidered primarily a clustering technique,the grid itself theoretically preserves thetopological properties of the input feature space.As a consequence, samples that are nearestneighbors in the feature space generally endup in topologically adjacent nodes. However,while SOMs are topology-preserving, theyare not distance-preserving. Vectors that arespatially distant in the input space may bepresented as proximal in the output, whichmay be semantically undesirable. SOMs havea number of attractive characteristics, includingstraightforward out-of-sample extension andlow computational and memory complexity.Examples of the use of SOMs in text visualizationapplications can be found in Lin et al. (1991),Kaski et al. (1998), and Kohonen et al. (2000).

http://dx.doi.org/10.1007/978-1-4899-7687-1_746


Often, it is considered desirable to attempt topreserve the distances among the samples in theinput space to the greatest extent possible in theoutput. The rationale is that the spatial proxim-ities of the text vectors capture important andmeaningful characteristics of the associated textunits: spatial “nearness” corresponds to concep-tual “nearness.” As a consequence, many text vi-sualization systems employ distance-preservingdimensionality reduction algorithms. By far, themost commonly used among these is the classof algorithms known as multidimensional scaling(MDS) algorithms.

Multidimensional scaling is “a term used todescribe any procedure which starts with the‘distances’ between a set of points (or individualsor objects) and finds a configuration of the points,preferably in a smaller number of dimensions,usually 2 or 3” (Chatfield and Collins 1980,quoted in Chalmers and Chitson 1992). There aretwo main subclasses of MDS algorithms. Metric(quantitative, also known as classical) MDS al-gorithms attempt to preserve the pairwise inputdistances to the greatest extent possible in theoutput configuration, while nonmetric (qualita-tive) techniques attempt only to preserve the rankorder of the distances. Metric techniques are mostcommonly employed in text visualization.

Metric MDS maps the points in the input spaceto the output space while maintaining the pair-wise distances among the points to the greatestextent possible. The quality of the mapping isexpressed in a stress function which is minimizedusing any of a variety of optimization methods,e.g., via eigen decomposition of a pairwise dis-similarity matrix, or using iterative techniquessuch as generalized Newton–Raphson, simulatedannealing, or genetic algorithms. A simple exam-ple of a stress function is the raw stress function(Kruskal 1964) defined by

�.Y / DX

ij

.jjxi � xj jj � jjyi � yj jj/2

in which jjxi � xj jj is the Euclidean distancebetween points xi and xj in the high-dimensionalspace and jjyi � yj jj is the distance between

the corresponding points in the output space.A variety of alternative stress functions havebeen proposed (Cox and Cox 2001). In additionto its distance-preserving characteristics, MDShas the added advantage of preserving the globalproperties of the input space. A major disadvan-tage of MDS, however, is its high computationalcomplexity, which is approximately O(kN2/,where N is the number of data points and k isthe dimensionality of the embedding. Althoughcomputationally expensive, MDS can be usedpractically on data sets of up to several hundreddocuments in size. Another disadvantage is thatout-of-core extension requires reprocessing ofthe full data set if an optimization method whichcomputes the output coordinates all at once isused.

The popularity of MDS methods has led tothe development of a range of strategies forimproving on its computational efficiency toenable scaling of the technique to text collectionsof larger size. One approach is to use eithercluster centroids or a randomly sampled subsetof input vectors as surrogates for the full set. Thesurrogates are down-projected independentlyusing MDS, and then the remainder of thedata is projected relative to this “framework”using a less expensive algorithm, e.g., distance-based triangulation. This is the basis for theanchored least stress algorithm used in theSPIRE text visualization system (York et al.1995), as well as the more recently developedLandmark MDS (de Silva and Tenenbaum 2003)algorithm.

While self-organizing maps and multidimen-sional scaling techniques have received the mostattention to date, a number of other machinelearning techniques have also been used fortext spatialization. The Starlight system (Rischet al. 1999) uses stochastic proximity embedding(Agrafiotis 2003), a high-speed nonlinearmanifold learning algorithm. Other approacheshave employed methods based on graph layouttechniques (Fabrikant 2001). Generally speaking,any of a number of techniques for performingdimensionality reduction in a correlated systemof measurements (classified under the rubric of


T

factor analysis in statistics) may be employed forthis purpose.

Machine learning algorithms can also be usedin text visualization for tasks other than textvector spatialization. For example, generation ofdescriptive labels for semantic maps requires par-titioning of the text units into related sets. Typ-ically, a partitioning-type � clustering algorithmsuch as K-means is used for this purpose (see� Partitional Clustering), either as an elementof the spatialization strategy (see York et al.1995) or as a postspatialization step. The labelingprocess itself may also employ machine learn-ing algorithms. For instance, the TRUST system(Booker et al. 1999; Kao et al. 2008) employedby Starlight generates meaningful labels for doc-ument clusters using a kind of � unsupervisedlearning. By projecting a cluster centroid de-fined in the reduced dimensional representation(e.g., 50–250 dimensions) back into the full-term space, terms related to the content of thedocuments in the cluster are identified and used assummary terms. Machine learning techniques canalso be applied indirectly during the tokenizationphase of text visualization. For example, informa-tion extraction systems commonly employ rulesets that have been generated by a supervisedlearning algorithm (Mooney and Bunescu 2006).Such systems may be used to identify tokensthat are most characteristic of the overall topicof a text unit or are otherwise of interest (e.g.,the names of people or places). In this way, thedimensionality of the input space can be drasti-cally reduced, accelerating downstream process-ing while simultaneously improving the qualityof the resulting visualizations.

Applications

SammonThe first text visualization system based on a textvector space model was likely a prototype devel-oped in the 1960s by John Sammon’s “nonlinearmapping,” or NLM, algorithm (today referred toas organizing text data). The configuration de-picted here is the result of applying Sammon’s

Text Visualization, Fig. 1 Text Visualisation on a CRTdisplay using a light pen

algorithm to a collection of 188 documents rep-resented as 17-dimensional vectors determinedaccording to document relevance to 1,125 key-words and phrases. Among other interesting andprescient ideas, Sammon describes techniques forinteracting with text visualizations depicted on a“CRT display” using a light pen (Fig. 1).

LinLin’s 1991 prototype (Lin et al. 1991) was one ofthe first to demonstrate the use of self-organizingmaps for organizing text documents. Lin formeda 25-dimensional vector space model of a 140-document collection using 25 key index terms ex-tracted from the text. The document vectors wereused to train a 140-node feature map, generatingthe result shown here (the fact that the number ofnodes matches the number of documents is coin-cidental). Lin was also among the first to assigntext labels to various regions of the resulting mapto improve the interpretability and utility of theresulting product (Fig. 2).

BEADThe BEAD system (Chalmers and Chitson1992) was a text visualization prototype

http://dx.doi.org/10.1007/978-1-4899-7687-1_943

http://dx.doi.org/10.1007/978-1-4899-7687-1_637

http://dx.doi.org/10.1007/978-1-4899-7687-1_976


4

network application onlinesearch

2 1 6

6

3

3

322

2

2

Iibrar

retreval

Intelligent

3

knowledgeothers

29 2

2

4

2

2

1

4

1

1

machine learning

natural

process

language

expert

system

4

6

1

1

research

2 3 3

2

2

2 2

3 21 1 1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1 1 1

cliationdatabase

Text Visualization, Fig. 2 Labelled self-organising maps

developed during the early 1990s at Rank XeroxEuroPARC. BEAD employed a vector spacemodel constructed using document keywords anda hybrid MDS algorithm based on an optimizedform of simulated annealing. Although it did notinclude a region labeling component, BEAD didsupport highlighting of visualization features inresponse to query operations, a now standardtext visualization system feature. The BEADproject also pioneered a number of now commoninteraction techniques and was among the firstto explore 3D representations of documentcollections (Fig. 3).

IN-SPIREIN-SPIRE (formerly SPIRE, Spatial Paradigmfor Information Retrieval and Exploration) (Wiseet al. 1995) was originally developed in 1995 atPacific Northwest National Laboratory (PNNL).Over the years, IN-SPIRE has evolved from usingMDS to anchored least stress to a hybrid clus-tering/PCA projection scheme. The SPIRE/IN-

SPIRE system introduced several new concepts,including the use of a 3D “landscape” abstrac-tion (called a ThemeView) for depicting the gen-eral characteristics of large text collections. Arecently developed parallelized version of thesoftware is capable of generating visualizationsof document collections containing millions ofitems (Fig. 4).

WEBSOMWEBSOM (Kaski et al. 1998) was another earlyapplication of Kohonen self-organizing maps totext data. Early versions of WEBSOM used anindependent SOM to generate reduced dimen-sionality text vectors which were then mappedwith a second SOM for visualization purposes.More recent SOM-based text visualization ex-periments have employed vectors constructed viarandom projections of weighted word histograms(Kohonen et al. 2000). SOMs have been used togenerate semantic maps containing millions ofdocuments (Fig. 5).


T

Text Visualization, Fig. 33D representation ofdocument collections

Text Visualization, Fig. 4Large scale 3Drepresentation of documentcollections

StarlightStarlight (Risch et al. 1999) is a general-purposeinformation visualization system developed atPNNL that includes a text visualization compo-nent. Starlight’s text visualization system usesthe Boeing Text Representation Using SubspaceTransformation (TRUST) text engine for vectorspace modeling and text summarization. Textvectors generated by TRUST are clustered, andthe cluster centroids are down-projected to 2Dand 3D using a nonlinear manifold learning

algorithm. Individual document vectors asso-ciated with each cluster are likewise projectedwithin a local coordinate system establishedat the projected coordinates of their associatedcluster centroid, and TRUST is used to generatetopical labels for each cluster. Starlight isunique in that it couples text visualizationwith a range of other information visualizationtechniques (such as link displays) to depictmultiple aspects of information simultaneously(Fig. 6).


Text Visualization, Fig. 5Semantic map generatedby self-organising maps(SOMs)

Text Visualization, Fig. 6 Starlight link display of multiple aspects of information


T

Cross-References

�Data Preprocessing�Dimensionality Reduction�Document Classification�Evolutionary Feature Selection and Construc-

tion� Self-Organizing Maps�Text Visualization

Recommended Reading

Agrafiotis DK (2003) Stochastic proximity embedding.J Comput Chem 24(10):1215–1221

Berners-Lee T, Hendler J, Lassila O (2001) The seman-tic web. Sci Am 284(5):34–43

Booker A, Condliff M, Greaves M, Holt FB, Kao A,Pierce DJ et al (1999) Visualizing text data sets.Comput Sci Eng 1(4):26–35

Chalmers M, Chitson P (1992) Bead: explorations ininformation visualization. In: SIGIR ’92: proceed-ings of the 15th annual international ACM SIGIRconference on research and development in infor-mation retrieval, Copenhagen. ACM, New York,pp 330–337

Chatfield C, Collins A (1980) Introduction to multi-variate analysis. Chapman & Hall, London

Cox MF, Cox MAA (2001) Multidimensional scaling.Chapman & Hall, London

Crouch D (1986) The visual display of information inan information retrieval environment. In: Proceed-ings of the ACM SIGIR conference on research anddevelopment in information retrieval, Pisa. ACM,New York, pp 58–67

Deerwester S, Dumais S, Furnas G, Landauer T, Harsh-man R (1990) Indexing by latent semantic analysis.J Am Soc Inf Sci 41(6):391–407

de Silva V, Tenenbaum JB (2003) Global versuslocal methods in nonlinear dimensionality reduc-tion. In: Becker S, Thrun S, Obermayer K (eds)Proceedings of the NIPS, Vancouver, vol 15,pp 721–728

Doyle L (1961) Semantic roadmaps for literaturesearchers. J Assoc Comput Mach 8(4):367–391

Fabrikant SI (2001) Visualizing region and scale ininformation spaces. In: Proceedings of the 20thinternational cartographic conference, ICC 2001,Beijing, pp 2522–2529

Havre S, Hetzler E, Whitney P, Nowell L (2002) The-meRiver: visualizing thematic changes in large doc-ument collections. IEEE Trans Vis Comput Graph8(1):9–20

Huang S, Ward M, Rundensteiner E (2003) Explo-ration of dimensionality reduction for text visu-alization. Technical report TR-03-14, Department

of Computer Science, Worcester Polytechnic Insti-tute, Worcester

Kao A, Poteet S, Ferng W, Wu J, Quach L (2008)Latent semantic indexing and beyond, to appear. In:Song M, Wu YF (eds) Handbook of research ontext and web mining technologies. Idea Group Inc.,Hershey

Kaski S, Honkela T, Lagus K, Kohonen T (1998)WEBSOM-self-organizing maps of document col-lections. Neurocomputing 21:101–117

Kohonen T (1997) Self-organizing maps. Series ininformation sciences, vol 30, 2nd edn. Springer,Heidelberg

Kohonen T, Kaski S, Lagus K, Salojarvi J, HonkelaJ, Paatero V et al (2000) Self organization of amassive document collection. IEEE Trans NeuralNetw 11(3):574–585

Kruskal JB (1964) Multidimensional scaling by opti-mizing goodness of fit to a nonmetric hypothesis.Psychometrika 29(1):1–27

Lin X, Soergel D, Marchionini DA (1991) Self-organizing semantic map for information retrieval.In: Proceedings of the fourteenth annual inter-national ACM/SIGIR conference on research anddevelopment in information retrieval, Chicago,pp 262–269

Mooney RJ, Bunescu R (2006) Mining knowledgefrom text using information extraction. In: Kao K,Poteet S (eds) SIGKDD explorations, pp 3–10

Paulovich FV, Nonato LG, Minghim R (2006)Visual mapping of text collections through afast high precision projection technique. In: Pro-ceedings of the tenth international conferenceon information visualisation (IV’06), London,pp 282–290

Risch JS, Rex DB, Dowson ST, Walters TB, MayRA, Moon BD (1999) The STARLIGHT informa-tion visualization system. In: Card S, Mackinlay J,Shneiderman B (eds) Readings in information visu-alization: using vision to think. Morgan Kaufmann,San Francisco, pp 551–560

Salton G (1989) Automatic text processing. Addison-Wesley, Reading

Sammon JW (1969) A nonlinear mapping for datastructure analysis. IEEE Trans Comput 18(5):401–409

Small D (1996) Navigating large bodies of text. IBMSyst J 35(3&4):514–525

Wise JA, Thomas JJ, Pennock K, Lantrip D, PottierM, Schur A et al (1995) Visualizing the non-visual:spatial analysis and interaction with informationfrom text documents. In: Proceedings of the IEEEinformation visualization symposium ’95, Atlanta,pp 51–58

York J, Bohn S, Pennock K, Lantrip D (1995)Clustering and dimensionality reduction in SPIRE.In: Proceedings, symposium on advanced infor-mation processing and analysis, AIPA95, TysonsCorner

http://dx.doi.org/10.1007/978-1-4899-7687-1_100100

http://dx.doi.org/10.1007/978-1-4899-7687-1_71

http://dx.doi.org/10.1007/978-1-4899-7687-1_75

http://dx.doi.org/10.1007/978-1-4899-7687-1_90

http://dx.doi.org/10.1007/978-1-4899-7687-1_746

http://dx.doi.org/10.1007/978-1-4899-7687-1_837

1274 TF–IDF

TF–IDF

TF–IDF (term frequency–inverse document fre-quency) is a term weighting scheme commonlyused to represent textual documents as vectors(for purposes of classification, clustering, visual-ization, retrieval, etc.). Let T D ft1; : : : ; tng bethe set of all terms occurring in the documentcorpus under consideration. Then a documentdi is represented by a n-dimensional real-valuedvector xi D .xi1 ; : : : ; xin/ with one componentfor each possible term from T .

The weight xij corresponding to term tjin document di is usually a product of threeparts: one which depends on the presence orfrequency of tj in di , one which depends ontj ’s presence in the corpus as a whole, and anormalization part which depends on dj . Themost common TF–IDF weighting is definedby xij D TFi � IDFj � .

Pj .TFij IDFj /2/�1=2,

where TFij is the term frequency (i.e., numberof occurrences) of tj in di , and IDFj is theIDF of tj , defined as log(N /DFj ), where N

is the number of documents in the corpus andDFj is the document frequency of tj (i.e., thenumber of documents in which tj occurs). Thenormalization part ensures that the vector has aEuclidean length of 1.

Several variations on this weighting schemeare also known. Possible alternatives for TFij in-clude min f1; TFij g (to obtain binary vectors) and.1CTFij = maxj TFij /=2 (to normalize TF withinthe document). Possible alternatives for IDFj in-clude 1 (to obtain plain TF vectors instead of TF–IDF vectors) and log (

Pi

Pk TFik=

Pi TFij /.

The normalization part can be omitted altogetheror modified to use some other norm than theEuclidean one.

Threshold Phenomena in Learning

� Phase Transitions in Machine Learning

Time Sequence

�Time Series

Time Series

Eamonn KeoghUniversity of California-Riverside, Riverside,CA, USA

Synonyms

Temporal data; Time sequence; Trajectory data

Definition

A Time Series is a sequence T D .t1; t2; : : : ; tn/

which is an ordered set of n real-valued num-bers. The ordering is typically temporal; however,other kinds of data such as color distributions(Hafner et al. 1995), shapes (Ueno et al. 2006),and spectrographs also have a well-defined order-ing and can be fruitfully considered “time series”for the purposes of machine learning algorithms.


The special structure of time series producesunique challenges for machine learningresearchers.

It is often the case that each individual timeseries object has a very high dimensionality.Whereas classic algorithms often assume arelatively low dimensionality (for example,a few dozen measurements such as “height,weight, blood sugar,” etc.), time series learningalgorithms must be able to deal with dimension-alities in hundreds or thousands. The problemscreated by high-dimensional data are more thanmere computation time considerations; the verymeaning of normally intuitive terms, such as“similar to” and “cluster forming,” become

http://dx.doi.org/10.1007/978-1-4899-7687-1_642

http://dx.doi.org/10.1007/978-1-4899-7687-1_972

http://dx.doi.org/10.1007/978-1-4899-7687-1_100468

http://dx.doi.org/10.1007/978-1-4899-7687-1_100474

http://dx.doi.org/10.1007/978-1-4899-7687-1_100484

Topic Modeling 1275

T

unclear in high-dimensional space. The reasonfor this is that as dimensionality increases, allobjects become essentially equidistant to eachother and thus classification and clustering losetheir meaning. This surprising result is known asthe � curse of dimensionality and has been thesubject of extensive research. The key insight thatallows meaningful time series machine learningis that although the actual dimensionality maybe high, the intrinsic dimensionality is typicallymuch lower. For this reason, virtually all timeseries data mining algorithms avoid operating onthe original “raw” data; instead, they considersome higher level representation or abstractionof the data. Such algorithms are known as� dimensionality reduction algorithms. Thereare many general dimensionality reductionalgorithms, such as singular value decompositionand random projections, in addition to manyreduction algorithms specifically designedfor time series, including piecewise linerapproximations, Fourier transforms, wavelets,and symbol approximations (Ding et al. 2008).

In addition to the high dimensionality of in-dividual time series objects, many time seriesdatasets have very high numerosity, resulting ina large volume of data. One implication of highnumerosity combined with the high dimension-ality of this is that the entire dataset may not fitin main memory. This requires an efficient disk-aware learning algorithm or a careful samplingapproach.

A final consideration due to the special natureof time series is the fact that individual datapointsare typically highly correlated with their neigh-bors (a phenomenon known as autocorrelation).Indeed, it is this correlation that makes mosttime series excellent candidates for dimensional-ity reduction. However, for learning algorithmsthat assume the independence of features (i.e.,�Naive Bayes), this lack of independence mustbe countered or mitigated in some way.

While virtually every machine learningmethod has been used to classify time series,the current state-of-the-art method is the nearestneighbor algorithm (Ueno et al. 2006) with a

suitable distance measure (Ding et al. 2008). Thissimple method outperforms neutral networks andBayesian classifiers.

The major database (SIGMOD, VLDB,PODS) and data mining (SIGKDD, ICDM,SDM) conferences typically feature several timeseries machine learning/data mining papers eachyear. In addition, because of the ubiquity oftime series, several other communities haveactive subgroups that conduct research on timeseries; for example, the SIGGRAPH conferencetypically has papers on learning or indexing ormotion capture time series, and most medicalconferences have tracks devoted to medicaltime series, such as electrocardiograms andelectroencephalograms.

The UCR Time Series Archive has severaldozen time series datasets which are widely usedto test classification and clustering algorithms,and the UCI Data Mining archive has severaladditional datasets.

Recommended Reading

Ding H, Trajcevski G, Scheuermann P, Wang X, KeoghEA (2008) Querying and mining of time seriesdata: experimental comparison of representationsand distance measures. In: Proceeding of the VLDB,Auckland. VLDB endowment

Hafner J, Sawhney H, Equitz W, Flickner M, NiblackW (1995) Efficient color histogram indexing forquadratic form distance functions. IEEE Trans Pat-tern Anal Mach Intell 17(7):729–736

Ueno K, Xi X, Keogh E, Lee D (2006) Anytimeclassification using the nearest neighbor algorithmwith applications to stream mining. In: Proceedingsof IEEE international conference on data mining(ICDM), Hong Kong

Topic Mapping

�Text Visualization

Topic Modeling

�Topic Models for NLP Applications

http://dx.doi.org/10.1007/978-1-4899-7687-1_192

http://dx.doi.org/10.1007/978-1-4899-7687-1_71

http://dx.doi.org/10.1007/978-1-4899-7687-1_581

http://dx.doi.org/10.1007/978-1-4899-7687-1_837

http://dx.doi.org/10.1007/978-1-4899-7687-1_906

1276 Topic Models for NLP Applications

Topic Models for NLP Applications

Zhiyuan Chen and Bing LiuUniversity of Illinois at Chicago, Chicago, IL,USA

Abstract

Topic modeling is a machine learning tech-nique for discovering semantic topics froma document collection. It typically assumesthat a document is a multinomial distributionover latent topics, and a topic is a multi-nomial distribution over words. By captur-ing the co-occurrence statistics of words inthe documents, it uncovers these distributionswhich indicate important semantic relation-ships. Topic modeling has been widely studiedin machine learning, text mining, and naturallanguage processing (NLP). This chapter givesan introduction to topic modeling. It coversboth the fundamental techniques and some ofits important applications in NLP.

Synonyms

Topic modeling

Definition

Given a collection of documents, how to discoversemantic topics from the documents is an im-portant yet challenging task. It is infeasible toask human beings to manually read and identifythe topics in every available document. This callsfor an automated approach to extracting topics.Topic models are statistical machine learningmethods that aim to discover a set of latentsemantic topics from a document collection orcorpus. A topic model is usually represented in adirected graphical model where topics and wordsare modeled as random variables. In a classictopic model, a document is modeled as an admix-ture of latent topics, while a topic is regarded asa probability distribution over words. The words

are assumed to be generated conditioned on thetopics, while topics are assumed to be sampledfrom a predefined distribution. Topic models arebased on “higher-order co-occurrence,” i.e., howoften words co-occur in different contexts. Theyusually perform well with a large number ofdocuments which provide reliable co-occurrencestatistics.


Discovering semantic topics from text corpora isbeneficial to many applications in natural lan-guage processing. Due to the wide variety, highvolume, and dynamic nature of topics, manualtopic identification is clearly not scalable. Toaddress it, topic models, such as latent Dirichletallocation (LDA) (Blei et al. 2003) and prob-abilistic latent semantic analysis (pLSA) (Hof-mann 1999), have been proposed to automaticallydiscover latent topics from text corpora. In gen-eral, topic models assume that each document isa multinomial distribution over topics, while eachsemantic topic is a multinomial distribution overwords. The two types of resulting distributionsin topic models are document-topic distributionsand topic-word distributions, respectively. Theintuition is that certain words are more or lesslikely to be present given the topics of a doc-ument. For example, “sport” and “player” willappear more often in documents about sports, and“rain” and “cloud” will appear more frequently indocuments about weather.


Topic modeling represents a class of statisti-cal methods that can automatically extract the-matic information from unstructured text docu-ments. Topic models usually assume a generativeprocess to describe how words are generatedin documents. We use the most popular topicmodel, LDA (latent Dirichlet allocation) (Bleiet al. 2003), as an example to explain. We denotethe number of documents by M and the numberof topics by T . Each document m 2 f1; : : : ; M g

http://dx.doi.org/10.1007/978-1-4899-7687-1_100476

Topic Models for NLP Applications 1277

T

contains Nm words. The vocabulary in the corpusis denoted by f1; : : : ; V g. The generative processof LDA is given as follows:

1. For each topic t 2 f1; : : : ; T g

(i) Draw a per topic distribution over words,'t � Dir.ˇ/

2. For each document m 2 f1; : : : ; M g

(i) Draw a topic distribution, �m � Dir.˛/

(ii) For each word position n in document m,where n 2 f1; : : : ; Nmg

(a) Draw a topic ´m;n �M ul t.�m/

(b) Emit word wm;n �M ul t.'´m;n/

Here, ˛ and ˇ are called Dirichlet priors rep-resenting hyperparameters. Dir./ denotes theDirichlet distribution and M ul t./ indicates themultinomial distribution. Note that Dirichlet dis-tribution is the conjugate prior of the multinomialdistribution which simplifies the model inferencederivation. � is the document-topic distributionand ' is the topic-word distribution.

Inference and Parameter Estimation

The posterior inference of the LDA model isintractable and cannot be solved by exact infer-ence. Common approximate inference techniquesinclude collapsed Gibbs sampling (Griffiths andSteyvers 2004), variational methods (Blei et al.2003), and expectation propagation (Minka andLafferty 2002). Collapsed Gibbs sampling is themost popular inference approach due to its sim-plicity.

Gibbs sampling is a special case ofMetropolis-Hastings algorithm which is aMCMC (Markov chain Monte Carlo) technique.It is usually used to generate samples from ajoint probability of many random variables toapproximate the marginal distribution. Gibbssampling is especially useful when it is hard tosample from the joint distribution directly due toits complexity, but sampling from the conditionaldistribution of the random variables is easy. Itis an iterative process that starts with a randominitialization of the Markov chain’s state. In eachiteration, the value of each random variable is

updated by drawing a sample from its conditionaldistribution based on the current state of all otherrandom variables and the data.

The conditional distribution of assigning topict to a word wi in the collapsed Gibbs sampler forLDA is stated as below:

P.´i D t j´�i ; w; ˛; ˇ/ /n�i

m;t C ˛PT

t 0D1.n�im;t 0C ˛/

�n�i

t;wiC ˇ

PVv0D1.n�i

t;v0C ˇ/

(1)

where ´�i are the topic assignments excludingthe current topic assignment of wi . w denotes allthe words in the documents. n�i is the count thatexcludes the current word. nm;t is the number oftimes that topic t appears in document m, and nt;w

is the number of occurrences of word w undertopic t . Equation 1 is quite intuitive: the first ratioexpresses the probability of topic t in documentm, and the second ratio implies the probability ofword w under topic t . Since this information issufficient to compute the conditional distribution,Gibbs sampling can be implemented efficientlyby caching and updating these counts only.

The estimation of document-topic distribution� and topic-word distribution ' is straightforwardgiven the samples of Gibbs sampling as below:

�m;t Dnm;t C ˛

PTt 0D1.nm;t 0 C ˛/

(2)

't;w Dnt;w C ˇ

PVv0D1.nt;v0 C ˇ/

(3)

Nonparametric Topic Models

Classic topic models such as LDA and pLSAgenerally require the number of topics to bespecified by the user before running the actualmodels. In practice, this number is usually setempirically by conducting some initial exper-iments, which may not guarantee the optimalparameters for the model. Nonparametric topic


models automatically learn the appropriate num-ber of topics from the data itself without manualsetting for the number of topics. The hierarchicalDirichlet process mixture model (Teh et al. 2006)is the first nonparametric topic model. It intro-duces the Dirichlet process into topic models toautomatically estimate the number of topics. Theintuition is that there is a set of infinite groupsin the data where each observation is generatedindependently given a group. This is the same asLDA as there is a set of topics where each word issampled given a topic except that LDA fixes thenumber of topics, while the hierarchical Dirichletprocess mixture model assumes that there are in-finite topics. For the inference of the hierarchicalDirichlet process mixture model, Gibbs samplingis also used. More details can be found in Tehet al. (2006).

Knowledge-Based Topic Models

Most of the traditional topic models are fullyunsupervised. However, researchers have shownthat fully unsupervised topic models oftenproduce incoherent topics because the objectivefunctions of topic models do not always correlatewell with human judgments (Chang et al.2009). To address this issue, several knowledge-based topic models (KBTM), also called semi-supervised topic models, have been proposed andused in NLP applications.

DF-LDA (Andrzejewski et al. 2009) is the firstKBTM which incorporates prior knowledge inthe forms of must-links and cannot-links wherea must-link states that two words should belongto the same topic and a cannot-link states thattwo words should not be in the same topic. Ina similar but more generic vein, must-sets andcannot-sets are used in MC-LDA (Chen et al.2013). Mukherjee and Liu (2012) proposed amodel that allows the human user to providesome seed words in some topics. Interactive topicmodels were also proposed to allow the userto interact with the model during its inferenceprocess (Hu et al. 2011). In Blei and McAuliffe(2010), document class labels were consideredin a supervised setting. However, these works in

KBTM require the user to be involved to providethe knowledge or guidance for a superior modelperformance. As we know, expert knowledge canbe hard to obtain. To address it, lifelong topicmodeling (LTM) (Chen and Liu 2014) was pro-posed to automatically mine the prior knowledgefrom past domains and leverage the knowledgeto help discover topics of higher quality in a newdomain.

Applications in NLP

Since topic models are primarily designed foranalysis of text documents, there are numerousapplications in almost every subarea of NLP. Itis difficult to describe them all. Here we discussonly a few subareas to give a flavor of the typesof NLP applications.

Part-of-speech (POS) tagging is one of coreNLP tasks. The task is to specify a particular partof speech to a given word based on the definitionand context of that word. For example, in asentence “Bob enjoys reading books,” the wordenjoys is marked up with POS tag VBZ, indicatingthat this word is a verb with third-person singularpresent. The challenge is that some words mayhave multiple POS tags, e.g., the word move canbe a verb or a noun. In such cases, the context ofthe given word is usually required to decide thecorrect POS tag.

Topic models have been widely applied inthe task of POS tagging. Griffiths et al. (2004)proposed a topic model that can model both thesemantic and syntactic information for part-of-speech tagging. Their motivation is that a word ina sentence can have one of the two roles: serving asyntactic function or providing a semantic mean-ing (Griffiths et al. 2004). Syntactic words usu-ally have short-range dependencies, i.e., spanningseveral words without going beyond the scopeof a sentence. In contrast, semantic words tendto have long-range dependencies: some sentenceswithin a document are likely to share similarwords and express similar contexts. Based on it,a hidden Markov model (HMM) was used insidethe generative model to decide whether a wordbelongs to the syntactic class or the semantic

Topic Models for NLP Applications 1279

T

class. Obtaining such information is helpful indetermining POS tags. For example, knowingthat a word “control” in a text corpus belongs tothe syntactic class makes it more likely to be averb than a noun. Toutanova and Johnson (2008)further added a sparse prior to topic models onthe distribution over tags for each word. Theyalso explicitly modeled ambiguity classes, i.e.,the set of part-of-speech tags that a word can beassociated with. In their model, each word type isassigned with a set of possible parts of speech,and each token of this word type is associatedwith a part-of-speech tag.

Word sense disambiguation (WSD) is anotherimportant NLP area where topic models havebeen popularly applied. Its objective is to identifythe sense or meaning of an ambiguous word inits context. For example, the word light in “thelight of the sun” refers to the meaning “some-thing that makes things visible,” while light in“The box is light to carry” indicates the sense“of little weight.” A dictionary, e.g., WordNet(https://wordnet.princeton.edu/), is usually usedto help provide word senses. Boyd-Graber et al.(2007) proposed a model called LDAWN (LDAwith WordNet) model to distinguish word senses.In WordNet, a word sense is represented by asynset (short for synonym sets). For example, inthe above examples, the synset flight, luminancegis associated with the sense of “something thatmakes things visible.” LDAWN models the synsetpath, i.e., a path from one synset to anothersynset, as a hidden variable. It assumes that wordsunder the same topic are likely to share thesame meaning as well as their synset path. Theposterior inference of LDAWN was conductedusing Gibbs sampling to infer the synset path,i.e., the sense, of a word. The key advantage ofLDAWN is that it does not need labeled data todisambiguate a corpus. It simultaneously decom-poses a corpus into topics with words groupinginto their word senses.

Sentiment analysis (or opinion mining) is per-haps one of the biggest application areas of topicmodels in NLP. The goal of sentiment analysisis to extract subjective information such as opin-ions, evaluations, appraisals, and emotions fromtext. Liu (2012) gave a comprehensive survey

of the sentiment analysis and opinion miningresearch. Topic models have been widely appliedin aspect-based sentiment analysis, which is afine-grained analysis of opinions, to infer aspectsand opinion words. Aspects in the sentimentanalysis context are entity features on whichopinions have been expressed. For example, in areview sentence, “The picture looks great,” abouta camera, the aspect is “picture” and the opinionword is “great.” Mei et al. (2007) proposed thetopic-sentiment mixture (TSM) model to revealthe latent topics and their associated sentimentsin a Weblog collection. They also designed aspecial HMM structure in the topic model to de-tect topic life cycles and sentiment dynamics. Asemi-supervised topic model was proposed in Luand Zhai (2008) to integrate opinions expressedin well-written expert reviews and opinions ex-pressed by the general public in sources such asweblogs to generate an aligned and integratedopinion summary. Titov and McDonald (2008)proposed a topic model to distinguish globalaspects and local aspects. In their model, globalaspects correspond to global properties of ob-jects, e.g., the brand of a product type, while localaspects are the aspects of an object or entity thattend to be rated or evaluated by users.

More recently, Lin and He (2009) proposedthe joint sentiment/topic (JST) model that jointlymodels topics (aspects) and sentiments. Ratherthan having only one set of latent topic vari-ables as in LDA, JST adds another set of hiddensentiment variables. The advantage of JST isthat it is able to model both aspects and sen-timents in a fully unsupervised fashion withoutthe need of supervised information such as la-bels. Based on JST, Jo and Oh (2011) madean assumption that one sentence represents onlyone aspect, i.e., all the words in a sentence aregenerated from one aspect. However, these mod-els do not actually separate aspects and opin-ion words in their results. The maximum en-tropy model was integrated into a topic modelby Zhao et al. (2010) to explicitly separate opin-ions from aspects. Chen et al. (2014) proposedthe AKL (automated knowledge LDA) model thatlearns prior knowledge from reviews of otherproducts/domains and applies such knowledge to

https://wordnet.princeton.edu/


mine more coherent aspects. The knowledge baseis represented by a set of clusters, where eachcluster consists of words that are semanticallycorrelated.

Some other NLP applications of topic modelsinclude machine translation (Eidelman et al.2012), summarization (Haghighi and Vander-wende 2009), tagging (Krestel et al. 2009),multi-language topic synchronization (Pettersonet al. 2010), topical keyphrase extraction (Zhaoet al. 2011), relation extraction between namedentities (Yao et al. 2011), entity linking (Hanand Sun 2012), and document retrieval (Wei andCroft 2006).

Cross-References

�Bayesian Network�Graphical Models�Unsupervised Learning

Recommended Reading

Andrzejewski D, Zhu X, Craven M (2009) Incor-porating domain knowledge into topic modelingvia Dirichlet Forest priors. In: ICML, Montreal,pp 25–32

Blei DM, McAuliffe JD (2010) Supervised topic mod-els. In: NIPS, Whistler, pp 121–128

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichletallocation. J Mach Learn Res 3:993–1022

Boyd-Graber JL, Blei DM, Zhu X (2007) A topicmodel for word sense disambiguation. In: EMNLP-CoNLL, Prague, pp 1024–1033

Chang J, Boyd-Graber J, Chong W, Gerrish S, Blei DM(2009) Reading tea leaves: how humans interprettopic models. In: NIPS, Whistler, pp 288–296

Chen Z, Liu B (2014) Topic modeling using topicsfrom many domains, lifelong learning and big data.In: ICML, Beijing, pp 703–711

Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M,Ghosh R (2013) Exploiting domain knowledge inaspect extraction. In: EMNLP, Seattle, pp 1655–1667

Chen Z, Mukherjee A, Liu B (2014) Aspect extractionwith automated prior knowledge learning. In: ACL,Baltimore, pp 347–358

Eidelman V, Boyd-Graber J, Resnik P (2012) Topicmodels for dynamic translation model adaptation.In: ACL, Jeju Island, pp 115–119

Griffiths TL, Steyvers M (2004) Finding scientifictopics. PNAS 101(Suppl):5228–5235

Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB(2004) Integrating topics and syntax. In: NIPS, Van-couver, pp 537–544

Haghighi A, Vanderwende L (2009) Exploring con-tent models for multi-document summarization. In:ACL, Boulder, pp 362–370

Han X, Sun L (2012) An entity-topic model for entitylinking. In: EMNLP, Jeju Island, pp 105–115

Hofmann T (1999) Probabilistic latent semantic analy-sis. In: UAI, Stockholm, pp 289–296

Hu Y, Boyd-Graber J, Satinoff B (2011) Interactivetopic modeling. In: ACL, Portland, pp 248–257

Jo Y, Oh AH (2011) Aspect and sentiment unificationmodel for online review analysis. In: WSDM, HongKong, pp 815–824

Krestel R, Fankhauser P, Nejdl W (2009) Latent dirich-let allocation for tag recommendation. In: RecSys,New York, pp 61–68

Lin C, He Y (2009) Joint sentiment/topic model forsentiment analysis. In: CIKM, Hong Kong, pp 375–384

Liu B (2012) Sentiment analysis and opinion mining.Synth Lect Hum Lang Technol 5(1):1–167

Lu Y, Zhai C (2008) Opinion integration throughsemi-supervised topic modeling. In: WWW, Bei-jing, pp 121–130

Mei Q, Ling X, Wondra M, Su H, Zhai C (2007) Topicsentiment mixture: modeling facets and opinions inweblogs. In: WWW, Banff, pp 171–180

Minka T, Lafferty J (2002) Expectation-propagationfor the generative aspect model. In: UAI’02, Ed-monton, pp 352–359

Mukherjee A, Liu B (2012) Aspect extraction throughsemi-supervised modeling. In: ACL, Jeju Island,pp 339–348

Petterson J, Smola A, Caetano T, Buntine W,Narayanamurthy S (2010) Word features for latentDirichlet allocation. In: NIPS, Whistler, pp 1921–1929

Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hi-erarchical Dirichlet processes. J Am Stat Assoc101(476): 1–30

Titov I, McDonald R (2008) Modeling online reviewswith multi-grain topic models. In: WWW, Beijing,pp 111–120

Toutanova K, Johnson M (2008) A Bayesian LDA-based Model for Semi-Supervised Part-of-speechTagging. In: NIPS, Whistler

Wei X, Croft WB (2006) LDA-based document modelsfor ad-hoc retrieval. In: SIGIR, Seattle, pp 178–185

Yao L, Haghighi A, Riedel S, McCallum A (2011)Structured relation discovery using generative mod-els. In: EMNLP, Edinburgh, pp 1456–1466

Zhao WX, Jiang J, He J, Song Y, Achananuparp P, LimE-P, Li X (2011) Topical keyphrase extraction fromtwitter. In: ACL, Portland, pp 379–388

Zhao WX, Jiang J, Yan H, Li X (2010) Jointly mod-eling aspects and opinions with a MaxEnt-LDAhybrid. In: EMNLP, Cambridge, pp 56–65

http://dx.doi.org/10.1007/978-1-4899-7687-1_927

http://dx.doi.org/10.1007/978-1-4899-7687-1_119

http://dx.doi.org/10.1007/978-1-4899-7687-1_976

Trace-Based Programming 1281

T

Topology

�Topology of a Neural Network

Topology of a Neural Network

Risto MiikkulainenDepartment of Computer Science, TheUniversity of Texas at Austin, Austin, TX, USA

Synonyms

Architecture; Connectivity; Structure; Topology

Definition

Topology of a neural network refers to the waythe neurons are connected, and it is an importantfactor in how the network functions and learns.A common topology in unsupervised learning isa direct mapping of inputs to a collection of unitsthat represents categories (e.g., � Self-OrganizingMaps). The most common topology in super-vised learning is the fully connected, three-layer,feedforward network (see �Backpropagation and�Radial Basis Function Networks): All inputvalues to the network are connected to all neu-rons in the hidden layer (hidden because theyare not visible in the input or output), the out-puts of the hidden neurons are connected to allneurons in the output layer, and the activationsof the output neurons constitute the output ofthe whole network. Such networks are popu-lar partly because they are known theoreticallyto be universal function approximators (with,e.g., a sigmoid or Gaussian nonlinearity in thehidden layer neurons), although networks withmore layers may be easier to train in practice(e.g., �Cascade-Correlation). In particular, deeplearning architectures (see �Deep Learning) uti-lize multiple hidden layers to form a hierar-chy of gradually more structured representationsthat support a supervised task on top. Layered

networks can be extended to processing sequen-tial input and/or output by saving a copy of thehidden layer activations and using it as addi-tional input to the hidden layer in the next timestep (see � Simple Recurrent Network). Fullyrecurrent topologies, where each neuron is con-nected to all other neurons (and possibly to it-self), can also be used to model time-varyingbehavior, although such networks may be unsta-ble and difficult to train (e.g., with backprop-agation; but see also �Boltzmann Machines).Modular topologies, where different parts of thenetworks perform distinctly different tasks, canimprove stability and can also be used to modelhigh-level behavior (e.g., �Echo-State Machinesand �Adaptive Resonance Theory). Whateverthe topology, in most cases, learning involvesmodifying the �Weight on the network con-nections. However, arbitrary network topologiesare possible as well and can be constructed aspart of the learning (e.g., with backpropagationor �Neuroevolution) to enhance feature selec-tion, recurrent memory, abstraction, or general-ization.

Trace-Based Programming

Pierre Flener1 and Ute Schmid2

1Department of Information Technology,Uppsala University, Uppsala, Sweden2Faculty of Information Systems and AppliedComputer Science, University of Bamberg,Bamberg, Germany

Abstract

Trace-based programming is introduced as aspecific approach to inductive programmingwhere a, typically recursive, program is in-ferred from a small set of example computa-tional traces.

Most of the work by this author was done while onleave of absence in 2006/2007 as a Visiting FacultyMember and Erasmus Exchange Teacher at SabancıUniversity, Turkey.

http://dx.doi.org/10.1007/978-1-4899-7687-1_843

http://dx.doi.org/10.1007/978-1-4899-7687-1_100019

http://dx.doi.org/10.1007/978-1-4899-7687-1_100079

http://dx.doi.org/10.1007/978-1-4899-7687-1_100451

http://dx.doi.org/10.1007/978-1-4899-7687-1_100477

http://dx.doi.org/10.1007/978-1-4899-7687-1_746

http://dx.doi.org/10.1007/978-1-4899-7687-1_51

http://dx.doi.org/10.1007/978-1-4899-7687-1_698

http://dx.doi.org/10.1007/978-1-4899-7687-1_33

http://dx.doi.org/10.1007/978-1-4899-7687-1_909

http://dx.doi.org/10.1007/978-1-4899-7687-1_768

http://dx.doi.org/10.1007/978-1-4899-7687-1_31

http://dx.doi.org/10.1007/978-1-4899-7687-1_781

http://dx.doi.org/10.1007/978-1-4899-7687-1_6

http://dx.doi.org/10.1007/978-1-4899-7687-1_886

http://dx.doi.org/10.1007/978-1-4899-7687-1_594

1282 Training Curve

Synonyms

Programming from traces

Definition

Trace-based programming addresses the infer-ence of a program from a small set of examplecomputation traces. The induced program is typ-ically a recursive program. A computation traceis a non-recursive expression that describes thetransformation of some specific input into the de-sired output with help of a predefined set of prim-itive functions. While the construction of tracesis highly dependent on background knowledge oreven on knowledge about the program searchedfor, the inductive generalization is based onsyntactical methods of detecting regularities anddependencies between traces, as proposed in clas-sical approaches to � inductive programming or� explanation-based learning. As an alternative toproviding traces by hand simulation, AI planningtechniques or programming by demonstrationcan be used.

Cross-References

�Explanation-Based Learning� Inductive Programming� Programming by Demonstration

Recommended Reading

Biermann AW (1972) On the inference of Turingmachines from sample computations. Artif Intell3(3):181–198

Schmid U, Wysotzki F (1998) Induction of recur-sive program schemes. In: Proceedings of the 10thEuropean conference on machine learning (ECML1998). Volume 1398 of lecture notes in artificialintelligence. Springer, pp 214–225

Schrodl S, Edelkamp S (1999) Inferring flow of controlin program synthesis by example. In: Proceedingsof the 23rd annual German conference on artificialintelligence (KI 1999). Volume 1701 of lecturenotes in artificial intelligence. Springer, pp 171–182

Shavlik JW (1990) Acquiring recursive and iterativeconcepts with explanation-based learning. MachLearn 5:39–70

Wysotzki F (1983) Representation and induction ofinfinite concepts and recursive action sequences. In:Proceedings of the 8th international joint confer-ence on artificial intelligence (IJCAI 1983). MorganKaufmann, pp 409–414

Training Curve

�Learning Curves in Machine Learning

Training Data

Synonyms

Training examples; Training instances

Definition

Training data are data to which a learner isapplied.

Cross-References

�Training Set

Training Examples

�Training Data

Training Instances

�Training Data

Training Set

Synonyms

Training data

http://dx.doi.org/10.1007/978-1-4899-7687-1_100382

http://dx.doi.org/10.1007/978-1-4899-7687-1_137

http://dx.doi.org/10.1007/978-1-4899-7687-1_96

http://dx.doi.org/10.1007/978-1-4899-7687-1_96

http://dx.doi.org/10.1007/978-1-4899-7687-1_137

http://dx.doi.org/10.1007/978-1-4899-7687-1_679

http://dx.doi.org/10.1007/978-1-4899-7687-1_452

http://dx.doi.org/10.1007/978-1-4899-7687-1_100480

http://dx.doi.org/10.1007/978-1-4899-7687-1_100481

http://dx.doi.org/10.1007/978-1-4899-7687-1_974

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

Tree Augmented Naive Bayes 1283

T

Definition

A training set is a � data set containing datathat are used for learning by a learning system.A training set may be divided further into a� growing set and a � pruning set.

Cross-References

�Data Set�Training Data

Training Time

A learning algorithm is typically applied at twodistinct times. Training time refers to the timewhen an algorithm is learning a model from� training data. �Test time refers to the timewhen an algorithm is applying a learned model tomake predictions. �Lazy learning usually blursthe distinction between these two times, deferringmost learning until test time.

Trait

�Attribute

Trajectory Data

�Time Series

Transductive Learning

� Semi-supervised Learning� Semi-supervised Text Processing

Transfer Learning

� Inductive Transfer

Transfer of Knowledge AcrossDomains

� Inductive Transfer

Transition Probabilities

In a �Markov decision process, the transitionprobabilities represent the probability of being instate s0 at time tC1, given you take action a fromstate s at time t for all s; a and t .

Tree Augmented Naive Bayes

Fei Zheng1;2 and Geoffrey I. Webb3

1Monash University, Sydney, NSW, Australia2Monash University, Clayton, Melbourne, VIC,Australia3Faculty of Information Technology, MonashUniversity, Victoria, Australia

Synonyms

TAN

Definition

Tree augmented � naive Bayes is a � semi-naiveBayesian Learning method. It relaxes the naiveBayes attribute independence assumption by em-ploying a tree structure, in which each attributeonly depends on the class and one other attribute.A maximum weighted spanning tree that maxi-mizes the likelihood of the training data is usedto perform classification.

Classification with TAN

Interdependencies between attributes can beaddressed directly by allowing an attribute to

http://dx.doi.org/10.1007/978-1-4899-7687-1_196

http://dx.doi.org/10.1007/978-1-4899-7687-1_357

http://dx.doi.org/10.1007/978-1-4899-7687-1_682

http://dx.doi.org/10.1007/978-1-4899-7687-1_196

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

http://dx.doi.org/10.1007/978-1-4899-7687-1_840

http://dx.doi.org/10.1007/978-1-4899-7687-1_821

http://dx.doi.org/10.1007/978-1-4899-7687-1_449

http://dx.doi.org/10.1007/978-1-4899-7687-1_923

http://dx.doi.org/10.1007/978-1-4899-7687-1_972

http://dx.doi.org/10.1007/978-1-4899-7687-1_749

http://dx.doi.org/10.1007/978-1-4899-7687-1_967

http://dx.doi.org/10.1007/978-1-4899-7687-1_138

http://dx.doi.org/10.1007/978-1-4899-7687-1_138

http://dx.doi.org/10.1007/978-1-4899-7687-1_512

http://dx.doi.org/10.1007/978-1-4899-7687-1_100464

http://dx.doi.org/10.1007/978-1-4899-7687-1_581

http://dx.doi.org/10.1007/978-1-4899-7687-1_748

1284 Tree Mining

depend on other non-class attributes. However,techniques for learning unrestricted Bayesiannetworks often fail to deliver lower zero-one lossthan naive Bayes (Friedman et al. 1997). Onepossible reason for this is that full �Bayesiannetworks are oriented toward optimizing thelikelihood of the training data rather than theconditional likelihood of the class attribute givena full set of other attributes. Another possiblereason is that full Bayesian networks have highvariance due to the large number of parametersestimated. An intermediate alternative techniqueis to use a less restrict structure than naive Bayes.Tree augmented naive Bayes (TAN) (Friedmanet al. 1997) employs a tree structure, allowingeach attribute to depend on the class and at mostone other attribute. Figure 1 shows Bayesiannetwork representations of the types of modelthat NB and TAN respectively create.

Chow (1968) proposed a method that effi-ciently constructs a maximum weighted span-ning tree which maximizes the likelihood thatthe training data was generated from the tree.The weight of an edge in the tree is the mutualinformation of the two attributes connected by theedge. TAN extends this method by using condi-tional mutual information as weights. Since theselection of root does not affect the log-likelihood

Y

NB

X1 X2 Xi Xi+1 Xn

Y

TAN

X1 X2 Xi Xi+1 Xn

Tree Augmented Naive Bayes, Fig. 1 Bayesian net-work examples of the forms of model created by NB andTAN

of the tree, TAN randomly selects a root attributeand directs all edges away from it. The parent ofeach attribute Xi is indicated as �.Xi / and theparent of the class is ¿. It assumes that attributesare independent given the class and their parentsand classifies the test instance x D hx1; : : : ; xni

by selecting

argmaxy

OP .Y /Y

1�i�n

OP .xi jy; �.xi //; (1)

where �.xi / is a value of �.Xi / and y is a classlabel.

Due to the relaxed attribute independence as-sumption, TAN considerably reduces the � biasof naive Bayes at the cost of an increase invariance. Empirical results (Friedman et al. 1997)show that it substantially reduces zero-one lossof naive Bayes on many data sets and that ofall data sets examined it achieves lower zero-oneloss than naive Bayes more often than not.

Cross-References

�Averaged One-Dependence Estimators�Bayesian Network�Naıve Bayes� Semi-Naive Bayesian Learning

Recommended Reading

Chow CK, Liu CN (1968) Approximating discreteprobability distributions with dependence trees.IEEE Trans Inf Theory 14:462–467

Friedman N, Geiger D, Goldszmidt M (1997) Bayesiannetwork classifiers. Mach Learn 29(2):131–163

Tree Mining

Siegfried NijssenKatholieke Universiteit Leuven, Leuven,Belgium

Definition

Tree mining is an instance of constraint-basedpattern mining and studiesthe discovery of tree

http://dx.doi.org/10.1007/978-1-4899-7687-1_927

http://dx.doi.org/10.1007/978-1-4899-7687-1_72

http://dx.doi.org/10.1007/978-1-4899-7687-1_48

http://dx.doi.org/10.1007/978-1-4899-7687-1_927

http://dx.doi.org/10.1007/978-1-4899-7687-1_581

http://dx.doi.org/10.1007/978-1-4899-7687-1_748

Tree Mining 1285

T

patterns in data that is represented as a treestructure or as a set of trees structures. Minimumfrequency is the most studied constraint.


Tree mining is motivated by the availability ofmany types of data that can be represented astree structures. There is a large variety in treetypes, for instance, ordered trees, unordered trees,rooted trees, unrooted (free) trees, labeled trees,unlabeled trees, and binary trees; each of thesehas its own application areas. An example aretrees in tree banks, which store sentences anno-tated with parse trees. In such data, it is not onlyof interest to find commonly occurring sets ofwords (for which frequent itemset miners couldbe used), but also to find commonly occurringparses of these words. Tree miners aim at findingpatterns in this structured information. The pat-terns can be interesting in their own right, or canbe used as features in classification algorithms.

Structure of Problem

All tree miners share a similar problem setting.Their input consists of a set of trees and a set ofconstraints, usually a minimum frequency con-straint, and their output consists of all subtreesthat fulfill the constraints.

Tree miners differ in the constraints that theyare able to deal with, and the types of trees thatthey operate on. The following types of trees canbe distinguished:

Free trees, which are graphs without cycles, andno order on the nodes or edges;

Unordered trees, which are free trees in whichone node is chosen to be the root of the tree;

Ordered trees, which are rooted trees in which thenodes are totally ordered.

For each of these types of tree, we can choose tohave labels on the nodes, or on the edges, or onboth.

The differences between these types of treesare illustrated in Fig. 1. Every graph in this figurecan be interpreted as a free tree Fi , an unorderedtree Ui , or an ordered tree Ti . When interpretedas ordered trees, none of the trees are equivalent.When we interpret them as unordered trees, U1

and U2 are equivalent representations of the sameunordered tree that has B as its root and C and D

as its children. Finally, as free trees, not only F1

and F2 are equivalent, but also F5 and F7.Intuitively, a free tree requires less specifica-

tion than an ordered tree. The number of possiblefree trees is smaller than the number of possibleordered trees. On the other hand, to test if twotrees are equivalent we need a more elaboratecomputation for free trees than for ordered trees.

Assume that we have data represented as (aset of) trees, then the data mining problem isto find patterns, represented as trees, that fulfillconstraints based on this data. To express theseconstraints, we need a coverage relation thatexpresses when one tree can be considered tooccur in another tree. Different coverage relationscan be expressed for free trees, ordered trees, andunordered trees. We will introduce these relationsthrough operations that can be used to transform

T U F

T U F T U F T U F

T U F T U F T U F

T U F

Tree Mining, Fig. 1 The leftmost tree is part of the data, the other trees could be patterns in this tree, depending on thesubtree relation that is used

1286 Tree Mining

Tree Mining, Fig. 2Relations between the treesof Fig. 1

trees. As an example, consider the operation thatremoves a leaf from a tree. We can repeatedlyapply this operation to turn a large tree into asmaller one. Given two trees A and B , we saythat A occurs in B as

Induced subtree, if A can be obtained from B

by repeatedly removing leaves from B . Whendealing with rooted trees, the root is here alsoconsidered to be a leaf if it has one child;

Root-induced subtree, if A can be obtained fromB by repeatedly removing leaves from B .When dealing with rooted trees, the root is notallowed to be removed;

Embedded subtree, if A can be obtained from B

by repeatedly either (1) removing a leaf or (2)removing an internal node, reconnecting thechildren of the removed node with the parentof the removed node;

Bottom-up subtree, if there is a node v in B suchthat if we remove all nodes from B that are nota descendant of v, we obtain A;

Prefix, if A can be obtained from B by repeatedlyremoving the last node from the ordered treeB;

Leaf set, if A can be obtained from B by selectinga set of leaves from B , and all their ancestorsin B .

For free trees, only the induced subtree relationis well-defined. A prefix is only well-definedfor ordered trees, the other relations apply bothto ordered and unordered trees. In the case ofunordered trees, we assume that each operationmaintains the order of the original tree B . Therelations are also illustrated in Fig. 2.

Intuitively, we can speak of occurrences (alsocalled embeddings by some authors) of a small

Tree Mining 1287

T

tree in a larger tree. Each such occurrence (orembedding) can be thought of as a function ' thatmaps every node in the small tree to a node in thelarge tree.

Using an occurrence relation, we can definefrequency measures. Assume given a forest F oftrees, all ordered, unordered, or free. Then thefrequency of a tree A can be defined

Transaction-based, where we count the numberof trees B 2 F such that A is a subtree of B;

Node-based, where we count the number of nodesv in F such that A is a subtree of the bottom-up subtree below v.

Node-based frequency is only applicable inrooted trees, in combination with the root-induced, bottom-up, prefix, or leaf set subtreerelations.

Given a definition of frequency, constraints ontrees of interest can be expressed:

Minimum frequency, to specify that only treeswith a certain minimum number of occur-rences are of interest;

Closedness, to specify that a tree is only ofinterest if its frequency is different from all itssupertrees;

Maximality, to specify that a tree is only ofinterest if none of its supertrees is frequent.

Observe that in all of these constraints, the sub-tree relation is again important. The subtree re-lation is not only used to compare patterns withdata, but also patterns among themselves.

The tree mining problem can now be statedas follows. Given a forest of trees F (ordered,unordered, or free) and a set of constraints, basedon a subtree relation, the task is to find all treesthat satisfy the given constraints.

Theory/Solution

The tree mining problem is an instance of themore general problem of constraint-based pat-tern mining under constraints. For more informa-tion about the general setting, see the sections

on constraint-based mining, itemset mining, andgraph mining.

All algorithms iterate a process of generatingcandidate patterns, and testing if these candidatessatisfy the constraints. Essential is to avoid thatevery possible tree is considered as a candi-date. To this purpose, the algorithms exploit thatmany frequency measures are anti-monotonic.This property states that for two given trees A andB , where A is a subtree of B , if A is infrequent,then also B is infrequent, and therefore, we donot need to consider it as a candidate.

This observation can make it possible to findall trees that satisfy the constraints, if these re-quirements are fulfilled:

• We have an algorithm to enumerate candidatesubtrees, which satisfies these properties:– It should be able to enumerate all trees in

the search space;– It should avoid that no two equivalent sub-

trees are listed;– It should only list a tree after at least one of

its subtrees has been listed, to exploit theanti-monotonicity of the frequency con-straint;

• We have an algorithm to efficiently compute inhow many database trees a pattern tree occurs.

The algorithmic solutions to these problems de-pend on the type of tree and the subtree relation.

Encoding and Enumerating TreesWe will first consider how tree miners internallyrepresent trees. Two types of encodings have beenproposed, both of which are string-based. We willillustrate these encodings for node-labeled trees,and start with ordered trees.

The first encoding is based on a preorderlisting of nodes: (1) for a rooted ordered tree T

with a single vertex r , the preorder string of T isST;r D lr � 1, where lr is the label for the singlevertex r , and (2) for a rooted ordered tree T withmore than one vertex, assuming the root of T is r

(with label lr / and the children of r are r1,. . . , rK

from left to right, then the preorder string for T is

1288 Tree Mining

ST;r D lrST;rK�1, where ST ; r1; : : : ; ST ; rk are

the preorder strings for the bottom-up subtreesbelow nodes r1; : : : ; rK in T .

The second encoding is based on listing thedepths of the nodes together with their labels inprefix-order. The depth of a node v is the lengthof the path from the root to the node v. The codefor a tree is ST;r D dr ; lr ; ST;r1 : : : ST;rk

, wheredr is the depth of the node r in tree T .

Both encodings are illustrated in Fig. 3.A search space of trees can be visualized as

in Fig. 4. In this figure, every node correspondsto the depth encoding of a tree, while the edges

Tree Depth-sequence Preorder string

T6 1A2B2D AB-1D-1

T7 1A2B3C3D ABC-1D-1-1-1

T 1A2B3C3D2E ABC-1D-1-1E-1

T4 1A2D2E AD-1E-1-1

T3 1A2E AE-1-1

T5 1B2A2C2D BA-1C-1D-1-1

T1 1B2C2D BC-1D-1-1

T2 1B2D2C BD-1C-1-1

Tree Mining, Fig. 3 Depth sequences for all the treesof Fig. 1, sorted in lexicographical order. Tree T2 is thecanonical form of unordered tree U2, as its depth sequenceis the highest among equivalent representations

visualize the partial order defined by the subtreerelation. It can be seen that the number of inducedsubtree relations between trees is smaller than thenumber of embedded subtree relations.

The task of the enumeration algorithm is totraverse this search space starting from trees thatcontain only one node. Most algorithms performthe search by building an enumeration tree overthe search space. In this enumeration tree everypattern should have a single parent. The childrenof a pattern in the enumeration tree are called itsextensions or its refinements. An example of anenumeration tree for the induced subtree relationis given in Fig. 5.

In the enumeration tree that is given here, theparent of a tree is its prefix in the depth encoding.An alternative definition is that the parent of a treecan be obtained by removing the last node in aprefix order traversal of the ordered tree. Everyrefinement in the enumeration has one additionalnode that is connected to the rightmost path of theparent.

The enumeration problem is more compli-cated for unordered trees. In this case, the treesrepresented by the strings 1A2A2B and 1A2B2A

are equivalent, and we only wish to enumerateone of these strings. This can be achieved bydefining a total order on all strings that representtrees, and to define that only the highest (orlowest) string of a set of equivalent strings shouldbe considered.

For depth encodings, the ordering is usuallylexicographical, and the highest string is chosento be the canonical encoding. In our example,1A2B2A would be canonical. This code has thedesirable property that every prefix of a canonicalcode is also a canonical code. Furthermore itcan be determined in polynomial time which

Tree Mining, Fig. 4A search space of orderedtrees, where edges denotesubtree relationships

1A 1B

1A2A 1A2B 1B2A 1B2B

1A2A2A

1A2A3A

1A2A2B 1A2A3B1A2B2A

1A2B2B

1A2B3A 1A2B3B

. . . . . .

Induced/Embedded SubtreeEmbedded Subtree only

Tree Mining 1289

T

. . . . . .

Tree Mining, Fig. 5 Part of an enumeration tree for the search space of Fig. 4

extensions of a canonical code lead to a canonicalcode, such that it is not necessary to consider anycode that is not canonical.

Alternative codes have also been proposed,which are not based on a preorder, depth-firsttraversal of a tree, but on a level-wise listing ofthe nodes in a tree.

Finally, for free trees we have the additionalproblem that we de not have a root for the tree.Fortunately, it is known that every free tree eitherhas a uniquely determined center or a uniquelydetermined bicenter. This (bi)center can be foundby determining the longest path between twonodes in a free tree: the node(s) in the middle ofthis path are the center of the tree. It can be shownthat if multiple paths have the same maximallength, they will have the same (bi)center. Byappointing one center to be the root, we obtaina rooted tree, for which we can compute a code.

To avoid that two codes are listed that repre-sent equivalent free trees, several solutions havebeen proposed. One is based on the idea of firstenumerating paths (thus fixing the center of atree), and for each of these paths enumeratingall trees that can be grown around them. Anothersolution is based on enumerating all rooted, un-ordered trees under the constraint that at leasttwo different children of the root have a bottom-up subtree of equal, maximal depth. In the firstapproach, a preorder depth encoding was used; inthe second approach a level-wise encoding wasused.

Counting TreesTo evaluate the frequency of a tree the subtreerelation between a candidate pattern tree and all

Tree Mining, Table 1 Worst case complexities of thebest known algorithms that determine whether a treerelation holds between two trees; m is the number ofnodes in the pattern tree, l is the number of leafs in thepattern tree, n the number of nodes in the database tree

Ordered

Embedding O.nl/

Induced O.nm/

Root-induced O.n/

Leaf-set O.n/

Bottom-up O.n/

Prefix O.m/

Unordered

Embedding NP-complete

Induced O.nm1 12 = log m/

Root-induced O.nm1 12 = log m/

Leaf-set O.nm1 12 = log m/

Bottom-up O.n/

trees in the database has to be computed. For eachof our subtree relations, polynomial algorithmsare known to decide the relation, which are sum-marized in Table 1.

Even though a subtree testing algorithm and analgorithm for enumerating trees are sufficient tocompute all frequent subtrees correctly, in prac-tice fine-tuning is needed to obtain an efficientmethod. There are two reasons for this:

• In some databases, the number of candidatescan by far exceed the number of trees thatare actually frequent. One way to reduce thenumber of candidates is to only generate aparticular candidate after we have encountered

1290 Tree Mining

at least one occurrence of it in the data (thisis called pattern growth); another way is torequire that a candidate is only generated if atleast two of its subtrees satisfy the constraints(this is called pattern joining).

• The trees in the search space are very similarto each other: a parent only differs from itschildren by the absence of a single node. Ifmemory allows, it is desirable to reuse the sub-tree matching information, instead of startingthe matching from scratch.

A large number of data structures have beenproposed to exploit these observations. We willillustrate these ideas using the FreqT algorithm,which mines induced, ordered subtrees, and usesa depth encoding for the trees.

In FreqT, for a given pattern tree A, a list of(database tree, database node) pointers is stored.Every element (B , v/ in this list corresponds to anoccurrence of tree A in tree B in which the lastnode (in terms of the preorder) of A is mappedto node v in database tree B . For a database andthree example trees this is illustrated in Fig. 6.

Every tree in the database is stored as follows.Every node is given an index, and for every node,we store the index of its parent, its righthandsibling, and its first child.

Let us consider how we can compute the oc-currences of the subtree 1A2B2B from the occur-rences of the tree 1A2B . The first occurrence of1A2B is (t1, 2), which means that the B labelednode can be mapped to node 2 in t1. Using the

arrays that store the database tree, we can thenconclude that node 6, which is the right-handsibling of node 2, corresponds to an occurence ofthe subtree 1A2B2B . Therefore, we add (t1, 6)to the occurrence list of 1A2B2B . Similarly, byscanning the data we find out that the first child ofnode 2 corresponds to an occurrence of the subree1A2B3C , and we add (t1, 3) to the occurrencelist of 1A2B3C .

Overall, using the parent, sibling and childpointers we can scan every node in the data thatcould correspond to a valid expansion of the sub-tree 1A2B , and update the corresponding lists.After we have done this for every occurrence ofthe subtree, we know the occurrence lists of allpossible extensions.

From an occurrence list we can determine thefrequency of a tree. For instance, the transaction-based frequency can be computed by counting thenumber of different database trees occurring inthe list.

As we claimed, this example illustrates twofeatures that are commonly seen in tree miners:first, the occurrence list of one tree is used tocompute the occurrence list of another tree, thusreusing information; second, the candidates arecollected from the data by scanning the nodes thatconnect to the occurrence of a tree in the data.Furthermore, this example illustrates that a care-ful design of the datastructure that stores the datacan ease the frequency evaluation considerably.

FreqT does not perform pattern joining. Themost well-known example of an algorithm that

Tree Mining, Fig. 6 Atree database (left) andthree ordered trees withtheir occurrence listsaccording to the FreqTalgorithm (right). Thedatastructure that stores t1in FreqT is given in thetable (right)

A

B B

DC E

t1

A

B B

DC

E

t2

2

1 1

6

3 4 5

2

3

4

6

5

A

A

B

A

BB

(t1,1)(t2,1)

(t1,2)(t1,6)(t2,2)(t2,6)

(t1,6)(t2,6)

Tree Mining 1291

T

performs tree joining is the embedded TreeMiner(Zaki 2002). Both the FreqT and the TreeMinerperform the search depth-first, but also tree min-ers that use the traditional level-wise approach ofthe Apriori algorithm have been proposed. TheFreqT and the TreeMiner have been extended tounordered trees.

Other ConstraintsAs the number of frequent subtrees can be verylarge, approaches have been studied to reducethe number of trees returned by the algorithm,of which closed and maximal trees are the mostpopular. To find closed or maximal trees, twoissues need to be addressed:

• How do we make sure that we only output atree if it is closed or maximal, that is, how dowe determine that none of its supertrees hasthe same support, or is frequent?

• Can we conclude that some parts of the searchspace will never contain a closed or maximaltree, thus making the search more efficient?

Two approaches can be used to address the firstissue:

• All closed patterns can be stored, and everynew pattern can be compared with the storedset of patterns;

• When we evaluate the frequency of a patternin the data, we also (re)evaluate the frequencyof all its possible extensions, and only outputthe pattern if its support is different.

The second approach requires less memory, butin some cases requires more computations.

To prune the search space, a common ap-proach is to check all occurrences of a tree in thedata. If every occurrence of a tree can be extendedinto an occurrence of another tree, the small treeshould not be considered, and the search shouldcontinue with the tree that contains all commonedges and nodes. Contrary to graph mining, it canbe shown that this kind of pruning can safely bedone in most cases.

Applications

Examples of databases to which tree mining al-gorithms have been applied are

Parse tree analysis: Since the early 1990slarge Treebank datasets have been collectedconsisting of sentences and their grammaticalstructure. An example is the Penn TreeBank(Marcus et al. 1993). These databasescontain rooted, ordered trees. To discoverdifferences in domain languages it is usefulto compare commonly occurring grammaticalconstructions in two different sets of parsedtexts, for which tree miners can be used(Sekine 1998).

Computer network analysis: IP multicast is aprotocol for sending data to multiple receivers.In an IP multicast session a webserver sendsa packet once; routers copy a packet iftwo different routes are required to reachmultiple receivers. During a multicast sessionrooted trees are obtained in which theroot is the sender and the leaves are thereceivers. Commonly occurring patterns in therouting data can be discovered by analyzingthese unordered rooted trees (Chalmers andAlmeroth 2003).

Webserver access log analysis: When usersbrowse a website, this behavior is reflected inthe access log files of the webserver. Serverscollect information such as the webpagethat was visited, the time of the visit, andthe webpage that was clicked to reach thewebpage. The access logs can be transformedinto a set of ordered trees, each of whichcorresponds to a visitor. Nodes in these treescorrespond to webpages; edges are inserted ifa user browses from one webpage to another.Nodes are ordered in viewing order. A toolwas developed to perform this transformationin a sensible way (Punin et al. 2002).

Phylogenetic trees: One of the largest treedatabases currently under construction isthe TreeBASE database, which is comprisedof a large number of phylogenetic trees(Morell 1996). The trees in the TreeBASEdatabase are submitted by researchers and are

1292 Tree-Based Regression

collected from publications. Originating frommultiple sources, they can disagree on partsof the phylogenetic tree. To find commonagreements between the trees, tree minershave been used (Zhang and Wang 2005). Thephylogenetic trees are typically unordered;labels among siblings are unique.

Hypergraph mining: Hypergraphs are graphs inwhich one edge can have more than two end-points. Those hypergraphs in which no twonodes share the same label can be transformedinto unordered trees, as follows. First, an ar-tificial root is inserted. Second, for each edgeof the hypergraph a child node is added to theroot, labeled with the label of the hyperedge.Finally, the labels of nodes within hyperedgesare added as leaves to the tree. An example ofhypergraph data is bibliographic data: if eachexample corresponds to a paper, nodes in thehypergraph correspond to authors cited by thepaper, and hyperedges connect coauthors ofcited papers.

Multi-relational data mining: Many multi-relational databases are tree shaped, ora tree-shaped view can be created. Forinstance, a transaction database in which everytransaction is associated with customers andtheir information, can be represented as a tree(Berka 1999).

XML data mining: Several authors have stressedthat tree mining algorithms are most suitablefor mining XML data. XML is a tree–shapeddata format, and tree miners can be helpfulwhen trying to (re)construct Document TypeDefinitions (DTDs) for such documents.

Cross-References

�Constraint-Based Mining�Graph Mining

Further Reading

The FreqT algorithm was introduced in (Asaiet al. 2002; Wang and Liu 1998; Zaki 2002).

The most popular tree miner is the embedded treeminer by Zaki (2002). A more detailed overviewof tree miners can be found in Chi et al. (2005).Most implementations of tree miners are avail-able on request from their authors.

Recommended Reading

Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H,Arikawa S (2002) Efficient substructure discoveryfrom large semi-structured data. In: Proceedings ofthe second SIAM international conference on datamining, Arlington. SIAM, pp 158–174

Berka P (1999) Workshop notes on discovery chal-lenge PKDD-99 (Technical report). University ofEconomics, Prague

Chalmers R, Almeroth K (2003) On the topology ofmulticast trees. IEEE/ACM Trans Netw 11:153–165. IEEE Press/ACM Press

Chi Y, Nijssen S, Muntz RR, Kok JN (2005) Frequentsubtree mining—an overview. In: Fundam Inform66:161–198. IOS Press

Marcus MP, Santorini B, Marcinkiewicz MA (1993)Building a large annotated corpus of English: thePenn Treebank. Comput Linguist 19:313–330. MITPress

Morell V (1996) TreeBASE: the roots of phylogeny.Science 273:569

Punin J, Krishnamoorthy M, Zaki MJ (2002)LOGML—log markup language for web usage min-ing. In: WEBKDD 2001—mining web log dataacross all customers touch points. Third interna-tional workshop, San Francisco. Lecture notes inartificial intelligence, vol 2356. Springer, pp 88–112

Sekine S (1998) Corpus-based parsing and sublan-guages studies. Ph.D. dissertation. New York Uni-versity, New York

Wang K, Liu H (1998) Discovering typical structuresof documents: a road map approach. In: Proceedingsof the 21st annual international ACM SIGIR con-ference on research and development in informationretrieval, Melbourne. ACM Press, pp 146–154

Zaki MJ (2002) Efficiently mining frequent trees ina forest. In: Proceedings of the 8th internationalconference knowledge discovery and data mining(KDD), Edmonton. ACM Press, pp 71–80

Zhang S, Wang J (2005) Frequent agreement subtreemining. http://aria.njit.edu/mediadb/fast/

Tree-Based Regression

�Regression Trees

http://dx.doi.org/10.1007/978-1-4899-7687-1_164

http://dx.doi.org/10.1007/978-1-4899-7687-1_350

http://aria.njit.edu/mediadb/fast/

http://dx.doi.org/10.1007/978-1-4899-7687-1_717

Typical Complexity of Learning 1293

T

True Lift Modeling

�Uplift Modeling

True Negative

True negatives are the negative examples thatare correctly classified by a classification model.See � confusion matrix for a complete range ofrelated terms.

True Negative Rate

� Specificity

True Positive

True positives are the positive examples that arecorrectly classified by a classification model. See

� confusion matrix for a complete range of re-lated terms.

True Positive Rate

� Sensitivity

Type

�Class

Typical Complexity of Learning

� Phase Transitions in Machine Learning

http://dx.doi.org/10.1007/978-1-4899-7687-1_911

http://dx.doi.org/10.1007/978-1-4899-7687-1_50

http://dx.doi.org/10.1007/978-1-4899-7687-1_770

http://dx.doi.org/10.1007/978-1-4899-7687-1_50

http://dx.doi.org/10.1007/978-1-4899-7687-1_751

http://dx.doi.org/10.1007/978-1-4899-7687-1_940

http://dx.doi.org/10.1007/978-1-4899-7687-1_642