Detecting Cross-Cultural Differences Using a Multilingual Topic Model

Detecting Cross-Cultural Differences Using a Multilingual Topic Model

E.D. Gutierrez1 Ekaterina Shutova2 Patricia Lichtenstein3

Gerard de Melo4 Luca Gilardi51 University of California, San Diego

2 Computer Laboratory, University of Cambridge3 University of California, Merced

4 IIIS, Tsinghua University, 5 ICSI, [email protected] [email protected] [email protected]

[email protected] [email protected]

Abstract

Understanding cross-cultural differences hasimportant implications for world affairs andmany aspects of the life of society. Yet, themajority of text-mining methods to date focuson the analysis of monolingual texts. In con-trast, we present a statistical model that simul-taneously learns a set of common topics frommultilingual, non-parallel data and automati-cally discovers the differences in perspectiveson these topics across linguistic communities.We perform a behavioural evaluation of a sub-set of the differences identified by our modelin English and Spanish to investigate their psy-chological validity.

1 Introduction

Recent years have seen a growing interest in text-mining applications aimed at uncovering publicopinions and social trends (Fader et al., 2007; Mon-roe et al., 2008; Gerrish and Blei, 2011; Pennac-chiotti and Popescu, 2011). They rest on the as-sumption that the language we use is indicative ofour underlying worldviews. Research in cognitiveand sociolinguistics suggests that linguistic varia-tion across communities systematically reflects dif-ferences in their cultural and moral models andgoes beyond lexicon and grammar (Kovecses, 2004;Lakoff and Wehling, 2012). Cross-cultural differ-ences manifest themselves in text in a multitude ofways, most prominently through the use of explicitopinion vocabulary with respect to a certain topic(e.g. “policies that benefit the poor”), idiomatic andmetaphorical language (e.g. “the company is spin-ning its wheels”) and other types of figurative lan-guage, such as irony or sarcasm.

The connection between language, culture andreasoning remains one of the central research ques-tions in psychology. Thibodeau and Borodit-sky (2011) investigated how metaphors affect ourdecision-making. They presented two groups of hu-man subjects with two different texts about crime.In the first text, crime was metaphorically portrayedas a virus and in the second as a beast. The twogroups were then asked a set of questions on howto tackle crime in the city. As a result, while thefirst group tended to opt for preventive measures(e.g. stronger social policies), the second groupconverged on punishment- or restraint-oriented mea-sures. According to Thibodeau and Boroditsky, theirresults demonstrate that metaphors have profoundinfluence on how we conceptualize and act with re-spect to societal issues. This suggests that in order togain a full understanding of social trends across pop-ulations, one needs to identify subtle but systematiclinguistic differences that stem from the groups’ cul-tural backgrounds, expressed both literally and fig-uratively. Performing such an analysis by hand islabor-intensive and often impractical, particularly ina multilingual setting where expertise in all of thelanguages of interest may be rare.

With the rise of blogging and social media, NLPtechniques have been successfully used for a numberof tasks in political science, including automaticallyestimating the influence of particular politicians inthe US senate (Fader et al., 2007), identifying lex-ical features that differentiate political rhetoric ofopposing parties (Monroe et al., 2008), predictingvoting patterns of politicians based on their use oflanguage (Gerrish and Blei, 2011), and predictingpolitical affiliation of Twitter users (Pennacchiottiand Popescu, 2011). Fang et al. (2012) addressed

47

Transactions of the Association for Computational Linguistics, vol. 4, pp. 47–60, 2016. Action Editor: David Chiang.Submission batch: 11/2015; Published 2/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

the problem of automatically detecting and visual-ising the contrasting perspectives on a set of top-ics attested in multiple distinct corpora. While suc-cessful in their tasks, all of these approaches fo-cused on monolingual data and did not reach be-yond literal language. In contrast, we present amethod that detects fine-grained cross-cultural dif-ferences from multilingual data, where such differ-ences abound, expressed both literally and figura-tively. Our method brings together opinion miningand cross-lingual topic modelling techniques for thispurpose. Previous approaches to cross-lingual topicmodelling (Boyd-Graber and Blei, 2009; Jagarla-mudi and Daume III, 2010) addressed the problemof mining common topics from multilingual cor-pora. We present a model that learns such com-mon topics, while simultaneously identifying lexi-cal features that are indicative of the underlying dif-ferences in perspectives on these topics by speakersof English, Spanish and Russian. These differencesare mined from multilingual, non-parallel datasets ofTwitter and news data. In contrast to previous work,our model does not merely output a list of mono-lingual lexical features for manual comparison, butalso automatically infers multilingual contrasts.

Our system (1) uses word-document co-occur-rence data as input, where the words are labeledas topic words or perspective words; (2) finds thehighest-likelihood dictionary between topic wordsin the two languages given the co-occurrence data;(3) finds cross-lingual topics specified by distribu-tions over topic-words and perspective-words; and(4) automatically detects differences in perspective-word distributions in the two languages. We performa behavioural evaluation of a subset of the differ-ences identified by the model and demonstrate theirpsychological validity. Our data and dictionaries areavailable from the first author upon request.

2 Related workView detection. Identifying different viewpointsis related to the well-studied area of subjectivitydetection, which aims at exposing opinion, evalu-ation, and speculation in text (Wiebe et al., 2004)and attributing it to specific people (Awadallah et al.,2011; Abu-Jbara et al., 2012). In our work, we areless interested in explicit local forms of subjectivity,instead aiming at detecting more general contrasts

across linguistic communities.Another line of research has focused on inferring

author attributes such as gender, age (Garera andYarowsky, 2009), location (Jones et al., 2007), or po-litical affiliation (Pennacchiotti and Popescu, 2011).Such studies make use of syntactic style, discoursecharacteristics, as well as lexical choice. The modelsused for this are typically binary classifiers trainedin a fully supervised fashion. In contrast, in ourtask, we automatically infer the topic distributionsand find topic-specific contrasts.

Probabilistic topic models. Probabilistic topicmodels have proven useful for a variety of seman-tic tasks, such as selectional-preference induction(O Seaghdha, 2010; Ritter et al., 2010), sentimentanalysis (Boyd-Graber and Resnik, 2010) and study-ing the evolution of concepts and ideas (Hall et al.,2008). The goal of a topic model is to character-ize observed data in terms of a much smaller setof unobserved, semantically coherent topics. A par-ticularly popular probabilistic topic model is LatentDirichlet Allocation (LDA) (Blei et al., 2003). Un-der its assumptions, each document has a unique mixof topics, and each topic is a distribution over termsin the vocabulary. A topic is chosen for every wordtoken according to the topic mix of the documentto which it belongs, and then the word’s identity isdrawn from the corresponding topic’s distribution.

Handling multilingual corpora. LDA is de-signed for monolingual text and thus it lacks thestructure necessary to model cross-lingually validtopics. While topic models can be trained indi-vidually on two languages and then the acquiredtopics can be matched, the correspondences be-tween the topics for the two terms will be highlyunstable. To address this, Boyd-Graber and Blei(2009) (MUTO) and Jagarlamudi and Daume III(2010) (JOINTLDA) introduced the notion of cross-lingually valid concepts associated with differentterms in different languages, using bilingual dictio-naries to model topics across languages. Based ona model by Haghighi et al. (2008), MUTO is capa-ble of learning translations–i.e., matching betweenterms in the different languages being compared.The Polylingual Topic Model of Mimno et al. (2009)is another approach to finding topics in multilingualcorpora, but it requires tuples composed of compa-

48

rable documents in each language of the corpus.

Topic models for view detection. LDA also as-sumes that the distribution of each topic is fixedacross all documents in a corpus. Therefore, a topicassociated with, e.g., war will have the same dis-tribution over the lexicon regardless of whether thedocument was taken from a pro-war editorial or ananti-war speech. However, in reality we may expecta single topic to exhibit systematic and predictablevariations in its distribution based on authorship.

The cross-collection LDA model by Paul andGirju (2009) addresses this by specifically aiming toexpose viewpoint differences across different doc-ument collections. Ahmed and Xing (2010) pro-posed a similar model for detecting ideological dif-ferences. Fang et al. (2012)’s Cross-PerspectiveTopic (CPT) model breaks up the terms in the vo-cabulary into topic terms and perspective terms withdifferent generative processes, and differentiates be-tween different collections of documents within thecorpus. The topic terms are assumed to be generatedas in LDA. However, the distribution of perspectiveterms in a document is taken to be dependent on boththe topic mixture of the document as well as the col-lection from which the document is drawn.

Recent works proposed models for specific typesof data. Qiu and Jiang (2013) use user identitiesand interactions in threaded discussions, while Got-tipati et al. (2013) developed a topic model for De-batepedia, a semi-structured resource in which ar-guments are explicitly enumerated. However, all ofthese models perform their analyses on monolingualdatasets. Thus, they are useful for comparing differ-ent ideologies expressed in the same language, butnot for cross-linguistic comparisons.

3 Method

The goal of our model is to analyse large, non-parallel, multilingual corpora and present cross-lingually valid topics and the associated perspec-tives, automatically inferring the differences in con-ceptualization of these topics across cultures. Fol-lowing Boyd-Graber and Blei (2009) and Jagarla-mudi and Daume III (2010), our distributions of la-tent topics range over latent, cross-lingual topic con-cepts that manifest themselves as language-specifictopic words. We use bilingual dictionaries, contain-

Figure 1: Basic generative model.

ing words in one language and their translations inanother language, to represent the topic concepts.These are represented as a bipartite graph, with eachtranslation entry being an edge and each topic wordin the two languages being a vertex. While the topicwords are tied together by the translation dictionary,the perspective words can vary freely across lan-guages. Following Fang et al. (2012), we treat nounsas topic words and verbs and adjectives as perspec-tive words1. The model assumes that adjective andverb tokens in each document are assigned to topicsin proportion to the topic assignments of the topicword tokens. Then, the perspective term for thistopic is drawn depending on the topic assignmentand the language of the speaker.

3.1 Basic Generative Model

Given the languages ` ∈ {a, b}, our model infers thedistributions of multi-lingual topics and language-specific perspective-words (Fig. 2), as follows:

1. Draw a set C of concepts (u, v) matching topicword u from language a to topic word v from lan-guage b, where the probability of concept (u, v) isproportional to a prior πu,v (e.g. based on informa-tion from a translation dictionary).

2. Draw multinomial distributions:

1This approximation was adopted for convenience, compu-tational efficiency and ease of interpretation. However, in prin-ciple our method does not depend on it, since it can be appliedwith all content words as topic or perspective words.

49

• For topic indices k ∈ {1, ...,K}, drawlanguage-independent topic-concept distribu-tions φwk ∼ Dir(βw) over pairs (wa, wb) ∈ C.

• For topic indices k ∈ {1, ...,K} and lan-guages ` ∈ {a, b}, draw language-specificperspective-term distributions φ`,ok ∼ Dir(βo)over perspective-terms in language `.

3. For each document d ∈ {1, ..., D} with lang. `d:• Draw topic weights θd ∼ Dir(α)

• For each topic-word index i ∈ {1, ..., Nwd } of

document d:

– Draw topic zi ∼ θd– Draw topic concept ci = (wa, wb) ∼ φwzi ,

and select w`d as the member of that paircorresponding to language `d.

• For each perspective-word indexj ∈ {1, ..., No

d} of document d:

– Draw topic xj ∼ Uniform(zw1 , ..., zwNod)

– Draw perspective-word oj ∼ φ`,oxj3.2 Model VariantsWe have experimented with several variants of ourmodel, in order to account for the translation of pol-ysemous words, adapt the translation model to thecorpus used, and to handle words for which no trans-lation is found.

a) SINGLE variants of the model match each topicterm in a language with at most one topic term inthe other language.

MULTIPLE variants allow each term to match tomultiple other words in the other language.

b) INFER variants allow higher-likelihood match-ings to be inferred from the data.

STATIC variants treat the matchings as fixed,which is equivalent to assigning a probability of0 or 1 to every edge in our bipartite graph C.

c) RELEGATE variants relegate all unmatchedwords in each language to a single separate back-ground topic distinct from the topics that arelearned for the matched topic words. This isakin to forcing the probability for currently un-matched words to 0 in all topics except for

one, and forcing the probability of all currentlymatched words to 0 in this topic.

INCLUDE variants do not restrict the assignmentunmatched words; they are assigned to the sameset of topics as the matched words.

We test the following six variants: SINGLESTATI-CRELEGATE, SINGLESTATICINCLUDE, SIN-GLEINFERRELEGATE, SINGLEINFERINCLUDE,MULTIPLESTATICRELEGATE, and MULTI-PLESTATICINCLUDE. We do not test MULTI-PLEINFER variants because of the complexity ofinferring a multiple matching in a bipartite graph.

3.3 Learning & InferenceFor all variants, a collapsed Gibbs sampler can beused to infer topics φ`,o and φw, per-document topicdistributions θ, as well as topic assignments z andx. This corresponds to the S-step below. For INFER

variants, we follow Boyd-Graber and Blei in usingan M-step involving a bipartite graph matching al-gorithm to infer the matching m that maximizes theposterior likelihood of the matching.S-Step: Sample topics for words in the corpus usinga collapsed Gibbs sampler. For topic-word wi =u belonging to document d, if the word occurs inconcept ci = (u, v), then sample the topic and entryaccording to:

p(zi = k, ci = (u, v) | wi = u, z−i, C)

∝ Ndk + αk∑j

(Ndj + αj)×

Nk(u,v) + βwk∑v′

(Nk(u,v′) + βwk

)

where the sum in the denominator of the first termis over all topics, and in the second term is over allwords matched to u. Ndk is the count of topic-wordsof topic k in document d, Nk(u,v) is the count oftopic-words either of type u or of type v assignedto topic k in all the corpora.2 For perspective-wordoi = n, sample the topic according to:

p(zi = k|oi = n, z−i, C) ∝Ndk∑j Ndj

× N `dkv + βok∑

m

(N `dkm + βok

)

2In RELEGATE variants, for u unmatched zi is sampled as:p(zi = k|wi = u, z−i, C) ∝ Ndk + αk∑

k

(Ndk + αk),

which can be seen as βwu· → ∞ for unmatched terms.

50

where the sum in the second term of the denominatoris over the perspective-word vocabulary of language`d; Ndk is the count of topic words in document dwith topic k; and N `d

km is the count of perspective-word m being assigned topic k in language `d. Notethat in all the counts above, the current word token iis omitted from the count.

Given our sampling assignments, we can then es-timate θd, φ`,o, and φw as follows:

θkd =Ndk + αk∑

k

(Ndk + αk),

φwk(u,v) =Nk(u,v) + βw(u,v)

∑v′

(Nk(u,v′) + βw(u,v′)

) ,

φ`,onk =Nkn + βon∑

m

(N `km + βon

) .

M-Step: (for INFER variants only): Run the Jonker-Volgenant (Jonker and Volgenant, 1987) bipartitematching algorithm to find the optimal matching Cgiven some weights. For topic-term u from languagea and topic-term v from language b, our weightscorrespond to the log of the posterior odds that theoccurrences of u and v come from a matched topicdistribution, as opposed to coming from unmatcheddistributions:

µu,v =∑

k\{a∗,b∗}

(Nk(u,v) log φwk(u,v)

)

−Nu log φwk(u,·) −Nv log φwk(·,v) + πu,v,

where Nu is the count of topic-term u in the cor-pus. This expression can also be interpreted as akind of pointwise mutual information (Haghighi etal., 2008). The Jonker-Volgenant algorithm has timecomplexity of at mostO(V 3), where V is the size ofthe lexicon (Jonker and Volgenant, 1987).

3.4 Inference of Perspective-Word ContrastsHaving learned our model and inferred how likelyperspective-terms are for a topic in a given language,we seek to know whether these perspectives differsignificantly in the two languages. More precisely,can we infer whether word m in language a and theequivalent word n in language b have significantlydifferent distributions under a topic k? To do this,we make the assumption that the perspective-words

in languages a and b are in one-to-one correspon-dence to each other. Recall that, for a given topick and language `, N `

km is the count for term m andφ`,ok,m is the probability for word m in language `.Just as we collect the probabilities into word-topicdistribution vectors φ`,ok , we collect the counts intoword-topic count vectors [N `

k1, N`k2, ..]. Then, since

our model assumes a prior over the parameter vec-tors φ`,ok , we can infer the likelihood for that ob-served word-topic counts Na

km and N bkn were drawn

from a single word-topic-distribution prior denotedby φ := φa,okm = φb,okn. Below all our probabilitiesare conditioned implicitly on this event as well as onNak and N b

k being fixed.Denote the total count of word tokens in topic k

from language ` by N `k =

∑mN

`km. Now, we de-

rive the probability that we observe a ratio greaterthan δ between the proportion of words in topic kthat belong to word typem in language a and to cor-responding word type n in language b:

p

(Nakm

Nak

N bk

N bkn

≥ δ)

+ p

(N bkn

N bk

Nak

Nakm

≥ δ)

(1)

By symmetry, it suffices to derive an expression forthe first term. We note that the inequality in the prob-ability is equivalent to a sum over a range of valuesof Na

km and N bkn. By rearranging terms, applying

the law of conditional probability to condition onthe term φ, and exploiting the conditional indepen-dence of Na

km and N bkm given φ, Na

k , and N bk , we

can rewrite this first term as

Nbk∑

x=0

Nak∑

y=xδNa/b

∫p(N b

kn = x|φ)p(Nakm = y|φ)p(φ)dφ,

where Na/b =Na

k

Nbk

. Recall that φ`,ok ∼ Dir(βo) un-der our model. Assume a symmetric Dirichlet dis-tribution for simplicity. It can then be shown thatthe marginal distribution of φ is φ ∼ Beta(βo, (V −1)βo), where V is the total size of the perspective-word vocabulary. Similarly, it can be shown that themarginal distribution of N `

km given φ`,ok is N `km ∼

Binom(N `k, φ

`,oi ) for ` ∈ {a, b}. Therefore, the inte-

grand above is proportional to the beta-binomial dis-tribution with number of trials Na

k + N bk , successes

x + y, and parameters βo and (V − 1)βo, but withpartition function

(Na

ky

)(Nb

kx

). Denote the PMF of this

51

distribution by f(Nak +N b

k, x+y, βo). Then expres-sion (1) above becomes:

Nbk∑

x=0

Nak∑

y=xδNa/b

f(Nak +N b

k, x+ y, βo)

+

Nak∑

x=0

Nbk∑

y=xδNb/a

f(Nak +N b

k, x+ y, βo). (2)

We cannot observe Nakb, N

bkn, Na

k and N bk explic-

itly, but we can estimate them by obtaining poste-rior samples from our Gibbs sampler. We substitutethese estimates into expression (2).

4 Experiments

4.1 DataTwitter Data. We gathered Twitter data in En-glish, Spanish and Russian during the first twoweeks of December 2013 using the Twitter API.Following previous work (Puniyani et al., 2010),we treated each Twitter user account as a docu-ment. We then tagged each document for part-of-speech, and divided the word tokens in it into topic-words and perspective-words. We constructed a lex-icon of 2,000 topic terms and 1,500 perspective-terms for each language by filtering out any termsthat occurred in more than 10% of the docu-ments in that language, and then selecting the re-maining terms with the highest frequency. Fi-nally, we kept only documents that contained 4or more topic words from our lexicon. This leftus with 847,560 documents in English (4,742,868topic-word and 1,907,685 perspective-word tokens);756,036 documents in Spanish (4,409,888 topic-word and 1,668,803 perspective-word tokens); and260,981 documents in Russian (1,621,571 topic-word and 981,561 perspective-word tokens).News Data. We gathered all the articles publishedonline during the year 2013 by the state-run mediaagencies of the United States (Voice of America or“VOA”–English), Russia (RIA Novosti or “RIA”–Russian), and Venezuela (Agencia Venezolana deNoticias or “AVN”–Spanish). These three newsagencies were chosen because they not only pro-vide media in three distinct languages, but they areguided by the political world-views of three dis-tinct governments. We treated each news article as

a document, and removed duplicates. Once again,we constructed a lexicon of 2,000 topic terms and1,500 perspective-terms using the same criteria asfor Twitter, and kept only documents that contained4 or more topic words from our lexicon. This left uswith 23,159 articles (10,410,949 tokens) from VOA,41,116 articles (11,726,637 tokens) from RIA, and8,541 articles (2,606,796 tokens) from AVN.

Dictionaries. To create the translation dictionar-ies, we extracted translations from the English,Spanish, and Russian editions of Wiktionary, bothfrom the translation sections and the gloss sectionsif the latter contained single words as glosses. Multi-word expressions were universally removed. Weadded inverse translations for every original trans-lation. From the resulting collection of translations,we then created separate translation dictionaries foreach language and part-of-speech tag combination.

In order to give preference to more importanttranslations, we assigned each translation an initialweight of 1 + 1

r , where r was the rank of the trans-lation within the page. Since a translation (or its in-verse) can occur on multiple pages, we aggregatedthese initial weights and then assigned final weightsof 1 + 1

r′ , where r′ was the rank after aggregationand sorting in descending order of weights.

4.2 Experimental Conditions

To evaluate the different variants of our model, weheld out 30,000 documents (test set) during training.We plugged in the estimates of φw and C acquiredduring training using the rest of the corpus to pro-duce a likelihood estimate for these held-out docu-ments. All models were initialized with the priormatching determined by the dictionary data. Foreach number of topics K, we set α to 50/Kand theβ variables to 0.02, as in Fang et al. (2012). Forthe MULTIPLE variants, we set πi,j = 1 if i and jshare an entry and 0 otherwise. For INFER variants,only three M -steps were performed to avoid overfit-ting, at 250, 500, and 750 iterations of Gibbs sam-pling, following the procedure in Boyd-Graber andBlei (2009).

4.3 Comparison of model variants

In order to compare the variants of our model,we computed the perplexity and coherence for

52

each variant on TWITTER and NEWS, for English–Spanish and English–Russian language pairs.Perplexity is a measure of how well a model trainedon a training set predicts the co-occurrence of wordson an unseen test set H. Lower perplexity indicatesbetter model fit. We evaluate the held-out perplex-ity for topic words wi and perspective-words oi sep-arately. For topic words, the perplexity is definedas exp(−∑wi∈H logp(wi)/N

w). As for standardLDA, exact inference of p(wi) is intractable underthis model. Therefore we adapted the estimator de-veloped by Murray and Salakhutdinov (2009) to ourmodels.Coherence is a measure inspired by pointwise mu-tual information (Newman et al., 2010). LetD(v) bethe the number of documents with at least one tokenof type v and let D(v, w) be the number of docu-ments containing at least one token of type v and atleast one token of type w. Then Mimno et al. (2011)define the coherence of topic k as

1(M2

)M∑

m=2

m−1∑

`=1

logD(v

(k)m , v

(k)` ) + ε

D(v(k)` )

,

where V (k) = (v(k)1 , ..., v

(k)M ) is a list of the M most

probable words in topic k and ε is a small smoothingconstant used to avoid taking the logarithm of zero.Mimno et al. (2011) find that coherence correlatesbetter with human judgments than do likelihood-based measures. Coherence is topic-specific mea-sure, so for each model variant we trained, we com-puted the median topic coherence across all the top-ics learned by the model. We set ε = 0.1.Model performance and analysis. Fig. 2 showsperplexity for the variants as a function of the num-ber of iterations of Gibbs sampling on the English-Spanish NEWS corpus. The figure confirms that1000 iterations of Gibbs sampling on the NEWS

corpus was sufficient for convergence across modelvariants. We omit figures for English-Russian andfor the TWITTER corpus, since the patterns werenearly identical. Figure 3 shows how perplexityvaries as a function of the number of topics. Weused this information to choose optimal models forthe different corpora. The optimal number of top-ics was K = 175 for the English-Spanish NEWS

corpus, K = 200 for the English-Russian NEWS,

K = 325 for the English-Spanish TWITTER, andK = 300 for the English-Russian TWITTER. Al-though the optimal number of topics varied acrosscorpora, the relative performance of the differentmodels was the same. In all of our corpora, theMULTIPLE variants provided better fits than theircorresponding SINGLE variants. There are severalexplanations for this. For one, the MULTIPLE vari-ants are able to exploit the information from multi-ple translations, unlike the SINGLE variants, whichdiscarded all but one translation per word. For an-other, the matchings produced by the SINGLEINFER

variants can be purely coincidental and the result ofoverfitting (see some examples below). INCLUDE

variants performed markedly better than RELEGATE

variants. INFER variants improved model fit com-pared to STATIC variants, but required more topicsto produce optimal fit.

Recall that we performed an M-step in the IN-FER variants 3 times, at 250, 500, and 750 itera-tions. As noted in §3.3, the M-step in the INFER

variants maximizes the posterior likelihood of thematching. However, Fig. 2 shows that this maxi-mization causes held-out perplexity to increase sub-stantially just after the first matching M-step, around250 iterations, before decreasing again after about50 more iterations of Gibbs sampling. We believethat this happens because the M-step is maximizingover expectations that are approximate, since theyare estimated using Gibbs sampling. If the samplerhas not yet converged, then the M-step’s maximiza-tion will be unstable. We found support for this ex-planation when we re-ran the INFER variants using1000 iterations between M-steps, giving the Markovchain enough time to converge. After this change,perplexity went down immediately after the M-stepand kept decreasing monotonically, rather than in-creasing after the M-step before decreasing. How-ever, this did not result in a significantly lower finalperplexity or coherence and thus did not change therelative performance of the models. In addition, Fig.2 suggests that the second and third M-steps (at 500and 750 iterations, respectively) had little effect onperplexity. In light of the high computational ex-pense of each inference step, this suggests in prac-tice a single inference step may be sufficient.

Fig. 4 shows that the MULTIPLESTATICINCLUDE

variant was also the superior model as measured by

53

Figure 2: Perplexity of different model variants for dif-ferent numbers of iterations at K=175.

median topic coherence. Once again, this generalpattern held true for the English-Russian pair andTWITTER corpora. Overall, the results show thatMULTIPLESTATICINCLUDE provides superior per-formance across measures, corpora, topic numbers,and languages. We therefore used this variant infurther data analysis and evaluation. Incidentally,the observed decrease in topic coherence as K in-creases is expected, because as K increases, lower-likelihood topics tend to be more incoherent (Mimnoet al., 2011). Experiments by Stevens et al. (2012)show that this effect is observed for LDA-, NMF-,and SVD-based topic models.

Cross-linguistic matchings. The matchings in-ferred by the SINGLEINFERINCLUDE variant wereof mixed quality. Some of the matchings correctedlow-quality translations in the original dictionary.For instance, our prior dictionary matched passagein English to pasaje in Spanish. Though technicallycorrect, the dominant meaning of pasaje is [travel]ticket. The TWITTER model correctly matched pas-sage to ruta instead. Many of the matchings learnedby the model did not provide technically correcttranslations, yet were still revelatory and interesting.For instance, the dictionary translated the Spanishword pito as cigarette in English. However, in infor-mal usage this word refers specifically to cannabiscigarettes, not tobacco cigarettes. The TWITTER

Figure 3: Perplexity of different model variants.

model matches pito to the English slang word weedinstead. The Spanish word Siria (Syria) was un-matched in the prior dictionary; the NEWS modelmatched it to the word chemical, which makes sensein the context of extensive reporting of the usage ofchemical weapons in the ongoing Syrian conflict.

4.4 Data analysis and discussion

We have conducted a qualitative analysis of thetopics, perspectives and contrasts produced by ourmodels for English–Spanish and English–Russian,TWITTER and NEWS datasets. While the topicswere coherent and consistent across languages, setsof perspective words manifested systematic differ-ences revealing interesting cross-cultural contrasts.Fig. 5 and 7 show the top perspective words discov-ered by the model for the topic of finance and econ-omy in English and Spanish NEWS and TWITTER

corpora, respectively. While some of the perspec-tive words are neutral, mostly literal and occur inboth English and Spanish (e.g. balance or autho-rize), many others represent metaphorical vocabu-lary (e.g. saddle, gut, evaporate in English, or in-cendiar, sangrar, abatir in Spanish) pointing at dis-tinct models of conceptualization of the topic. Whenwe applied the contrast detection method (describedin §3.4) to these perspective words, it highlightedthe differences in metaphorical perspectives, ratherthan the literal ones, as shown in Fig. 6 and 8. En-

54

Figure 4: Coherence of different model variants.

glish speakers tend to discuss economic and finan-cial processes using motion terms, such as “slow,drive, boost or sluggish”, or a related metaphor ofhorse-riding, e.g. “rein in debt”, “saddle with debt”,or even “breed money”. In contrast, Spanish speak-ers tend to talk about the economy in terms of sizerather than motion, using verbs such as ampliar ordisminuir, and other metaphors, such as sangrar(to bleed) and incendiar (to light up). These ex-amples demonstrate coherent conceptualization pat-terns that differ in the two languages. Interestingly,this difference manifested itself in both NEWS andTWITTER corpora and echoes the findings of a pre-vious corpus-linguistic study of Charteris-Black andEnnis (2001), who manually analysed metaphorsused in English and Spanish financial discourse andreported that motion and navigation metaphors thatabound in English were rarely observed in Spanish.

For the majority of the topics we analysedthe model revealed interesting cross-cultural differ-ences. For instance, the Spanish corpora exhib-ited metaphors of battle when talking about poverty(with poverty seen as an enemy), while in the En-glish corpus poverty was discussed more neutrallyas a social problem that needs a practical solu-tion. English-Russian NEWS experiments revealeda surprising difference with respect to the topic ofprotests. They suggested that while US media tendto use stronger metaphorical vocabulary, such as

Topic EN budget debt deficit reduction spend balancecut increase limit downtown tax stress addition planetTopic ES presupuesto deficit deuda reduccion equilib-rio disminucion gasto aumentacion tasa sacerdote

Perspective EN balance default triple rein accumulateaccrue trim incur saddle slash prioritize avert gut bur-den evaporate borrow pile cap cut tacklePerspective ES renegociar mejora etiquetado desplo-mar recortar endeudar incendiar destinar asignar au-torizar aprobado ascender sangrar augurar abatir

Figure 5: Top perspectives in system output for the topicof finance in the NEWS corpus (metaphors in red italics).

Contrasts EN: rein [in debt], saddle [with debt], cap[debt], breed [money], gut [budget], [debt] hit, tackle[debt], boost, slow, drive, sluggish [economy], spur

Contrasts ES: sangrar [dinero], ampliar, disminuir [laeconomıa], superar [la tasa], emitir [deuda]

Figure 6: Contrasts identified by the model in NEWS.

clash, erupt or fire, in Russian protests are discussedmore neutrally. Generally, the NEWS corpora con-tained more abstract topics and richer informationabout conceptual structure and sentiment in all lan-guages. Many of the topics discovered in TWIT-TER related to everyday concepts, such as pets orconcerts, with fewer topics covering societal issues.Yet, a few TWITTER-specific contrasts could be ob-served: e.g., the sports topic tends to be discussedusing war and battle vocabulary in Russian to agreater extent than in English.

Our models tend to identify two general kindsof differences: (1) cross-corpus differences repre-senting world views of particular populations whomthe corpora characterize (such differences exist bothacross and within languages, e.g. the metaphorsused in the progressive New York Times would bedifferent from the ones in the more conservative WallStreet Journal); and (2) deeply entrenched cross-linguistic differences, such as the motion versusexpansion metaphors for the economy in Englishand Spanish. Such systematic cross-linguistic con-trasts can be associated with contrastive behaviouralpatterns across the different linguistic communities(Casasanto and Boroditsky, 2008; Fuhrman et al.,2011). In both NEWS and TWITTER data, ourmodel effectively identifies and summarises suchcontrasts simplifying the manual analysis of the data

55

Topic EN economy growth rate percent bankeconomist interest reserve market policyTopic ES economıa crecimiento tasa banco polticamercado interes inflacin empleo economista

Perspective EN economic financial grow global ex-pect remain cut boost low slow drivePerspective ES economico mundial agregar fi-nanciero informal pequeno significar interno bajar

Figure 7: Top perspectives in system output for the econ-omy topic in TWITTER (metaphors in red).

Contrasts EN: slow [the economy], push [the econ-omy], strong [economy], weak [economy], stable[economy], boost [the economy]

Contrasts ES: caer [la economıa], disminuir, superar[la economıa], ampliar [el crecimiento]

Figure 8: Contrasts identified by the model in TWITTER.

by highlighting linguistic trends that are indicativeof the underlying conceptual differences. However,the conceptual differences are not straightforwardto evaluate based on the surface vocabulary alone.In order to investigate this further, we conducted abehavioural experiment testing a subset of the con-trasts discovered by our model.

5 Behavioural evaluation

We assessed the relevance of the contrasts throughan experimental study with native English-speakingand native Spanish-speaking human subjects. Wefocused on a linguistic difference in the metaphorsused by English speakers versus Spanish speak-ers when discussing changes in a nation’s econ-omy. While English speakers tend to use metaphorsinvolving both locative motion verbs (e.g. slow)as well as expansive/contractive motion verbs (e.g.shrink), Spanish speakers preferentially employ ex-pansive/contractive motion verbs (e.g. disminuir) todescribe changes in the economy. These differencescould reflect linguistic artefacts (such as collocationfrequencies) or could reflect entrenched conceptualdifferences. Our experiment addresses the questionof whether such patterns of behaviour arise cross-linguistically in response to non-linguistic stimuli.If the linguistic differences are indicative of en-trenched conceptual differences, then we expect tosee responses to the non-linguistic stimuli that corre-spond to the usage differences in the two languages.

5.1 Experimental setup

We recruited 60 participants from one English-speaking country (the US) and 60 participants fromthree Spanish-speaking countries (Chile, Mexico,and Spain) using the CrowdFlower crowdsourcingplatform. Participants first read a brief descriptionof the experimental task, which introduced them toa fictional country in which economists are devis-ing a simple but effective graphic for “representingchange in [the] economy”. They then completeda demographic questionnaire including informationabout their native language. Results from 9 US and3 non-US participants were discarded for failure tomeet the language requirement.

Participants navigated to a new page to completethe experimental task. Stimuli were presented in a1200 × 700-pixel frame. The center of the framecontained a sphere with a 64-pixel diameter. Foreach trial, participants clicked on a button to activatean animation of the sphere which involved (1) a pos-itive displacement (in rightward pixels) of 10% or20%, or a negative displacement (in leftward pixels)of 10% or 20%;3 and, (2) an expansion (in increasedpixel diameter) of 10% or 20%, or a contraction (indecreased pixel diameter) of 10% or 20%.4

Participants saw each of the resulting conditions3 times. The displacement and size conditions weredrawn from a random permutation of 16 condi-tions using a Fisher-Yates shuffle (Fisher and Yates,1963). Crucially, half of the stimuli contained con-flicts of information with respect to the size and dis-placement metaphors for economic change (e.g. thesphere could both grow and move to the left). Over-all we expected the Spanish speakers’ responses tobe more closely associated with changes in diam-eter due to the presence and salience of the sizemetaphor, and the English speakers’ responses tobe influenced by both conditions. We expectedthese differences to be most prominent in the con-

3The use of leftward/rightward horizontal displacement torepresent decreases/increases in magnitude is supported by re-search in numerical cognition showing that people associatesmaller magnitudes with the left side of space and larger mag-nitudes with the right side (Dehaene, 1992; Fias et al., 1995).

4A demonstration of the English experimental interface canbe accessed at http://goo.gl/W3YVfC. The Spanish in-terface is identical, but for a direct translation of the guidelinesprovided by a native Spanish/fluent English speaker.

56

Figure 9: ”Economy Improved” response rate in conflict-ing stimulus conditions.

flicting trials, which force English speakers (unlikeSpanish speakers) to choose between two availablemetaphors. We focus on these conflicting trials inour analysis and discussion of the results.

5.2 Results

In trials in which stimuli moving rightward weresimultaneously contracting, English speakers re-sponded that the economy improved 66% of thetime, whereas Spanish speakers judged the econ-omy to have improved 43% of the time. In trials inwhich stimuli moving leftward were simultaneouslyexpanding, English speakers judged the economy tohave improved 34% of the time, and Spanish speak-ers responded that the economy improved 55% ofthe time. The results are illustrated in Figure 9.

These results indicate three effects: (1) En-glish speakers exhibit a pronounced bias for us-ing horizontal displacement rather than expan-sion/contraction during the decision-making pro-cess; (2) Spanish speakers are more biased to-ward expansion/contraction in formulating a deci-sion; and, (3) across the two languages the responsesshow contrasting patterns. The results support ourexpectation on the relevance of different metaphorswhen reasoning about the economy by the Englishand Spanish speakers.

To examine the significance of these effects, wefit a binary logit mixed effects model5 to the data.The full analysis modeled judgment with native lan-guage, displacement, and size as fully crossed fixed

5See Fox and Weisberg (2011) for a discussion of such mod-els including application of the Type II Wald test.

effects and participant as a random effect. This anal-ysis confirmed that native language was associatedwith judgments about economic change. In particu-lar, it indicated that changes in size affected Englishspeakers’ judgments and Spanish speakers’ judg-ments differently (p < 0.001), with an increase insize increasing the odds (eβ = 2.5) of a judgment ofIMPROVED by Spanish speakers and decreasing theodds (eβ = 0.44) of a judgment of IMPROVED byEnglish speakers. A Type II Wald test revealed theinteraction between language and size to be highlystatistically significant (χ2(1) < 0.001).

In summary, the patterns we see in the be-havioural data are consistent with the patterns un-covered in the output of our model. While much ter-ritory remains to be investigated to delimit the natureof this relationship, our results represent a first steptoward establishing an association between informa-tion mined from large textual data collections andinformation observed through behavioural responseson a human scale.

6 Conclusion

We presented the first model that detects commontopics from multilingual, non-parallel data and au-tomatically uncovers differences in perspectives onthese topics across linguistic communities. Ourdata analysis and behavioural evaluation offer evi-dence of a symbiotic relationship between ecolog-ically sound corpus experiments and scientificallycontrolled human subject experiments, paving theway for the use of large-scale text mining to informcognitive linguistics and psychology research.

We believe that our model represents a good foun-dation for future projects in this area. A promisingarea for further work is in developing better methodsfor identifying contrasts in perspective terms. Thiscould perhaps involve modifying the generative pro-cess for perspective terms or incorporating syntacticdependency information. It would also be interest-ing to investigate the effect of dictionary quality andcorpus size on the relative performance of STATIC

and INFER variants. Finally, we note that the modelcan be applied to identify contrastive perspectives inmonolingual as well as multilingual data, providinga general tool for the analysis of subtle, yet impor-tant, cross-population differences.

57

Acknowledgments

We would like to thank the anonymous review-ers as well as the TACL editors, Sharon Gold-water and David Chiang, for helpful commentson an earlier draft of this paper. This workused the Extreme Science and Engineering Discov-ery Environment (XSEDE), which is supported byNational Science Foundation grant number ACI-1053575. Ekaterina Shutova’s research is sup-ported by the Leverhulme Trust Early Career Fel-lowship. Gerard de Melo’s research is supportedby China 973 Program Grants 2011CBA00300,2011CBA00301, and NSFC Grants 61033001,61361136003, 61550110504.

ReferencesAmjad Abu-Jbara, Mona Diab, Pradeep Dasigi, and

Dragomir Radev. 2012. Subgroup detection in ideo-logical discussions. In Proceedings of the 50th AnnualMeeting of the Association for Computational Linguis-tics: Long Papers - Volume 1, ACL ’12, pages 399–409, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Amr Ahmed and Eric P. Xing. 2010. Staying in-formed: Supervised and semi-supervised multi-viewtopical analysis of ideological perspective. In Pro-ceedings of the 2010 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP ’10,pages 1140–1150, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Rawia Awadallah, Maya Ramanath, and GerhardWeikum. 2011. OpinioNetIt: Understanding theOpinions-People network for politically controversialtopics. In Proceedings of the 20th ACM InternationalConference on Information and Knowledge Manage-ment, CIKM ’11, pages 2481–2484, New York, NY,USA. ACM.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet allocation. Journal of MachineLearning Research, 3:993–1022.

Jordan Boyd-Graber and David M. Blei. 2009. Multilin-gual topic models for unaligned text. In Proceedingsof the Twenty-Fifth Conference on Uncertainty in Ar-tificial Intelligence (UAI ’09), pages 75–82. Arlington,VA, USA: AUAI Press.

Jordan Boyd-Graber and Philip Resnik. 2010. Holisticsentiment analysis across languages: multilingual su-pervised latent Dirichlet allocation. In Proceedings ofthe 2010 Conference on Empirical Methods in NaturalLanguage Processing, pages 45–55.

Daniel Casasanto and Lera Boroditsky. 2008. Time inthe mind: Using space to think about time. Cognition,106(2):579–593.

Jonathan Charteris-Black and Timothy Ennis. 2001. Acomparative study of metaphor in Spanish and En-glish financial reporting. English for Specific Pur-poses, 20:249–266.

Stanislas Dehaene. 1992. Varieties of numerical abili-ties. Cognition, 44:1–42.

Anthony Fader, Dragomir Radev, Burt L. Monroe, andKevin M. Quinn. 2007. MavenRank: Identifying in-fluential members of the US senate using lexical cen-trality. In In Proceedings of the 2007 Joint Conferenceon Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning,pages 658–666.

Yi Fang, Luo Si, Naveen Somasundaram, and Zheng-tao Yu. 2012. Mining contrastive opinions on polit-ical texts using cross-perspective topic model. In Pro-ceedings of the Fifth ACM International Conferenceon Web Search and Data Mining (WSDM ’12), pages63–72, New York. New York: ACM.

Wim Fias, Marc Brysbaert, Frank Geypens, and Geryd’Ydewalle. 1995. The importance of magnitudeinformation in numerical processing: evidence fromthe SNARC effect. Mathematical Cognition, 2(1):95–110.

Ronald A. Fisher and Frank Yates. 1963. StatisticalTables for Biological, Agricultural and Medical Re-search. Oliver and Boyd, Edinburgh.

John Fox and Sanford Weisberg. 2011. An R Companionto Applied Regression. SAGE Publications, CA: LosAngeles.

Orly Fuhrman, Kelly McCormick, Eva Chen, HeidiJiang, Dingfang Shu, Shuaimei Mao, and LeraBoroditsky. 2011. How linguistic and cultural forcesshape conceptions of time: English and Mandarin timein 3D. Cognitive Science, 35:1305–1328.

Nikesh Garera and David Yarowsky. 2009. Model-ing latent biographic attributes in conversational gen-res. In Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural Language Process-ing of the AFNLP: Volume 2 - Volume 2, ACL ’09,pages 710–718, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Sean M. Gerrish and David M. Blei. 2011. Predict-ing legislative roll calls from text. In Proceedings ofICML.

Swapna Gottipati, Minghui Qiu, Yanchuan Sim, JingJiang, and Noah A. Smith. 2013. Learning topicsand positions from Debatepedia. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1858–1868, Seattle,

58

Washington, USA, October. Association for Computa-tional Linguistics.

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,and Dan Klein. 2008. Learning bilingual lexiconsfrom monolingual corpora. In Proceedings of the 46thAnnual Meeting of the Association for ComputationalLinguistics, ACL-’08:HLT, pages 771–779, Colum-bus, Ohio, USA.

David Hall, Daniel Jurafsky, and Christopher D. Man-ning. 2008. Studying the history of ideas using topicmodels. In Proceedings of the 2008 Conference onEmpirical Methods in Natural Language processing,pages 363–371. Association for Computational Lin-guistics.

Jagadeesh Jagarlamudi and Hal Daume III. 2010. Ex-tracting multilingual topics from unaligned compara-ble corpora. In Cathal Gurrin, Yulan He, GabriellaKazai, Udo Kruschwitz, and Suzanne Little, editors,Proceedings of the 32nd European Conference on Ad-vances in Information Retrieval (ECIR’2010), pages444–456. Springer-Verlag, Berlin.

Rosie Jones, Ravi Kumar, Bo Pang, and AndrewTomkins. 2007. “I know what you did last summer”:Query logs and user privacy. In Proceedings of the Six-teenth ACM Conference on Conference on Informationand Knowledge Management, CIKM ’07, pages 909–914, New York, NY, USA. ACM.

Roy Jonker and Anton Volgenant. 1987. A shortest aug-menting path algorithm for dense and sparse linear as-signment problems. Computing, 38(4):325–340.

Zoltan Kovecses. 2004. Introduction: Cultural varia-tion in metaphor. European Journal of English Stud-ies, 8:263–274.

George Lakoff and Elisabeth Wehling. 2012. The Lit-tle Blue Book: The Essential Guide to Thinking andTalking Democratic. Free Press, New York.

David Mimno, Hanna M. Wallach, Jason Naradowsky,David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing: Volume 2, pages 880–889. Asso-ciation for Computational Linguistics.

David Mimno, Hanna M. Wallach, Edmund Talley,Miriam Leenders, and Andrew McCallum. 2011. Op-timizing semantic coherence in topic models. In Pro-ceedings of the 2011 Conference on Empirical Meth-ods in Natural Language Processing. Association forComputational Linguistics.

Burt L. Monroe, Michael P. Colaresi, and Kevin M.Quinn. 2008. Fightin’ words: Lexical feature selec-tion and evaluation for identifying the content of polit-ical conflict. Political Analysis, 16(4):372–403.

Iain Murray and Ruslan R. Salakhutdinov. 2009. Evalu-ating probabilities under high-dimensional latent vari-able models. In Advances in Neural Information Pro-cessing Systems, pages 1137–1144.

David Newman, Jey Han Lau, Karl Grieser, and TimothyBaldwin. 2010. Automatic evaluation of topic coher-ence. In Proceedings of the 2010 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies. Association for Computational Linguistics.

Diarmuid O Seaghdha. 2010. Latent variable modelsof selectional preference. In Proceedings of the 48thAnnual Meeting of the Association for ComputationalLinguistics, pages 435–444, Uppsala, Sweden. Asso-ciation for Computational Linguistics.

Michael Paul and Roxana Girju. 2009. Cross-culturalanalysis of blogs and forums with mixed-collectiontopic models. In Proceedings of the 2009 Conferenceon Empirical Methods in Natural Language Process-ing: Volume 3 - Volume 3, EMNLP ’09, pages 1408–1417, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Marco Pennacchiotti and Ana-Maria Popescu. 2011.Democrats, Republicans and Starbucks afficionados:user classification in Twitter. In Proceedings ofthe 17th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’11,pages 430–438.

Kriti Puniyani, Jacob Eisenstein, Shay Cohen, and Eric P.Xing. 2010. Social links from latent topics in mi-croblogs. In Proceedings of the NAACL/HLT 2010Workshop on Computational Linguistics in a World ofSocial Media, pages 19–20. Association for Computa-tional Linguistics.

Minghui Qiu and Jing Jiang. 2013. A latent variablemodel for viewpoint discovery from threaded forumposts. In Proceedings of the 2013 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 1031–1040, Atlanta, Georgia, June. Asso-ciation for Computational Linguistics.

Alan Ritter, Mausam Etzioni, and Oren Etzioni. 2010. Alatent Dirichlet allocation method for selectional pref-erences. In Proceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics, pages424–434. Association for Computational Linguistics.

Keith Stevens, Philip Kegelmeyer, David Andrzejewski,and David Buttler. 2012. Exploring topic coherenceover many models and many topics. In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, pages 952–961, Jeju Is-land, Korea.

59

Paul H. Thibodeau and Lera Boroditsky. 2011.Metaphors we think with: The role of metaphor in rea-soning. PLoS ONE, 6(2):e16782.

Janyce Wiebe, Theresa Wilson, Rebecca Bruce, MatthewBell, and Melanie Martin. 2004. Learning subjectivelanguage. Comput. Linguist., 30(3):277–308, Septem-ber.

60

Documents

Detecting Cross-Cultural Differences Using a Multilingual Topic Model