emnlp - Association for Computational Linguisticsmirror.aclweb.org/emnlp2013/files/handbook.pdf · 2014-12-15 · EMNLP Organizing committee General Chair David Yarowsky (John Hopkins

emnlp 2013

The 2013 Conferenceon

Empirical Methodsin

Natural Language Processing

Conference Handbook

October 18-21, 2013Grand Hyatt Hotel

Seattle, Washington

EMNLP Organizing committee

General ChairDavid Yarowsky (John Hopkins University, USA)

Program ChairsTimothy Baldwin (University of Melbourne, Australia)Anna Korhonen (University of Cambridge, UK)

Workshops ChairKaren Livescu (Toyota Technological Institute at Chicago, USA)

Publication ChairSteven Bethard (University of Alabama at Birmingham, USA)

Local Arrangements Chair:Priscilla Rasmussen (ACL Business Office)

Webmaster:Francesco Figari (The University of Edinburgh)

Area ChairsPhonology, Morphology, Tagging, Chunking and Segmentation:

Kemal Oflazer (Carnegie Mellon University Qatar)Anna Feldman (Montclair State University, USA)

Syntax and Parsing:Jennifer Foster (Dublin City University, Ireland)Yoav Goldberg (Bar Ilan University, Israel)

Semantics:Mark Stevenson (University of Sheffield, UK)Luke Zettlemoyer (University of Washington, USA)

Discourse, Dialogue, and Pragmatics:Carolyn Rose (Carnegie Mellon University, USA)Matt Purver (Queen Mary, University of London, UK)

Language resources:Emily M. Bender (University of Washington, USA)Aline Villavicencio (Federal University of Rio Grande do Sul, Brazil)

Summarization and Generation:Dragomir Radev (University of Michigan, USA)Yang Liu (University of Texas at Dallas, USA)

NLP-related Machine Learning: theory, methods and algorithms:Amir Globerson (The Hebrew University of Jerusalem, Israel)Antal van den Bosch (Radboud University Nijmegen, Netherlands)

Machine Translation:Taro Watanabe (NICT, Japan)Kevin Knight (Information Sciences Institute, USA)

Information Retrieval and Question Answering:Bernardo Magnini (Fondazione Bruno Kessler, Italy)Soumen Chakrabarti (Indian Institute of Technology, India)

Information Extraction:Mausam (Indian Institute of Technology, India)Heng Ji ( City University of New York, USA)

Spoken Language Processing:Haizhou Li (Institute for Infocomm Research, Singapore)Amanda Stent (AT&T Labs, USA)

Text Mining and Natural Language Processing Applications:Hang Li (Huawei Noah’s Ark Lab, Hong Kong)Kevin Cohen (University of Colorado School of Medicine, USA)

Sentiment Analysis and Opinion Mining:Janyce Weibe (University of Pittsburgh, USA)Bing Liu (University of Illinois at Chicago, USA)

NLP for the Web and Social Media:Miles Osborne (University of Edinburgh, UK)Chin-Yew Lin (Microsoft Research Asia, China)

Computational Models of Human Language Acquisition and Processing:Alessandro Lenci (University of Pisa, Italy)Afra Alishahi (Tilburg University, Netherlands)

i

Conference Sponsors

Platinum Sponsor

Gold Sponsors

Allen Institutefor artificial intelligence

ii

Conference Sponsors

Bronze Sponsors

Supporters

iii

iv

Contents

1 Friday, October 19, 2013: Workshops 1TextGraphs-8: Graph-based Methods for Natural Language Processing . . . . . . . . 2Statistical Parsing of Morphologically-Rich Languages . . . . . . . . . . . . . . . . 5Twenty Years of Bitext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Saturday, October 19, 2013: Main Conference 13Information Extraction I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Language Acquisition and Processing . . . . . . . . . . . . . . . . . . . . . . . . . 15NLP for Social Media I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16NLP Applications I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Semantics I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Invited talk: Andrew Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Machine Translation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Dialogue and Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Morphology and Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Poster Session A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Poster Session B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Sunday, October 20, 2013: Main Conference 55Invited talk: Fernando Pereira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Machine Learning for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Summarization and Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Information Extraction and Social Media Analysis . . . . . . . . . . . . . . . . . . . 61Machine Translation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Semantics II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Opinion Mining and Sentiment Analysis I . . . . . . . . . . . . . . . . . . . . . . . 65Machine Translation III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Information Extraction II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67NLP Applications II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

v

4 Monday, October 21, 2013: Main Conference 71Information Extraction III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Opinion Mining and Sentiment Analysis II . . . . . . . . . . . . . . . . . . . . . . . 73NLP for Social Media II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Semantics III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76NLP Applications III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Plenary session I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Plenary Session II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

vi

Friday, October 19, 2013: Workshops

1Friday, October 19, 2013: Workshops

Overview8:00–9:00 Breakfast (Leonesa Foyer)

9:00–5:30 Workshop 1: TextGraphs-8 (Eliza Anderson Amphitheater)

9:00–5:30 Workshop 2: SPMRL-2013 (Blewett Suite)

9:00–5:30 Workshop 3: Twenty Years of Bitext (Portland)

1


TextGraphs-8: Graph-based Methods for NaturalLanguage ProcessingProgram Chairs: Zornitsa Kozareva, Irina Matveeva, Gabor Melli, Vivi Nastase

Invited talk: Opportunities for Graph-based Machine ReadingOren Etzioni

Friday 9:00–10:05 — Eliza Anderson Amphitheater

Event-Centered Information Retrieval Using Kernels on Event GraphsGoran Glavaš, Jan Šnajder


Traditional information retrieval models assume keyword-based queries and use unstructured document repre-sentations. There is an abundance of event-centered texts (e.g., breaking news) and event-oriented informationneeds that often involve structure that cannot be expressed using keywords. We present a novel retrieval modelthat uses a structured event-based representation. We structure queries and documents as graphs of eventmentions and employ graph kernels to measure the query-document similarity. Experimental results on twoevent-oriented test collections show significant improvements over state-of-the-art keyword-based models.

JoBimText Visualizer: A Graph-based Approach to Contextualizing DistributionalSimilarity

Chris Biemann, Bonaventura Coppola, Michael R. Glass, Alfio Gliozzo, Matthew Hatem,Martin Riedl


We introduce an interactive visualization component for the JoBimText project. JoBimText is an open sourceplatform for large-scale distributional semantics based on graph representations. First we describe the underly-ing technology for computing a distributional thesaurus on words using bipartite graphs of words and contextfeatures, and contextualizing the list of semantically similar words towards a given sentential context usinggraph-based ranking. Then we demonstrate the capabilities of this contextualized text expansion technologyin an interactive visualization. The visualization can be used as a semantic parser providing contextualizedexpansions of words in text as well as disambiguation to word senses induced by graph clustering, and isprovided as an open source tool.

Merging Word SensesSumit Bhagwani, Shrutiranjan Satapathy, Harish Karnick


WordNet, a widely used sense inventory for Word Sense Disambiguation, is often too fine-grained for manyNatural Language applications because of its narrow sense distinctions. We present a semi-supervised ap-proach to learn similarity between WordNet synsets using a graph based recursive similarity definition. Weseed our framework with sense similarities of all the word-sense pairs, learnt using supervision on human-labelled sense clusterings. Finally we discuss our method to derive coarse sense inventories at arbitrary gran-ularities and show that the coarse grained sense inventory obtained significantly boosts the disambiguation ofnouns on standard test sets.

Reconstructing Big Semantic Similarity NetworksAi He, Shefali Sharma, Chun-Nan Hsu


Distance metric learning from high (thousands or more) dimensional data with hundreds or thousands ofclasses is intractable but in NLP and IR, high dimensionality is usually required to represent data points,such as in modeling semantic similarity. This paper presents algorithms to scale up learning of a Mahalanobisdistance metric from a large data graph in a high dimensional space. Our novel contributions include randomprojection that reduces dimensionality and a new objective function that regularizes intra-class and inter-class

2


distances to handle a large number of classes. We show that the new objective function is convex and canbe efficiently optimized by a stochastic-batch subgradient descent method. We applied our algorithm to twodifferent domains; semantic similarity of documents collected from the Web, and phenotype descriptions ingenomic data. Experiments show that our algorithm can handle the high-dimensional big data and outperformcompeting approximations in both domains.

Graph-Based Unsupervised Learning of Word Similarities Using Heterogeneous FeatureTypes

Avneesh Saluja, Jiri NavratilFriday 11:50–12:15 — Eliza Anderson Amphitheater

In this work, we propose a graph-based approach to computing similarities between words in an unsupervisedmanner, and take advantage of heterogeneous feature types in the process. The approach is based on thecreation of two separate graphs, one for words and one for features of different types (alignment- based,orthographic, etc.). The graphs are connected through edges that link nodes in the feature graph to nodes in theword graph, the edge weights representing the importance of a particular feature for a particular word. Highquality graphs are learned during training, and the proposed method outperforms experimental baselines.

From Global to Local Similarities: A Graph-Based Contextualization Method usingDistributional Thesauri

Martin Riedl, Chris BiemannFriday 12:15–12:30 — Eliza Anderson Amphitheater

After recasting the computation of a distributional thesaurus in a graph-based framework for term similarity,we introduce a new contextualization method that generates, for each term occurrence in a text, a ranked listof terms that are semantically similar and compatible with the given context. The framework is instantiatedby the definition of term and context, which we derive from dependency parses in this work. Evaluatingour approach on a standard data set for lexical substitution, we show substantial improvements over a strongnon-contextualized baseline across all parts of speech. In contrast to comparable approaches, our frameworkdefines an unsupervised generative method for similarity in context and does not rely on the existence of lexicalresources as a source for candidate expansions.

Invited talk: Extracting Tractable Probabilistic Knowledge Graphs from TextPedro Domingos


Understanding seed selection in bootstrappingYo Ehara, Issei Sato, Hidekazu Oiwa, Hiroshi Nakagawa


Bootstrapping has recently become the focus of much attention in natural language processing to reduce la-beling cost. In bootstrapping, unlabeled instances can be harvested from the initial labeled “seed” set. Theselected seed set affects accuracy, but how to select a good seed set is not yet clear. Thus, an “iterative seed-ing” framework is proposed for bootstrapping to reduce its labeling cost. Our framework iteratively selectsthe unlabeled instance that has the best “goodness of seed” and labels the unlabeled instance in the seed set.Our framework deepens understanding of this seeding process in bootstrapping by deriving the dual problem.We propose a method called expected model rotation (EMR) that works well on not well-separated data whichfrequently occur as realistic data. Experimental results show that EMR can select seed sets that provide sig-nificantly higher mean reciprocal rank on realistic data than existing naive selection methods or random seedsets.

Graph-Structures Matching for Review Relevance IdentificationLakshmi Ramachandran, Edward GehringerFriday 4:00–4:25 — Eliza Anderson Amphitheater

Review quality is determined by identifying the relevance of a review to a submission (the article or paperthe review was written for). We identify relevance in terms of the semantic and syntactic similarities between

3


two texts. We use a word order graph, whose vertices, edges and double edges help determine structure-basedmatch across texts. We use WordNet to determine semantic relatedness. Ours is a lexico-semantic approach,which predicts relevance with an accuracy of 66% and f-measure of 0.67.

Automatic Extraction of Reasoning Chains from Textual ReportsGleb Sizov, Pinar Öztürk


Many organizations possess large collections of textual reports that document how a problem is solved oranalysed, e.g. medical patient records, industrial accident reports, lawsuit records and investigation reports.Effective use of expert knowledge contained in these reports may greatly increase productivity of the organiza-tion. In this article, we propose a method for automatic extraction of reasoning chains that contain informationused by the author of a report to analyse the problem at hand. For this purpose, we developed a graph-basedtext representation that makes the relations between textual units explicit. This representation is acquired au-tomatically from a report using natural language processing tools including syntactic and discourse parsers.When applied to aviation investigation reports, our method generates reasoning chains that reveal the connec-tion between initial information about the aircraft incident and its causes.

Graph-based Approaches for Organization Entity Resolution in MapReduceHakan Kardes, Deepak Konidena, Siddharth Agrawal, Micah Huff, Ang Sun


Entity Resolution is the task of identifying which records in a database refer to the same entity. A standardmachine learning pipeline for the entity resolution problem consists of three major components: blocking,pairwise linkage, and clustering. The blocking step groups records by shared properties to determine whichpairs of records should be examined by the pairwise linker as potential duplicates. Next, the linkage stepassigns a probability score to pairs of records inside each block. If a pair scores above a user-defined threshold,the records are presumed to represent the same entity. Finally, the clustering step turns the input records intoclusters of records (or profiles), where each cluster is uniquely associated with a single real-world entity. Thispaper describes the blocking and clustering strategies used to deploy a massive database of organization entitiesto power a major commercial People Search Engine. We demonstrate the viability of these algorithms for largedata sets on a 50-node hadoop cluster.

A Graph-Based Approach to Skill Extraction from TextIlkka Kivimäki, Alexander Panchenko, Adrien Dessy, Dries Verdegem, Pascal Francq, Hugues

Bersini, Marco SaerensFriday 5:15–5:40 — Eliza Anderson Amphitheater

This paper presents a system, that performs skill extraction from text documents. It outputs a list of profes-sional skills that are relevant to a given input text. We argue that the system can be practical for hiring andmanagement of personnel in an organization. We make use of the texts and hyperlink graph of Wikipedia, aswell as a list of professional skills obtained from the LinkedIn social network.

4


Statistical Parsing of Morphologically-Rich Languages

Program Chairs: Mohammed Attia, Bernd Bohnet, Marie Candito, Aoife Cahill, Özlem Cetinoglu,Jinho D. Choi, Grzegorz Chrupała, Benoit Crabbé, Gülsen Cebiroglu Eryigit, Michael Elhadad,Richard Farkas, Jennifer Foster, Josef van Genabith, Koldo Gojenola, Spence Green, Samar Hu-sain, Sandra Kübler, Jonas Kuhn, Alberto Lavelli, Joseph Le Roux, Wolfgang Maier, TakuyaMatsuzaki, Joakim Nivre, Kemal Oflazer, Adam Przepiórkowski, Owen Rambow, Kenji Sagae,Benoit Sagot, Djamé Seddah, Reut Tsarfaty, Lamia Tounsi, Daniel Zeman

Invited talk: An HDP Model for Inducing CCGsJulia Hockenmaier

Friday 9:10–10:05 — Blewett Suite

Working with a small dataset - semi-supervised dependency parsing for IrishTeresa Lynn, Jennifer Foster, Mark Dras


We present a number of semi-supervised parsing experiments on the Irish language carried out using a smallseed set of manually parsed trees and a larger, yet still relatively small, set of unlabelled sentences. We taketwo popular dependency parsers – one graph-based and one transition-based – and compare results for both.Results show that using semi-supervised learning in the form of self-training and co-training yields only verymodest improvements in parsing accuracy. We also try to use morphological information in a targeted way andfail to see any improvements.

Lithuanian Dependency Parsing with Rich Morphological FeaturesJurgita Kapociute-Dzikiene, Joakim Nivre, Algis Krupavicius


We present the first statistical dependency parsing results for Lithuanian, a morphologically rich languagein the Baltic branch of the Indo-European family. Using a greedy transition-based parser, we obtain a labeledattachment score of 74.7 with gold morphology and 68.1 with predicted morphology (77.8 and 72.8 unlabeled).We investigate the usefulness of different features and find that rich morphological features improve parsingaccuracy significantly, by 7.5 percentage points with gold features and 5.6 points with predicted features. Asexpected, CASE is the single most important morphological feature, but virtually all available features bringsome improvement, especially under the gold condition.

Parsing Croatian and Serbian by Using Croatian Dependency TreebanksŽeljko Agic, Danijela Merkler, Daša Berovic


We investigate statistical dependency parsing of two closely related languages, Croatian and Serbian. As thesetwo morphologically complex languages of relaxed word order are generally under-resourced – with the topicof dependency parsing still largely unaddressed, especially for Serbian – we make use of the two availabledependency treebanks of Croatian to produce state-of-the-art parsing models for both languages. We observeparsing accuracy on four test sets from two domains. We give insight into overall parser performance forCroatian and Serbian, impact of preprocessing for lemmas and morphosyntactic tags and influence of selectedmorphosyntactic features on parsing accuracy.

A Cross-Task Flexible Transition Model for Arabic Tokenization, Affix Detection, AffixLabeling, POS Tagging, and Dependency Parsing

Stephen TratzFriday 11:50–12:15 — Blewett Suite

This paper describes cross-task flexible transition models (CTF-TMs) and demonstrates their effectivenessfor Arabic natural language processing (NLP). NLP pipelines often suffer from error propagation, as errorscommitted in lower-level tasks cascade through the remainder of the processing pipeline. By allowing a

5


flexible order of operations across and within multiple NLP tasks, a CTF-TM can mitigate both cross-taskand within-task error propagation. Our Arabic CTF-TM models tokenization, affix detection, affix labeling,part-of-speech tagging, and dependency parsing, achieving state-of-the-art results. We present the details ofour general framework, our Arabic CTF-TM, and the setup and results of our experiments.

The LIGM-Alpage architecture for the SPMRL 2013 Shared Task: Multiword ExpressionAnalysis and Dependency Parsing

Matthieu Constant, Marie Candito, Djamé SeddahFriday 14:10–14:20 — Blewett Suite

This paper describes the LIGM-Alpage system for the SPMRL 2013 Shared Task. We only participated to theFrench part of the dependency parsing track, focusing on the realistic setting where the system is informedneither with gold tagging and morphology nor (more importantly) with gold grouping of tokens into multi-word expressions (MWEs). While the realistic scenario of predicting both MWEs and syntax has already beeninvestigated for constituency parsing, the SPMRL 2013 shared task datasets offer the possibility to investigateit in the dependency framework. We obtain the best results for French, both for overall parsing and for MWErecognition, using a reparsing architecture that combines several parsers, with both pipeline architecture (MWErecognition followed by parsing), and joint architecture (MWE recognition performed by the parser).

Exploring beam-based shift-reduce dependency parsing with DyALog: Results from theSPMRL 2013 shared task

Eric De La ClergerieFriday 14:20–14:30 — Blewett Suite

The SPMRL 2013 shared task was the opportunity to develop and test, with promising results, a simple beam-based shift-reduce dependency parser on top of the tabular logic programming system DyALog. The parserwas also extended to handle ambiguous word lattices, with almost no loss w.r.t. disambiguated input, thanksto specific training, use of oracle segmentation, and large beams. We believe that this result is an interestingnew one for shift-reduce parsing.

Effective Morphological Feature Selection with MaltOptimizer at the SPMRL 2013Shared Task

Miguel BallesterosFriday 14:30–14:55 — Blewett Suite

The inclusion of morphological features provides very useful information that helps to enhance the resultswhen parsing morphologically rich languages. MaltOptimizer is a tool, that given a data set, searches for theoptimal parameters, parsing algorithm and optimal feature set achieving the best results that it can find forparsers trained with MaltParser. In this paper, we present an extension of MaltOptimizer that explores, oneby one and in combination, the features that are geared towards morphology. From our experiments in thecontext of the Shared Task on Parsing Morphologically Rich Languages, we extract an in-depth study thatshows which features are actually useful for transition-based parsing and we provide competitive results, in afast and simple way.

Exploiting the Contribution of Morphological Information to Parsing: the BASQUETEAM system in the SPRML’2013 Shared Task

Iakes Goenaga, Koldo Gojenola, Nerea EzeizaFriday 14:55–15:10 — Blewett Suite

This paper presents a dependency parsing system, presented as BASQUE_TEAM at the SPMRL’2013 SharedTask, based on the analysis of each morphological feature of the languages. Once the specific relevance ofeach morphological feature is calculated, this system uses the most significant of them to create a series ofanalyzers using two freely available and state of the art dependency parsers, MaltParser and Mate. Finally, thesystem will combine previously achieved parses using a voting approach.

6


The AI-KU System at the SPMRL 2013 Shared Task : Unsupervised Features forDependency Parsing

Volkan Cirik, Hüsnü SensoyFriday 15:10–15:20 — Blewett Suite

We propose the use of the word categories and embeddings induced from raw text as auxiliary features independency parsing. To induce word features, we make use of contextual, morphologic and orthographicproperties of the words. To exploit the contextual information, we make use of substitute words, the mostlikely substitutes for target words, generated by using a statistical language model. We generate morphologicand orthographic properties of word types in an unsupervised manner. We use a co-occurrence model withthese properties to embed words onto a 25-dimensional unit sphere. The AI-KU system shows improvementsfor some of the languages it is trained on for the first Shared Task of Statistical Parsing of MorphologicallyRich Languages.

SPMRL’13 Shared Task System: The CADIM Arabic Dependency ParserYuval Marton, Nizar Habash, Owen Rambow, Sarah Alkhulani


We describe the submission from the Columbia Arabic & Dialect Modeling group (CADIM) for the SharedTask at the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL’2013). Weparticipate in the Arabic Dependency parsing task for predicted POS tags and features. Our system is basedon Marton et al. (2013).

A Statistical Approach to Prediction of Empty Categories in Hindi Dependency TreebankPuneeth Kukkadapu, Prashanth Mannem


In this paper we use statistical dependency parsing techniques to detect NULL or Empty categories in the Hindisentences. We have currently worked on Hindi dependency treebank which is released as part of COLING-MTPIL 2012 Workshop. Earlier Rule based approaches are employed to detect Empty heads for Hindi lan-guage but statistical learning for automatic prediction is not explored. In this approach we used a techniqueof introducing complex labels into the data to predict Empty Categories in sentences. We have also discussedabout shortcomings and difficulties in this approach and evaluated the performance of this approach on differ-ent Empty categories.

An Empirical Study on the Effect of Morphological and Lexical Features in PersianDependency Parsing

Mojtaba Khallash, Ali Hadian, Behrouz Minaei-BidgoliFriday 16:00–16:20 — Blewett Suite

This paper investigates the impact of different morphological and lexical information on data-driven depen-dency parsing of Persian, a morphologically rich language. We explore two state-of-the-art parsers, namelyMSTParser and MaltParser, on the recently released Persian dependency treebank and establish some baselinesfor dependency parsing performance. Three sets of issues are addressed in our experiments: effects of usinggold and automatically derived features, finding the best features for the parser, and a suitable way to alleviatethe data sparsity problem. The final accuracy is 87.91% and 88.37% labeled attachment scores for MaltParserand MSTParser, respectively.

Constructing a Practical Constituent Parser from a Japanese Treebank with FunctionLabels

Takaaki Tanaka, Masaaki NAGATAFriday 16:00–16:20 — Blewett Suite

We present an empirical study on constructing a Japanese constituent parser, which can output function labelsto deal with more detailed syntactic information. Japanese syntactic parse trees are usually represented asunlabeled dependency structure between bunsetsu chunks, however, such expression is insufficient to uncoverthe syntactic information about distinction between complements and adjuncts and coordination structure,which is required for practical applications such as syntactic reordering of machine translation. We describea preliminary effort on constructing a Japanese constituent parser by a Penn Treebank style treebank semi-

7


automatically made from a dependency-based corpus. The evaluations show the parser trained on the treebankhas comparable bracketing accuracy as conventional bunsetsu-based parsers, and can output such functionlabels as the grammatical role of the argument and the type of adnominal phrases.

Context Based Statistical Morphological Analyzer and its Effect on Hindi DependencyParsing

Deepak Kumar Malladi, Prashanth MannemFriday 16:00–16:20 — Blewett Suite

This paper revisits the work of (Malladi and Mannem, 2013) which focused on building a Statistical Mor-phological Analyzer (SMA) for Hindi and compares the performance of SMA with other existing statisticalanalyzer, Morfette. We shall evaluate SMA in various experiment scenarios and look at how it performsfor unseen words. The later part of the paper presents the effect of the predicted morph features on depen-dency parsing and extends the work to other morphologically rich languages: Hindi and Telugu, without anylanguage-specific engineering.

Representation of Morphosyntactic Units and Coordination Structures in the TurkishDependency Treebank

Umut Sulubacak, Gülsen EryigitFriday 16:00–16:20 — Blewett Suite

This paper presents our preliminary conclusions as part of an ongoing effort to construct a new dependency rep-resentation framework for Turkish. We aim for this new framework to accommodate the highly agglutinativemorphology of Turkish as well as to allow the annotation of unedited web data, and shape our decisions aroundthese considerations. In this paper, we firstly describe a novel syntactic representation for morphosyntactic sub-word units (namely inflectional groups (IGs) in Turkish) which allows inter-IG relations to be discerned withperfect accuracy without having to hide lexical information. Secondly, we investigate alternative annotationschemes for coordination structures and present a better scheme (nearly 11% increase in recall scores) thanthe one in Turkish Treebank (Oflazer et al., 2003) for both parsing accuracies and compatibility for colloquiallanguage.

(Re)ranking Meets Morphosyntax: State-of-the-art Results from the SPMRL 2013 SharedTask

Anders Björkelund, Ozlem Cetinoglu, Richárd Farkas, Thomas Mueller, Wolfgang SeekerFriday 16:20–16:45 — Blewett Suite

This paper describes the IMS-SZEGED-CIS contribution to the SPMRL 2013 Shared Task. We participate inboth the constituency and dependency tracks, and achieve state-of-the-art for all languages. For both tracks wemake significant improvements through high quality preprocessing and (re)ranking on top of strong baselines.Our system came out first for both tracks.

Invited talk: Dependency Parsing with Selectional BranchingJinho D. Choi


Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of ParsingMorphologically Rich Languages

Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas,Jennifer Foster, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green,

Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Yuval Marton, Joakim Nivre, AdamPrzepiórkowski, Ryan Roth, Wolfgang Seeker, Yannick Versley, Veronika Vincze, Marcin

Wolinski, Alina WróblewskaFriday 18:00–18:15 — Blewett Suite

This paper reports on the first shared task on statistical parsing of Morphologically Rich Languages (MRLs).The task features data sets from nine languages, each available both in constituency and dependency annota-tion. We report on the alignment of the data sets, on the proposed parsing scenarios, and on the evaluation

8


metrics for parsing MRLs given different representation types. We present and analyze parsing results obtainedby the task participants, and then provide an analysis and cross- comparison of the parsers across languagesand frameworks, reported for gold input as well as more realistic parsing scenarios.

9


Twenty Years of BitextProgram Chairs: Chris Dyer, Noah A. Smith, Phil Blunsom

What a Translation Model Tastes LikeKevin Knight, USC/ISI

Friday 9:30–10:00 — Portland

Between Machine Translation and Human TranslationPhilipp Koehn, Edinburgh


Oh, Yes, Everything’s Right on Schedule, FredPeter F. Brown and Robert L. Mercer, Renaissance Technologies


Beyond Bitext: Five Open Problems in Machine TranslationAdam Lopez and Matt PostFriday 3:00–4:00 — Portland

Twenty Flavors of One TextDaniel Zeman and Ondrej Bojar


What SMT LearnsDekai Wu


Violation-Fixing Perceptron and Forced Decoding for Scalable MT TrainingHeng Yu, Liang Huang, and Haito Mi


Aligning Words in Bitexts using the Bilingual WebJim Chang, Joseph Chee Chang, Jian-cheng Wu, and Jason S. Chang


The Pedantic Javadoc Corpus: Comments and Code as BitextJim White


Bitexts as Semantic MirrorsJörg Tiedemann, Lonneke van der Plas, and Begoña Villada Moirón


Using parallel corpora to bootstrap multilingual semantic parsersKilian Evang and Johan Bos


10


Bitext as Word Sense Annotation: Lessons from Evaluating Machine Translation onLexical Semantics Tasks

Marine CarpuatFriday 3:00–4:00 — Portland

Beyond Parallel Data – A Decipherment Approach to Machine TranslationQing Dou and Kevin Knight


On Applying and Extending Bitext for Entity TranslationSeung-won Hwang


Lexicalized Reordering Model in Chart-based Machine TranslationThuyLinh Nguyen


Discrete Log-Linear Autoencoders for Unsupervised Learning of Linguistic StructureWaleed Ammar, Chris Dyer, and Noah A. Smith


Google Translate: Past, Present, FutureFranz Josef Och, GoogleFriday 4:00–4:30 — Portland

11


12

Saturday, October 19, 2013: Main Conference

2Saturday, October 19, 2013: Main Conference

Overview7:30 – 9:00 Breakfast (Leonesa Foyer)

9:00 – 9:15 Opening Remarks (Leonesa Ballroom)

9:20 –10:30 Parallel SessionsInformation Extraction I (Leonesa I and II)Language Acquisition and Processing (Leonesa III)NLP for Social Media I (Eliza Anderson Amphitheater)

10:30–11:00 Break (Leonesa Foyer)

11:00–12:35 Parallel SessionsNLP Applications I (Leonesa I and II)Semantics I (Leonesa III)Language Resources (Eliza Anderson Amphitheater)

12:35– 2:00 Lunch

2:00 – 3:15 Invited talk: Andrew Ng (Leonesa Ballroom)

3:15 – 3:45 Break (Leonesa Foyer)

3:45 – 5:50 Parallel SessionsMachine Translation I (Leonesa I and II)Dialogue and Discourse (Leonesa III)Morphology and Phonology (Eliza Anderson Amphitheater)

6:00 – 7:30 Poster Session A (Princessa Ballroom and Foyer)

7:30 – 9:00 Poster Session B (Princessa Ballroom and Foyer)

13


Information Extraction ILeonesa I and II Chair: Steven Bethard

Event-Based Time Label Propagation for Automatic Dating of News ArticlesTao Ge, Baobao Chang, Sujian Li, Zhifang Sui

Saturday 9:20–9:45 — Leonesa I and II

Since many applications such as timeline summaries and temporal IR involving temporal analysis rely ondocument timestamps, the task of automatic dating of documents has been increasingly important. Instead ofusing feature-based methods as conventional models, our method attempts to date documents in a year level byexploiting relative temporal relations between documents and events, which are very effective for dating doc-uments. Based on this intuition, we proposed an event-based time label propagation model called confidenceboosting in which time label information can be propagated between documents and events on a bipartitegraph. The experiments show that our event-based propagation model can predict document timestamps inhigh accuracy and the model combined with a MaxEnt classifier outperforms the state-of-the-art method forthis task especially when the size of the training set is small.

Exploiting Discourse Analysis for Article-Wide Temporal ClassificationJun-Ping Ng, Min-Yen Kan, Ziheng Lin, Wei Feng, Bin Chen, Jian Su, Chew Lim Tan


In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is incontrast to much of the existing literature which focuses on just event pairs which are found within the sameor adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides moreuseful semantic information than typical lexico-syntactic features. We propose the use of several discourseanalysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and3) topical text segmentation. We explain how features derived from these frameworks can be effectively usedwith support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal iseffective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we onlyadopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourseanalysis can further boost gains to 35%.

Combining Generative and Discriminative Model Scores for Distant SupervisionBenjamin Roth, Dietrich Klakow


Distant supervision is a scheme to generate noisy training data for relation extraction by aligning entities of aknowledge base with text. In this work we combine the output of a discriminative at-least-one learner with thatof a generative hierarchical topic model to reduce the noise in distant supervision data. The combination signif-icantly increases the ranking quality of extracted facts and achieves state-of-the-art extraction performance inan end-to-end setting. A simple linear interpolation of the model scores performs better than a parameter-freescheme based on non-dominated sorting.

14


Language Acquisition and ProcessingLeonesa III Chair: Julia Hockenmaier

Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directedSpeech

Stella Frank, Frank Keller, Sharon GoldwaterSaturday 9:20–9:45 — Leonesa III

Children learn various levels of linguistic structure concurrently, yet most existing models of language acqui-sition deal with only a single level of structure, implicitly assuming a sequential learning process. Developingmodels that learn multiple levels simultaneously can provide important insights into how these levels mightinteract synergistically during learning. Here, we present a model that jointly induces syntactic categories andmorphological segmentations by combining two well-known models for the individual tasks. We test on child-directed utterances in English and Spanish and compare to single-task baselines. In the morphologically poorerlanguage (English), the model improves morphological segmentation, while in the morphologically richer lan-guage (Spanish), it leads to better syntactic categorization. These results provide further evidence that jointlearning is useful, but also suggest that the benefits may be different for typologically different languages.

A Joint Learning Model of Word Segmentation, Lexical Acquisition, and PhoneticVariability

Micha Elsner, Sharon Goldwater, Naomi Feldman, Frank WoodSaturday 9:45–10:10 — Leonesa III

We present a cognitive model of early lexical acquisition which jointly performs word segmentation and learnsan explicit model of phonetic variation. We define the model as a Bayesian noisy channel; we sample seg-mentations and word forms simultaneously from the posterior, using beam sampling to control the size ofthe search space. Compared to a pipelined approach in which segmentation is performed first, our model isqualitatively more similar to human learners. On data with variable pronunciations, the pipelined approachlearns to treat syllables or morphemes as words. In contrast, our joint model, like infant learners, tends to learnmultiword collocations. We also conduct analyses of the phonetic variations that the model learns to acceptand its patterns of word recognition errors, and relate these to developmental evidence.

Animacy Detection with Voting ModelsJoshua Moore, Christopher J.C. Burges, Erin Renshaw, Wen-tau Yih

Saturday 10:10–10:30 — Leonesa III

Animacy detection is a problem whose solution has been shown to be beneficial for a number of syntactic andsemantic tasks. We present a state-of-the-art system for this task which uses a number of simple classifierswith heterogeneous data sources in a voting scheme. We show how this framework can give us direct insightinto the behavior of the system, allowing us to more easily diagnose sources of error.

15


NLP for Social Media IEliza Anderson Amphitheater Chair: Noah Smith

A Log-Linear Model for Unsupervised Text NormalizationYi Yang, Jacob Eisenstein

Saturday 9:20–9:45 — Eliza Anderson Amphitheater

We present a unified unsupervised statistical model for text normalization. The relationship between standardand non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weightsof these features are trained in a maximum-likelihood framework, employing a novel sequential Monte Carlotraining algorithm to overcome the large label space, which would be impractical for traditional dynamicprogramming solutions. This model is implemented in a normalization system called unLOL, which achievesthe best known results on two normalization datasets, outperforming more complex systems. We use theoutput of unLOL to automatically normalize a large corpus of social media text, revealing a set of coherentorthographic styles that underlie online language variation.

Paraphrasing 4 Microblog NormalizationWang Ling, Chris Dyer, Alan W Black, Isabel Trancoso


Compared to the edited genres that have played a central role in NLP research, microblog texts use a moreinformal register with nonstandard lexical items, abbreviations, and free orthographic variation. When con-fronted with such input, conventional text analysis tools often perform poorly. Normalization — replacingorthographically or lexically idiosyncratic forms with more standard variants — can improve performance.We propose a method for learning normalization rules from machine translations of a parallel corpus of mi-croblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizingEnglish tweets and then translating improves translation quality (compared to translating unnormalized text)using three standard web translation services as well as a phrase-based translation system trained on parallelmicroblog data.

Question Difficulty Estimation in Community Question Answering ServicesJing Liu, Quan Wang, Chin-Yew Lin, Hsiao-Wuen Hon


In this paper, we address the problem of estimating question difficulty in community question answeringservices. We propose a competition-based model for estimating question difficulty by leveraging pairwisecomparisons between questions and users. Our experimental results show that our model significantly outper-forms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptionsreflects the question difficulty. This implies the possibility of predicting question difficulty from the text ofquestion descriptions.

16


NLP Applications ILeonesa I and II Chair: Hang Li

Measuring Ideological Proportions in Political SpeechesYanchuan Sim, Brice D. L. Acree, Justin H. Gross, Noah A. Smith


We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, weinfer ideological cues from a corpus of political writings annotated with known ideologies. We then representthe speeches of U.S. Presidential candidates as sequences of cues and lags (filler distinguished only by itslength in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies eachcandidate uses in each campaign. The results are validated against a set of preregistered, domain expert-authored hypotheses.

Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction RuleSegmentation

Dekai Wu, Karteek Addanki, Markus Saers, Meriem BeloucifSaturday 11:25–11:50 — Leonesa I and II

We present a novel model that learns to improvise rhyming and fluent responses upon being challenged with aline of hip hop lyrics, by combining both bottom-up token based rule induction and top-down rule segmentationstrategies to learn a stochastic transduction grammar that simultaneously learns both phrasing and rhymingassociations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploita strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that nouse of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics isparticularly noisy and unstructured. We evaluate the performance of the learned model against a model learnedonly using the more conventional bottom-up token based rule induction, and demonstrate the superiority of ourcombined token based and rule segmentation induction method toward generating higher quality improvisedresponses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of theinherent challenges in adapting other algorithms to this novel task, we also compare the quality of the responsesgenerated by our model to those generated by an out-of-the-box phrase based SMT system. We tackle thechallenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module,which is also acquired via unsupervised learning and report improved quality of the generated responses.Finally, we report results with French hip hop lyrics indicating that our model performs surprisingly well withno special adaptation to other languages.

Modeling Scientific Impact with Topical Influence RegressionJames Foulds, Padhraic Smyth


When reviewing scientific literature, it would be useful to have automatic tools that identify the most influentialscientific articles as well as how ideas propagate between articles. In this context, this paper introduces topicalinfluence, a quantitative measure of the extent to which an article tends to spread its topics to the articles thatcite it. Given the text of the articles and their citation graph, we show how to learn a probabilistic model torecover both the degree of topical influence of each article and the influence relationships between articles.Experimental results on corpora from two well-known computer science conferences are used to illustrate andvalidate the proposed approach.

Joint Parsing and Disfluency Detection in Linear TimeMohammad Sadegh Rasooli, Joel Tetreault


We introduce a novel method to jointly parse and detect disfluencies in spoken utterances. Our model can usearbitrary features for parsing sentences and adapt itself with out-of-domain data. We show that our method,based on transition-based parsing, performs at a high level of accuracy for both the parsing and disfluencydetection tasks. Additionally, our method is the fastest for the joint task, running in linear time.

17


Semantics I

Leonesa III Chair: Luke Zettlemoyer

Modeling and Learning Semantic Co-Compositionality through Prototype Projectionsand Neural Networks

Masashi Tsubaki, Kevin Duh, Masashi Shimbo, Yuji MatsumotoSaturday 11:00–11:25 — Leonesa III

We present a novel vector space model for semantic co-compositionality. Inspired by Generative LexiconTheory (Pustejovsky, 1995), our goal is a compositional model where both predicate and argument are al-lowed to modify each others’ meaning representations while generating the overall semantics. This readilyaddresses some major challenges with current vector space models, notably the polysemy issue and the useof one representation per word type. We implement co-compositionality using prototype projections on pred-icates/arguments and show that this is effective in adapting their word representations. We further cast themodel as a neural network and propose an unsupervised algorithm to jointly train word representations withco-compositionality. The model achieves the best result to date (rho = 0.47) on the semantic similarity task oftransitive verbs (Grefenstette and Sadrzadeh, 2011).

Studying the Recursive Behaviour of Adjectival Modification with CompositionalDistributional Semantics

Eva Maria Vecchi, Roberto Zamparelli, Marco BaroniSaturday 11:25–11:50 — Leonesa III

In this study, we use compositional distributional semantic methods to investigate restrictions in adjectiveordering. Specifically, we focus on properties distinguishing Adjective-Adjective-Noun phrases in which thereis flexibility in the adjective ordering from those bound to a rigid order. We explore a number of measuresextracted from the distributional representation of AAN phrases which may indicate a word order restriction.We find that we are able to distinguish the relevant classes and the correct order based primarily on the degreeof modification of the adjectives. Our results offer fresh insight into the semantic properties that determineadjective ordering, building a bridge between syntax and distributional semantics.

Learning Latent Word Representations for Domain Adaptation using Supervised WordClustering

Min Xiao, Feipeng Zhao, Yuhong GuoSaturday 11:50–12:15 — Leonesa III

Domain adaptation has been popularly studied on exploiting labeled information from a source domain tolearn a prediction model in a target domain. In this paper, we develop a novel representation learning approachto address domain adaptation for text classification with automatically induced discriminative latent features,which are generalizable across domains while informative to the prediction task. Specifically, we propose ahierarchical multinomial Naive Bayes model with latent variables to conduct supervised word clustering onlabeled documents from both source and target domains, and then use the produced cluster distribution ofeach word as its latent feature representation for domain adaptation. We train this latent graphical model us-ing a simple expectation-maximization (EM)algorithm. We empirically evaluate the proposed method withboth cross-domain document categorization tasks on Reuters-21578 dataset and cross-domain sentiment clas-sification tasks on Amazon product review dataset. The experimental results demonstrate that our proposedapproach achieves superior performance compared with alternative methods.

Appropriately Incorporating Statistical Significance in PMIOm P. Damani, Shweta GhongeSaturday 12:15–12:35 — Leonesa III

Two recent measures incorporate the notion of statistical significance in basic PMI formulation. In some tasks,we find that the new measures perform worse than the PMI. Our analysis shows that while the basic ideas inincorporating statistical significance in PMI are reasonable, they have been applied slightly inappropriately.By fixing this, we get new measures that improve performance over not just PMI but on other popular co-

18


occurrence measures as well. In fact, the revised measures perform reasonably well compared with moreresource intensive non co-occurrence based methods also.

19


Language ResourcesEliza Anderson Amphitheater Chair: Emily Bender

Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic ModelsStefano Faralli, Roberto Navigli


In this paper we present a minimally-supervised approach to the multi-domain acquisition of wide-coverageglossaries. We start from a small number of hypernymy relation seeds and bootstrap glossaries from the Webfor dozens of domains using Probabilistic Topic Models. Our experiments show that we are able to extracthigh-precision glossaries comprising thousands of terms and definitions.

Joint Learning of Phonetic Units and Word Pronunciations for ASRChia-ying Lee, Yu Zhang, James Glass


The creation of a pronunciation lexicon remains the most inefficient process in developing an AutomaticSpeech Recognizer (ASR). In this paper, we propose an unsupervised alternative - requiring no language-specific knowledge - to the conventional manual approach for creating pronunciation dictionaries. We presenta hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S)mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, theresults demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, inwhich the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizersbuilt with the automatically induced lexicon consistently outperform grapheme-based recognizers and evenapproach the performance of recognition systems trained using conventional supervised procedures.

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of TextMatthew Richardson, Christopher J.C. Burges, Erin Renshaw


We present MCTest, a freely available set of stories and associated questions intended for research on themachine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) hasmade great strides, but primarily focuses either on limited-domain datasets, or on solving a more restrictedgoal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choicereading comprehension questions about fictional stories, directly tackling the high-level goal of open-domainmachine comprehension. Reading comprehension can test advanced abilities such as causal reasoning andunderstanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, theanswer typically can be found only in the story itself. The stories and questions are also carefully limitedto those a young child would understand, reducing the world knowledge that is required for the task. Wepresent the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured thatthe data is the same quality as another that we manually edited, but at one tenth the editing cost. By beingopen-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clearmetric for advancement on the machine comprehension of text.

Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliterationfrom Bilingual Corpora

Katsuhito Sudoh, Shinsuke Mori, Masaaki NagataSaturday 12:15–12:35 — Eliza Anderson Amphitheater

This paper proposes a novel noise-aware character alignment method for bootstrapping statistical machinetransliteration from automatically extracted phrase pairs. The model is an extension of a Bayesian many-to-many alignment method for distinguishing non-transliteration (noise) parts in phrase pairs. It worked ef-fectively in the experiments of bootstrapping Japanese-to-English statistical machine transliteration in patentdomain using patent bilingual corpora.

20


Invited talk: Andrew NgLeonesa Ballroom

The Online Revolution: Education for EveryoneDr Andrew Ng, Co-CEO and Co-founder of Coursera

Saturday 2:00–3:15 — Leonessa Ballroom

In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and takefor free. Together, these three courses had enrollments of around 350,000 students, making this one of thelargest experiments in online education ever performed.

Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social en-trepreneurship company whose mission is to make high-quality education accessible to everyone by allowingthe best universities to offer courses to everyone around the world, for free. Coursera classes provide a realcourse experience to students, including video content, interactive exercises with meaningful feedback, usingboth auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials.

Currently, Coursera has 80 university and other partners, and 3.6 million students enrolled in its nearly400 courses. These courses span a range of topics including computer science, business, medicine, science,humanities, social sciences, and more.

In this talk, I’ll report on this far-reaching experiment in education, and why we believe this model canprovide both an improved classroom experience for our on-campus students, via a flipped classroom model, aswell as a meaningful learning experience for the millions of students around the world who would otherwisenever have access to education of this quality.

21


Machine Translation ILeonesa I and II Chair: Kevin Knight

Optimal Beam Search for Machine TranslationAlexander Rush, Yin-Wen Chang, Michael Collins


Beam search is a fast and empirically effective method for translation decoding, but it lacks any formal guar-antees about search error. We develop a new decoding algorithm that combines the speed of beam searchwith the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based trans-lation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solutionon the majority of translation examples in our test data. The algorithm is 3.5 times faster than a Lagrangianrelaxation-based method for phrase-based translation and 4 times faster for syntax-based translation.

An Efficient Language Model Using Double-Array StructuresMakoto Yasuhara, Toru Tanaka, Jun-ya Norimatsu, Mikio Yamamoto


Ngram language models tend to increase in size with inflating the corpus size, and consume considerableresources. In this paper, we propose an efficient method for implementing ngram models based on double-arraystructures. First, we propose a method for representing backwards suffix trees using double-array structuresand demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency ofdata representation in the double-array structures. Embedding probabilities into unused spaces in double-arraystructures reduces the model size. Moreover, tuning the word IDs in the language model makes the modelsmaller and faster. We also show that our method can be used for building large language models using thedivision method. Lastly, we show that our method outperforms methods based on recent related works fromthe viewpoints of model size and query speed when both optimization methods are used.

Structured Penalties for Log-Linear Language ModelsAnil Kumar Nelakanti, Cedric Archambeau, Julien Mairal, Francis Bach, Guillaume Bouchard


Language models can be formalized as log-linear regression models where the input features represent pre-viously observed contexts up to a certain length $m$. The complexity of existing algorithms to learn theparameters by maximum likelihood scale linearly in $nd$, where $n$ is the length of the training corpusand $d$ is the number of observed features. We present a model that grows logarithmically in $d$, makingit possible to efficiently leverage longer contexts. We account for the sequential structure of natural languageusing tree-structured penalized objectives to avoid overfitting and achieve better generalization.

Interactive Machine Translation using Hierarchical Translation ModelsJesús González-Rubio, Daniel Ortíz-Martinez, José-Miguel Benedí, Francisco Casacuberta


Current automatic machine translation systems are not able to generate error-free translations and humanintervention is often required to correct their output. Alternatively, an interactive framework that integrates thehuman knowledge into the translation process has been presented in previous works. Here, we describe a newinteractive machine translation approach that is able to work with phrase-based and hierarchical translationmodels, and integrates error-correction all in a unified statistical framework. In our experiments, our approachoutperforms previous interactive translation systems, and achieves relative effort reductions of as much as 48%over a traditional post-edition system.

Max-Margin Synchronous Grammar Induction for Machine TranslationXinyan Xiao, Deyi Xiong


Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has aloose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discrimi-

22


natively inducing synchronous grammars for machine translation, which directly optimizes translation qualitymeasured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi transla-tions. This further facilitates the incorporation of various non-local features that are defined on the target side.We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-basedsystem. Experiments show that our max-margin method significantly outperforms the traditional two-steppipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihoodestimation method.

23


Dialogue and DiscourseLeonesa III Chair: Eugene Charniak

Error-Driven Analysis of Challenges in Coreference ResolutionJonathan K. Kummerfeld, Dan Klein


Coreference resolution metrics quantify errors but do not analyze them. Here, we consider an automatedmethod of categorizing errors in the output of a coreference system into intuitive underlying error types. Us-ing this tool, we first compare the error distributions across a large set of systems, then analyze commonerrors across the top ten systems, empirically characterizing the major unsolved challenges of the coreferenceresolution task.

Exploiting Zero Pronouns to Improve Chinese Coreference ResolutionFang Kong, Hwee Tou Ng


Coreference resolution plays a critical role in discourse analysis. This paper focuses on exploiting zero pro-nouns to improve Chinese coreference resolution. In particular, a simplified semantic role labeling frameworkis proposed to identify clauses and to detect zero pronouns effectively, and two effective methods (refiningsyntactic parser and refining learning example generation) are employed to exploit zero pronouns for Chinesecoreference resolution. Evaluation on the CoNLL-2012 shared task data set shows that zero pronouns cansignificantly improve Chinese coreference resolution.

Joint Coreference Resolution and Named-Entity Linking with Multi-Pass SievesHannaneh Hajishirzi, Leila Zilles, Daniel S. Weld, Luke Zettlemoyer


Many errors in coreference resolution come from semantic mismatches due to inadequate world knowledge.Errors in named-entity linking (NEL), on the other hand, are often caused by superficial modeling of en-tity context. This paper demonstrates that these two tasks are complementary. We introduce NECo, a newmodel for named entity linking and coreference resolution, which solves both problems jointly, reducing theerrors made on each. NECo extends the Stanford deterministic coreference system by automatically link-ing mentions to Wikipedia and introducing new NEL-informed mention-merging sieves. Linking improvesmention-detection and enables new semantic attributes to be incorporated from Freebase, while coreferenceprovides better context modeling by propagating named-entity links within mention clusters. Experimentsshow consistent improvements across a number of datasets and experimental conditions, including over 11%reduction in MUC coreference error and nearly 21% reduction in F1 NEL error on ACE 2004 newswire data.

Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns asTraining Data

Varada Kolhatkar, Heike Zinsmeister, Graeme HirstSaturday 5:00–5:25 — Leonesa III

Interpreting anaphoric shell nouns (ASNs) such as "this issue" and "this fact" is essential to understandingvirtually any substantial natural language text. One obstacle in developing methods for automatically inter-preting ASNs is the lack of annotated data. We tackle this challenge by exploiting cataphoric shell nouns(CSNs) whose construction makes them particularly easy to interpret (e.g., "the fact that X"). We propose anapproach that uses automatically extracted antecedents of CSNs as training data to interpret ASNs. We achieveprecisions in the range of 0.35 (baseline = 0.21) to 0.72 (baseline = 0.44), depending upon the shell noun.

24


Parsing Entire Discourses as Very Long Strings: Capturing Topic Continuity in GroundedLanguage Learning [TACL]

Minh-Thang Luong, Michael C. Frank, Mark JohnsonSaturday 5:25–5:50 — Leonesa III

Grounded language learning, the task of mapping from natural language to a representation of meaning, hasattracted more and more interest in recent years. In most work on this topic, however, utterances in a con-versation are treated independently and discourse structure information is largely ignored. In the context oflanguage acquisition, this independence assumption discards cues that are important to the learner, e.g., thefact that consecutive utterances are likely to share the same referent (Frank et al., 2013). The current paperdescribes an approach to the problem of simultaneously modeling grounded language at the sentence and dis-course levels. We combine ideas from parsing and grammar induction to produce a parser that can handle longinput strings with thousands of tokens, creating parse trees that represent full discourses. By casting groundedlanguage learning as a grammatical inference task, we use our parser to extend the work of Johnson et al.(2012), investigating the importance of discourse continuity in children?s language acquisition and its interac-tion with social cues. Our model boosts performance in a language acquisition task and yields good discoursesegmentations compared with human annotators.

25


Morphology and PhonologyEliza Anderson Amphitheater Chair: Anders Sogaard

Exploring Representations from Unlabeled Data with Co-training for Chinese WordSegmentation

Longkai Zhang, Houfeng Wang, Xu Sun, Mairgup MansurSaturday 3:45–4:10 — Eliza Anderson Amphitheater

Nowadays supervised sequence labeling models can reach competitive performance on the task of Chineseword segmentation. However, the ability of these models is restricted by the availability of annotated dataand the design of features. We propose a scalable semi-supervised feature engineering approach. In contrastto previous works using pre-defined taskspecific features with fixed values, we dynamically extract represen-tations of label distributions from both an in-domain corpus and an out-of-domain corpus. We update therepresentation values with a semi-supervised approach. Experiments on the benchmark datasets show that ourapproach achieve good results and reach an f-score of 0.961. The feature engineering approach proposed hereis a general iterative semi-supervised method and not limited to the word segmentation task.

Efficient Higher-Order CRFs for Morphological TaggingThomas Mueller, Helmut Schmid, Hinrich Schütze


Training higher-order conditional random fields is prohibitive for huge tag sets. We present an approximatedconditional random field using coarse-to-fine decoding and early updating. We show that our implementationyields fast and accurate morphological taggers across six languages with different morphological propertiesand that across languages higher-order models give significant improvements over first-order models.

The Effects of Syntactic Features in Automatic Prediction of MorphologyWolfgang Seeker, Jonas Kuhn


Morphology and syntax interact considerably in many languages and language processing should pay attentionto these interdependencies. We analyze the effect of syntactic features when used in automatic morphologyprediction on four typologically different languages. We show that predicting morphology for languages withhighly ambiguous word forms profits from taking the syntactic context of words into account and results instate-of-the-art models.

Adaptor Grammars for Learning Non-Concatenative MorphologyJan A. Botha, Phil Blunsom


This paper contributes an approach for expressing non-concatenative morphological phenomena, such as stemderivation in Semitic languages, in terms of a mildly context-sensitive grammar formalism. This offers a con-venient level of modelling abstraction while remaining computationally tractable. The nonparametric Bayesianframework of adaptor grammars is extended to this richer grammar formalism to propose a probabilistic modelthat can learn word segmentation and morpheme lexicons, including ones with discontiguous strings as ele-ments, from unannotated data. Our experiments on Hebrew and three variants of Arabic data find that theadditional expressiveness to capture roots and templates as atomic units improves the quality of concatenativesegmentation and stem identification. We obtain 74% accuracy in identifying triliteral Hebrew roots, whileperforming morphological segmentation with an F1-score of 78.1.

26


Poster Session APrincessa Ballroom and Foyer

Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in aWin-Lose Game

Anais Cadilhac, Nicholas Asher, Farah Benamara, Alex LascaridesSaturday 6:00–7:30 — Princessa Ballroom and Foyer

This paper describes a method that predicts which trades players execute during a win-lose game. Our methoduses data collected from chat negotiations of the game "The Settlers of Catan". We use this corpus to learnclassifiers that map each utterance to its speech act and to other acts that are pertinent to bargaining. And wedevelop a symbolic algorithm that, from the classifiers’ output, dynamically constructs a model of each player’spreferences as the conversation proceeds (for instance, the preference to receive a certain resource, or to accepta certain trade). This preference model is based on CP-nets (Boutilier et al., 2004), a logic of preferences forwhich algorithms for computing equilibrium strategies exist. We adapt those algorithms to predict whethera trade is executed as a result of the players’ negotiations. A comparison of our method against 4 baselinesshows that tracking how preferences evolve through the dialogue and reasoning about equilibrium moves areboth crucial to success.

Unsupervised Induction of Contingent Event Pairs from Film ScenesZhichao Hu, Elahe Rahimtoroghi, Larissa Munishkina, Reid Swanson, Marilyn A. Walker

Saturday 6:00–7:30 — Princessa Ballroom and Foyer

Human engagement in narrative is partially driven by reasoning about discourse relations between narrativeevents, and the expectations about what is likely to happen next that results from such reasoning. Researchersin NLP have tackled modeling such expectations from a range of perspectives, including treating it as theinference of the CONTINGENT discourse relation, or as a type of common-sense reasoning. Our approachis to model likelihood between events by drawing on several of these lines of previous work. We implementand evaluate different unsupervised methods for learning event pairs that are likely to be CONTINGENT onone another. We refine event pairs that we learn from a corpus of film scene descriptions utilizing Web searchcounts, and evaluate our results by collecting human judgments of contingency Our results indicate that theuse of Web search counts increases the average accuracy of our best method to 85.64% over a baseline of 50%,as compared to an average accuracy of 75.15% without web search.

Latent Anaphora Resolution for Cross-Lingual Pronoun PredictionChristian Hardmeier, Jörg Tiedemann, Joakim Nivre


This paper addresses the task of predicting the correct French translations of third-person subject pronouns inEnglish discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphoraresolution. We present an approach based on neural networks that models anaphoric links as latent variablesand show that its performance is competitive with that of a system with separate anaphora resolution while notrequiring any coreference-annotated training data. This demonstrates that the information contained in parallelbitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way.

Towards Situated Dialogue: Revisiting Referring Expression GenerationRui Fang, Changsong Liu, Lanbo She, Joyce Y. Chai


In situated dialogue, humans and agents have mismatched capabilities of perceiving the shared environment.Their representations of the shared world are misaligned. Thus referring expression generation (REG) willneed to take this discrepancy into consideration. To address this issue, we developed a hypergraph-basedapproach to account for group-based spatial relations and uncertainties in perceiving the environment. Ourempirical results have shown that this approach outperforms a previous graph-based approach with an absolutegain of 9%. However, while these graph-based approaches perform effectively when the agent has perfectknowledge or perception of the environment (e.g., 84%), they perform rather poorly when the agent has im-perfect perception of the environment (e.g., 45%). This big performance gap calls for new solutions to REG

27


that can mediate a shared perceptual basis in situated dialogue.

Open-Domain Fine-Grained Class Extraction from Web Search QueriesMarius Pasca


This paper introduces a method for extracting fine-grained class labels (countries with double taxation agree-ments with india) from Web search queries. The class labels are more numerous and more diverse than thoseproduced by previous extraction methods. Also extracted are representative sets of instances (singapore, unitedkingdom) for the class labels.

Unsupervised Relation Extraction with General Domain KnowledgeOier Lopez de Lacalle, Mirella Lapata


In this paper we present an unsupervised approach to relational information extraction. Our model partitionstuples representing an observed syntactic relationship between two named entities (e.g., “X was born in Y”and “X is from Y”) into clusters corresponding to underlying semantic relation types (e.g., BornIn, Located).Our approach incorporates general domain knowledge which we encode as First Order Logic rules and auto-matically combine with a topic model developed specifically for the relation extraction task. Evaluation resultson the ACE 2007 English Relation Detection and Categorization (RDC) task show that our model outperformscompetitive unsupervised approaches by a wide margin and is able to produce clusters shaped by both the dataand the rules.

Efficient Collective Entity Linking with StackingZhengyan He, Shujie Liu, Yang Song, Mu Li, Ming Zhou, Houfeng Wang


Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entitiesin knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisionsat the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based onstacking. First, we train local predictor $g_0$ with learning to rank as base learner, to generate initial rankinglist of candidates. Second, top $k$ candidates of related instances are searched for constructing expressiveglobal coherence features. A global predictor $g_1$ is trained in the augmented feature space and stackingis employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement.Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich se-mantic relatedness measure between entity categories and context document, performance is further improved.

Joint Bootstrapping of Corpus Annotations and Entity TypesHrushikesh Mohapatra, Siddhanth Jain, Soumen Chakrabarti


Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguatedentities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets.Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortu-nately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biasedsample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broadenour goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are syner-gistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contextshelp assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference.TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In exper-iments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, andover 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) comparedto baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recentannotations of the same corpus with Freebase entities, and report considerable improvements within the peopledomain.

28


Effectiveness and Efficiency of Open Relation ExtractionFilipe Mesquita, Jordan Schmidek, Denilson Barbosa


A large number of Open Relation Extraction approaches have been proposed recently, covering a wide range ofNLP machinery, from "shallow" (e.g., part-of-speech tagging) to "deep" (e.g., semantic role labeling-SRL). Anatural question then is what is the trade-off between NLP depth (and associated computational cost) versus ef-fectiveness. This paper presents a fair and objective experimental comparison of 8 state-of-the-art approachesover 5 different datasets, and sheds some light on the issue. The paper also describes a novel method, EX-EMPLAR, which adapts ideas from SRL to less costly NLP machinery, resulting in substantial gains both inefficiency and effectiveness, over binary and n-ary relation extraction tasks.

Automatic Feature Engineering for Answer Selection and ExtractionAliaksei Severyn, Alessandro Moschitti


This paper proposes a framework for automatically engineering features for two important tasks of questionanswering: answer sentence selection and answer extraction. We represent question and answer sentencepairs with linguistic structures enriched by semantic information, where the latter is produced by automaticclassifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structuresenable a simple way to generate highly discriminative structural features that combine syntactic and semanticinformation encoded in the input trees. We conduct experiments on a public benchmark from TREC to comparewith previous systems for answer sentence selection and answer extraction. The results show that our modelsgreatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction,while using no additional resources and no manual feature engineering.

Improving Web Search Ranking by Incorporating Structured Annotation of QueriesXiao Ding, Zhicheng Dou, Bing Qin, Ting Liu, Ji-rong Wen


Web users are increasingly looking for structured data, such as lyrics, job, or recipes, using unstructuredqueries on the web. However, retrieving relevant results from such data is a challenging problem due to theunstructured language of the web queries. In this paper, we propose a method to improve web search rankingsby detecting Structured Annotation of queries based on top search results. In a structured annotation, theoriginal query is split into different units that are associated with semantic attributes in the correspondingdomain. We evaluate our techniques using real world queries and achieve significant improvement.

Building Specialized Bilingual Lexicons Using Large Scale Background KnowledgeDhouha Bouamor, Adrian Popescu, Nasredine Semmar, Pierre Zweigenbaum


Bilingual lexicons are central components of machine translation and cross-lingual information retrieval sys-tems. Their manual construction requires strong expertise in both languages involved and is a costly process.Several automatic methods were proposed as an alternative but they often rely of resources available in alimited number of languages and their performances are still far behind the quality of manual translations.We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia.This massively multilingual encyclopedia makes it possible to create lexicons for a large number of languagepairs. Wikipedia is used to extract domains in each language, to link domains between languages and to cre-ate generic translation dictionaries. The approach is tested on four specialized domains and is compared tothree state of the art approaches using two language pairs: French-English and Romanian-English. The newlyintroduced method compares favorably to existing methods in all configurations tested.

Document Summarization via Guided Sentence CompressionChen Li, Fei Liu, Fuliang Weng, Yang Liu


Joint compression and summarization has been used recently to generate high quality summaries. However,such word-based joint optimization is computationally expensive. In this paper we adopt the ’sentence com-pression + sentence selection’ pipeline approach for compressive summarization, but propose to perform sum-

29


mary guided compression, rather than generic sentence-based compression. To create an annotated corpus,the human annotators were asked to compress sentences while explicitly given the important summary wordsin the sentences. Using this corpus, we train a supervised sentence compression model using a set of token-,syntax-, and document level features. During summarization, we use multiple compressed sentences in theinteger linear programming framework to select salient summary sentences. Our results on the TAC 2008and 2011 summarization data sets show that by incorporating the guided sentence compression model, oursummarization system can yield significant performance gain as compared to the state-of-the-art.

Anchor Graph: Global Reordering Contexts for Statistical Machine TranslationHendra Setiawan, Bowen Zhou, Bing Xiang


Reordering poses one of the greatest challenges in Statistical Machine Translation research as the key contex-tual information may well be beyond the confine of translation units. We present the “Anchor Graph” (AG)model where we use a graph structure to model global contextual information that is crucial in reordering. Thekey ingredient of our AG model is the edges that model the relationship between the reordering around a set ofselected translation units, which we refer to as emphanchors. As the edges link anchors that may span multi-ple translation units at decoding time, our AG model effectively encodes global contextual information that ispreviously absent. We integrate our proposed model into a state-of-the-art translation system and demonstratethe efficacy of our proposal in a large-scale Chinese-to-English translation task.

Source-Side Classifier Preordering for Machine TranslationUri Lerner, Slav Petrov


We present a simple and novel classifier-based preordering approach. Unlike existing preordering models,we train feature-rich discriminative classifiers that directly predict the target-side word order. Our approachcombines the strengths of lexical reordering and syntactic preordering models by performing long-distance re-orderings using the structure of the parse tree, while utilizing a discriminative model with a rich set of features,including lexical features We present extensive experiments on 22 language pairs, including preordering intoEnglish from 7 other languages. We obtain improvements of up to 1.4 BLEU on language pairs in the WMT2010 shared task. For languages from different families the improvements often exceed 2 BLEU. Many ofthese gains are also significant in human evaluations.

Improving Pivot-Based Statistical Machine Translation Using Random WalkXiaoning Zhu, Zhongjun He, Hua Wu, Haifeng Wang, Conghui Zhu, Tiejun Zhao


This paper proposes a novel approach that utilizes a machine learning method to improve pivot-based statisticalmachine translation (SMT). For language pairs with few bilingual data, a possible solution in pivot-based SMTusing another language as a "bridge" to gener-ate source-target translation. However, one of the weaknesses isthat some useful source-target translations cannot be generated if the corresponding source phrase and targetphrase connect to different pivot phrases. To alleviate the problem, we utilize Markov random walks to connectpossible translation phrases between source and target language. Experimental results on European Parliamentdata, spoken language data and web data show that our method leads to significant improvements on all thetasks over the baseline system.

Improving Alignment of System Combination by Using Multi-objective OptimizationTian Xia, Zongcheng Ji, Shaodan Zhai, Yidong Chen, Qun Liu, Shaojun Wang


This paper proposes a multi-objective optimization framework which supports heterogeneous informationsources to improve alignment in machine translation system combination techniques. In this area, most oftechniques usually utilize confusion networks (CN) as their central data structure to compact an exponentialnumber of an potential hypotheses, and because better hypothesis alignment may benefit constructing betterquality confusion networks, it is natural to add more useful information to improve alignment results. However,these information may be heterogeneous, so the widely-used Viterbi algorithm for searching the best alignmentmay not apply here. In the multi-objective optimization framework, each information source is viewed as an

30


independent objective, and a new goal of improving all objectives can be searched by mature algorithms. Thesolutions from this framework, termed Pareto optimal solutions, are then combined to construct confusion net-works. Experiments on two Chinese-to-English translation datasets show significant improvements, 0.97 and1.06 BLEU points over a strong Indirected Hidden Markov Model-based (IHMM) system, and 4.75 and 3.53points over the best single machine translation systems.

Flexible and Efficient Hypergraph Interactions for Joint Hierarchical andForest-to-String Decoding

Martin Cmejrek, Haitao Mi, Bowen ZhouSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Machine translation benefits from system combination. We propose flexible interaction of hypergraphs as anovel technique combining different translation models within one decoder. We introduce features controllingthe interactions between the two systems and ex-plore three interaction schemes of hiero andforest-to-stringmodels-specification, gener-alization, and interchange. The experimentsare carried out on large training datawithstrong baselines utilizing rich sets of denseand sparse features. All three schemes signif-icantly improveresults of any single systemon four testsets. We find that specification-amore constrained scheme that almostentirelyuses forest-to-string rules, but optionally useshiero rules for shorter spans-comes out asthe strongest,yielding improvement up to 0.9(Ter-Bleu)/2 points. We also provide a de-tailed experimental and qualitativeanalysis ofthe results.

Factored Soft Source Syntactic Constraints for Hierarchical Machine TranslationZhongqiang Huang, Jacob Devlin, Rabih ZbibSaturday 6:00–7:30 — Princessa Ballroom and Foyer

This paper describes a factored approach to incorporating soft source syntactic constraints into a hierarchicalphrase-based translation system. In contrast to traditional approaches that directly introduce syntactic con-straints to translation rules by explicitly decorating them with syntactic annotations, which often exacerbatethe data sparsity problem and cause other problems, our approach keeps translation rules intact and factorizesthe use of syntactic constraints through two separate models: 1) a syntax mismatch model that associates eachnonterminal of a translation rule with a distribution of tags that is used to measure the degree of syntacticcompatibility of the translation rule on source spans; 2) a syntax-based reordering model that predicts whethera pair of sibling constituents in the constituent parse tree of the source sentence should be reordered or notwhen translated to the target language. The features produced by both models are used as soft constraints toguide the translation process. Experiments on Chinese-English translation show that the proposed approachsignificantly improves a strong string-to-dependency translation system on multiple evaluation sets.

Recursive Autoencoders for ITG-Based TranslationPeng Li, Yang Liu, Maosong Sun


While inversion transduction grammar (ITG) is well suited for modeling ordering shifts between languages,how to make applying the two reordering rules (i.e., straight and inverted) dependent on actual blocks beingmerged remains a challenge. Unlike previous work that only uses boundary words, we propose to use recursiveautoencoders to make full use of the entire merging blocks alternatively. The recursive autoencoders arecapable of generating vector space representations for variable-sized phrases, which enable predicting ordersto exploit syntactic and semantic information from a neural language modeling’s perspective. Experiments onthe NIST 2008 dataset show that our system significantly improves over the MaxEnt classifier by 1.07 BLEUpoints.

Automatically Classifying Edit Categories in Wikipedia RevisionsJohannes Daxenberger, Iryna Gurevych


In this paper, we analyze a novel set of features for the task of automatic edit category classification. Editcategory classification assigns categories such as spelling error correction, paraphrase or vandalism to edits ina document. Our features are based on differences between two versions of a document including meta data,textual and language properties and markup. In a supervised machine learning experiment, we achieve a micro-

31


averaged F1 score of .62 on a corpus of edits from the English Wikipedia. In this corpus, each edit has beenmulti-labeled according to a 21-category taxonomy. A model trained on the same data achieves state-of-the-artperformance on the related task of fluency edit classification. We apply pattern mining to automatically labelededits in the revision histories of different Wikipedia articles. Our results suggest that high-quality articles showa higher degree of homogeneity with respect to their collaboration patterns as compared to random articles.

Semi-Markov Phrase-Based Monolingual AlignmentXuchen Yao, Benjamin Van Durme, Chris Callison-Burch, Peter Clark


We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-MarkovCRF. Our model achieves state-of-the-art alignment accuracy on two phrase-based alignment datasets (RTEand paraphrase), while doing significantly better than other strong baselines in both non-identical alignmentand phrase-only alignment. Additional experiments highlight the potential benefit of our alignment modelto RTE, paraphrase identification and question answering, where even a naive application of our model’salignment score approaches the state of the art.

A Constrained Latent Variable Model for Coreference ResolutionKai-Wei Chang, Rajhans Samdani, Dan Roth


Coreference resolution is a well known clustering task in Natural Language Processing. In this paper, wedescribe the Latent Left Linking model (L3 M), a novel, principled, and linguistically motivated latent struc-tured prediction approach to coreference resolution. We show that L3M admits efficient inference and can beaugmented with knowledge-based constraints; we also present a fast stochastic gradient based learning. Ex-periments on ACE and Ontonotes data show that L3 M and its constrained version, CL3 M, are more accuratethan several state-of-the-art approaches as well as some structured prediction models proposed in the literature.

Centering Similarity Measures to Reduce HubsIkumi Suzuki, Kazuo Hara, Masashi Shimbo, Marco Saerens, Kenji Fukumizu


The performance of nearest neighbor methods is degraded by the presence of hubs, i.e., objects in the datasetthat are similar to many other objects. In this paper, we show that the classical method of centering, thetransformation that shifts the origin of the space to the data centroid, provides an effective way to reducehubs. We show analytically why hubs emerge and why they are suppressed by centering, under a simpleprobabilistic model of data. To further reduce hubs, we also move the origin more aggressively towards hubs,through weighted centering. Our experimental results show that (weighted) centering is effective for naturallanguage data; it improves the performance of the k-nearest neighbor classifiers considerably in word sensedisambiguation and document classification tasks.

Unsupervised Spectral Learning of WCFG as Low-rank Matrix CompletionRaphaël Bailly, Xavier Carreras, Franco M. Luque, Ariadna Quattoni


We derive a spectral method for unsupervised learning of Weighted Context Free Grammars. We frame WCFGinduction as finding a Hankel matrix that has low rank and is linearly constrained to represent a functioncomputed by a inside-outside recursions. The proposed algorithm picks the grammar that agrees with a sampleand is the simplest with respect to the nuclear norm of the Hankel matrix.

Identifying Phrasal Verbs Using Many Bilingual CorporaKarl Pichotta, John DeNero


We address the problem of identifying multiword expressions in a language, focusing on English phrasalverbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 differentlanguages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel

32


corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set of English phrasalverbs, achieving performance comparable to a human-curated set.

Deep Learning for Chinese Word Segmentation and POS TaggingXiaoqing Zheng, Hanyang Chen, Tianyu Xu


This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deeplearning. We try to avoid task-specific feature engineering, and use deep layers of neural etworks to discoverrelevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation ofChinese characters, and use these improved representations to enhance supervised word segmentation and POStagging models. Our networks achieved close to state-of-the-art performance with minimal computational cost.We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

Joint Chinese Word Segmentation and POS Tagging on Heterogeneous AnnotatedCorpora with Multiple Task LearningXipeng Qiu, Jiayi Zhao, Xuanjing Huang


Chinese word segmentation and part-of-speech tagging (S&T) are fundamental steps for more advanced Chi-nese language processing tasks. Recently, it has attracted more and more research interests to exploit hetero-geneous annotation corpora for Chinese S&T. In this paper, we propose a unified model for Chinese S&T withheterogeneous annotation corpora. We first automatically construct a loose and uncertain mapping betweentwo representative heterogeneous corpora, Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD).Then we regard the Chinese S&T with heterogeneous corpora as two “related” tasks and train our model ontwo heterogeneous corpora simultaneously. Experiments show that our method can boost the performancesof both of the heterogeneous corpora by using the shared information, and achieves significant improvementsover the state-of-the-art methods.

The Topology of Semantic KnowledgeJimmy Dubuisson, Jean-Pierre Eckmann, Christian Scheible, Hinrich Schütze


Studies of the graph of dictionary definitions (DD) (Picard et al., 2009; Levary et al., 2012) have revealedstrong semantic coherence of local topological structures. The techniques used in these papers are simple andthe main results are found by understanding the structure of cycles in the directed graph (where words pointto definitions). Based on our earlier work (Levary et al., 2012), we study a different class of word definitions,namely those of the Free Association (FA) dataset (Nelson et al., 2004). These are responses by subjects to acue word, which are then summarized by a directed, free association graph. We find that the structure of thisnetwork is quite different from both the Wordnet and the dictionary networks. This difference can be explainedby the very nature of free association as compared to the more “logical” construction of dictionaries. It thussheds some (quantitative) light on the psychology of free association. In NLP, semantic groups or clusters areinteresting for various applications such as word sense disambiguation. The FA graph is tighter than the DDgraph, because of the large number of triangles. This also makes drift of meaning quite measurable so that FAgraphs provide a quantitative measure of the semantic coherence of small groups of words.

Unsupervised Induction of Cross-Lingual Semantic RelationsMike Lewis, Mark Steedman


Creating a language-independent meaning representation would benefit many cross-lingual NLP tasks. Weintroduce the first unsupervised approach to this problem, learning clusters of semantically equivalent Englishand French relations between referring expressions, based on their named-entity arguments in large mono-lingual corpora. The clusters can be used as language-independent semantic relations, by mapping clusteredexpressions in different languages onto the same relation. Our approach needs no parallel text for training, butoutperforms a baseline that uses machine translation on a cross-lingual question answering task. We also show

33


how to use the semantics to improve the accuracy of machine translation, by using it in a simple reranker.

Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs usingEntailment

Julien Kloetzer, Stijn De Saeger, Kentaro Torisawa, Chikara Hashimoto, Jong-Hoon Oh, MotokiSano, Kiyonori Ohtake


In this paper we propose a two-stage method to acquire contradiction relations between typed lexico-syntacticpatterns such as X_drug prevents Y_disease and Y_disease caused by X_drug. In the first stage, we train anSVM classifier to detect contradiction pattern pairs in a large web archive by exploiting the excitation polarity(Hashimoto et al., 2012) of the patterns. In the second stage, we enlarge the first stage classifier’s trainingdata with new contradiction pairs obtained by combining the output of the first stage’s classifier and that ofan entailment classifier. We acquired this way 750,000 typed Japanese contradiction pattern pairs with anestimated precision of 80%. We plan to release this resource to the NLP community.

Sarcasm as Contrast between a Positive Sentiment and Negative SituationEllen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, Ruihong Huang


A common form of sarcasm on Twitter consists of a positive sentiment contrasted with a negative situation.For example, many sarcastic tweets include a positive sentiment, such as "love” or "enjoy”, followed by anexpression that describes an undesirable activity or state (e.g., "taking exams” or "being ignored”). We havedeveloped a sarcasm recognizer to identify this type of sarcasm in tweets. We present a novel bootstrappingalgorithm that automatically learns lists of positive sentiment phrases and negative situation phrases fromsarcastic tweets. We show that identifying contrasting contexts using the phrases learned through bootstrappingyields improved recall for sarcasm recognition.

Collective Personal Profile Summarization with Social NetworksZhongqing Wang, Shoushan LI, Fang Kong, Guodong Zhou


Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many in-teresting applications, such as talent recommendation and contextual advertising. However, personal profilesusually lack organization confronted with the large amount of available information. Therefore, it is alwaysa challenge for people to find desired information from them. In this paper, we address the task of personalprofile summarization by leveraging both personal profile textual information and social networks. Here, usingsocial networks is motivated by the intuition that, people with similar academic, business or social connections(e.g. co-major, co-university, and co-corporation) tend to have similar expe-rience and summaries. To achievethe learning process, we propose a collective factor graph (CoFG) model to incorporate all these re-sourcesof knowledge to summarize personal profiles with local textual attribute functions and social connection fac-tors. Extensive eval-uation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of theproposed approach.

Optimized Event Storyline Generation based on Mixture-Event-Aspect ModelLifu Huang, Lian’en Huang


Recently, much research focuses on event storyline generation, which aims to produce a concise, global andtemporal event summary from a collection of articles. Generally, each event contains multiple sub-events andthe storyline should be composed by the component summaries of all the sub-events. However, different sub-events have different part-whole relationship with the major event, which is important to correspond to users’interests but seldom considered in previous work. To distinguish different types of sub-events, we propose amixture-event-aspect model which models different sub-events into local and global aspects. Combining theselocal/global aspects with summarization requirements together, we utilize an optimization method to generatethe component summaries along the timeline. We develop experimental systems on 6 distinctively differentdatasets. Evaluation and comparison results indicate the effectiveness of our proposed method.

34


Automatically Determining a Proper Length for Multi-Document Summarization: ABayesian Nonparametric Approach

Tengfei Ma, Hiroshi NakagawaSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Document summarization is an important task in the area of natural language processing, which aims to extractthe most important information from a single document or a cluster of documents. In various summarizationtasks, the summary length is manually defined. However, how to find the proper summary length is quite aproblem; and keeping all summaries restricted to the same length is not always a good choice. It is obviouslyimproper to generate summaries with the same length for two clusters of documents which contain quite dif-ferent quantity of information. In this paper, we propose a Bayesian nonparametric model for multi-documentsummarization in order to automatically determine the proper lengths of summaries. Assuming that an originaldocument can be reconstructed from its summary, we describe the "reconstruction" by a Bayesian frameworkwhich selects sentences to form a good summary. Experimental results on DUC2004 data sets and some ex-panded data demonstrate the good quality of our summaries and the rationality of the length determination.

A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in aComplex Question Answering Task

Maria Liakata, Simon Dobnik, Shyamasree Saha, Colin Batchelor, DietrichRebholz-Schuhmann


We present a method which exploits automatically generated scientific discourse annotations to create a contentmodel for the summarisation of scientific articles. Full papers are first automatically annotated using theCoreSC scheme, which captures 11 content-based concepts such as Hypothesis, Result, Conclusion etc atthe sentence level. A content model which follows the sequence of CoreSC categories observed in abstractsis used to provide the skeleton of the summary, making a distinction between dependent and independentcategories. Summary creation is also guided by the distribution of CoreSC categories found in the full articles,in order to adequately represent the article content. Finally, we demonstrate the usefulness of the summaries byevaluating them in a complex question answering task. Results are very encouraging as summaries of papersfrom automatically obtained CoreSCs enable experts to answer 66 % of complex content-related questionsdesigned on the basis of paper abstracts. The questions were answered with a precision of 75 %, where theupper bound for human summaries (abstracts) was 95 %.

Optimal Incremental Parsing via Best-First Dynamic ProgrammingKai Zhao, James Cross, Liang Huang


We present the first provably optimal polynomial-time dynamic programming (DP) algorithm for best-firstshift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagaeand Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial.We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significantspeedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractablefor the first time.

Exploiting Language Models for Visual RecognitionDieu-Thu Le, Jasper Uijlings, Raffaella BernardiSaturday 6:00–7:30 — Princessa Ballroom and Foyer

The problem of learning language models from large text corpora has been widely studied within the compu-tational linguistic community. However, little is known about the performance of these language models whenapplied to the computer vision domain. In this work, we compare representative models: a window-basedmodel, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in twovisual recognition scenarios: human action recognition and object prediction. We examine whether the knowl-edge extracted from texts through these models are compatible to the knowledge represented in images. Wedetermine the usefulness of different language models in aiding the two visual recognition tasks. The studyshows that the language models built from general text corpora can be used instead of expensive annotatedimages and even outperform the image model when testing on a big general dataset.

35


Mining Scientific Terms and their Definitions: A Study of the ACL AnthologyYiping Jin, Min-Yen Kan, Jun-Ping Ng, Xiangnan He


This paper presents DefMiner, a supervised sequence labeling system that identifies scientific terms and theiraccompanying definitions. DefMiner achieves 85% F1 on a Wikipedia benchmark corpus, significantly im-proving the previous state-of-the-art by 8%. We exploit DefMiner to process the ACL Anthology ReferenceCorpus (ARC) – a large, real-world digital library of scientific articles in computational linguistics. The re-sulting automatically-acquired glossary represents the terminology defined over several thousand individualresearch articles. We highlight several interesting observations: more definitions are introduced for conferenceand workshop papers over the years and that multiword terms account for slightly less than half of all terms.Obtaining a list of popular defined terms in a corpus of computational linguistics papers, we find that conceptscan often be categorized into one of three categories: resources, methodologies and evaluation metrics.

Joint Learning and Inference for Grammatical Error CorrectionAlla Rozovskaya, Dan Roth


State-of-the-art systems for grammatical error correction are based on a collection of independently-trainedmodels for specific errors. Such models ignore linguistic interactions at the sentence level and thus do poorlyon mistakes that involve grammatical dependencies among several words. In this paper, we identify linguisticstructures with interacting grammatical properties and propose to address such dependencies via joint inferenceand joint learning. We show that it is possible to identify interactions well enough to facilitate a joint approachand, consequently, that joint methods correct incoherent predictions that independently-trained classifiers tendto produce. Furthermore, because the joint learning model considers interacting phenomena during training, itis able to identify mistakes that require making multiple changes simultaneously and that standard approachesmiss. Overall, our model significantly outperforms the Illinois system that placed first in the CoNLL-2013shared task on grammatical error correction.

With Blinkers on: Robust Prediction of Eye Movements across ReadersFranz Matthies, Anders Søgaard


Nilsson and Nivre (2009) introduced a tree-based model of persons’ eye movements in reading. The individualvariation between readers reportedly made application across readers impossible. While a tree-based modelseems plausible for eye movements, we show that competitive results can be obtained with a linear CRFmodel. Increasing the inductive bias also makes learning across readers possible. In fact we observe next-to-no performance drop when evaluating models trained on gaze records of multiple readers on new readers.

Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness ofSupervised Models in Situated Dialogue Systems

Claire Gardent, Lina M. Rojas BarahonaSaturday 6:00–7:30 — Princessa Ballroom and Foyer

This paper explores to what extent lemmatisation, lexical resources, distributional semantics and paraphrasescan increase the accuracy of supervised models for dialogue management. The results suggest that each ofthese factors can help improve performance but that the impact will vary depending on their combination andon the evaluation mode.

Cascading Collective Classification for Bridging Anaphora Recognition using a RichLinguistic Feature Set

Yufang Hou, Katja Markert, Michael StrubeSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Recognizing bridging anaphora is difficult due to the wide variation within the phenomenon, the resulting lackof easily identifiable surface markers and their relative rarity. We develop linguistically motivated discoursestructure, lexico-semantic and genericity detection features and integrate these into a cascaded minority pref-erence algorithm that models bridging recognition as a subtask of learning fine-grained information status (IS).

36


We substantially improve bridging recognition without impairing performance on other IS classes.

A Synchronous Context Free Grammar for Time NormalizationSteven Bethard


We present an approach to time normalization (e.g. "the day before yesterday" => 2013-04-12) based on a syn-chronous context free grammar. Synchronous rules map the source language to formally defined operators formanipulating times (FindEnclosed, StartAtEndOf, etc.). Time expressions are then parsed using an extendedCYK+ algorithm, and converted to a normalized form by applying the operators recursively. For evaluation,a small set of synchronous rules for English time expressions were developed. Our model outperforms Hei-delTime, the best time normalization system in TempEval 2013, on four different time normalization corpora.

Rule-Based Information Extraction is Dead! Long Live Rule-Based InformationExtraction Systems!

Laura Chiticariu, Yunyao Li, Frederick R. ReissSaturday 6:00–7:30 — Princessa Ballroom and Foyer

The rise of “Big Data” analytics over unstructured text has led to renewed interest in information extraction(IE). We surveyed the landscape of IE technologies and identified a major disconnect between industry andacademia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technologyby the academia. We believe the disconnect stems from the way in which the two communities measure thebenefits and costs of IE, as well as academia’s perception that rule-based IE is devoid of research challenges.We make a case for the importance of rule-based IE to industry practitioners. We then lay out a researchagenda in advancing the state-of-the-art in rule-based IE systems which we believe has the potential to bridgethe gap between academic research and industry practice.

Improving Learning and Inference in a Large Knowledge-Base using Latent SyntacticCues

Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel, Tom MitchellSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Automatically constructed knowledge bases (KBs) are often incomplete and there is a genuine need to improvetheir coverage. Path Ranking Algorithm (PRA) is a recently proposed method which aims to improve KB cov-erage by performing inference directly over the KB graph. For the first time, we demonstrate that addition ofedges labeled with latent features mined from a large dependency parsed corpus of 500 million Web documentscan significantly outperform previous PRA-based approaches on the KB inference task. We present extensiveexperimental results validating this finding. The resources presented in this paper are publicly available.

What is Hidden among Translation RulesLibin Shen, Bowen Zhou


Most of the machine translation systems rely on a large set of translation rules. These rules are treated asdiscrete and independent events. In this work-in-progress short paper, we propose a novel method to modelrules as observed generation output of a compact hidden model, which leads to better generalization capabil-ity. We present a preliminary generative model to test this idea. Experimental results show about one pointimprovement on TER-BLEU over a strong baseline in Chinese-to-English translation.

Converting Continuous-Space Language Models into N-Gram Language Models forStatistical Machine Translation

Rui Wang, Masao Utiyama, Isao Goto, Eiichro Sumita, Hai Zhao, Bao-Liang LuSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Neural network language models, or continuous-space language models (CSLMs), have been shown to im-prove the performance of statistical machine translation (SMT) when they are used for reranking n-best trans-lations. However, CSLMs have not been used in the first pass decoding of SMT, because using CSLMs in

37


decoding takes a lot of time. In contrast, we propose a method for converting CSLMs into back-off n-gramlanguage models (BNLMs) so that we can use converted CSLMs in decoding. We show that they outperformthe original BNLMs and are comparable with the traditional use of CSLMs in reranking.

A Corpus Level MIRA Tuning Strategy for Machine TranslationMing Tan, Tian Xia, Shaojun Wang, Bowen Zhou


MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with alarge number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usuallydefine a variety of heuristic-driven sentence-level BLEUs in their model losses. Instead, we present a newMIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simplerin implementation. Experiments on Chinese-to-English translation show its effectiveness over two state-of-the-art MIRA implementations.

Word Level Language Identification in Online Multilingual CommunicationDong Nguyen, A. Seza Dogruoz


Multilingual speakers switch between languages in online and spoken communication. Analyses of largescale multilingual data require automatic language identification at the word level. For our experiments withmultilingual online discussions, we first tag the language of individual words using language models anddictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%.Besides word level accuracy, we use two new metrics to evaluate this task.

Microblog Entity Linking by Leveraging Extra PostsYuhang Guo, Bing Qin, Ting Liu, Sheng Li


Linking name mention in microblog posts to a knowledge base, namely microblog entity linking, is useful fortext mining tasks on microblog. Entity linking in long text has been well studied in previous works. Howeverfew work has focused on short text like microblog post. Microblog posts are short and noisy. Previous methodcan extract few features from the post context. In this paper we propose to use multiple posts for the microblogentity linking task. Experimental results show that our methods significantly improve the linking accuracyover traditional methods.

Automatic Domain Partitioning for Multi-Domain LearningDi Wang, Chenyan Xiong, William Yang WangSaturday 6:00–7:30 — Princessa Ballroom and Foyer

Multi-Domain learning (MDL) assumes that the domain labels in the dataset are known. However, when thereare multiple metadata attributes available, it is not always straightforward to select a single best attribute fordomain partition, and it is possible that combining more than one metadata attributes (including continuousattributes) can lead to better MDL performance. In this work, we propose an automatic domain partitioningapproach that aims at providing better domain identities for MDL. We use a supervised clustering approachthat learns the domain distance between data instances, and then cluster the data into better domains for MDL.Our experiment on real multi-domain datasets shows that using our automatically generated domain partitionimproves over popular MDL methods.

Decipherment with a Million Random RestartsTaylor Berg-Kirkpatrick, Dan Klein


This paper investigates the utility and effect of running numerous random restarts when using EM to attackdecipherment problems. We find that simple decipherment models are able to crack homophonic substitutionciphers with high accuracy if a large number of random restarts are used but almost completely fail with only afew random restarts. For particularly difficult homophonic ciphers, we find that big gains in accuracy are to behad by running upwards of 100K random restarts, which we accomplish efficiently using a GPU-based parallel

38


implementation. We run a series of experiments using millions of random restarts in order to investigate otherempirical properties of decipherment problems, including the famously uncracked Zodiac 340.

Russian Stress Prediction using Maximum Entropy RankingKeith Hall, Richard Sproat


We explore a model of stress prediction in Russian using a combination of local contextual features and lin-guisticallymotivated features associated with the word’s stem and suffix. We frame this as a ranking problem,where the objective is to rank the pronunciation with the correct stress above those with incorrect stress.We train our models using a simple Maximum Entropy ranking framework allowing for efficient prediction.An empirical evaluation shows that a model combining the local contextual features and the linguistically-motivated non-local features performs best in predicting both primary and secondary stress.

Scaling to Largeˆ3 Data: An Efficient and Effective Method to Compute DistributionalThesauri

Martin Riedl, Chris BiemannSaturday 6:00–7:30 — Princessa Ballroom and Foyer

We introduce a new highly scalable approach to computing Distributional Thesauri (DTs). By employingpruning techniques and a distributed framework, we make the computation for very large corpora feasible oncomparably small computational resources. We demonstrate this by releasing a DT for the whole vocabularyof Google Books syntactic n-grams. Evaluating against lexical resources using two measures, we show thatour approach produces higher quality DTs than previous approaches, and is thus preferable in terms of speedand quality for large corpora.

Discriminative Improvements to Distributional Sentence SimilarityYangfeng Ji, Jacob Eisenstein


Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including para-phrase identification. The key idea is that similarity in the latent space implies semantic relatedness. Wedescribe three ways in which labeled data can improve the accuracy of these approaches on paraphrase clas-sification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperformsTF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classifica-tion algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gramoverlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.

Is Twitter A Better Corpus for Measuring Sentiment Similarity?Shi Feng, Le Zhang, Binyang Li, Daling Wang, Ge Yu, Kam-Fai Wong


Extensive experiments have validated the effectiveness of the corpus-based method for classifying the word’ssentiment polarity. However, no work is done for comparing different corpora in the polarity classificationtask. Nowadays, Twitter has aggregated huge amount of data that are full of people’s sentiments. In this paper,we empirically evaluate the performance of different corpora in sentiment similarity measurement, which isthe fundamental task for word polarity classification. Experiment results show that the Twitter data can achievea much better performance than the Google, Web1T and Wikipedia based methods.

Implicit Feature Detection via a Constrained Topic Model and SVMWei Wang, Hua Xu, Xiaoqiu Huang


Implicit feature detection, also known as implicit feature identification, is an essential aspect of feature-specificopinion mining but previous works have often ignored it. We think, based on the explicit sentences, severalSupport Vector Machine (SVM) classifiers can be established to do this task. Nevertheless, we believe it ispossible to do better by using a constrained topic model instead of traditional attribute selection methods.Experiments show that this method outperforms the traditional attribute selection methods by a large margin

39


and the detection task can be completed better.

Online Learning for Inexact Hypergraph SearchHao Zhang, Liang Huang, Kai Zhao, Ryan McDonald


Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequentialsearch problems, like left-to-right tagging and parsing, beam search has been successfully combined withperceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However,perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inferenceover hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) tohypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in thehighest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid ofadditional resources.

40


Poster Session B

Princessa Ballroom and Foyer

Predicting the Presence of Discourse ConnectivesGary Patterson, Andrew Kehler


We present a classification model that predicts the presence or omission of a lexical connective between twoclauses, based upon linguistic features of the clauses and the type of discourse relation holding between them.The model is trained on a set of high frequency relations extracted from the Penn Discourse Treebank andachieves an accuracy of 86.6%. Analysis of the results reveals that the most informative features relate to thediscourse dependencies between sequences of coherence relations in the text. We also present results of anexperiment that provides insight into the nature and difficulty of the task.

Japanese Zero Reference Resolution Considering Exophora and Author/Reader MentionsMasatsugu Hangyo, Daisuke Kawahara, Sadao Kurohashi


In Japanese, zero references often occur and many of them are categorized into zero exophora, in which areferent is not mentioned in a document. However, previous studies have focused on only zero endophora, inwhich a referent explicitly appears. We present a zero reference resolution model considering zero exophoraand author/reader of a document. To deal with zero exophora, our model adds pseudo entities corresponding tozero exophora to candidate referents of zero pronouns. In addition, we automatically detect mentions that referto author and reader of a document by using lexico-syntactic patterns. We represent their particular behaviorin a discourse as a feature vector of a machine learning model. We report the experimental results on Japanesezero reference resolution and demonstrate the effectiveness our model for not only zero exophora but also zeroendophora.

A Dataset for Research on Short-Text ConversationsHao Wang, Zhengdong Lu, Hang Li, Enhong Chen


Natural language conversation is widely regarded as a highly difficult problem, which is usually attackedwith either rule-based or learning-based models. In this paper we propose a retrieval-based automatic responsemodel for short-text conversation, to exploit the vast amount of short conversation instances available on socialmedia. For this purpose we introduce a dataset of short-text conversation based on the real-world instancesfrom Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to agiven short text, and useful for both training and testing of conversation models. This data set consists of bothnaturally formed conversations, manually labeled data, and a large repository of candidate responses. Ourpreliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonablywell when combined with the rich instances in our data set

Discourse Level Explanatory Relation Extraction from Product Reviews UsingFirst-Order Logic

Qi Zhang, Jin Qian, Huan Chen, Jihua Kang, Xuanjing HuangSaturday 7:30–9:00 — Princessa Ballroom and Foyer

Explanatory sentences are employed to clarify reasons, details, facts, and so on. High quality online productreviews usually include not only positive or negative opinions, but also a variety of explanations of why theseopinions were given. These explanations can help readers get easily comprehensible information of the dis-cussed products and aspects. Moreover, explanatory relations can also benefit sentiment analysis applications.In this work, we focus on the task of identifying subjective text segments and extracting their correspondingexplanations from product reviews in discourse level. We propose a novel joint extraction method using first-order logic to model rich linguistic features and long distance constraints. Experimental results demonstratethe effectiveness of the proposed method.

41


Building Event Threads out of Multiple News ArticlesXavier Tannier, Véronique Moriceau


We present an approach for building multidocument event threads from a large corpus of newswire articles. Anevent thread is basically a succession of events belonging to the same story. It helps the reader to contextualizethe information contained in a single article, by navigating backward or forward in the thread from this article.A specific effort is also made on the detection of reactions to a particular event. In order to build these eventthreads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of informationin the newswire corpus. We also share interesting comments concerning our manual annotation procedure forbuilding a training and testing set.

Tree Kernel-based Negation and Speculation Scope Detection with Structured SyntacticParse Features

Bowei Zou, Guodong Zhou, Qiaoming ZhuSaturday 7:30–9:00 — Princessa Ballroom and Foyer

Scope detection is a key task in information extraction. This paper proposes a new approach for tree kernel-based scope detection by using the structured syntactic parse information. In addition, we have explored theway of selecting compatible features for different part-of-speech cues. Experiments on the BioScope corpusshow that both constituent and dependency structured syntactic parse features have the advantage in capturingthe potential relationships between cues and their scopes. Compared with the state of the art scope detectionsystems, our system achieves substantial improvement.

A temporal model of text periodicities using Gaussian ProcessesDaniel Preotiuc-Pietro, Trevor Cohn


Temporal variations of text are usually ignored in NLP applications. However, text use changes with time,which can affect many applications. In this paper we model periodic distributions of words over time. Focusingon hashtag frequency in Twitter, we first automatically identify the periodic patterns. We use this for regressionin order to forecast the volume of a hashtag based on past data. We use Gaussian Processes, a state-of-the-art bayesian non-parametric model, with a novel periodic kernel. We demonstrate this in a text classificationsetting, assigning the tweet hashtag based on the rest of its text. This method shows significant improvementsover competitive baselines.

Automatically Detecting and Attributing Indirect QuotationsSilvia Pareti, Tim O’Keefe, Ioannis Konstas, James R. Curran, Irena Koprinska


Direct quotations are used for opinion mining and information extraction as they have an easy to extract spanand they can be attributed to a speaker with high accuracy. However, simply focusing on direct quotationsignores around half of all reported speech, which is in the form of indirect or mixed speech. This work presentsthe first large-scale experiments in indirect and mixed quotation extraction and attribution. We propose twomethods of extracting all quote types from news articles and evaluate them on two large annotated corpora,one of which is a contribution of this work. We further show that direct quotation attribution methods can besuccessfully applied to indirect and mixed quotation attribution.

Identifying Web Search Query Reformulation using Concept based MatchingAhmed Hassan


Web search users frequently modify their queries in hope of receiving better results. This process is referredto as "Query Reformulation". Previous research has mainly focused on proposing query reformulations in theform of suggested queries for users. Some research has studied the problem of predicting whether the currentquery is a reformulation of the previous query or not. However, this work has been limited to bag-of-wordsmodels where the main signals being used are word overlap, character level edit distance and word level editdistance. In this work, we show that relying solely on surface level text similarity results in many false positiveswhere queries with different intents yet similar topics are mistakenly predicted as query reformulations. We

42


propose a new representation for Web search queries based on identifying the concepts in queries and showthat we can significantly improve query reformulation performance using features of query concepts.

The Answer is at your Fingertips: Improving Passage Retrieval for Web QuestionAnswering with Search Behavior Data

Mikhail Ageev, Dmitry Lagun, Eugene AgichteinSaturday 7:30–9:00 — Princessa Ballroom and Foyer

Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrievalalgorithms are effective at selecting document passages most similar to the question, or those that contain theexpected answer types, they do not take into account which parts of the document the searchers actually founduseful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examina-tion data into passage retrieval for question answering. Specifically, we exploit detailed examination data, suchas mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting,and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demon-strate that our method significantly improves passage retrieval, compared to using textual features alone. Asan additional contribution, we make available to the research community the code and the search behavior dataused in this study, with the hope of encouraging further research in this area.

Assembling the Kazakh Language CorpusOlzhas Makhambetov, Aibek Makazhanov, Zhandos Yessenbayev, Bakhyt Matkarimov, Islam

Sabyrgaliyev, Anuar SharafudinovSaturday 7:30–9:00 — Princessa Ballroom and Foyer

This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a localresearch community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal.Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmenteddocuments encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic,and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC hasa web-based corpus management system that helps to navigate the data and retrieve necessary information.KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotationof existing materials.

Automatic Extraction of Morphological Lexicons from Morphologically AnnotatedCorpora

Ramy Eskander, Nizar Habash, Owen RambowSaturday 7:30–9:00 — Princessa Ballroom and Foyer

We present a method for automatically learning inflectional classes and associated lemmas from morpho-logically annotated corpora. The method consists of a core language independent algorithm, which can beoptimized for specific languages. The method is demonstrated on Egyptian Arabic and German, two morpho-logically rich languages. Our best method for Egyptian Arabic provides an error reduction of 55.6% over asimple baseline; our best method for German achieves a 66.7% error reduction.

Joint Language and Translation Modeling with Recurrent Neural NetworksMichael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig


We present a joint language and translation model based on a recurrent neural network which predicts targetwords based on an unbounded history of both source and target words. The weaker independence assumptionsof this model result in a vastly larger search space compared to related feed-forward-based language or trans-lation models. We tackle this issue with a new lattice rescoring algorithm and demonstrate its effectivenessempirically. Our joint model builds on a well known recurrent neural network language model (Mikolov, 2012)augmented by a layer of additional inputs from the source language. We show competitive accuracy comparedto the traditional channel model features. Our best results improve the output of a system trained on WMT2012 French-English data by up to 1.5 BLEU, and by 1.1 BLEU on average across several test sets.

43


Multi-Domain Adaptation for SMT Using Multi-Task LearningLei Cui, Xilun Chen, Dongdong Zhang, Shujie Liu, Mu Li, Ming Zhou


Domain adaptation for SMT usually adapts models to an individual specific domain. However, it often lackssome correlation among different domains where common knowledge could be shared to improve the overalltranslation quality. In this paper, we propose a novel multi-domain adaptation approach for SMT using Multi-Task Learning (MTL), with in-domain models tailored for each specific domain and a general-domain modelshared by different domains. The parameters of these models are tuned jointly via MTL so that they can learngeneral knowledge more accurately and exploit domain knowledge better. Our experiments on a large-scaleEnglish-to-Chinese translation task validate that the MTL-based adaptation approach significantly and consis-tently improves the translation quality compared to a non-adapted baseline. Furthermore, it also outperformsthe individual adaptation of each specific domain.

Translation with Source Constituency and Dependency TreesFandong Meng, Jun Xie, Linfeng Song, Yajuan Lü, Qun Liu


We present a novel translation model, which simultaneously exploits the constituency and dependency treeson the source side, to com-bine the advantages of two types of trees. We take head-dependents relations ofdependency trees as backbone and incorporate phrasal n-odes of constituency trees as the source side of ourtranslation rules, and the target side as strings. Our rules hold the property of long distance reorderings and thecompatibility with phrases. Large-scale experimental result-s show that our model achieves significantly im-provements over the constituency-to-string (+2.45 BLEU on average) and dependency-to-string (+0.91 BLEUon average) model-s, which only employ single type of trees, and significantly outperforms the state-of-the-arthierarchical phrase-based model (+1.12 BLEU on average), on three Chinese-English NIST test sets.

Monolingual Marginal Matching for Translation Model AdaptationAnn Irvine, Chris Quirk, Hal Daumé III


When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domaintext, one major challenge is the large number of out-of-vocabulary (OOV) and new-translation-sense words.We present a method to identify new translations of both known and unknown source language words thatuses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairsderived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches themarginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence fromthe OLD-domain distribution. Adding learned translations to our French-English MT model results in gainsof about 2 BLEU points over strong baselines.

Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved ReorderingMaryam Siahbani, Baskaran Sankaran, Anoop Sarkar


Left-to-right (LR) decoding (Watanabe et al.,2006b) is a promising decoding algorithm for hierarchical phrase-based translation (Hiero). It generates the target sentence by extending the hypotheses only on the right edge.LR decoding has complexity O(n2b) for input of n words and beam size b, compared to O(n3) for the CKYalgorithm. It requires a single language model (LM) history for each target hypothesis rather than two LMhistories per hypothesis as in CKY. In this paper we present an augmented LR decoding algorithm that buildson the original algorithm in (Watanabe et al.,2006b). Unlike that algorithm, using experiments over multiplelanguage pairs we show two new results: our LR decoding algorithm provides demonstrably more efficientdecoding than CKY Hiero, four times faster; and by introducing new distortion and reordering features for LRdecoding, it maintains the same translation quality (as in BLEU scores) obtained phrase-based and CKY Hierowith the same translation model.

44


A Systematic Exploration of Diversity in Machine TranslationKevin Gimpel, Dhruv Batra, Chris Dyer, Gregory Shakhnarovich


This paper addresses the problem of producing a diverse set of plausible translations. We present a simpleprocedure that can be used with any statistical machine translation (MT) system. We explore three ways ofusing diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) anovel post-editing scenario in which multiple translations are presented to users. We find that diversity canimprove performance on these tasks, especially for sentences that are difficult for MT.

Max-Violation Perceptron and Forced Decoding for Scalable MT TrainingHeng Yu, Liang Huang, Haitao Mi, Kai Zhao


While large-scale discriminative training has triumphed in many NLP problems, its definite success on ma-chine translation has been largely elusive. Most recent efforts along this line are not scalable (training on thesmall dev set with features from top 100 most frequent words) and overly complicated. We instead presenta very simple yet theoretically motivated approach by extending the recent framework of “violation-fixingperceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation ex-periments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by upto +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort oflarge-scale online discriminative training for MT.

Identifying Multiple Userids of the Same AuthorTieyun Qian, Bing Liu


This paper studies the problem of identifying users who use multiple userids to post in so-cial media. Sincemultiple userids may belong to the same author, it is hard to directly apply supervised learning to solve theproblem. This paper proposes a new method, which still uses supervised learning but does not require train-ingdocuments from the involved userids. In-stead, it uses documents from other userids for classifier building.The classifier can be ap-plied to documents of the involved userids. This is possible because we transform thedocument space to a similarity space and learning is performed in this new space. Our evaluation is done inthe online review do-main. The experimental results using a large number of userids and their reviews showthat the proposed method is very effective.

Gender Inference of Twitter Users in Non-English ContextsMorgane Ciot, Morgan Sonderegger, Derek Ruths


While much work has considered the problem of latent attribute inference for users of social media such asTwitter, little has been done on non-English-based content and users. Here, we conduct the first assessment oflatent attribute inference in languages beyond English, focusing on gender inference. We find that the genderinference problem in quite diverse languages can be addressed using existing machinery. Further, accuracygains can be made by taking language-specific features into account. We identify languages with complexorthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for futureresearch.

A Multimodal LDA Model integrating Textual, Cognitive and Visual ModalitiesStephen Roller, Sabine Schulte im Walde


Recent investigations into grounded models of language have shown that holistic views of language and per-ception can provide higher performance than independent views. In this work, we improve a two-dimensionalmultimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperformtext-only models in two different evaluations, and demonstrate that low-level visual features are directly com-patible with the existing model. (2) We present a novel way to integrate visual features into the LDA modelusing unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluationtasks. (3) We provide two novel ways to extend the bimodal models to support three or more modalities. We

45


find that the three-, four-, and five-dimensional models significantly outperform models using only one or twomodalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced intoa shared, latent structure.

Combining PCFG-LA Models with Dual Decomposition: A Case Study with FunctionLabels and Binarization

Joseph Le Roux, Antoine Rozenknop, Jennifer FosterSaturday 7:30–9:00 — Princessa Ballroom and Foyer

It has recently been shown that different NLP models can be effectively combined using dual decomposition.In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. Weexperiment with the different models which result from alternative methods of extracting a grammar from atreebank (retaining or not function labels, left binarization versus right binarization) and achieve a labeledParseval f- score of 92.4 on Wall Street Journal Section 23 - this represents an absolute improvement of 0.7and an error reduction rate of 7% over a strong PCFG-LA product-model baseline. Although we experimentonly with binarization and function labels in this study, there is much scope for applying this approach to othergrammar extraction strategies.

Feature Noising for Log-Linear Structured PredictionSida Wang, Mengqiu Wang, Stefan Wager, Percy Liang, Christopher D. Manning


NLP models have many and sparse features, and regularization is key for balancing model overfitting versusunderfitting. A recently repopularized form of regularization is to generate fake training data by repeatedlyadding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with asecond-order formula that can be used during training without actually generating fake data. We show howto apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs.We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizerefficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervisedor transductive extension. Applied to text classification and NER, our method provides a $>$1 % absoluteperformance gain over use of standard $ LII$ regularization.

Improvements to the Bayesian Topic N-Gram ModelsHiroshi Noji, Daichi Mochihashi, Yusuke MiyaoSaturday 7:30–9:00 — Princessa Ballroom and Foyer

One of the language phenomena that n-gram language model fails to capture is the topic information of agiven situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) intwo directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all n-gramsinto exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams acrossdifferent documents to other topics. Our blocked sampler can efficiently search for higher probability spaceeven with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic toonly some parts of a document.

An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-TrainingFan Yang, Paul Vozila


In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training.We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) acharacter-based segmenter using character-level features within a CRF-based sequence labeler. These twosegmenters are initially trained with a small amount of segmented data, and then iteratively improve eachother using the large amount of unlabelled data. Our experimental results show that co-training captures 20%and 31% of the performance improvement achieved by supervised training with an order of magnitude moredata for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.

46


Ubertagging: Joint Segmentation and Supertagging for EnglishRebecca Dridan


A precise syntacto-semantic analysis of English requires a large detailed lexicon with the possibility of treatingmultiple tokens as a single meaning-bearing unit, a word-with-spaces. However parsing with such a lexicon,as included in the English Resource Grammar, can be very slow. We show that we can apply supertaggingtechniques over an ambiguous token lattice without resorting to previously used heuristics, a process we callubertagging. Our model achieves an ubertagging accuracy that can lead to a four to eight fold speed up whileimproving parser accuracy.

Automatic Knowledge Acquisition for Case Alternation between the Passive and ActiveVoices in Japanese

Ryohei Sasano, Daisuke Kawahara, Sadao Kurohashi, Manabu OkumuraSaturday 7:30–9:00 — Princessa Ballroom and Foyer

We present a method for automatically acquiring knowledge for case alternation between the passive and activevoices in Japanese. By leveraging several linguistic constraints on alternation patterns and lexical case framesobtained from a large Web corpus, our method aligns a case frame in the passive voice to a corresponding caseframe in the active voice and finds an alignment between their cases. We then apply the acquired knowledgeto a case alternation task and prove its usefulness.

Exploiting Multiple Sources for Open-Domain Hypernym DiscoveryRuiji Fu, Bing Qin, Ting Liu


Hypernym discovery aims to extract such noun pairs that one noun is a hypernym of the other. Most previousmethods are based on lexical patterns but perform badly on open-domain data. Other work extracts hypernymrelations from encyclopedias but has limited coverage. This paper proposes a simple yet effective distant su-pervision framework for Chinese open-domain hypernym discovery. Given an entity name, we try to discoverits hypernyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclopedias, andmorphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we ap-ply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the rankingmodel. We also present a heuristic strategy to build a large-scale noisy training data for the model without hu-man annotation. Experimental results demonstrate that our approach outperforms the state-of-the-art methodson a manually labeled test dataset.

A Semantically Enhanced Approach to Determine Textual SimilarityEduardo Blanco, Dan Moldovan


This paper presents a novel approach to determine textual similarity. A layered methodology to transform textinto logic forms is proposed, and semantic features are derived from a logic prover. Experimental results showthat incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scoresobtained from the logic prover in an unsupervised manner outperform supervised methods.

Understanding and Quantifying Creativity in Lexical CompositionPolina Kuznetsova, Jianfu Chen, Yejin Choi


Why do certain combination of words such as “disadvantageous peace” or “metal to the petal” appeal to ourminds as interesting expressions with a sense of creativity, while other phrases such as “quiet teenager”, or“geometrical base” not as much? We present statistical explorations to understand the characteristics of lexicalcompositions that give rise to the perception of being original, interesting, and at times even artistic. We firstexamine various correlates of perceived creativity based on information theoretic measures and the connotationof words, then present experiments based on supervised learning that give us further insights on how differentaspects of lexical composition collectively contribute to the perceived creativity.

47


Sentiment Analysis: How to Derive Prior Polarities from SentiWordNetMarco Guerini, Lorenzo Gatti, Marco Turchi


Assigning a positive or negative score to a word out of context (i.e. a word’s prior polarity) is a challenging taskfor sentiment analysis. In the literature, various approaches based on SentiWordNet have been proposed. In thispaper, we compare the most often used techniques together with newly proposed ones and incorporate all ofthem in a learning framework to see whether blending them can further improve the estimation of prior polarityscores. Using two different versions of SentiWordNet and testing regression and classification models acrosstasks and datasets, our learning approach consistently outperforms the single metrics, providing a new state-of-the-art approach in computing words’ prior polarity for sentiment analysis. We conclude our investigationshowing interesting biases in calculated prior polarity scores when word Part of Speech and annotator genderare considered.

Simulating Early-Termination Search for Verbose Spoken QueriesJerome White, Douglas W. Oard, Nitendra Rajput, Marion Zalk


Building search engines that can respond to spoken queries with spoken content requires that the system notjust be able to find useful responses, but also that it know when it has heard enough about what the user wantsto be able to do so. This paper describes a simulation study with queries spoken by non-native speakers thatsuggests that indicates that finding relevant content is often possible within a half minute, and that combiningfeatures based on automatically recognized words with features designed for automated prediction of querydifficulty can serve as a useful basis for predicting when that useful content has been found.

Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction andReconstruction

Shize Xu, Shanshan Wang, Yan ZhangSaturday 7:30–9:00 — Princessa Ballroom and Foyer

The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex newsevent, it is difficult to understand its general idea within a single coherent picture. A complex event often con-tains branches, intertwining narratives and side news which are all called storylines. In this paper, we proposea novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, wefirst investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract alleffective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted linesand generate the high-quality story map. Experiments on real-world datasets show that our method is quiteefficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.

Image Description using Visual Dependency RepresentationsDesmond Elliott, Frank Keller


Describing the main event of an image involves identifying the objects depicted and predicting the relationshipsbetween them. Previous approaches have represented images as unstructured bags of regions, which makes itdifficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual de-pendency representations to capture the relationships between the objects in an image, and hypothesize that thisrepresentation can improve image description. We test this hypothesis using a new data set of region-annotatedimages, associated with visual dependency representations and gold-standard descriptions. We describe twotemplate-based description generation models that operate over visual dependency representations. In an im-age description task, we find that these models outperform approaches that rely on object proximity or corpusinformation to generate descriptions on both automatic measures and on human judgements.

Semi-Supervised Feature Transformation for Dependency ParsingWenliang Chen, Min Zhang, Yue Zhang


48


In current dependency parsing models, conventional features (i.e. base features) defined over surface wordsand part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparsenessproblem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e.meta features) with the help of a large amount of automatically parsed data. The meta features are usedtogether with base features in our final parser. Our studies indicate that our proposed approach is very effectivein processing unseen data and features. Experiments on Chinese and English data sets show that the finalparser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best knownparsers on the English data.

Leveraging Lexical Cohesion and Disruption for Topic SegmentationAnca-Roxana Simon, Guillaume Gravier, Pascale Sébillot


Topic segmentation classically relies on one of two criteria, either finding areas with coherent vocabularyuse or detecting discontinuities. In this paper, we propose a segmentation criterion combining both lexicalcohesion and disruption, enabling a trade-off between the two. We provide the mathematical formulation ofthe criterion and an efficient graph based decoding algorithm for topic segmentation. Experimental results onstandard textual data sets and on a more challenging corpus of automatically transcribed broadcast news showsdemonstrate the benefit of such a combination. Gains were observed in all conditions, with segments of eitherregular or varying length and abrupt or smooth topic shifts. Long segments benefit more than short segments.However the algorithm has proven robust on automatic transcripts with short segments and limited vocabularyreoccurrences.

This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model forComputational Branding Analytics

William Yang Wang, Edward Lin, John KominekSaturday 7:30–9:00 — Princessa Ballroom and Foyer

We propose a Laplacian structured sparsity model to study computational branding analytics. To do this, wecollected customer reviews from Starbucks, Dunkin’ Donuts, and other coffee shops across 38 major citiesin the Midwest and Northeastern regions of USA. We study the brand related language use through thesereviews, with focuses on the brand satisfaction and gender factors. In particular, we perform three tasks: auto-matic brand identification from raw text, joint brand-satisfaction prediction, and joint brand-gender-satisfactionprediction. This work extends previous studies in text classification by incorporating the dependency and in-teraction among local features in the form of structured sparsity in a log-linear model. Our quantitative eval-uation shows that our approach which combines the advantages of graphical modeling and sparsity modelingtechniques significantly outperforms various standard and state-of-the-art text classification algorithms. In ad-dition, qualitative analysis of our model reveals important features of the language uses associated with thespecific brands.

Mining New Business Opportunities: Identifying Trend related Products by LeveragingCommercial Intents from Microblogs

Jinpeng Wang, Wayne Xin Zhao, Haitian Wei, Hongfei Yan, Xiaoming LiSaturday 7:30–9:00 — Princessa Ballroom and Foyer

Hot trends are likely to bring new business opportunities. For example, “Air Pollution” might lead to a signifi-cant increase of the sales of related products, e.g., mouth mask. For ecommerce companies, it is very importantto make rapid and correct response to these hot trends in order to improve product sales. In this paper, we takethe initiative to study the task of how to identify trend related products. The major novelty of our work is thatwe automatically learn commercial intents revealed from microblogs. We carefully construct a data collectionfor this task and present quite a few insightful findings. In order to solve this problem, we further propose agraph based method, which jointly models relevance and associativity. We perform extensive experiments andthe results showed that our methods are very effective.

49


Using Topic Modeling to Improve Prediction of Neuroticism and Depression in CollegeStudents

Philip Resnik, Anderson Garron, Rebecca ResnikSaturday 7:30–9:00 — Princessa Ballroom and Foyer

We investigate the value-add of topic modeling in text analysis for depression, and for neuroticism as astrongly associated personality measure. Using Pennebaker’s Linguistic Inquiry and Word Count (LIWC)lexicon to provide baseline features, we show that straightforward topic modeling using Latent Dirichlet Al-location (LDA) yields interpretable, psychologically relevant “themes” that add value in prediction of clinicalassessments.

Predicting the Resolution of Referring Expressions from User BehaviorNikos Engonopoulos, Martin Villalba, Ivan Titov, Alexander Koller


We present a statistical model for predicting how the user of an interactive, situated NLP system resolved areferring expression. The model makes an initial prediction based on the meaning of the utterance, and revisesit continuously based on the user’s behavior. The combined model outperforms its components in predictingreference resolution and when to give feedback.

Chinese Zero Pronoun Resolution: Some Recent AdvancesChen Chen, Vincent Ng


We extend Zhao and Ng’s (2007) Chinese anaphoric zero pronoun resolver by (1) using a richer set of featuresand (2) exploiting the coreference links between zero pronouns during resolution. Results on OntoNotes showthat our approach significantly outperforms two state-of-the-art anaphoric zero pronoun resolvers. To ourknowledge, this is the first work to report results obtained by an end-toend Chinese zero pronoun resolver.

Connecting Language and Knowledge Bases with Embedding Models for RelationExtraction

Jason Weston, Antoine Bordes, Oksana Yakhnenko, Nicolas UsunierSaturday 7:30–9:00 — Princessa Ballroom and Foyer

This paper proposes a novel approach for relation extraction from free text which is trained to jointly useinformation from the text and from existing knowledge. Our model is based on scoring functions that operateby learning low-dimensional embeddings of words, entities and relationships from a knowledge base. Weempirically show on New York Times articles aligned with Freebase relations that our approach is able toefficiently use the extra information provided by a large subset of Freebase data (4M entities, 23k relationships)to improve over methods that rely on text features alone.

Simple Customization of Recursive Neural Networks for Semantic Relation ClassificationKazuma Hashimoto, Makoto Miwa, Yoshimasa Tsuruoka, Takashi Chikayama


In this paper, we present a recursive neural network (RNN) model that works on a syntactic tree. Our modeldiffers from previous RNN models in that the model allows for an explicit weighting of important phrasesfor the target task. We also propose to average parameters in training. Our experimental results on semanticrelation classification show that both phrase categories and task-specific weighting significantly improve theprediction accuracy of the model. We also show that averaging the model parameters is effective in stabilizingthe learning and improves generalization capacity. The proposed model marks scores competitive with state-of-the-art RNN-based models.

Improving Statistical Machine Translation with Word Class ModelsJoern Wuebker, Stephan Peitz, Felix Rietig, Hermann Ney


Automatically clustering words from a monolingual or bilingual training corpus into classes is a widely usedtechnique in statistical natural language processing. We present a very simple and easy to implement methodfor using these word classes to improve translation quality. It can be applied across different machine transla-

50


tion paradigms and with arbitrary types of models. We show its efficacy on a small German-to-English and alarger French-to-German translation task with both standard phrase-based and hierarchical phrase-based trans-lation systems for a common set of models. Our results show that with word class models, the baseline can beimproved by up to 1.4% BLEU and 1.0% TER on the French-to-German task and 0.3% BLEU and 1.1% TERon the German-to-English task.

Shift-Reduce Word Reordering for Machine TranslationKatsuhiko Hayashi, Katsuhito Sudoh, Hajime Tsukada, Jun Suzuki, Masaaki Nagata


This paper presents a novel word reordering model that employs a shift-reduce parser for inversion transductiongrammars. Our model uses rich syntax parsing features for word reordering and runs in linear time. We applyit to postordering of phrase-based machine translation (PBMT) for Japanese-to-English patent tasks. Ourexperimental results show that our method achieves a significant improvement of +3.1 BLEU scores against30.15 BLEU scores of the baseline PBMT system.

Decoding with Large-Scale Neural Language Models Improves TranslationAshish Vaswani, Yinggong Zhao, Victoria Fossum, David Chiang


We explore the application of neural language models to machine translation. We develop a new model thatcombines the neural probabilistic language model of Bengio et al., rectified linear units, and noise-contrastiveestimation, and we incorporate it into a machine translation system both by reranking $k$-best lists and bydirect integration into the decoder. Our large-scale, large-vocabulary experiments across four language pairsshow that our neural language model improves translation quality by up to 1.1 BLEU

Bilingual Word Embeddings for Phrase-Based Machine TranslationWill Y. Zou, Richard Socher, Daniel Cer, Christopher D. Manning


We introduce bilingual word embeddings: semantic embeddings associated across two languages in the contextof neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus,while utilizing MT word alignments to naturally constrain translational equivalence. The new embeddingssignificantly out-perform baselines in word semantic similarity. A single semantic similarity feature inducedwith bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machinetranslation task.

Application of Localized Similarity for Web DocumentsPeter Reberšek, Mateja Verlic


In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a documentpointing to similar documents. Methods used in this approach rank parts of a document based on the similarityto a presumably related document. Ranks are then used to automatically construct the best anchor text for alink inside original document to the compared document. A number of different methods from informationretrieval and natural language processing are adapted for this task. Automatically constructed anchor textsare manually evaluated in terms of relatedness to linked documents and compared to baseline consisting oforiginally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors andautomatically constructed anchors. Results show that our best adapted methods rival the precision of thebaseline method.

Dependency Language Models for Sentence CompletionJoseph Gubbins, Andreas Vlachos


Sentence completion is a challenging semantic modeling task in which models must choose the most appro-priate word from a given set to complete a sentence. While a variety of language models have been applied tothis task in previous work, none of the existing approaches has attempted to incorporate syntactic information.In this paper we propose to tackle this task using a pair of simple language models in which the probability of

51


a sentence is estimated as the probability of the lexicalisation of a given syntactic dependency tree. We applyour approach to the Microsoft Research Sentence Completion Challenge and show that it improves on N-gramlanguage models by 8.7 percentage points in accuracy, achieving the highest accuracy reported to date apartfrom the more complex and expensive to train neural language models.

A Walk-Based Semantically Enriched Tree Kernel Over Distributed WordRepresentations

Shashank Srivastava, Dirk Hovy, Eduard HovySaturday 7:30–9:00 — Princessa Ballroom and Foyer

In this paper, we propose a walk-based graph kernel that generalizes the notion of tree-kernels to continuousspaces. Our proposed approach subsumes a general framework for word-similarity, and in particular, provides aflexible way to incorporate distributed representations. Using vector representations, such an approach capturesboth distributional semantic similarities among words as well as the structural relations between them (encodedas the structure of the parse tree). We show an efficient formulation to compute this kernel using simple matrixoperations. We present our results on three diverse NLP tasks, showing state-of-the-art results.

Automatic Idiom Identification in WiktionaryGrace Muzny, Luke Zettlemoyer


Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In thispaper, we study the problem of automatically identifying idiomatic dictionary entries with such resources.We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions,incorporating features that model whether phrase meanings are constructed compositionally. Experimentsdemonstrate that the learned classifier can provide high quality idiom labels, more than doubling the numberof idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiomdetection in sentences, by simply using known word sense disambiguation algorithms to match phrases totheir definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boostsdetection recall by over 28 percentage points.

Elephant: Sequence Labeling for Word and Sentence SegmentationKilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos


Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizersachieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show thathigh-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on thecharacter level combined with unsupervised feature learning. We evaluated our method on three languages andobtained error rates of 0.27 ‰(English), 0.35 ‰(Dutch) and 0.76 ‰(Italian) for our best models.

Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours inVector Space Models

Douwe Kiela, Stephen ClarkSaturday 7:30–9:00 — Princessa Ballroom and Foyer

We present a novel unsupervised approach to detecting the compositionality of multiword expressions. Wecompute the compositionality of a phrase through substituting the constituent words with their "neighbours" ina semantic vector space and averaging over the distance between the original phrase and the substituted neigh-bour phrases. Several methods of obtaining neighbours are presented. The results are compared to existingsupervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionalityratings.

52


Naive Bayes Word Sense InductionDo Kook Choe, Eugene Charniak


We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. Theextended model incorporates the idea the words closer to the target word are more relevant in predicting itssense. The proposed model is very simple yet effective when evaluated on SemEval-2010 WSI data.

The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of VerbsJoshua K. Hartshorne, Claire Bonial, Martha Palmer


This research describes efforts to use crowd-sourcing to improve the validity of the semantic predicates inVerbNet, a lexicon of about 6300 English verbs. The current semantic predicates can be thought of semanticprimitives, into which the concepts denoted by a verb can be decomposed. For example, the verb spray (of theSpray class), involves the predicates Motion, Not, and Location, where the event can be decomposed into anAgent causing a Theme that was originally not in a particular location to now be in that location. AlthoughVerbNet’s predicates are theoretically well-motivated, systematic empirical data is scarce. This paper describesa recently-launched attempt to address this issue with a series of human judgment tasks, posed to subjects inthe form of games.

Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections UsingOnline Reviews

Jun Seok Kang, Polina Kuznetsova, Michael Luca, Yejin ChoiSaturday 7:30–9:00 — Princessa Ballroom and Foyer

This paper offers an approach for governments to harness the information contained in social media in orderto make public inspections and disclosure more efficient. As a case study, we turn to restaurant hygieneinspections - which are done for restaurants throughout the United States and in most of the world and area frequently cited example of public inspections and disclosure. We present the first empirical study thatshows the viability of statistical models that learn the mapping between textual signals in restaurant reviewsand the hygiene inspection records from the Department of Public Health. The learned model achieves over82% accuracy in discriminating severe offenders from places with no violation, and provides insights intosalient cues in reviews that are indicative of the restaurant’s sanitary conditions. Our study suggests that publicdisclosure policy can be improved by mining public opinions from social media to target inspections and toprovide alternative forms of disclosure to customers.

Automatically Identifying Pseudepigraphic TextsMoshe Koppel, Shachar Seidman


The identification of pseudepigraphic texts - texts not written by the authors to which they are attributed- has important historical, forensic and commercial applications. We introduce an unsupervised techniquefor identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pair-wisesimilarities of all documents in the corpus. The crucial point is that document similarity not be measured inany of the standard ways but rather be based on the output of a recently introduced algorithm for authorshipverification. The proposed method strongly outperforms existing techniques in systematic experiments on ablog corpus.

53


54

Sunday, October 20, 2013: Main Conference

3Sunday, October 20, 2013: Main Conference


9:15 –10:30 Invited talk: Fernando Pereira (Leonesa Ballroom)


11:00–12:35 Parallel SessionsMachine Learning for NLP (Leonesa I and II)Summarization and Generation (Leonesa III)Information Extraction and Social Media Analysis (Eliza AndersonAmphitheater)

12:35– 2:00 Lunch

2:00 – 3:35 Parallel SessionsMachine Translation II (Leonesa I and II)Semantics II (Leonesa III)Opinion Mining and Sentiment Analysis I (Eliza AndersonAmphitheater)


4:05 – 5:55 Parallel SessionsMachine Translation III (Leonesa I and II)Information Extraction II (Leonesa III)NLP Applications II (Eliza Anderson Amphitheater)

55


Invited talk: Fernando PereiraLeonesa Ballroom

Meaning in the WildDr Fernando Pereira, Research Director at Google

Sunday 9:15–10:30 — Leonesa Ballroom

This meeting was founded on the premise that analytical approaches to computational linguistics could bebeneficially replaced by machine learning from large corpora exhibiting the linguistic behaviors of interest.The successes of that program have been most notable in speech recognition and machine translation, wherethe behavior of interest is plentiful in the wild: people transcribe speech and translate texts for practical reasons,creating a voluminous record from which our algorithms can learn.

However, when people understand what they are told or what they read, the output of the process is ahidden mental state change, only accessible partially from whatever observable actions are triggered by thechange: the desired input-output behavior is not available in the wild, even accepting the dubious assumptionthat the output can be abstracted away from the mental state where it came about. The standard escape routeof enlisting linguists to create annotated training data is hard enough for parsing, and it quickly falls apart forsemantics, even for seemingly ?constrained? tasks like coreference.

Nevertheless, meaning is all around us, in how people ask and respond to search queries, in how they writetext so that it can be understood by others, in how they annotate their text with hyperlinks, and in many othercommon behaviors that are to some extent observable in the Web. We also have a growing body of structuredtext organized so that computers can use it meaningfully, such as WordNet and Freebase. I will take you ona tour of examples from search, coreference, and information extraction that show small successes and bigfailures in understanding, asking questions about the potential and limitations of our current approaches alongthe way. I’ll not give you a complete recipe for machine understanding, but I hope you?ll find the examplesand research questions fun and useful.

56


Machine Learning for NLP

Leonesa I and II Chair: Chris Dyer

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction[TACL]

Ming-Wei Chang, Wen-tau YihSunday 11:00–11:25 — Leonesa I and II

Due to the nature of complex NLP problems, structured prediction algorithms have been important modelingtools for a wide range of tasks. While there exists evidence showing that linear Structural Support VectorMachine (SSVM) algorithm performs better than structured Perceptron, the SSVM algorithm is still less fre-quently chosen in the NLP community because of its relatively slow training speed. In this paper, we propose afast and easy-to-implement dual coordinate descent algorithm for SSVMs. Unlike algorithms such as Percep-tron and stochastic gradient descent, our method keeps track of dual variables and updates the weight vectormore aggressively. As a result, this training process is as efficient as existing online learning methods, and yetderives consistently better models, as evaluated on four benchmark NLP datasets for part-of-speech tagging,named-entity recognition and dependency parsing.

Dynamic Feature Selection for Dependency ParsingHe He, Hal Daumé III, Jason EisnerSunday 11:25–11:50 — Leonesa I and II

Feature computation and exhaustive search have significantly restricted the speed of graph-based dependencyparsing. We propose a faster framework of dynamic feature selection, where features are added sequentially asneeded, edges are pruned early, and decisions are made online for each sentence. We model this as a sequentialdecision-making problem and solve it by imitation learning techniques. We test our method on 7 languages.Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features,while computing fewer than 30% of the feature templates.

Semi-Supervised Representation Learning for Cross-Lingual Text ClassificationMin Xiao, Yuhong Guo

Sunday 11:50–12:15 — Leonesa I and II

Cross-lingual adaptation aims to learn a prediction model in a label-scarce target language by exploiting la-beled data from a label-rich source language. An effective cross-lingual adaptation system can substantiallyreduce the manual annotation effort required in many natural language processing tasks. In this paper, wepropose a new cross-lingual adaptation approach for document classification based on learning cross-lingualdiscriminative distributed representations of words. Specifically, we propose to maximize the log-likelihoodof the documents from both language domains under a cross-lingual log-bilinear document model, while min-imizing the prediction log-losses of labeled documents. We conduct extensive experiments on cross-lingualsentiment classification tasks of Amazon product reviews. Our experimental results demonstrate the efficacyof the proposed cross-lingual adaptation approach.

Using Crowdsourcing to get Representations based on Regular ExpressionsAnders Søgaard, Hector Martinez, Jakob Elming, Anders Johannsen


Often the bottleneck in document classification is finding good representations that zoom in on the most im-portant aspects of the documents. Most research uses n-gram representations, but relevant features often occurdiscontinuously, e.g., ’not ... good’ in sentiment analysis. Discontinuous features can be expressed as regularexpressions, but even if we limit the regular expressions that we derive from a set of documents to some fixedlength, the number becomes astronomical. Some feature combination methods can be seen as digging into thespace of regular expressions. In this paper we present experiments getting experts to provide regular expres-sions, as well as crowdsourced annotation tasks from which regular expressions can be derived. Somewhatsurprisingly, it turns out that these crowdsourced feature combinations outperform automatic feature combi-nation methods, as well as expert features, by a very large margin and reduce error by 24-41% over $n$-gram

57


representations.

58


Summarization and Generation

Leonesa III Chair: Lucy Vanderwende

Overcoming the Lack of Parallel Data in Sentence CompressionKatja Filippova, Yasemin AltunSunday 11:00–11:25 — Leonesa III

A major challenge in supervised sentence compression is making use of rich feature representations because ofvery scarce parallel data. We address this problem and present a method to automatically build a compressioncorpus with hundreds of thousands of instances on which deletion-based algorithms can be trained. In ourcorpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hencesupervised systems which require a structural alignment between the input and output can be successfullytrained. We also extend an existing unsupervised compression method with a learning module. The newsystem uses structured prediction to learn from lexical, syntactic and other features. An evaluation with humanraters shows that the presented data harvesting method indeed produces a parallel corpus of high quality. Also,the supervised system trained on this corpus gets high scores both from human raters and in an automaticevaluation setting, significantly outperforming a strong baseline.

Fast Joint Compression and Summarization via Graph CutsXian Qian, Yang Liu

Sunday 11:25–11:50 — Leonesa III

Extractive summarization typically uses sentences as summarization units. In contrast, joint compressionand summarization can use smaller units such as words and phrases, resulting in summaries containing moreinformation. The goal of compressive summarization is to find a subset of words that maximize the total scoreof concepts and cutting dependency arcs under the grammar constraints and summary length constraint. Wepropose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approachfirst relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objectivefunction by the supermodular binary quadratic programming problem, which can be solved efficiently usinggraph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convexrelaxation for initialization. Experimental results on TAC2008 dataset demonstrates our method achievescompetitive ROUGE score and has good readability, while is much faster than the integer linear programming(ILP) method.

Inducing Document Plans for Concept-to-Text GenerationIoannis Konstas, Mirella LapataSunday 11:50–12:15 — Leonesa III

In a language generation system, a content planner selects which elements must be included in the outputtext and the ordering between them. Recent empirical approaches perform content selection without anyordering and have thus no means to ensure that the output is coherent. In this paper we focus on the problemof generating text from a database and present a trainable end-to-end generation system that includes bothcontent selection and ordering. Content plans are represented intuitively by a set of grammar rules that operateon the document level and are acquired automatically from training data. We develop two approaches: the firstone is inspired from Rhetorical Structure Theory and represents the document as a tree of discourse relationsbetween database records; the second one requires little linguistic sophistication and uses tree structures torepresent global patterns of database record sequences within a document.

Single-Document Summarization as a Tree Knapsack ProblemTsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasuda, Masaaki Nagata


Recent studies on extractive text summarization formulate it as a combinatorial optimization problem such as aKnapsack Problem, a Maximum Coverage Problem or a Budgeted Median Problem. These methods success-fully improved summarization quality, but they did not consider the rhetorical relations between the textualunits of a source document. Thus, summaries generated by these methods may lack logical coherence. This

59


paper proposes a single document summarization method based on the trimming of a discourse tree. This isa two-fold process. First, we propose rules for transforming a rhetorical structure theory-based discourse treeinto a dependency-based discourse tree, which allows us to take a tree-trimming approach to summarization.Second, we formulate the problem of trimming a dependency-based discourse tree as a Tree Knapsack Prob-lem, then solve it with integer linear programming (ILP). Evaluation results showed that our method improvedRouge scores.

60


Information Extraction and Social Media Analysis

Eliza Anderson Amphitheater Chair: Soumen Chakrabarti

A Hierarchical Entity-Based Approach to Structuralize User Generated Content in SocialMedia: A Case of Yahoo! Answers

Baichuan Li, Jing Liu, Chin-Yew Lin, Irwin King, Michael R. LyuSunday 11:00–11:25 — Eliza Anderson Amphitheater

Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC)containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This"list-based" approach is simple, but hinders users from browsing and learning knowledge of certain topics ef-fectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC insocial media. By using a large-scale entity repository, we design a three-step framework to organize UGC in anovel hierarchical structure called "cluster entity tree (CET)". With Yahoo! Answers as a test case, we conductexperiments and the results show the effectiveness of our framework in constructing CET. We further evaluatethe performance of CET on UGC organization in both user and system aspects. From a user aspect, our userstudy demonstrates that, with CET-based structure, users perform significantly better in knowledge learningthan using traditional list-based approach. From a system aspect, CET substantially boosts the performance oftwo information retrieval models (i.e., vector space model and query likelihood language model).

Semantic Parsing on Freebase from Question-Answer PairsJonathan Berant, Andrew Chou, Roy Frostig, Percy Liang

Sunday 11:25–11:50 — Eliza Anderson Amphitheater

In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logicalforms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The mainchallenge in this setting is narrowing down the huge number of possible logical predicates for a given question.We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using aknowledge base and a large text corpus. Second, we use a "bridge" operation to generate additional predicatesbased on neighboring predicates. On the dataset of Cai and Yates (2013), despite not having annotated logicalforms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic andchallenging dataset of question-answer pairs and improves over a natural baseline.

Scaling Semantic Parsers with On-the-Fly Ontology MatchingTom Kwiatkowski, Eunsol Choi, Yoav Artzi, Luke Zettlemoyer


We consider the challenge of learning semantic parsers that scale to large, open-domain problems, such asquestion answering with Freebase. In such settings, the sentences cover a wide variety of topics and includemany phrases whose meaning is difficult to represent in a fixed target ontology. For example, even simplephrases such as ‘daughter’ and ‘number of people living in’ cannot be directly represented in Freebase, whoseontology instead encodes facts about gender, parenthood, and population. In this paper, we introduce a newsemantic parsing approach that learns to resolve such ontological mismatches. The parser is learned fromquestion-answer pairs, uses a probabilistic CCG to build linguistically motivated logical-form meaning rep-resentations, and includes an ontology matching model that adapts the output logical forms for each targetontology. Experiments demonstrate state-of-the-art performance on two benchmark semantic parsing datasets,including a nine point accuracy improvement on a recent Freebase QA corpus.

Classifying Message Board Posts with an Extracted Lexicon of Patient AttributesRuihong Huang, Ellen Riloff


The goal of our research is to distinguish veterinary message board posts that describe a case involving aspecific patient from posts that ask a general question. We create a text classifier that incorporates automaticallygenerated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data,we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE

61


system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms.Our experimental results show that using the learned attribute lists to encode patient information in the textclassifier yields improved performance on this task.

62


Machine Translation IILeonesa I and II Chair: Kristina Toutanova

Lexical Chain Based Cohesion Models for Document-Level Statistical MachineTranslation

Deyi Xiong, Yang Ding, Min Zhang, Chew Lim TanSunday 2:00–2:25 — Leonesa I and II

Lexical chains provide a representation of the lexical cohesion structure of a text. In this paper, we propose twolexical chain based cohesion models to incorporate lexical cohesion into document-level statistical machinetranslation: 1) a count cohesion model that rewards a hypothesis whenever a chain word occurs in the hypoth-esis, 2) and a probability cohesion model that further takes chain word translation probabilities into account.We compute lexical chains for each source document to be translated and generate target lexical chains basedon the computed source chains via maximum entropy classifiers. We then use the generated target chains toprovide constraints for word selection in document-level machine translation through the two proposed lexicalchain based cohesion models. We verify the effectiveness of the two models using a hierarchical phrase-basedtranslation system. Experiments on large-scale training data show that they can substantially improve transla-tion quality in terms of BLEU and that the probability cohesion model outperforms previous models based onlexical cohesion devices.

Measuring machine translation errors in new domains [TACL]Ann Irvine, John Morgan, Marine Carpuat, Hal Daumé III, Dragos Munteanu


We develop two techniques for analyzing the effect of porting a machine translation system to a new domain.One is a macro-level analysis that measures how domain shift affects corpus-level evaluation; the second isa micro-level analysis for word-level errors. We apply these methods to understand what happens when aParliament-trained phrase-based machine translation system is applied in four very different domains: news,medical texts, scientific articles and movie subtitles. We present quantitative and qualitative experiments thathighlight opportunities for future research in domain adaptation for machine translation.

A Convex Alternative to IBM Model 2Andrei Simion, Michael Collins, Cliff Stein


The IBM translation models have been hugely influential in statistical machine translation; they are the basisof the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translationmodels, and practically all variants proposed in the literature, have relied on the optimization of likelihoodfunctions or similar functions that are non-convex, having multiple local optima. In this paper we introducea convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on asubgradient method combined with exponentiated-gradient updates. The approach gives the same level ofalignment accuracy as IBM Model 2.

Pair Language Models for Deriving Alternative Pronunciations and Spellings fromPronunciation DictionariesRussell Beckley, Brian Roark


Pronunciation dictionaries provide a readily available parallel corpus for learning to transduce between char-acter strings and phoneme strings or vice versa. Translation models can be used to derive character-levelparaphrases on either side of this transduction, allowing for the automatic derivation of alternative pronunci-ations or spellings. We examine finite-state and SMT-based methods for these related tasks, and demonstratethat the tasks have different characteristics – finding alternative spellings is harder than alternative pronunci-ations and benefits from round-trip algorithms when the other does not. We also show that we can increaseaccuracy by modeling syllable stress.

63


Semantics IILeonesa III Chair: Dipanjan Das

Prior Disambiguation of Word Tensors for Constructing Sentence VectorsDimitri Kartsaklis, Mehrnoosh Sadrzadeh


Recent work has shown that compositional-distributional models using element-wise operations on contextualword vectors benefit from the introduction of a prior disambiguation step. The purpose of this paper is to gen-eralise these ideas to tensor-based models, where relational words such as verbs and adjectives are representedby linear maps(higher order tensors) acting on a number of arguments (vectors). We propose disambiguationalgorithms for a number of tensor-based models, which we then test on a variety of tasks. The results showthat disambiguation can provide better compositional representation even for the case of tensor-based models.Furthermore, we confirm previous findings regarding the positive effect of disambiguation on vector mixturemodels, and we compare the effectiveness of the two approaches.

Multi-Relational Latent Semantic AnalysisKai-Wei Chang, Wen-tau Yih, Christopher Meek


We present Multi-Relational Latent Semantic Analysis (MRLSA) which generalizes Latent Semantic Analysis(LSA). MRLSA provides an elegant approach to combining multiple relations between words by constructinga 3-way tensor. Similar to LSA, a low-rank approximation of the tensor is derived using a tensor decomposi-tion. Each word in the vocabulary is thus represented by a vector in the latent semantic space and each relationis captured by a latent square matrix. The degree of two words having a specific relation can then be mea-sured through simple linear algebraic operations. We demonstrate that by integrating multiple relations fromboth homogeneous and heterogeneous information sources, MRLSA achieves state-of-the-art performance onexisting benchmark datasets for two relations, antonymy and is-a.

A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and NothingElse)

Ivan Vulic, Marie-Francine MoensSunday 2:50–3:15 — Leonesa III

We present a new language pair agnostic approach to inducing bilingual vector spaces from non-parallel datawithout any other resource in a bootstrapping fashion. The paper systematically introduces and describes allkey elements of the bootstrapping procedure: (1) starting point or seed lexicon, (2) the confidence estimationand selection of new dimensions of the space, and (3) convergence. We test the quality of the induced bilingualvector spaces, and analyze the influence of the different components of the bootstrapping approach in the taskof bilingual lexicon extraction (BLE) for two language pairs. Results reveal that, contrary to conclusions fromprior work, the seeding of the bootstrapping process has a heavy impact on the quality of the learned lexicons.We also show that our approach outperforms the best performing fully corpus-based BLE methods on thesetest sets.

Deriving Adjectival Scales from Continuous Space Word RepresentationsJoo-Kyung Kim, Marie-Catherine de Marneffe


Continuous space word representations extracted from neural network language models have been used ef-fectively for natural language processing, but until recently it was not clear whether the spatial relationshipsof such representations were interpretable. Mikolov et al. (2013) show that these representations do capturesyntactic and semantic regularities. Here, we push the interpretation of continuous space word representationsfurther by demonstrating that vector offsets can be used to derive adjectival scales (e.g., okay < good < excel-lent). We evaluate the scales on the indirect answers to yes/no questions corpus (de Marneffe et al., 2010). Weobtain 72.8% accuracy, which outperforms previous results ( 60%) on this corpus and highlights the quality ofthe scales extracted, providing further support that the continuous space word representations are meaningful.

64


Opinion Mining and Sentiment Analysis IEliza Anderson Amphitheater Chair: Oren Tsur

Recursive Deep Models for Semantic Compositionality Over a Sentiment TreebankRichard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng,

Christopher PottsSunday 2:00–2:25 — Eliza Anderson Amphitheater

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principledway. Further progress towards understanding compositionality in tasks such as sentiment detection requiresricher supervised training and evaluation resources and more powerful models of composition. To remedy this,we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parsetrees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, weintroduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperformsall previous methods on several metrics. It pushes the state of the art in single sentence positive/negativeclassification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrasesreaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that canaccurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levelsfor both positive and negative phrases.

Open Domain Targeted SentimentMargaret Mitchell, Jacqui Aguilar, Theresa Wilson, Benjamin Van Durme


We propose a novel approach to sentiment analysis for a low resource setting. The intuition behind this work isthat sentiment expressed towards an entity, targeted sentiment, may be viewed as a span of sentiment expressedacross the entity. This representation allows us to model sentiment detection as a sequence tagging problem,jointly discovering people and organizations along with whether there is sentiment directed towards them.We compare performance in both Spanish and English on microblog data, using only a sentiment lexicon asan external resource. By leveraging linguistically informed features within conditional random fields (CRFs)trained to minimize empirical risk, our best models in Spanish significantly outperform a strong baseline, andreach around 90% accuracy on the combined task of named entity recognition and sentiment prediction. Ourmodels in English, trained on a much smaller dataset, are not yet statistically significant against their baselines.

Exploiting Domain Knowledge in Aspect ExtractionZhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, Riddhiman Ghosh


Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have beenused for the task. However, such models without any domain knowledge often produce aspects that are notinterpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed,which allow the user to input some prior domain knowledge to generate coherent aspects. However, existingknowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporatethe cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge.This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to addressthese problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also pro-posed in this paper). Experiments on real-life product reviews from a variety of domains show that MC-LDAoutperforms the existing state-of-the-art models markedly.

65


Machine Translation IIILeonesa I and II Chair: Taro Watanabe

Dependency-Based Decipherment for Resource-Limited Machine TranslationQing Dou, Kevin Knight


We introduce dependency relations into deciphering foreign languages and show that dependency relationshelp improve the state-of-the-art deciphering accuracy by over 500%. We learn a translation lexicon fromlarge amounts of genuinely non parallel data with decipherment to improve a phrase-based machine translationsystem trained with limited parallel data. In experiments, we observe BLEU gains of 1.2 to 1.8 across threedifferent test sets.

Translating into Morphologically Rich Languages with Synthetic PhrasesVictor Chahuneau, Eva Schlinger, Noah A. Smith, Chris Dyer


Translation into morphologically rich languages is an important but recalcitrant problem in MT. We presenta simple and effective approach that deals with the problem in two phases. First, a discriminative model islearned to predict inflections of target words from rich source-side annotations. Then, this model is used tocreate additional sentence-specific word- and phrase-level translations that are added to a standard translationmodel as "synthetic” phrases. Our approach relies on morphological analysis of the target language, but weshow that an unsupervised Bayesian model of morphology can successfully be used in place of a supervisedanalyzer. We report significant improvements in translation quality when translating from English to Russian,Hebrew and Swahili.

Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations fromRelevance Rankings

Artem Sokokov, Laura Jehl, Felix Hieber, Stefan RiezlerSunday 4:55–5:20 — Leonesa I and II

We present an approach to learning bilingual n-gram correspondences from relevance rankings of English doc-uments for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complementsmachine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting al-gorithm that deals with very large cross-product spaces of word correspondences. We show in an experimentalevaluation on patent prior art search that our approach, and in particular a consensus-based combination ofboosting and translation-based approaches, yields substantial improvements in CLIR performance. Our train-ing and test data are made publicly available.

Recurrent Continuous Translation ModelsNal Kalchbrenner, Phil BlunsomSunday 5:20–5:55 — Leonesa I and II

We introduce a class of probabilistic continuous translation models called Recurrent Continuous TranslationModels that are purely based on continuous representations for words, phrases and sentences and do not relyon alignments or phrasal translation units. The models have a generation and a conditioning aspect. Thegeneration of the translation is modelled with a target Recurrent Language Model, whereas the conditioningon the source sentence is modelled with a Convolutional Sentence Model. Through various experiments, weshow first that our models obtain a perplexity with respect to gold translations that is >43% lower than that ofstate-of-the-art alignment-based translation models. Secondly, we show that they are remarkably sensitive tothe word order, syntax, and meaning of the source sentence despite lacking alignments. Finally we show thatthey match a state-of-the-art system when rescoring n-best lists of translations.

66


Information Extraction IILeonesa III Chair: Ellen Riloff

Learning Biological Processes with Global ConstraintsAju Thalappillil Scaria, Jonathan Berant, Mengqiu Wang, Peter Clark, Justin Lewis, Brittany

Harding, Christopher D. ManningSunday 4:05–4:30 — Leonesa III

Biological processes are complex phenomena involving a series of events that are related to one anotherthrough various relationships. Systems that can understand and reason over biological processes would dra-matically improve the performance of semantic applications involving inference such as question answering(QA) - specifically “How?” and “Why?” questions. In this paper, we present the task of process extraction,in which events within a process and the relations between the events are automatically extracted from text.We represent processes by graphs whose edges describe a set of temporal, causal and co-reference event-eventrelations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then,we present a method for extracting relations between the events, which exploits these structural properties byperforming joint inference over the set of extracted relations. On a novel dataset containing 148 descriptionsof biological processes (released with this paper), we show significant improvement comparing to baselinesthat disregard process structure.

Generating Coherent Event Schemas at ScaleNiranjan Balasubramanian, Stephen Soderland, Mausam, Oren Etzioni


Chambers and Jurafsky (2009) demonstrated that event schemas can be automatically induced from text cor-pora. However, our analysis of their schemas identifies several weaknesses, e.g., some schemas lack a commontopic and distinct roles are incorrectly mixed into a single actor. It is due in part to their pair-wise representa-tion that treats subject-verb independently from verb-object. This often leads to subject-verb-object triples thatare not meaningful in the real-world. We present a novel approach to inducing open-domain event schemasthat overcomes these limitations. Our approach uses co-occurrence statistics of semantically typed relationaltriples,which we call Rel-grams (relational n-grams). In a human evaluation, our schemas outperform Cham-bers’s schemas by wide margins on several evaluation criteria. Both Rel-grams and event schemas are freelyavailable to the research community.

Modeling Missing Data in Distant Supervision for Information Extraction [TACL]Alan Ritter, Luke Zettlemoyer, Mausam, Oren Etzioni


Distant supervision algorithms learn information extraction models given only large readily available databasesand text collections. Most previous work has used heuristics for generating labeled data, for example assum-ing that facts not contained in the database are not mentioned in the text, and facts in the database must bementioned at least once. In this paper, we propose a new latent-variable approach that models missing data.This provides a natural way to incorporate side information, for instance modeling the intuition that text willoften mention rare entities which are likely to be missing in the database. De- spite the added complexityintroduced by reasoning about missing data, we demonstrate that a carefully designed local search approachto inference is very accurate and scales to large datasets. Experiments demonstrate improved performance forbinary and unary relation extraction when compared to learning with heuristic labels, including on average a27% increase in area under the precision recall curve in the binary case.

Orthonormal Explicit Topic Analysis for Cross-Lingual Document MatchingJohn Philip McCrae, Philipp Cimiano, Roman Klinger


Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and termi-nology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) aswell as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in differ-ent languages to be mapped into the same space and thus to be compared across languages. In this paper, we

67


present a novel approach that combines latent and explicit topic modelling approaches in the sense that it buildson a set of explicitly defined topics, but then computes latent relations between these. Thus, the method com-bines the benefits of both explicit and latent topic modelling approaches. We show that on a cross-lingual materetrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translatesevery word in a document into the target language.

68


NLP Applications II

Eliza Anderson Amphitheater Chair: Min-Yen Kan

Automated Essay Scoring by Maximizing Human-Machine AgreementHongbo Chen, Ben He


Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classi-fication, regression, or pairwise classification loss, depending on the learning algorithm used. In this paper, weargue that the current AES systems can be further improved by taking into account the agreement between hu-man and machine raters. To this end, we propose a rank-based approach that utilizes listwise learning to rankalgorithms for learning a rating model, where the agreement between the human and machine raters is directlyincorporated into the loss function. Various linguistic and statistical features are utilized to facilitate the learn-ing algorithms. Experiments on the publicly available English essay dataset, Automated Student AssessmentPrize (ASAP), show that our proposed approach outperforms the state-of-the-art algorithms, and achieves per-formance comparable to professional human raters, which suggests the effectiveness of our proposed methodfor automated essay scoring.

Powergrading: a Clustering Approach to Amplify Human Effort for Short AnswerGrading [TACL]

Sumit Basu, Charles Jacobs, Lucy VanderwendeSunday 4:30–4:55 — Eliza Anderson Amphitheater

Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. Intensity order in-formation is very useful for language learners as well as in several NLP tasks, but is missing in most lexicalresources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approachthat uses semantics from Web-scale data (e.g., phrases like good but not excellent) to rank words by assigningthem positions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly determine theranks, such that individual decisions benefit from global information. When ranking English adjectives, ourglobal algorithm achieves substantial improvements over previous work on both pairwise and rank correlationmetrics (specifically, 70% pairwise accuracy as compared to only 56% by previous work). Moreover, our ap-proach can incorporate external synonymy information (increasing its pairwise accuracy to 78%) and extendseasily to new languages. We also make our code and data freely available.

Success with Style: Using Writing Style to Predict the Success of NovelsVikas Ganjigunte Ashok, Song Feng, Yejin Choi


Predicting the success of literary works is a curious question among publishers and aspiring writers alike. Weexamine the quantitative connection, if any, between writing style and successful literature. Based on novelsover several different genres, we probe the predictive power of statistical stylometry in discriminating success-ful literary works, and identify characteristic stylistic elements that are more prominent in successful writings.Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminatinghighly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyseslead to several new insights into characteristics of the writing style in successful literature, including findingsthat are contrary to the conventional wisdom with respect to good writing style and readability.

A Generative Joint, Additive, Sequential Model of Topics and Speech Acts inPatient-Doctor Communication

Byron C. Wallace, Thomas A Trikalinos, M. Barton Laws, Ira B. Wilson, Eugene CharniakSunday 5:20–5:55 — Eliza Anderson Amphitheater

We develop a novel generative model of conversation that jointly captures both the topical content and thespeech act type associated with each utterance. Our model expresses both token emission and state transitionprobabilities as log-linear functions of separate components corresponding to topics and speech acts (and theirinteractions). We apply this model to a dataset comprising annotated patient-physician visits and show that the

69


proposed joint approach outperforms a baseline univariate model.

70

Monday, October 21, 2013: Main Conference

4Monday, October 21, 2013: Main Conference


9:00 –10:35 Parallel SessionsInformation Extraction III (Leonesa I and II)Opinion Mining and Sentiment Analysis II (Leonesa III)NLP for Social Media II (Princessa)


11:00–11:45 Parallel SessionsParsing (Leonesa I and II)Semantics III (Leonesa III)NLP Applications III (Princessa)

11:45–12:15 SIGDAT business meeting (Princessa)

12:15– 1:45 Lunch

1:45 – 3:15 Plenary session I (Leonesa Ballroom)


3:45 – 4:45 Plenary Session II (Leonesa Ballroom)

4:45 – 5:15 Closing session (Leonesa Ballroom)

71


Information Extraction IIILeonesa I and II Chair: Andreas Vlachos

Harvesting Parallel News Streams to Generate Paraphrases of Event RelationsCongle Zhang, Daniel S. Weld

Monday 9:00–9:25 — Leonesa I and II

The distributional hypothesis, which states that words that occur in similar contexts tend to have similar mean-ings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. Unfortu-nately, these methods have several drawbacks, such as confusing synonyms with antonyms and causes witheffects. This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in par-allel news streams, and shows how they may be used to generate high precision paraphrases for event relations.We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for miningnews streams. We present experiments demonstrating that NEWSSPIKE significantly outperforms severalcompetitive baselines. In order to spur further research, we provide a large annotated corpus of timestampednews articles as well as the paraphrases produced by NEWSSPIKE.

Relational Inference for WikificationXiao Cheng, Dan Roth


Wikification, commonly referred to as Disambiguation to Wikipedia (D2W), is the task of identifying conceptsand entities in text and disambiguating them into the most specific corresponding Wikipedia pages. Previousapproaches to D2W focused on the use of local and global statistics over the given text, Wikipedia articlesand its link structures, to evaluate context compatibility among a list of probable candidates. However, thesemethods fail (often, embarrassingly), when some level of text understanding is needed to support Wikification.In this paper we introduce a novel approach to Wikification by incorporating, along with statistical methods,richer relational analysis of the text. We provide an extensible, efficient and modular Integer Linear Program-ming (ILP) formulation of Wikification that incorporates the entity-relation inference problem, and show thatthe ability to identify relations in text helps both candidate generation and ranking Wikipedia titles consider-ably. Our results show significant improvements in both Wikification and the TAC Entity Linking task.

Event Schema Induction with a Probabilistic Entity-Driven ModelNathanael Chambers


Event schema induction is the task of learning high-level representations of complex events (e.g., a bombing)and their entity roles (e.g., perpetrator and victim) from unlabeled text. Event schemas have important con-nections to early NLP research on frames and scripts, as well as modern applications like template extraction.Recent research suggests event schemas can be learned from raw text. Inspired by a pipelined learner basedon named entity coreference, this paper presents the first generative model for schema induction that integratescoreference chains into learning. Our generative model is conceptually simpler than the pipelined approachand requires far less training data. It also provides an interesting contrast with a recent HMM-based model. Weevaluate on a common dataset for template schema extraction. Our generative model matches the pipeline’sperformance, and outperforms the HMM by 7 F1 points (20%).

Using Soft Constraints in Joint Inference for Clinical Concept RecognitionPrateek Jindal, Dan Roth


This paper introduces IQPs (Integer Quadratic Programs) as a way to model joint inference for the task ofconcept recognition in clinical domain. IQPs make it possible to easily incorporate soft constraints in theoptimization framework and still support exact global inference. We show that soft constraints give statisticallysignificant performance improvements when compared to hard constraints.

72


Opinion Mining and Sentiment Analysis IILeonesa III Chair: Bing Liu

Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysisin Social Media

Svitlana Volkova, Theresa Wilson, David YarowskyMonday 9:00–9:25 — Leonesa III

Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, par-ticularly in informal contexts such as social media. In this paper we focus on learning gender differencesin the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-culturaldifferences in emoticon and hashtag use for male and female users. We show that gender differences in sub-jective language can effectively be used to improve sentiment analysis, and in particular, polarity classificationfor Spanish and Russian. Our results show statistically significant relative F-measure improvement over thegender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for En-glish for polarity and subjectivity classification.

Opinion Mining in Newspaper Articles by Entropy-Based Word ConnectionsThomas Scholz, Stefan ConradMonday 9:25–9:50 — Leonesa III

A very valuable piece of information in newspaper articles is the tonality of extracted statements. For theanalysis of tonality of newspaper articles either a big human effort is needed, when it is carried out by mediaanalysts, or an automated approach which has to be as precise as possible for a Media Response Analysis(MRA). To this end, we will compare several state-of-the-art approaches for Opinion Mining in newspaperarticles in this paper. Furthermore, we will introduce a new technique to extract entropy-based word connec-tions which identifies the word combinations which create a tonality. In the evaluation, we use two differentcorpora consisting of news articles, by which we show that the new approach achieves better results than thefour state-of-the-art methods.

Collective Opinion Target Extraction in Chinese MicroblogsXinjie Zhou, Xiaojun Wan, Jianguo Xiao

Monday 9:50–10:15 — Leonesa III

Microblog messages pose severe challenges for current sentiment analysis techniques due to some inherentcharacteristics such as the length limit and informal writing style. In this paper, we study the problem ofextracting opinion targets of Chinese microblog messages. Such fine-grained word-level task has not beenwell investigated in microblogs yet. We propose an unsupervised label propagation algorithm to address theproblem. The opinion targets of all messages in a topic are collectively extracted based on the assumption thatsimilar messages may focus on similar opinion targets. Topics in microblogs are identified by hashtags or usingclustering algorithms. Experimental results on Chinese microblogs show the effectiveness of our frameworkand algorithms.

Detecting Promotional Content in WikipediaShruti Bhosale, Heath Vinicombe, Raymond Mooney


This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometricfeatures, including features based on n-gram and PCFG language models, we demonstrate improved accuracyat identifying promotional articles, compared to using only lexical information and meta-features.

73


NLP for Social Media IIPrincessa Chair: Chin-Yew Lin

Learning Topics and Positions from DebatepediaSwapna Gottipati, Minghui Qiu, Yanchuan Sim, Jing Jiang, Noah A. Smith

Monday 9:00–9:25 — Princessa

We explore Debatepedia, a community-authored encyclopedia of sociopolitical debates, as evidence for infer-ring a low-dimensional, human-interpretable representation in the domain of issues and positions. We intro-duce a generative model positing latent topics and cross-cutting positions that gives special treatment to personmentions and opinion words. We evaluate the resulting representation’s usefulness in attaching opinionateddocuments to arguments and its consistency with human judgments about positions.

A Unified Model for Topics, Events and Users on TwitterQiming Diao, Jing Jiang


With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for peopleto post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand,when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors onevents are often closely tied to their personal interests. In this paper, we try to model topics, events and userson Twitter in a unified way. We propose a model which combines an LDA-like topic model and the RecurrentChinese Restaurant Process to capture topics and events. We further propose a duration-based regularizationcomponent to find bursty events. We also propose to use event-topic affinity vectors to model the associationbetween events and topics. Our experiments shows that our model can accurately identify meaningful eventsand the event-topic affinity vectors are effective for event recommendation and grouping events by topics.

Authorship Attribution of Micro-MessagesRoy Schwartz, Oren Tsur, Ari Rappoport, Moshe Koppel


Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the questionof whether the author of a very short text can be successfully identified. We use Twitter as an experimentaltestbed. We introduce the concept of an author’s unique "signature", and show that such signatures are typicalof many authors when writing very short texts. We also present a new authorship attribution feature ("flexiblepatterns") and demonstrate a significant improvement over our baselines. Our results show that the author of asingle tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic RoleLabeling System Take You?Wiltrud Kessler, Jonas Kuhn


This short paper presents a pilot study investigating the training of a standard Semantic Role Labeling (SRL)system on product reviews for the new task of detecting comparisons. An (opinionated) comparison consists ofa comparative "predicate" and up to three "arguments": the entity evaluated positively, the entity evaluated neg-atively, and the aspect under which the comparison is made. In user-generated product reviews, the "predicate"and "arguments" are expressed in highly heterogeneous ways; but since the elements are textually annotatedin existing datasets, SRL is technically applicable. We address the interesting question how well training anout-of-the-box SRL model works for English data. We observe that even without any feature engineering orother major adaptions to our task, the system outperforms a reasonable heuristic baseline in all steps (predicateidentification, argument identification and argument classification) and in three different datasets.

74


ParsingLeonesa I and II Chair: Keith Hall

A Multi-Teraflop Constituency Parser using GPUsJohn Canny, David Hall, Dan KleinMonday 11:00–11:25 — Leonesa I and II

Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units(GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers have beenmodest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to theGPU’s practical maximum speed (a Teraflop for recent GPUs), or around a half-trillion rule evaluations persecond. Net parser performance on a 4-GPU system is over 1 thousands sentences/second (1 trillion rules/sec),for unpruned CKY evaluation of the Berkeley Parser Grammar. The techniques we introduce include grammarcompilation, recursive symbol clustering and blocking, and cache-sharing.

Fish Transporters and Miracle Homes: How Compositional Distributional Semantics canHelp NP Parsing

Angeliki Lazaridou, Eva Maria Vecchi, Marco BaroniMonday 11:25–11:45 — Leonesa I and II

In this work, we argue that measures that have been shown to quantify the degree of semantic plausibility ofphrases, as obtained from their compositionally-derived distributional semantic representations, can resolvesyntactic ambiguities. We exploit this idea to choose the correct parsing of NPs (e.g., (live fish) transporterrather than live (fish transporter)). We show that our plausibility cues outperform a strong baseline and signif-icantly improve performance when used in combination with state-of-the-art features.

75


Semantics IIILeonesa III Chair: Martha Palmer

Learning Distributions over Logical Forms for Referring Expression GenerationNicholas FitzGerald, Yoav Artzi, Luke Zettlemoyer


We present a new approach to referring expression generation, casting it as a density estimation problem wherethe goal is to learn distributions over logical expressions identifying sets of objects in the world. Despite anextremely large space of possible expressions, we demonstrate effective learning of a globally normalized log-linear distribution. This learning is enabled by a new, multi-stage approximate inference technique that uses apruning model to construct only the most likely logical forms. We train and evaluate the approach on a newcorpus of references to sets of visual objects. Experiments show the approach is able to learn accurate models,which generate over 87% of the expressions people used. Additionally, on the previously studied special caseof single object reference, we show a 35% relative error reduction over previous state of the art.

Learning to Rank Lexical SubstitutionsGyörgy Szarvas, Róbert Busa-Fekete, Eyke Hüllermeier


The problem to replace a word with a synonym that fits well in its sentential context is known as the lexicalsubstitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of targetwords, their sentential contexts and the potential substitutions for the target words, the goal is to train a modelthat accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, wecustomize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best modelssignificantly advance the state-of-the-art.

76


NLP Applications IIIPrincessa Chair: Richart Sproat

Identifying Manipulated Offerings on Review PortalsJiwei Li, Myle Ott, Claire Cardie


Recent work has developed supervised methods for detecting deceptive opinion spam, fake reviews writtento sound authentic and deliberately mislead readers. We are instead interested in identifying which offerings(e.g., hotels, restaurants) are posting fake reviews and develop a semi-supervised approach to the task that isbased on a manifold ranking algorithm and requires only a small set of labeled truthful and deceptive reviewsfor training. With no gold standard information to indicate with certainty which offerings are posting fakereviews, we introduce a novel evaluation procedure that ranks artificial instances of real offerings for whichthe number of deceptive reviews is known. Experiments are conducted on a newly constructed dataset of hotelreviews and results show that the proposed method outperforms state-of-art learning baselines.

Well-Argued Recommendation: Adaptive Models Based on Words in RecommenderSystems

Julien Gaillard, Marc El-Beze, Eitan Altman, Emmanuel EthisMonday 11:25–11:45 — Princessa

Recommendation systems take advantage of products and users information in order to propose items to con-sumers. Collaborative, content-based and a few hybrid RS have been developed in the past. In contrast, wepropose a new domain-independent semantic RS. By providing textually well-argued recommendations, weaim to give more responsibility to the end user in his decision to purchase or not the recommended item. Thesystem includes a new similarity measure keeping up both the accuracy of rating predictions and coverage.We propose an innovative way to apply a fast adaptation scheme at a semantic level, providing recommenda-tions and arguments in phase with the very recent past. We have performed several experiments on films data,providing textually well-argued recommendations.

77


Plenary session ILeonesa Ballroom Chair: Anna Korhonen

Regularized Minimum Error Rate TrainingMichel Galley, Chris Quirk, Colin Cherry, Kristina Toutanova

Monday 1:45–2:15 — Leonesa Ballroom

Minimum Error Rate Training (MERT) remains one the preferred methods for tuning linear parameters inmachine translation systems, yet, it faces significant issues. First, MERT is an unregularized learner and istherefore easily prone to overfitting. Second, it incorporates a difficult non-convex optimization problem thatbecomes intricate with many parameters to tune. To address these issues, we study the addition of a regular-ization term to the MERT objective function. Since standard regularizers such as L2 are inapplicable to MERTdue to the scale invariance of its objective function, we turn to two regularizers—L0 and a modification ofL2—and present methods for efficiently integrating them during search. To improve search in large parameterspaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orientMERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yieldresults comparable to PRO, a learner often used with large feature sets.

Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Modelswith Neural Representations of Concepts

Andrew J. Anderson, Elia Bruni, Ulisse Bordignon, Massimo Poesio, Marco BaroniMonday 2:15–2:45 — Leonesa Ballroom

Traditional distributional semantic models extract word meaning representations from co-occurrence patternsof words in text corpora. Recently, the distributional approach has been extended to models that record theco-occurrence of words with visual features in image collections. These image-based models should be com-plementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visualperception. In this study, we test whether image-based models capture the semantic patterns that emerge fromfMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation betweenimage-based and brain-based semantic similarities, and that image-based models complement text-based ones,so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory,but explained outcomes (in particular, failure to detect differential association of models with brain areas), theresults show, on the one hand, that image-based distributional semantic models can be a precious new tool toexplore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate testset to validate artificial semantic models in terms of their cognitive plausibility.

Easy Victories and Uphill Battles in Coreference ResolutionGreg Durrett, Dan Klein


Classical coreference systems encode various syntactic, discourse, and semantic phenomena explicitly, us-ing heterogenous features computed from hand-crafted heuristics. In contrast, we present a state-of-the-artcoreference system that captures such phenomena implicitly, with a small number of homogeneous featuretemplates examining shallow properties of mentions. Surprisingly, our features are actually more effectivethan the corresponding hand-engineered ones at modeling these key linguistic phenomena, allowing us to win"easy victories" without crafted heuristics. These features are successful on syntax and discourse; however,they do not model semantic compatibility well, nor do we see gains from experiments with shallow semanticfeatures from the literature, suggesting that this approach to semantics is an "uphill battle." Nonetheless, ourfinal system outperforms the Stanford system (Lee et al. (2011), the winner of the CoNLL 2011 shared task)by 3.5% absolute on the CoNLL metric and outperforms the IMS system (Bjorkelund and Farkas (2012), thebest publicly available English coreference system) by 1.9% absolute.

78


Plenary Session IILeonesa Ballroom Chair: Tim Baldwin

Breaking Out of Local Optima with Count Transforms and Model Recombination: AStudy in Grammar Induction

Valentin I. Spitkovsky, Hiyan Alshawi, Daniel JurafskyMonday 3:45–4:15 — Leonesa Ballroom

Many statistical learning problems in NLP call for local model search methods. But accuracy tends to sufferwith current techniques, which often explore either too narrowly or too broadly: hill-climbers can get stuckin local optima, whereas samplers may be inefficient. We propose to arrange individual local optimizers intoorganized networks. Our building blocks are operators of two types: (i) transform, which suggests new placesto search, via non-random restarts from already-found local optima; and (ii) join, which merges candidatesolutions to find better optima. Experiments on grammar induction show that pursuing different transforms(e.g., discarding parts of a learned model or ignoring portions of training data) results in improvements. Groupsof locally-optimal solutions can be further perturbed jointly, by constructing mixtures. We used these tools todesign several modular dependency grammar induction networks of increasing complexity. Our completesystem achieves 48.6% directed dependency accuracy (average over all 19 languages in the 2006/7 CoNLLdata): more than 5% higher than the previous state-of-the-art.

Cross-Lingual Discriminative Learning of Sequence Models with Posterior RegularizationKuzman Ganchev, Dipanjan Das


We present a framework for cross-lingual transfer of sequence information from a resource-rich source lan-guage to a resource-impoverished target language that incorporates soft constraints via posterior regularization.To this end, we use automatically word aligned bitext between the source and target language pair, and learn adiscriminative conditional random field model on the target side. Our posterior regularization constraints arederived from simple intuitions about the task at hand and from cross-lingual alignment information. We showimprovements over strong baselines for two tasks: part-of-speech tagging and named-entity segmentation.

79


80

Conference Venue: Grand Hyatt Seattle

The Blewett Suite is on the 7th floor

81