Are Distributional Dimensions Semantic Features? Katrin Erk University of Texas at Austin Meaning in Context Symposium München September 2015 Joint work

Are Distributional Dimensions Semantic Features?

Katrin ErkUniversity of Texas at Austin

Meaning in Context SymposiumMünchen September 2015

Joint work with Gemma Boleda

Semantic features by example: Katz & Fodor

Different meanings of a word characterized by lists of semantic features

Semantic features

• In linguistics: Katz&Fodor, Wierzbicka, Jackendoff, Bierwisch, Pustejovsky, Asher, …

• In computational linguistics/AI: Schank, Wilks, Masterman, Sowa…

“drink” in preference semantics (Wilks):((*ANI SUBJ) (((FLOW STUFF) OBJE) (MOVE CAUSE))

Schank, Conceptual Dependencies

Semantic features: Characteristics

• Primitive (not themselves defined), unanalyzable

• Small set

• Lexicalized in all languages

• Combined, they characterize semantics of all lexical expressions in all languages

• Precise, fixed meaning, which is not part of language. • Wilks: not so

• Individually enable inferences

• Feature lists or complex graphs Compiled from:Wierzbicka, Geeraerts, Schank

Uses of semantic features

• Event structure in the lexical semantics of verbs (Levin):• change-of-state verbs:

[ [ x ACT] CAUSE [BECOME [y <result-state>]]

• Handle polysemy (Pustejovsky, Asher)

• Characterize selectional constraints (e.g. in VerbNet)

• Characterize synonyms, also cross-linguistically (application: translation)

• Enable inferences:John is a bachelor

John is unmarried, John is a man

Are distributional dimensions semantic features?

Alligator:believe-v 0.794065american-a 2.245667kill-v 1.946722consider-v 0.047781seem-v 0.410991turn-v 0.919250side-n 0.098926serve-v 0.479459involve-v 0.435661report-v 0.483651little-a 1.175299big-a 1.468021water-n 1.806485attack-n 1.795050much-a 0.011354….

Computed from UKWaC+Wikipedia + BNC + Gigaword, 2 word window, PPMI transform


• [The] differences between vector space encoding and more familiar accounts of meaning is easy to exaggerate. For example, a vector space encoding is entirely compatible with the traditional doctrine that concepts are ‘bundles’ of semantic features. Indeed, the latter is a special case of the former, the difference being that […] semantic dimensions are allowed to be continuous.

(Fodor and Lepore 1999: All at Sea in Semantic Space) (About connectionism and particularly Churchland, not distributional models)


• If so, they either address or inheritmethodological problems:• Coverage of a realistic vocabulary• Empirically determining semantic features• Meaning creep: Predicates used in CyC did not

stay stable in their meaning over the years (Wilks 2008)


• If so, they inherit theoretical problems • Lewis 1970: “Markerese”

• Fodor et al 1980, Against Definitions; Fodor and Lepore 1999, All at Sea in Semantic Space• Asymmetry between words and primitives:• What makes the primitives more basic?

• Also, how can people communicate if their semantic spaces differ?

Outline

• Differences between distributional dimensions and semantic features

• Redefining the dichotomy

• No dichotomy after all

• Integrated inference

Semantic features: Characteristics

• Primitive (not themselves defined), unanalyzable

• Small set

• Lexicalized in all languages

• Combined, they characterize semantics of all lexical expressions in all languages

• Precise, fixed meaning, not part of language.

• Individually enable inferences

• Feature lists or complex graphs

Neither primitive nor with a fixed meaning

• Not unanalyzable: Any distributional feature can in principle be a distributional target

• Compare: Target and dimensions as a graph (with similarity determined on the basis of random walks):

target

d1

d2

d3

dd1

Neither primitive nor with a fixed meaning

• But are they treated as unanalyzed in practice?• Features in vector usually not analyzed further

• SVD, topic modeling, prediction-based models:• induce latent features• exploiting distributional properties of features• Are latent features unanalyzable?

No, linked to original dimensions

• No fixed meaning, distributional features can be ambiguous

Then is it“Markerese”?

• Inference = deriving something non-distributional from distributional representations

• Inference from relation to other words• “X cause Y”, “Y trigger X” occur with similar X, Y, hence

they are probably close in meaning• “alligator” appears in a subset of the contexts of “animal”,

hence they are probably animals

• Inference from co-occurrence with extralinguistic information• Distributional vectors linked to images for the same target• Alligators are similar to crocodiles, crocodiles are listed in the

ontology as animals, hence alligators are probably animals

No individual inferences

• Distributional representation as a whole, in the aggregate, allows for inferences using

aggregate techniques:• Distributional similarity• Distributional inclusion • Whole-vector mappings to visual vectors

No individual inferences

• Feature-based inference possible with “John Doe” features:• Take text representation• Take apart into features that are individually

almost meaningless• Aggregate of such features allows for

inferences

Outline





Redefining the dichotomy

• Not semantic features versus distributional dimensions:Individual features versus aggregate features

• Individual features:• Individually allow for inferences• May be relevant to grammar• Are introspectively salient• Not necessarily primitive• Also hypernyms and synonyms

• Aggregate features • May be individually almost meaningless• Allow for aggregate inference

• Two modes of inference: individual and aggregate

Individual features in distributional representations

• Some distributional dimensions can be cognitively relevant features

• Thill et al 2014: Because distributional models focus on how words are frequently used, they point to how humans experience concepts

• Freedom: (features from Baroni&Lenci 2010) • positive events: guarantee, secure, grant, defend,

respect• negative events: undermine, deny, infringe on,

violate


• Approaches that find cognitively plausible features distributionally: • Almuhareb & Poesio 2004• Cimiano & Wenderoth 2007• Schulte im Walde et al 2008: German

association norms• Baroni et al 2010: STRUDEL• Baroni & Lenci 2010: Distributional memory• Devereux et al 2010: dependency paths

extracted from Wikipedia


• Difficult: only small fraction of human-elicited features can be retrieved

• Baroni et al 2010: Distributional features tend to be different from human-elicited features• preference for “‘actional’ and ‘situated’

descriptions”• motorcycle:• elicited: wheels, dangerous, engine, fast• distributional: ride, sidecar, park, road

Outline





Not a competition

• Use both kinds of features!

• Computational perspective:• Distributional features are great• learned automatically• enable many inferences

• Human-defined semantic features are great• less noisy• enable inferences with more certainty• enable inferences that distributional models do not

provide

• How can we integrate the two?

Speculation: Learning both individual and aggregate features

• Learner makes use of features from textual environment

• Some features almost meaningless, others more meaningful

• Some of them relevant to grammar (CAUSE, BECOME)

• Both meaningful and near-meaningless features enter aggregate inference

• Only certain features allow individual inference

• (Unclear: This should not be feature lists, there is structure! But where does that fit in this picture?)

Outline





Inferring individual features from aggregates

• Johns and Jones 2012: • Compute weight of feature bird for nightingale as

summed similarity of nightingale to known birds

• Fagarasan/Vecchi/Clark 2015: • Learn a mapping from distributional vectors to vectors

of individual features

• Herbelot/Vecchi 2015:• Learn a mapping from distributional space to “set-

theoretic space”, vectors of quantified individual features (ALL apes are muscular, SOME apes live on coasts)

Inferring individual features from aggregates

• Gupta et al 2015: • Regression to learn properties of unknown

cities/countries from those of known cities/countries

• Snow/Jurafsky/Ng 2006:• Infer location of a word in the WordNet

hierarchy using a distributional co-hyponymy classifier

Individual features influencing aggregate representations

• Andrews/Vigliocco/Vinson 2009, Roller/Schulte im Walde 2013: Topic modeling, including known individual features of words in the text

• Faruqui et al 2015: Update vector representation to better match known synonymy, hypernymy, hyponymy information

Individual features influencing aggregate representations

• Boyd-Graber/Blei/Zhu 2006: • WordNet hierarchy as part of a topic model.• Generate a word: choose topic, then walk down WN hierarchy

based on the topic• aim: best WN sense for each word in context

• Riedel et al 2013, Rocktäschel et al 2015: Universal Schema• Relation characterized by vector of Named Entity pairs

(entity pairs that fill the relation)• Both human-defined and corpus-extracted relations• Matric factorization over union of human-defined and corpus-

extracted relations• Predict whether a relation holds of an entity pair

Conclusion

• Distributional features are not semantic features:• Not primitive• Inference from relations between word representations,

co-occurrence with extra-linguistic information

• Not (necessarily) individually meaningful • Inference from the aggregate of features• Two modes of inference: individual and aggregate

• Use both individual and aggregate features• How to integrate the two, and infer one from the

other?

References

• Almuhareb, A., & Poesio, M. (2004). Attribute-based and value-based clustering: an evaluation (pp. 1–8). Presented at the EMNLP.

• Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463–498.

• Asher, N. (2011) Lexical meaning in context: a web of words. Cambridge University Press.

• Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A Corpus-Based Semantic Model Based on Properties and Types. Cognitive Science, 34(2), 222–254

• Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721.

• Bierwisch, M. (1969) On certain problems of semantic representation. Foundations of Language 5: 153–84.

• Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation. Presented at the EMNLP.

References

• Cimiano, Philipp and Johanna Wenderoth. 2007. Automatic acquisition of ranked qualia structures from the Web. In Proceedings of ACL, pages 888–895, Prague.

• Devereux, B., Pilkington, N., Poibeau, T., & Korhonen, A. (2010). Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data. Research on Language and Computation, 7(2-4), 137–170.

• Fagarasan, L., E. Vecchi, S. Clark (2015). From distributional semantics to feature norms: grounding semantic models in human perceptual data. Proceedings of IWCS.

• Faruqui, M., Dodge, J., Jauhar, S., Dyer, C., Hovy, E., & Smith, N. (2015). Retrofitting Word Vectors to Semantic Lexicons. Presented at the NAACL.

• Fodor, J., Garrett, M. F., Walker, E. C. T., & Parkes, C. H. (1980). Against definitions. Cognition, 8(3), 263–367.

• Fodor, J., & Lepore, E. (1999). All at sea in semantic space: Churchland on meaning similarity. The Journal of Philosophy, 96(8), 381–403.

• Geeraerts, D. (2009) Theories of Lexical Semantics. Oxford University Press.

References

• Gupta, A., Boleda, G., Baroni, M., & Pado, S. (2015). Distributional vectors encode referential attributes. Proceedings of EMNLP.

• Herbelot, A., & Vecchi, E. M. (2015). Building a shared world:Mapping distributional to model-theoretic semantic spaces. Proceedings of EMNLP.

• Jackendoff, R. (1990) Semantic Structures. MIT Press.

• Johns, B. T., & Jones, M. N. (2012). Perceptual Inference Through Global Lexical Similarity. Topics in Cognitive Science, 4(1), 103–120

• Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39(2), 170.

• Lewis, D. (1970). General semantics. Synthese, 22(1):18– 67.

• Pustejovsky, J. (1991) The Generative Lexicon. Computational Linguistics 17(4).

References

• Rapaport Hovav, M., and B. Levin (2001). An event structure account of English resultatives. Language 77(4).

• Riedel, S., Yao, L., McCallum, A., & Marlin, B. (2013). Relation Extraction with Matrix Factorization and Universal Schemas. Presented at the NAACL.

• Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting Logical Background Knowledge into Embeddings for Relation Extraction. Presented at the NAACL.

• Roller, S., & Schulte im Walde, S. (2013). A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities. Presented at the EMNLP.

• Schank, R. (1969). A conceptual dependency parser for natural language. Proceedings of COLING 1969

• Schulte im Walde, S., A. Melinger, M. Roth, A. Weber (2008). An Empirical Characterisation of Response Types in German Association Norms. Research on Language and Computation 6(2):205-238, 2008.

References

• Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence (pp. 801–808). Presented at the ACL-COLING.

• Sowa, J. (1992). Logical Structures in the Lexicon. In J. Pustejovsky & S. Bergler (Eds.), Lexical semantics and knowledge representation (LNCS, Vol. 627, pp. 39–60).

• Thill, S., Pado, S., & Ziemke, T. (2014). On the Importance of a Rich Embodiment in the Grounding of Concepts: Perspectives From Embodied Cognitive Science and Computational Linguistics. Topics in Cognitive Science, 6(3), 545–558.

• Wierzbicka, A. (1996) Semantics. Primes and Universals. Oxford University Press.

• Wilks, Y. (2008). What would a Wittgensteinian computational linguistics be like? Presented at the AISB workshop on computers and philosophy, Aberdeen.

Documents

Are Distributional Dimensions Semantic Features? Katrin Erk University of Texas at Austin Meaning in Context Symposium München September 2015 Joint work