How to marry a star: probabilistic constraints for meaning in context
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
How to marry a star: probabilistic constraints for meaning in context Katrin Erk Aurélie Herbelot University of Texas at Austin University of Trento arXiv:2009.07936v1 [cs.CL] 16 Sep 2020 Abstract In this paper, we derive a notion of word meaning in context from Fill- more’s ‘semantics of understanding’, in which a listener draws on their knowledge of both language and the world to ‘envision’ the situation described in an utterance. We characterize utterance understanding as a combination of cognitive semantics and Discourse Representation Theory, formalized as a situation description system: a probabilistic model which takes utterance understanding to be the mental process of describing one or more situations that would account for an observed utterance. Our model captures the interplay of local and global contexts and their joint influ- ence upon the lexical representation of sentence constituents. We implement the system using a directed graphical model, and apply it to examples containing various contextualisation phenomena. 1 Introduction Word meaning is flexible. This flexibility is often characterised by distinguishing the ‘timeless’ meaning of a lexical item (its definition(s) in a dictionary) and its ‘speech act’ or ‘token’ meaning – the one it acquires by virtue of being used in the context of a particular sentence (Grice 1968). The generation of a token meaning goes well beyond word sense disambiguation and typically involves speakers’ knowledge of the world as well as their linguistic knowledge; for instance, Searle (1980: pp.222- 223) reminds us that it would be inappropriate to cut grass with a knife, or to cut a cake with a lawnmower. Fillmore’s frame semantics (Fillmore 1982) is a prominent example of a linguis- tic representation which combines aspects of the lexicon with the shared world knowledge of a linguistic community. Frames are “schematic representations of the conceptual structures and patterns of beliefs, practices, institutions, images, etc. that provide a foundation for meaningful interaction in a given speech community” (Fill- more et al. 2003: p. 235). The approach is the product of Fillmore’s ‘semantics of understanding’, or ‘U-semantics’ (Fillmore 1985), the aim of which is to give “an account of the ability of a native speaker to ‘envision’ the ‘world’ of the text under 1
Katrin Erk & Aurélie Herbelot an interpretation of its elements” (p.235). The idea behind the notion of envisioning is that the speaker uses the frames that are ‘evoked’ by the words in the utterance to “[construct] an interpretation of the whole” (p. 233). It seems that this envisioning process should capture at least some aspects of meaning in context: understanding a sentence requires dealing with word senses; in frame semantics, it additionally involves semantic roles and their selectional constraints, contributing to utterance-specific token meanings. Finally, frames also include larger ‘scripts’ or ‘scenarios’ (Fillmore 1982: p. 130) that can account for wider, topical influences on meaning. However, the treatment of utterance understanding in frame semantics has mainly been contained within the subfield of ‘frame semantics parsing’, which deals with automatically identifying the frames that are evoked by a sentence, as well as their relationships (e.g. Das et al. 2014, Johannsen et al. 2015 and especially Ferraro & Van Durme 2016). While a frame semantics parser accounts for part of the envisioning process, it falls short of outputting a full description of a scene, including elements that may not be explicit in the original sentence. Our main aim in this paper is to explore, and provide a formalization of, the process of envisioning, and the kinds of contextualized meanings that emerge from that process. In particular, we will say that meaning in context emerges from an interaction and integration of different constraints, some of them local, some of them more global. To give an overview of the type of phenomena we want to capture, we present next various types of meaning specializations occurring under the pressure of different constraints. We will then sketch how such constraints might be integrated in a probabilistic U-semantics which can in principle be used for both generation and understanding. Let’s first consider what contextual influences might play a role in shifting the meaning of a word. The first effect that comes to mind might be local context. Specific combinations of predicates and arguments activate given senses of the lexical items involved in the composition. This is known as ‘selectional preference’ and can be demonstrated with the following example: (1) She drew a blade. In this case, where words in both the predicate and the argument positions have multiple senses, the sentence can mean that the agent sketched either a weapon or a piece of grass, or that she randomly sampled either a weapon or a piece of grass, 2
or that she pulled out a weapon. It is much less likely that she pulled out a piece of grass. But word meaning is not only influenced by semantic-role neighbors. Global context is involved. (2) is a contrast pair adapted from an example by Ray Mooney (p.c.), with different senses of the word ball (sports equipment vs dancing event). Arguably, the sense of the predicate run is the same in (2a) and (2b), so the difference in the senses of ball must come from something other than the syntactic neighbors, some global topical context brought about by the presence of athlete in the first sentence, and violinist in the second. (2) a. The athlete ran to the ball. b. The violinist ran to the ball. There is even a whole genre of jokes resting on a competition of local and global topical constraints on meaning: the pun. Sentence (3) shows an example. (3) The astronomer married the star. This pun rests on two senses of the word star, which can be paraphrased as ‘well- known person’ and ‘sun’. It is interesting that this sentence should even work as a pun: The predicate that applies to star, marry, clearly selects for a person as its theme. So if the influence of local context were to apply strictly before global context, marry should immediately disambiguate star towards the ‘person’ sense as soon as they combine. But the ‘sun’ sense is clearly present.1 So local context and global topical context seem to be competing. Finally, it is also notable that meaning can undergo modulation at a level more fine-grained than senses. Zeevat et al. (2017) distinguish 78 uses of the verb fall, many of them different but related, as in light fell on a table and her hair falls. Having considered a range of constraints that act on different levels of granularity of the word’s meaning, we must sketch how we think they might be encoded in a ‘semantics of understanding’, as we proposed earlier. To answer this question, we will posit a U-semantics structured as follows. First, we will have a standard frame semantics, and note that frames can range in ‘size’ from concepts to whole scenarios, potentially accounting for what we called ‘local’ and ‘global’ contexts. (Fillmore 1 In fact, our own intuitions about sentence (3) vary. One of us prominently perceives the reading where the astronomer weds a gigantic ball of fire; for the other one of us, the sentence oscillates between the two different senses of star. 3
Katrin Erk & Aurélie Herbelot 1982: p. 111).2 Next, we will add fine-grained feature-based representation of lexical meaning to the standard framework to cater for property modulation, following e.g. Asher (2011), McNally & Boleda (2017), and Zeevat et al. (2017). Finally, we will link the cognitive semantics to a logical representation in order to model the interaction between actual utterances and the representations they evoke in the comprehender. With this general structure in mind, we can hypothesize constraints of various types. At the conceptual level, certain senses (frames) of a word affect the senses of other lexical items in a sentence. So we require a notion of constraints acting upon combinations of frames. At the logical level, we must respect the formal structure of the utterance and ensure, for instance, that properties pertaining to a single discourse referent are compatible. So the entire constraint system needs to cross over the conceptual and logical levels (an issue also faced by other frameworks such as Asher 2011, McNally & Boleda 2017, Emerson 2018). Overall, the problem is to explain where and how such constraints are implemented, and most crucially, the dynamic process that will activate them given a certain utterance. Our solution to the above problem is a probabilistic model which we will refer to as situation description system. As mentioned previously, Fillmore’s framework assumes that an interpreter actively ‘envisions’ the situation evoked by a sentence. Our claim is that envisioning naturally implements lexical constraints, and that modeling the imagination process required to arrive at an interpretation of the utterance automatically provides contextualised meaning presentations. We will propose a formalization of the envisioning process – the situation description system – and provide a particular implementation of our proposal. We will illustrate how the system can imagine referents that are not mentioned in the observed utterance and how it builds particular lexical preferences in the process of interpretation – or indeed, remains agnostic about sense in the case of a pun. Our framework relates to various strands of research in the linguistic and cognitive science literature. To close this introduction, we highlight how our proposal differs from – and complements – major avenues of research. We start with noting that Fillmorian frames can be about prototypical scenes, as well as cultural conventions. Because of that, any approach that builds on frame semantics will be pragmatic to some extent. We however emphasize that the kind of pragmatic reasoning involved in frame semantics is very different from what can be found in e.g. Rational Speech Act theory (Frank & Goodman 2012) and Relevance theory (Wilson & Sperber 2004), where the central mechanism is that the listener reasons over the intentions 2 This notion of a frame differs from the one used by Löbner (2014) and Zeevat et al. (2017). There, a frame is a graph encoding attributes and their values. 4
of the individual speaker, and this mechanism is used, among other things, to explain various types of word sense disambiguation. It is certainly true that the listener reasons over the individual speaker, but we believe that much can already be accounted for by assuming that the listener draws on general conventions in the speaker’s community to interpret an utterance, as McMahan & Stone (2015) also argue. We also depart from Goodman and colleagues (Goodman et al. 2015, Goodman & Lassiter 2015), who do assume that concepts are generative devices that can simulate situations, but hypothesize a strict division between general cognition on the one hand and linguistic knowledge on the other hand. In their framework, general cognition can ‘imagine’ situations, but linguistic knowledge cannot do more than check the truth of utterances in the generated situations. We do not make such a distinction. On the contrary, we assume that lexical concepts are able to conjure up objects and events that realize those concepts. In this, our approach is more similar to Emerson (2018).3 Finally, our work can be related to a number of concerns raised by computational linguists about what it means for a system to perform ‘Natural Language Understand- ing’ (NLU). Bender & Koller (2020) argue that true NLU requires to appropriately connect linguistic form and meaning. They particularly emphasise the interplay of linguistic expressions with both conventional meanings (what we called ‘timeless meaning’) and communicative intents (what we called ‘speech act meaning’). They propose a new version of the Turing test which specifically requires a model to ‘at least hypothesize the underlying communicative intents’ of speakers, especially the way that words refer to things and properties in their shared environment. In the same vein, Trott et al. (2020) call for systems that would adequately model construal, that is, the way that a speaker’s choice of words reflects a specific way to frame a meaning, and constrains the interpretation of the listener. In what follows, we first give a high-level overview of a situation description system, highlighting its main components and their role in implementing constraints on various aspects of meaning (§2). We then proceed with a general formalization of the proposed framework in §3. Finally, §4 suggests an implementation based on a directed graph model at three levels of complexity: with global constraints only, with semantic roles, and finally with concept combination at the individual feature level. We finish with illustrations of the system’s behaviour, showing how 3 Like Goodman and colleagues, Emerson argues in favor of a strict division between linguistic knowledge and general cognition, with a linguistic knowledge that is discriminative rather than generative, but his model does not enforce this at a technical level. 5
Katrin Erk & Aurélie Herbelot the envisioning process ‘fills in’ the details of an utterance and how it deals with conflicting constraints, as in the pun example. 2 Model overview In this section we describe out framework informally, using a simple example sentence, and show how it lends itself to the formalization of utterance understanding, as well as the description of semantic constraints. Our model is formulated as a system of constraints on utterance understanding, which we will refer to as a situation description system. In this model, constraints can be regarded as interacting forces which explain how the words of a speaker and the linguistic and world knowledge of a listener jointly contribute to utterance meaning. We will formalize our model as a two-tier system (Pelletier 2017), combining cognitive aspects of utterance processing with logical description. That is, we have a system made of two components: a logical tier and a conceptual tier. We will assume that our logical tier takes the form of a restricted Discourse Representation Theory (DRT: Kamp 1981, Heim 1982) which we will refer to as eDRT (formal definitions will be provided later in §3). DRT is an ideal fit for our purposes because it specifically presents itself as a mental representation of an ongoing discourse, so it is a good interface between our conceptual tier and the logical aspects of the framework.4 We will define a situation description to consist of a conceptual representation G and an eDRS D. It represent a situation’s logical form together with its asso- ciated conceptual content. A situation description can be regarded as a particular interpretation of an utterance. The situation description system defines a probability distribution over situation descriptions. That is, given an utterance such as The star shines, the system might attribute a fairly high probability to a situation description encoding a stargazing situation and a lower probability to a situation description representing a famous person being particularly witty. We will consider Fillmore’s description of the understanding process as one in which “ the lexical, grammatical and semantic material of the sentence [serves] as a ‘blueprint’ (to borrow an image from Fauconnier) off of 4 The present paper does not look at the discourse phenomena that DRT excels at, but we believe they can be integrated at a later stage. 6
which the interpreter constructs an interpretation of the whole. The interpreter accomplishes this by bringing to the ‘blueprint’ a great deal of knowledge, in particular knowledge of the interpretive frames which are evoked [. . . ] but also including knowledge of the larger structure (the ‘text’) within which the sentence occurs” (Fillmore 1985: p. 233). We will formalize this idea by having a situation description system produce de- scriptions that expand on a given utterance. The simple example above, the utterance u: The star shines, corresponds to the eDRS Du : x, e star(x) shine(e) Theme(e, x) One possible expansion could be the following expanded eDRS De : x, e, y star(x) shine(e) Theme(e, x) sun(x) sky(y) Location(e, y) where De adds to Du ’s content by specifying that the shining star happens to be a sun (thereby disambiguating star), and that the situation description contains an additional sky entity, with the Location of the shining event being the sky. Figure 1 (a) illustrates a situation description system using the simple example from above, with the conceptual representation G on top and the eDRS on the bottom.The original Du is shown on the left-hand side of the eDRS in a black font. Its ‘extended’ version De corresponds to the whole eDRS shown in the figure, including the greyed out referents and conditions. Unpacking the content of that eDRS, we have discourse referents x,e respectively corresponding to a star entity and a shining event. For De , the discourse referents x,e,y include an additional sky entity. Similarly, we have conditions star(x),shine(e) and T heme(e,x) for Du , while De shows the additional sun(x) and sky(y), as well as Location(e,y). We now go through the components of the conceptual tier, starting with individual concepts, then local constraints, and finally global constraints. We assume the 7
Katrin Erk & Aurélie Herbelot (a) Scenario mix (b) Star STARGAZE STARGAZE STARGAZE Scenario Star Shine Sky Concept x star(x) sun(x) Shine-THM Shine-LOC Semantic role Feature vector (c) Star Shine x, e y Discourse Representation Shine-THM star(x) sky(y) Structure shine(e) Location(e, y) Theme(e, x) sun(x) Figure 1 An illustration of a situation description in our situation description system, given the utterance A star shines. In this simple example, all concepts are associated with a single STARGAZING scenario, but more complex utterances may involve mixtures of scenarios. Figures (b) and (c) zoom in on an individual concept, and on a semantic role, respectively. existence of frames that can be described as concepts, which correspond to word senses. They are shown in light blue in Fig. 1, where part (b) zooms in on the concept STAR. Concepts can have instances, for example, a particular star might be an instance of STAR, or a particular shining event an instance of SHINE. The properties of such instances are described in feature vectors (purple in the figure), which describe meaning at a more fine-grained level than concepts.5 Each concept instance on the conceptual side of the situation description corresponds to a discourse referent in the DRS. For example, the STAR instance corresponds to x. Some of the properties in the feature vector of the STAR instance are turned into conditions in the eDRS – including conditions not mentioned in the original utterance, in this case sun(x). The local constraints that we consider here include selectional constraints imposed by semantic roles. Semantic roles are shown in darker blue in the figure, where part (c) zooms in on the influence that the THEME role of SHINE exerts on the STAR 5 We use the terms property and feature interchangeably here. 8
feature vector. Semantic roles, too, are turned into conditions in the eDRS, such as T heme(e,x) in part (a) of the figure. We finally come to the global constraints. We assume that the conceptual tier also includes frames that can be described as scenarios (light green in part (a) of the figure). Scenarios are larger settings that can encompass many entities and events, for instance, a WEDDING scenario, or a STARGAZING scenario. We assume scenarios to be something like the generalized event knowledge of McRae & Matsuki (2009), who argue that event knowledge drives the listener’s expectations for upcoming words. Overall, our formalization uses exactly two kinds of frames: concept frames and scenario frames. This makes our formalization easier, but it is a simplification. FrameNet makes no such assumption; any frame can have a subframe (corresponding to our scenarios), or be a subframe (corresponding to our concepts), or both.6 In our example, we have three concepts STAR, SHINE and SKY, matching the repeated scenario STARGAZING. A more complex utterance might have a mix of multiple scenarios, for example The robber paused in his flight to gaze at the shining star would probably involve an additional ROBBERY scenario. The extent to which different scenarios might be involved in a particular situation description is characterised by the ‘scenario mix’ shown at the top of Fig. 1. It is in fact this scenario mix that imposes a global constraint on understanding, in particular the constraint that scenario mixes preferably draw on only a few different scenarios rather than many. This means that different words in the utterance will tend to have senses that match the same scenario – like the athlete and ball in sentence (2a). To stress this point, the preference for sparse scenario mixes is the only truly global constraint. All other components of the conceptual tier can be considered as lexicalized – concepts evoked by words, and scenarios linked to concepts. Constraints are shown as lines in Fig. 1, linking components within and across tiers. As we will explain in more detail in §4, our situation description system is a directed graphical model, and our constraints are conditional dependencies. But for now, we can simply regard them as the reciprocal influence that two system components have over each other. For example, the presence of a STARGAZING scenario in the situation description promotes the presence of related concepts such as STAR, SHINE or SKY, making them more likely than, say, the concepts CAKE or FLOWER, thus reflecting world knowledge about stargazing scenarios (the listener is more likely to imagine a sky than a cake when thinking of stargazing). Conversely, knowing 6 Another simplification that we adopt for now is that we assume that polysemy always manifests as a single word (or predicate in the DRS) being linked to different frames, like star the celestial object and star the person. Several recent proposals in lexical semantics describe a word as an overlay of multiple senses and sub-senses in a single fine-grained feature representation (McNally & Boleda 2017, Zeevat 2013). 9
Katrin Erk & Aurélie Herbelot that the situation description involves a concept STAR might make us more confident about being in a STARGAZING scenario than, say, a SWIMMING scenario. A situation description system is formulated as a collection of probability distri- butions, which express the constraints shown as dotted lines in the figure. These distributions jointly determine the probability of any situation description. So the entire situation description in Fig. 1 is associated with some probability according to the situation description system, and so are alternative situation descriptions for the same utterance The star shines. This concludes our brief sketch of the formalism, which we work out in detail in §4. 3 The situation description framework In this section, we give a generic definition of our framework, with the idea that it could be the basis for different types of more specific implementations. Since our proposal defines understanding in terms of ‘situation descriptions’, we will first discuss why we think that cognitive representations of situations are the right building blocks for formalising understanding (§3.1). We will then proceed with the actual definition of situation description systems (§3.2). 3.1 Probability distributions over situation descriptions We want to model language understanding in terms of probabilities because proba- bilistic systems provide a general and extensible framework in which we can describe interacting constraints of different strengths. There are different options for the sam- ple space of our probability distribution, for example it could be a sample space of worlds, or situations. We next discuss why we choose neither of those two options but use situation descriptions instead. Probability distributions over worlds are used in van Eijck & Lappin (2012), van Benthem et al. (2009), Zeevat (2013), Erk (2016), Lassiter & Goodman (2015), and probability distributions over situations in Emerson (2018), Bernardy et al. (2018). Given a probability distribution over worlds, Nilsson (1986) simply defines the probability of a sentence ϕ as the summed probability of all the worlds that make ϕ true: p(ϕ) = ∑ p(w) w∶JϕKw =1 This has the desirable property that the truth of sentences is evaluated completely classically, as all probabilistic effects come from the probabilities of worlds. 10
A problem with using a sample space of worlds is that a world, or at least the real world, is an unimaginably gigantic object. This is, in fact, the reason why Cooper et al. (2015) say it is unrealistic to assume that a cognizer could represent a whole world in their mind, let alone a distribution over worlds. Their argument is that a world is a maximal set of consistent propositions (Carnap 1947), and no matter the language in which the propositions are expressed, we cannot assume that a cognizer would be able to enumerate them. But cognitive plausibility is not the only problem. Another problem is that we do not know enough about a world as a mathematical object. Rescher (1999) argues that objects in the real world have an infinite number of properties, either actual or dispositional. This seems to imply that worlds can only be represented over an infinite-dimensional probability space. When defining a probability measure, it is highly desirable to use a finite-dimensional probability space – but it is not clear whether that is possible with worlds. We cannot know for certain that a world cannot be ‘compressed’ into a finite-dimensional vector, but that is because we simply do not know enough about what worlds are to say what types of probability measures we could define over them. Situations, or partial worlds, may be smaller in size, but they still present a similar problem: How large exactly is, say, a situation where Zoe is playing a sonata? Both Emerson (2018) and Bernardy et al. (2018) assume, when defining a probability distribution over situations, that there is a given utterance (or set of utterances) and that the only entities and properties present in the situation are the ones that are explicitly mentioned in the utterance(s). But arguably, a sonata-playing situation should contain an entity filling some instrument role, even if it is not explicitly mentioned. Going one step further, Clark (1975) discusses inferences that are “an integral part of the message”, including bridging references such as “I walked into the room. The windows looked out to the bay.” This raises the question of whether any situation containing a room would need to contain all the entities that are available for bridging references, including windows and even possibly a chandelier. (Note that there is little agreement on which entities should count as available for bridging references: see Poesio & Vieira 1988.) The point is not that for Zoe is playing a sonata there is a particular fixed-size situation that comprises entities beyond Zoe and the sonata, the point is that there is no fixed size to the situation at all. Our solution to the above issues is to use a probability distribution over situation descriptions, which are objects in the mind of the listener rather than in some actual state-of-affairs. As human minds are finite in size, we can assume that each situation description only comprises a finite number of individuals, with a finite number of possible properties – this addresses the problem that worlds are too huge to imagine. But we also assume that the size of situation descriptions is itself probabilistic rather than fixed, and may be learned by the listener through both situated experience and 11
Katrin Erk & Aurélie Herbelot language exposure. Doing so, we remain agnostic about what might be pertinent for describing a particular situation. 3.2 Definition of a situation description system We now define situation descriptions and situation description systems. Our situation descriptions are pairs of a conceptual representation and a discourse representation structure (DRS). From the perspective of the speaker, we will posit that there is a certain conceptual content that underlies the words that they utter. From the perspec- tive of the listener, the understanding process involves hypothesizing the conceptual content that might have resulted in the utterance. We express this hypothesis as a probability distribution over situation descriptions. We first define the logical fragment of DRT that we use. Then we formalize situation descriptions, including the ‘glue’ between utterance and conceptual components. Finally, we define situation description systems probability distributions over situ- ation descriptions, and we show how language understanding can be described as conditioning a situation description system on a given utterance. A DRT fragment. The context influences on lexical meaning that we want to model exhibit complex behaviour even in simple linguistic structures – the examples in the introduction were simple predicate-argument combinations and modifier-noun phrases with intersective adjectives. With this in mind, we work with a very simple fragment of DRT: • we only have conjunctions of DRS conditions; • negation only scopes over individual conditions (to avoid disjunction and quantificational effects); • we only have unary and binary (neo-Davidsonian) predicate symbols; • the fragment does not contain constants. Let REF be a set of reference markers, and PS a set of predicate symbols with arities of either one or two. In the following definition, xi ranges over the set REF, and F over the set of predicate symbols PS. The language of eDRSs (existential and conjunctive DRSs) is defined by: conditions C ∶∶= ⊺ ∣ Fx1 ...xk ∣ ¬Fx1 ...xk eDRSs D ∶∶= ({x1 ,...,xn },{C1 ,...,Cm }) 12
x y sheep(x) sheep(y) fluffy(x) fluffy(y) Table 1 Two equivalent eDRSs Situation descriptions (SDs). A situation description combines a conceptual rep- resentation with a logical form. In the most general case, given a set Fr of conceptual components, a language CON of conceptual representations is a subset of P(Fr), the powerset of Fr. To connect a conceptual representation and an eDRS, we define a mapping g from conditions of the eDRS to conceptual components. More specifically, g is a function that links each condition in D to a single node in the conceptual structure. In §4, this will be a feature vector node for unary conditions, and a semantic role node for binary conditions (as illustrated in Fig. 3 and 5). Definition (Situation description). Given a language CON of conceptual represen- tations, a language of situation descriptions is a subset of the set of all tuples ⟨G,D,g⟩ for a conceptual representation G ∈ CON, an eDRS D, and a partial mapping g from conditions of D to components of G such that if D = ({x1 ,...,xn }, {C1 ,...,Cm }), then for each i ∈ {1,...,m}: Ci is in the domain of g iff Ci ≠ ⊺. Situation description systems. We characterize an individual’s conceptual knowl- edge as a situation description system, a probability distribution over all situation descriptions that the person can possibly imagine.7 One technical detail to take care of is that a situation description system must not give separate portions of probability mass to situation descriptions that are equivalent. For now, we will define equivalence of eDRSs; we will add equivalence of conceptual structures below. Two eDRSs are equivalent if they are the same modulo variable renaming function v, as in Table 1. So we define two situation descriptions S1 ,S2 as being equivalent, written S1 ≡ S2 , if their conceptual representations are the same, their eDRSs are the same modulo variable renaming, and the g-mappings are the 7 We do not assume that a person would hold such a probability distribution in mind explicitly, but we do assume that they have the mental ability to generate situation descriptions according to this knowledge, and that they will be more likely to generate some situation descriptions than others. 13
Katrin Erk & Aurélie Herbelot same modulo variable renaming on the conditions (see definition in appendix). We will define a situation description system as only involving a single representative of each equivalence class of situation descriptions. Definition (Situation description system). Given a language S of situation descrip- tions over CON, a situation description system over S is a tuple ⟨S,∆⟩ where ∆ is a probability distribution over S and where S ⊆ S such that • For any S1 ,S2 ∈ S, S1 ≡/ S2 • there is a dimensionality N and a function f ∶ S → RN such that f is injective. A situation description system ⟨S,∆⟩ is called conceptually envisioned if there are probability distributions p1 , p2 such that • For any ⟨G,D,g⟩ ∈ S with D = ({x1 ,...,xn },{C1 ,...,Cm }), • ∆(⟨G,D,g⟩) = p1 (G) ∏m i=1 p2 (Ci ∣ g(Ci )) The second bullet point in the definition of S ensures that S can be embedded into a finite-dimensional space, addressing our concerns about finiteness from §3.1. A conceptually envisioned situation description system is one that can be decom- posed into two probability distributions, one that is a distribution over conceptual representations (p1 ), and one that describes each DRS condition as probabilistically dependent solely on its g-linked conceptual component (p2 ). This is the decomposi- tion that we will use in §4, and that we have illustrated in §2. Situation description system for a given utterance. We have previously defined situation descriptions systems in their most general form. We centrally want to use situation description systems to describe the process of understanding, where the listener has to envision a probability distribution over situation descriptions given a specific utterance. As we have discussed above, we want a situation description system to be able to expand on a given utterance, to construct an “interpretation of the whole” from the “blueprint” that is the utterance (Fillmore 1985: p. 233). We can do this simply by saying that the situation description system for a given utterance ascribes non-zero probabilities only to situation descriptions whose eDRS “contains” the utterance. To be able to specify this relation, we first need to define a subset operation on eDRSs modulo variable renaming. For eDRSs D1 = (X1 ,C1 ),D2 = (X2 ,C2 ), we write D1 ⫅ D2 if there is an eDRS D = (X,C) such that D1 ≡ D and X ⊆ X2 and C ⊆ C2 . 14
To restrict probability mass to eDRSs that “contain” the utterance, we use Dirac’s delta, a probability distribution that puts all its probability mass in one point. With this, we can assign zero probability to cases we are not interested in. We now have all the pieces in place to say what the situation description system given an observed utterance Du should be: Let ⟨S,∆⟩ be a situation description system over S, and let Du be an eDRS such that there is some ⟨G,D,g⟩ ∈ S with Du ⫅ D. Then the situation description system given utterance Du is a tuple ⟨S,∆Du ⟩ such that for any ⟨G,D,g⟩ ∈ S, ∆Du (⟨G,D,g⟩∣Du ) ∝ ∆(⟨G,D,g⟩) δ (Du ⫅ D) That is, ∆Du assigns non-zero probabilities only to situation descriptions that “con- tain” Du . In more detail, it assigns a zero probability to all situation descriptions that do not “contain” Du , and normalizes all other probabilities so all probabilities sum to 1 again. This is the probabilistic meaning of the utterance for the listener. This concludes the formal exposition of our general framework. Using the above definitions, we will describe in the following section a situation description system where DRS conditions are conditionally dependent on frames that express lexical- conceptual and world knowledge. 4 Using situation description systems to model word meaning in context We will now use the general framework introduced in the previous section to im- plement a specific system of probabilistic constraints, with a view to model word meaning in context. We will introduce a situation description system with a concep- tual tier made of frames and of property vectors, and structured as a graph. We will show that our notion of situation description system lets us restrict inferences based on both global and local constraints, guiding the envisioning process towards the appropriate meanings. At the global level, we will inspect the role of scenarios in restricting word senses. At the local level we will inspect selectional constraints that semantic roles impose on role fillers, and constraints on modifier-head combinations. We will describe a conceptual representation as a directed graphical model (Koller & Friedman 2009), or Bayesian network. A directed graphical model describes a factorization of a joint probability over random variables: Each node is a random variable, directed edges are conditional probabilities, and the graph states that the joint probability over all random variables factorizes into the conditional probabilities 15
Katrin Erk & Aurélie Herbelot Scenario mix STARGAZE STARGAZE STARGAZE Scenario Star Shine Sky Concept Feature vector x, e y Discourse Representation Structure star(x) sky(y) shine(e) sun(x) Figure 2 Situation description from Fig. 1, without semantic roles. in the graph. The illustration in Fig. 1 is a simplified version of the graphical models that we introduce in the current section, and the edges are conditional probabilities. For example, each concept in the figure is conditioned on the scenario that it links to, that is, each concept is sampled from a probability distribution associated with the linked scenario.8 We will define our system of constraints in three stages of increasing complexity, starting with global constraints only, then adding semantic roles, and finally modifier- head combinations. At each stage, we will illustrate the behaviour of the system with concrete examples implemented in the probabilistic programming language WebPPL (Goodman & Stuhlmüller 2014). 4.1 System of constraints, first stage: scenario constraints We start out with a system that only has global constraints, no local ones. We use conceptual representations that consist of scenarios, concepts, and feature vectors, but no semantic roles, as shown in Fig. 2. As we mentioned in §2, global constraints arise here from the ‘scenario mix’ having a preference for fewer scenarios. They exert an influence on the co-occurrence of word senses (lexicalized concepts) in an utterance: The ‘sun’ sense of star is more likely to occur in a situation description in which the STARGAZE scenario is active than in a discussion of theater, where 8 Note that inference in a directed graphical model does not always proceed in the direction of the edges. If a scenario is known, we can probabilistically infer which concept is likely to be sampled from it. But likewise, if we know the concept, we can probabilistically infer from which scenario it is likely to be sampled. 16
the ‘person’ sense would be more salient.9 In what follows, we formally define a language CON1 of conceptual representations that comprises scenarios, concepts, and properties. We then define languages SD1 of situation descriptions over CON1 and eDRSs. We finally construct a particular conceptually envisioned situation description system over an SD1 language. The construction is illustrated step-by- step with toy examples involving the two senses of bat. 4.1.1 Definitions The language CON1 of conceptual representations. We now formalize the shape of conceptual representations such as the one in Fig. 2: They form a graph with nodes for scenario tokens, concept tokens, and feature vectors, where each scenario token samples one concept token, and each concept token samples a feature vector. Definition (Language CON1 of conceptual representations). We assume a finite set FS of scenarios, a finite set FC of concepts, and a finite set FF of features, where the three sets are disjoint. We write fF for the set of binary vectors of length ∣FF ∣ (or equivalently, the set of functions from FF to {0,1}). Then the language CON1 of conceptual representations is the largest set of directed graphs (V,E,`) with node set V , edge set V , and node labeling function ` ∶ V → FS ∪ FC ∪ fF such that • If v is a node with a label in FS , then v has exactly one outgoing edge, and the target has a label from FC . • If v is a node with an FC label, then it has exactly one incoming edge, and the source has a label from FS . It further has exactly one outgoing edge, and the target has a label from fF . • If v is a node with a label in fF , then v has exactly one incoming edge and no outgoing edges. The source on the incoming edge has a label from FC . Note that each representation in CON1 can be described as a set, as defined in §3, where the set members are nodes, edges, and node labels. 9 This may be an unusual constellation from a linguistic point of view, without the familiar local constraints on word meaning. But from a probabilistic modeling point of view, this system is straightforward, a simple variant on topic models (Blei et al. 2003). 17
Katrin Erk & Aurélie Herbelot Languages SD1 of situation descriptions. Before we define our languages of situation descriptions, we need a few notations in place. First, we had required, in the definition of situation description systems in §3.1, that it must be possible to embed situation descriptions in a finite space. For that reason, we assume from now on that PS, the set of predicate symbols used in eDRSs, is at most as large as R. In the definition below, we restrict the size of the conceptual representation to some maximal number M of scenario tokens, and the size of the eDRS to some maximal number N of referents and of conditions per referent. (We use a single upper limit N for both referents and conditions per referent in order to keep the number of parameters down; the number of conditions per discourse referent does not have to be the same as the number of discourse referents, only the upper limit is re-used.) As mentioned previously, the conceptual representation characterizes an individual as a feature vector which matches a particular referent in the eDRS. We will want to say that all unary conditions in the eDRS that mention a particular referent x are probabilistically dependent on the same feature vector node, and that all conditions that depend on that feature vector are about x. In order to express that, we define Var(C) for an eDRS condition C as its sequence of variables: • if C = ⊺ then Var(C) = ⟨⟩, • if C = Fx1 ,...,xk then Var(C) = ⟨x1 ,...,xk ⟩, • if C = ¬Fx1 ,...,xk , then Var(C) = ⟨x1 ,...,xk ⟩. We call the length of Var(C) the arity of the condition C; we call a condition unary if the length of Var(C) is 1. Later in this section, we will encounter binary conditions generated by role nodes, for which the length of VAR(C) is 2. Definition (Language SDM,N 1 of situation descriptions). For M,N ∈ N, SDM,N 1 is the largest set of situation descriptions ⟨G,D,g⟩ such that • G ∈ CON1 has at most M nodes with labels from FS , • D is an eDRS that contains at most N discourse referents and at most N unary conditions per discourse referent, and no conditions with arity greater than 1. • For any condition C of D, g(C) has a label in fF . • g is a function from conditions of D to nodes of G such that for any two unary conditions C1 ,C2 of D, g(C1 ) = g(C2 ) iff Var(C1 ) = Var(C2 ). 18
Lemma (Finite embedding of SD1 situation descriptions.). For any M,N ∈ N, any situation description in SDM,N 1 can be embedded in a finite-dimensional space. Proof. Let ⟨G,D,g⟩ ∈ SDM,N 1 . Then G has at most M nodes with labels from FS , and the same number of nodes with labels in FC , and the same number of nodes with labels in fF . So there is a finite number of nodes and edges in the graph. The set of possible node labels is also finite (as fF is finite), so G can be embedded in a finite-dimensional space. The number of discourse referents and conditions in D is finite. As the set of predicate symbols is at most as large as R, each predicate symbol can be embedded in a single real-valued dimension. We need to amend our definition of equivalence classes of situation descriptions. Equivalence among eDRSs is as before, based on variable renaming. But now we additionally have an equivalence relation on the conceptual representation graphs. Two graphs in CON1 are equivalent if there is a homomorphism between them that respects node labels. Two situation descriptions ⟨G1 ,D1 ,g1 ⟩,⟨G2 D2 ,g2 ⟩ in SDM,N 1 are equivalent, ⟨G1 ,D1 ,g1 ⟩ ≡ ⟨G2 D2 ,g2 ⟩, iff • D1 ≡ D2 with variable mappings a such that a(D1 ) = D2 , • G1 ≡ G2 with graph homomorphism h mapping the nodes of G1 to nodes of G2 in a way that respects node labels, and • for any condition C of D1 , h(g1 (C)) = g2 (a(C)). 4.1.2 A conceptually envisioned situation description system with global con- straints In §3.2, we defined a conceptually envisioned situation description system as being decomposable into two probability distributions, one that is a distribution over con- ceptual representations (p1 ), and one that relates each DRS condition to a conceptual node (p2 ). We will now show that it is possible to define a situation description system that satisfies this definition, that is, it decomposes into the appropriate p1 and p2 distributions. Formally, we want to show that: Lemma. There are M,N ∈ N such that there is a situation description system over SDMN 1 that is conceptually envisioned. What follows is a construction of the situation description system, which also serves as the proof for this lemma. 19
Katrin Erk & Aurélie Herbelot We have already shown that any situation description in SDM,N 1 has a finite em- bedding. It remains to show that there is a situation description system, a joint M,N probability distribution over random variables, that operates over some subset SD1 of SDM,N 1 and that can be factored in the right way: • there are probability distributions p1 , p2 such that the probability of any M,N ⟨G,D,g⟩ ∈ SD1 , with D having conditions C1 ,...,Cm , can be characterized as p1 (G) ∏m i=1 p2 (Ci ∣ g(Ci )). M,N • SD1 does not contain more than one representative from any SD equiva- lence class. M is the number of scenario tokens, and by extension the number of concept tokens and feature vectors. (This is illustrated in Fig 2, where each referent is associated with a unique token at each level of the conceptual representation). We will talk of ‘representations of individuals’ to refer to the combination of a scenario token, a concept token and a feature vector. We do not specify any restrictions on M, the number of individual representations. As for N, the number of discourse referents and the number of conditions per discourse referents, we only consider values of N with N ≥ M, because we need a one-to-one correspondence between discourse referents and feature vector nodes, and N ≥ ∣FF ∣, as each feature q in FF will be associated with a predicate corresponding to q, so a feature vector can generate at most ∣FF ∣ conditions. We now describe a particular decomposition of the joint probabilities of random variables in the model, in the form of a ‘generative story’. A generative story verbalizes the conditional probabilities in a pseudo-procedural form, as a series of sampling steps. However, this is just a procedural way of presenting a declarative formulation, not a description of a procedure. And in fact, in §4.1.3 below we draw inferences in the opposite direction from the ‘steps’ of the generative story. Our generative story samples some situation description in SDM,N 1 , starting at the top of the conceptual graph with scenario nodes, going down through concept nodes, and ending with feature vectors which then probabilistically condition the content of the eDRS. The associated steps are defined as follows: A: Draw a scenario mix, and from it, a collection of scenario frames (where the same scenario frame can appear multiple times to accommodate multiple individuals drawn from the same type). B: For each token of a scenario frame, draw a token of a concept frame. C: For each token of a concept frame, draw a feature vector. 20
scenarios p scenarios p GOTHIC 0.1295 GOTHIC , GOTHIC , GOTHIC , GOTHIC 0.0695 BASEBALL 0.1280 BASEBALL , BASEBALL , BASEBALL , BASEBALL 0.0675 BASEBALL , BASEBALL 0.0930 GOTHIC , GOTHIC , GOTHIC 0.0675 GOTHIC , GOTHIC 0.0895 BASEBALL , GOTHIC 0.0565 BASEBALL , BASEBALL , BASEBALL 0.0880 BASEBALL , GOTHIC , GOTHIC 0.0515 FS = {BASEBALL, GOTHIC}. FC = {BAT- STICK, BAT- ANIMAL, VAMPIRE, PLAYER} Table 2 Ten most likely sampled scenario collections among 2,000 SDs At this point, we have a conceptual representation from which conditions in the eDRS can similarly be sampled: D: For any discourse referent x, there is a single graph node with a label from fF from which we sample all the (unary) conditions involving x, overall at most N conditions. We now formalize this generative story. A. Drawing a collection of scenarios. We sample scenarios from some distribu- tion θ (the ‘scenario mix’ in Fig 1), where the overall shape of θ controls how many different scenarios will be involved in the situation description. To obtain θ , we assume an ∣FH ∣-dimensional Dirichlet (which can be viewed as a distribution over parameters for a multinomial) with a concentration parameter α < 1, which will prefer sparse distributions (distributions that assign non-zero probabilities only to few outcomes). So θ ∼ Dir(α). We thus obtain a distribution giving some non-zero probability mass to a limited number of scenarios. As mentioned before, the con- centration parameter α is the only truly global constraint we consider. All other constraints are specific to particular concepts and particular scenarios. We draw a number m ∈ {1,...,M}, the number of individuals in the conceptual representation, from a discrete uniform distribution. This will also be the number of discourse referents in the situation description. We then draw a multiset H (a bag, meaning that an element can appear multiple times) of m node labels from FS from a multinomial with parameters θ . This gives us m independent draws, with possible repeats of scenario labels, where the probability of a collection of scenarios is independent of the order in which scenarios were drawn from the multinomial: H ∼ Multinomial(m,θ ). Illustration: In the WebPPL programming language, we can directly implement the sampling process, and draw samples that are situation descriptions according to their probability under the situation description system. We use a tiny toy collection 21
Katrin Erk & Aurélie Herbelot concepts p concepts p BAT- ANIMAL 0.0670 BAT- ANIMAL , VAMPIRE 0.0440 BAT- STICK 0.0650 BAT- STICK , PLAYER , PLAYER 0.0350 PLAYER 0.0630 BAT- STICK , BAT- STICK , PLAYER 0.0305 VAMPIRE 0.0625 PLAYER , PLAYER 0.0265 BAT- STICK , PLAYER 0.0470 BAT- ANIMAL , VAMPIRE , VAMPIRE 0.0255 FS = {BASEBALL, GOTHIC}. FC = {BAT- STICK, BAT- ANIMAL, VAMPIRE, PLAYER} Table 3 Ten most likely sampled concept collections among the same 2,000 SDs of two scenarios, BASEBALL and GOTHIC. The BASEBALL scenario can sample the concepts BAT- STICK or PLAYER with equal probability (giving 0 probability mass to the other concepts). Similarly, the GOTHIC scenario can sample the con- cepts BAT- ANIMAL or VAMPIRE. We set the maximum number of scenario tokens (individuals) to M = 4 and sample 2,000 situation descriptions. With a Dirichlet concentration parameter of α = 0.5, we prefer sparser distributions θ , so the majority of our sampled situation descriptions contain one scenario frame only: 38% of all 2000 SDs include BASEBALL only, 36% GOTHIC only, 27% include both. If we lower the concentration parameter still more to α = 0.1, the preference for fewer scenarios becomes more pronounced: We obtain 48% GOTHIC, 44% BASEBALL, and 8% of sampled SDs containing both scenarios. Table 2 shows the 10 most likely scenario collections in our 2000 sampled SDs for α = 0.5, with their probabilities. We see for instance that with a probability of approximately 0.13, a single GOTHIC scenario token was sampled. A single BASEBALL scenario token was likewise sampled with a probability of about 0.13. The most likely mixed scenario collection, at a probability of 0.057, has two tokens, one BASEBALL and one GOTHIC. B. Drawing a concept frame for each scenario frame. The next step is for each scenario frame to sample a concept. We write ĥ for tokens of scenario frame h, and similarly ẑ for tokens of concepts frame z. We assume that each scenario type h is associated with a categorical distribution, parametrized by a probability vector φh , over the concept frame labels in FC . For each scenario token ĥ of h in H, we draw a concept frame type zĥ from the categorical distribution associated with h, zĥ ∼ Categorical(φh ), for a total of m concept tokens. Illustration: Re-using the same 2,000 sampled scenario collections (for α = 0.5), we sample one concept per scenario using a uniform distribution. Table 3 shows the ten most likely concept collections. As expected given the distribution over 22
STARGAZE Star : ptruth star:1.0 object:1.0 bright:0.4 ... star object - bright sun - animate - furry consisting of plasma : psalience x star(x) sun(x) Figure 3 A feature vector. A minus sign in front of a property indicates that the instance does not have that property. Only some components of the feature vectors will be turned into conditions in the eDRS. scenarios, which prefers single types, we have coherent concept groups: bat-animals co-occur with vampires while bat-sticks co-occur with players. C. Drawing a feature vector for each concept frame. We assume that each concept frame type z is associated with a vector τz of ∣FF ∣ Bernoulli probabilities, which lets us sample, for each feature, whether it should be true or false. Abusing notation somewhat by also viewing τz as a function, we write τz (q) for the value in τz that is the Bernoulli probability of feature q ∈ FF . The sampling of feature values is illustrated in Fig. 3, which shows the properties of a particular star entity which happens to be a sun and not to be bright.10 In this paper, we restrict ourselves to binary properties for simplicity. In the figure, τ specifies that the properties star and object have to be true of any entity that is an instance of STAR, indicated by a τ probability of 1.0, while stars may or may not be bright, with a τ value of 0.4.11 10 Note that we are showing here ambiguous properties such as bright, which could mean smart or luminous, but properties can also be disambiguated on the conceptual side without any change to the framework. 11 Intermediate probabilities like the one for bright allow us to represent typicality or prevalence of properties, while probabilities of 1.0 or 0.0 put hard constraints on category membership. We need to represent typicality because we want the cognizer to be more likely to imagine typical situation descriptions than atypical ones. As a side note, Kamp & Partee (1995) stress the distinction 23
Katrin Erk & Aurélie Herbelot concepts bat vampire player have_wings fly humanlike athletic wooden p BAT- ANIMAL 1 0 0 1 1 0 0 0 0.0650 BAT- STIC k 1 0 0 0 0 0 0 1 0.0485 PLAYER 0 0 1 0 0 1 1 0 0.0480 VAMPIRE 0 1 0 0 0 1 0 0 0.0320 BAT- STICK 1 0 0 0 0 0 0 1 0.0485 PLAYER 0 0 1 0 0 1 1 0 0.0480 FS = {BASEBALL, GOTHIC}. FC = {BAT- STICK, BAT- ANIMAL, VAMPIRE, PLAYER} FF = {bat, vampire, player, have_wings, fly, humanlike, athletic, wooden} Table 4 Feature vectors for the 5 most likely sampled collections of individuals. For each concept frame token ẑ that was previously sampled, we sample a feature vector f (which we write as a function from FF to {0,1}). The probability of sampling feature vector f for concept z is (1) p( f ∣z) = ∏ τz (q) ⋅ ∏ (1 − τz (q)) q∈FF ∶ f (q)=1 q∈FF ∶ f (q)=0 That is, p( f ∣z) is the probability of truth of the features that are true in f and probability of falsehood of the features that are false in f . Illustration: Re-using the same 2,000 scenario and concept collections obtained in the previous steps, we now sample a feature vector for each individual. To do so, we need τ distributions for our four concepts of interest. We can partially build on ‘real’ distributions by considering the quantifiers that Herbelot & Vecchi (2016) added to the feature norms collected by McRae et al. (2005) (which for instance tell us that on average annotators judged that an instance of BAT- ANIMAL has a probability of 1.0 to have wings, and that an instance of BAT- STICK has a probability of 0.75 to be wooden). For concepts that are not available from the annotated norms, we manually set the values of τ. All distributions are shown in Table 7 of the appendix. The sampling process results in a collection of feature vectors for each situation description, one feature vector per individual. Table 4 shows the five most likely collections of individuals, corresponding to a) a bat-animal; b) a bat-stick; c) a player; d) a vampire; e) a bat-stick and a player, together with the associated feature between typicality and membership in a concept, while Hampton (2007) suggests that typicality and membership are both functions of a single cline of similarity to a prototype. Our formalization accommodates both views. Strict property constraints, as for the object property of STAR, correspond to strict membership: STAR will never have an instance that is not an object. Soft property constraints, as for bright above, allow for degrees of typicality of STAR instances. 24
You can also read