Experience Grounds Language
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Experience Grounds Language Yonatan Bisk* Ari Holtzman* Jesse Thomason* Jacob Andreas Yoshua Bengio Joyce Chai Mirella Lapata Angeliki Lazaridou Jonathan May Aleksandr Nisnevich Nicolas Pinto Joseph Turian Abstract Meaning is not a unique property of language, but a general characteristic of human activity ... We cannot Successful linguistic communication relies on say that each morpheme or word has a single or central a shared experience of the world, and it is this meaning, or even that it has a continuous or coherent shared experience that makes utterances mean- range of meanings ... there are two separate uses and ingful. Despite the incredible effectiveness meanings of language – the concrete ... and the abstract. of language processing models trained on text arXiv:2004.10151v1 [cs.CL] 21 Apr 2020 Zellig S. Harris (Distributional Structure 1954) alone, today’s best systems still make mistakes that arise from a failure to relate language to it is a universal property of intelligence. Conse- the physical world it describes and to the so- quently, we must consider what knowledge or con- cial interactions it facilitates. cepts are missing from models trained solely on Natural Language Processing is a diverse field, text corpora, even when those corpora are large and progress throughout its development has come from new representational theories, mod- scale or meticulously annotated. eling techniques, data collection paradigms, Large, generative language models emit utter- and tasks. We posit that the present success ances which violate visual, physical, and social of representation learning approaches trained commonsense. Perhaps this is to be expected, since on large text corpora can be deeply enriched they are tested in terms of Independent and Identi- from the parallel tradition of research on the cally Distributed held-out sets, rather than queried contextual and social nature of language. on points meant to probe the granularity of the dis- In this article, we consider work on the contex- tinctions they make (Kaushik et al., 2020; Gardner tual foundations of language: grounding, em- et al., 2020). We draw on previous work in NLP, bodiment, and social interaction. We describe Cognitive Science, and Linguistics to provide a a brief history and possible progression of how contextual information can factor into our rep- roadmap towards addressing these gaps. We posit resentations, with an eye towards how this inte- that the universes of knowledge and experience gration can move the field forward and where available to NLP models can be defined by succes- it is currently being pioneered. We believe this sively larger world scopes: from a single corpus to framing will serve as a roadmap for truly con- a fully embodied and social context. textual language understanding. We propose the notion of a World Scope (WS) as Improvements in hardware and data collection a lens through which to view progress in NLP. From have galvanized progress in NLP. Performance the beginning of corpus linguistics (Zipf, 1932; Har- peaks have been reached in tasks such as lan- ris, 1954), to the formation of the Penn Treebank guage modeling (Radford et al., 2019; Zellers et al., (Marcus et al., 1993), NLP researchers have con- 2019c; Keskar et al., 2019) and span-selection ques- sistently recognized the limitations of corpora in tion answering (Devlin et al., 2019a; Yang et al., terms of coverage of language and experience. To 2019; Lan et al., 2019) through massive data and organize this intuition, we propose five WSs, and massive models. Now is an excellent time to reflect note that most current work in NLP operates in the on the direction of our field, and on the relationship second (internet-scale data). of language to the broader AI, Cognitive Science, We define five levels of World Scope: and Linguistics communities. WS1. Corpus (our past) We are interested in how the data and world a WS2. Internet (our present) language learner is exposed to defines and poten- WS3. Perception tially constrains the scope of the learner’s seman- WS4. Embodiment tics. Meaning is not a unique property of language; WS5. Social
100 % of 2019 Citations 1 WS1: Corpora and Representations 80 Harris 1954 Firth 1957 60 Chomsky 1957 The story of computer-aided, data-driven language 40 research begins with the corpus. While by no 20 means the only of its kind, the Penn Treebank (Mar- 0 cus et al., 1993) is the canonical example of a ster- 1960 1970 1980 1990 2000 2010 2020 Year ilized subset of naturally generated language, pro- Academic interest in Firth and Harris increases dramatically cessed and annotated for the purpose of studying around 2010, perhaps due to the popularization of Firth (1957) representations, a perfect example of WS1. While “You shall know a word by the company it keeps." initially much energy was directed at finding struc- ture (e.g. syntax trees), this has been largely over- shadowed by the runaway success of unstructured, ument generated as a bag-of-words conditioned fuzzy representations—from dense word vectors on topics. However, the most successful vector (Mikolov et al., 2013) to contextualized pretrained representations came from Latent Semantic Index- representations (Peters et al., 2018b; Devlin et al., ing/Analysis (Deerwester et al., 1988, 1990; Du- 2019b). mais, 2004), which represents words by their co- Yet fuzzy, conceptual word representations have occurrence with other words. Via singular value de- a long history that predates the recent success of composition, these matrices are reduced to low di- deep learning methods. Philosophy (Lakoff, 1973) mensional projections, essentially a bag-of-words. and linguistics (Coleman and Kay, 1981) recog- A related question also existed in connectionism nized that meaning is flexible yet structured. Early (Pollack, 1987; James and Miikkulainen, 1995): experiments on neural networks trained with se- Are concepts distributed through edges or local to quences of words (Elman, 1990) suggested that units in an artificial neural network? vector representations could capture both syntax and semantics. Subsequent experiments with larger “... there has been a long and unresolved models, document contexts, and corpora have debate between those who favor localist demonstrated that representations learned from text representations in which each process- capture a great deal of information about word ing element corresponds to a meaningful meanings in and out of context (Bengio et al., 2003; concept and those who favor distributed Collobert and Weston, 2008; Turian et al., 2010; representations.” Hinton (1990) Special Issue on Connectionist Symbol Processing Mikolov et al., 2013; McCann et al., 2017) It has long been acknowledged that context lends Unlike work that defined words as distributions meaning (Firth, 1957; Turney and Pantel, 2010). over clusters, which were perceived as having in- Local contexts proved powerful when combined trinsic meaning to the user of a system, in connec- with agglomerative clustering guided by mutual tionism there was the question of where meaning information (Brown et al., 1992). A word’s posi- resides. The tension of modeling symbols and dis- tion in this hierarchy captures semantic and syntac- tributed representations was nicely articulated by tic distinctions. When the Baum-Welch algorithm Smolensky (1990), and alternative representations (Welch, 2003) was applied to unsupervised Hidden (Kohonen, 1984; Hinton et al., 1986; Barlow, 1989) Markov Models, it assigned a class distribution to and approaches to structure and composition (Erk every word, and that distribution was considered a and Padó, 2008; Socher et al., 2012) span decades partial representation of a word’s “meaning.” If the of research. set of classes was small, syntax-like classes were All of these works rely on corpora. The Brown induced; If the set was large, classes became more Corpus (Francis, 1964) and Penn Treebank (Mar- semantic. Over the years this “search for meaning” cus et al., 1993) were colossal undertakings that de- has played out again and again, often with similar fined linguistic context for decades. Only relatively themes. recently (Baroni et al., 2009) has the cost of anno- Later approaches discarded structure in favor of tations decreased enough to enable the introduction larger context windows for better semantics. The of new tasks and have web-crawls become viable most popular generative approach came in the form for researchers to collect and process (WS2).1 of Latent Dirichlet Allocation (Blei et al., 2003). 1 The sentence structure was discarded and a doc- An important parallel discussion centers on the hardware
2 WS2: The Written World with Good (1953), the weights of Kneser and Ney (1995); Chen and Goodman (1996), and the Corpora in NLP have broadened to include large power-law distributions of Teh (2006). Modern ap- web-crawls. The use of unstructured, unlabeled, proaches to learning dense representations allow us multi-domain, and multilingual data broadens our to better estimate these distributions from massive world scope, in the limit, to everything humanity corpora. However, modeling lexical co-occurrence, has ever written. We are no longer constrained to no matter the scale, is still “modeling the written a single author or source, and the temptation for world." Models constructed this way blindly search NLP is to believe everything that needs knowing for symbolic co-occurences void of meaning. can be learned from the written world. This move towards using large scale (whether In regards to the apparent paradox of “impre- mono- or multilingual) has led to substantial ad- sive results” vs. “diminishing returns”, language vances in performance on existing and novel com- modeling—the modern workhorse of neural NLP munity benchmarks (Wang et al., 2019a). These systems—provides a poignant example. Recent advances have come due to the transfer learning pretraining literature has produced results that few enabled by representations in deep models. Tra- could have predicted, crowding leader boards re- ditionally, transfer learning relied on our under- sults that make claims to super-human accuracy standing of model classes, such as English gram- (Rajpurkar et al., 2018). However, there are di- mar. Domain adaptation could proceed by simply minishing returns. For example, in the LAM- providing a model with sufficient data to capture BADA dataset (Paperno et al., 2016) (designed lexical variation, while assuming most higher-level to capture human intuition), GPT2 (Radford et al., structure would remain the same. Embeddings— 2019) (1.5B), Megatron-LM (Shoeybi et al., 2019) lexical representations built from massive corpora— (8.3B), and TuringNLG (Rosset, 2020) (17B) per- encompass multiple domains and lexical senses, form within a few points of each other and very violating this structural assumption. far from perfect (
3 WS3: The World of Sights and Sounds Language learning needs perception. An agent ob- serving the real world can build axioms from which to extrapolate. Learned, physical heuristics, such as that a falling cat will land quietly, are general- ized and abstracted into language metaphors like as nimble as a cat (Lakoff, 1980). World knowledge forms the basis for how people make entailment and reasoning decisions. There exists strong evi- Eugene Charniak (A Framed PAINTING: The Representation dence that children require grounded sensory input, of a Common Sense Knowledge Fragment 1977) not just speech, to learn language (Sachs et al., 1981; O’Grady, 2005; Vigliocco et al., 2014). tested on the 1,000 class2 ImageNet challenge are Perception includes auditory, tactile, and visual limited to extracting a bag of visual words. In re- input. Even restricted to purely linguistic signals, ality, these architectures are mature and capable auditory input is necessary for understanding sar- of generalizing, both from the perspective of en- casm, stress, and meaning implied through prosody. gineering infrastructure3 and that the backbones Further, tactile senses give meaning, both physi- have been tested against other technologies like cal (Sinapov et al., 2014; Thomason et al., 2016) self-attention (Ramachandran et al., 2019). The sta- and abstract, to concepts like heavy and soft. Vi- bility of these architectures allows for new research sual perception is a powerful signal for modeling a into more challenging world modeling. Mottaghi vastness of experiences in the world that cannot be et al. (2016) predict the effects of forces on ob- documented by text alone (Harnad, 1990). jects in images. Bakhtin et al. (2019) extends this For example, frames and scripts (Schank and physical reasoning to complex puzzles of cause Abelson, 1977; Charniak, 1977; Dejong, 1981; and effect. Sun et al. (2019b,a) model scripts and Mooney and Dejong, 1985) require understand- actions. Alternative, unsupervised training regimes ing often unstated sets of pre- and post-conditions (Bachman et al., 2019) also open up research to- about the world. To borrow from Charniak (1977), wards automatic concept formation. how should we learn the meaning, method, and im- While a minority of language researchers make plications of painting? A web crawl of knowledge forays into computer vision, researchers in CV are from an exponential number of possible how-to, often willing to incorporate language signals. Ad- text-only guides and manuals (Bisk et al., 2020) is vances in computer vision architectures and model- misdirected without some fundamental referents to ing have enabled building semantic representations ground symbols to. We posit that models must be rich enough to interact with natural language. In able to watch and recognize objects, people, and the last decade of work descendant from image activities to understand the language describing captioning (Farhadi et al., 2010; Mitchell et al., them (Li et al., 2019b; Krishna et al., 2017; Yatskar 2012), a myriad of tasks on visual question answer- et al., 2016; Perlis, 2016). ing (Antol et al., 2015; Das et al., 2018), natural While the NLP community has played an im- language and visual reasoning (Suhr et al., 2019b), portant role in the history of scripts and grounding visual commonsense (Zellers et al., 2019a), and (Mooney, 2008), recently remarkable progress in multilingual captioning and translation via video grounding has taken place in the Computer Vision (Wang et al., 2019b) have emerged. These com- community. In the last decade, CV researchers bined text and vision benchmarks are rich enough have refined and codified the “backbone” of com- to train large-scale, multimodal transformers (Li puter vision architectures. These advances have et al., 2019a; Lu et al., 2019; Zhou et al., 2019) lead to parameter efficient models (Tan and Le, without language pretraining, for example via con- 2019) and real time object detection on embedded ceptual captions (Sharma et al., 2018), or further devices and phones (Rastegari et al., 2016; Redmon broadened to include audio (Tsai et al., 2019). and Farhadi, 2018). 2 Or the 1,600 classes of Anderson et al. (2017). We find anecdotally that many researchers in the 3 Torchvision and Detectron both include pretrained ver- NLP community still assume that vision models sions of over a dozen models.
Semantic level representations emerge from Im- If A and B have some environments in common and ageNet classification pretraining partially due to some not ... we say that they have different meanings, the amount of meaning difference corresponding class hypernyms. For example, the person class roughly to the amount of difference in their sub-divides into many professions and hobbies, like environments ... firefighter, gymnast, and doctor. To differentiate Zellig S. Harris (Distributional Structure 1954) such sibling classes, learned vectors can also en- code lower-level characteristics like clothing, hair, and typical surrounding scenes. These representa- action taking open several new dimensions to un- tions allow for pixel level masks and skeletal mod- derstanding and actively learning about the world. eling, and can be extended to zero-shot settings tar- Queries can be resolved via dialog-based explo- geting all 20,000 ImageNet categories (Chao et al., ration with a human interlocutor (Liu and Chai, 2016; Changpinyo et al., 2017). Modern architec- 2015), even as new object properties like texture, tures are flexible enough to learn to differentiate weight, and sound become available (Thomason instances within a general class, such as face. For et al., 2017). We see the need for embodied lan- example, facial recognition benchmarks require guage with complex meaning when thinking deeply distinguishing over 10,000 unique faces (Liu et al., about even the most innocuous of questions: 2015). While vision is by no means “solved,” vi- sion benchmarks have led to off-the-shelf tools for Is an orange more like a baseball or a building representations rich enough for tens of banana? thousands of objects, scenes, and individuals. That the combination of language and vision su- WS1 is likely not to have an answer beyond that pervision leads to models that produce clear and the objects are common nouns that can both be held. coherent sentences, or that their attention can trans- WS2 may also capture that oranges and baseballs late parts of an image to words, indicates they truly both roll, but is unlikely to capture the deformation are learning about language. Moving forward, we strength, surface texture, or relative sizes of these believe benchmarks incorporating auditory, tactile, objects (Elazar et al., 2019). WS3 can begin to un- and visual sensory information together with lan- derstand the relative deformability of these objects, guage will be crucial. Such benchmarks will facili- but is likely to confuse how much force is necessary tate modeling language meaning with respect to an given that baseballs are used much more roughly experienced world. than oranges in widely distributed media. WS4 can appreciate the nuanced nature of the question— 4 WS4: Embodiment and Action the orange and baseball share a similar texture and weight allowing for similar manipulation, while the Many of the phenomena in Winograd (1972)’s Un- orange and banana both contain peels, deform un- derstanding Natural Language require representa- der stress, and are edible. We as humans have rich tion of concepts derivable from perception, such as representations of these items. The words evoke a color, counting, size, shape, spatial reasoning, and myriad of properties, and that richness allows us to stability. Recent work (Gardner et al., 2020) argues reason. that these same dimensions can serve as meaning- Control is where people first learn abstraction ful perturbations of inputs to evaluate models’ rea- and simple examples of post-conditions through soning consistency. However, their presence is in trial and error. The most basic scripts humans learn service of actual interactions with the world. Ac- start with moving our own bodies and achieving tion taking is a natural next rung on the ladder to simple goals as children, such as stacking blocks. broader context. In this space, we have unlimited supervision from An embodied agent, whether in a virtual world, the environment and can learn to generalize across such as a 2D Maze (MacMahon et al., 2006), a plans and actions. In general, simple worlds do not Baby AI in a grid world (Chevalier-Boisvert et al., entail simple concepts: even in block worlds (Bisk 2019), Vision-and-Language Navigation (Ander- et al., 2018) where concepts like “mirroring”—one son et al., 2018; Thomason et al., 2019b), or the of a massive set of physical phenomena that we real world (Tellex et al., 2011; Matuszek, 2018; generalize and apply to abstract concepts—appear. Thomason et al., 2020; Tellex et al., 2020) must In addition to learning basic physical proper- translate from language to control. Control and ties of the world from interaction, WS4 also al-
lows the agent to construct rich pre-linguistic rep- In order to talk about concepts, we must understand the resentations from which to generalize. Hespos and importance of mental models... we set up a model of the world which serves as a framework in which to Spelke (2004) show pre-linguistic category forma- organize our thoughts. We abstract the presence of tion within children that are then later codified by particular objects, having properties, and entering into social constructs. Mounting evidence seems to indi- events and relationships. cate that children have trouble transferring knowl- Terry Winograd - 1971 edge from the 2D world of books (Barr, 2013) and iPads (Lin et al., 2017) to the physical 3D world. So while we might choose to believe that we can en- nication in service of real-world cooperation is the code parameters (Chomsky, 1981) more effectively prototypical use of language, and the ability to fa- and efficiently than evolution provided us, develop- cilitate such cooperation remains the final test of a mental experiments indicate doing so without 3D learned agent. interaction may prove insurmountable. The acknowledgement that interpersonal com- Part of the problem is that much of the knowl- munication is the necessary property of a linguistic edge humans hold about the world is intuitive, pos- intelligence is older than the terms “Computational sibly incommunicable by language, but is required Linguistics” or “Artificial Intelligence.” Launched to understand that language. This intuitive knowl- into computing when Alan Turing suggested the edge could be acquired by embodied agents in- “Imitation Game” (Turing, 1950), the first illustra- teracting with their environment, even before lan- tive example of the test brings to bear an issue that guage words are grounded to meanings. has haunted generative NLP since it was possible. Robotics and embodiment are not available in Smoke and mirrors cleverness, in which situations the same off-the-shelf manner as computer vision are framed to the advantage of the model, can cre- models. However, there is rapid progress in simu- ate the appearance of genuine content where there lators and commercial robotics, and as language re- is none. This phenomenon has been noted count- searchers we should match these advances at every less times, from Pierce (1969) criticizing Speech step. As action spaces grow, we can study complex Recognition as “deceit and glamour” (Pierce, 1969) language instructions in simulated homes (Shrid- to Marcus and Davis (2019) which complains of har et al., 2019) or map language to physical robot humanity’s “gullibility gap”. control (Blukis et al., 2019; Chai et al., 2018). The Interpersonal dialogue is canonically framed as last few years have seen massive advances in both a grand test for AI (Norvig and Russel, 2002), but high fidelity simulators for robotics (Todorov et al., there are few to no examples of artificial agents 2012; Coumans and Bai, 2016–2019; NVIDIA, one could imagine socializing with in anything 2019; Xiang et al., 2020) and the cost and avail- more than transactional (Bordes et al., 2016) or ability of commodity hardware (Fitzgerald, 2013; extremely limited game scenarios (Lewis et al., Campeau-Lecours et al., 2019; Murali et al., 2019). 2017)—at least not ones that aren’t purposefully As computers transition from desktops to perva- hollow and fixed, where people can project what- sive mobile and edge devices, we must make and ever they wish (e.g. ELIZA (Weizenbaum, 1966)). meet the expectation that NLP can be deployed in More important to our discussion is why social- any of these contexts. Current representations have ization is required as the next rung on the context very limited utility in even the most basic robotic ladder in order to fully ground meaning. settings (Scalise et al., 2019), making collaborative robotics (Rosenthal et al., 2010) largely a domain Language that Does Something Work in the of custom engineering rather than science. philosophy of language has long suggested that function is the source of meaning, as famously il- 5 WS5: The Social World lustrated through Wittgenstein’s “language games” (Wittgenstein, 1953, 1958) and J.L. Austin’s “per- Natural language arose to enable interpersonal com- formative speech acts” (Austin, 1975). That func- munication (Dunbar, 1993). Since then, it has been tion is the source of meaning was echoed in linguis- used in everything from laundry instructions to per- tics usage-based theory of language acquisition, sonal diaries, which has given us data and situated which suggests that constructions that are useful scenarios that are the raw material for the previous are the building blocks for everything else (Lan- levels of contextualization. Interpersonal commu- gacker, 1987, 1991). In recent years, this theory
has begun to shed light on what use-cases language Q: Please write me a sonnet on the subject of the Forth presents in both acquisition and its initial origins Bridge. A: Count me out on this one. I never could write poetry. in our species (Tomasello, 2009; Barsalou, 2008). Q: Add 34957 to 70764. WS1, WS2, WS3, and WS4 lend an extra depth A: (Pause about 30 seconds and then give as answer) to the interpretation of language through context, 105621. Q: Do you play chess? because they expand the factorizations of informa- A: Yes. tion available to define meaning. Understanding Q: I have K at my K1, and no other pieces. You have only that what one says can change what people do al- K at K6 and R at R1. It is your move. What do you play? A: (After a pause of 15 seconds) R-R8 mate. lows language to take on its most active role. This is the ultimate goal for natural language generation: A.M. Turing (Computing Machinery and Intelligence 1950) language that does something to the world. Despite the current, passive way generation is created and evaluated (Liu et al., 2016), the ability In the language community, this paradigm has to change the world through other people is the been described under the “Speaker-Listener” model rich learning signal that natural language genera- (Stephens et al., 2010), and a rich theory to describe tion offers. In order to learn the effects language this computationally is being actively developed has on the world, an agent must participate in lin- under the Rational Speech Act Model (Frank and guistic activity, such as negotiation (Lewis et al., Goodman, 2012; Bergen et al., 2016). 2017; He et al., 2018), collaboration (Chai et al., Recently, a series of challenges that attempt to 2017), visual disambiguation (Liu and Chai, 2015; address this fundamental aspect of communication Anderson et al., 2018; Lazaridou et al., 2017), pro- (Nematzadeh et al., 2018; Sap et al., 2019) have viding emotional support (Rashkin et al., 2019), all been introduced. Using traditional-style datasets of which require inferring mental states and social can be problematic due to the risk of embedding outcomes—a key area of interest in itself (Zadeh spurious patterns (Le et al., 2019). Despite in- et al., 2019). creased scores on datasets due to larger models As an example, what “hate” means in terms and more complex pretraining curricula, it seems of discriminative information is always at ques- unlikely that models understand their listener any tion: it can be defined as “strong dislike,” but what more than they understand the physical world in it tells one about the processes operating in the which they exist. This disconnect is driven home by environment requires social context to determine studies of bias (Gururangan et al., 2018; Glockner (Bloom, 2002). It is the toddler’s social experi- et al., 2018) and techniques like Adversarial Filter- mentation with “I hate you” that gives the word ing (Zellers et al., 2018, 2019b), which elucidate weight and definite intent (Ornaghi et al., 2011). In the biases models exploit in lieu of understanding. other words, the discriminative signal for the most foundational part of a word’s meaning can only Our training data, complex and large as it is, be observed by its effect on the world, and active does not offer the discriminatory signals that make experimentation seems to be key to learning that the hypothesizing of consistent identity or mental effect. This is in stark contrast to the disembodied states an efficient path towards lowering perplexity chat bots that are the focus of the current dialogue or raising accuracy. Firstly, there is a lack of induc- community (Adiwardana et al., 2020; Zhou et al., tive bias (Martin et al., 2018): it is not clear that 2020; Chen et al., 2018; Serban et al., 2017), which any learned model, not just a neural network, from often do not learn from individual experiences and reading alone would be able to posit that people are not given enough of a persistent environment exist, that they arrange narratives along an abstract to learn about the effects of actions. single-dimension called time, and that they describe things in terms of causality. Models learn what they Theory of Mind By attempting to get what we need to discriminate, and so inductive bias is not want, we confront people with their own desires just about efficiency, it is about making sure what and identities. Premack and Woodruff (1978) be- is learned will generalize (Mitchell, 1980). Sec- gan to formalize how much the ability to con- ondly, current cross entropy training losses actively sider the feelings and knowledge of others is discourage learning the tail of the distribution prop- human-specific and how it actually functions, com- erly, as events that are statistically infrequent are monly referred to now as the “Theory of Mind.” drowned out (Pennington et al., 2014; Holtzman
et al., 2020). Researchers have just begun to scratch 6 Self-Evaluation the surface by considering more dynamic evalua- You can’t learn language from the radio. tions (Zellers et al., 2020; Dinan et al., 2019), but true persistence of identity and adaption to change Nearly every NLP course will at some point are still a long way off. make this claim. While the futility of learning language from an ungrounded linguistic signal is Language in a Social Context Whenever lan- intuitive, those NLP courses go on to attempt pre- guage is used between two people, it exists in a cisely that learning. In fact, our entire field follows concrete social context: status, role, intention, and this pattern: trying to learn language from the in- countless other variables intersect at a specific point ternet, standing in as the modern radio to deliver (Wardhaugh, 2011). Currently, these complexities limitless language. The need for language to attach are overlooked by selecting labels crowd workers to “extralinguistic events" (Ervin-Tripp, 1973) and agree on. That is, our notion of ground truth in the requirement for social context (Baldwin et al., dataset construction is based on crowd consensus 1996) should guide our research. bereft of social context (Zellers et al., 2020). Concretely, we use our notion of World Scopes to make the following concrete claims: In a real-world interaction, even one as anony- mous as interacting with a cashier at a store, no- You can’t learn language ... tions of social cognition (Fiske and Taylor, 1991; ... from the radio (internet). WS2 ⊂ WS3 Carpenter et al., 1998) are present in the style of utterances and in the social script of the exchange. A learner cannot be said to be in WS3 The true evaluation of generative models will re- if it can perform its task without sensory quire the construction of situations where artificial perception such as visual, auditory, or agents are considered to have enough identity to be tactile information. granted social standing for these interactions. ... from a television. WS3 ⊂ WS4 Social interaction is a precious signal and some NLP researchers have began scratching the sur- A learner cannot be said to be in WS4 face of it, especially for persuasion related tasks if the space of actions and consequences where interpersonal relationships are key (Wang of its environment can be enumerated. et al., 2019c; Tan et al., 2016). These initial studies ... by yourself. WS4 ⊂ WS5 have been strained by the training-validation-test set scenario, as collecting static of this data is dif- A learner cannot be said to be in WS5 if ficult for most use-cases and often impossible as its cooperators can be replaced with clev- natural situations cannot be properly facilitated. erly pre-programmed agents to achieve Learning by participation is a necessary step to the the same goals. ultimately social venture of communication. By By these definitions, most of NLP research still exhibiting different attributes and sending varying resides in WS2, and we believe this is at odds with signals, the sociolinguistic construction of identity the goals and origins of our science. This does (Ochs, 1993) could be examined more deeply than not invalidate the utility or need for any of the ever before, yielding an understanding of social research within NLP, but it is to say that much of intelligence that simply isn’t possible with a fixed that existing research targets a different goal than corpus. This social layer is the foundation upon language learning. which language situates (Tomasello, 2009). Under- standing this social layer is not only necessary, but Where Should We Start? Concretely, many in will make clear complexities around implicature our community are already pursuing the move from and commonsense that so obviously arise in cor- WS2 to WS3 by rethinking our existing tasks and pora, but with such sparsity that they cannot prop- investigating whether their semantics can be ex- erly be learned (Gordon et al., 2018). Once models panded and grounded. Importantly, this is not a are commonly expected to be interacted with in or- new trend (Chen and Mooney, 2008; Feng and La- der to be tested, probing their decision boundaries pata, 2010; Bruni et al., 2014; Lazaridou et al., for simplifications of reality as in Kaushik et al. 2016), and task and model design can fail to re- will become trivial and natural. quire sensory perception (Thomason et al., 2019a).
However, research and progress in multimodal, These problems include the need to bring meaning grounded language understanding has accelerated and reasoning into systems that perform natural language processing, the need to infer and dramatically in the last few years. represent causality, the need to develop computationally-tractable representations of Elliott et al. (2016) takes the classic problem of uncertainty and the need to develop systems that machine translation and reframes it in the context formulate and pursue long-term goals. of visual observations (Elliott and Kádár, 2017). Michael Jordan (Artificial intelligence – the Wang et al. (2019b) extends this paradigm by ex- revolution hasn’t happened yet, 2019) ploring machine translation with videos serving as an interlingua, and Regneri et al. (2013) in- 7 Conclusions troduced a foundational dataset which aligns text descriptions and semantic annotations of actions Our proposed World Scopes are steep steps, and with videos. Even core tasks like syntax can be it is possible that WS5 is AI-complete. WS5 im- informed with visual information. Shi et al. (2019) plies a persistent agent experiencing time and a investigate the role of visual information in syn- personalized set of experiences. With few excep- tax acquisition. This may prove necessary to learn tions (Carlson et al., 2010), machine learning mod- headedness (e.g. choosing the main verb vs the els have been confined to IID datasets that do not more common auxiliary as the root of a sentence) have the structure in time from which humans draw (Bisk and Hockenmaier, 2015). Relatedly, Oror- correlations about long-range causal dependencies. bia et al. (2019) investigate the benefits of visual What happens if a machine is allowed to par- context for language modeling. ticipate consistently? This is difficult to imagine testing under current evaluation paradigms for gen- Collaborative games have long served as a eralization. Yet, this is the structure of general- testbed for studying language (Werner and Dyer, ization in human development: drawing analogies 1991). Recently, Suhr et al. (2019a) introduced a to episodic memories, gathering a system through testbed for evaluating language understanding in non-independent experiments. the service of a shared goal, and Andreas and Klein While it is fascinating to broadly consider such (2016) use a simpler visual paradigm for studying far reaching futures, our goal in this call to action pragmatics. A communicative goal can also be is more pedestrian. As with many who have ana- used to study the emergence of language. Lazari- lyzed the history of Natural Language Processing, dou et al. (2018) evolves agents with language that its trends (Church, 2007), its maturation toward a aligns to perceptual features of the environment. science (Steedman, 2008), and its major challenges These paradigms can help us examine how induc- (Hirschberg and Manning, 2015; McClelland et al., tive biases and environmental pressures can build 2019), we hope to provide momentum for a direc- towards socialization (WS5). tion many are already heading. We call for and Most of this research provides resources such as embrace the incremental, but purposeful, contextu- data, code, simulators and methodology for evaluat- alization of language in human experience. With all ing the multimodal content of linguistic representa- that we have learned about what words can tell us tions (Silberer and Lapata, 2014; Bruni et al., 2012). and what they keep implicit, now is the time to ask: Moving forward, we would like to encourage a What tasks, representations, and inductive-biases broader re-examination of how NLP researchers will fill the gaps? frame the relationship between meaning and con- Computer vision and speech recognition are ma- text. We believe that the time is ripe to begin ture enough for novel investigation of broader lin- a deeper dive into grounded learning, embodied guistic contexts (Section 3). The robotics indus- learning, and ultimately social participation. With try is rapidly developing commodity hardware and recent deep learning representations, researchers sophisticated software that both facilitate new re- can more easily integrate additional signals into the search and expect to incorporate language technolo- learned meanings of language than just word to- gies (Section 4). Our call to action is to encour- kens. We particularly advocate for the homegrown age the community to lean in to trends prioritizing creation of such merged representations, rather than grounding and agency (Section 5), and explicitly waiting for the representations and data to come to aim to broaden the corresponding World Scopes NLP from other fields. available to our models.
Acknowledgements Anton Bakhtin, Laurens van der Maaten, Justin John- son, Laura Gustafson, and Ross Girshick. 2019. Thanks to Raymond Mooney for discussions and Phyre: A new benchmark for physical reasoning. suggestions, Paul Smolensky for discussion and Dare A. Baldwin, Ellen M. Markman, Brigitte Bill, Re- disagreements, and to Catriona Silvey for help with nee N. Desjardins, Jane M. Irwin, and Glynnis Tid- references. ball. 1996. Infants’ reliance on a social criterion for establishing word-object relations. Child Develop- A Invitation to Reply ment, 67(6):3135–3153. H.B. Barlow. 1989. Unsupervised learning. Neural What we have introduced here is not comprehen- Computation, 1(3):295–311. sive and you may not agree with our arguments. Perhaps they do not go far enough, they miss a Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide web: a relevant literature, or you feel they do not represent collection of very large linguistically processed web- the goals of NLP/CL. To this end, we welcome crawled corpora. Language resources and evalua- both suggestions for an updated version of this tion, 43(3):209–226. manuscript, as well as response papers on alternate Rachel Barr. 2013. Memory constraints on infant learn- foci and directions for the field. ing from picture books, television, and touchscreens. Child Development Perspectives, 7(4):205–210. References Lawrence W Barsalou. 2008. Grounded cognition. Annu. Rev. Psychol., 59:617–645. Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Yoshua Bengio, RÃl’jean Ducharme, Pascal Vincent, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Christian Jauvin. 2003. A neural probabilistic et al. 2020. Towards a human-like open-domain language model. Journal of Machine Learning Re- chatbot. arXiv preprint arXiv:2001.09977. search, 3:1137–1155. Leon Bergen, Roger Levy, and Noah Goodman. 2016. Peter Anderson, Xiaodong He, Chris Buehler, Damien Pragmatic reasoning through semantic inference. Teney, Mark Johnson, Stephen Gould, and Lei Semantics and Pragmatics, 9. Zhang. 2017. Bottom-up and top-down attention for image captioning and visual question answering. Vi- Yonatan Bisk and Julia Hockenmaier. 2015. Probing sual Question Answering Challenge at CVPR 2017. the linguistic strengths and limitations of unsuper- vised grammar induction. In Proceedings of the Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, 53rd Annual Meeting of the Association for Compu- Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen tational Linguistics and the 7th International Joint Gould, and Anton van den Hengel. 2018. Vision- Conference on Natural Language Processing (Vol- and-Language Navigation: Interpreting visually- ume 1: Long Papers). grounded navigation instructions in real environ- ments. In Proceedings of the IEEE Conference on Yonatan Bisk, Kevin Shih, Yejin Choi, and Daniel Computer Vision and Pattern Recognition (CVPR). Marcu. 2018. Learning interpretable spatial opera- tions in a rich 3d blocks world. In Proceedings of the Jacob Andreas and Dan Klein. 2016. Reasoning about Thirty-Second Conference on Artificial Intelligence pragmatics with neural listeners and speakers. In (AAAI-18). Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian- 1173–1182, Austin, Texas. Association for Compu- feng Gao, and Yejin Choi. 2020. PIQA: Reasoning tational Linguistics. about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelli- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- gence. garet Mitchell, Dhruv Batra, C Lawrence Zitnick, David M. Blei, Andrew Y. Ng, and Michael I. Jordan. and Devi Parikh. 2015. Vqa: Visual question an- 2003. Latent dirichlet allocation. Journal of Ma- swering. In Proceedings of the IEEE international chine Learning Research, 3:993–1022. conference on computer vision, pages 2425–2433. Paul Bloom. 2002. How children learn the meanings John Langshaw Austin. 1975. How to do things with of words. MIT press. words, volume 88. Oxford university press. Valts Blukis, Yannick Terme, Eyvind Niklasson, Philip Bachman, R Devon Hjelm, and William Buch- Ross A. Knepper, and Yoav Artzi. 2019. Learning to walter. 2019. Learning representations by maximiz- map natural language instructions to physical quad- ing mutual information across views. arXiv preprint copter control using simulated flight. In 3rd Confer- arXiv:1906.00910. ence on Robot Learning (CoRL).
Antoine Bordes, Y-Lan Boureau, and Jason Weston. Eugene Charniak. 1977. A framed painting: The rep- 2016. Learning end-to-end goal-oriented dialog. In resentation of a common sense knowledge fragment. International Conference on Learning Representa- Cognitive Science, 1(4):355–394. tions. Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang Peter F Brown, Peter V deSouza, Robert L Mercer, Vin- Yang, Jiaping Zhang, Mingyang Zhou, Kevin Jesse, cent J Della Pietra, and Jenifer C Lai. 1992. Class- Austin Chau, Antara Bhowmick, Shreenath Iyer, based n-gram models of natural language. Compu- et al. 2018. Gunrock: Building a human-like social tational Linguistics, 18. bot by leveraging large scale real user data. Alexa Prize Proceedings. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- Khanh Tran. 2012. Distributional semantics in tech- David L. Chen and Raymond J. Mooney. 2008. Learn- nicolor. In Proceedings of the 50th Annual Meet- ing to sportscast: A test of grounded language ac- ing of the Association for Computational Linguistics quisition. In Proceedings of the 25th International (Volume 1: Long Papers), pages 136–145, Jeju Is- Conference on Machine Learning (ICML), Helsinki, land, Korea. Association for Computational Linguis- Finland. tics. SF Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language model- Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. ing. In Association for Computational Linguistics, Multimodal distributional semantics. Journal of Ar- pages 310–318. Association for Computational Lin- tificial Intelligence Research, 49:1–47. guistics. Alexandre Campeau-Lecours, Hugo Lamontagne, Si- Maxime Chevalier-Boisvert, Dzmitry Bahdanau, mon Latour, Philippe Fauteux, Véronique Maheu, Salem Lahlou, Lucas Willems, Chitwan Saharia, François Boucher, Charles Deguire, and Louis- Thien Huu Nguyen, and Yoshua Bengio. 2019. Joseph Caron L’Ecuyer. 2019. Kinova modular Babyai: First steps towards grounded language robot arms for service robotics applications. In learning with a human in the loop. In ICLR’2019. Rapid Automation: Concepts, Methodologies, Tools, and Applications, pages 693–719. IGI Global. Noam Chomsky. 1981. Lectures on Government and Binding. Mouton de Gruyter. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom M Mitchell. Kenneth Church. 2007. A pendulum swung too far. 2010. Toward an architecture for never-ending lan- Linguistic Issues in Language Technology âĂŞ LiLT, guage learning. In Twenty-Fourth AAAI Conference 2. on Artificial Intelligence. L. Coleman and P. Kay. 1981. The english word “lie". Malinda Carpenter, Katherine Nagell, Michael Linguistics, 57. Tomasello, George Butterworth, and Chris Moore. 1998. Social cognition, joint attention, and commu- Ronan Collobert and Jason Weston. 2008. A unified nicative competence from 9 to 15 months of age. architecture for natural language processing: deep Monographs of the society for research in child neural networks with multitask learning. In ICML. development, pages i–174. Erwin Coumans and Yunfei Bai. 2016–2019. Pybullet, a python module for physics simulation for games, Joyce Y. Chai, Rui Fang, Changsong Liu, and Lanbo robotics and machine learning. http://pybullet. She. 2017. Collaborative language grounding to- org. ward situated human-robot dialogue. AI Magazine, 37(4):32–45. Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste- fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em- Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang, bodied question answering. In Proceedings of the Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan- IEEE Conference on Computer Vision and Pattern guage to action: Towards interactive task learning Recognition Workshops, pages 2054–2063. with physical agents. In Proceedings of the Twenty- Seventh International Joint Conference on Artificial Scott Deerwester, Susan T. Dumais, George W. Furnas, Intelligence (IJCAI-18). Thomas K. Landauer, and Richard Harshman. 1988. Improving information retrieval with latent semantic Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. 2017. indexing. In Proceedings of the 51st Annual Meet- Predicting visual exemplars of unseen classes for ing of the American Society for Information Science zero-shot learning. In ICCV. 25, page 36âĂŞ40. Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, Scott Deerwester, Susan T. Dumais, George W. Fur- and Fei Sha. 2016. An empirical study and analysis nas, Thomas K. Landauer, and Richard Harshman. of generalized zero-shot learning for object recog- 1990. Indexing by latent semantic analysis. Jour- nition in the wild. In ECCV, pages 52–68, Cham. nal of the American Society for Information Science, Springer International Publishing. 41(6):391–407.
Gerald Dejong. 1981. Generalizations based on expla- Ali Farhadi, M Hejrati, M Sadeghi, Peter Young, Cyrus nations. In IJCAI. Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sen- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tences from images. In European Conference on Kristina Toutanova. 2019a. Bert: Pre-training of Computer Vision. Springer. deep bidirectional transformers for language under- standing. In North American Chapter of the Associ- Yansong Feng and Mirella Lapata. 2010. Topic models ation for Computational Linguistics (NAACL). for image annotation and text illustration. In Human Language Technologies: The 2010 Annual Confer- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ence of the North American Chapter of the Associa- Kristina Toutanova. 2019b. BERT: Pre-training of tion for Computational Linguistics, pages 831–839, Deep Bidirectional Transformers for Language Un- Los Angeles, California. Association for Computa- derstanding. In Proceedings of the 2019 Conference tional Linguistics. of the North American Chapter of the Association for Computational Linguistics: Human Language J. R. Firth. 1957. A synopsis of linguistic theory, 1930- Technologies, Volume 1 (Long and Short Papers). 1955. Studies in Linguistic Analysis. Emily Dinan, Samuel Humeau, Bharath Chintagunta, Susan T Fiske and Shelley E Taylor. 1991. Social cog- and Jason Weston. 2019. Build it break it fix it for nition. Mcgraw-Hill Book Company. dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Cliff Fitzgerald. 2013. Developing baxter. In 2013 Empirical Methods in Natural Language Processing IEEE Conference on Technologies for Practical and the 9th International Joint Conference on Natu- Robot Applications (TePRA). ral Language Processing (EMNLP-IJCNLP), pages 4529–4538. W. Nelson Francis. 1964. A standard sample of present-day english for use with digital computers. Susan T. Dumais. 2004. Latent semantic analysis. An- Report to the U.S Office of Education on Coopera- nual Review of Information Science and Technology, tive Research Project No. E-007. 38(1):188–230. Michael C Frank and Noah D Goodman. 2012. Pre- Robin IM Dunbar. 1993. Coevolution of neocortical dicting pragmatic reasoning in language games. Sci- size, group size and language in humans. Behav- ence, 336(6084):998–998. ioral and brain sciences, 16(4):681–694. Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran, Wang, and Ting Liu. 2014. Learning semantic hier- Tania Bedrax-Weiss, and Dan Roth. 2019. How archies via word embeddings. In Proceedings of the large are lions? inducing distributions over quanti- 52nd Annual Meeting of the Association for Compu- tative attributes. In Proceedings of the 57th Annual tational Linguistics (Volume 1: Long Papers), pages Meeting of the Association for Computational Lin- 1199–1209. guistics, pages 3973–3983. Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu- Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, cia Specia. 2016. Multi30k: Multilingual english- Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, german image descriptions. In Workshop on Vision Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, and Langauge at ACL ’16. Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Desmond Elliott and Ákos Kádár. 2017. Imagination Singh, Noah A. Smith, Sanjay Subramanian, Reut improves multimodal translation. In Proceedings of Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. the Eighth International Joint Conference on Natu- 2020. Evaluating NLP Models via Contrast Sets. ral Language Processing (Volume 1: Long Papers), arXiv:2004.02709. pages 130–141. Max Glockner, Vered Shwartz, and Yoav Goldberg. J Elman. 1990. Finding structure in time. Cognitive 2018. Breaking nli systems with sentences that re- Science, 14(2):179–211. quire simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Com- Katrin Erk and Sebastian Padó. 2008. A structured putational Linguistics (Volume 2: Short Papers), vector space model for word meaning in context. pages 650–655. In Proceedings of the 2008 Conference on Empiri- cal Methods in Natural Language Processing, pages I J Good. 1953. The population frequencies of 897–906, Honolulu, Hawaii. Association for Com- species and the estimation of population parameters. putational Linguistics. Biometrika, 40:237–264. Susan Ervin-Tripp. 1973. Some strategies for the first Daniel Gordon, Aniruddha Kembhavi, Mohammad two years. In Timothy E. Moore, editor, Cognitive Rastegari, Joseph Redmon, Dieter Fox, and Ali Development and Acquisition of Language, pages Farhadi. 2018. Iqa: Visual question answering in in- 261 – 286. Academic Press, San Diego. teractive environments. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recog- Teuvo Kohonen. 1984. Self-Organization and Associa- nition, pages 4089–4098. tive Memory. Springer. Suchin Gururangan, Swabha Swayamdipta, Omer Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- Levy, Roy Schwartz, Samuel Bowman, and Noah A son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Smith. 2018. Annotation artifacts in natural lan- Yannis Kalantidis, Li-Jia Li, David A Shamma, guage inference data. In Proceedings of the 2018 et al. 2017. Visual genome: Connecting language Conference of the North American Chapter of the and vision using crowdsourced dense image anno- Association for Computational Linguistics: Human tations. International Journal of Computer Vision, Language Technologies, Volume 2 (Short Papers), 123(1):32–73. pages 107–112. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- Stevan Harnad. 1990. The symbol grounding problem. ton. 2012. Imagenet classification with deep con- Physica D, 42:335–346. volutional neural networks. In F. Pereira, C. J. C. Zellig S Harris. 1954. Distributional structure. Word, Burges, L. Bottou, and K. Q. Weinberger, editors, 10:146–162. Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc. He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. 2018. Decoupling strategy and generation in George Lakoff. 1973. Hedges: A study in meaning negotiation dialogues. In Proceedings of the 2018 criteria and the logic of fuzzy concepts. Journal of Conference on Empirical Methods in Natural Lan- Philosophical Logic, 2:458–508. guage Processing, pages 2333–2343. George Lakoff. 1980. Metaphors We Live By. Univer- Susan J. Hespos and Elizabeth S. Spelke. 2004. Con- sity of Chicago Press. ceptual precursors to language. Nature, 430. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Kevin Gimpel, Piyush Sharma, and Radu Soricut. 1986. Distributed representations. Parallel Dis- 2019. Albert: A lite bert for self-supervised learn- tributed Processing: Explorations in the Microstruc- ing of language representations. arXiv preprint ture of Cognition, Volume 1: Foundations. arXiv:1909.11942. Geoffrey E. Hinton. 1990. Preface to the special issue Ronald W Langacker. 1987. Foundations of cogni- on connectionist symbol processing. Artificial Intel- tive grammar: Theoretical prerequisites, volume 1. ligence, 46(1):1 – 4. Stanford university press. Julia Hirschberg and Christopher D Manning. 2015. Advances in natural language processing. Science, Ronald W Langacker. 1991. Foundations of Cognitive 349(6245):261–266. Grammar: descriptive application. Volume 2, vol- ume 2. Stanford university press. Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degener- Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, ation. In ICLR 2020. and Stephen Clark. 2018. Emergence of linguis- tic communication from referential games with sym- Daniel L. James and Risto Miikkulainen. 1995. Sard- bolic and pixel input. In Internationl Conference on net: A self-organizing feature map for sequences. In Learning Representations. Advances in Neural Information Processing Systems 7 (NIPS’94), pages 577–584, Denver, CO. Cam- Angeliki Lazaridou, Alexander Peysakhovich, and bridge, MA: MIT Press. Marco Baroni. 2017. Multi-agent cooperation and the emergence of (natural) language. In ICLR 2017. Michael I Jordan. 2019. Artificial intelligence – the rev- olution hasn’t happened yet. Harvard Data Science Angeliki Lazaridou, Nghia The Pham, and Marco Ba- Review. roni. 2016. The red one!: On learning to refer to things based on discriminative properties. In Pro- Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. ceedings of the 54th Annual Meeting of the Associa- 2020. Learning the difference that makes a differ- tion for Computational Linguistics (Volume 2: Short ence with counterfactually-augmented data. In Inter- Papers), pages 213–218, Berlin, Germany. Associa- national Conference on Learning Representations. tion for Computational Linguistics. Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A Matthew Le, Y-Lan Boureau, and Maximilian Nickel. conditional transformer language model for control- 2019. Revisiting the evaluation of theory of mind lable generation. arXiv preprint arXiv:1909.05858. through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natu- Reinhard Kneser and Hermann Ney. 1995. Improved ral Language Processing and the 9th International backing-ff of m-gram language modeling. In Pro- Joint Conference on Natural Language Processing ceedings of the IEEE International Conference on (EMNLP-IJCNLP), pages 5871–5876, Hong Kong, Acoustics, Speech and Signal Processing. China. Association for Computational Linguistics.
You can also read