On the Evolution of Word Order
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On the Evolution of Word Order Idan Rejwan Avi Caciularu Tel Aviv University and AI21 Labs Bar-Ilan University Tel Aviv, Israel Ramat-Gan, Israel idanrejwan91@gmail.com avi.c33@gmail.com Abstract the object. Apparently, in languages with a rich Most natural languages have a predominant or system of case markers, such as Arabic, Latin, and fixed word order. For example, in English, the Sanskrit, the word order is more flexible (Sapir, word order used most often is Subject-Verb- 1921; McFadden, 2004). arXiv:2101.09579v1 [cs.CL] 23 Jan 2021 Object. This work attempts to explain this These findings raise several research questions. phenomena as well as other typological find- First, why is there a preferred or fixed order in ings regarding word order from a functional the majority of natural languages? Second, why do perspective. That is, we target the question of most languages have SVO or SOV as their preferred whether fixed word order gives a functional ad- order? Third, how can we explain the relationship vantage, that may explain why these languages are common. To this end, we consider an evo- between case markers and flexible word order? lutionary model of language and show, both This work addresses the aforementioned ques- theoretically and using a genetic algorithm- tions within the following evolutionary perspective based simulation, that an optimal language is (Pinker and Bloom, 1990; Nowak et al., 2000). As one with fixed word order. We also show in the evolution theory of the animal world, if a par- that adding information to the sentence, such ticular feature appears in many different species, as case markers and noun-verb distinction, re- it is assumed to give a functional advantage to the duces the need for fixed word order, in accor- dance with the typological findings. organisms carrying it, thus helping them survive natural selection. Similarly, linguistic universals 1 Introduction that are common to all languages are assumed to Word order is a linguistic phenomenon that deals provide a functional advantage and enable better with the arrangement of syntactic elements in the communication between the speakers. sentence in different languages. Specifically, it According to this perspective, a plethora of deals with the relative position of the three promi- works suggested to use Genetic Algorithms (GA), nent syntactic roles in a sentence: subject (S), verb a method to find solutions in large search spaces (V), and object (O). Theoretically, there are six inspired by evolutionary principles, in order to in- possible orders of the three syntactic roles in the vestigate typological findings (Batali, 1998; Kirby, sentence: SVO, SOV, VSO, VOS, OVS, OSV. How- 1999; Levin, 1995; Kirby, 2001). In this work, we ever, typological findings show that most natural follow this framework and use GA to study the phe- languages have one preferred or fixed word order, nomenon of word order, yet not investigated. In par- with the vast majority of them belonging to the ticular, we attempt to answer whether a fixed word first two families: SVO or SOV (Comrie, 1989). order gives a functional advantage, which might Interestingly, languages that have developed spon- explain its emergence in most natural languages. taneously and did not contact other languages also We also provide a theoretical analysis which proves have a distinct preference for particular word or- that the grammar with the most functional advan- der, such as the Al-Sayyid Bedouin Sign Language tage, under some assumptions, is a grammar with (Sandler et al., 2005). Another typological finding fixed word order. is related to case markers, which are morpholog- 2 Method ical indicators (usually prefixes or suffixes) that indicate the syntactic role of a word in a sentence. Following Pinker and Bloom (1990), we assume For example, the word for the boy in Arabic is al- that the role of language is to communicate a literal walad-u if it is the subject and al-walad-a if it is description of a thought from one person to another
so that at the end of the verbal communication, Next, the hearer receives the noisy sentence and it will be fully and accurately communicated to matches the syntactic roles in two steps. First, for the hearer. Accordingly, the optimal language is every word in the sentence, he removes the noise the one that allows precise delivery of the thought by looking for the nearest word in the lexicon, in between the speaker and the hearer. terms of Levenshtein distance. Then, he outputs It should be emphasized that the hearer observes the most probable word order given the grammar only the sentence received from the speaker, which and the words. does not contain any additional information such Finally, we measure the Hamming distance be- as the relationship between the words and their tween the input and the output by counting the syntactic roles. correct syntactic roles. For example, the distance Intuitively, word order might impact the way the between SVO and SOV is 2/3 since O, and V are hearer understands the sentence. For example, com- at different locations. pare the English sentence “we ship a man” with In each generation, 30% of the grammars with “we man a ship”. In this case, as the words “ship” the smallest distances are replicated to the next gen- and “man” may function as both a noun and a verb, eration, with some random mutations to the proba- and due to the absence of case markers, the mean- bility distributions - a Gaussian noise with variance ing is determined solely by the order of words in 0.01 is added to each probability in the grammar, the sentence. and then re-normalized, to sum up to 1. After many Under these assumptions, we use GA to simulate generations, the surviving grammars are expected a language’s evolution, starting with no preferred to be the ones that allow optimal communication word order, and test whether communication con- between the speaker and the hearer. straints enforce a fixation of particular word order. We use this method to examine whether adding information to the sentence, such as the distinction 2.1 Algorithm between nouns and verbs or explicit case markers, changes the optimal grammar. The algorithm is initialized with a lexicon, a fixed We tested different values for the population size, list of 1000 words, where each word is a string of selection rate, and mutation rate, but all the exam- three letters. The words are randomly initialized ined values yielded similar trends. and are given to all agents. We test both options of one lexicon for all words or two lexicons for nouns 3 Theoretical Analysis and verbs (see elaboration in section 4). As mentioned, optimal communication occurs We initialize a population of 100 grammars, when the distance between the input and output where each one is a probability distribution over is minimal. In this section, we prove that under this the six possible word orders and is initialized to definition of optimality, and assuming the hearer the uniform probability. Each grammar is repre- always identifies the words in the lexicon correctly sented by a speaker and a hearer trying to commu- despite the noise (as indeed happens with a high nicate a message over a noisy channel, and their probability), the following holds success in communication measures the fitness of Theorem 1. Any optimal grammar has exactly one the grammar. In other words, for every grammar, fixed word order. the speaker produces a sentence based on the gram- mar, and the hearer determines the syntactic role Proof. Let X1 . . . Xn and Y1 . . . Yn be random of each word in the sentence. variables of the occurrence of possible word or- The GA runs for 1000 successive generations. ders, associated with the speaker and the hearer, In each one, the speaker samples three random respectively. We define words from the lexicon with their syntactic roles. P (Xi = 1) = P (Yi = 1) = pi , i = 1, ..., n. Then, he samples an order according to the proba- bility distribution in the grammar, concatenates the Let dij be the distance between Xi and Yj , defined sampled words according to it, and sends it to the as the number of syntactic roles with different po- hearer. Then, a random noise flips every letter in sitions between Xi and Yj . Then, the expected the sentence with a probability of p = 0.01. Note distance is given by X that the noisy sentence might contain words that E[d] = dij P (Xi , Yj ), are out of the lexicon. (i,j)
where P (Xi , Yj ) is the probability that the speaker Figure 1: Avg distance of sentences over generations chose order Xi (with probability of pi ) and the hearer chose order Yj (with probability of pj ). Since the hearer has no additional information other than the observed sentence, he picks the most prob- able order from his grammar, independently of the speaker. Accordingly, P (Xi , Yj ) = pi pj , and therefore the expected value of the distance is given by X E[d] = pi dij pj . (i,j) Note that pi dij pj ≥ 0, ∀i, j, since pi , pj are proba- Figure 2: Avg entropy of grammars over generations bilities, and dij is a distance. Since dij = 0 only for i = j, to get the minimal expected distance 0 it should hold ∀i 6= j either pi = 0 or pj = 0. Therefore, pi can be greater than zero in exactly one entry. For other words, the only probability distributions that give an expected distance of zero have probability 1 for a particular word order Xi and zero for the rest. Accordingly, the optimal grammar must have fixed word order. 4 Experiments We tested the model in four different scenarios. subject and object. In both scenarios, we observe a 1. Base - all words are taken from a single lex- drop in the average distance, meaning that at some icon, without distinction between nouns and point, the hearers successfully parsed the sentences verbs and without case markers. with only a small error. In the two scenarios with the case markers, we see that the distance is very 2. N-V - separate lexicons for nouns and verbs, close to zero even in the first generation, meaning which enables the hearer to distinguish be- that the hearers parsed the sentences correctly from tween nouns and verbs, thus identifying the the first generation. verb in the sentence. To examine if the grammars converged to fixed 3. Case - each word in the sentence carries a word order, we present the grammars’ average en- suffix indicating its syntactic role (S, V or O). tropy in every generationP(Figure 2). The entropy is defined by H = − i pi log pi and satisfies 4. N-V & Case - both 2 and 3, i.e., there is a dis- 0 ≤ H ≤ 1 where H = 1 represents a uniform tinction between nouns and verbs, and every distribution (absolute randomness) and H = 0 rep- word in the sentence carries a case marker. resents probability 1 for one word order and zero for the rest (absolute determinism). It can be ob- served that the average entropy of grammars begins 5 Results at 1 (absolute randomness) in all scenarios, as the For every scenario, we present the average distance probability for every word order is equal. In the between the input and the output among all gram- base scenario, we observe a fast decline in entropy mars in each generation (Figure 1). It can be ob- until it converges to zero, i.e., a fixed word order. A served that the base scenario starts with the high- decline is also observed in the N-V scenario. How- est distance among the scenarios, as the sentences ever, it is slower and does not converge to zero even carry no information regarding the syntactic roles. after 1000 generations. As for the scenarios with A similar discussion could be held for the N-V sce- the case markers, in both cases, we observe that the nario, in which the hearer can identify the verb in entropy remains high even after 1000 generations, the sentence but cannot distinguish between the meaning that no particular word order was fixed.
Figure 3: Word order probability after 1000 genera- The reason why the base scenario converged to tions VOS and the N-V scenario converged to VSO is arbitrary, due to random drift, since there is no functional advantage to a particular word order over the others. 7 Conclusions Theoretically and empirically, this work proves that the absence of information about the syntactic re- lations in a sentence implies that any optimal lan- guage must have a fixed word order. Empirically, we used GA to test whether addi- We also present the grammar that had the small- tional information can make the fixed word order est distance after 1000 generations (Figure 3). In redundant. We showed that a distinction between the base and N-V scenarios, we observe that the nouns and verbs gives additional information to grammar converged to an (almost) fixed word or- the hearer, but it is not sufficient. In contrast, case der, VOS in the base scenario, and VSO in the N-V markers provide enough information to enable a scenario. In contrast, in both scenarios with case flexible word order. These results are in line with markers, the word order remained flexible even the typological findings. after 1000 generations. As to why most natural languages prefer SVO or SOV orders over others, the model used in this 6 Discussion work does not seem to provide an answer. As we have seen, even in cases where a single word order In the base scenario, we observed a fast fixation fixation has been achieved, we have received vari- of single word order. Therefore, in line with the ous possible orders and not necessarily one of these theoretical result, we conclude that given no addi- two. This is because we did not assume any func- tional information regarding the sentence’s syntac- tional advantage for individual word orders over tic relations, fixed word order is necessary to allow others, so the order fixed at the end is purely the successful communication. We also observed a result of random drift. We must involve additional fixation in the N-V scenario, meaning that a dis- psycho-linguistic considerations to explain the ty- tinction between nouns and verbs does not provide pological finding, as done in other works (Gibson sufficient information for communication, and thus et al., 2013; Hengeveld et al., 2004). fixed word order is also necessary in this case. In contrast, both scenarios with case markers did not converge to preferred word order, meaning that References the case markers by themselves are enough to allow John Batali. 1998. Computational simulations of the communication. This finding is consistent with the emergence of grammar. Approach to the Evolution result of artificial language learning experiments of Language, pages 405–426. conducted on human learners (Fedzechkina et al., Bernard Comrie. 1989. Language universals and lin- 2011). Note that since the case markers we used guistic typology: Syntax and morphology. Univer- were just one letter, they could have been flipped sity of Chicago press. by the random noise, leading to a wrong parsing by the hearer. However, the chances that two case Maryia Fedzechkina, T Florian Jaeger, and Elissa L markers in the same sentence will be flipped are Newport. 2011. Functional biases in language learn- ing: Evidence from word order and case-marking small, and thus the hearer can parse the sentence interaction. In Proceedings of the Annual Meeting correctly with high probability. of the Cognitive Science Society, volume 33. We note that additional experiments performed with a more significant number of syntactic roles Edward Gibson, Steven T Piantadosi, Kimberly Brink, Leon Bergen, Eunice Lim, and Rebecca Saxe. 2013. yielded similar trends. Hence we conclude that A noisy-channel account of crosslinguistic word- these findings do not depend on the number of order variation. Psychological science, 24(7):1079– syntactic roles. 1088.
Kees Hengeveld, Jan Rijkhoff, and Anna Siewierska. 2004. Parts-of-speech systems and word order. Journal of linguistics, pages 527–570. Simon Kirby. 1999. Syntax out of learning: The cultural evolution of structured communication in a population of induction algorithms. In Euro- pean Conference on Artificial Life, pages 694–703. Springer. Simon Kirby. 2001. Spontaneous evolution of linguis- tic structure-an iterated learning model of the emer- gence of regularity and irregularity. IEEE Transac- tions on Evolutionary Computation, 5(2):102–110. Michael Levin. 1995. The evolution of understanding: A genetic algorithm model of the evolution of com- munication. BioSystems, 36(3):167–178. Thomas McFadden. 2004. The position of morpholog- ical case in the derivation. PhD diss., University of Pennsylvania. Martin A Nowak, Joshua B Plotkin, and Vincent AA Jansen. 2000. The evolution of syntactic communi- cation. Nature, 404(6777):495–498. Steven Pinker and Paul Bloom. 1990. Natural lan- guage and natural selection. Behavioral and brain sciences, 13(4):707–727. Wendy Sandler, Irit Meir, Carol Padden, and Mark Aronoff. 2005. The emergence of grammar: System- atic structure in a new language. Proceedings of the National Academy of Sciences, 102(7):2661–2665. Edward Sapir. 1921. Language: An Introduction to the Study of Speech.
You can also read