Formalizing Inflectional Paradigm Shape with Information Theory - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Formalizing Inflectional Paradigm Shape with Information Theory Grace LeFevre Micha Elsner∗ Andrea D. Sims∗ The Ohio State University The Ohio State University The Ohio State University lefevre.33@osu.edu elsner.14@osu.edu sims.120@osu.edu Abstract to previous work focusing exclusively on stem shape organization (e.g. Maiden, 2005; Boyé and “Paradigm shape,” our term for the morpho- Cabredo Hofherr, 2006), our method focuses logical structure formed by implicative rela- equally on stems and affixes. Furthermore, our tions between inflected forms, has not been formally quantified in a gradient manner. We results enable direct analysis of both allomor- develop a method to formalize paradigm shape phic and distributional classes (Baerman et al., by modeling the joint effect of stem alterna- 2017), where most previous shape-based analysis tions and affixes. Applied to Spanish verbs, has been purely distributional. As such, this work our model successfully captures aspects of provides a unified, computational approach to phe- both allomorphic and distributional classes. nomena relating to paradigm shape that have pre- These results are replicable and extendable to dominantly been treated separately in the past. We other languages. implement our method on Spanish in this paper as 1 Introduction a test case.2 In this paper, we explore what we call “paradigm 2 Phenomenon to be modeled shape,”1 which is a type of morphological struc- Morphological structure is characterized by both ture characterized by the implicative relations syntagmatic relations and paradigmatic relations. holding among inflected forms. This structure Syntagmatic relations involve the combinatorial reflects the predictable, patterned ways in which properties of morphemes, such as the relationship stem alternants and even fully suppletive allo- between the Spanish verb stem cant- ‘sing’ and morphs occur in parallel paradigm cells across in- 1 SG suffix -o. Paradigmatic relations involve sub- flection classes in some languages (see the Span- stitutional relationships, such as the relationship ish verbs SENTIR ‘feel’, PENSAR ‘think’, and between 1 SG canto and other inflected forms of MOVER ‘move’ in Table 1). Historically, some the same lexeme, or between canto and the 1 SG Romance verbs shifted to better conform to ex- of other verbs (see Table 1). In this paper we isting paradigm shapes, indicating that this is a seek to model a kind of paradigmatic relation that cognitively real organizing principle for speakers we call “paradigm shape.” A lexeme’s paradigm (Maiden, 2005). As such, it has important impli- shape is defined by the implicative relations hold- cations for language learning and change. ing among its inflected forms, for instance how We develop a computational method to pre- well an unobserved form of some lexeme is pre- cisely quantify similarity in paradigm shape. dicted by an observed form. Implicative rela- Building on previous work measuring the inter- tions of this sort bind the forms in a paradigm to- predictability of word forms (Ackerman et al., gether (Wurzel, 1989) and conceptually, two lex- 2009; Bonami and Beniamine, 2016), we ap- emes have the same paradigm shape to the extent ply information-theoretic entropy to these forms that they exhibit the same paradigmatic implica- to compute sets of values characterizing the tions. shapes of the inflection classes. In contrast Spanish offers an interesting test case be- * Joint second authors. cause, along with other Romance languages, 1 This use of the term should not be confused with the it is well known for having paradigmatically- same term used elsewhere to refer to the number of mor- 2 phosyntactic property sets a lexeme expresses (e.g. Corbett, The code and data for our analysis are available at 2009). github.com/gracelefevre/paradigm-shape. 102 Proceedings of the Society for Computation in Linguistics (SCiL) 2021, pages 102-115. Held on-line February 14-19, 2021
LEXEME GLOSS PRS .1 SG PRS .2 SG PRS .3 SG PRS .1 PL PRS .2 PL PRS .3 PL CANTAR ‘sing’ canto cantas canta cantamos cantáis cantan SUBIR ‘rise’ subo subes sube subimos subı́s suben SENTIR ‘feel’ siento sientes siente sentimos sentı́s sienten PENSAR ‘think’ pienso piensas piensa pensamos pensáis piensan MOVER ‘move’ muevo mueves mueve movemos movéis mueven Table 1: Present indicative forms of verbs from a few Spanish microclasses; stem alternations highlighted. structured stem alternants in verbs. Spanish verbs tifies three major distributions of stem alternants are traditionally grouped into inflectional macro- in Romance verbs: the “L-pattern” (shared alter- classes (terminology from Beniamine et al., 2017) nation in 1 SG present indicative and all present based on the theme vowel that shows up in the in- subjunctive), the “N-pattern” (shared alternation finitive: -a vs. -e vs. -i (Butt et al., 2019). For in 1 SG , 2 SG , 3 SG, and 3 PL of the present in- the words in Table 1, this would group CANTAR dicative), and the “U-pattern” (shared alternation and PENSAR together, define a second group for in 1 SG and 3 PL present indicative and all present MOVER , and delineate a final group for SUBIR and subjunctive).5 In Table 1, SENTIR, PENSAR, and SENTIR . However, there is clearly more to say MOVER belong to the N-pattern. Maiden (2005, p. about the morphological structure of these words. 169) observes that while details differ from one No verb in Table 1 has exactly the same exponence language to another, in the history of Romance as any other, which is to say, they each represent speakers “... actually pass up golden opportunities a distinct inflectional microclass in the sense of to align allomorphs with morphosyntactic proper- Beniamine et al. (2017). S ENTIR, PENSAR and ties...”, instead choosing to maintain these distri- MOVER have stem alternations whereas CANTAR butions, reinforce them, and extend them to new and SUBIR do not. Moreover, the distribution of verbs. We are interested in the role this abstract stem alternants is the same for each lexeme (high- distribution of alternants plays in facilitating or lighted by purple shaded cells), despite the alterna- inhibiting inferences about the inflected forms of tions not involving the same phonology (e~ie vs. lexemes. o~ue).3 At the same time, as can be observed in Table The history of Romance is replete with ex- 1, inflectional suffixes— in particular the theme amples of morphological change motivated by vowels that show up in many inflected forms (e.g. paradigmatic stem distributions of this sort. In PRS .1 PL -amos vs. -emos vs. -imos)— have their Old Spanish, the present indicative forms of IRE own distribution. As noted above, the theme vow- ‘go’ were vo, vas, va, imos, ides, van (Maiden, els group verbs into macroclasses differently than 2005; O’Neill, 2018b), showing full stem supple- the stem alternations do. Theme vowels also im- tion with the same distribution of alternants as in pact how predictable other inflected forms of the Table 1.4 The Old Spanish forms arose from in- same lexeme are. While the a theme vowel ap- cursion, in which two separate lexemes merge to pears relatively consistently across the paradigm, become a single lexeme with suppletive forms. the e and i classes sometimes collapse (compare The fact that the result reproduced an existing dis- PRS .1 PL and PRS .2 SG ). As a result, these two tribution of alternants attests that the distribution macroclasses are more confusable.6 Moreover, was (and presumably still is) cognitively real for the theme vowel does not surface in some cells speakers. (e.g. PRS .1 SG), making these cells poorly infor- More broadly, Maiden (2004, 2005, 2009) iden- mative about other inflected forms of the lexeme. 3 Cells/inflected form thus differ in how informative In Spanish, alternation is related to lexical stress place- ment; in the relevant verbs the diphthong alternants appear 5 when the vowel is stressed and e and o appear when un- The U-pattern does not occur in Modern Spanish, hav- stressed. However, for our purposes this is not material. We ing been replaced with the L-pattern (Maiden, 2005, p. 146). are interested in the effect of the resulting stem distributions Among the modern languages, it is restricted to some Italian on the implicative structure of the paradigm. varieties and Romanian (Maiden, 2009, p. 47). We therefore 4 do not consider it further in this paper. This alternation has been leveled in Modern Spanish, 6 which has present indicative forms voy, vas, va, vamos, vais, See Penny (1972) for changes in the history of Spanish van. that were motivated by this confusability. 103
they are about the inflected forms of other cells as ganization of the stem space, which can be rep- a function of their suffixes. resented as an acyclic graph of implicative rela- We take “paradigm shape” in Spanish to encom- tions. Core insights of this and other formal work pass both the stem alternations and the suffixes. on stem organization are thus that (in Romance) In this paper we develop methods for modeling certain cells predictably have the same stem form, their joint effect on the implicative relations hold- that cells with different stem forms enter into pre- ing among inflected forms and use this to quantify dictable implicative relations, and that these rela- similarity in paradigm shape across lexemes. tions are often parallel across classes. At the same time, the formal theory approach 3 Previous Work has a number of limitations in the context of try- Related work can be roughly divided into two lines ing to quantify the extent to which lexemes have of investigation. The first models the distribution similar paradigm shapes. Two are important here. of stem alternants within a paradigm. The sec- First, there is a limited ability to express par- ond consists of work on inflectional complexity tial similarity in the implicative relations holding which is interested in the predictability of inflected among inflected forms. For example, Boyé and forms. These have points of intersection, since Cabredo Hofherr’s method encodes implicative re- both focus on the distribution of implicative rela- lations holding among stems, but not the extent to tions within the paradigm. However, to the best of which words are similar in their implicative rela- our knowledge our work is among the first seeking tions. In Maiden’s classification into L-, N- and to integrate the insights of each.7 U-patterns, the patterns are discrete and any no- Work modeling the paradigmatic distribution tion of similarity among patterns is left informal of stem alternants (in Romance and elsewhere) at best. Yet intuitively, some paradigms are more has tended to approach it either from diachronic similar in shape than others, without being exactly perspective (Aski, 1995; Herce, 2020; Hippisley the same. For example, VENIR ‘come’, DECIR et al., 2004; Juge, 1999; Maiden, 2004, 2005, ‘say’ and TENER ‘have’ follow the L-pattern but 2009; O’Neill, 2018b; Wheeler, 2011), as noted additionally have the N-pattern alternation (except above, or from the perspective of formal linguis- in the 1 SG indicative present, where the N- and L- tic theory (Bonami and Boyé, 2002; Boyé and patterns overlap). These verbs thus have a “modi- Cabredo Hofherr, 2006, 2010; Hippisley, 1998; fied” L-pattern. We want to quantify this and other Maiden, 2011; Montermini and Bonami, 2013; distributional variation in fully gradient terms. O’Neill, 2018a).8 As an example of the latter ap- Second, formal analyses have tended to abstract proach, Boyé and Cabredo Hofherr (2006) iden- away from affixes, in order to focus on stem or- tify eleven stem ‘zones’ for Spanish verbs— sets ganization. Yet as we note above, inferences of paradigm cells which always exhibit the same about the inflected forms of a Spanish verb depend stem form. No verb has a different stem for each on both stem distributions and affix distributions, zone9 and Boyé and Cabredo Hofherr argue that which are partly independent. So in order to un- the distribution of stems alternants is not random, derstand paradigm shape as an organizing princi- but rather, systematically constrained by the or- ple of inflectional systems, we want to model stem 7 The work of Stump and Finkel (Finkel and Stump, 2007, and affix distributions jointly. 2009; Stump and Finkel, 2013) is also notable for bridg- The second line of research reflects comple- ing these two lines of research. They extensively examine how principal parts structure inflectional systems. Defin- mentary insights and complementary problems. ing different notions of principal parts— static, dynamic, Coming from the literature on inflectional com- and adaptive— allows them to investigate questions of dis- plexity, it consists of work that uses information- tributional parallelism across lexemes and classes. How- ever, since principal parts are defined set-theoretically, their theoretic measures, predominantly conditional en- approach encounters difficulty capturing partial parallelism. tropy, to measure the average uncertainty asso- Ultimately, we take their work as inspiration but we think that ciated with the unobserved form realizing some our approach has a number of advantages. 8 Much of this work engages with the concept of a mor- paradigm cell, given one or more observed forms phome (Aronoff, 1994), meaning structure that is irreducibly of the same lexeme (Ackerman et al., 2009; Ack- morphological in nature and autonomous of both syntax and erman and Malouf, 2013; Bonami and Beniamine, phonology. This issue is beyond the scope of the present pa- per. 2016; Cotterell et al., 2019; Mansfield, 2016; 9 SER ‘be’ has the largest number, with six distinct stems. Parker and Sims, 2020; Sims and Parker, 2016; 104
Stump and Finkel, 2013). This work tends to be form of exponents, also serves to group classes typological in focus. that have different exponents in the same distri- The information-theoretic approach has proven bution.11 The insight behind Maiden’s L- and N- popular for quantifying paradigmatic relations in patterns (and other work on stem organization) is a gradient way. At the same time, this literature that stem classes are distributional in nature. Con- has tended to abstract away stem alternations in ditional entropy as it has been employed in the order to focus on affixes and other ‘primary’ in- inflectional complexity literature cannot capture flectional exponence (e.g. Ackerman and Malouf, such classes unless the input data to entropy calcu- 2013), although there are exceptions (Parker and lations is transformed into a purely distributional Sims, 2020). This reflects in part a tendency to representation (a process we refer to below as ‘dei- rely on hand segmentation of words into morpho- dentification’). logical exponents, a problematic issue (Beniamine In this paper we seek to bridge the gap between and Guzmán Naranjo, 2021) that we return to be- the historical/formal literature on stem space or- low. ganization and the information-theoretic literature More importantly, extending the information- on inflectional complexity and improve on both. theoretic approach to the task of quantifying We draw on information-theoretic measures devel- paradigm shape turns out to be a challenge be- oped in the inflectional complexity literature and cause conditional entropy is insufficient by itself apply them to investigating the extent to which to fully capture the regularities that we are inter- implicative relations exhibit distributional paral- ested in. Specifically, since entropy is calculated lelism across lexemes and classes. As we show over surface exponents, it treats identical distribu- below, by doing so we are able to precisely quan- tions instantiated by different phonological mate- tify similarity in paradigm shape in a way that is rial (as with the stem alternations in SENTIR and replicable and extendable to new languages. Our PENSAR vs. MOVER in Table 1) as formally inde- methods also take into account both stem and af- pendent. Conditional entropy thus misses abstract fix distributions. This allows us to capture insights generalizations of the sort embodied by Maiden’s that have predominantly been treated separately in L- and N-patterns. Ultimately, entropy is appro- previous work. priate to quantifying what Baerman et al. (2017) call ‘allomorphic’ inflection class systems, but it 4 Methods does not automatically capture the kinds of gener- We quantify the strength of implicative relations alizations that instantiate ‘distributional’ systems. between cells by identifying sets of cells that “con- Allomorphic systems are the type of inflection fuse” two microclasses in that system—that is, class system familiar to most linguists: classes are sets among which internal comparisons do not al- defined by inflectional exponents and two lexemes low precise assignment of a verb to a single mi- belong to different classes if they are realized by croclass. Using entropy, we then compute the de- different exponents. In contrast, distributional sys- gree to which each such set of cells helps to iden- tems are ones in which two lexemes are realized by tify the exact inflectional microclass of each verb. the same set of exponents, but these are distributed We structure these values in a matrix of m micro- differently among paradigm cells (Baerman et al., classes × n sets of cells, where each entry corre- 2017).10 Class divisions are thus defined by the sponds to the entropy value associated with a set distribution of exponents, rather than the form of of cells for a particular microclass. These entropy the exponents. Baerman et al. are primarily in- values provide a quantitative basis for analyzing terested in how inflection class distinctions are in- the inflectional system along multiple organiza- stantiated, but from a converse perspective, we ob- tional dimensions. serve that the idea of classes based on the distri- To make precise our definition of “confusion,” bution of exponents, rather than the phonological our method relies on segmentation. Given a set 10 One of Baerman et al.’s canonical examples is from Aza- of forms of a single lexeme, we identify a theme, zulco Otomı́, an Oto-Manguean language of Mexico. In verbs, one class is defined by having the suffix -di in the stem-like material that remains invariant for ev- first, second, and third person realis incompletive, and an- ery form in the set, and a set of distinguishers, other class is defined by having the suffix -di in the first and 11 second person realis completive [pp.13,112]. Which cells -di This is also reflected in the concept of a distillation, as shows up in is the only thing distinguishing these two classes. found in Stump and Finkel (2013). 105
set of forms theme distinguishers nation in PENSAR is now forced into the distin- piensas, pensamos pensas i, mo guisher (since it is not shared with the 1 PL). pienso, piensas, piensa piens o, as, a This notion of confusion is important because piens o, sa, a it enables us to view the classes as locally similar even when they are globally different. As shown Table 2: Local segmentation examples. above, PENSAR and CANTAR are confused by sets which do not show both stem alternants of PEN - form-specific, affix-like material that vary across SAR . On the other hand, PENSAR and SENTIR are the set.12 See Table 2 for examples using forms of confused by sets which do not vary the expression PENSAR . of the theme vowel (such as 3 SG, 1 PL). We can ap- Although segmentation-based analysis of mor- ply the same definition even when the stem is en- phological systems is common in the computa- tirely suppletive. Because our definition is based tional literature, in the morphological literature on internal contrasts within sets of cells, it can rec- there is no accepted standard for what constitutes ognize (for example) that SER ‘be’ is anomalous a ‘correct’ segmentation (Spencer, 2012). The as- among Spanish verbs by virtue of its suppletive signment of phonological material to stems vs. to preterite (1 SG present indicative soy, preterite fui), affixes often reflects language-specific traditions but also that, within the set of preterite forms, its of analysis, unclear analytic criteria, and/or the- conjugation is relatively regular. oretical considerations (Taylor, 2008). Further- Enumerating every set of cells which confuses more, different segmentation strategies can result two microclasses is difficult, since if a large set S in different analyses of inflection class structure confuses two microclasses, each of its exponen- (Beniamine et al., 2017). Our use of “local” tially many subsets does as well. We restrict our segmentation (potentially identifying a different attention to the maximal confusion sets for each theme and distinguishers for each set of forms) pair of microclasses,13 which, we show below, can rather than “global” segmentation (identifying a be efficiently computed. A set of cells S (size >1) single theme/stem for each lexeme) follows Beni- is a maximal confusion set for microclasses A and amine et al. (2017). They show that this method B if no superset of S also confuses A and B. yields better descriptions of lexemes with stem al- Once all maximal confusion sets have been ternations. In such cases, the alternating charac- identified, we evaluate each set’s predictive power ters are included in the theme when only one stem by determining how it groups all the microclasses allomorph is included in the set, but in the dis- in the system into mutually confusable partitions. tinguisher when multiple allomorphs are present We compute how well the set narrows down the (compare rows 1 and 2 in Table 2). Therefore, the identity of each microclass. If a particular mi- analysis of different sets can show both the regu- croclass is confusable with k classes on the basis larity of the affixes (PENSAR takes the -as suffix in of some set of cells, the remaining uncertainty is 2. SG) and the presence of the alternation. − log2 k bits. Entropy’s usefulness as a quantita- We can now define a confusion between micro- tive standard is particularly clear in the case of no classes: a set of cells confuses two classes if, when confusability: if the set uniquely identifies a par- the inflected forms for each class are locally seg- ticular class, the entropy value is zero, indicating mented, they have identical distinguishers. For in- that there is no remaining uncertainty about which stance, [ PRS .1 SG, PRS .2 SG ] confuses CANTAR class the set belongs to. and PENSAR. Local segmentation yields the dis- By applying this process to each maximal con- tinguishers o, as for both. However, if we added fusion set and each microclass, we compute a ma- PRS .1 PL to the set, it would no longer confuse trix of entropy values quantifying the distribution these two classes. The -i- from the stem alter- of predictive relationships across the inflectional 12 system. For this paper, we analyzed the Spanish Ideally, the theme is the longest common subsequence of all the forms; we approximate this NP-hard computation by verbal inflectional system, using 60 morphosyn- aligning the forms one at a time using dynamic programming. tactic property sets of 40 verb microclasses (drawn Each character has an identical insertion cost of 1. Once the 13 theme is identified, we realign each form to the theme and Our method computes maximal confusion sets only for designate the unaligned characters as the distinguisher. Posi- pairs of classes. We believe that sets can be computed for tions of discontinuities are not noted in either the themes or larger numbers of classes as well, but leave this for future the distinguishers. work. 106
from Brodsky (2005)).14 Our method generates to-one correspondences between sets. 290 maximal sets, for a resulting 40 x 290 matrix Using the previously described Spanish data, of entropy values. Further details of the algorithm our deidentified method generates 25,239 maxi- are described below. mal sets, for a resulting 40 x 25239 matrix of en- tropy values. 4.1 Deidentification 4.2 Algorithmic efficiency The procedure just described highlights differ- ences between microclasses based on both the af- We state above that the maximal confusion sets for fixes they take and the distribution of different each pair of microclasses can be efficiently com- stem allomorphs within the paradigm. To fo- puted, although the search space contains expo- cus on purely distributional information, we also nentially many sets. Here, we explain how this develop a “deidentified” analysis which abstracts can be done. We begin with the intuition that ev- away from the forms of the distinguishers. In ery maximal confusion set must be associated with this analysis, we replace the individual characters some theme for each row involved. That is, know- within the distinguishers (which represent affixes, ing that a given set of morphosyntactic properties stem alternations and other local variation within can yield an identical distinguisher set for class the set) with abstract identifiers indicating the po- 1 and class 2 presupposes the existence of some sitions of identical characters. For instance, the theme A for class 1 and some theme B for class distinguishers o, as, a would be represented as α, 2 that produce these distinguishers. Therefore, de- βγ, β. This enables them to match o, es, e, in termining all the possible themes for class 1 and which the theme vowel has changed but its distri- class 2 and finding the largest confusion set for bution has not. each cross-class pairing of themes will necessarily To replace the characters with identifiers, we yield all the maximal confusion sets (along with perform a multi-string alignment of the distin- some non-maximal confusion sets). guishers within each step. We search for multi- Next, we note that themes (longest common way alignments using the A∗ algorithm (Russell subsequences of sets of forms) grow monoton- and Norvig, 2021, p85). Having obtained strings ically shorter as more forms are added. Thus, of abstract identifiers, we want to identify con- all possible themes for a given microclass can be fusion sets between microclasses. This requires computed by aligning every pair of forms within a slight modification of our confusion definition it, then aligning the resulting themes until no fur- for the deidentified case: a set of cells confuses ther themes can be generated. Once all themes for two classes if, when the inflected forms for each a class have been determined, we compare each class are locally segmented, they have deidenti- theme against each form in the class and find the fied distinguishers with a perfect one-to-one cor- largest possible set of cells for which that theme is respondence. This enables us to identify matches valid; denote this set S(T ) for a theme T . between distinguisher sets with different abstract Now, to find confusion sets for a pair of classes identifiers. For example, α, βγ, β and γ, δ, I, J, we test every pair of themes Ai and Bj . δ do not have identical identifiers but do have a We take set S(Ai ) ∩ S(Bj ) and test whether it perfect one-to-one correspondence (α:γ, β:γ, γ:) has at least two members, and whether local seg- and therefore comprise a confusion set. We again mentation of those members actually produces the use the A∗ algorithm to search for maximal one- themes Ai , Bj .15 All sets that meet this specifica- tion are output as potential confusion sets. After 14 We only look at verbs with full paradigms in this work. all confusion sets are generated for a pair of rows, Real language learners may not observe every form for ev- ery verb, due to their Zipfian distribution (Lignos and Yang, any confusion sets that are subsets of any others 2018); we do not address the question of learning shapes from are removed; this ensures that only maximal con- this kind of partial data. We also do not address the issue fusion sets are retained. This method of identify- of inflectional defectiveness (paradigmatic gaps) (Albright, 2003), which causes problems for our method even when all ing maximal confusion sets is followed for every forms of a verb are available. Sims (2015) and others ar- pair of classes in the system. gue that gaps are sometimes irreducible morphological ob- 15 jects, including in Spanish verbs (Boyé and Cabredo Hofherr, Because the intersection may be smaller than the origi- 2010; Maiden and O’Neill, 2010; O’Neill, 2018a). In this nal sets, local segmentation might produce themes which are case, it makes sense to treat defective verbs as defining addi- longer than Ai , Bj , in which case the set is not valid for this tional microclasses. pair of themes, although it might be output for another pair. 107
The final step of the algorithm is applying en- consists of both -er and -ir verbs, so it is clear tropy. After we have collected all maximally con- that the clustering is not driven by inflectional fusable sets for all microclasses, we generate a ma- affixes alone. In fact, Maiden’s stem alterna- trix of m microclasses × n maximally confusable tions explain most of the clustering distinction, as sets. For each maximal set, we iterate through the most verbs in the left-hand cluster exhibit the N- classes; for each class, we determine how many pattern (SENTIR, PEDIR, DORMIR, CONSTRUIR, classes it can be confused with based on the forms ARG ÜIR , O ÍR , PERDER , MOVER , DISCERNIR , and in the maximal set. Two classes are confusable ADQUIRIR ) while most in the right-hand clus- by a set of forms if it is possible for both classes ter have the L-pattern (LUCIR, ASIR, CONOCER, to generate an identical distinguisher set for those COMER , SUBIR , VALER , and SALIR ). forms; similar logic applies to confusability of three or more classes. The corresponding cells in Our deidentified approach is able to take this the matrix are filled with the resulting count val- one step further and draw even finer distributional ues. The entropy is − log2 of the counts. distinctions between classes. The previously men- tioned upper-left-hand cluster in Figure 1 splits 5 Results into two smaller clusters under the deidentified ap- proach in Figure 2. Though the large cluster is We visualize our matrices of entropy values us- united by all of its members having Maiden’s N- ing hierarchical clustering and t-SNE analyses pattern (except for O ÍR, which has the mixed N- (van der Maaten and Hinton, 2008). Though the pattern and L-pattern), the split into smaller clus- underlying clusters we discuss in this section are ters can be explained by another alternation in present in both analyses, we focus on the t-SNE the preterite. As shown in Table 3, the verbs in results, providing the dendrograms as Appendix the first group (SENTIR, PEDIR, DORMIR, CON - A. Figure 1 shows t-SNE visualizations for the STRUIR , ARG ÜIR , and O ÍR ) all have an alternation maximally confusable sets generated by our pri- in their third person singular and third person plu- mary method, and Figure 2 shows the same for ral preterite indicative forms, whereas those in the the deidentified case. For our traditional class second group (PERDER, MOVER, DISCERNIR, and categorizations, we grouped all the classes Brod- ADQUIRIR ) have no alternations in the preterite. sky (2005) deemed “fundamentally irregular” to- gether and then organized the remaining “basi- It is important to note that the verbs in the first cally regular” classes into -ar, -er, and -ir groups group do not all exhibit the same alternation: SEN - based on their infinitive forms. We also catego- TIR and PEDIR have e~i; DORMIR has o~u; and rized our classes based on Maiden’s alternation CONSTRUIR , ARG ÜIR , and O ÍR have i~y. This patterns, identifying the L-pattern, the N-pattern, shows a strength of our local segmentation strat- and a “modified L-pattern” (a mixed N-pattern and egy. The i~y alternation appears at the boundary L-pattern) in our data. between stem and affix, but we are not forced to These visualizations highlight several key com- commit to placing it in one or the other a priori; ponents of paradigm shape in Spanish verbs. Tra- instead, it can be grouped with the other two alter- ditional allomorphic class groupings are readily nations which occur in comparable positions. This distinguishable in Figure 1, most clearly in the illustrates our method’s utility in identifying distri- cluster of red -ar verbs. By contrast, the -er and - butional class structure. ir verbs are somewhat interspersed. This comports with the fact that the -ar classes have a fairly con- Our method provides a gradient, numerical sistent a inflectional suffix across their paradigms, characterization of structural similarities between while the -er and -ir classes demonstrate inconsis- the paradigms of Spanish verbs. In doing so, it tency in the realization of an i vs. e suffix. These captures several pre-existing intuitions about the clustering structures show that our method is cap- implicative structure of Spanish verbal inflections, turing aspects of allomorphic classes delineated by including the traditional inflection classes as well inflectional exponents. as Maiden’s distributional classes. Moreover, it Distributional class groupings are also present. also makes finer distinctions which were not ex- For example, the large swath of classes at the plicitly listed in prior work, but which follow from top of Figure 1 has two main clusters. Each their principles of analysis. 108
Figure 1: Results of t-SNE analysis based on entropy of maximally confusable sets. Colors show the traditional classes; symbols show the Maiden alternation patterns. LEXEME GLOSS PRET.1 SG PRET.2 SG PRET.3 SG PRET.1 PL PRET.2 PL PRET.3 PL SENTIR ‘feel’ sentı́ sentiste sintió sentimos sentisteis sintieron PEDIR ‘ask for’ pedı́ pediste pidió pedimos pedisteis pidieron DORMIR ‘sleep’ dormı́ dormiste durmió dormimos dormistes durmieron CONSTRUIR ‘build’ construı́ construiste construyó construimos construisteis construyeron ARG ÜIR ‘argue’ argüı́ argüiste arguyó argüimos argüisteis arguyeron O ÍR ‘hear’ oı́ oı́ste oyó oı́mos oı́steis oyeron PERDER ‘lose’ perdı́ perdiste perdió perdimos perdisteis perdieron MOVER ‘move’ movı́ moviste movió movimos movistes movieron DISCERNIR ‘discern’ discernı́ discerniste discernió discernimos discernisteis discernieron ADQUIRIR ‘acquire’ adquirı́ adquiriste adquirió adquirimos adquiristeis adquirieron Table 3: Preterite alternation that leads to the cluster split observable in Figure 2 109
References Farrell Ackerman, James P. Blevins, and Robert Mal- ouf. 2009. Parts and wholes: Patterns of relatedness in complex morphological systems and why they matter. In James P. Blevins and Juliette Blevins, ed- itors, Analogy in grammar: Form and acquisition, pages 54–82. Oxford University Press. Farrell Ackerman and Robert Malouf. 2013. Morpho- logical organization: The Low Conditional Entropy Conjecture. Language, 89(3):429–464. Adam Albright. 2003. A quantitative study of spanish paradigm gaps. Proceedings of the West Coast Con- ference on Formal Linguistics, 22:1–14. Mark Aronoff. 1994. Morphology by itself: Stems and inflectional classes. MIT Press. Janice Aski. 1995. Verbal suppletion: An analysis Figure 2: Results of t-SNE analysis based on entropy of Italian, French, and Spanish to go. Linguistics, of maximally confusable sets in the deidentified con- 33:403–432. dition. Coding of colors and shapes is the same as in Figure 1. Matthew Baerman, Dunstan Brown, and Greville G. Corbett. 2017. Morphological complexity. Cam- bridge University Press. 6 Conclusion and future work Sacha Beniamine, Olivier Bonami, and Benoı̂t Sagôt. 2017. Inferring inflection classes with description In this paper, we present a method for pre- length. Journal of Language Modelling, 5:465–525. cisely quantifying paradigm shape in a repli- cable, extendable way. Bridging the gap be- Sacha Beniamine and Matı́as Guzmán Naranjo. 2021. Multiple alignments of inflectional paradigms. Pro- tween formal literature on stem space organiza- ceedings of the Society for Computation in Linguis- tion and information-theoretic literature on inflec- tics, 4. tional complexity, this work models the joint ef- fect of stem alternations and suffixes on the im- Olivier Bonami and Sacha Beniamine. 2016. Joint pre- dictiveness in inflectional paradigms. Word Struc- plicative relations holding among inflected forms. ture, 9(2):156–182. We show that our model captures important allo- morphic and distributional class insights in Span- Olivier Bonami and Gilles Boyé. 2002. Suppletion and dependency in inflectional morphology. In Frank ish verbs. In the future, we would like to extend van Eynde, Lars Hellan, and Dorothee Beermann, our notion of confusion sets beyond the pairwise editors, Proceedings of the 8th International Confer- case. We would also like to develop a systematic ence on Head-Driven Phrase Structure Grammar, way of substantiating the precise structures cap- pages 51–70. CSLI. tured by our approach; though our analysis goes Gilles Boyé and Patricia Cabredo Hofherr. 2006. The some length toward showing what particular as- structure of allomorphy in Spanish verbal inflection. pects of structure the method is sensitive to, a rig- Cuadernos de Lingüı́stica del Instituto Universitario orous substantiation is beyond the current scope of Ortega y Gasset, 13:9–24. our work. Finally, we aim to use our method to an- Gilles Boyé and Patricia Cabredo Hofherr. 2010. De- alyze other Romance languages and to trace how fectiveness as stem suppletion in French and Spanish shape has impacted the historical development of verbs. In Matthew Baerman, Greville G. Corbett, and Dunstan Brown, editors, Defective paradigms: Romance verbs. Missing forms and what they tell us, pages 35– 52. Oxford University Press, in coordination with Acknowledgements British Academy Press. We thank Bob Levine and the members of the David Brodsky. 2005. Spanish verbs made simple(r). Autumn 2019 Undergraduate Research Seminar at University of Texas Press. Ohio State for their feedback on a preliminary ver- John Butt, Carmen Benjamin, and Antonia Mor- sion of this work, as well as three anonymous re- eira Rodrı́guez. 2019. A new reference grammar of viewers. modern Spanish, 6th edition. Routledge. 110
Greville G. Corbett. 2009. Canonical inflection Martin Maiden. 2011. Morphomes and ‘stress- classes. In Fabio Montermini, Gilles Boyé, and conditioned allomorphy’ in Romansh. In Martin Jesse Tseng, editors, Selected proceedings of the 6th Maiden, John Charles Smith, Maria Goldbach, and Décembrettes, pages 1–11. Cascadilla. Marc-Olivier Hinzelin, editors, Morphological au- tonomy: Perspectives from Romance inflectional Ryan Cotterell, Christo Kirov, Mans Hulden, and Ja- morphology, pages 36–50. Oxford University Press. son Eisner. 2019. On the complexity and typol- ogy of inflectional morphological systems. Transac- Martin Maiden and Paul O’Neill. 2010. On mor- tions of the Association for Computational Linguis- phomic defectiveness: Evidence from the Romance tics, 7:327–342. languages of the Iberian Peninsula. In Matthew Baerman, Greville G. Corbett, and Dunstan Brown, Raphael Finkel and Gregory T. Stump. 2007. Princi- editors, Defective paradigms: Missing forms and pal parts and morphological typology. Morphology, what they tell us, pages 103–124. Oxford University 17:39–75. Press, in coordination with British Academy Press. John Mansfield. 2016. Intersecting formatives and in- Raphael Finkel and Gregory T. Stump. 2009. Principal flectional predictability: How do speakers and learn- parts and degrees of paradigmatic transparency. In ers predict the correct form of Murrinhpatha verbs? James P. Blevins and Juliette Blevins, editors, Anal- Word Structure, 9:183–214. ogy in grammar: Form and acquisition, pages 13– 53. Oxford University Press. Fabio Montermini and Olivier Bonami. 2013. Stem spaces and predictability in verbal inflection. Lingue Borja Herce. 2020. Alignment of forms in Spanish ver- e Linguaggio, 12:171–190. bal inflection: The gang poner, tener, venir, salir, valer as a window into the nature of paradigmatic Paul O’Neill. 2018a. Near-synonymy in morpholog- analogy and predictability. Morphology, 30:90–115. ical structures: Why Catalans can abolish consti- tutions but Portuguese and Spanish speakers can’t. Andrew Hippisley. 1998. Indexed stems and Russian Languages in Contrast, 18:6–34. word formation: A network-morphology account Paul O’Neill. 2018b. Velar allomorphy in Ibero- of Russian personal nouns. Linguistics, 36:1093– Romance: Roots, endings and clashes of mor- 1124. phomes. In Miriam Bouzouita, Ioanna Sitaridou, and Enrique Pato, editors, Studies in historical Andrew Hippisley, Marina Chumakina, Greville Cor- Ibero-Romance morpho-syntax, pages 13–46. John bett, and Dunstan Brown. 2004. Suppletion: Fre- Benjamins. quency, categories and distribution of stems. Studies in Language, 28:387–418. Jeff Parker and Andrea D. Sims. 2020. Irregularity, paradigmatic layers, and the complexity of inflec- Matthew Juge. 1999. On the rise of suppletion in ver- tion class systems: A study of Russian nouns. In bal paradigms. In 25th Proceedings of the Annual Francesco Gardani and Peter Arkadiev, editors, The Meeting of the Berkeley Linguistics Society, pages complexities of morphology, pages 23–51. Oxford 183–194. Berkeley Linguistics Society. University Press. Constantine Lignos and Charles Yang. 2018. Morphol- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, ogy and language acquisition. In Andrew Hippisley B. Thirion, O. Grisel, M. Blondel, P. Pretten- and Gregory T. Stump, editors, Cambridge hand- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- book of morphology, pages 765–791. Cambridge sos, D. Cournapeau, M. Brucher, M. Perrot, and University Press. E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine Ralph Penny. 1972. Verb-class as a determiner of stem- learning research, 9(Nov):2579–2605. vowel in the historical morphology of Spanish verbs. Revue de Linguistique Romane, 36:342–359. Martin Maiden. 2004. When lexemes become allo- morphs: On the genesis of suppletion. Folia Lin- Stuart Russell and Peter Norvig. 2021. Artificial intel- guistica, 38:227–256. ligence: A modern approach. Pearson. Martin Maiden. 2005. Morphological autonomy and Andrea D. Sims. 2015. Inflectional defectiveness. diachrony. In Geert Booij and Jaap van Marle, edi- Cambridge University Press. tors, Yearbook of morphology 2004, pages 137–175. Andrea D. Sims and Jeff Parker. 2016. How inflection Springer. class systems work: On the informativity of implica- tive structure. Word Structure, 9(2):215–239. Martin Maiden. 2009. From pure phonology to pure morphology: The reshaping of the Romance verb. Andrew Spencer. 2012. Identifying stems. Word Struc- Recherches linguistiques de Vincennes, 38:45–82. ture, 5:88–108. 111
Gregory T. Stump and Raphael A. Finkel. 2013. Mor- phological typology: From word to paradigm. Cam- bridge University Press. Catherine Taylor. 2008. Maximising stems. In Mil- tiadis Kokkonidis, editor, Proceedings of LingO 2007, pages 228–235. Faculty of Linguistics, Philol- ogy and Phonetics, University of Oxford. Max Wheeler. 2011. The evolution of a morphome in Catalan verb inflection. In Martin Maiden, John Charles Smith, Maria Goldbach, and Marc- Olivier Hinzelin, editors, Morphological autonomy: Perspectives from Romance inflectional morphol- ogy, pages 183–209. Oxford University Press. Wolfgang U. Wurzel. 1989. Inflectional morphology and naturalness. Kluwer. 112
Appendix A: Dendrograms Although we find our similarity measurements are most interpretable via the T-SNE visualizations in the main paper, T-SNE is non-deterministic and can sometimes erroneously group points that are not similar in the underlying space. Thus, we also present dendrograms in which the points are clus- tered using the complete method from Scikit Learn (Pedregosa et al., 2011). These show the same clustering structures discussed in the main paper. Figure 3 shows the main results and Figure 4 the deidentified results. 113
Figure 3: Results of hierarchical clustering analysis based on entropy of maximally confusable sets. 114
Figure 4: Results of hierarchical clustering analysis based on entropy of maximally confusable sets in the deiden- tified condition. 115
You can also read