Multiword expressions as discourse markers in Hebrew and Lithuanian
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multiword expressions as discourse markers in Hebrew and Lithuanian Giedre Valunaite Oleskeviciene Chaya Liebeskind Institute of Humanities, Department of Computer Science, Mykolas Romeris university Jerusalem College of Technology Ateities 20 LT-08303, 21 Havaad Haleumi st. 9116001, Vilnius, Lietuva Jerusalem, Israel gvalunaite@mruni.eu liebchaya@gmail.com Abstract or processing language by the speaker and the hearer (Siyanova-Chanturia et al., 2011). Formu- Multiword expressions are of key impor- laic language includes idioms and proverbs, vari- tance in language generation and process- ous clichés and collocations, lexical bundles, and ing. Certain multiword expressions also phrasal verbs. Biber et al. (2004) observed that could operate as discourse markers. In lexical bundles constitute a high percentage of the this research, we combined the alignment produced language and the authors identified that model of the phrase-based statistical ma- one function of lexical bundles is to organize dis- chine translation and manual treatment of course by providing an example of such bundles, the data in order to examine English mul- for example, I think, which relates to the research tiword discourse markers and their equiv- on discourse markers (DMs). Phrases such as you alents in Lithuanian and Hebrew, by re- know and I think have also been classified as DMs searching their changes in translation. Af- that perform certain discourse organising func- ter establishing a full list of multiword dis- tions. However, Maschler and Schiffrin (2015) course markers in our generated parallel observe that there is no a priori theoretical clas- corpus, we focused on the two most fre- sification of DMs and the analysis of function in quent ones functioning as stance attitudi- the data is necessary. Research on DMs as tools of nal discourse markers: I think and you discourse management prove that they carry sev- know aiming to research if they demon- eral functions, including signposting, signalling, strate their functional stability as stance at- and rephrasing. Furthermore, there are ongoing at- titudinal discourse markers in translation tempts to investigate the importance of discourse and what changes they undergo in Lithua- layers in language production, communication, nian and Hebrew translation. Our research second language learning, and translation. Addi- proves that the examined multiword dis- tionally, Dobrovoljc (2017) has recently attempted course markers preserve their function as to research multiword expressions as DMs in a stance attitudinal discourse markers and corpus of spoken Slovene, identifying structurally tend to remain multiword discourse mark- fixed discourse marking multiword expressions. ers in the Hebrew translation but turn into The underlying assumption is that DMs I think one-word discourse markers in Lithuanian and you know are indicators of stance in discourse due to the translation tendency relying on used to express and understand points of view and inflections. beliefs. The purpose of the current research is to examine multiword expressions used as DMs in 1 Introduction TED talk English transcripts focusing on stance Research on multiword expressions has identi- attitudinal DMs I think and you know and com- fied that language is not produced just word by pare them with their counterparts in Lithuanian word but it usually involves generating certain and Hebrew by following Maschler and Schiffrin chunks using a lot of formulaic constructions (2015) observation on the necessity of closer in- (Barlow, 2011). Native speakers have a multi- vestigation on their function as stance DMs. To tude of memorized sequences to perform various achieve the aim of the research, the set objec- functions within language, for example, organiz- tives were to create a parallel research corpus to ing discourse (Nattinger and DeCarrico, 1992), identify multiword expressions used as stance at-
titudinal DMs and to analyse their translations in sions and the oldest layer of the Lithuanian lan- Lithuanian and Hebrew to determine if they func- guage vocabulary is related to the Indo-European tion as stance DMs and are also multiword ex- language, which is dated to be approximately over pressions or one word translations, or if they ac- 5000 years old. quire any other linguistic forms. An additional Hebrew is a very old Semitic language and it is a benefit of the study was extending the available successful example of a revived dead language. It resources and providing linguistic processing for survived in the medieval period as the language of several languages by creating a multilingual paral- religious scriptures, being revived, in the 19th cen- lel corpus (including English, Lithuanian, and He- tury, into a spoken and literary language (Joslyn- brew); the created corpus is shared and interlinked Siemiatkoski, 2007). Hebrew is an important lan- via CLARIN open language resources. What is guage for researchers specializing in Middle East more, the current research could be extended to civilizations and Christian theology studies. other languages. The future research envisions ap- plying machine learning and using the model for 2.2 Multiword expressions as DMs discourse marker identification in other languages to research how stance signalling is treated. The research areas of natural language processing (NLP), linguistics, and translation are closely re- 2 Theoretical background lated to discourse research, focusing on discourse relations between clauses or sentences. NLP re- The literature overview briefly takes into account search focuses more and in depth on multiple the research languages, studies related to multi- language-related areas, such as semantic phenom- word expressions and their use as DMs, the im- ena, dialogue exchange structure, and discourse portance of DMs for discourse management, and textual structure (Webber and Joshi, 2012). NLP certain insights into DM translation. recognizes that language is not just placing words in the right order but getting the meaning and 2.1 Cultural heritage and research languages deeper textual relations as well as organizing ideas First, it is necessary to briefly discuss the cultural into a logical textual flow. According to re- heritage of the languages of the research, which, searchers (Barlow, 2011; Sinclair, 1991), language in a way, guided the choice of languages for our is not just generated word by word; it is also study. According to Bieliauskienė (2012), Jewish formulaic. Speakers possess multiple learnt for- and Lithuanian cultures coexisted on the same ter- mulaic sequences, which, according to Siyanova- ritory from the first half of the 14th century. The Chanturia et al. (2011) are important in organizing author stressed that from 19th century onwards, discourse and help the language producer and re- in the Republic of Lithuania, Vilnius was called cipient to manage language processing. However, Lithuania’s Jerusalem, attracting knowledgeable formulaic language is not easy to manage and cat- people in the field of education and inspiring a egorize for NLP research, as it may seem at first flourishing high culture, for example, in theatre, sight, since the sequences that could be considered art, and literature. In fact, both languages, Lithua- formulaic vary in length, meaning, fixedness, etc., nian and Hebrew, formed the cultural heritage of and the finalized definition of formulaic language the region. In this study, we research the Lithua- has not fully crystallized. It could be considered nian and Hebrew corpus in parallel with pivotal as an umbrella term embracing idioms, proverbs, English. clichés, phrasal verbs, collocations, and lexical Lithuanian is an old surviving Baltic language, bundles (Wray, 2012). According to Wei and Li retaining forms related to Sanskrit and Latin and (2013), formulaic language covers approximately preserving the most phonological and morpholog- 60% of written texts in their researched corpus of ical aspects of the Proto-Indo-European language. English academic language. According to Biber Thus, it has gained importance in Indo-European et al. (1994, 1999), lexical bundles are groups of language studies and has been researched by many words that show a statistical tendency to co-occur scientists so far, including Ferdinand de Saussure, and could be considered as extended collocations, who considered Lithuanian “the Galapagos of lin- for example, I think. Biber et al. (2004) identify guistic evolution” (Joseph, 2009). Lithuanian is that lexical bundles have functional purposes, such rich in declensions and cases inside the declen- as organizing discourse, expressing stance, and
referential meaning. Based on the evidence of the take (e.g., he goed), it can be easily identified formulaic nature of language for communication, and redeemed by the native speaker; however, if research has turned to investigating multiword ex- a learner misses words such as you know, or well, pressions used as DMs (Dobrovoljc, 2017), iden- the native speaker cannot identify any error and tifying structurally fixed discourse marking multi- the speech might sound impolite or even dogmatic. word expressions. The same idea is also supported by Hasselgren (2002), who observed that DMs enhance interac- Another important issue in NLP is discourse tion. Furthermore, it has also been researched us- management, which is related to discourse re- ing learner corpora to demonstrate the importance lations, connecting ideas between sentences and of discourse level knowledge, especially at more bigger parts of the text. Discourse relations may advanced levels of language learning (Granger, remain implicit or be expressed explicitly through 2015; Cobb and Boulton, 2015). discourse markers, which help textual coherence and discourse management, and are used for mak- ing coherent speech appropriately segmented to 2.3 Translation issues of DMs enable textual understanding. DMs perform im- DMs are used in both written texts and spoken portant functions, such as signposting, signalling, discourse to connect ideas and guide the reader and rephrasing, by facilitating discourse organi- or the listener through expression by ensuring zation. They are mainly drawn from syntactic that the ideas are grasped correctly. DMs have classes of conjunctions, adverbials, and preposi- been researched by applying various theoretical tional phrases (Fraser, 2009), as well as expres- approaches, such as Rhetorical Structure Theory sions such as you know, you see, and I mean (Mann and Thompson, 1988), Segmented Dis- (Schiffrin, 2001; Hasselgren, 2002; Maschler and course Representation Theory (Asher et al., 2003), Schiffrin, 2015). Hasselgren (2002) advocated and PDTB (Prasad et al., 2008), first focusing that better DM signalled fluency contributes to in- on the monolingual approach, which resulted in teraction and even makes the speaker sound more multilingual studies focusing on translation (De- ‘native-like’. Recently, discourse relations and gand and Pander Maat, 2003; Pit, 2007; Dixon, DM research has gained certain impetus with cor- 2009; Zufferey and Cartoni, 2012). As Zufferey pora annotation for exploring discourse structure and Cartoni (2012) observed, multilingual stud- in texts, for example, the Penn Discourse Tree ies are more complicated as languages differ in Bank (PDTB) (Webber et al., 2016). Furthermore, the use of DMs and their expression. The au- there was a rise in annotated multilingual corpora thors also added that often DMs are poly-semic, for researching different means of expressing dis- which means that a single expression of a DM may course relations and managing discourse (Stede perform in expressing various discourse relations. et al., 2016; Zufferey and Degand, 2017; Oleske- They provided an example of the English since, viciene et al., 2018). which could express temporal or causal discourse Language, especially spoken, is characterised relations depending on the surrounding contexts. by DM use; however, some of them (e.g., you Recently, much research has gained interest in know, I think, well) are sometimes referred to in using parallel translated corpora. For example, a critical manner, as indicating a lack of fluency Dupont and Zufferey (2017) focused on the inves- (O’Donnell and Todd, 2013). Still, DMs are abun- tigation of translation corpora to study if the effect dantly used and, according to Crystal (1988), they of register, translation direction, or translator’s ex- enhance communication if used appropriately and pertise could influence the shifts of meaning and should not be considered unnecessary or undesir- omissions of English and French markers of con- able. As Biber (2006) observed, DMs, such as cession. Hoek et al. (2017) investigated a paral- you know, or well, are very rare in written lan- lel corpus on English parliamentary debates trans- guage. However, they are quite common in spo- lated into Dutch, German, French, and Spanish, ken discourse and should not be treated as just searching what types of DMs might have a higher fancy words since they serve the function of orga- tendency to be more frequently omitted in trans- nizing discourse by signalling, rephrasing, mark- lation. Baker (2018), in her extensive studies on ing, or relating ideas. Svartvik (1980) observed translation, observed that DMs could be used to that, if a foreign language learner makes a mis- signal different relations and these relations could
be expressed by a variety of means. The author In Lithuanian translation, we observe a provided the example that, in English, the expres- change in the order of the elements in the sen- sion of causality could be realized through con- tence. tent verbs, such as cause or lead, or more sim- ply, through a DM signalling the causality rela- 3. After he had left – Jam išėjus. tion. Further, different languages demonstrate dif- In the case of Hebrew translation, the change ferent tendencies – some languages prefer using of the order of the elements could be ob- simpler structures connected by a variety of DMs, served in the following example. while other languages favour complex structures, sparsely using explicit DMs. The author analysed 4. Classical music – !מוזיקה קלאסית the example of an evident difference between En- glish and Arabic, identifying that, while English Omission occurs when some elements of the prefers signalling discourse relation through DMs, original text could be considered excessive or Arabic prefers grouping the information into big- redundant in translation. ger grammatical chunks and using fewer DMs. In the Lithuanian translation example, the The finding is supported by Al-Saif and Mark- whole phrase I thought is omitted. ert (2010), who observed that, in Arabic, many discourse relations are expressed via prepositions 5. I thought you said you were alright. – Bet tu with nominalizations. Therefore, translation poses sakei, kad viskas gerai. a challenge in adapting various preferences of the source and target languages. Translators face var- In the following example in Hebrew, the ious choices of inserting DMs to make the flow of translation of are is omitted. the ideas smoother in the target text, however, they risk making the translation sound foreign or trans- 6. We still are – !Nאנחנו עדיי posing the grammatical syntactic structure, ending Supplementation involves changes when new up using different means of expressing DMs or elements, which are non-existent in the simply omitting them. It appears that it is not al- source text, appear in the translated text in ways possible to use the word for word technique order to ensure structural adequacy of the lat- and natural changes in translation are sometimes ter. Such modifications are usually consid- inevitable. According to Baker (2018), grammat- ered structurally or contextually motivated. ical changes in translation involve certain tech- For example, due to the elliptical nature of niques, such as substitution, transposition, omis- the English language, the Lithuanian transla- sion, and supplementation. tion should use supplementation to make the Substitution is the change of the grammatical translation understandable. category of the source unit in translation. For example, active voice is more common in 7. Soap star – muilo operos žvaigždė (although Lithuanian; therefore, English passive voice units the word opera is omitted in English due could be changed into active units: to ellipsis, it should be added in Lithuanian 1. He was told the news. – jam pranešė nau- translation to make it contextually coherent). jienas The same technique should be applied in the Similarly, in the following example, the verb Hebrew translation. in the source language is changed into a noun in Hebrew translation. 8. Soap star – !Nכוכב אופרת סבו 2. We should have broken ten minutes before. – As shown above, translation is not a mere pro- ! לצאת להפסקה לפני עשר דקותMהיינו צריכי cess of transposing words from one language into Transposition represents a change of position another but requires certain motivated changes. in the order of elements of the source textual Thus, translation involves grammatical transfor- unit or changing the part of speech in transla- mations, as a result of the process of looking tion, which implies the change in the order of for approximate correspondences in the translated the elements in the translated text. texts.
2.4 Research data resources lasting prevalence of phrase-based MT techniques. It should be stressed that parallel data resources 3 Research methodology are not extensive, and researchers still need to work on creating parallel corpora for their re- The detailed description of the research proce- search, especially if they would like to cover the dures is provided in the research methodology sec- variety of languages and areas. One of the most tion. In the current research, phrase-based MT prized parallel multilingual resources is Europarl was applied relying on two main reasons: NMT (Koehn, 2005). It comprises the translations of the techniques do not allow extensive processing of European Parliament proceedings (at most 50 mil- phrases and NMT procedures are not as explicit lion words) in most European languages; however, as phrase-based MT processes. The current study it covers just one specific domain of parliamentary does not involve the full set of phrase-based MT proceedings. TED talks subtitles to their videos systematic procedures, as it is used just for a seem to be a growing resource of parallel linguis- phrase table construction, which is a single step of tic material, covering a multitude of languages. the phrase-based MT paradigm. The research aim In addition, being an open and a developing re- comprised examining multiword expressions used source, TED talks attract attention of researchers as DMs in TED talk English transcripts and com- and their subtitles cover a wide variety of knowl- paring them with their counterparts in Lithuanian edge fields (Cettolo et al., 2012), which makes the and Hebrew. Thus, there was a need to achieve the data of the talks widely applicable. However, re- double objectives of creating the parallel corpus searchers should keep in mind that the talks are for the research data and carrying out the research translated by volunteers although with adminis- on multiword expressions used as DMs in the stud- tratively managed quality checks, and the transla- ied languages. Unlike working on one language tion is mostly unidirectional from source English and using statistical methods we used parallel cor- subtitles to other target languages. Furthermore, pus knowledge alignment algorithm. Initially, the Dupont and Zufferey (2017) identified that such list of multiword and one word expressions that talks contain features of both spoken and written could potentially be used as DMs was generated language, as they are semi-prepared speeches by relying on theoretical insights by Schiffrin (1987) nature. Additionally, Lefer and Grabar (2015) ob- and the classification provided by Fraser (2009). served that subtitle translation bears certain speci- Fraser’s extensive classification was taken as a ba- ficity in itself. Even by taking into account the sis, and Huang (2011) theoretical analysis of DM features of TED talks discussed by researchers, characteristics for spoken discourse, for example, TED talks are extensively useful as they are an you know, you see, I mean, I think, was also in- open resource and could provide large amounts of cluded. parallel data for research. Besides, parallel cor- pora are employed as a pool of data for statisti- 3.1 Parallel Corpus creation cal machine translation systems and TED talks is First, a parallel corpus meeting the research aim one of the most frequent data resources referred to needed to be created. We decided to use TED Talk explore multilingual Neural MT (NMT) (Aharoni transcripts, as they are publicly available and pro- et al., 2019; Chu et al., 2017; Hoang et al., 2018; vide appropriate material for parallel data. In order Khayrallah et al., 2018; Tan et al., 2018; Xiong to create a substantial parallel corpus containing et al., 2019; Zhang et al., 2019). NMT, as cur- data in English, Lithuanian, and Hebrew, the talks rently the newest technique of MT, stems from the were extracted automatically using a special code, model of the functioning of the human brain neu- which ensured that English sentences with the can- ral networks, which place information into differ- didate DMs from the theoretically based list were ent layers for processing it before generating the extracted and matched with their Lithuanian and outcome. With the technological advancements, Hebrew counterparts. The process of creating NMT gained impetus, as it used to be, resource the parallel corpus allows parallelizing the data and computation wise, too costly to outdo phrase- of any researched languages. While building the based MT, which operates on the basis of trans- corpus, the parallel texts in English, Lithuanian, lating entire sequences of words. Now, the neu- and Hebrew were extracted from TED talk tran- ral approach of NMT started challenging the long- scripts. Then, the sentences were aligned to make
a parallel corpus for further research. The cor- Figure 2: English – Hebrew phrase alignment pus contains 87.230 aligned sentences (published in LINDAT/CLARIN-LT repository http:// hdl.handle.net/20.500.11821/34). 3.2 Multiword DM extraction Figure 1 visualizes Lithuanian–English corre- Another stage of the research focuses on multi- sponding phrases marked in respective colours. word expressions that are used as DMs to ensure Figure 2 shows English–Hebrew respective phrase textual cohesion and, according to Fraser (2009), alignment, with a note for the reader that Hebrew to relate separate discourse messages. For exam- text should be read from right to left. ple, phrases such as you know, I mean, of course, are characteristic of spoken language (Maschler The model applies the segmentation of the input and Schiffrin, 2015; Furkó and Abuczki, 2014; into sequences of words, which are called phrases, Huang, 2011). Thus, 3.314 aligned sentences and then each phrase is translated into English containing the earlier mentioned multiword ex- phrases that could later be reordered in the out- pressions were extracted and manually annotated, put. Such a model ensures the correspondence be- spotting the cases in which the expressions were tween the units of phrases. After being extracted, used as DM. One-word DM identification did not all the possible translations were manually filtered represent much challenge; however, turning to to reject the wrong translation variants and pre- multiword expressions, they certainly caused chal- pare the data for the machine analysis stage. This lenges. For example, to identify if the expression helped us extract sentences with translations of the you know is used as a DM, the context in which researched DMs from the target language corpus it occurs should be examined by identifying if the and analyse their use. expression serves as a DM. As such, two situa- While analysing the data, we noticed that there tions arise: (1) the multiword expression you know was a small amount of data left which did not is used to introduce a new discourse message, or fit the variations of possible translations. The (2) they are content words fully integrated into the first supposition was that it might represent the sentence. cases of omissions; however, we decided to anal- yse it closely to verify. We checked manually 1. You know, this is really an infinite thing. the extracted non-attached data and established that most of the analysed cases involved omis- 2. You know exactly what you want to do from sion with some minor grammatical transformation one moment to the other. cases, incorrect translations, and some phrases not included in the possible translations by the ma- After that, the variations of the translations of chine. DMs into Lithuanian and Hebrew were extracted automatically for a comparative study, determin- 4 Research findings ing the variations in translation. We ran an NLP word-alignment algorithm to extract a phrase table 4.1 Multiword DM distribution of all the possible translations of the researched The most frequent multiword expressions used in DMs, using our parallel corpus (in our case, source the study corpus have been extracted and are pre- = English, target = Lithuanian/Hebrew). The ex- sented in the table below. traction of the translation variations was depen- dent on the phrase-based statistical machine trans- It could be seen in Table 1 that the two most lation model introduced by Koehn et al. (2003). frequent multiword expressions in the corpus are I The model could be visually represented in the re- think and you know. search languages by the figures below. As mentioned earlier, multiword expressions needed to be manually annotated, spotting the cases when the expressions were used as DMs. Figure 1: Lithuanian – English phrase alignment The manual annotation revealed that some mul- tiword expressions were used as DMs more fre- quently while others were more often used as con- tent words fully integrated into sentences.
Multiword expression Frequency 2005), fully represents the verb-pronoun cases. I think 580 Other one-verb variants and multiword expres- You know 573 sions do not demonstrate high numbers. A sep- That is 370 arate case is represented by omission, which com- Of course 312 prises 48 situations, showing that such a technique You see 287 is also chosen by the translators. In fact 256 Referring to Hebrew, the most frequent transla- I mean 199 tion is !אני חושב, which refers to a male derivative, For example 161 while the female derivate, !אני חושבת, comprises only 51 cases. The assumption could be that the Table 1: Multiword expressions in the corpus choice of gender in first person pronouns depends on the gender of the speaker. However, Hebrew Multiword Discourse Content translation variant choices differ from the Lithua- expression marker word nian ones, as they mostly remain multiword ex- I think 473 107 pressions in translation. Another interesting ob- You know 380 193 servation in Hebrew is that a number of 70 cases That is 29 341 include the additionally integrated connective and Of course 233 79 into the derivative !ואני חושב. It reveals that some- You see 47 240 times translators prefer inserting additional infor- In fact 217 39 mation into the translation, which could be re- I mean 168 31 lated not to the direct semantic meaning of ad- For example 117 44 dition of and but more to the pragmatic infer- ences drawn by the translators form the surround- Table 2: Multiword expressions used as DMs ing contexts, which relates to the observations of Blakemore and Carston (1999), and Moeschler (1989). Hebrew demonstrates less omission cases It is visible in Table 2 that multiword expres- than Lithuanian for the DM I think. The number sions That is and You see although identified as of omissions in Hebrew is 23, while the Lithua- DMs by the theoretical literature, in this study, nian omission number is approximately double in they demonstrate a weak tendency to be used as the parallelized corpus sentences. DMs and are mainly used as content words in the current corpus. While multiword expressions I think and you know demonstrate a high tendency 4.3 DM ‘you know’ translations of being used as DMs and the stability of remain- Another commonly used multiword DM, you ing DMs in Lithuanian and Hebrew translation. know, demonstrates far more variable translations. A closer investigation into the translations of DM 4.2 DM ‘I think’ translations you know reveals that the most common ones in Further, following our research aim, we present Lithuanian are also one-word verbs žinote/ žinai/ a detailed analysis of the translations of the two žinot, which represent verb-pronoun cases. An- most frequent multiword expressions used DMs – other quite frequent translator choice is the single I think and you know. The alignment approach al- particle na. Although not numerous, very interest- lowed extracting direct output of the translations ing cases of multiword expressions with particles together with the figures of the translation fre- could be found, such as na jūs žinote or na supran- quency. First, we explore the translations of the tate, or a single particle juk. Even a single parti- most frequent multiword DM, I think. cle is used as a DM, which is characteristic of the The most frequent multiword expression in the Lithuanian language. There are also cases of mul- researched corpus, I think, has a number of trans- tiword expressions involving a connective and in- lation variants in both researched languages, He- flected verb phrases, for example, kaip žinote, bet brew and Lithuanian. The most frequent one žinote. The translator’s choice to additionally use in Lithuanian is a one-word expression – an in- particles or connectives is obviously related not flected verb, manau, which, due to Lithuanian be- to the translation of semantic meaning but more ing a highly inflected language (Zinkevičius et al., to the pragmatic meaning inferred by them from
the surrounding context. It connotes with the deep tion choices could be based on the nature of the observation made by Nau and Ostrowski (2010b) target language into which the texts are translated; that Lithuanian particles contain the component of for example, Lithuanian is rich in particles and, as subjectivity and inter-subjectivity, and their mean- the analysis has demonstrated, translators choose ing is mostly coloured by the surrounding context. to additionally integrate particles into DMs to add In Hebrew, the translation variants for the DM supplementary discourse expressions. you know are not as variable. The most frequent In Hebrew, the male gender prevails in transla- ones, again, are the variants referring to the male tion, and translators automatically give preference gender, including both plural (191) !M יודעיMאת to male derivatives as in English; the gender is not and singular (26) !אתה יודע, which by far exceeds expressed in English and the choice of the gen- the number of female derivatives in plural (2) Nאת der of the derivative is completely the translator’s ! יודעותand singular (17) !את יודעת. The prevalence choice. Another observation regarding Hebrew is of male derivatives could be explained by the na- that multiword DMs remain multiword because of ture of the Hebrew language, which has the feature the translator choice to relay more on word for that male derivatives are used while addressing word translation, while in Lithuanian there is a purely male and mixed audiences (Tobin, 2001). tendency to omit the pronoun by using just an in- In Hebrew, this DM is much prone to omission, flected verb, and this way, multiword DMs turn as the number of omissions amounts to 113 cases, into one-word DMs. which are a bit less than the number of the trans- lated cases. Again, multiword expressions remain 5 Conclusions and Future research multiword expressions with just one case of one- The study results showed that English multiword word choice in translation. The translation choices expressions I think and you know, identified as for the multiword expression serving as a DM you DMs according to Maschler and Schiffrin (2015) know are more versatile than those of I think and function-based approach, remain stance attitudi- certain cases of grammatical transformation could nal DMs in Lithuanian and Hebrew translation but be observed in the case of the former. they demonstrate variability in Lithuanian and He- In Lithuanian, eight cases of grammatical brew translations: they are either translated into changes were found and, even amongst those, one- multiword expressions or one inflected word, or word DMs prevail. The multiword DM you know they are completely omitted. In Hebrew transla- is translated also into a connective, taigi (so), and tion there is a tendency to use multiword discourse adverbs gerai (okay) and iš tiesu˛ (really). How- marker translations to express stance, and there ever, such translator choices are absolutely rare, is a clear tendency for translators to give pref- considering the size of the dataset. erence to male over female derivatives, which is The grammatical transformation cases are more due to the nature of the Hebrew language (Tobin, numerous, comprising of 21 occurrences, and 2001). However, in Lithuanian, there is a clear much more versatile in Hebrew. The most interest- tendency observed for one-word DMs in transla- ing cases include: !( טוב נוokay), which is a usual tion. One-word translations mainly include verbs, colloquial saying in Hebrew, !( נחשו מהguess what), which, due to Lithuanian being a highly inflected and two connectives used successively, !( כאילוas language (Zinkevičius et al., 2005), fully repre- if). There are also some cases when a connective sent the verb-pronoun cases. It should be noted is just added as in the following example !Nואז כמוב, that Lithuanian translations of pronoun-verb mul- (then of course), which could be done by the trans- tiword expressions and one-word verb cases could lator simply to stress the discourse management be considered almost word-for-word translations. role of the DM used or possibly attaches a rhetor- Concerning translation modelling the research re- ical function to the integrated connective. Even veals stance signalling in discourse preserved as among the limited cases of grammatical transfor- an important element in translation. mation, multiword expressions as DMs prevail in More interesting cases include translator Hebrew. What is similar to Lithuanian is that there choices of particle or connective integration are also adverbs used in the Hebrew translation: into multiword expressions. The integration of !( הריindeed), !( נוwell), !( ברורclearly). Reflecting particles for Lithuanian and connectives for both why different DMs demonstrate different transla- languages might carry the pragmatic meaning that
could have been inferred from the surrounding Douglas Biber. 2006. contexts by the translators (Nau and Ostrowski, https://doi.org/10.1016/j.jeap.2006.05.001 Stance in spoken and written university registers. Journal of 2010a; Blakemore and Carston, 1999; Moeschler, English for Academic Purposes, 5(2):97–116. 1989), or translator choices might be also guided by the inner discourse managing system of the Douglas Biber, Susan Conrad, and Viviana Cortes. target language. The translator’s choice to insert 2004. If you look at. . . : Lexical bundles in uni- versity teaching and textbooks. Applied linguistics, particles and connectives needs closer investi- 25(3):371–405. Publisher: Oxford University Press. gation and might be studied in future research. Furthermore, keeping in mind that each language Douglas Biber, Susan Conrad, and Randi Reppen. 1994. Corpus-based approaches to issues in ap- is a unique system with unique features, research plied linguistics. Applied linguistics, 15(2):169– could be carried out without English as a pivotal 189. Publisher: Oxford University Press. language, which means furthering the current Douglas Biber, Stig Johansson, Geoffrey Leech, research and using linguistically linked open data S. Conrad, Eclwarcl Finegan, and Randolph Quirk. (LLOD) and thus accessing related linguistic 1999. Longman. Grammar of spoken and written data directly and comparing the languages. This english. has already been done for related languages; for Roza Bieliauskienė. 2012. Vilnius–jidiš kalbos example, Snyder et al (2010) analysed Ugaritic Jeruzalė. Krantai, (4):56–61. (an ancient Semitic language spoken in the second millennium BCE) through resources originally Diane Blakemore and Robyn Carston. 1999. The inter- pretation of and-conjunctions. Iten, C. &. developed for Hebrew. However, linked data pro- vide a sound basis and potential for interoperable Mauro Cettolo, Christian Girardi, and Marcello Fed- resources relating across various languages and erico. 2012. Wit3: Web inventory of transcribed and translated talks. In Conference of european associa- enable research across languages and areas. tion for machine translation, pages 261–268. Acknowledgments Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of domain adaptation The research described in this paper was con- methods for neural machine translation. In Proceed- ings of the 55th Annual Meeting of the Association ducted in the context of the COST Action for Computational Linguistics (Volume 2: Short Pa- CA18209 “Nexus Linguarum” (“European net- pers), pages 385–391. work for Web-centred linguistic data science”). Tom Cobb and Alex Boulton. 2015. Classroom appli- cations of corpus analysis. References David Crystal. 1988. Another look at, well, you know. . . . English Today, 4(1):47–49. Publisher: Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Cambridge University Press. Massively Multilingual Neural Machine Transla- tion. In Proceedings of the 2019 Conference of Liesbeth Degand and Henk Pander Maat. 2003. A con- the North American Chapter of the Association for trastive study of Dutch and French causal connec- Computational Linguistics: Human Language Tech- tives on the Speaker Involvement Scale. LOT Occa- nologies, Volume 1 (Long and Short Papers), pages sional Series, 1:175–199. Publisher: LOT, Nether- 3874–3884. lands Graduate School of Linguistics. Amal Al-Saif and Katja Markert. 2010. The Leeds Robert MW Dixon. 2009. The semantics of clause Arabic Discourse Treebank: Annotating Discourse linking in typological perspective. The semantics of Connectives for Arabic. In LREC, pages 2046– clause linking: A cross-linguistic typology, pages 1– 2053. 55. Publisher: Oxford University Press Oxford. Kaja Dobrovoljc. 2017. Multi-word discourse mark- Nicholas Asher, Nicholas Michael Asher, and Alex ers and their corpus-driven identification: The case Lascarides. 2003. Logics of conversation. Cam- of MWDM extraction from the reference corpus of bridge University Press. spoken Slovene. International journal of corpus linguistics, 22(4):551–582. Publisher: John Ben- Mona Baker. 2018. In other words: A coursebook on jamins. translation. Routledge. Maïté Dupont and Sandrine Zufferey. 2017. Method- Michael Barlow. 2011. Corpus linguistics and theoreti- ological issues in the use of directional parallel cor- cal linguistics. International Journal of Corpus Lin- pora: A case study of English and French conces- guistics, 16(1):3–44. Publisher: John Benjamins. sive connectives. International journal of corpus
linguistics, 22(2):270–297. Publisher: John Ben- Marie-Aude Lefer and Natalia Grabar. 2015. Super- jamins. creative and over-bureaucratic: A cross-genre corpus-based study on the use and translation of Bruce Fraser. 2009. An account of discourse markers. evaluative prefixation in TED talks and EU parlia- International review of Pragmatics, 1(2):293–320. mentary debates. Across Languages and Cultures, Publisher: Brill. 16(2):187–208. Publisher: Akadémiai Kiadó. Péter Furkó and Ágnes Abuczki. 2014. English dis- William C. Mann and Sandra A. Thompson. 1988. course markers in mediatised political interviews. Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–281. Pub- Sylviane Granger. 2015. Contrastive interlanguage lisher: Berlin. analysis: A reappraisal. International Journal of Learner Corpus Research, 1(1):7–24. Publisher: Yael Maschler and Deborah Schiffrin. 2015. Discourse John Benjamins. markers: Language, meaning, and context. The handbook of discourse analysis, 2:189–221. Pub- lisher: Wiley Online Library. Angela Hasselgren. 2002. Learner corpora and lan- guage testing: Smallwords as markers of learner flu- Jacques Moeschler. 1989. Pragmatic connectives, ar- ency. Computer learner corpora, second language gumentative coherence and relevance. Argumenta- acquisition and foreign language teaching, pages tion, 3(3):321–339. Publisher: Springer. 143–174. Publisher: John Benjamins Amsterdam, The Netherlands. James R. Nattinger and Jeanette S. DeCarrico. 1992. Lexical phrases and language teaching. Oxford Vu Cong Duy Hoang, Philipp Koehn, Gholamreza University Press. Haffari, and Trevor Cohn. 2018. Iterative back- translation for neural machine translation. In Pro- Nicole Nau and Norbert Ostrowski. 2010a. Back- ceedings of the 2nd Workshop on Neural Machine ground and perspectives for the study of particles Translation and Generation, pages 18–24. and connectives in Baltic languages. Particles and connectives in Baltic, pages 1–37. Publisher: Vil- Jet Hoek, Sandrine Zufferey, Jacqueline Evers- nius: Vilniaus universitetas, Academia Salensis. Vermeul, and Ted JM Sanders. 2017. Cognitive complexity and the linguistic marking of coherence Nicole Nau and Norbert Ostrowski. 2010b. Particles relations: A parallel corpus study. Journal of Prag- and connectives in Baltic. matics, 121:113–131. Publisher: Elsevier. William R. O’Donnell and Loreto Todd. 2013. Variety Lan Fen Huang. 2011. Discourse markers in spoken in contemporary English. Routledge. English: A corpus study of native speakers and Chi- nese non-native speakers. PhD Thesis, University Giedre Valunaite Oleskeviciene, Deniz Zeyrek, Vik- of Birmingham. torija Mazeikiene, and Murathan Kurfalı. 2018. Ob- servations on the annotation of discourse relational John E. Joseph. 2009. Why Lithuanian accentua- devices in TED talk transcripts in Lithuanian. In tion mattered to Saussure. Language & History, Proceedings of the workshop on annotation in dig- 52(2):182–198. Publisher: Taylor & Francis. ital humanities co-located with ESSLLI, volume 2155, pages 53–58. Daniel Joslyn-Siemiatkoski. 2007. The Cambridge his- Mirna Pit. 2007. Cross-linguistic analyses of backward tory of Judaism: The late Roman-rabbinic period. causal connectives in Dutch, German and French. Theological Studies, 68(4):924. Publisher: Sage Languages in Contrast, 7(1):53–82. Publisher: John Publications Ltd. Benjamins. Huda Khayrallah, Brian Thompson, Kevin Duh, and Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- Philipp Koehn. 2018. Regularized training objec- sakaki, Livio Robaldo, Aravind K. Joshi, and Bon- tive for continued training for domain adaptation in nie L. Webber. 2008. The Penn Discourse Tree- neural machine translation. In Proceedings of the Bank 2.0. In Proceedings of the Sixth International 2nd Workshop on Neural Machine Translation and Conference on Language Resources and Evaluation Generation, pages 36–44. (LREC’08), Marrakech, Morocco. European Lan- guage Resources Association (ELRA). Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, vol- Deborah Schiffrin. 2001. Discourse markers: Lan- ume 5, pages 79–86. Citeseer. guage, meaning, and context. The handbook of dis- course analysis, 1:54–75. Publisher: Wiley Online Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Library. Statistical phrase-based translation. Technical re- port, University of Southern California Marina Del John Sinclair. 1991. Corpus, concordance, colloca- Rey Information Science Inst. tion. Oxford University Press.
Anna Siyanova-Chanturia, Kathy Conklin, and Wal- Baltic Conference on Human Language Technolo- ter JB Van Heuven. 2011. Seeing a phrase “time gies, pages 365–370. and again” matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Sandrine Zufferey and Bruno Cartoni. 2012. English Experimental Psychology: Learning, Memory, and and French causal connectives in contrast. Lan- Cognition, 37(3):776. Publisher: American Psycho- guages in contrast, 12(2):232–250. Publisher: John logical Association. Benjamins. Manfred Stede, Stergos Afantenos, Andreas Peldzsus, Sandrine Zufferey and Liesbeth Degand. 2017. Anno- Nicholas Asher, and Jérémy Perret. 2016. Paral- tating the meaning of discourse connectives in mul- lel discourse annotations on a corpus of short texts. tilingual corpora. Corpus Linguistics and Linguis- In 10th International Conference on Language Re- tic Theory, 13(2):399–422. Publisher: De Gruyter sources and Evaluation (LREC 2016), pages 1051– Mouton. 1058. Jan Svartvik. 1980. Well in conversation. Studies in English Linguistics for Randolph Quirk, 5:167–177. Publisher: London: Longman. Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie- Yan Liu. 2018. Multilingual Neural Machine Trans- lation with Knowledge Distillation. In International Conference on Learning Representations. Yishai Tobin. 2001. Gender switch in modern Hebrew. Gender across languages: The linguistic represen- tation of women and men, 1:177–198. Bonnie Webber and Aravind Joshi. 2012. Discourse structure and computation: past, present and future. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, pages 42– 54. Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar- avind Joshi. 2016. A discourse-annotated corpus of conjoined VPs. In Proceedings of the 10th Linguis- tic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), pages 22–31. Naixing Wei and Jingjie Li. 2013. A new computing method for extracting contiguous phraseological se- quences from academic text corpora. International Journal of Corpus Linguistics, 18(4):506–535. Pub- lisher: John Benjamins. Alison Wray. 2012. What do we (think we) know about formulaic language? An evaluation of the current state of play. Annual review of applied linguistics, 32(1):231–254. Publisher: Cambridge University Press. Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Modeling coherence for discourse neural ma- chine translation. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 33, pages 7338–7345. Yuqi Zhang, Kui Meng, and Gongshen Liu. 2019. Paragraph-Level Hierarchical Neural Machine Translation. In International Conference on Neural Information Processing, pages 328–339. Springer. Vytautas Zinkevičius, Vidas Daudaravičius, and Erika Rimkutė. 2005. The Morphologically annotated Lithuanian Corpus. In Proceedings of The Second
You can also read