Multiword expressions as discourse markers in Hebrew and Lithuanian

Page created by Michael Goodwin
 
CONTINUE READING
Multiword expressions as discourse markers in Hebrew and Lithuanian
             Giedre Valunaite Oleskeviciene                     Chaya Liebeskind
                 Institute of Humanities,                Department of Computer Science,
              Mykolas Romeris university                 Jerusalem College of Technology
                  Ateities 20 LT-08303,                  21 Havaad Haleumi st. 9116001,
                     Vilnius, Lietuva                            Jerusalem, Israel
              gvalunaite@mruni.eu                          liebchaya@gmail.com

                      Abstract                         or processing language by the speaker and the
                                                       hearer (Siyanova-Chanturia et al., 2011). Formu-
     Multiword expressions are of key impor-           laic language includes idioms and proverbs, vari-
     tance in language generation and process-         ous clichés and collocations, lexical bundles, and
     ing. Certain multiword expressions also           phrasal verbs. Biber et al. (2004) observed that
     could operate as discourse markers. In            lexical bundles constitute a high percentage of the
     this research, we combined the alignment          produced language and the authors identified that
     model of the phrase-based statistical ma-         one function of lexical bundles is to organize dis-
     chine translation and manual treatment of         course by providing an example of such bundles,
     the data in order to examine English mul-         for example, I think, which relates to the research
     tiword discourse markers and their equiv-         on discourse markers (DMs). Phrases such as you
     alents in Lithuanian and Hebrew, by re-           know and I think have also been classified as DMs
     searching their changes in translation. Af-       that perform certain discourse organising func-
     ter establishing a full list of multiword dis-    tions. However, Maschler and Schiffrin (2015)
     course markers in our generated parallel          observe that there is no a priori theoretical clas-
     corpus, we focused on the two most fre-           sification of DMs and the analysis of function in
     quent ones functioning as stance attitudi-        the data is necessary. Research on DMs as tools of
     nal discourse markers: I think and you            discourse management prove that they carry sev-
     know aiming to research if they demon-            eral functions, including signposting, signalling,
     strate their functional stability as stance at-   and rephrasing. Furthermore, there are ongoing at-
     titudinal discourse markers in translation        tempts to investigate the importance of discourse
     and what changes they undergo in Lithua-          layers in language production, communication,
     nian and Hebrew translation. Our research         second language learning, and translation. Addi-
     proves that the examined multiword dis-           tionally, Dobrovoljc (2017) has recently attempted
     course markers preserve their function as         to research multiword expressions as DMs in a
     stance attitudinal discourse markers and          corpus of spoken Slovene, identifying structurally
     tend to remain multiword discourse mark-          fixed discourse marking multiword expressions.
     ers in the Hebrew translation but turn into
                                                          The underlying assumption is that DMs I think
     one-word discourse markers in Lithuanian
                                                       and you know are indicators of stance in discourse
     due to the translation tendency relying on
                                                       used to express and understand points of view and
     inflections.
                                                       beliefs. The purpose of the current research is to
                                                       examine multiword expressions used as DMs in
1    Introduction
                                                       TED talk English transcripts focusing on stance
Research on multiword expressions has identi-          attitudinal DMs I think and you know and com-
fied that language is not produced just word by        pare them with their counterparts in Lithuanian
word but it usually involves generating certain        and Hebrew by following Maschler and Schiffrin
chunks using a lot of formulaic constructions          (2015) observation on the necessity of closer in-
(Barlow, 2011). Native speakers have a multi-          vestigation on their function as stance DMs. To
tude of memorized sequences to perform various         achieve the aim of the research, the set objec-
functions within language, for example, organiz-       tives were to create a parallel research corpus to
ing discourse (Nattinger and DeCarrico, 1992),         identify multiword expressions used as stance at-
titudinal DMs and to analyse their translations in       sions and the oldest layer of the Lithuanian lan-
Lithuanian and Hebrew to determine if they func-         guage vocabulary is related to the Indo-European
tion as stance DMs and are also multiword ex-            language, which is dated to be approximately over
pressions or one word translations, or if they ac-       5000 years old.
quire any other linguistic forms. An additional             Hebrew is a very old Semitic language and it is a
benefit of the study was extending the available         successful example of a revived dead language. It
resources and providing linguistic processing for        survived in the medieval period as the language of
several languages by creating a multilingual paral-      religious scriptures, being revived, in the 19th cen-
lel corpus (including English, Lithuanian, and He-       tury, into a spoken and literary language (Joslyn-
brew); the created corpus is shared and interlinked      Siemiatkoski, 2007). Hebrew is an important lan-
via CLARIN open language resources. What is              guage for researchers specializing in Middle East
more, the current research could be extended to          civilizations and Christian theology studies.
other languages. The future research envisions ap-
plying machine learning and using the model for          2.2   Multiword expressions as DMs
discourse marker identification in other languages
to research how stance signalling is treated.            The research areas of natural language processing
                                                         (NLP), linguistics, and translation are closely re-
2     Theoretical background                             lated to discourse research, focusing on discourse
                                                         relations between clauses or sentences. NLP re-
The literature overview briefly takes into account       search focuses more and in depth on multiple
the research languages, studies related to multi-        language-related areas, such as semantic phenom-
word expressions and their use as DMs, the im-           ena, dialogue exchange structure, and discourse
portance of DMs for discourse management, and            textual structure (Webber and Joshi, 2012). NLP
certain insights into DM translation.                    recognizes that language is not just placing words
                                                         in the right order but getting the meaning and
2.1    Cultural heritage and research languages
                                                         deeper textual relations as well as organizing ideas
First, it is necessary to briefly discuss the cultural   into a logical textual flow. According to re-
heritage of the languages of the research, which,        searchers (Barlow, 2011; Sinclair, 1991), language
in a way, guided the choice of languages for our         is not just generated word by word; it is also
study. According to Bieliauskienė (2012), Jewish        formulaic. Speakers possess multiple learnt for-
and Lithuanian cultures coexisted on the same ter-       mulaic sequences, which, according to Siyanova-
ritory from the first half of the 14th century. The      Chanturia et al. (2011) are important in organizing
author stressed that from 19th century onwards,          discourse and help the language producer and re-
in the Republic of Lithuania, Vilnius was called         cipient to manage language processing. However,
Lithuania’s Jerusalem, attracting knowledgeable          formulaic language is not easy to manage and cat-
people in the field of education and inspiring a         egorize for NLP research, as it may seem at first
flourishing high culture, for example, in theatre,       sight, since the sequences that could be considered
art, and literature. In fact, both languages, Lithua-    formulaic vary in length, meaning, fixedness, etc.,
nian and Hebrew, formed the cultural heritage of         and the finalized definition of formulaic language
the region. In this study, we research the Lithua-       has not fully crystallized. It could be considered
nian and Hebrew corpus in parallel with pivotal          as an umbrella term embracing idioms, proverbs,
English.                                                 clichés, phrasal verbs, collocations, and lexical
   Lithuanian is an old surviving Baltic language,       bundles (Wray, 2012). According to Wei and Li
retaining forms related to Sanskrit and Latin and        (2013), formulaic language covers approximately
preserving the most phonological and morpholog-          60% of written texts in their researched corpus of
ical aspects of the Proto-Indo-European language.        English academic language. According to Biber
Thus, it has gained importance in Indo-European          et al. (1994, 1999), lexical bundles are groups of
language studies and has been researched by many         words that show a statistical tendency to co-occur
scientists so far, including Ferdinand de Saussure,      and could be considered as extended collocations,
who considered Lithuanian “the Galapagos of lin-         for example, I think. Biber et al. (2004) identify
guistic evolution” (Joseph, 2009). Lithuanian is         that lexical bundles have functional purposes, such
rich in declensions and cases inside the declen-         as organizing discourse, expressing stance, and
referential meaning. Based on the evidence of the     take (e.g., he goed), it can be easily identified
formulaic nature of language for communication,       and redeemed by the native speaker; however, if
research has turned to investigating multiword ex-    a learner misses words such as you know, or well,
pressions used as DMs (Dobrovoljc, 2017), iden-       the native speaker cannot identify any error and
tifying structurally fixed discourse marking multi-   the speech might sound impolite or even dogmatic.
word expressions.                                     The same idea is also supported by Hasselgren
                                                      (2002), who observed that DMs enhance interac-
   Another important issue in NLP is discourse
                                                      tion. Furthermore, it has also been researched us-
management, which is related to discourse re-
                                                      ing learner corpora to demonstrate the importance
lations, connecting ideas between sentences and
                                                      of discourse level knowledge, especially at more
bigger parts of the text. Discourse relations may
                                                      advanced levels of language learning (Granger,
remain implicit or be expressed explicitly through
                                                      2015; Cobb and Boulton, 2015).
discourse markers, which help textual coherence
and discourse management, and are used for mak-
ing coherent speech appropriately segmented to        2.3   Translation issues of DMs
enable textual understanding. DMs perform im-         DMs are used in both written texts and spoken
portant functions, such as signposting, signalling,   discourse to connect ideas and guide the reader
and rephrasing, by facilitating discourse organi-     or the listener through expression by ensuring
zation. They are mainly drawn from syntactic          that the ideas are grasped correctly. DMs have
classes of conjunctions, adverbials, and preposi-     been researched by applying various theoretical
tional phrases (Fraser, 2009), as well as expres-     approaches, such as Rhetorical Structure Theory
sions such as you know, you see, and I mean           (Mann and Thompson, 1988), Segmented Dis-
(Schiffrin, 2001; Hasselgren, 2002; Maschler and      course Representation Theory (Asher et al., 2003),
Schiffrin, 2015). Hasselgren (2002) advocated         and PDTB (Prasad et al., 2008), first focusing
that better DM signalled fluency contributes to in-   on the monolingual approach, which resulted in
teraction and even makes the speaker sound more       multilingual studies focusing on translation (De-
‘native-like’. Recently, discourse relations and      gand and Pander Maat, 2003; Pit, 2007; Dixon,
DM research has gained certain impetus with cor-      2009; Zufferey and Cartoni, 2012). As Zufferey
pora annotation for exploring discourse structure     and Cartoni (2012) observed, multilingual stud-
in texts, for example, the Penn Discourse Tree        ies are more complicated as languages differ in
Bank (PDTB) (Webber et al., 2016). Furthermore,       the use of DMs and their expression. The au-
there was a rise in annotated multilingual corpora    thors also added that often DMs are poly-semic,
for researching different means of expressing dis-    which means that a single expression of a DM may
course relations and managing discourse (Stede        perform in expressing various discourse relations.
et al., 2016; Zufferey and Degand, 2017; Oleske-      They provided an example of the English since,
viciene et al., 2018).                                which could express temporal or causal discourse
  Language, especially spoken, is characterised       relations depending on the surrounding contexts.
by DM use; however, some of them (e.g., you              Recently, much research has gained interest in
know, I think, well) are sometimes referred to in     using parallel translated corpora. For example,
a critical manner, as indicating a lack of fluency    Dupont and Zufferey (2017) focused on the inves-
(O’Donnell and Todd, 2013). Still, DMs are abun-      tigation of translation corpora to study if the effect
dantly used and, according to Crystal (1988), they    of register, translation direction, or translator’s ex-
enhance communication if used appropriately and       pertise could influence the shifts of meaning and
should not be considered unnecessary or undesir-      omissions of English and French markers of con-
able. As Biber (2006) observed, DMs, such as          cession. Hoek et al. (2017) investigated a paral-
you know, or well, are very rare in written lan-      lel corpus on English parliamentary debates trans-
guage. However, they are quite common in spo-         lated into Dutch, German, French, and Spanish,
ken discourse and should not be treated as just       searching what types of DMs might have a higher
fancy words since they serve the function of orga-    tendency to be more frequently omitted in trans-
nizing discourse by signalling, rephrasing, mark-     lation. Baker (2018), in her extensive studies on
ing, or relating ideas. Svartvik (1980) observed      translation, observed that DMs could be used to
that, if a foreign language learner makes a mis-      signal different relations and these relations could
be expressed by a variety of means. The author             In Lithuanian translation, we observe a
provided the example that, in English, the expres-         change in the order of the elements in the sen-
sion of causality could be realized through con-           tence.
tent verbs, such as cause or lead, or more sim-
ply, through a DM signalling the causality rela-        3. After he had left – Jam išėjus.
tion. Further, different languages demonstrate dif-
                                                           In the case of Hebrew translation, the change
ferent tendencies – some languages prefer using
                                                           of the order of the elements could be ob-
simpler structures connected by a variety of DMs,
                                                           served in the following example.
while other languages favour complex structures,
sparsely using explicit DMs. The author analysed        4. Classical music – !‫מוזיקה קלאסית‬
the example of an evident difference between En-
glish and Arabic, identifying that, while English          Omission occurs when some elements of the
prefers signalling discourse relation through DMs,         original text could be considered excessive or
Arabic prefers grouping the information into big-          redundant in translation.
ger grammatical chunks and using fewer DMs.                In the Lithuanian translation example, the
The finding is supported by Al-Saif and Mark-              whole phrase I thought is omitted.
ert (2010), who observed that, in Arabic, many
discourse relations are expressed via prepositions      5. I thought you said you were alright. – Bet tu
with nominalizations. Therefore, translation poses         sakei, kad viskas gerai.
a challenge in adapting various preferences of the
source and target languages. Translators face var-         In the following example in Hebrew, the
ious choices of inserting DMs to make the flow of          translation of are is omitted.
the ideas smoother in the target text, however, they
risk making the translation sound foreign or trans-     6. We still are – !N‫אנחנו עדיי‬
posing the grammatical syntactic structure, ending         Supplementation involves changes when new
up using different means of expressing DMs or              elements, which are non-existent in the
simply omitting them. It appears that it is not al-        source text, appear in the translated text in
ways possible to use the word for word technique           order to ensure structural adequacy of the lat-
and natural changes in translation are sometimes           ter. Such modifications are usually consid-
inevitable. According to Baker (2018), grammat-            ered structurally or contextually motivated.
ical changes in translation involve certain tech-
                                                           For example, due to the elliptical nature of
niques, such as substitution, transposition, omis-
                                                           the English language, the Lithuanian transla-
sion, and supplementation.
                                                           tion should use supplementation to make the
   Substitution is the change of the grammatical
                                                           translation understandable.
category of the source unit in translation.
   For example, active voice is more common in
                                                        7. Soap star – muilo operos žvaigždė (although
Lithuanian; therefore, English passive voice units
                                                           the word opera is omitted in English due
could be changed into active units:
                                                           to ellipsis, it should be added in Lithuanian
  1. He was told the news. – jam pranešė nau-             translation to make it contextually coherent).
     jienas                                                The same technique should be applied in the
     Similarly, in the following example, the verb         Hebrew translation.
     in the source language is changed into a noun
     in Hebrew translation.                             8. Soap star – !N‫כוכב אופרת סבו‬

  2. We should have broken ten minutes before. –          As shown above, translation is not a mere pro-
     !‫ לצאת להפסקה לפני עשר דקות‬M‫היינו צריכי‬           cess of transposing words from one language into
     Transposition represents a change of position     another but requires certain motivated changes.
     in the order of elements of the source textual    Thus, translation involves grammatical transfor-
     unit or changing the part of speech in transla-   mations, as a result of the process of looking
     tion, which implies the change in the order of    for approximate correspondences in the translated
     the elements in the translated text.              texts.
2.4   Research data resources                           lasting prevalence of phrase-based MT techniques.

It should be stressed that parallel data resources      3     Research methodology
are not extensive, and researchers still need to
work on creating parallel corpora for their re-         The detailed description of the research proce-
search, especially if they would like to cover the      dures is provided in the research methodology sec-
variety of languages and areas. One of the most         tion. In the current research, phrase-based MT
prized parallel multilingual resources is Europarl      was applied relying on two main reasons: NMT
(Koehn, 2005). It comprises the translations of the     techniques do not allow extensive processing of
European Parliament proceedings (at most 50 mil-        phrases and NMT procedures are not as explicit
lion words) in most European languages; however,        as phrase-based MT processes. The current study
it covers just one specific domain of parliamentary     does not involve the full set of phrase-based MT
proceedings. TED talks subtitles to their videos        systematic procedures, as it is used just for a
seem to be a growing resource of parallel linguis-      phrase table construction, which is a single step of
tic material, covering a multitude of languages.        the phrase-based MT paradigm. The research aim
In addition, being an open and a developing re-         comprised examining multiword expressions used
source, TED talks attract attention of researchers      as DMs in TED talk English transcripts and com-
and their subtitles cover a wide variety of knowl-      paring them with their counterparts in Lithuanian
edge fields (Cettolo et al., 2012), which makes the     and Hebrew. Thus, there was a need to achieve the
data of the talks widely applicable. However, re-       double objectives of creating the parallel corpus
searchers should keep in mind that the talks are        for the research data and carrying out the research
translated by volunteers although with adminis-         on multiword expressions used as DMs in the stud-
tratively managed quality checks, and the transla-      ied languages. Unlike working on one language
tion is mostly unidirectional from source English       and using statistical methods we used parallel cor-
subtitles to other target languages. Furthermore,       pus knowledge alignment algorithm. Initially, the
Dupont and Zufferey (2017) identified that such         list of multiword and one word expressions that
talks contain features of both spoken and written       could potentially be used as DMs was generated
language, as they are semi-prepared speeches by         relying on theoretical insights by Schiffrin (1987)
nature. Additionally, Lefer and Grabar (2015) ob-       and the classification provided by Fraser (2009).
served that subtitle translation bears certain speci-   Fraser’s extensive classification was taken as a ba-
ficity in itself. Even by taking into account the       sis, and Huang (2011) theoretical analysis of DM
features of TED talks discussed by researchers,         characteristics for spoken discourse, for example,
TED talks are extensively useful as they are an         you know, you see, I mean, I think, was also in-
open resource and could provide large amounts of        cluded.
parallel data for research. Besides, parallel cor-
pora are employed as a pool of data for statisti-       3.1    Parallel Corpus creation
cal machine translation systems and TED talks is        First, a parallel corpus meeting the research aim
one of the most frequent data resources referred to     needed to be created. We decided to use TED Talk
explore multilingual Neural MT (NMT) (Aharoni           transcripts, as they are publicly available and pro-
et al., 2019; Chu et al., 2017; Hoang et al., 2018;     vide appropriate material for parallel data. In order
Khayrallah et al., 2018; Tan et al., 2018; Xiong        to create a substantial parallel corpus containing
et al., 2019; Zhang et al., 2019). NMT, as cur-         data in English, Lithuanian, and Hebrew, the talks
rently the newest technique of MT, stems from the       were extracted automatically using a special code,
model of the functioning of the human brain neu-        which ensured that English sentences with the can-
ral networks, which place information into differ-      didate DMs from the theoretically based list were
ent layers for processing it before generating the      extracted and matched with their Lithuanian and
outcome. With the technological advancements,           Hebrew counterparts. The process of creating
NMT gained impetus, as it used to be, resource          the parallel corpus allows parallelizing the data
and computation wise, too costly to outdo phrase-       of any researched languages. While building the
based MT, which operates on the basis of trans-         corpus, the parallel texts in English, Lithuanian,
lating entire sequences of words. Now, the neu-         and Hebrew were extracted from TED talk tran-
ral approach of NMT started challenging the long-       scripts. Then, the sentences were aligned to make
a parallel corpus for further research. The cor-
                                                           Figure 2: English – Hebrew phrase alignment
pus contains 87.230 aligned sentences (published
in LINDAT/CLARIN-LT repository http://
hdl.handle.net/20.500.11821/34).

3.2   Multiword DM extraction
                                                          Figure 1 visualizes Lithuanian–English corre-
Another stage of the research focuses on multi-
                                                       sponding phrases marked in respective colours.
word expressions that are used as DMs to ensure
                                                       Figure 2 shows English–Hebrew respective phrase
textual cohesion and, according to Fraser (2009),
                                                       alignment, with a note for the reader that Hebrew
to relate separate discourse messages. For exam-
                                                       text should be read from right to left.
ple, phrases such as you know, I mean, of course,
are characteristic of spoken language (Maschler           The model applies the segmentation of the input
and Schiffrin, 2015; Furkó and Abuczki, 2014;          into sequences of words, which are called phrases,
Huang, 2011). Thus, 3.314 aligned sentences            and then each phrase is translated into English
containing the earlier mentioned multiword ex-         phrases that could later be reordered in the out-
pressions were extracted and manually annotated,       put. Such a model ensures the correspondence be-
spotting the cases in which the expressions were       tween the units of phrases. After being extracted,
used as DM. One-word DM identification did not         all the possible translations were manually filtered
represent much challenge; however, turning to          to reject the wrong translation variants and pre-
multiword expressions, they certainly caused chal-     pare the data for the machine analysis stage. This
lenges. For example, to identify if the expression     helped us extract sentences with translations of the
you know is used as a DM, the context in which         researched DMs from the target language corpus
it occurs should be examined by identifying if the     and analyse their use.
expression serves as a DM. As such, two situa-            While analysing the data, we noticed that there
tions arise: (1) the multiword expression you know     was a small amount of data left which did not
is used to introduce a new discourse message, or       fit the variations of possible translations. The
(2) they are content words fully integrated into the   first supposition was that it might represent the
sentence.                                              cases of omissions; however, we decided to anal-
                                                       yse it closely to verify. We checked manually
  1. You know, this is really an infinite thing.       the extracted non-attached data and established
                                                       that most of the analysed cases involved omis-
  2. You know exactly what you want to do from         sion with some minor grammatical transformation
     one moment to the other.                          cases, incorrect translations, and some phrases not
                                                       included in the possible translations by the ma-
After that, the variations of the translations of
                                                       chine.
DMs into Lithuanian and Hebrew were extracted
automatically for a comparative study, determin-
                                                       4     Research findings
ing the variations in translation. We ran an NLP
word-alignment algorithm to extract a phrase table     4.1    Multiword DM distribution
of all the possible translations of the researched
                                                       The most frequent multiword expressions used in
DMs, using our parallel corpus (in our case, source
                                                       the study corpus have been extracted and are pre-
= English, target = Lithuanian/Hebrew). The ex-
                                                       sented in the table below.
traction of the translation variations was depen-
dent on the phrase-based statistical machine trans-       It could be seen in Table 1 that the two most
lation model introduced by Koehn et al. (2003).        frequent multiword expressions in the corpus are I
The model could be visually represented in the re-     think and you know.
search languages by the figures below.                    As mentioned earlier, multiword expressions
                                                       needed to be manually annotated, spotting the
                                                       cases when the expressions were used as DMs.
Figure 1: Lithuanian – English phrase alignment        The manual annotation revealed that some mul-
                                                       tiword expressions were used as DMs more fre-
                                                       quently while others were more often used as con-
                                                       tent words fully integrated into sentences.
Multiword expression        Frequency             2005), fully represents the verb-pronoun cases.
      I think                     580                   Other one-verb variants and multiword expres-
      You know                    573                   sions do not demonstrate high numbers. A sep-
      That is                     370                   arate case is represented by omission, which com-
      Of course                   312                   prises 48 situations, showing that such a technique
      You see                     287                   is also chosen by the translators.
      In fact                     256                      Referring to Hebrew, the most frequent transla-
      I mean                      199                   tion is !‫אני חושב‬, which refers to a male derivative,
      For example                 161                   while the female derivate, !‫אני חושבת‬, comprises
                                                        only 51 cases. The assumption could be that the
  Table 1: Multiword expressions in the corpus          choice of gender in first person pronouns depends
                                                        on the gender of the speaker. However, Hebrew
  Multiword        Discourse        Content             translation variant choices differ from the Lithua-
  expression       marker           word                nian ones, as they mostly remain multiword ex-
  I think          473              107                 pressions in translation. Another interesting ob-
  You know         380              193                 servation in Hebrew is that a number of 70 cases
  That is          29               341                 include the additionally integrated connective and
  Of course        233              79                  into the derivative !‫ואני חושב‬. It reveals that some-
  You see          47               240                 times translators prefer inserting additional infor-
  In fact          217              39                  mation into the translation, which could be re-
  I mean           168              31                  lated not to the direct semantic meaning of ad-
  For example      117              44                  dition of and but more to the pragmatic infer-
                                                        ences drawn by the translators form the surround-
  Table 2: Multiword expressions used as DMs            ing contexts, which relates to the observations of
                                                        Blakemore and Carston (1999), and Moeschler
                                                        (1989). Hebrew demonstrates less omission cases
   It is visible in Table 2 that multiword expres-
                                                        than Lithuanian for the DM I think. The number
sions That is and You see although identified as
                                                        of omissions in Hebrew is 23, while the Lithua-
DMs by the theoretical literature, in this study,
                                                        nian omission number is approximately double in
they demonstrate a weak tendency to be used as
                                                        the parallelized corpus sentences.
DMs and are mainly used as content words in the
current corpus. While multiword expressions I
think and you know demonstrate a high tendency          4.3   DM ‘you know’ translations
of being used as DMs and the stability of remain-       Another commonly used multiword DM, you
ing DMs in Lithuanian and Hebrew translation.           know, demonstrates far more variable translations.
                                                        A closer investigation into the translations of DM
4.2   DM ‘I think’ translations
                                                        you know reveals that the most common ones in
Further, following our research aim, we present         Lithuanian are also one-word verbs žinote/ žinai/
a detailed analysis of the translations of the two      žinot, which represent verb-pronoun cases. An-
most frequent multiword expressions used DMs –          other quite frequent translator choice is the single
I think and you know. The alignment approach al-        particle na. Although not numerous, very interest-
lowed extracting direct output of the translations      ing cases of multiword expressions with particles
together with the figures of the translation fre-       could be found, such as na jūs žinote or na supran-
quency. First, we explore the translations of the       tate, or a single particle juk. Even a single parti-
most frequent multiword DM, I think.                    cle is used as a DM, which is characteristic of the
   The most frequent multiword expression in the        Lithuanian language. There are also cases of mul-
researched corpus, I think, has a number of trans-      tiword expressions involving a connective and in-
lation variants in both researched languages, He-       flected verb phrases, for example, kaip žinote, bet
brew and Lithuanian. The most frequent one              žinote. The translator’s choice to additionally use
in Lithuanian is a one-word expression – an in-         particles or connectives is obviously related not
flected verb, manau, which, due to Lithuanian be-       to the translation of semantic meaning but more
ing a highly inflected language (Zinkevičius et al.,   to the pragmatic meaning inferred by them from
the surrounding context. It connotes with the deep       tion choices could be based on the nature of the
observation made by Nau and Ostrowski (2010b)            target language into which the texts are translated;
that Lithuanian particles contain the component of       for example, Lithuanian is rich in particles and, as
subjectivity and inter-subjectivity, and their mean-     the analysis has demonstrated, translators choose
ing is mostly coloured by the surrounding context.       to additionally integrate particles into DMs to add
    In Hebrew, the translation variants for the DM       supplementary discourse expressions.
 you know are not as variable. The most frequent            In Hebrew, the male gender prevails in transla-
 ones, again, are the variants referring to the male     tion, and translators automatically give preference
 gender, including both plural (191) !M‫ יודעי‬M‫את‬         to male derivatives as in English; the gender is not
 and singular (26) !‫אתה יודע‬, which by far exceeds       expressed in English and the choice of the gen-
 the number of female derivatives in plural (2) N‫את‬      der of the derivative is completely the translator’s
!‫ יודעות‬and singular (17) !‫את יודעת‬. The prevalence      choice. Another observation regarding Hebrew is
 of male derivatives could be explained by the na-       that multiword DMs remain multiword because of
 ture of the Hebrew language, which has the feature      the translator choice to relay more on word for
 that male derivatives are used while addressing         word translation, while in Lithuanian there is a
 purely male and mixed audiences (Tobin, 2001).          tendency to omit the pronoun by using just an in-
 In Hebrew, this DM is much prone to omission,           flected verb, and this way, multiword DMs turn
 as the number of omissions amounts to 113 cases,        into one-word DMs.
 which are a bit less than the number of the trans-
 lated cases. Again, multiword expressions remain        5   Conclusions and Future research
 multiword expressions with just one case of one-
                                                         The study results showed that English multiword
 word choice in translation. The translation choices
                                                         expressions I think and you know, identified as
 for the multiword expression serving as a DM you
                                                         DMs according to Maschler and Schiffrin (2015)
 know are more versatile than those of I think and
                                                         function-based approach, remain stance attitudi-
 certain cases of grammatical transformation could
                                                         nal DMs in Lithuanian and Hebrew translation but
 be observed in the case of the former.
                                                         they demonstrate variability in Lithuanian and He-
   In Lithuanian, eight cases of grammatical             brew translations: they are either translated into
changes were found and, even amongst those, one-         multiword expressions or one inflected word, or
word DMs prevail. The multiword DM you know              they are completely omitted. In Hebrew transla-
is translated also into a connective, taigi (so), and    tion there is a tendency to use multiword discourse
adverbs gerai (okay) and iš tiesu˛ (really). How-        marker translations to express stance, and there
ever, such translator choices are absolutely rare,       is a clear tendency for translators to give pref-
considering the size of the dataset.                     erence to male over female derivatives, which is
   The grammatical transformation cases are more         due to the nature of the Hebrew language (Tobin,
numerous, comprising of 21 occurrences, and              2001). However, in Lithuanian, there is a clear
much more versatile in Hebrew. The most interest-        tendency observed for one-word DMs in transla-
ing cases include: !‫( טוב נו‬okay), which is a usual      tion. One-word translations mainly include verbs,
colloquial saying in Hebrew, !‫( נחשו מה‬guess what),      which, due to Lithuanian being a highly inflected
and two connectives used successively, !‫( כאילו‬as        language (Zinkevičius et al., 2005), fully repre-
if). There are also some cases when a connective         sent the verb-pronoun cases. It should be noted
is just added as in the following example !N‫ואז כמוב‬,    that Lithuanian translations of pronoun-verb mul-
(then of course), which could be done by the trans-      tiword expressions and one-word verb cases could
lator simply to stress the discourse management          be considered almost word-for-word translations.
role of the DM used or possibly attaches a rhetor-       Concerning translation modelling the research re-
ical function to the integrated connective. Even         veals stance signalling in discourse preserved as
among the limited cases of grammatical transfor-         an important element in translation.
mation, multiword expressions as DMs prevail in             More interesting cases include translator
Hebrew. What is similar to Lithuanian is that there      choices of particle or connective integration
are also adverbs used in the Hebrew translation:         into multiword expressions. The integration of
!‫( הרי‬indeed), !‫( נו‬well), !‫( ברור‬clearly). Reflecting   particles for Lithuanian and connectives for both
why different DMs demonstrate different transla-         languages might carry the pragmatic meaning that
could have been inferred from the surrounding             Douglas                  Biber.                 2006.
contexts by the translators (Nau and Ostrowski,             https://doi.org/10.1016/j.jeap.2006.05.001 Stance in
                                                            spoken and written university registers. Journal of
2010a; Blakemore and Carston, 1999; Moeschler,
                                                            English for Academic Purposes, 5(2):97–116.
1989), or translator choices might be also guided
by the inner discourse managing system of the             Douglas Biber, Susan Conrad, and Viviana Cortes.
target language. The translator’s choice to insert          2004. If you look at. . . : Lexical bundles in uni-
                                                            versity teaching and textbooks. Applied linguistics,
particles and connectives needs closer investi-             25(3):371–405. Publisher: Oxford University Press.
gation and might be studied in future research.
Furthermore, keeping in mind that each language           Douglas Biber, Susan Conrad, and Randi Reppen.
                                                            1994. Corpus-based approaches to issues in ap-
is a unique system with unique features, research           plied linguistics. Applied linguistics, 15(2):169–
could be carried out without English as a pivotal           189. Publisher: Oxford University Press.
language, which means furthering the current
                                                          Douglas Biber, Stig Johansson, Geoffrey Leech,
research and using linguistically linked open data          S. Conrad, Eclwarcl Finegan, and Randolph Quirk.
(LLOD) and thus accessing related linguistic                1999. Longman. Grammar of spoken and written
data directly and comparing the languages. This             english.
has already been done for related languages; for          Roza Bieliauskienė. 2012.       Vilnius–jidiš kalbos
example, Snyder et al (2010) analysed Ugaritic              Jeruzalė. Krantai, (4):56–61.
(an ancient Semitic language spoken in the second
millennium BCE) through resources originally              Diane Blakemore and Robyn Carston. 1999. The inter-
                                                            pretation of and-conjunctions. Iten, C. &.
developed for Hebrew. However, linked data pro-
vide a sound basis and potential for interoperable        Mauro Cettolo, Christian Girardi, and Marcello Fed-
resources relating across various languages and            erico. 2012. Wit3: Web inventory of transcribed and
                                                           translated talks. In Conference of european associa-
enable research across languages and areas.                tion for machine translation, pages 261–268.

Acknowledgments                                           Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.
                                                            An empirical comparison of domain adaptation
The research described in this paper was con-               methods for neural machine translation. In Proceed-
                                                            ings of the 55th Annual Meeting of the Association
ducted in the context of the COST Action                    for Computational Linguistics (Volume 2: Short Pa-
CA18209 “Nexus Linguarum” (“European net-                   pers), pages 385–391.
work for Web-centred linguistic data science”).
                                                          Tom Cobb and Alex Boulton. 2015. Classroom appli-
                                                            cations of corpus analysis.
References                                                David Crystal. 1988. Another look at, well, you
                                                            know. . . . English Today, 4(1):47–49. Publisher:
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.        Cambridge University Press.
  Massively Multilingual Neural Machine Transla-
  tion. In Proceedings of the 2019 Conference of          Liesbeth Degand and Henk Pander Maat. 2003. A con-
  the North American Chapter of the Association for         trastive study of Dutch and French causal connec-
  Computational Linguistics: Human Language Tech-           tives on the Speaker Involvement Scale. LOT Occa-
  nologies, Volume 1 (Long and Short Papers), pages         sional Series, 1:175–199. Publisher: LOT, Nether-
  3874–3884.                                                lands Graduate School of Linguistics.

Amal Al-Saif and Katja Markert. 2010. The Leeds           Robert MW Dixon. 2009. The semantics of clause
 Arabic Discourse Treebank: Annotating Discourse            linking in typological perspective. The semantics of
 Connectives for Arabic. In LREC, pages 2046–               clause linking: A cross-linguistic typology, pages 1–
 2053.                                                      55. Publisher: Oxford University Press Oxford.
                                                          Kaja Dobrovoljc. 2017. Multi-word discourse mark-
Nicholas Asher, Nicholas Michael Asher, and Alex            ers and their corpus-driven identification: The case
  Lascarides. 2003. Logics of conversation. Cam-            of MWDM extraction from the reference corpus of
  bridge University Press.                                  spoken Slovene. International journal of corpus
                                                            linguistics, 22(4):551–582. Publisher: John Ben-
Mona Baker. 2018. In other words: A coursebook on           jamins.
 translation. Routledge.
                                                          Maïté Dupont and Sandrine Zufferey. 2017. Method-
Michael Barlow. 2011. Corpus linguistics and theoreti-     ological issues in the use of directional parallel cor-
  cal linguistics. International Journal of Corpus Lin-    pora: A case study of English and French conces-
  guistics, 16(1):3–44. Publisher: John Benjamins.         sive connectives. International journal of corpus
linguistics, 22(2):270–297. Publisher: John Ben-        Marie-Aude Lefer and Natalia Grabar. 2015. Super-
  jamins.                                                  creative and over-bureaucratic: A cross-genre
                                                           corpus-based study on the use and translation of
Bruce Fraser. 2009. An account of discourse markers.       evaluative prefixation in TED talks and EU parlia-
  International review of Pragmatics, 1(2):293–320.        mentary debates. Across Languages and Cultures,
  Publisher: Brill.                                        16(2):187–208. Publisher: Akadémiai Kiadó.

Péter Furkó and Ágnes Abuczki. 2014. English dis-         William C. Mann and Sandra A. Thompson. 1988.
  course markers in mediatised political interviews.        Rhetorical structure theory: Toward a functional the-
                                                            ory of text organization. Text, 8(3):243–281. Pub-
Sylviane Granger. 2015. Contrastive interlanguage           lisher: Berlin.
  analysis: A reappraisal. International Journal of
  Learner Corpus Research, 1(1):7–24. Publisher:          Yael Maschler and Deborah Schiffrin. 2015. Discourse
  John Benjamins.                                           markers: Language, meaning, and context. The
                                                            handbook of discourse analysis, 2:189–221. Pub-
                                                            lisher: Wiley Online Library.
Angela Hasselgren. 2002. Learner corpora and lan-
  guage testing: Smallwords as markers of learner flu-    Jacques Moeschler. 1989. Pragmatic connectives, ar-
  ency. Computer learner corpora, second language            gumentative coherence and relevance. Argumenta-
  acquisition and foreign language teaching, pages           tion, 3(3):321–339. Publisher: Springer.
  143–174. Publisher: John Benjamins Amsterdam,
  The Netherlands.                                        James R. Nattinger and Jeanette S. DeCarrico. 1992.
                                                            Lexical phrases and language teaching. Oxford
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza                University Press.
  Haffari, and Trevor Cohn. 2018. Iterative back-
  translation for neural machine translation. In Pro-     Nicole Nau and Norbert Ostrowski. 2010a. Back-
  ceedings of the 2nd Workshop on Neural Machine            ground and perspectives for the study of particles
  Translation and Generation, pages 18–24.                  and connectives in Baltic languages. Particles and
                                                            connectives in Baltic, pages 1–37. Publisher: Vil-
Jet Hoek, Sandrine Zufferey, Jacqueline Evers-              nius: Vilniaus universitetas, Academia Salensis.
   Vermeul, and Ted JM Sanders. 2017. Cognitive
   complexity and the linguistic marking of coherence     Nicole Nau and Norbert Ostrowski. 2010b. Particles
   relations: A parallel corpus study. Journal of Prag-     and connectives in Baltic.
   matics, 121:113–131. Publisher: Elsevier.
                                                          William R. O’Donnell and Loreto Todd. 2013. Variety
Lan Fen Huang. 2011. Discourse markers in spoken            in contemporary English. Routledge.
  English: A corpus study of native speakers and Chi-
  nese non-native speakers. PhD Thesis, University        Giedre Valunaite Oleskeviciene, Deniz Zeyrek, Vik-
  of Birmingham.                                            torija Mazeikiene, and Murathan Kurfalı. 2018. Ob-
                                                            servations on the annotation of discourse relational
John E. Joseph. 2009. Why Lithuanian accentua-              devices in TED talk transcripts in Lithuanian. In
  tion mattered to Saussure. Language & History,            Proceedings of the workshop on annotation in dig-
  52(2):182–198. Publisher: Taylor & Francis.               ital humanities co-located with ESSLLI, volume
                                                            2155, pages 53–58.
Daniel Joslyn-Siemiatkoski. 2007. The Cambridge his-
                                                          Mirna Pit. 2007. Cross-linguistic analyses of backward
  tory of Judaism: The late Roman-rabbinic period.
                                                            causal connectives in Dutch, German and French.
  Theological Studies, 68(4):924. Publisher: Sage
                                                            Languages in Contrast, 7(1):53–82. Publisher: John
  Publications Ltd.
                                                            Benjamins.
Huda Khayrallah, Brian Thompson, Kevin Duh, and           Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-
  Philipp Koehn. 2018. Regularized training objec-          sakaki, Livio Robaldo, Aravind K. Joshi, and Bon-
  tive for continued training for domain adaptation in      nie L. Webber. 2008. The Penn Discourse Tree-
  neural machine translation. In Proceedings of the         Bank 2.0. In Proceedings of the Sixth International
  2nd Workshop on Neural Machine Translation and            Conference on Language Resources and Evaluation
  Generation, pages 36–44.                                  (LREC’08), Marrakech, Morocco. European Lan-
                                                            guage Resources Association (ELRA).
Philipp Koehn. 2005. Europarl: A parallel corpus for
  statistical machine translation. In MT summit, vol-     Deborah Schiffrin. 2001. Discourse markers: Lan-
  ume 5, pages 79–86. Citeseer.                             guage, meaning, and context. The handbook of dis-
                                                            course analysis, 1:54–75. Publisher: Wiley Online
Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.        Library.
  Statistical phrase-based translation. Technical re-
  port, University of Southern California Marina Del      John Sinclair. 1991. Corpus, concordance, colloca-
  Rey Information Science Inst.                             tion. Oxford University Press.
Anna Siyanova-Chanturia, Kathy Conklin, and Wal-           Baltic Conference on Human Language Technolo-
  ter JB Van Heuven. 2011. Seeing a phrase “time           gies, pages 365–370.
  and again” matters: The role of phrasal frequency in
  the processing of multiword sequences. Journal of      Sandrine Zufferey and Bruno Cartoni. 2012. English
  Experimental Psychology: Learning, Memory, and           and French causal connectives in contrast. Lan-
  Cognition, 37(3):776. Publisher: American Psycho-        guages in contrast, 12(2):232–250. Publisher: John
  logical Association.                                     Benjamins.

Manfred Stede, Stergos Afantenos, Andreas Peldzsus,      Sandrine Zufferey and Liesbeth Degand. 2017. Anno-
 Nicholas Asher, and Jérémy Perret. 2016. Paral-           tating the meaning of discourse connectives in mul-
 lel discourse annotations on a corpus of short texts.     tilingual corpora. Corpus Linguistics and Linguis-
 In 10th International Conference on Language Re-          tic Theory, 13(2):399–422. Publisher: De Gruyter
 sources and Evaluation (LREC 2016), pages 1051–           Mouton.
 1058.
Jan Svartvik. 1980. Well in conversation. Studies in
   English Linguistics for Randolph Quirk, 5:167–177.
   Publisher: London: Longman.
Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-
  Yan Liu. 2018. Multilingual Neural Machine Trans-
  lation with Knowledge Distillation. In International
  Conference on Learning Representations.
Yishai Tobin. 2001. Gender switch in modern Hebrew.
  Gender across languages: The linguistic represen-
  tation of women and men, 1:177–198.
Bonnie Webber and Aravind Joshi. 2012. Discourse
  structure and computation: past, present and future.
  In Proceedings of the ACL-2012 Special Workshop
  on Rediscovering 50 Years of Discoveries, pages 42–
  54.
Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar-
  avind Joshi. 2016. A discourse-annotated corpus of
  conjoined VPs. In Proceedings of the 10th Linguis-
  tic Annotation Workshop held in conjunction with
  ACL 2016 (LAW-X 2016), pages 22–31.
Naixing Wei and Jingjie Li. 2013. A new computing
  method for extracting contiguous phraseological se-
  quences from academic text corpora. International
  Journal of Corpus Linguistics, 18(4):506–535. Pub-
  lisher: John Benjamins.
Alison Wray. 2012. What do we (think we) know about
  formulaic language? An evaluation of the current
  state of play. Annual review of applied linguistics,
  32(1):231–254. Publisher: Cambridge University
  Press.
Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang.
  2019. Modeling coherence for discourse neural ma-
  chine translation. In Proceedings of the AAAI Con-
  ference on Artificial Intelligence, volume 33, pages
  7338–7345.
Yuqi Zhang, Kui Meng, and Gongshen Liu. 2019.
  Paragraph-Level Hierarchical Neural Machine
  Translation. In International Conference on Neural
  Information Processing, pages 328–339. Springer.
Vytautas Zinkevičius, Vidas Daudaravičius, and Erika
  Rimkutė. 2005. The Morphologically annotated
  Lithuanian Corpus. In Proceedings of The Second
You can also read