Two Demonstrations of the Machine Translation Applications to Historical Documents
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Two Demonstrations of the Machine Translation Applications to Historical Documents Miguel Domingo and Francisco Casacuberta arXiv:2102.01417v1 [cs.CL] 2 Feb 2021 Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València - Camino de Vera s/n, 46022 Valencia, Spain midobal@prhlt.upv.es, fcn@prhlt.upv.es Abstract erties of these documents—in the past, orthography changed depending on the time period and author We present our demonstration of two machine trans- due to a lack of an spelling convention—and the na- lation applications to historical documents. The first ture of language—which evolves with the passage of task consists in generating a new version of a his- time. This creates a language barrier which increases torical document, written in the modern version of the difficulty of comprehending historical documents. its original language. The second application is lim- Different research topics on historical documents ited to a document’s orthography. It adapts the focus on tackling this language barrier as well as dif- document’s spelling to modern standards in order ferent aspects related to the document’s language to achieve an orthography consistency and account- richness, which is also part of our cultural her- ing for the lack of spelling conventions. We followed itage. For example, historical manuscripts are au- an interactive, adaptive framework that allows the tomatically digitized and transcribed (Toselli et al., user to introduce corrections to the system’s hypoth- 2017). Documents are automatically translated into esis. The system reacts to these corrections by gen- a modern version of their original language to make erating a new hypothesis that takes them into ac- them accessible to a broader audience Domingo and count. Once the user is satisfied with the system’s Casacuberta (2020). Orthography is normalized to hypothesis and validates it, the system adapts its account for the lack of a spelling convention (Porta model following an online learning strategy. This et al., 2013). Search queries can find all occurrences system is implemented following a client–server ar- of one or more words (Ernst-Gerlach and Fuhr, 2006). chitecture. We developed a website which communi- Word frequency lists are generated (Baron et al., cates with the neural models. All code is open-source 2009). And natural language processing tools pro- and publicly available. The demonstration is hosted vide automatic annotations to identify and extract at http://demosmt.prhlt.upv.es/mthd/. linguistic structures such as relative clauses (Hundt et al., 2011) or verb phrases (Pettersson et al., 2013). 1 Introduction In this work, we focus on two of these topics: language modernization and spelling normalization. Despite being an important part of our cultural her- The first one aims to generate a modern version of a itage, historical documents are mostly accessible to historical document to make its content available to a scholars. This is due to both the linguistic prop- broader audience. The second one aims to achieve an 1
orthography consistency—without altering the doc- f (e.g., a word correction). Then, the system sug- ument’s content—to reduce the variability derived gests a new hypothesis ỹ, compatible with the user’s from the lack of spelling conventions and facilitate the feedback. This restricts Eq. (1) by constraining the work of scholars. We present a demonstration that search space: showcases neural machine translation (NMT) appli- cations to these two topics, and provides an interac- ỹ = arg max P r(y | x, f ) (2) tive, adaptive framework for scholars to increase their y compatible with f productivity when working on these tasks. This interactive correction process is repeated un- til the user is satisfied with the system’s suggestion. This validated translation is new valuable data that 2 Interactive, adaptive NMT could be leveraged to continuously improve the sys- framework tem. Each time the user validates a translation hy- pothesis, the system’s models can be incrementally Given a source sentence x, machine translation (MT) updated with this new sample. Therefore, when the aims to find the most likely translation ŷ (Brown system generates a new translation, it will consider et al., 1993): the previous user feedback and it is expected to pro- duce higher quality translations. A common way of generating this adaptive systems consists in following ŷ = arg max P r(y | x) (1) y an online learning paradigm (Ortiz-Martı́nez, 2016; Peris and Casacuberta, 2019). NMT models this equation with a neural network which usually follows an encoder-decoder architec- ture (although others architectures are possible), in 3 Applications which the source sentence is projected into a dis- tributed representation at the encoding step. Then, Our demonstration showcases NMT applications to at the decoding step, the decoder generates, word by two different historical documents research topics: word, its most likely translation using a beam search language modernization and spelling normalization. method (Sutskever et al., 2014). The model param- eters are typically estimated jointly on large parallel corpora, via stochastic gradient descent (Robbins and 3.1 Language modernization Monro, 1951; Rumelhart et al., 1986). At decoding This research topic aims to tackle the language bar- time, the system obtains the most likely translation rier inherent in historical documents in order to make by means of a beam search method. them available to a broader audience. To do so, it Despite the quality improvement in recent years, proposes to automatically generate a new version of NMT is still far from obtaining high-quality transla- a historical document, written in the modern ver- tions (Toral et al., 2018). Thus, human experts need sion of the document’s original language. A com- to supervise and correct the NMT systems outputs mon approach to this problem is to tackle modern- in order to achieve the highest quality. As an alter- ization as a conventional MT task (Domingo et al., native to this static, decoupled strategy for correct- 2017; Sen et al., 2019). Results showed that, while ing systems, protocols such as the iterative-predictive there is still room for improvement, modernization have been studied for years (Barrachina et al., 2009; techniques successfully decrease the comprehension Knowles and Koehn, 2016; Peris and Casacuberta, difficulty of historical documents—indicated by both 2019). automatic metrics and human evaluation (Domingo In this framework, the correction process shifts and Casacuberta, 2020). to a human–computer collaboration. The user in- Additionally, while modernization’s goal is lim- teracts with the system through a feedback signal ited to bringing a better understanding of historical 2
documents to a general audience—language-related that handles the client’s requests. All code is open- losses may appear during the process—scholars are in source an publicly available1 . charge of different tasks that require them to generate Initially, an old sentence is presented to the user in modernizations of the highest quality (e.g., producing the client website. When the user requests an auto- a comprehensive contents document for non-experts matic modernization/normalization (see Section 3), (Monk, 2018)). Thus, we provided our system of the client communicates the server via PHP. Then, an interactive, adaptive framework to increase the the server queries the NMT system, which generates scholars productivity when working on these tasks. an initial hypothesis applying Eq. (1). After that, The modernization system was developed following the hypothesis is sent back to the client website. Domingo and Casacuberta (2020). For the inter- At this point, the interactive-predictive process active, adaptive framework, we followed Peris and starts: The user searches the hypothesis for the first Casacuberta (2019). error and introduces a correction with the keyboard (writing one or more characters). Once the user fin- ishes typing, the client reacts to this feedback by 3.2 Spelling normalization sending a request to the server. This request con- In order to account for the lack of spelling con- tains the old sentence and the user feedback (the se- ventions, which were not created until recent years, quence of characters that conform the prefix). Then, spelling normalization aims to achieve an orthogra- the NMT system applies Eq. (2) to produce an alter- phy consistency by adapting a document’s spelling native hypothesis coherent with the user’s feedback to modern standards. This linguistics problem sup- and sends it back to the client website. This process pose an additional challenge for the effective natural is repeated until the user finds the system’s hypoth- language processing for these documents. Through esis satisfactory. Fig. 1 illustrates one step of this the years, different approaches to this problem have process. been researched (e.g., Baron and Rayson, 2008; Porta Client (web browser) Server et al., 2013). Some of them tackle spelling normal- User x , ..., x 1 J ization as a conventional MT problem (e.g., Boll- Old sentence Encoder y , ..., y PHP Decoder mann, 2018; Scherrer and Erjavec, 2013). We de- Feedback 1 I Initial hypothesis 0 y , ..., y 1 veloped our normalization system following Domingo i PHP and Casacuberta (2019), which approached spelling y , ...,y , y 1 0 , ..., y i 0 PHP i+1 0 I Constrained Search normalization using character-based NMT and ob- tained significant improvements according to several Alternative hypothesis automatic metrics. Like in the previous task, we fol- lowed Peris and Casacuberta (2019) for the interac- Figure 1: System architecture. The client presents tive, adaptive framework. the user an old sentence and a prediction. Then, the user introduces a feedback signal for correcting this prediction (in this example, they are validating the 4 System description prefix y1 , .., yi−1 and correcting the word yi0 ). After that, the old sentence and the user’s feedback is sent Our system is composed of two main elements: the to the server, which generates an alternative hypoth- client and the server. The client is an HTML website, esis that takes into account the user corrections (in which interacts with the user through javascript and this example, a new suffix yi+1 0 , .., yI0 that completes communicates with the server via the HTTP proto- the user’s feedback). col, using the PHP curl tool. The server is the core element. It contains the NMT systems, which were Once the user is satisfied with the system’s hy- developed with NMT-Keras (Peris and Casacuberta, 2018), and it is deployed as a Python HTTP server 1 https://github.com/midobal/mthd. 3
Figure 2: Frontend of the client website. As the button “Modernize” is clicked (or “Normalize”, depending on the task your are performing), an initial hypothesis for the old sentence appears in the right area. Then, the user can introduce corrections of this text. The system will react to each correction, producing alternative hypotheses coherent with the user feedback. Once the user is satisfied with the modernization hypothesis, they can click in the “Validate” button to accept the hypothesis. pothesis, they can validate it. Then, the system is us to create a better visualization of the hypothesis, incrementally updated with this new sample follow- which is specially relevant for the spelling normal- ing an online learning setup (Peris and Casacuberta, ization task. Additionally, relevant attributes of the 2019). Hence, in future interactions, the system will neural system could be visualize in order to help to be progressively updated and able to generate better understand better the model predictions and behav- hypothesis. ior. Fig. 2 illustrates an example of how to performed a task using the client server. After having selected the task to perform, a list of old sentences will ap- Acknowledgments pear. When you click on “Modernize/Normalize”, the system will generate an initial hypothesis. If you The research leading to these results has received desire to improve this hypothesis, you can click on funding from Generalitat Valenciana (GVA) un- the left box and type a correction. The system will, der project PROMETEO/2019/121. We grate- then, generate a new hypothesis to take that correc- fully acknowledge the support of NVIDIA Cor- tion into account. You can repeat this process for as poration with the donation of a GPU used for part many corrections as you desire to make. Finally, you of this research, and Andrés Trapiello and Ediciones can click “Validate” to tell the system that you are Destino for granting us permission to use their book happy with the modernization/normalization and, if in our research. the Learn from sample option is activated (in blue), the system will use the sample to improve its model. References 5 Conclusions and future work Baron, A. and Rayson, P. (2008). VARD2: A tool for dealing with spelling variation in historical cor- We presented a demonstration of two MT appli- pora. Postgraduate conference in corpus linguis- cations to historical documents. We described its tics. client–server architecture and developed a website to facilitate the use of the system. Baron, A., Rayson, P., and Archer, D. (2009). Word As a future work, we would like to improve our frequency and key word statistics in corpus linguis- website’s front end. This improvement could allow tics. Anglistik, 20(1):41–67. 4
Barrachina, S., Bender, O., Casacuberta, F., Civera, https://www.themedievalmonk.com/the- J., Cubel, E., Khadivi, S., Lagarda, A., Ney, H., rochester-customs-book.html. Tomás, J., Vidal, E., and Vilar, J.-M. (2009). Sta- tistical approaches to computer-assisted transla- Ortiz-Martı́nez, D. (2016). Online learning for statis- tion. Computational Linguistics, 35:3–28. tical machine translation. Computational Linguis- tics, 42(1):121–161. Bollmann, M. (2018). Normalization of Histor- ical Texts with Neural Network Models. PhD Peris, A. and Casacuberta, F. (2018). NMT-Keras: a thesis, Sprachwissenschaftliches Institut, Ruhr- Very Flexible Toolkit with a Focus on Interactive Universität. NMT and Online Learning. The Prague Bulletin of Mathematical Linguistics, 111:113–124. Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematics of statis- Peris, Á. and Casacuberta, F. (2019). Online learning tical machine translation: Parameter estimation. for effort reduction in interactive neural machine Computational Linguistics, 19(2):263–311. translation. Computer Speech & Language, 58:98– Domingo, M. and Casacuberta, F. (2019). Enriching 126. character-based neural machine translation with Pettersson, E., Megyesi, B., and Tiedemann, J. modern documents for achieving an orthography (2013). An SMT approach to automatic annota- consistency in historical documents. In Proceedings tion of historical text. In Proceedings of the work- of the International Workshop on Pattern Recogni- shop on computational historical linguistics, pages tion for Cultural Heritage, pages 59–69. 54–69. Domingo, M. and Casacuberta, F. (2020). Modern- izing historical documents: A user study. Pattern Porta, J., Sancho, J.-L., and Gómez, J. (2013). Edit Recognition Letters, 133:151–157. transducers for spelling variation in old spanish. In Proceedings of the workshop on computational Domingo, M., Chinea-Rios, M., and Casacuberta, historical linguistics, pages 70–79. F. (2017). Historical documents modernization. The Prague Bulletin of Mathematical Linguistics, Robbins, H. and Monro, S. (1951). A stochastic ap- 108:295–306. proximation method. The Annals of Mathematical Statistics, pages 400–407. Ernst-Gerlach, A. and Fuhr, N. (2006). Generating search term variants for text collections with his- Rumelhart, D. E., Hinton, G. E., and Williams, toric spellings. In European Conference on Infor- R. J. (1986). Learning representations by back- mation Retrieval, pages 49–60. propagating errors. Nature, 323(6088):533–536. Hundt, M., Denison, D., and Schneider, G. (2011). Scherrer, Y. and Erjavec, T. (2013). Modernizing his- Retrieving relatives from historical data. Literary torical slovene words with character-based smt. In and Linguistic Computing, 27(1):3–16. Proceedings of the Workshop on Balto-Slavic Nat- Knowles, R. and Koehn, P. (2016). Neural interactive ural Language Processing, pages 58–62. translation prediction. In Proceedings of the Asso- ciation for Machine Translation in the Americas, Sen, S., Hasanuzzaman, M., Ekbal, A., Bhat- pages 107–120. tacharyya, P., and Way, A. (2019). Take help from elder brother: Old to modern english nmt with Monk, C. (2018). Custumale Roffense: An phrase pair feedback. In Proceedings of the Inter- overview of the thirteenth-century cus- national Conference on Computational Linguistics tumal of Rochester Cathedral priory. and Intelligent Text Processing. In press. 5
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Se- quence to sequence learning with neural networks. In Proceedings of the Advances in Neural Informa- tion Processing Systems, volume 27, pages 3104– 3112. Toral, A., Castilho, S., Hu, K., and Way, A. (2018). Attaining the unattainable? reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation, pages 113–123. Toselli, A. H., Leiva, L. A., Bordes-Cabrera, I., Hernández-Tornero, C., Bosch, V., and Vidal, E. (2017). Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription. Dig- ital Scholarship in the Humanities, 33(1):173–202. 6
You can also read