Slator 2018 Neural Machine Translation Report
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Slator 2018 NMT Report | Slator.com Disclaimer The mention of any public or private entity in this report does not constitute an official Slator Reports endorsement by Slator. The information used in this report are either compiled from publicly accessible resources (e.g. pricing web pages for technology providers), acquired directly from company representatives, or both. The insights and opinions from Slator’s expert respondents are theirs alone, and Slator does not necessarily share the same position where opinions are attributed to respondents. Acknowledgments Slator would like to thank the following for their time, guidance, and insight, without which this report could not have come into fruition: Andrew Rufener, CEO, Omniscien Technologies Diego Bartolome, Machine Translation Manager, Transperfect Jean Senellart, CTO, Systran John Tinsley, CEO, Iconic Machine Translations Joss Moorkens, post-doctoral researcher, ADAPT Center and Lecturer, Dublin City University Kirti Vashee, Language Technology Evangelist, SDL Kyunghyun Cho, Assistant Professor and pioneer researcher, New York University Maja Popovic, Researcher, DFKI – Language Technology Lab, Berlin, Germany Mihai Vlad, VP Machine Learning Solutions, SDL Pavel Levin, Researcher, Booking.com Rico Sennrich, pioneer and post-doctoral researcher, MT Group, University of Edinburgh Samad Echihabi, VP of Research and Product Development, SDL Samuel Laubli, PhD candidate and CTO, Textshuttle Silvio Picinini, MT Expert, eBay Spence Green, CEO, Lilt Tony O’Dowd , CEO, Kantan MT Yannis Evangelou, Founder, LexiQA PAGE: 2 // 37
Slator.com Table of Contents Executive Summary 4 Neural is the New Black 6 The Current NMT Landscape 6 So, What Now? 7 By the End of 2017, Neural MT was Mainstream 7 Neural Network-Based Language Technology Providers 8 Current Customized NMT Deployments 10 EPO’s Patent Translate 10 Booking.com: A Trial NMT Deployment 11 NMT Performance: What NMT Can and Cannot Do 12 Exceptional Capabilities of NMT 14 Current Limitations of NMT 16 What’s Next in NMT 20 How Do You Quantify Quality? 20 Replacing BLEU 21 Human Evaluation Remains the Ultimate Standard 21 Creating a New Quality Standard 22 New and Existing NMT QA Processes 22 Machines Testing Machines 24 Training Data Becomes Big(ger) Business 24 Publicly Accessible Corpora 25 Building Your Own Corpora 25 Buying Corpora from Others 25 Quality is Always a Caveat 26 Directions of NMT Research 27 “So Many” Exciting Research Directions 27 “Convolutional Neural Networks are Doomed” 28 Pivot Languages and Zero Shot 28 Buy Vs Build 29 Quality Versus Cost 30 Productivity and Production Boost? 31 To Build or Not To Build 32 Shifting Paradigms: Changing Models of Working 34 From the Experts 34 PAGE: 3 // 37
Slator 2018 NMT Report | Slator.com Executive Summary Neural has become the new standard in machine translation. There are now over 20 providers of neural network- Machine translation, while still niche compared to its based language technology—four times as many as human translation counterpart, is a growth market. There little over a year earlier. are also emerging sub-markets driven by the need for high quality training data. This begs the question, however, on Both the public and private sectors have shown whether commoditized data can actually provide the in- effective use cases for the emerging technology, domain, high quality data required by neural engines. for example: reducing 16,000 man-hours into instant, fluent patent translation, and cutting Is commoditized training data worth the millions of multilingual content delivery costs by speeding dollars that Sogou Inc. invested in UTH International, up translation and shortening the pipeline to for instance? Is quality training data so quintessential digital publishing. that Baidu can afford to sell human translation at rock bottom rates to accelerate its acquisition? Today, even as ongoing research continues to leverage the exceptional capabilities of neural Effective translation engines can be built with as little machine translation and come up with workarounds as a few hundred thousand and as much as a billion to its limitations, important questions that span the high quality sentence pairs, depending on the use case, entire language industry are arising. domain, and technology. And attempts are made at zero-shot models where no reference data is available. How does research move forward in terms Several open-source tools are available to DIY, but of automatically defining output quality and expert consensus recommends otherwise. incremental gains? How does the industry efficiently infuse automated processes with much- Indeed, many would “fail” at building their own industry- needed human evaluation? And how are human proof neural machine translation models, according to translators going to interact with much improved experts. They need a combination of high quality training machine output? data, the right technology, and an expert team in place. PAGE: 4 // 37
Slator.com The prices of graphics processing units (GPUs) used to crunch NMT models as well as costs associated with training will steadily decline, enabling the machines to feed on ever bigger data sets. As neural machine translation ripples through the language services supply chain, increasing adoption will disrupt the way the industry operates. The role of the linguist is evolving, shifting further into the language technology pipeline both in terms of pre-translation and post-translation responsibilities. Human and machine interaction is expected to change both to adapt to new neural network-based technology and to improve overall usability of existing tools. Finally, industry-wide changes in pricing models and opportunities for technological thought leadership are afoot. Finding your place in all this technologically-enabled change requires an understanding of the catalyst for it: neural machine translation. PAGE: 5 // 37
Slator 2018 NMT Report | Slator.com Neural is the New Black Google, Microsoft, IBM, SAP, Systran, Facebook, Amazon, Salesforce, Baidu, Sogou, SDL, DeepL—and this is the short version of a much longer list that includes KantanMT, Omniscien, Lilt, Globalese, and TransPerfect (via Tauyou) and many other startup and mid-sized players. These companies have all become involved in neural machine translation (NMT). Some of them offer NMT solutions. Some use proprietary systems for unique problems. Others are researching ways to incorporate NMT into their existing service and product lines. By now, the generic praise heaped upon the new technology is becoming repetitive: it outperforms statistical machine translation (SMT), it is a genuine breakthrough in AI tech, and it is fast-paced in terms of research and deployment. The industry is well past discussing the emergence of NMT. Clearly, neural is the new black. Now the main concern is to see if you look good in black. The Current NMT Landscape It can be argued that the language services market is at an early stage of tech adoption with regards to NMT, since the current NMT landscape is the result of about four years of research and deployment. The fact is, however, that barely four years of NMT research has already eclipsed around two decades of SMT research. “ “It’s fair to say that the industry has condensed 15 years of statistical research into 3 years of NMT research and produced systems that will outperform the best SMT can offer.” —Tony O’Dowd, CEO, KantanMT Speaking at SlatorCon New York in October 2017, Asst. Professor Kyunghyun Cho of New York University said NMT adoption is incredibly fast-paced. He cited two examples of rapid deployment: Google Translate and Facebook—each taking around a year from zero to full deployment. “ “It only took about a year and a half for Google to replace the entire system they built over 12 years into this new system. It’s getting shorter and shorter everyday. Facebook noticed that Google is doing it, they have to do it as well. If you look at the date [of announcement]—it’s one year since Google. Google deploys it in production without telling anybody in detail what they have done, but other companies have been able to catch up within a year. It’s an extremely fast-moving field in that every time there is some change, we see the improvement.” —Asst. Professor Kyunghyun Cho speaking at SlatorCon New York, October 2017 In 2015, when Slator first started publishing, NMT was only just starting to build momentum. In February 2017, only Google Translate, Microsoft Translator, and Systran’s Pure Neural provided NMT. Now, Slator has found there are at least 20 language technology providers of various technologies based on or related to NMT. PAGE: 6 // 37
Neural is the New Black Slator.com So, What Now? This report is developed with a specific audience in mind—an audience that wants to learn about the current state of NMT. They want to find out what it can and cannot do, delve into use cases to better understand its applications, and discover ongoing and future research directions, as well as the broader trends that NMT has jumpstarted. This report first delves into NMT developments in 2017, the year the technology irrevocably went mainstream, and will then go briefly through some NMT technology providers and users to illustrate the state of NMT last year. We will look at the exceptional qualities of NMT to clarify what it can really do and what is likely to be hyperbole. Then, we will touch on the limitations of the technology, with input from subject matter experts from both academia and the corporate world. As the industry adjusts to NMT as a new standard technology alongside existing ones like SMT and translation memory, it also needs to address the increased need for high quality training data and the sub-category of businesses that are starting to cash in on the demand. This report will discuss this demand for more training data and a few efforts towards open sourcing both data and technology to encourage growth in NMT. We will also take a look at quality assurance and post-editing in an NMT-powered industry. Finally, through a few use cases and insight from subject matter experts, we will address two big questions: should you buy or build an NMT system? And how does this new normal change your existing ways of working? By the End of 2017, Neural MT was Mainstream Key Takeaways ▶ 2017 was marked with a number of major NMT release announcements from big players—the number of technology providers spiked from less than five a year ago to over twenty, generally categorized into enterprise providers led by Big Tech and independent, boutique providers with more specialized fields. ▶ Current production-level NMT deployments include the European Patent Office and Booking.com, use cases that highlight the production-level capabilities of NMT specifically for translation pipelines with repetition and stringent guidelines (patents), as well as the massive scale of real- time content (Booking.com’s descriptions). ▶ While NMT continues to evolve, there are clear exceptional capabilities of the technology that will continue to give it an edge in the future, even as its current limitations are being tackled through research. PAGE: 7 // 37
Slator 2018 NMT Report | Slator.com By the End Slator coverage for NMT in 2017 saw Amazon going from buying a machine translation startup to building up manpower of 2017, to actually launching their offering; and saw Facebook say SMT is at end of life and finally completing the switch to NMT. NMT was Mainstream (cont.) Then there were initiatives to make sure NMT becomes easier to adopt and develop as a technology, such as Harvard and Systran’s OpenNMT project, Google’s public tutorial on the use of its neural network library called TensorFlow, Facebook’s open sourcing of their NMT tech and Amazon’s open sourcing of their machine learning platform Sockeye. There were also buzz-making launches and announcements, such as DeepL’s unveiling and Sogou’s million dollar investment in UTH International’s training data. Neural Network-Based Language Technology Providers The most highly publicized NMT launches in 2017 (and earlier) were from cloud-based enterprise platforms like Google, Microsoft, Baidu and Amazon. These same “Big Tech” companies often offer free, browser-based, generic engines anyone can use. Free, Browser-Based, Generic NMT Engines • Baidu fanyi.baidu.com • DeepL deepl.com • Google translate.google.com • GTCom yeekit.com • Microsoft translator.microsoft.com/neural and also used on bing.com/translator • PROMT online-translator.com • Systran demo-pnmt.systran.net • Tilde translate.tilde.com • Yandex translate.yandex.com This is not an exhaustive list. These free, generic engines are usually provided for demonstration purposes, though the more well-known ones such as Google, Bing Translate and DeepL are used by millions of people for straightforward translation tasks. These Big Tech players all share a common rationale behind their move towards NMT: pitch it to their user base of cloud platform clients as a value-add to their existing service areas. This is also Salesforce’s plan, once they complete their R&D—NMT is intended to be part of a suite of AI-powered features for international clients with multilingual needs. Interestingly, Amazon pitched its NMT offering to LSPs directly, though again with the caveat that those who would benefit most are existing users of Amazon Web Services. As of this writing, Amazon does not offer a browser-based, free, generic NMT engine. The pricing structure for Big Tech NMT providers are mostly fixed, compared to boutique or specialized providers where the flexibility and customizability of services are reflected in equally flexible and bespoke pricing. PAGE: 8 // 37
By the End of 2017, Neural MT was Mainstream Slator.com The Big Tech usually employ either a straightforward pay-as-you-go pricing strategy, Neural or augment that with tiers: Network- Based Language • Amazon Translate - USD 15 per million characters. For the first 12 months from the date of the first Technology Providers translation request, users can translate two million characters a month free of charge. (cont.) • Google Translate - USD 20 per million characters. There are custom pricing considerations for the Translation API on embedded devices (e.g. TVs, cars, speakers, etc) as well as for translation volume exceeding a billion characters monthly. • IBM Watson Language Translator - Lite, Standard, and Advanced “plans:” • At Lite, users can translate one million characters a month free of charge using default translation models. • At Standard pricing, the first 250,000 characters translated are free; every thousand characters beyond costs USD 0.02. The Standard plan includes news, conversational, and patent translation models. • The Advanced plan is also priced at USD 0.02 per thousand characters translated using standard models, but translation using custom models cost USD 0.10 per thousand characters. The Advanced plan also includes custom model maintenance at USD 15 monthly per model, pro-rated daily. • Microsoft Translator - Free use for the first two million characters; pay-as-you-go rate of USD 10 monthly for every million characters after the first two million. Users with regular, massive translation volumes can leverage discounts from tiered monthly commitments: • USD 2,055.01 monthly for 250 million characters with overage rate of USD 8.22 for every million characters above the limit. • USD 6,000 monthly for one billion characters with overage rate of USD 6 for every million characters above the limit. • USD 45,000 monthly for ten billion characters with overage rate of USD 4.50 for every million characters above the limit. • SAP Translation Hub - EUR 39 (USD 48) for every “bucket” of 100,000 characters per year. Requires SAP Cloud Platform license, though the service can be requested “A-la-carte.” • Yandex Translate - Starts at USD 15 per every million characters with reduced prices for larger volumes: • USD 15 per million characters monthly for less that 50 million characters a month. • USD 12 per million characters monthly for over 50 million and under 100 million characters a month. • USD 10 per million characters monthly for translation requests between 100 and 200 million characters a month. • USD 8 per million characters monthly for translation requests between 200 and 500 million characters a month. • USD 6 per million characters monthly for translation requests between 500 million and 1 billion characters a month. • Custom pricing for over a billion characters translated every month. • DeepL - No pricing available but still a heavily trafficked free online engine. (DeepL Pro, a API for developers, was launched at printing on March 20th). PAGE: 9 // 37
Slator 2018 NMT Report | Slator.com Neural There are also boutique and specialized language technology providers such as Systran, Lilt, and KantanMT, who Network- offer their neural network-based language technology without the requisite cloud platform. Some of these include: Based Language Technology • Globalese MT Providers • GTCom (Chinese provider of AI-based technology) (cont.) • Iconic Translation Machines • KantanMT • Lilt (adaptive MT) • National Institute of Information and Communications Technology (NICT) in Japan • Omniscien Technologies • PROMT • Systran • Tilde NMT These companies are typically language technology, machine learning, or ICT companies in general. Some of them focus on a very specific neural network-based technology, such as Lilt’s predictive, machine learning, adaptive MT. Some companies like Systran are very early adopters of NMT, and continue to contribute to open source initiatives. Others still are active in the research scene. KantanMT and Tilde, for instance, are involved in current consortiums within the European Council’s many language technology initiatives. Finally, some LSPs have also developed NMT offerings in-house or through acquisitions. They typically leverage existing parallel corpora and linguistic expertise for the development of their NMT services. Some LSPs that offer NMT technology today include: • Pangeanic (PangeaMT) • SDL • TransPerfect • United Language Group (powered by Lucy Software) Current Customized NMT Deployments The NMT provider ecosystem is already taking shape and includes large enterprise players as well as boutique providers. In a consumer context, the technology has been deployed extensively by Google, DeepL, Microsoft and other free online services. As many language service providers are now integrating NMT into their supply chain, we wanted to highlight two high profile (and high volume) customized deployments of NMT: EPO’s Patent Translate Patent Translate is a free translation service the European Patent Office (EPO) launched in 2013 in association with Google. Slator reported in June 2017 that the EPO has moved this service to NMT. At the time the article was published, the EPO used NMT to translate eight (Chinese, French, German, Japanese, Korean, Portuguese, Spanish, and Turkish) of the 28 supported languages. PAGE: 10 // 37
By the End of 2017, Neural MT was Mainstream Slator.com The EPO blog post announcing the move noted: “The NMT solution is producing significant improvements in the EPO’s quality of translated texts.” Patent Translate (cont.) EPO President Benoît Battistelli told Slator in that article that the EPO received approximately 15,000 translation requests on average per day, mostly from India, Japan, Russia and the US, in addition to requests from EPO member states. The EPO’s use case appears to paint a very positive picture for NMT. For instance, according to Battistelli, it would take 16,000 man-years to translate the Chinese patent documentation available at the time into English. Meanwhile, through NMT, Patent Translate provides all that documentation in EPO’s three official languages instantly. Of course, Patent Translate’s NMT engines are trained specifically in this domain. Our article quoted Battistelli saying they set a threshold of several tens of thousands of human translations in a language “corpus” before it considers offering the language in Patent Translate. If nothing else, the EPO’s case study shows that processing large volumes of automated patent translation with high quality output is feasible. But there’s a caveat: the training data. Patent Translate is trained with millions of highly specialized, in-domain data. In fact, the EPO’s Espacenet, a free online service for searching patents and patent applications, includes more than 100 million source patent texts from over 90 patent granting authorities, written in many languages. According to Jean Sellenart, CTO of Systran: “To create a high quality NMT engine, one would need about 20 million sentences, but there is no upper limit—the more data we put, the better it becomes.” Diego Bartolome, Machine Translation Manager of Transperfect, hedged his bet at 1 billion sentences. He did note however, that they “produced a fluent NMT engine for Korean with only two million words in the training materials. So it’s not necessarily that a neural MT engine needs more data. It depends on the goal and its scope.” “To build a general purpose MT engine the likes of Google Translate, massive volumes of data are needed—tens of millions of sentence pairs,” said John Tinsley, CEO of Iconic Translation Machines. He provided supporting context on the not-so-simple relationship between training data and NMT engines: “For domain-specific engines, it totally depends on the use case, the variety of source content to be translated, vocabulary size, etc. We’ve built production engines with as little as one million sentence pairs and as much as 48 million sentence pairs.” Travel fare aggregator and lodging reservations website Booking.com found that the convergence of three major technology trends gave them a golden opportunity to try out a production-level NMT system. There was demand for local language content, access to cheap computing power in the cloud, and now, open-source NMT frameworks. Booking.com: A Trial NMT Deployment They built an NMT system from OpenNMT and tried it out. They published a research paper on Arxiv detailing their findings: 1. NMT consistently outperforms SMT; 2. Performance degraded for longer sentences, but even then NMT still outperformed SMT; 3. In the case of German, in-house NMT performed better than online general purpose engines; 4. Fluency of NMT is close to human translation level. Booking.com reportedly handles 1.55 million room night reservations a day, and the company offers content in 40 different languages. The research paper indicated that property descriptions (hotels, apartments, B&Bs, hostels, etc.) were a main use case for NMT. PAGE: 11 // 37
Slator 2018 NMT Report | Slator.com Booking.com: Additionally, the paper goes on to note that in-house MT systems can increase translation efficiency “by increasing A Trial NMT its speed and reducing the time it takes for a translated property description to appear online, as well as significantly Deployment cutting associated translation costs.” (cont.) Maxim Khalilov, Commercial Owner of Booking.com, said they identified ten use cases for MT within the company and they will focus on these according to a list of priorities. Booking.com’s trial run of an NMT system was motivated by corporate priorities and enabled by access to required technology. Is it a good example of a corporate user that can build its own system instead of relying on service providers? Pavel Levin, Senior Data Scientist at Booking.com and lead author of the research paper, provides more insight into the matter. “It is true that there are many deep learning frameworks out there which make it relatively easy to build your own NMT system, however your system will only be as good as the data you have,” he said. “In our case we have millions of relevant parallel sentences, which makes it a perfect setting for rolling out our own system. However, if you are a small startup with no data, or getting into a new business area, it might be easier to just buy services from one of several existing commercial general-purpose engines, assuming their quality and usage constraints (legal, technical) suit you.” Language Technology Evangelist Kirti Vashee counseled against attempting to build your own system simply because of the availability of technology such as OpenNMT. “It would be unwise to build NMT or DIY without deep expertise in machine learning, data analysis and preparation, and overall process management skills,” he said. “My recommendation is to find experts and work closely with them.” “ “This (building a system from scratch) is not an actual possibility for most of the translation industry players. I would expect most would fail with such a strategy.” —Kirti Vashee, Language Technology Evangelist, SDL Asked if it is a question of maturity—whether you should buy until you can build, Vashee said “possibly, but for now buy is a wiser strategy.” NMT Performance: What NMT Can and Cannot Do Most of the hype surrounding NMT is due to its perceived superiority. It is consistently better than predecessor technology in most areas, particularly in terms of translation output fluency. One of the most widely talked about NMT-related news in 2017 was the launch of technology provider DeepL, developed by the founders of Linguee.com, an online dictionary launched in 2009. In Slator’s own, wholly unscientific test, we pitted DeepL against Google Translate by having both translate three short paragraphs taken from a Bloomberg article from English to German. Our anecdotal experiment on DeepL’s general purpose engine showed that it was indeed somewhat more fluent for shorter sentences. One translation stood out as quite accurate and indeed much more fluent for that particular target sentence compared to Google Translate. Meanwhile, no Google-translated sentence seemed unambiguously better than its DeepL counterpart. PAGE: 1 2 // 37
By the End of 2017, Neural MT was Mainstream Slator.com We also noticed that in longer sentences, DeepL and Google Translate both broke down. NMT Per- formance: What NMT DeepL appears to have impressed mainstream media more than the obvious and unavoidable baseline Can and comparison: Google Translate. So is it the industry player to beat? Cannot Do (cont.) Not necessarily. The only conclusion we can draw is that DeepL’s generic NMT engine is usually better than Google’s generic NMT engine. So how much and what sort of data is required to create a fluent engine? “This is the question no one has or can give a definite answer to,” said Kyunghyun Cho, Assistant Professor at New York University, one of the pioneers in NMT research. “It depends not only on the problem (target language pairs) or the quality of data (which is after all not even well-defined rigorously), but also on what kind of learning algorithm is used (some learning algorithms are more robust to noise in data, while some others are more sensitive) and what kind of model is used (some neural net architectures are more robust while some others are more sensitive).” From a corporate perspective, Omniscien CEO Andrew Rufener essentially agreed and said the fluency of any NMT engine “depends very much on the domain and the language pair. The more complex the domain or the broader and more complex the language, the more data is needed.” It turns out “how much” data is the wrong question. It is more about “how good” the data is. “ “The general finding is that NMT systems benefit from cleaner and more diverse training corpora rather than a massive unfiltered corpus, which was typically best for phrase-based systems.” —Spence Green, CEO, Lilt Tony O’Dowd, CEO of KantanMT, went further and said “there is no direct correlation between the amount of training data and the quality of a NMT engine.” What was important, he said, was to use “highly cleansed and aligned training data for the purposes of building successful NMT engines.” “We find we can build NMT engines using less data than our equivalent SMT engines, with the caveat that it’s of much higher quality,” O’Dowd added. So training data remains important, but the quality is key. Furthermore, the quantity of additional training data dumped into engines will not yield equivalent leaps in performance. “ “The learning curve of neural machine translation systems is roughly logarithmic. Every doubling of the training data gives a small bump in quality.” —Rico Sennrich, Post-doctoral researcher, University of Edinburgh Gauging fluency improvements on NMT is not as simple as asking how much data is needed, but there is consensus on NMT’s edge over SMT, among other things. PAGE: 1 3 // 37
Slator 2018 NMT Report | Slator.com Exceptional Capabilities of NMT At all three SlatorCons in 2017, we had a NMT expert present on various aspects of the popular technological trend. TextShuttle CTO Samuel Läubli’s presentation in SlatorCon Zürich in December 2017, in particular, centered around the benefits of NMT compared to predecessor technology and in general. Indeed, between the very first research paper on Arxiv on NMT to today, there are a few advantages NMT has touted over SMT. 1. NMT is more fluent. Kyunghyung Cho, Asst. Professor, NYU - SlatorCon New York 2017 In his SlatorCon presentation, Läubli explained how NMT engines consider entire sentences while SMT considers only a few words at a time, so the result is that NMT’s output is often more fluent. SMT systems would evaluate the fluency of a sentence in the target language a few words at a time using an N-gram language model, Läubli said. “If we have an N-gram model of order 3, when it generates a translation, it will always assess the fluency by looking at n-1 previous words,” he said in his presentation. “So in this case it would be two previous words.” This means that given a sentence of any length, an SMT system with a 3-gram language model will make sure every three words would be fluent together. “The context is always very local. As new words are added, we can always look back, but only to a very limited amount, basically,” Läubli said. On the other hand, NMT models use recurrent neural networks. According to Läubli, “the good thing here is that we can condition the probability of words that are generated at each position on all the previous words in that output sentence.” In essence, where SMT is limited to how many words its N-gram model dictates, NMT evaluates fluency for the entire sentence. 2. NMT makes better translation choices. Both SMT and NMT, in the simplest sense, function using numerical substitution—i.e. they replace words with numbers, and then proceed to perform mathematical equations on those numbers to translate. In an extremely simplified perspective, SMT more or less uses random numbers, in the way that two related words would have numbers that aren’t related. Läubli gave two example sentences in his presentation where only one word is different, but used the same way: one sentence used “but” and the other used “except.” SMT systems, he said, would for example assign values like ID number 9 and 2 to both words respectively, and therefore not relate them in any way. On the other hand, NMT systems would assign values like 3.16 and 3.21, essentially placing them close together if the training data shows their use to be fairly similar. “NMT systems capture the similarity of words and can then benefit from that,” Läubli said. PAGE: 14 // 37
By the End of 2017, Neural MT was Mainstream Slator.com 3. NMT can choose translations that would rarely occur in training corpora. Exceptional Capabilities of NMT In Kyunghyun Cho’s presentation at SlatorCon New York, he said NMT was very robust when it came to (cont.) spelling mistakes, and additionally, “it can actually handle many of those very rare compound words.” He explained the NMT system they trained can translate into compound words that rarely appear in a training corpus the size of 100 million words. “It can also handle the morphology really well,” Cho said. “It can even handle the, let’s say, ironical compound word.” During the presentation, he pointed to an example in his slide: “This means “worsening improvement” in German, that’s actually ironic—how is the improvement worsening? This character to character level [of NMT] is able to handle that perfectly.” Joss Moorkens, Assistant Professor at Dublin City University and post-doctoral researcher at ADAPT, said “the use of byte pair encoding (breaking words into sub-word chunks) helps with translation of unseen words (that don’t occur in the training data), but can also result in NMT neologisms – non-dictionary compound words that the system has created.” 4. NMT can automatically code-switch. According to Cho in his SlatorCon New York presentation, NMT “automatically learns how to handle code- switching inside a sentence.” He provided an example: “We decided to train a NMT system to translate from German, Czech, Finnish, and Russian to English. We’re going to give it a sentence in any of these four languages and ask the system to translate it into the corresponding English sentence. We didn’t give any kind of language identifier.” The system they trained did not need to identify the source languages. “Now, since our model is actually of the same size as before, we are saving around four times the parameters. Still, we get the same level or better performance,” Cho said. “In a human evaluation especially in terms of fluency, this model beats any single paired model you can think of.” Cho continued to elaborate on NMT’s code-switching: “Once we trained the model we decided to make up these kinds of sentences: it starts with German and then Czech, back to German and into Russian and ends with German. The system did the translation as it is, without any kind of external indication which part of the sentence was written in which language.” 5. NMT reduces post-editing effort. Back to Läubli’s SlatorCon Zürich presentation, he said NMT reduces post-editing effort by about 25%. Rufener, CEO of Omniscien, and Tinsley, CEO of Iconic, both agreed that there would be significant gains in productivity given high quality MT engines, though the former cautioned that it again depends on the use- case and the latter “will not hazard to suggest a number.” O’Dowd, CEO of KantanMT, turned to client use-cases: “Our clients are finding that high fluency clearly leads to business advantages. Since the vast majority of MT output is not post-edited, high fluency will lead to high adequacy which in-turn leads to high usability. This means that the translations are closer to 'fit-for-purpose' scenarios than ever previously experienced using SMT approaches. Instant-chat, Technical and Customer Support portals, Knowledge bases, these can all now be translated at levels of quality previously unheard of.” PAGE: 1 5 // 37
Slator 2018 NMT Report | Slator.com Current Limitations of NMT So NMT has exceptional qualities compared to SMT. In his SlatorCon London presentation, however, John Tinsley, CEO of Iconic, wanted to set expectations straight, pointing out that it is best for the industry at large to not focus only on the hype. What are NMT’s limitations and how exactly do they impact possible business applications? 1. A major limitation is handling figures of speech, metaphors, hyperbole, irony, and other ambiguous semantic elements of language. When Slator talked to inbound marketing company HubSpot for a feature article and asked them about MT, Localization Manager Chris Englund said they do not use MT, and it was, in fact, a taboo topic. The reason was simple: marketing required much more creative translation—more transcreation than translation in many places. So MT at its current level of handling ambiguous semantic elements of language was not an option for them. On January 2018, Slator published an article on a research paper by Antonio Toral, Assistant Professor at the University of Groningen and Andy Way, Professor in Computing and Deputy Director of the EU’s ADAPT Centre for Digital Content Technology. Their research is titled “What Level of Quality can Neural Machine Translation Attain on Literary Text?”. They found that the NMT engine they trained on 100 million words of literary text was able to produce translations that were equal to professional human translation about a fifth to a third of the time. NMT, like SMT and rule-based MT before it, depends on statistical estimation from a parallel corpus of training data. So when NMT encounters an idiom, for instance, with words or phrases used in a way that contradicts how similar words or phrases are used more commonly, it will have a difficult time translating properly. “It is a core property of all data-driven approaches to machine translation that they learn and reproduce patterns that they see in the translations that are used for training,” Sennrich said. “ “Creative language will remain a challenge because it by definition breaks with the most common patterns in language.” —Rico Sennrich, Post-doctoral researcher, University of Edinburgh. “ “NMT systems only learn from what’s in the training data, and while the addition of context for each word (based on training data and words produced so far) is a valuable addition, the process is hardly transcreation.” —Joss Moorkens, post-doctoral researcher, ADAPT Center and Lecturer at Dublin City University. Moorkens pointed out, however, that it does not necessarily mean NMT output cannot be useful for translating creative texts, referring to Toral and Way’s research. Cho was optimistic that it was at least not impossible. “As long as such phenomena occur in a corpus, a better learning algorithm with a better model would be able to capture them. It is just a matter of how we advance our science and find those better algorithms and models,” he said. PAGE: 16 // 37
By the End of 2017, Neural MT was Mainstream Slator.com 2. Despite the fact that NMT is more fluent than SMT, more complex, Current longer sentences still suffer poorer output. Limitations of NMT (cont.) Any MT system, regardless of technology or model, stumbles on longer sentences. Indeed, several of the experts we spoke to explained that translating longer sentences is just more difficult (even for humans), period. “ “Consider translating the sentence ‘Peter saw Mary’ to French. You can probably enumerate all possible French translations of that sentence. Now write down a 40-word sentence and try to enumerate all of the possible translations. Longer inputs present harder search and modeling problems.” —Spence Green, CEO, Lilt Samuel Läubli pointed out that NMT quality only degenerates noticeably for very long sentences of over 60 words. He explained that NMT systems have no “intuition” of sentence length; “translations of long input sentences are often too short,” he said. This is the same point Tinsley brought up in his SlatorCon London presentation. Sometimes NMT engines would translate a 30-word Chinese sentence into just six English words, he said. “Current research focuses on modelling coverage (Tu et al., 2016), i.e., making sure that all words of the input are covered in the output,” Läubli said. Jean Sellenart compared the problem of very long sentences to the problem of translating a single sentence without the context of the entire document. “We don’t have yet the ability to translate a sentence given the context of the document. A very long sentence is generally multiple sentences put together,” he said. Diego Bartolome said they have achieved improved performance on longer sentences by applying “segmentation techniques” to the source sentence, which is one of the ways SMT output for longer sentences were improved in the past. There is ongoing research focusing on this specific limitation, according to Cho and Sennrich. Sennrich, however, noted that “currently, standard test sets don’t strongly incentivize improving performance on long sentences, and most neural systems do not even use long sentences in training for efficiency reasons.” Samuel Läubli, CTO, TextShuttle - SlatorCon Zurich 2017 With the way NMT research has accelerated, however, only time will tell whether there will be more research specifically in the translation fluency of longer sentences. Using additional layers of technology on NMT engines can help alleviate the issue. “Attention mechanisms” and other hidden layers applied to the model can improve quality, but not completely resolve the problem. Also, they can come at the cost of computing power. Omniscien CEO Rufener pointed out that hybridization of NMT and SMT can also circumvent the problem. PAGE: 17 // 37
Slator 2018 NMT Report | Slator.com Current 3. Terminology accuracy may take a hit. Limitations of NMT Research into terminology accuracy and input from our respondents confirm that NMT can indeed (cont.) sometimes be less accurate when it comes to terminology, and part of it is due to its nature. It is less consistent and predictable than SMT (which is also a factor that allows it to potentially come up with better translations and choose rare compound words). Additionally when training SMT models, you can explicitly force them to learn terminology, which is a trickier concept for NMT. “General disambiguation depending on a context is problematic,” according to Maja Popovic, Researcher at the German Research Centre for Artificial Intelligence (DFKI) Language Technology Lab, Berlin, Germany. She said in the example “I would give my right arm for this,” the word “right” can be translated as “correct” instead of “opposite of left.” “ “Naturally, NMT engines are very good at learning the structure of the language and are missing the simple ability to memorize a list of words while SMT is (only) good at that.” —Jean Sellenart, CTO, Systran However, there are workarounds to this limitation. One way is applying a “focused attention” mechanism to the NMT decoder to constrain how the engine translates specific words, and indicate which words those are through user dictionaries with specific terminology. This is called constrained decoding. Läubli provides an example: “If «Coffee» is translated as «Kaffee» in our termbase, we force the NMT system to include «Kaffee» in the German output for any English input containing «Coffee».” He noted, however, that the approach can be slow, and gets slower the more output constraints are added. “Accuracy on the terminology is also a matter of the specificity of an MT engine,” Diego Bartolome added. “We have trained a client-specific engine with 20 million words with NMT, and there are no issues with terminology; it is indeed accurate.” Domain adaptation will indeed improve terminology accuracy with relevant training data, according to Sennrich. 4. NMT is a “black box”. The “black box” problem is not quite clear-cut. The supposed problem is that because NMT components are trained together, when there is a problem in translation output, it is harder to tell where the problem originated. This means harder debugging and customization. “This is correct to an extent,” said Tinsley. “In NMT, if we have an issue, there aren’t many places to look, aside from the data on which the model was trained, and maybe some of the parameters. Aside from that, it’s a black box. With SMT we can see exactly how and why a particular translation was generated, which might point to a component in the process that we could modify.” Andrew Rufener was inclined to disagree with the problem statement: “No, this is not correct. The neural model is definitely more complex and does not allow the same level of control in every single step as the statistical model does. At the same time, neural is not a black box, the same way statistical isn't. The mechanisms for control however, are different and require different approaches and it is definitely correct to state that there is less control than with statistical machine translation.” PAGE: 18 // 37
By the End of 2017, Neural MT was Mainstream Slator.com “There is no doubt that the customization of internal components (such as Named entities, placeholders, Current tags) is more challenging for NMT systems when compared to SMT systems,” said Tony O’Dowd. He added Limitations of NMT however, that SMT faced similar challenges that were resolved as the technology matured. (cont.) It seems the consensus of experts we talked to amounted to yes, NMT is more difficult to debug and customize due to not having the same level of control of individual components, but it is not a big problem. It appears that it is more a problem of changing how to debug and customize from the way the industry is used to in SMT. Sellenart pointed out that “in opposition to previous technology, neural networks absorb individual feedback very easily and can learn from it quite reliably. To fix a neural network, we just need to teach it how to correct some mis-translation.” Sennrich concurred: “Changes to a neural system typically involve retraining of the end-to-end model.” Pavel Levin of Booking.com said “at this point we need to combine NMT with other NLP techniques precisely to be able to control the errors on particularly sensitive parts of text (distances, times, amounts, etc.)” As for any impact this “black box” problem presents to production-level NMT, Läubli saw none too significant: “From what I’ve seen in the localization industry, MT engines were mostly ‘customized’ through preprocessing of the input or post processing of the output. The same is totally possible with NMT systems.” Sellenart shared the same sentiment. “The limitation is really virtual—some people are afraid of the fact that this means the output is not predictable so we cannot rely on it,” he said. “In terms of customization, I believe neural networks are actually the easiest to customize.” Sennrich also thought neural networks are easier to customize due to the high sensitivity to training data provided. He ventured that for any limitation encountered due to opacity of trained NMT models, customization can then be approached through the training data, pre- and post-processing, and hybridization of MT. O’Dowd shared how early on in their NMT efforts at KantanMT, these limitations gave them trouble working with marked-up content. 18 months down the road, however, “we now have these challenges under control and to a great extent resolved,” he said. Moorkens offers how they resolved a similar issue working with the Translation for Massive Open Online Courses (TraMOOC) project: “Our University of Edinburgh partners in the TraMOOC project have had some success in adapting NMT systems to the MOOC/educational domain, using general data (which will help with new input) and a smaller amount of domain-specific data, and using transfer learning for domain adaptation.” PAGE: 19 // 37
Slator 2018 NMT Report | Slator.com What’s Next in NMT Key Takeaways ▶ As NMT’s fluency outmatches predecessor technology, there is a need for new, comprehensive methods for defining quality and quality gains in research, as well as a means to efficiently combine human evaluation with automated processes into new quality standards. ▶ Meanwhile, as NMT highlights the need for high quality training data, not only is there a growing niche market of companies commoditizing parallel corpora, there is also the simultaneous issue of guaranteeing quality and relevance in any training data set. ▶ Finally, aspects of training data is just one of many “exciting” research directions in the bustling academic research scene, including using pivot languages and zero shot translation to resolve low-resource language NMT. As NMT increasingly becomes a standard technology across all areas of the language industry, there comes with it a few sweeping changes that are already emerging. How Do You Quantify Quality? Since NMT is dependent on the quality of the training data more so than its quantity, the question of what constitutes “quality” has become a focus. For instance, most research on NMT uses BLEU (bilingual evaluation understudy) for scoring the quality of translation output. The problem is that BLEU, an automated algorithm-based metric, does not necessarily dictate actual translation fluency. The limited applications of BLEU are further strained as NMT outputs become increasingly more fluent over SMT. “BLEU reached its limit for any translation, not only NMT,” said Maja Popovic. All of the experts we talked to agree: BLEU remains useful as a yardstick for measuring the rapid advance of MT in terms of quality, but in terms of actually gauging fluency, it leaves much to be desired. Läubli pointed out, however, that BLEU was intended to take multiple reference translations, and if it were used that way, it would not be as problematic. It is a staple tool for academic research as it tells us how far the latest findings have progressed from previous ones in a fast, predictable manner, but it is definitely not “industry-proof.” PAGE: 20 // 37
What’s Next in NMT Slator.com How Do You Replacing BLEU Quantify Quality? So while BLEU has limited uses, various researchers and industry peers have been trying out different methods of (cont.) gauging fluency. Läubli said metrics like METEOR would be able to reward synonym use in translation output. Popovic places a vote of confidence for character-based scores “such as BEER, chrF and characTER… for their potential for MT evaluation”. O’Dowd said they use a character-based Perplexity scoring mechanism used in conjunction with F-Measure and TER that provides “a decent (albeit, not always 100% accurate at this point in time) barometer” of quality. He added though that since these are machine generated scores, they can only be used during the developmental phase of an NMT engine. Bartolome said since they use MT for post-editing, they use Levenshtein distance or edit-distance: “we could enrich it by measuring the ‘complexity’ associated with changes, but it’s not an easy task.” Silvio Picinini, Machine Translation Language Specialist at eBay, also said edit-distance is a decent metric. He added that time spent by post-editors and keystrokes ratio may be interesting. “Interactive MT (which predicts the next words) seem to have been looking at Word Prediction Accuracy (WPA) or nextword accuracy (Koehn et al., 2014),” he said. Of course, there is a catch in that all these methods require post-editing work, so there’s still a human element. Human Evaluation Remains the Ultimate Standard “ “A well-designed blind human evaluation remains the most trusted quality assessment approach.” —Mihai Vlad, VP of Machine Learning Solutions, SDL All the experts we asked agree that human evaluation is the definitive metric for fluency. There is no way around it. BLEU combined with human assessments is an option, according to Kirti Vashee. Spence Green and Jean Sellenart agreed that BLEU should be complemented with human evaluation. Picinini added that crowdsourced human evaluation may be a feasible approach as it is “a cheap, fast and accurate way to evaluate quality.” He added that it also would accommodate quality levels for different purposes of MT Spence Green, CEO, Lilt - SlatorCon London 2017 in terms of content and audiences, factors that affect quality expectations. “Human assessment remains the best way of evaluating machine translation that is at our disposal,” Sennrich said, noting that the research community regularly performs shared translation tasks with human evaluation. PAGE: 21 // 37
Slator 2018 NMT Report | Slator.com Creating a New Quality Creating a New Quality Standard Standard (cont.) So if BLEU is not industry-proof, and human evaluation, while the ultimate standard, does not really scale as well, then from a holistic point of view, how does one go about assessing the quality of NMT output? Yannis Evangelou, Founder and CEO of linguistic QA company LexiQA, illustrated a process for NMT split into three stages: pre-translation, machine translation and post-editing. Pre-translation includes preparing the training data that needs to be of the highest quality as well as cleaning up legacy translation memories used as a reference corpus. The machine translation process ecompasses the engine’s decoding/ encoding process. The post-editing stage includes revision, quality assurance, and quality assessment. “The latter two could take place twice,” Evangelou said, referring to quality assurance and assessment. “During the quality assurance stage, each segment would be checked for various error classes, especially with a view to addressing locale-specific conversions; at this point, the quality assurance engine could provide suggestions for the user (e.g., alternative date notations in the target locale).” Evangelou continued: “Following that, an initial quality assessment could take place (the post-editor would then add annotations, spot false negatives and reject false positives; this way the post-editing effort would also be calculated). As soon as the revision is complete, a second quality assurance check should automatically run to make sure that no new errors have been introduced by the post-editor.” By the end of the process, the overall quality assessment can include the initial NMT output, post-editing effort, and final output, which is the combined MT and human revision. Each part would be assigned different weights, the average of which would form the total score. New and Existing NMT QA Processes Aside from computer generated scores during the developmental phase of an NMT engine, O’Dowd said they also employ a proprietary new platform in KantanMT so professional translators can rate the quality of MT systems. PAGE: 2 2 // 37
What’s Next in NMT Slator.com For Mihai Vlad of SDL, automatic metrics should indeed be validated by human assessment. “Equally, quality has New and to be defined in the context of the scenario being used,” he said, offering some examples: Existing NMT QA Processes • Quality for post editing is measured by the translator being more productive. (cont.) • Quality for multilingual eDiscovery is measured by the accuracy of identifying the right documents. • Quality for multilingual text analytics is measured by the effectiveness of the analyst in identifying the relevant information. • Quality for multilingual chat is measured by the feedback rating of the end customer. Sellenart shared the sentiment, agreeing that since MT is always connected to some use case, “the final evaluation always needs to be connected to this use case.” Kyunghyun Cho, speaking from the academia side, also highlights the need for use-case specifics in quality evaluation: “For instance, the quality of MT for dialogues would need to be measured by how well it facilitates dialogue between participants of different native languages.” “ “I believe the quality would need to be defined with respect to a downstream task in which translation affects its performance.” —Kyunghyun Cho, Asst. Professor, New York University Levin believes that in the near future, the standardization of NMT quality assurance might be as fragmented as demand: “We will be seeing practitioners rolling out their own metrics which are more relevant to their problems (e.g. metrics related to handling of particular named entities, scores from custom QA systems, potentially machine learning based, etc.) and use several of them in combinations.” He added that he expects industry players to streamline inevitable human evaluation loops either through in- house resources, as they do in Booking.com, or external services such as lexiQA or the TAUS DQF framework. Levin also explained that in Booking.com’s case, they use business sensitivity framework (BSF). In their research paper, Levin and his co-authors write: “One important shortcoming of the BLEU score is that it says nothing about the so-called ‘business sensitive’ errors. For example, the cost of mistranslating ‘Parking is available’ to mean ‘There is free parking’ is much greater than a minor grammatical error in the output.” BSF is a two-stage system. It first identifies sentences that may contain business sensitive aspects and evaluates whether they are translated properly with respect to business sensitive information. It then flags problematic NMT output. Ultimately, the language services market will most probably not take a single route when it comes to performance benchmarking, quality assessment, and applying metrics to NMT. “We do not foresee this industrywide,” Rufener said. “It may however happen for particular verticals” Diego Bartolome from Transperfect offered some examples: “In e-commerce, a metric is more related to conversion rates. For SEO, number of views. For regulatory, no critical errors. That should probably be the way to go.” User satisfaction should also be taken into account, according to Picinini. “This could be as simple as a ‘Was this translation helpful to you?’ with a Yes/No answer.” Google and Facebook actively employ this method in their NMT systems today. PAGE: 2 3 // 37
You can also read