MATRA: A PRACTICAL APPROACH TO FULLY-AUTOMATIC INDICATIVE ENGLISH-HINDI MACHINE TRANSLATION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
MaTra: A Practical Approach to Fully-Automatic Indicative English- Hindi Machine Translation Ananthakrishnan R, Kavitha M, Jayprasad J Hegde, Chandra Shekhar, Ritesh Shah, Sawani Bade, Sasikumar M Centre for Development of Advanced Computing (formerly NCST), Juhu, Mumbai-400049 India {anand,kavitham,jjhegde,shekhar,ritesh,sawani,sasi}@cdacmumbai.in (iii) producing indicative rather than perfect Abstract translations. MaTra is a fully automatic system for MaTra is designed to be fully automatic, and is indicative English-Hindi Machine geared towards translating real-world, general- Translation (MT) of general-purpose purpose texts. As a tradeoff, MaTra makes the texts. This paper discusses the strengths last of the aforementioned approximations. of the MaTra approach, especially focusing on the robust strategy for The primary testing ground for MaTra is the parsing and the intuitive intermediate Web, where many sentences are grammatically representation used by the system. This incorrect or incomplete (fragments) containing approach allows convenient enhancement unknown words and abbreviations. In such a of the linguistic capabilities of the noisy environment, with incomplete knowledge, translation system, while making it it is impractical to work with perfect models and possible for us to produce acceptable grammars. In this scenario, the design of MaTra translations even as the system evolves. represents a pragmatic approach to engineering a This paper also presents encouraging usable system, which aims to produce results of automatic evaluation using understandable output for wide coverage, rather BLEU/NIST and subjective evaluation than perfect output for a limited range of by human judges. sentences. 1 Introduction and Background MaTra achieves this by using a judicious mix of (i) corpus-based or statistical tools and MaTra is a fully automatic system for indicative techniques for shallow parsing, word-sense English-Hindi Machine Translation (MT) of disambiguation, abbreviation handling, and general-purpose texts. This paper discusses the transliteration, and (ii) rule-based techniques for strengths of the MaTra approach, especially lexical and structural transfer. focusing on the robust strategy for parsing and the intuitive intermediate representation used by The structural transfer component has at its core the system. a relatively simple and intuitive intermediate representation (which we call MSIR – MaTra It is well accepted that Fully-Automatic, High Structured Intermediate Representation) that can Quality, General-Purpose MT is not achievable accommodate most types of sentences that are in the foreseeable future – more so for widely found in real-world texts. The simplicity and divergent languages like English and Hindi. generality of the MSIR is an important aspect of Existing MT systems work by relaxing one or our approach. The level of detail, both syntactic more of these three dimensions. They do this by: and semantic, is just enough to capture the (i) focusing on a fairly small subset of the divergence between English and Hindi. language (in domains such as weather forecasting and official gazettes), or (ii) using human assistance during or post translation, or
In keeping with our engineering outlook, MaTra This paper is organized as follows: the next does not attempt either a deep parse or an section presents the overall architecture of elaborate semantic analysis of the English MaTra and describes the various components sentence. The parsing strategy is hybrid – a briefly. Section 3 discusses the intermediate combination of statistical and rule-based representation used by MaTra (MSIR). The techniques. The parsing component is modular parsing algorithm used to obtain the MSIR is by design, with each stage working on a logically presented in section 4. Section 5 provides separate aspect of structural analysis. One of the evaluation details. Section 6 concludes the paper. primary goals of the design is graceful degradation from full sentence structures all the way down to word-by-word structures. The design of the MSIR and the parsing algorithm is the focus of this paper. Fig 1: MaTra Architecture
Sentence IndepClause|Conjunct|Fragment|MixedBag Fragment FragClause|Conjunct Conjunct Word, Sentence+ IndepClause VerbFrame, Modifier* VerbFrame VerbGroup, Subject*,Object*,Modifier* VerbGroup Word+ FragClause FragFrame, Modifier* FragFrame VerbGroup*, Subject*, Object*, Modifier* Subject NounPhrase|Sentence, Modifier* Object NounPhrase|Sentence, Modifier* Modifier PrepPhrase|DepClause DepClause Word*, FragFrame, Modifier* NounPhrase Word+ PrepPhrase Preposition, NounPhrase MixedBag (Word|NounPhrase|PrepPhrase|DepClause|Fragment|IndepClause)+ Fig 2: Grammar for the MaTra Structured Intermediate Representation (MSIR) sentences [Mehta and Rao, 2003], and further, to fragments and incomplete sentences. However, the framework, as of now, does not handle 2 MaTra Architecture imperative and interrogative sentences. Figure 2 shows the grammar for the MSIR based on this Essentially, MaTra follows a structural and framework. lexical transfer approach, using semantic information only when required. Figure 1 shows In this section, we first discuss how various types the overall architecture of the system. of clauses are represented in the MSIR. Then, we look at the MSIR interpretations of Subjects, The preprocessing component splits the input Objects, and Modifiers, which are simplifications text into sentences. Abbreviations, acronyms, of the traditional definitions. Finally, we look at dates, numeric expressions, etc. are identified at how incomplete sentences and fragments are this stage. Part-of-speech tagging and chunking represented in the MSIR. are done using the fast Transformation Based Learning tool, fnTBL [Ngai and Florian, 2001]. 3.1 Clauses The word sense disambiguation component, A clause is the basic unit of predication in any then, chooses the appropriate Hindi mapping for language, and forms the basis of the MSIR too. English words and phrases. Next, the sentence- A clause consists of a single verb group, which structuring component converts the chunked represents an action, event or state change. The input into the MSIR that facilitates Hindi verb group may consist of one or more verbs, generation. A rule-based generation engine [Rao including auxiliaries and pre-modifying adverbs, et al., 1998; Rao et al., 2000; Mehta and Rao, and may be finite or non-finite. Every verb has 2003] generates the Hindi translation from the certain sub-categorization features, which define MSIR using rule-bases for preposition-mapping, the number and nature of other constituents that noun-inflection, verb-inflection, and structural attach with the verb to form the clause, namely, transformation. The transliteration component the subject, objects, complements, and modifiers. handles unknown words. This component uses Each clause modifier may itself be a separate Genetic Algorithms to learn transliteration rules clause. This recursion can be used to create from a pronunciation dictionary. [Mishra, 2004]. sentences with a hierarchy of clauses. 3 MaTra Intermediate Structured Clauses may be classified, based on the types of Representation (MSIR) verbs that they contain, into the following three categories: The core part of our transfer-based framework for translation deals with the transfer of a single • Finite Clauses: clauses containing a clause, which is adequate for handling mono- finite verb phrase. E.g., Jayshankar` has finite sentences [Rao et al., 2000]. This has then been systematically extended to handle multi- visited Delhi finite sentences and non-finite clauses, thus • Non-finite clauses: clauses containing a covering compound and compound-complex non-finite verb phrase but no finite verb
phrase. E.g., Having visited Delhi, FragFrame to indicate the fact that the verb Jayshankar … group is optional, and (ii) it allows a • Verbless clauses: clauses with no verb subordinator (Word) in front of the FragFrame -- phrase. E.g., Jayshankar, then at subordinators are words such as wh-words, Delhi, … although, because, etc. Clauses are represented in the MSIR as either 3.1.3 Representing Compound Sentences independent or dependent clauses. In our Compound sentences are represented by representation, independent clauses are always connecting clauses with a Conjunct. Conjuncts finite, whereas dependent clauses may be finite, include and, but, or, etc., and also correlative non-finite or verbless. conjunctions such as not only-but also and if- then (see Figure 4). Sentences are classified as Simple, Complex and Compound based on the types of clauses that 3.2 Subject and Object they contain. Simple sentences contain a single Subject and Object in our representation have a independent clause; complex sentences contain more general interpretation than usual. Subjects one independent and at least one dependent are phrases (NounPhrase) or clauses with clause connected by subordinators; and a nominal function, which occur before the verb compound sentence contains more than one group (VerbGroup) in the sentence. Similarly, independent clause connected by coordinating Objects are phrases or clauses with nominal conjunctions. function, which occur after the VerbGroup in the sentence. Thus, complements are also In the following three subsections, we look at represented as Subjects or Objects. These are how the MSIR represents simple, complex and simply syntactic entities. We do not attempt to compound sentences respectively. determine semantic roles here. 3.1.1 Representing an Independent Subjects and Objects may themselves be clauses. Clause – Simple Sentences Figure 5 shows such an example. The fundamental unit that represents a single independent clause is a frame (VerbFrame) that 3.3 Modifiers captures: Modifiers can be attached to any component of a clause (Modifier within Subject, Object, (i) the action that is conveyed by the VerbFrame, and FragFrame) or to a complete sentence (VerbGroup, which clause as a whole (Modifier within IndepClause includes a single verb group and and DepClause). This recursive nesting of auxiliaries), clauses can be used to represent complexity of (ii) the entities that are involved in the any arbitrary level. action (Subject and Object – see subsection 3.2) For simplicity, we club all modifiers except (iii) any adverbials (Modifier – see prepositional phrases as DepClause, in which subsection 3.3) phrases are represented as verbless clauses. Prepositional phrases are represented using Simple sentences, which have a single PrepPhrase. independent clause, are represented by an IndepClause, which in turn contains a 3.4 Representing Incomplete Sentences and VerbFrame. Fragments 3.1.2 Representing Dependent Clauses – We define Fragment to represent incomplete Complex Sentences sentences which end at chunk boundaries. This is used to handle captions, titles, news headlines The DepClause modifier when combined with an etc., which are usually not full sentences. IndepClause allows us to represent complex sentences (see Figure 3). Incomplete sentences that do not end at chunk boundaries usually cannot be parsed in full by DepClause differs from IndepClause in the the sentence-structuring algorithm. MixedBag is following ways: (i) VerbFrame is replaced by
a fallback option that allows us to represent such structures. Any identified clause will be structured in the usual manner, while other parts of the sentence will be represented as chunks (see Figure 6). Fig 5: MSIR representation for a sentence containing a clause as a subject: To be the greatest batsman was his dream Fig 3: MSIR representation for a sentence containing a dependent clause: Salim, who was the prince, was sent to jail Fig 6: MSIR representation for a fragment not ending at a chunk boundary: Companies are using the Globus Toolkit to develop. generation component than most existing grammars. As described, many categories of sentence elements have been clubbed together for simplicity of representation and parsing. For instance, except for prepositional phrases, modifiers are not distinguished, and complements are clubbed with Subject or Object. Fig 4: MSIR representation for a compound sentence: Semantic roles are largely ignored. However, If you want a healthy baby, you must eat well. based on a preliminary evaluation of MaTra, it is our contention that most of these details are not The MSIR is a simple representation, which is essential to English-Hindi indicative MT (see reasonably easy to attain for most sentence types, section 5 for evaluation details). We are working using a straightforward parsing algorithm as on larger scale and more thorough evaluation to discussed in the next section. This representation substantiate this claim. is admittedly rather coarse, and supplies much less syntactic and semantic information to the Hindi generation from the MSIR (structural and | lexical transfer) is described in detail in [Rao et al., 1998; Rao et al., 2000; Mehta and Rao, 2003].
4 The Parsing Algorithm for each sentinel extract subordinate clause and label as DepClause The source sentence is first processed for mark the position where the the punctuations, numerals, contractions, acronyms, subordinate clause was originally and other abbreviations. The sentence is then present chunked. Since the clause is the basis of the end for End if MSIR, the various clauses are then identified; End this is done by the clause boundary detection component. Each clause is then structured The following is an example of a clause independently as discussed in section 3.1. The boundary output: MSIR for the whole sentence is finally obtained by putting these structured clauses together in a manner that adheres to the MSIR. {[Salim NNP NP] [, , -] DepClause#1 [, , -] [wasVBD sentVBN VP] [toTO PP] [jailNN NP] } Thus, the parsing stage can be divided into three { DepClause#1= [whoWP NP] [wasVBD VP] [theDT princeNN NP] } major components: (i) the POS tagging and chunking component, (ii) the clause boundary 4.3 Sentence Structuring detection component, and (iii) the sentence- The last stage of parsing is the sentence- structuring component. structuring component. This takes the clause 4.1 POS Tagging & Chunking boundaries and builds the MSIR from it. POS tagging and chunking are done using the The algorithm for structuring the clauses and fnTBL POS tagger and chunker [Ngai and putting these structured clauses together to obtain Florian, 2001], which is based on rules learnt the MSIR is as described below. using Transformation Based Learning (TBL) on the Wall Street Journal corpus. The POS tagger SentStruct() chooses the most probable part-of-speech tag Begin (from the UPenn tagset) for each word. The text For each clause chunking component of fnTBL groups the words ClauseStruct() End For into one of Noun, Verb, Adjective, and If no well-formed combination of clauses and Preposition groups. phrases Create a tree with an unstructured collection of clauses and The following is an example of fnTBL output: independent phrases [SalimNNP NP] [, , -] [whoWP NP] [wasVBD V] [theDT princeNN NP] [, , Else -] [was VBD sentVBN VP ] [toTO PP ] [jailNN NP ] If Conjunct present Root Conjunct 4.2 Clause Boundary Detection End If Attach each Structured Fragment or The clause boundary detection works on the IndepClause to Root/Conjunct chunked sentence. It identifies the various Attach dependent clauses to their clauses present in a well-formed sentence based respective IndepClauses Attach Clause Modifiers to respective on verbs and clause boundary markers called clauses sentinels (who, whom, which, etc.) [Narayana End if Murthy, 1996] as described below: End Begin Each clause (with a single verb group) is then if verb not present structured in the following way: label as Fragment End if ClauseStruct() if sentence is well-formed Begin for each coordinating conjunction If not Fragment label left clause of coord conjunction Attach the verb group to the root as IndepClause End for Search and attach Noun group occurring label the rest of sentence as IndepClause before the verb group as Subject Search and attach prepositional/noun groups as modifiers to the Subject
of the MSIR, parsing algorithm, and Hindi Search and attach Noun groups occurring after the verb group as Object. The generation component. ordering determines whether the object is direct or indirect. BLEU [Papineni et al., 2002] and NIST evaluations [Doddington, 2002] were done using For each object identify and attach prepositional groups as the NIST evaluation toolkit 1 . One reference modifiers to the object translation was used for each sentence. N-grams (up to 4-grams) were matched after stemming. The remaining groups are attached as modifiers to the verb group else Table 1 shows the scores for the current version If verb group present of MaTra (called MaTra2) and the previous Root verb group version. The previous version, in addition to Search and attach verb group modifiers imperative and interrogative sentences, also did Search and attach noun not have support for fragments and incomplete group as object sentences. Non-finite clauses were also not Search and attach prepositional groups as modifiers supported. The scores suggest that substantial else improvement has been made in MaTra2 over the Root object previous version. Search and attach noun group as object Matra (Feb 05) Matra2 (Mar 06) Search and attach prepositional groups as modifiers BLEU 0.0377 0.0534 End NIST 2.1261 3.1494 Table 1: BLEU and NIST scores The parsing algorithm currently does not support imperative and interrogative sentences. Also, the Manual inspection indicates that there are implementation for non-finite clauses is not differences in lexical choice in the reference and complete. system translations. Structural differences also exist due to the free word-order of Hindi. These 5 Evaluation Strategy & Results are partly the reason for the low absolute scores. A preliminary test set of 315 sentences has been To improve the automatic evaluation process, we created by selecting sentences from news are increasing the number of reference archives and other sources. This test set is in two translations for BLEU/NIST. We are also parts: looking at other automatic evaluation strategies that perform matching of linguistic phrases rather The first part of the test set contains declarative than n-grams. Such a strategy would perhaps be sentences and fragments exhibiting certain basic more suited for a free word-order language like grammatical phenomena, for example, sentences Hindi. with: (i) different types of clauses in various roles (ii) different types of phrases, and (iii) all Subjective evaluation was performed by two tense-aspect-modality combinations. Evaluation native Hindi speakers, who are also proficient in shows that the MSIR and parsing algorithm can English. These judges were able to identify most handle such sentences well. transliterated words, which may not be true for a person not knowing English. They graded The second part includes sentences and translations on the following 4-point scale fragments with (i) interrogative, subjunctive, and [Sumita et al., 1999]. Translations marked as one imperative mood, (ii) phrasal verbs, (iii) elliptical of A, B, and C can be considered acceptable: clauses (iv) idioms, etc. This part is intuitively more difficult for MT. We are currently working A) Perfect: no problems in both information and on extending the MSIR and parsing algorithm to grammar accommodate these phenomena. B) Fair: easy-to-understand with some unimportant information missing or flawed The complete test set of 315 sentences was used grammar to evaluate the translations produced by MaTra. This serves as an indication of the effectiveness 1 http://www.nist.gov/speech/tests/mt/resources/scoring.htm
C) Acceptable: broken but understandable with References effort D) Nonsense: important information has been Doddington, G. 2002. Automatic Evaluation of translated incorrectly Machine Translation Quality Using N-Gram Co- Occurrence Statistics, Proceedings of the Second Grade % of sentences International Conference on Human Language A 12.7 % Technology, 2002. (A + B) 37.1 % Mishra A., Discovering Rules for Transliteration from (A + B + C) 65.4 % English to Hindi: A Genetic Algorithms approach, Table 2: Subjective evaluation scores MCA thesis, SSIT Orissa, 2004. Narayana Murthy, K., Universal Clause Structure Grammar, Ph.D. thesis, Dept. of Computer and Table 2 shows the results of the subjective Information Sciences, University of Hyderabad, evaluation. Though the number of perfect 1996. translations is low (12.7%), it is highly Mehta V. and Rao D., Natural Language Generation encouraging that more than 65% of the of Compound-Complex Sentences for English- translations were rated as acceptable. Hindi Machine Aided Translation, Proceedings of the Symposium on Translation Support Systems 6 Conclusion (STRANS) 2003. In this paper, we have described a practical Mohanraj K., Hegde J., Dogra N., Ananthakrishnan R., The MaTra Lexicon, Technical Report, CDAC strategy for designing an English-Hindi machine Mumbai, 2003. translation system. The system is geared towards translating real-world, general-purpose texts with Ngai G., and Florian R., Transformation-Based a high level of noise, such as those on the Web. Learning in the Fast Lane, Proceedings of North The design of the system represents a pragmatic American Association for Computational Linguistics (NAACL), 2001. approach to engineering a usable system, which aims to produce understandable output for wide Papineni, Kishore, Salim Roukos, Todd Ward, and coverage, rather than perfect output for a limited WeiJing Zhu, 2002. Bleu: A Method for Automatic range of sentences. The design is based on an Evaluation of Machine Translation, Proceedings of intuitive intermediate representation (MSIR) that the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002. especially simplifies the parsing algorithm, as described in the paper. The approach allows Rao D., Bhattacharya P. and Mamidi R., Natural convenient enhancement of the linguistic Language Generation for English to Hindi Human- capabilities of the translation system, while Aided Machine Translation, Proceedings of the making it possible for us to produce acceptable International Conference on Knowledge Based translations even as the system evolves. Computer Systems (KBCS), 1998. Rao D., Mohanraj K., Hegde J., Mehta V., and Though at an early stage of implementation, Mahadane P., A Practical Framework for Syntactic preliminary evaluation of the system is very Transfer of Compound-Complex Sentences for encouraging -- more than 65% of translations English-Hindi Machine Translation, Proceedings were rated as acceptable by human judges. The of the International Conference on Knowledge Based Computer Systems (KBCS), 2000. paper also reports automatic evaluation results using BLEU and NIST. Sumita E., Yamada S., Yamamoto K., Paul M., Kashioka H., Ishikawa K., and Shirai S., Solutions Future work will look at handling interrogative, to Problems Inherent in Spoken-Language subjunctive and imperative moods, phrasal verbs, Translation: the ATR-MATRIX Approach, Proceedings of MT Summit VII, 1999. idiomatic usages, etc. Larger scale and component-wise evaluation of the system is also being planned.
You can also read