MATRA: A PRACTICAL APPROACH TO FULLY-AUTOMATIC INDICATIVE ENGLISH-HINDI MACHINE TRANSLATION

Page created by Beverly Washington

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

MaTra: A Practical Approach to Fully-Automatic Indicative English-
                   Hindi Machine Translation

              Ananthakrishnan R, Kavitha M, Jayprasad J Hegde,
           Chandra Shekhar, Ritesh Shah, Sawani Bade, Sasikumar M
  Centre for Development of Advanced Computing (formerly NCST),
                          Juhu, Mumbai-400049
                                   India
{anand,kavitham,jjhegde,shekhar,ritesh,sawani,sasi}@cdacmumbai.in

                                                    (iii) producing indicative rather than perfect
                   Abstract                         translations.

    MaTra is a fully automatic system for           MaTra is designed to be fully automatic, and is
    indicative     English-Hindi    Machine         geared towards translating real-world, general-
    Translation (MT) of general-purpose             purpose texts. As a tradeoff, MaTra makes the
    texts. This paper discusses the strengths       last of the aforementioned approximations.
    of the MaTra approach, especially
    focusing on the robust strategy for             The primary testing ground for MaTra is the
    parsing and the intuitive intermediate          Web, where many sentences are grammatically
    representation used by the system. This         incorrect or incomplete (fragments) containing
    approach allows convenient enhancement          unknown words and abbreviations. In such a
    of the linguistic capabilities of the           noisy environment, with incomplete knowledge,
    translation system, while making it             it is impractical to work with perfect models and
    possible for us to produce acceptable           grammars. In this scenario, the design of MaTra
    translations even as the system evolves.        represents a pragmatic approach to engineering a
    This paper also presents encouraging            usable system, which aims to produce
    results of automatic evaluation using           understandable output for wide coverage, rather
    BLEU/NIST and subjective evaluation             than perfect output for a limited range of
    by human judges.                                sentences.

1    Introduction and Background                    MaTra achieves this by using a judicious mix of
                                                    (i) corpus-based or statistical tools and
MaTra is a fully automatic system for indicative    techniques for shallow parsing, word-sense
English-Hindi Machine Translation (MT) of           disambiguation, abbreviation handling, and
general-purpose texts. This paper discusses the     transliteration, and (ii) rule-based techniques for
strengths of the MaTra approach, especially         lexical and structural transfer.
focusing on the robust strategy for parsing and
the intuitive intermediate representation used by   The structural transfer component has at its core
the system.                                         a relatively simple and intuitive intermediate
                                                    representation (which we call MSIR – MaTra
It is well accepted that Fully-Automatic, High      Structured Intermediate Representation) that can
Quality, General-Purpose MT is not achievable       accommodate most types of sentences that are
in the foreseeable future – more so for widely      found in real-world texts. The simplicity and
divergent languages like English and Hindi.         generality of the MSIR is an important aspect of
Existing MT systems work by relaxing one or         our approach. The level of detail, both syntactic
more of these three dimensions. They do this by:    and semantic, is just enough to capture the
(i) focusing on a fairly small subset of the        divergence between English and Hindi.
language (in domains such as weather
forecasting and official gazettes), or (ii) using
human assistance during or post translation, or

In keeping with our engineering outlook, MaTra       This paper is organized as follows: the next
does not attempt either a deep parse or an           section presents the overall architecture of
elaborate semantic analysis of the English           MaTra and describes the various components
sentence. The parsing strategy is hybrid – a         briefly. Section 3 discusses the intermediate
combination of statistical and rule-based            representation used by MaTra (MSIR). The
techniques. The parsing component is modular         parsing algorithm used to obtain the MSIR is
by design, with each stage working on a logically    presented in section 4. Section 5 provides
separate aspect of structural analysis. One of the   evaluation details. Section 6 concludes the paper.
primary goals of the design is graceful
degradation from full sentence structures all the
way down to word-by-word structures.

The design of the MSIR and the parsing
algorithm is the focus of this paper.

Fig 1: MaTra Architecture

Sentence IndepClause|Conjunct|Fragment|MixedBag
Fragment FragClause|Conjunct
Conjunct Word, Sentence+
IndepClause VerbFrame, Modifier*
VerbFrame VerbGroup, Subject*,Object*,Modifier*
VerbGroup Word+
FragClause FragFrame, Modifier*
FragFrame VerbGroup*, Subject*, Object*, Modifier*
Subject NounPhrase|Sentence, Modifier*
Object NounPhrase|Sentence, Modifier*
Modifier PrepPhrase|DepClause
DepClause Word*, FragFrame, Modifier*
NounPhrase Word+
PrepPhrase Preposition, NounPhrase
MixedBag (Word|NounPhrase|PrepPhrase|DepClause|Fragment|IndepClause)+

Fig 2: Grammar for the MaTra Structured Intermediate Representation (MSIR)
sentences [Mehta and Rao, 2003], and further, to
fragments and incomplete sentences. However,
the framework, as of now, does not handle
2 MaTra Architecture imperative and interrogative sentences. Figure 2
shows the grammar for the MSIR based on this
Essentially, MaTra follows a structural and
framework.
lexical transfer approach, using semantic
information only when required. Figure 1 shows
In this section, we first discuss how various types
the overall architecture of the system.
of clauses are represented in the MSIR. Then, we
look at the MSIR interpretations of Subjects,
The preprocessing component splits the input Objects, and Modifiers, which are simplifications
text into sentences. Abbreviations, acronyms, of the traditional definitions. Finally, we look at
dates, numeric expressions, etc. are identified at how incomplete sentences and fragments are
this stage. Part-of-speech tagging and chunking represented in the MSIR.
are done using the fast Transformation Based
Learning tool, fnTBL [Ngai and Florian, 2001]. 3.1 Clauses
The word sense disambiguation component, A clause is the basic unit of predication in any
then, chooses the appropriate Hindi mapping for language, and forms the basis of the MSIR too.
English words and phrases. Next, the sentence- A clause consists of a single verb group, which
structuring component converts the chunked represents an action, event or state change. The
input into the MSIR that facilitates Hindi verb group may consist of one or more verbs,
generation. A rule-based generation engine [Rao including auxiliaries and pre-modifying adverbs,
et al., 1998; Rao et al., 2000; Mehta and Rao, and may be finite or non-finite. Every verb has
2003] generates the Hindi translation from the certain sub-categorization features, which define
MSIR using rule-bases for preposition-mapping, the number and nature of other constituents that
noun-inflection, verb-inflection, and structural attach with the verb to form the clause, namely,
transformation. The transliteration component the subject, objects, complements, and modifiers.
handles unknown words. This component uses Each clause modifier may itself be a separate
Genetic Algorithms to learn transliteration rules clause. This recursion can be used to create
from a pronunciation dictionary. [Mishra, 2004]. sentences with a hierarchy of clauses.

3 MaTra Intermediate Structured Clauses may be classified, based on the types of
Representation (MSIR) verbs that they contain, into the following three
categories:
The core part of our transfer-based framework
for translation deals with the transfer of a single
• Finite Clauses: clauses containing a
clause, which is adequate for handling mono-
finite verb phrase. E.g., Jayshankar` has
finite sentences [Rao et al., 2000]. This has then
been systematically extended to handle multi- visited Delhi
finite sentences and non-finite clauses, thus • Non-finite clauses: clauses containing a
covering compound and compound-complex non-finite verb phrase but no finite verb

phrase. E.g., Having visited Delhi, FragFrame to indicate the fact that the verb
Jayshankar … group is optional, and (ii) it allows a
• Verbless clauses: clauses with no verb subordinator (Word) in front of the FragFrame --
phrase. E.g., Jayshankar, then at subordinators are words such as wh-words,
Delhi, … although, because, etc.

Clauses are represented in the MSIR as either 3.1.3 Representing Compound Sentences
independent or dependent clauses. In our Compound sentences are represented by
representation, independent clauses are always connecting clauses with a Conjunct. Conjuncts
finite, whereas dependent clauses may be finite, include and, but, or, etc., and also correlative
non-finite or verbless. conjunctions such as not only-but also and if-
then (see Figure 4).
Sentences are classified as Simple, Complex and
Compound based on the types of clauses that 3.2 Subject and Object
they contain. Simple sentences contain a single Subject and Object in our representation have a
independent clause; complex sentences contain more general interpretation than usual. Subjects
one independent and at least one dependent are phrases (NounPhrase) or clauses with
clause connected by subordinators; and a nominal function, which occur before the verb
compound sentence contains more than one group (VerbGroup) in the sentence. Similarly,
independent clause connected by coordinating Objects are phrases or clauses with nominal
conjunctions. function, which occur after the VerbGroup in the
sentence. Thus, complements are also
In the following three subsections, we look at represented as Subjects or Objects. These are
how the MSIR represents simple, complex and simply syntactic entities. We do not attempt to
compound sentences respectively. determine semantic roles here.
3.1.1 Representing an Independent
Subjects and Objects may themselves be clauses.
Clause – Simple Sentences
Figure 5 shows such an example.
The fundamental unit that represents a single
independent clause is a frame (VerbFrame) that 3.3 Modifiers
captures: Modifiers can be attached to any component of a
clause (Modifier within Subject, Object,
(i) the action that is conveyed by the VerbFrame, and FragFrame) or to a complete
sentence (VerbGroup, which clause as a whole (Modifier within IndepClause
includes a single verb group and and DepClause). This recursive nesting of
auxiliaries), clauses can be used to represent complexity of
(ii) the entities that are involved in the any arbitrary level.
action (Subject and Object – see
subsection 3.2) For simplicity, we club all modifiers except
(iii) any adverbials (Modifier – see prepositional phrases as DepClause, in which
subsection 3.3) phrases are represented as verbless clauses.
Prepositional phrases are represented using
Simple sentences, which have a single PrepPhrase.
independent clause, are represented by an
IndepClause, which in turn contains a 3.4 Representing Incomplete Sentences and
VerbFrame. Fragments

3.1.2 Representing Dependent Clauses – We define Fragment to represent incomplete
Complex Sentences sentences which end at chunk boundaries. This is
used to handle captions, titles, news headlines
The DepClause modifier when combined with an etc., which are usually not full sentences.
IndepClause allows us to represent complex
sentences (see Figure 3). Incomplete sentences that do not end at chunk
boundaries usually cannot be parsed in full by
DepClause differs from IndepClause in the the sentence-structuring algorithm. MixedBag is
following ways: (i) VerbFrame is replaced by

a fallback option that allows us to represent such
structures. Any identified clause will be
structured in the usual manner, while other parts
of the sentence will be represented as chunks
(see Figure 6).

                                                       Fig 5: MSIR representation for a sentence containing
                                                       a clause as a subject: To be the greatest batsman was
                                                       his dream

Fig 3: MSIR representation for a sentence containing
a dependent clause: Salim, who was the prince, was
sent to jail

                                                       Fig 6: MSIR representation for a fragment not ending
                                                       at a chunk boundary: Companies are using the Globus
                                                       Toolkit to develop.

                                                       generation component than most existing
                                                       grammars. As described, many categories of
                                                       sentence elements have been clubbed together
                                                       for simplicity of representation and parsing. For
                                                       instance, except for prepositional phrases,
                                                       modifiers     are     not   distinguished,    and
                                                       complements are clubbed with Subject or Object.
Fig 4: MSIR representation for a compound sentence:    Semantic roles are largely ignored. However,
If you want a healthy baby, you must eat well.         based on a preliminary evaluation of MaTra, it is
                                                       our contention that most of these details are not
The MSIR is a simple representation, which is          essential to English-Hindi indicative MT (see
reasonably easy to attain for most sentence types,     section 5 for evaluation details). We are working
using a straightforward parsing algorithm as           on larger scale and more thorough evaluation to
discussed in the next section. This representation     substantiate this claim.
is admittedly rather coarse, and supplies much
less syntactic and semantic information to the         Hindi generation from the MSIR (structural and
|                                                      lexical transfer) is described in detail in [Rao et
                                                       al., 1998; Rao et al., 2000; Mehta and Rao,
                                                       2003].

4       The Parsing Algorithm                                                             for each sentinel
                                                                                                extract subordinate clause and label
                                                                                                 as DepClause
The source sentence is first processed for
                                                                                                mark the position where the the
punctuations, numerals, contractions, acronyms,                                                  subordinate clause was originally
and other abbreviations. The sentence is then                                                    present
chunked. Since the clause is the basis of the                                             end for
                                                                                       End if
MSIR, the various clauses are then identified;                         End
this is done by the clause boundary detection
component. Each clause is then structured                              The following is an example of a clause
independently as discussed in section 3.1. The                         boundary output:
MSIR for the whole sentence is finally obtained
by putting these structured clauses together in a
manner that adheres to the MSIR.                                       {[Salim       NNP NP]   [, , -] DepClause#1 [, , -] [wasVBD sentVBN   VP]

                                                                       [toTO   PP]   [jailNN   NP]   }
Thus, the parsing stage can be divided into three                                               {
                                                                       DepClause#1= [whoWP NP] [wasVBD VP] [theDT princeNN NP]               }
major components: (i) the POS tagging and
chunking component, (ii) the clause boundary                           4.3           Sentence Structuring
detection component, and (iii) the sentence-
                                                                       The last stage of parsing is the sentence-
structuring component.                                                 structuring component. This takes the clause
4.1      POS Tagging & Chunking                                        boundaries and builds the MSIR from it.
POS tagging and chunking are done using the                            The algorithm for structuring the clauses and
fnTBL POS tagger and chunker [Ngai and                                 putting these structured clauses together to obtain
Florian, 2001], which is based on rules learnt                         the MSIR is as described below.
using Transformation Based Learning (TBL) on
the Wall Street Journal corpus. The POS tagger                         SentStruct()
chooses the most probable part-of-speech tag                           Begin
(from the UPenn tagset) for each word. The text                                        For each clause
chunking component of fnTBL groups the words                                                    ClauseStruct()
                                                                                        End For
into one of Noun, Verb, Adjective, and                                                 If no well-formed combination of clauses and
Preposition groups.                                                                    phrases
                                                                                                Create a tree with an unstructured
                                                                                                   collection of clauses and
The following is an example of fnTBL output:                                                       independent phrases

[SalimNNP NP] [, , -] [whoWP NP] [wasVBD V] [theDT princeNN NP] [, ,                   Else
-] [was VBD sentVBN VP ] [toTO PP ] [jailNN NP ]                                            If Conjunct present
                                                                                                   Root    Conjunct
4.2      Clause Boundary Detection                                                          End If
                                                                                            Attach each Structured Fragment or
The clause boundary detection works on the                                                     IndepClause to Root/Conjunct
chunked sentence. It identifies the various                                               Attach dependent clauses to their
clauses present in a well-formed sentence based                                                respective IndepClauses
                                                                                         Attach Clause Modifiers to respective
on verbs and clause boundary markers called                                                    clauses
sentinels (who, whom, which, etc.) [Narayana                                          End if
Murthy, 1996] as described below:                                      End

Begin                                                                  Each clause (with a single verb group) is then
           if verb not present                                         structured in the following way:
                 label as Fragment
           End if
                                                                       ClauseStruct()
           if sentence is well-formed                                  Begin
                 for each coordinating conjunction                         If not Fragment
                      label left clause of coord conjunction                     Attach the verb group to the root
                       as IndepClause
                 End for                                                               Search and attach Noun group occurring
                label the rest of sentence as IndepClause                                before the verb group as Subject

                                                                                       Search and attach prepositional/noun groups
                                                                                       as modifiers to the Subject

of the MSIR, parsing algorithm, and Hindi
Search and attach Noun groups occurring
after the verb group as Object. The
generation component.
ordering determines whether the object is
direct or indirect. BLEU [Papineni et al., 2002] and NIST
evaluations [Doddington, 2002] were done using
For each object identify and attach
prepositional groups as the NIST evaluation toolkit 1 . One reference
modifiers to the object translation was used for each sentence. N-grams
(up to 4-grams) were matched after stemming.
The remaining groups are attached as
modifiers to the verb group
else Table 1 shows the scores for the current version
If verb group present of MaTra (called MaTra2) and the previous
Root verb group version. The previous version, in addition to
Search and attach verb
group modifiers imperative and interrogative sentences, also did
Search and attach noun not have support for fragments and incomplete
group as object sentences. Non-finite clauses were also not
Search and attach prepositional
groups as modifiers
supported. The scores suggest that substantial
else improvement has been made in MaTra2 over the
Root object previous version.
Search and attach noun
group as object Matra (Feb 05) Matra2 (Mar 06)
Search and attach prepositional
groups as modifiers
BLEU 0.0377 0.0534
End NIST 2.1261 3.1494
Table 1: BLEU and NIST scores
The parsing algorithm currently does not support
imperative and interrogative sentences. Also, the
Manual inspection indicates that there are
implementation for non-finite clauses is not differences in lexical choice in the reference and
complete. system translations. Structural differences also
exist due to the free word-order of Hindi. These
5 Evaluation Strategy & Results
are partly the reason for the low absolute scores.
A preliminary test set of 315 sentences has been To improve the automatic evaluation process, we
created by selecting sentences from news are increasing the number of reference
archives and other sources. This test set is in two translations for BLEU/NIST. We are also
parts: looking at other automatic evaluation strategies
that perform matching of linguistic phrases rather
The first part of the test set contains declarative than n-grams. Such a strategy would perhaps be
sentences and fragments exhibiting certain basic more suited for a free word-order language like
grammatical phenomena, for example, sentences Hindi.
with: (i) different types of clauses in various
roles (ii) different types of phrases, and (iii) all Subjective evaluation was performed by two
tense-aspect-modality combinations. Evaluation native Hindi speakers, who are also proficient in
shows that the MSIR and parsing algorithm can English. These judges were able to identify most
handle such sentences well. transliterated words, which may not be true for a
person not knowing English. They graded
The second part includes sentences and translations on the following 4-point scale
fragments with (i) interrogative, subjunctive, and [Sumita et al., 1999]. Translations marked as one
imperative mood, (ii) phrasal verbs, (iii) elliptical of A, B, and C can be considered acceptable:
clauses (iv) idioms, etc. This part is intuitively
more difficult for MT. We are currently working A) Perfect: no problems in both information and
on extending the MSIR and parsing algorithm to grammar
accommodate these phenomena. B) Fair: easy-to-understand with some
unimportant information missing or flawed
The complete test set of 315 sentences was used grammar
to evaluate the translations produced by MaTra.
This serves as an indication of the effectiveness
1
http://www.nist.gov/speech/tests/mt/resources/scoring.htm

C) Acceptable: broken but understandable with References
effort
D) Nonsense: important information has been Doddington, G. 2002. Automatic Evaluation of
translated incorrectly Machine Translation Quality Using N-Gram Co-
Occurrence Statistics, Proceedings of the Second
Grade % of sentences International Conference on Human Language
A 12.7 % Technology, 2002.
(A + B) 37.1 % Mishra A., Discovering Rules for Transliteration from
(A + B + C) 65.4 % English to Hindi: A Genetic Algorithms approach,
Table 2: Subjective evaluation scores MCA thesis, SSIT Orissa, 2004.
Narayana Murthy, K., Universal Clause Structure
Grammar, Ph.D. thesis, Dept. of Computer and
Table 2 shows the results of the subjective Information Sciences, University of Hyderabad,
evaluation. Though the number of perfect 1996.
translations is low (12.7%), it is highly Mehta V. and Rao D., Natural Language Generation
encouraging that more than 65% of the of Compound-Complex Sentences for English-
translations were rated as acceptable. Hindi Machine Aided Translation, Proceedings of
the Symposium on Translation Support Systems
6 Conclusion (STRANS) 2003.

In this paper, we have described a practical Mohanraj K., Hegde J., Dogra N., Ananthakrishnan
R., The MaTra Lexicon, Technical Report, CDAC
strategy for designing an English-Hindi machine
Mumbai, 2003.
translation system. The system is geared towards
translating real-world, general-purpose texts with Ngai G., and Florian R., Transformation-Based
a high level of noise, such as those on the Web. Learning in the Fast Lane, Proceedings of North
The design of the system represents a pragmatic American Association for Computational
Linguistics (NAACL), 2001.
approach to engineering a usable system, which
aims to produce understandable output for wide Papineni, Kishore, Salim Roukos, Todd Ward, and
coverage, rather than perfect output for a limited WeiJing Zhu, 2002. Bleu: A Method for Automatic
range of sentences. The design is based on an Evaluation of Machine Translation, Proceedings of
intuitive intermediate representation (MSIR) that the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), 2002.
especially simplifies the parsing algorithm, as
described in the paper. The approach allows Rao D., Bhattacharya P. and Mamidi R., Natural
convenient enhancement of the linguistic Language Generation for English to Hindi Human-
capabilities of the translation system, while Aided Machine Translation, Proceedings of the
making it possible for us to produce acceptable International Conference on Knowledge Based
translations even as the system evolves. Computer Systems (KBCS), 1998.
Rao D., Mohanraj K., Hegde J., Mehta V., and
Though at an early stage of implementation, Mahadane P., A Practical Framework for Syntactic
preliminary evaluation of the system is very Transfer of Compound-Complex Sentences for
encouraging -- more than 65% of translations English-Hindi Machine Translation, Proceedings
were rated as acceptable by human judges. The of the International Conference on Knowledge
Based Computer Systems (KBCS), 2000.
paper also reports automatic evaluation results
using BLEU and NIST. Sumita E., Yamada S., Yamamoto K., Paul M.,
Kashioka H., Ishikawa K., and Shirai S., Solutions
Future work will look at handling interrogative, to Problems Inherent in Spoken-Language
subjunctive and imperative moods, phrasal verbs, Translation: the ATR-MATRIX Approach,
Proceedings of MT Summit VII, 1999.
idiomatic usages, etc. Larger scale and
component-wise evaluation of the system is also
being planned.

You can also read