Applying Oxford-PWN English-Polish Dictionary to Machine Translation

Page created by Mitchell Holt

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Applying Oxford-PWN English-Polish Dictionary to Machine Translation

Jassem Krzysztof

Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań
[jassem@amu.edu.pl]

The paper reports on the process of converting 2. Building a lexicon for an MT system
a large human-readable English-Polish with the Polish language
dictionary further abbreviated as OPEP Two approaches for building an MT lexicon
(Oxford-PWN English-Polish), to the format are mainly discussed in the literature. In
applicable for machine translation. The report Pinkham and Smets (2002) the authors
concludes with an evaluation of the distinguish between "HanC systems" – based
Translatica system which uses the converted on "Hand-crafted Dictionary" and the "Lead
data in transfer-based translation. systems" – based on "Learned Dictionary".
Systems of the first type use traditional
1. Translatica system bilingual dictionaries to determine the transfer
The Translatica system originates from the between word senses, whereas in "Lead
PolEng project developed in 1996-2002 at systems" the transfer part of the dictionary is
Adam Mickiewicz University (AMU) in trained on bilingual text corpora. The authors
Poznań. PolEng developed a transfer-based of the publication show the advantages of the
system translating texts from Polish into "Lead approach" – particularly for new pairs of
English, with the lexicon based on Internet languages and limited time for development.
texts. Translatica expands PolEng capabilities By contrast, the experiments of Ilaraza,
by translation in the reverse direction and the Mayor, and Sarasola (2001) demonstrate the
usage of broader lexicon obtained from the advantages of building an MT lexicon on the
contents of OPEP. basis of large traditional dictionaries. The
The main features of PolEng inherited by authors compare translation produced by the
Translatica are: system that uses raw bilingual dictionary to tha
bottom-up parsing based on the CYK given by the system whose dictionary is a
algorithm merge of a Basque lexical database and the
phrasal structure representation Morris bilingual English-Basque dictionary.
Perl-like formalism of transfer and The authors state that the output of the latter
synthesis rules. system is distinctly better.
It is worth noting that the same formalism for Another dichotomy is mentioned by
description is used for both directions Baldwin, Hutchinson and Bond (1999). On the
The formalism for the description of basis of English-Japanese translation the
grammars is a kind of CFG. The authors have authors compare the systems that store entries
tried to use available resources to describe the as source/target language pairs to those which
English grammar, e.g. AGFL (2002)but it consider both languages separately. In the
turned out that they would hardly comply with source/target approach "a word has as many
the elaborated translation engine (see Graliński senses as it has translation equivalents". In the
(2002) for details on the engine). In order to latter approach sense distinctions are specific
use the engine it was necessary for us to for each language, which in the authors'
compose grammar rules ourselves. opinion is "more cognitively justifiable". The
According to the agreement between the main argument for the "monolingual" approach
authors of PolEng and the authors of OPEP, in MT is that the decision on the sense
the lexicon for the English-to-Polish direction disambiguation may be postponed until the
should be based on the OPEP contents. The process of generation whereas in the "bilingual
paper reports the work that was done in 2003 approach" the semantic constraints on the
in order to adopt OPEP contents to Translatica source side are used for disambiguation in both
– the PolEng successor. source analysis and transfer. Another argument

against "bilingual" dictionaries is their uni- lexicon developed by the
directionality. organization of Polish
One of the main criteria for choosing Scrabble players, further
the type of lexicon for an MT application is the referred to as the scrabble
availability of resources. Because of the dictionary
scarcity of aligned Polish-English bi-texts the other dictionaries of Polish
"Lead approach" has lost its main benefit for published by PWN, e.g. Bańko
our purposes: rapid deployment – in order to (2000)
use the approach it would be necessary to build lists of entries (e.g. proper nouns)
aligned corpora (the same reason has from PWN encyclopedias.
determined the choice of the transfer method The PolEng lexicon comprises some
used for the translation, rather than the corpus- syntactical information that could prove useful
based one). On the other hand, our group have for the other direction; the scrabble dictionary
had to free disposal quite large traditional handles Polish inflection exhaustively; Bańko
dictionaries: OPEP English-Polish dictionary (2000) includes some information on
and a few Polish dictionaries mentioned in syntactical features of entries, in the form well
section 4. The situation called for the "HanC suited for computer processing.
approach". 2. Lexical resources at limited disposal (via
The decision whether the entries should be Internet), e.g.
bilingual or monolingual was to large extend Meriam-Webster Dictionary (www.m-
determined by the conditions of the agreement w.com)
between the PolEng group and PWN. The Internet English-Polish dictionary
dictionary publishers (as well as the authors of (www.dict.pl)
the system) liked the system to use the These and other dictionaries available on-line
linguistic knowledge included in the dictionary helped lexicographers understand the meaning
material to the maximum extend. The system of some entries or suggest alternative
dictionary should mirror the OPEP material as equivalents
closely as possible. This called for the 3. Text corpora concordancers:
bilingual description. The uni-directionality British National Corpus
was not a counterargument either as the Polish- (http://sara.natcorp.ox.ac.uk)
to-English part had already been developed. Collins Cobuild Corpus
(http://www.cobuild.collins.co.uk/for
3. Main goals m.html)
The main goals posed to the conversion WordCorp
process were: (http://www.webcorp.org.uk/index.ht
to lose as little information as possible ml)
from OPEP PolEng Internet corpus – the corpus of
to extract and formalize syntactic and Polish Internet texts collected while
semantic information given in OPEP working with the Polish-English
to supplement data with all translation.
information necessary for transfer- Consulting such tools helped to verify
based machine translation – in syntactic features of words – such information
accordance with the Translatica is not given exhaustively in OPEP.
algorithm. WordNet has proved to play a key role in
assigning semantic values. The semantic
4. Resources hierarchy used in Translatica is a subtree of
the WordNet lattice.
Before the start of the work, the group
4. Translation tools:
consolidated the following resources:
grammar description
1. Lexical resources at free disposal:
syntactic-semantic parser
OPEP in the electronic form, XML
tools for transfer and synthesis.
format
Translation tools impose specific constraints
PolEng lexicon (Polish-to-English)
on the type of information that should be
Polish lexicon of inflected forms
stored in the dictionary.
delivered by PWN – the
5. Translatica dictionary formalism

The dictionary formalism of the the time the paper was written the authors had
Translatica system assumes one lexicon for converted 50 MB of on-line "bidicts" of
each direction (source/target approach). For varying formats and the longest ABET script
example, the description of an entry in the they needed consisted of mere twenty-four
English-to-Polish direction gives the lines.
constraints under which a word (or a phrase) is Although we think that a language like
translated into appropriate equivalents. ABET may prove beneficial in dealing with
more than one dictionary of a simple format
5. Basic problem – time limitations we do not think that the idea would work for
As is shown in section 6, full and detailed traditional off-line dictionaries that include
conversion of a single entry consumes a lot of deep linguistic knowledge, rarely given in a
man-work (apart from computer work). The systematic way. In our attempt to convert the
group could afford 27 man-months to dictionary we have come across so many
accomplish the task of conversion (3 specific "little problems" that it is hard to
lexicographers, 9 months). The available time imagine for us that any generalizing language
was not sufficient to manually elaborate each could be of much help.
entry of OPEP (even after automatic pre- The script used for the conversion job
processing). The group assumed the following was written in Perl. The section lists the major
approach: problems in the conversion (and does not
• function words should be described mention those "little nuisances" which are
almost from scratch particularly not amenable to description in a
• out of 55 900 entries in OPEP, ca 20 language of a higher level).
000 most frequent ones (according to The task consisted in the following
BNC) should be conversed steps:
automatically and then elaborated Separating entries
manually In the approach suggested in Mayfield and
• the rest of the dictionary should be McNamee (2002) the macro HEADER-FIND
conversed only automatically is responsible for separating entries. The macro
• errors resulting from manual and identifies headers according to the
automatic conversion should be specification of the dictionary. Such approach
corrected semi-automatically. would not solve the problems that we
encountered in separating entries in OPEP:
6. Processing of the dictionary • Some entries have references to other
The process of dictionary conversion involved entries. The entry upon is described in OPEP
the following stages: only with the reference to the entry on (i.e.
automatic conversion of dictionary upon= on). In such a situation the reference
information was forwarded (the idea of reference is used in
automatic morphological description the Translatica dicitionary as well). Quite a
manual description/verification of 20 few entries have references only in one of the
000 most frequent lexemes senses. An example may be brainstorm, whose
semi-automatic correction of errors informal meaning is described as equal to
manual correction of errors found brainwave (whereas other senses are described
while testing the translation system. separately). Such situation requires different
treatment ("copy and paste inside"). Another
6.1. Automatic conversion of dictionary type of reference concerns entries which are
information word forms of other entries. The entry bidden
In Mayfield and McNamee (2002) the authors has a reference to bid as its whole description,
present an interesting idea that aims at whereas the entry built has a reference (to
simplifying the conversion of bilingual build) in one sense as well as its own set of
dictionaries from human-readable to computer- equivalents for other senses. Our decision was
readable form. They have created a language to discard the entries like bidden from the
called ABET (APL Bidict Extraction Tool) dictionary and discard only referenced senses
that allows for automating "the processes that in the entries like built.
are the same across most extraction tasks". At

100

• Entries may have graphical variants. Automatic acquisition of attribute values
We decided to separate such variants into This phase consists in extracting from OPEP
multiple entries (tha other variants received the values of attributes needed by the
references to the first variant). There are two Translatica translation algorithm.
ways of denoting graphical variants in OPEP: • The automatic procedure converts
one with the usage of braces inside words, e.g. OPEP flexional description into Translatica
bias(s)ed, the other by means of a comma, e.g. encoding. Some verbs in English have
baldachin, baldaquin (the comma is not used different inflected forms for different senses
consistently, however, e.g. in the entry births, (e.g speed, sped, sped as contrasted to speed,
marriages and deaths the comma obviously speeded, speeded). In such a case we decided
does not separates variants – to solve that to divide an entry into two.
disambiguation automatically we assumed that • The syntactical information on the
variants of the same word must share the first complements should be concluded from
letter). On the other hand graphical variants various parts of the OPEP description.
that differ only with the first letter Transitive verbs (decoded as vt in OPEP) are
capitalization (e.g. balkanization, treated in our approach as having an equivalent
Balkanization) should be merged to one entry with a noun complement in accusative in
only. The entry backwards has a "graphical Polish. The complementation may partially be
alternative" backward but backward itself derived from PREP elements which in the
constitutes a separate entry in OPEC (giving a OPEP XML describes prepositional
reference to backwards!). In cases like this the complements. Direct complements are listed in
entries should not be duplicated during OPEP as OBJ elements for verbs and INDIC
conversion
elements for nouns. Complementation may and
• Some words are indexed and treated as should also be concluded from human-readable
separated entries in OPEP (e.g. billet as billet1 descriptions, e.g. from strings like on or about
/an order/ and billet2 /of wood/) although they smth.
are the same parts of speech. We merge such
• Assigning semantic attributes from
entries into one (with separate equivalents).
various tags in OPEP has been partially done
• There are entries that lack direct automatically. For example the values given in
equivalents (e.g. behalf) – the equivalents are the element COLL, which describes the sense,
given only for phrases that include them. Since
(e.g. for the word speech, the senses are:
we demand each entry to have at least one non-
oration, faculty, language, subject) are
empty equivalent such entries must be
automatically converted into Translatica
automatically tagged and individually decided:
semantic hierarchy by means of a WordNet
either the entry should be removed (only
query.
lexical phrases should remain) or an artificial
• The attribute of context, which
equivalent should be found for the entry.
comprises "domain", "style" and "dialect" in
• Phrasal verbs need special treatment.
Translatica should be concluded from various
We decided to separate entries that are
tags in OPEP, mainly from qualifiers..
described in OPEP as phrasal verbs from
simple verbs. This approach requires Merging senses
considerable caution. For example, the phrasal This phase aims at diminishing numbers of
verb get out of is in OPEP described as a equivalents for entries in order to simplify
separate phrasal, but some important uses of computations in the translation process.
the multiword get out of are mentioned also in • Some "second-best" equivalents are
the description of the phrasal verb get out, and replaced: the sense they represent is discarded
in the description of the simple verb get. To and the equivalent is inserted as a synonym of
make things worse, these uses partially a "better" equivalent. A sense is converted into
overlap. a synonym only if strict conditions are
As it will be seen in the sections to follow, the fulfilled: all attributes must have same value
process of separating entries takes place also in (this automatic procedure may be undone
the next phases of the conversion. during manual verification).
• It often happens that different senses
have the same equivalents. For example, all

101

first five senses the verb get in OPEP include (e.g. bulgur and bulgur wheat); if they appear
the Polish equivalent dostać. In such a case we in the target side, the second alternative is
would like to merge the senses into one omitted.
(according to the paradigm that in Translatica
dictionary a word has as many senses as it has 6.2. Automatic morphological description
equivalents). In order to make the merging it The morphological information for English
was necessary to cautiously process the values words is given in OPEP. The main resources
of attributes, e.g. the value of complementation that contributed to determining the Polish
attribute for the merged entry should include inflection were the PolEng Polish-to-English
complementation patterns of all merged dictionary and the PWN scrabble lexicon of
senses. inflected forms.
Automatic conversion of lexical phrases
6.3. Manual verification/modification of
For most word-senses the OPEP dictionary data
gives examples of usage as well as idioms that
include the word-sense. Both types of phrases The data produced by automatic conversion
were automatically copied into the list of needed manual verification and modification.
idioms in Translatica database. It was up to the Verification showed errors in the conversion
lexicogrpahers to distingush between the two process – which could be corrected by mere
types and make appropriate decisions. Idioms adjusting conversion procedures. Some
in the Translatica dictionary are described linguistic aspects required human linguistic
with the same set of attributes as single words. knowledge as well as consulting other sources
Some syntactical and semantic values could be than OPEP.
obtained automatically from OPEP in the same An important factor was the labor
way as for single words (e.g. the human organization. As the lexicographic group
semantic value for an object denoted as sb). consisted of three (occasionally four)
Still, there was more to do for lexicographers lexicographers who were controlled by a co-
with idioms than with single words. ordinator it was necessary to find the way in
Some idioms and their equivalents are which linguistic data could be modified more
denoted in OPEP by slash marks (e.g. a than once and the labor could be distributed
banking/an educational ~ system bankowy/ between persons. We decided to convert the
edukacji). Such idioms needed to be separated data into a classical SQL database. This idea
automatically. made it possible to keep the lexical database in
order even if some of the work overlapped.
Cleaning up and adjusting The main aspect (in view of transfer
OPEP uses its own metalanguage which translation) which is not well handled by
should be parsed during conversion. For automatic conversion is complementation.
example, the word or serves to indicate either OPEP delivers almost none explicit
alternative uses, alternative translations, or information on left-hand complements
alternative complements. We have decided to (specifiers) like prepositions (and their
treat or in the following way in the conversion translations) that tend to precede specific
process: alternative uses should be divided into nouns. This information should have been
separate idioms, from the alternative encoded manually on the basis of other
equivalents all except the first one should be resources (e.g. PolEng dictionary). The right-
discarded, alternative complements should be hand complements are treated in OPEP more
merged. Similar treatment is used for the exhaustively but many of them are given
metaword also. The word beacon, in its third implicitly in examples of usage, e.g. we'll
sense, has the following form in OPEP: (also: never get by without him is an example given
radio ~). This should be conversed into a new in OPEP that should, for the translation needs,
entry: radio beacon. be treated as a phrasal verb get by
The usage of words in braces should complemented by a PP without sb.
be disambiguated in this phase also. Usually The lexicographers were asked to
the braces denote optional occurrence, like in query English corpora in order to check for
the entry bulgur (wheat). If the braces appear complements that are not listed in OPEP. This
at the source side, two entries are generated was a hard task that required good knowledge

102

of English and could not be executed It is worth noting that the OPEP dictionary was
(semi)automatically. automatically reversed to enrich the Polish-to-
Word senses in the Translatica English lexicon with one-to-one equivalents.
dictionary are described semantically by means
of a subtree from the WordNet semantic 8. Evaluation
hierarchy. OPEP delivers some hints on The evaluation of the Polish-to-English
semantic values of words and semantic values translation (PolEng) was executed in the
of their complements but it was up to the Allied Irish Bank (Dublin) in November 2003.
lexicographer to choose the most appropriate At that time the English-into-Polish translation
ones. was not available yet. The lexicon included
The lexicographers found processing idioms only entries developed manually from various
and phrases the most challenging. It was up to traditional dictionaries (before OPEP
lexicographers: enrichment).
to convert idioms and phrases into a As the system provides add-ins to MS-
canonical form Office application as well as a plug-in to the
to distinguish phrases from word usage Internet Explorer, the evaluation dealt with
examples so that only the former are types of documents specific to those
included in the dictionary applications. For each type the tester selected
to determine complementation of 7-15 documents from the banking domain and
idioms (basing on lexicographers' tried to evaluate the speed, coverage of
intuition and on-line corpora) words/phrases and accuracy. The testing was a
to describe admissible gaps in phrases "black-box" type – the tester was not capable
to determine the syntactic role of a of estimating which component of the
phrase (e.g. to determine that this translation was responsible for erroneous
morning should be treated as an translations. The tester used both relative and
adverb rather than a noun). absolute evaluation method. PolEng was
compared to PolTrans – a translation system
6.4. Semi-automatic correction available on-line. Not surprisingly, PolTrans
Before that stage several types of errors were passed the test better as far as the speed is
present in the lexicon. The errors might have concerned, PolEng had higher numbers in
resulted from erroneous automatic conversion, accuracy. The attempt to find the absolute
mistakes of lexicographers or the development estimation was made in the following way: for
of description formalism during the project. each document the number of translated words
Semi-automatic correction consisted in was divided by the number of all words in the
automatic search for errors (spelling and text giving the word completeness. The tester
syntactical – inconsistent with a formalism) then divided texts into "phrases": whole
and manual correction. sentences or parts of sentences. The number of
phrases translated completely (e.g. all word
7. The dictionary status translated) was divided by the number of all
The status of the dictionary after the phrases, giving the phrase completeness
conversion process is the following: coefficient. For each phrase the accuracy of
• English to Polish: translation was estimated in terms of
73294 lexemes; 101878 inflected forms; "accurate", "good", "moderate", "illegible".
120805 Polish equivalents; 79971 lexical The number of accurate and good translation
phrases divided by the number of all translated phrase
• Polish to English: resulted in the accuracy coefficient.
49989 lexemes, 974989 inflected forms, 54952 An extract of the document is
equivalents, 42596 lexical phrases. presented below. The range goes from worst to
best translated documents.

103

Type of document Word completeness% Phrases completeness% Accuracy%
Word 97 95,87 – 99,07 72,45 – 98,84 68,37 – 93,02
Word 2000 95,05 – 99,63 72,45 – 98,84 59,56 – 93,02
Internet Explorer 73,78 – 98,09 46,15 – 94,23 33,85 – 81,69
PowerPoint 94,71 – 99,7 62,64 – 97,65 54,9 – 81,95

Similar tests have not yet been done for debts) list (letter) which should be properly
English-to-Polish translation. At the time this translated into He sent me a long letter (the
report is written, the quality of the translation subject is not obligatory in a Polish sentence)
in this direction "looks" distinctly lower. The might as well be interpreted as A letter sent me
main reason is worse elaboration of transfer debts (according to the free order of
rules (less time) in that direction. Other components in a Polish sentence). This
possible reasons are discussed in Conclusions. problem may be easily handled by semantic
disambiguation: a letter is more likely to be an
9. Conclusions object than a subject of a send action. In
The Translatica group has reached the translating from English, homography should
following conclusions: be disambiguated as soon as possible (e.g. by
1. Only a small part (smaller than statistical POS-tagging) in order to achieve
expected) of the conversion of a good and robust translation.
human-readable dictionary into a
machine-readable lexicon could be References
properly carried out fully
automatically. AGFL (2002): http://www.usenix.org/events/
2. The quality of translation is not usenix02/tech/freenix/full_papers/koster/koster
proportional to the completeness of the _html/node4.html
Baldwin, T., Hutchinson, B. and Bond, F. (1999): A
description in the dictionary. A larger
Valency Dictionary Architecture for Machine
number of equivalents (if they are not
Translation. In: Eighth International
constrained strongly) often results in
Conference on Theroretical and Method-
decreasing rather then increasing the ological Issues in Machine Translation: TMI-
standard of translation. 99, Chester, UK, pp. 207-217
3. The dictionary description should be Bańko, M. (2000): Inny słownik języka polskiego (A
limited: the more information – the Different Dictionary of Polish), Wydawnictwo
slower translation Naukowe PWN
4. WordNet as an entry point for Graliński, F. (2002): Wstępujący parser języka
semantic description proved helpful polskiego na potrzeby systemu POLENG
but it has two drawbacks: 1) the (Bottom-up Parser of Polish designed for the
structure of lattice assumed in POLENG System) In: Speech and Language
Technology. Volume 6, Poznań 2002, [http://
WordNet is hard to deal with
www.ceti.pl/~poleng/zasoby/publikacje/index.
computationally (as opposed to the html]
structure of a tree), 2) Semantic Ilaraza, A. Diaz de, Mayor, A. and Sarasola, K.
hierarchy needed for disambiguation (2001): Building a lexicon for an English-
between English and Polish rarely Basque MT system from heterogeneous wide-
subsumes the WordNet hierarchy coverage dictionaries. In: Proceedings of MT
5. It proved very beneficial for the 2000, Machine Translation and Multilingual
organization of labor to store the Applications in the New Millenium, University
lexicon in the form of a standard SQL of Exeter, United Kingdom, 20-22 November
2000
database.
Mayfield, J. and McNamee, P. (2002): Converting
Another conclusion concerns the translation on-line bilingual dictionaries from human-
algorithm itself. The transfer translation in readable to machine-readable form. In:
both direction differs strongly in one specific Proceedings of the 25th Annual International
aspect: problem of homography. In Polish ACM SIGIR Conference on Research and
homography is quite rare because of rich Development in Information Retrieval, August
inflexion. An exemplary homographic 11-15, 2002, Tampere, Finland
sentence: Przeslał (sent) mi (me) długi (long,

104

OPEP (2002): Wielki Słownik Angielsko-Polski           deployment of a new language pair. In:
    (English-Polish Dictionary), Wydawnictwo           Proceedings of the 19th International
    Naukowe PWN                                        Conference on Computational Linguistics,
Pinkham, J. and Smets, M. (2002): Modular MT           Taipei,    Taiwan,    pp.     800-806.
    with a learned bilingual dictionary: rapid

                                                 105

You can also read