Applying Oxford-PWN English-Polish Dictionary to Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Applying Oxford-PWN English-Polish Dictionary to Machine Translation Jassem Krzysztof Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań [jassem@amu.edu.pl] The paper reports on the process of converting 2. Building a lexicon for an MT system a large human-readable English-Polish with the Polish language dictionary further abbreviated as OPEP Two approaches for building an MT lexicon (Oxford-PWN English-Polish), to the format are mainly discussed in the literature. In applicable for machine translation. The report Pinkham and Smets (2002) the authors concludes with an evaluation of the distinguish between "HanC systems" – based Translatica system which uses the converted on "Hand-crafted Dictionary" and the "Lead data in transfer-based translation. systems" – based on "Learned Dictionary". Systems of the first type use traditional 1. Translatica system bilingual dictionaries to determine the transfer The Translatica system originates from the between word senses, whereas in "Lead PolEng project developed in 1996-2002 at systems" the transfer part of the dictionary is Adam Mickiewicz University (AMU) in trained on bilingual text corpora. The authors Poznań. PolEng developed a transfer-based of the publication show the advantages of the system translating texts from Polish into "Lead approach" – particularly for new pairs of English, with the lexicon based on Internet languages and limited time for development. texts. Translatica expands PolEng capabilities By contrast, the experiments of Ilaraza, by translation in the reverse direction and the Mayor, and Sarasola (2001) demonstrate the usage of broader lexicon obtained from the advantages of building an MT lexicon on the contents of OPEP. basis of large traditional dictionaries. The The main features of PolEng inherited by authors compare translation produced by the Translatica are: system that uses raw bilingual dictionary to tha bottom-up parsing based on the CYK given by the system whose dictionary is a algorithm merge of a Basque lexical database and the phrasal structure representation Morris bilingual English-Basque dictionary. Perl-like formalism of transfer and The authors state that the output of the latter synthesis rules. system is distinctly better. It is worth noting that the same formalism for Another dichotomy is mentioned by description is used for both directions Baldwin, Hutchinson and Bond (1999). On the The formalism for the description of basis of English-Japanese translation the grammars is a kind of CFG. The authors have authors compare the systems that store entries tried to use available resources to describe the as source/target language pairs to those which English grammar, e.g. AGFL (2002)but it consider both languages separately. In the turned out that they would hardly comply with source/target approach "a word has as many the elaborated translation engine (see Graliński senses as it has translation equivalents". In the (2002) for details on the engine). In order to latter approach sense distinctions are specific use the engine it was necessary for us to for each language, which in the authors' compose grammar rules ourselves. opinion is "more cognitively justifiable". The According to the agreement between the main argument for the "monolingual" approach authors of PolEng and the authors of OPEP, in MT is that the decision on the sense the lexicon for the English-to-Polish direction disambiguation may be postponed until the should be based on the OPEP contents. The process of generation whereas in the "bilingual paper reports the work that was done in 2003 approach" the semantic constraints on the in order to adopt OPEP contents to Translatica source side are used for disambiguation in both – the PolEng successor. source analysis and transfer. Another argument 98
against "bilingual" dictionaries is their uni- lexicon developed by the directionality. organization of Polish One of the main criteria for choosing Scrabble players, further the type of lexicon for an MT application is the referred to as the scrabble availability of resources. Because of the dictionary scarcity of aligned Polish-English bi-texts the other dictionaries of Polish "Lead approach" has lost its main benefit for published by PWN, e.g. Bańko our purposes: rapid deployment – in order to (2000) use the approach it would be necessary to build lists of entries (e.g. proper nouns) aligned corpora (the same reason has from PWN encyclopedias. determined the choice of the transfer method The PolEng lexicon comprises some used for the translation, rather than the corpus- syntactical information that could prove useful based one). On the other hand, our group have for the other direction; the scrabble dictionary had to free disposal quite large traditional handles Polish inflection exhaustively; Bańko dictionaries: OPEP English-Polish dictionary (2000) includes some information on and a few Polish dictionaries mentioned in syntactical features of entries, in the form well section 4. The situation called for the "HanC suited for computer processing. approach". 2. Lexical resources at limited disposal (via The decision whether the entries should be Internet), e.g. bilingual or monolingual was to large extend Meriam-Webster Dictionary (www.m- determined by the conditions of the agreement w.com) between the PolEng group and PWN. The Internet English-Polish dictionary dictionary publishers (as well as the authors of (www.dict.pl) the system) liked the system to use the These and other dictionaries available on-line linguistic knowledge included in the dictionary helped lexicographers understand the meaning material to the maximum extend. The system of some entries or suggest alternative dictionary should mirror the OPEP material as equivalents closely as possible. This called for the 3. Text corpora concordancers: bilingual description. The uni-directionality British National Corpus was not a counterargument either as the Polish- (http://sara.natcorp.ox.ac.uk) to-English part had already been developed. Collins Cobuild Corpus (http://www.cobuild.collins.co.uk/for 3. Main goals m.html) The main goals posed to the conversion WordCorp process were: (http://www.webcorp.org.uk/index.ht to lose as little information as possible ml) from OPEP PolEng Internet corpus – the corpus of to extract and formalize syntactic and Polish Internet texts collected while semantic information given in OPEP working with the Polish-English to supplement data with all translation. information necessary for transfer- Consulting such tools helped to verify based machine translation – in syntactic features of words – such information accordance with the Translatica is not given exhaustively in OPEP. algorithm. WordNet has proved to play a key role in assigning semantic values. The semantic 4. Resources hierarchy used in Translatica is a subtree of the WordNet lattice. Before the start of the work, the group 4. Translation tools: consolidated the following resources: grammar description 1. Lexical resources at free disposal: syntactic-semantic parser OPEP in the electronic form, XML tools for transfer and synthesis. format Translation tools impose specific constraints PolEng lexicon (Polish-to-English) on the type of information that should be Polish lexicon of inflected forms stored in the dictionary. delivered by PWN – the 5. Translatica dictionary formalism 99
The dictionary formalism of the the time the paper was written the authors had Translatica system assumes one lexicon for converted 50 MB of on-line "bidicts" of each direction (source/target approach). For varying formats and the longest ABET script example, the description of an entry in the they needed consisted of mere twenty-four English-to-Polish direction gives the lines. constraints under which a word (or a phrase) is Although we think that a language like translated into appropriate equivalents. ABET may prove beneficial in dealing with more than one dictionary of a simple format 5. Basic problem – time limitations we do not think that the idea would work for As is shown in section 6, full and detailed traditional off-line dictionaries that include conversion of a single entry consumes a lot of deep linguistic knowledge, rarely given in a man-work (apart from computer work). The systematic way. In our attempt to convert the group could afford 27 man-months to dictionary we have come across so many accomplish the task of conversion (3 specific "little problems" that it is hard to lexicographers, 9 months). The available time imagine for us that any generalizing language was not sufficient to manually elaborate each could be of much help. entry of OPEP (even after automatic pre- The script used for the conversion job processing). The group assumed the following was written in Perl. The section lists the major approach: problems in the conversion (and does not • function words should be described mention those "little nuisances" which are almost from scratch particularly not amenable to description in a • out of 55 900 entries in OPEP, ca 20 language of a higher level). 000 most frequent ones (according to The task consisted in the following BNC) should be conversed steps: automatically and then elaborated Separating entries manually In the approach suggested in Mayfield and • the rest of the dictionary should be McNamee (2002) the macro HEADER-FIND conversed only automatically is responsible for separating entries. The macro • errors resulting from manual and identifies headers according to the automatic conversion should be specification of the dictionary. Such approach corrected semi-automatically. would not solve the problems that we encountered in separating entries in OPEP: 6. Processing of the dictionary • Some entries have references to other The process of dictionary conversion involved entries. The entry upon is described in OPEP the following stages: only with the reference to the entry on (i.e. automatic conversion of dictionary upon= on). In such a situation the reference information was forwarded (the idea of reference is used in automatic morphological description the Translatica dicitionary as well). Quite a manual description/verification of 20 few entries have references only in one of the 000 most frequent lexemes senses. An example may be brainstorm, whose semi-automatic correction of errors informal meaning is described as equal to manual correction of errors found brainwave (whereas other senses are described while testing the translation system. separately). Such situation requires different treatment ("copy and paste inside"). Another 6.1. Automatic conversion of dictionary type of reference concerns entries which are information word forms of other entries. The entry bidden In Mayfield and McNamee (2002) the authors has a reference to bid as its whole description, present an interesting idea that aims at whereas the entry built has a reference (to simplifying the conversion of bilingual build) in one sense as well as its own set of dictionaries from human-readable to computer- equivalents for other senses. Our decision was readable form. They have created a language to discard the entries like bidden from the called ABET (APL Bidict Extraction Tool) dictionary and discard only referenced senses that allows for automating "the processes that in the entries like built. are the same across most extraction tasks". At 100
• Entries may have graphical variants. Automatic acquisition of attribute values We decided to separate such variants into This phase consists in extracting from OPEP multiple entries (tha other variants received the values of attributes needed by the references to the first variant). There are two Translatica translation algorithm. ways of denoting graphical variants in OPEP: • The automatic procedure converts one with the usage of braces inside words, e.g. OPEP flexional description into Translatica bias(s)ed, the other by means of a comma, e.g. encoding. Some verbs in English have baldachin, baldaquin (the comma is not used different inflected forms for different senses consistently, however, e.g. in the entry births, (e.g speed, sped, sped as contrasted to speed, marriages and deaths the comma obviously speeded, speeded). In such a case we decided does not separates variants – to solve that to divide an entry into two. disambiguation automatically we assumed that • The syntactical information on the variants of the same word must share the first complements should be concluded from letter). On the other hand graphical variants various parts of the OPEP description. that differ only with the first letter Transitive verbs (decoded as vt in OPEP) are capitalization (e.g. balkanization, treated in our approach as having an equivalent Balkanization) should be merged to one entry with a noun complement in accusative in only. The entry backwards has a "graphical Polish. The complementation may partially be alternative" backward but backward itself derived from PREP elements which in the constitutes a separate entry in OPEC (giving a OPEP XML describes prepositional reference to backwards!). In cases like this the complements. Direct complements are listed in entries should not be duplicated during OPEP as OBJ elements for verbs and INDIC conversion elements for nouns. Complementation may and • Some words are indexed and treated as should also be concluded from human-readable separated entries in OPEP (e.g. billet as billet1 descriptions, e.g. from strings like on or about /an order/ and billet2 /of wood/) although they smth. are the same parts of speech. We merge such • Assigning semantic attributes from entries into one (with separate equivalents). various tags in OPEP has been partially done • There are entries that lack direct automatically. For example the values given in equivalents (e.g. behalf) – the equivalents are the element COLL, which describes the sense, given only for phrases that include them. Since (e.g. for the word speech, the senses are: we demand each entry to have at least one non- oration, faculty, language, subject) are empty equivalent such entries must be automatically converted into Translatica automatically tagged and individually decided: semantic hierarchy by means of a WordNet either the entry should be removed (only query. lexical phrases should remain) or an artificial • The attribute of context, which equivalent should be found for the entry. comprises "domain", "style" and "dialect" in • Phrasal verbs need special treatment. Translatica should be concluded from various We decided to separate entries that are tags in OPEP, mainly from qualifiers.. described in OPEP as phrasal verbs from simple verbs. This approach requires Merging senses considerable caution. For example, the phrasal This phase aims at diminishing numbers of verb get out of is in OPEP described as a equivalents for entries in order to simplify separate phrasal, but some important uses of computations in the translation process. the multiword get out of are mentioned also in • Some "second-best" equivalents are the description of the phrasal verb get out, and replaced: the sense they represent is discarded in the description of the simple verb get. To and the equivalent is inserted as a synonym of make things worse, these uses partially a "better" equivalent. A sense is converted into overlap. a synonym only if strict conditions are As it will be seen in the sections to follow, the fulfilled: all attributes must have same value process of separating entries takes place also in (this automatic procedure may be undone the next phases of the conversion. during manual verification). • It often happens that different senses have the same equivalents. For example, all 101
first five senses the verb get in OPEP include (e.g. bulgur and bulgur wheat); if they appear the Polish equivalent dostać. In such a case we in the target side, the second alternative is would like to merge the senses into one omitted. (according to the paradigm that in Translatica dictionary a word has as many senses as it has 6.2. Automatic morphological description equivalents). In order to make the merging it The morphological information for English was necessary to cautiously process the values words is given in OPEP. The main resources of attributes, e.g. the value of complementation that contributed to determining the Polish attribute for the merged entry should include inflection were the PolEng Polish-to-English complementation patterns of all merged dictionary and the PWN scrabble lexicon of senses. inflected forms. Automatic conversion of lexical phrases 6.3. Manual verification/modification of For most word-senses the OPEP dictionary data gives examples of usage as well as idioms that include the word-sense. Both types of phrases The data produced by automatic conversion were automatically copied into the list of needed manual verification and modification. idioms in Translatica database. It was up to the Verification showed errors in the conversion lexicogrpahers to distingush between the two process – which could be corrected by mere types and make appropriate decisions. Idioms adjusting conversion procedures. Some in the Translatica dictionary are described linguistic aspects required human linguistic with the same set of attributes as single words. knowledge as well as consulting other sources Some syntactical and semantic values could be than OPEP. obtained automatically from OPEP in the same An important factor was the labor way as for single words (e.g. the human organization. As the lexicographic group semantic value for an object denoted as sb). consisted of three (occasionally four) Still, there was more to do for lexicographers lexicographers who were controlled by a co- with idioms than with single words. ordinator it was necessary to find the way in Some idioms and their equivalents are which linguistic data could be modified more denoted in OPEP by slash marks (e.g. a than once and the labor could be distributed banking/an educational ~ system bankowy/ between persons. We decided to convert the edukacji). Such idioms needed to be separated data into a classical SQL database. This idea automatically. made it possible to keep the lexical database in order even if some of the work overlapped. Cleaning up and adjusting The main aspect (in view of transfer OPEP uses its own metalanguage which translation) which is not well handled by should be parsed during conversion. For automatic conversion is complementation. example, the word or serves to indicate either OPEP delivers almost none explicit alternative uses, alternative translations, or information on left-hand complements alternative complements. We have decided to (specifiers) like prepositions (and their treat or in the following way in the conversion translations) that tend to precede specific process: alternative uses should be divided into nouns. This information should have been separate idioms, from the alternative encoded manually on the basis of other equivalents all except the first one should be resources (e.g. PolEng dictionary). The right- discarded, alternative complements should be hand complements are treated in OPEP more merged. Similar treatment is used for the exhaustively but many of them are given metaword also. The word beacon, in its third implicitly in examples of usage, e.g. we'll sense, has the following form in OPEP: (also: never get by without him is an example given radio ~). This should be conversed into a new in OPEP that should, for the translation needs, entry: radio beacon. be treated as a phrasal verb get by The usage of words in braces should complemented by a PP without sb. be disambiguated in this phase also. Usually The lexicographers were asked to the braces denote optional occurrence, like in query English corpora in order to check for the entry bulgur (wheat). If the braces appear complements that are not listed in OPEP. This at the source side, two entries are generated was a hard task that required good knowledge 102
of English and could not be executed It is worth noting that the OPEP dictionary was (semi)automatically. automatically reversed to enrich the Polish-to- Word senses in the Translatica English lexicon with one-to-one equivalents. dictionary are described semantically by means of a subtree from the WordNet semantic 8. Evaluation hierarchy. OPEP delivers some hints on The evaluation of the Polish-to-English semantic values of words and semantic values translation (PolEng) was executed in the of their complements but it was up to the Allied Irish Bank (Dublin) in November 2003. lexicographer to choose the most appropriate At that time the English-into-Polish translation ones. was not available yet. The lexicon included The lexicographers found processing idioms only entries developed manually from various and phrases the most challenging. It was up to traditional dictionaries (before OPEP lexicographers: enrichment). to convert idioms and phrases into a As the system provides add-ins to MS- canonical form Office application as well as a plug-in to the to distinguish phrases from word usage Internet Explorer, the evaluation dealt with examples so that only the former are types of documents specific to those included in the dictionary applications. For each type the tester selected to determine complementation of 7-15 documents from the banking domain and idioms (basing on lexicographers' tried to evaluate the speed, coverage of intuition and on-line corpora) words/phrases and accuracy. The testing was a to describe admissible gaps in phrases "black-box" type – the tester was not capable to determine the syntactic role of a of estimating which component of the phrase (e.g. to determine that this translation was responsible for erroneous morning should be treated as an translations. The tester used both relative and adverb rather than a noun). absolute evaluation method. PolEng was compared to PolTrans – a translation system 6.4. Semi-automatic correction available on-line. Not surprisingly, PolTrans Before that stage several types of errors were passed the test better as far as the speed is present in the lexicon. The errors might have concerned, PolEng had higher numbers in resulted from erroneous automatic conversion, accuracy. The attempt to find the absolute mistakes of lexicographers or the development estimation was made in the following way: for of description formalism during the project. each document the number of translated words Semi-automatic correction consisted in was divided by the number of all words in the automatic search for errors (spelling and text giving the word completeness. The tester syntactical – inconsistent with a formalism) then divided texts into "phrases": whole and manual correction. sentences or parts of sentences. The number of phrases translated completely (e.g. all word 7. The dictionary status translated) was divided by the number of all The status of the dictionary after the phrases, giving the phrase completeness conversion process is the following: coefficient. For each phrase the accuracy of • English to Polish: translation was estimated in terms of 73294 lexemes; 101878 inflected forms; "accurate", "good", "moderate", "illegible". 120805 Polish equivalents; 79971 lexical The number of accurate and good translation phrases divided by the number of all translated phrase • Polish to English: resulted in the accuracy coefficient. 49989 lexemes, 974989 inflected forms, 54952 An extract of the document is equivalents, 42596 lexical phrases. presented below. The range goes from worst to best translated documents. 103
Type of document Word completeness% Phrases completeness% Accuracy% Word 97 95,87 – 99,07 72,45 – 98,84 68,37 – 93,02 Word 2000 95,05 – 99,63 72,45 – 98,84 59,56 – 93,02 Internet Explorer 73,78 – 98,09 46,15 – 94,23 33,85 – 81,69 PowerPoint 94,71 – 99,7 62,64 – 97,65 54,9 – 81,95 Similar tests have not yet been done for debts) list (letter) which should be properly English-to-Polish translation. At the time this translated into He sent me a long letter (the report is written, the quality of the translation subject is not obligatory in a Polish sentence) in this direction "looks" distinctly lower. The might as well be interpreted as A letter sent me main reason is worse elaboration of transfer debts (according to the free order of rules (less time) in that direction. Other components in a Polish sentence). This possible reasons are discussed in Conclusions. problem may be easily handled by semantic disambiguation: a letter is more likely to be an 9. Conclusions object than a subject of a send action. In The Translatica group has reached the translating from English, homography should following conclusions: be disambiguated as soon as possible (e.g. by 1. Only a small part (smaller than statistical POS-tagging) in order to achieve expected) of the conversion of a good and robust translation. human-readable dictionary into a machine-readable lexicon could be References properly carried out fully automatically. AGFL (2002): http://www.usenix.org/events/ 2. The quality of translation is not usenix02/tech/freenix/full_papers/koster/koster proportional to the completeness of the _html/node4.html Baldwin, T., Hutchinson, B. and Bond, F. (1999): A description in the dictionary. A larger Valency Dictionary Architecture for Machine number of equivalents (if they are not Translation. In: Eighth International constrained strongly) often results in Conference on Theroretical and Method- decreasing rather then increasing the ological Issues in Machine Translation: TMI- standard of translation. 99, Chester, UK, pp. 207-217 3. The dictionary description should be Bańko, M. (2000): Inny słownik języka polskiego (A limited: the more information – the Different Dictionary of Polish), Wydawnictwo slower translation Naukowe PWN 4. WordNet as an entry point for Graliński, F. (2002): Wstępujący parser języka semantic description proved helpful polskiego na potrzeby systemu POLENG but it has two drawbacks: 1) the (Bottom-up Parser of Polish designed for the structure of lattice assumed in POLENG System) In: Speech and Language Technology. Volume 6, Poznań 2002, [http:// WordNet is hard to deal with www.ceti.pl/~poleng/zasoby/publikacje/index. computationally (as opposed to the html] structure of a tree), 2) Semantic Ilaraza, A. Diaz de, Mayor, A. and Sarasola, K. hierarchy needed for disambiguation (2001): Building a lexicon for an English- between English and Polish rarely Basque MT system from heterogeneous wide- subsumes the WordNet hierarchy coverage dictionaries. In: Proceedings of MT 5. It proved very beneficial for the 2000, Machine Translation and Multilingual organization of labor to store the Applications in the New Millenium, University lexicon in the form of a standard SQL of Exeter, United Kingdom, 20-22 November 2000 database. Mayfield, J. and McNamee, P. (2002): Converting Another conclusion concerns the translation on-line bilingual dictionaries from human- algorithm itself. The transfer translation in readable to machine-readable form. In: both direction differs strongly in one specific Proceedings of the 25th Annual International aspect: problem of homography. In Polish ACM SIGIR Conference on Research and homography is quite rare because of rich Development in Information Retrieval, August inflexion. An exemplary homographic 11-15, 2002, Tampere, Finland sentence: Przeslał (sent) mi (me) długi (long, 104
OPEP (2002): Wielki Słownik Angielsko-Polski deployment of a new language pair. In: (English-Polish Dictionary), Wydawnictwo Proceedings of the 19th International Naukowe PWN Conference on Computational Linguistics, Pinkham, J. and Smets, M. (2002): Modular MT Taipei, Taiwan, pp. 800-806. with a learned bilingual dictionary: rapid 105
You can also read