Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links Gilles Sérasset, Mathieu Mangeot GETA-CLIPS IMAG University Joseph Fourier — Grenoble 1 BP 53, 38041 Grenoble cedex 9, FRANCE Gilles.Serasset@imag.fr, Mathieu.Mangeot@imag.fr in very small dictionaries. Also, dictionaries Abstract never contain numeric specifiers, which are as This paper presents a new research important in Japanese as gender and number in and development project called French. On the other hand, the information Papillon. It started as a French- available in paper dictionaries does not exist in Japanese cooperation between machine-readable forms, or is not accessible on laboratories GETA/CLIPS (Grenoble, line. France) and NII (Tokyo, Japan). Its The lack of bilingual resources is also an goal is to build a multilingual lexical obstacle to develop linguistic software database and to extract from it digital applications, for which adapted dictionaries are bilingual dictionaries. a need. As an example, Nippon Telegraph and The database is based on monolingual Telephone in Japan or Lexiquest in France have dictionaries, one for each language of to develop their own dictionaries in a separate the database, linked to an interlingual and time-consuming effort. In the academic dictionary. world, this implies that applications that have From the lexical database, it is been created for French and Japanese offer only planned to derive user customized a reduced scope, while good English-Japanese bilingual dictionaries in multiple pieces of software are available. target formats. It will be possible to Nevertheless, it is a true fact that Japan is very generate human usage dictionaries as interested in the French language. Conversely, a well as specialized dictionaries for growing number of French individuals invest machine translation software. These much energy to learn Japanese. There is a dictionaries will be available under vacuum to be filled. the terms of an open source license. The leveraging of communication that Internet This project, initiated by some offers allows one to think that a convenient computational linguists, aims at being digital dictionary could be produced by a useful and open to all those who are general cooperation between linguists, interested in Japanese and French. It translators, computer scientists, etc., working is also opened to any other language. together through Internet. Moreover, the pivot architecture of A similar project between English and Japanese the database will facilitate the has been active for about a decade. This project addition of new languages and save has allowed the effective building of a free translation efforts. Japanese-English dictionary, available through an Internet server. This Edict project has been created and supported by Pr. Jim Breen from 1. Introduction Monash University, Australia (see bookmark 2). The current JMDict dictionary comprises now There are few French-Japanese usage 70,000 entries of common vocabulary, a specific dictionaries, which are really usable and useful kanji dictionary, and around twenty specialized for French speakers. The main problem is that dictionaries (biology, law, etc). the original Japanese script and the rômaji A different project, fed by volunteers, is phonetic transcription are present together only supported by NEC Corporation. Its aim is to
increase the dictionaries used by the NEC 3. Internal Architecture of the translation tool (see bookmark 3), and to bring Database in new entries on a constant way. We should also mention the SAIKAM project The database will be built using a pivot [1], (see bookmark 5) cooperation between NII architecture based on Dr. G. Sérasset’s Ph.D. (Tokyo, Japan) and NECTEC (Bangkok, thesis (Sérasset 1994) and experimented by Dr. Thailand) active since about 5 years, where Thai E. Blanc in PARAX (Blanc 1999). The students working or having worked in Japan monolingual dictionaries will be linked only have built a sizable Japanese-Thai online dictionary through Internet. In such a context, the GETA/CLIPS laboratory (Grenoble, France) and the National Institute of database. Here are described the architecture of the database, the structure of the entries and the methodology adopted for the project. 2. General View of the Database The lexical database is built on the one hand by integrating existing resources and on the other Example of an interlingual acception encoded in XML hand by writing and correcting new entries (see Figure 1). through a pivot dictionary of interlingual links called acceptions. These acceptions will also be Once the database is homogeneous, users will be linked together by refinement links. They may able to extract their own customized dictionaries also be translated into the UNL language (UNL 1996), (see bookmark User User User 6). Interaction with Each sense or meaning the Dictionaries of each entry of a Dictionary Dictionary monolingual dictionary is linked to one or more Extraction of acceptions of the pivot Dictionaries dictionary. For Lexical example, in French “ carte ” has two Database meanings: “ carte à jouer (card) ” and Integration of “ carte géographique (map) ”. The entry existing resources “ c a r t e ” will consequently be linked Resource Resource Resource to two "lexies" Figure 1. General architecture (corresponding to 2 word senses) in the dynamically from the database and to interact French monolingual dictionary, which in turn with them. will be linked to 2 acceptions in the pivot dictionary: in the example, the first has number 343, with the corresponding UNL "UW" (universal word) “card(icl>play)”, and the
Figure 2. Lexical architecture of the papillon database second one has number 345, with UW 7. Examples: La mésentente pourrait “map(fld>geography)” (see Figure 2). être le mobile du meurtre. 8. Full idioms: _appel au meurtre_ _crier au meurtre_ 4. Structure of the monolingual We chose to encode this dictionary in XML (see dictionaries Annex). With this choice, we are able to The structure of the entries or microstructure of manipulate the dictionary structure using open- the monolingual dictionaries is based on the source tools XSLT processors and DOM parsers structure used for the formal lexical database (such as xalan and xerces from the « apache » DiCo (Polguère 1998) of the OLST laboratory in project, see bookmark 7). Université de Montréal. The encoding methodology is directly borrowed from the 5. Building methodology explanatory and combinatorial lexicology, which is part of the meaning-text theory The building methodology of the lexical (Mel’cuk 1997). database builds on one hand on the reuse of 1. Name of the lexical unit: MEURTRE existing data, the French-English-Malay 2. Grammatical properties: nom, masc dictionary (Gut et al. 1996), (see bookmark 1) and the Japanese-English dictionary of Jim 3. Semantic formula: action de tuer: _ PAR L’individu X DE L’individu Y Breen (see bookmark 2), and on the other hand 4. Government pattern: X = I = de N, A- on the contribution of volunteers working poss Y = II = de N, A-poss through the Internet. 5. (Quasi-)synonyms: {QSyn} Different steps are planned: The first step is the assassinat, homicide#1; crime integration of existing resources. It consists in 6. Semantic derivations and collocations: {V0} preparing a "lexical soup" by merging the two tuer dictionaries thanks to the presence of English. {A0} meurtrier-adj This merging operation will produce correct as {S1} auteur [de ART ] _ well as incorrect acceptions (interlingual links). //meurtrier-n /*Nom pour X*/ These wrong acceptions will be corrected or deleted by lexicologists.
Then the voluntary contributors will index new encourage voluntary contribution. These entries and the lexicologists will correct and contributions will be accepted via a “community integrate them into the database. It will create a web site” where any user should be able to : cycle of edition/correction/modification of the • Communicate and discuss about the entries between the lexicographers/contributors available material and the lexicologists. Different kind of • Consult dictionaries contributors can work on the database: • Correct dictionaries • specialists of one language will write the • Contribute monolingual entries; - By giving personal dictionaries • people with good knowledge of French - By providing entries to other and Japanese like translators will work on dictionaries the links between the monolingual entries A mockup of this web site is currently and the acceptions; developed in Java that integrates web services • people with good knowledge of UNL will with XML processing. translate the acceptions into UNL (UNL About 300 detailed French and some Japanese 1996). entries are currently integrated in the database (in XML form). 6. Encourage voluntary contributions The web side gives access to these entries by providing the users with a dynamically One of the main idea in Papillon project is to generated form obtained via XSL Figure 3. Web page dynamically generated for French entry "meurtre" (murder)
transformations. Electronics and Computer Technology Center Different XSL transformations are provided that (NECTEC/Thailand). give access to different views. Up to now, we The open source license makes all the data only develop a complete view inspired by the available to anyone. Furthermore, we will be Explanatory and Combinatory Dictionary (ECD) able to generate multiple formats from the developed by Igor Mel’cuk (see Figure 3). lexical database. Finally, it should be stressed that such an endeavor will not only need the dedication of as 7. Dictionaries produced many volunteer contributors as possible, but some stable support, in the form of a server and, Several monolingual or bilingual dictionaries more difficult, of a central team of experts can then be extracted from the database. charged of "refining the raw ore" of individual Different types are needed: for human use, via contributions. database and plug in functionalities or via usual That team does not have to be in a single place, dictionary formats, and for machine use. but convenient groupware tools should be developed for it. 1.1 For human use, via database and References plug in functionalities Vuthichai Ampornaramveth, Akiko Aizawa, Persons that interact in foreigner languages Keizo Oyama, Tasanee Methapisit (2000) often can access computers. One of the aims of Implementation of an Internet-Based Dictionary this dictionary is then to provide them with a Development Environment: SAIKAM (in direct help, within their editor, browser, or their Japanese) Research Bulletin of the National daily used personal digital assistant. Center for Science Information Systems vol.12, p.101-109 (2000) 1.2 For human use, via usual dictionary Blanc Étienne (1999) PARAX-UNL: A large formats scale hypertextual multilingual lexical database. We plan to automatically derive from the Proceedings 5th Natural Language Processing database digital presentations for web Pacific Rim Symposium 1999, Tsinghua consultation and paper edition. The FeM (see University Press, Beijing, 1999, p.507-510. bookmark 1) and JMDict (see bookmark 2) Gut Yvan, Puteri Rashida Megat Ramli, formats are the first targeted formats. Zaharin Yusoff, Chuah Choy Kim, Salina A. Samat, Christian Boitet, Nicolas 1.3 For machine use Nédobejkine, Mathieu Lafourcade et al. (1996) Kamus Perancis-Melayu Dewan, The terminology resources available for building dictionnaire français-malais. Dewan Bahasa lingware (linguistic software) are almost null Dan Pustaka, Kuala Lumpur, 667 p. between Japanese and French. The rare available Mel'cuk Igor A. (1997) Vers une linguistique ones have to be radically restructured and Sens-Texte. Leçon inaugurale, Collège de augmented. The orientation of the Papillon France, Chaire internationale, 43 pages. lexical database towards possible use by http://www.fas.umontreal.ca/LING/olst/FrEng/ machines will encourage the realization of melcukColldeFr.pdf lingware including both languages, by providing Polguère Alain (1998) La théorie Sens-Texte. a first support for such projects. Dialangue, Vol. 8-9, Université du Québec à Chicoutimi, pp. 9-30. 8. Conclusion http://www.fas.umontreal.ca/LING/olst/FrEng/P olgIntroTST.pdf The pivot architecture allows an easy integration Sérasset Gilles (1994) Interlingual Lexical of new languages because the reuse of existing Organisation for Multilingual Lexical links will save a lot of time consuming efforts. Databases in NADIA, COLING-94, 5-9 August The Thai language is already about to integrate 1994, vol. 1/2 : pp. 278-282. the project through a cooperation with Kasetsart Tomokiyo Mutsuko, Mathieu Mangeot & University (KU/Thailand), and National Emmanuel Planas (2000) Papillon : a Project
of Lexical Database for English, French and [b2] JMDict Japanese->English: Japanese, using Interlingual Links. Journées http://meshplus.mesh.ne.jp/CRV2/dic/club/do Science et Technologie de l'ambassade de wn.html France au Japon, 13 Novembre 2001, Tokyo, [b3] NEC project: Japon, 3 p. http://meshplus.mesh.ne.jp/CRV2/dic/club/do UNL (1996) Universal Networking Language. wn.html UNL center, Institute of Advanced Studies, The UN University, 1996, 74 p. [b4] Papillon Project: http://vulab.ias.unu.edu/papillon/index.html Bookmarks [b5] SAIKAM Project: http://saikam.nii.ac.jp [b1] FeM Dictionary: http://www- [b6] UNL Project: http://www.unl.ias.unu.edu clips.imag.fr/geta/services/fem [b7] Apache project tools: http://xml.apache.org/ Annex Excerpt of the XML structure that encodes the French entry « meutre » (murder) meurtre meu+rtr(e) n.m. action de tuer: ~ PAR L' individu X DE L' individu Y XI de N A-poss YII de N A-poss assassinat homicide$2 crime tuer … Très choquant atroce affreux brutal horrible inqualifiable odieux …
C'est ici que le double meurtre a été commis. Soupçonné du meurtre de son épouse, il a été arrêté par les gendarmes mercredi. Il devrait comparaître aux assises dans trois semaines comme auteur présumé du meurtre d'un quinquagénaire. La mésentente pourrait être le mobile du meurtre. _appel au meurtre_ _crier au meurtre_
You can also read