Nordic Perspectives on the CLARIN Infrastructure of Common Language Resources
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Vol. 5 Proceedings of the NoDaLiDa 2009 Workshop Nordic Perspectives on the CLARIN Infrastructure of Common Language Resources May 14, 2009 Odense, Denmark Editors Rickard Domeij Kimmo Koskenniemi Steven Krauwer Bente Maegaard Eiríkur Rögnvaldsson Koenraad De Smedt Northern European Association for Language Technology
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources NEALT Proceedings Series, Vol.5 © 2009 The editors and contributors ISSN 1736-6305 Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt Electronically published at Tartu University Library (Estonia) http://dspace.utlib.ee/dspace/handle/ 10062/4116 Volume Editors Rickard Domeij Kimmo Koskenniemi Steven Krauwer Bente Maegaard Eiríkur Rögnvaldsson Koenraad de Smedt Series Editor-in-Chief Mare Koit Series Editorial Board Lars Ahrenberg Koenraad De Smedt Kristiina Jokinen Joakim Nivre Patrizia Paggio Vytautas Rudzionis
Contents Preface.......................................................................................................................... iv Maia Andréasson, Lars Borin, Markus Forsberg, Jonas Beskow, Rolf Carlson, Jens Edlund, Kjell Elenius, Kahl Hellmer, David House, Magnus Merkel, Eva Forsbom, Beáta Megyesi, Anders Eriksson & Sven Strömqvist: Swedish CLARIN activities............................................................................................ 1 Hanne Fersøe & Bente Maegaard: CLARIN in Denmark – European and Nordic perspectives ................................................................................................................. 6 Kimmo Koskenniemi & Antti Arppe: Nordic co-operation in building the language resource infrastructures ...................................................................... 12 Rūta Marcinkevičienė: Two decades of Lithuanian HLT ........................................... 16 Einar Meister, Tiit Roosmaa & Jaak Vilo: Estonian language technology Anno 2009 .................................................................................................................. 21 Eiríkur Rögnvaldsson, Hrafn Loftsson, Kristín Bjarnadóttir, Sigrún Helgadóttir, Matthew Whelpton, Anna Björk Nikulásdóttir & Anton Karl Ingason: Icelandic Language Resources and Technology: Status and Prospects .................... 27 Inguna Skadiņa: CLARIN in Latvia: current situation and future perspectives ........ 33 Pavel Skrelin, Vera Evdokimova & Karina Evgrafova: The Possible NEALT Role in the Consolidation of the Nordic and Baltic Language Resources ........................ 38 Koenraad de Smedt: CLARIN: Norwegian and Nordic perspectives ...................... 42
Preface The Nordic countries have a long tradition of cooperating within many areas, including politics, education and science. Many languages are closely related and sometimes also the same language is spoken over national boundaries (for example Sámi and Swedish). Language technology is relatively well developed in these countries, but much more is needed to build the infrastructure needed for advanced R&D, and to secure the languages of the region for the future. The CLARIN project is an initiative on the European level to meet those challenges by making language resources and technology available and usable. In recent years, new regions around the Baltic have become parts of the Nordic area. With increased cooperation, coordination and consolidation of common strengths, the Nordic/Baltic countries could strengthen their work in language technology infrastructure and their contribution to CLARIN. The main topic of the workshop is Nordic strengths and opportunities of cooperation within the NEALT Geographic Region in constructing an infrastructure of common language resources in connection to the European CLARIN (Common Language Resources and Technology Infrastructure) initiative. The purpose is to find ways of cooperating that will strengthen the contribution of the associated countries to the development of an infrastructure of common language resources within the CLARIN intitiative. In the workshop, participants in CLARIN from the NEALT-associated countries will be given the chance to present their national projects and discuss possible ways of cooperating, sharing resources, coordinating activities, consider new projects and such. Opportunities and proposals for closer cooperation and coordination will be presented and discussed at the workshop. CLARIN participants in the NEALT-associated countries were invited to present their national work from the perspective of possible cooperation between groups and projects in the different countries: Denmark, Estonia, Finland, Iceland, Latvia, Lithuania, Norway and Sweden. The workshop is intended for participants having an interest in developing the language resources and technology in the NEALT Geographic Region for languages spoken in that region. There will be an opportunity for each such country to present an overview of the status of their national language resource infrastructure. Eight abstracts from each of the above mentioned countries were submitted for review by the editors (who did not review contributions from their own country). All of them were accepted. We, the editors and organizers, want to thank the authors for their contributions. We look forward to a promising workshop in Odense where they will be presented and discussed. Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson & Koenraad De Smedt iv
Nordic Perspectives on the CLARIN Infrastructure of Common Language Resources – Workshop Proceedings May 14, 2009 Odense, Denmark Nodalida 2009 Workshop website: https://kitwiki.csc.fi/twiki/bin/view/Nealt/SigInfraNodalida2009WorkshopCall Conference website: http://beta.visl.sdu.dk/nodalida2009/ Organized by the NEALT SigInfra through: Rickard Domeij, the Language Council of Sweden, chairman of the Nordic language councils' working group on language technology and of the NEALT SigInfra Kimmo Koskenniemi, Prof. of language technology at the University of Helsinki, Department of General Linguistics, Executive Board member of CLARIN Steven Krauwer, Utrecht University, CLARIN Coordinator Bente Maegaard, Head of Centre for Language Technology, University of Copenhagen, Denmark. Executive Board member of CLARIN Eiríkur Rögnvaldsson, Prof. of Icelandic Language, University of Iceland Koenraad De Smedt, Prof. of Computational Linguistics, University of Bergen; national coordinator of CLARIN in Norway v
Swedish CLARIN Activities Maia Andréasson Jonas Beskow, Rolf Carlson Lars Borin Jens Edlund, Kjell Elenius Markus Forsberg Kahl Hellmer, David House Språkbanken Centre for Speech Technology Dept of Swedish Language School of Computer Science University of Gothenburg and Communication first.last@svenska.gu.se KTH (rolf,kjell,davidh)@speech.kth.se Magnus Merkel Eva Forsbom, Beáta Megyesi NLP Lab Language Technology Unit Dept of Computer Science Dept of Linguistics and Philology Linköping University Uppsala University mme@ida.liu.se first.last@lingfil.uu.se Anders Eriksson Sven Strömqvist Phonetics Unit Centre for Languages and Literature Dept of Philosophy, Linguistics Lund University and Theory of Science sven.stromqvist@ling.lu.se University of Gothenburg anders.eriksson@ling.gu.se Abstract infrastructure in general, and in fact Swedish CLARIN members have been able to secure Although Sweden has yet to allocate project funding for some CLARIN-related ac- funds specifically intended for CLARIN tivities in this way from the Committee for activities, there are some ongoing ac- Research Infrastructures and its subcommittee tivities which are directly relevant to DISC (Database Infrastructure Committee) of the CLARIN, and which are explicitly linked Swedish Research Council. to CLARIN. These activities have been CLARIN-related work in Sweden has been con- funded by the Committee for Research In- siderably aided by the fact that the Swedish lan- frastructures and its subcommittee DISC guage technology community is close-knit – with (Database Infrastructure Committee) of well-functioning channels and fora of communi- the Swedish Research Council. cation and collaboration – and united in its recog- 1 Introduction nition that the realization of the kind of infrastruct- ure that CLARIN engagement requires is a costly CLARIN has two part- endeavor which must be a collective undertaking ners (Centre for Speech Technology, KTH and the involving the whole community. Humanities Lab, Lund University) and a consider- In the next section we describe some of the on- able number of members in Sweden, including the going CLARIN-related activities in Sweden, for sites of the authors of this document. which we have been able to secure funding by the However, the Swedish Research Council has de- Swedish Research Council. cided not to allocate national funds for Swedish involvement in the ongoing preparatory phase of 2 Some CLARIN-related activities in CLARIN, which means that any participation by Sweden Swedish members beyond that which is covered by EC funding to the two Swedish CLARIN part- 2.1 An infrastructure for Swedish language ners must be covered by funds obtained elsewhere. technology On the other hand, the Swedish Research Coun- In 2007, the Research Infrastructure Committee of cil has increased available funding for research the Swedish Research Council awarded a two-year 1
planning grant to a national Swedish consortium in A funding proposal for an SNK/BLARK combi- language technology, with 7 partner institutions: nation was submitted to VR/KFI in October 2008. The proposal is now being reviewed by interna- • University of Gothenburg (coordinating part- tional experts. The amount of funding needed for ner) realizing the SNK and Swedish BLARK in paral- • Chalmers University of Technology lel is estimated at 130 million SEK over 7 years. • KTH (Royal Institute of Technology) However, it is pointed out in the proposal, that pur- • Linköping University suing the two separately would cost on the order • Lund University of 50 million SEK more, i.e., there is considerable • The Swedish Language Council synergy in the proposal. • Uppsala University No doubt in large part as a result of the work in The planning grant was awarded for a proposal this planning project, the Swedish Research Coun- entitled An infrastructure for Swedish language cil has listed language technology as one of a technology, with the aim of preparing a project number of national research infrastructure areas of proposal or project proposals for creating an in- highest priority in its Roadmap to research infra- tegrated basic Swedish language technology re- structure. This spring, a call will be issued for pro- search infrastructure, consisting of posals by national consortia in exactly those areas. Thus, it seems there is a good chance that the two 1. a Swedish national corpus (Svensk nationell years of dedicated work laid down in this project korpus – SNK); might pay off. 2. a Basic LAnguage Resource Kit (BLARK) for Swedish. 2.2 Safeguarding the future of Språkbanken The practical planning work has been carried out Språkbanken (the Swedish Language Bank; by two working groups, with researchers from ) at the University Gothenburg and Linköping responsible primar- of Gothenburg provides an online service to the re- ily for the work on SNK, and researchers from search community since 1975, whereby language KTH and Uppsala having worked mainly on the resources (corpora and lexicons) are made avail- Swedish BLARK. The two groups have interacted able to the research community and the public. constantly throughout the course of the work, both The resources are available free of charge on the in physical meetings and by means of electronic internet through a number of search interfaces. communication, e.g., project reports and other Språkbanken possesses a unique combination of documents have been collectively prepared using competences in the areas of Swedish text corpora, a project wiki. parallel text corpora, Swedish computational lex- icons, and LT tools for the processing, annotation The main tasks of the working groups have and presentation of text corpora, coupled with the been: kind of stable organization required for sustained • to make an inventory of and collect informa- large-scale corpus processing and presentation. tion about existing resources, their character, Språkbanken’s resources are widely used for quality, and not least, availability for research research and teaching, but also for other related and other purposes; purposes (for checking what is possible or good • to make a survey of the needs of the research Swedish, as a reference in popular writings about community and industry; language usage, etc.). In particular, a good number • to collect information about similar initia- of PhD theses in Sweden and Finland have used tives – completed, ongoing and planned – in Språkbanken as a data source. other countries, especially in Europe; Språkbanken has grown organically over the • on the basis of this information, to formu- four decades of its existence. Many of the late a concrete funding proposal to VR/KFI, presently available corpora have been collected comprising a description of the SNK and the on Språkbanken’s own initiative, and this is on- Swedish BLARK, together with an outline going work; e.g., about 15–20 million words of work plan and budget for creating the re- press text are added annually. However, some of sources. the corpora are the result of independent research 2
projects conducted by the NLP research group at Språkbanken will be further developed in the fol- Gothenburg or by groups at other Swedish uni- lowing areas, broadly definable as those dealing versities. In principle, the same situation obtains with infrastructure components (1–5) and user in- for the lexicon resources. Tools for browsing and terface/interaction components (6): searching resources have been developed in con- 1. Standardization of storage and exchange for- cert with the creation of the resources themselves. mats; This means that resources are stored in Språk- 2. Standardization of annotation, markup and banken in several different formats, with varying metadata formats; amounts of added information. The use of differ- 3. Addition of uniform linguistic annotations to ent formats implies that idiosyncratic tools are re- all the corpora of contemporary Swedish; quired for browsing and searching each resource. 4. Addition of metadata to existing resources; A number of language technology tools are used 5. Definition of a set of processing components with the resources, which have been developed or and APIs (Application Programming Inter- adapted in various research projects in the depart- faces) for these components; ment. There are also tools that have been devel- 6. Development of a set of user interface com- oped in collaboration with other groups, e.g. mor- ponents for selecting, browsing, searching, phological processors for modern Swedish and annotating, etc., Språkbanken’s corpora and Old Swedish which are being developed jointly lexicons, as well as up- and downloading with the Language Technology research group at texts. Chalmers University of Technology. The condi- tions under which such research endeavors are Work is well underway in the project on all of undertaken have not in general been conducive these. One aim is to collaborate with other initia- to standardization and wider integration of these tives whenever feasible; thus, the corpus browser tools. frontend Glossa developed by Tekstaboratoriet, University of Oslo, is now being adapted for use in Generally, the kinds of research questions Språkbanken. This work will be conducted jointly which can be addressed using a large text material with Tekstlaboratoriet. such as that found in Språkbanken are heavily de- The CLARIN preparatory phase work is seen pendent on three characteristics of the material and as so important by an institution such as Språk- the infrastructure in which it is embedded: (1) the banken – whose day-to-day activities will be pro- character of the material itself (its representativity foundly influenced by the standards, recomme- w.r.t. the language variety under investigation); (2) dations, best practices, etc., which emerge from the annotations, markup and metadata that the ma- CLARIN preparatory phase work – that Språk- terial is provided with (and, more generally, which banken has decided to use part of the funding for annotations, etc., are [formally] allowed by a given this national project to participate in the prepara- framework); (3) the level of access to the mate- tory phase of CLARIN; at the present time, this is rial, viz. (3a) inspection (search and presentation) one of the best ways of safeguarding the future of access only: (3a1) restricted (individually [login] Språkbanken. or by site [IP number]); (3a2) unrestricted; (3b) download access (or other in toto access): (3b1) 2.3 Spontal: Multimodal database of restricted (individually [login] or by site [IP num- spontaneous speech in dialog ber]); (3b2) unrestricted. This section describes the ongoing Swedish The ideal would be to have fully representa- speech database project, Spontal: Multimodal tive corpora provided with the maximum possi- database of spontaneous speech in dialog. The ble amount of high-level linguistic annotations and project takes as its point of departure the fact that rich metadata, which would be available both via both vocal signals and gesture involving the face sophisticated online user interfaces and for down- and body are important in everyday, face-to-face loading. There is now an urgent need for inte- communicative interaction. Our understanding of gration of the (presently) diverse resources and vocal and visual cues and interactions in spon- tools in Språkbanken in a way that also takes into taneous speech is growing, but there is a great account international standardization work in the need for data with which we can make more pre- field of language (technology) resources. Thus, cise measurements. Currently we have very little 3
data with which we can measure with precision box, but they may also chose to continue what- such important aspects of human communication ever discussion they were engaged in or talk about as the timing relationships between vocal signals something entirely different. The subjects are all and facial and body gestures, or how these gestures native speakers of Swedish and balanced as to gen- vary in spontaneous speech or in different speak- der and whether the dialogue partners know each ing styles. other or not. This balance will result in 15 dialogs of each configuration: 15x4x2 for a total of 120 The goal of the Spontal project is the creation dialogs. Currently (February, 2009), about 25% of of a Swedish multimodal spontaneous speech the database has been recorded. database rich enough to capture important vari- ations among speakers and speaking styles to In the base configuration, the recordings meet the demands of current talk-in-interaction are comprised of high-quality audio and high- research. An important contemporary trend is definition video, with about 5% of the record- the study of everyday spoken language in dia- ings also making use of a motion capture sys- log which has many characteristics differing from tem using infra-red cameras and reflective mark- written language or scripted speech. Detailed anal- ers for recording facial gestures in 3D. In addi- ysis of spontaneous speech can also be fruitful for tion, the motion capture system is used on virtually phonetic studies of prosody and also reduced and all recordings to capture body and head gestures, hypoarticulated speech. The Spontal database will although resources to treat and annotate this data make it possible to test hypotheses on the visual have yet to be allocated. and verbal features employed in communicative behavior covering a variety of functions. To in- 2.4 SweDia 2000 – A Swedish dialect crease our understanding of traditional prosodic database functions such as prominence lending and group- The SweDia database consists of recorded speech ing and phrasing, the database will enable re- from 107 dialects representing the dialectal vari- searchers to study visual and acoustic interaction ation in Sweden and Swedish-speaking parts of over several subjects and dialog partners. More- Finland. The recordings were made in 1999 by over, dialog functions such as the signaling of a previous research project, SweDia 2000. Each turn-taking, feedback, attitudes and emotion can dialect is represented by twelve speakers repre- be studied from a multimodal, dialog perspective. senting two generations with an equal number of In addition to basic research, one important appli- male and female speakers. Research questions that cation area of the database is to gain knowledge may be addressed using the data are: What are to use in creating an animated talking agent (talk- the laws that govern language development and ing head) capable of displaying realistic commu- change? To what extent does internal structural co- nicative behavior with the long-term aim of using herence govern the development of dialects? The such an agent in conversational spoken language database has until now primarily been used by systems. The database will be freely available for the SweDia group and a circle of researchers who research purposes. have obtained personal copies on hard disks. The 60 hours of dialog consisting of 120 half-hour goal of the present work is to make the database sessions will be recorded. Each session consists available to a much wider circle by placing it on of three consecutive 10 minute blocks. Subjects an internet server together with other language are told that they are allowed to talk about abso- databases accessible via a common web-based in- lutely anything they want at any point in the ses- terface. It should be possible to perform searches sion, including meta-comments on the recording at syllable-, word- or word sequence levels. A first environment and suchlike, with the intention to re- version of (nearly) the entire database already ex- lieve subjects from feeling forced to behave in any ists hosted on an IMDI-server at the Centre for particular manner. Subjects are informed about the Language and Literature at Lund University. The time after each 10 minute block. After 20 minutes, result of a successful search can, for example, be they are asked to open a wooden box which con- a sound file with the desired items and a time- tains objects whose identity or function is not im- aligned transcription. It should be possible to lis- mediately obvious. The subjects may then hold, ten to it directly or download a file for further anal- examine and discuss the objects taken from the ysis. In its present form, only parts of the database 4
material are transcribed. sources being realized with this funding will be A part of the database that comprises informal extremely valuable when CLARIN enters its per- interviews and semi spontaneous monologues will manent phase. be simultaneously hosted on a server at Tekstlab- oratoriet at the University of Oslo. This part of the Acknowledgments database will be combined with data collected by We gratefully acknowledge the following sources the Scandinavian Dialect Syntax project. of funding for the work described or mentioned To make the databases fully searchable they will above. have to be transcribed at the word level. This work is in progress and substantial parts of the material The work in the CLARIN preparatory phase by are already transcribed. Simple analysis tools will the Centre for Speech Technology, KTH, and Cen- also be available. To the extent that it is possible tre for Languages and Literature, Lund University, they will be designed to run on-line. Additional supported by CLARIN. tools will be offered for download. The planning project An infrastructure for Swedish language technology 2007–2008 (a na- 2.5 Litteraturbanken tional collaboration, coordinated by Språkbanken, The project described in this section – Littera- University of Gothenburg), by the Swedish Re- turbanken (the Swedish Literature Bank; ) is different from the oth- structures (VR dnr 2006-6763). ers described above, in that it has permanent fund- ing by an independent private funding body, the The project Safeguarding the future of Språk- Swedish Academy. banken 2008–2010 (Språkbanken, University of Litteraturbanken is a public digital repository of Gothenburg), supported by the Database Infra- classical Swedish literary works in scientifically structure Committe of the Swedish Research validated editions. It is slated to grow by approxi- Council’s Committee for Research Infrastructures mately 100 novel-length works annually. The rele- (VR dnr 2007-7430). vance to CLARIN of this endeavor is found in the The project Spontal: Multimodal database of following two circumstances: spontaneous speech in dialog 2007–2009 (Cen- 1. The technical infrastructure of Litteratur- tre for Speech Technology, KTH, supported by the banken was developed by Språkbanken, Database Infrastructure Committe of the Swedish which is also responsible for developing this Research Council’s Committee for Research Infra- infrastructure and maintaining the Litteratur- structures (VR dnr 2006-7482). banken website in its servers. This means that The project SweDia 2000 – A Swedish dialect the work on the technical solutions in Litte- database 2008–2010 (Phonetics, University of raturbanken is part of the work in the project Gothenburg), supported by the Database Infra- decribed above in section 2.2; structure Committe of the Swedish Research 2. Litteraturbanken is developed with the aim Council’s Committee for Research Infrastructures that it can serve as a primary data source for (VR dnr 2007-7432). research in a number of disciplines in the hu- manities and social sciences (e.g., literature, Litteraturbanken, supported on a permanent basis various historical disciplines and sociology), by the Swedish Academy. using language technology tools, e.g., in the form of text mining. 3 Conclusion Even though the Swedish Research Council has not set aside funds explicitly intended for CLARIN work, the projects described in the pre- ceding section together represent a funding of 10.6 million SEK (about 1 million Euro), plus about 2.5 million SEK annually to Litteraturbanken. The re- 5
CLARIN in Denmark – European and Nordic perspectives Hanne Fersøe Bente Maegaard University of Copenhagen University of Copenhagen Centre for Language Technology Centre for Language Technology Copenhagen, Denmark Copenhagen, Denmark hannef@hum.ku.dk bmaegaard@hum.ku.dk The participation in the construction and de- Abstract ployment of pan-European RIs must be funded nationally, so the 27 EU member states have agreed to develop national Roadmaps. Currently approximately half of these are available. This paper gives an overview of the Danish CLARIN project (funding background, na- 2 The National Danish Context tional strategic goals, formation of consortium etc.) including the very important priority of 2.1 Funding of Danish RI projects or Dan- aiming the results of the project at researchers ish participation in European RI pro- from the wide range of all fields of humanities jects research which is based on language sources, i.e. not exclusively at researchers in the fields In parallel with the European interest in research of linguistics and language technology, but infrastructures, the Danish Ministry of Science, with a much broader scope. Secondly, it dis- Technology and Innovation commissioned the cusses future perspectives of European and Danish Council for Strategic research to survey Nordic cooperation. the needs and propose a strategy for future re- search infrastructures. Their report was published 1 The European context in December 2005. Following these preparatory strategy papers a The European Strategy Forum on Research In- call for proposals of RIs was published in Sep- frastructures (ESFRI) initiated its Roadmap tember 2007 with a pool of 200 million DKK Process in 2001, and in 2006 it published the first (€27 million) per year for a period of three years. European Roadmap for Research Infrastructures (RI), which was updated in 20081. The Roadmap 2.2 Danish Roadmap gave its recommendation to 6 SSH-projects (So- cial Sciences and Humanities), and the European Additionally, the Danish Agency for Science, CLARIN project is one of those 6 projects. Technology and Innovation is preparing the na- At the European Commission level a funding tional Danish roadmap for RIs in agreement with model for European Research Infrastructure (RI) the ESFRI and the Commission process. projects was developed in the 7th Framework Programme, a call was opened for those recom- 3 The Danish CLARIN project mended by ESFRI, and 34 projects are currently The Danish CLARIN consortium applied for the running, including 5 SSH projects. In parallel, equivalent of four million euros and was the European Commission has work in progress awarded two million for the three year period on a Council Regulation to provide a legal form 2008-2010 for the construction of a national re- for the long-term organization to run the pan- search infrastructure for the humanities, focusing European RIs in the construction and deployment on material expressed in language (written or phases. spoken) and tools to treat this material. This means that Denmark is not in a preparatory phase parallel to the EU-CLARIN project, but that we 1 http://cordis.europa.eu/esfri/ 6
are actually implementing a national research project. The European CLARIN project is as- infrastructure. sessing existing standards and recommendations in order to be able to determine a set of CLARIN 3.1 The Consortium specific recommendations and standards in areas The Danish CLARIN consortium has four uni- such as technical architecture, meta data, inter- versities and four cultural institutions as their operability, IPR and copyright issues etc. How- members with the University of Copenhagen ever, the Danish CLARIN project needs to pro- coordinating the consortium. The members are: ceed, in order to make sure to be able to deliver the results foreseen at the end of 2010. • University of Copenhagen For this reason it was vitally important for the • University of Southern Denmark consortium to design the work packages in such a way as to be able to deliver as a result not only • University of Aarhus the technical infrastructure but also as many • Copenhagen Business School types of content as possible. This means that the project plan contains activities both to deliver • Society for Danish Language and Litera- already existing resources and to produce new ture resources. The project is organized into themati- • Danish Language Council cally defined main work packages, namely writ- ten language resources, spoken language re- • The Royal Library sources and collections of constructed data. Each • The National Museum of Denmark main work package is subdivided into a number of sub work packages, and in each of these the A total of 11 research groups are participating participants are in the process of collecting, an- with funding, and a 12th group has joined as of notating and otherwise producing and including January 2009 as an observer. different types of resources. With these partners the consortium is very strong and to the point, as it has a good combina- 3.3 Written language resources tion of the necessary skills and experience: hu- In the main work package written language manities, language technology, language re- resources six different written language re- sources, and computer science. The consortium sources will be created and made available will collaborate with EU-CLARIN where possi- through the Danish CLARIN infrastructure. ble, and particularly strive to learn from and ad- The Danish CLARIN partner Society for Dan- here to standards as decided at the European ish Language and Literature (DSL)2: is responsi- level in order to pave the way for Denmark to be ble for collecting a contemporary general lan- an active partner in the construction and exploi- guage corpus of 15 million words of annotated tation phases of the European project. One of the Danish text per year (i.e. a total of 45 million national tasks for the Danish CLARIN consor- words), mainly from newspapers and periodicals. tium is to propose a strategy for the exploitation This new corpus will cover the period around at the national level. 2010, and as such it will be supplementing the 3.2 Strategic project goal existing KorpusDK3 which contains around 56 million words from the periods around 1990 and The vision is to create a researcher’s toolbox by around 2000, respectively. The corpus annota- establishing a number of digital Danish text, tions will be expressed according to TEI P5 speech and visual resources and associated tools specifications. Apart from KorpusDK, DSL has and to integrate these resources into a web-based many other interesting and relevant digital re- environment for research thus creating a much sources, as can be seen on their web page, and as needed support for Danish humanities and en- a part of the project some of these will also be hance its possibilities for European collaboration. made available through CLARIN. The Danish CLARIN project is eager to fol- University of Copenhagen, Centre for Lan- low standards and recommendations developed guage Technology (CST)4, together with the in the preparatory phase of the European CLARIN project, as far as possible, but as the European project is a preparatory project, the 2 http://dsl.dk/ recommendations may not all be available when 3 http://ordnet.dk/korpusdk they are needed for implementation in the Danish 4 http://english.cst.ku.dk/ 7
Danish Language Council (DSN)5 is responsible tended to the Danish Dictionary of Insular Dia- for collecting an 11 million words corpus of an- lects, DID8, see further down. notated sublanguage texts from the period 2000- Older literary texts will be represented through 2010 from broadly selected domains such as the work of the Danish writer and Nobel Prize health care and medicine, IT, agriculture, con- winner, Johannes V. Jensen. Of his work 50 struction, meteorology. The corpus will be based books will be digitized, OCR recognized and on texts originating from experts and semi- annotated, the latter a task which implies adapt- experts and with a targeted readership of semi- ing the tools, e.g. the PoS-tagger, to older Dan- experts and laymen. At present no such corpus ish. DSL is responsible for this work together exists for Danish so the sublanguage corpus will with the Johannes V. Jensen Centre of the Uni- represent a truly new type of resource for scien- versity of Århus9. In addition DSL will also be tists to work with, and as such it will constitute a specifying a prototypical lexicon of orthographi- valuable supplement to the general language cor- cal variation. pus. To learn more about the general language Finally a parallel multilingual resource of at corpus and the sublanguage corpus, see Halskov least 20 million words will be collected from (to appear). available bilingual texts. The work will build on Another corpus of sublanguage texts will be experience gained from previous work carried collected by researchers from the DUDS6 group out by research groups at the University of Co- at University of Copenhagen. They will create a penhagen (Maegaard, Offersgaard et al. 2006). corpus of 250,000 words composed of extracts While this previous work focused on older texts, from non-literary texts for everyman’s use from namely The Snowman by the famous Danish the period 1500 to 1750. The texts will be ex- fairy tale writer Hans Christian Andersen, the tracted from rare books only obtainable from The new parallel corpus will focus on contemporary Royal Library in Copenhagen, and they will texts. The texts will be collected and subse- cover subjects such as ethics and moral issues, quently aligned and annotated, and focus will be geography and topography, history, housekeep- on Danish-English and Danish-German. CST is ing and cooking, medical science, mathematics responsible for collecting, aligning, and other- and astrology, natural sciences, pedagogics, etc. wise annotating the multilingual corpus and for (Fersøe 2008b). The texts will be scanned and making it available. OCR recognized and marked up according to the One of the challenges in connection with col- Multi Level Text (MLT) annotation (Ruus 2002) lecting and making available current written text which handles orthographical variation, and resources is the copyright issue. The consortium which will be the key to searching the corpus. is asking permission from writers, publishers and The domains covered in the Everyman corpus other categories of text owners, and only texts mentioned above could be richly illustrated by for which permission can be obtained will be the images found in existing collections belong- included. ing to the section Danmarks Nyere Tid (DNT)7 of The National Museum of Denmark. A group 3.4 Spoken language resources from this unit is responsible for creating a pilot In the main work package spoken language re- corpus of 8,000 images with associated textual sources three different spoken language corpora, descriptions and for making them available on one of them including video recordings, will be the platform. After deciding the best way of cap- collected, annotated and made available with a turing and annotating all the available informa- number of associated tools. tion from the associated texts, including which A group of researchers from the University of language technologies to use for this, they will Southern Denmark. USD10, in Kolding will col- select more images. Currently there are 50,000 lect video and sound recordings of 20 hours of digitized images to choose from. It is not the task naturally occurring interaction, mostly from face of this project to link the Everyman corpus and the DNT images, but this is a future research pro- 8 ject. Furthermore the linking could also be ex- http://dialektforskning.ku.dk/publikationer/oemaalsor dbogen/ 9 http://www.nordisk.au.dk/jensen/index 5 10 http://www.dsn.dk/ 6 http://duds.nordisk.ku.dk/ http://www.sdu.dk/Om_SDU/Institutter_centre/Isk/Ce 7 http://www.nationalmuseet.dk/sw6796.asp ntre/SoPraCon.aspx 8
to face situations. The corpus will be annotated 3.5 Collections of constructed data according to the Conversation Analysis methods The term ‘collections of constructed data’, or (MacWhinney and Wagner, to appear) to encode technological resources as they are also called, is overlap, pausing, prosody, and a wide variety of a loose definition we have used in the Danish non-lexical features. In addition to this, parts of CLARIN project to cover resources that are not the corpus will also be annotated with multimo- collected and annotated as they are, such as e.g. dality coding according to the MUMIN system written or spoken corpora, but which are care- (Jokinen, and Navarretta et al., 2008) for facial fully selected data put together as a collection and manual gestures, gaze, posture, and prox- according to a specific set of requirements, such imity. The corpus will be accompanied by a as e.g. dictionaries. In the main work package search engine which allows the data to be collections of constructed data three different searched for interactional features, mainly com- sets of constructed data will be made available. binations of verbal material, timing plus features The Danish WordNet, DanNet14 (Pedersen, marked in the transcription. Nimb et al. 2008), will be extended from 35,000 Another spoken corpus will be collected by to 70,000 synsets in close collaboration between the researchers from the Danish National Re- CST and DSL and according to a set of specifi- search Foundation Centre for Language Change cations for inclusion of new vocabulary. The ex- in Real Time, LANCHART11, at the University tension, more precisely, consists of generation of of Copenhagen. This group is working with cor- the new synsets, placing them in the ontological pora collected over a long period of time, and structure of DanNet, determining DanNet they are re-interviewing some of the informants equivalents for Base Concepts from Princeton that were interviewed earlier in order to be able WordNet15, and establishing the links to Prince- to compare their language between then and now ton WordNet. The existing coding tool will be and thus study language change (Gregersen, slightly enhanced, and an xml-format will be de- 2007). There are, however, various confidential- veloped. ity restrictions which are making it very difficult Researchers from the Jens Peter Skautrup – if not impossible – to offer free availability to Centre16 at the University of Århus have devel- these corpora, so in the CLARIN context a new oped Jysk Ordbog17, which is a rich resource of small corpus of spoken young Copenhagen Dan- dialects of Jutland. In the CLARIN project the ish will be collected and annotated according to research group will evaluate the current data base the LANCHART standards. The group will also format of the dictionary and subsequently re- deliver a tool that can be used for analysis by all design it to fit more appropriately with CLARIN researchers who want to handle and study spoken standards and formats before making it available language materials. through the infrastructure. The third spoken corpus to be delivered Bringing together different types of dictionary through the Danish CLARIN infrastructure is resources is scientifically interesting and has ob- created at Copenhagen Business School, CBS12. vious benefits for teaching. In the CLARIN pro- The corpus text is the Danish PAROLE corpus13 ject researchers from CST will bring together of which currently 100,000 tokens exist as sound DanNet and the Danish computational diction- files in lab quality (Henrichsen, 2007). This cor- ary, STO18, and thus highly improve the potential pus will be made available with the sound files of both as a computerized representation of Dan- and with annotations for PoS, syntactic struc- ish vocabulary, providing not only lexical se- tures, acoustic measurements, phonetic transcrip- mantic information, but also syntax and mor- tion, and more. These data are unique in Den- phology. The work will be based on the positive mark for phonetic studies and speech technology. results of a pilot project (Pedersen, Braasch et al. The data will be extended, revised and re- 2008), and will comprise about 9,000 words. organized to be made available through The research group from Danish Dictionary of CLARIN, and so will the accompanying tools for Insular Dialects (DID) mentioned earlier is not a word-level alignment, verification of phonetic CLARIN partner with funding from the grant. transcription, and acoustically based prosodic analysis. 14 http://www.wordnet.dk/ 15 http://wordnet.princeton.edu/ 11 16 http://lanchart.hum.ku.dk/ http://www.jysk.au.dk/index.jsp 12 17 http://isvcbs.dk/~pjuel/index2.html http://www.jysk.au.dk/jyskordbog/jyskordbog 13 18 http://korpus.dsl.dk/e-resurser/parole-korpus.html http://english.cst.ku.dk/sto_ordbase/ 9
The group, however, is currently working with creation of NEALT (Northern European Asso- some technical issues similar to those of Jysk ciation for Language Technology) in 2007. Ordbog, i.e. formats, meta data, data structure The Nordic collaboration has been very im- and tools, and therefore the Danish CLARIN portant for the building up of the Nordic compu- consortium has invited the DID group to become tational linguistics communities, not least for observers in the work package regarding the con- preparing for European collaboration. structed data. 4.1 Content of the Nordic collaboration 3.6 Technical platform Some Nordic countries have languages that The technical infrastructure of the Danish are similar and in this case it is highly recom- CLARIN platform is in the process of being mendable to reuse and accommodate tools, stan- specified, and it is still too early to give a more dards etc., wherever possible. E.g. the CST lem- detailed account of these matters. Currently the matizer for Danish has been trained for Icelandic infrastructure is seen as a digital repository with and is now being used in Iceland. This kind of a web user interface managing: collaboration will take place only if information about the existence of language technology tools • Access rights given to users based on and methods is available. There are several in- user verification mechanisms struments for knowledge sharing and dissemina- • Access rights for users to specific con- tion: the NorDokNet centres (Fersøe, Rögnvalds- tent based on resource profiling son et al. 2005) were supported by the Nordic Council of Ministers, and even if funding has • Search and retrieval facilities stopped, the collaboration among the centres • A personal work space survives, albeit at a lower level. Similarly the Nodalida conferences are a great help to dis- • Communication facilities seminate knowledge and support Nordic collabo- 3.7 The future after 2010 ration. One of the management tasks of the Danish con- 4.2 Merging of Nordic and European per- sortium is to propose a plan for future operation spectives and exploitation of the Danish CLARIN infra- CLARIN is a European initiative, and this structure. Key elements for which future funding means that CLARIN will provide everything must be found are on the one hand the technical which the Nordic collaboration provides, just at inclusion of Danish CLARIN into EU-CLARIN, the larger, European, scale: standards and tools and on the other hand the continued inclusion of are shared with many more languages, and it is new resources on to the national infrastructure. possible to collaborate with many more research Another challenge will be the dissemination of groups and to be inspired by many more re- the usefulness of the infrastructure for a wide searchers around Europe. range of humanities research areas. In a successful CLARIN we see the Nordic and the European perspective merging. 4 European and Nordic Perspectives The history of language technology collaboration Acknowledgements among the Nordic countries goes back to the This project is supported by the Danish Agency early days of computational linguistics. The first for Science, Technology and Innovation, as well Nordic summer school in computational linguis- as by all partner institutions. tics was held in Marstrand, Sweden, in 1972, We thank all participants in the Danish consor- followed up by Bergen 1973 and Copenhagen tium for their contribution to the project. 1974. These summer schools have been instru- We also thank all the work package leaders for mental in the creation of a Nordic computational their work package descriptions, which have linguistic community. Later on the Nodalida con- served as input particularly to section 3 of this ferences were started by “Den Nordiske Samar- document. bejdsgruppe for datamaskinel sprogbehandling” with the first conference in Gothenburg 1977, References and as the latest step in this direction we have the Hanne Fersøe 2008a. The Danish CLARIN Project. CLARIN Newsletter, number 2, July 2008. 10
Hanne Fersøe 2008b. Knowledge for Everyman from Ruus, Hanne. 2002. A Corpus-based Electronic Dic- the Renaissance to Modern Times. CLARIN News- tionary for (Re)search, in EURALEX 2002 Pro- letter, number 4, December 2008. ceedings, pages 175-185. Hanne Fersøe, Eiríkur Rögnvaldsson and Koenraad de Smedt 2005. NorDokNet - Network of Nordic Documentation Centres. Contacts to future Baltic Partners. Nordisk Sprogteknologi. Årbog for Nor- disk Sprogteknologisk Forskningsprogram 2000 - 2004. København 2005, side 13-23. Frans Gregersen 2007. The LANCHART Corpus of Spoken Danish, Report from a corpus in progress, in J.Toivanen & P.Juel Henrichsen (eds.): Current Trends in Research on Spoken Language in the Nordic Countries, Volume II, Oulu University Press, p.130-143, ISBN 978-951-42-8514-1. Jakob Halskov (to appear). Compiling, annotating and publishing corpora in DK-CLARIN, the Dan- ish incarnation of the pan-European initiative for a common resource infrastructure. To appear in Cor- pus Linguistics 2009, Liverpool. Peter Juel Henrichsen 2007. The Danish PAROLE corpus - a merge of speech and writing; in J.Toivanen & al (eds) Current Trends in Research on Spoken Language in the Nordic Countries, vol II; Oulu Univ. Press 2007, pp.84-93 K. Jokinen, C. Navarretta , P. Paggio 2008. Distin- guishing the communicative functions of gestures. In A. Popescu-Belis and R. Stiefelhagen (eds.) Proceedings of 5th Joint Workshop on Machine Learning and Multimodal Interaction, Utrecht, September 2008, Springer, 38-49. Brian MacWhinney, Johannes Wagner (to appear): Transcribing, searching and data sharing: The CLAN software. To appear in Gesprächsforschung 2009 (ISSN 1617–1837). Bente Maegaard, L. Offersgaard, K.F. Joensen. X. Lepetit, C. Navarretta, J. Pedersen, C. Povlsen. 2006. MULINCO - Korpusplatform til sprog- og oversættelsesstudier. Tidsskrift for Universiteter- nes efter- og videreuddannelse, nr. 7 s. 1-15: E- læring i sprogfag, Danmark. Bolette S. Pedersen, S. Nimb, L. Trap-Jensen (2008) DanNet: udvikling og anvendelse af det danske wordnet. Nordiske Studier i Leksikografi 9, Rap- port fra konference om leksikografi i Norden pp. 353-371, Akureyri, Island. Bolette S. Pedersen, A. Braasch, L. Henriksen, S. Ol- sen, C. Povlsen, 2008. Merging a Syntactic Re- source with a WordNet: A Feasibility Study of a Merge between STO and DanNet. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC'08). European Lan- guage Resources Association, 2008. 5 s. 11
Nordic co-operation in building the language resource infrastructures Kimmo Koskenniemi Antti Arppe University of Helsinki University of Helsinki Finland Finland kimmo.koskenniemi@helsinki.fi antti.arppe@helsinki.fi research or having language resources and tech- Abstract nologies available for such research. As the first step, the FIN-CLARIN consortium members have conducted in 2008 a survey of linguistic research resources and tools that exist This paper attempts to identify worthwhile within their organizations. In all, 76 distinct col- goals when building Nordic language re- lections of resources have been identified in this source infrastructures and the relevant par- survey, for which the key descriptive data, iden- ties who should participate their planning tifying the resource, its content, location, and and construction. Finally, some actions are access requirements are available at the FIN- suggested which could move us closer to CLARIN website as well as the general ad hoc the goals which have been set. registry maintained by CLARIN1. As a second step, the FIN-CLARIN consor- 1 Background tium has commissioned from CSC – IT Center for Science a White Paper concerning the various We have a long tradition of Nordic co-operation possibilities for setting up a Finnish national Au- within language technology (Koskenniemi et al. thorization and Authentication Infrastructure 2007), including a long series of NODALIDA (AAI) for language resources, as well as a pro- conferences, the Nordic Research Program 2001- posal covering the requirements specifications 2004, NGSLT, and we now have the NEALT and actual construction plan for implementing organization which hosts special interest groups such an infrastructure in Finland. Such an AA such as the SigInfra dedicated to research infra- infrastructure is the technical bedrock which al- structures for language resources. Similar co- lows for the potential use of a language resource operation has also been practiced in linguistics, at any of the participating Finnish organizations e.g. the NordForsk summer schools and the according to the Single-Sign-On (SSO) principle, Scandinavian Conference of Linguistics (SCL). i.e. requiring a user's identification only at one's The European Common Language Resource own Finnish home organization. In practice, this and Technology Infrastructure (CLARIN) infra- now completed development plan realizes the structure entered its EC funded preparatory phase technical framework of the envisioned CLARIN 2008-2010 and is creating frameworks according infrastructure within Finland, and is planned to to which the operational CLARIN could be built. be fully conformant with the pan-European All Nordic and Baltic countries are participating CLARIN AAI, the kernel of which is planned to CLARIN in various roles. be operational already in 2009. As the third step, In Finland, FIN-CLARIN, a consortium of re- the FIN-CLARIN consortium has commissioned search institutions involved in linguistics and from CSC the actual construction of this AAI in language technology has been formed in 2007 to Finland within 2009. strive towards CLARIN objectives at a national level. Currently, FIN-CLARIN encompasses the 2 Nordic goals Universities of Helsinki, Joensuu, Jyväskylä, Oulu, and Tampere, the Research Institute for the One important goal of Nordic research infra- Languages of Finland (KOTUS/FOCIS), and structures for language resources is obviously to CSC – IT Center for Science, but the consortium make language and lexical materials accessible remains open to all other Finnish academic or- 1 ganizations with an involvement in linguistic see http://www.clarin.eu/view_resources 12
and usable for all those who need them for re- • researchers in various disciplines such as search, teaching, language planning or similar linguistics, language technology, or ma- purposes. The access and use of existing materi- chine learning who need linguistic mate- als should be facilitated, new materials should be rials in their research and who some- created, and measures should be taken in order to times produce new materials, secure maximally free availability of the future materials already when the materials will be cre- • researchers in other disciplines who in ated. fact essentially work with linguistic data, Just within the Nordic countries, the CLARIN e.g. historians, sociologists, or theologi- infrastructure should allow for researchers inter- ans, just to mention a few fields, ested in e.g. the overall state of the Swedish lan- • funders of research projects who can re- guage, i.e. Swedish spoken and written both in quire allowing free access, and compli- Sweden and in Finland, to easily access the lan- ance with standard formats as new mate- guage resources currently physically located at rials are produced as a result of the pro- several institutions, first and foremost Språk- jects, banken (The Swedish Language Bank) in Göte- borg, Sweden, CSC – IT Center for Science, • specialists in language planning or lan- Finland, the Department of Scandinavian lan- guage cultivation (språkvård), who util- guages and literature at the University of Hel- ize the materials in their work and com- sinki, and the Research Centre of the Languages pile new dictionaries, norms for lan- of Finland, regardless of what their home organi- guage users, and compile new corpus zation currently is. Likewise, the CLARIN infra- materials, structure should allow for researchers in e.g. the • commercial parties such as publishers Department of Finno-Ugrian Studies at the Uni- and broadcasting companies who own or versity of Helsinki to have easy access to the possess written and spoken materials, as substantial Sámi resources at the University of well as language technology companies Tromsø. In addition to such ease of access, the who need written or spoken corpus mate- CLARIN infrastructure aims to provide user- rials and create language technology friendly interfaces to aggregate such scattered tools using these materials, resources as single virtual corpora, and to con- duct the most common search and concordancing • libraries, museums, and some commer- operations for researchers lacking extensive cial companies such as Google and Mi- skills in language technology and programming, crosoft Corporation which may have which would be necessary to work by themselves huge archives of materials and which are directly with the source format of the resources. involved in digitizing and storing these The resources for CLARIN or national lan- archives, guage resource infrastructures are limited. In • organizations of authors and journalists, order to proceed fast and get the appropriate high as well as the organizations which proc- quality services available, the Nordic participants ess the copyright fees of authors and per- now have an opportunity to get more by smart formers, and division of labour and by co-ordination, making the most of the current individual strengths of all • experts in copyright legislation. the parties. There is an obvious need for attracting relevant This paper also discusses how the Nordic parties to the work because relevant materials countries could better integrate themselves in the exist and are controlled by them. In addition, European CLARIN which is, of course, the best, risks will increase if those parties are not moti- if not the only way to offer the Nordic research- vated and co-operative. ers the access to materials and tools in other EU At first sight, some of these parties might ap- countries. pear to have conflicting interests. It would be nice for the researchers if they could use all pub- 3 Actors lished materials on an open access basis. This It is important to get the relevant parties in- might, however, conflict with the legitimate volved, including but not restricted to: commercial interests of the publisher if they in- tend to print and sell copies of such a work. We think that there may still be workable compro- 13
You can also read