CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics DFN meeting June 7’th Berlin
Contents What is the CLARIN Project What are Language Resources A “Holy Grail” CLARIN User Scenario AAI Vision and what needs to be solved to achieve it
What is CLARIN Common Language Resources and Technology Infrastructure The CLARIN project is a large-scale pan- European collaborative effort to coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers.
Language Resources Any resource used to study language Text Corpora Newspapers,…, email, sms messages Multi-media corpora Audio recordings to study phonetics, train speech recognizers Video recordings for Sign-Language studies Language Documentation (language use in cultural context) Multi-Media Lexica Lexical entries linked with pictures, sound
Sign-Language Example
Multi-Media lexicon example Lexical entries link directly into archived corpora, e.g. via Annex
What is CLARIN CLARIN is an EU Infrastructure project with 4.2 ME funding for a 3 year preparatory phase (ends 2010) Additional funding from national governments (at this moment at least 14 ME ) The CLARIN consortium has now 36 partners from 26 EU countries The CLARIN community has >180 member organisations in 32 countries (mostly from NLP orgs.) CLARIN is based on many earlier initiatives with many participants: LangWeb, EARL, TELRI, LIRICS and more recent DAM-LR MPI for Psycholinguistics is responsible for WP2; working on the technical infrastructure
CLARIN Time Plan 2008 - 2010 Preparatory Phase Limited set of federated CLARIN centers (10+) Showcases, demonstrators WP8 Investigation national funding for the construction & maintenance phase 2011 - 2016 Construction Phase No direct European funding but EU assisting projects Depend on national project commitments Netherlands already until 2014 currently intensive preparations for CLARIN D (->2016) 2016? - … Operational Phase Has to be cost efficient, we have to compete! CLARIN EU continuation after the preparatory phase is likely in the form of an ERIC important if only to provide a legal entity to make contracts with outside parties on behalf of the CLARIN community.
A backbone of CLARIN centers These together uphold the infrastructure, maintaining it and offer guidance & expertise for its use. Have stable repositories for resources and services Need strong national support for many years Need good teams that have a This is yet far from reality long time perspective and can •Current situation is one of accidental and provide persistency and temporary collaborations and obligations continuation of knowledge •Only a limited number of centers can probably fulfill the criteria of sufficient stability, funding and technological strength •Currently 25 candidate centers
CLARIN “Holy Grail” User Scenario A researcher authenticates at his own organization and creates a “virtual” collection of resources from different repositories. He does this on the basis of browsing a catalogue, searching through metadata, or searching in resource content. To be granted access to this distributed dataset he signs the appropriate licenses He is then able to use a workflow specification tool and process this virtual collection using LT tools in the form of reliable distributed web services which he is authorized to use. (Intermediate) results are stored in a user specific workspace After evaluation, the resulting data (including metadata) can be added to a repository and the “virtual” collection specification can be stored for future reference For our domain this is ambitious and challenging, but even a partial realization is worthwhile
CLARIN Infrastructure Components In the previous scenario we find the following components & functionality Metadata catalog Virtual collection registries Persistent Identification of Resources EPIC: European PID Consortium: GWDG, CSC, SARA AAI infrastructure Technical issues Organizational Legal
Virtual Language Observatory CGN (12.000) OAI PMH harvesting and transformation End.Lang. (35.000) lay er ov IMDI MPI (33.000) GIS Domain BAS (7.400) AILLA (1.800) ws er Br o Indexes e tted OLAC (40.000) Fac LRT Inventory (800/137) DFKI Tool Registry (292) hard problem: ue og tal ELDA (60) - mapping Ca - granularity others - curation
CLARIN AAI Purpose is to create one single domain of CLARIN resources and services for our users Where users have only one identity (and since we hope to have very many users) preferably maintained at their home institute and can use SSO (single sign on) between the centers Our users are linguists and SSH academics spread out over Europe, CLARIN can not hope to influence the way their user accounts are set-up. But CLARIN can profit from existing AAI systems in the research & education domain. CLARIN centers are part of the CLARIN organization and they can be asked to conform to specific standards wrt. AAI
Federated Authentication Many countries have a National Identity Federation (IDF) set up by the different NRENs (national research education network) Such a federation is a collection of IdPs and SP Users have an account at their institute (IdP) and can use resources or services from centers (SPs) When a user accesses a resource at a SP he can authenticate at his own IdP 1 2 SPa resources 3 IdP 5 SPb user Purpose: info 4,6 resources •Provide SSO •Single user identity processing •Limited user information exposure
CLARIN wide AAI (1) The CLARIN SPs become members of their national IDFs Rely on the eduGain confederation (GEANT 3 project) to provide the trust between the national IdFs eduGain is not yet functional •attribute harmonization issues SP1 •privacy issues disclosing attributes when crossing IDF a national frontiers eduGain Metadata & trust SP2 SP3 IDF b IDF c homeless users?
CLARIN wide AAI (2) Establish a CLARIN SP organization as a legal entity able to sign contracts where needed with the national IDFs CLARIN SP organization takes care of exchanging the SP specifications with the national IDFs IDF a Metadata SP1 & trust SP2 SP3 IDF b IDF c homeless users?
How about licenses? Many resources are available under a special license (EULA) e.g. “Academic use only” CLARIN WP7 investigated possible harmonization Should a user have to repeatedly sign the same EULA at different data provider when processing a distributed data set? This would break the SSO! Can we store the signed EULA information at the users IdP as an attribute? CLARIN has no way of influencing the IdP organizations so a CLARIN registry for this would be needed
Virtual Organization Platform SPa External SPb User Attribute Authority browser VO Platform user •There is a PoC implementation EULA DB available IdP •This is suitable as a basis for a CLARIN EULA service. Create special EULA service. This is part of the CLARIN organization •Developing this further (probably) independent of the IDFs part of CLARIN NL
CLARIN SP Test Federation The national Identity Current status Federations (IDF) will come •Initial Service Provider Federation: MPI-Psyl, together in a single BBAW, IDS, CSC confederation: eduGAIN This way users associated with •Made contract with HAKA Finland, DFN AAI Germany, SURFfed Netherlands any IdP can use resources from any SP in the •Successfully demonstrated SSO with a few confederation SPs This is not operational yet Therefore CLARIN created a SP federation that can sign contracts with the individual IDFs This is an administrative burden but: it works!, is extendible and independent of eduGAIN progress
Problems encountered Federation fees for SPs SURFfed, HAKA require payment from “external” SPs to enter the IDF. All foreign SPs could be considered external. Particular IDF requirements Specific X509 certificate issuer(s) (HAKA) IdP initiated SP connection request (SURFFed) Explaining the SP federation model to all participants SP, IDF management and legal people Scalability of the contracts Important flexibility to add new SPs or national identity federations without too much overhead. One representative for the SPs with power of attorney to deal with the national identity federation agreements (1xN instead of NxN signatures). Currently a CLARIN centre, in the future the CLARIN ERIC
National IDF policy What can national IdFs do to make (CLARIN) life easy. Facilitate/push eduGAIN, that would solve most of our problems. Think of harmonizing your contracts (saves the number of annexes in the CLARIN SP contract) Be flexible, be aware of different situations for SPs from other countries e.g.The certificate issuer requirement Don’t start asking money for connecting the CLARIN SP federation. We are not commercial publishers Keep cooperating with us, it is going well!
Non-EU collaborations Regional Archives Initiative: Cooperation of MPI-Psyl with other organizations interested in EL archiving They use MPI’s LAT archiving software Encourage local resource collecting & archiving Network of South American archives has been established and contacts with CLARA were made
Non-EU collaborations How will we accommodate users and SPs from non-EU countries? nc sy •Will we have to wait for a ta super eduGAIN or da •can we introduce non-EU IdPs & SPs in the CLARIN federation? Regional Archives Initiative: Cooperation of MPI-Psyl with other organizations interested in EL archiving They use MPI’s LAT archiving software Encourage local resource collecting & archiving Network of South American archives has been established
collaborations/interactions concrete joint plans projects cooperations contribution PARADE discussions
Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n° 212230
You can also read