Developments in Irish Language Technology - Dr.Aodhán Mac Cormaic Director of Irish Department of Culture, Heritage and the Gaeltacht Government ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Developments in Irish Language Technology Dr. Aodhán Mac Cormaic Director of Irish Department of Culture, Heritage and the Gaeltacht Government of Ireland
Developments in Irish Language Technology Achoimre / Summary • Current status of Irish technology. • Details of work already underway. • Research demonstrates that Irish, like some major world languages, is falling behind English in the Digital Age. • Government efforts to tackle this problem: Ø Digital Plan for the Irish Language
Investment by Department of Culture, Heritage and the Gaeltacht €9m invested in Irish language digital and technology sector since 2006. Over €1m p.a. over last 3 years. Projects Funded: • Abair.ie – voice synthesis • Dúchas.ie – Bailiúchán na Scol 1938 / Schools Collection 1938 • Ainm.ie – biographical details of people involved in Irish language activities • Logainm.ie – place-‐names online • Royal Irish Academy – historical Irish language dictionary • TechSpace as Gaeilge • Dublin City University -‐ MOOC in Irish and Irish Traditional Culture
Investment by Department of Culture, Heritage and the Gaeltacht (cont.) Irish Language Terminology for the EU Terminology Database (IATE) • Annual grant of €231,000 to Dublin City University. • Irish is now the 13th largest of the languages in the database and the largest of the new languages! • Over 72,000 terms translated into Irish do date. • Important work due to strategy to end derogation by 2022.
Investment by Department of Culture, Heritage and the Gaeltacht (cont.) www.gaois.ie • Search engine on www.gaois.ie site allowing searches for legal texts. • 9m words in this corpus, half in Irish and half in English. • Developed by Dublin City University
Investment by Department of Culture, Heritage and the Gaeltacht (cont.) Tapadóir: Machine Translation System • DCU research – statistic-‐based. • Trinity College Dublin – rule-‐based. • Hybrid system combining both.
Tapadóir
Tapadóir
European Language Resource Coordination (ELRC) • EU Commission supporting public-‐sector bodies across Europe to provide multi-‐lingual public services. • Machine Translation system to be developed as part of Connecting Europe Facility (CEF). • 2nd Phase of data collection process currently underway.
Investment by Foras na Gaeilge Online Dictionaries • Irish version of Foclóir Béarla-‐Gaeilge, 90% of which is complete, available on www.focloir.ie. • Three other dictionaries – Béarla/Gaeilge (1959), An Foclóir Beag (aon teangach) agus Foclóir Gaeilge-‐ Béarla (1978) – all available on www.teanglann.ie.
2012 META NET Report: The Irish Language in the Digital Age Language Processing: level of support for language technology for 30 European languages Excellent Support Good Support Reasonable Intermittent Poor or No Support Support Support Béarla (English) Gearmáinis Bascais Íoslainnis Iodáilis Bulgáiris Cróitis Fionlainnis Danmhairgis Laitvis Fraincis Eastóinis Liotuáinis Ollainnis Gailísis Máltais Portaingéilis Gréigis Rómáinis Spáinnis Gaeilge Seicis Catalóinis Ioruais Polainnis Sualainnis Seirbis Slóvaicis Slóivéinis Ungáiris
2012 META NET Report: The Irish Language in the Digital Age Machine Translation: level of support for language technology for 30 European languages Tacaíocht den Tacaíocht mhaith Tacaíocht réasúnta Tacaíocht bhriste Tacaíocht lag nó scoth gan tacaíocht Béarla (English) Fraincis Gearmáinis Bascais Spáinnis Iodáilis Bulgáiris Catalóinis Danmhairgis Ollainnis Eastóinis Polainnis Fionlainnis Rómáinis Gailísis Ungáiris Gréigis Gaeilge Íoslainnis Cróitis Laitvis Liotuáinis Máltais Ioruais Portaingéilis Sualainnis Seirbis Slóvaicis Slóivéinis Seicis
2012 META NET Report: The Irish Language in the Digital Age Text Analysis: level of support for language technology for 30 European languages Tacaíocht den Tacaíocht mhaith Tacaíocht Tacaíocht bhriste Tacaíocht lag nó scoth réasúnta gan tacaíocht Béarla (English) Gearmáinis Bascais Eastóinis Fraincis Bulgáiris Gaeilge Iodáilis Danmhairgis Íoslainnis Ollainnis Fionnlainnis Cróitis Spáinnis Gailísis Laitvis Gréigis Liotuáinis Catalóinis Máltais Ioruais Seirbis Polainnis Portaingéilis Rómáinis Sualainnis Slóvaicis Slóivéinis Seicis Ungáiris
2012 META NET Report: The Irish Language in the Digital Age Speech and Text Resources: level of support for language technology for 30 European languages Tacaíocht den Tacaíocht mhaith Tacaíocht Tacaíocht bhriste Tacaíocht lag nó scoth réasúnta gan tacaícoht Béarla (English) Gearmáinis Bascais Gaeilge Fraincis Bulgáiris Íoslainnis Iodáilis Danmhairgis Cróitis Ollainnis Eastóinis Laitvis Polainnis Fionlainnis Liotuáinis Sualainnis Gailísis Máltais Spáinnis Gréigis Seicis Catalóinis Ungáiris Cróitis Ioruais Portaingéilis Rómáinis Seirbis Slóvaicis Slóivéinis
Our New Approach: A Digital Plan for the Irish Language • Long-‐term plan required in order to improve technologies in various sectors. • Expert team from DCU and Trinity College. • Planto be published late 2017.
Terms of Reference for Working Group To prepare cost estimates for a 10 year action plan, taking into account the following: • that the annual funding which will be available to implement the Plan will be based on the financial resources available in 2015, with the expectation that additional resources will be made available on an incremental annual basis: • that additional financial resources may be made available in the future within the timeframe of operation of the Plan as a result of same being redirected from other projects being funded by the Department of Culture, Heritage and the Gaeltacht at the present time; • that the ongoing financial implications for the State and/or the private sector be taken into account when making recommendations in regard to expenditure; and, • that the Plan will recognize those items the development of which cannot be achieved during its timeframe but that an attempt will be made to supply cost estimates for the development of those items.
Aims of the Plan • To set up a long-‐term research and development infrastructure that will into the future deliver those state-‐of-‐the-‐art technologies that are increasingly vital for language maintenance. • In the plan, the basic linguistic and phonetic research is seen as providing the essential resources for the technology development. These technologies include machine translation, text-‐to-‐speech synthesis, speech recognition, and dialogue systems that enable speech-‐based human computer interaction. • These core technologies will enable the development of the growing number of applications that will serve the Irish speaking public. • These technologies are particularly vital for the teaching/learning of Irish, as well as for those with disabilities.
Contents of the Digital Plan? Digital documentation and linguistic analysis of the written and spoken dialects • Types of documentation work required include: phonetics, syntax, phonology, discourse and language acquisition. • This work is required prior to development of language tools.
Contents of the Digital Plan? Language Resources: Resources, Data and Knowledge Bases • Irish language corpora providing detailed information relating to the linguistic structure of the language. • These corpora are fundamental to all other research work.
Contents of the Digital Plan? Natural Language Processing (NLP) What are the processing tools required in order to process the data obtained from the corpora? • Examples: Morphological analysis, part-‐of-‐ speech tagging, parsing, semantic analysis, named entry recognition etc.
Contents of the Digital Plan? Natural Language Understanding (NLU) • The computer must be taught to understand the text before it rather than merely seeing it as a string of words • This technology is used in search engines such as Google, for example.
Contents of the Digital Plan? Speech Synthesis • Much work already carried out by TCD (abair.ie) • Review this work and lay out next steps
Contents of the Digital Plan? Speech Recognition: Conversion of spoken word to text • A Priority – already in use in major languages. • Examples: mobile phones, automated subtitling, translation of speech in one language to another language. • Challenging work due to dialects and young people’s voices, for example.
Contents of the Digital Plan? Machine Translation Systems • Tapadóir, ELRC, MT@EC. • Further research required.
Contents of the Digital Plan? Dialogue Systems • User inputs a query in text or speech, computer interprets the query and provides a response. • These systems combine speech synthesis and speech recognition technologies.
Contents of the Digital Plan? Information Retrieval • Systems required for accurate retrieval of material in Irish language (i.e. nuclear energy v atomic energy). • Google only has front-‐end in Irish!
Contents of the Digital Plan? Educational Applications • Computer-‐Assisted Language Learning (CALL) • Interactive games for schools and universities in Ireland and overseas. • Self-‐led learning etc.
Contents of the Digital Plan? Access for people with disabilities • Blind or partially-‐sighted • Speech difficulties • Learning difficulties (e.g. dyslexia)
Contents of the Digital Plan? Role of national and multi-‐national companies and of Government and the public • Private sector support essential for localisation of services. • Public support for crowd-‐sourcing initiatives (e.g. Meitheal Dúchas) provides added value. • State incentivising small companies to localise their services (websites etc.)
GaelTeic @ DCU • Government investment of €621,000 over 4 years • Areas of Natural Language Processing (NLP) research that require significant attention with respect to the Irish language. The areas have been selected to build rapidly upon existing work, and to achieve maximum impact and reusability of the tools, data and resources created. The work here builds upon the pioneering work to date and will serve to provide a firmer grounding on which downstream language applications can be built to support the use of Irish in a digital environment.
GaelTeic @ DCU • Broad range of objectives include furthering state-‐of-‐the-‐art syntactic parsing for Irish, augmenting syntactically annotated corpora, improving tools for understanding Irish social media text, establishing Irish as a language used in international shared task projects and ultimately training a new team of language technology experts with Irish language skills, a resource that is significantly lacking and considerably impacting progress in this field.
Next Steps • Publication of Plan by end 2017 • Ministerial support essential. • Funding • Review and update every 5 years. End
You can also read