Developments in Irish Language Technology - Dr.Aodhán Mac Cormaic Director of Irish Department of Culture, Heritage and the Gaeltacht Government ...

Page created by Brad James
 
CONTINUE READING
Developments in Irish Language Technology - Dr.Aodhán Mac Cormaic Director of Irish Department of Culture, Heritage and the Gaeltacht Government ...
Developments	
  in	
  Irish	
  Language	
  
        Technology
  Dr. Aodhán	
  Mac	
  Cormaic
  Director	
  of	
  Irish
  Department	
  of	
  Culture,	
  Heritage	
  and	
  the	
  Gaeltacht
  Government	
  of	
  Ireland
Developments in Irish Language Technology - Dr.Aodhán Mac Cormaic Director of Irish Department of Culture, Heritage and the Gaeltacht Government ...
Developments	
  in	
  Irish	
  Language	
  
             Technology
Achoimre / Summary
• Current status of Irish technology.

• Details of work already underway.

• Research demonstrates that Irish, like some major
  world languages, is falling behind English in the Digital
  Age.

• Government efforts to tackle this problem:
   Ø Digital Plan for the Irish Language
Investment	
  by	
  Department	
  of	
  Culture,	
  
     Heritage	
  and	
  the	
  Gaeltacht
€9m invested	
  in	
  Irish	
  language	
  digital	
  and	
  technology	
  sector	
  since	
  2006.	
  	
  Over	
  €1m	
  p.a.	
  over	
  last	
  
3	
  years.

Projects	
  Funded:

•     Abair.ie	
  – voice	
  synthesis

•     Dúchas.ie	
  – Bailiúchán	
  na	
  Scol	
  1938	
  /	
  Schools	
  Collection	
  1938

•     Ainm.ie	
  – biographical	
  details	
  of	
  people	
  involved	
  in	
  Irish	
  language	
  activities

•     Logainm.ie	
  – place-­‐names	
  online

•     Royal	
  Irish	
  Academy	
  – historical	
  Irish	
  language	
  dictionary

•     TechSpace	
  as	
  Gaeilge

•     Dublin	
  City	
  University	
  -­‐ MOOC	
  in	
  Irish	
  and	
  Irish	
  Traditional	
  Culture
Investment	
  by	
  Department	
  of	
  Culture,	
  
  Heritage	
  and	
  the	
  Gaeltacht	
  (cont.)
Irish	
  Language	
  Terminology	
  for	
  the	
  EU	
  Terminology	
  Database	
  
(IATE)

• Annual	
  grant	
  of	
  €231,000	
  to	
  Dublin	
  City	
  University.

• Irish	
  is	
  now	
  the	
  13th largest	
  of	
  the	
  languages	
  in	
  the	
  
  database	
  and	
  the	
  largest	
  of	
  the	
  new	
  languages!

• Over	
  72,000	
  terms	
  translated	
  into	
  Irish	
  do	
  date.

• Important	
  work	
  due	
  to	
  strategy	
  to	
  end	
  derogation	
  by	
  2022.
Investment	
  by	
  Department	
  of	
  Culture,	
  
  Heritage	
  and	
  the	
  Gaeltacht	
  (cont.)
www.gaois.ie
• Search	
  engine	
  on	
  www.gaois.ie site	
  
  allowing	
  searches	
  for	
  legal	
  texts.	
  
• 9m	
  words	
  in	
  this	
  corpus,	
  half	
  in	
  Irish	
  and	
  
  half	
  in	
  English.	
  
• Developed	
  by	
  Dublin	
  City	
  University
Investment	
  by	
  Department	
  of	
  Culture,	
  
  Heritage	
  and	
  the	
  Gaeltacht	
  (cont.)
Tapadóir:	
  Machine	
  Translation	
  System	
  

• DCU	
  research	
  – statistic-­‐based.

• Trinity	
  College	
  Dublin	
  – rule-­‐based.

• Hybrid	
  system	
  combining	
  both.
Tapadóir
Tapadóir
European	
  Language	
  Resource	
  
            Coordination	
  (ELRC)

• EU	
  Commission	
  supporting	
  public-­‐sector	
  
  bodies	
  across	
  Europe	
  to	
  provide	
  multi-­‐lingual	
  
  public	
  services.
• Machine	
  Translation	
  system	
  to	
  be	
  developed	
  
  as	
  part	
  of	
  Connecting	
  Europe	
  Facility	
  (CEF).
• 2nd Phase	
  of	
  data	
  collection	
  process	
  currently	
  
  underway.
Investment	
  by	
  Foras	
  na	
  Gaeilge	
  
Online	
  Dictionaries
• Irish	
  version	
  of	
  Foclóir Béarla-­‐Gaeilge,	
  90%	
  of	
  which	
  
  is	
  complete,	
  available	
  on	
  www.focloir.ie.	
  

• Three	
  other	
  dictionaries	
  – Béarla/Gaeilge	
  (1959),	
  An	
  
  Foclóir	
  Beag (aon	
  teangach)	
  agus	
  Foclóir	
  Gaeilge-­‐
  Béarla	
   (1978)	
  – all	
  available	
  on	
  www.teanglann.ie.	
  
2012 META NET Report:
The Irish Language in the Digital Age
Language	
  Processing:	
  level of	
  support	
  for	
  language	
  technology	
  for	
  30	
  
European	
  languages

Excellent	
  Support     Good	
  Support        Reasonable	
         Intermittent	
         Poor	
  or	
  No	
  
                                                 Support               Support               Support

                       Béarla	
  (English)   Gearmáinis           Bascais               Íoslainnis
                                             Iodáilis             Bulgáiris             Cróitis
                                             Fionlainnis          Danmhairgis           Laitvis
                                             Fraincis             Eastóinis             Liotuáinis
                                             Ollainnis            Gailísis              Máltais
                                             Portaingéilis        Gréigis               Rómáinis
                                             Spáinnis             Gaeilge
                                             Seicis               Catalóinis
                                                                  Ioruais
                                                                  Polainnis
                                                                  Sualainnis
                                                                  Seirbis
                                                                  Slóvaicis
                                                                  Slóivéinis
                                                                  Ungáiris
2012 META NET Report:
The Irish Language in the Digital Age
      Machine	
  Translation:	
  	
  level of	
  support	
  for	
  language	
  technology	
  for	
  30	
  
      European	
  languages

        Tacaíocht	
  den	
      Tacaíocht	
  mhaith   Tacaíocht	
  réasúnta   Tacaíocht	
  bhriste    Tacaíocht	
  lag	
  nó	
  
            scoth                                                                                      gan	
  tacaíocht

                               Béarla	
  (English)    Fraincis                Gearmáinis             Bascais
                                                      Spáinnis                Iodáilis               Bulgáiris
                                                                              Catalóinis             Danmhairgis
                                                                              Ollainnis              Eastóinis
                                                                              Polainnis              Fionlainnis
                                                                              Rómáinis               Gailísis
                                                                              Ungáiris               Gréigis
                                                                                                     Gaeilge
                                                                                                     Íoslainnis
                                                                                                     Cróitis
                                                                                                     Laitvis
                                                                                                     Liotuáinis
                                                                                                     Máltais
                                                                                                     Ioruais
                                                                                                     Portaingéilis
                                                                                                     Sualainnis
                                                                                                     Seirbis
                                                                                                     Slóvaicis
                                                                                                     Slóivéinis
                                                                                                     Seicis
2012	
  META	
  NET	
  Report:	
  
The	
  Irish	
  Language	
  in	
  the	
  Digital	
  Age
      Text	
  Analysis:	
   level of	
  support	
  for	
  language	
  technology	
  for	
  30	
  European	
  languages

        Tacaíocht	
  den	
     Tacaíocht	
  mhaith          Tacaíocht	
         Tacaíocht	
  bhriste     Tacaíocht	
  lag	
  nó	
  
            scoth                                           réasúnta                                      gan	
  tacaíocht

                               Béarla	
  (English)    Gearmáinis               Bascais                 Eastóinis
                                                      Fraincis                 Bulgáiris               Gaeilge
                                                      Iodáilis                 Danmhairgis             Íoslainnis
                                                      Ollainnis                Fionnlainnis            Cróitis
                                                      Spáinnis                 Gailísis                Laitvis
                                                                               Gréigis                 Liotuáinis
                                                                               Catalóinis              Máltais
                                                                               Ioruais                 Seirbis
                                                                               Polainnis
                                                                               Portaingéilis
                                                                               Rómáinis
                                                                               Sualainnis
                                                                               Slóvaicis
                                                                               Slóivéinis
                                                                               Seicis
                                                                               Ungáiris
2012 META NET Report:
The Irish Language in the Digital Age
    Speech	
  and	
  Text	
  Resources:	
  	
  level of	
  support	
  for	
  language	
  technology	
  for	
  30	
  European	
  
    languages

      Tacaíocht	
  den	
       Tacaíocht	
  mhaith            Tacaíocht	
           Tacaíocht	
  bhriste        Tacaíocht	
  lag	
  nó	
  
          scoth                                               réasúnta                                           gan	
  tacaícoht

                              Béarla	
  (English)        Gearmáinis                 Bascais                   Gaeilge
                                                         Fraincis                   Bulgáiris                 Íoslainnis
                                                         Iodáilis                   Danmhairgis               Cróitis
                                                         Ollainnis                  Eastóinis                 Laitvis
                                                         Polainnis                  Fionlainnis               Liotuáinis
                                                         Sualainnis                 Gailísis                  Máltais
                                                         Spáinnis                   Gréigis
                                                         Seicis                     Catalóinis
                                                         Ungáiris                   Cróitis
                                                                                    Ioruais
                                                                                    Portaingéilis
                                                                                    Rómáinis
                                                                                    Seirbis
                                                                                    Slóvaicis
                                                                                    Slóivéinis
Our	
  New	
  Approach:
  A	
  Digital	
  Plan	
  for	
  the	
  Irish	
  Language

• Long-­‐term	
  plan	
  required	
  in	
  order	
  to	
  improve	
  
  technologies	
  in	
  various	
  sectors.

• Expert	
  team	
  from	
  DCU	
  and	
  Trinity	
  College.

• Planto be	
  published	
  late	
  2017.
Terms	
  of	
  Reference	
  for	
  Working	
  Group
To	
  prepare	
  cost	
  estimates	
  for	
  a	
  10	
  year	
  action	
  plan,	
  taking	
  into	
  account	
  the	
  
following:	
  
• that	
  the	
  annual	
  funding	
  which	
  will	
  be	
  available	
  to	
  implement	
  the	
  Plan	
  will	
  
      be	
  based	
  on	
  the	
  financial	
  resources	
  available	
  in	
  2015,	
  with	
  the	
  expectation	
  
      that	
  additional	
  resources	
  will	
  be	
  made	
  available	
  on	
  an	
  incremental	
  annual	
  
      basis:
• that	
  additional	
  financial	
  resources	
  may	
  be	
  made	
  available	
  in	
  the	
  future	
  
      within	
  the	
  timeframe	
  of	
  operation	
  of	
  the	
  Plan	
  as	
  a	
  result	
  of	
  same	
  being	
  
      redirected	
  from	
  other	
  projects	
  being	
  funded	
  by	
  the	
  Department	
  of	
  
      Culture,	
  Heritage	
  and	
  the	
  Gaeltacht	
  at	
  the	
  present	
  time;
• that	
  the	
  ongoing financial	
  implications	
  for	
  the	
  State	
  and/or	
  the	
  private	
  
      sector	
  be	
  taken	
  into	
  account	
  when	
  making	
  recommendations	
  in	
  regard	
  to	
  
      expenditure;	
  and,
• that	
  the	
  Plan	
  will	
  recognize	
  those	
  items	
  the	
  development	
  of	
  which	
  cannot	
  
      be	
  achieved	
  during	
  its	
  timeframe	
  but	
  that	
  an	
  attempt	
  will	
  be	
  made	
  to	
  
      supply	
  cost	
  estimates	
  for	
  the	
  development	
  of	
  those	
  items.	
  
Aims	
  of	
  the	
  Plan
•   To	
  set	
  up	
  a	
  long-­‐term	
  research	
  and	
  development	
  infrastructure	
  that	
  will	
  
    into	
  the	
  future	
  deliver	
  those	
  state-­‐of-­‐the-­‐art	
  technologies	
  that	
  are	
  
    increasingly	
  vital	
  for	
  language	
  maintenance.

•   In	
  the	
  plan,	
  the	
  basic	
  linguistic	
  and	
  phonetic	
  research	
  is	
  seen	
  as	
  providing	
  
    the	
  essential	
  resources	
  for	
  the	
  technology	
  development.	
  These	
  
    technologies	
  include	
  machine	
  translation,	
  text-­‐to-­‐speech	
  synthesis,	
  
    speech	
  recognition,	
  and	
  dialogue	
  systems	
  that	
  enable	
  speech-­‐based	
  
    human	
  computer	
  interaction.	
  

•   These	
  core	
  technologies	
  will	
  enable	
  the	
  development	
  of	
  the	
  growing	
  
    number	
  of	
  applications	
  that	
  will	
  serve	
  the	
  Irish	
  speaking	
  public.

•   These	
  technologies	
  are	
  particularly	
  vital	
  for	
  the	
  teaching/learning	
  of	
  Irish,	
  
    as	
  well	
  as	
  for	
  those	
  with	
  disabilities.	
  
Contents	
  of	
  the	
  Digital	
  Plan?

Digital	
  documentation	
  and	
  linguistic	
  analysis	
  of	
  the	
  written	
  
and	
  spoken	
  dialects

• Types	
  of	
  documentation	
  work	
  required	
  include:	
  phonetics,	
  
  syntax,	
  phonology,	
  discourse	
  and	
  language	
  acquisition.

• This	
  work	
  is	
  required	
  prior	
  to	
  development	
  of	
  language	
  
  tools.
Contents	
  of	
  the	
  Digital	
  Plan?
Language	
  Resources:	
  Resources,	
  Data	
  and	
  
Knowledge	
  Bases

• Irish	
  language	
  corpora	
  providing	
  detailed	
  
  information	
  relating	
  to	
  the	
  linguistic	
  structure	
  
  of	
  the	
  language.
• These	
  corpora	
  are	
  fundamental	
  to	
  all	
  other	
  
  research	
  work.
Contents	
  of	
  the	
  Digital	
  Plan?
Natural	
  Language	
  Processing	
  (NLP)
What	
  are	
  the	
  processing	
  tools	
  required	
  in	
  order	
  
to	
  process	
  the	
  data	
  obtained	
  from	
  the	
  corpora?
• Examples:	
  Morphological	
  analysis,	
  part-­‐of-­‐
  speech	
  tagging,	
  parsing,	
  semantic	
  analysis,	
  
  named	
  entry	
  recognition	
  etc.
Contents	
  of	
  the	
  Digital	
  Plan?
Natural	
  Language	
  Understanding	
  (NLU)
• The	
  computer	
  must	
  be	
  taught	
  to	
  understand	
  
  the	
  text	
  before	
  it	
  rather	
  than	
  merely	
  seeing	
  it	
  
  as	
  a	
  string	
  of	
  words	
  
• This	
  technology	
  is	
  used	
  in	
  search	
  engines	
  such	
  
  as	
  Google,	
  for	
  example.
Contents	
  of	
  the	
  Digital	
  Plan?
Speech	
  Synthesis

• Much	
  work	
  already	
  carried	
  out	
  by	
  TCD	
  
  (abair.ie)

• Review	
  this	
  work	
  and	
  lay	
  out	
  next	
  steps
Contents	
  of	
  the	
  Digital	
  Plan?
Speech	
  Recognition:	
  Conversion	
  of	
  spoken	
  
word	
  to	
  text
• A	
  Priority	
  – already	
  in	
  use	
  in	
  major	
  languages.	
  	
  
• Examples:	
  mobile	
  phones,	
  automated	
  
  subtitling,	
  translation	
  of	
  speech	
  in	
  one	
  
  language	
  to	
  another	
  language.
• Challenging	
  work	
  due	
  to	
  dialects	
  and	
  young	
  
  people’s	
  voices,	
  for	
  example.
Contents	
  of	
  the	
  Digital	
  Plan?

Machine	
  Translation	
  Systems

• Tapadóir,	
  ELRC,	
  MT@EC.

• Further	
  research	
  required.
Contents	
  of	
  the	
  Digital	
  Plan?
Dialogue	
  Systems
• User	
  inputs	
  a	
  query	
  in	
  text	
  or	
  speech,	
  
  computer	
  interprets	
  the	
  query	
  and	
  provides	
  a	
  
  response.
• These	
  systems	
  combine	
  speech	
  synthesis	
  and	
  
  speech	
  recognition	
  technologies.
Contents	
  of	
  the	
  Digital	
  Plan?

Information	
  Retrieval
• Systems	
  required	
  for	
  accurate retrieval	
  of	
  
  material	
  in	
  Irish	
  language	
  (i.e.	
  nuclear	
  energy	
  v
  atomic	
  energy).
• Google	
  only	
  has	
  front-­‐end	
  in	
  Irish!
Contents	
  of	
  the	
  Digital	
  Plan?

Educational	
  Applications
• Computer-­‐Assisted	
  Language	
  Learning	
  (CALL)
• Interactive	
  games	
  for	
  schools	
  and	
  universities	
  
  in	
  Ireland	
  and	
  overseas.
• Self-­‐led	
  learning	
  etc.
Contents	
  of	
  the	
  Digital	
  Plan?

Access	
  for	
  people	
  with	
  disabilities
• Blind	
  or	
  partially-­‐sighted
• Speech	
  difficulties
• Learning	
  difficulties	
  (e.g.	
  dyslexia)
Contents	
  of	
  the	
  Digital	
  Plan?
Role	
  of	
  national	
  and	
  multi-­‐national	
  companies	
  
and	
  of	
  Government	
  and	
  the	
  public
• Private	
  sector	
  support	
  essential	
  for	
  
  localisation	
  of	
  services.	
  
• Public	
  support	
  for	
  crowd-­‐sourcing	
  initiatives	
  
  (e.g.	
  Meitheal	
  Dúchas)	
  provides	
  added	
  value.
• State	
  incentivising	
  small	
  companies	
  to	
  localise	
  
  their	
  services	
  (websites	
  etc.)
GaelTeic @	
  DCU
• Government	
  investment	
  of	
  €621,000	
  over	
  4	
  years
• Areas	
  of	
  Natural	
  Language	
  Processing	
  (NLP)	
  research	
  
  that	
  require	
  significant	
  attention	
  with	
  respect	
  to	
  the	
  
  Irish	
  language.	
  The	
  areas	
  have	
  been	
  selected	
  to	
  build	
  
  rapidly	
  upon	
  existing	
  work,	
  and	
  to	
  achieve	
  maximum	
  
  impact	
  and	
  reusability	
  of	
  the	
  tools,	
  data	
  and	
  resources	
  
  created.	
  The	
  work	
  here	
  builds	
  upon	
  the	
  pioneering	
  
  work	
  to	
  date	
  and	
  will	
  serve	
  to	
  provide	
  a	
  firmer	
  
  grounding	
  on	
  which	
  downstream	
  language	
  applications	
  
  can	
  be	
  built	
  to	
  support	
  the	
  use	
  of	
  Irish	
  in	
  a	
  digital	
  
  environment.	
  
GaelTeic @	
  DCU
• Broad	
  range	
  of	
  objectives	
  include	
  furthering	
  
  state-­‐of-­‐the-­‐art	
  syntactic	
  parsing	
  for	
  Irish,	
  
  augmenting	
  syntactically	
  annotated	
  corpora,	
  
  improving	
  tools	
  for	
  understanding	
  Irish	
  social	
  
  media	
  text,	
  establishing	
  Irish	
  as	
  a	
  language	
  
  used	
  in	
  international	
  shared	
  task	
  projects	
  and	
  
  ultimately	
  training	
  a	
  new	
  team	
  of	
  language	
  
  technology	
  experts	
  with	
  Irish	
  language	
  skills,	
  a	
  
  resource	
  that	
  is	
  significantly	
  lacking	
  and	
  
  considerably	
  impacting	
  progress	
  in	
  this	
  field.
Next	
  Steps
•   Publication	
  of	
  Plan	
  by	
  end	
  2017
•   Ministerial	
  support	
  essential.
•   Funding
•   Review	
  and	
  update	
  every	
  5	
  years.

                                End
You can also read